As a trainer of the DAU-CDAT Training (Certified Data & Analytics Tester) people sometimes ask me questions about the training. Curious? Below you’ll find the Top questions.
Note: do you have a question yourself? Just let me know and I’m happy to answer (or to find someone who should know the anwer).
#1) I am a data engineer, not a tester. I am worried that I don’t understand the testing lingo. Should I worry?
No, that should not be a problem! In the world of Testing, people have different “dialects” when it comes to the testing terminology. Therefore, trainers will always make sure that everyone is on the same page, first. This will give you the opportunity to learn the terms that are used in CDAT. To get even more prepared, you could read ISTQB and/or TMap related books or take a training about testing in general.
#2) Will I learn to automate tests for a data centric environments?
As CDAT is a Foundation training, the focus will be on learning what to test, why to test this, how to elicit test cases (test design) and how to create a testing approach that you can use to report remaining risks in clear wordings to your client. The “how” question, specifically tools and test automation, will be tackled in (future) trainings in the DAU curriculum.
#3) Can I take the exam with Brightest without attending a training?
Yes, you can try and take the exam with Brightest without attending the training: DAU is not forcing you to attend the training before you can enter the exam (like some other certifications do). Please contact us for exam vouchers, should you decide to do so. Keep in mind that many attendees feel that the training adds value as they return to their work place with new concepts and practical tools. If that sounds good to you, then please take the time (2 days or 4 evenings) and join the training.
When dealing with data centric systems, e.g. when analyzing and testing data warehouse or business intelligence environments, you’ll probably come across various errors. These errors can be due to insufficient analysis of the source data, developing the wrong transformations or even by using the system in the wrong way. To explain and categorize these common mistakes, you’ll find a summary of common mistakes below.
Types of errors
Homonyms and Synonyms in columns/field names
Technical differences between systems
Wrong usage of source systems
Fields/Columns filled with technically correct values but not providing any real business
History issues
Wrong mappings
Wrong transformations
Homonyms/Synonyms
When various concepts are known under the same description, we call this a “homonym”. For instance, an “area” may be used to depict a quantity of a two-dimensional surface (e.g. measured in square meters) but it may also refer to a region (e.g. a neighborhood, borough or zone).
When two different names are used for the same concept, we call these “synonyms”. For instance, when analyzing concepts like “contact” and “client”, it is important to figure out if the meaning of the concepts are 100% identical or that slight differences apply, based on the usage in the source systems. E.g. for some departments, contacts may be treated as a customer only after they have actually put in their first order with the company.
When analyzing, building and testing interfaces between systems, data integration and reports, this is something to take a good look at.
Technical Differences between Systems
When transferring (loading, moving) data from one system to another and integrating these values, issues may arise due to technical differences between systems. In the design, development and test of the interfaces and data integration, this needs attention.
Character Sets
One example is the different implementation of treating characters, i.e. storing specific characters in a different way on the various systems. Computers are good at storing numbers (1’s and 0’s) but for storing texts, character sets were developed. These may be incomplete and not able to store all relevant characters – leading to simplifying texts. E.g. the ASCII character set is able to store “plain English” characters but to store all letters, punctuation and symbols of other languages, the Unicode character set is required.
Physical Storage of Data
Related to the issue above, is the difference in storing files on different systems like mainframes, Windows based system and a Unix/Linux based systems. Files may be treated in a different way where it comes to technical features like “end of line” characters and the way specific characters are stored.
Differences in Data Types
When storing data, you often get to choose from different data types, like character, numbers, dates and currencies. Various Database management systems (DBMS’s) and integration tools treat data types in different ways. These differences may be inherent to the system, others may be configured by technical staff. E.g. if a source system only supports storing text and target systems have ways to store numbers, dates and texts, some kind of transformation is needed. The analysis of the possibilities in the source systems is crucial: is “7,001” to be treated the same as “7.001”? Data profiling tools can be helpful. Make sure ask the right questions during review of the specifications, to test variations in the unit- and system tests and to have “rich” acceptance test sets that cover all kinds of different values in the source systems.
Wrong use in/of Source Systems
Users are extremely resourceful when it comes to using their tools (i.e. our source systems) in the most efficient manner. This will sometimes lead to using in a way it was not meant to, e.g. by filling “free format” fields with data that should have been stored elsewhere. Or by team A using field X in a different way than team B, making it difficult to analyze the data in that certain field (X).
Values are Technically Correct – but have low Business Value
When analyzing data sets, you may come across tables which contain a lot of “generic” of default values. Consider a table with customers having a column called “MarketSegment”. If most hold the value “99 Various Market Segments”, the business value of having such a field in a report is quite low.
When analyzing data in order to build meaningful reports, this should be mentioned to end users in an early stage – as they will probably record this as a finding or bug at the time of an acceptance test.
The following situation is even worse: when users decide to enter blanks or certain values like “X” or “0,0” in fields that should have been filled in a correct manner. In this case, the value is not only too generic (like with “99 Various Market Segments”) but plain wrong. If not addressed, it may lead to wrong business decisions. E.g. when car crashes are not registered on their precise location, due to users entering wrong geo locations, dangerous streets and crossings may not get improved in time.
History Issues
Some source systems are designed is such a way that only the current situation is stored, and does not retain “history”. Should history be required this may lead to (at least) two kinds of issues.
The first issue arises when end users get creative in storing information that they want to keep in fields that weren’t meant for this purpose, like in a “Remarks” field. Should data analysts overlook the data in this field, e.g. because the technical documentation says it’s “just for remarks”, this may lead to issues. Data profiling should be applied, to find out if these situations occur in the source systems. Even if the issue is addressed, retrieving the relevant data from such fields can be hard and error prone.
The second issue happens when there is uncertainty about the actual validity of a value, as the original values get overwritten in the source systems. Is the shown value “the latest value” or the “original value”? Input from the user community is required.
Note: A solution may consist of developers building a system for “change data capture” (CDC), registering each and every change in the source system and figuring out what to do with it based on technical and business rules.
In data warehousing, history is often kept by the use of Slowly Changing Dimensions type 2 (SCD2). Tables have columns added with a begin- and end date, showing the period in which the situation is valid. When an attribute changes value, a new row is added with a new begin date (and the old record gets “closed”). More variants of SCD exist, and when developing solutions based on SCD, errors can be introduced as well. Be sure to have a test plan in place to cover the basic situations when working with SCD.
CustomerID
Short Name
ValidFrom
ValidUntil
1
Dorsek
01-01-1980
14-02-2021
1
Dörsek
15-02-2021
<null>
2
Johnson
01-01-1980
<null>
3
Smith
01-01-1980
17-09-2020
3
Smith-Jones
18-09-2020
<null>
“Customers” table with SCD2 based History columns
Wrong Mappings
Mappings can be implemented in a wrong way, in many ways.
First, due to a human errors, columns may be mapped to other columns than intended (in the design). E.g. mapping “StudentName” in the student database to “BANKACCOUNT” instead of “LASTNAME”. Replacing the manual mapping with more automated mappings tackle many of these issues and many of the wrong mappings are easily discovered by a visual check. Should a mapping hold many similar fields, this may become more difficult, but even so: checking it is very valuable, as solving issues in Acceptance Tests are more costly than bugs that are found early in the process.
A second type of error does not concern mapping the actual fields, but concerns the contents of the fields. E.g. when mapping dates from TEXT and DATE fields to other DATE and DATETIME, extra attention is required from developers and testers. Does “20230305” stand for 3 May 2023 or 5 March 2023? Are DATETIME fields (e.g. “20230305000000”) actually holding date/time with much precision – or only the date?
If several systems hold similar values, these may be integrated, but small differences are lurking around the corner: if system A will store “sex” coded as 1 (male), 2 (female) and 0 (unknown and other) and system B will store texts “boy”, “girl”, “man”, “woman”, “military”, “disabled”, “other” and <blank>, then business rules will need to be applied to make sure that the data in Data Marts and reports will be meaningful.
Suggestions
This is just a limited number of issues that may occur. You’ll probably have found some of these, and others, yourself. Feel free to share your experiences and report issues that you’re experiencing in the field of data.
Profiteer nu van de Black Friday Deal van Verified en ontvang je examenvoucher t.w.v. EUR 200,- helemaal gratisbij je inschrijving voor “Certified Data & Analytics Tester”.
Dit betekent dat je voor de tweedaagse training inclusief het officiële examen van DAU/Brightest nu geen EUR 1195,- maar slechts EUR 995,- betaalt (ex. BTW).
Deze actie gaat per direct in en is geldig tot en met vrijdag 25 november 2022 (23:59 CET).
Schrijf je nu in en vermeld in het commentaarveld “Black Friday” om aanspraak te maken op deze korting.
Samen studeren is toch véél leuker dan in je eentje? En nu levert het je ook nog eens een mooie korting op!
Hoe werkt het?
Als je je vanaf nu samen met een vriend/collega/kennis registreert voor de training “Certified Data & Analytics Tester” via het formulier onderaan deze pagina, dan krijgen jullie allebei 10% korting op de reguliere cursusprijs (ex. examen). Dat wil zeggen: je betaalt per persoon geen EUR 995,- maar EUR 895,50 voor deze training.
Let erop dat je in het opmerkingenveld aangeeft met wie je samen de training aanvraagt (naam en achternaam).
Deze actie is geldig tot 7 februari 2022.
Details
De korting wordt verstrekt op de factuur van de training zodra beide deelnemers zich daadwerkelijk via het deelnameformulier voor de training hebben ingeschreven.
De actie is geldig voor nieuwe deelnemers die zich na 10 januari 2022 aanmelden tot 7 februari 2022.
Om het examen van Data & Analytics United (DAU) af te leggen, heb je een voucher nodig van Brightest. Deze kun je ook aanschaffen via Verified maar op de vouchers kunnen we geen korting verstrekken. Als je in het aanmeldingsformulier aangeeft dat je de voucher graag ontvangt, zullen we deze ook meenemen bij de factuur. Je kunt de voucher ook tijdens of na de training aanvragen.
Op 26 september heeft Armando Dörsek namens Verified het “Superlot” van de Grote Clubactie van zijn buurmeisje Luna in ontvangst mogen nemen.
De opbrengst van het Superlot gaat naar spelmaterialen voor de jeugd van de Apeldoornse Mixed Hockey Club (AMHC), waar Luna speelt in de jeugd. We hopen dat er nog veel bedrijven hetzelfde doen en daarmee de club van Luna steunen!
In juni 2021 heeft Armando Dörsek, onze consultant op gebied van testen, kwaliteit en compliance, een presentatie voor het KNVI verzorgd. Deze heeft als titel: “10 Beginnersfouten in het Datalandschap (waar Testen het verschil kan maken)”. De deelnemers konden door middel van stemming besluiten welke vijf (van tien) onderwerpen die avond toegelicht zouden worden.
Onderwerpen
De deelnemers konden kiezen uit de volgende onderwerpen:
Meer is niet altijd beter…
(over gebruik van productiedata)
Doet u mij maar een testertje!
(over specifieke vaardigheden en eigenschappen)
Onderschatting van kaders…
(over security en privacy)
Datakwaliteit, dat kunnen we niet echt meten?
(over data-kwaliteitsattributen)
Datakwaliteit? Garbage In = Garbage Out, toch?
(over eigenaarschap van data, datakwaliteit en effecten op data-projecten)
We herstellen die data hier – omdat het kan.
(ten koste waarvan corrigeer je data in een datawarehouse)
Onze Helden: de Bugfixers
(over de delicate balans tussen snel en goed)
Slepen met Data
(over het effect van “meerdere waarheden” – naast het risico op datalekken)
Wil de Echte Klant opstaan?
(over acceptatietesten in een dataproject)
Risico’s? Eh…
(over het nut van risicogebaseerd testen in een dataproject).
Opname
De presentatie (opname) is beschikbaar op de site van de KNVI.
Blogreeks
Op basis van de onderwerpen zal Armando Dörsek een blogreeks schrijven, te starten met de onderwerpen die bij KNVI gepresenteerd zijn.
Houd de site in de gaten!
Trainings for the second half of 2021 are currently being planned. Verified will be organizing trainings in Berlin, Copenhagen, Utrecht and Schiphol Amsterdam. If you’re participating in one of these trainings, you’ll get a free (*) exam voucher, worth EUR 190,-.
This way, planning ahead is really beneficial!
*) Rules apply
Valid for all CDAT trainings from July through December 2021, organized by Verified and enrolled before 1 July 2021 through the Verified-website. The free voucher is part of a (paid) training and will be sent to the attendee immediately after full payment and upon completing the training.
Daarnaast blijft het mogelijk om in company trainingen en trainingen op maat aan te vragen. Vragen? Neem vooral contact met Verified op, we staan je graag te woord.
Ongeveer twee jaar geleden legden Jaap de Roos (DataConnected), Rogier Ammerlaan (Ammerlaan Trainingen) en Armando Dörsek (Verified.nl) de basis voor de training van Certified Data & Analytics Tester (CDAT). Na het verzorgen van een reeks trainingen in Nederland werd het tijd voor enige vernieuwing: binnenkort volgt hier meer nieuws.
(Een tipje van de sluier: uw certificaat (er)kent men straks ook buiten Nederland…)
Testing of Data & Analytics environments is increasingly supported by test tools. At Verified, we wonder what your experience is with one of these tools – and if you propose certain tools that have helped you ?
Please let us know, by leaving a message under this post, through the Contact Form, LinkedIn or Twitteror contact us in any way you prefer. If you like, we can reply with a listing of more than 15 tools that should help testers and analysts in testing and quality issues.
Note that for BI Testing, an add on is available. Connectors are available for Cognos, Tableau, SAP Business Objects, Microstrategy and OBIEE. Features include comparison of reports when upgrading or migrating, querying report meta data.
ICEDQ
iCEDQ is a commercial ETL Test Tool, which runs in the cloud, on premise or hybrid.
It offers a rule engine module, which features:
Checksum rules (e.g. row counts)
Reconciliation rules (e.g. missing records)
Validity rules (SCD Type 2, Format Checks)
Script rules (e.g. execute DML/DDL, pre and post actions).
iCEDQ currently integrates with Jenkins, TFS, Jira, HP ALM and ServiceNow.
QualiDI focuses on ETL testing from sources to targets.
It provides data validation and comparison of data, has a centralized repository of requirements, test cases, test results and has a API trigger based automation of test execution.
There is Business Rules Engine, enabling non-technical users to write rules instead of queries. It integrates with Bitwise Test Data Management Suite for providing test data.
QualiDi is a commercial product, a 30 day trial is available
SSIS Tester is offered by bytesoftwo and focuses on testing Microsoft SSIS.
SSIS Tester is a testing framework built on top of SQL Server Integration Services. It enables you to test packages, tasks and precedence constraints. It supports two types of tests:
unit tests
integration tests.
SSIS Tester helps you to develop you ETL process in the test driven manner and to identify errors early in the development process. It offers “data taps” and fake source or targe components, enabling developers to work independently. Tests target packages or particular tasks.
More information on SSIS Tester on MSDN can be found here:
In a few weeks I will publish the next Blog post, asking for new experiences. Thanks for reading this post!
Omdat deze tekst gericht is op een internationaal publiek, heb ik ervoor gekozen om de bovenstaande tekst in het Engels op te stellen. Stelt u prijs op een vertaling naar het Nederlands? Neemt u dan aub contact met ons op.
Below you can find the first 4 tools of this list, with a short description and links to the tool providers.
We are curious for your experiences when using these tools. What have they brought to your (testing) team? What is a Big Plus of using the tool – and with which features are you less impressed? Is an active user community available, or a help desk that makes a difference? Please let us know, by leaving a message under this post, through the Contact Form, LinkedIn or Twitter.
Note: We are not being sponsored by one of the products or their integration partners.
Data Cleaner
Data Cleaner is software which is used for Data Profiling.
It can also be used in the process of creating logical and physical test cases as it offers extensive support to analyze and browse through data sets.
Data Cleaner also offers features to clean, transform and export data sets. Rules can be built and reused.
Data Cleaner is free, additional (paid) features are available.
It can be downloaded to your desktop (Windows, Mac) environment from sourceforge or the community website:
SAS Unit is a framework for the unit testing of SAS programs (SAS EG 9.2, 9.3, 9.4).
Amongst others, SAS Unit offers features to assess test coverage of SAS Macros, running tests in batch mode, performance testing of SAS programs.
SAS Unit uses assertions to test the values of macro variables, contents of SAS data sets, relations between data sets, the existence of files incl. log files and – messages.
SAS Unit is open source software and built from SAS macros.
Informatica offers a testing solution called Data Validation Option for PowerCenter (DVO).
The DataValidator for PowerCenter product was originally developed by a company DVO SOFTWARE. It is now available under the Informatica brand as Informatica PowerCenter Data Validation Option.
DVO is a custom tool built on top of Informatica PowerCenter. DVO integrates with the Informatica PowerCenter Repository and Integration Services and enables developers and business analysts to create rules to test the data being transformed during the data integration process.
Informatica mentions the following ETL Testing Use Cases: Production Validation Testing, Source to Target Testing and Application upgrades.
Amongst others, it features:
An easy-to-use GUI interface to test the rules created for data validations for multiple projects.
No programming skills needed to create validation tests.
A repository with reporting capabilities to provide a complete audit trail of all tests and their results.
Reading data definitions from PowerCenter metadata repositories and can easily deal with data definition changes.
Note that DVO tests data only, not mappings or workflows. Testing mappings is unit testing, which is different from data validation.
As DVO is a commercial product it may be attractive to Informatica Power Center users. At this moment, no free trial seems available (Feb 9, 2020).
Datagaps offers several products, e.g. ETL Validator, BI Validator, Data Flow and Test Data Manager.
The Datagaps ETL Validator comes pre-packaged with an ETL engine capable of extracting and comparing millions of records from multiple data sources while executing test cases in parallel.
ETL Validator also has a Visual Test Case Builder with drag-and-drop capabilities and a Query Builder that enables defining tests without manual typing.
Key features include enterprise collaboration, flat file testing, data profile testing, baselining, data quality testing, and database metadata testing.
Omdat deze tekst gericht is op een internationaal publiek, heb ik ervoor gekozen om de bovenstaande tekst in het Engels op te stellen. Stelt u prijs op een vertaling naar het Nederlands? Neemt u dan aub contact met ons op.
Het waren dagen met goede discussies en nieuwe inzichten voor iedereen. We hebben hard gewerkt en plezier gehad met uiteindelijk een prima resultaat. Zes van de zeven deelnemers zijn in één keer geslaagd voor het examen. De docenten ontvingen een mooie 8 als beoordeling.
Testen & Toetsen in de wereld van BI, Datawarehousing en Analytics
We gebruiken cookies om ervoor te zorgen dat onze site zo soepel mogelijk draait. Als je doorgaat met het gebruiken van deze site, gaan we er vanuit dat je ermee instemt.OKPrivacybeleid