Tagarchief: issues

Analyzing and Testing Data Environments – Common Issues

Analyzing and Testing Data Environments – Common Issues

Armando Dörsek © 2023

Introduction

When dealing with data centric systems, e.g. when analyzing and testing data warehouse or business intelligence environments, you’ll probably come across various errors. These errors can be due to insufficient analysis of the source data, developing the wrong transformations or even by using the system in the wrong way. To explain and categorize these common mistakes, you’ll find a summary of common mistakes below. 

Types of errors

  1. Homonyms and Synonyms in columns/field names
  2. Technical differences between systems
  3. Wrong usage of source systems
  4. Fields/Columns filled with technically correct values but not providing any real business
  5. History issues
  6. Wrong mappings
  7. Wrong transformations

Homonyms/Synonyms

When various concepts are known under the same description, we call this a “homonym”. For instance, an “area” may be used to depict a quantity of a two-dimensional surface (e.g. measured in square meters) but it may also refer to a region (e.g. a neighborhood, borough or zone).

When two different names are used for the same concept, we call these “synonyms”. For instance, when analyzing concepts like “contact” and “client”, it is important to figure out if the meaning of the concepts are 100% identical or that slight differences apply, based on the usage in the source systems. E.g. for some departments, contacts may be treated as a customer only after they have actually put in their first order with the company.

When analyzing, building and testing interfaces between systems, data integration and reports, this is something to take a good look at.

Technical Differences between Systems

When transferring (loading, moving) data from one system to another and integrating these values, issues may arise due to technical differences between systems. In the design, development and test of the interfaces and data integration, this needs attention.

Character Sets

One example is the different implementation of treating characters, i.e. storing specific characters in a different way on the various systems. Computers are good at storing numbers (1’s and 0’s) but for storing texts, character sets were developed. These may be incomplete and not able to store all relevant characters – leading to simplifying texts. E.g. the ASCII character set is able to store “plain English” characters but to store all letters, punctuation and symbols of other languages, the Unicode character set is required. 

Physical Storage of Data

Related to the issue above, is the difference in storing files on different systems like mainframes, Windows based system and a Unix/Linux based systems. Files may be treated in a different way where it comes to technical features like “end of line” characters and the way specific characters are stored.

Differences in Data Types

When storing data, you often get to choose from different data types, like character, numbers, dates and currencies. Various Database management systems (DBMS’s) and integration tools treat data types in different ways. These differences may be inherent to the system, others may be configured by technical staff. E.g. if a source system only supports storing text and target systems have ways to store numbers, dates and texts, some kind of transformation is needed. The analysis of the possibilities in the source systems is crucial: is “7,001” to be treated the same as “7.001”? Data profiling tools can be helpful. Make sure ask the right questions during review of the specifications, to test variations in the unit- and system tests and to have “rich” acceptance test sets that cover all kinds of different values in the source systems.

Wrong use in/of Source Systems

Users are extremely resourceful when it comes to using their tools (i.e. our source systems) in the most efficient manner. This will sometimes lead to using in a way it was not meant to, e.g. by filling “free format” fields with data that should have been stored elsewhere. Or by team A using field X in a different way than team B, making it difficult to analyze the data in that certain field (X).

Values are Technically Correct – but have low Business Value

When analyzing data sets, you may come across tables which contain a lot of “generic” of default values. Consider a table with customers having a column called “MarketSegment”. If most hold the value “99 Various Market Segments”, the business value of having such a field in a report is quite low.

When analyzing data in order to build meaningful reports, this should be mentioned to end users in an early stage – as they will probably record this as a finding or bug at the time of an acceptance test. 

The following situation is even worse: when users decide to enter blanks or certain values like “X” or “0,0” in fields that should have been filled in a correct manner. In this case, the value is not only too generic (like with “99 Various Market Segments”) but plain wrong. If not addressed, it may lead to wrong business decisions. E.g. when car crashes are not registered on their precise location, due to users entering wrong geo locations, dangerous streets and crossings may not get improved in time.

History Issues

Some source systems are designed is such a way that only the current situation is stored, and does not retain “history”. Should history be required this may lead to (at least) two kinds of issues.

The first issue arises when end users get creative in storing information that they want to keep in fields that weren’t meant for this purpose, like in a “Remarks” field. Should data analysts overlook the data in this field, e.g. because the technical documentation says it’s “just for remarks”, this may lead to issues. Data profiling should be applied, to find out if these situations occur in the source systems. Even if the issue is addressed, retrieving the relevant data from such fields can be hard and error prone.

The second issue happens when there is uncertainty about the actual validity of a value, as the original values get overwritten in the source systems. Is the shown value “the latest value” or the “original value”? Input from the user community is required.

Note: A solution may consist of developers building a system for “change data capture” (CDC), registering each and every change in the source system and figuring out what to do with it based on technical and business rules.

In data warehousing, history is often kept by the use of Slowly Changing Dimensions type 2 (SCD2). Tables have columns added with a begin- and end date, showing the period in which the situation is valid. When an attribute changes value, a new row is added with a new begin date (and the old record gets “closed”). More variants of SCD exist, and when developing solutions based on SCD, errors can be introduced as well. Be sure to have a test plan in place to cover the basic situations when working with SCD.

CustomerIDShort NameValidFromValidUntil
1Dorsek01-01-198014-02-2021
1Dörsek15-02-2021<null>
2Johnson01-01-1980<null>
3Smith01-01-198017-09-2020
3Smith-Jones18-09-2020<null>
“Customers” table with SCD2 based History columns

Wrong Mappings

Mappings can be implemented in a wrong way, in many ways.

First, due to a human errors, columns may be mapped to other columns than intended (in the design). E.g. mapping “StudentName” in the student database to “BANKACCOUNT” instead of “LASTNAME”. Replacing the manual mapping with more automated mappings tackle many of these issues and many of the wrong mappings are easily discovered by a visual check. Should a mapping hold many similar fields, this may become more difficult, but even so: checking it is very valuable, as solving issues in Acceptance Tests are more costly than bugs that are found early in the process.

A second type of error does not concern mapping the actual fields, but concerns the contents of the fields. E.g. when mapping dates from TEXT and DATE fields to other DATE and DATETIME, extra attention is required from developers and testers. Does “20230305” stand for 3 May 2023 or 5 March 2023? Are DATETIME fields (e.g. “20230305000000”) actually holding date/time with much precision – or only the date?

If several systems hold similar values, these may be integrated, but small differences are lurking around the corner: if system A will store “sex” coded as 1 (male), 2 (female) and 0 (unknown and other) and system B will store texts “boy”, “girl”, “man”, “woman”, “military”, “disabled”, “other” and <blank>, then business rules will need to be applied to make sure that the data in Data Marts and reports will be meaningful.


Suggestions

This is just a limited number of issues that may occur. You’ll probably have found some of these, and others, yourself. Feel free to share your experiences and report issues that you’re experiencing in the field of data.