In the days of punch-card computing, data validation received greater focus than it does now. The notion of GIGO (Garbage In, Garbage Out) was afforded much greater respect than it is now. The validation of data was meticulous and bad data was a rare phenomenon. But it isn’t rare now.
So why is that?
First let’s acknowledge that corporate computer systems are immensely more complex than they were 40, 30 or even 20 years ago. Data is passed from application to application far more often that it ever used to be. So there is a much greater opportunity for data quality problems to accumulate over time.
There are many reasons for this. Here’s ten:
1. Less care is taken: The validation of data on input may still be done meticulously in core corporate systems. But it never is, for example, when someone in some department (for good reasons) sets up a little Excel or Microsoft Access database which eventually becomes an important data source for corporate systems.
2. Unverified data sources: Nowadays, data from outside the organization is often imported and used, with little vetting of the data quality and with no means of enforcing data quality if there is a clear issue. You take what you get and live with it.
3. System/application initialization: We rarely build completely new applications. Usually we have to import some data from a previous applications or two, and that frequently causes data quality problems, which emerge at a later date.
4. System and data consolidations: This is particularly a problem in the wake of mergers and acquisitions, but consolidations take place at other times too. Usually because there is an urgency involved, with too little attention paid to data quality and getting the data right from the get go. So data anomalies and errors get baked into the consolidation.
5. Data feeds: Data feeds, whether batch or real-time and whether from within the company or without are a common cause of data quality issues. Sometimes, especially with real-time feeds, no one thinks that data quality issues will arise and thus they do nothing to prevent them.
6. Accidental purging/archive: Purging and archiving data often demands precision and rarely experiences it, leading to data anomalies that don’t immediately make their presence felt. Such problems can prove to be difficult to fix.
7. Skills loss: We rarely think of the company’s staff as having much to do with data quality, but in reality conscientious staff often act as a data quality check in some of the applications they use. Such skills are rarely passed on, so when staff leave the data quality they enforced leaves with them.
8. New Data Usage: Data structures are usually just designed for use by one or two applications, but in time, the need arises for some other application to “have at” the data. Commonly, this exposes data structure mismatches, which in turn lead to data errors.
9. Software entropy: This is usually the greatest overall cause of data quality problems. Over time programs change and some program error or other leads to data corruption, or possibly some change is made in an application to a data structure and nobody thinks to make corresponding changes in other applications that will eventually use the data. The “entropy” of the data gradually increases and data quality deteriorates with the passage of time.
10. Data cleansing done poorly: And finally we have data quality itself. This can be done well or badly and if it is done badly it is likely to create new problems as well as fix old ones.
This last point is what the rest of this article will now focus on. Review the above list of sources of data quality problems and you should conclude fairly quickly that data quality will never be a quick fix because the sources of error are so diverse. In my view, data quality needs to be baked into the whole computing environment for it to be effective. But that’s easier said than done.
Assume for a moment that we know there is a serious data quality problem with our Data Warehouse. So we employ a very fast data quality tool with the idea of cleaning the data in situ. There are some very fast data quality products, like for example, Pervasive’s Data Quality Suite, that have been built to be highly parallelized. But even they will not work in no time at all. You are unlikely to be able to halt all use of the Data Warehouse while the data quality software struts its stuff, so you will probably need to break the task into chunks.
Imagine that we do that and we fix all problems (which is unlikely) and we complete the work, then we now have clean data. Well, yes….
But even if the data users are happier than before, and they should be, we haven’t fixed any of the sources of corruption, we’ve merely eliminated some symptoms. Wait a year and the Data Warehouse may well be in as much of a mess as before. The point is that data quality is not one-shot-and-we’re-done. We need to identify and fix all sources of data corruption and, unfortunately, how each is fixed is entirely contextual. It may be lousy programming, it may be lousy testing, it may be DBA error; it may be many things.
If we are going to truly fix the problem, we need to clean all the data across the organization that is corrupt and we will probably need to schedule to do that regularly, because even if we fix every source of data corruption we can identify, we won’t fix everything and anyway, it won’t stop data corruption happening in the future.
The main point here is that a data quality initiative needs to establish a data quality infrastructure. Such an infrastructure needs to be properly designed to identify points in the whole network of corporate systems where data needs to be regularly cleaned, so that the consequences of bad data are avoided. It needs to regularly report on data corruption so that all sources of corruption, new and old, are quickly identified – and it needs to establish a coherent set of data cleansing rules. It is not a trivial undertaking, by any means.
In my next blog I’ll tackle the related area of MDM and data governance, and when I do that, I’ll have much more to say about data cleansing rules.