We can attack this thorny question in many ways. But let’s set the scene by suggesting that, in concept, BI isn’t really about technology, at least not conceptually. “Business Intelligence” is a lousy term admittedly, because it is not exactly self-defining; pretty much everyone would agree that Business Intelligence is about getting and analyzing information about an organization’s operation in order to manage it better and possibly improve its processes.
As such, business intelligence has been a business activity ever since the snake established his famous apple franchise in the Garden of Eden. The first real BI systems were accounting systems, which analyzed the movement and deployment of cash. Later came the cash to order systems, manufacturing systems, and stock management systems.
These were not BI systems per se, of course, they were transactional systems, but none of them could work without a reporting function of some kind being part of what they did. You could say that that was the age of silo applications, which had BI built-in as best it could be.
The Fundamental Data Flow
BI was truly born when companies began building data warehouses with the very definite goal of handling a whole series of reporting requests in a centralized manner. Reporting software, typified by the likes of Crystal Reports, became a user self-service phenomenon, to some degree. And there followed OLAP software, and data mining software, and data visualization software, and dashboards, and so forth. God was in his heaven and all was right with the world.
The common delusion was that a kind of data flow could be established which would feed the wildest ambitions of anyone within the organization that wanted to analyze data. I’ve depicted it in the illustration below.
I could have made the central part of this diagram a lot more complex by including MDM, EII, and other processes and data stores that are sometimes found there. However, what I’d like to draw attention to is the right-hand end of this data flow, which depicts the plethora of user tools available to manipulate data in one way or another. As far as I know there is nothing much wrong with any of these tools. They all have their uses and they will all function admirably if you can just deliver the right data to them. The problem lies in the data delivery.
At the other end of diagram we depict the self-evident fact that data exists only because it is created and maintained in some operational systems somewhere. In general, such data is the data that the BI tools need to get at, and all we have to do is move it to the right place in the right way on a timely basis to feed the BI tools. Sadly, the obstacles to this are formidable. Let’s list them.
- The data is dirty (to some degree).
- Operational systems don’t agree on the definition of some data entities.
- Operational systems are always changing and hence they change data definitions sometimes.
- New operational systems are added regularly.
- Semi-structured data (content, etc.) is rarely well defined and it features in many systems.
- There are multiple data warehouses, even multiple MDM repositories.
- There are guerilla systems (built at departmental levels or lower using MS access or even MS Excel) which often contain important data.
- There are external systems where we can’t necessarily get at decent metadata definitions and which also keep changing.
- The whole data preparation process often takes far too long. Indeed some data (for streaming systems) needs to be captured before it is written to a database.
- The amount of data just keeps on growing.
- The demand for data by users seems to increase all the time.
That’s probably not the full inventory of obstacles, but it will do for the moment. It would be very pleasant if, by magic, we could conjure up a “piece in the middle” between operational systems and BI tools which just worked perfectly; cleaned the data and presented it to whoever wanted it, where it was wanted, in a timely manner. But even if we could construct such a modern-day-marvel we would have to introduce some truly onerous procedures in order to keep it working. Every time a new application was introduced or an application was changed, we would have to prevent it from running until we had captured everything we needed (API changes, metadata changes) for our miracle data management system, so that it would not degrade. And that simply isn’t feasible.
Set aside whether it’s possible to even build such a data integration system, the sorry truth is that we wouldn’t be able to keep it current. There are too many changes taking place; far too many applications and far too many groups of users and developers doing their own thing for it to be practical. And that means that we shouldn’t try to aim for such a universal system at all, we should simply try to do what works and what is maintainable. In this situation, perfection is the enemy of the practical, and practical is what matters.
I’m not trying to pour cold water on data integration here or any aspect of it. Data integration is of the utmost importance. There are unavoidable facts:
- Data needs to be cleaned.
- Data needs to be integrated (in many situations for many reasons).
- Metadata needs to be exposed and made available wherever possible.
- Metadata conflicts need to be reconciled for the sake of data integration.
- Data transformations need to be carried for the sake of data integration.
- Data needs to be transported in a timely manner for some applications.
- Applications change. New applications appear.
What I don’t subscribe to is the idea that a large MDM project will deliver anything more maintainable than a pragmatic series of data integration projects that are aimed at specific targets. Indeed, from what I’ve seen, it is likely to deliver far less.
The Bottom Line
In reality, it’s not that BI is too hard for us, it’s that data integration is difficult, very difficult. But most of all, the “universal” solution that is implicitly promised by some MDM initiatives is simply a mirage. That’s the bit that’s too hard for us.