The growth of Big Data technology is unprecedented, as is the associated need for data integration. Indeed, this is more than a trend; it’s an outright revolution in the way we deal with data and how data flows between core business systems and/or data stores. The time is ripe to make a science around how to leverage core business data in better and more strategic ways. It’s also time to rethink your data integration strategy, and make sure you have the right data integration infrastructure in place to support this movement. I have a few ideas for you to consider.
First, let’s back up a bit. What’s new is the data storage technology, such as Hadoop (which is a collection of products), along with the emerging world of cloud computing. I’m a pretty conservative fellow when it comes to leveraging new technology, but if you’re an enterprise that continues to maintain silos of data, then you need to look at the value of Big Data technology along with a sound data integration strategy, and you need to look at it right now. Not acting means you could be missing out on both the operational and strategic value of this technology, and this could perhaps harm your business.
What’s changing is how we deal with the way we view data. We’re gravitating toward a state where we need to deal with analytics across very diverse data sets, structured and unstructured. Most traditional BI solutions were designed to operate on relational data and other forms of very structured data. We define the schema first, and then load the data. That’s a clear limitation, but it seemed natural at the time.
These days, user organizations continue to struggle to obtain BI value from the wide range of unstructured data types (text, logs, clickstream, documents, etc.). However, the Hadoop set of technologies is good at making sense of this hodgepodge of unstructured data, and thus allows us to leverage that data for analytics. This includes the ability to leverage data analytics in real-time, perhaps embedded into business processes, as well as the ability to perform ad-hoc analysis, such as supporting business decisions through BI.
Data is structured at query or analysis time. Thus, you don’t need to define the use of the data until you actually need to use the data. This means you’ll avoid expensive schema changes in order to deal with desired data views. Or, worse yet, the old standby way of copying data from store to store to change the structure, such as reporting databases, and other ways that we end up duplicating data for the sake of analysis.
New ways of approaching data analytics or BI come with new ways of dealing with data integration issue. The emerging trends include:
• Focus on dealing with data replication and consistency, not semantic transformation.
• Focus on performance.
• Focus on data governance.
Focus on dealing with data replication and consistency, not semantic transformation, refers to the fact that we now focus on the mere replication of data from data store to data store (say, a transactional database to a HDFS cluster), and not as much on dealing with semantic differences between the stores. This due to the fact that we can apply structure to the data after it has been moved or copied (as covered above), versus having to set up static structures on the target before it can house data that is available to the BI engines.
Focus on performance means you account for the probability that data moving from store to store will increase in size and complexity. Thus, the data integration solution you support needs to support the speeds that will keep up with the data analytics requirements. Data integration performance is more of a design issue than a problem that can be solved by tossing technology at the problem. I suggest some planning go along with dealing with performance versus just faster networks and processing power.
Focus on data governance means that we keep an eye both the data and who’s using the data. This means dealing with data governance inside and outside of your data integration technology solution, or people and processes as well as technology. Those who establish data governance programs before moving into sophisticated data analytics have a much greater chance of delivering value to the business during operations.
I think most people understand that things are changing in the world of data. It’s becoming bigger, less structured, and better able to provide more timely and valuable information. In that same light, data integration technology and processes should be considered strategic to the success of BI, and provide the ability to drive more meaning from data. We all should rejoice at this prospect.