Approaching Big Data and Data Integration


I’ve been talking about big data, and the need for big data integration for some time now.  The concept is rather simple.  If we are moving to a big data platform, typically within a cloud, and typically consolidating enterprise data, then we must have a workable data integration strategy and the right technology.

Let’s consider the business case.  There is a deluge of data within most enterprises, to the point where it’s almost built into the culture of the business that some information is just not accessible, or, more likely, exists in very different systems or silos.  There is no way for the information to be presented in a functional business context, no matter if you’re considering operational data, and data required to make business decisions.

Moreover, there is not a single version of truth when it comes to data within most enterprises.  For instance, there are many places where redundant patient data is stored within hospital systems, and thus the data is typically of poor quality and not relied upon.  Or, no single version of a customer within a manufacturing company, thus orders are often incorrect and money is lost.

The business case here is easy to make, considering that the inefficiencies are obvious.  So, the movement to correct this problem, and the cost required to do so, should be rather easy to justify.

There are two paths required to correct this problem:  First, the migration from operationally focused silos of data to much larger database management systems.  Thus, the interest in big data.  Second is the use of data integration technology to facilitate the free flow of information from the existing data sets to the big data system.

The movement to big data systems these days is largely driven around the commoditization of technology, and the availability of cheap and massively scalable platforms, such as those provided by cloud computing providers.  Another factor is the ability to manage data at speeds once not considered possible, given the ability for systems such as Hadoop to distribute massive data searching operations across hundreds or thousands of servers within a cluster.  What once took days, now takes hours or minutes.  This is why big data is more operationally viable.


There are some foundational concepts that come into play when considering an integration strategy around big data.  They include:

  • The size of the data sets that can go in the petabyte range.
  • The data types, which in many cases are unstructured.
  • Complexities around big data interfaces.
  • Considerations around data governance.

The size of the data not only deals with issues around migration – there are plenty – but the operational aspects of data integration that include real-time and batch-oriented updates and edits to the data.  In many respects, in the world of data integration around the use of big data, we create a strategic infrastructure that will drive a great deal of value to the business (discussed above).

Thus, integration technology should be able to handle larger data consumption, transformation, and data production loading.  This means that massive amounts of data should be able to flow through the integration technology without causing latency within the system.

Also, the data integration technology should be able to deal with different types of structures of data.  This means that in many instances within the use of big data systems the data will be unstructured, but it still needs to be managed and integrated with other systems that may only deal with structured data.

The interfaces may be a challenge as well, and data integration technology that you leverage with big data systems should be able to communicate using these interfaces without driving additional latency, and they should be reliable.  It is not enough to just provide access to this data, but data integration technology should be able to deal with both the complexities of the interfaces, as well as deal with inevitable issues, such as a leveraging a sound exception handling subsystem.

Finally, there should be some data governance capabilities offered by the data integration solution for a big data system, including the ability to deal with schema and content changes without causing a cascading issue with other integrated systems.  Considering the amount of data and the complexity of that data, these types of subsystems are mandated.

It’s not a question of will you move to big data, it’s when.  I don’t say that without a great deal of forethought and an understanding that this is not just a trend, it’s a huge movement.  The technology is cheap, it works, and it solves a problem that most enterprises and government agencies have.  Moreover, it meshes nicely with our desire to leverage platforms that are delivered on-demand, such as the movement to cloud-delivered infrastructure.

However, this also puts more importance on the need for a sound integration strategy and efficient data integration technology.  You’re bridging the new with the old, as well as extending the value of newer big data systems.

Share and Enjoy:
  • Print
  • LinkedIn
  • Facebook
  • Twitter
  • Digg
  • Technorati
  • StumbleUpon

Leave Your Response

You must be to post a comment.


Welcome to Pervasive Software's Data Integration Blog

Log in

Lost your password?

Register For This Site


Join us as we spread the word.