I was pleased to see a Pervasive Software presentation at GigaOM Structure 2010 on “Shattering the Barriers to Processing Big Data.” This was the first time I saw a data integration company drive right at this topic, and there clearly needs to be some thought leadership here.
My head has been in the cloud for years and I come from a database background. It is clear to me that using the cloud’s ability to provide massive amounts of commodity computing power, on-demand, when combined with a database architecture that will exploit that power means data processing power on scales we have never seen at these low price points.
The notion of “Big Data” using a highly distributed query mechanism is not at all new. We’ve talked about “share nothing” database queries since I taught database design at the local college. However, we don’t yet have good platforms to provide this technology, considering that it needs massive amounts of distributed systems to be effective. And then along comes the cloud.
So, what’s new? First is the ability to manage large data sets more efficiently than with traditional relational technology as done in the past. The methodology is to leverage an approach called MapReduce. MapReduce is a software framework brought to us by guys like Google (Nasdaq: GOOG), Yahoo Inc. (Nasdaq: YHOO), and Facebook to support large distributed data sets on clusters of commodity computers. The power of MapReduce is that it can process both structured and unstructured data through the use of a distributed “share nothing”-type query-processing system.
The “Map” portion of MapReduce is the master node that accepts the request and divides it among any number of worker nodes. The “Reduce” portion means that the master node considers the results from the worker nodes and combines them to determine the answer to the request. The power of this architecture is the simplistic nature of MapReduce, meaning it’s both easy to understand and to implement.
There are open-source software instances that leverage MapReduce, such as Hadoop, which continues to gain popularity as a very efficient approach to managing large data sets. You can find Hadoop in private and public clouds. Those who deal with large amounts of structured and unstructured data for business operations or business intelligence find this technology to be both a huge value that also allows them to make sense of the oceans of data that many businesses gather right now.
So, what technology can assist you in leveraging “Big Data.” Pervasive DataRush, for example, is a parallel dataflow platform that eliminates performance bottlenecks in “Big Data” preparation and analytics. What’s unique about this technology is that it fully leverages the parallel processing capabilities of multicore processors and SMP systems to deliver higher-end performance, and does this without having to resort to the complexity and costs of a large, and hard to manage cluster. Sounds like MapReduce, right? Not really, I don’t consider MapReduce or Hadoop as competitors of DataRush. DataRush is designed to work in conjunction with Hadoop, with its ability to be utilized within a system that has multiple cores. This kind of technology that’s able to leverage the value of MapRuduce and Hadoop, but does so in a supported product, will drive the movement to “Big Data,” if you ask me.
So, what does “Big Data” have to do with data integration? Those looking to move to technologies such as MapReduce and Hadoop who don’t have a good data integration strategy to go with their “Big Data” strategy will quickly need to get one. There are many issues to consider around data integration, including:
The integration between the different database models. While most data is in traditional databases, such as Oracle and DB2, Hadoop and other cloud-based databases are more object-based. Thus, there has to be some translation of data as it flows from one model to the other. Translation between models can be complex and difficult to manage if you’re looking to do this pragmatically or through manual functions.
The ability to manage the volumes of data. “Big Data” means large volumes of data will be flowing between systems and data stores, and thus you’ll need technology that can manage and handle the high volumes. In many cases this will be gigabytes of information flowing through data integration technology, which can cause saturation and integrity issues if not managed properly.
The ability to manage quality and integrity. Often overlooked is the ability to deal with data quality and integrity issues at the integration layer, meaning that the data flowing between “Big Data” and other systems is valid and of good quality. The trouble with integration between models, as we discussed above, is that some quality and integrity issues are likely to arise, and the data integration technology should be able to manage this as well.
The rise of the cloud presents many opportunities to both commercial enterprises and government, including the ability to manage huge amounts of data, as well as bring that data into the lens of the business user. However, often these opportunities are not without other problems that need to be addressed, such as data integration. As you move toward the cloud, as well as toward “Big Data,” I urge you to put these issues on your radar right now.