It began with Google and Yahoo!, both of whom has lots of data, partly because they were offering search services, and the web even in its early days, was big data by any reasonable metric. And when I write “reasonable metric” I have to pause for a while because when you think of it there is no reasonable metric for so-called big data. There are several ways to view this.
1. Moore’s Law Cubed
First off, databases have always had a tendency to grow simply because we accumulate data as time slips by. A large database in the early 1990s was measure in the tens of megabytes. Seems odd, doesn’t it, given that a Photoshop file can be that big, but back in the day (1992) even mainframes weren’t that powerful or fast. By about 1998 we were measuring databases by the gigabyte. A gigabyte is smaller than about an hour of video and that doesn’t seem too big either, but back in the day (1998) Unix servers didn’t run at lightning speeds. Then by about 2004, we were building terabyte databases. A terabyte is big, isn’t it. I’ve got a terabyte drive on my Mac and it definitely holds a lot of information, but actually that drive cost less than $100. But, back in the day (2004) you couldn’t get a terabyte drive without coughing up serious dollars. Then by 2010, the petabyte databases started to emerge and that seems only like the day before yesterday.
Can you see where I’m going with this? While Moore’s Law has been merrily increasing computer power by a factor of 10 every 6 years (that’s roughly what it works out to) we have been expanding databases by a factor of 1000 every 6 years. And if we continue at this rate (and I for one believe we will for a while) then we’ll see exabyte databases by 2016.
2. Application Morphology
If big data means anything at all, and to be honest, I don’t think it does, but I’m willing to pretend, then perhaps it applies to the biggest databases around. But the big databases that some companies are growing right now are not the same databases – in terms of what they contain – as the big databases back in the day.
The big databases of yesteryear were transactional databases – those famed ‘mission-critical” ones. But further forward in the day, the big databases were the data warehouses, that IT people dreamed would hold all the corporation’s data – and which, to be honest, didn’t. And we surely spread that data around, into data marts and even unto spreadsheets. And that was structured data based on the aggregation of all the jolly transactional databases we had built.
But the terabyte databases weren’t of the same ilk. Moving further forward in the day we discover such databases composed of the primary transactions of a business, whether those were retail sales of individual items, or telephone calls or financial transactions. And they were analytical databases which we could pepper with statistical algorithms to our heart’s content discovering data gold mines hiding somewhere in among the trillions of bytes.
In the main, the petabyte databases of today don’t contain that kind of data – although, I admit, a few do. Some of these databases are presided over by the new kids on the block, like Facebook, Twitter and Linked-In. And some belong to not-so-new kids on the block, like Amazon, eBay, Google and Microsoft. And others belong to the big banks, telcos and retailers. And, what they are recording and analyzing tends to be “event data” rather than transaction data. Just to explain, if you are not familiar with the idea of event data; buying a book on Amazon is a transaction, but clicking on an Amazon web page is an event. And events are interesting if you can deduce things from them, such as which books you are likely to buy in the future.
So the applications have changed. The “big data” applications are new applications in the main.
We also have the fact that we’ve lived in the relational database era for well over a decade and that era was defined in many ways by database engines that we engineered for particular kinds of workload. Such databases were never capable of massive scale-out. They were built to run on tightly connected clusters Unix boxes. There is now an increasing population of new products that have been built from day one for scale-out. “Come the hour, come the man” so to speak.
This is especially the case with Hadoop, which is a hugely parallel Open Source platform that many vendors are augmenting with their own software and which happily runs on grids of commodity Linux servers which can even reside in the cloud. Because we now have such tools we can now process and analyze data; log data, streaming data and event data, that was previously impossible to do much with.
The Era of Big Data Is Yet To Come
If “Moore’s Law Cubed” continues and I expect it to, we will indeed be running databases that are measured in exabytes in the future in about 4 years’ time. So what will this data be?
Well, we sailed through the Internet revolution and the data grew accordingly, and we are currently in the midst of the mobile revolution, which had ratcheted up the data we need to gather, if for no other reason than because of those capabilities like FaceBook and Twitter with hundreds of millions of users. The revolution that follows this will be the embedded chip revolution.
This means sensors of every kind deployed everywhere: in offices and houses, in supply lines whether of oil, water or bandwidth, in transport of every description, on streets and in factories and on everyday objects of all kinds and even within human bodies. This is event data writ large. There will be trillions of such embedded chips pumping out data night and day and dropping it into huge scale-out databases. If what we have now is “big data”, then this will be “gargantuan data”, “ginormous data”, “humungous data”…
It wont be over until the really really really fat database sings.