NoSQL became a buzz word in the last 12 months. Why?
The short explanation is that there are a number of new databases which don’t depend entirely on SQL (which we can think of as Not Only SQL databases) and some which don’t depend on SQL at all (which we can think of as absolutely No SQL databases). Some of these products are getting traction (examples include Aster Data, MongoDB, Apache Cassandra and MarkLogic). These are database products which IT users are happy to buy into, that are distinctly different to the typical relational databases that we have grown accustomed to.
Some of this has been a consequence of significant changes to computer hardware.
Servers are now multicore – which means that each server CPU sports more than one processor (or core) per CPU chip. The trend to multicore chips began when heat output made it impractical to increase the clock speed of chips. The viable clock rate topped out at about 3 gigahertz for desktop devices and a little higher – in the 4-5 gigahertz range – for servers. Servers have more efficient cooling systems than desktops, so high clock rates are possible.
So the chip manufacturers start adding extra cores to their CPU chips continuing to improve raw chip performance by other means. This had the effect of turning a single server with a single CPU chip into something akin to an SMP server (a server built with multiple CPUs). That in turn had the effect of increasing the raw power of a genuine SMP server. Low cost SMP servers typically have 4 CPUs and if each has, say, 6 cores then that gives you 24 processors. That’s more like a grid of computers than an SMP server and if you assemble several such servers, that’s even more like a computer grid.
Traditional relational databases were engineered for SMP configurations and they are, to some degree being superseded by shared-nothing scale-out architectures that are fundamentally grid-oriented.
The Other Hardware Components
It’s now possible to configure hundreds of gigabytes of memory on powerfully configured servers, which means that much more data can be held in memory. This makes it possible to pin fairly large databases tables in memory, giving a welcome kick to performance for some applications. This in turn has emphasized the importance of data compression as a database performance technique.
Additionally, Solid State Disks (SSDs) are now coming to the fore. So many of the software techniques and tuning parameters associated with spinning disk that are found in relational databases are becoming unimportant. Even with spinning disk, random reads from spinning disk can be largely eliminated, because they are too slow. Instead a database can either read serially from disk, or where random reads are frequent, pin the table in memory. That in turn means that data is best organized to suit serial reads of disks – i.e. tables need to be partitioned across multiple disks and multiple i/o channels.
Convulsions In The Data Landscape
This hardware evolution engendered the recent clutch of column-store databases (ParAccel, Vertica, 1010data, etc.) that better suit modern servers if you need to process large amounts of data. However, the kind of data that many organizations need to process has changed too – and so there’s no confusion, what I mean here is new data rather than the traditional well-structured data.
On one hand, unstructured data has come to the fore. This is particularly the case with social networks and large web sites, whose data always has a significant textual element. Relational databases are ill-suited to unstructured data of this type, especially when it is nested within hierarchical structures like documents or web pages. On the other hand we have the relatively new phenomenon of “machine generated data” – data that is produced directly from log files or telephone systems or sensors in industrial environments or RFID tags. This data is well structured, but it is voluminous and in many cases consists of one big table rather than a complex set of related entities.
New Kinds Of Database
There is variety in the new databases types, which we can classify in the following way:
- Column-store products: We have already mentioned these.
- Hadoop-Based products: HBase is an open-source database for the Hadoop environment, and there is also Aster Data (recently acquired by Teradata) which provides its own version of MapReduce. MapReduce is a parallel framework with which programmers can select and analyze very large volumes of data quickly. The query mechanism here is not SQL, so these products qualify as NoSQL databases.
- Big Table products: Some of these are just key-value stores, in the sense that the database holds a single table and each row in the table includes a single key. Such products are usually built for parallelism. Google invented Big Table and access to Google’s version is available through the Google App Engine.
- Document stores: You could think of these as similar to object databases in that they are good at storing and retrieving objects or items within objects (a document is an object of a kind). Most of these (Mark Logic and Mungo DB are examples) are schema-less in the sense that they do not provide the kind of database schema that a relational database does. They are also truly NoSQL in that they don’t use SQL to access data – although they may use XML or a simple data template in order to get at the data lodged within the “document.”
- Graph Databases: These are adept at storing and querying data that includes complex relationships (e,g, who knows who and what is their network of contacts?) Such relationships are best represented in the form of a graph and SQL is not required or used.
NoSQL databases are here to stay, because they scale and because they all provide capabilities that the legacy relational databases either don’t provide or don’t provide well.
We have a rash of new products, but we also seem to be experiencing the resurrection of some “old” products, and with good reason. None of the data processing activities here are new. So there’s no reason why a well-established but niche object database product, like Versant cannot claim to be a “NoSQL document store” – it does nowadays and it can certainly be used as one. Similarly there’s no reason why the well-seasoned Btrieve-based Pervasive PSQL cannot legitimately claim to be a “No SQL key value store”, because it can happily play that role.
The Bottom Line
The shift in hardware technology along with new kinds of data to process has brought us “back to the future” – with a little bit of parallelism thrown in. If the current generation of programmers forsake SQL, or at least cease to regard it so highly then we will probably see the resurrection of older niche products as well as this plethora of new ones.