When we speak of data, most of the time we’re talking about well-behaved and well defined data that is stored conveniently in databases. We refer to this proudly as “structured data,” and well we might because unstructured data is far less convenient.
An often quoted statistic is that “only 20 percent of data is structured.” I’ve never been able to find the source of that statistic, so I suspect it’s a lie based on the Pareto principle (the 80-20 rule) which states that 80 percent of the effects come from 20 percent of the causes. Someone made it up – or perhaps it was vaguely true a few decades ago.
If you take the 400 GBytes stored on my Mac, almost 100 percent is unstructured. On servers it has to be less than that. But all those web sites, email systems, document management systems, log files and raw files systems. It has to be way more than 80 percent of the data.
But why is all this data unstructured?
The dirty truth is that databases themselves are no good at storing lumpy data. Whenever some new kind of application emerges, its data is not stored in a well-structured database, it’s stored in some programmer-invented-meta-data-deficient file. But why is this? Are all programmers anarchists?
Not at all. Only 80 percent of programmers are anarchists.
Databases came into existence over 40 years ago because of the limitations of file systems. A database was a more effective mechanism for storing data, for many reasons. The main one was that databases deliberately made metadata (data definition data) available, so that many different programs could use the same data store. The situation further improved with the emergence of a standard data access language; SQL. This meant that the programmer no longer needed to think about how data was stored – for applications that used databases.
Sadly, the IT industry never even tried to agree on a standard file format that exposed the metadata of a file. Thus the commonly used operating systems never tried to provide such a file type even for their own domain. The independent software vendors (ISVs) that write the majority of software we use were never going to pay license fees to database vendors. Consequently (ISVs) continually invented new types of files for the data that they stored.
But in time, good Open Source databases became available; MySQL, Firebird, PostgreSQL et al. So the ISV’s didn’t use them either – for four reasons:
- For some data, such databases would just get in the way – hampering performance and adding nothing.
- With proprietary file types you can lock the customer in, hold him hostage and torture him if necessary.
- There’s no point in making your data available at the item level in your standard product, because it prevents you from profiting by making your data available at the item level later on with an add-on.
- 80 percent of programmers are anarchists.
Let me just drive this point home by asking “How many different file types are there for graphical files.”
Here’s a list of 42 different file types, 29 of which are bitmap files and 13 of which are vector files.
- Bit maps: BMP, CD5, ECW, Exif, FITS, GIF, ICNS, ILBM, IMG, JPEG 2000, JPEG/JFIF, JPEG XR, PBM, PCD, PCX, PGC, PGF, PGM, PNG, PNM, PPM, PSD, PSP, RAW, SID, TGA, TIFF, WEBP, XCF.
- Vector graphics: AI, CDR, CGM, EPS, ODG, PDF, PGML, SVG , SWF, VML, WMF / EMF, XAR, XPS
This is not by any stretch of the imagination an exhaustive list, but it is an exhausting list – if you happen to be writing software to cater for them all.
This ain’t no technological breakdown. Oh no, this is the road to hell
What happened over time with graphics files, happened earlier with text files and is happening all over again with video files. The lyrics from Chris Rea’s rock song seem oddly appropriate, but we don’t need to abandon all hope yet.
Nowadays ISVs pay more respect to metadata than they used to. And even when metadata is not exposed, it’s usually possible, for example, to strip text information from a file, because even the most proprietary of ISVs didn’t think to encrypt the text data as well as lock it away in a proprietary structure.
Right now there’s a good deal of activity in accessing and using text – and blending that with access to structured data. Unfortunately making sense of text data isn’t that easy. You know where you are with structured data, you know what it means. But you’re not certain of where you are with text, or graphics or video. You cannot easily determine what it means automatically.
You can identify and isolate some meaning from it, but by its very nature it is lumpy – because it has no standard structure. So you cannot process it in the same way that you process structured data. That would be fine if important facts were not embedded in this lumpy data. But they are.
Consider a digitized legal contract. Could we ask the following question of any software that was capable of reading it:
What has this contract committed our organization to?
Of course we couldn’t. But actually, that’s the most important information that the document contains. We can provide similar examples for graphical data (Is this diagram corporate IP?) or video data (Do we know what that individual is doing?)
The curse of unstructured data is that it’s really difficult to add structure to it in a way that exposes the full importance of all the data it contains. The joy of unstructured data is that it is possible to derive some important data from it automatically and add it as metadata. We can analyze text and deduce some things about what it means. We can analyze photographs and, by matching, determine who some of the people are.
But in the end the data is still lumpy, and that means sometimes – horror of horrors – we’re going to need human beings to properly understand it.