In my last blog post I noted that time is an integral aspect of data. We made a distinction between data as entity, where the data pertains specifically to a thing; a person, a product, a company, etc., and data pertaining to an event; a sale, a delivery, a customer complaint, etc. In both cases time is critical to understanding the data, especially when you collect the data together for the sake of analysis.
It seems rather strange then that many databases don’t automatically time stamp the data. It’s true that a time stamp, on its own is not enough to pin down the time of something. For accuracy’s sake, the database needs to record both the time when an event took place and the time when it recorded the event. They could be different and the difference may be important. Clearly you cannot respond to an event if you do not know it has occurred, and if you discover that an event happens a while before your systems learn about it, you may discover that your systems are not capturing the data you need in a timely manner.
It is worse when we allow data to be updated. This is an error, because when you update data you destroy data. The old value of one item or several items replaces what went before. Data ought to be captured as a time series for that very reason alone. Then we would not have updates, only the addition of new data. Neither would we have data deletion. Just the flagging of data to indicate that it is no longer current and ceased to be current at a specific date and time.
The importance of time in respect of data runs even deeper than that. Consider for example, a supermarket shopping trolley. Naturally the goal of the supermarket is that you exit the premises with the greatest value of goods it can tempt you into buying. Indeed it may be even more complex than that. The supermarket probably wants you to exit the premises with the most profitable (for it) collection of goods as well.
Nevertheless, supermarkets are miles away from achieving such a goal. They tempt you into buying goods by virtue of displaying them well, but simply do not know your method of shopping. In particular, they do not know the order in which you selected your trolley full of groceries. Right now they have no way of capturing that data, because the data they collect only tells them what is in the trolley, not the time it entered the trolley and hence order in which you selected the goods.
And, where I live at least, whenever you pay for the goods at the supermarket they ask you, completely mechanically, “Did you find everything all right?” as though, if you hand’t found something they would fix the situation. But everyone answers in the positive because when they get to the check out, they are no longer shopping, they are checking out. Fairly useless question, really, at that point in time.
Time and Time Again
If we consider data from the perspective of it being a time series, then it automatically violates one of the fundamental ideas of relational databases – that data has no inherent order. In this, the relational database theory is quite wrong. Data clearly does have a natural order in time. One consequence of this is that new types of database have come into existence purely to deal with time series. The OLAP databases themselves were spawned by the problem of time, because once you add in the dimension of time, you have cubes not tables.
But let us consider events. A good focus here is the stock market. Stocks are being bought and sold all the time and the price bounces up and down according to supply and demand. The movement of stock prices correlates to some degree with the market itself. When the market as a whole falls, as reflected in some market index, most of the stocks fall. When sectors of the market, the tech sector for example falls, most tech stocks fall. Many trading operations watch such movements to discover relationships and take advantage of price moves accordingly. The analysis that is done is almost completely a time-based analysis. They also look for broad trends to take advantage of, and where they can exploit those trends.
We could look at a business in the same manner. There are a whole series of events happening within the business and they are, in one way or another, related. Why not analyze this data in the same way that stock market data is analyzed. Maybe individual customers or whole groups of them are exhibiting some kind of trend. Maybe the cost of some of the inputs to our business are exhibiting some kind of trend. And maybe we can find correlations between the behavior of some of our suppliers. And maybe we can mix that data with data that spans the market our business operates in.
If you’re thinking that this is predictive analytics, then you are right. And if you think that such activity can deliver huge benefits to a business, you are right too.
But right now, we are a long way away from easily supporting predictive analytics and the reason why is that our databases, to a great degree, don’t treat time as a vital dimension of data. In fact if we took an inventory of all the data we capture we’d discover that in many instances, the dimension of time is simply not captured or else not captured in a convenient manner.
That needs to change.