Bitemporal Data - SQL on Steroids

2020-11-01 on this blog

This article was originally published in Visionen Issue 5/2020, the official magazine of VIS.

Bitemporal Data - SQL on Steroids

When trying to solve problems, computer scientists tend to make up a completely different problem, of which the original problem happens to be a more or less trivial subcase (call it the Spirit of Ueli). This time, my problem was to implement a "Recent Activity" feature for a database application. While doing research I came across a rather old book ¹ from the 90s describing an inspiring technique for database modelling incidentally solving the "Recent Activity" question.

To have an illustrative data model I will partly introduce the Stocks data model ² which I develop to manage my food stock. There are FoodItems which are of a certain kind of Food and stored in a Location. In a traditional, nontemporal data model, buying a banana means a SQL insert on the FoodItem table referencing the Food "Banana" and the Location, say, "Pantry". Moving a glass of jam from the pantry to the fridge is a SQL update on the FoodItem table, changing the foreign key stored_in of the FoodItem to reference the Location "Fridge". Finally, eating a delicious bar of chocolate induces a SQL delete of the corresponding food item. So far, so DMDB.

In this model you can answer questions like "What is stored in the fridge?" or "Are there any bagels left?". If you read carefully you will note they refer to the present state of the food stock, or are so called current queries. They act on the data at the instant of "now". Questions like "For how long have we run short of chocolate?" or modelling state like "When we will have done the shopping this weekend, we will have 5 new bars of chocolate" are not possible to express in a nontemporal data model. Following the terminology from the book ¹ these are statements on the valid time of the data.

To extend our data model from above to incorporate valid time, every row of data has to store a period of validity in which the data is considered to be valid. This is easily modelled by adding two columns to each table, valid_time_start and valid_time_end which capture the start and end of validity of the respective row. This is our first step to extend the data model to incorporate a new dimension of time. Instead of a nontemporal approach where any state of the data is a snapshot of reality at one point in time, we now store all such snapshots in the same database.

So how do you operate on such a data model? Bad news ahead, temporal data will cause quite a blow up to the SQL code and lead to long sequences of instructions for even "simple" operations. Let's walk through the same examples as above. Inserting the banana into the pantry is almost the same as before. For the valid_time_start you choose freely when the data entered your pantry. valid_time_end should be set to "Whenever I desire to eat a banana" which no SQL data type is able to represent. Instead the pragmatic solution tells to either set it to 'infinity' (if you have the privilege to develop on PostgreSQL) or any date in remote future, say 9999-12-31 (this way you create a new Y2K problem ³. The team maintaining the database in this remote future will curse you for being that naive. Ignore them and be proud of having created a system which has lasted for ages and survived the test of time).

So how do I move the glass of jam into the fridge now? The "row" you want to update is now a chronologically ordered sequence of rows. The data can be in a state where, the jam was in the pantry for one week, then placed into the cupboard for another week and only yesterday put onto the table to be opened today. A generic SQL update operation will have to cut a slice from this chronological sequence of rows and insert a new row starting now and end at the end of the slice (or infinity). So for the jam, we would locate the row stating the jam is on the table. We now terminate this row by setting its valid_time_end to now and insert a new row with a valid_time_start of now and a valid_time_end of infinity. If the database held another row stating "tomorrow, put the jam back into the cupboard" we would now delete this row, saying "this event will not occur tomorrow".

Deleting a row (also known as eating) is done by retrieving the FoodItem row which is valid now and set the valid_time_end to now. Then any future row of that item must be deleted.

Summing up we can now cope with both situations I used to show the shortcomings of nontemporal data models before. E.g. the statement about buying chocolate next weekend can be expressed as

insert into FoodItem
select <next-saturday> as valid_time_start,
      'infinity' as valid_time_end,
      5 as of_type,         -- 5 is the id of chocolate
      1 as stored_in        -- 1 is the id of the pantry

The other query is omitted for exercise and brevity.

This is an outline on valid-time data models. I did a lot of simplifications and we didn't even talk about key constraints (see p.117ff ¹ for details). Going back to the motivation, namely showing an activity feed of all the modifications to my data, I now wonder if the valid-time data can serve this purpose. It turns out it cannot. An example is the data we inserted earlier where we planned to purchase chocolate this weekend. Rumours have spread that chocolate cures cancer so brokers have jacked up the chocolate price and you cannot afford it any longer. You thus have to correct the data by deleting the row containing your scheduled chocolate. While the event "Chocolate has been bought" will not occur the event "You cancelled the scheduled chocolate purchase" has indeed occurred but is not visible in the data. You would need both the state of the data before the deletion as well as after the deletion to recognise this event. What if this information was part of the data model as well?

I now introduce the second dimension of time to the table justifying the term bitemporal. The formerly introduced valid time is accompanied by the transaction time which for each row captures the period of time, when the row was considered to be true. This allows us to rewind the valid-time state of the database to any point in the past, enabling us to compare the aforementioned two different states of the data. Thus any row of data will now contain 4 timestamps: two for the valid-time stating the period of validity and another two, stating the period during which we considered this row true. Let's look at a very small bitemporal table to appreciate its beauty:

`valid_time_start`	`valid_time_end`	`tr_time_start`	`tr_time_end`	`stored_in`
2020-10-01	'infinity'	2020-10-01	2020-10-02	1
2020-10-01	2020-10-03	2020-10-02	'infinity'	1
2020-10-03	'infinity'	2020-10-02	2020-10-05	2
2020-10-03	2020-10-06	2020-10-05	'infinity'	2

The rows are sorted by insertion order and all consider the same FoodItem. The first row states that on 2020-10-01 we knew a food item was stored in the pantry (ID 1). At that time we had no other plans, so valid_time_end is infinity. The second and third rows were part of the same transaction: On 2020-10-02 we agreed to ourselves to move the item to the fridge (ID 2) but to do so only one day after, on 2020-10-03. Thus the second row stores almost the same information as the first, namely that the item originates from the pantry. But the second row also knows when the item left the pantry. To keep both states of knowledge about our data we don't update the first row but instead we update a clone of it. The third row captures the new information, namely the move to the fridge. Finally, the fourth row marks our decision on 2020-10-05 to eat the item on 2020-10-06. Thus the "fridge age" will have ended on 2020-10-06 (valid_time_end) but this is known already on 2020-10-05.

You will by now realise the immense amount of information which is compressed into this data model. The most important difference to the valid-time model before is that no data is ever deleted from the table. Any row which is no longer true is terminated in its tr_time_end. The table thus becomes an archive of all its prior states weaved into the table itself. To maintain it two simple rules must be followed:

When inserting rows they carry "now" in their tr_time_start and 'infinity' in their tr_time_end. Any other values were a lie to your archive.
The only legal modification to a bitemporal row is to set its tr_time_end to "now" to mark it is now obsolete.

Let's operate on the table. The insertion is almost the same as before, simply respect rule 1 from above. Updating a row is more involved. I will only describe a current update where new information about the state of "now" is added and believed to be valid forever. This will involve a total of 5 SQL operations. First I insert the updated row with a valid time starting now extending to the next row's valid_time_end while following rule 1 for transaction time. Then I will terminate the row which is currently (with respect to transaction time) true "now" (wrt valid time) by terminating its valid time at "now". This opens up the slice for our updated data in the same way as above. Next we "delete" the row currently true "now" by following rule 2. The last two operations take care of the future. First we insert all future rows of this row, but with the new columns' values we want to update. And finally we "delete" all existing future rows by terminating their transaction period to "now". In the book this transaction takes 30 lines of SQL (see p.289f ¹. I'm convinced that the book has a mistake there. If you think so too, don't hesitate to contact me).

The last modification I want to show is a current deletion where we decide that some information is invalid starting from "now" until forever. This only requires two SQL statements: We insert a new row by copying all values from the currently valid (wrt valid time) and true (wrt transaction time) row but setting the valid_time_end to "now". After that we "delete" any row which overlaps the valid time at "now" by setting the tr_time_end to "now" (also see p.293f ¹).

Now we know how to keep our bitemporal data model up to date. And still all the nontemporal data which we started with is present in the data. E.g. to get the nontemporal view of the food items "now", run the following query:

select * from FoodItem
where valid_time_start <= current_timestamp
and current_timestamp < valid_time_end
and tr_time_end = 'infinity';

Line 2 and 3 limit our view to one point in valid time, line 4 gives us the data considered true now.

In the finale I can now show you the query which allowed me to implement the feature showing what activity has occurred on the database, which was the goal of all this mind-bending:

select * from FoodItem i1
left outer join FoodItem i2
on i1.tr_time_start = i2.tr_time_start and i1.id = i2.id
where (i1.tr_time_start = min_time
        or i2.id is not null
        or i1.tr_time_end = 'infinity')
and not (i1.tr_time_start = min_time or
                (i1.valid_time_end = 'infinity'
                and i1.tr_time_end = i1.valid_time_end))

-- where min_time was extracted for readability to be:

select min(i1.tr_time_start)
from FoodItem i3
where i3.id = i1.id

This is a simplified version of the query used in the android client ⁴. The result set should contain one row for each event (insert, update, delete) which occurred on the table. For updates it should contain both data about the old row as well as the new row. To achieve this, I run a self-join on the FoodItem table where two rows match if they refer to the same ID and if they are part of the same event which is conveniently just the tr_time_start. The first part of the where filter makes sure we only have interesting data: Either the row should be the first of its ID by checking it has the earliest transaction time (and is thus an insert), or it is an update containing data for i2, or it is still true now. The second part removes rows which at some point in valid time proclaimed eternal validity (which are terminated during an event but simply duplicate the event also represented by other joined row pairs) except if it's the insertion event which should be kept. I don't claim this is the most elegant query to come up with, but it has proven in practice.

Now there is still a lot to discover on bitemporal data and I only scratched the surface here. For example it was only during implementing it in Stocks when I realised I can now massively reduce network bandwidth as clients can ask the server to "give me all data changes since date t" instead of obtaining all data on the table.

Of course I can also warmly recommend the book ¹ which introduces the concepts very carefully and makes a fool of the SQL-92 standard and its deficiencies all the time. The concepts are applicable today despite their age. It also has interesting info boxes on the concept of time in general (Can you tell what happened on October 12, 1582 in Rome?).

If you have any questions or remarks, don't hesitate to contact me at veenj@vis.ethz.ch, especially on the presumable error in the book I mentioned above.

R. T. Snodgrass: Developing Time-Oriented Database Applications in SQL

https://github.com/F1rst-Unicorn/stocks/blob/master/manual/development/spec.md

https://en.wikipedia.org/wiki/Year_2000_problem

⁴

https://github.com/F1rst-Unicorn/stocks/blob/master/android-client/app/src/main/java/de/njsm/stocks/android/db/dao/EventDao.java#L245