This article was originally published in Visionen Issue 5/2020, the official magazine of VIS.
Bitemporal Data - SQL on Steroids
When trying to solve problems, computer scientists tend to make up a completely different problem, of which the original problem happens to be a more or less trivial subcase (call it the Spirit of Ueli). This time, my problem was to implement a “Recent Activity” feature for a database application. While doing research I came across a rather old book 1 from the 90s describing an inspiring technique for database modelling incidentally solving the “Recent Activity” question.
To have an illustrative data model I will partly introduce the Stocks data model
2 which I develop to manage my food stock. There are
FoodItems which are of
a certain kind of
Food and stored in a
Location. In a traditional,
nontemporal data model, buying a banana means a SQL
insert on the
FoodItem table referencing the
Food “Banana” and the
“Pantry”. Moving a glass of jam from the pantry to the fridge is a SQL
FoodItem table, changing the foreign key
stored_in of the
to reference the
Location “Fridge”. Finally, eating a delicious bar of
chocolate induces a SQL
delete of the corresponding food item. So far, so
In this model you can answer questions like “What is stored in the fridge?” or “Are there any bagels left?”. If you read carefully you will note they refer to the present state of the food stock, or are so called current queries. They act on the data at the instant of “now”. Questions like “For how long have we run short of chocolate?” or modelling state like “When we will have done the shopping this weekend, we will have 5 new bars of chocolate” are not possible to express in a nontemporal data model. Following the terminology from the book 1 these are statements on the valid time of the data.
To extend our data model from above to incorporate valid time, every row of
data has to store a period of validity in which the data is considered to be
valid. This is easily modelled by adding two columns to each table,
valid_time_end which capture the start and end of
validity of the respective row. This is our first step to extend the data model
to incorporate a new dimension of time. Instead of a nontemporal approach where
any state of the data is a snapshot of reality at one point in time, we now
store all such snapshots in the same database.
So how do you operate on such a data model? Bad news ahead, temporal data will
cause quite a blow up to the SQL code and lead to long sequences of instructions
for even “simple” operations. Let’s walk through the same examples as above.
Inserting the banana into the pantry is almost the same as before. For the
valid_time_start you choose freely when the data entered your pantry.
valid_time_end should be set to “Whenever I desire to eat a banana” which no
SQL data type is able to represent. Instead the pragmatic solution tells to
either set it to
'infinity' (if you have the privilege to develop on
PostgreSQL) or any date in remote future, say
9999-12-31 (this way you create
a new Y2K problem 3. The team maintaining the database in this remote future
will curse you for being that naive. Ignore them and be proud of having created
a system which has lasted for ages and survived the test of time).
So how do I move the glass of jam into the fridge now? The “row” you want to
update is now a chronologically ordered sequence of rows. The data can be in a
state where, the jam was in the pantry for one week, then placed into
the cupboard for another week and only yesterday put onto the table to be opened
today. A generic SQL update operation will have to cut a slice from this
chronological sequence of rows and insert a new row starting now and end at
the end of the slice (or infinity). So for the jam, we would locate the row
stating the jam is on the table. We now terminate this row by setting its
valid_time_end to now and insert a new row with a
now and a
valid_time_end of infinity. If the database held another row
stating “tomorrow, put the jam back into the cupboard” we would now delete this
row, saying “this event will not occur tomorrow”.
Deleting a row (also known as eating) is done by retrieving the
which is valid now and set the
valid_time_end to now. Then any future row of
that item must be deleted.
Summing up we can now cope with both situations I used to show the shortcomings of nontemporal data models before. E.g. the statement about buying chocolate next weekend can be expressed as
insert into FoodItem select <next-saturday> as valid_time_start, 'infinity' as valid_time_end, 5 as of_type, -- 5 is the id of chocolate 1 as stored_in -- 1 is the id of the pantry
The other query is omitted for exercise and brevity.
This is an outline on valid-time data models. I did a lot of simplifications and we didn’t even talk about key constraints (see p.117ff 1 for details). Going back to the motivation, namely showing an activity feed of all the modifications to my data, I now wonder if the valid-time data can serve this purpose. It turns out it cannot. An example is the data we inserted earlier where we planned to purchase chocolate this weekend. Rumours have spread that chocolate cures cancer so brokers have jacked up the chocolate price and you cannot afford it any longer. You thus have to correct the data by deleting the row containing your scheduled chocolate. While the event “Chocolate has been bought” will not occur the event “You cancelled the scheduled chocolate purchase” has indeed occurred but is not visible in the data. You would need both the state of the data before the deletion as well as after the deletion to recognise this event. What if this information was part of the data model as well?
I now introduce the second dimension of time to the table justifying the term bitemporal. The formerly introduced valid time is accompanied by the transaction time which for each row captures the period of time, when the row was considered to be true. This allows us to rewind the valid-time state of the database to any point in the past, enabling us to compare the aforementioned two different states of the data. Thus any row of data will now contain 4 timestamps: two for the valid-time stating the period of validity and another two, stating the period during which we considered this row true. Let’s look at a very small bitemporal table to appreciate its beauty:
The rows are sorted by insertion order and all consider the same
first row states that on 2020-10-01 we knew a food item was stored in the pantry
(ID 1). At that time we had no other plans, so
valid_time_end is infinity. The
second and third rows were part of the same transaction: On 2020-10-02 we agreed
to ourselves to move the item to the fridge (ID 2) but to do so only one day
after, on 2020-10-03. Thus the second row stores almost the same information as
the first, namely that the item originates from the pantry. But the second row
also knows when the item left the pantry. To keep both states of knowledge about
our data we don’t update the first row but instead we update a clone of it. The
third row captures the new information, namely the move to the fridge. Finally,
the fourth row marks our decision on 2020-10-05 to eat the item on 2020-10-06.
Thus the “fridge age” will have ended on 2020-10-06 (
valid_time_end) but this
is known already on 2020-10-05.
You will by now realise the immense amount of information which is compressed
into this data model. The most important difference to the valid-time model
before is that no data is ever deleted from the table. Any row which is no
longer true is terminated in its
tr_time_end. The table thus becomes
an archive of all its prior states weaved into the table itself. To maintain it
two simple rules must be followed:
When inserting rows they carry “now” in their
tr_time_startand ‘infinity’ in their
tr_time_end. Any other values were a lie to your archive.
The only legal modification to a bitemporal row is to set its
tr_time_endto “now” to mark it is now obsolete.
Let’s operate on the table. The insertion is almost the same as before, simply
respect rule 1 from above. Updating a row is more involved. I will only
describe a current update where new information about the state of “now” is
added and believed to be valid forever. This will involve a total of 5 SQL
operations. First I insert the updated row with a valid time starting now
extending to the next row’s
valid_time_end while following rule 1 for
transaction time. Then I will terminate the row which is currently (with respect
to transaction time) true “now” (wrt valid time) by terminating its valid time
at “now”. This opens up the slice for our updated data in the same way as above.
Next we “delete” the row currently true “now” by following rule 2. The last two
operations take care of the future. First we insert all future rows of this
row, but with the new columns' values we want to update. And finally we
“delete” all existing future rows by terminating their transaction period to
“now”. In the book this transaction takes 30 lines of SQL (see p.289f 1. I’m
convinced that the book has a mistake there. If you think so too, don’t hesitate
to contact me).
The last modification I want to show is a current deletion where we decide
that some information is invalid starting from “now” until forever. This only
requires two SQL statements: We insert a new row by copying all values
from the currently valid (wrt valid time) and true (wrt transaction time) row
but setting the
valid_time_end to “now”. After that we “delete” any row which
overlaps the valid time at “now” by setting the
tr_time_end to “now” (also see
Now we know how to keep our bitemporal data model up to date. And still all the nontemporal data which we started with is present in the data. E.g. to get the nontemporal view of the food items “now”, run the following query:
select * from FoodItem where valid_time_start <= current_timestamp and current_timestamp < valid_time_end and tr_time_end = 'infinity';
Line 2 and 3 limit our view to one point in valid time, line 4 gives us the data considered true now.
In the finale I can now show you the query which allowed me to implement the feature showing what activity has occurred on the database, which was the goal of all this mind-bending:
select * from FoodItem i1 left outer join FoodItem i2 on i1.tr_time_start = i2.tr_time_start and i1.id = i2.id where (i1.tr_time_start = min_time or i2.id is not null or i1.tr_time_end = 'infinity') and not (i1.tr_time_start = min_time or (i1.valid_time_end = 'infinity' and i1.tr_time_end = i1.valid_time_end)) -- where min_time was extracted for readability to be: select min(i1.tr_time_start) from FoodItem i3 where i3.id = i1.id
This is a simplified version of the query used in the android client 4. The
result set should contain one row for each event (insert, update, delete) which
occurred on the table. For updates it should contain both data about the old row
as well as the new row. To achieve this, I run a self-join on the
table where two rows match if they refer to the same ID and if they are part
of the same event which is conveniently just the
first part of the
where filter makes sure we only have interesting data:
Either the row should be the first of its ID by checking it has the earliest
transaction time (and is thus an insert), or it is an update containing data for
i2, or it is still true now. The second part removes rows which at some point
in valid time proclaimed eternal validity (which are terminated during an event
but simply duplicate the event also represented by other joined row pairs)
except if it’s the insertion event which should be kept. I don’t claim this is
the most elegant query to come up with, but it has proven in practice.
Now there is still a lot to discover on bitemporal data and I only scratched the surface here. For example it was only during implementing it in Stocks when I realised I can now massively reduce network bandwidth as clients can ask the server to “give me all data changes since date t” instead of obtaining all data on the table.
Of course I can also warmly recommend the book 1 which introduces the concepts very carefully and makes a fool of the SQL-92 standard and its deficiencies all the time. The concepts are applicable today despite their age. It also has interesting info boxes on the concept of time in general (Can you tell what happened on October 12, 1582 in Rome?).
If you have any questions or remarks, don’t hesitate to contact me at firstname.lastname@example.org, especially on the presumable error in the book I mentioned above.