Why a log structured database ?
My attention was recently caught by the blog post Damn cool Algorithms: log structured storage. The white paper presenting RethinkDB provides a more exhaustive view of the benefits of this data structure and some disadvantages too. The LWN.net article Log-structured file systems: There's one in every SSD covers the use of log structure in SSD file systems.
The performance benefit is mainly due to constraining write operations to the end of the file because read access can benefit from memory caches, writes not. With random location writes, the disk writing head needs to move into position (seek) and this has a huge latency compared to transistor state changes or data transmission speed.
Reducing disk head movements may thus yield a significant performance increase. Note that this won't be true with SSD disks anymore, but other constrains come in play too where a log structured database may still be attractive (evenly distributed and grouped writes).
The Record Index
As you may guess writing data to the end of the file implies that modified records are copied. The record offset is then modified which implies an update of the index too. If the index, generally tree structured, is also stored in the log database, it result in cascade of changes which increases the amount of data to write to disk.
This makes log structured database less attractive, especially if the index is a BTree of record keys. A BTree key index is not very compact and not trivial to manipulate, especially if keys are of varying length.
I finally found a better solution derived from reading the white paper presenting the The PrimeBase XT Transactional Engine describing a log structured table with ACID property for an RDMS table, and more recently the article Using Uninitialized Memory for Fun and Profit describing a simple data structure to use an uninitialized array.
The idea is to use an intermediate record index which is basically a table of record offset and size. The entry index in the table is the record identifier and is used as key to locate the record in the file. The record identifier is associated to a record for its life time and may be reused for a new record after the record has been deleted.
Benefits of the record index
The record index is stored as a tree index where non lead nodes hold the offset to the lower level nodes of the tree. Changing an offset in a leaf node will still imply a change in all the nodes up to the root of the tree, but the index is much more compact than a conventional BTree associating the record key with its offset and size. The record identifier doesn't need to be stored in the index because it is its relative position in it.
Another benefit of this intermediate record index is that the record key index will now refer to the record identifier and this doesn't change when the record is modified. It is then possible to have multiple index to the records or to use the record identifier inside the user data to support record reference graphs (i.e. linked lists, etc.).
By storing the record identifier along with the record data, the garbage collector or the crash recovery process can easily determine if a record is valid or. It simply has to compare the record offset and size with the one found in the record index. If it is the same, the record is the latest valid version.
Snapshots and recovery points
The dirty pages of the record index need only to be saved at snapshot time. In case of process or system crash, the database should be restored to the last saved snapshot. A snapshot correspond to a coherent state of the database. A snapshot is saved any time the user closes the database. Restoring the database to some snapshot saved state boils down to truncate the file after the last valid record of the file.
If snapshots saving is very frequent and crash recovery very rare, it is possible to use lightweight snapshots. For such snapshot only a small record is appended to the record stream which tags the point in the file where the snapshot occurred. When the database is recovered at some saved snapshot point, the recovery process can continue the recovery process beyond that recovery point by replaying all the changes until the last valid lightweight snapshot. The state of the database can then be restored to the latest lightweight snapshot, but with a slightly bigger effort than a saved snapshot recovery.
For the garbage collector (GC) the classical method may be applied which consist in opening a secondary log file and progressively copy valid records into it in background while it is used. A database backup is as simple as copying the file.
When the lifetime duration of records varies a lot, it might be better to use generational log files, an algorithm used with memory garbage collector. The idea is to avoid copying constant records due to some other records short lifetime of frequent change generated garbage. The idea is to group records according to their change frequency into separated log structured database.
A first log structured database contains all new or changed records. The garbage collector progress then at the same speed as records are written to the end of the file. Every valid data it finds is then copied in a second generation record log file. These records have lasted a GC cycle without a change. Additional generation database may be added for even slower changing records.
The use of multiple log files will induce some disk writing head movements, but it will be balanced by saving the effort to repeatedly copy constant records.
It is not my intent to implement this shortly. I just wanted to document the method which seems to be the canonical way to handle the record index problem and for which I couldn't find a description on the web.