Log structured database 03/01/2010
The distributed information system (DIS) needs a database to store its information and a key value database is all it needs. Today, Tokyo Cabinet seems the best choice for such type of database. My attention was recently caught by the blog post Damn cool Algorithms: log structured storage. The white paper presenting RethinkDB provides a more exhaustive view of the benefits of this data structure and some disadvantages too. The LWN.net article Log-structured file systems: There's one in every SSD covers the use of log structure in SSD file systems. While surfing the web to get more informations on log structured database, I found the following blog note presenting the experimental YDB log structured database with some interesting benchmark showing that YDB is roughly 5.6 time faster than Tokyo Cabinet and 8 time faster than Berkeley DB with random writes. These numbers justify some deeper investigation. The performance benefit is mainly due to constraining write operations to the end of the file because read access can benefit from memory caches, writes not. With random location writes, the disk writing head needs first to move into position (seek) and this takes time because the movement is performed by a motor, not by blitz fast transistor flip flops or data transmissions. Reducing disk head movements may thus yield a significant performance increase. Note that this won't be true with SSD disks anymore, but other constrains come in play too where a log structured database may still be attractive. As you may guess writing data to the end of the file implies that modified records are copied to the end of the file. The record offset is then modified and the record index needs to be updated accordingly. If the record index is also stored in the log database, it result in cascade of changes and associated writes. This makes log structured database much less attractive, especially if the index is a BTree of record keys. A BTree key index is not very compact and not trivial to manipulate, especially if keys are of varying length. I finally found a much better solution derived from reading the white paper presenting the The PrimeBase XT Transactional Engine describing a log structured table with ACID property for an RDMS table, and more recently the article Using Uninitialized Memory for Fun and Profit describing a simple data structure to use an uninitialized array. The idea is to use an intermediate record index which is basically a table. The entry index is the record identifier and the entry contains the record offset and size. The record identifier is than an arbitrary integer that will remain the same for the lifetime of the record. It is reused for a new record when the record is deleted. The record is stored as a BTree that contains no empty space and no key since the key is the value's relative position in the tree. It is the most compact index and will save a lot of save and effort when saved or loaded from disk. The record key index will then associate the key with the record identifier. Its changes will be limited to key insertions and deletions. A hash table or BTree index can then be used. The other benefit of record identifier is that they are compact and efficient references to other records that can be stored inside user records to build record graphs or aggregations. The garbage collector can efficiently determine if the examined record is a valid record or not by simply comparing the offset stored in the record index. If the offset is different, the record is an old version. Saving the modified records of the index provides a snapshot of the database. A crash recovery needs only to locate the last saved record index and ignore whatever has been written after it. Closing the database implies making a snapshot of the database so that opening it is immediate. A snapshot requires to save all dirty nodes of the record index and may thus imply significant time and storage space. It can be optimized by delaying snapshot saving and store lightweight snapshots instead. A lightweight snapshot is simply a special record saved in the log signaling that this point is a log file truncation point candidate. A crashed database is then restored by loading the last saved record index and replaying all changes performed until the last lightweight snapshot. The record index can use a simple round robin cache with a dirty bit telling if the page needs to be written to disk or not. For the garbage collector (GC) the classical method may be applied which consist in opening a secondary log file and progressively copy valid records into it in background while it is used. A database backup is as simple as copying the file. It is not my intent to implement this shortly. I just wanted to document the method which seems the canonical way to implement the record index and for which I couldn't find a description on the web. My intent would be to implement generational log files, an algorithm used with memory garbage collector. The idea is to avoid that the GC keeps copying constant records indefinitely while fast changing records generate a lot of garbage. To do so, the database would have different record generation log files. The first log file contains newly created or modified records. When the GC finds a valid record in it, it is copied into the second generation log file. These records are likely to change much less frequently or last longer. The same process can be applied with a third and fourth generation log file. The higher the generation, the less garbage is generated. Main write activity is concentrated in the first generation log file. The 8 fallacies of distributed computing 10/17/2009
The following two paragraphs are the introductory paragraphs of the document Fallacies of distributed computing (pdf) by Arnon Rotem-Gal-Oz that presents the 8 fallacies of distributed computed. "Distributed systems already exist for a long tThe software industry has been writing distributed systems for several decades. Two examples include The US Department of Defense ARPANET (which eventually evolved into the Internet) which was established back in 1969 and the SWIFT protocol (used for money transfers) was also established in the same time frame [Britton2001]. Nevertheless, In 1994, Peter Deutsch, a sun fellow at the time, drafted 7 assumptions architects and designers of distributed systems are likely to make, which prove wrong in the long run - resulting in all sorts of troubles and pains for the solution and architects who made the assumptions. In 1997 James Gosling added another such fallacy [JDJ2004]. The assumptions are now collectively known as the "The 8 fallacies of distributed computing" [Gosling]:
While in the process of designing a new distributed information system, it a good idea to check how it position itself regarding these 8 fallacies. The network is reliable DIS uses TCP which was designed to be reliable and robust. Reliable means that data is transmitted uncorrupted to the other end and robust means that it may resist to a certain amount of errors. There is however a limit to the robustness of a TCP connection, and in some conditions connection to a remote service may even not be possible. DITP, the communication protocol of DIS, is of course designed to handle connection failures. Higher level and distributed services will have to take it in account too. Making a distribute information system robust implies to anticipate connection failures at any stage of the communication. For instance, a flock of servers designed to synchronize with each other may suddenly be partitioned in two or more unconnected flocks because of a network failure, and be connected back together later. The latency is zero Latency was a major focus in the design of the DITP protocol because DIS is intended to be used for World Area Network (WAN) applications. DITP reduces latency impact by supporting asynchronous requests. These requests are batched and processes sequentially by the server in the order of emission. If a request in the batch is aborted by an exception, subsequent requests of the batch are ignored. This provides a fundamental functionality to support transactional applications. In addition to this, DIS may also support the ability to send code to be executed by a remote service. This provides the same functionality as JavaScript code embedded in web pages and executed by browsers, allowing to implement powerful and impressive web 2.0 applications. With DIS, remote code execution is taken care by services made available by the server manager if he wants to support them. The services may then process different types of pseudo-codes: JavaScript, Haxe, JVM, Python, ... Many different pseudo-codes services may then coexist and evolve independently of DIS. Such functionality is of course also exposed to security issues. See the secure network fallacy for an insight on how DIS addresses it. Bandwidth is infinite This fallacy is the rational of the Information Data Representation (IDR) design. It uses binary and native data representation. In addition to be very fast and easy to Marshall, it is also very compact. DITP supports also user defined processing of transmitted data so that compression algorithms may be applied to them. DITP is also multiplexing concurrent communication channels in the same connections, allowing to apply different transmitted data processing to each channel. By choosing the channel the user may decide to compress transmitted data or not. The network is secure A distributed system designed for a world wide usage must obviously take security in account. This means securing the transmitted data by mean of authentication and cyphering, as well as authenticating communicating parties and enforce access or action restriction rules. Communication security is provided by the DITP protocol by mean of the user specified transmitted data processing. As data compression, these can also handle data authentication and cyphering. Different authentication and cyphering methods and algorithms can coexist in DIS and may evolve independently of the DITP protocol. Authentication and access control may use conventional passwords methods as well as user identification certificates. But instead of using x509 certificates, DIS uses IDR encoded certificates corresponding to instances of certificate classes. Users may then derive their own certificates with class inheritance. They may extend the information carried in the certificate or combine different certificate types together. An authentication based on password checking or user identity certificate matching doesn't scale well for a world wide distributed system because they need to access a reference database. With distributed services, accessing a remote database introduces latencies and replicating it (i.e. caches) weakens its security by multiplying the number breach points. The authentication mechanism favored in DIS uses member certificates. These certificates are like club or company member access cards. When trying to access a service, the user present the corresponding certificate and the service needs simply to check the certificate validity. With such authentication mechanism, the service can be scattered all over the Internet and remain lightweight as is required for embedded applications (i.e. smart phones, car computers, ...). The authentication domain can also handle billions of members as well and easily as a few ones. Member certificates may be extended to carry specific informations and connection parameters. Topology doesn't change The ability to handle network topology changes initiated the conception of DIS in 1992. It is thus designed from the start to address this issue in a simple, robust and efficient way. It is not a coincidence that the DIS acronym resembles the one of DNS. DIS is a distributed information system as the DNS is a distributed naming system. DIS uses the proven architecture of the DNS and applies it to generic information with additional functionalities like allowing to remotely manage the information. The DNS is known to be a corner stone of the network topology change solution, as will be DIS. There is one administrator As the DNS, DIS supports a distributed administration. Information domain administrator have full liberty and authority in the way they organize and manage their information domain as long as the interface to DIS respects some standard rules. As for the DNS, there will be a central administration that defines the operational rules and control their application. If DIS becomes a broadly adopted system, the central administration will be composed of members elected democratically and coordinated with the Internet governance administration if such structures happens to be created. Transport cost is zero The transport cost is indeed not zero but most of it is distributed and shared by the users. There remains however a residual cost for the central services and administration for which a revenue has to be identified. The DIS system will allow to obtain such a revenue and there is a rational reason why it ought to. Imposing a financial cost to some domains or features of DIS which are limited or artificially limited resources provides a mean to apply a perceptible pressure on its misbehaving users (i.e. spam). The network is homogeneous DITP is designed to support different types of underlying transport connections. The information published in DIS is treated like an opaque byte block and may be of any type as well as its description language. It may be XML with its DTD description, binary with C like description syntax, python pickles or anything else. Of course it will also contain IDR encoded information with its Information Type Description. Conclusion The conclusion is that DIS, DITP and IDR have been designed without falling on any of the common fallacies. This is partly due to the long maturation process of its conception. While this may be considered as a shortcoming, it may also be its strength since it allowed to examine all aspects wisely with time. Here is a (long) blog note I would recommend reading : "Snakes on the web" written by Jackob Kaplan-Moss (September 4, 2009). It is a talk given at PyCon Argentina and PyCon Brazil, 2009. It presents an analysis on the current situation of web edition and desirable future system properties. My impression, and this is not a coincidence, is that DIS matches most of these requirements since it was designed to address the short comings of the actual systems. DITP and The black triangle 07/11/2009
A hacker news submission references the "The black triangle" blog note. I can only backup the author since I have experienced this many time. "A note on distributed computing" (1994) 07/07/2009
What is wrong with HTTP ? 06/25/2009
Here is a document presenting a review on what is good an bad with HTTP. It provides some light on the choices I made for DIS. I couldn't identify the author's name in the text. Sorry. Median value selection algorithm 06/09/2009
At work I'm currently working on tomographic reconstruction algorithms. I have to implement a Bayesian iterative algorithm that requires to select the median value in a set of the 27 float values from a cube of 3x3x3 voxels. This operation must be performed for each voxel and for each iteration. We have to expect 256 million voxels to process for each iteration, but "only" 60 to 100 iterations. When sketching out your business model or marketing strategy, read the following blog note or referenced book. There are easy ways to increase your efficiency. Object deserialization handling 02/07/2009
In the last month I rewrote the IDR prototype from scratch and translated the IDR specification document in English. During this process I made a few enhancements in the IDR encoding. I removed an ambiguity with exceptions decoding in some very unlikely situations. The other change was to integrate the update of IEEE 754 specification in 2008 that now defines four types of floating point values, 2 Bytes, 4 Bytes, 8 Bytes and 16 Bytes. It may take some time until these types reach your desk, but IDR should better stick to the standards. So these will be the floating point encodings supported by IDR. I just found a relevant question on StackOverflow asking for a good general purpose binary protocol. If you are interested in distributed information systems and their protocols, this topic might be a good read. |


RSS Feed