Category: Dis

Log structured database

3/1/2010

The distributed information system (DIS) needs a database to store its information and a simple key value database would do the job. Today, Tokyo Cabinet seems the best choice for such type of database.

Why a log structured database ?

My attention was recently caught by the blog post Damn cool Algorithms: log structured storage. The white paper presenting RethinkDB provides a more exhaustive view of the benefits of this data structure and some disadvantages too. The LWN.net article Log-structured file systems: There's one in every SSD covers the use of log structure in SSD file systems.

While surfing the web to get more informations on log structured database, I found the following blog note presenting the experimental YDB log structured database with some interesting benchmark showing that YDB is roughly 5.6 time faster than Tokyo Cabinet and 8 time faster than Berkeley DB with random writes. These numbers justify some deeper investigation.

The performance benefit is mainly due to constraining write operations to the end of the file because read access can benefit from memory caches, writes not. With random location writes, the disk writing head needs to move into position (seek) and this has a huge latency compared to transistor state changes or data transmission speed.

Reducing disk head movements may thus yield a significant performance increase. Note that this won't be true with SSD disks anymore, but other constrains come in play too where a log structured database may still be attractive (evenly distributed and grouped writes).

The Record Index

As you may guess writing data to the end of the file implies that modified records are copied. The record offset is then modified which implies an update of the index too. If the index, generally tree structured, is also stored in the log database, it result in cascade of changes which increases the amount of data to write to disk.

This makes log structured database less attractive, especially if the index is a BTree of record keys. A BTree key index is not very compact and not trivial to manipulate, especially if keys are of varying length.

I finally found a better solution derived from reading the white paper presenting the The PrimeBase XT Transactional Engine describing a log structured table with ACID property for an RDMS table, and more recently the article Using Uninitialized Memory for Fun and Profit describing a simple data structure to use an uninitialized array.

The idea is to use an intermediate record index which is basically a table of record offset and size. The entry index in the table is the record identifier and is used as key to locate the record in the file. The record identifier is associated to a record for its life time and may be reused for a new record after the record has been deleted.

Benefits of the record index

The record index is stored as a tree index where non lead nodes hold the offset to the lower level nodes of the tree. Changing an offset in a leaf node will still imply a change in all the nodes up to the root of the tree, but the index is much more compact than a conventional BTree associating the record key with its offset and size. The record identifier doesn't need to be stored in the index because it is its relative position in it.

Another benefit of this intermediate record index is that the record key index will now refer to the record identifier and this doesn't change when the record is modified. It is then possible to have multiple index to the records or to use the record identifier inside the user data to support record reference graphs (i.e. linked lists, etc.).

By storing the record identifier along with the record data, the garbage collector or the crash recovery process can easily determine if a record is valid or. It simply has to compare the record offset and size with the one found in the record index. If it is the same, the record is the latest valid version.

Snapshots and recovery points

The dirty pages of the record index need only to be saved at snapshot time. In case of process or system crash, the database should be restored to the last saved snapshot. A snapshot correspond to a coherent state of the database. A snapshot is saved any time the user closes the database. Restoring the database to some snapshot saved state boils down to truncate the file after the last valid record of the file.

If snapshots saving is very frequent and crash recovery very rare, it is possible to use lightweight snapshots. For such snapshot only a small record is appended to the record stream which tags the point in the file where the snapshot occurred. When the database is recovered at some saved snapshot point, the recovery process can continue the recovery process beyond that recovery point by replaying all the changes until the last valid lightweight snapshot. The state of the database can then be restored to the latest lightweight snapshot, but with a slightly bigger effort than a saved snapshot recovery.

Garbage collector

For the garbage collector (GC) the classical method may be applied which consist in opening a secondary log file and progressively copy valid records into it in background while it is used. A database backup is as simple as copying the file.

When the lifetime duration of records varies a lot, it might be better to use generational log files, an algorithm used with memory garbage collector. The idea is to avoid copying constant records due to some other records short lifetime of frequent change generated garbage. The idea is to group records according to their change frequency into separated log structured database.

A first log structured database contains all new or changed records. The garbage collector progress then at the same speed as records are written to the end of the file. Every valid data it finds is then copied in a second generation record log file. These records have lasted a GC cycle without a change. Additional generation database may be added for even slower changing records.

The use of multiple log files will induce some disk writing head movements, but it will be balanced by saving the effort to repeatedly copy constant records.

Conclusion

It is not my intent to implement this shortly. I just wanted to document the method which seems to be the canonical way to handle the record index problem and for which I couldn't find a description on the web.

1 Comment

The 8 fallacies of distributed computing

10/17/2009

4 Comments

The following two paragraphs are the introductory paragraphs of the document Fallacies of distributed computing (pdf) by Arnon Rotem-Gal-Oz that presents the 8 fallacies of distributed computed.

"Distributed systems already exist for a long tThe software industry has been writing distributed systems for several decades. Two examples include The US Department of Defense ARPANET (which eventually evolved into the Internet) which was established back in 1969 and the SWIFT protocol (used for money transfers) was also established in the same time frame [Britton2001].

Nevertheless, In 1994, Peter Deutsch, a sun fellow at the time, drafted 7 assumptions architects and designers of distributed systems are likely to make, which prove wrong in the long run - resulting in all sorts of troubles and pains for the solution and architects who made the assumptions. In 1997 James Gosling added another such fallacy [JDJ2004]. The assumptions are now collectively known as the "The 8 fallacies of distributed computing" [Gosling]:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous

..."

While in the process of designing a new distributed information system, it a good idea to check how it position itself regarding these 8 fallacies.

The network is reliable

DIS uses TCP which was designed to be reliable and robust. Reliable means that data is transmitted uncorrupted to the other end and robust means that it may resist to a certain amount of errors. There is however a limit to the robustness of a TCP connection, and in some conditions connection to a remote service may even not be possible.

DITP, the communication protocol of DIS, is of course designed to handle connection failures. Higher level and distributed services will have to take it in account too.

Making a distribute information system robust implies to anticipate connection failures at any stage of the communication. For instance, a flock of servers designed to synchronize with each other may suddenly be partitioned in two or more unconnected flocks because of a network failure, and be connected back together later.

The latency is zero

Latency was a major focus in the design of the DITP protocol because DIS is intended to be used for World Area Network (WAN) applications. DITP reduces latency impact by supporting asynchronous requests. These requests are batched and processes sequentially by the server in the order of emission. If a request in the batch is aborted by an exception, subsequent requests of the batch are ignored. This provides a fundamental functionality to support transactional applications.

In addition to this, DIS may also support the ability to send code to be executed by a remote service. This provides the same functionality as JavaScript code embedded in web pages and executed by browsers, allowing to implement powerful and impressive web 2.0 applications.

With DIS, remote code execution is taken care by services made available by the server manager if he wants to support them. The services may then process different types of pseudo-codes: JavaScript, Haxe, JVM, Python, ... Many different pseudo-codes services may then coexist and evolve independently of DIS. Such functionality is of course also exposed to security issues. See the secure network fallacy for an insight on how DIS addresses it.

Bandwidth is infinite

This fallacy is the rational of the Information Data Representation (IDR) design. It uses binary and native data representation. In addition to be very fast and easy to Marshall, it is also very compact.

DITP supports also user defined processing of transmitted data so that compression algorithms may be applied to them. DITP is also multiplexing concurrent communication channels in the same connections, allowing to apply different transmitted data processing to each channel. By choosing the channel the user may decide to compress transmitted data or not.

The network is secure

A distributed system designed for a world wide usage must obviously take security in account. This means securing the transmitted data by mean of authentication and cyphering, as well as authenticating communicating parties and enforce access or action restriction rules.

Communication security is provided by the DITP protocol by mean of the user specified transmitted data processing. As data compression, these can also handle data authentication and cyphering. Different authentication and cyphering methods and algorithms can coexist in DIS and may evolve independently of the DITP protocol.

Authentication and access control may use conventional passwords methods as well as user identification certificates. But instead of using x509 certificates, DIS uses IDR encoded certificates corresponding to instances of certificate classes. Users may then derive their own certificates with class inheritance. They may extend the information carried in the certificate or combine different certificate types together.

An authentication based on password checking or user identity certificate matching doesn't scale well for a world wide distributed system because they need to access a reference database. With distributed services, accessing a remote database introduces latencies and replicating it (i.e. caches) weakens its security by multiplying the number breach points.

The authentication mechanism favored in DIS uses member certificates. These certificates are like club or company member access cards. When trying to access a service, the user present the corresponding certificate and the service needs simply to check the certificate validity.

With such authentication mechanism, the service can be scattered all over the Internet and remain lightweight as is required for embedded applications (i.e. smart phones, car computers, ...). The authentication domain can also handle billions of members as well and easily as a few ones. Member certificates may be extended to carry specific informations and connection parameters.

Topology doesn't change

The ability to handle network topology changes initiated the conception of DIS in 1992. It is thus designed from the start to address this issue in a simple, robust and efficient way. It is not a coincidence that the DIS acronym resembles the one of DNS. DIS is a distributed information system as the DNS is a distributed naming system. DIS uses the proven architecture of the DNS and applies it to generic information with additional functionalities like allowing to remotely manage the information. The DNS is known to be a corner stone of the network topology change solution, as will be DIS.

There is one administrator

As the DNS, DIS supports a distributed administration. Information domain administrator have full liberty and authority in the way they organize and manage their information domain as long as the interface to DIS respects some standard rules. As for the DNS, there will be a central administration that defines the operational rules and control their application. If DIS becomes a broadly adopted system, the central administration will be composed of members elected democratically and coordinated with the Internet governance administration if such structures happens to be created.

Transport cost is zero

The transport cost is indeed not zero but most of it is distributed and shared by the users. There remains however a residual cost for the central services and administration for which a revenue has to be identified. The DIS system will allow to obtain such a revenue and there is a rational reason why it ought to.

Imposing a financial cost to some domains or features of DIS which are limited or artificially limited resources provides a mean to apply a perceptible pressure on its misbehaving users (i.e. spam).

The network is homogeneous

DITP is designed to support different types of underlying transport connections. The information published in DIS is treated like an opaque byte block and may be of any type as well as its description language. It may be XML with its DTD description, binary with C like description syntax, python pickles or anything else. Of course it will also contain IDR encoded information with its Information Type Description.

Conclusion

The conclusion is that DIS, DITP and IDR have been designed without falling on any of the common fallacies. This is partly due to the long maturation process of its conception. While this may be considered as a shortcoming, it may also be its strength since it allowed to examine all aspects wisely with time.

4 Comments

DIS development roadmap

11/12/2008

0 Comments

The following figure shows the kernel components of the Distributed Information System, the road map and how far I am today. The items in black are implemented and operational and the items in gray still needs to be implemented. Progress is going clockwise :).

OID An OID is to DIS what the URL is to the web. It is a unique, binary encoded and non reusable reference to an information published in the distributed information system. It was the first tile I designed and implemented. Its simplicity is inversely proportional to the time and effort required to invent it because I had to explore and compare many different possible and existing solutions.

IDR It is to DIS what HTML or XML is to the web. IDR is the Information Data Representation used in DIS. It is a stream oriented encoding with support of object serialization and exceptions. The prototype implementation is currently being fully rewritten. It still miss the ability to specify the encoding version or a formalization of data description. The later is required to be able to display data in a human readable format or to automatically generate data manipulation functions or containers mapped to different programming languages.

DITP It is to DIS what HTTP is to the web. It is the protocol used to exchange information or invoke remote actions in DIS. It is very simple, modular and extensible through the use of dynamically configurable data processing tasks. Support of compression, authentication or encryption is then provided by some kinds of plugins. The protocol use the object oriented model with remote method invocation. The current prototype does not yet support concurrent asynchronous method invocation.

DIS DIS stands here for Distributed Information Service and is not to be confused with Distributed Information System. It is fundamental to DIS, so a confusion is not really a problem. This service combines the properties of DNS and LDAP and would be a new kind of service on the Internet. I can't disclose more on it because it is still in development. A first prototype has been implemented unfortunately proving the need to support data description.

SEC This part covers authentication and access control in DIS. It requires a functional DIS service. An interesting feature is that it is designed to scale up so that a service could cope with millions of different users without having to keep track of million accounts and passwords.

IDX It is a service simply mapping human readable UTF8 strings to OID references. It is equivalent to the list of named entries in a directory. Like any other services, its access is controlled by ACL and can thus be modified remotely with appropriate privileges. An index may be huge with multiple alternate entry point, exactly like the DNS but exclusively as a flat name space. The OID associated to the UTF8 string is stored in an object so that polymorphism allow to associate images (icons) and other informations to entries by extension.

DIR It is a graph of IDX services with one root entry. Services or information published in DIS can then be referenced by a humanly readable path in the IDX graph relative to the root.

It is an ambitious project but, I am convinced, its added value is worth the effort. I wish I could work full time on this project with the help of some other developers, but this would require funding I don't have access to for now.

An application would help demonstrating the added value of the system. I'm still looking for one with an optimal balance in development effort and success potential.

0 Comments

Debugging DIS

9/8/2008

0 Comments

While developing a prototype of DIS to check and validate it, I'm frequently confronted to bugs.

One reason is that the complexity of the program is comparable to a compiler. Individual encoding rules are simple, but they can be used in an infinite set of combinations.

After testing many different debuggers on Linux, my conclusion is that none of them is as good as the one of Visual C++ on Windows. So when I have to develop new code which may require debugging, I always develop it with Visual C++. Once it is validated, I move it on Linux.

The biggest difference is in the capability to explore data structures, STL containers, and other application specific data. If there wasn't Visual C++, I would have completely dropped Windows. It was thus a very smart marketing move of Microsoft to make Visual C++ available for free.

But Visual C++ is not yet perfect. So when not debugging, I prefer working on Linux. On feature I'm really missing in Visual C++ is one provided in Eclipse.

With object oriented programming, one usually store one class per file. I guess it is to simplify locating the class definition information. Simply look for a file with the same name as the class. The back side of this is that we end up with the code spread in many files. But when browsing the code, I often met a method call and would like to see its implementation.

To do this I have to switch to, or open, the corresponding file, locate the method definition in the file. Once examined, I may want to go back to where I was before.

Eclipse has this smart and powerful ability to change an identifier into a hyper text link. One simply move the pointer over the identifier and press the control key at the same time. The identifier changes into a hyper text link (underlined blue text) and a click moves you directly on the method implementation.

In visual, you get a context menu when clicking on an identifier and then you have to locate and click the "go to definition" menu command. Two clicks.

0 Comments

Making DITP flexible, versatile and simple

9/4/2008

0 Comments

DITP is flexible, versatile and simple because it uses the inter-object communication model. Not only for the user needs to communicate with its service, but also to setup and configure the connection data processing (i.e. authentication, encryption, compression, logging, tunneling,...).

Inter-object communication

By adopting the inter-object communication model users can create any type of service (remote object) they want. They can also extend or refine their capabilities by using inheritance with the polymorphism property and preserve backward compatibility at the same time.

This makes DITP versatile and flexible, but any other inter-object communication protocol could claim the same.

Configuring and setting up the connection

What makes DITP different is that the connection configuration and setup are also performed by using the object oriented model. The different algorithms used are controlled by specific services and the client controls them by invoking the appropriate methods.

This makes the protocol very flexible and versatile since the algorithm can be combined and configured in any way. It is easy to add support for new algorithms and there is no constrain on the transaction polka required to configure them.

This design choice basically factorizes and parameterizes the protocol. What is left for DITP to define is how to open a connection, how to exchange messages between client and services and how to setup a new client-service binding.

Opening the DITP connection

Opening a DITP connection implies a very simple transaction where the client and the service side exchange a four byte message. If both message contain the expected value, the connection is considered opened. It can hardly be made simpler.

When the connection is opened the client and service implicitly attach a channel control service to the connection. This service has very few methods. One is used to close the connection and the other to request the attachment of another service whose type and identity are given as argument.

That is all it takes to have an operational DITP server. The exchanged messages have also a very simple structure, but will be described in another note because they have another original feature allowing to minimize latency.

Once the connection is opened, if the client wants to secure the connection by adding authentication or encryption, he request the attachment of the corresponding services and configure them by calling their methods.

This is why I claim that DITP is versatile, flexible and simple.

0 Comments

Optimizing DITP connection open

8/25/2008

0 Comments

The DITP protocol has been designed to minimize the time required to setup an operational connection. This is achieved by a simple method which is made explicit in the following figure.

In common protocols, like HTTP and SMTP, the server is expected to send a greeting before the client can respond and proceed by sending its first request.

The time the client has to wait for this greeting message is usually dominated by the round trip time. In a LAN the round trip time is less than a millisecond, but on Internet it will take many tens of milliseconds and sometime hundreds of millisecond if the server is on another continent.

By simply swapping the DITP open transaction orientation, we save one round trip time delay before the client can sent its first transaction request. Another advantage of this method is that the server can use a very narrow timeout window for the arrival of the DITP connection setup request. This protects against some type of DOS attacks.

There are two additional things we can observe from the previous figure.

1.- The HTTP or SMTP protocol could be optimized by allowing the client to send its first data without having to wait for the server greeting.

2.- The round trip time due to TCP could be avoided if we could combine it with the DITP connection set up and the two first requests. There is clearly room for improvement on this layer, but this is out of scope regarding this project.

0 Comments

Time value encoding in DIS

5/26/2008

0 Comments

One fundamental question is the encoding of a time value. A time value has two types of use. One is as time stamp and the other is just as a general time reference.

Requirements

On one hand, a time stamp has the requirement to have a well defined and controlled precision, while the covered time span can be limited (i.e. +/- 200 years). On the other hand, a general time reference needs to be applicable to a very large time span, with less constrains on the precision limit.

Options

For the time reference value one could use a double precision float representation with seconds as units. All arithmetic operations are provided right out of the box and generally hardwired in the processor. Conversion to calendar time is trivial since one simply has to extract the integer part of the value and convert it to a time_t value. From there one can use the common calendar time conversion and formatting functions.

For time stamps, using integers seems preferable. But we still have a choice between a split encoding like the timeval structure, a 64bit fixed point encoding, or an integer with very small time unit (i.e. nanoseconds).

Discussion

There is not much to discuss about the absolute time. Using a double precision float is an optimal solution. For time stamps however we have three different solutions.

From my experience, I've seen that split time encoding like the timeval structure is not convenient to use when dealing with time arithmetics. It is even error prone if the user has to program the operations himself.

I also tried to implement a fixed point time encoding class with the decimal point between bit 29 and 30. But this is tricky to get right and some operations are not trivial to implement correctly. This is because fractional computation requires normalization and optimal rounding errors handling.

A 64bit integer using nanoseconds as time units is apparently the most simple and straightforward time stamp encoding. Converting to seconds is done with a simple 64bit integer division which is also hardwired in most recent processors. Conversion to other time units like microseconds, milliseconds, days or week is as accurate and simple. Multiplication or division with decimal scalar values is also trivial.

Another advantage of the 64bit integer nanosecond values is that there is no need of special functions to do the conversions or operations. A programmer can easily figure out what to do and use conventional arithmetic operations.

With a 64 bit signed integers with nanosecond units, the covered time span is over +/- 292 years range. One can thus afford keep the current time_t January 1970 epoch and push back the wrapping limit far away.

Conclusion

In DIS, we'll thus use a double precision float for general time reference value and a 64bit integer with nanosecond units for time stamps and delays encoding.

Note: I've seen the use of a double precision float for time encoding in some Windows operating system API. I still have to see the use of a 64bit signed integer with nanosecond units. It would make sense as an upgrade of time_t which is required since we are getting close to the wrapping limit.Update : It has been brought to my attention that Java stores time values in a signed 64bit integer with milliseconds as time units relative to January 1, 1970. The covered time span is thus +/- 290 million years. I'll stay with the nanosecond units for time stamps.

0 Comments

Logo for DIS

6/5/2007

0 Comments

DIS has finally its logo. It was clear at the beginning of the design process that it should put forward the world wide networking feature of DIS which is key to its working principle. I looked for various networking grids but couldn't find anything original and explicit enough.

It is fortuitous that I saw an instance of the UVG120 grid which was first discribed in the article "The Planetary grid: a New Synthesis" written by William Becker and Dr. Bethe Hagens in 1984. I adopted this grid for the logo with the written permission of Dr Bethe Hagens.

I find this grid beautiful because of its apparent randomness and its subliminal regularity resulting from combining the vertices's of a dodecahedron and an icosahedron mapped on a sphere. The fact that these volumes integrate the divine proportion may contribute to the impression of beauty.

The logo was designed by the graphic designer Johan VINET. He runs the company grafxtory but is better known as lordyoyo with his blog offering copious tutorials and enlightening information on graphic design. He has a creative and yet a professional approach, with a pleasant benevolent patience I challenged with my exigencies and care of the details.

0 Comments

Log structured database

The 8 fallacies of distributed computing

DIS development roadmap

Debugging DIS

Making DITP flexible, versatile and simple

Optimizing DITP connection open

Time value encoding in DIS

Logo for DIS

Author

Categories

Archives