Category: - Distributed Information System (DIS)

DITP and The black triangle

7/11/2009

A hacker news submission references the "The black triangle" blog note. I can only backup the author since I have experienced this many time.

For short, with some programs the visible part of it is merely just a black triangle while the invisible part may be complex or required a lot of efforts to achieve. The black triangle is then generally just a simple visual example to prove that the underlying system works.

That is the state of progress of DITP. I'm working to get the black triangle to become visible. In doing so I'm also writing the protocol specification so that the protocol may be reviewed and implemented by third parties in other languages or libraries.

The black triangle is like the first fruits of a fruiterer tree that may, sometime, took a long time to grow up to the point to be able to produce fruits.

Object deserialization handling

2/7/2009

In the last month I rewrote the IDR prototype from scratch and translated the IDR specification document in English. During this process I made a few enhancements in the IDR encoding. I removed an ambiguity with exceptions decoding in some very unlikely situations. The other change was to integrate the update of IEEE 754 specification in 2008 that now defines four types of floating point values, 2 Bytes, 4 Bytes, 8 Bytes and 16 Bytes. It may take some time until these types reach your desk, but IDR should better stick to the standards. So these will be the floating point encodings supported by IDR.

Beside these, there was a much bigger problem left in the API of object deserialization. The problem is to determine what to do when the decoder doesn't recognize the class type of a serialized object. The solution I came up is very satisfying since it matches all the requirements I had. It remains to check its usage convenience with real examples.

The problem

Object deserialization is a process in which the decoder reconstruct the serialized object aggregate. To do so it has to reconstruct each object of the aggregate and restore their pointers to each other. Objects are reconstructed by using object factories, a classic in design pattern. An object factory is an object that "knows" how to reconstruct some types of objects.

The decoder has thus a collection of factories to which it delegates the reconstruction of the different types of objects found in the serialized aggregate. But what happens if the decoder can't find an appropriate factory for some type of serialized object ? In some use case this should be considered as an error, but in others it might be an acceptable and even desirable situation.

Consider for instance exceptions. In IDR an exception is an object and handled as such. There is no point, and even impossible, for a decoder to have a factory for all possible exceptions in the world. It is enough for the decoder to have a factory for the common exception base classes and, of course, the one it has to deal with. It should then be enough to reconstruct the object as an instance of the parent class it has a factory for, a process called object slicing.

The worst case is when the decoder may not even slice the object because none of the parent class is "known" by the decoder. In this case the best the decoder can do is to ignore the object and set all references to it to NULL. We'll call this process object pruning. As for slicing, it may be considered as an acceptable and even desirable behavior with some use cases (i.e optional properties), and an error in others since the lobotomized data structured may end up too crippled or even invalid.

The problem is thus to define the appropriate behaviour of the decoder when a slicing or pruning occurs. In some case it is an error, in others not, and in some case it depends on what part of the aggregate the slicing or pruning took place.

The solution

The decision whether it is an error or not is obviously context specific and have thus to be put in the hands of the user. So the problem boiled down to determine how the user would able to select the appropriate behavior.

The solution I came up was to provide three object deserialization methods.

1. A strict object decoder that would throw an exception and abort object decoding as soon as a missing object factory is detected. With this you get an exact reconstruction or a failure.

2. A lax object decoder that would slice and prune at will and return whatever comes out of it, and nothing else. This object decoder would for instance be used for exceptions.

3. Another lax object decoder, like the previous one, but that would also return a feedback on the missing object factories. The feedback on slicing would be an associative index mapping sliced object references to the list of their unrecognized classs types. The feedback on pruning would be a list of the different types of pruned object with a list of the unrecognized class type and the number of instance pruned.

The later method would make it possible and easy for the user to determine if slicing and pruning occurred, what are the missing factories and test for specific objects if slicing took place and to what extend. Since this method would give an easy way to test if slicing or pruning took place, the strict object decoder may seem unnecessary. The reason of its presence is that it may stop the decoding process as soon as a missing factory is detected and thus avoid wasting resources when an exact reconstruction is required and no feedback is needed.

I'm very satisfied by this solution because it keeps the API simple with only a small effort on the decoder implementation. What I still need to validate is how convenient it is to use.

DIS development roadmap

11/12/2008

The following figure shows the kernel components of the Distributed Information System, the road map and how far I am today. The items in black are implemented and operational and the items in gray still needs to be implemented. Progress is going clockwise :).

OID An OID is to DIS what the URL is to the web. It is a unique, binary encoded and non reusable reference to an information published in the distributed information system. It was the first tile I designed and implemented. Its simplicity is inversely proportional to the time and effort required to invent it because I had to explore and compare many different possible and existing solutions.

IDR It is to DIS what HTML or XML is to the web. IDR is the Information Data Representation used in DIS. It is a stream oriented encoding with support of object serialization and exceptions. The prototype implementation is currently being fully rewritten. It still miss the ability to specify the encoding version or a formalization of data description. The later is required to be able to display data in a human readable format or to automatically generate data manipulation functions or containers mapped to different programming languages.

DITP It is to DIS what HTTP is to the web. It is the protocol used to exchange information or invoke remote actions in DIS. It is very simple, modular and extensible through the use of dynamically configurable data processing tasks. Support of compression, authentication or encryption is then provided by some kinds of plugins. The protocol use the object oriented model with remote method invocation. The current prototype does not yet support concurrent asynchronous method invocation.

DIS DIS stands here for Distributed Information Service and is not to be confused with Distributed Information System. It is fundamental to DIS, so a confusion is not really a problem. This service combines the properties of DNS and LDAP and would be a new kind of service on the Internet. I can't disclose more on it because it is still in development. A first prototype has been implemented unfortunately proving the need to support data description.

SEC This part covers authentication and access control in DIS. It requires a functional DIS service. An interesting feature is that it is designed to scale up so that a service could cope with millions of different users without having to keep track of million accounts and passwords.

IDX It is a service simply mapping human readable UTF8 strings to OID references. It is equivalent to the list of named entries in a directory. Like any other services, its access is controlled by ACL and can thus be modified remotely with appropriate privileges. An index may be huge with multiple alternate entry point, exactly like the DNS but exclusively as a flat name space. The OID associated to the UTF8 string is stored in an object so that polymorphism allow to associate images (icons) and other informations to entries by extension.

DIR It is a graph of IDX services with one root entry. Services or information published in DIS can then be referenced by a humanly readable path in the IDX graph relative to the root.

It is an ambitious project but, I am convinced, its added value is worth the effort. I wish I could work full time on this project with the help of some other developers, but this would require funding I don't have access to for now.

An application would help demonstrating the added value of the system. I'm still looking for one with an optimal balance in development effort and success potential.

Debugging DIS

9/8/2008

While developing a prototype of DIS to check and validate it, I'm frequently confronted to bugs.

One reason is that the complexity of the program is comparable to a compiler. Individual encoding rules are simple, but they can be used in an infinite set of combinations.

After testing many different debuggers on Linux, my conclusion is that none of them is as good as the one of Visual C++ on Windows. So when I have to develop new code which may require debugging, I always develop it with Visual C++. Once it is validated, I move it on Linux.

The biggest difference is in the capability to explore data structures, STL containers, and other application specific data. If there wasn't Visual C++, I would have completely dropped Windows. It was thus a very smart marketing move of Microsoft to make Visual C++ available for free.

But Visual C++ is not yet perfect. So when not debugging, I prefer working on Linux. On feature I'm really missing in Visual C++ is one provided in Eclipse.

With object oriented programming, one usually store one class per file. I guess it is to simplify locating the class definition information. Simply look for a file with the same name as the class. The back side of this is that we end up with the code spread in many files. But when browsing the code, I often met a method call and would like to see its implementation.

To do this I have to switch to, or open, the corresponding file, locate the method definition in the file. Once examined, I may want to go back to where I was before.

Eclipse has this smart and powerful ability to change an identifier into a hyper text link. One simply move the pointer over the identifier and press the control key at the same time. The identifier changes into a hyper text link (underlined blue text) and a click moves you directly on the method implementation.

In visual, you get a context menu when clicking on an identifier and then you have to locate and click the "go to definition" menu command. Two clicks.

Progress status and cardinality encoding....

5/12/2008

It is time for a new communication duty on my project. It's still in a steady progress but not as fast as I would like. I now use the latest version of libgc. I spent most of my time last month searching the source of a major memory leak. I finally found out that it was caused by STL containers. I changed the code and now use gc_allocator. Even strings had to be changed. Now the client and server run without any memory leak. I thought of changing language (i.e. D) but I didn't want to cut myself away from the C++ community as potential users. So I had to sort out the problem and I finally did.

The client service communication model is now finalized and under test. The system is ready to support transmitted data processing (compression, authentication and encryption).

In this note I'll explain the encoding of a key data type I name cardinality. A cardinality is the number of elements in a sequence. Because of its efficient encoding I extended its use to other type of information. What makes a cardinality different from a classical unsigned integer is that small values are much more frequent than big values. Consider for instance strings. Most strings will be less than 256 bytes in length but from time to time big strings may show up.

We could thus benefit from an encoding that is compact for small values and eventually bigger for big values. I spent some time investigating all the possible encodings I could find and the most interesting ones have been found in BER and in ICE.

BER's cardinality encoding

BER stands for Basic Encoding Rules and is used in SNMP and x509 certificates encoding. In this encoding the cardinality value is chopped in 7 bits chunks and stored in as many bytes in big endian order. The leading bytes with value 0 are dropped. The most significant bit of all the bytes is set to one, except in the last byte where it is set to 0 and signals the last byte.

I implemented a BER encoder and decoder a long time ago and, as attractive as this encoding might be, it is not trivial to do and it requires some bit manipulations on bytes which makes it not optimal in term of performance.

ICE's cardinality encoding

ICE is an inter-object communication protocol which is really worth looking at and eventually use until DITP is available ;). Check their web site for more information.

ICE was apparently initially inspired by IIOP that simply doesn't have a cardinality. Sequence length are encoded as a straight 4 byte integer. But ICE encodes classes and method names as strings which are generally short. The 4 byte encoding overhead was very clear. So the designer of ICE's encoding added a small change in it that significantly increased its efficiency.

If the cardinality value is less than 255, the value is stored in a single byte. Otherwise a byte with the value 255 is written and followed by the 4 byte value encoding the cardinality value. Encoding and decoding is trivial and much more efficient then the BER encoding. It is not as compact for values bigger or equal to 255, but the overhead is insignificant when considering the amount of data that follows (i.e. string).

IDR's cardinality encoding

IDR extends the idea found in ICE by supporting 8 byte encoding and the intermediate 2 byte integer size. As with ICE, if the value is smaller than 255 the cardinality is encoded as a single byte. If less than 65535, it is encoded in a total of three bytes. The first byte holds the value 255 and the cardinality value is stored in the following 2 bytes as a classical unsigned integer. If the value is bigger, it is followed by a 4 byte value, etc. up to the 8 byte integer.

The encoding is as trivial and efficient as the one of ICE, if we take care to detect and encode small values first. It is more compact for values up to 65535, and then becomes less efficient because big values have a long encoding size. But as already pointed out, this overhead is insignificant when considering the amount of data that follows.

Use of cardinality in IDR

The cardinality encoding is so efficient that it its use has been extended for other types of information. Here is a short list of them.

- Object references in serialized object aggregates are encoded as cardinality because a reference is simply the object index number in the sequence of serialized objects. The value 0 is reserved for the null reference and the first serialized object is thus identified by the reference value 1. Small reference values are expected to be the most frequent.

- Methods are identified by a number instead of a string as is common place with inter-object communication protocol. The encoding is much more compact and efficient to handle and, on the server side, the method call dispatching becomes simple and efficient because the identifier can be used as an index in a method pointer table. Encoding the method identifier as a cardinality was an obvious solution since small values will be the norm.

- A channel can host multiple concurrent client-service connections. These bindings are identified by a unique number, starting with 1 and incrementing for each new binding. We can expect that most frequent binding identification will have a small value and thus the use of cardinality encoding imposed itself.

The progress may be slow but the benefit is that taking the time to think and explore the possible solutions for every choice yields a better thought out product.

Progress status

12/18/2007

Progress on the design and the prototype implementation is going on. I now have a working prototype for the inter-object communication system. This helps me testing and refining the design. I also regularly review and update the specification documents.

On DITP, the current points of focus is to find a good way to manage the PDU (Protocol Data Units) processing like compression, authentication or enciphering. The user must be able to select and set them up in a snap while keeping it as versatile and flexible as possible.

On IDR, the current point of focus is a refinement of signed information encoding. A straightforward implementation is to simply append to signature to the signed information. But this annihilates all the benefits of the stream oriented encoding. Beside, invalid signature or data must be detected as early as possible. A solution has been identified, but fitting it nicely with the current encoding requires some more investigation.

A design process is a difficult task because we have zillion of decisions to make. The more complex the design, the more decisions there is to make, and likely we can make a mistake somewhere. The two heuristics I use to minimize this risk is first to keep the design as simple as possible and second to minimize the constrains on usage. The former is popular, the later much less.

Progress on cryptography

7/20/2007

The low level C++ wrapper class for cryptographic functions is now finalized. I use XySSL as low level C cryptographic library. XySSL is an open source project of Christophe Devine, a French computer scientist specialized in security. XySSL will support the VIA padlock cryptographic engine which is a good news since VIA servers are cheap, cold and low consuming computers.

The signing algorithm is parameterized so that one can easily switch to a stronger model if needed. For now we'll use the PKCS1 2.0 OAEP signature model described in RFC3447 because it is stream friendly. The signature model described in IEEE 1363a adds a salt with the hash value. The salt is some random bytes that are hashed before the information to sign.

The problem with this is that the salt is not available when starting to decode the information. To do so we would have to put the signature in front of the information. But then it is the signature generation that would not be stream friendly. One would have to first serialize the data in some buffer so that we can compute the hash value and encode the signature. This then breaks the stream processing model.

It is not clear to me how this salt adds any security to the signature. Please add a comment if you have some hints on this. It seem that picking a stronger hash function with longer digest or combining multiple hash functions output would contribute more to security than the salt value.

Progress status

7/7/2007

Progress is good on multiple fronts.

- I never managed to make libgc (C++ garbage collector) work with code compiled in release mode (VC2003). I spent some time debugging it without success. Version 7.0 has just been release but the problem is still there. So I had to solve it. I finally found out the cause and made a quick hack for my code to work. The author has been notified and I hope the bug will be definitely fixed in the next release.

- In the mean time I also investigated various cryptographic packages to use for the prototype. There are quite many out there. Openssl is the one I'll pick because it fits best my requirements. But it needs a C++ wrapper that makes its use more simple and convenient as in other C++ cryptographic packages.

- Signed and multi-signed information data encoding format is now finalized. It was not trivial because the requirements were quite tricky to match. Their properties are attractive, but this must still be implemented and tested to validate its usability.

Progress status

5/31/2007

Progress is slow because I have daily job and family duties.

I wasted a few days on the garbage collector that doesn't work when code is compiled in release mode. The problem is due to the garbage collector and an initialization failure. Since it only shows up in release mode, it is hard to debug.

I am in the process of extending the IDR API to support pre-encoded IDR data insertion and extraction. This will be used to implement a remote IDR data storage service, event dispatching, relays or broadcasting or simply to optimize generation of frequently emitted requests or responses. The need to support exceptions in this process requires some attention.

DITP status

5/14/2007

DITP was designed early, together with IDR. Since IDR is now operational and ready for throughout testing, DITP could be implemented and tested.

DITP core is very trivial and easy to implement and a first prototype is already operational. This gave the opportunity to test readabilty and lightness of the IDR API which needed some refinements.

This also allowed me to check what is needed to build a server and client objects which of course have to be made as simple as possible.

A critical and time consuming task was adding garbage collector support into C++. I'll discuss the C++ and GC issue in another post. I used libgc v6.8 and managed yesterday to get a running client server prototype with full reliance on the GC and showing no memory leaks. Since the client and server may host live objects, they have to support multithreading.

So far, so good! ... as said a guy at each storey when falling from a building.