Progress status and cardinality encoding....

5/12/2008

It is time for a new communication duty on my project. It's still in a steady progress but not as fast as I would like. I now use the latest version of libgc. I spent most of my time last month searching the source of a major memory leak. I finally found out that it was caused by STL containers. I changed the code and now use gc_allocator. Even strings had to be changed. Now the client and server run without any memory leak. I thought of changing language (i.e. D) but I didn't want to cut myself away from the C++ community as potential users. So I had to sort out the problem and I finally did.

The client service communication model is now finalized and under test. The system is ready to support transmitted data processing (compression, authentication and encryption).

In this note I'll explain the encoding of a key data type I name cardinality. A cardinality is the number of elements in a sequence. Because of its efficient encoding I extended its use to other type of information. What makes a cardinality different from a classical unsigned integer is that small values are much more frequent than big values. Consider for instance strings. Most strings will be less than 256 bytes in length but from time to time big strings may show up.

We could thus benefit from an encoding that is compact for small values and eventually bigger for big values. I spent some time investigating all the possible encodings I could find and the most interesting ones have been found in BER and in ICE.

BER's cardinality encoding

BER stands for Basic Encoding Rules and is used in SNMP and x509 certificates encoding. In this encoding the cardinality value is chopped in 7 bits chunks and stored in as many bytes in big endian order. The leading bytes with value 0 are dropped. The most significant bit of all the bytes is set to one, except in the last byte where it is set to 0 and signals the last byte.

I implemented a BER encoder and decoder a long time ago and, as attractive as this encoding might be, it is not trivial to do and it requires some bit manipulations on bytes which makes it not optimal in term of performance.

ICE's cardinality encoding

ICE is an inter-object communication protocol which is really worth looking at and eventually use until DITP is available ;). Check their web site for more information.

ICE was apparently initially inspired by IIOP that simply doesn't have a cardinality. Sequence length are encoded as a straight 4 byte integer. But ICE encodes classes and method names as strings which are generally short. The 4 byte encoding overhead was very clear. So the designer of ICE's encoding added a small change in it that significantly increased its efficiency.

If the cardinality value is less than 255, the value is stored in a single byte. Otherwise a byte with the value 255 is written and followed by the 4 byte value encoding the cardinality value. Encoding and decoding is trivial and much more efficient then the BER encoding. It is not as compact for values bigger or equal to 255, but the overhead is insignificant when considering the amount of data that follows (i.e. string).

IDR's cardinality encoding

IDR extends the idea found in ICE by supporting 8 byte encoding and the intermediate 2 byte integer size. As with ICE, if the value is smaller than 255 the cardinality is encoded as a single byte. If less than 65535, it is encoded in a total of three bytes. The first byte holds the value 255 and the cardinality value is stored in the following 2 bytes as a classical unsigned integer. If the value is bigger, it is followed by a 4 byte value, etc. up to the 8 byte integer.

The encoding is as trivial and efficient as the one of ICE, if we take care to detect and encode small values first. It is more compact for values up to 65535, and then becomes less efficient because big values have a long encoding size. But as already pointed out, this overhead is insignificant when considering the amount of data that follows.

Use of cardinality in IDR

The cardinality encoding is so efficient that it its use has been extended for other types of information. Here is a short list of them.

- Object references in serialized object aggregates are encoded as cardinality because a reference is simply the object index number in the sequence of serialized objects. The value 0 is reserved for the null reference and the first serialized object is thus identified by the reference value 1. Small reference values are expected to be the most frequent.

- Methods are identified by a number instead of a string as is common place with inter-object communication protocol. The encoding is much more compact and efficient to handle and, on the server side, the method call dispatching becomes simple and efficient because the identifier can be used as an index in a method pointer table. Encoding the method identifier as a cardinality was an obvious solution since small values will be the norm.

- A channel can host multiple concurrent client-service connections. These bindings are identified by a unique number, starting with 1 and incrementing for each new binding. We can expect that most frequent binding identification will have a small value and thus the use of cardinality encoding imposed itself.

The progress may be slow but the benefit is that taking the time to think and explore the possible solutions for every choice yields a better thought out product.

0 Comments

Progress status and cardinality encoding....

Leave a Reply.

Author

Categories

Archives