Distributed Information System (DIS)
  • Home
  • The blog
  • Contact

Numbering schema yielding identical lexical and numerical ordering

2/4/2012

2 Comments

 
Picture
It may be desirable in some situation to be able to assign a numerical reference (integer) to a resource with the particular property that the string  representation of the reference preserves the numerical ordering. This blog post presents a numbering method that has this property. The proposed numbering schema achieves this goal and avoids adding zero's or spaces in front of the numbers, thus keeping the strings short. The price to pay is that there will be gaps between the numbering sequence. The numbers in these gaps are invalid numbers in this numbering schema and may be easily recognized and used for error detection.

The problem: You probably experienced that sorting a list of strings representing the integer sequence "1", "2", "3", ..., "10", "11", ... "20", "21", ... yields the weird result "1", "10", "11", ... "2", "20", "21", ... "3", ... This shows up, for instance, when naming files by numbers. We get this result because strings are sorted in lexicographical order which means they are ordered by digit value, one by one from left to right. So in a lexicographical order, "10" is smaller than "2" which is the opposite of the numerical order.

In the situations where this is an unacceptable nuisance we have a set of solutions we could pick one from.

Use a specially crafted sorting algorithm able to detect that it deals with numbers in ASCII representation instead of text strings. In some context, changing the sorting algorithm is not possible (i.e. file names). 

Another possibility is to add some zero's or spaces in front of the number in its ASCII representation. The problem with this method is to know how many zero's or spaces should be added. There should be at least as many as the number of digits in the biggest number we need to represent. In some context it is not possible to know the biggest number we will have to dealt with and this introduces a highest value constrain which is preferable to avoid if possible.

The solution: The proposed solution is to use a numbering schema where we simply add in front the number a digits in the number. For instance the number 234 has 3 digits. This number would then be coded as "3123" in the proposed schema where the 3 (shown in bold) is added in front of the number.

The number is valid if the string contains only digits and the first digit is length minus one. The value 0 is represented as 0. For negative numbers, if you need them, the number of digits must be inserted between the minus sign (-) and the number.

There is also an upper limit in the maximum number of digits the number can have. The biggest number that may be represented with this numbering schema is 10 billion minus one. 

With this numbering schema, the sequence "1", "2", "3", ..., "10", "11", ... "20", "21" becomes "11", "12", "13", ..., "210", "211", ... "220", "221" with the added digit in front shown in bold. The lexicographical sorting of this number sequence will preserve this order.

The price to pay is that the numbering sequence is not compact. It has gaps containing invalid numbers (i.e. 23, 123,... ). This may be considered an inconvenient but has also the benefit to make it possible to detect errors and invalid values.

Generating such number sequence is trivial as well as checking their validity.

Application example: I "invented" this coding schema when looking for an optimal way to numerically reference resources assigned incrementally for a web service (i.e. userId, documentId, imageId, ....). The numbering provides a direct mapping with a numerical table index value as well as a compact string representation. The size of the reference would grow smoothly as needed with the number of references.

Another application is as document id in NoSQL databases like CouchDB, MongoDB, etc. Keep the id compact and sorted.

Using a Base64 like encoding

A more compact coding would use a base64 like encoding. Conversion between the ASCII and binary encoding would not be as straightforward, but identifiers would be much more compact and still preserve the sorting of ASCII and binary representation.

To generate such encoding, split the binary representation in groups of 6 bits, starting from the less significant bit (right most) toward the most significant bit. Then replace all the left most chunks that have all bits to zero with a single chunk coding the number of 6bits chunks left. For instance  ...00000|110010|010011 becomes 000010|110010|010011 because there are only two significant chunks in the number and 2 is encoded with 6 bits as 000010. The last step is to replace each 6 bit chunks in the resulting chunk sequence with the ASCII codes provided in the following table.

Picture
Mapping between chunk's 6 bit binary integer value and ASCII letters used for encoding
The resulting encoding is very similar to Base64 encoding but has the particular properties to preserve the sorting order of the chunk integer value and the associated ASCII value as well as using ASCII codes that may be used in URLs or filenames. Except for the value 0, the ASCII representation will never start with a '-'. 

Conversion between the ASCII representation and the binary representation is more complicated, especially when it has to be done by humans. Though a benefit of this coding is that its ASCII representation will be short for small numbers. The ASCII coding will have n+1 letters for numbers with n significant chunks. For up to 24 bit numbers (over 16 millions values), the longest ASCII encoding will be 5 letters.
2 Comments

What is wrong with HTTP ?

6/25/2009

0 Comments

 

Here is a document presenting a review on what is good an bad with HTTP. It provides some light on the choices I made for DIS. I couldn't identify the author's name in the text. Sorry. 

What is wrong with HTTP ?


In this essay, the first of a pair on browser apps, I explore how they are better than traditional desktop apps in some ways, but worse in others. Some of the disadvantages of browser apps are deeply rooted in the use of HTTP URLs for naming. In the second essay, I will present a design sketch for a new platform, are placement for HTTP combining both styles' advantages.Right now, we're seeing a massive shift to browser apps, largely server-side browser apps. As I warned in "People, places, things,and ideas," [18] this move to server-side browser apps imperils our software freedom; I outlined how to solve this problem in "The equivalent of free software for online services." [19] This pair of essays represents more detail on this problem and proposed solution.

read more ...

0 Comments

Median value selection algorithm

6/9/2009

6 Comments

 

At work I'm currently working on tomographic reconstruction algorithms. I have to implement a Bayesian iterative algorithm that requires to select the median value in a set of the 27 float values from a cube of 3x3x3 voxels. This operation must be performed for each voxel and for each iteration. We have to expect 256 million voxels to process for each iteration, but "only" 60 to 100 iterations.

Trying to find the most efficient algorithm I came up with a new algorithm considering the one described on the select algorithm page of wikipedia. I then submitted a question on StackOverflow to get some feedback. And I did get valuable feedback. I was first pointed to the C++ nth_element function I didn't know at the time. It was also suggest to optimize by sharing intermediate information which is indeed a smart thing to do to get a better initial guess on the median value.

After multiple tests and changes to the code I finally reached what seem to be a very efficient algorithm. The ratio is so good that I'm still unsure about it, but I checked everything. It could be due to memory cache or particularly favorable parallelization opportunities with sse instructions. I don't know.

Here is the outline of the algorithm. We have 27 values and have to find the value that split the set in two with 13 values smaller or equal to it and 13 values bigger or equal to it. This value is called the median value.

The fundamental idea of the algorithm is to use a heap data structure which has its smallest or biggest value at the top. Such data structure is very efficient for adding an element or to extract its top most element. It can also be very easily mapped into an array.

The algorithm uses two heaps initially empty, with a common top value which is the median value. Each heap has a capacity of at most 14 elements, where one is common to the two heaps, their top value and also the median value.
The algorithm proceed in two phases. In the first phase, the algorithm picks a value as initial median value guess. Subsequent values are then compare with this median value and added to the corresponding heap until one heap becomes full and contains 14 elements.

At this point the median value is a value in the full heap or in the remaining set of values to process. The second phase of the algorithm then starts where the remaining values are processed. Values that would not be inserted in the full heap are ignored. The other values are inserted in the full heap after deleting its top most value. The heap then gets a new top most value and thus also a new median value. When all the remaining elements have been processed, the median of the 27 value set is the top most value of the full heap.

Here are some benchmark results. See StackOverflow for more detailed information.

HeapSort        :2.287 
QuickSort       :2.297 
QuickMedian1 :0.967 
HeapMedian1  :0.858 
NthElement     :0.616 
QuickMedian2 :1.178 
HeapMedian2  :0.597 
HeapMedian3  :0.015  <-- best

It thus seem that HeapMedian3 is 33 times faster than NthElement. I used a 3GHz Intel E8400 processor and the Intel C++ compiler with options -03 and -xS for benchmarking.

Here is the code :

EDIT: The code has been removed because it was invalid. I tested it on a single random float value sequence were it gave a valid result by coincidence. A new blog post provides the correct code.

6 Comments

Stay Hungry, Stay Foolish [Steve Jobs]

2/16/2008

0 Comments

 

This is a talk of Steve Jobs at Stanford  in 2005. It's really worth  viewing.

0 Comments

My explanation of social networking success

1/28/2008

0 Comments

 

In some discussion group we just settled, we were asked to write down what would be the most important for us in our lives. And to our big surprise 90% youngsters in the group used the words "recognition" or "social recognition". There wasn't any significant discussion before that could have biased this informal survey. It was surprising enough to catch my attention.

20 years later, here is how I explain this to myself. It is known for many years now that our brain has special centers responsible of fundamental drives in our behavior. These centers have been detected and localized by changes in people's behavior when altered by an accident or a disease. The drives are related to hunger, libido, aggressivity/love, etc.


Assuming that mankind is the product of an evolution driven by natural selection, the existence of such drive centers makes sense. It ensures perennity of our species from the biological and the physiological perspective. Maybe the social perspective did not get the same attention.

To me, the existence of a drive for social recognition would also make sense because it pushes toward social coherence and cohesion and thus also to stability while being flexible enough to preserve the capability of our society to evolve and adapt to context changes.

This social recognition drive is thus equivalent to sexual drive, with all the possible related pathology. Social recognition compulsion can for instance lead to mythomania or eccentric behavior.

So my impression, if you allow me to push the logic to its extrema, is that social network sites are a variant of porn sites. They just tickle the brain centers driving our behavior.

I would thus say that social networking is directly related to the social recognition drive, which is itself a kind of attractive magnetic force required to get social cohesion and coherence. If this is true, animals would show the same fundamental drive, and why not, ETs too.

0 Comments

The network effect...

8/18/2007

0 Comments

 

DIS is designed to address many shortcomings of existing internet services. But it is also subject to the network effect which means that its value is function of the number of users.

It's like a TV broadcasting system. Whatever the quality of the technology, without interesting broadcasts it won't attract users and without users it won't attract producers and without producers it won't get attractive broadcasts and then again no users. But when the feedback in such system becomes positive we get a virtuous circle which amplifies its spinning by itself. And once the virtuous circle is spinning one can tap its energy with advertisement, freemium or even a fee if the traction is strong enough.

To bootstrap it, one has to pick the right TV broadcast which has to be one that people want to see while also simple and cheap to produce. As you may guess this is a strategic choice.

Starting such type of system is not easy, but once rolling it is quite safe because it is difficult to stop and provides thus a protection against competitors. A good example is Google failing to catch on You Tube which ended up buying it.

Like the TV broadcasting system, DIS offers the liberty degree to pick the right bootstrapping application. It has to be something that people want while being also cheap and simple to produce.

The business model is also strategic. Offering a free service means not tapping energy from the virtuous circle which is good to get maximum acceleration. But at some point we need to tap energy from it so that we can invest it in development of new features and make it more attractive.

The business model and the bootstrapping application are thus key to success, but don't expect me to present them here before they are in production.

0 Comments

Pictures from Marseille.... (France)

8/16/2007

0 Comments

 

This picture of Marseille was taken from the top of Marseilleveyre, a small mountain south of Marseille. The small harbor in front is the harbor of Pointe Rouge. The Vieux Port, center of Marseille, and Notre Dame de la Garde, are not visible here.

One of the islands in the bay host the Château d'If that was a prison for sometime. This place was made famous by the roman "Le Comte de Monte Cristo" of Alexandre Dumas.

0 Comments

Pictures from France...

8/16/2007

0 Comments

 

I just wanted to share with you two nice pictures taken during my summer holidays. 

The picture shows some vineyards and a very old church in a small village close to Blaye. The wine of this region is labeled côte de grave and famous. The river Gironde is just behind the small hill in the background. The famous Bordeau wine region Medoc is just on the other side of the Gironde.

0 Comments

DIS as Web 3.0 root

8/8/2007

0 Comments

 

Eric Schmidt, Google CEO, presented his view of the web evolution track at the Seoul Digital Forum.

DIS was conceived with this view in mind. The protocol and key service are the keystone of such system.

There are other properties that Eric Schmidt didn't evoke but that require a better response than provided by existing systems.

0 Comments

Suggested reading: Hacknot !

7/8/2007

0 Comments

 

An enlightening reading on hacking! 'To those about to hack"

As people may have understood by now, I'm more of an Abe than a George...

0 Comments
<<Previous

    Author

    Christophe Meessen is a  computer science engineer working in France.

    Any suggestions to make DIS more useful ? Tell me by using the contact page.

    Categories

    All
    Business Model
    Database
    Dis
    Ditp
    Dvcs
    Git
    Gob
    Idr
    Misc
    Murphys Law
    Programming Language
    Progress Status
    Startup
    Suggested Reading
    Web Site

    Archives

    December 2017
    November 2015
    September 2015
    February 2013
    December 2012
    November 2012
    May 2012
    February 2012
    March 2010
    October 2009
    September 2009
    July 2009
    June 2009
    May 2009
    February 2009
    January 2009
    November 2008
    September 2008
    August 2008
    July 2008
    May 2008
    April 2008
    March 2008
    February 2008
    January 2008
    December 2007
    October 2007
    August 2007
    July 2007
    June 2007
    May 2007

    RSS Feed

    Live traffic feed
    You have no departures or arrivals yet. Wait a few minutes and check again.
    Powered by FEEDJIT
Powered by Create your own unique website with customizable templates.