Distributed Information System (DIS)
  • Home
  • The blog
  • Contact

Numbering schema yielding identical lexical and numerical ordering

2/4/2012

2 Comments

 
Picture
It may be desirable in some situation to be able to assign a numerical reference (integer) to a resource with the particular property that the string  representation of the reference preserves the numerical ordering. This blog post presents a numbering method that has this property. The proposed numbering schema achieves this goal and avoids adding zero's or spaces in front of the numbers, thus keeping the strings short. The price to pay is that there will be gaps between the numbering sequence. The numbers in these gaps are invalid numbers in this numbering schema and may be easily recognized and used for error detection.

The problem: You probably experienced that sorting a list of strings representing the integer sequence "1", "2", "3", ..., "10", "11", ... "20", "21", ... yields the weird result "1", "10", "11", ... "2", "20", "21", ... "3", ... This shows up, for instance, when naming files by numbers. We get this result because strings are sorted in lexicographical order which means they are ordered by digit value, one by one from left to right. So in a lexicographical order, "10" is smaller than "2" which is the opposite of the numerical order.

In the situations where this is an unacceptable nuisance we have a set of solutions we could pick one from.

Use a specially crafted sorting algorithm able to detect that it deals with numbers in ASCII representation instead of text strings. In some context, changing the sorting algorithm is not possible (i.e. file names). 

Another possibility is to add some zero's or spaces in front of the number in its ASCII representation. The problem with this method is to know how many zero's or spaces should be added. There should be at least as many as the number of digits in the biggest number we need to represent. In some context it is not possible to know the biggest number we will have to dealt with and this introduces a highest value constrain which is preferable to avoid if possible.

The solution: The proposed solution is to use a numbering schema where we simply add in front the number a digits in the number. For instance the number 234 has 3 digits. This number would then be coded as "3123" in the proposed schema where the 3 (shown in bold) is added in front of the number.

The number is valid if the string contains only digits and the first digit is length minus one. The value 0 is represented as 0. For negative numbers, if you need them, the number of digits must be inserted between the minus sign (-) and the number.

There is also an upper limit in the maximum number of digits the number can have. The biggest number that may be represented with this numbering schema is 10 billion minus one. 

With this numbering schema, the sequence "1", "2", "3", ..., "10", "11", ... "20", "21" becomes "11", "12", "13", ..., "210", "211", ... "220", "221" with the added digit in front shown in bold. The lexicographical sorting of this number sequence will preserve this order.

The price to pay is that the numbering sequence is not compact. It has gaps containing invalid numbers (i.e. 23, 123,... ). This may be considered an inconvenient but has also the benefit to make it possible to detect errors and invalid values.

Generating such number sequence is trivial as well as checking their validity.

Application example: I "invented" this coding schema when looking for an optimal way to numerically reference resources assigned incrementally for a web service (i.e. userId, documentId, imageId, ....). The numbering provides a direct mapping with a numerical table index value as well as a compact string representation. The size of the reference would grow smoothly as needed with the number of references.

Another application is as document id in NoSQL databases like CouchDB, MongoDB, etc. Keep the id compact and sorted.

Using a Base64 like encoding

A more compact coding would use a base64 like encoding. Conversion between the ASCII and binary encoding would not be as straightforward, but identifiers would be much more compact and still preserve the sorting of ASCII and binary representation.

To generate such encoding, split the binary representation in groups of 6 bits, starting from the less significant bit (right most) toward the most significant bit. Then replace all the left most chunks that have all bits to zero with a single chunk coding the number of 6bits chunks left. For instance  ...00000|110010|010011 becomes 000010|110010|010011 because there are only two significant chunks in the number and 2 is encoded with 6 bits as 000010. The last step is to replace each 6 bit chunks in the resulting chunk sequence with the ASCII codes provided in the following table.

Picture
Mapping between chunk's 6 bit binary integer value and ASCII letters used for encoding
The resulting encoding is very similar to Base64 encoding but has the particular properties to preserve the sorting order of the chunk integer value and the associated ASCII value as well as using ASCII codes that may be used in URLs or filenames. Except for the value 0, the ASCII representation will never start with a '-'. 

Conversion between the ASCII representation and the binary representation is more complicated, especially when it has to be done by humans. Though a benefit of this coding is that its ASCII representation will be short for small numbers. The ASCII coding will have n+1 letters for numbers with n significant chunks. For up to 24 bit numbers (over 16 millions values), the longest ASCII encoding will be 5 letters.
2 Comments

About me...

6/27/2007

0 Comments

 

I've added a photo of my self so you can associate a face to DIS. The city where I live and work is Marseille which is in Provence (south of France). View it with Google maps.

Below you can see the calanque of Sugiton, just next to Marseille, that you can reach after a 30 minute walk through a pine wood from Luminy.

0 Comments

License

5/13/2007

0 Comments

 

By reading and using any information published on this web site you implicitely and totally agree to fully assume any possible consequences of this activity.

The author owns total and exclusive intellectual property on DIS and its covered technology as on all the information published on this web site. Note that it doesn't exclude free use of it.

If the technology discribed as part of DIS, and claimed to be an exclusive invention of the author of this web site, appears to be covered in part, or totally, by any patents, you must immediately inform the author of it and provide the justifying documentation.

0 Comments

DITP Blog opened !

5/13/2007

0 Comments

 

It was time to create this web site so I could explain what is DIS and, what I value most, get your feedback to make DIS more useful. Please leave a comment or send a mail.

0 Comments

    Author

    Christophe Meessen is a  computer science engineer working in France.

    Any suggestions to make DIS more useful ? Tell me by using the contact page.

    Categories

    All
    Business Model
    Database
    Dis
    Ditp
    Dvcs
    Git
    Gob
    Idr
    Misc
    Murphys Law
    Programming Language
    Progress Status
    Startup
    Suggested Reading
    Web Site

    Archives

    December 2017
    November 2015
    September 2015
    February 2013
    December 2012
    November 2012
    May 2012
    February 2012
    March 2010
    October 2009
    September 2009
    July 2009
    June 2009
    May 2009
    February 2009
    January 2009
    November 2008
    September 2008
    August 2008
    July 2008
    May 2008
    April 2008
    March 2008
    February 2008
    January 2008
    December 2007
    October 2007
    August 2007
    July 2007
    June 2007
    May 2007

    RSS Feed

    Live traffic feed
    You have no departures or arrivals yet. Wait a few minutes and check again.
    Powered by FEEDJIT
Powered by Create your own unique website with customizable templates.