<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" >

<channel><title><![CDATA[Distributed Information System (DIS) - The blog]]></title><link><![CDATA[http://www.disnetwork.info/the-blog]]></link><description><![CDATA[The blog]]></description><pubDate>Sun, 28 Dec 2025 15:34:08 -0800</pubDate><generator>Weebly</generator><item><title><![CDATA[Median value selection (Fixed)]]></title><link><![CDATA[http://www.disnetwork.info/the-blog/median-value-selection-fixed]]></link><comments><![CDATA[http://www.disnetwork.info/the-blog/median-value-selection-fixed#comments]]></comments><pubDate>Wed, 20 Dec 2017 16:26:08 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.disnetwork.info/the-blog/median-value-selection-fixed</guid><description><![CDATA[In 2009 I presented a heap based median selection algorithm. It was original, and was apparently very fast when compiled with the Intel compiler (icc). Since I don't have the Intel compiler anymore, I can't test it's performance. It's slower than the nthElement code given below when compiled with g++-7 -O3.Here is the fixed code.float fixedHeapMedian (float *a) {&nbsp; const unsigned char HEAP_LEN = 13;&nbsp; float left[HEAP_LEN], right[HEAP_LEN], *p, median;&nbsp; unsigned char nLeft, nRight;&n [...] ]]></description><content:encoded><![CDATA[<div class="paragraph">In 2009 I presented a <a href="http://www.disnetwork.info/the-blog/median-value-selection-algorithm">heap based median selection algorithm</a>. It was original, and was apparently very fast when compiled with the Intel compiler (icc). Since I don't have the Intel compiler anymore, I can't test it's performance. It's slower than the nthElement code given below when compiled with g++-7 -O3.<br /><br />Here is the fixed code.<br /><br />float fixedHeapMedian (float *a) {<br />&nbsp; const unsigned char HEAP_LEN = 13;<br />&nbsp; float left[HEAP_LEN], right[HEAP_LEN], *p, median;<br />&nbsp; unsigned char nLeft, nRight;<br /><br />&nbsp; // pick first value as median candidate<br />&nbsp; p = a;<br />&nbsp; median = *p++;<br />&nbsp; nLeft = nRight = 0;<br /><br />&nbsp; for (;;) {<br />&nbsp;&nbsp;&nbsp; //dumpState(left, nLeft, median, right, nRight, p, 27 - (p-a));<br />&nbsp;&nbsp;&nbsp; //assert(stateIsValid(left, nLeft, median, right, nRight));<br /><br />&nbsp;&nbsp;&nbsp; // get next value<br />&nbsp;&nbsp;&nbsp; float val = *p++;<br /><br />&nbsp;&nbsp;&nbsp; // if value is smaller than median, append to left heap<br />&nbsp;&nbsp;&nbsp; if (val &lt;= median) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // move biggest value to the top of left heap<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; unsigned char child = nLeft++, parent = (child - 1) / 2;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; while (child &amp;&amp; val &gt; left[parent]) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; left[child] = left[parent];<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; child = parent;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; parent = (parent - 1) / 2;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; left[child] = val;<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // if left heap is full<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (nLeft == HEAP_LEN) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; //cout &lt;&lt; "---" &lt;&lt; endl;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // for each remaining value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for (unsigned char nVal = 27-(p - a); nVal; --nVal) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; //dumpState(left, nLeft, median, right, nRight, p, nVal);<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; //assert(stateIsValid(left, nLeft, median, right, nRight));<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // get next value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; val = *p++;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // discard values falling in other heap<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (val &gt;= median) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; continue;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // if val is bigger than biggest in heap, val is new median<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (val &gt;= left[0]) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; median = val;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; continue;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // biggest heap value becomes new median<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; median = left[0];<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // insert val in heap<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; parent = 0;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; child = 2;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; while (child &lt; HEAP_LEN) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (left[child-1] &gt; left[child]) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; child = child-1;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (val &gt;= left[child]) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; break;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; left[parent] = left[child];<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; parent = child;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; child = (parent + 1) * 2;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; left[parent] = val;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return median;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp; } else {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // move smallest value to the top of right heap<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; unsigned char child = nRight++, parent = (child - 1) / 2;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; while (child &amp;&amp; val &lt; right[parent]) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; right[child] = right[parent];<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; child = parent;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; parent = (parent - 1) / 2;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; right[child] = val;<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // if right heap is full<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (nRight == HEAP_LEN) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; //cout &lt;&lt; "---" &lt;&lt; endl;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // for each remaining value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for (unsigned char nVal = 27-(p - a); nVal; --nVal) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; //dumpState(left, nLeft, median, right, nRight, p, nVal);<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; //assert(stateIsValid(left, nLeft, median, right, nRight));<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // get next value<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; val = *p++;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // discard values falling in other heap<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (val &lt;= median) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; continue;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // if val is smaller than smallest in heap, val is new median<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (val &lt;= right[0]) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; median = val;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; continue;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // heap top value becomes new median<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; median = right[0];<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // insert val in heap<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; parent = 0;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; child = 2;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; while (child &lt; HEAP_LEN) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (right[child-1] &lt; right[child]) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; child = child-1;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (val &lt;= right[child]) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; break;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; right[parent] = right[child];<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; parent = child;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; child = (parent + 1) * 2;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; right[parent] = val;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return median;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp; }<br />&nbsp; }<br />}</div>]]></content:encoded></item><item><title><![CDATA[C source code for MSB encoding and decoding]]></title><link><![CDATA[http://www.disnetwork.info/the-blog/c-source-code-for-msb-encoding-and-decoding]]></link><comments><![CDATA[http://www.disnetwork.info/the-blog/c-source-code-for-msb-encoding-and-decoding#comments]]></comments><pubDate>Mon, 02 Nov 2015 09:54:50 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.disnetwork.info/the-blog/c-source-code-for-msb-encoding-and-decoding</guid><description><![CDATA[For a detailed explanation see Efficiently encoding variable-length integers in&nbsp;C/C++.#include &lt;stdint.h&gt;#include &lt;string.h&gt;// Little endian encodingsize_t encodeMSBlittleEndian(uint64_t value, uint8_t* out) {&nbsp;&nbsp;&nbsp; uint8_t *p = out;&nbsp;&nbsp;&nbsp; while (value &gt; 127) {&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *p++ = value | 0x80;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; value &gt;&gt;= 7;&nbsp;&nbsp;&nbsp; }&nbsp;&nbsp;&nbsp; *p++ = value;&nbsp;&nbsp;&nbsp;  [...] ]]></description><content:encoded><![CDATA[<div class="paragraph" style="text-align:left;"><span>For a detailed explanation see </span><a href="http://techoverflow.net/blog/2013/01/25/efficiently-encoding-variable-length-integers-in-cc/">Efficiently encoding variable-length integers in&nbsp;C/C++</a>.<br /><br />#include &lt;stdint.h&gt;<br />#include &lt;string.h&gt;<br /><br /><span>// Little endian encoding<br /><strong>size_t encodeMSBlittleEndian(uint64_t value, uint8_t* out)</strong> {<br />&nbsp;&nbsp;&nbsp; uint8_t *p = out;<br />&nbsp;&nbsp;&nbsp; while (value &gt; 127) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *p++ = value | 0x80;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; value &gt;&gt;= 7;<br />&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp; *p++ = value;<br />&nbsp;&nbsp;&nbsp; return p - out;<br />}<br /><br />// Little endian decoding<br /><strong>size_t decodeMSBlittleEndian(uint64_t *value, uint8_t* in)</strong> {<br />&nbsp;&nbsp;&nbsp; // locate end of int<br />&nbsp;&nbsp;&nbsp; uint8_t *p = in;<br />&nbsp;&nbsp;&nbsp; while (*p++ &amp; 0x80);<br />&nbsp;&nbsp;&nbsp; size_t size = p - in;<br />&nbsp;&nbsp;&nbsp; //decode int<br />&nbsp;&nbsp;&nbsp; uint64_t ret = 0;<br />&nbsp;&nbsp;&nbsp; do {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ret = (ret &lt;&lt; 7) | (*--p &amp; 0x7F);<br />&nbsp;&nbsp;&nbsp; } while (p != in);<br />&nbsp;&nbsp;&nbsp; *value = ret;<br />&nbsp;&nbsp;&nbsp; return size;<br />}<br /><br />Note that little endian encoding makes encoding fast but requires more work to decode. When encoding the integer once and decoding it many times, big endian encoding should be favored.</span><br /><br />// Big endian encoding<br /><strong>size_t encodeMSBbigEndian(<span>uint64_t</span> value, uint8_t* out)</strong> {<br />&nbsp;&nbsp;&nbsp; uint8_t buf[9], *p = buf + 9;<br />&nbsp;&nbsp;&nbsp; *--p = value &amp; 0x7F;<br />&nbsp;&nbsp;&nbsp; while (value &gt;&gt;= 7) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *--p = value | 0x80;<br />&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp; size_t size = buf + 9 - p;<br />&nbsp;&nbsp;&nbsp; memcpy(out, p, size);<br />&nbsp;&nbsp;&nbsp; return size;<br />}<br /><br />// Big endian decoding<br /><strong>size_t decodeMSBbigEndian(<span>uint64_t</span> *value, uint8_t* in)</strong> {<br />&nbsp;&nbsp;&nbsp; uint8_t *p = in;<br />&nbsp;&nbsp;&nbsp; <span>uint64_t</span> ret = *p &amp; 0x7F;<br />&nbsp;&nbsp;&nbsp; while (*p &amp; 0x80) {<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ret = (ret &lt;&lt; 7) | (*++p &amp; 0x7F);<br />&nbsp;&nbsp;&nbsp; }<br />&nbsp;&nbsp;&nbsp; *value = ret;<br />&nbsp;&nbsp;&nbsp; return p - in + 1;<br />}<br /></div>]]></content:encoded></item><item><title><![CDATA[Presenting the Timez data type]]></title><link><![CDATA[http://www.disnetwork.info/the-blog/presenting-the-timez-data-type]]></link><comments><![CDATA[http://www.disnetwork.info/the-blog/presenting-the-timez-data-type#comments]]></comments><pubDate>Mon, 21 Sep 2015 13:02:39 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.disnetwork.info/the-blog/presenting-the-timez-data-type</guid><description><![CDATA[I'm currently working on an implementation of the Date Time stamp I described in this post and that I decided to name Timez. It seamed trivial to implement at first, but I then discovered the particular property of the system time regarding leap seconds and the problem it represent.Before explaining the problem, let me briefly explain what a Timez is. The idea is simple but brilliant (thanks).Clocks displaying local time of different location of the world. [http://sapling-inc.com]The Timez stamp [...] ]]></description><content:encoded><![CDATA[<div class="paragraph" style="text-align:left;">I'm currently working on an implementation of the Date Time stamp I described in <a title="" href="http://www.disnetwork.info/the-blog/date-time-stamp-binary-encoding">this post</a> and that I decided to name <em><a title="" href="https://github.com/chmike/timez">Timez</a>.</em> It seamed trivial to implement at first, but I then discovered the particular property of the system time regarding leap seconds and the problem it represent.<br><br>Before explaining the problem, let me briefly explain what a <em>Timez</em> is. The idea is simple but brilliant (thanks).<br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:5px;padding-bottom:10px;margin-left:0;margin-right:10px;text-align:left"><a href='http://sapling-inc.com/'><img src="http://www.disnetwork.info/uploads/3/8/0/1/38014/3094209_orig.png" alt="Photo" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%">Clocks displaying local time of different location of the world. [http://sapling-inc.com]</div></div></div><h2 class="wsite-content-title" style="text-align:left;">The <em style="">Timez</em> stamp</h2><div class="paragraph" style="text-align:justify;">A time stamp is generated at a particular location on the surface of the globe. It thus have a specific local time offset relative to the UTC time. A user in a different time zone may want to<br><ol style=""><li style="">view the stamp time with the local time where it was produced ;</li><li style="">sort stamps by UTC time, thus ignoring the local time offset of its origin ;</li><li style="">view the stamp time with his own local time offset&nbsp; ;</li><li style="">view the stamp time with the local time offset of another location in the world. &nbsp;</li></ol>Use cases are for instance a web forum with messages of people from different time zone. Messages have to be sorted by time regardless of the senders local time offset. etc. Another use case is a messaging system like mail. ISO 8601 defines a standard ASCII time representation convenient for humans, but its not compact and efficient.<br><br>The solution I came up with is to combine into a 64 bit signed integer a time expressed as the number of micro seconds relative to an epoch and the local time offset expressed as a number of minutes where the stamp was generated.<br></div><div><div class="wsite-image wsite-image-border-none" style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:right"><a><img src="http://www.disnetwork.info/uploads/3/8/0/1/38014/4965361_orig.png" alt="Picture" style="width:auto;max-width:100%"></a><div style="display:block;font-size:90%"></div></div></div><div class="paragraph" style="text-align:left;">The number of micro seconds is a signed integer. When it is negative, it represents the number of micro seconds left to elapse to reach the epoch. When positive, it is the number of micro seconds elapsed since the epoch. The time range covered by this value is +/- 142 years relative to the epoch's year.<br><br>The time offset is an unsigned integer. It's the time offset plus 1024, and its range is +/- 17 hours (+/-1023 minutes). If the bits of the time offset field are all zero, the time offset value is -1024 and the <em>Timez</em> value is invalid or undefined.<br><br>Up to here, it's all simple and straightforward. So I started implementing the <em>Timez</em> data type in C.<br></div><h2 class="wsite-content-title" style="text-align:left;">Time in <em>Timez</em> is without leap seconds correction<br></h2><div class="paragraph" style="text-align:left;">The time without leap second correction is called the TAI (<a title="" href="https://en.wikipedia.org/wiki/International_Atomic_Time">International Atomic Time</a>) time. The GPS (<a title="" href="https://en.wikipedia.org/wiki/Global_Positioning_System">Global Positioning System</a>) time is also uncorrected by leap second. They differ by their epoch only. I decided that <em>Timez</em> refers to these clocks to avoid the problems resulting from the leap second correction.<br><br>But to my big surprise, there is actually no way to get the time in POSIX (all Unix flavors) or Windows without leap second corrections a.k.a. TAI or GPS time. The system time you get with the functions <em>time()</em>, <em>gettimeofday()</em> or&nbsp; <em>clock_gettime()</em> is the number of seconds elapsed since the 1970-01-01T00:00:00 UTC <strong>minus the leap seconds</strong>.<br><br>Investigating this further, it appears that the time handling problem on computers is actually a rabbit hole. I learned a lot in the process, but also wasted a significant amount of time. It is frustrating because it clearly result from standard definition and support lagging.<br><br>Happily, things are slowly changing, but only very slowly. Since Linux Kernel 3.10 there is a new clock id that can be used by <em>clock_gettime()</em> named <em>CLOCK_TAI</em>. But on my computer it currently still returns the same time as <em>CLOCK_REALTIME</em>. Apparently you need a version of NTP higher than 4.2.6 to get the <em>CLOCK_TAI</em> clock adjusted.<br><br>What is then still missing is a conversion between leap second corrected time and uncorrected time. I plan to provide such function so that <em>Timez</em> can be used with operating systems that don't provide TAI or GPS time. I'll unfortunately have to hard code the table of leap seconds because there is no easy access to a dynamically updated table.<br><br>If you want to learn more about leap seconds I suggest to read <a title="" href="https://en.wikipedia.org/wiki/Leap_second#Proposal_to_abolish_leap_seconds">this section in Wikipedia</a>.&nbsp; An interesting part is about the proposal to drop the leap second correction. I also encourage you look see this short <span style="">video on <a style="" title="" href="https://www.youtube.com/watch?v=-5wpm-gesOY">the Time &amp; Time zone problem</a> from the <em>Computerphile</em>.</span><br></div><h2 class="wsite-content-title" style="text-align:left;">The epoch of <em>Timez</em><br></h2><div class="paragraph" style="text-align:left;">This was a difficult decision to make. <em>CLOCK_TAI</em> is using the epoch 1970-01-01T00:00:00 UTC as <em>CLOCK_REALTIME</em>. This means that negative count of seconds cover the period before this epoch. With 64bit integers to encode the court of seconds, this is not a problem.<br><br>With 53 bits encoding the number of microseconds, we are short. We can only cover +/- 142 years around the epoch. By picking the same epoch as <em>CLOCK_TAI</em> we would only have ~100 years left until the <em>Timez</em> time counter would wrap.<br><br>I then identified three options.<br><ol style=""><li style="">Epoch = 1970-01-01T00:00:00 UTC + 2^52 : the covered time range is then from 1970 to 2254 ;</li><li style="">Epoch = 2050-01-01T00:00:00 TAI : the covered time range is then from 1908 to 2192 ;<br></li><li style="">Epoch = 1970-01-01T00:00:00 UTC + 2^52 - (2^31) * 1000000: the covered time range is then from 1902 to 2186.</li></ol><br>Option 1 would have the advantage to push the wrapping limit the farther away in the future. The disadvantage is that it can't represent time before 1970. The epoch offset is a value easy to remember.<br><br>Option 2 would have the advantage to allow representing time in the past. But the epoch offset would be an obscure integer magic number corresponding to the number of micro seconds between 1970 and 2150.<br>&nbsp;<br>Option 3 has the advantage to cover the time span of 32 bit signed integer time_t value. The <em>Timez</em> would thus be backward compatible with the <em>time_t</em> values. The epoch offset is still a magic word but more easily obtained than the one of option 2. However, conversion between corrected and uncorrected time is not well defined before 1972.<br><br>Considering the pros and cons of the different options, I choose option 2. The <em>Timez</em> epoch is <strong>1970-01-01T00:00:00 UTC + 2^52</strong>. The value <strong style="">2^52</strong> is the timez epoch offset relative to the POSIX time epoch. Note that 1970-01-01T00:00:00 UTC is 1970-01-01T00:00:10 TAI.<br><br>To convert a <em>CLOCK_TAI</em> value to a <em>Timez</em> micro second count use the following expression :<br><span style=""></span><span style=""></span></div><div><div id="771289813717539233" align="left" style="width: 100%; overflow-y: hidden;" class="wcustomhtml"><code>#define TIMEZ_EPOCH 0x10000000000000LL<br>struct timespec tp;<br>if (clock_gettime(CLOCK_TAI, &amp;tp) ) /*fail*/ ;<br>int64_t t = tp.tv_sec * 1000000 + tp.tv_nsec / 1000 - TIMEZ_EPOCH;</code></div></div>]]></content:encoded></item><item><title><![CDATA[hostname and hostname --fqdn mystery]]></title><link><![CDATA[http://www.disnetwork.info/the-blog/hostname-and-hostname-fqdn-mystery]]></link><comments><![CDATA[http://www.disnetwork.info/the-blog/hostname-and-hostname-fqdn-mystery#comments]]></comments><pubDate>Sun, 10 Feb 2013 14:10:45 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.disnetwork.info/the-blog/hostname-and-hostname-fqdn-mystery</guid><description><![CDATA[I have just installed a fresh Ubuntu 12.04 LTS server named home. Why not. Preparing the installation of PHP with fastcgi and nginx, inspired by this tutorial, I was puzzled by the fact I get home with the commands hostname and hostname -f. I'm expected to receive the fully qualified domain name when using the hostname -f command.It takes some time and manual page reading to find out that this result is normal. Indeed the /etc/hostname file must contain the server name, and not the fully qualifi [...] ]]></description><content:encoded><![CDATA[<div class="paragraph" style="text-align:left;">I have just installed a fresh Ubuntu 12.04 LTS server named home. <span>Why not. </span>Preparing the installation of PHP with fastcgi and nginx, inspired by <a href="http://library.linode.com/web-servers/nginx/php-fastcgi/ubuntu-12.04-precise-pangolin">this</a> tutorial, I was puzzled by the fact I get <em>home</em> with the commands <em>hostname</em> and <em>hostname -f</em>. <br /><br /><span></span>I'm expected to receive the fully qualified domain name when using the <em style="">hostname -f </em><span style="">command.</span><br /><br /><span>It takes some time and manual page reading to find out that this result is normal. Indeed the <em>/etc/hostname</em> file must contain the server name, and </span>not the fully qualified domain name. <br /><br /><span>The right command to use to get the fully qualified domain names is </span><br /><span></span><strong><em>hostname --all-fqdns</em></strong> and not  <em style="">hostname -f</em>. There is no need to change the<em> /etc/hosts</em> file. It should contain <br /><em>127.0.0.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; localhost </em><br /><em>127.0.1.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; home</em><br /><span>This </span><em style="">127.0.1.1 </em>is weird, but it was set so by default. <br /></div>]]></content:encoded></item><item><title><![CDATA[Date time stamp binary encoding]]></title><link><![CDATA[http://www.disnetwork.info/the-blog/date-time-stamp-binary-encoding]]></link><comments><![CDATA[http://www.disnetwork.info/the-blog/date-time-stamp-binary-encoding#comments]]></comments><pubDate>Tue, 18 Dec 2012 11:04:58 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.disnetwork.info/the-blog/date-time-stamp-binary-encoding</guid><description><![CDATA[Infinite Clock II by Robbert van der Steeg A date time stamp is a reference in time. This post consider only date time stamps used as time references in computer systems with a limited time span like now +/- 100 years. It presents a binary encoding with microsecond unit resolution for absolute time encoding including time zone information or relative time intervals for arithmetic time computation.    IntroductionOperating  systems classically represent time as an integer value corresponding to   [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:left;height:0px'></span><span style='z-index:10;position:relative;float:left;;clear:left;margin-top:0px;*margin-top:0px'><a href='http://www.flickr.com/photos/robbie73/5925546380/'><img src="http://www.disnetwork.info/uploads/3/8/0/1/38014/7328051.jpg" style="margin-top: 5px; margin-bottom: 10px; margin-left: 0px; margin-right: 10px; border-width:1px;padding:3px;" alt="Photo" class="galleryImageBorder" /></a><div style="display: block; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;">Infinite Clock II by Robbert van der Steeg</div></span> <div class="paragraph" style="text-align:left;display:block;"><br /><span></span>A date time stamp is a reference in time. This post consider only date time stamps used as time references in computer systems with a limited time span like now +/- 100 years. <br /><br /><span></span>It presents a binary encoding with microsecond unit resolution for absolute time encoding including time zone information or relative time intervals for arithmetic time computation. <br /><span></span><br /><br /><span style=""></span></div> <hr style="width:100%;clear:both;visibility:hidden;"></hr>  <div class="paragraph" style="text-align:left;"><strong style="">Introduction</strong><br /><br />Operating  systems classically represent time as an integer value corresponding to  the number of seconds elapsed since 1970-01-01 00:00. Unfortunately the big time resolution granularity and absence of time zone information makes it  inconvenient to use as time reference for world wide communicating applications. <br /><br /><strong style="">Rationale</strong><br /><br />The  rational of this encoding choice is to privilege efficient date time  comparison and local time computation or UTC time and time zone extraction with  simple to remember and trivial operations. Arithmetic operations on time should also be straightforward.<br /><br /><strong style="">Time zone encoding<br /></strong><br />As of  <a style="" title="" href="http://en.wikipedia.org/wiki/ISO_8601">ISO 8601</a>,  the international normalization of date time representation, the time  offset relative to UTC has a minute granularity. According to this <a style="" title="" href="http://archives.postgresql.org/pgsql-hackers/2012-05/msg01464.php">bug report</a>  the smallest time zone offset value relative to UTC  may be&nbsp; -15:56:00  in Asia/Manila and the biggest 15:13:42 in America/Metlakatla. We may  round this to -16:00 to +15:59. This time span represent 2 x 960 = 1920  minutes. Thus 11 bits are sufficient to encode the time zone. The value  is encoded as an unsigned integer relative to 1024. Thus -40 is encoded  as 1024 - 40 = 984 and +40 as 1024 + 40 = 1064. An hour is 60 minutes,  thus 2:04 is encoded as 2 x 60 + 4 = 124. <br /><br /><strong style="">Time encoding</strong><br /><br />If  we use 64 bit integers, this leaves 53 bits for time encoding. The  obvious choice is to use the UTC time as universal reference and the  time elapsed since 1970-01-01 00:00 in some unit to get an  integer representation. This provides a well normalized and easy to  remember time reference. It also simplify conversion from the existing  (old) 32bit system time encoding. Reserving one bit as sign bit  so that a 64 bit signed integer data type can be used, we have 52 bits  left. Using microsecond time units, the time value can be in a year  range of 1970 +/- 142. This leaves 100 years left ahead of us. <br /><br /><strong style="">Encoding summary</strong><br /><br />The  time is encoded in a 64 bit signed integer. The 53 most significant  bits represent a signed time delay in microsecond time units. <br /><br />When  the value represent a time interval or the result of some time  computation the 11 less significant bits are 0 so that conventional  signed integer arithmetic operations can be use for time computation. The only constrain is with time interval division where the the 11 less significant bits of the result must be cleared. &nbsp; <br /><br />When the time is an absolute time, the 53 most significant bits encode the time interval relative to the 1970-01-01  00:00 UTC time. The 11 less significant bits encode the local time  offset relative to the UTC time in minute units and added by 1024 so  that as it is encoded as an unsigned integer value. The value 0 (-1024)  is not a valid time offset value. <br /><br /><strong style="">Time operations</strong><br /><br /><ul style=""><li style="">Testing if a time value is an absolute time or an interval is performed by testing if the 11 less significant bits are all 0. </li><li style="">To perform time computation, first clear the 11 less significant bits then use conventional integer addition, subtraction and multiplication arithmetic operations.&nbsp;</li><li style="">To perform time interval division, use the normal integer division operation and clear the 11 less significant bits of the result.<br /></li><li style="">Comparing  absolute times can be done as conventional integer comparison as well  for time intervals. Comparing absolute time, with time interval won't  make sense unless the time interval is relative to the 1970-01-01 00:00 UTC time.</li><li style="">Extracting  the UTC time zone in minute units is performed by clearing the 53 most  significant bits and subtracting 1024 to the resulting value.</li><li style="">Conversion to double precision floats with second units is trivial and without loss of precision, but it will lack the time zone information. <br /></li></ul><br /><strong style="">Final remarks</strong><br /><br />This encoding is trivial to understand and to manipulate by using conventional integer arithmetics, comparison or bit wise operations. Its  value may represent an absolute time or a time interval with the  possibility to distinguish between these two types of value. Time comparison or  arithmetic operations in this representation is more efficient than by  using double float encoding. <br /><br />This encoding is perfectly suited for  date time stamping in the defined limited range and using such encoded date as  indexed key in a database or when sorting stamped information is needed. It allows  to display any absolute time using the ISO 8601 convention or any  country specific representation. <br /><br />However this time encoding has two limitations which are minor weakness. The  first limitation is the restricted time span covered by the encoding.  The second limitation is the inability to encode summer or winter time  saving information. The later is not impairing absolute time comparison  because the UTC time is used as reference. The problem is just the  inability to determine if the time zone offset includes or not the  winter or summer time. But this is also the case with the <a style="" title="" href="http://en.wikipedia.org/wiki/ISO_8601">ISO 8601</a> representation. </div>]]></content:encoded></item><item><title><![CDATA[Base32 encoding proposal]]></title><link><![CDATA[http://www.disnetwork.info/the-blog/base32-encoding-proposal]]></link><comments><![CDATA[http://www.disnetwork.info/the-blog/base32-encoding-proposal#comments]]></comments><pubDate>Mon, 26 Nov 2012 17:20:32 GMT</pubDate><category><![CDATA[Uncategorized]]></category><guid isPermaLink="false">http://www.disnetwork.info/the-blog/base32-encoding-proposal</guid><description><![CDATA[Base64 is a popular data encoding used to represent binary data as a sequence of ASCII characters. What is less popular is the Base32 encoding because it is generates a less compact encoding. However, Base32 encoding has the benefit to provide an encoding that is easy to handle "manually" by humans. I suggest to use Base32 encoding to provide a compact identifier encoding that users have to remember and may have to spell out to other people. It is for this reason that I would prefer to use such  [...] ]]></description><content:encoded><![CDATA[<div class="paragraph" style="text-align:left;"><a href="http://en.wikipedia.org/wiki/Base64">Base64</a> is a popular data encoding used to represent binary data as a sequence of ASCII characters.<span> What is less popular is the <a href="http://en.wikipedia.org/wiki/Base32">Base32</a> encoding because it is generates a less compact encoding. </span><br /><br />However, Base32 encoding has the benefit to provide an encoding that is easy to handle "manually" by humans. I suggest to use Base32 encoding to provide a compact identifier encoding that users have to remember and may have to spell out to other people. <br /><br />It is for this reason that I would prefer to use such Base32 encoding for the user identifiers key of a web service. <br /><br /><span>With 4 Base32 ASCII code value one can encode one number in a million. With 5 </span>Base32 <span>ASCII code, we can encode one value in 32 million and with 6 </span>Base32 <span>ASCII code we can encode one value in a billion. </span>I can't wait for my users to have 5 or even 6 ASCII codes in their identifiers.<br /><span></span><br /><span></span>My proposed encoding is the same as the <a href="http://en.wikipedia.org/wiki/Base32#Crockford.27s_Base32">Crockford's Base32 alphabet</a> with the difference that the letter <strong>U</strong> is preserved and the letter <strong>W</strong> is removed. I guess the letter U has been removed to avoid confusions with two 1 in sequence. But I find that removing the W is preferable because it is less convenient to memorize and spell out. <br /><br /><span></span>Note: with a Base64 encoding, only 5 Base64 ASCII codes a needed for one number in a billion, but the complexity to remember, spell out and distinguish upper and lower case letters makes it inefficient. <br /></div>]]></content:encoded></item><item><title><![CDATA[IDR encoding compared to Go language encoding]]></title><link><![CDATA[http://www.disnetwork.info/the-blog/idr-encoding-compared-to-go-language-encoding]]></link><comments><![CDATA[http://www.disnetwork.info/the-blog/idr-encoding-compared-to-go-language-encoding#comments]]></comments><pubDate>Sat, 12 May 2012 10:05:56 GMT</pubDate><category><![CDATA[gob]]></category><category><![CDATA[idr]]></category><guid isPermaLink="false">http://www.disnetwork.info/the-blog/idr-encoding-compared-to-go-language-encoding</guid><description><![CDATA[As the author of the IDR encoding (yet unpublished), I was very curious do see how it compares to the data encoding proposed in the Go language designed by the Google team (gobs of data). There are two fundamental difference between the two. Value encoding Gobs encodes value by a tag byte followed by a compact byte encoding of the value. The tag identifies the type of the value and its encoded byte length. The byte encoding drops trailing 0 bytes of the value.IDR uses the most common computer in [...] ]]></description><content:encoded><![CDATA[<div class="paragraph" style='text-align:left;'>As the author of the IDR encoding (yet unpublished), I was very curious do see how it compares to the data encoding proposed in the Go language designed by the Google team (<a title="" href="http://blog.golang.org/2011/03/gobs-of-data.html">gobs of data</a>). <br /><br /><span>There are two fundamental difference between the two. </span><br /><span></span><br /><span style="font-weight: bold; text-decoration: underline;">Value encoding</span> <br /><span></span><br /><span>Gobs encodes value by a tag byte followed by a compact byte encoding of the value. </span>The tag identifies the type of the value and its encoded byte length. The byte encoding drops trailing 0 bytes of the value.<br /><br /><span>IDR uses the most common computer internal representation </span>of data as encoding and has thus no marshaling work. <br /><br /><span style="font-weight: bold;">Advantages</span> <br /> <br /><span></span><span>Gobs has two major benefits. The first benefit is that the type of data is provided with the value which allows anyone to decode the values of a message without prior knowledge of its content. </span>The second benefit of such encoding is that data can be split in blocs anywhere since decoding is processed byte after byte. <br /><br /><span></span>IDR has the advantage of fast and trivial marshaling as in RPC and IIOP. <br /><br /><span style="font-weight: bold;">Disadventages</span><br /><br /><span></span>The price to pay with Gobs is the additional tag byte and the marshaling work. <span>With IDR, it is the code complexity to ensure the atomicity of the base values </span>if a data stream needs to be split and the absence of base value type information with the data. <br /><span></span><br /><span></span><span style="font-weight: bold; text-decoration: underline;">Type encoding</span> <br /><br /><span>Gobs provides the maximum type information with the message so that it is self describing. This makes the encoding more complex since conciseness competes with expressiveness. </span><br /><br /><span>RPC, IIOP and ICE rely on the context to determine the type of encoded data. </span>The encoding targeting mainly use in communication, this optimization make sense to some extend.<br /><br /><span>IDR precedes any message with a type reference. The type reference is a key to a distributed database similar to the DNS from which a description of the data contained in the message may be obtained. It is possible to obtain a concise form to efficiently parse the data by a program </span>or a detailed expressive form with comments to be used by humans. <br /><br /><span>The IDR data type description strategy seems the most efficient because the data type description is written once. But the decoupling of the type description from the data </span>expose to the risk of loosing access to the data description if it gets deleted. <br /><br /><span style="font-weight: bold; text-decoration: underline;">Conclusion</span><br /><span></span><br /><span>There are some good and bad points on both sides and there is no easy way to merge the good points into a new optimal encoding. </span><br /><br /><span></span><span>My experience is that the IDR encoding, while simple and efficient on some aspects, was quite complex to develop. </span><br /><br /><span></span><span>Today I still favor IDR's choice because of the marshaling efficiency.</span><span> <span style="font-weight: bold;">Olivier Pisano</span> managed to translate the C++ IDR library to the D language in a very short time. So maybe it is just the conception and validation of IDR that took so much time. </span><span></span><br /><span></span><br /><span>I like very much the smart encoding </span>of the base values in Go, but not so much to force all floating point values to be encoded into a double precision float (64bit). I hope they'll change that. <br /><br />There are other differences between IDR and Gob which have not been detailed here. What they have in common is that both may use their encoding to support persistence. IDR may use it with its distributed database. <br /><br /><br /><span></span><br /><span></span> </div>]]></content:encoded></item><item><title><![CDATA[Numbering schema yielding identical lexical and numerical ordering]]></title><link><![CDATA[http://www.disnetwork.info/the-blog/numbering-schema-yielding-identical-lexical-and-numerical-ordering]]></link><comments><![CDATA[http://www.disnetwork.info/the-blog/numbering-schema-yielding-identical-lexical-and-numerical-ordering#comments]]></comments><pubDate>Sat, 04 Feb 2012 10:57:53 GMT</pubDate><category><![CDATA[misc]]></category><category><![CDATA[web site]]></category><guid isPermaLink="false">http://www.disnetwork.info/the-blog/numbering-schema-yielding-identical-lexical-and-numerical-ordering</guid><description><![CDATA[ It may be desirable in some situation to be able to assign a numerical reference (integer) to a resource with the particular property that  the string&nbsp; representation of the reference preserves the numerical ordering. This blog post presents a numbering  method that has this property. The proposed numbering schema achieves this goal and avoids adding zero's or spaces in front of the numbers, thus keeping the strings short. The price to pay is that there will  be gaps between the numbering  [...] ]]></description><content:encoded><![CDATA[<span class='imgPusher' style='float:left;height:0px'></span><span style=' float: left; z-index: 10; position: relative; ;clear:left;margin-top:0px;*margin-top:0px'><a><img src="http://www.disnetwork.info/uploads/3/8/0/1/38014/4815516.jpg" style="margin-top: 5px; margin-bottom: 10px; margin-left: 0px; margin-right: 10px; border-width:1px;padding:3px;" alt="Picture" class="galleryImageBorder" /></a><div style="display: block; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;"></div></span> <div  class="paragraph editable-text" style=" text-align: left; display: block; ">It may be desirable in some situation to be able to assign a numerical reference (integer) to a resource with the particular property that  the string&nbsp; representation of the reference preserves the numerical ordering. This blog post presents a numbering  method that has this property. <span></span>The proposed numbering schema achieves this goal and avoids adding zero's or spaces in front of the numbers, thus keeping the strings short. The price to pay is that there will  be gaps between the numbering sequence. The numbers in these gaps are  invalid numbers in this numbering schema and may be easily recognized and used for error detection. <br /><br /><span style="font-weight: bold;">The problem</span>: You probably experienced that sorting a list of strings representing the  integer sequence "1", "2", "3", ..., "10", "11", ... "20", "21", ...  yields the weird result "1", "10", "11", ... "2", "20", "21", ... "3",  ... This shows up, for instance, when naming files by numbers. We get this result because strings are sorted in lexicographical order which means they  are ordered by digit value, one by one from left to right. So in a lexicographical  order, "10" is smaller than "2" which is the opposite of the numerical  order. <br /><br />In the situations where this is an unacceptable nuisance we have a set of solutions we could pick one from. <br /><br />Use  a specially crafted sorting algorithm able to detect that it deals with  numbers in ASCII representation instead of text strings. In some context, changing the sorting algorithm is not possible (i.e. file names).&nbsp; <br /><br />  Another possibility is to add some zero's or spaces in front of the number in its  ASCII representation. The problem with this method is to know how many  zero's or spaces should be added. There should be at least as many as the number of digits in the biggest number we need to represent. In some context it is not possible to know the biggest number we will have to dealt with and this introduces a highest value constrain which is preferable to avoid if possible. <br /><br /><span style="font-weight: bold;">The solution</span>:  The proposed solution is to use a numbering schema where we simply add in front the number a digits in the number. For instance the number 234 has 3 digits. This number would then be coded as "<span style="font-weight: bold;">3</span>123" in the proposed schema where the 3 (shown in bold) is added in front of the number.<br /><br /><span>The number is valid if the string contains only digits and the first digit is length minus one. </span>The value 0 is represented as 0. For negative numbers, if you need them, the number of digits must be inserted between the minus sign (-) and the number. <br /><br />There is also an upper limit in the maximum number of digits the number can have. The biggest number that may be represented with this numbering schema is 10 billion minus one.&nbsp; <br /><br /> With this numbering schema, the sequence "1", "2", "3", ..., "10", "11", ... "20", "21" becomes "11", "12", "13", ..., "210", "211", ... "220", "221" with the added digit in front shown in bold. The lexicographical sorting of this number sequence will preserve this order. <br /><br />The  price to pay is that the numbering sequence is not compact. It has gaps containing invalid numbers (i.e. 23, 123,... ). This may be  considered an inconvenient but has also the benefit to make it possible  to detect errors and invalid values. <br /><br /><span></span>Generating such number sequence is trivial as well as checking their validity. <br /><br /><span style="font-weight: bold;">Application example</span>: I "invented" this coding schema when looking for an optimal way to numerically reference resources assigned incrementally for  a web service (i.e. userId, documentId, imageId, ....). The numbering provides a direct mapping with a numerical  table index value as well as a compact string representation. The size of  the reference would grow smoothly as needed with the number of  references. <br /><br /><span>Another application is as document id in NoSQL databases like CouchDB, MongoDB, etc. Keep the id compact and sorted. </span><br /><br /><span style="font-weight: bold;">Using a Base64 like encoding</span><br /><br /><span></span><span>A</span><span> more compact coding would use a base64 like encoding</span>. Conversion between the ASCII and binary encoding would not be as straightforward, but identifiers would be much more compact and still preserve the sorting of ASCII and binary representation. <br /><br />To generate such encoding, split the binary representation in groups of 6 bits, starting from the less significant bit (right most) toward the most significant bit. Then replace all the left most chunks that have all bits to zero with a single chunk coding the number of 6bits chunks left. For instance&nbsp; ...00000|110010|010011 becomes 000010|110010|010011 because there are only two significant chunks in the number and 2 is encoded with 6 bits as 000010. The last step is to replace each 6 bit chunks in the resulting chunk sequence with the ASCII codes provided in the following table. <span></span><br /></div> <hr  style=" clear: both; visibility: hidden; width: 100%; "></hr>  <div ><div class="wsite-image wsite-image-border-thin " style="padding-top:10px;padding-bottom:10px;margin-left:0;margin-right:0;text-align:center"> <a> <img src="http://www.disnetwork.info/uploads/3/8/0/1/38014/1141398_orig.png" alt="Picture" style="width:100%;max-width:578px" /> </a> <div style="display:block;font-size:90%">Mapping between chunk's 6 bit binary integer value and ASCII letters used for encoding</div> </div></div>  <div  class="paragraph editable-text" style=" text-align: left; ">The resulting encoding is very similar to Base64 encoding but has the particular properties to preserve the sorting order of the chunk integer value and the associated ASCII value as well as using ASCII codes that may be used in URLs or filenames. Except for the value 0, the ASCII representation will never start with a '-'.&nbsp; <br /><br /><span>Conversion between the ASCII representation and the binary representation is more complicated, especially when it has to be done by humans. </span>Though a benefit of this coding is that its ASCII representation will be short for small numbers. The ASCII coding will have n+1 letters for numbers with n significant chunks. For up to 24 bit numbers (over 16 millions values), the longest ASCII encoding will be 5 letters. <span></span><br /><span></span></div>  ]]></content:encoded></item><item><title><![CDATA[Distributed Version Constrol System (DVCS) usage model]]></title><link><![CDATA[http://www.disnetwork.info/the-blog/development-model-using-distributed-version-constrol-system-dvcs]]></link><comments><![CDATA[http://www.disnetwork.info/the-blog/development-model-using-distributed-version-constrol-system-dvcs#comments]]></comments><pubDate>Sun, 28 Mar 2010 11:00:07 GMT</pubDate><category><![CDATA[dvcs]]></category><category><![CDATA[git]]></category><category><![CDATA[suggested reading]]></category><guid isPermaLink="false">http://www.disnetwork.info/the-blog/development-model-using-distributed-version-constrol-system-dvcs</guid><description><![CDATA[Subversion has been my software version control system for years now. It is simple and straightforward but is inappropriate for some usage patterns that required sharing intermediate development code between developer or combining an official release version track with one or more development tracks. Distributed Version Control Systems with Git, Mercurial or Bazaar solves these problems. The best way to understand this is by reading Vincent Driessen's blog post titled "A successful Git branching [...] ]]></description><content:encoded><![CDATA[<span  style=" position: relative; float: left; z-index: 10; "><a><img src="http://www.disnetwork.info/uploads/3/8/0/1/38014/7510252.jpg" style="margin-top: 5px; margin-bottom: 10px; margin-left: 0px; margin-right: 10px; border-width:1px;padding:3px;" alt="Picture" class="galleryImageBorder" /></a><div style="display: block; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;"></div></span><div  class="paragraph" style=" text-align: left; display: block; ">Subversion has been my software version control system for years now. It is simple and straightforward but is inappropriate for some usage patterns that required sharing intermediate development code between developer or combining an official release version track with one or more development tracks. <br /><br />Distributed Version Control Systems with <a href="http://git-scm.com/">Git</a>, <a href="http://mercurial.selenic.com/">Mercurial </a>or <a href="http://bazaar.canonical.com/en/">Bazaar</a> solves these problems. The best way to understand this is by reading Vincent Driessen's blog post titled "<a href="http://nvie.com/git-model">A successful Git branching model</a>". It presents a usage model for Distributed Version Control System (DVCS) using git, but it work as well with Mercurial or Bazaar. <br /><br />The <a href="http://hginit.com/">Mercurial tutorial</a> provided by Joel Spolsky provides a very good introduction which explains why DVCS are better than the centralized version control systems like subversion. <br /><br />I still have to chose between the three. For now my preference is Git for technical reasons. The ergonomic aspect is important too, but fore this I usually rely on desktop integrated tools like <a href="http://code.google.com/p/tortoisegit/">turtoiseGit</a>. I'm currently a very happy user of <a href="http://rabbitvcs.org/">RabitVCS </a>which currently supports only Subversion. I hope they will support Git or Mercurial soon.<br /></div><hr  style=" visibility: hidden; width: 100%; clear: both; "></hr>]]></content:encoded></item><item><title><![CDATA[Log structured database]]></title><link><![CDATA[http://www.disnetwork.info/the-blog/log-structured-database]]></link><comments><![CDATA[http://www.disnetwork.info/the-blog/log-structured-database#comments]]></comments><pubDate>Mon, 01 Mar 2010 12:29:31 GMT</pubDate><category><![CDATA[database]]></category><category><![CDATA[dis]]></category><guid isPermaLink="false">http://www.disnetwork.info/the-blog/log-structured-database</guid><description><![CDATA[The distributed information system (DIS) needs a database to store its  information and a simple key value database would do the job. Today, Tokyo Cabinet seems the best  choice for such type of database.Why a log structured database ?  My attention was recently caught by the blog post Damn  cool Algorithms: log structured storage. The white paper presenting  RethinkDB  provides a more exhaustive view of the benefits of this data structure  and some disadvantages too. The LWN.net article Log-str [...] ]]></description><content:encoded><![CDATA[<span  style=" float: left; z-index: 10; position: relative; "><a><img src="http://www.disnetwork.info/uploads/3/8/0/1/38014/695970.jpg" style="margin-top: 5px; margin-bottom: 10px; margin-left: 0px; margin-right: 10px; border-width:1px;padding:3px;" alt="Picture" class="galleryImageBorder" /></a><div style="display: block; font-size: 90%; margin-top: -10px; margin-bottom: 10px; text-align: center;"></div></span><div  class="paragraph" style=" text-align: left; display: block; ">The distributed information system (DIS) needs a database to store its  information and a simple key value database would do the job. Today, <a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a> seems the best  choice for such type of database.<br /><br /><span style="font-weight: bold;">Why a log structured database ?</span><br />  <br />My attention was recently caught by the blog post <a title="Links  active once published" href="http://blog.notdot.net/2009/12/Damn-Cool-Algorithms-Log-structured-storage">Damn  cool Algorithms: log structured storage</a>. The white paper presenting  <a href="http://www.rethinkdb.com/papers/whitepaper.pdf">RethinkDB</a>  provides a more exhaustive view of the benefits of this data structure  and some disadvantages too. The LWN.net article <a title="Links active  once published" href="http://lwn.net/Articles/353411/">Log-structured  file systems: There's one in every SSD</a> covers the use of log  structure in SSD file systems.<br /></div><hr  style=" width: 100%; visibility: hidden; clear: both; "></hr><div  class="paragraph" style=" text-align: left; ">While surfing the web to get more informations on log structured database, I found the following blog note presenting the experimental <a href="http://www.lshift.net/blog/2009/08/21/yet-another-key-value-database">YDB</a> log structured database with some interesting benchmark showing that YDB is roughly 5.6 time faster than Tokyo Cabinet and 8 time faster than Berkeley DB with random writes. These numbers justify some deeper investigation.<br /><br />The performance benefit is mainly due to constraining write operations to the end of the file because read access can benefit from memory caches, writes not. With random location writes, the disk writing head needs to move into position (seek) and this has a huge latency compared to transistor state changes or data transmission speed.<br /><br />Reducing disk head movements may thus yield a significant performance increase. Note that this won't be true with SSD disks anymore, but other constrains come in play too where a log structured database may still be attractive (evenly distributed and grouped writes).&nbsp; <br /><br /><span style="font-weight: bold;">The Record Index</span><br /><br />As you may guess writing data to the end of the file implies that modified records are copied. The record offset is then modified which implies an update of the index too. If the index, generally tree structured, is also stored in the log database, it result in cascade of changes which increases the amount of data to write to disk. <br /><br />This makes log structured database less attractive, especially if the index is a BTree of record keys. A BTree key index is not very compact and not trivial to manipulate, especially if keys are of varying length.<br /><br />I finally found a better solution derived from reading the white paper presenting the <a href="http://www.primebase.org/download/pbxt_white_paper.pdf">The PrimeBase XT Transactional Engine</a> describing a log structured table with <a href="http://en.wikipedia.org/wiki/ACID">ACID</a> property for an RDMS table, and more recently the article <a href="http://research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html">Using Uninitialized Memory for Fun and Profit</a> describing a simple data structure to use an uninitialized array. <br /><br />The idea is to use an intermediate <span style="font-style: italic;">record index</span> which is basically a table of record offset and size. The entry index in the table is the record identifier and is used as key to locate the record in the file. The record identifier is associated to a record for its life time and may be reused for a new record after the record has been deleted. <br /><br /><span style="font-weight: bold;">Benefits of the record index</span><br /><br />The record index is stored as a tree index where non lead nodes hold the  offset to the lower level nodes of the tree. Changing an offset in a  leaf node will still imply a change in all the nodes up to the root of  the tree, but the index is much more compact than a conventional BTree  associating the record key with its offset and size. The record  identifier doesn't need to be stored in the index because it is its  relative position in it.&nbsp; <br /><br />Another benefit of this intermediate record index is that the record key index will now refer to the record identifier and this doesn't change when the record is modified. It is then possible to have multiple index to the records or to use the record identifier inside the user data to support record reference graphs (i.e. linked lists, etc.).<br /><br />By storing the record identifier along with the record data, the garbage collector or the crash recovery process can easily determine if a record is valid or. It simply has to compare the record offset and size with the one found in the record index. If it is the same, the record is the latest valid version. <br /><br /><span style="font-weight: bold;">Snapshots and recovery points</span><br /><br />The dirty pages of the record index need only to be saved at snapshot time. In case of process or system crash, the database should be restored to the last saved snapshot. A snapshot correspond to a coherent state of the database. A snapshot is saved any time the user closes the database. Restoring the database to some snapshot saved state boils down to truncate the file after the last valid record of the file. <br /><br />If snapshots saving is very frequent and crash recovery very rare, it is possible to use lightweight snapshots. For such snapshot only a small record is appended to the record stream which tags the point in the file where the snapshot occurred. When the database is recovered at some saved snapshot point, the recovery process can continue the recovery process beyond that recovery point by replaying all the changes until the last valid lightweight snapshot. The state of the database can then be restored to the latest lightweight snapshot, but with a slightly bigger effort than a saved snapshot recovery. <br /><br /><span style="font-weight: bold;">Garbage collector</span><br /><br />For the garbage collector (GC) the classical method may be applied which consist in opening a secondary log file and progressively copy valid records into it in background while it is used. A database backup is as simple as copying the file. <br /><br />When the lifetime duration of records varies a lot, it might be better to use generational log files, an algorithm used with memory garbage collector. The idea is to avoid copying constant records due to some other records short lifetime of frequent change generated garbage. The idea is to group records according to their change frequency into separated log structured database.&nbsp; <br /><br />A first log structured database contains all new or changed records. The garbage collector progress then at the same speed as records are written to the end of the file. Every valid data it finds is then copied in a second generation record log file. These records have lasted a GC cycle without a change. Additional generation database may be added for even slower changing records. <br /><br />The use of multiple log files will induce some disk writing head movements, but it will be balanced by saving the effort to repeatedly copy constant records. <br /><br /><span style="font-weight: bold;">Conclusion</span> <br /><br />It is not my intent to implement this shortly. I just wanted to document  the method which seems to be the canonical way to handle the record index  problem and for which I couldn't find a description on the web.<br /> </div>]]></content:encoded></item></channel></rss>