Archive for November, 2006

Sorting IP Addresses

Thursday, November 30th, 2006

Sorting IP addresses in the commonly used decimal dotted quad format, e.g., is a problem because the individual components need not be zero-padded. An ordinary lexicographic sort therefore produces the wrong result. For example, should sort before but a standard sort will produce the opposite ordering because the leading “1″ of “160″ ranks ahead of the “2″ of “26″.

A solution to this problem using GNU sort is given by Paul Heinlein here. His approach is to break the address into four fields, using each as a separate key. This works nicely so long as the data that one wishes to sort consist of nothing but IP addresses. Unfortunately, it doesn’t work if the IP addresses are just one field of a more complex record, so that the key selection facilities of the sort program must be used to isolate the IP addresses.

If the only tool available is a standard sort utility like GNU sort, the only way around this is to write an ad hoc program to convert the IP addresses to integers, so that becomes 184 + (256 * 160) + (256 * 256 * 130) + (256 * 256 * 256 * 158) = 2,659,360,952 and becomes 2,659,326,648, or to restructure the record with the individual components of the IP address unpacked as separate fields. An easier approach is to use a sort utility such as msort that provides the option of hybrid numeric/lexicographic comparison.

A hybrid comparison is one that breaks strings containing a mixture of numbers and other characters into the parts that represent integers and the rest, treats each piece as a separate key, and compares the integral parts numerically but the others lexicographically. This is the sort of comparison that yields the desired results with things like section numbers, where, for example, we want “9a” to sort before “9b” and that in turn to sort before “10a”. A hybrid comparison breaks these each into two fields, e.g. 9 a and 10 b. It first compares 9 with 10 numerically, so that 9 comes first.

A hybrid comparison applied to IP addresses has the effect of breaking them into their components, e.g. 158 . 130 . 26 . 184. Each IP address is broken into seven fields, but since the second, fourth, and sixth fields are the same in all cases, only the numerical parts influence the ordering, and since they are ranked numerically, the lack of zero-padding does not matter.