Notes on Sorted Data
December 1, 2024 · Amit Prasad
When building systems that need to store and compare data in a sorted order, we often run into challenges with how to represent our data while preserving the ordering we want. Here are some common patterns (and pitfalls) I’ve observed when working with ordered data.
For most of this post, we’ll be considering a generic store that needs to store, compare, and retrieve data which is stored as “raw bytes” or “byte strings”. Bytes are typically compared by value, which leads to a “byte-lexicographical” ordering. This is a common comparator method seen in real-world databases, KV-stores, and other data systems.
Integers
Integers are generally easy to store and compare… right? Well, not always!
For starters, fixed-width integers are simple to store but can waste space for small numbers — which are often prevalent, depending on your data. More importantly, if you’re dumping integers directly from memory, you might run into endianness issues when reinterpreting integers as arrays of bytes during comparisons/ordering. A 32-bit integer, 0x12345678
could be stored as:
0x12 0x34 0x56 0x78
in big-endian or “network order” (used in TCP, older architectures)0x78 0x56 0x34 0x12
in little-endian (used by both x86 & ARM, among others)
Comparing byte-by-byte in little endian wouldn’t preserve ordering between two numbers!
Alright, big-endian. But we’re still storing 32 bits (or 64, or 128!) for every number, no matter how big the number actually is. Here, we borrow ideas from serialization (“wire”) formats, like Protocol Buffers, Cap’n Proto, or Flatbuffers, and use variable-length integers (varints).
Here’s a typical varint implementation, using the protobuf-style encoding where each byte uses its most significant bit as a continuation flag:
Value | Encoded | First Byte |
---|---|---|
0x100000 | 0b1_0001_000 0b1_0_0000_00 0b1_00_0000_0 0b0_000_0000 | 0x88 |
0x2000 | 0b1_0010_000 0b1_0_0000_00 0b0_00_0000_0 | 0x90 |
While 0x100000 > 0x2000
, their encoded forms don’t preserve this ordering when compared byte-by-byte! The first byte of the encoded form doesn’t contain information about the “magnitude” of this byte’s value, which completely breaks ordering.
Here’s another better approach that preserves ordering:
- Count the minimum bytes needed to represent the number (i.e., the number of bytes needed to store the highest set bit)
- Store this count in the first byte
- Store the number in the remaining
count
bytes,
For example:
number: 0x0B_B8 (3000)
bytes: 2
encoded: [0x02, 0x0B, 0xB8]
This is called length-prefixing (or “length-delimiting”, in other contexts).
This encoding preserves ordering:
- If the length-prefix of one number is smaller, the number itself must be smaller, and vice-versa.
- If the length-prefixes are equal, we’re simply comparing the following bytes as numbers. As long as we’re using big-endian encoding, this works out.
However, we’re using an entire extra byte to store the prefix. Most languages support a maximum of 128-bit numbers. 128 bits is only 16 bytes, so let’s shrink the length prefix to match that. Err, 5 bits is awkward to work with, and also wastes the “17-31” values that the length prefix could have, let’s use 4 bits instead and see how that works out:
- Count the minimum bytes needed to represent the number, as before, but subtract 4 bits from this calculation.
- Store the count in a 4-bit prefix
- Store the first 4 bits of the number in the remaining bits of the first byte
- Store the remaining bits of the number in the following bytes
For example:
number: 0x0B_B8 (3000)
bytes: 2 (but we aren't using the top 4 bits, so subtract 1) -> 1
first byte: 0x1 << 4 | 0xB = 0x1B
encoded: [0x1B, 0xB8]
Since we can only represent a length of 15 bytes (using 4 bits), we can’t store 128-bit numbers, but we can store 124-bit (120 + 4 from first byte) numbers with this (simple) format. If we really wanted to be able to store the full 128-bit precision at the cost of an extra byte in the case where the number is using most of the 128 bits, we could use the continuation bit approach if the length prefix was 0b1111 = 15
, and only in the last byte of the number. This means we can store 123-bit numbers easily, but we use an extra byte for 124- to 128-bit numbers.
If our numbers were much larger, we could length-prefix our length-prefix, a technique reminiscent of “bootstrapping” from more theoretical CS contexts.
Signed (Negative) Integers
If you’re storing negative numbers and want them to be correctly ordered, be sure to keep in mind how negative numbers are represented in memory. Most use “two’s complement” to represent negative numbers, using the most significant bit as a sign bit. This means that -1
is represented as 0b1...1
, and -2
is represented as 0b1...0
. If we compare the bytes directly, we’d find that (in 8-bit) -1
is equal to 255
, whilst 0
is still 0
, and a naive byte comparison would find that -1
is greater than 0
! To work around this, we want to remap signed integers to unsigned integers such that the comparing bytes directly will give us the correct ordering.
How do we do this remappiong? At first glance, we could do something like adding the minimum representable negative number to all signed integers, so that MIN
becomes 0
, and MAX (signed)
becomes MAX (unsigned)
. For example, in 8-bit, the minimum representable signed number is -128
, so we’d add 128
to all signed numbers to remap them to unsigned numbers.
Unfortunately, two’s complement means that we can’t actually represent MIN
as a positive number, so we’d have to do some error-prone type-casting shenanigans, subtract the minimum (- -128
) or add the maximum plus one in two operations (+ 127 + 1
). Alternatively, we can simply XOR with the minimum num ^ MIN
, which can be shown to achieve the same result.
Decimals and Floating Point
Floating point numbers are tricky because their binary representation doesn’t match their natural ordering. The IEEE-754 format places the sign bit first (like two’s complement), meaning negative numbers appear after positive ones when comparing raw bytes. Additionally, we have to deal with the exponent after the sign bit. A negative number with a large exponent is “more negative”, or smaller, even if it appears bigger.
Thus we end up with the following rules for floats:
- If positive, do as we did with integers, and simply
num ^ MIN
to remap the positives “higher” in the byte ordering - If negative, we need to reverse the ordering so that larger exponents are smaller. Luckily, a
num - MIN
will do the trick. Then, do as we did with positive floats, and XOR.
Arbitrary Data (Strings, Bytes, etc.)
When storing arbitrary data (like strings) that needs to be compared, length-prefixing seems attractive but has a critical flaw. Consider:
"abcd" → "4abcd"
"def" → "3def"
While “abcd” < “def” lexicographically, their length-prefixed versions compare incorrectly because “4” > “3”, and our comparator preemptively concludes that “abcd” > “def”.
Instead, if we need to store length information while preserving order. If this is unacceptable, borrow from C, and use a terminator byte (or bytes). This comes with the annoying caveat that you can’t use the terminator in your data without further escaping / special handling, though.
In fact, a null-terminator is really nice: We can use C-style strings directly, and we can use the same method directly in composite data (tuples, etc.), which we’ll see below.
Composite Data (Tuples, Structs, etc.)
Consider the following tuples:
("12", "34")
("123", "4")
Serialized, we would get something like the following bytes:
# ASCII: 0x31 = '1', 0x32 = '2', ...
[0x31, 0x32, 0x33, 0x34]
[0x31, 0x32, 0x33, 0x34]
Both cases are indistinguishable! Typical definitions of tuple ordering would compare the first element, then the second, and so on, which would have our first tuple be lexicographically-before the second. How can we fix this?
Null terminators! We can use the same idea from strings to separate our composite data. Since the null terminator is always considered less than any other byte (other than itself), we can use it to separate elements of the tuple in a way that causes ordering to be enforced:
[0x31, 0x32, *0x00*, 0x33, 0x34]
[0x31, 0x32, 0x33, *0x00*, 0x34]
Now, when we compare the two, we’ll see the null-terminator in the first tuple and immediately conclude that it’s lexicographically-before the second tuple. We may still have issues if our data can contain null characters.
This works, as opposed to the following alternative methods:
- Length-prefixing arbitrarily-sized data (as we saw earlier) has the advantage of being able to jump quickly past each part, but is not ordered
- Using some other human-readable separator (e.g.
/
or.
) is useful for e.g. URLs, but is not necessarily ordered as correctly as a null-terminator.
That about covers most of the interesting (and common) patterns that I’ve come across. If you have personally seen any interesting patterns worth including here, or have any suggestions/corrections, please let me know.