Versioned Nodes Byte Trees

Byte trees are a general purpose format for storing byte-oriented data in versioned nodes.

Format

The general idea is a tree of nodes. Each node may hold some data in its body, which is logically followed by the data held by the nodes linked to by its parts.

Body

The body directly holds zero or more bytes of the stored data.

Parts

The data section of each part is an 8-bit variable length quantity, indicating the total length to use of the linked node’s data. As the size must be greater than zero, the stored value is one less than the actual value.

Note: That value may not match the actual size of the data under the linked node. Extra data is ignored and insufficient data is padded with zeros, much like holes in a sparse file.

VLQ8

VLQ8 is a non-self-terminating encoding for natural numbers. Conceptually it maps natural numbers to known-length byte strings sorted from shortest to longest, and then lexiographically. That is, 0 is the empty byte string, 1 is the byte string containing a single zero byte, 257 is the byte string containing two empty bytes, and so forth.

toVLQ8 = rec [] where
  rec xs x = if x == 0 then xs else
    rec (((x-1) `mod` 256) : xs) ((x-1) `div` 256)

Links

Links should be empty.

Discussion

There are several possible strategies for writing data into a byte tree. Branch nodes may or may not contain data and the branching style is arbitrary.

Storing data in the branch nodes decreases both the total number of nodes and the time to access data near the beginning (and scattered throughout). However, that also increases write costs, as that data must be rewritten when its children change.

All the usual tradeoffs apply to the branching styles - balanced trees have better worst-case charactaristics, but specifically unbalanced trees (like leaf nodes as the first and last children of each branch) can handle some common access patterns faster (on average).

published 2025-06-14 updated 2025-07-07

← Versioned Nodes Prior Art
There are a variety of similar systems and designs to versioned nodes. However, they all are designed for narrower applications, and do not quite implement the paradigm.