try·st·imu·li

Algebraic Tag Length Value

atlv is a generic binary data encoding that preserves the data’s leaf-spine structure while providing customizable tags and dictating minimal semantics.

Unlike traditional TLV encodings there are no hard upper bounds and the complete tree structure can be parsed without knowledge of the types. Unlike JSON there is significant extensibility with tagging, every distinct value has a single representation, and it includes first class support for binary data and tagged unions. Unlike CBOR and ASN.1 it is extremely simple, providing only the four types necessary to compactly encode arbitrary structures and data.

Overview

value:   binary | array | union
binary:  len=vlq(00) byte[len]
array:   len=vlq(01) value[len]
union:   tag=vlq(10) value
vlq(YY): YYxxxxxx | 11xxxxxx vlq(YY)
byte:    xxxxxxxx

Each atlv value starts with a VLQ. The two high bits in the last byte of the VLQ determine whether the value is a binary, union, or array:

  • 00 - binary - the VLQ is a length stating how many bytes follow.
  • 01 - array - the VLQ is a length static that many values follow.
  • 10 - union - the VLQ is a tag.

Theory

There are four fundamental building blocks of algebraic data types:

  • Void corresponds to 0, and represents impossible data, as such it has no representation.
  • Unit corresponds to 1, it represents trivial data.
  • Sum corresponds to addition, it represents data with two alternatives.
  • Product corresponds to multiplication, it represents data with two parts.

Sum and product can be straightforwardly extended to an arbitrary number of alternatives/parts. A side effect of this is that a sum of zero alternatives in the same as a void, and a product of zero parts is the same as a unit.

Modern computers, on the other hand, are built on sequences of 8-bit bytes. While this is conceptually equivalent to a 256-way sum, or the product of two 16-way sums, four 4-way sums, or eight 2-way sums, of units, it is much more compact to encode byte-oriented data directly.

This results in the three constructors of atlv. Binaries handle byte vectors. Arrays handle generalized products. Unions handle generalized sums.

Details

binary

binary:  len=vlq(00) byte[len]

binaries encode binary blobs, strings, numbers, enumerated values, and similar variable length values.

there are two general approaches to encoding unsigned numbers with atlv. the simple approach only works for fixed-width numbers, store the number directly in fixed-width big-endian. this is often faster when the numbers are evenly distributed. the other approach is shorter on average, but requires more transcoding. use a variable length quantity with all eight bits of each byte. (the length is encoded first, so there’s no need to mark the last value). this has the advantage of working with variable length (and simply different) number sizes.

union

union:   tag=vlq(10) value

unions wrap a tag around another value. this makes them natural for encoding sum types,

array

array:   len=vlq(01) value[len]

arrays encode “normal” containers for arrays, records, key-value pairs, etc.

Variable Length Quantities

vlq(YY): YYxxxxxx | 11xxxxxx vlq(YY)

Variable-length quantities encode all the simple quantities in atlv. This allows most lengths and tags to be encoded as a single byte, while not enforcing a fixed maximum.

These variable-length quantities are a sequence of one or more bytes, where the initial bytes have the high two bits set, and the final byte has at least one of them cleared. When interpreted as a quantity, each byte is taken a base-64 digit, with the initial digit being the most significant. Redundancy is eliminated offsetting the raw value in multi-digit sequence to start at one more than maximum value of the sequence one digit shorter.

So where the largest single byte sequence is 3f (encoding 63), the smallest double byte sequence is c0 00 (encoding 64), and the largest double byte sequence is ff 3f (encoding 4158).

The top two bits in the final byte of a vlq encodes one of three values, which is used to distinguish the three constructors.

One side feature of this VLQ encoding is that the sort order of the quantities is preserved.

published