Algebraic Tag Length Value
atlv
is a generic binary data encoding that preserves the data’s
leaf-spine structure while providing customizable tags and dictating
minimal semantics.
Unlike traditional TLV encodings there are no hard upper bounds and the complete tree structure can be parsed without knowledge of the types. Unlike JSON there is significant extensibility with tagging, every distinct value has a single representation, and it includes first class support for binary data and tagged unions. Unlike CBOR and ASN.1 it is extremely simple, providing only the four types necessary to compactly encode arbitrary structures and data.
Overview
value: binary | array | union
binary: len=vlq(00) byte[len]
array: len=vlq(01) value[len]
union: tag=vlq(10) value
vlq(YY): YYxxxxxx | 11xxxxxx vlq(YY)
byte: xxxxxxxx
Each atlv
value starts with a VLQ. The two high bits in the last
byte of the VLQ determine whether the value is a binary, union, or
array:
- 00 - binary - the VLQ is a length stating how many bytes follow.
- 01 - array - the VLQ is a length static that many values follow.
- 10 - union - the VLQ is a tag.
Theory
There are four fundamental building blocks of algebraic data types:
Void
corresponds to 0, and represents impossible data, as such it has no representation.Unit
corresponds to 1, it represents trivial data.Sum
corresponds to addition, it represents data with two alternatives.Product
corresponds to multiplication, it represents data with two parts.
Sum and product can be straightforwardly extended to an arbitrary number of alternatives/parts. A side effect of this is that a sum of zero alternatives in the same as a void, and a product of zero parts is the same as a unit.
Modern computers, on the other hand, are built on sequences of 8-bit bytes. While this is conceptually equivalent to a 256-way sum, or the product of two 16-way sums, four 4-way sums, or eight 2-way sums, of units, it is much more compact to encode byte-oriented data directly.
This results in the three constructors of atlv. Binaries handle byte vectors. Arrays handle generalized products. Unions handle generalized sums.
Details
binary
binary: len=vlq(00) byte[len]
binaries encode binary blobs, strings, numbers, enumerated values, and similar variable length values.
there are two general approaches to encoding unsigned numbers with atlv. the simple approach only works for fixed-width numbers, store the number directly in fixed-width big-endian. this is often faster when the numbers are evenly distributed. the other approach is shorter on average, but requires more transcoding. use a variable length quantity with all eight bits of each byte. (the length is encoded first, so there’s no need to mark the last value). this has the advantage of working with variable length (and simply different) number sizes.
union
union: tag=vlq(10) value
unions wrap a tag around another value. this makes them natural for encoding sum types,
array
array: len=vlq(01) value[len]
arrays encode “normal” containers for arrays, records, key-value pairs, etc.
Variable Length Quantities
vlq(YY): YYxxxxxx | 11xxxxxx vlq(YY)
Variable-length quantities encode all the simple quantities in atlv
.
This allows most lengths and tags to be encoded as a single byte, while
not enforcing a fixed maximum.
These variable-length quantities are a sequence of one or more bytes, where the initial bytes have the high two bits set, and the final byte has at least one of them cleared. When interpreted as a quantity, each byte is taken a base-64 digit, with the initial digit being the most significant. Redundancy is eliminated offsetting the raw value in multi-digit sequence to start at one more than maximum value of the sequence one digit shorter.
So where the largest single byte sequence is 3f
(encoding 63), the
smallest double byte sequence is c0 00
(encoding 64), and the largest
double byte sequence is ff 3f
(encoding 4158).
The top two bits in the final byte of a vlq encodes one of three values, which is used to distinguish the three constructors.
One side feature of this VLQ encoding is that the sort order of the quantities is preserved.