try·st·imu·li

13.79508

every so often i need to render some data in human-typeable form. sometimes that leads me to my one-word-per-byte encoding. other times it leads me to the various n characters per m bytes encodings.

assuming we are encoding evenly distributed data, and that we want to be able to seek within that data without reading from the front, we generally break the text to encode or decode into fixed-sized blocks. each block gets transformed into another fixed size. the ratio of those fixed sizes is approximately ratio of the logarithm of the alphabet sizes.

we can rearrange that to find alphabet sizes for different block ratios, 256 (the alphabet size of a byte) raised to various powers, rounded up.

1 2 3 4 5 6 7 8
1 256
2 16 256
3 41 256
4 16 64 256
5 28 85 256
6 16 41 102 256
7 24 53 115 256
8 16 32 64 128 256
9 22 41 75 138
10 16 28 49 85
11 21 35 57
12 16 26 41
13 20 31
14 16 24
15 20
16 16

the ones i see around the most are base 16, 64, 32, and 85. there are at least two proposals for base 41 around.

base 28? the only code i see out there turns seven nibbles into six digits, which could also be done with base 26. but there are some nice properties available - case-insensitivity is with an alphabet of “@ABCDEFGHIJKLMNOPQRSTUVWXYZ[” or “`abcdefghijklmnopqrstuvwxyz{”.

i want:

  1. fast and simple constant time implementation
  2. embeddable in url fragments without further encoding
  3. preserves sorting
  4. relatively easy to narrate
  5. compact as possible given the other constraints
  • base 16 (a-p or e-t) fulfills everything except 5
  • base 24 (a-x) is a bit more complicated but a bit more compact
  • base 28 (@-[?) is equally complicated, and a bit more compact
  • base 41 could be done with (A-Oa-z) or similar, still more complicated, and a bit harder to narrate
published