Foreword

This document contains the complete specification draft for the UDT 0 protocol. It is complete, though I might make a few last changes before declaring it finished. For something more usable in implementing your own code to read or write the protocol, see my hacker's reference.

Contents

  1. Overview
  2. Components of the Protocol
    1. Magic Number
    2. Variable-length Quantity Numbers (VarQty)
    3. Metadata
    4. Type Tables
    5. Packing Modes
  3. Type System
    1. Enums
    2. Ranges
    3. Unions
    4. Records
    5. Arrays
    6. Pointers
  4. Type Definitions
  5. Predefined Types
  6. Examples

Overview

The UDT protocol is designed along the principles of compact encoding, strong typing of the data, and enabling the reader to ignore uninteresting data elements for speed purposes. The first two principles go hand in hand, since the more you know about an element's possible values, the more details you can omit when encoding it. The third principle is a bit at odds with the other two, but UDT provides the option to either optimize for size, or for speed, depending on how the data types are specified.

I wrote UDT because I feel that XML fails at all three of these points. The very nature of XML prevents it from being a compact representation, thanks to all-text data, and the redundant type-name in the closing tag. XML does have a data type for every element, but it is just a name, and can only be interpreted by programs that know what the tag names mean. XML Schemas partially solve this problem, but schemas are difficult to use since they are stored in separate files, and have less exact meaning than UDT's types. Finally, in XML, every character must be processed by the reader, which requires lots of unnecessary effort if the user was only interested in a specific element.

Components of the Protocol

The protocol is composed of a Magic number, variable-length quantity numbers which I call the VarQty, fixed-length numbers, metadata blocks, type tables, type definitions, and a set of predefined types. The protocol also defines several packing modes, which can affect alignment of the data, and whether certian parts (like metadata or type tables) will be written for an element. Each of these will be discussed in detail.

Specifically, the sequence of parts in a UDT file are the magic Number, the protocol version, the file's metadata block, the file's type table, the root element's metadata block, the root's type, and then the root element itself, which will likely contain many child elements.

Magic Number

If you are unfamiliar with the concept of the Magic Number, Wikipedia has a nice paragraph about it. The magic number for UDT is "UDT\n\0\0"^D^Z. In hex, the magic number is: 55 44 54 0A 00 00 04 1A That particularly cryptic magic number was chosen because it begins with the letters UDT, has a null terminator as a warning for C-style string misuse, and ends with both a UNIX and a DOS end of file character as a warning for text programs. In short, any program accidentally using text routines to read a UDt stream will fail on the first 8 bytes, rather than randomly at other parts of the stream. Its also 8 bytes long, so you can store it in a 64-bit int and do a compare on it.

Variable-length Quantity Numbers (VarQty)

The 'varQty' type is used throughout the protocol and is sort of a generic way of writing positive integers. For the most part, it is used for the indexes and element counts. Implementing this encoding is essential for using the UDT protocol, but also very easy. Note that this definition is recursive in order to facilitate unbelievably large numbers.

First byteExtension bytesEncoded lengthNumber size
0 [     7     ]
127
1 0 0 [   5   ]
[ 8 ]               
2213
1 0 1 [   5   ]
[   16   ]          
3221
1 1 0 [   5   ]
[      24      ]    
4229
1 1 1 [ ~5=N  ]
[   8 x (N+4)  ]    
5-36232 .. 22048
1 1 1 1 1 1 1 1
[varQty=N] [ 8 x N ]
2-Inf20 .. Inf

2X+8 2X+1 2X 20 (Ordering of bits in the result)
Example...

The "~" on the 5 bits in the second to last case simply means "almost" 5 bits, because one of the possible values (11111b) means something different. The =N part means assign this value the name "N", and then use it in the following formula. Also notice the +4 on that encoding- this is to prevent redundancy with the three previous encodings. In the last encoding, the varQty=N means to read a VarQty at that point and use it in the following formula.

A simple implementation might look like this, assuming you name it ReadVarQty() and have another function named ReadNQty(byte_length):

  1. Read 1 byte of input.
  2. If it is less than 0x80, return it.
  3. else...
    1. make a copy named FLAG, shift right by 5 and AND with 3.
    2. AND the original with 0x1F.
    3. if FLAG is 0, return the next byte ORed with (original shift-left 8).
    4. if FLAG is 1, return ReadNQty(2) ORed with (original shift-left 16).
    5. if FLAG is 2, return ReadNQty(3) ORed with (original shift-left 24).
    6. if FLAG is 3 ...
      1. if original is 0x1F return ReadNQty(ReadVarQty);
      2. else return ReadNQty(original+4);

Writing can be even easier than reading (if you’re not worried about size efficiency) : just always use the N-Byte mode, and choose something that matches your machine’s word size. For instance, if you have 32-bit integers, write 0xE0 followed by the bytes of your integer in big-endian order. Otherwise you can easily run some comparisons to find out how many bytes you need.

Version Number

The version number is a simple VarQty. The initial version of the protocol is #0. There are no major/minor notations, because there are no minor versions of the protocol. If someone wants to add something to the protocol, it had better be important enough to be worth incrementing the main version number. Note that while the number 0 can be encoded in one byte, it can also be encoded as a multiple-byte form of VarQty.

Metadata

In UDT, Metadata is simply a variable-length array of TypeAny, with Normal packing mode. Thus, I refer to them as "Metadata Blocks" since they are always arrays. The type of a metadata block is in fact describable by the protocol itself, and the library uses this definition to read metadata blocks. The type is not given a place on the default type stack, however, and so it doesn't have a type code and can't be used in user data.

Metadata appears before each element in a Tuple (Record or Array) if the current packing mode is Normal, Indexed, or Streamed. It also appears twice at the start of every UDT stream, first to describe the stream itself, and then to describe the root element. An empty metadata element can be written as a simple 0x00 byte. In BytePack and BitPack modes, the metadata is assumed to be empty, and not written at all, saving one byte per element.

Type Tables

Originally, I had placed the type definitions in the metadata blocks. However, it was cumbersome to check each metadata element for a possible type definition, so I moved them to their own table. Type tables are a variable-length array of type Typedef. They appear at the start of each tuple, rather than before each element of a tuple. This prevents a lot of empty type tables getting written for atomic (non-tuple) elements.

Each element of the type table is converted to a type object in the Reader, and pushed onto the type stack (and thus assigned a type code). Type tables have the special property that they are allowed to forward-reference type codes, so long as they become valid within the type table. This makes it possible to use circular references to specify infinite-sized types, which the library must catch and report as an error.

Packing Modes

This is now divided into two concepts: the bit/byte alignment of data in the UDT stream, and the way in which a tuple is encoded. I will refer to the first as "Alignment", and the second as "Tuple Coding Mode"

At this point, the protocol supports two Alignments: Byte-Alignment and Bit-Alignment. When numbers are written whose bit-lengths are not divisible by 8, Byte alignment causes them to be right-aligned in the smallest number of bytes that can contain the number. By right-alignment I mean that the least significant bit of the number is placed in the least significant bit of the least significant byte, which in big-endian encoding is the last byte to be written. With Bit-Alignment, the number is left-aligned, meaning the most significant bits of the number are packed into the least significant bits of any unfinished byte from earlier writes, and then bytes are written as they are filled in this manner. At the end of bit-alignment mode, the unfinished byte, if it exists, is written as-is, with the least-significant bits undefined.

Alignment only affects encoding size if it is applied to a tuple whose elements' bit-lengths are not a multiple of 8. Since Union, Record, Array are the only three types that fall into this category, they are the only ones that have a Alignment property. However, since it wouldn't make much sense to change back to byte-alignment in the middle of a bit-aligned element, this property is encoded as either "Inherit" or "BitAlign". The root element of the UDT stream is byte-aligned, so byte-alignment is used only if all ancestor elements of the current element were specified as "Inherit". However, I don't expect that bit-alignment will be used very often, since there is a small speed penalty, and almost all of the predefined types use "Inherit".

Tuple Coding Modes describe which parts of a tuple are written.
The total list of parts of a tuple are:

ComponentDescription
Element CountThe number of elements in the tuple
Type TableAn array of typedefs which become valid types in the context of this tuple
IndexAn array of numbers describing byte-distance between elements
For each element:
  MetadataArray of freeform data elements describing the element
  TypeType code, or inline type definition
  DataData; encoding determined by the data type

Fields that are predetermined by the type will not be written. For example, in an array of TInt32, no type codes are written for the elements, because the element type is forced by the tuple's type. Another common one is the Element Count, which is never written for record-style tuples, since the number of elements is always known. As a final example, the index only contains sizes of elements whose size can't be determined by their type.

UDT currently supports four Tuple Coding Modes:

ModeTuple Components Written
IndexedAll of the above parts are written (if not predetermined)
NormalAll parts except the index are written. This saves the extra computation needed to calculate the index, but removes the ability to seek to an element of the tuple within the stream.
TightNo Indexes, Type-tables, or Metadata are written.
IndefiniteSame as Normal, except that in tuples where an Element Count is required the elements are "streamed", by using a single boolean bit (or byte, depending on Alignment) before each element to indicate whether there is another element. This is intended to facilitate sending result sets before the number of rows is known.

Type System

UDT has a very robust type system. My goal was to allow the programmer to not only specify all the types that their favorite language had available, and the user-defined types like classes or records, but also describe the kinds of hacks that are commonly done with integers. I wanted to encode these hacks in some way that makes them compatible between systems or between versions of an application. For example, people often encode symbolic constants in unused ranges of their integer values, such as "a number from 0..5000, or INFINITY = -1". In UDT, this could be described as Union(Enum(["Infinity",INFINITY]), Range(TWhole, 0, 5000)) (Note that the encoding of this would differ from the memory representation, but it would be encoded as a simple number, and require the same number of bits.)

UDT has four funamental types (TWhole, TNegWhole, TAny, TType) and six user-definable type classes (Enum, Range, Union, Array, Record, Pointer). TWhole is the set of whole integers, from 0..Inf, and TNegWhole is the opposite. TAny is a theoretical Union of all types (defined and undefined), and TType is the type of a type. Enum is a specific enumeration of either values or symbols. Range is a subset of contiguous elements from another type (which cannot be an array or record, or selector-Union). Union is exactly what the name implies, although it has two different encodings depending on which types were unioned. Array is a tuple whose elements are all the same type, and which can have fixed or variable length. Record is a fixed-length tuple whose elements are keyed with symbols (element names). Pointer is a number describing a bit-offset within the stream, where the referenced value was written.

UDT has a seventh type definition called "Synonym" which is really just an optimization for duplicating a type definition. This is useful for assigning new meanings to existing types, and uses only two bytes (in most cases). UDT also has a moderate-sized list of predefined types, to provide a nice standard set of common types for people to use, and to prevent the basic stuff from needing to be defined in every single UDT stream. These are given in the table at the bottom of this document.

Enums

Enums are the typical deal where a programmer wants to select from a list of values using an integer. In UDT, enumeration lists are either "Symbolic", meaning an array of strings, or an array of values, which can be any valid UDT encoding of data. Values with an Enum type are encoded as you might expect: sinply write the element's index as a fixed-length quantity, using as many bits as are needed for the largest value. For instance, an enumeration of 17 values is written as a 5-bit quantity.

Enums are "scalar" types, meaning that they can be concatenated within a Union type, or used in a Range type.

Since the values used for the enumeration might not be the same as were used by the host language (like, in C you have the option to specify your own integer values for each enum constant) the UDT library creates a two-way mapping from host values to protocol values. This enables enum constants to be shared accross versions of a program where the value wasn't preserved.

Ranges

Ranges are a sort of optimized "subset" type. Ranges simply specify a base type, and then an inclusive range within it. Ranges can be ascending or descending, but only ascending ranges can be infinite.

Ranges are "scalar" types, meaning that they can be concatenated within a Union type, or used in a Range type.

Unions

Unions are simply a type containing the union of all the values that belong to a list of subtypes. There are two internal implementations of unions: concatenation-unions, and selector-unions. The concatenation implementation are chosen whenever the sub-types are all "scalar" and none (except optionally the last one) are infinite. The encoding of a concatenation is done by concatenating the ranges of values of the subtypes, and writing one large quantity which then indicates both the subtype and the value. For instance, suppose type A = Range(TWhole, 0, 1) and B = Range(TWhole, 5, 7). If C = Union([a, b]), then to encode value 6 of type TWhole as a value of type C, the library will write 3.

The selector encoding of a union simply writes a small fixed-length quantity specifying which sub-type the value comes from, followed by the encoding used by that sub-type. This mode is compatible with all sets of sub-types, and is used whevener the types can't be encoded using concatenation.

Records

Records are a kind of tuple where the number of elements are known, the type of each element is known (or possibly TAny), and each element is identified by a string. Records (as with Arrays) have a Tuple Coding Mode, which determines which bits of metadata get written, and determine the general efficiency of the encoding.

Arrays

Arrays are a tuple where all elements have the same type (possibly TAny), and the number of elements is optionally a constant.

Pointers

Pointers are a VarQty that specify a bit-address within the UDT stream. Pointers are typed, with the exception that TAny may point to values of any type (rather than just values declared under a TAny field). Pointers use a bit-address, since some data elements could be bit-packed inbetween other elements.

Resolving pointers within a UDT Stream can be tricky, and is currently not implemented. One approach would be to make two passes through the stream; first to find which addresses matter, and second to record which elements were at those addresses. Another approach is to optimize the first method by adding hints in the metadata about which elements are the targets of pointers, but this requires more work during encoding. However, the ONLY approach, currently, is to not use pointers and instead try to arrange data as a tree. ;-)

Type Definitions

User-defined types are encoded into the type tables of a UDT stream using the UDT protocol itself. Therefore, each definable type has a corresponding type describing its definition, and TType is the union of all these typedef types.

The following are the definitions used internally by the protocol to encode type definitions:

Type NameDefinition
TType Union([ TEnumDef, TRangeDef, TUnionDef, TBitAlignUnionDef, TRecordDef, TArrayDef, TPointerDef, TSynonymDef ])
TEnumDef Union( [
   Array(Tight, VarLength, TAny),
   Array(Tight, VarLength, TIdentifier)
])
TRangeDef Record(Tight, [
   ["BaseType", TType]
   ["From", TWhole]
   ["To", TWholeOrInf]
])
TUnionDefArray( Tight, VarLength, TType )
TBitalignUnionDefSynonym(TUnionDef)
TRecordDef Record( Tight, [
   ["Mode", TupleCodingMode],
   ["Fields", Array(Tight, VarLength, TFieldDef ) ],
])
TArrayDef Record( Tight, [
   ["Mode", TupleCodingMode],
   ["Length", TOptionalLen],
])
TPointerDefSynonym(TType)
TSynonymDefSynonym(TType)
TAlignmentAndMode Record( Tight&BitAlign, [
   ["Alignment", TAlignment],
   ["Mode", TTupleCodingMode ],
])
TWholeOrInfUnion([ Enum(["Infinity"]), TWhole ])
TFieldDef Record( Tight, [
   ["Name", TIdentifier],
   ["Value", TAny],
])
TAlignmentEnum(["Inherit", "BitAlign"])
TTupleCodingModeEnum(["Indexed", "Normal", "Tight", "Indefinite"])
TOptionalLenUnion([ Enum(["VarLength"]), TWhole ])

Note that TBitalignUnionDef is a slight hack. The proper thing to do would be to make Union a Record, with fields of Alignment and SubTypes. However, by making a separate type, I save a byte on union typedefs (and a little complexity), and the the hacker inside of me couldn't pass it up.

Predefined Type List

UDT writes numeric codes to specify types. The code is the index of that type in the current type stack. The following types are pushed onto the type stack before the first element is read:

CodeType NameDescription
0x00TInlineTypedef[not actually a type] Used to indicate inline typedef instead of a type code.
0x01TAny [fundamental type] Encoded as {TypeCode, Data}
0x02TType [fundamental type] Encoded as a VarQty of the type code. (or an inline typedef)
0x03TWhole [fundamental type] Encoded as a VarQty
0x04TNegWhole [fundamental type] Encoded as a VarQty
0x05TNull Enum([])
0x06TBool Enum(["False","True"])
0x07TByte Range(TWhole, 0, 0xFF)
0x08TInt8 Union( Range(TWhole, 0, 0x7F), Range(TNegWhole, 0x80, 1) )
0x09TInt16u Range( TWhole, 0, 0xFFFF)
0x0ATInt16 Union( Range(TWhole, 0, 0x7FFF), Range(TNegWhole, 0x8000, 1) )
0x0BTInt32u Range( TWhole, 0, 0xFFFFFFFF)
0x0CTInt32 Union( Range(TWhole, 0, 0x7FFFFFFF), Range(TNegWhole, 0x80000000, 1) )
0x0DTInt64u Range( TWhole, 0, 0xFFFFFFFFFFFFFFFF)
0x0ETInt64 Union( Range(TWhole, 0, 0x7FFFFFFFFFFFFFFF), Range(TNegWhole, 0x8000000000000000, 1) )
0x0FTBigInt Synonym(TByteArray)
0x10TFloat32 Record(TupleCoding.TIGHT, [
    ["Sign", Enum(["Positive", "Negative"])],
    ["Exponent", Union([
        Enum(["Denormalized"]),
        Range(TNegWhole, 0xFE, 1),
        Range(TWhole, 0, 0xFE),
        Enum(["Inf/NaN"])
    ])],
    ["Fraction", Range(TWhole, 0, 0x7FFFFF)]
], Alignment.BITPACK)
0x11TFloat64 Record(TupleCoding.TIGHT, [
    ["Sign", Enum(["Positive", "Negative"])],
    ["Exponent", Union([
        Enum(["Denormalized"]),
        Range(TNegWhole, 0x3FE, 1),
        Range(TWhole, 0, 0x3FE),
        Enum(["Inf/NaN"])
    ])],
    ["Fraction", Range(TWhole, 0, 0x1FFFFFFFFFFFFF)]
], Alignment.BITPACK)
0x12TUTF8 Array(Packing.TIGHT, VAR_LENGTH, TByte)
0x13TUTF16 Array(Packing.TIGHT, VAR_LENGTH, TInt16u)
0x14TIdentifier Synonym(TUTF8)
0x15TTypeName Synonym(TIdentifier)
0x16TBoolArray Array(Packing.TIGHT, VAR_LENGTH, TBool, Alignment.BITPACK)
0x17TByteArray Array(Packing.TIGHT, VAR_LENGTH, TBool)
0x18TStringArray Array(Packing.TIGHT, VAR_LENGTH, TUTF8)
0x19TArrayOfAny Array(Packing.NORMAL, VAR_LENGTH, TAny)
0x1ATStreamOfAny Array(Packing.INDEFINITE, VAR_LENGTH, TAny)
0x1BTMapEntry Record(Packing.TIGHT, [["Key", TAny], ["Value", TAny]])

Note that most of these types have optimized encoding and decoding routines, and may be handled differently by the UDT library than if you were to re-spacify the type. Making synononyms of these types grants the same optimized implementation to the new type.

Examples