So, What is it?
The "Universal data tree" (UDT) project is the combination of a file format / communication protocol, a
library that reads it, and tools that use the library to perform common tasks. The file format is designed
to be flexible in representing data of just about any kind, the library should allow programmers to easily
read and write this format, and the tools are meant to encourage people to use this format for all their
data storage needs by making it easy to see/modify/use the data inside.
Who is this for?
UDT is for programmers, mainly. In a broader sense, it might be for users, in the same way that
HTML is used by the general population, or in the way that Windows’ regedit.exe is for users.
I also hope that in time, UNIX systems will adopt it as a replacement for text data, such as
printing 'ls' output in UDT and adding UDT support to shells so that they can display it automagically.
Why should I care?
Well, assuming you are the indended audience, you have probably needed to store data in a file and
then read it back out. If you are any kind of an application developer, you have needed
to do this on a large scale in probably every project you have ever worked on. Some simple examples
are things like storing "last window location" settings, or writing a list of macros into
the settings file, or reading a slew of user preferences from their personal config file.
If you have developed anything that runs exclusively on windows, you may have discovered how
addictively convenient it is to stuff this junk into the registry rather than invent yet another file
format or write boring tedious code to smash binary data into a text format like INI or XML (and then
get it back out). I have gotten rather attached to the registry, but its three main weaknesses are that
- it is shared among every app in the system which makes it unsuitable for storing large content,
- it is usable exclusively on windows platforms,
- and the data cannot be transferred in the binary format (it is usually imported/exported in a text
protocol similar to INI.
But doesn’t XML do this already?
Yes, but slowly, inefficiently, and crudely. XML is great, as long as your data is almost exclusively
text with little formatting. Storing any general data types, like integers, floating point numbers,
image data, sound data, or complex data types with named members like "struct Rectangle { int top,
bottom, left, right; };" can be downright painful. In memory on a 32-bit sytem, that struct would
occupy 16 bytes. The following string of XML takes 57 bytes, not to mention that if you have to embed
it within another XML document without having it interpreted as part of that document (like I did in
writing this page) it requires a significant amount of escaping, bringing the total up to 103 characters.
<rectangle top="23" bottom="800" left="100" right="1024">
So, given that UDT alows me to define custom types and then use them in a binary fasion, that rect
would only use 17 bytes in a UDT file. UDT has no escaping requirements, so embedding UDT data is
no effort or cost at all.
But java has serialization, why not just use that?
Java serialization is exactly what I wanted, except that it dies if you change your classes in the
slightest way, and it is also extremely reliant on the java language. Any general purpose use of
serialization would require the same version of class files and a java interpreter on each end of a
communication link. This is totally unacceptable for general use. Essentially, UDT is like having
the important size, type, and structure data from a class file embedded in the data stream. Also,
it is completely unrelated to Java so there aren’t any licenses to run into, other than the LGPL :-)
Well, nobody likes binary protocols because you can’t edit them with vi/vim/emacs/notepad
Heaven forbid that we ever leave the era of line-based ASCII data formats. I mean really, people,
parsing through a sequence of characters looking for the newline is most certianly less efficient
than knowing the length of the line in advance. I think its time we moved to some new tools that
enable editing of binary data trees. If the right people tackle this project, we could end up with
treemax, or tvim, or something. Also, for those of you who are nervous about loosing things like
sed or cat or cut, I should mention that I’m already planning for equivalents to those commands
in the standard set of UDT tools. Also, for the graphically inclined, I’m planning to do the visual
tree editor in Java before the others (since the java lib will be done first). I might make a native
windows version later, once the delphi library is finished. (and then probably port it to kylix).
Ok, you’ve got my attention, so what are the features?
Thought you’d never ask :-) So, for starters,
- UDT is completely structure based and uses a binary encoding. When
reading a UDT file there is no parsing involved; this means there is no need for escape sequences
and that any raw data can be written into the data stream without modification. (clever people can have
their data buffer passed directly to the "write" function). It is not necessary to write any encoding
or decoding code in many cases.
- Data structures can be defined within the protocol, allowing for tight packing of specified
data types. Data can also be loosely packed, making it possible to store meta info like the names
of record elements.
- The protocol is designed with a wide variety of needs in mind; It should be equally usable for
user program settings and video data storage.
- The reader has indexes available to allow it to skip over data that it doesn’t care about.
This conflicts with streaming a bit, but my solution was to make the outermost tuple (array)
streamable on an element-by-element basis, and everything below that must be indexed.
- The protocol was designed with low-overhead in mind. I have taken great pains to ensure
that data can be written in just about as many bytes as it occupies in memory.
- The protocol is relatively simple to read and write. What I mean by this is that I plan to
keep the entire java library under 1000 lines of code. Also, if you only need it for specific
things like saving an array of C/C++ structs to a file you could write up your header using a text
editor, ascii chart, and protocol spec, and then put this into a C/C++ program as a constant block
of data. When you wanted to write a file, you could simply write this chunk of data followed by a
record count followed by the binary data in your structs. To edit this data, you could just jump
to an offset and write a new record over the old record.
- It supports the creation of just about any data type you can think up, from arrays of bit-packed 2-bit
integers to imaginary numbers in biginteger representation to enums to sets. Also, it does this without
making a rediculous amount of special cases: the only native data types are bit, tuple, quantity, typedef.
From this I derive things like Int32, Byte, String, UTF8String, and the like.
So how about some details on the file format?
All the details of the file format can be found in the
protocol documentation.
I’ll warn you though, its dry stuff. Skip to the examples section if you want to see the bytes dance.