Binary format

Though the first cut of any serialization is likely to reuse an existing lower-level scheme, it’s a bit more universal to describe things in raw binary frames.

The aim of this format is to allow multiple versions to be stored in the same file, including versions that the client may not know how to read. We’ll address how we could manage newer versions later.

Variants

I think a reader should scan through variants to find the version desired.

Up front, we need to describe the file scheme. TOKEN may be SIMPLE or PAGED.

Simple

For a SIMPLE stream, we’d just write out one version at a time.

Paged

The basic idea of a paged stream is that it represents data for one or more versions that have pages in common. Thus the writers will do a scan phase to identify common segments, write them all out, and the version map will identify the relevant page numbers. Assuming a page size of 256 bytes, let’s say that a negative page number means a complete page, and a positive means the first byte is the page length.

Negotiated

A negotiated form is for socket connections. The reader can offer a list of known versions and the writer can pick the highest it knows. At that point, it becomes a simple stream.

But this might be as simple as adding headers to an HTTP request that would identify versions the client handles, and the server could then pick the optimal stream.

Writing values

In the future, there may be alternative representations of scalar forms. A version should indicate the Tenet spec it’s compatible with, and we can imagine the first version assumes all values have a single, canonical binary representation.

One point to make is that these have a strict schema, so the reader always knows the type of data it’s reading. The only reason for a flag is to handle [unbound]1 data types; I’m anticipating the first version will only have unbound types.

Integers

An integer can have a lead byte containing a flag: 0 - 6 bits 1 - 14 bits 2 - 30 bits 3 - long up to 2^30 words All ints are signed.

Strings

UTF-8 is definitely ubiquitous enough to justify simply using it until there’s a demand for an alternative.

High bit: 0 - up to 127 chars 1 - up to 2^63 chars

Homogenous containers

Simple non-scalars are uniformly typed sequences; mappings are sequences of pairs. Either way, the same concept of a variable count applies.

Complex non-scalars

  • A union’s tag is simply represented as a number followed by the variant.
  • A tuple will simply be the sequence of attributes.

The number of tags is known at compile time, so the writer can use a single byte in most cases. I’d be surprised if the compiler can even handle tens of thousands of tags at this point.


  1. Unbound really means types with very large bounds.