April 2007


Nate has out-grown his crib. When I put him to bed last night, he swung himself out and refused to stay in the “little bed”. Two or three times we’ve found him out of his crib but he didn’t do it often and would thump out of there hard and cry. Now he can do it reliably. He’s tall for his age and physically adept. Like a firefighter exiting a window he grabs a corner, pulls himself onto the rail, swings one leg over, rolls off, then hangs down and drops to his feet facing the crib. He does it faster than I can put him back in. I was something of an escape artist when I was a toddler so I’m not entirely surprised.

So last night was his first night in the “big bed”. There was screaming and crying. He pointed to the little bed and then to the big bed. I tucked him in and he climbed out. After a few tries in each, I put on the mean daddy face and with the mean daddy voice told him to pick. No crying. Go. To. Sleep. He chose the big bed and stayed in all night. He even slept late and this morning I dismantled the crib.

We’re not looking forward to naps. Nate hates naps, he acts as if he might miss something if he sleeps. The kid will fall over before going off for a nap if we don’t force him into bed. Now that he’s in the big it might be worse. I might have to put a latch on his door to keep him in.

I’ve been working with a large collection of files (several hundred GB worth) containing serialized data structures. The decision to use a marshaled structure was made years ago and for small collections, used only by the author themself, was probably easiest. What I’ve realized is:

  • others need access to the data
  • the marshaling is tied to the language and, to some degree, library version
  • marshaling/unmarshaling is relatively slow
  • the marshaled data is bloated and repeats metadata
  • the hierarchal data structure is unnecessary, resulting in…
  • most of the unmarshaled structure is thrown away
  • damaged files are not easily salvaged

I considered XML, JSON and S-Expressions but the most portable, efficient representation I could come up with is one of the oldest and worst defined: CSV, comma separated values. I say worst because while everyone knows what it is there is no “standard” only reference implementations- which diverge, for example Microsoft Excel- and no definition for including metadata except a convention of using the first line for field names. Still, it meets my requirements. Going to a CSV representation saves me 51%-54% on disk and I can use fast, C-based libraries.

The type of data in the file is important and the field names do not uniquely identify it, so I thought of including a comment line at the file start. This breaks normal CSV implementations that expect either data or field names at the first line. I chose to use an “eye-catcher” as the first field which encodes the unique type. This wastes space, adding slightly less than 10% to the file size for a field that never varies within a given file. That is still a significant savings over serialized structures but unsatisfying. What I’d like to do is store a comment or additional metadata once. Searching for a better solution, I happened across Creativyst Table Format which has the goals:

  1. More functional than CSV
  2. Less overhead than XML
  3. Simplicity

and true it does all that. It is a well-written specification. Best, it neatly supports what I want to do. I could bodge together a library to read and write a basic form of it (and I still may) but as far as I know no reference implementations exist for the languages I’m concerned with. I lose portability and it is unreasonable to impose on every random colleague the requirement that they use my code or write their own parser just to access this data. So it’s a far better idea but not suitable for my situation at this time.

Which is disappointing. It should be popularized but I’m not in a position to do it. My hope is that someone reads this and cobbles together an Open Source reference implementation. Having ready implementations in Perl and Java would ease adoption and make decisions like mine simple: use the best data exchange format available.