Data munging and a spec that improves on CSV
Apr 3, 2007I’ve been working with a large collection of files (several hundred GB worth) containing serialized data structures. The decision to use a marshaled structure was made years ago and for small collections, used only by the author themself, was probably easiest. What I’ve realized is:
- others need access to the data
- the marshaling is tied to the language and, to some degree, library version
- marshaling/unmarshaling is relatively slow
- the marshaled data is bloated and repeats metadata
- the hierarchal data structure is unnecessary, resulting in…
- most of the unmarshaled structure is thrown away
- damaged files are not easily salvaged
I considered XML, JSON and S-Expressions but the most portable, efficient representation I could come up with is one of the oldest and worst defined: CSV, comma separated values. I say worst because while everyone knows what it is there is no “standard” only reference implementations- which diverge, for example Microsoft Excel- and no definition for including metadata except a convention of using the first line for field names. Still, it meets my requirements. Going to a CSV representation saves me 51–54 on disk and I can use fast, C-based libraries.
The type of data in the file is important and the field names do not uniquely identify it, so I thought of including a comment line at the file start. This breaks normal CSV implementations that expect either data or field names at the first line. I chose to use an “eye-catcher” as the first field which encodes the unique type. This wastes space, adding slightly less than 10% to the file size for a field that never varies within a given file. That is still a significant savings over serialized structures but unsatisfying. What I’d like to do is store a comment or additional metadata once. Searching for a better solution, I happened across Creativyst Table Format which has the goals:
- More functional than CSV
- Less overhead than XML
- Simplicity
and true it does all that. It is a well-written specification. Best, it neatly supports what I want to do. I could bodge together a library to read and write a basic form of it (and I still may) but as far as I know no reference implementations exist for the languages I’m concerned with. I lose portability and it is unreasonable to impose on every random colleague the requirement that they use my code or write their own parser just to access this data. So it’s a far better idea but not suitable for my situation at this time.
Which is disappointing. It should be popularized but I’m not in a position to do it. My hope is that someone reads this and cobbles together an Open Source reference implementation. Having ready implementations in Perl and Java would ease adoption and make decisions like mine simple: use the best data exchange format available.