I was reviewing some notes from last year and a colleague made an off-hand remark about software testing,
> There is no such thing as no time, just no priority.”
I’ve been working with a large collection of files (several hundred GB worth) containing serialized data structures. The decision to use a marshaled structure was made years ago and for small collections, used only by the author themself, was probably easiest. What I’ve realized is:
- others need access to the data
- the marshaling is tied to the language and, to some degree, library version
- marshaling/unmarshaling is relatively slow
- the marshaled data is bloated and repeats metadata
- the hierarchal data structure is unnecessary, resulting in…
- most of the unmarshaled structure is thrown away
- damaged files are not easily salvaged
I considered XML, JSON and S-Expressions but the most portable, efficient representation I could come up with is one of the oldest and worst defined: CSV, comma separated values. I say worst because while everyone knows what it is there is no “standard” only reference implementations- which diverge, for example Microsoft Excel- and no definition for including metadata except a convention of using the first line for field names. Still, it meets my requirements. Going to a CSV representation saves me 51%-54% on disk and I can use fast, C-based libraries.
The type of data in the file is important and the field names do not uniquely identify it, so I thought of including a comment line at the file start. This breaks normal CSV implementations that expect either data or field names at the first line. I chose to use an “eye-catcher” as the first field which encodes the unique type. This wastes space, adding slightly less than 10% to the file size for a field that never varies within a given file. That is still a significant savings over serialized structures but unsatisfying. What I’d like to do is store a comment or additional metadata once. Searching for a better solution, I happened across Creativyst Table Format which has the goals:
- More functional than CSV
- Less overhead than XML
and true it does all that. It is a well-written specification. Best, it neatly supports what I want to do. I could bodge together a library to read and write a basic form of it (and I still may) but as far as I know no reference implementations exist for the languages I’m concerned with. I lose portability and it is unreasonable to impose on every random colleague the requirement that they use my code or write their own parser just to access this data. So it’s a far better idea but not suitable for my situation at this time.
Which is disappointing. It should be popularized but I’m not in a position to do it. My hope is that someone reads this and cobbles together an Open Source reference implementation. Having ready implementations in Perl and Java would ease adoption and make decisions like mine simple: use the best data exchange format available.
Came across this interesting observation:
> For many years I have been asking new clients to tell me who their best-performing people are. And then I ask: “What are they assigned to?” Almost without exception, the performers are assigned to problems… Almost invariably, the opportunities are left to fend for themselves.
> Peter F. Drucker, writing for [The Wall Street Journal](http://online.wsj.com/public/article/SB113208353287697881.html)
So you have to ask yourself, “What are you working on?”
We got our “numbers” today and several of my colleagues received promotions. A co-worker told that a former boss of his used to remark that, “Any day the firm hands you a check on a discretionary basis is a good one.”
He’s right, it is a good day.
Luck of the calendar and I have on-call for Thanksgiving. It’s not a holiday in Europe or the Far East. I’ve been logged in and working since 9am. Whee.
I just hope it stays quiet.
I just got back from a few days off, arrived at JFK, and now I turn around and
fly out for work, leaving from LGA. This is not how I like to plan these things but I have meetings tomorrow and the idea of a 6am flight didn’t sound so good. I’ll catch up on the 1000+ backlog of work email over the next couple of days.