RSS / Atom

Data munging and a spec that improves on CSV

2007-04-03 , ,

I’ve been working with a large collection of files (several hundred GB worth) containing serialized data structures. The decision to use a marshaled structure was made years ago and for small collections, used only by the author themself, was probably easiest. What I’ve realized is:

  • others need access to the data
  • the marshaling is tied to the language and, to some degree, library version
  • marshaling/unmarshaling is relatively slow
  • the marshaled data is bloated and repeats metadata
  • the hierarchal data structure is unnecessary, resulting in…
  • most of the unmarshaled structure is thrown away
  • damaged files are not easily salvaged

I considered XML, JSON and S-Expressions but the most portable, efficient representation I could come up with is one of the oldest and worst defined: CSV, comma separated values. I say worst because while everyone knows what it is there is no “standard” only reference implementations- which diverge, for example Microsoft Excel- and no definition for including metadata except a convention of using the first line for field names. Still, it meets my requirements. Going to a CSV representation saves me 51%-54% on disk and I can use fast, C-based libraries.

The type of data in the file is important and the field names do not uniquely identify it, so I thought of including a comment line at the file start. This breaks normal CSV implementations that expect either data or field names at the first line. I chose to use an “eye-catcher” as the first field which encodes the unique type. This wastes space, adding slightly less than 10% to the file size for a field that never varies within a given file. That is still a significant savings over serialized structures but unsatisfying. What I’d like to do is store a comment or additional metadata once. Searching for a better solution, I happened across Creativyst Table Format which has the goals:

  1. More functional than CSV
  2. Less overhead than XML
  3. Simplicity

and true it does all that. It is a well-written specification. Best, it neatly supports what I want to do. I could bodge together a library to read and write a basic form of it (and I still may) but as far as I know no reference implementations exist for the languages I’m concerned with. I lose portability and it is unreasonable to impose on every random colleague the requirement that they use my code or write their own parser just to access this data. So it’s a far better idea but not suitable for my situation at this time.

Which is disappointing. It should be popularized but I’m not in a position to do it. My hope is that someone reads this and cobbles together an Open Source reference implementation. Having ready implementations in Perl and Java would ease adoption and make decisions like mine simple: use the best data exchange format available.


Comment [1]

2007-04-13 13:32 , Creativyst


I just wanted to swoop in here to say thanks for the mention.

I appreciate your kind words and agree with your critique. I developed the convention because most of the jobs I\'m getting these days are web-based and there were two basic needs:

  1. a decent way to transport data back and forth between web applications and the desktop applications on client\'s computer, and

  2. a decent way to move data between multiple applications produced for the same client, or into retrievable archives.

As you\'ve pointed out, CTX is not a good choice for #1, which is best handled by mimicking Microsoft\'s brand of CSV as closely as possible. As inadequate as MS-CSV is, supporting it does let clients download database tables produced on-line, and analyze them using their own desktop applications (Excel, Access, etc.).

Fortunately, the developers of Open-Office saw the importance of having a CSV file that was MS-compatible for their desktop apps as well. They even cited my CSV article when developing it, calling MS-CSV \"The 800 pound gorilla of CSV\".

I really developed CTX because of all the inadequacies I was tripping over while documenting the MS-CSV format. It became almost an obsessive compulsive desire to produce a transport format that addressed all the issues. Enter CTX, an efficient transport for my own client applications. It certainly didn\'t seem like it would hurt to put it up on the web and share it (though some out on the web disagree). Perhaps others with similar needs will use it.

It is important to note, as you have pointed out, that it is NOT a good choice as a transport format that can be used by a wide array of outside applications. As painful as it is to say it, CSV is the best choice for that. This issue is also addressed in the CTX article\'s FAQ.


P.S. I found your comment while looking through the inlinks at Yahoo.

Commenting is closed for this article.