json-to-multicsv - Convert hierarchical JSON to multiple CSV files

Posted on 2016-01-12 in General, Perl

Introduction

json-to-multicsv is a little program to convert a JSON file to one or more CSV files in a way that preserves the hierarchical structure of nested objects and lists. It's the kind of dime a dozen data munging tool that's too trivial to talk about, but I'll write a bit anyway for a couple of reasons.

The first one is that I spent an hour looking for an existing tool that did this and didn't find one. Lots of converters to other formats, all of which seem to assume the JSON is effectively going to be a list of records, but none that supported arbitrary nesting. Did I just somehow manage to miss all the good ones? Or is this truly something that nobody has ever needed to do?

Second, this is as good an excuse as any to start talking a bit about some patterns in how command line programs get told what to do (I'd use the word "configured", except that's not quite right).

What and why?

I needed to produce some data for someone else to analyze, but the statistics package they were using could not import JSON files with any non-trivial structure. Instead the data needed to be provided as multiple CSV files that can be joined together by the appropriate columns.

As a simplified example, instead of this:

{
  "item 1": {
    "title": "The First Item",
    "genres": ["sci-fi", "adventure"],
    "rating": {
      "mean": 9.5,
      "votes": 190
     }
  },
  "item 2": {
    "title": "The Second Item",
    "genres": ["history", "economics"],
    "rating": {
      "mean": 7.4,
      "votes": 865
   },
   "sales": [
     { "count": 76, "country": "us" },
     { "count": 13, "country": "de" },
     { "count": 4, "country": "fi" }
   ]
  }
}

My "customer" needed this:

item.csv

item._key item.rating.mean item.rating.votes item.title
"item 1" 9.5 190 "The First Item"
"item 2" 7.4 865 "The Second Item"

item.genres.csv

genres item._key item.genres._key
sci-fi "item 1" 1
adventure "item 1" 2
history "item 2" 1
economics "item 2" 2

item.sales.csv

item._key item.sales._key sales.count sales.country
"item 2" 1 76 us
"item 2" 2 13 de
"item 2" 3 4 fi

One way to do this would have been to just change the program I used to produce the output. That would have been a bit annoying since the CSV output codepath would have been basically completely separate from the JSON one (which was basically just a JSON::encode_json on the natural data structure. It's almost easier to just have a generic converter than one specific for that one app (the documentation is as long as the program itself). The only question is how to configure the generic mechanism for the specific case.

How command line tools get run

Could this "just work" out of the box with no settings at all? Not really, there's multiple ways of interpreting the data. A compound value could mean either the addition of more columns (ratings in the example) or adding rows to another CSV file (sales in the example). Consistently choosing the first interpretation would not work at all, while in the latter case you'd get really awkward entity-attribute-value-style output.

Ok, so some configuration is needed. What kind of options do we have for doing that? Command line flags tend to be the simplest to start with, though they'll often eventually become complex either by developing ordering dependencies between flags (to express different semantics) or by the values developing some kind of complicated internal structure.

Both of those actually happen for this tool. To run it, you need to pass in multiple --path command line options, each containing a pair of a patterns and the action to take for values whose path matches the pattern. (Just the first matching action is taken). For the above example those flags were:

   --path /:table:item
   --path /*/rating:column
   --path /*/sales:table:sales
   --path /*/genres:table:genres

Scalar values have an automatic fallback handler that just outputs the value as a column, but for compound data fields not finding a match is an error. In these cases the error message will print out some suggestions on what command line arguments could be added to resolve the error, for example:

Don't know how to handle object at /*/appendix/. Suggestions:
 --path /*/appendix/:table:name
 --path /*/appendix/:column
 --path /*/appendix/:row
 --path /*/appendix/:ignore

The next option would be feeding some kind of a schema file to the tool, which would then be used to guide the process. For example if the schema says that a type of object has a static set of fields, those fields are probably columns. If it has an unknown set of keys, it's probably more like tabular data.

The problem is that writing the schema would be a bit of a pain, and it would be much harder for the conversion tool to guide the user through an iterative process of getting the schema definition right. One could maybe generate a schema file from the data file itself, and edit any bits that the autodetection goes wrong. Schema generators do exist, for example jsonschema.net, but at least that one doesn't have enough knobs to tweak to even get this basic example right. And the mistakes are such that fixing them would take a fair bit of work. Reliable automated schema generation would make for some pretty epic yak shaving in the context of this tiny tool.

Maybe if people really did write JSON schemas for everything it would make sense to use that existing infrastructure. But I've never seen one of those in the wild, the spec is complicated, and JSON schemas are not particularly well suited to this use case. (Really you'd want a custom schema format, but then it's completely guaranteed that there's no pre-existing schema file to use).

And here's the thing... It's not just this specific case. It never feels like any kind of declarative schema is the right solution. In a couple of decades of writing data munging scripts I can remember just a single case of basing the solution on an external description of the data. And that single exception had several people working on the tool full time. Sure, it's great to have a schema of some sort for for your data interchange or storage format, for use in validation, code generation, automated generation of example data, or other things like that. But for actually processing it? It's just an incredibly rare pattern.

And finally, could this be a use case for a special purpose language? If schemas feel like a rarity, little languages are the opposite. Especially in classic Unix they are ubiquitous.

As a recovering programming language addict, I have to be deeply suspicious every time a new language looks like the right solution. Is it really? Or is this just an excuse to fall off the wagon again, and implement a language. (Not a big language, man. Just a little one, to take the edge off).

It's also clear that the general idea of a JSON processing language is solid. Some already exist (e.g. jq), but there could be room for multiple approaches. Writing sample programs to see what a language for JSON processing and transformation might look like was a fun way to spend a couple of hours on the boring "no internet" leg of a train journey. ("It could have this awk-like structure of a toplevel pattern matching clauses, but on paths instead of rows of text, and with a recursive main loop instead of a streaming one, and and and...").

If I kind of wanted to write this, the idea is good, and an initial implementation is not an unreasonable amount of work, why not do it? Well, even if a script written in this hypothetical language to translate from hierarchical to tabular data would have been pretty simple, it would still have been a program that the user of the tool needs to write in a dodgy DSL. And since the language would have been much more generic than a mere conversion tool, it it would also have been impossible to guide the user through a process of iteratively building the right configuration (like is now done via the error messages).

In all likelihood it'd mean that nobody else would ever use the tool for the original purpose. The less powerful and less flexible version is just going to be more useful purely due to simplicity.

So sanity prevailed this time. But tune in for the next post for an earlier example of where my self control failed.

Name
Message
	As an antispam measure, you need to write a super-secret password below. Today's password is "xyzzy" (without the quotes).
Password

item._key	item.rating.mean	item.rating.votes	item.title
"item 1"	9.5	190	"The First Item"
"item 2"	7.4	865	"The Second Item"

genres	item._key	item.genres._key
sci-fi	"item 1"	1
adventure	"item 1"	2
history	"item 2"	1
economics	"item 2"	2

item._key	item.sales._key	sales.count	sales.country
"item 2"	1	76	us
"item 2"	2	13	de
"item 2"	3	4	fi