Friday, September 28, 2018

One CSV, 30 stories (for small values of 30)

While re-reading some older posts on anonymity (of which more later, probably), and updating the occasional broken link, I happened to click through on the credit link on my profile picture.  Said link is still in fine fettle and, while it hasn't been updated in a while, and one of the more recent posts is Paul Downey chastising himself for just that, there's still plenty of interesting material there, including the (current) last post, now nearly three years old, with a brilliant observation on "scope creep".

What caught my attention in particular was the series One CSV, thirty stories, which took on the "do 30 Xs in 30 days" kind of challenge in an effort to kickstart the blog.  Taken literally, it wasn't a great success -- there only ended up being 21 stories, and there hasn't been much on the blog since -- but purely from a blogging point of view I'd say the experiment was a fine success.

Downey takes a single, fairly large, CSV file containing records of land sales transactions from the UK and proceeds to turn this raw data into useful and interesting information.  The analysis starts with basic statistics such as how many transactions there are (about 19 million), how many years they cover (20) and how much money changed hands (about £3 trillion) and ends up with some nifty visualizations showing changes in activity from day to day within the week, over the course of the year and over decades.

This is all done with off-the-shelf tools, starting with old-school Unix commands that date back to the 70s and then pulling together various free source from off the web.  Two of Downey's recurring themes, which were very much evident to me when we worked together on standards committees, um, a few years ago, are also very much in evidence here: A deep commitment to open data and software, and an equally strong conviction that one can and should be able to do significant things with data using basic and widely available tools.

A slogan that pops up a couple of times in the stories is "Making things open makes them better".  In this spirit, all the code and data used is publicly available.  Even better, though, the last story, Mistakes were made, catches the system in the act of improving itself due to its openness.  On a smaller scale, reader suggestions are incorporated in real time and several visualizations benefit from collaboration with colleagues.

There's even a "hack day" in the middle.  If anything sums up Downey's ideal of how technical collaboration should work, it's this: "My two favourite hacks had multidisciplinary teams build something, try it with users, realise it was the wrong thing, so built something better as a result. All in a single day!"  It's one thing to believe in open source, agile development and teamwork in the abstract.  The stories show them in action.

As to the second theme, the whole series, from the frenetic "30 things in 30 days" pace through to the actual results, shows an admirable sort of impatience:  Let's not spend a lot of time spinning up the shiniest tools on a Big Data server farm.  I've got a laptop.  It's got some built-in commands.  I've got some data.  Let's see what we can find out.

Probably my favorite example is the use of geolocation in Postcodes.  It would be nice to see sales transactions plotted on a map of the UK.  Unfortunately, we don't have one of those handy, and they're surprisingly hard to come by and integrate with, but never mind.  Every transaction is tagged with a "northing" and "easting", basically latitude and longitude, and there are millions of them.  Just plot them spatially and, voila, a map of England and Wales, incidentally showing clearly that the data set doesn't cover Scotland or Northern Ireland.

I wouldn't say that just anyone could do the same analyses in 30 days, but neither is there any deep wizardry going on.  If you've taken a couple of courses in computing, or done a moderate amount of self-study, you could almost certainly figure out how the code in the stories works and do some hacking on it yourself (in which case, please contribute anything interesting back to the repository).  And then go forth and hack on other interesting public data sets, or, if you're in a position to do so, make some interesting data public yourself (but please consult with your local privacy expert first).

In short, these stories are an excellent model of what the web was meant to be: open, collaborative, lightweight and fast.

Technical content aside, there are also several small treasures in the prose, from Wikipedia links on a variety of subjects to a bit on the connection between the cover of Joy Division's Unknown Pleasures and the discovery of pulsars by Jocelyn Bell Burnell et. al..

Finally, one of the pleasures of reading the stories was their sheer Englishness (and, if I understand correctly, their Northeast Englishness in particular).   The name of the blog is whatfettle.  I've already mentioned postcodes, eastings and northings, but the whole series is full of Anglicisms -- whilsta spot of breakfastcock-a-hoop, if you are minded, splodgy ... Not all of these may be unique to the British Isles, but the aggregate effect is unmistakeable.

I hesitate to even mention this for fear of seeming to make fun of someone else's way of speaking, but that's not what I'm after at all.   This isn't cute or quaint, it's just someone speaking in their natural manner.  The result is located or even embodied.  On the internet, anyone could be anywhere, and we all tend to pick up each other's mannerisms.  But one fundamental aspect of the web is bringing people together from all sorts of different backgrounds.  If you buy that, then what's the point if no one's background shows through?

2 comments:

earl said...

This speaks to my prejudice in favor of technique over technology. And the concept of agility seems to be just the attitude of any good designer that your best weapon is your critical sense, and your compulsion to discard anything that isn't going to work.

David Hull said...

See the followup post.