Showing posts with label open source. Show all posts
Showing posts with label open source. Show all posts

Friday, September 28, 2018

One CSV, 30 stories (for small values of 30)

While re-reading some older posts on anonymity (of which more later, probably), and updating the occasional broken link, I happened to click through on the credit link on my profile picture.  Said link is still in fine fettle and, while it hasn't been updated in a while, and one of the more recent posts is Paul Downey chastising himself for just that, there's still plenty of interesting material there, including the (current) last post, now nearly three years old, with a brilliant observation on "scope creep".

What caught my attention in particular was the series One CSV, thirty stories, which took on the "do 30 Xs in 30 days" kind of challenge in an effort to kickstart the blog.  Taken literally, it wasn't a great success -- there only ended up being 21 stories, and there hasn't been much on the blog since -- but purely from a blogging point of view I'd say the experiment was indeed a success.

Downey takes a single, fairly large, CSV file containing records of land sales transactions from the UK and proceeds to turn this raw data into useful and interesting information.  The analysis starts with basic statistics such as how many transactions there are (about 19 million), how many years they cover (20) and how much money changed hands (about £3 trillion) and ends up with some nifty visualizations showing changes in activity from day to day within the week, over the course of the year and over decades.

This is all done with off-the-shelf tools, starting with old-school Unix commands that date back to the 70s and then pulling together various free-source from off the web.  Two of Downey's recurring themes, which were very much evident to me when we worked together on standards committees, um, a few years ago, are also very much in evidence here: A deep commitment to open data and software, and an equally strong conviction that one can and should be able to do significant things with data using basic and widely available tools.

A slogan that pops up a couple of times in the stories is "Making things open makes them better".  In this spirit, all the code and data used is publicly available.  Even better, though, the last story, Mistakes were made, catches the system in the act of improving itself due to its openness.  On a smaller scale, reader suggestions are incorporated in real time and several visualizations benefit from collaboration with colleagues.

There's even a "hack day" in the middle.  If anything sums up Downey's ideal of how technical collaboration should work, it's this: "My two favourite hacks had multidisciplinary teams build something, try it with users, realise it was the wrong thing, so built something better as a result. All in a single day!"  It's one thing to believe in open source, agile development and teamwork in the abstract.  The stories show them in action.

As to the second theme, the whole series, from the frenetic "30 things in 30 days" pace through to the actual results, shows an admirable sort of impatience:  Let's not spend a lot of time spinning up the shiniest tools on a Big Data server farm.  I've got a laptop.  It's got some built-in commands.  I've got some data.  Let's see what we can find out.

Probably my favorite example is the use of geolocation in Postcodes.  It would be nice to see sales transactions plotted on a map of the UK.  Unfortunately, we don't have one of those handy, and they're surprisingly hard to come by and integrate with, but never mind.  Every transaction is tagged with a "northing" and "easting", basically latitude and longitude, and there are millions of them.  Just plot them spatially and, voila, a map of England and Wales, incidentally showing clearly that the data set doesn't cover Scotland or Northern Ireland.

I wouldn't say that just anyone could do the same analyses in 30 days, but neither is there any deep wizardry going on.  If you've taken a couple of courses in computing, or done a moderate amount of self-study, you could almost certainly figure out how the code in the stories works and do some hacking on it yourself (in which case, please contribute anything interesting back to the repository).  And then go forth and hack on other interesting public data sets, or, if you're in a position to do so, make some interesting data public yourself (but please consult with your local privacy expert first).

In short, these stories are an excellent model of what the web was meant to be: open, collaborative, lightweight and fast.

Technical content aside, there are also several small treasures in the prose, from Wikipedia links on a variety of subjects to a bit on the connection between the cover of Joy Division's Unknown Pleasures and the discovery of pulsars by Jocelyn Bell Burnell et. al..

Finally, one of the pleasures of reading the stories was their sheer Englishness (and, if I understand correctly, their Northeast Englishness in particular).   The name of the blog is whatfettle.  I've already mentioned postcodes, eastings and northings, but the whole series is full of Anglicisms -- whilsta spot of breakfastcock-a-hoop, if you are minded, splodgy ... Not all of these may be unique to the British Isles, but the aggregate effect is unmistakeable.

I hesitate to even mention this for fear of seeming to make fun of someone else's way of speaking, but that's not what I'm after at all.   This isn't cute or quaint, it's just someone speaking in their natural manner.  The result is located or even embodied.  On the internet, anyone could be anywhere, and we all tend to pick up each other's mannerisms.  But one fundamental aspect of the web is bringing people together from all sorts of different backgrounds.  If you buy that, then what's the point if no one's background shows through?

Thursday, September 10, 2009

Tools of choice

In real life I'm a software developer. That doesn't figure in much here, probably because as far as the web is concerned I'm an ordinary user, not a developer. However, one place I use the web is at work. No, not to browse fascinating articles from the blogosphere, unless the article happens to answer a particular vexing question I'm dealing with. My web use at work basically boils down to
  • gmail
  • a web-based bug tracking system
  • searches now and then for answers to vexing questions
  • researching and downloading open source software
This last bullet item has significantly changed the way software developers work, at least in the Java corner of the world where I dwell. Your mileage may vary, but for a large and still-growing set of typical problems, downloading a package and using it is likely to be a better option than rolling your own. You can't beat the price, and the time between "that looks interesting" and actually using the package is generally measured in minutes. There's no obligation and if the package is not quite right, the source code is right there.

Some examples from the toolkit I use at work
  • Java itself and its libraries are now essentially open source.
  • The Eclipse IDE. Now, I realize that IDE wars are to our time what editor wars were to the previous generation (um, that would be my generation, I guess), but Eclipse is the one I happen to use for a variety of reasons. One caveat: Eclipse is not just an IDE. It's really a whole platform. It slices. It dices. It has distros like Linux has distros. If you're not careful you can end up with a bloated mess. If you pick and choose, though, you can end up with a very nice, usable, though still memory-hungry tool.
  • Subversion for version control. Again, other worthy choices are available.
  • JUnit. The value here is not so much the code as the mere fact of putting something out there as a framework for writing unit tests. That said, I've had no complaints about the code.
  • Apache Ant for builds. I actually don't use Ant directly these days, but I rely on it behind the scenes. Having seen one too many Makefiles that ate Chicago, I have no plans to go back to make.
  • Apache in general for a variety of useful libraries, including networking (Mina) and general utilities (Commons)
  • A new favorite for taming Swing: MiG Layout. If you've ever considered fleeing to a tall mountain in Nepal rather than hassle another mysterious problem with GridBagLayout and its little friends, check MiG out. Your life will become better.
Naturally this is just a particular, idiosyncratic view of what's out there. If you run Linux (as I do at home), there's the whole GNU/Linux/git/gcc/autoconf/gmake/gcc/... toolchain. If you like Perl, or Ruby, or Python, each is its own little universe. Any way you slice it, the amount of stuff out there is impressive.

Back at the blog, is this a real live example of disruptive technology? If so, what is the disruptor? Is it the concept of open source? Is the enabling technology the internet, the web, or some combination of both? How much does it matter that much of the internet and web as we know it rests on open source/free software? Why am I carefully saying "open source" here and not "free"? How many threes are there in a dozen?

All interesting questions except perhaps the last, but not ones I'm going to tackle just now.

[I still use Java, Eclipse and JUnit.  I'd now recommend git over Subversion for version control.  For various reasons, I don't have much occasion to use the rest of the list these days. --D.H. Dec 2015]

Friday, May 22, 2009

In which I reach no particular conclusion about open source

I'd originally expected to file this under "not really about the web but I'm posting it anyway," but all roads lead back to the web. Perhaps it's more germane than I'd originally thought. Nonetheless, you may wish to skip the geekly details (which I've indented) and go straight to the lack of conclusion at the bottom.
I've been experimenting a bit with video capture on Ubuntu as a means of smashing old analog tapes to bits. To that end, I bought an inexpensive video capture device that takes video in one end and puts USB out the other. It worked out of the box, sort of, on an aging Windows box, but seemed to drop frames, probably because the aging Windows box lacked the horsepower. So I tried plugging the thing into my Ubuntu box.

At first, nothing at all happened. The kernel wouldn't recognize the device as anything but a random USB device. A little googling (see, I told you the web was involved) and a look at dmesg showed that the device wasn't being recognized at all. This turned out to be because it wasn't in the driver's list of devices. But at least I could hack the driver that comes with the distro to put it on the list. As it happens, all I really needed to do was change one byte of the driver (better fixes were possible, but that's enough to make the driver recognize the device).

Ah, but while modern distros are still hackable -- and have to be, to qualify as open/free -- they're not shipped that way. A modern distro is a bunch of binaries along with the artifacts needed for their care and feeding. Source is separate. So ... download the source and requisite tools, and find out the preferred build command, Ubuntu conveniently provides this on a page heavily larded with "are you really sure you need to do this?"

The preferred build command rebuilds everything, as it's aimed more at someone trying to create a package for a distro, not a casual developer. In my case all I needed to do was change one byte of one driver. Further, I couldn't figure out where the giant build had put the results of my one-byte change. Somewhere, probably. After a while, using different instructions on the Ubuntu page, which look much more like what I'm used to, I'm able to build a driver that recognizes the device.

Unfortunately, that version doesn't quite work, for reasons I no longer recall. More googling determines that the latest version of the driver supports the card directly without the problem. Like many major distros, Ubuntu doesn't ship with the latest and greatest version of many components, so it's not a surprise that there would be a newer one. In this particular case, Ubuntu lags a bit farther behind the the latest because the main developer, who has ready access to the actual chipsets and so is pretty well technically qualified, has had some sort of dispute with -- I forget who, but some segment of the community.

However, the source is readily available, and even better, the driver is nice and self-contained (they generally are), so I can rebuild it quickly and modprobe it in. Sure enough, I do that and it "just works". The device is recognized. The other problem I'd been having is gone.

But the picture looks funny. It looks like NTSC is being interpreted as PAL, or something similar. Sure enough (after stumbling down several blind alleys), I check the source code and notice the driver expects the card to be speaking PAL. Not a surprise since the main developer lives in Europe. One three-line copy/paste later, the grabber is working fine. I post the patch to what looks like a relevant forum (look ma, I'm an open source developer!) and feel pretty good about myself.

But, while I can watch the incoming video on screen just fine, I can't figure out how to record it to disk. Which is what I came here for. There are approximately 5,923 different video programs to choose from. OK, more like a half dozen. On the one hand, there is Kino, which works just fine for devices with a FireWire connection, but doesn't seem to know anything about the USB family. Likewise with dvgrab. There appears to be some combination of kernel modules that will get you around this, but I haven't chased that down yet.

On the other hand are the approximately 5,922 programs for watching TV on your computer, which assume you have a USB device hooked up to a TV tuner. Each of them has its own quirks and requires its own special bit of hand-holding to get something showing on the screen, but the ones that can display seem to have trouble saving the video and the ones that might be able to do that can't seem to talk to the device.

That's where I am at the moment. I'm sure I'll chase down the last bit pretty quickly, but an out-of-the-box experience it wasn't.
So ... are closed systems inherently better? You don't see problems like this on Windows, partly because the manufacturer always ships a Windows driver along with the device and often ships a compatible application for good measure. It's even less of a problem on the Mac. Simply place the device in the same room as the Mac and the Mac will install the appropriate drivers, figure out what you want to watch, draw a nice facsimile of brushed chrome around the video window and fix you a latte.

In comparison to that, the Linux experience is pure chaos. In particular, even if I'd just grabbed the driver source and installed to begin with, delving through C code seems a poor way to say simple things like "the device ID is actually 1234:abcd and not 1234:5678" and "no really, this card also understand NTSC."

Except ...

My experience is that modern distros, for the most part, "just work." I've been running Ubuntu for years, now, and this is the first time I've found any need to recompile anything. Conversely, it's certainly possible to have driver problems of the same sort under Windows. Given that the driver ships with the device, detecting the device and figuring out what it supports are much easier. The problem is how the device driver interacts with the rest of the system, and that can vary depending on which of the zilions of different setups you actually have.

The Mac gets around this by tightly controlling the hardware and the software around it. This works, but the flip side is that some aspects of the system are fundamentally closed. For this and other reasons, Macs are considerably more expensive.

Yes, this particular corner of Linux seems fairly messy, particularly with the USB/FireWire split -- why should I care what kind of wire the video's coming over? -- and the apparent disconnect between the driver developer and the rest of the kernel community.

But these aren't open source problems. They're software problems. Any sufficiently large software organization is going to have occasional arbitrary distinctions and political friction. The threshold for "sufficiently large" here is probably a handful of people. The more relevant question is to what extent is open source more or less liable to have such problems. Dunno.

Against that, you have the fundamental advantage of being able to fix it yourself if you need to. It's annoying that things don't just work out of the box, and annoying that the most effective way of fixing the problem involved digging around in the driver source, but at least I could do that. In the proprietary world, you're generally stuck waiting for the next release [which, to be sure, has always worked before and nearly worked this time].

Is hackability worth the trouble? For an everyday user, having to fire up obscure tools, or even a command line, is not really acceptable. The real benefit is that any everyday user might also be a qualified developer who could help with a problem. Hackability maks it much easier for that person to get involved. The benefit to the everyday user is indirect: the larger pool of developers means a better system down the line.

So. Conclusions, or lack thereof? Not much, but maybe this: open source and the web together are powerful, but not all-powerful. But then, neither is anything else.

Friday, March 6, 2009

The point of the postscript

In a previous rumination over Open Source, I quoted Linus's original Usenet post announcing the beginnings of Linux. I left it sitting there somewhat ambiguously, and since a certain amount of ambiguity can be useful, and since this is a blog, not a wiki, I'm going to leave it that way in the original and comment on it here instead.

The point I was making is this: The post strongly suggests Linus thought he was just throwing stuff at the net, but this is highly ironic in light of what actually played out. Little did he know ...

One might go further and speculate about why one would want to put something like that out and invite feedback. My personal guess is that Linus's post cautiously understates what he thought his as-yet-unnamed OS might become. This, in turn, vastly underestimates what it actually has become.

Sunday, March 1, 2009

Revolution OS and thereabouts

OK, so I just watched Revolution OS (on the Roku/Netflix box, of course), which I'd been putting off out of concern it might be more propaganda than information. The opening minute or so did little to allay that, but it turned out to be a pretty good documentary, and as even-handed as you could expect from something that interviewed Open Source folks entirely. It did this by getting in touch with several of the principals, including rms, esr and Linus, and pretty much just letting them talk. This is often a good idea, especially when the principals involved are thoughtful, creative, articulate and intellectually curious.

What emerged was a clear picture of the history of Free Software/Open Source, how "Open Source" came to be the dominant name, and the essential differences between the two: Free Software advocates want all software to be free because it's a Good Thing. Open Source advocates want particular software to be open because it's a Useful Thing. It may not surprise the attentive reader that I tilt toward the latter.

There are ironies along the way, for example a small one in Netscape adopting Open Source not through grassroots activism by engineers -- though this did occur -- but because it was eventually imposed from the top down by management; a large one in that the entire Open Source movement, which is at best indifferent to rms' s central goal of making all software free, depends crucially on GNU code and even more crucially on the GPL [more precisely: on the GPL and licenses directly influenced by it]. Rms himself points this out in his acceptance of the Linus Torvalds award at the 1999 Linux world. Linus's daughters trot back and forth behind him on stage all the while.

If you're looking for a spirited debate over Open Source vs. not-open, you won't find it -- except for an early quote from Bill Gates (who did not directly participate), there are no dissenting voices. If you're looking for knock-down, drag-out Linux vs. Windows, as the marketing collateral implies, you won't find that either. And a good thing. Revolution OS is a much more a chance to put a human face on the names you see floating around and get an idea of what they were thinking. Considered from that point of view, it succeeds nicely.


But I didn't really set out to write a movie review here. I really just wanted to share something amusing I ran across while chasing a link from a link from a page I looked up out of curiosity after watching the movie. This is from Jamie Zawinski, who has done more Open Source development than most of us, I would wager. Zawinski says:
But now I've taken my leave of that whole sick, navel-gazing mess we called the software industry. Now I'm in a more honest line of work: now I sell beer.

Specifically, I own the DNA Lounge nightclub in San Francisco. However, it takes quite a lot of software to keep the place running, because we do audio and video webcasts twenty-four hours a day, and because the club contains a number of anonymous internet kiosks. So all that code is also available.

This all sounds fine and noble, and I like the design decisions, but I have to wonder: Just how anonymous can an internet kiosk be in a nightclub full of webcams? Checking sports scores without having to establish an account anywhere? Sure. Plotting world domination? Maybe not so much.

By the way, this turns out to be post number 300. I made a production of 100 and 200, but from here on out I probably won't until some more significant milestone. Hmm ... 100π is about 314 ...

Friday, February 27, 2009

Dear Mozilla, again

Did I also mention it sure would be nice if I could tell which Firefox tab is running a (possibly invisible) Java or Flash script that's eating my CPU, or which ones are gobbling up memory?

Didn't I also make a comment at some point about a modern browser being its own operating system in all but name, but without the usual administrative tools? I can't be bothered to search back for it, but I'm pretty sure I did ...

If said features are already available (maybe as a plug-in), I would be glad to hear about them and praise their authors. If not, no, I'm not going to even think about diving into the code to implement them. I will remain a selfish, greedy little user sponging off the efforts of selfless volunteers.

I'll be in good company.

Thursday, February 26, 2009

Open source vs open source

This piece arose out of a discussion on a mailing list I'm on. The discussion was about, among other things, why open source projects fail. By the time I finished a particular email, I realized I was basically blogging (not coincidentally, my output to that particular list dropped considerably when I started blogging here). Here's what I wrote, toned down a bit and edited lightly to take the later discussion into account:

It occurs to me there are two kinds of open source. First, there is stuff someone put together and threw out to the web. Much of this of limited value. There might be a good idea in there, but chances are it's not really new, or it's not fully worked out. It might or might not be well coded. It's probably indifferently documented. The developer works on it whenever, so there's no release schedule. For my money, such projects don't fail -- they never really start.

Then there are projects like Apache, or Mozilla, or Eclipse or Red Hat, or the various Google offerings, that are developed and supported full time by some sort of durable entity [Later in the discussion I unwisely called this a corporate entity, in a legalistic sense. This proved to have too many overtones, so I switched to "institution," which is still not ideal.] Generally there's commercial money in it one way or another, but the important quality is that there are numbered, scheduled releases and there are multiple people working on it as part of their day job.

But what about Linux, Python, Perl, the GNU tools and such? They may have started out in the first category [Re-reading, no. They weren't just thrown out on the web. They were generally well along before the world saw them -- but see the Postscript below for an interesting twist], but at some fairly early point the person behind them made a conscious decision to move to the second. Linus could have decided "Hey, that kernel thing was cool. I think I'll do something else," but instead he spearheaded the move from 0.9.x to 1.0.x (note -- release numbers) and has stuck with it up to 2.6.x. The Linux kernel arguably has one of the least formal and most distributed development processes of the major projects I'm aware of, but even then there is a single gateway for significant changes and if Linus should get hit by a bus, there are known people who could take over.

Python's Benevolent Dictator for Life, Guido van Rossum, works at Google, where he spends half his time on Python (evidently 50% for Guido is 20% for everyone else). BDFL Larry Wall's early development of Perl took place during his employ at JPL. While Perl is another of the less formal examples, there is still an elaborate structure around the development and release of Perl.

Rms took the more formal route, founding a foundation and drafting the famous GNU licenses. Being rms, he secured his own money for it, some of it thanks to a grant from another foundation, the MacArthur foundation. This is not a path lightly traveled, but the destination, again, is an institution dedicated to supporting the software.

In short, whenever something significant has happened in open source, it's because someone explicitly pushed for it. The actual coding, testing, documentation etc. might be distributed and more or less volunteer, but at the bottom (or top, if you prefer), there is a small, single point of control. This point of control tends to become institutional, that is, an entity distinct from any individual, fairly quickly.

There is a "religious" aspect to open source that many people, including myself, instinctively distrust. It relates to the notion that open source stuff "just happens" and that if you just throw stuff out on the web, or maybe even just hope it will happen, you'll magically get a robust, coherent and useful product. Experience shows, not surprisingly, that this just doesn't happen. Generally, a project has to start with a robust, coherent and useful product before a culture can emerge around it.

PostScript: My pondering on this topic keeps returning to Linux. On the one hand, it definitely supports the idea that good free software doesn't just emerge out of the web, but requires the ferocious dedication of a single person or small group. On the other hand, it has had a less romantic version of the idea of informal, distributed development in it from the beginning. Linus's GIT source control system, which deliberately has no central repository, is a more recent manifestation. In this light, it's interesting, not to mention somewhat amusing, to read Linus's original Usenet post to the world:

Hello everybody out there using minix -

I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones. This has been brewing since april [i.e., a few months], and is starting to get ready. I'd like any feedback on things people like/dislike in minix, as my OS resembles it somewhat (same physical layout of the file-system (due to practical reasons) among other things).

I've currently ported bash(1.08) and gcc(1.40), and things seem to work. This implies that I'll get something practical within a few months, and I'd like to know what features most people would want. Any suggestions are welcome, but I won't promise I'll implement them :-)

Linus (torvalds@kruuna.helsinki.fi)

PS. Yes – it's free of any minix code, and it has a multi-threaded fs. It is NOT portable (uses 386 task switching etc), and it probably never will support anything other than AT-harddisks, as that's all I have :-(.

Thursday, February 12, 2009

Dear Mozilla

Is there any chance that when Firefox starts and several things require passwords, it could ask for the master password once, instead of over and over again? I'm pretty sure it doesn't change.

Just asking.

(This is a known problem, since 2006. So much for the "all bugs are shallow in the open source world" theory.)

Wednesday, April 30, 2008

Two ways to digitize books

I recently said that I didn't expect everything in print to be available online digitally any time soon. One reason is that the fundamental question of who gets paid when and how for copyrighted material is far from settled. Another is the sheer volume of books out there.

All I know is what I read in the paper, um, I mean the online version of the paper, but as I understand it there are two competing approaches to this at the moment. Google and Microsoft will come in and digitize your books for you. All they ask is that they retain certain rights to the digital version, like the exclusive right to index it online.

The Open Content Alliance, on the other hand, will digitize the book and make it available to all. But it will cost you, along with the alliance, and its benefactors, a portion of the $30 or so it costs to digitize a book.

Despite the cost, many research libraries are finding it more in keeping with their mission to make the digital content available without restriction. This will be an interesting test of the "information wants to be free" theory.

Wednesday, February 6, 2008

Tourism today

Here's Richard Stallman talking to an audience in Stockholm, at the Kungliga Tekniska Hogskolan (Royal Institute of Technology), in October 1986. I've edited a bit to draw out the point I want to make. Please see the original transcript for further detail.
Now "tourism" is a very old tradition at the AI lab, that went along with our other forms of anarchy, and that was that we'd let outsiders come and use the machine. Now in the days where anybody could walk up to the machine and log in as anything he pleased this was automatic: if you came and visited, you could log in and you could work. Later on we formalized this a little bit, as an accepted tradition specially when the Arpanet began and people started connecting to our machines from all over the country.

Now what we'd hope for was that these people would actually learn to program and they would start changing the operating system . If you say this to the system manager anywhere else he'd be horrified. If you'd suggest that any outsider might use the machine, he'll say ``But what if he starts changing our system programs?'' But for us, when an outsider started to change the system programs, that meant he was showing a real interest in becoming a contributing member of the community.

We would always encourage them to do this. [...] So we would always hope for tourists to become system maintainers, and perhaps then they would get hired, after they had already begun working on system programs and shown us that they were capable of doing good work.

But the ITS machines had certain other features that helped prevent this from getting out of hand, one of these was the ``spy'' feature, where anybody could watch what anyone else was doing. And of course tourists loved to spy, they think it's such a neat thing, it's a little bit naughty you see, but the result is that if any tourist starts doing anything that causes trouble there's always somebody else watching him.

So pretty soon his friends would get very mad because they would know that the continued existence of tourism depended on tourists being responsible. So usually there would be somebody who would know who the guy was, and we'd be able to let him leave us alone. And if we couldn't, then what we would [do] was we would turn off access from certain places completely, for a while, and when we turned it back on, he would have gone away and forgotten about us. And so it went on for years and years and years.
In sum:
  • Everyone can change the system. In fact, everyone is openly encouraged to change the system.
  • People who make good changes rise in the ranks and eventually help run the place.
  • Everyone can see what people are up to, and in particular what changes they're making.
  • Vandals are locked out and generally go away after a bit (likely to be replaced by new, nearly identical vandals). Sites can be blocked if blocking individuals doesn't work.
One of rms's main themes is that this is The Way Things Should Be. Now, it's all well and good to have this sort of semi-anarchic meritocracy in the hallows of academia, with access physically limited to those who worked in the lab or wandered in off the street. It might even work on the early Arpanet, several orders of magnitude smaller than today's wild and woolly web. But surely it'll never work on a big scale in the commercial world. After all, one of rms's other main themes is that commercialism killed the AI lab (at least in the form he describes) and threatens environments like it.

The obvious counter-argument is that open source software (a term, I should mention, that rms disfavors) works quite well on the same basic principles, albeit not always strictly according to the FSF model. It's a good point, but the average open source project is a fairly small system. I doubt there are many with more than a few dozen active participants at any given time. Such projects also tend to have a limited audience, a limited pool of potential contributors, or both.

However, there is at least one very large and prominent system, with hundreds of thousands of participants, that the bullet points above describe almost as though I'd written them with it in mind (which I did): Wikipedia.

Wednesday, December 26, 2007

80% of the solution in a fraction of the time

As can happen, I set out to write this piece once already, only to end up with a slightly different one. Here's another take, bringing Wikipedia into the picture.

First, let me say I like Wikipedia. A quick scan will show I refer to it all the time. I see it as a default starting point for information on a particular topic (as opposed to a narrowly-focused search for a given document or type of document). I don't see it as definitive, but I don't think that's really its job.

Wikipedia would seem a perfect test case for Eric S. Raymond's formulation of Linus's Law ("Given enough eyeballs, all bugs are shallow). But -- as Wikipedia's page on Raymond dutifully reports -- Raymond himself has said, well, here's how it came out in a New Yorker article:
Even Eric Raymond, the open-source pioneer whose work inspired Wales, argues that “ ‘disaster’ is not too strong a word” for Wikipedia. In his view, the site is “infested with moonbats.” (Think hobgoblins of little minds, varsity division.) He has found his corrections to entries on science fiction dismantled by users who evidently felt that he was trespassing on their terrain. “The more you look at what some of the Wikipedia contributors have done, the better Britannica looks,” Raymond said. He believes that the open-source model is simply inapplicable to an encyclopedia. For software, there is an objective standard: either it works or it doesn’t. There is no such test for truth.
Let's start right there. Software doesn't simply either work or not. You can't even put it on some sort of linear, objective "goodness" scale. Even in cases where you'd think software is cut and dried, it isn't. Did you test that sort routine with all N! combinations of N elements? Of course you didn't. Did you rigorously prove its correctness? How do you know your correctness proof is correct? Don't laugh: Mathematicians routinely find holes in each other's proofs, in some cases even after publication.

But most software is nowhere near this regime. Often we don't even know exactly what we're trying to write when we set out to write it (thus much of the emphasis on "agile" development techniques). In the case of something like a game, or even a website design, most of what we're after is a subjectively good experience, not something objectively testable (though ironically games seem to put a bigger premium on basic correctness, since bugs spoil the illusion).

It's not even completely clear when software doesn't work. If a piece of code is supposed to do X and Y, but in fact does Y and Z, does it work? It does if I need it to do Y or Z. What if it hangs when you try to do X, but there's an easy work-around? What if it hangs at random 10% of the time when you try to do X, but that's tolerable and nothing else does X at all? What if it does X if a coin flip comes up heads, but might not if it doesn't? I'm not making that one up. See this Wikipedia article (of course) for more info. What if it's an operating system and it just plain hangs some of the time? Not that that would ever happen.

All of this to say that I doubt that software and encyclopedia entries make such different demands on their development process. And as a corollary, I think the results are about the same. Namely, there are excellent results in some cases, reasonable but not excellent results in many cases, and occasional out-and-out garbage.

Here's what I think goes on, roughly, in both cases:
  • Someone comes up with an idea. That person may be an expert in the field, or may just have what looks like a neat idea.
  • The original person produces a first draft, or perhaps just a "stub", or an "enhancement request".
  • If no one with the expertise to take it further is persuaded to do so, it stays right there indefinitely, or may even be purged from the system (perhaps to re-appear later).
  • If the idea has legs, one or more people take it up and make improvements.
  • Typically, one of three stable states is reached:
    • It's perfect. Nothing more than minor cosmetic changes can be added. New ideas along the same lines typically become their own projects.
    • It's good enough for everyone currently involved. That may not be particularly good, but no one can be persuaded to go further. This may be the case right out of the gate, or after several rounds of fixes by a single originator.
    • It's not good enough, but there is no agreement on how to take it further. Work may grind to a halt as competing fixes go in and come out, or the project may split into two similar projects, with better or worse sharing of common material and effort.
Thinking it over, this process is not unique to open source. The magic of the open approach is that the bigger the pool of participants, the bigger the chance that an idea with legs will get supporters and get fleshed out, and the faster it will get to a stable state. In our imperfect world, that stable state is generally short of perfection. Put the two together and you have 80% of the solution in a fraction of the time.

That said, there are some differences between prose and software. I've argued above that software isn't hard and fast. It's soft, in other words. But prose is even softer. As a result, there is greater potential for disagreement on where to go, and in case of disagreement, there looks to be a better chance of thrashing back and forth with competing fixes, as opposed to moving forward but with separate (and to some extent redundant) solutions.

Wikipedia does seem to attract more vandals, but this is not necessarily because it's not software. It may also be because it openly invites frequent edits from a very large pool and changes are moderated after the fact. Open software projects, particularly critical pieces like kernels and basic tools, tend to require changes to pass by a small group of gatekeepers before being checked in. Conversely, some wikis are moderated.

As usual, this is all just my rough "figuring it out as I go along" guess, not anything with actual numbers behind it, but that's my story and I'm sticking to it for now.

Saturday, December 22, 2007

Eyeballs and shallow bugs

Eric S. Raymond has asserted that "given enough eyeballs, all bugs are shallow", a principle he calls Linus's law after Linus Torvalds (my fingers want to type "Linux Torvalds").

How many eyeballs are enough? How many eyeballs are available? What does it take to get a "shallow" bug fixed, checked in and tested? What's a bug, anyway? A few meditations:

How many eyeballs are enough?
Suppose an excellent kernel hacker has a 90% chance of nailing a given bug. What are the chances that two excellent kernel hackers can nail the bug? Well, it's not 180%. It will range from 90% (if the second doesn't know anything new about the particular problem) to 100% (if the second knows everything the first one doesn't).

If the two are completely independent sources of information there's a 99% chance one or the other will nail it. But what are the odds two people got to be excellent kernel hackers by completely independent routes? Now, what are the odds that a bunch excellent kernel hackers, quite a few more reasonable kernel hackers and a horde of non-specialists could nail the bug? Pretty good, I'd say, but not 100%.

How many are there?
I've been doing software for a while now. I've built a few Linux kernels (roughly the equivalent of changing a tire on a car), and I've looked at small portions of the source (roughly the equivalent of opening the hood, pointing and grunting). If some subtle race condition should creep into the next version of the kernel, the odds that I could contribute something useful to the conversation are approximately zero (the automotive equivalent might be, say, being able to help fix a problem in a Formula One engine design).

The most qualified people in such a situation are the small and dedicated core of kernel maintainers and whoever put in the changes that turned up the problem. These may well be the same people. I know a fair bit about software in general, and even a bit about race conditions in general, but I know essentially nothing about the details of the kernel, the design decisions behind particular parts, the hidden pitfalls and so forth.

This being open source, much of that information is available, directly or indirectly. The limiting factor is the ability to absorb all of the above. This takes not only skill but time and dedication. The natural consequence is that, for at least some bugs, there just aren't enough eyeballs available to make them "shallow". Instead, someone will have to expend considerable brain sweat figuring out what happened.

[Another not-infrequent case: Lots of people see a bug, but no one can quite nail down what's causing it, much less suggest a fix. Filing good bug reports takes practice, just like writing good code does. Eliminating all the variables in a typical desktop environment takes time, even for someone with lots of practice. As a result, the people who could fix the bug don't have enough information to go on and probably have bigger fish to fry.]

What does it take to get a shallow bug fixed and tested? Suppose that the broken code in question (kernel or otherwise) has passed by enough eyeballs that someone has said "Hey, that's easy, you just need to ..." That person puts in a fix. Are we done? No. At a minimum, someone needs to test the fix, preferably someone other than the fixer. Someone should also look over the code change and make sure it fits in well with the existing code. And so forth. Open source doesn't remove the need for good software hygiene. If anything, it increases it.

What's a bug, anyway? Suppose not just one person steps up with a fix to some bug. Suppose two or three people do. Unfortunately, they don't exactly agree on the fix. Maybe one wants to patch around the problem, one has a small re-write that may also fix some other problems, and another thinks the application wouldn't have such bugs if it were structured differently. Someone else might even argue that nothing needs to be fixed at all.

Expediency will tend to favor the patch, and expediency is often right. The small re-write has a chance if the proponent can convince enough people that it's a good thing. The re-structured system will probably need to be a whole new project, potentially splitting up the pool of qualified eyeballs.


So does this mean that open source is a crock? Not at all. Most of the problems I've pointed out here aren't open source things. They're software things. Open source offers a number of potential advantages in dealing with them. One I think may be overlooked is that writing a system so that anyone anywhere can check it out and build it, and that several people can work on simultaneously and largely independently, enforces a certain discipline that's useful anyway. If your code's a mess or no one else can build it and run it, you're not going to get as many collaborators.

On the other hand, open source isn't a magical solution to all of life's problems, and there are arguably cases where you just need someone to say "Today we work on X," or "We will not do Y." Strictly speaking, that kind of control is a separate question from whether the source is freely available, but Linus's law assumes that eyeballs are not being commanded to look elsewhere.

So is Linus's law a crock? Not at all. It captures a useful principle. But like most snappy aphorisms, it only captures an ideal in a world that's considerably messier and more intricate.

[A few years later, a couple of striking examples turned up, in the form of Heartbleed and Shellshock -- D.H. Dec 2018]