Monday, January 4, 2010

Google Books vs. the Library of Alexandria

I received The Case for Books, by Robert Darnton, as a present this year from someone saying they'd sent me a book case. As I'm not above the occasional bad pun myself, I went ahead and read it. I'm glad I did, though as usual I didn't completely agree with all of the positions put forth.

Robert Darnton is (among other things) the director of the Harvard University Library, and assumed that role just as negotiations over Google Books were heating up. He was also a proponent of Harvard's open access policy, whereby most research publications become publicly available on the web unless the authors specifically opt out, and spearheaded Gutenberg-e, which sought to get top-quality dissertations in history reworked and expanded into electronic book form. In short, he knows a little something about books, both electronic and traditional.

On top of this, he writes clearly and engagingly. If you want to stop right now and get the book in order to see what such a person is thinking about the web and books, I'd completely understand. In any case, I would like to devote this and a few upcoming posts to examining Darnton's thoughts in more detail.


The book consists of previously-published essays on various topics, grouped into Future, Present and Past in that order. The first essay forthrightly answers the question of what effect the Google books settlement will have on the world of books: "No one knows," he says, "because the settlement is so complex that it is difficult to perceive the legal and economic contours in the lay of the land." Now that's an answer I can trust, particularly coming from someone intimately involved in the process.

Darnton's thesis here is that, while it's perfectly fine for businesses to make money and for authors to make money through copyrights, it's the responsibility of the public at large to push back in the direction of making the web more democratic. Darnton argues this is necessary for a number of reasons, some less widely publicized than others. In the particular context of Google books, there are several that seem particularly noteworthy:
  • Google is obtaining an effective monopoly on access to a large number of digitized books. Google means not to use its awesome power for evil, but it's awesome power nonetheless.
  • Google's terms of access, while generally reasonable and in particular "a boon to the small-town, Carnegie-library readers," are "hedged with restrictions." For example, access is limited to one terminal per library, no matter the size of the library.
  • Market forces cannot act correctively should Google or some successor in future decades choose to favor profit over access because "Students, faculty and patrons of public libraries will not pay for the subscriptions. The payment will come from the libraries[.]"
  • (This I had not realized at all) The libraries Google is digitizing "will not come close to exhausting the stock of books in the United States," and "contrary to what one might expect, there is little redundancy in the holdings of the five libraries [Google has partnered with.]"
    • There are about 543 million volumes in the research libraries of the United States, of which Google intends initially to digitize 15 million.
    • 60% of the books being digitized by Google exist in only one of the five libraries.
    • Google has not even begun to digitize the libraries' special collections.
The monopoly argument is the elephant in the room. I've met people who work at Google. They're serious, dedicated scary-smart geeks who want to build awesome software that makes the world a better place. I have no doubt that this culture extends from Larry and Sergey to the chefs in the kitchen. But a publicly-traded company is a publicly-traded company and a monopoly is a monopoly.

As to restrictions, the particular example of one terminal per library can probably be finessed or possibly renegotiated around. The larger problem is that there are many restrictions, larger and smaller, worked into the 100+ pages of the agreement by various parties with various axes to grind. At worst, such a web of legalese can have a chilling effect as no one wants to run afoul of something they signed up to but don't completely understand. At best, it's an annoying mess.

The market disconnect is a particularly sore subject to Darnton, who has seen a nasty and at least superficially similar scenario play out with academic journals. Students and faculty rely on journals both for research and for getting published (the preferred alternative to perishing), but don't and generally couldn't pay for them directly. As a result, the publishers have been able to steadily jack up the subscription prices, into the tens of thousands of dollars per year.

Bear in mind that the publishers do not pay the authors, who are academics struggling to publish, so profit margins are rather on the high side. University libraries have to carry the journals, so instead they cut back on other core activities, such as buying books. The claim here is not that Google intends to do the same, but that there is no effective market mechanism to push back against it, and no guarantee that Google's current coprporate culture will outlive Google the corporate entity.

The problem with Google only digitizing a small slice of the pie that people will tend to think that, since Google indexes the public web, and in fact the public web can for most intents and purposes be defined as that which Google indexes, Google books has similar scope. Since Google's digitizing is in essentially random order, it's quite possible that someone researching a given topic will turn up a small and unrepresentative sample of the published information on that topic without realizing it.

This should get better with time, but obviously it's going to take quite a bit of time before even the majority of published books have been digitized. In the mean time, the only antidote is old-fashioned library research, not that that's such a bad thing.

Darnton also puts forth several arguments I find less convincing:
  • "Companies decline rapidly in the fast-changing environment of electronic technology"
  • "As in the case of microfilm, there is no guarantee Google's copies will last."
  • "Google will make mistakes [in digitizing, tagging, etc.]"
  • Google is not a library. It doesn't have the expertise to tell people, say, which editions of Shakespeare's plays are more likely to represent what Shakespeare actually wrote.
  • Digitizing cannot capture everything in a paper book.
I'm not convinced electronic technology has much to do with Google's decline or lack thereof. Companies have imploded ever since there were companies. Google, however, is clearly in the category of Microsoft and IBM -- just don't say that in Mountain View -- and not Flooze or WebVan.com. Darnton's objection here is not ownership of the database per se, which seems a more relevant issue to me. Rather, that "Google may disappear or be eclipsed by an even greater technology, which could make its database as outdated and inaccessible as many of our old floppy disks and CD-ROMs."

This misses two qualitative differences between an old floppy disk with a document file produced by some long-unsupported word processor and Google's data. The first is that Google's data is not tied to any particular physical medium. I'm not intimately familiar with Google's infrastructure, but I would be shocked if digitized books were not already hosted at multiple physical locations. As hardware gets replaced and upgraded, more copies will get made. Google has to do this sort of thing to ensure high availability.

Second, Google's formats, even if they include Google secret sauce, are not going to be orphaned. If some future disaster should take out all known copies of Google's source code (making a local copy to hack on is a standard part of development) and prevent anyone familiar with the formats from remembering them, we would surely have much bigger fish to fry. Likewise, if Google's digitized books are no longer useful because of some new and better technology ... great.

But the main reason that I'm not concerned about Google's fallibility, or its inability even to pretend to be a library in anything but the most superficial sense, or the fiasco of microfilm (and yeesh, was it a fiasco), is that unlike the case of microfilm and early US newspapers, no one is getting rid of the originals. If you want to look at the Bad Quarto of Shakespeare, you can still go to Special Collections and see it, or at least you're no less able to than you were before. If you can't, you can at least look at a picture, which you couldn't before.

So, on the whole, I share Darnton's position of speaking "as a Google enthusiast, although I worry about its monopolistic tendencies," and I hope that despite the potential pitfalls, things will work out well. At the very least, it's very beneficial at least to publicly air and discuss the issues. Granted, it might have been better to have discussed them wider and sooner.

Indeed, this is one of Darnton's major regrets: "By spreading the cost in various ways [...]we could have provided authors and publishers with a legitimate income , while maintaining an open access repository or one in which access was based on reasonable fees. We could have created a National Digital Library -- the twenty-first century equivalent of the Library of Alexandria. It is too late now. [... W]orse, we are allowing a question of public policy -- the control of access to information -- to be determined by private lawsuit."

"While the authorities slept, Google took the initiative. It did not seek to settle its affairs in court. It went about its business, scanning books in libraries[.]"

1 comment:

earl said...

Good post. I'm looking forward to reading those that follow, and, ultimately, the book itself.

Couldn't help but notice that your proofreader missed a couple, in the paragraph just following the bullet points.