Tuesday, May 12, 2009

Google Print: A Numbers Game

The following post was originally published in March 2007 but I recently saw an up tick in views which prompted me to look at it again. Given the excitement over the Google Settlement I thought it would be interesting to post it once more.


The following post is written by Andrew Grabois who worked with me at Bowker and has (among other things) compiled bibliographic stats out of the Books In Print database for a number of years. His contact details are at the bottom of this article.


On February 6th, Google announced that the Princeton University library system agreed to participate in their Book Search Library Project. According to the announcement, Princeton and Google will identify one million works in the public domain for digitization. This follows the January 19th announcement that the University of Texas libraries, the fifth largest library in the U.S., also climbed on board the Library Project. Very quietly, the number of major research libraries participating in the project has more than doubled to twelve in the last two years. The seven new libraries will add millions of printed items to the tens of millions already held by the original five, and more fuel to the legal fire surrounding Google’s plan to scan library holdings and make the full texts searchable on the web.

The public discussion has been mostly one-sided, with Google supporters trying to hold the high moral ground. Their basic argument goes something like this: The universe of published works in the U.S. consists of some 32 million books. They argue that while 80 percent of these books were published after 1923, and, therefore, potentially protected by copyright, only 3 million of them are still in-print and available for sale. As a result, mountains of books have been unnecessarily consigned to obscurity.

No one has yet challenged the basic assumptions supporting this argument. Perhaps they’ve been scared off by Google’s reputation for creating clever algorithms that “organize the world’s information”. This one, though, doesn’t stand up to serious scrutiny.

The figures used by supporters of the Library Project come from a 2005 study undertaken by the Online Computer Library Center (OCLC), the largest consortium of libraries in the U.S. According to the OCLC study, its 20,000 member libraries hold 31,923,000 print books; the original five research libraries participating in the Google library scanning project hold over 18 million.

OCLC did not actually count physical books. They searched their massive database of one billion library holdings and isolated 55 million catalog records describing “language-based monographs”. This was further refined (eliminating duplicates) to 32 million “unique manifestations”, not including government publications, theses and dissertations. The reality of library classification, however, is such that “monographs” often include things like pamphlets, unbound documents, reports, manuals, and ephemera that we don’t usually think of as commercially published books.

The notion that 32 million U.S. published books languish on library shelves is absurd. Just do the math. That works out to more than 80,000 new books published every year since the first English settlement in Jamestown in 1607. Historical book production figures clearly show that the 80,000-threshold was not crossed until the 1980’s, after hovering around 10,000 for fifty years between 1910 to1958. The OCLC study showed, moreover, that member libraries added a staggering 17 million items (half of all print collections) since 1980. That averages out to 680,000 new print items acquired every year for 25 years, or more than the combined national outputs of the U.S., U.K., China, and Japan in 2004.

Not only will Google have to sift through printed collections to identify books, and then determine if they are in the public domain, but they will also have to separate out those published in the U.S. (assuming that their priority is scanning U.S.-based English-language books) from the sea of books published elsewhere. The OCLC study clearly showed that most printed materials held by U.S. libraries were not published in the U.S. The study counted more than 400 languages system-wide, and more than 3 million print materials published in French and German alone in the original Google Five. English-language print materials accounted for only 52% of holdings system-wide, and 49% in the Google Five. Since more than a few works were probably published in the United Kingdom, the total number of English-language books published in the U.S. will constitute less than half of all print collections, both system-wide and in Google libraries.

So how many U.S.-published books are there in our libraries? Annual book production figures show that some 4 million books have been published in the 125 years since figures were regularly compiled in 1880. If, very conservatively, we add an additional 1.5 million books to cover the pre-1880 years, and another 1.5 million to cover books published after 1880 that might have been missed, we get a much more realistic total of 7 million.

Using the lower baseline for published books tells a very different story than the dark one (that the universe of books consists of works that are out-of-print, in the public domain, or “orphaned” in copyright limbo) told by Google and their supporters. With some 3 million U.S. books in print, the inconvenient truth here is that 40% of all books ever published in the U.S. could still be protected by copyright. That would appear to jive with the OCLC finding that 75% of print items held by U.S. libraries were published after 1945, and 50% after 1974.

If we’re going to have a debate that may end up rewriting copyright law, let’s have one based on facts, not wishful thinking.


Andrew Grabois is a consultant to the publishing industry. He has compiled U.S. book production statistics since 1999. He can be reached at the following email address: agrabois@yahoo.com

Clarification update from Andrew: My post is not intended to be a criticism of the OCLC study ("Anatomy of Aggregate Collections: The Example of Google Print for Libraries") by Brian Lavoie et al, which is a valuable and timely look at print collections held by OCLC member libraries. What I am attempting to do here is point out how friends of the Google library project have misinterpreted the paper and cherry-picked findings and conclusions out of context to support their arguments.
Reblog this post [with Zemanta]

1 comment:

Marion Gropen said...

Now, that makes much, much more sense than other takes on this situation that I've seen. Their numbers just seemed off, but I never dug into them. I'm so glad that you not only did, but posted about it.