Personanondata: Search results for grabois

Showing posts sorted by relevance for query grabois. Sort by date Show all posts

Wednesday, March 07, 2007

Google Print: A Numbers Game

The following post is written by Andrew Grabois who worked with me at Bowker and has (among other things) compiled bibliographic stats out of the Books In Print database for a number of years. His contact details are at the bottom of this article.

On February 6th, Google announced that the Princeton University library system agreed to participate in their Book Search Library Project. According to the announcement, Princeton and Google will identify one million works in the public domain for digitization. This follows the January 19th announcement that the University of Texas libraries, the fifth largest library in the U.S., also climbed on board the Library Project. Very quietly, the number of major research libraries participating in the project has more than doubled to twelve in the last two years. The seven new libraries will add millions of printed items to the tens of millions already held by the original five, and more fuel to the legal fire surrounding Google’s plan to scan library holdings and make the full texts searchable on the web.

The public discussion has been mostly one-sided, with Google supporters trying to hold the high moral ground. Their basic argument goes something like this: The universe of published works in the U.S. consists of some 32 million books. They argue that while 80 percent of these books were published after 1923, and, therefore, potentially protected by copyright, only 3 million of them are still in-print and available for sale. As a result, mountains of books have been unnecessarily consigned to obscurity.

No one has yet challenged the basic assumptions supporting this argument. Perhaps they’ve been scared off by Google’s reputation for creating clever algorithms that “organize the world’s information”. This one, though, doesn’t stand up to serious scrutiny.

The figures used by supporters of the Library Project come from a 2005 study undertaken by the Online Computer Library Center (OCLC), the largest consortium of libraries in the U.S. According to the OCLC study, its 20,000 member libraries hold 31,923,000 print books; the original five research libraries participating in the Google library scanning project hold over 18 million.

OCLC did not actually count physical books. They searched their massive database of one billion library holdings and isolated 55 million catalog records describing “language-based monographs”. This was further refined (eliminating duplicates) to 32 million “unique manifestations”, not including government publications, theses and dissertations. The reality of library classification, however, is such that “monographs” often include things like pamphlets, unbound documents, reports, manuals, and ephemera that we don’t usually think of as commercially published books.

The notion that 32 million U.S. published books languish on library shelves is absurd. Just do the math. That works out to more than 80,000 new books published every year since the first English settlement in Jamestown in 1607. Historical book production figures clearly show that the 80,000-threshold was not crossed until the 1980’s, after hovering around 10,000 for fifty years between 1910 to1958. The OCLC study showed, moreover, that member libraries added a staggering 17 million items (half of all print collections) since 1980. That averages out to 680,000 new print items acquired every year for 25 years, or more than the combined national outputs of the U.S., U.K., China, and Japan in 2004.

Not only will Google have to sift through printed collections to identify books, and then determine if they are in the public domain, but they will also have to separate out those published in the U.S. (assuming that their priority is scanning U.S.-based English-language books) from the sea of books published elsewhere. The OCLC study clearly showed that most printed materials held by U.S. libraries were not published in the U.S. The study counted more than 400 languages system-wide, and more than 3 million print materials published in French and German alone in the original Google Five. English-language print materials accounted for only 52% of holdings system-wide, and 49% in the Google Five. Since more than a few works were probably published in the United Kingdom, the total number of English-language books published in the U.S. will constitute less than half of all print collections, both system-wide and in Google libraries.

So how many U.S.-published books are there in our libraries? Annual book production figures show that some 4 million books have been published in the 125 years since figures were regularly compiled in 1880. If, very conservatively, we add an additional 1.5 million books to cover the pre-1880 years, and another 1.5 million to cover books published after 1880 that might have been missed, we get a much more realistic total of 7 million.

Using the lower baseline for published books tells a very different story than the dark one (that the universe of books consists of works that are out-of-print, in the public domain, or “orphaned” in copyright limbo) told by Google and their supporters. With some 3 million U.S. books in print, the inconvenient truth here is that 40% of all books ever published in the U.S. could still be protected by copyright. That would appear to jive with the OCLC finding that 75% of print items held by U.S. libraries were published after 1945, and 50% after 1974.

If we’re going to have a debate that may end up rewriting copyright law, let’s have one based on facts, not wishful thinking.

Andrew Grabois is a consultant to the publishing industry. He has compiled U.S. book production statistics since 1999. He can be reached at the following email address: agrabois@yahoo.com

Clarification update from Andrew: My post is not intended to be a criticism of the OCLC study ("Anatomy of Aggregate Collections: The Example of Google Print for Libraries") by Brian Lavoie et al, which is a valuable and timely look at print collections held by OCLC member libraries. What I am attempting to do here is point out how friends of the Google library project have misinterpreted the paper and cherry-picked findings and conclusions out of context to support their arguments.

Related articles:
Google Book Project (3/6/07)
Qualified Metadata (2/22/07)

Tuesday, May 12, 2009

Google Print: A Numbers Game

The following post was originally published in March 2007 but I recently saw an up tick in views which prompted me to look at it again. Given the excitement over the Google Settlement I thought it would be interesting to post it once more.

The following post is written by Andrew Grabois who worked with me at Bowker and has (among other things) compiled bibliographic stats out of the Books In Print database for a number of years. His contact details are at the bottom of this article.

On February 6th, Google announced that the Princeton University library system agreed to participate in their Book Search Library Project. According to the announcement, Princeton and Google will identify one million works in the public domain for digitization. This follows the January 19th announcement that the University of Texas libraries, the fifth largest library in the U.S., also climbed on board the Library Project. Very quietly, the number of major research libraries participating in the project has more than doubled to twelve in the last two years. The seven new libraries will add millions of printed items to the tens of millions already held by the original five, and more fuel to the legal fire surrounding Google’s plan to scan library holdings and make the full texts searchable on the web.

The public discussion has been mostly one-sided, with Google supporters trying to hold the high moral ground. Their basic argument goes something like this: The universe of published works in the U.S. consists of some 32 million books. They argue that while 80 percent of these books were published after 1923, and, therefore, potentially protected by copyright, only 3 million of them are still in-print and available for sale. As a result, mountains of books have been unnecessarily consigned to obscurity.

No one has yet challenged the basic assumptions supporting this argument. Perhaps they’ve been scared off by Google’s reputation for creating clever algorithms that “organize the world’s information”. This one, though, doesn’t stand up to serious scrutiny.

The figures used by supporters of the Library Project come from a 2005 study undertaken by the Online Computer Library Center (OCLC), the largest consortium of libraries in the U.S. According to the OCLC study, its 20,000 member libraries hold 31,923,000 print books; the original five research libraries participating in the Google library scanning project hold over 18 million.

OCLC did not actually count physical books. They searched their massive database of one billion library holdings and isolated 55 million catalog records describing “language-based monographs”. This was further refined (eliminating duplicates) to 32 million “unique manifestations”, not including government publications, theses and dissertations. The reality of library classification, however, is such that “monographs” often include things like pamphlets, unbound documents, reports, manuals, and ephemera that we don’t usually think of as commercially published books.

The notion that 32 million U.S. published books languish on library shelves is absurd. Just do the math. That works out to more than 80,000 new books published every year since the first English settlement in Jamestown in 1607. Historical book production figures clearly show that the 80,000-threshold was not crossed until the 1980’s, after hovering around 10,000 for fifty years between 1910 to1958. The OCLC study showed, moreover, that member libraries added a staggering 17 million items (half of all print collections) since 1980. That averages out to 680,000 new print items acquired every year for 25 years, or more than the combined national outputs of the U.S., U.K., China, and Japan in 2004.

Not only will Google have to sift through printed collections to identify books, and then determine if they are in the public domain, but they will also have to separate out those published in the U.S. (assuming that their priority is scanning U.S.-based English-language books) from the sea of books published elsewhere. The OCLC study clearly showed that most printed materials held by U.S. libraries were not published in the U.S. The study counted more than 400 languages system-wide, and more than 3 million print materials published in French and German alone in the original Google Five. English-language print materials accounted for only 52% of holdings system-wide, and 49% in the Google Five. Since more than a few works were probably published in the United Kingdom, the total number of English-language books published in the U.S. will constitute less than half of all print collections, both system-wide and in Google libraries.

So how many U.S.-published books are there in our libraries? Annual book production figures show that some 4 million books have been published in the 125 years since figures were regularly compiled in 1880. If, very conservatively, we add an additional 1.5 million books to cover the pre-1880 years, and another 1.5 million to cover books published after 1880 that might have been missed, we get a much more realistic total of 7 million.

Using the lower baseline for published books tells a very different story than the dark one (that the universe of books consists of works that are out-of-print, in the public domain, or “orphaned” in copyright limbo) told by Google and their supporters. With some 3 million U.S. books in print, the inconvenient truth here is that 40% of all books ever published in the U.S. could still be protected by copyright. That would appear to jive with the OCLC finding that 75% of print items held by U.S. libraries were published after 1945, and 50% after 1974.

If we’re going to have a debate that may end up rewriting copyright law, let’s have one based on facts, not wishful thinking.

Andrew Grabois is a consultant to the publishing industry. He has compiled U.S. book production statistics since 1999. He can be reached at the following email address: agrabois@yahoo.com

Clarification update from Andrew: My post is not intended to be a criticism of the OCLC study ("Anatomy of Aggregate Collections: The Example of Google Print for Libraries") by Brian Lavoie et al, which is a valuable and timely look at print collections held by OCLC member libraries. What I am attempting to do here is point out how friends of the Google library project have misinterpreted the paper and cherry-picked findings and conclusions out of context to support their arguments.

Wednesday, November 10, 2010

Books about Presidents

These USA Today snapshots were the best and most effective marketing and PR we did at Bowker when I was there. We got more mileage out of these than anything else we did. At the time it was Andrew Grabois who did the stats and this time it is Roy Crego.

Wednesday, December 05, 2007

Reading Stutters

Some time contributor to this blog Andrew Grabois writes about the recent reports on reading over at beneaththecover.com. He discusses the results of the recent National Endowment for the Arts study and the Progress in International Reading study:

Now, on the heels of the NEA’s gloomy assessment, comes the Progress in International Reading Literacy Study (PIRLS). Based on tests given to 215,000 10-year-olds from 45 countries and provinces, and data gleaned from background surveys of pupils, their parents and teachers, the findings tell the same sad story. Since 2001, the U.S. dropped from 4th place to 18th place; the U.K., from 3rd to 19th. The average scores for U.S. and U.K. students did not drop as much as their places on the new list would suggest, but they didn’t make any progress compared with the spectacular improvement shown by 10-year-olds in Russia, some Canadian provinces, Hong Kong, and Singapore. The best that can be said is that the average scores for children from the world’s two largest book markets were above the international mean. So far, there’s been no official response to our relatively poor showing.

Here is the link.

Monday, May 14, 2007

Increasing Traffic and New Authors for Personanondata

The past four months have seen a rapid rise in the traffic to Personanondata for which I am very grateful to the readers who have found me and stayed with me. I have also benefitted from multiple links from a variety of blogs and websites which have raised awareness and interest. Significantly, I have also seen some links from industry leading trades such as Mediabistro/Galleycat, Publishers Lunch, Library Journal and Book Business Magazine and these links have served to endorse some of what I have published.

But it is not enough (for me), and I would like to encourage all my readers to tell people about the site and hopefully build some discussion around some of the themes I talk about. (I have exhausted my contacts and don't wish to bother them too frequently).

I am also interesting in publishing material from other people in the industry with a point of view. Over the past three months I have published articles by Andrew Grabois, John Dupuis, Michael Healy and Michael Holdsworth. All have been well read on the site and I hope they will all return at some point but I would also like to include more perspectives. Along those lines if anyone is interested in blogging about sessions at BookExpo in a few weeks please let me know.

Thanks for the support.

Michael Cairns

Michael Cairns has served as President, CEO and CRO of several brand name publishing, technology and content-centric businesses. His career includes a wide range of publishing and information products, services and B2B categories and his operating and consulting experience has largely been with brand-name companies such as American Psychological Association, Reed Elsevier, Wolters Kluwer, PriceWaterhouseCoopers, Macmillan, Inc., Berlitz International, AARP, R.R. Bowker and IEEE. Mr. Cairns has also served on boards including the Association of American Publishers (AAP), the Book Industry Study Group (BISG) and also served as Chairman of the International ISBN Executive Committee.