Wednesday, March 07, 2007

Google Books Experienced

Via Lorcan Dempsey and as he did this is best left told by the author, Peter Brantley of the California Digital Library (error: he is with the Digital Library Federation). Astounding writing. Link.

Google Print: A Numbers Game

The following post is written by Andrew Grabois who worked with me at Bowker and has (among other things) compiled bibliographic stats out of the Books In Print database for a number of years. His contact details are at the bottom of this article.


On February 6th, Google announced that the Princeton University library system agreed to participate in their Book Search Library Project. According to the announcement, Princeton and Google will identify one million works in the public domain for digitization. This follows the January 19th announcement that the University of Texas libraries, the fifth largest library in the U.S., also climbed on board the Library Project. Very quietly, the number of major research libraries participating in the project has more than doubled to twelve in the last two years. The seven new libraries will add millions of printed items to the tens of millions already held by the original five, and more fuel to the legal fire surrounding Google’s plan to scan library holdings and make the full texts searchable on the web.

The public discussion has been mostly one-sided, with Google supporters trying to hold the high moral ground. Their basic argument goes something like this: The universe of published works in the U.S. consists of some 32 million books. They argue that while 80 percent of these books were published after 1923, and, therefore, potentially protected by copyright, only 3 million of them are still in-print and available for sale. As a result, mountains of books have been unnecessarily consigned to obscurity.

No one has yet challenged the basic assumptions supporting this argument. Perhaps they’ve been scared off by Google’s reputation for creating clever algorithms that “organize the world’s information”. This one, though, doesn’t stand up to serious scrutiny.

The figures used by supporters of the Library Project come from a 2005 study undertaken by the Online Computer Library Center (OCLC), the largest consortium of libraries in the U.S. According to the OCLC study, its 20,000 member libraries hold 31,923,000 print books; the original five research libraries participating in the Google library scanning project hold over 18 million.

OCLC did not actually count physical books. They searched their massive database of one billion library holdings and isolated 55 million catalog records describing “language-based monographs”. This was further refined (eliminating duplicates) to 32 million “unique manifestations”, not including government publications, theses and dissertations. The reality of library classification, however, is such that “monographs” often include things like pamphlets, unbound documents, reports, manuals, and ephemera that we don’t usually think of as commercially published books.

The notion that 32 million U.S. published books languish on library shelves is absurd. Just do the math. That works out to more than 80,000 new books published every year since the first English settlement in Jamestown in 1607. Historical book production figures clearly show that the 80,000-threshold was not crossed until the 1980’s, after hovering around 10,000 for fifty years between 1910 to1958. The OCLC study showed, moreover, that member libraries added a staggering 17 million items (half of all print collections) since 1980. That averages out to 680,000 new print items acquired every year for 25 years, or more than the combined national outputs of the U.S., U.K., China, and Japan in 2004.

Not only will Google have to sift through printed collections to identify books, and then determine if they are in the public domain, but they will also have to separate out those published in the U.S. (assuming that their priority is scanning U.S.-based English-language books) from the sea of books published elsewhere. The OCLC study clearly showed that most printed materials held by U.S. libraries were not published in the U.S. The study counted more than 400 languages system-wide, and more than 3 million print materials published in French and German alone in the original Google Five. English-language print materials accounted for only 52% of holdings system-wide, and 49% in the Google Five. Since more than a few works were probably published in the United Kingdom, the total number of English-language books published in the U.S. will constitute less than half of all print collections, both system-wide and in Google libraries.

So how many U.S.-published books are there in our libraries? Annual book production figures show that some 4 million books have been published in the 125 years since figures were regularly compiled in 1880. If, very conservatively, we add an additional 1.5 million books to cover the pre-1880 years, and another 1.5 million to cover books published after 1880 that might have been missed, we get a much more realistic total of 7 million.

Using the lower baseline for published books tells a very different story than the dark one (that the universe of books consists of works that are out-of-print, in the public domain, or “orphaned” in copyright limbo) told by Google and their supporters. With some 3 million U.S. books in print, the inconvenient truth here is that 40% of all books ever published in the U.S. could still be protected by copyright. That would appear to jive with the OCLC finding that 75% of print items held by U.S. libraries were published after 1945, and 50% after 1974.

If we’re going to have a debate that may end up rewriting copyright law, let’s have one based on facts, not wishful thinking.


Andrew Grabois is a consultant to the publishing industry. He has compiled U.S. book production statistics since 1999. He can be reached at the following email address: agrabois@yahoo.com

Clarification update from Andrew: My post is not intended to be a criticism of the OCLC study ("Anatomy of Aggregate Collections: The Example of Google Print for Libraries") by Brian Lavoie et al, which is a valuable and timely look at print collections held by OCLC member libraries. What I am attempting to do here is point out how friends of the Google library project have misinterpreted the paper and cherry-picked findings and conclusions out of context to support their arguments.


Related articles:
Google Book Project (3/6/07)
Qualified Metadata (2/22/07)

Tuesday, March 06, 2007

Google Book Project

Thomas Rudin the associate general council for Microsoft lambasted Google’s approach to copyright protection characterizing it a ‘cavalier’ in comments delivered at the Association of American Publishers conference in New York. Those of us in publishing have a first hand understanding of this opinion and other segments of media are rapidly coming to a realization that even obvious content ownership isn’t enough to preclude Google from adopting and more importantly making money off content under copyright. Google is probably the only company that was willing to take the significant legal risks associated with the purchase of YouTube for example.

Publishers have elected to sue Google to protect their content rights and the content rights of their authors. At the same time, publishers have engaged with Google in participation in the Google Scholar program. Here publishers are equal partners and (I assume) negotiations for the acquisition of content by Google was negotiated in good faith and the results have been good to great for both parties. (Springer, Cambridge University). It is also no bad thing that Google’s content (digitization) programs have spurned other similar content initiatives particularly those of some of the larger trade and academic publishers.

The continued area of friction is the digitization project that Google initiated to scan all the books in as many libraries willing to participate. This is where publishers got upset. They were not consulted nor asked permission, they cannot approve the quality of the scanning, they will not participate in any revenue generated and they can not take for granted that the availability of the scanned book will not undercut any potential revenues they may generate on their own. The books in question are the majority of those published after 1925 or so (It's actually 1923: thanks to Shatzkin for noticing my error) and which are still likely to be under copyright protection of some sort.

Having said that, lets get one thing straight; having all books which exist in library stacks (or deep storage) available in electronic form so that they can be indexed, searched, reassembled, found at all and generally resourced in an easy way is a good thing and an important step forward and opportunity for libraries and library patrons. Ideally, it would lead to one platform (network) providing equal access to high quality, indexed e-book content which any library patron would be able to access via their local library. Sadly, while the vision is still viable the execution represented by the Google library program is not going to get us there.

Setting aside the copyright issue, the Google library program has been going on now for approximately 24mths and results and feedback is starting to show that the reality of the program is not living up to its promise. According to this post from Tim O’Reilly, the scans are not of high quality and importantly are not sufficient to support academic research. Assuming this is universally true (?), the program represents a fantastic opportunity lost for patrons, libraries and Google. BowerBird via O’Reilly states:

umichigan is putting up the o.c.r. from its google scans, for the public-domain books anyway, so the other search engines will be able to scrape that text with ease. what you will find, though, if you look at it (for even as little as a minute or two) is that the quality is so inferior it's almost worthless
Could Google suffer more embarrassment as disillusion grows over the program – perhaps, but I doubt it will force them to rethink their methodology. It would represent a huge act of humility for Google to ‘return to the table’ with publishers and libraries to work with them to rethink the project with the intention of agreeing to the copyright issues, and agreeing a better way to process and tag the content. To suggest that they become less a content repository and more a navigator or ‘switchboard’ which is how O’Reilly phases it is beyond expectation; however, were they to change course in this way they would immediately reap benefits with all segments of the publishing and library communities. O’Reilly – a strong supporter of the Google program – believes the search engines (Google, Yahoo, Others) will ‘lose’ if they continue to create content repositories that are not ‘open’.

Ironically, the lawsuit by the AAP could actually have a beneficial impact on the process of digitization. As some have noted, we may have underestimated the difficultly in finding relevant materials and resources once there is more content to search (this assuming full text is available for search). Initiatives are underway particularly by Library of Congress to address the bibliographic (metadata) requirements of a world with lots more content and perhaps the results of some of these bibliographic activities will result in a better approach to digitization of the more recent content (post 1923). Regrettably, some believe that since there may be only one opportunity to scan the materials in libraries that we may have lost the only opportunity to make these (older) materials accessible to users in an easy way.


Tomorrow, just what is the universe of titles in the post 1923 ‘bucket’? The supporters of the Google project speak about a universe of 30million books but deeper analysis suggests the number is wildly exaggerated.

Shaffer Announced as Chairman of Knovel

Dave Shaffer (and old boss of mine) has been named as Chairman of the Board of Knovel. Knovel is an information publishing company that focuses on Science and Engineering. The company is in the process of developing an integrated platform of information products and integration tools that enable users to integrate their use of Knovel products into their daily workflow. This is little different than what most information publishers are trying to do.

From the press release:
Chris Forbes, CEO of Knovel said: "David brings Knovel years ofexperience in managing companies that have successfully delivered highvalue information and productivity solutions to end users. We look forwardto his guidance as Knovel takes the next steps to dominate the engineeringinformation market." David Shaffer continued: "I am excited to join acompany that is a passionate leader in driving value for end users in thislarge and important market. This new role is a natural extension of mycareer at Thomson."

Monday, March 05, 2007

Bear Sterns Media Conference

For those interested in the publishing company participants for the current (Mon/Thur) Bear Sterns media conference at the Breakers in Palm Beach here is the approximate schedule:

Monday 3/5
Moodys - 10am. Webinar
Meridith - 11.00am. Webinar
CBS - 12.20pm. Webinar
Thomson - 2.40pm. Webinar
McGraw Hill - 3.20pm. Webinar

Tuesday 3/6
NYTimes - 9.25am. Webinar
Primedia - 10:05am Webinar
Dow Jones - 10:40am Webinar
Scripps - 2:40om Webinar

Reader's Digest Closes Sale

RD announced over the weekend that they had closed the sale of the company to Ripplewood Holdings and announced that Mary Berner has been appointed President and CEO of the company. She had previously been at Conde Nast and Fairchild.

Press Releases: Berner, Deal

Sunday, March 04, 2007

Social BookMarking

Many of you are more than familiar with Librarything and the similar sites that enable cataloging of book titles with tags that are meaningful and useful to the user. Tucked away in the 'Your Money' section of The NY Times was another puff piece about how great Librarything is for people like us who like books and reading. The likely origin of this article is a blog post by Tim at Librarything which looked at the comparison between tags of books at LT and tags of books on Amazon.com.
Amazon visitors have not taken to tagging Amazon's books in significant numbers. With thousands of times the traffic, Amazon produced a tenth as many tags as LibraryThing. What's going on?

He goes on to talk about the power of numbers that inevitably feeds on itself; as more people tag more people find it useful which leads to a larger community. Passion also plays a part because the overwhelming number of active 'taggers' also have more books. Those that have less than 200 titles rarely tag books. (Since fees kick in over this number the users are also more financially committed to librarything).
Critical mass is important, even if we can't pinpoint the line. Ten tags are never enough; a thousand almost always is. Unfortunately, Amazon's low numbers translate into a broader failure to reach critical mass. With ten times as many tags overall, LibraryThing has fifteen times as many books with 100 tags, and 35 times as many with over 200 tags.

It is a very interesting article on the power of community and worth a read.

Lorcan Dempsey suggests these types of sites also represent examples of the 'Network effect':
Regular readers of this blog will not be surprised to hear me say that they also highlight a structural issue for libraries, how to provide services at the network level. These new services are network level services. They are aimed at the general user, not an audience circumscribed by region, or funding or institution. And, additionally, they provide an integrated service, moving the user quickly through whatever steps are needed to complete a task.

The other component of interest to me is that sites like librarything are becoming sources of viable bibliographic information and indeed the NY times article suggests that Librarything maybe selling or licensing some of its tagging data:
For example, he is in the process of selling some of his recommendations data, which is based in part on tagging statistics, to other sites that sell books and book information.

At Bowker one of our most important and costly editorial tasks was to assign subject classification to titles so that they could be found either in our own products or in the products of our customers. With well over 200,000 titles per year this is an expensive excercise and while it can be automated (and was) the process suffers from obvious limitations. Firstly, the subject classification methodology is quite rigid and is not always intuitive. Secondly, subjects change over time and books previously categorized could benefit from additional or changed subjects. Thirdly, the application of subjects is subjective and can pose a limitation on the subjects applied to the title. A subject expert wants to be as accurate as possible in applying the subject classification for relevancy and integrity. In BooksinPrint, we had an average of slightly more than two subjects per title over approximately five million records. Many had more and a few had more than 15 but the average was two. Many users of Librarything, myself included, have placed more than two tags against most titles.

The ability of a community to supply a range of subjects that reflect the work, what's important about the book and what becomes important in the wider world may provide a vast addition to traditional bibliographic database work. I believe the social network applications and the structured approaches will work in concert but increasingly it is the social aspect that needs to be actively solicited for inclusion into the traditional bibliographic databases.