Thursday, November 26, 2009

The Character of Books

OCLC recently published (LINK) some analysis on the characteristics of book titles published since 1923 and the results were quite interesting. The analysis follows some work OCLC did immediately after Google announced its library scanning project (referenced here) and the second analysis was undertaken both out of curiosity and in answer to many 'private queries':
Discussions of Google Books and other digitization efforts tend to treat in-copyright print books as an amorphous collection, with little elaboration or detail on what this important collection of materials actually looks like. How many titles are involved? What is the distribution of their publication dates? What general observations can be made about their content? This article examines these and other questions in regard to the collection of US-published print books represented in WorldCat. Many of these questions were posed to the authors in private inquiries; these inquiries, along with the keen interest in digitization that continues to spark debate on blogs and listservs, suggested that a general publication addressing the characteristics of in-copyright print books could provide helpful context for ongoing discussions.
No doubt some of these private inquiries revolved around estimating the number and character of Orphan works but since those queries would be problematic this analysis focuses on in-copyright and potentially in-copyright works.

Here is a small sample of the report and a section of particular interest to me:

The percentages reported in Table 2 indicate that about 14 percent of the US-published aggregate print book collection was published before 1923, and therefore is, with reasonable certainty, in the public domain according to US copyright law. A further 17 percent were published between 1923 and 1963; for these, copyright status cannot be ascertained without investigating each individual title. Some portion of these materials will be in the public domain – in particular, those whose copyright was not renewed. The rest will still be under copyright. Recent statistics from the HathiTrust indicate that about 60 percent of candidate materials for digitization published between 1923 and 1963 reverted to the public domain, either because copyright was not renewed, the book was published without a copyright notice, or for other reasons.7 Applying this fraction to the US-published aggregate print book collection in WorldCat suggests that approximately 1.6 million manifestations are public domain, while the remaining 1 million are still in copyright.

The HathiTrust result is based on academic library holdings, while the aggregate print book collection in WorldCat represents the holdings of a variety of institution types (although as Table 1 indicates, academic libraries hold the largest portion). A more general, but much earlier study by the US Copyright Office in 1960 found that only 7 percent of books registered for copyright in 1931-32 had had their copyright renewed within the prescribed 28 year period after initial registration. The remainder of the books would have reverted to the public domain.8 Both the HathiTrust and Copyright Office results suggest that of the print books published between 1923 and 1963, a majority – and perhaps a substantial majority – are likely to be in the public domain.

More from the report.

Also, if you didn't see it here is a link to my analysis on estimating the number of orphan titles.

