Wednesday, September 09, 2009

580,388 Orphan Works – Give or Take

Clearly one of the most (if not the most) contentious issue regarding the Google Book Settlement (GBS) centers on the nebulous community of “orphans and orphan titles”. And yet, through the entirety of the discussion since the Google Book Settlement agreement was announced, no one has attempted to define how many orphans there really are. Allow me: 580,388. How do I know? Well, I admit, I do my share of guess work to get to this estimate, but I believe my analysis is based on key facts from which I have extrapolated a conclusion. Interestingly, I completed this analysis starting from two very different points and the first results were separated by only 3,000 works (before I made some minor adjustments).

Before I delve into my analysis, it might be useful to make some observations about the current discussion on the number of orphans. First, when commentators discuss this issue, they refer to the ‘millions’ of orphan titles. This is both deliberate obfuscation and lazy reporting: Most notably, the real issue is not titles but the number of works. My analysis attempts to identify the number of ‘works’; Titles are a multiple of works. A work will often have multiple manifestations or derivations (paperback, library version, large print, etc.) and thus, while the statement that there may be ‘millions of Orphans titles’ may be partially correct, it is entirely misleading when the true measure applicable to the GBS discussion is how many orphan works exist. It is the owner (or parent) of the work we want to find.

To many reporters and commentators, suggesting there are millions of orphans makes sense because of the sheer number of books scanned by Google but, again, this is laziness. Because Google has scanned 7-10 million titles then, so the logic goes, there must be ‘millions of orphans’. However, as a 2005 report (which I understand they are updating) by OCLC noted, all types of disclaimers should be applied to this universe of titles such as titles in foreign languages, titles distributed in the US, titles published in the UK, to name a few. Accounting for these disclaimers significantly reduces the population of titles at the core of this Orphan discussion. These points were made in the 2005 OCLC report (although they were not looking specifically at orphans) when they looked at the overlap in title holdings among the first five Google libraries. (And if you like this stuff, this was pretty interesting). Prognosticators unfamiliar with the industry may also believe there are millions and millions of published titles since, well, there are just lots and lots in their local B&N and town library.

The two methods I chose to try to estimate the population of orphans relied, firstly, on data from Bowker’s BooksinPrint and OCLC’s Worldcat databases and, secondly, on industry data published by Bowker since 1880 on title output. I accessed BooksinPrint via NYPL (Bowker cut off my sub) and Worldcat is free via the web. The Bowker title data has been published and referred to numerous times over the years and I found this data via Google Book Search; I also purchased an old copy of The Bowker Annual from Alibris.

In using these databases, my goal was to determine whether there are consistencies across the two databases that I could then apply to the Google title counts. In addition to the ‘raw data’ I extracted from the databases, OCLC (Dempsey) also noted some specific numbers of ‘books’ in their database (91mm), titles from the US (13mm) and non-corporate ‘Authors’ (4mm). Against the title counts from both sets of data, I attributed percentages which I then applied to the Google universe of titles (7mm). (My analysis also 'limits' these numbers to print books excluding for example dissertations).

In order to complete the analysis to determine a specific orphan population, I reduced my raw results based on best guess estimates for non-books in the count, public domain titles and titles where the copyright status is known. These final calculations result in a potential orphan population of 600,000 works. I also stress-tested this calculation by manipulating my percentages resulting in a possible universe of 1.6mm orphan works. This latter estimate is (in my view) illogical as I will show in my second analysis.

An important point should be made here. I am calculating the potential orphan population, not the number of orphans. These numbers represent a total before any effort is made to find the copyright holder. These efforts are already underway and will get easier once money collected by the Books Rights Registry is to be distributed.

My second approach emanated from my desire to validate the first approach. If I could determine how many works had been published each year since 1924 then I could attribute percentages to this annual output based on my estimate of how likely it was that the copyright status would be in doubt. Simply put, my supposition was that the older the work, the more likely it was that it could be an orphan.

Bowker has consistently calculated the number of works published in the US since 1880 (give or take) and the methodology for these calculations remained consistent through the mid-1990s. According to their numbers, approximately 2mm works were published between 1920 and 2000. Unsurprisingly, a look at the distribution of these numbers confirms that the bulk of those works were published recently. If there were (only) 2mm works published since the 1920s, it is impossible to conclude there are millions of orphan works.

To complete this analysis, I aggressively estimated the percentage of works published each decade since 1920 which could be orphan works. The analysis suggests a total of 580K potential orphan works which, as a subset of the approximately 2mm works published in the US during this period, seems a reasonable estimate. My objective to ‘validate’ my first approach (using OCLC and BIP data) shows that both approaches, using different methodology, reach similar conclusions.

There are several conclusions that can be drawn from this analysis. Firstly, since the universe of works is finite then, beyond a certain point, the Google scanning operation will begin to find ‘new’ orphans at a decreasing rate. I don’t know if this number is 5mm scanned titles or 12mm but my estimate is 7mm because, according to Worldcat, there are 3mm authors to 12mm titles. If you apply this ratio to the Bowker estimate of total of works published, the number is around 7-8mm titles. Secondly, publishing output accelerated in the latter part of the 20th century which means that, while my estimates in percentage terms of the number of latter day orphans were comparably lower than the percentages applied in the early part of the century, the base number of published titles is much higher, therefore the number of possible orphans is higher. Common sense dictates that it will be far easier to find the parents of these later ‘orphans’.

In the aggregate, the 600K potential orphans may still seem high against a “work” population of 2.2mm (25%). I disagree, given the distribution of the ‘orphan’ works (above paragraph) and because I have assumed no estimate of the BRR’s effort to find and identify the parents. In my view, true orphans will be a much lower number than 600,000, which leads me to my final point. Money collected on behalf of unidentified orphan owners will eventually be disbursed to cover costs of BRR or to other publishers. There has been some controversy on this point and it derives, again, from the idea that there are millions of orphans and thus the pool of undisbursed revenues will be huge. The true numbers don’t support this conclusion. There will not be a huge pool of royalty revenues to be ultimately disbursed to publishers who don’t ‘deserve’ this windfall because there won’t be very many true orphans. The other point here is that royalty revenues will be calculated on usage and, almost by definition, true orphan titles for the most part are not going to be popular titles and therefore will not generate significant revenues in comparison with all other titles.

This analysis is not definitive, it is directional. Until someone else can present an argument that examines the true numbers and works in more detail, I think this analysis is more useful to the Google Settlement discussion than referring by rote to the ‘millions of orphans’. The prevailing approach is lazy, misleading and inaccurate.

(Thanks to Mike Shatzkin who encouraged me to think about analysis and helped me conclude it. Grateful thanks to others who also helped review the post).
Reblog this post [with Zemanta]


Nick W-W said...

That's an extraordinary and incisive analysis and as I can think of few people better placed to do it than you, I am inclined to accept it. It feeds in to my concern that the revenue pool from the orphan works won't be enough to sustain the workings of the Book Registry. As you suggest, by definition these are abandoned works. Won't a quick look at the pages via Google Book search be enough to impede purchase. And so what then? the registry is an expensive proposition.

Nick W-W

Eric said...


Could you clarify how you dealt with US and non-US books? Does 580K include non-US published works?

MC said...

Eric, The historic Bowker numbers captured the books published by US publishers (not including books distributed by US publishers) thus they capture US output. (The OCLC numbers are less 'clean' in that respect). The 580K number is US published works excluding non-US published works.

MC said...

Nick, I think the registry is the only way true Orphans could be identified. It is not my expectation that their operating budget depends on money from non-distributed Orphan revenues so in that respect I don't think my conclusion that those monies will be less than people may think will have a negative impact on the BRR.

Bruce Albrecht said...

Does your estimate include stories and articles published in magazines, but are too short to have been published on their own? If not, then I suspect the real total of orphan works really is in the millions.

Kent Larsen said...

Bruce has a good point, and if you include these shorter works, he may well be right.

BUT, there is also an important issue in U.S. copyright law that will significantly reduce your estimate -- you don't include post 1923 works in the public domain.

Under U.S. law at the time, published works needed to be renewed every 28 years. If they were not renewed, they fell into the public domain. Only in the case of works published after 1964 an you be sure that they are not in the public domain.

In fact, unless the work kept selling, it was unusual for its copyright to be renewed. So your estimates for what percentage of works are "orphan" works for the decades before 1964 are too high. It could well be that 75% or more of books during this period are in the public domain because they weren't renewed.