Re-post: Massive Data Sets

Originally posted June 24th, 2008

Large publishers like Elsevier, Macmillan and Kluwer spent the past 20-30 years or so consolidating journal publishing under their umbrellas and building virtually unassailable positions in numerous vertical publishing segments. The open access movement has had only minimal impact on the prospects for these businesses and there is little indication even market forces will reduce their commanding positions. Much of the consolidation has occurred but occasionally, some large concentration of journals comes on the market however, it is unlikely that a new publisher would be able to build a significant position in any meaningful segment because all the important titles already belong to one of the major players.

Journals publish the outcome of the intellectual activity of the article authors. In some cases, access is provided to the data that serves to back up the investigation but invariably this data remains in the dark. Some publishers have experimented with allowing journal readers to play with the data but this does not appear to be a developing trend. Data's day may come however. Several months ago (via Brantley) I read of yet another initiative at Google.
Sources at Google have disclosed that the humble domain,, will soon provide a home for terabytes of open-source scientific datasets. The storage will be free to scientists and access to the data will be free for all. The project, known as Palimpsest and first previewed to the scientific community at the Science Foo camp at the Googleplex last August, missed its original launch date this week, but will debut soon.
The article on their web site is brief and my immediate thoughts had little to do with the gist of this story. My immediate thought was that here could develop the next land grab for publishers and perhaps other parties interested in gaining access to the raw data supporting all types of research. As publishers develop platforms supporting their publishing and (n0w) service offers will they see maintaining these data sets as integral to that policy? I believe so, and I suspect in agreements with authors, institutions and associations that own these journals the publishers like Elsevier will also require the 'deposit' of the raw data supporting each article. In return, the offerings on the publisher's 'platform' would enable analysis, synthesis and data storage all of benefit to their authors. But the story may be more comprehensive than simply rounding out their existing titles with more data.

The current power publishers in the journal segment may find themselves competing with new players including Google in thier efforts to gain access to data sets that may have been historically supplemental or even not considered relevant to research. In addition, sources of massive data sets are growing with the introduction of every new consumer product and exponential web traffic growth. In the NYTimes today is an article about a number of new companies that are analyzing massive amounts of data to produce market reports and business analysis. From the article:
Just this month, the journal Nature published a paper that looked at cellphone data from 100,000 people in an unnamed European country over six months and found that most follow very predictable routines. Knowing those routines means that you can set probabilities for them, and track how they change. It’s hard to make sense of such data, but Sense Networks, a software analytics company in New York, earlier this month released Macrosense, a tool that aims to do just that. Macrosense applies complex statistical algorithms to sift through the growing heaps of data about location and to make predictions or recommendations on various questions — where a company should put its next store, for example. Gregory Skibiski, 34, the chief executive and a co-founder of Sense, says the company has been testing its software with a major retailer, a major financial services firm and a large hedge fund.
As noted in the article, the data (growing rapidly to massive status) has been hard to manipulate but this issue is diminishing rapidly. As it diminishes we will see more and more companies, groups and even individuals note the value of their data and begin to negotiate the access to this data. All of the large information publishers will see themselves playing a significant role in this market as they gather data sets around market segments just as they did with journals. If they don't do this they could undercut the value of their journal collections if they are forced to separate the result/analysis from the data. Signing agreements for access to these data sets (cell phone data in the example above) will enable Journal publishers to concentrate research even further by making access to this information a pre-condition to publication in the respective journal. Either way, the providers of these data sets are likely to be looking at a new and significant revenue source.

