Personanondata: Search

Showing posts with label Search. Show all posts

Thursday, March 10, 2016

EBSCO's Tim Collins on eBooks, Libraries and Search "has never been more important".

Interesting interview from Scholarly Kitchen with Tim Collins. Here's a clip:

Many libraries are starting to see that, while they may spend less on ebooks for a couple of year by using STLs, they are often left with lower annual budgets (if they spend less in one year their budget declines the next) and a much less robust ebook collection to offer their users (as they don’t own as many books). While some libraries may feel like this is okay as they can enable their patrons to search ‘all’ ebooks via Demand Driven Acquisition (DDA) models without actually buying them, we worry about this logic as it assumes that publishers will continue to make all of their content available for searching via DDA at no cost to users. We don’t see this as a valid assumption as, if DDA results in reducing ebook budgets even further, we wonder whether publishers will be able to afford to make their ebooks available under this model.

We can see why book publishers worked with these models as they wanted to support their customers. But, if these models result in budget reductions, which result in publishers not being able to fulfill their mission of publishing the world’s research so that it can be consumed, we don’t see them being sustainable. We understand that this view may not be welcomed or shared by all libraries, but we see the logic being sound. Business models need to work for both customers and vendors in order for them to be sustainable. There was much great discussion on this subject at the recent Charleston Conference and in related articles published in Against the Grain by both publishers and librarians.

Friday, February 25, 2011

Yahoo (and Now Google) and The Semantic Web

News from Google about their use of Micro formats reminded me of something Yahoo announced over two years ago. First here is a snip from the Google announcement:

That’s a tough problem with the current web, according to Google’s Jack Menzel, the company’s product management director for search, despite the apparent ease that Watson had besting human Jeopardy opponents.

“We are still grasping for the Holy Grail of natural language search,” Menzel said. “We take the approach that the internet exists, and it is so big and Wild West-like that you have to take it for what it is. It is this giant immutable thing that will do its own thing, despite what you want it to do.”

The dream of a structured web has proven nearly impossible to create in practice as it requires coordination on building specs and then that web page builders take the time to mark their pages up in complicated XML. A more grassroots effort, known as Microformats, has had more success by focusing on just a few kinds of data and making innovative use of HTML, the lingua franca of the web, to simplify publishing meta-data. Google introduced its own suggestions of how websites could start publishing Google-friendly meta-data in 2009 (such as how many stars a rating is), with its so-called Rich Snippets.

And now for the first time, a mainstream search engine is built entirely around webpages that use microformats and other structured data.

So for instance, Google is able to show a searcher only Pho recipes that use tofu that take less than a half an hour to make, not by searching for pages that include the word “Pho” and “Tofu” and “Recipe”, but by actually knowing that a recipe for something called “Pho” has an ingredient “Tofu” and a listed cooking time of 1 hour (for example, the is done after publisher’s wrapping the word “1 Hour” in a defined HTML tag ()and then interpreting that in the search results ).

Here is the repost from March 14, 2008 (and some of what I comment on still applies):

In their continued strategic realignment and adoption of open standards, Yahoo has announced they are supporting a number of semantic web standards that will enable third parties (publishers) to augment and enhance native Yahoo search results. From Techcrunch:

A few details are being disclosed now, and Yahoo promises more in a few weeks. They are saying that they will support a number of microformats at the start: hCard, hCalendar, hReview, hAtom and XFN. They will support vocabulary components from Dublin Core, Creative Commons, FOAF, GeoRSS, MediaRSS, and others. They will support RDFa and eRDF markup to embed these into existing HTML pages. Finally, Yahoo will support the Amazon A9 OpenSearch specification with extensions for structured queries to deep web data.

There is a lot to get excited about in the Yahoo announcement(s) - and in reading the comments associated with the Techcrunch post others are interested as well - but perhaps the best thing to consider is that the weight of Yahoo will press faster adoption of some of these standards. In particular, microformats if adopted by publishers could/would change the rules of content syndication and lead to far wider distribution of publisher content. This in turn would lead to higher pass through traffic generating product or advertising sales for publishers.

I only became aware of microformats in the past six months or so but the concept derives from a practical problem. How often - like me - have you been frustrated by the need to copy down an address, or details of a book review, resume details, or even a cooking recipe. Well, microformats can standardize the manner in which these items are published to the web enabling users like me to access and use this content as uniform packages of information or data independent of the publisher. For example, if I wanted to create a list of recipes from ten different cookbooks, I could assemble these and they would all appear together in consistent form. That would be a huge practical improvement on copy and paste.

Image then how this could impact publishers which publish information that could be disaggregated. This could include every topic from travel to technology to cooking to sewing and knitting. A single dress pattern (Mrs PND believes no one sews anymore but no matter), which typically existed with many others in a book (or magazine) can now be extracted, indexed and monetized. Just think how many discrete elements could exist at a typical publishing house if they were disaggregated from their 'mother' products.

While the benefits of the Yahoo initiatives can be debated, at their core is a potential transformation in the manner in which information is produced and disseminated. The technology is not new, but the weight of Yahoo could propel the adoption and that would be a good thing. While publishers will be slow and cautious (some would say cumbersome), they should become the biggest beneficiaries of this initiative: They own massive quantities of content that has traditionally been packaged to discourage narrow use of content. Microformats and resulting syndication models will open up those content repositories and fundamentally change publishing.

Perhaps I could also add that these changes may enable publishers to reestablish stronger influence over the distribution of their content which has been lost in the physical world to Amazon and B&N. Maybe and perhaps...

Friday, September 24, 2010

Re-post: Massive Data Sets

Originally posted June 24th, 2008

Large publishers like Elsevier, Macmillan and Kluwer spent the past 20-30 years or so consolidating journal publishing under their umbrellas and building virtually unassailable positions in numerous vertical publishing segments. The open access movement has had only minimal impact on the prospects for these businesses and there is little indication even market forces will reduce their commanding positions. Much of the consolidation has occurred but occasionally, some large concentration of journals comes on the market however, it is unlikely that a new publisher would be able to build a significant position in any meaningful segment because all the important titles already belong to one of the major players.

Journals publish the outcome of the intellectual activity of the article authors. In some cases, access is provided to the data that serves to back up the investigation but invariably this data remains in the dark. Some publishers have experimented with allowing journal readers to play with the data but this does not appear to be a developing trend. Data's day may come however. Several months ago (via Brantley) I read of yet another initiative at Google.

Sources at Google have disclosed that the humble domain, http://research.google.com/, will soon provide a home for terabytes of open-source scientific datasets. The storage will be free to scientists and access to the data will be free for all. The project, known as Palimpsest and first previewed to the scientific community at the Science Foo camp at the Googleplex last August, missed its original launch date this week, but will debut soon.

The article on their web site is brief and my immediate thoughts had little to do with the gist of this story. My immediate thought was that here could develop the next land grab for publishers and perhaps other parties interested in gaining access to the raw data supporting all types of research. As publishers develop platforms supporting their publishing and (n0w) service offers will they see maintaining these data sets as integral to that policy? I believe so, and I suspect in agreements with authors, institutions and associations that own these journals the publishers like Elsevier will also require the 'deposit' of the raw data supporting each article. In return, the offerings on the publisher's 'platform' would enable analysis, synthesis and data storage all of benefit to their authors. But the story may be more comprehensive than simply rounding out their existing titles with more data.

The current power publishers in the journal segment may find themselves competing with new players including Google in thier efforts to gain access to data sets that may have been historically supplemental or even not considered relevant to research. In addition, sources of massive data sets are growing with the introduction of every new consumer product and exponential web traffic growth. In the NYTimes today is an article about a number of new companies that are analyzing massive amounts of data to produce market reports and business analysis. From the article:

Just this month, the journal Nature published a paper that looked at cellphone data from 100,000 people in an unnamed European country over six months and found that most follow very predictable routines. Knowing those routines means that you can set probabilities for them, and track how they change. It’s hard to make sense of such data, but Sense Networks, a software analytics company in New York, earlier this month released Macrosense, a tool that aims to do just that. Macrosense applies complex statistical algorithms to sift through the growing heaps of data about location and to make predictions or recommendations on various questions — where a company should put its next store, for example. Gregory Skibiski, 34, the chief executive and a co-founder of Sense, says the company has been testing its software with a major retailer, a major financial services firm and a large hedge fund.

As noted in the article, the data (growing rapidly to massive status) has been hard to manipulate but this issue is diminishing rapidly. As it diminishes we will see more and more companies, groups and even individuals note the value of their data and begin to negotiate the access to this data. All of the large information publishers will see themselves playing a significant role in this market as they gather data sets around market segments just as they did with journals. If they don't do this they could undercut the value of their journal collections if they are forced to separate the result/analysis from the data. Signing agreements for access to these data sets (cell phone data in the example above) will enable Journal publishers to concentrate research even further by making access to this information a pre-condition to publication in the respective journal. Either way, the providers of these data sets are likely to be looking at a new and significant revenue source.

Friday, September 17, 2010

Repost: 'Qualified Metadata' - What Does it All Mean?

Originally posted on 2/22/2007, I was speaking to someone this afternoon about this topic and it reminded me a little of this post.

Earlier this month I spoke about how data providers may be able to carve a place for themselves as the single provider of catalog information for particular industries. This data, representing 'base level' descriptive information (in the book world we call it bibliographic data) would be widely disseminated across the Internet to facilitate trade of products, materials and services and would be provided by one data supplier. Other data suppliers - one layer up if you will - would also make use of this base level information but add to it value added data elements which would be particularly important to segments of the supply chain. The most obvious example in books would be subject and categorization data which aids in discovery of the item described. Another set of data elements could reflect more descriptive information about a publisher over and above basic address and contact details. In the second of my series, I take a look at the library environment.

In a recent article in D-Lib (January 07), Karen Markey of the University of Michigan looks at how the library online catalog experience needs to change in order for users to receive more relevant and authoritative sources of information to support their research needs. She goes on to quote Deanna Marcum of Library of Congress "the detailed attention that we have paid to descriptive cataloguing may no longer be justified...retooled catalogers could give more time to authority control, subject analysis, [and] resource identification and evaluation." Markey proposes redesigning the library catalog to embrace three things:

post-Boolean probabilistic searching to ensure the precision in online catalogs that contain full-text
subject cataloguing that takes advantage of a users ability to recognize what they do and don't want
qualification cataloguing to enable users to customize retrieval based on level of understanding or expertise

New search technologies such as MarkLogic, FAST and the search tool behind Worldcat offer some of these capabilities but are generally not accessible to the average user. For example, some of these tools enable flexibility in the relevant importance given to elements within a record; so manipulating the importance of Audience level in a WorldCat search would 'skew' the search result set to higher or lower comprehension titles based on the bias given to one or the other.

Perhaps the most compelling point Markey raises in her article supporting increased attention to "qualification metadata" is the 30 to 1 'rule'.

The evidence pertains to the 30-to-1 ratios that characterize access to stores of information (Dolby and Resnikoff, 1971). With respect to books, titles and subject headings are 1/30 the length of a table of contents, tables of contents are 1/30 the length of a back-of-the-book index, and the back-of-the-book index is 1/30 the length of a text. Similar 30 to 1 ratios are reported for the journal article, card catalog, and college class. "The persistence of these ratios suggests that they represent the end result of a shaking down process, in which, through experience, people became most comfortable when access to information is staged in 30-to-1 ratios" (Bates, 2003, 27). Recognizing the implications of the 30-to-1 rule, Atherton 1978) demonstrated the usefulness of an online catalog that filled the two 30-to-1 gaps between subject headings and full-length texts with tables of contents and
back-of-the-book indexes.

Once I read this it was obvious to me that we may not have thought through the implications of projects such as Google Print on retrieval. These initiatives will result in huge (big, big, big) increases in the amount of stuff researchers and students will have to wade through to find items that are even remotely relevant to what they are looking for. In the case of students, unless appropriate tools and descriptive data is made available we will only compound the 'its good enough' mentality and they will never see anything but Google Search as useful.

Markey's article is worth a read if you are interested in this type of stuff, but I think her view point is a starting point for any bibliographic agency or catalog operation in defining their strategy for the next ten years. Most bibliographers understand that base level data is a commodity. The only value a provider can supply here is consistency and one-stop shopping and the barriers to entry are lowered every day. I am of the view (see my first article on this subject) that the agency that can demonstrably deliver consistent data should do so as a loss leader in order to corner the market on base level data and then generate a (closed) market for value added and descriptive (qualification) metadata. There are indications that markets may be heading in this direction (Global Data Synchronization - which I will address next) with incumbent data providers reluctantly following.

Providing relevancy in search is a holy grail of sorts and descriptive data is key to this. In the library environment if the current level of resources were reallocated to building the deeper bibliographic information we need then the traffic in and out of library catalogs would be tremendous. If no one steps in to provide this needed descriptive data then the continuing explosion of resources would be irrelevant because no-one would be directed to the most relevant stuff. Serendipity would rule. The data would also prove valuable and important to the search providers (Google, etc.) because they also want to provide relevance; having libraries and the library community execute on this task would be somewhat ironic given the current decline in use of the online library catalog.

Tuesday, May 05, 2009

Library Associations Address Issues in Google Settlement

Fellow traveler Peter Brantley has posted on Scribd the submission by ALA, ARL and ACRL addressing their concerns about the Google Settlement agreement. This document was submitted to the court last week. Here are some of the notable passages but the document is only 22 pages long for those looking for a quick read.

On how important the database may become:

Notwithstanding these deficiencies in the ISD, an institutional subscription will provide an authorized user with online access to the full text of as many as 20 million books. Students and faculty members at higher education institutions with institutional subscriptions will be able to access the ISD from any computer -- from home, a dorm room, or an office. Accordingly, it is possible that faculty and students at institutions of higher education will come to view the institutional subscription as an indispensable research tool. They might insist that their institution’s library purchase such a subscription. The institution’s administration might also insist that the library purchase an institutional subscription so that the institution can remain competitive with other institutions of higher education in terms of the recruitment and retention of faculty and students.

And this in regard to market power:

However, as likely consumers of this essential research facility, the Library Associations cannot overlook the possibility that the Registry or Google might abuse the control the Settlement confers upon them. Abuse of this control would threaten fundamental library values of access, equity, privacy, and intellectual freedom.

This with respect to pricing:

Google will have the incentive to negotiate vigorously with the Registry to set the price of the institutional subscription as low as possible to maximize the number of authorized users with access to the ISD. Nonetheless, Google’s business model, at least with respect to the institutional subscription, may change, and at some point in the future it may seek a profit maximizing price structure that has the effect of reducing access.

Significantly, the predominant model for pricing of scientific, technical, and medical journals in the online environment has been based on low volume and high prices. Major commercial publishers have been content with strategies that maximize profits by selling subscriptions to few customers at high cost. Typically these customers are academic and research libraries. Therefore, the Registry and Google may seek to emulate this strategy in the market for institutional subscriptions.

On privacy:

Evidently, in the Settlement negotiations the class representatives insisted on these measures to protect the security of digital copies of their books; but no one demanded protection of user privacy. Users of the services enabled by the Settlement also cannot rely on competitive forces to preserve their privacy. In the online environment, competition is perhaps the most powerful force that can help to insure user privacy. If a user does not like one search engine firm’s privacy policy, he can switch to another search engine. Similarly, a user has many choices among online retailers, email providers, social networks, and Internet access providers. The competitive pressure often forces at least a minimal level of privacy protection. However, with the services enabled by the Settlement, there will be no competitive pressure protecting user privacy.

They worry about intellectual freedom and censorship:

While Google on its own might not choose to exclude books, it probably will find itself under pressure from state and local governments or interest groups to censor books that discuss topics such as alternative lifestyles or evolution. After all, the Library Project will allow minors to access up to 20% of the text of millions of books from the computers in their bedrooms and to read the full text of these books from the public access terminals in their libraries.

Addresses issues with new affiliated services:

Although the Settlement permits the Registry to license the rights it possesses to third parties such as Amazon, the Settlement does not require it to do so. Nor does it provide standards to govern the terms by which the Registry would license these rights. This means that the Registry could refuse to license the rights to Google competitors on terms comparable to those provided to Google under the Settlement.43 The Registry, therefore, could prevent the development of competitive services.

Tuesday, April 21, 2009

Overview of Open Access Book Projects

Writing in the UM Journal of Electronic Publishing Peter Suber provides a complete run down of open access initiatives. This article mainly covers journals however he does pay significant attention to open access initiatives concerning books. (LINK)

Here is a sample:

2008 wasn’t the first year that academic book publishers published OA monographs or discovered the synergy of OA and POD (print on demand). But in 2008 the OA-POD model moved from the periphery to the mainstream and became a serious alternative more often than an experiment. We saw OA monographs or OA imprints from Amsterdam UP, Athabasca UP, Bauhaus-Universität Weimar, Caltech, Columbia UP, Hamburg UP, Potsdam UP, the Universidad Católica Argentina, the American Veterinary Medical Association, the Forum for Public Health in South Eastern Europe, the Institut français du Proche-Orient, and the Society of Biblical Literature.

He goes on to name many more programs. He also discusses the Google scanning project:

The settlement could mean that fair use will never be a workable rationale for large-scale book scanning projects, even if Google’s original fair-use claim was strong (as I believe it was). Future scanners may have to pay for permission, in part because Google paid and in part because the new commercial opportunities arising from the settlement itself will weigh against fair-use claims. At the same time, it means that users will have vastly improved online access to books under copyright but out of print (20% previews rather than short snippets), free full-text searching for a much larger number of books, free full-text access from selected terminals in libraries, free text-mining of full texts for some institutional users, and easier priced access to full-text digital editions.

Sunday, April 19, 2009

The Google Settlements Vast Supply of Content

For most public and academic institutions, amassing a collection of seven million volumes would be a pipe dream of extraordinary proportion and, while there are many legitimate concerns and arguments regarding the imminent resolution of the case between Google, The Authors Guild and the AAP, the benefit to the public (via libraries) is hard to ignore. Take an average cost of $10 per book purchased and add to that cost the costs of making it shelf ready, of checking it in and out and the capital expense of storing it, and there is simply no way libraries could afford to acquire a collection as comprehensive.

We don't know what the pricing will be to libraries (and Mike Shatzkin and I are attempting to make some estimates) but the methodology for pricing is unlikely to differ substantially from the way existing databases are offered to public and academic libraries. Allowance will be made for institution budgets, school enrollment, population served, etc. and both Google and the Book Rights Registry (AG & AAP) will be interested in maximizing penetration so that their revenues are optimized. There may be built in protection against extortionate pricing since both Google and the BBR want to maximize views which argues for pricing that achieves the widest potential audience for the database. Library penetration will not be 100% but it will be high since libraries - particularly academics and large publics - will feel compelled to purchase access to this content to support their patrons. In fact, not having it will cause more consternation and deliberation.

Many libraries will see licensing this content as an opportunity to put their research capabilities on par with the top order of academic libraries. After all, this content comes from a who's who of top flight public and academic institutions. A small agricultural college in west Texas may never have had the resources to purchase a deep repository of content supporting their core curriculum but here they have the opportunity to do just that.

Opposition to this agreement is building in advance of the early May decision and, while I personally support adoption of this agreement, I am troubled that the fact of the scanning of this material is now treated as a fait accompli and has thus become a starting point for establishing agreement. Resolution should have addressed the core issue of fair use and copyright but that has not been the case and, because those issues have not been addressed, it leaves Google with a certain (some may say excessive) market power and leaves non-participants to this agreement/resolution open to possible future copyright violation.

As has been pointed out (and openly supported by Google), Orphan works legislation is not precluded or superseded by the agreement between AG, AAP and Google. What strikes me as odd however is the lack of attention any member of Congress has paid to this particular issue. To my knowledge no Congressional representative has come out either in support of the Google settlement or become newly interested in Orphan works legislation. Given the intensity of the attention paid to this issue in the publishing and library community it would seem that if Congress is still not interested in addressing Orphan works legislation then they never will. That's the situation we effectively had before the parties agreed to the settlement (and for many years past). I hope Congress does take up Orphan works legislation but in the meantime I also hope a lot of students and researchers make extensive use of this vast supply of content.

Monday, March 24, 2008

Google Print Integration

I always wondered at my own immediate need for 10,000 e-book titles available on things like the Kindle and Sony e-Reader. Give me an e-Book library of my librarything.com titles then I might be interested. The idea that I could browse the full text of my collection on librarything has far more relevance for me than a e-Book catalog that's just BIG. And what do you know? We are almost there because librarything.com announced an integration with Google Book Search several weeks ago and on the site a user can link to the text of many of the titles in their collection. The links aren't universal but as a taste of what is surely inevitable it is a great step forward.

Other companies are jumping on the API bandwagon. ExLibris announced they have integrated a link to 'About this Book' pages on Google Book Search. From their press release:

Using a new “viewability” application programming interface (API) supported by Google Book Search, library patrons can now enhance their findings with Google Book Search features such as full text, book previews, cover thumbnails, and a mashup from Google Maps linking pages in a book describing a specific place to its location on the world map. Use of this “viewability” API has been added to the Ex Libris Primo® discovery and delivery solution, SFX® context-sensitive link resolver, and the Aleph® and Voyager® integrated library systems.

In the ILS world everyone plays follow the leader so the links should start appearing in all the other vendors products if they haven't already. Libraries have long had the ability to gather content in a similar manner (not full text) from Amazon.com. Many have done this successfully to augment (prettify) their catalogs, but the Google option will prove to be compelling both because of the potential breadth of content in the 'About the Book' package but also the limited commercial nature of the Google Book Program. The Google Book Program could become the primary distribution mechanism for publishers into libraries: Imagine every ILS using the Google API and publishers making their titles available via a subscription/lending module. All of this at very low capital expense for publishers.

The other interesting aspect of the Exlibris implementation is the integration with the SFX link resolver. How this will develop could also be interesting for the discovery of journals and articles.

Over on Exact Editions, Adam had some related thoughts on this.

Also, I had an additional thought that it may be Microsoft that has the better Publisher workbench/toolkit for managing access to their content from what I saw at their presentations last year. Where they are in their relationships to library intermediaries is anyone's guess however.

Friday, February 01, 2008

Google Search By Year

I came across this a few weeks ago and I thought it was very interesting. Google has place where you can see some of their experimentation with new search interfaces. In the Google search box enter the following: joseph conrad view:timeline and you will see a dateline version of the life of Joseph Conrad. It works for all kinds of things: Try replacing JC with Viet Nam War. If you play with this a little you will see that you can narrow down searches within years. As far as I can tell it doesn't do months.

Some of you will recall an interface that OCLC has worked on for authors that is similar. WorldCat Identities looks like this: Conrad

Fellow traveler, Peter Brantley reminded me of the Google interface by referring me to an article in Arstechnica.com. In this post they look at six of the experimental interfaces.

PS. Within three clicks on the Google I was reading a review/appreciation of The Red Badge of Courage written by Conrad himself. Again, yet another reason to want to be be a high school student today.

Microsoft to Buy Yahoo for $44Billion

The dam has finally busted. Will Microsoft be able to pull off the deal to buy Yahoo and then, more importantly will they be able to make a success of the integration. This could be one of the most exciting news stories of the year. Will this be welcomed by Yahoo? Is this the big deal that Terry Semel was said to be working on only yesterday? Could Yahoo look for some other combination - with Ebay - and act defensively to stop the acquisition? The current offer is very expensive - 60% over the closing share price yesterday.

AP
Timesonline.
NYTimes

Friday, November 30, 2007

ACAP is Implemented

At a conference in New York yesterday, World Association of Newspapers President Gavin O'Reilly updated the content community on the status of the ACAP initiative. ACAP is a technology that updates the manner in which web search robots search and index material on the web. The ACAP protocol aims to create a more balanced approach to gathering web content and enabling content owners to 'publish' specific rights information applicable to their content which can then be read by the search tool. Rather than limit the amount of free content available to web users, content owners participating in this initiative believe the ultimate outcome will be to make more content available by bringing content from behind subscription walls.

All content owners are being encouraged to implement version 1 of the protocol and Times Online announced that they have implemented ACAP on their site. From the Associated Press:

The proposal, unveiled by a consortium of publishers at the global headquarters of The Associated Press, seeks to have those extra commands — and more — apply across the board. Sites, for instance, could try to limit how long search engines may retain copies in their indexes, or tell the crawler not to follow any of the links that appear within a Web page. The current system doesn't give sites "enough flexibility to express our terms and conditions on access and use of content," said Angela Mills Wade, executive director of the European Publishers Council, one of the organizations behind the proposal. "That is not surprising. It was invented in the 1990s and things move on."

Personally, I was initially skeptical about this initiative but they have delivered on their time table, retained their broad support and even have some in the search community actively supporting the initiative.

ACAP organizers tested their system with French search engine Exalead Inc. but had only informal discussions with others. Google, Yahoo and Microsoft Corp. sent representatives to the announcement, and O'Reilly said their "lack of public endorsement has not meant any lack of involvement by them." Danny Sullivan, editor in chief of the industry Web site Search Engine Land, said robots.txt "certainly is long overdue for some improvements."

Associated Press

Sunday, July 15, 2007

News Update: Week 7/9

Deals, M/A:
Apparently News of Murdoch's Dow Jones Purchase was Premature: MSNBC
Not so Smooth: Pearson are Selling Les Echos: WSJ
Soon to be everything except moot court, LN buys services provider: Dayton Bus Jrnal
Visant (educational pulisher) for sale: Reuters
In case you were wondering, Proquest are now Voyager: PRNews

Publishing:
Publishers create web travel aids: Galleycat

Retailing:
Looking for Divine Intervention for Christian Retailing: Washington Post

Search:
Topic Specific Search Engines are the Next Best Thing: Economist

Library:
Book Industry Council (UK) looks to improve Library Supply Chain: PN

Sport:
Beckham arrives (and Posh): LA Times (It is all a bit silly)
Belmar Five: 34:10

Wednesday, June 27, 2007

Publishers Fight Back - 2

I don't seem to hear too much about the Automated Content Access Protocol (ACAP), which is being developed by a group of content producers under the aegis of the World Association of Newspapers but they released a press release about the progress of the initiative. To refresh, the ACAP is new standard to allow on-line content providers to automatically communicate information to search engine operators and others on how their content can be used. From the press release:

ACAP is building on existing technology including Robots Exclusion Protocol and is using established methods for defining standard permissions semantics.
Collaboration and support for the project has been overwhelming: the list of 28 organisations continues to grow and represents a worldwide interest in the project (partners are listed below).
Work is now underway to prepare ACAP for the post-pilot stage -- to hand over a long-term sustainable model to a pre-existing governance organisation or to set up its own ACAP governance organisation.

Effectively, the group is establishing a standard way to lock (or make available) the content on content providers web sites. This tool will allow publishers to select the content that they want crawled and thereby better control the access to their content. The tool is being developed so that when in place, a webcrawler will be able to read the permissions information and act accordingly. No human intervention required other than for the publisher to set the initial parameters.

“What we seek to do together is create the foundations for what is surely the highest aspiration that publishers, aggregators, search engines and politicians could have for the content industry - namely an increasingly healthy, profitable and vibrant sector which drives knowledge and diverse thinking throughout the internet and the world and which creates new opportunities for everyone," said Gavin O’Reilly, President of the World Association of Newspapers.

One hopes it is all not a bit late....

Prior Post

Friday, February 02, 2007

It's not Mr.Dewey's Search Engine Anymore

Read/WriteWeb is an excellent blog dedicated to all things web 2.0ish. You could spend a lot of time here catching up with whats new in web development, new companies and new approaches. Here is their description:

Read/WriteWeb is a popular weblog that provides Web Technology news, reviews and analysis. It began publishing on April 20, 2003 and is now one of the most widely read and respected Web 2.0 blogs. Read/WriteWeb is ranked among Technorati’s Top 100 blogs in the world. The site is edited by Richard MacManus, a recognized thought leader in the Internet industry.

I say check it out. And specifically look at this recent article on search engines - that is the other 100. It is facinating to look at some of the examples he links to and I recommend MsDewey (who is significantly more attractive than you might imagine) and LivePlasma. If you get to MsDewey search for book or books several times in a row and see what happens. It is facinating. (Note: I just went back to MsDewey is seemed to get stumped; regardless, a benefit is to remain transfixed on MsDewey).

Wednesday, January 31, 2007

Jimmy Wales Discusses Wikia Search

Against the back drop of a very friendly audience Jimmy Wales gave his first public talk this evening about his new search project. In fairness, given his work with wikipedia it would be difficult to image a non-friendly audience at any occasion, and when the topic is basically giving stuff of real value away for free then its unlikely there will be too many boos and groans. Jimmy wasn’t particularly controversial; he is on a quest to make search transparent, participatory and free. (Nice picture of someone's head - sorry).

The meeting at NYU was both a official class meeting and a hosted meeting of the FreeCulture society. Purely by chance, I found out about it by glancing at a copy of AM New York this morning as I came back from a breakfast meeting. This is the great thing about New York that these types of things go on all the time like no other place on earth. (I will have another post tomorrow about another meeting I attended earlier in the week).

Wales suggested that he was taken aback by the attention given by the media world on this initiative and he claims that he accidentally dropped the hint about it at the end of last year. I have some doubts about the story. As he explained the search ‘tool’ will become a legitimate competitor to the commercial providers particularly Google and Yahoo. He even suggested that some second tier search providers have approached him to offer assistance and support and he reasons that these companies recognize that a legitimate competitor to Google et al is good for all the non-major players. He didn’t directly state that an objective is to make basic search a commodity but this does seem to be the central objective of the initiative. Value-added services then would ride along or on top of basic search thereby providing unique business offerings.

With respect to the three core criteria he views as essential to the initiative all algorithms will be published, testable and researchable which supports his transparency goal. Establishing a participatory environment will be dependent on the relevancy and usefulness of the engine. As one student suggested, if the tool sucks then no-one will participate to which Wales noted that he is in the process of hiring the best researchers in search technology and is well aware that the first release has to be impressive. He also went on to say that they want to include the best elements of wikipedia participation coupled with the trusted network of key participants. Within wikipedia there is a core group of 1000-2000 contributors who are unlike gatekeepers and more like collaborators. Lastly, the search tool will be free which he defined by reciting the four freedoms of software. These are the ability to copy, modify, redistribute and redistribute with modifications.

Other than the fact that I was in a room full of under-graduates and feeling very old this was a very interesting discussion. Questions towards the end reflected concern over privacy issues and why Google and the other services are not ‘free’. I was curious about why the wikipedia model hasn’t yet transferred well to the world of educational publishing and journal publishing because so far those initiatives appear indifferent but I didn’t get the chance to ask.

Tuesday, January 30, 2007

Google TV: Is it for real? UPDATED

Update:

I think the following YouTube video from the same source as the GoogleTV episode proves GoogleTV is a hoax. This video is about recharging alkaline batteries and to quote "it is called electrical tape because it conducts electricity."

Well it was fun to think about....

In the run up to the law suit by US publishers against Google, an article was circulated (and I forget where it was published) to all of us on the AAP board which described a meeting between Google execs and NBC in Los Angeles. Apparently, Google had been storing the NBC feed for months and presented their proud new ideas for TV over the internet. NBC were not similarly impressed and the idea was buried.

Now comes word of GoogleTV which if true could be the resolution of this earlier idea to deliver on-demand television to your computer. How cool will that be. Just think in six months you can use your browser on your new iPhone to watch the program of your choice when and where ever you want. Extending that your iPhone becomes the distribution device for your own tv, radio/music, movie channel. The YOU channel.

Again it could all be a well constructed hoax. It could also be a hoax but very close to the truth.

Here are links:

http://www.youtube.com/watch?v=wNjlGm-YIKg

http://www.youtube.com/watch?v=7MulSMSJV-U

http://www.youtube.com/watch?v=N-eCO5L9wXQ

Thursday, January 25, 2007

Joining the Network

Ebay and Amazon.com are network level applications. What defines them as network level applications is that they raise to the level of a platform fundamental processes, data pools and transaction information that previously existed at a local user or store level. In doing so they achieve radical economies of scale which is made available to all comers. Their benefit in doing so is to create a market place from which they receive transaction fees and charges across a huge network. Participants (vendors) benefit because they get access to state of the art applications, databases and the marketplace itself for a fraction of the cost of developing these assets themselves.

Ebay and Amazon are the most obvious network level players but others are becoming increasingly prevalent and increasingly have at their core a set of integrated web service applications. Google, for example is known mostly as a search platform; however, they are investing in many types of applications from calendar functions, to spreadsheets, to blog software that in effect creates a potential network level desktop. It is entirely possible in the Google environment to become completely untethered from your traditional 'physical' desktop and increasingly this will be the way people work. As a consequence of the developing Google network we will benefit from more integration of communication - calendars, email, blogging, - between users. Who knows the level of integration that may result once browser type, hardware, productivity application don't matter.

The beauty of the Network level application is that it can function as a component of work flow or as the work flow itself. The applications are built to standard specifications and are interchangeable, upgradeable and reusable. Web services applications are the most common facility by which modular component software applications are brought together to produce a work flow application. By definition, these web service applications are not tied to any specific operating system or programming language.

Amazon.com has aggressively promoted (and was an early adopter) of web services. Most online booksellers would face much higher monthly operating expenses and would also not have access to other seller tools (comparative pricing, availability information, etc.) were it not for Amazons web services. Simple cover art is available to all online retailers via the Amazon web services widget. It would be difficult and time consuming for a small book retailer to scan and upload cover art and the fact that this and many other functions have been removed from the local store level to the network is an example of a network effect. Anyone who has sold on Ebay over the past ten years will recognize how the process gets easier and easier as Ebay has added content, applications and services that the average garage sale seller could never develop by themselves. As Amazon has done, they to have developed web services applications that others can use on their own auction or retail sales sites.

The Network effect is coming to the library, publisher and bookseller market. (There is still some opportunity in book selling despite Amazon). It is interesting to think further about the Network effect on publishing but I think that we are seeing the first stages of a radical change in publishing with the development of self-publishing houses (AuthorHouse, Lulu, iUniverse), publishing applications (Blurb, Picaboo) and blogging tools (wetpaint, Wordpress, Blogger) all of which represent a very different way of publishing. I think it is the beginning of the death of publishing as we know it but by no means the death of publishing.

Similarly, in the library community software vendors sell expensive software implementations for local catalogs (OPAC) which are proprietary and often islands of information with only minimal integration with the outside world. More often than not the applications themselves are filled with features and gee wiz stuff that no one needs or uses. Rise up to the network level all of this functionality and the libraries is able select the applications they want as components parts and assemble them as they please. Due to the increasing strength and decreasing cost of communications and bandwidth the library can run the critical tools they need via a set of network level applications. Importantly they do this without a large expensive investment in hardware or software and they get continual access to software development improvements.

It is an interesting time and I think I will think more about the network-level impact on book selling and libraries.

Wednesday, January 24, 2007

Spooky Cry for Help? All the books Art Garfunkel has read

Unfairly a part of me thinks this is the work of a troubled mind. On the other hand, I am having difficulty remembering what books I read back in the 1970s and if I had been this diligent then I wouldn't be having this problem. He presaged Librarything by decades.

Interestingly, 52 people have tagged this web page in delicious. I think most are in awe.

A new take on personal cataloging of books etc is gurulib which I have yet to test out but it has a similar purpose to Librarything. We all need a little competition. This site is free for the moment although Librarything was hardly expensive.

Tuesday, January 23, 2007

Google News and Other News

The Sunday Times reports that Google is looking a developing 'a system' to allow e-book downloads to laptops and PDAs. Hummm. Not too revolutionary other than if they put their weight behind it will it blast e-books into the stratosphere?

Pearson were cagey all year about their results suggesting that the second half of the year would be much harder. Well they are on target for record earnings. Their stock declined.

Here are the Edgar Award nominees.

The French apparently love James Bond.

There was a Google love in at the New York Public Library last week. Predictably publishers are wrong and Google is right.

If you see the local Barnes & Noble windows filled with Chocolate in a few weeks you will know why. In these competitions they never say where the people who live in New York get to go if they win. I guess the presumption is we get to go to Omaha. It's just not fair.

Monday, January 22, 2007

Archiving Special Collections.

Visit the British Library web site and page through some amazing texts coupled with an audio recording about the work. It is very cool.

As more and more libraries start to digitize their 'special' collections (and not via Google either) it will be interesting to see how they 'display' these collections in the on-line context. I believe small thriving businesses will be developed that help libraries create on-line or electronic shows. The online version of the material that typically shows up in the glass display cases in the library lobby.

It is great to have all this 'special' collection material available for research and access but for the casual library patron some filtering and explanation/analysis is important for enjoyment and value. This is why I think you will find digital archivists selling their services to libraries to create these representative packages or shows of the material. These digital archivists will present the best parts of the collection and 'design' or curate these online shows. Viewers will be able to access the material via the web but perhaps they could also view the material in the library via web enabled kiosks.

There is so much of this material coming on-line. (OCLC has three different programs designed to collect this material). I think only a small percentage of stuff has thus far been digitised and we run the risk of having a glut of material that if not organized and presented to the typical patron may never be seen. This would be akin to the local public library collection that sits undisturbed in the special viewing room that isn't open to the public.

The need to create logical presentations of this valuable material does draw attention to the importance of the local librarian in selecting or editing this material. The process also creates new opportunities to add to the material in ways that perhaps was discouraged or difficult. This is especially the case with oral histories. Creating a visual respresenation of the material in a special collection and then encouraging some long lived local patrons to add their vocal history to the presentation would add significantly to the relevance and importance of the collection.

As I said, I think this is coming because I think there is an obvious need for services to support this type of activity.

Michael Cairns

Michael Cairns is currently Head of Partner and Channel Development for the publishing unit of the American Psychological Association which is a $100million scholarly, academic and educational publisher. In this role, he is responsible for all sales and marketing activities supporting the revenue base and working with current and new partners to expand APA's publishing program and mission. Michael Cairns has served as President, CEO and CRO of several brand name publishing, technology and content-centric businesses. His career includes a wide range of publishing and information products, services and B2B categories and his operating and consulting experience has largely been with brand-name companies such as PriceWaterhouseCoopers, Macmillan, Inc., Berlitz International, AARP, R.R. Bowker and Wolters Kluwer. Mr. Cairns has also served on boards including the Association of American Publishers (AAP), the Book Industry Study Group (BISG) and also served as Chairman of the International ISBN Executive Committee.