Google Book Search, Open Content Alliance and Live Search Books notes

From PLN

Jump to: navigation, search

Google Book Search, Open Content Alliance and Live Search Books notes

Contents

The basics:

  • Google Book Search (GBS, previously Google Print) seeks to support search access to all books, present and past, through a combination of arrangements with current publishers and the Google Library Project (GLP) to digitize older books. It would appear that several million books are part of GBS by the end of 2007. Google has been criticized for the proprietary nature of its project, for secrecy (including confidential contracts with libraries--but, thanks to state laws, some of these contracts are now public knowledge), for the way GBS works and more recently for substantial flaws in the scans themselves. It has also been sued by one group of publishers and another group of writers for copyright infringement.
  • The Open Content Alliance (OCA), led by the Internet Archive, seeks to digitize print books--books in the public domain, at least at first--held in library collections and make them available for searching, online reading and downloading. OCA began with a substantial publicity push and appeared to have Yahoo! as a lead partner along with many libraries and corporations and a commitment to openness. To date (late 2007), the most visible results of OCA are a beta Open Library site (oddly, the original Open Library site offers only information and access to a handful of books), portions of Microsoft's Live Search Books and (metadata searching only) the Internet Archive; only a few hundred thousand items appear to be available at this point. (The most recent press release from OCA says more than 100,000 freely available items; IA includes some 314,000 text items as of late December 2007, but only some of these come from OCA.) OCA digitizing is claimed to be high quality and appears to be generally superior to GLP scanning.
  • Microsoft is a partner in OCA but has also established separate scanning/digitizing agreements, both feeding Live Search Books (which also supports access to current books through publisher agreements). As of late 2007, close to a million books appear to be part of Live Search Books. Public domain books are fully viewable and downloadable. Microsoft scans appear equivalent to OCA scans. Update May 2008: Microsoft is shutting down Live Search Books and, in the process, making all its scans of public domain books fully public domain, available through OCA and elsewhere.
  • GBS and OCA agreements (and, as far as is known, LSB agreements) are nonexclusive and all scanning methods are designed to avoid damage to books (with fragile books withheld from current projects). Several libraries and groups of libraries are involved in more than one project.
  • Both GBS and LSB incorporate Worldcat.org searches into book results. As of this writing, the OCA and Open Library sites do not.
  • GBS, LSB, and Open Library all offer full downloads for (some) public domain works. GBS adds a page of requests for handling such downloads; after being attacked for those "privatizing" conditions, Google made it clear that these were requests, not demands.

Leader's Digest

by Leslie Dillon

Google Books API

Leader's Digest March 2008

Google’s new Books API (application programming interface) helps people gain access to the books Google has digitized. Books Viewability lets web developers “locate titles on Google Book Search and automatically embed links to those books on their own sites.”

The earliest adopters are libraries that are linking their OPACS to Google Books. Ann Arbor (MI) District Library converted their catalog in less than a day. So now, if someone searches the AADL catalog for The Iliad, for example, they’ll find among the results a link saying “Look inside this book at Google Books.” Clicking on that link will take them to Google’s Iliad entry, which “includes references, reviews, and popular passages, not to mention a searchable text of the work.”

Some experts have responded to the new API with skepticism. Google’s Book Search has been controversial, and some worry that this will increase Google’s power even further. Several university libraries are already using the Books Viewability API, and Ex Libris and LibraryThing have integrated Google Books searching into their products.

But, as Ann Arbor’s IT and product development associate Eli Neiburger says, people “have come to expect a lot more out of book searching than the staid old card catalog could provide. … There are a lot of commercial products … that do this, but Google has a much broader reach.”

(Michael LoPresti, "Google Books reaches out with new API," [http://newsbreaks.infotoday.com/default.asp Information Today NewsBreaks, Mar. 20, 2008.)

Google's moon shot

Leader's Digest February 2007

If you haven't already read Jeffrey Toobin's article in the February 5 issue of the New Yorker, I heartily recommend that you read it, but if you don't have the time, here's what I took away from a pretty close reading. Toobin describes Google's quest for the universal library, looks at the Google Books Project, which aims to digitize all the books described in WorldCat "inside of ten years," includes some explanations of Sergey Brin and Larry Page's goals, clearly explains the legal issues behind publishers' lawsuits and describes how Google Book Search works and the business model behind it.

Before Google, Brin and Page worked on Stanford's Digital Library Technology Project. The project participants believed that "putting things on dead trees was obsolete and getting it all into a searchable, digital format...had to be accomplished someday." But, according to Brin, they were less interested in making it easy "to obtain the full text of books online than in making accessible the information those books contained." What they wanted was "comprehensiveness of a search...having the really high-quality information."

Google hosted a recent conference on the future of publishing, whose message, Toobin believes, can be best summed up by a quote from Charles Darwin: "It is not the strongest of the species that survive, nor the most intelligent, but the ones most responsive to change." One former publishing executive described Google as the gatekeeper. They're reaching audiences that publishers haven't.

So it makes sense to use a search engine to help sell books. By now, Google has formed partnerships with almost all the major American publishers. But in spite of that, some of those very publishers have sued Google, particularly over books that are still copyrighted, or of uncertain status and out of print. Google's defense is that its use of these books is "transformative," that Google Book Search is a different product from the original books. Being able to search a book isn't the same as making the book available. Most of those involved in the legal dispute believe there will be a settlement. Unfortunately, though, "a settlement could insulate Google from competitors, which would be especially troubling, because the company has already proved that when it comes to searches, [Google] is not infallible." YouTube got video search right, not Google, and they didn't get blog search right; Technorati did. So if Google doesn't get book search right, and the lawsuit is settled, that could eliminate competition.

Toobin's interviews included Google's chief engineer for book scanning, Dan Clancy, who explained that previous book-scanning efforts were constrained by budget and scale, and they had to spend all kinds of time "figuring out which were the perfect 10,000 books, so they spent as much time in selection as in scanning." Because there hadn't been any need to build a machine that could scan 30 million books, Google had to build it themselves. Google doesn't discuss its proprietary scanning technology, but instead of investing in page-turning technology, they've hired people to run the machines. That's at least in part because automatic page turners are designed for the normal book, but there's no such thing as a normal book. Google also won't discuss how much the books project is costing, but using Microsoft's claim that it will spend $2.5 million to scan 100,000 books, Toobin estimates it'll cost Google $800 million to scan 32 million books (the number in WorldCat), "a major, but hardly extravagant expenditure for a multibillion dollar corporation." Clancy said the book project's biggest challenge was to "get somebody something that they are actually interested in, inside a book... Web sites are part of a network," but books aren't. "There's a huge research challenge, to understand the relationship between books."

Toobin makes it clear that "the central truth about Google Book Search" is that it is a business. And while prospects for making huge profits from the books project aren't likely, Google has often made money from unlikely sources. Also while there is "nothing evil about Google Book Search...there is nothing inherently virtuous about it" either. "Google has succeeded because...it has developed excellent products." (Jeffrey Toobin, "Google's moon shot", The New Yorker, Feb. 5, 2007.)

Cites & Insights

by Walt Crawford

What's not here: Google's Library Project

from Bibs & Blather, Cites & Insights 5:1, January 2005

The Google Library Project was announced near the end of 2004, with enormous amounts of library blog and list commentary--more than I thought an announcement deserved. I offered three quick notes, including the following paragraph, which I still regard as the most cogent thing that can be said about GBS versus libraries:

Google’s project spells doom for neither libraries nor print books. The sky is not falling, now or six years from now. Your library probably has a lot of post-1922 books, none of which can be made freely and wholly available on Google without publisher agreement. Your library should do a lot more than just hand people books one page at a time. Publishers that have posted books online have generally found that print sales increase as a result. The Google project has every chance of increasing library use and sales of print books. If I had to bet, I’d bet on that outcome as a success for the Google project.

Google and Gorman

Net Media Perspective: Google and Gorman, from Cites & Insights 5:6, April 2005

The first portion of this essay discussed "prototypical reactions"--still extremely early in the project, with a fair amount of "the sky is falling" language and accusations that GLP "means libraries are being commercialized." There were even assumptions that Google was "disbinding" books for faster scanning--which would be disastrous if true, but was absolutely false. Some more sensible reactions included Dorothea Salo's warning that scanning and OCR don't turn a print book into a usable digital object--and it's pretty clear that the results of Google's library scans are frequently not usable digital objects. (The remainder of the essay discusses Michael Gorman's unfortunate attack on blogs and bloggers, and has no relevance to this discussion--or much of any other discussion, truth be told, except possibly on the duties of leaders to think about what they say and write.)

What's next? Academic libraries in a Google environment

a report on an ACRL program by Joy Weese Moll, from [Cites & Insights 5:7, May 2005.

For a brief time, Cites & Insights included contributed program reports. This one, based on a session at the 2005 ACRL National Conference, includes excellent summaries of discussions by Google's Adam Smith on Google Print (the earlier name for GBS) and Google Scholar and by John Price Wilkin of the University of Michigan on the library perspective.

The report's worth reading in its entirety. A couple of key paragraphs, in both cases Moll's reporting:

Adam Smith on Google Print's goals: Google’s motivation for Google Print is to enhance the quality of the search. Smith believes that Google Print does not signal the beginning of the end for libraries, that the roles of Google and libraries are complementary, and that Google Print will help the user discover and use library resources.
John Price Wilkin on transformative possibilities: Wilkin asked that we begin to consider the transformative implications of the Google Print project. He wondered about broad social issues like the effect of wide, efficient, democratizing access to information. He says that the project has already proven to be a factor in driving clarification of intellectual property rights, including the orphan copyright issue.
Wilkin also wondered about the transformative implications of Google projects on libraries. What are the possibilities for a cooperative, universal library? What are the implications for library-as-place given the paradox of rising gate counts as more information goes on-line? If libraries cede the generalist role to Google, how can they facilitate specialization in service? How can Google Print and Google Scholar free up resources for related issues like institutional repositories and scholarly communication?

Perspectives: OCA and GLP

Two essays from Cites & Insights 5:14, December 2005

By this time, the Open Content Alliance had been announced to considerable fanfare. The first of these two essays asks some questions about Project Gutenberg, GLP, OCA and libraries--and got one answer somewhat wrong (Project Gutenberg does include some digital books, although it's primary just the texts). That first essay was primarily an attempt to make distinctions between book-length etext and actual digital books, including some of the organizational qualities that make books more than text. Note that Google Print became Google Book Search "between the time these essays were first written and the time they appeared"--thus, in late October or early November 2005.

The second essay, "OCA and GLP 2: Steps on the digitization road," is more relevant to this discussion. It's a long essay (more than 9,000 words) covering the early history of OCA, comparisons between the two projects and--extensively--the two lawsuits against GBS and a range of opinions as to their validity and issues raised. Too much to summarize here, but valuable as early history.

OCA and GLP redux

"Followup/Feedback Perspective" in Cites & Insights 6:1, January 2006.

This lengthy piece clarified the Project Gutenberg situation and added some new information (or information I'd missed) on OCA, but it's mostly more commentaries and other notes on GBS. Again, too long and complex to summarize here, but a summary of all my pre-2008 notes on OCA and GBS appears in the January 2008 Cites & Insights, out at the start of the new year (at http://citesandinsights.info/civ8i1.pdf, which will be a dead link until 1/2/08). On the other hand, one major section of this article does not appear in the summary, as I decided it was pointless to argue with a particular university professor.

Discovering books: The OCA/GBS saga continues

Perspective in Cites & Insights 6:6, Spring 2006.

"The short version could be one paragraph. New members continue to join the Open Content Alliance, with affiliated projects such as Alouette, involving 27 major Canadian academic research libraries, and a group of committees have formed to plan OCA’s future. The Google Library Project keeps scanning, the lawsuits haven’t been settled, Google continues to be more opaque than seems necessary—and Google Book Search generates lots of articles and discussions."

The essay discusses the purported 2006 agenda of OCA, the somewhat related Million Book Project, and--at some length--GBS items. The Million Book Project, headquartered in India, can't be faulted for modest ambitions or schedule:

Note this assertion at the Indian center: “The technological advances today make it possible to think in terms of storing all the knowledge of the human race in digital form by the year 2008.” I find that a trifle optimistic. It appears that the project is becoming affiliated with OCA, to some extent. It clearly can’t be accused of being Anglocentric: Of the 600,000 books scanned, roughly 135,000 are in English.

As of late 2007, the [www.ulib.org Million Book Project] has apparently scanned 1.5 million books, including some 400,000 in English. The OCA/Internet Archive connection doesn't seem to amount to much: Fewer than 30,000 "Universal Library" books (Universal Library is the ambitious name used by Million Book Project) are available at IA, with [www.ulib.org this site] offering much broader access.

Scan this book?

A Perspective in Cites & Insights 6:9, July 2006.

I could dismiss this piece as silly-season coverage--except that the essay discussed appeared in the New York Times. I refer to Kevin Kelly's remarkable "pile of technological determinism" (same title as the heading above, substituting an exclamation point for the question mark). Much of the relatively brief Perspective considers a few reactions to the essay (including John Updike's vivid reaction) and an awful anti-book screed by Jeff Jarvis including a statement that sounds just right coming from a former TV Guide columnist: "Print is where words go to die." I won't bother you with that nonsensical piece, but will offer a few paragraphs of my comments on Kelly's much higher-profile article (naturally--it appeared in a widely-read print newspaper instead of on a blog):

Kelly’s essay claims that the various book scanning projects are “assembling the universal library page by page,” quite an ambitious claim for OCA, Google Library Project, and friends. He goes on to say that this “planetary source of all written material” will “transform the nature of what we now call the book and the libraries that hold them”—toward Kelly’s “Eden of everything” and “away from the paradigm of the physical paper tome.” He assures us that search technology will enable us “to grab and read any book ever written,” surely not a likely outcome of any current projects—and that “with tomorrow’s technology” his estimate of “the entire works of humankind” (he says 50 petabytes) will “fit onto your iPod.”
A bit later, he seems to assert that nobody prints out web PDF documents; people “happily read” them on computers. He claims “still more people now spend hours watching movies on microscopic cellphone screens”—without any apparent evidence. Then he launches into his fevered dreams of books “reading” one another, a future where “no book will be an island.” Somehow, indexing every word—or, as he puts it, as each word is “cross-linked, clustered, cited, extracted, indexed, analyzed, annotated, remixed, reassembled” they will be “woven deeper into the culture than ever before” as “every page reads all the other pages.” Whew.
Tags will “serve better than out-of-date schemes like the Dewey Decimal System.” Every book, “including fiction, will become a web of names and a community of ideas.” Kelly throws in more figures: there are 100 billion web pages with 10 links each, making a trillion “electrified connections”—and, for those who find those numbers suspiciously neat, raising a question as to whether Kelly just makes this stuff up.
What happens when all books become “a single liquid fabric of interconnected words and ideas”? He says this will “deepen our grasp of history” and cultivate a “new sense of authority”—or, just maybe, it could leave us drowning in interlinked trivia.
There’s more-—a lot more. Kelly loves universalisms. It’s “obvious to all that copyright now existed primarily to protect a threatened business model.” “No one doubts electronic books will make money eventually.” [Emphases added] He also loves oppositions, contrasting “people of the book” with “people of the screen.” He assures us that digital technology “has now disrupted all business models based on mass-produced copies.” He suggests authors should make their livings through performances and sponsorship, giving up any chance of royalties as such. We’re told “copies don’t count any more.” Not that books matter much anyway: “The only way for books to retain their waning authority in our culture is to wire their texts into the universal library.”

Book searching: OCA/GBS update

Perspective in Cites & Insights 7:1, January 2007.

This piece, the last significant C&I coverage of GBS or OCA until the January 2008 issue, notes a range of commentary on both projects, the short-lived Google Librarian Newsletter, early experience with Live Search Books--and some interesting comparison tests on what you can find in GBS and LSB. It's short enough to read directly. One of my few conclusions was that LSB needed to add "find in a library functionality" to be really useful for book discovery--and that's happened.

Discovering books: An OCA and GBS retrospective

Perspective in Cites & Insights 8:1, January 2008.

Possibly more than you want to read about the Open Content Alliance and Google Book Search, including summaries of the previous articles (above) and notes from 2007. Really too long to read in HTML form.

Related articles

  • Libraries in the new age - Harvard's Robert Darnton sees a bright future for libraries as Google Book Search makes book learnign more accessible.
  • Searching notes - Notes on searching and search tool development in general.

Your turn: Talk about it

Personal tools
Home