Even before its debut a lot was written about Google Print and The Open Library managed to snag some columnar inches as well when the project announced that they had funding to scan some 150,000+ books which had fallen out of copyright in the US.
While I can’t pretend that this is going to be in any way a scholarly take on the two services I do want to discuss how the two services differ from each other as well as from the much older Project Gutenberg archives of digitized books and do so in more detail than I was seeing in article after article that barely discussed what actually using the sites was like.
This article about the Open Library gives you an overview of how the books end up in digital form and it’s doubtful that there is a lot of difference between how this is done in Google’s project or The Open Library. If you are at all interested in the mechanics of the process, it’s a good read and not particularly long.
Note: Because The Open Library is only going to do books right now which are in the public domain I picked the Henry James novel An International Episode for most of the examples below. In its original version it had illustrations mixed with text, it is long out of copyright, and the same edition is available from both Google and The Open Library making for easy comparison of how the two services deal with the same book.
Unfortunately most of the press given to Google’s online library has been devoted to the controversy about the fact that they are digitizing not just public domain books and those books voluntarily contributed but also those still in copyright.
If you haven’t had a chance to read about the controversy I’ve added some links to articles at the end of this one. I come down squarely on one side on this and it’s Google’s side. However, what I think of that is completely irrelevant to how the services deal with books online because they are going to offer pretty much the same features and UI for all books.
Google’s take on books is, not surprisingly, very search centric. They don’t offer any kind of browsing interface to find books and you’ll find that once you’ve gotten yourself into a book, even one which is public domain that there is no mechanism for jumping to particular pages. Here’s the opening of the story from James shown in Google. Note that the available navigation is limited to flipping back and forth by a single page or jumping directly to things like the table of contents.
Note that if you got here via a search for some text which appeared on this particular page, that text will be highlighted, even though the page is actually an image. That’s a nice touch. Other things to note are that there are links to find and buy the book via various services but no opportunity to download the book in its entirety, even though it is in the public domain, nor to zoom into the page or isolate the page image for printing. It’s all about finding and nothing else.
That lack of zooming capability doesn’t seem so harsh when you are dealing with the nice big print of an 1892 James novel but let’s look at a much newer book (Pirates 1660-1730) in the Google Print collection.
That’s not so nice is it. Notice that the print is starting to actually obscure itself due to the lack of resolution. In a book with small type on the entire page or perhaps just in spots (e.g. footnotes or text below illustrations) this is likely to sink below the level of readabiltity.
The Open Library
Searching? Not so much. “two young englishmen” produces no results at The Open Library, even when you’re staring at an open copy of An International Episode. Not so good. It only seems to support single word searches and if you try to move your cursor within the search field with the left and right arrows, it ends up turning pages in the book instead of moving the cursor.
The big thing to note here is that the book looks like a book. It has the slightly yellowed pages of a hundred plus year old copy. Every page hasn’t been digitally bleached white in an attempt to only display the context of a page. Obviously if the damage to a page was severe or you wanted to print a page without all the extras (perhaps you have no desire to waste all your yellow ink reproducing the details of an old page) you might appreciate having every page stark white. I’m not big on that though. I like the fact that the pages “turn” and they seem to do so very quickly and best of all I like what you get when you click “Print”.
Download a PDF? Download a DjVu version of the book?!? Fantastic. You can even pay to have a copy printed at Lulu.
DjVu was a wonderfully promising compression system for printed text materials. It excels at books, comics, catalogs, anything which combines both illustrations, page characteristics (like folds, dings, page color, etc.) and printed text. The original creator didn’t have a lot of luck licensing it and eventually open sourced their implementation. Now we see it being put to excellent use here to give us high quality book pages with less size than a corresponding JPG image of the same page would require.
A brief word about how the veteran at online books deals with the same volume.
At Project Gutenberg, it’s all about the words. Literally. It’s rare to find any book with its original illustrations or even a note to indicate whether it had any. Usually what you can get for any book is a straightforward text file with the original text of the book. It’s great if all you want to do is read the book or even produce your own version of a book. Let’s say you’d like to create your own illustrated Alice in Wonderland. Would you want to start with searching at Google Print? No, that’s useless. Images of the original Alice at The Open Library? Interesting reference material, but ultimately you need the words, just the words, and you are only going to get that from Project Gutenberg.
Gutenberg is there for specialized needs like the one I just mentioned but also for those cases where there aren’t any illustrations to be preserved (a multitude of books) or you need to have the book in a form where actual page images are not practical such as reading from a phone or PDA.
Strange Behavior in Google Print
I thought I’d mention that Google seemed to do something really strange when I was looking for the same pages that I had found in the Open Library. I did a search for “two young Englishmen had occasion” since that appeared on a page that I thought would be a good comparison between Google and the Open Library. Unfortunately the results weren’t what I hoped. Doing a search on that phrase returns two books. One is a Penguin book of James stories, it lacks the illustration that is in the Open Library version. The other instance is from a book that comments on the passage. Good info but not what I’m looking for.
A whole series of searches for Henry James, Henry James as an author, “an international episode”, etc. were very frustrating at not delivering what I wanted. Eventually I found the same book in Google’s Print library with the same page and illustration (search “two young englishmen” henry james) but it thought the book was copyrighted (even though on the same page it correctly showed the publish date as 1892) and would only show me snippets from within the page. With more work and different searches I got to what seemed in every detail to be the exact same book but it realized that the book had fallen into the public domain. Just for kicks I have linked to the page of explanation that Google links to from copyrighted texts which they won’t show in their entirety. But note that when I tried to again find the “copyrighted” version of the 1892 text as I wrote this I was unable to do so. It may have just been a temporary glitch.
BTW, the reason the initial query I tried didn’t work was due to the hyphen that appeared in a word break at the end of a line. If I search for “two young englishmen had oc-casion” then Google is able to find it. Oops. That’s not terribly helpful if I don’t have the original text sitting in front of me to see where they might have inserted word breaks…
Either Way, We Are In The Infancy
Even though searching (with significant glitches) can get you to something specific you are already aware of, where is the browsing environment? How do I just walk down the “aisles” of these tens or hundreds of thosands of books we’ll have in the next twelve months and find something to read? Is there a library interface that lets me say, I’m interested in mystery short stories and it will show me: a) what is available and b) what might actually be worth reading? Project Gutenberg sure won’t let you do it. Your available methods for tackling the 17,000 free books and magazines in their collection is author, title, language, and how recently it was added. Not exactly a browser’s bonanza.
I recently wanted to locate a magazine I had helped convert to electronic form for Project Gutenberg via the Distributed Proofreader’s website. I thought it might be fun to read the rest of an article I had worked on. It took more than a half hour to find something that should have been as simple as looking at the titles of all the magazines in Gutenberg’s collection. No such categorization existed. Nor could I filter by the age of the material, or the country of origin.