The Book-Scanning Project

Foolishly, I complained.

"I've got too many books," I said. "Not too many, of course. But I have now reached the point where I can forget that I own a book."

"You should catalog your collection," someone said. "Keep the database on your Palm."

"Hell no. It would take forever, and then it would keep taking forever every time I bought more books. I don't need anything else to remember regularly."

"You know," someone said, "I bet you could get a bar-code scanner cheap, and scan the ISBN codes. Then it would be a Small Matter of Programming to look up the titles on the Net."

My reply was offensive and heartfelt.

Caveat

This code doesn't work any more. It hasn't worked since about three months after I wrote it, which was the year 2000. All the web sites that it scans have changed their format.

I apologize for not keeping this stuff up to date, but I don't use it any more. There are great shareware alternatives out there, and probably some open-source ones, but I haven't investigated them.

Research

The project requires several stages: In my usual logical, methodical way, I started in the middle.

Looking up ISBNs - Converting UPC to ISBN - Buying a scanner - Querying for Titles - Afterword

Or, if you don't care about the historical details, here's How To Make It Work For You.

You probably want to check the Page of Updates to This Document, since some details have changed since I actually did this myself.

And finally... the Book List. Also the sublist of recent acquisitions.


Looking up ISBNs

A Google search for "isbn search" turned up one big winner: http://isbn.nu/. This site lets you enter an ISBN -- or even a raw UPC number -- and then looks up the book at several on-line booksellers, and gives you a price comparison table.

Useful, but not what I was looking for. Since the site is just referring queries to Amazon.com (and so on), I might as well go to those sites directly. I don't need price comparisons. Possibly, if Amazon turns out to have an incomplete ISBN database, I might want to come back to isbn.nu rather than design queries for several booksellers myself. But Amazon probably has the goods, right?

The Amazon search form is large, ugly, and full of crap I don't need. I was resigned to figuring out all its details when someone on rec.arts.sf.written pointed out that a URL such as http://www.amazon.com/exec/obidos/ISBN=1565922867/ would work as well. Bingo!

The snag:

At first, Amazon seemed to be finding less than ten percent of the ISBNs I was submitting. Ack! Blah! Disaster! No, I was just generating the ISBN checksum digit wrong, ten times out of eleven. (Naturally, the one book I had within reach was the eleventh case -- that's why I didn't catch the error.)

In fact, Amazon finds some 85% of my test run. By checking both Amazon.com and Chapters.ca, I pushed that up to some 92%. Amazon.co.uk gave me a few more. The remaining intractables weren't found by any other web database I tried, so I expect that's the best I'm going to get.

(Bowker provides on-line access to Books Out Of Print... for thirty bucks per week. Oh well.)

Converting UPC to ISBN

The bar-code on a book is a thirteen-digit number, starting with "978". (Apparently, the first three digits of a UPC indicate the country of origin. "978" is officially Bookland. Cool, eh? The UPC number for a book is sometimes called the "Bookland EAN".)

An ISBN is a ten-digit number (the last digit may be "X").

A conversion must be found.

Fortunately, it's trivial. The http://isbn.nu/ search form actually does it in JavaScript:

if (indexisbn.indexOf("978") == 0) {
   isbn = isbn.substr(3,9);
   var xsum = 0;
   var add = 0;
   var i = 0;
   for (i = 0; i < 9; i++) {
        add = isbn.substr(i,1);
        xsum += (10 - i) * add;
   }
   xsum %= 11;
   xsum = 11 - xsum;
   if (xsum == 10) { xsum = "X"; }
   if (xsum == 11) { xsum = "0"; }
   isbn += xsum;
}

Not that JavaScript isn't annoying and stupid, of course. (I keep it turned off, so I can't even use the UPC search function on that web site.) But it saved me from looking up the details of UPC and ISBN checksums.

(Small footnote: Chris Taylor contributes the Java code equivalent to the Javascript above.)

The snag:

This was a doozy.

Did I say the bar-code was a thirteen-digit number, starting with 978? I lied. Some books have that EAN code. Others have a true UPC code, which is twelve digits. Other books have both (the EAN is often inside the cover.)

Moreover, either kind of code can have a five-digit extension. On an EAN, the extension gives the book's suggested retail price. On a UPC, the second half of the main barcode gives the price, and the extension gives half the ISBN, and the other half of the ISBN is...

Missing. I told you the snag was a doozy.

The first half of the main UPC barcode is a publisher number, which corresponds to an ISBN prefix. You have to look it up in a table. Combine the prefix with the five-digit extension, tack on the checksum, and you have the full ISBN.

But you can't get such a table anywhere. I don't think Bowker sells one -- they're in charge of ISBNs, not UPC publisher numbers.

How to deal? The obvious suggestion (which wasn't obvious to me until Christopher Davis suggested it, thank you Christopher) is to use those books that have both kinds of barcodes. When you scan, scan both whenever possible. The clever scripts can then use that information to build up a table of correspondences.

(Why does this silly EAN/UPC system exist? Basically, I'm told, because mass-market books (mostly paperbacks) are sold in mass-market outlets, like grocery stories and drugstores. Mass-market outlets often have ancient, creaky old scanners which only understand UPC codes.)

(In the distant future -- 2005, specifically -- all scanners will be smart, and publishers can start putting the EAN on every book, even mass-market paperbacks. Of course, everyone's collection will still be full of books without EANs. Life is hard.)

Barcode Scanners

Back to Google. A search for "barcode scanners" turned up several sources. The cheap one on the list -- well, a cheap one -- was Custom Sensors Inc.

(Okay, I don't actually see CSI on the Google result list now. What the hell, I got there somehow.)

CSI sells a couple of relevant toys. They have a pistol-grip point-and-zap scanner (CCD-8000), and a smaller wand scanner (MT-605). These cost about a hundred bucks each, give or take, depending on model. (They have more expensive models too, but I assume the typical book-scanner has spent all his money on books.)

(I went for the wand, on the theory that simpler is better. Also, a bit cheaper.)

One must mind the interfaces. Both of these products comes in several forms. You can get them with an RS-232 connector, or a "keyboard wedge". The latter is a clever interface that plugs into the keyboard port of a computer, so that the barcodes you scan appear just as if you'd typed them on the keyboard. The keyboard wedge means that you don't need any data-capture software; just start up any text editor.

I actually wanted the thing to work on both my older Macintoshes, which use ADB connectors, and my PowerBook, which has USB. CSI sells a separate USB adaptor gizmo (MT-606). This converts one of the keyboard wedge interfaces to a USB connection. Thirty bucks. However, my older Macintoshes lose; the manufacturer no longer makes the wand with a Mac ADB interface.

The snag:

None, other than obsolence of ADB. The order was easy; I called them, named a model, paid by credit card, and it arrived within a week.

I did have to be careful to program the scanner correctly. (How do you program a bar-code scanner? Right! You point it at a special table of bar-codes! I love it.) I set it up to read EAN and UPC, always including the first and last digit, and optionally including the five-digit extension.

Querying for Titles

Presume that I've got a simple text file, containing UPC numbers, one per line. I need a program which will convert those to ISBNs, send them to Amazon via HTTP, and parse Amazon's HTML response page.

This is a job for Perl!

Well, I don't know Perl. Perl is fuggly. I've put off learning it this long; I have no great desire to wade in now. Someone (a different someone :-) suggested Python. Python is simple. The dumbass whitespace formatting is annoying, but not in a way that makes the language harder to learn. Python it was.

(Footnote: It strikes me that one could modify the Python compiler to ignore whitespace, and use -- for example -- ":" by itself as a block terminator. Since the compiler is part of the run-time environment, this may even be trivial. It would make a lot of people happier with the language, wouldn't it?)

I decided to split the task into its parts. The first script, upcfind.py, goes over a list of scanned codes (both EAN and UPC) and updates a master table called upc-map. This table, as described above, maps UPC prefixes to ISBN prefixes.

The second script, makeisbn.py, reads the same list of scanned codes; it spits out a list of ISBNs. (If it finds any ISBNs in the original list, it leaves them alone. Any line that looks entirely confusing stays in the list, but the script puts a "#" mark before it so that later programs can ignore it.) This script uses the upc-map table, of course. It's also smart enough to find instances where you scanned both UPC and EAN barcodes of the same book, and only spit out the ISBN once.

The third script, shelve.py, is the one that actually hits the Internet. It reads the list of ISBNs, and writes two output lists. The "out-err" file is a list of ISBNs that the databases (Amazon and Chapters) didn't manage to find. The "out-good" file has three lines for every ISBN successfully queried. The three lines are ISBN, author, and title.

The fourth script, collate.py, is only relevant if you want to turn the data files into a JFile database. (JFile is a shareware database app for the Palm.) The collate.py script just takes one or more data files, strips out blank lines and comments, sorts the data, and adds the one-line header which you need to convert a text file to a JFile PDB. (JFile comes with Windows tools to import data; I wrote jtrans, which does the same job on Unix.)

The guts of these scripts are straightforward. Python has library modules for reading lines of text, manipulating them, sending HTTP queries, and returning the results. A bit of regexp cleverness was needed to parse Amazon's HTML, but nothing painful.

(Of course, it's possible that Amazon will change its response page format. They may even remove the ISBN query URL entirely. I don't guarantee that these scripts will work. They worked for me, is all.)

The snag:

I won't even try to go into it. These scripts went through several revisions before I came up with this set of tools. They're still awkward, but I've been able to use them. Honest.

Afterword

Well, it's the end of a long weekend, and the database is in my Palm 3. What do I conclude about the experience?

Different publishers hyphenate ISBNs in all sorts of inconsistent ways. Ignore the punctuation, and look for ten digits. (And don't trust the older "SBN" too far -- sometimes you can get a valid ISBN by prepending a zero, but in general you can't.)

Scanning books is a lot of work. Do not imagine that this project consists of a few minutes of beep-beep-beep, followed by four scripts and you're done. I have six well-filled bookcases (figure 250 books each), and scanning in all the books in a bookcase takes something like 45 minutes. Could be faster if you're lucky, but if you have any quantity of old or obscure books, you'll be typing a lot of comments by hand -- that eats time.

Then, after the scripts are run, you have to go back and fill in missing titles, fix typoes and outright mistakes, and generally massage the data. Make sure author's names and series titles are spelled consistently through the data -- that sort of thing. That's another 45 minutes per bookshelf.

So, overall, I probably blew eight or nine hours on this project -- not counting the programming time. Was this worthwhile?

Hell yes. I could probably type in the titles and authors of 250 books in less than ninety minutes... but it would be a two-person job: one to read off books, one to type the data in. (Try to do both, and it's a running battle whether you drop from exhaustion, turning from the computer to the bookshelf and back, or just petrify your eyeballs from the focussing strain.)

And the typing job would be awful and tedious, even by itself. And you'd have that editing-proofing-massaging stage to do anyway.

So this work certainly saved me folks-hours. It saved me effort; editing a generated list for mistakes is much easier than generating the list yourself, even if it takes nearly as much time. And, I shouldn't even need to say, the prospect of doing the job geekly got me to do it -- I would never have started if the only option had been dronely.

After afterword:

Skip the Penguin has sent in htmlmake.sh, a one-line shell script to mangle the output of collate.py into an HTML page. I've added it to the script package.

8/28/00: Dan Poirier reports that the Amazon web searcher has to be jiggered. In shelve.py, line 73, change re.compile('/Author=([^/"]*)') to re.compile('&field-author=([^/"]*)'). I haven't tried this myself.

8/29/00: Radio Shack is giving away free barcode scanners as part of some marketing program I don't understand. Skip the Penguin has put up a page about using the CueCat scanner for your own purposes (including cataloging books). Linux and Windows instructions included. Or see the Lineo page for another Linux CueCat driver.

Further updates on this page, since appending them here in the middle doesn't make sense.


Skip the History, Make it Work

(Like I said, it isn't actually going to work. I am preserving this section for historic interest. --Z)

Actually, if you skipped the sections above, you're going to get confused. But I'll try to hit the highlights.

  1. Buy a bar-code scanner.
  2. Program the scanner.
  3. Grab the Python scripts.
  4. Scan a lot of books.
  5. Run upcfind.py on your scanned list (to deduce UPC prefix mappings).
  6. Run makeisbn.py on your scanned list (to turn the barcodes into ISBNs).
  7. Go back to your bookshelf, look over the output of makeisbn.py, and fill in any comments, barcodes, or XXXXXX=YYYY lines that are necessary.
  8. Run shelve.py on the ISBN list (which actually looks up the titles and authors).
  9. Go back to your bookshelf yet again, and go over the data file.
  10. Go through the entire process again, with the next bookshelf.
  11. Now you have a bunch of data files. Import them into whatever database you plan to keep them in.
  12. If you want to transfer the data to a JFile database on a PalmOS PDA, do the following:
    1. Run collate.py on all the data files you've created.
      • collate.py datafilename1 datafilename2 datafilename3 > dbfile
      • This simply concatenates the files, strips out any remaining comments, sorts the entries, and adds an "Author / Title" header line.
    2. Run jtrans to convert dbfile to a PalmOS PDB file.
      • jtrans -e -o -n Books dbfile dbfile.pdb
      • If you have JFile Pro, leave out the "-o" flag.
    3. Use your regular Palm tools to upload dbfile.pdb to your Palm.
  13. If you want to transform the data to an HTML page, do the following:
    1. Run collate.py and htmlmake.sh on all the data files you've created.
      • collate.py datafilename1 datafilename2 datafilename3 | htmlmake.sh > books.html
  14. Check my Book-Scanning Updates Page to see if there are any important details I left out of this document.

Last updated October 2, 2000.

The Book List
List of Recently-Acquired Books
Recent Updates to This Document

Zarfhome (map) (down)