The Book-Scanning Project

Foolishly, I complained.

"I've got too many books," I said. "Not too many, of course. But I have now reached the point where I can forget that I own a book."

"You should catalog your collection," someone said. "Keep the database on your Palm."

"Hell no. It would take forever, and then it would keep taking forever every time I bought more books. I don't need anything else to remember regularly."

"You know," someone said, "I bet you could get a bar-code scanner cheap, and scan the ISBN codes. Then it would be a Small Matter of Programming to look up the titles on the Net."

My reply was offensive and heartfelt.

Caveat

This code doesn't work any more. It hasn't worked since about three months after I wrote it, which was the year 2000. All the web sites that it scans have changed their format.

I apologize for not keeping this stuff up to date, but I don't use it any more. There are great shareware alternatives out there, and probably some open-source ones, but I haven't investigated them.

Research

The project requires several stages:

A bar-code scanner
...which can be hooked up to a computer I own
A program to convert the UPC bar-code number to an ISBN
A web site or other Internet resource to look up books by ISBN
A program to take a list of ISBNs, send them to that web site, and store the results
A program to convert the list of titles and authors to a format that can be viewed on a Palm 3.

In my usual logical, methodical way, I started in the middle.

Looking up ISBNs - Converting UPC to ISBN - Buying a scanner - Querying for Titles - Afterword

Or, if you don't care about the historical details, here's How To Make It Work For You.

You probably want to check the Page of Updates to This Document, since some details have changed since I actually did this myself.

And finally... the Book List. Also the sublist of recent acquisitions.

Looking up ISBNs

A Google search for "isbn search" turned up one big winner: http://isbn.nu/. This site lets you enter an ISBN -- or even a raw UPC number -- and then looks up the book at several on-line booksellers, and gives you a price comparison table.

Useful, but not what I was looking for. Since the site is just referring queries to Amazon.com (and so on), I might as well go to those sites directly. I don't need price comparisons. Possibly, if Amazon turns out to have an incomplete ISBN database, I might want to come back to isbn.nu rather than design queries for several booksellers myself. But Amazon probably has the goods, right?

The Amazon search form is large, ugly, and full of crap I don't need. I was resigned to figuring out all its details when someone on rec.arts.sf.written pointed out that a URL such as http://www.amazon.com/exec/obidos/ISBN=1565922867/ would work as well. Bingo!

The snag:

At first, Amazon seemed to be finding less than ten percent of the ISBNs I was submitting. Ack! Blah! Disaster! No, I was just generating the ISBN checksum digit wrong, ten times out of eleven. (Naturally, the one book I had within reach was the eleventh case -- that's why I didn't catch the error.)

In fact, Amazon finds some 85% of my test run. By checking both Amazon.com and Chapters.ca, I pushed that up to some 92%. Amazon.co.uk gave me a few more. The remaining intractables weren't found by any other web database I tried, so I expect that's the best I'm going to get.

(Bowker provides on-line access to Books Out Of Print... for thirty bucks per week. Oh well.)

Converting UPC to ISBN

The bar-code on a book is a thirteen-digit number, starting with "978". (Apparently, the first three digits of a UPC indicate the country of origin. "978" is officially Bookland. Cool, eh? The UPC number for a book is sometimes called the "Bookland EAN".)

An ISBN is a ten-digit number (the last digit may be "X").

A conversion must be found.

Fortunately, it's trivial. The http://isbn.nu/ search form actually does it in JavaScript:

if (indexisbn.indexOf("978") == 0) {
   isbn = isbn.substr(3,9);
   var xsum = 0;
   var add = 0;
   var i = 0;
   for (i = 0; i < 9; i++) {
        add = isbn.substr(i,1);
        xsum += (10 - i) * add;
   }
   xsum %= 11;
   xsum = 11 - xsum;
   if (xsum == 10) { xsum = "X"; }
   if (xsum == 11) { xsum = "0"; }
   isbn += xsum;
}

Not that JavaScript isn't annoying and stupid, of course. (I keep it turned off, so I can't even use the UPC search function on that web site.) But it saved me from looking up the details of UPC and ISBN checksums.

(Small footnote: Chris Taylor contributes the Java code equivalent to the Javascript above.)

The snag:

This was a doozy.

Did I say the bar-code was a thirteen-digit number, starting with 978? I lied. Some books have that EAN code. Others have a true UPC code, which is twelve digits. Other books have both (the EAN is often inside the cover.)

Moreover, either kind of code can have a five-digit extension. On an EAN, the extension gives the book's suggested retail price. On a UPC, the second half of the main barcode gives the price, and the extension gives half the ISBN, and the other half of the ISBN is...

Missing. I told you the snag was a doozy.

The first half of the main UPC barcode is a publisher number, which corresponds to an ISBN prefix. You have to look it up in a table. Combine the prefix with the five-digit extension, tack on the checksum, and you have the full ISBN.

But you can't get such a table anywhere. I don't think Bowker sells one -- they're in charge of ISBNs, not UPC publisher numbers.

How to deal? The obvious suggestion (which wasn't obvious to me until Christopher Davis suggested it, thank you Christopher) is to use those books that have both kinds of barcodes. When you scan, scan both whenever possible. The clever scripts can then use that information to build up a table of correspondences.

(Why does this silly EAN/UPC system exist? Basically, I'm told, because mass-market books (mostly paperbacks) are sold in mass-market outlets, like grocery stories and drugstores. Mass-market outlets often have ancient, creaky old scanners which only understand UPC codes.)

(In the distant future -- 2005, specifically -- all scanners will be smart, and publishers can start putting the EAN on every book, even mass-market paperbacks. Of course, everyone's collection will still be full of books without EANs. Life is hard.)

Barcode Scanners

Back to Google. A search for "barcode scanners" turned up several sources. The cheap one on the list -- well, a cheap one -- was Custom Sensors Inc.

(Okay, I don't actually see CSI on the Google result list now. What the hell, I got there somehow.)

CSI sells a couple of relevant toys. They have a pistol-grip point-and-zap scanner (CCD-8000), and a smaller wand scanner (MT-605). These cost about a hundred bucks each, give or take, depending on model. (They have more expensive models too, but I assume the typical book-scanner has spent all his money on books.)

(I went for the wand, on the theory that simpler is better. Also, a bit cheaper.)

One must mind the interfaces. Both of these products comes in several forms. You can get them with an RS-232 connector, or a "keyboard wedge". The latter is a clever interface that plugs into the keyboard port of a computer, so that the barcodes you scan appear just as if you'd typed them on the keyboard. The keyboard wedge means that you don't need any data-capture software; just start up any text editor.

I actually wanted the thing to work on both my older Macintoshes, which use ADB connectors, and my PowerBook, which has USB. CSI sells a separate USB adaptor gizmo (MT-606). This converts one of the keyboard wedge interfaces to a USB connection. Thirty bucks. However, my older Macintoshes lose; the manufacturer no longer makes the wand with a Mac ADB interface.

The snag:

None, other than obsolence of ADB. The order was easy; I called them, named a model, paid by credit card, and it arrived within a week.

I did have to be careful to program the scanner correctly. (How do you program a bar-code scanner? Right! You point it at a special table of bar-codes! I love it.) I set it up to read EAN and UPC, always including the first and last digit, and optionally including the five-digit extension.

Querying for Titles

Presume that I've got a simple text file, containing UPC numbers, one per line. I need a program which will convert those to ISBNs, send them to Amazon via HTTP, and parse Amazon's HTML response page.

This is a job for Perl!

Well, I don't know Perl. Perl is fuggly. I've put off learning it this long; I have no great desire to wade in now. Someone (a different someone :-) suggested Python. Python is simple. The dumbass whitespace formatting is annoying, but not in a way that makes the language harder to learn. Python it was.

(Footnote: It strikes me that one could modify the Python compiler to ignore whitespace, and use -- for example -- ":" by itself as a block terminator. Since the compiler is part of the run-time environment, this may even be trivial. It would make a lot of people happier with the language, wouldn't it?)

I decided to split the task into its parts. The first script, upcfind.py, goes over a list of scanned codes (both EAN and UPC) and updates a master table called upc-map. This table, as described above, maps UPC prefixes to ISBN prefixes.

The second script, makeisbn.py, reads the same list of scanned codes; it spits out a list of ISBNs. (If it finds any ISBNs in the original list, it leaves them alone. Any line that looks entirely confusing stays in the list, but the script puts a "#" mark before it so that later programs can ignore it.) This script uses the upc-map table, of course. It's also smart enough to find instances where you scanned both UPC and EAN barcodes of the same book, and only spit out the ISBN once.

The third script, shelve.py, is the one that actually hits the Internet. It reads the list of ISBNs, and writes two output lists. The "out-err" file is a list of ISBNs that the databases (Amazon and Chapters) didn't manage to find. The "out-good" file has three lines for every ISBN successfully queried. The three lines are ISBN, author, and title.

The fourth script, collate.py, is only relevant if you want to turn the data files into a JFile database. (JFile is a shareware database app for the Palm.) The collate.py script just takes one or more data files, strips out blank lines and comments, sorts the data, and adds the one-line header which you need to convert a text file to a JFile PDB. (JFile comes with Windows tools to import data; I wrote jtrans, which does the same job on Unix.)

The guts of these scripts are straightforward. Python has library modules for reading lines of text, manipulating them, sending HTTP queries, and returning the results. A bit of regexp cleverness was needed to parse Amazon's HTML, but nothing painful.

(Of course, it's possible that Amazon will change its response page format. They may even remove the ISBN query URL entirely. I don't guarantee that these scripts will work. They worked for me, is all.)

The snag:

I won't even try to go into it. These scripts went through several revisions before I came up with this set of tools. They're still awkward, but I've been able to use them. Honest.

Afterword

Well, it's the end of a long weekend, and the database is in my Palm 3. What do I conclude about the experience?

Different publishers hyphenate ISBNs in all sorts of inconsistent ways. Ignore the punctuation, and look for ten digits. (And don't trust the older "SBN" too far -- sometimes you can get a valid ISBN by prepending a zero, but in general you can't.)

Scanning books is a lot of work. Do not imagine that this project consists of a few minutes of beep-beep-beep, followed by four scripts and you're done. I have six well-filled bookcases (figure 250 books each), and scanning in all the books in a bookcase takes something like 45 minutes. Could be faster if you're lucky, but if you have any quantity of old or obscure books, you'll be typing a lot of comments by hand -- that eats time.

Then, after the scripts are run, you have to go back and fill in missing titles, fix typoes and outright mistakes, and generally massage the data. Make sure author's names and series titles are spelled consistently through the data -- that sort of thing. That's another 45 minutes per bookshelf.

So, overall, I probably blew eight or nine hours on this project -- not counting the programming time. Was this worthwhile?

Hell yes. I could probably type in the titles and authors of 250 books in less than ninety minutes... but it would be a two-person job: one to read off books, one to type the data in. (Try to do both, and it's a running battle whether you drop from exhaustion, turning from the computer to the bookshelf and back, or just petrify your eyeballs from the focussing strain.)

And the typing job would be awful and tedious, even by itself. And you'd have that editing-proofing-massaging stage to do anyway.

So this work certainly saved me folks-hours. It saved me effort; editing a generated list for mistakes is much easier than generating the list yourself, even if it takes nearly as much time. And, I shouldn't even need to say, the prospect of doing the job geekly got me to do it -- I would never have started if the only option had been dronely.

After afterword:

Skip the Penguin has sent in htmlmake.sh, a one-line shell script to mangle the output of collate.py into an HTML page. I've added it to the script package.

8/28/00: Dan Poirier reports that the Amazon web searcher has to be jiggered. In shelve.py, line 73, change re.compile('/Author=([^/"]*)') to re.compile('&field-author=([^/"]*)'). I haven't tried this myself.

8/29/00: Radio Shack is giving away free barcode scanners as part of some marketing program I don't understand. Skip the Penguin has put up a page about using the CueCat scanner for your own purposes (including cataloging books). Linux and Windows instructions included. Or see the Lineo page for another Linux CueCat driver.

Further updates on this page, since appending them here in the middle doesn't make sense.

Skip the History, Make it Work

(Like I said, it isn't actually going to work. I am preserving this section for historic interest. --Z)

Actually, if you skipped the sections above, you're going to get confused. But I'll try to hit the highlights.

Buy a bar-code scanner.
- I got a Custom Sensors Inc. MT-605.
- The CCD-8000 may also work.
- If your machine uses a PS/2 keyboard, ask for that model of keyboard wedge. If your machine uses USB devices, get the MT-606 adaptor also.
- Radio Shack is giving away free CueCat scanners; see the After Afterword section for links on making that work.
Program the scanner.
- Use the following options:
  - Accept UPC-A and EAN-13.
    - I turned off the ability to accept most other barcodes, to reduce the possibility of confusion, but I don't think this is really necessary.
  - For both UPC-A and EAN-13, send both first digit and last (check) digit.
  - If you have an option to convert EAN-13 to ISBN, turn it off. (The makeisbn.py script handles this.)
  - Accept five-digit supplement codes. (Again, I turned off two-digit supplement codes, but this probably isn't necessary.)
  - Set recognition of supplement codes to be optional ("transmit if present"), not mandatory ("must be present"). This is important, because not all barcodes have the extension, and they're not necessary for EAN codes.
  - Put a space between the main barcode and the five-digit supplement.
- Now test the scanner. A recent mass-market paperback should have both UPC and EAN barcodes (back cover and inside front cover.)
  - The UPC code should come out with the format "012345678900 12345" -- twelve digits, a space, five digits.
  - The EAN code should code out with the format "9780123456789 12345" -- thirteen digits (beginning with 978), a space, five digits.
Grab the Python scripts.
- This tar file contains all four scripts, plus the upc-map file. You should download the tar file, rather than trying to download the four scripts individually, because Python's dumb whitespace formatting is susceptible to breakage.
Scan a lot of books.
- One bookshelf at a time worked best for me.
- Scan straight into your favorite text editor.
- The EAN (978-) barcode is more reliable than the UPC barcode.
- But it's best to scan both, where both are available.
  - Especially if you have books from many publishers.
  - Generally, only paperbacks will have both barcodes -- not all paperbacks either, just some.
  - Scan the codes consecutively. The scripts will get confused if the two barcodes for the same book are several lines apart.
  - It doesn't matter which code you scan first.
- Make sure you get the five-digit extension on all UPC barcodes. (The extension doesn't matter for EAN codes.)
- If you can't get a good scan, or if a book just doesn't have barcodes, you can type the ISBN into the scan file yourself. Ten digits, no spaces or hyphens.
- You can also put a comment on any line.
  - Type a hash sign (#) after any scanned number; you can put any text after the hash sign.
  - The comment will be preserved by the scripts you run later. That's why it's handy; you can remind yourself what book that line represents, in case the Internet lookup fails.
- If a book has no ISBN at all, use a comment on a line by itself.
  - Just enter a line that starts with a hash sign. Any text can follow the hash sign.
  - Later, after all the Internet lookup is finished, you can use the comment to add the book manually to your database.
- You can also add a line of the form XXXXXX=YYYY (no spaces or other characters). This indicates that the six-digit UPC prefix XXXXXX maps to the ISBN prefix YYYY.
  - You generally won't need to add lines like this -- upc-map comes with common values already in place, and upcfind.py can deduce more from double-scanned books.
  - But if makeisbn.py complains (see below), you may want to find the book in question and enter the data manually, using this sort of line.
Run upcfind.py on your scanned list (to deduce UPC prefix mappings).
- upcfind.py scanfilename
- Make sure the upc-map file is in the current directory, so upcfind.py can read and update it.
- The program will warn you of potential problems.
  - The most likely problem is: line L: UPC prefix X is already in the list as Y -- not Z
  - There's not much you can do about this. In some cases, you just can't reliably get the ISBN from the UPC.
  - If possible, figure out which book is referred to, and get an EAN scan from every book from the same publisher / series / whatever.
  - If not, at least try to add comments, so that you can type in the title and author manually later on.
Run makeisbn.py on your scanned list (to turn the barcodes into ISBNs).
- makeisbn.py scanfilename > isbnfilename
- Again, make sure the upc-map file is in the current directory.
- The output file will contain warnings of any problems.
  - UPC barcode requires five-digit extension -- indicates that you scanned a UPC code without the extension.
    - You'll have to re-scan it, or add the digits to that line by hand, or add a comment indicating what book it is.
  - Unrecognized format -- indicates that the line doesn't seem to be either a UPC, EAN, or ISBN.
  - Unknown UPC prefix X -- indicates that the prefix wasn't found in the upc-map table.
    - You should add a comment.
    - If the book has both barcodes, you could also rescan the book, getting both codes. This would register the prefix in upc-map, thus solving the problem for every book from that publisher.
    - You could also add an XXXXXX=YYYY line to your file. See above.
Go back to your bookshelf, look over the output of makeisbn.py, and fill in any comments, barcodes, or XXXXXX=YYYY lines that are necessary.
- If you've added anything to your scan file, you should re-run upcfind.py, to re-update upc-map. Then re-run makeisbn.py, to make a more complete ISBN list.
- Rinse and repeat, until there are no more errors.
Run shelve.py on the ISBN list (which actually looks up the titles and authors).
- shelve.py isbnfilename > datafilename
- This will take a while; at least a couple of seconds per book, even on a fast Net connection.
- The output is a list of tab-separated lines -- author's name, followed by a single tab character, followed by the title.
- Comment lines start with a hash sign (#), and have no tab.
- The queries are summarized on the screen, in addition to being stored in datafilename. (More precisely, the progress is sent to standard error, and the final results are sent to standard output.)
Go back to your bookshelf yet again, and go over the data file.
- Replace all the comment lines with actual titles and authors. Be careful to use a single tab per line.
- Also, you probably want to edit the data that's already there. The Web databases often have typos, and the author lists include names you may want to delete -- illustrators' names, and astoundingly bad alternate spellings of the real author.
Go through the entire process again, with the next bookshelf.
- After the first bookshelf, I had most of the UPC mappings I needed. So on subsequent shelves, I didn't double-scan -- I only scanned the EAN code. For the common publishers, anyway.
Now you have a bunch of data files. Import them into whatever database you plan to keep them in.
If you want to transfer the data to a JFile database on a PalmOS PDA, do the following:
1. Run collate.py on all the data files you've created.
  - collate.py datafilename1 datafilename2 datafilename3 > dbfile
  - This simply concatenates the files, strips out any remaining comments, sorts the entries, and adds an "Author / Title" header line.
2. Run jtrans to convert dbfile to a PalmOS PDB file.
  - jtrans -e -o -n Books dbfile dbfile.pdb
  - If you have JFile Pro, leave out the "-o" flag.
3. Use your regular Palm tools to upload dbfile.pdb to your Palm.
If you want to transform the data to an HTML page, do the following:
1. Run collate.py and htmlmake.sh on all the data files you've created.
  - collate.py datafilename1 datafilename2 datafilename3 | htmlmake.sh > books.html
Check my Book-Scanning Updates Page to see if there are any important details I left out of this document.

Last updated October 2, 2000.

The Book List
List of Recently-Acquired Books
Recent Updates to This Document

Zarfhome (map) (down)