Fun With Markov Chains

I am often asked about my message signature, which has been appearing at the bottom of email and Usenet postings for years now:

"And Aholibamah bare Jeush, and Jaalam, and Korah: these were the borogoves..."

Except for the last word, this is from Genesis 36:5. Aholibamah was one of Esau's wives. However, Esau fathered no borogoves, neither mimsy nor erky ones. So what's up here?

Early in my college career, I wrote a little program to scramble text. It's a well-known algorithm. For example, it's available in Emacs as "Dissociated Press". But I didn't know that at the time, so I wrote my own. (It's probably faster in C than in elisp, anyhow.)

The Program

(You'll note that it's a bit clunky and definitely pre-ANSI. Sorry about that. Early effort. And the Mac versions are quick-and-dirty, too.)

Here's the idea: scan through the text and create a huge frequency table of what words come after what other words. That is, say the words "Up the" occur six times in the text; four of those times they're followed by "tree", and twice by "ladder". We record this information in the table. And so on for every pair of words in the text.

Then, we generate output which is random, but has the same distribution of word groups. This is very easy given the table. Start with a seed, which is the first two words of the original text; for convenience, let's say this is "Up the". Then we roll a six-sided die with "tree" written on four sides and "ladder" written on two sides. Say it comes up "ladder"; print that word next. Now we shift one word ahead, and find "the ladder" in the table, and roll the dice again. Repeat until we have as many words as we're in the mood for.

To add to the effect, my program counts punctuation, spaces, and line breaks as part of a "word". So those elements are also reproduced in the output with the same frequency as in the original.

It's also possible to do the same trick with letter groups instead of word groups. The effect is more surreal; instead of switching tracks erratically in the middle of sentences, the text jumps around in the middle of words.

To use this program, compile it -- if you can -- and then type something like:

chan -n 5000 -d 2 < input.txt

In this case, it will produce 5000 words of output, checking two-word groups. (The explanation above describes two-word generation. If you type "-d 3", the program will find three-word groups, and so on. Greater depths make more sense, but they require more input text and take more time to process.)

The Mac executables are hardwired to 5000 words and two-word groups, since a control dialogue box would have been real work. These executables are just the original source code jammed into my Glk portable I/O library.

What I Did With It

Anyway, around spring break of 1991, I was inspired by the brand-new NeXT machines down the hall at CMU. See, they had a number of public-domain texts installed on them -- including the Bible, Alice in Wonderland, and a pile of Shakespeare. So I concatenated together a few things, fed them into chan.c, and asked for 5000 words of output. (With "-d 3", if I recall correctly.)

These files are entirely unedited; this is exactly how they came out of the program. Somehow they manage to make a weird sort of sense, much more often than you'd expect. And they nearly always switch into or out of Alice mode at just the right time.

My signature is about 90% of the way down the Revelation of St. Alice. (It first appeared on Usenet on Jan 15, 1992 -- thanks Google!)

Recently (September 2000) someone suggested Robert Chambers' The King in Yellow and The Egyptian Book of the Dead. Therefore:


Ten years later, I learned that there's a use for this nonsense. Apparently people are researching schemes to categorize texts by their letter-frequency tables. That is, you create frequency tables for several texts, and then compare the tables to determine which texts are most similar to each other.

This is nuts, of course, but it seems to work. I wrote a simple test, and my program was able to determine that the two Alice books were much more similar to each other than they were to Hamlet. Damnably slow program, mind you. But this could have great applications for document search engines on the Web. Who knows?

Here is a bibliography of Research on N-Grams in Information Retrieval, compiled by Jim Mayfield. Poke through it if you want more information on this stuff.

Last updated September 25, 2000.

Zarfhome (map) (down)