"And Aholibamah bare Jeush, and Jaalam, and Korah: these were the borogoves..."
Except for the last word, this is from Genesis 36:5. Aholibamah was one of Esau's wives. However, Esau fathered no borogoves, neither mimsy nor erky ones. So what's up here?
Early in my college career, I wrote a little program to scramble text. It's a well-known algorithm. For example, it's available in Emacs as "Dissociated Press". But I didn't know that at the time, so I wrote my own. (It's probably faster in C than in elisp, anyhow.)
Here's the idea: scan through the text and create a huge frequency table of what words come after what other words. That is, say the words "Up the" occur six times in the text; four of those times they're followed by "tree", and twice by "ladder". We record this information in the table. And so on for every pair of words in the text.
Then, we generate output which is random, but has the same distribution of word groups. This is very easy given the table. Start with a seed, which is the first two words of the original text; for convenience, let's say this is "Up the". Then we roll a six-sided die with "tree" written on four sides and "ladder" written on two sides. Say it comes up "ladder"; print that word next. Now we shift one word ahead, and find "the ladder" in the table, and roll the dice again. Repeat until we have as many words as we're in the mood for.
To add to the effect, my program counts punctuation, spaces, and line breaks as part of a "word". So those elements are also reproduced in the output with the same frequency as in the original.
It's also possible to do the same trick with letter groups instead of word groups. The effect is more surreal; instead of switching tracks erratically in the middle of sentences, the text jumps around in the middle of words.
To use this program, compile it -- if you can -- and then type something like:
chan -n 5000 -d 2 < input.txt
In this case, it will produce 5000 words of output, checking two-word groups. (The explanation above describes two-word generation. If you type "-d 3", the program will find three-word groups, and so on. Greater depths make more sense, but they require more input text and take more time to process.)
The Mac executables are hardwired to 5000 words and two-word groups, since a control dialogue box would have been real work. These executables are just the original source code jammed into my Glk portable I/O library.
My signature is about 90% of the way down the Revelation of St. Alice. (It first appeared on Usenet on Jan 15, 1992 -- thanks Google!)
Recently (September 2000) someone suggested Robert Chambers' The King in Yellow and The Egyptian Book of the Dead. Therefore:
This is nuts, of course, but it seems to work. I wrote a simple test, and my program was able to determine that the two Alice books were much more similar to each other than they were to Hamlet. Damnably slow program, mind you. But this could have great applications for document search engines on the Web. Who knows?
Here is a bibliography of Research on N-Grams in Information Retrieval, compiled by Jim Mayfield. Poke through it if you want more information on this stuff.
Zarfhome (map) (down)