SIPWiki - DiffCutUp

While digging through some of my old manuscripts (intending to publish some otherwise useless articles of fiction here) I discovered two nearly identical revisions of a story. From a cursory reading I was unable to tell which one was written first, so I ran the two files through the UNIX “diff” program, which, as you might expect, spits out the differences between the files. Due to the odd format in which the files were saved, the result was striking. I loaded the diff into a text editor, deleted most of the control characters and over half of the lines (keeping most of them intact), to give my very first cut-up.

W.S. Burroughs (1914-1997), of course, was the inventor of the Cut-Up method (though probably following Brion Gysin), which in turn has proved far more fascinating to computer scientists than it has to literateurs. This is hugely ironic, because Burroughs’s grandfather, William Seward Burroughs (1857-1898) was the inventor of the adding machine, and the founder of the Burroughs Corporation, which like IBM eventually moved into computers.

Burroughs was interested in Cut-Up because he thought he could divine implicit meanings from his writings; he wanted to access his subconscious, or perhaps even the soul of the universe. Computer geeks are interested in Cut-Up because the ability to generate new text algorithmically looks a lot like artificial intelligence. Of course, with Cut-Up the new text still derives most of its meaning and structure from the original, but the principle can be refined to produce text that is even further removed from the training sample (the text or texts used as input to the algorhythm). Some examples:

dadadodo, by Jamie Zawinski. Uses Markov chains to produce a probabilistic description of language. Dadadodo analyzes text and generates rules that look like “chains” follows “Markov” with a probability of 0.9. When new text is generated, if the last word chosen was Markov, dadadodo will follow it by chains.
the Chomskybot, by John Lawler. Somewhat less sophisticated than dadadodo, the Chomskybot constructs sentences from a dictionary of initiating, subject, verbal, and terminating phrases (chosen from Chomksy’s writings, I believe). The effect, however, is actually more believable than dadadodo’s, becuase each phrase is grammatically correct.
and the huge collection of “chatterbots”, of whom the most famous is Eliza. Most of these are based on keyword identification; the appearance of a keyword in the input text (which in this case is what a user types into a the program, as if conversing via IM) causes the program to generate an appropriate response. Some chatterboxes, however, are capable of generating their own rules. Apparently a variant of Alice, a more sophisticated cousin of Eliza, is being used to troll for pedophiles on chat boards.
and this collection of computer-generated writing, which I don’t really have time to comment on at present.

Now, what all these programs have in common is that they are some mixture of probabilistic rules and deterministic rules. The more probabilistic the program the more flexible it is; dadadodo is almost entirely probabilistic, and it doesn’t care what language you use to train it, or if it’s a language at all. Deterministic rules can be based on grammar (e.g., sentences must contain at least one subject and one verb) or on fixed sets of words (as in the Chomskybot). They tend to produce more grammatically correct text, but at the expense of flexibility.

Anyway, my thought is that perhaps the problem of text generation is still a little too hard. I haven’t seen much advancement since dadadodo came out. Perhaps an easier problem is to get an AI to learn how to revise a text; instead of training it on static pieces of text, use both the text and its subsequent revisions. It might be a faster way to learn grammar, and more similar to human language acquisition, which almost certainly involves taking into account the corrections others make to our mistakes.

last modified: 2004-07-14 23:20:22 -0400