solving:wordlists:about:mcilroy [NPL Wiki]

Doug McIlroy's Wordlists

Date: Wed, 1 Jan 2003 15:28:24 -0500
From: Doug McIlroy <doug@cs.dartmouth.edu>
To: webmaster@puzzlers.org
Subject: origins of some word lists

I was interested to find three Unix files that I made back in the 1970s alive and well at puzzlers.org: web2, web2a, and unixdict.

The tape from which web2 and web2a were made is often attributed to the Air Force. I seem to have lost documentation about the tape, but I do have a record of a work product of a project that used (and likely made) it: "Advanced character recognition techniques study, report No. 4, Appendix D. Tetragrams legal with respect to Webster's Internation Dictionary (unabridged second edition). U.S. Army Electronic Research and Development Laboratory, Fort Monmouth and RCA Data System Center, Bethesda. Contract No. DA 36-039-AMC-00112(E), (1963). This document was a sizable booklet, as more than 60,000 tetragrams occur in the dictionary.

The tape contained word lists of several specialized dictionaries in addition to the Unabridged. I don't recall whether the tetragram list was included. There was also trivially derivable information such as one wouldn't think of recording today. To the best of my recollection, each word was accompanied by its reversal and by its pattern word (first letter transliterated as A, next distinct letter as B, and so on).

Web2 is known to have many lacunae, and a few mistakes. I used to correct errors that came to my attention, so there are small differences among versions of the file--but no version control to identify them.

Web2 filled a whole removable disk on the original PDP11 Unix system at Bell Labs. It saw a lot of recreational uses, one of the earliest being Dennis Ritchie's creation of a definitive list of anagrams, in a heroic feat of file juggling.

I made Unixdict, generally known as /usr/dict/words, in about 1978 for the spelling checker, "spell", as recounted in M. D. McIlroy, Development of a spelling list, IEEE Trans. on Communications 30 (1982) 91-99. We had discovered that Webster's Collegiate was a poor authority for a spelling checker: it contained a lot of irrelevant rarities, and lacked many very common words, especially proper names. Webster's unabridged was even worse. being infested with esoterica, especially unknown short words that could easily arise as typing errors. Moreover, it was desirable to fit the dictionary in main memory, so as to minimize file processing.

The core of the Unix spelling list was Webster's 7th Collegiate intersected with the Brown Corpus (C. K Kucera and W. N. Francis, Computational Analysis of Present-Day American English, Brown U. Press, 1967), the former attesting authority and the latter currency. The list was aggressively pruned by eliminating forms derivable by prefixing, suffixing, or inflection. To the result were added the most common names in a large phone directory, as well as words that were caught during real runs of "spell". In compressed form the list fit in the 64K address space of a PDP11. (That kind of shoehorning, of course, is no longer necessary.)

On the original Bell Labs Unix system, /usr/dict/words always contained Webster's Collegiate. Copyright restrictions, however, prevented us from exporting that file. Since the spelling list came anyway as part of the system source, it was substituted for Webster's Collegiate in the export version of Unix.

Doug McIlroy
Doug McIlroy's Website