Dictionary files: are they all created equal? [updated]

Over the weekend, some friends and I got to apply for playing WarTron. You can read more about this in the previous blog post.

This particular blog post is about research into Unix dictionary files. It is related to the WarTron application puzzles, in that it is the result of a postmortem conversation, but is mostly tangential to WarTron itself, though there may be a minor spoiler (two words in a cryptogram are revealed).

One of the puzzles we had to solve was an unusual sort of cryptogram. In case you do not know, a cryptogram is a message in which each letter of the alphabet is substituted by a different letter. A might be Q and B might be L and C might be R and so on. There are a few strategies you typically use to solve these kinds of puzzles. You can look at letter frequencies – in large bodies of text, E is the most common letter followed by T, A, O, I, N, etc. You can also look for patterns. Three letter words are frequently something like “the” or “and” and that “th” prefix of “the” may be found on other words (“this,” “that,” “there”) which could further confirm things.

With this particular puzzle, we did not have a particularly large corpus to work with, so these tricks were not terribly useful. What we did have were several words with letters repeated within. For instance, we had something that looked a bit like:

X Y _ _ X Y

I wrote a quick and dirty Ruby script to churn through the Unix dictionary file for potential matches and came up with about 40 possibilities. They included things like decade, decide, decode, delude, emblem, eraser, church, rewire, and (my favorite) tomato. We started down the path using “decode” because it felt right, but had some difficulty further along.

The really perplexing word looked something like this:

A B C C A _ _ B C _

With so many repeats, it felt like it should be an easy match. Some looser constraints (the doubled-up “C”) came up with some good words like freehanded and summertime, but adding in the full set of constraints matched nothing.

We eventually did solve the puzzle (thanks, Ariock for coming up with “passphrase”), but it left me wondering how comprehensive the Unix dictionary really is. Did different Unixes have different dictionaries? How do they compare?

I do know from previous “work” in playing around with Unix word lists that there are a lot of what I’d call “noise words” in there. These are uncommon, specialized words typically related to science or medicine that you just do not find in day-to-day life. If at all. For instance, aside from this blog, when was the last time you used or even read the word “xerodermatic?” Not at all, you say?

The problem with word lists is that you must strike a balance between being comprehensive and having too many obscure words. That’s a sort of signal-to-noise ratio. If I meant to type “this” and accidentally type in “thir” will spell-check flag it? On the particular Unix variant I am using now, both are valid words in the dictionary file. When solving or creating puzzles or simply playing with words, obscure matches just give you more crud to sift through.

The following tables are strictly quantitative comparisons.

Mac OS X (10.7, Lion)

This is basically the same as FreeBSD. Its dictionary file is web2, or Webster’s Second International dictionary. Many distributions use versions of this file.

      706 bytes,     150 words, connectives
    8,546 bytes,   1,308 words, propernames
2,493,109 bytes, 235,886 words, web2
1,012,731 bytes,  76,205 words, web2a

Connectives includes: the, of, and, to, a, in, that, is, was, he, etc.

web2a includes multi-word phrases, hyphenated words e.g. able-minded, above-quoted, young lady, moth-balled, etc.

Ubuntu 10.04 (lucid)

      706 bytes,     150 words, connectives
   10,009 bytes,   1,516 words, propernames
2,486,824 bytes, 234,937 words, web2
1,012,730 bytes,  76,205 words, web2a

CentOS 5.5

4,950,996 bytes, 479,623 words, linux.words

Ubuntu 12.04

938,848 bytes, 99,171 words, american-english
938,969 bytes, 99,156 words, british-english

Solaris 10

206,663 bytes, 25,143 words, words

AIX (unknown version)

This dictionary is a list of word hashes. It’s great for quick lookups, but kind of useless for wordplay because you don’t have the actual words, just one-way hashes.

 53,096 bytes, ??? words, hlista (American)
 53,088 bytes, ??? words, hlistb (British)
  6,340 bytes, ??? words, hstop

CentOS 6.2

4,953,699 bytes, 479,829 words, linux.words

Scrabble Dictionary

SOWPODS is the main dictionary. TWL (Official Tournament and Club Word List) is a subset.

2,707,018 bytes, 267,753 words, sowpods.txt
1,763,171 bytes, 178,693 words, twl.txt

Words With Friends Dictionary

I extracted this word list from the iPhone application resource bundle a couple of years ago. I know they sometimes add words (e.g. “zen”) but am not sure how often. At any rate, this is at least a representative order of magnitude.

1,743,377 bytes, 172,853 words, wordswithfriends.txt

Results

Surprisingly, none of these contained “passphrase,” not even the web2a list of compound words.

As for a qualitative comparison, I am not sure how to best approach this task.

I’m inclined to lean toward the Scrabble dictionary since it is larger than most of the others and yet is constrained to a reasonable set of words. The CentOS dictionary is intriguing and makes me wonder why it has over twice the number of words as the other dictionaries. It might make a good fallback dictionary if the Scrabble one fails me.

I’ve noticed the Unix dictionaries have lots of 2- and 3-letter words that are not real words. I think most are abbreviations (of state names, for instance). They end up being a lot of noise for certain applications, such as anagramming. For instance, Scrabble lists 124 two-letter and 1292 three-letter words. The full CentOS list shows 1272 and 6221, respectively. That’s a lot of noise.

As always, if you are doing any brute-force decrypting, you may want to supplement your dictionary with potential crib words, depending on your situation. For instance, with WarTron, we would add the name of the game (WarTron), a character’s first and last name (Professor Lisa Goto), and so on. In this particular puzzle it would not have helped. The only non-dictionary word (aside from the arguable “passphrase”) that was in there was a proper noun – a fellow’s last name that hadn’t entered into WarTron’s story world and which was about as tangential to it as my dive into Unix dictionary files.

Go forth and spell!

Update: a few people asked about the Ruby code. It’s laughably primitive, but looks a little something like this:

#!/usr/bin/ruby
# X Y _ _ X Y
File.new("/usr/share/dict/words", "r").each_line { |line|
    line.chomp!
    next if line.length != 6
    next if !(line[0] == line[4] && line[1] == line[5])
    bad = false;
    [1,2,3,5].each {|pos|
      bad = true if line[0] == line[pos]
    }
    next if bad
    [2,3,4,6].each {|pos|
      bad = true if line[1] == line[pos]
    }
    next if bad
    print "#{line}\n" 
}

Posted in:

5 thoughts on “Dictionary files: are they all created equal? [updated]”

Pingback: Dissecting puzzle design: Dem Bones | Netninja.com
Larry Hosken says:

October 30, 2012 2:18pm at 2:18 pm

For a list of good words, I recommend the data that powers http://nutrimatic.org/ . You can play with its source code. Or you can grab a list of 100K “popular” English words as part of the Burninators’ “puzz” project. http://code.google.com/p/puzz/source/browse/trunk/data/nutrimatic-100K.txt . The words are sorted by “popularity” (roughly, number of mentions in wikipedia), so if you grab the first 50k or so, you have some nice common words.

Another favorite word-and-phrase list is the Collaborative Word List project. Roughly, it’s a long list of words and phrases, each with a score reflecting, roughly, clue-ability. It’s not free; you should buy Cruciverb.com membership first. But it might be worth it 🙂 I blogged about it at http://lahosken.san-francisco.ca.us/new/2012/01/31/more-tyro-crossword-construction-ramblings-some/

Dan E. says:

November 13, 2012 7:19pm at 7:19 pm

Yeah, frequency data is crucial, because it gets you away from the tension between “too big” and “too small”. You can use a word list as big as you like, but sort your output by score, and you get the good stuff first.

Wikipedia frequency isn’t perfect though. Certain things like obscure-to-us geographical names in faraway countries get many many references. Still, it’s miles better than a flat list.

At this point, if Nutrimatic can’t answer the question (it can’t answer this one — no backreferences), I use puzz. We have a handy “word” class which subclasses string but adds “.frequency” and such. So you can do:

import words
print sorted([w for w in words.all if …],
key=lambda w: -w.frequency)

… to do your grepping. I recently added “.phonemes” as well which is handy for pronunciation based stuff.

1. Brian Enigma says:
  
  November 14, 2012 5:05pm at 5:05 pm
  
  You know, keeping a frequency count alongside the words, then sorting the results by it, is a great idea, actually!
  
Pingback: Grep for puzzle designers (and solvers) | Netninja.com