I have been given a coding problem to solve recently, which is a classic Hangman game. The basic principle is, given the number of letters and a fixed number of attempts for a word, only one letter could be guessed in one attempt. If the guessed letter does exist in the word, it will be placed in all the correct positions of the word. Otherwise, the number of wrong guesses accumulates, until it reaches the limit. If you get all the letters right before the attempts run out, you win(at this exact word).

My initial response was that I should start with letters of the highest frequency in English words. This was what I found, so I started with E, T, A, O. Those letters did have a high frequency of appearing. But after E, T, A, O, none of those letters was a right guess. That’s when I looked up on mighty stackoverflow to find this answer by Tom Medley, which was a life saver. Tom’s idea was to construct a regular expression to match possible words from a dictionary, based on previous guesses. For example, we guessed E, T, A for a five-letter word, and E exists in the word such as *e***, so the possible words in the dictionary are words with pattern [^eta]e[^eta][^eta][^eta], which means five-letter words that the first and third thru fifth letters are not e, t or a, but the second letter is e. From there, we will get a list of possible words. And by counting all the letter’s occurrences in those words, we will find the letter with highest possibility of appearance, which will be the next best guess.

This method significantly improved my results. But that’s when I found the longer the word is, the more difficult for the regular expression to find a possible match. When the word is 10+ letters long, it was close to impossible. The dictionary comes with UNIX system(under /usr/share/dict/) has 200,000+ words, which is actually a pretty huge words list, but “decentralisation” doesn’t exist in it and the dictionary instead has “decentralization”. I did also notice a lot longer words end with -ess or -ly and -ation were not recognized. So apparently a larger dictionary is needed. That’s when I started to look up stemmers, snowball for example. That was really a pain. Right before I committed a suicide, I found SCOWL, which would allow me to generate a words list with spelling variations, British, American, Canadian. You name it. Both variations “decentralisation” and “decentralization” from the above case were in the dictionary. That gave me a words list of 700,000+ words. From there, not a single word with 8+ letters was missed.

Now the game looks much more fun. One last problem to solve, is words with 4 or 5 letters are now easier to missed than longer words. That definitely has something to do with the first a few letters to guess - and I used E,T,A,O for all possible length of words. Guessing vows at first actually makes sense, as at any length, very few words has no vows with them. However, for different lengths of words, the vows most likely to appear might be different. That’s when I found this article on lifehacker, which comes with a table for popularity of letters in dictionary words grouped by the length of those words. And even better, a table of Optimal calling order for different length of words. The orders for 4 and 5 letter words were nevertheless not that optimal, so I only kept the first three letters for these two lengths. Combined with regular expression, it produced the best result I was able to manage(in a reasonale period of time).

Note: the strategy mentioned might only work for “real” words, meaning words could be found in dictionaries. The size and nature of the dictionary used to generate the words for guessing in the game might also have an impact. A good way to know is to play around with it, get a few words, especially the spelling variations, there you would know how difficult the dictionary is. Then improve your strategy from there. That applies to any real problems(binary search tree problems are generally not among them), you start with a crappy strategy, but you understand more about the problem you are solving, then you go back and improve your strategy. These are just like “iterations” in product development.