Trie-ing to optimize text search in F#
Any evil genius knows that if you want to stay on top you need to exercise your mind. Most evil geniuses stay strong by lifting weights with their telekinetic abilities, however I prefer video games. PopCap.com’s Bookworm is one mental exercise I am particularly fond of.
Bookworm is simple. You have 7x7 hexagonal grid of letters and your goal is to create words by forming chains of adjacent tiles. Once you pick the word, the tiles are remove from the board and new ones drop down to fill the void.
The following screen shot shows a five letter word being selected. Obviously it is harder to find longer words, and thus they are worth more points.
Eventually you start getting Green, Gold, and Diamond tiles worth bonus points. However, if you make too many short words, typically those with only three-letters, you get flaming tiles. Each round a flaming tile exists on the screen it will burn through the tile below it. Once a flaming tile reaches the bottom of the screen it’s game over, man.
Mental exercise is important to keep any evil genius, well, a genius. However, a true evil genius would try to cheat. And that’s exactly what I intend to do in this blog post: write a program to find the longest word on a given Bookworm board.
Hexgrid Data Structure
In order to cheat at any game, first you need to write code to represent the game state. Which in the case of Bookworm is the hexagonal grid. Creating our grid data structure isn’t going to be super interesting, but bear with me as it is required in order to do word searches later.
We’ll start by creating an interface to represent each tile on the board, which has a letter, set of neighbors, and position. Then we create a class for the actual letter grid.
/// Representing a tile on the board
[<AllowNullLiteral>]
type ITile =
abstract Letter : char
abstract Neighbors : seq<ITile>
abstract Location : int * int
/// Data structure for a hex grid of letters
type LetterGrid(tileColumns : string[]) =
// Store the grid of letters in an array. Note that it is jagged. In
// the flash version of Bookworm even columns have 8-tiles instead of 7.
let mutable m_columns : char[][] = null
// Cache of ITiles corresponding to tiles on the grid
let mutable m_tileCache = Map.empty
Armed with these simple types, we can go about populating our LetterGrid class’ ITile cache. The following F# snippet shows using F# object expressions to create new instances of ITile. Note the recursive calls to getTile and then filtering out any tiles which are off the board. We can represent off-board tiles by simply having getTile return null. However, this requires some additional work.
By default any types defined in F# cannot be null, even interfaces like ITile. However, we can override this behavior by simply adding the [<AllowNullLiteral>] attribute.
/// Get an instance of an ITile from the given index. (Generating it if necessary.)
let rec getTile colIdx rowIdx =
if not <| isValidLocation colIdx rowIdx then
null
else
if Map.containsKey (colIdx, rowIdx) m_tileCache then
m_tileCache.[(colIdx, rowIdx)]
else
let newTile =
{
new ITile with
member this.Letter = getTileLetter colIdx rowIdx
member this.Neighbors =
let neighbors =
if (colIdx + 1) % 2 = 1 then
// Odd column
seq {
yield getTile (colIdx - 1) (rowIdx + 0) // NW
yield getTile (colIdx + 0) (rowIdx - 1) // N
yield getTile (colIdx + 1) (rowIdx + 0) // NE
yield getTile (colIdx + 1) (rowIdx + 1) // SE
yield getTile (colIdx + 0) (rowIdx + 1) // S
yield getTile (colIdx - 1) (rowIdx + 1) // SW
}
else
// Even column
seq {
yield getTile (colIdx - 1) (rowIdx - 1) // NW
yield getTile (colIdx + 0) (rowIdx - 1) // N
yield getTile (colIdx + 1) (rowIdx - 1) // NE
yield getTile (colIdx + 1) (rowIdx + 0) // SE
yield getTile (colIdx + 0) (rowIdx + 1) // S
yield getTile (colIdx - 1) (rowIdx + 0) // SW
}
// Filter out null (neighbors off the board)
neighbors |> Seq.filter ( (<>) null )
member this.Location = (colIdx, rowIdx)
}
m_tileCache <- Map.add (colIdx, rowIdx) newTile m_tileCache
newTile
Crappy Word Detection
Great! Now that we have a way to represent the hex grid, all we need to do is start searching for words. First, we need a dictionary. Floating around on the internet(s) you can find a copy of the Scrabble Tournament Word List (TWL). The TWL contains all legal Scrabble words, however Bookworm only honors a subset of them. (For example, most curse words are not recognized.)
Once we have a word list, perhaps represented as a Set<string> we just need to start trying out all combinations of word chains and seeing if they are valid words, right? Wrong! That is a terrible idea and here is why:
Bookworm accepts words between 3 and 12-letters long. The board starts with 52 tiles, and each tile has six neighbors. Ignoring the fact that a tile cannot be used twice in the same word chain, we get:
which WolframAlpha tells me is 22,638,535,920, and that is just word chains! Once we are done permuting every single word chain we then have to look it up in our dictionary. Which would be O(log n), where n is the number of words. O(log n) is quite fast, but multiplied by 22 billion is still way too big.
What we need is a way to efficiently prune our search space. If you ever have a word prefixed with “AKVWOP”, and you know it isn’t a word, then it doesn’t make sense to continue looking in the vague hope of finding “AKVWOPS”, “AWKVWOPED”, “AWKVWOPING”, etc.
The Trie Data Structure
Fortunately, there is a classic data structure for doing this called a trie. (Pronounced ‘try’, as in I was trie-ing to be clever when I titled this post.) It is also referred to as a prefix tree. The data structure can be explained quite well with the following picture:
You start with the root node, and the edge to any of its children requires a letter. For example ‘t’. The child of the root node associated with ‘t’, has a child using the letter ‘o’ and another child associated with the letter ‘e’. Since there is no child associated with the letter ‘q’, you can infer that there is no leaf nodes prefix with the string “tq”.
If we constructed a trie for the entire contents of the dictionary, we would then have an efficient way to determine if a given substring prefixes any valid word. (By simply checking if such a node exists in the tree or not.)
The code for implementing this in F# is straightforward. Each trie node stores a dictionary potentially mapping a letter to a child node, or the it is at a terminal node. Meaning that the prefix is a valid word, and there are no further words with that prefix.
type TrieNode =
// Letters go to other nodes, flag if it is a word node
| Node of IDictionary<char, TrieNode> * bool
// You are at the end of a legal word
| Eow
member this.IsLegalWord =
match this with
| Node(_, true)
| Eow -> true
| _ -> false
To generate our tree, we use the following function which takes a sequence of words, represented as F# lists of characters, and converts it into a TrieNode. I’ve added a glut of comments to help explain things. The only bit of black magic is the use of the dict function.
dict works by taking a seq<’k, ‘v> and produces an IDictionary<’k, ‘v>.
/// seq<char list> -> TrieNode
let rec generateTrieNode words =
// Does this represent an EOW node for anything?
let hasEowNode = Seq.exists (fun word -> word = []) words
if Seq.isEmpty words then
Eow
else
(* Now group all words by their first letter.
[ "cat"; "cats"; "dogs" ]
=>
seq {
('c', seq { "at"; "ats" })
('d', seq { "ogs" })
}
*)
let nextWordFragments =
// [""; "a"; "act"; "acts"; "actor"; ... ]
words
// ["a"; "act"; "acts"; "actor"; ... ]
|> Seq.filter (fun word -> word <> [])
// seq { ("a", seq { "act"; "acts"; "actor"; ... }); ... }
|> Seq.groupBy (fun word -> Seq.head word)
// seq { ("a", seq { "ct"; "cts"; "ctor"; ... }); ... }
|> Seq.map (fun (firstLetter, words) -> (firstLetter, words |> Seq.map List.tail))
// seq { ("a", seq ("c", seq { ... } ... }
|> Seq.map(fun (firstLetter, rest) -> (firstLetter, generateTrieNode rest))
// seq { ("a", seq ...); ("A", seq ...) ... }
|> Seq.fold
(fun acc (letter, trieNode) ->
(Char.ToUpper(letter), trieNode) :: (Char.ToLower(letter), trieNode) :: acc
)
[]
Node(dict nextWordFragments, hasEowNode)
Once you have the trie tree in memory, you can check if a string is a valid word or not by simply walking down the tree.
let isWord word tree =
let letters = word |> Seq.toList
let rec isWord currentNode letters =
match letters, currentNode with
// When you are out of letters, check if you are on
// an end-of-word node
| [], Eow
| [], Node(_, true)
-> true
// Otherwise, walk down the tree...
| nextLetter :: rest, Node(childNodes, _)
-> let hasNode, childNode = childNodes.TryGetValue(nextLetter)
if hasNode then
isWord childNode rest
else
false
| _ -> false
isWord tree letters
Searching
Searching through potential word chains can be parallelized trivially, but I’ll save that for a future blog post. For now we will use the following approach:
- Start with each tile on the board and put them into a stack of ‘search states’
- While the search stack is not empty do
- Pop the search stack
- Check that state to see if there are any legal words from that position (adjacent tiles leading to valid word prefixes)
- Push any legal, subsequent steps to our search stack
Here is the F# code. Note that each state is represented by TrieNode * ITile list. This represents the current word prefix node in the trie tree and the list of tiles forming the word chain. We then seed that search stack with word chains starting at every tile in the board.
let statesToExplore = new Stack<TrieNode * ITile list>()
// Start with searching all tiles
gameGrid.AllTiles
|> Seq.iter (fun tile ->
let firstStepDownTrieTree =
match trieRoot with
| Node(nextSteps, _) -> nextSteps.[tile.Letter]
| _ -> failwith "Shouldn't get here"
statesToExplore.Push(firstStepDownTrieTree, [tile]))
// Now explore the entire search space
while statesToExplore.Count > 0 do
let trieNode, prevLetters = statesToExplore.Pop()
match searchStep trieNode prevLetters with
| None -> ()
| Some(nextSearchStates) ->
// Add possible states to explore to our stack
Seq.iter (statesToExplore.Push) nextSearchStates
All we need is a function to take some given search state – current node in the trie tree and tiles used so far – and then check all of the neighboring tiles if they can continue down the trie tree towards valid words.
/// Given a spot in the Trie tree, a game tile, and a list of previously used tiles used
/// to form a word, return Some(seq< all possible steps>). If no legal words are possible
/// from this state, return None.
let searchStep (nodeInTrieTree : TrieNode) (tilesInChain : ITile list) =
// Are we at a legal word in the Trie tree?
if nodeInTrieTree.IsLegalWord then
foundWords.Add(getWordText tilesInChain)
let currentTile = List.head tilesInChain
match nodeInTrieTree with
// Cannot make any valid words from this point, no purpose in stepping
| Eow -> None
// If you can form legal words beond where we are at in the tree, continue
// searching.
| Node(possibleLetterSteps, _)
-> let possibleSteps =
currentTile.Neighbors
// Filter out neighbors already used for the word chain
|> Seq.filter
(fun neighborTile ->
not (List.exists
(fun (prevTile : ITile) -> prevTile.Location = neighborTile.Location)
tilesInChain
)
)
// Filter out neighbors which couldn't lead to avalid word
|> Seq.filter
(fun neighborTile -> possibleLetterSteps.ContainsKey(neighborTile.Letter))
// Now map them to the next 'states' to search
|> Seq.map
(fun neighborTile ->
( // Node in trie tree
possibleLetterSteps.[neighborTile.Letter],
// Tiles used in the word chain
neighborTile :: tilesInChain
)
)
if not (Seq.isEmpty possibleSteps) then
Some(possibleSteps)
else
None
And that’s it! We can use this efficient word search cheat to our heart’s content. Attached to this post is the source code.
I used BookwormBot9000 to spot this gem on my iPhone version of the game. But that begs the question what the hell is a madrilene?
Disclaimer
All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose. So in other words, if cheating at word games leads your brain to atrophy and turn to mush, don’t blame me or Microsoft.