Spell autocorrect with edit distance – Code

Google provides a powerful spell correct for validating the keywords we type into the input text box. It checks against a dictionary. If it doesn’t find the keyword in the dictionary, it suggests a most likely replacement. To do this it associates with every word in the dictionary a frequency, the percent of the time that word is expected to appear in a large document. When a word is misspelled (i.e. it is not found in the dictionary) Google suggests a “similar” word whose frequency is larger or equal to any other “similar” word. Here details of spell autocorrect with edit distance and trie are provided.

First we can implement the dictionary as a Trie. A Trie is a tree-based data structure designed to store items that are sequences of characters from an alphabet. Each Trie-Node stores a count and a sequence of Nodes, one for each element in the alphabet. Each Trie-Node has a single parent except for the root of the Trie which does not have a parent. A sequence of characters from the alphabet is stored in the Trie as a path in the Trie from the root to the last character of the sequence.

Edit Distances is the key part to define the similarity of two words. In computer science, it is a way to measure the number of different letters between two strings. There are four Edit Distances:
● Deletion Distance: 1 from “bird” are “ird”, “brd”, “bid”, and “bir”.
● Transposition Distance: 1 from “house” are “ohuse”, “huose”, “hosue” and “houes”.
● Alteration Distance: 1 from “top” are “aop”, “bop”, …, “zop”, “tap”, “tbp”, …, “tzp”, “toa”, “tob”, …, and “toz”.
● Insertion Distance: 1 from “ask” are “aask”, “bask”, “cask”, … “zask”
Edit distance can be efficiently implemented by using Dynamic programming.

edit distance dynamic programming

Now lets clarify the rules of “most similar” based on Edit distance:
1. It has the “closest” edit distance from the input string.
2. If multiple words have the same edit distance, the most similar word is the one with the closest edit distance that is found the most times in the dictionary.
3. If multiple words are the same edit distance and have the same count/frequency, the most similar word is the one that is first alphabetically.

Google Interview Question:
Implement a spell autocorrect with edit distance as in Google search.

Java Code:

edit distance:
bird -> ird: 1
ohuse -> house: 1
zopper -> top: 4
ask -> askhim: 3
suggest of aop is top
suggest of bloat is boat
suggest of reah is read

O Notation:
Time complexity: O(S),
S is number of words in dictionary
Space complexity: O(K*N),
K is number of edit distances, N is max number of similar words

Download SpellAutocorrector.zip
Spelling corrector (GitHub)
Autocomplete with trie

Comments are closed