Keyword matching with Aho-Corasick

So, for example, I have around a lot of different keywords and I want to find if any of these have been mentioned in some long text. I also want catch them if they are a part of another word like accidentally written two words together (e.g., greenplants). This is considerable amount of searching.

Brute Force

One way to search for keywords would be to take each keyword and go though the text or the another way around. This results in a lot of comparison operations, especially if the text is large.  (keywords * words in text).

You could first count which words are the most common in the type of texts you use and search for those first. If you are searching line by line in a text and match it by the first encountered keyword from your list this would somewhat reduce the number of operations.

Dictionary

A better way would be to make a dictionary from keywords, where the key is the keyword and the value is anything you want to match it with ( a category). Now you only need to take each word in a line /text and match it in the the dictionary (on average – O(1) for one word) . Nice now you only search as many times ar you have  words in the text. However, if you want to search for a partial match, you still need  to make substrings from words or add some other additional logic.

Automaton

Now, the best way is just to take a line and find all the keywords as you go though it, similar as you would read it. This can be done with an automaton. A nice algorithm for searching in text is Aho-Corasick algorithm. This algorithm takes a list of keywords and generates an automaton from them. With this you can match your keywords in linear time – which is about as good as it gets.

Aho-Corasick

Python has a simple Aho-Corasick library pyahocorasick.  According to pyahocorasick, this lib can work with unicode in Python 3, whereas Python 2 can only work with string bytes.

pip install pyahocorasick

Now, lets take some keywords and make an automaton. I have added a category for each keyword and stored that in a tuple (keyw, category). In this case all keywords are in the same category.

keywords = [
    ('he', 1),
    ('she', 1),
    ('hers', 1),
    ('her', 1)
]
text = [
    ' he is here ',
    ' this is she ',
    ' this is hers ',
    ' her bag is big '
]

First, keywords and their respective categories are added to a trie structure, from which an automaton will be generated.

import ahocorasick as ahc

def make_aho_automaton(keywords):
    A = ahc.Automaton()  # initialize
    for (key, cat) in keywords:
        A.add_word(key, (cat, key)) # add keys and categories
    A.make_automaton() # generate automaton
    return A

A = make_aho_automaton(keywords)

Next, let’s make a function which finds keywords in text. Searching in pyahocorasick happens via automaton.iter(line) function. This returns an iterable object with the end index of the word matched and the tuple of the respective keyword and its category.

def find_keywords(line, A):
    found_keywords = []
    for end_index, (cat, keyw) in A.iter(line):
        found_keywords.append(keyw)
    return found_keywords

Searching for keywords yields a list of the found keywords. Because the automaton was generated from keywords with no spaces around them, it finds overlapping keywords (here  contains he as well as her).

new_text = []
for line in text:
    print line, ':', find_keywords(line, A)

# he is here : ['he', 'he', 'her']
# this is she : ['she', 'he']
# this is hers : ['he', 'her', 'hers']
# her bag is big : ['he', 'her']

Let’s remake the automaton:

keywords = [
    (' he ', 1),
    (' she ', 1),
    (' hers ', 1),
    (' her ', 1)
]

A_spaces = make_aho_automaton(keywords)

Now the A_spaces automaton finds only fully matching words:

# he is here  : [' he ']
# this is she  : [' she ']
# this is hers  : [' hers ']
# her bag is big  : [' her ']

Replacing keywords

A useful function would be to replace the found keywords with some characters for anonymization. Here the function returns an array in the length of the line telling me which indices are to be hidden.

def find_keyword_locations(line, A):
    line_indices = [False for x in line]
    for end_index, (cat, keyw) in A.iter(line):
        start_index = end_index - len(keyw) + 2  # start index after first space
        for i in range(start_index, end_index):  # end index excluding last space
            line_indices[i] = True
    return line_indices

Let us try to replace the keywords with a dash or remove the altogether:

new_text_removed = []
new_text_replaced = []
for line in text:
    line_indices = find_keyword_locations(line, A_spaces)
    line = list(line) # split string into list
    new_line = "".join([line[i] if not x else '' for i, x in enumerate(line_indices)])
    new_text_removed.append(new_line)
    new_line = "".join([line[i] if not x else '-' for i, x in enumerate(line_indices)])
    new_text_replaced.append(new_line)

print text
print new_text_removed
print new_text_replaced

Original text:

[' he is here ', ' this is she ', ' this is hers ', ' her bag is big ']

Removed/replaced keywords in text:

['  is here ', ' this is  ', ' this is  ', '  bag is big ']
[' -- is here ', ' this is --- ', ' this is ---- ', ' --- bag is big ']

The last function seems pretty useful for text anonymization tasks if you, for example, need to anonymize person names/lastnames.

To conclude, an automaton can search really fast for selected keywords. It is easy to use it you need to match whole words. Partial matching can be archived by not adding spaces around the keywords. However, you have to choose carefully to not match accidental parts of words. I would consider making two automata – one for matching whole words, another for matching parts of words.

Full code can be found below, or on GitHub: