So, for example, I have around a lot of different keywords and I want to find if any of these have been mentioned in some long text. I also want catch them if they are a part of another word like accidentally written two words together (e.g., greenplants). This is considerable amount of searching.
One way to search for keywords would be to take each keyword and go though the text or the another way around. This results in a lot of comparison operations, especially if the text is large. (keywords * words in text).
You could first count which words are the most common in the type of texts you use and search for those first. If you are searching line by line in a text and match it by the first encountered keyword from your list this would somewhat reduce the number of operations.
A better way would be to make a dictionary from keywords, where the key is the keyword and the value is anything you want to match it with ( a category). Now you only need to take each word in a line /text and match it in the the dictionary (on average – O(1) for one word) . Nice now you only search as many times ar you have words in the text. However, if you want to search for a partial match, you still need to make substrings from words or add some other additional logic.
Now, the best way is just to take a line and find all the keywords as you go though it, similar as you would read it. This can be done with an automaton. A nice algorithm for searching in text is Aho-Corasick algorithm. This algorithm takes a list of keywords and generates an automaton from them. With this you can match your keywords in linear time – which is about as good as it gets.
Python has a simple Aho-Corasick library pyahocorasick. According to pyahocorasick, this lib can work with unicode in Python 3, whereas Python 2 can only work with string bytes.
pip install pyahocorasick
Now, lets take some keywords and make an automaton. I have added a category for each keyword and stored that in a tuple (keyw, category). In this case all keywords are in the same category.
keywords = [ ('he', 1), ('she', 1), ('hers', 1), ('her', 1) ] text = [ ' he is here ', ' this is she ', ' this is hers ', ' her bag is big ' ]
First, keywords and their respective categories are added to a trie structure, from which an automaton will be generated.
import ahocorasick as ahc def make_aho_automaton(keywords): A = ahc.Automaton() # initialize for (key, cat) in keywords: A.add_word(key, (cat, key)) # add keys and categories A.make_automaton() # generate automaton return A A = make_aho_automaton(keywords)
Next, let’s make a function which finds keywords in text. Searching in pyahocorasick happens via automaton.iter(line) function. This returns an iterable object with the end index of the word matched and the tuple of the respective keyword and its category.
def find_keywords(line, A): found_keywords =  for end_index, (cat, keyw) in A.iter(line): found_keywords.append(keyw) return found_keywords
Searching for keywords yields a list of the found keywords. Because the automaton was generated from keywords with no spaces around them, it finds overlapping keywords (here contains he as well as her).
new_text =  for line in text: print line, ':', find_keywords(line, A) # he is here : ['he', 'he', 'her'] # this is she : ['she', 'he'] # this is hers : ['he', 'her', 'hers'] # her bag is big : ['he', 'her']
Let’s remake the automaton:
keywords = [ (' he ', 1), (' she ', 1), (' hers ', 1), (' her ', 1) ] A_spaces = make_aho_automaton(keywords)
Now the A_spaces automaton finds only fully matching words:
# he is here : [' he '] # this is she : [' she '] # this is hers : [' hers '] # her bag is big : [' her ']
A useful function would be to replace the found keywords with some characters for anonymization. Here the function returns an array in the length of the line telling me which indices are to be hidden.
def find_keyword_locations(line, A): line_indices = [False for x in line] for end_index, (cat, keyw) in A.iter(line): start_index = end_index - len(keyw) + 2 # start index after first space for i in range(start_index, end_index): # end index excluding last space line_indices[i] = True return line_indices
Let us try to replace the keywords with a dash – or remove the altogether:
new_text_removed =  new_text_replaced =  for line in text: line_indices = find_keyword_locations(line, A_spaces) line = list(line) # split string into list new_line = "".join([line[i] if not x else '' for i, x in enumerate(line_indices)]) new_text_removed.append(new_line) new_line = "".join([line[i] if not x else '-' for i, x in enumerate(line_indices)]) new_text_replaced.append(new_line) print text print new_text_removed print new_text_replaced
[' he is here ', ' this is she ', ' this is hers ', ' her bag is big ']
Removed/replaced keywords in text:
[' is here ', ' this is ', ' this is ', ' bag is big '] [' -- is here ', ' this is --- ', ' this is ---- ', ' --- bag is big ']
The last function seems pretty useful for text anonymization tasks if you, for example, need to anonymize person names/lastnames.
To conclude, an automaton can search really fast for selected keywords. It is easy to use it you need to match whole words. Partial matching can be archived by not adding spaces around the keywords. However, you have to choose carefully to not match accidental parts of words. I would consider making two automata – one for matching whole words, another for matching parts of words.
Full code can be found below, or on GitHub: