Category Archives: text processing

Keyword matching with Aho-Corasick

So, for example, I have around a lot of different keywords and I want to find if any of these have been mentioned in some long text. I also want catch them if they are a part of another word like accidentally written two words together (e.g., greenplants). This is considerable amount of searching.

Brute Force

One way to search for keywords would be to take each keyword and go though the text or the another way around. This results in a lot of comparison operations, especially if the text is large.  (keywords * words in text).

You could first count which words are the most common in the type of texts you use and search for those first. If you are searching line by line in a text and match it by the first encountered keyword from your list this would somewhat reduce the number of operations.

Dictionary

A better way would be to make a dictionary from keywords, where the key is the keyword and the value is anything you want to match it with ( a category). Now you only need to take each word in a line /text and match it in the the dictionary (on average – O(1) for one word) . Nice now you only search as many times ar you have  words in the text. However, if you want to search for a partial match, you still need  to make substrings from words or add some other additional logic.

Automaton

Now, the best way is just to take a line and find all the keywords as you go though it, similar as you would read it. This can be done with an automaton. A nice algorithm for searching in text is Aho-Corasick algorithm. This algorithm takes a list of keywords and generates an automaton from them. With this you can match your keywords in linear time – which is about as good as it gets.

Aho-Corasick

Python has a simple Aho-Corasick library pyahocorasick.  According to pyahocorasick, this lib can work with unicode in Python 3, whereas Python 2 can only work with string bytes.

pip install pyahocorasick

Now, lets take some keywords and make an automaton. I have added a category for each keyword and stored that in a tuple (keyw, category). In this case all keywords are in the same category.

keywords = [
    ('he', 1),
    ('she', 1),
    ('hers', 1),
    ('her', 1)
]
text = [
    ' he is here ',
    ' this is she ',
    ' this is hers ',
    ' her bag is big '
]

First, keywords and their respective categories are added to a trie structure, from which an automaton will be generated.

import ahocorasick as ahc

def make_aho_automaton(keywords):
    A = ahc.Automaton()  # initialize
    for (key, cat) in keywords:
        A.add_word(key, (cat, key)) # add keys and categories
    A.make_automaton() # generate automaton
    return A

A = make_aho_automaton(keywords)

Next, let’s make a function which finds keywords in text. Searching in pyahocorasick happens via automaton.iter(line) function. This returns an iterable object with the end index of the word matched and the tuple of the respective keyword and its category.

def find_keywords(line, A):
    found_keywords = []
    for end_index, (cat, keyw) in A.iter(line):
        found_keywords.append(keyw)
    return found_keywords

Searching for keywords yields a list of the found keywords. Because the automaton was generated from keywords with no spaces around them, it finds overlapping keywords (here  contains he as well as her).

new_text = []
for line in text:
    print line, ':', find_keywords(line, A)

# he is here : ['he', 'he', 'her']
# this is she : ['she', 'he']
# this is hers : ['he', 'her', 'hers']
# her bag is big : ['he', 'her']

Let’s remake the automaton:

keywords = [
    (' he ', 1),
    (' she ', 1),
    (' hers ', 1),
    (' her ', 1)
]

A_spaces = make_aho_automaton(keywords)

Now the A_spaces automaton finds only fully matching words:

# he is here  : [' he ']
# this is she  : [' she ']
# this is hers  : [' hers ']
# her bag is big  : [' her ']

Replacing keywords

A useful function would be to replace the found keywords with some characters for anonymization. Here the function returns an array in the length of the line telling me which indices are to be hidden.

def find_keyword_locations(line, A):
    line_indices = [False for x in line]
    for end_index, (cat, keyw) in A.iter(line):
        start_index = end_index - len(keyw) + 2  # start index after first space
        for i in range(start_index, end_index):  # end index excluding last space
            line_indices[i] = True
    return line_indices

Let us try to replace the keywords with a dash or remove the altogether:

new_text_removed = []
new_text_replaced = []
for line in text:
    line_indices = find_keyword_locations(line, A_spaces)
    line = list(line) # split string into list
    new_line = "".join([line[i] if not x else '' for i, x in enumerate(line_indices)])
    new_text_removed.append(new_line)
    new_line = "".join([line[i] if not x else '-' for i, x in enumerate(line_indices)])
    new_text_replaced.append(new_line)

print text
print new_text_removed
print new_text_replaced

Original text:

[' he is here ', ' this is she ', ' this is hers ', ' her bag is big ']

Removed/replaced keywords in text:

['  is here ', ' this is  ', ' this is  ', '  bag is big ']
[' -- is here ', ' this is --- ', ' this is ---- ', ' --- bag is big ']

The last function seems pretty useful for text anonymization tasks if you, for example, need to anonymize person names/lastnames.

To conclude, an automaton can search really fast for selected keywords. It is easy to use it you need to match whole words. Partial matching can be archived by not adding spaces around the keywords. However, you have to choose carefully to not match accidental parts of words. I would consider making two automata – one for matching whole words, another for matching parts of words.

Full code can be found below, or on GitHub:

Dealing with JSON Encoded as JSON

The other day I had to fix a file containing JSON that had been encoded as JSON (JSON nested in JSON). It looked something like this:

"'{\"animal_list\": [{\"type\": \"mammal\", \"description\": \"Tall, with brown spots, lives in Savanna\", \"name\": \"Giraffa camelopardalis\"},{\"type\": \"mammal\", \"description\": \"Big, grey, with big ears, smart\", \"name\": \"Loxodonta africana\"},{\"type\": \"reptile\", \"description\": \"Green, changes color, lives in \"East Africa\"\", \"name\": \"Trioceros jacksonii\"}]}'"

It was all in one line, considerably big  – when properly formatted with around 1.5 million entries.

Quick’n’Dirty

The quick and dirty way to decode this would be to:

  • Put newline between each new entry (for readability)
  • Remove the quotation marks (” and ‘) from the beginning, end
  • Un-escape the quotation marks
newfile = open('animals_editted.json', 'w')
with open('animals.json', 'r') as jsonf:
    for line in jsonf:   # there will be only one line
        newline = line.replace('},{', '},\n{')
        newline = newline.strip('\"').strip('\'')
        newline = newline.replace('\\\"', '\"')
        print newline
        newfile.write(newline)
    newfile.close()

Now the JSON looks better but there is a problem:

{"animal_list": [{"type": "mammal", "description": "Tall, with brown spots, lives in Savanna", "name": "Giraffa camelopardalis"},
{"type": "mammal", "description": "Big, grey, with big ears, smart", "name": "Loxodonta africana"},
{"type": "reptile", "description": "Green, changes color, lives in "East Africa"", "name": "Trioceros jacksonii"}]}

By  escaping all the quotation marks also those that need to be escaped in the description field are now un-escaped. Now the challenge is to fix these description fields.

One way is to take the line, find “description” in it and split the parts – beginning of the line, description value, and end of the file. If we have the description value, we can escape the quotation marks there and glue back together all the parts:

newfile = open('animals_cleaned.json', 'w')
with open('animals_editted.json', 'r') as jsonf:
    for line in jsonf: 
        start, end = line.split('\"description\": \"')
        descr_val, end = end.split('", "')  
        # escape the quotes
        descr_val = descr_val.replace('"', '\\"')  
        descr = '\"description\": \"' + descr_val + '", "' 
        newline = start + descr + end
        newfile.write(newline)
    newfile.close()

Now the JSON is importing successfully:

with open('animals_cleaned.json', 'r') as nf:
    json_object = json.load(nf)
print len(json_object['animal_list'])
#>> 3

To sum up, this was a brute force way of decoding JSON encoded as JSON, this solution took me about 10 minutes and was good enough to fix the problem. Alternatively, other solutions could be to use json.reads() hook as explained in “encode nested JSON in JSON” answer.

See full code below or on github:

Working With Numpy Matrices

At the beginning when I started working with natural language processing, I used the default Python lists. But soon enough with bigger experiments and more data I run out of RAM. Python lists are not optimized for memory space so onto Numpy.

Numpy arrays are much like in C – generally you create the array the size you need beforehand and then fill it. Merging, appending is not recommended as Numpy will create one empty array in the size of arrays being merged  and then just copy the contents into it.

Here are some ways Numpy arrays (ndarray) can be manipulated:

Create ndarray

Some ways to create numpy matrices are:

aimport numpy as np

list = [1, 2, 3]
c = np.asarray(list)
  • Create an ndarray in the size you need filled with ones, zeros or random values:
# Array items as ndarray 
c = np.array([1, 2, 3])

# A 2x2 2d array shape for the arrays in the format (rows, columns)
shape = (2, 2)

# Random values
c = np.empty(shape)

d  = np.ones(shape)
e = np.zeros(shape)
# Creating ndarray from list
c = np.array([[1., 2.,],[1., 2.]])

# Creating new array in the shape of c, filled with 0
d = np.empty_like(c)

Slice

Sometimes I need to select only a part of all columns or rows in a 2d matrix. For example, matrices:

a = np.asarray([[1,1,2,3,4], # 1st row
                [2,6,7,8,9], # 2nd row
                [3,6,7,8,9], # 3rd row
                [4,6,7,8,9], # 4th row
                [5,6,7,8,9]  # 5th row
              ])

b = np.asarray([[1,1],
                [1,1]])

# Select row in the format a[start:end], if start or end omitted it means all range.
y = a[:1]  # 1st row
y = a[0:1] # 1st row
y = a[2:5] # select rows from 3rd to 5th row

# Select column in the format a[start:end, column_number]
x = a[:, -1] # -1 means first from the end
x = a[:,1:3] # select cols from 2nd col until 3rd

Merge arrays

Merging  numpy arrays is not advised because because internally numpy will create empty big array and then copy the contents into it. It would be best to create the intended size at the beginning and then just fill it up. However, sometimes you cannot avoid merging. In this case, numpy has some built-in functions:

Concatenate 

1d arrays:

a = np.array([1, 2, 3])
b = np.array([5, 6])
print np.concatenate([a, b, b])  
# >>  [1 2 3 5 6 5 6]

2d arrays:

a2 = np.array([[1, 2], [3, 4]])

# axis=0 - concatenate along rows
print np.concatenate((a2, b), axis=0)
# >>   [[1 2]
#       [3 4]
#       [5 6]]

# axis=1 - concatenate along columns, but first b needs to be transposed:
b.T
#>> [[5]
#    [6]]
np.concatenate((a2, b.T), axis=1)
#>> [[1 2 5]
#    [3 4 6]]

Append – append values to the end of an array

1d arrays:

# 1d arrays
print np.append(a, a2)
# >> [1 2 3 1 2 3 4]

print np.append(a, a)
# >> [1 2 3 1 2 3]

2d arrays – both arrays must match the shape of rows:

print np.append(a2, b, axis=0)
# >> [[1 2]
#     [3 4]
#     [5 6]]

print np.append(a2, b.T, axis=1)
# >> [[1 2 5]
#     [3 4 6]]

Hstack (stack horizontally) and vstack (stack vertically)

1d arrays:

print np.hstack([a, b])
# >> [1 2 3 5 6]

print np.vstack([a, a])
# >> [[1 2 3]
#     [1 2 3]]

2d arrays:

print np.hstack([a2,a2]) # arrays must match shape
# >> [[1 2 1 2]
#     [3 4 3 4]]

print np.vstack([a2, b])
# >> [[1 2]
#     [3 4]
#     [5 6]]

Types in ndarray

Without the default float, numpy can hold all the common types. If any of numbers in array is float, all numbers will be converted to float:

a = np.array([1, 2, 3.3])
print a
# >> [ 1.   2.   3.3]

But you can easily cast the type to int, float or other:

print a.astype(int)
# >> [1 2 3]

String arrays need to be created as arrays with the type S1 for string with length 1, S2 for length of 2 and so on .  numpy.chararray() creates array with this type. You need to specify the shape of the array and itemsize – the length of each string.

chararray = np.chararray([3,3], itemsize=3)
chararray[:] = 'abc' # assing value to all fields 
print chararray
#>> [['abc' 'abc' 'abc']
#    ['abc' 'abc' 'abc']
#    ['abc' 'abc' 'abc']]

Read/write to file

Write numpy array to a file with numpy.savetext() in plain text form and load it with  numpy.loadtext():

a2 = np.array([
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1]
])

np.savetxt('test.txt', a2, delimiter=',')
a2_new = np.loadtxt('test.txt', delimiter=',')

Writing sparse matrices

However, in machine learning if you have a large, sparse matrix (with a lot of values that are 0), reading and writing large matrices is faster and the file is smaller if you use the svmlight format:

from sklearn.datasets import dump_svmlight_file, load_svmlight_file

matrix = [
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2]
]

labels = [1,1,1,1,1,2,2]


dump_svmlight_file(matrix, labels, 'svmlight.txt', zero_based=True)

# The file looks like this:

# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 2 0:1 5:3 13:1 14:2
# 2 0:1 5:3 13:1 14:2

svm_loaded = load_svmlight_file('svmlight.txt', zero_based=True)

Use .toarray() to get matrix back from the svmlight Compressed Sparse Row format:

svm_loaded[0].toarray() # matrix element at index 0
svm_loaded[1] # labels at index 1

To sum up, this post looked at how to :

  • create numpy arrays,
  • slice arrays,
  • merge arrays,
  • basic types of numpy arrays,
  • reading and writing arrays to file,
  • reading and writing sparse matrices to svmlight format.

This was just an introduction into numpy matrices on how to get started and do basic manipulations. More information can be found in this MIT guide book as well as in the official documentation.

Cleaning Text for Natural Language Processing Tasks in Machine Learning in Python

Often when I work with text I need it to be clean. That is to remove gibberish or symbols/words I don’t need and to make all letters lowercase.

For example, a “dirty” line of text:

text = ['This is dirty TEXT: A phone number +001234561234, moNey 3.333, some date like 09.08.2016 and weird Čárákterš.']

Using Python2.7:

1) Read the line from list:

for line in text:
    # do something with line

or read from file:

with open('file.txt', 'r') as f:
    for line in f:
        # do something with line

2) Decode the line to utf8 from a string of bytes to work with special symbols:

line = line.decode('utf8')

3) Remove the symbols you don’t need. With replace() you can stack as many replace operations as you want.

line = line.replace('+', ' ').replace('.', ' ').replace(',', ' ').replace(':', ' ')

4) Remove numbers. Here you can use regex \d+. Because dots have already been removed we only need to check for whole numbers.

line = re.sub("(^|\W)\d+($|\W)", " ", line)

This regex matches the start of line ^ or whitespace, digits, end of line $ or whitespace to a space.

Alternatively you can just check if a word evaluates to a number by a simple function – is_digit() attempts to turn a string into int. If it succeeds, then the function returns true.

def is_digit(word):
    try:
        int(word)
        return True
    except ValueError:
        return False

Use this function on each word in the line by splitting the line on space with line.split(). New line array will hold only those words that are not numbers. At the end the array is joined together to a string.

new_line = []
for word in line.split():
    if not is_digit(word):
        new_line.append()
line = " ".join(new_line)

5) Now only lowercase and special characters remain. As lowercase only supports Latin letters, the special characters need to be turned to Latin. This can be done using Transliterate Python package or by hand. Here is a simple transliteration dictionary made from lists of character pairs:

cedilla2latin = [[u'Á', u'A'], [u'á', u'a'], [u'Č', u'C'], [u'č', u'c'], [u'Š', u'S'], [u'š', u's']]
tr = dict([(a[0], a[1]) for (a) in cedilla2latin])

In this way you can have multiple simbols to stand for one special symbol (like German [u’ä’, u’ae’]).
With the dictionary I can recreate letters in Latin:

def transliterate(line):
    new_line = ""
        for letter in line:
            if letter in tr:
                new_line += tr[letter]
            else:
                new_line += letter
    return new_line

And call the transliterate function:

line = transliterate(line)

6) After clearing away unnecessary symbols, finally I can lowercase the line:

line = line.lower()

And finally the line is reduced to simple Latin characters.

print line
>> this is dirty text a phone number money some date like and weird carakters

If you need to retain some numbers or check for other fields then go ahead and write more specific regexes.  However, regexes in Python use backtracking that makes them n-squared in terms of speed. This can slow you down especially if you are working with millions of lines.

As a side note a more general cleaning method that leaves only Latin characters can be to check for the ASCII value of each letter with ord().

def get_latin(line):
    return ' '.join(''.join([i if ord(i) >=65 and ord(i) <=90 or  ord(i) >= 97 and ord(i) <= 122 else ' ' for i in line]).split())

Full code of the above description is available below or here: