Dealing with JSON Encoded as JSON

The other day I had to fix a file containing JSON that had been encoded as JSON (JSON nested in JSON). It looked something like this:

"'{\"animal_list\": [{\"type\": \"mammal\", \"description\": \"Tall, with brown spots, lives in Savanna\", \"name\": \"Giraffa camelopardalis\"},{\"type\": \"mammal\", \"description\": \"Big, grey, with big ears, smart\", \"name\": \"Loxodonta africana\"},{\"type\": \"reptile\", \"description\": \"Green, changes color, lives in \"East Africa\"\", \"name\": \"Trioceros jacksonii\"}]}'"

It was all in one line, considerably big  – when properly formatted with around 1.5 million entries.

Quick’n’Dirty

The quick and dirty way to decode this would be to:

  • Put newline between each new entry (for readability)
  • Remove the quotation marks (” and ‘) from the beginning, end
  • Un-escape the quotation marks
newfile = open('animals_editted.json', 'w')
with open('animals.json', 'r') as jsonf:
    for line in jsonf:   # there will be only one line
        newline = line.replace('},{', '},\n{')
        newline = newline.strip('\"').strip('\'')
        newline = newline.replace('\\\"', '\"')
        print newline
        newfile.write(newline)
    newfile.close()

Now the JSON looks better but there is a problem:

{"animal_list": [{"type": "mammal", "description": "Tall, with brown spots, lives in Savanna", "name": "Giraffa camelopardalis"},
{"type": "mammal", "description": "Big, grey, with big ears, smart", "name": "Loxodonta africana"},
{"type": "reptile", "description": "Green, changes color, lives in "East Africa"", "name": "Trioceros jacksonii"}]}

By  escaping all the quotation marks also those that need to be escaped in the description field are now un-escaped. Now the challenge is to fix these description fields.

One way is to take the line, find “description” in it and split the parts – beginning of the line, description value, and end of the file. If we have the description value, we can escape the quotation marks there and glue back together all the parts:

newfile = open('animals_cleaned.json', 'w')
with open('animals_editted.json', 'r') as jsonf:
    for line in jsonf: 
        start, end = line.split('\"description\": \"')
        descr_val, end = end.split('", "')  
        # escape the quotes
        descr_val = descr_val.replace('"', '\\"')  
        descr = '\"description\": \"' + descr_val + '", "' 
        newline = start + descr + end
        newfile.write(newline)
    newfile.close()

Now the JSON is importing successfully:

with open('animals_cleaned.json', 'r') as nf:
    json_object = json.load(nf)
print len(json_object['animal_list'])
#>> 3

To sum up, this was a brute force way of decoding JSON encoded as JSON, this solution took me about 10 minutes and was good enough to fix the problem. Alternatively, other solutions could be to use json.reads() hook as explained in “encode nested JSON in JSON” answer.

See full code below or on github:

Iris Dataset and Xgboost Simple Tutorial

I had the opportunity to start using xgboost machine learning algorithm, it is fast and shows good results. Here I will be using multiclass prediction with the iris dataset from scikit-learn.

Installing Anaconda and xgboost

In order to work with the data, I need to install various scientific libraries for python. The best way I have found is to use Anaconda. It simply installs all the libs and helps to install new ones. You can download the installer for Windows, but if you want to install it on a Linux server, you can just copy-paste this into the terminal:

wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh
bash Anaconda2-4.0.0-Linux-x86_64.sh -b -p $HOME/anaconda
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc
bash 

After this, use conda to install pip which you will need for installing xgboost. It is important to install it using Anaconda (in Anaconda’s directory), so that pip installs other libs there as well:

conda install -y pip libgcc

Now, a very important step: install xgboost Python Package dependencies beforehand. I install these ones from experience:

sudo apt-get install -y make g++ build-essential gfortran libatlas-base-dev liblapacke-dev python-dev python-setuptools libsm6 libxrender1

I upgrade my python virtual environment to have no trouble with python versions:

pip install --upgrade virtualenv

And finally I can install xgboost with pip (keep fingers crossed):

pip install xgboost

This command installs the latest xgboost version, but if you want to use a previous one, just specify it with:

pip install xgboost==0.4a30

Now test if everything is has gone well – type python in the terminal and try to import xgboost:

import xgboost as xgb

If you see no errors – perfect.

Xgboost Demo with the Iris Dataset

Here I will use the Iris dataset to show a simple example of how to use Xgboost.

First you load the dataset from sklearn, where X will be the data, y – the class labels:

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

Then you split the data into train and test sets with 80-20% split:

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next you need to create the Xgboost specific DMatrix data format from the numpy array. Xgboost can work with numpy arrays directly, load data from svmlignt files and other formats. Here is how to work with numpy arrays:

import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

If you want to use svmlight for less memory consumption, first dump the numpy array into svmlight format and then just pass the filename to DMatrix:

import xgboost as xgb
from sklearn.datasets import dump_svmlight_file

dump_svmlight_file(X_train, y_train, 'dtrain.svm', zero_based=True)
dump_svmlight_file(X_test, y_test, 'dtest.svm', zero_based=True)
dtrain_svm = xgb.DMatrix('dtrain.svm')
dtest_svm = xgb.DMatrix('dtest.svm')

Now for the Xgboost to work you need to set the parameters:

param = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3}  # the number of classes that exist in this datset
num_round = 20  # the number of training iterations

Different datasets perform better with different parameters. The result can be really low with one set of params and really good with others. You can look at this Kaggle script how to search for the best ones. Generally try with eta 0.1, 0.2, 0.3, max_depth in range of 2 to 10 and num_round around few hundred.

Train

Finally the training can begin. You just type:

bst = xgb.train(param, dtrain, num_round)

To see how the model looks you can also dump it in human readable form:

bst.dump_model('dump.raw.txt')

And it looks something like this (f0, f1, f2 are features):

booster[0]:
0:[f2<2.45] yes=1,no=2,missing=1
    1:leaf=0.426036
    2:leaf=-0.218845
booster[1]:
0:[f2<2.45] yes=1,no=2,missing=1
    1:leaf=-0.213018
    2:[f3<1.75] yes=3,no=4,missing=3
        3:[f2<4.95] yes=5,no=6,missing=5
            5:leaf=0.409091
            6:leaf=-9.75349e-009
        4:[f2<4.85] yes=7,no=8,missing=7
            7:leaf=-7.66345e-009
            8:leaf=-0.210219
....

You can see that each tree is no deeper than 3 levels as set in the params.

Use the model to predict classes for the test set:

preds = bst.predict(dtest)

But the predictions look something like this:

[[ 0.00563804 0.97755206 0.01680986]
 [ 0.98254657 0.01395847 0.00349498]
 [ 0.0036375 0.00615226 0.99021029]
 [ 0.00564738 0.97917044 0.0151822 ]
 [ 0.00540075 0.93640935 0.0581899 ]
....

Here each column represents class number 0, 1, or 2. For each line you need to select that column where the probability is the highest:

import numpy as np
best_preds = np.asarray([np.argmax(line) for line in preds])

Now you get a nice list with predicted classes:

[1, 0, 2, 1, 1, ...]

Determine the precision of this prediction:

from sklearn.metrics import precision_score

print precision_score(y_test, best_preds, average='macro')
# >> 1.0

Perfect! Now save the model for later use:

 from sklearn.externals import joblib

joblib.dump(bst, 'bst_model.pkl', compress=True)
# bst = joblib.load('bst_model.pkl') # load it later

Now you have a working model saved for later use, and ready for more prediction.

See the full code on github or  below:

Working With Numpy Matrices

At the beginning when I started working with natural language processing, I used the default Python lists. But soon enough with bigger experiments and more data I run out of RAM. Python lists are not optimized for memory space so onto Numpy.

Numpy arrays are much like in C – generally you create the array the size you need beforehand and then fill it. Merging, appending is not recommended as Numpy will create one empty array in the size of arrays being merged  and then just copy the contents into it.

Here are some ways Numpy arrays (ndarray) can be manipulated:

Create ndarray

Some ways to create numpy matrices are:

aimport numpy as np

list = [1, 2, 3]
c = np.asarray(list)
  • Create an ndarray in the size you need filled with ones, zeros or random values:
# Array items as ndarray 
c = np.array([1, 2, 3])

# A 2x2 2d array shape for the arrays in the format (rows, columns)
shape = (2, 2)

# Random values
c = np.empty(shape)

d  = np.ones(shape)
e = np.zeros(shape)
# Creating ndarray from list
c = np.array([[1., 2.,],[1., 2.]])

# Creating new array in the shape of c, filled with 0
d = np.empty_like(c)

Slice

Sometimes I need to select only a part of all columns or rows in a 2d matrix. For example, matrices:

a = np.asarray([[1,1,2,3,4], # 1st row
                [2,6,7,8,9], # 2nd row
                [3,6,7,8,9], # 3rd row
                [4,6,7,8,9], # 4th row
                [5,6,7,8,9]  # 5th row
              ])

b = np.asarray([[1,1],
                [1,1]])

# Select row in the format a[start:end], if start or end omitted it means all range.
y = a[:1]  # 1st row
y = a[0:1] # 1st row
y = a[2:5] # select rows from 3rd to 5th row

# Select column in the format a[start:end, column_number]
x = a[:, -1] # -1 means first from the end
x = a[:,1:3] # select cols from 2nd col until 3rd

Merge arrays

Merging  numpy arrays is not advised because because internally numpy will create empty big array and then copy the contents into it. It would be best to create the intended size at the beginning and then just fill it up. However, sometimes you cannot avoid merging. In this case, numpy has some built-in functions:

Concatenate 

1d arrays:

a = np.array([1, 2, 3])
b = np.array([5, 6])
print np.concatenate([a, b, b])  
# >>  [1 2 3 5 6 5 6]

2d arrays:

a2 = np.array([[1, 2], [3, 4]])

# axis=0 - concatenate along rows
print np.concatenate((a2, b), axis=0)
# >>   [[1 2]
#       [3 4]
#       [5 6]]

# axis=1 - concatenate along columns, but first b needs to be transposed:
b.T
#>> [[5]
#    [6]]
np.concatenate((a2, b.T), axis=1)
#>> [[1 2 5]
#    [3 4 6]]

Append – append values to the end of an array

1d arrays:

# 1d arrays
print np.append(a, a2)
# >> [1 2 3 1 2 3 4]

print np.append(a, a)
# >> [1 2 3 1 2 3]

2d arrays – both arrays must match the shape of rows:

print np.append(a2, b, axis=0)
# >> [[1 2]
#     [3 4]
#     [5 6]]

print np.append(a2, b.T, axis=1)
# >> [[1 2 5]
#     [3 4 6]]

Hstack (stack horizontally) and vstack (stack vertically)

1d arrays:

print np.hstack([a, b])
# >> [1 2 3 5 6]

print np.vstack([a, a])
# >> [[1 2 3]
#     [1 2 3]]

2d arrays:

print np.hstack([a2,a2]) # arrays must match shape
# >> [[1 2 1 2]
#     [3 4 3 4]]

print np.vstack([a2, b])
# >> [[1 2]
#     [3 4]
#     [5 6]]

Types in ndarray

Without the default float, numpy can hold all the common types. If any of numbers in array is float, all numbers will be converted to float:

a = np.array([1, 2, 3.3])
print a
# >> [ 1.   2.   3.3]

But you can easily cast the type to int, float or other:

print a.astype(int)
# >> [1 2 3]

String arrays need to be created as arrays with the type S1 for string with length 1, S2 for length of 2 and so on .  numpy.chararray() creates array with this type. You need to specify the shape of the array and itemsize – the length of each string.

chararray = np.chararray([3,3], itemsize=3)
chararray[:] = 'abc' # assing value to all fields 
print chararray
#>> [['abc' 'abc' 'abc']
#    ['abc' 'abc' 'abc']
#    ['abc' 'abc' 'abc']]

Read/write to file

Write numpy array to a file with numpy.savetext() in plain text form and load it with  numpy.loadtext():

a2 = np.array([
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1]
])

np.savetxt('test.txt', a2, delimiter=',')
a2_new = np.loadtxt('test.txt', delimiter=',')

Writing sparse matrices

However, in machine learning if you have a large, sparse matrix (with a lot of values that are 0), reading and writing large matrices is faster and the file is smaller if you use the svmlight format:

from sklearn.datasets import dump_svmlight_file, load_svmlight_file

matrix = [
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2]
]

labels = [1,1,1,1,1,2,2]


dump_svmlight_file(matrix, labels, 'svmlight.txt', zero_based=True)

# The file looks like this:

# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 2 0:1 5:3 13:1 14:2
# 2 0:1 5:3 13:1 14:2

svm_loaded = load_svmlight_file('svmlight.txt', zero_based=True)

Use .toarray() to get matrix back from the svmlight Compressed Sparse Row format:

svm_loaded[0].toarray() # matrix element at index 0
svm_loaded[1] # labels at index 1

To sum up, this post looked at how to :

  • create numpy arrays,
  • slice arrays,
  • merge arrays,
  • basic types of numpy arrays,
  • reading and writing arrays to file,
  • reading and writing sparse matrices to svmlight format.

This was just an introduction into numpy matrices on how to get started and do basic manipulations. More information can be found in this MIT guide book as well as in the official documentation.

Cleaning Text for Natural Language Processing Tasks in Machine Learning in Python

Often when I work with text I need it to be clean. That is to remove gibberish or symbols/words I don’t need and to make all letters lowercase.

For example, a “dirty” line of text:

text = ['This is dirty TEXT: A phone number +001234561234, moNey 3.333, some date like 09.08.2016 and weird Čárákterš.']

Using Python2.7:

1) Read the line from list:

for line in text:
    # do something with line

or read from file:

with open('file.txt', 'r') as f:
    for line in f:
        # do something with line

2) Decode the line to utf8 from a string of bytes to work with special symbols:

line = line.decode('utf8')

3) Remove the symbols you don’t need. With replace() you can stack as many replace operations as you want.

line = line.replace('+', ' ').replace('.', ' ').replace(',', ' ').replace(':', ' ')

4) Remove numbers. Here you can use regex \d+. Because dots have already been removed we only need to check for whole numbers.

line = re.sub("(^|\W)\d+($|\W)", " ", line)

This regex matches the start of line ^ or whitespace, digits, end of line $ or whitespace to a space.

Alternatively you can just check if a word evaluates to a number by a simple function – is_digit() attempts to turn a string into int. If it succeeds, then the function returns true.

def is_digit(word):
    try:
        int(word)
        return True
    except ValueError:
        return False

Use this function on each word in the line by splitting the line on space with line.split(). New line array will hold only those words that are not numbers. At the end the array is joined together to a string.

new_line = []
for word in line.split():
    if not is_digit(word):
        new_line.append()
line = " ".join(new_line)

5) Now only lowercase and special characters remain. As lowercase only supports Latin letters, the special characters need to be turned to Latin. This can be done using Transliterate Python package or by hand. Here is a simple transliteration dictionary made from lists of character pairs:

cedilla2latin = [[u'Á', u'A'], [u'á', u'a'], [u'Č', u'C'], [u'č', u'c'], [u'Š', u'S'], [u'š', u's']]
tr = dict([(a[0], a[1]) for (a) in cedilla2latin])

In this way you can have multiple simbols to stand for one special symbol (like German [u’ä’, u’ae’]).
With the dictionary I can recreate letters in Latin:

def transliterate(line):
    new_line = ""
        for letter in line:
            if letter in tr:
                new_line += tr[letter]
            else:
                new_line += letter
    return new_line

And call the transliterate function:

line = transliterate(line)

6) After clearing away unnecessary symbols, finally I can lowercase the line:

line = line.lower()

And finally the line is reduced to simple Latin characters.

print line
>> this is dirty text a phone number money some date like and weird carakters

If you need to retain some numbers or check for other fields then go ahead and write more specific regexes.  However, regexes in Python use backtracking that makes them n-squared in terms of speed. This can slow you down especially if you are working with millions of lines.

As a side note a more general cleaning method that leaves only Latin characters can be to check for the ASCII value of each letter with ord().

def get_latin(line):
    return ' '.join(''.join([i if ord(i) >=65 and ord(i) <=90 or  ord(i) >= 97 and ord(i) <= 122 else ' ' for i in line]).split())

Full code of the above description is available below or here: