All posts by ieva

Basic Docker command set

Docker is a container platform. A Docker container is an isolated environment that runs on a shared operating system. So it’s very useful (and fast compared to virtual machines) to isolate different programs, data pipelines, etc., especially if they depend on different packages.

There are many tutorials how to use Docker containers. Here are my notes how to get started with Docker.

Installation

Install Docker as described in Docker’s homepage.

On Ubuntu, it’s as easy as running

curl -fsSL get.docker.com | sudo bash

Run

Run the Docker daemon on Windows or Mac, type docker in the terminal to make sure all is good. Now we can try to run a Docker container.

First you need to get a Docker image with your intended software on it. The easiest way is to download a ready made image from Docker Hub. There you can find the most popular configuration/setup that people have submitted as images.

To download an image to your computer you use pull. 

# get the latest version of ubuntu distribution
docker pull ubuntu
# get a specific version
docker pull ubuntu:xenial

Then use run to create a running container from the image.  run  will also pull the image if you don’t have it. So you can also skip pull.

# run the image and give it something to do (a command)
docker run --name my_server ubuntu:xenial sleep 100
# run a database and name the container
docker run --name my_db postgres
# run a database in the background (detach)
docker run -d --name my_db postgres

Check out the list of containers in your computer:

# see all running containers
docker ps
# see all running and stopped containers
docker ps -a

When the Docker instance has  finished the command you provided, it will stop. This state is like a computer switched off. You can rerun the container with start. It will execute the same command as before and you use -a (attach) flag to see the output.

# start the stopped container again with previous cmd
docker run -a my_server

If your container is running a database or some kind of continuous service then it will not stop unless you explicitly stop it.

# start a container with postgres DB, give it a name
docker run -d --name my_db postgres 
# stop a running container
docker stop my_db

Interactive mode

If you want to open a terminal inside the container and write commands there then you have to run an image using  interactive mode requiring flags -i (interactive – keeping listening to input (stdin)) -t (open a terminal interface). The default command in the ubuntu images is /bin/bash.

# both are the same for ubuntu
docker run -it ubuntu
docker run -it ubuntu /bin/bash

Type exit to exit the container. After this the container will stop too. If you want to keep your container running the add -d (detach) flag – this sends it to background. Now, to attach to this container you use exec. This executes commands on a running container. So now to attach to it you add interactive flags and a command to run when inside (bash terminal).

docker run -i -t -d --name my_server ubuntu 
# attach to running container, exit won't kill it
docker exec -i -t my_server /bin/bash
# stop the container afterwards
docker stop my_server

You can add flags to your containers when you run it. Some useful ones are:

--rm # remove the container after it is stopped

Creating images

The main idea of Docker is to create your own special environments and reuse them. This means saving a running, configured/set up container to an image.

First set up the container.

# creating container
docker run -i -t --name my-server ubuntu:xenial
# install/set up environment with Python
root@628f79fad524:/# apt-get update
root@628f79fad524:/# apt-get install python
root@628f79fad524:/# echo 'print "Hello!"' >> hello.py
root@628f79fad524:/# exit 

Saving the container to an image.

docker commit my-server ubuntu-with-python
# see that it's created
docker images

# use the new image to start up container
docker run -it ubuntu-with-python
# see that the environment is set up
root@3555c0e3a292:/# python hello.py
Hello!
root@3555c0e3a292:/# exit

You can run the command also when you run the container.

docker run ubuntu-with-python python hello.py
Hello!

Multiple Docker containers linked (deprecated)

When you place a database in a container and your program in another, you can make the database accessible with links. You just need to pass a the name of the database container when you run the container with your app with the link  argument:

docker run --name postgres_db -d postgres 
docker run -it --link postgres_db:db ubuntu

#inside the ubuntu container 
root@e3a977c99abb:/# apt-get update
root@e3a977c99abb:/# apt-get install netcat -y
# check the connection
root@e3a977c99abb:/# nc db 5432

Containers on a network

The recommended way to establish communication between containers is with a network. Use this network when creating the containers. It is possible to add it later when the containers are running too.

# create a new network
docker network create my_network
# see that it's there
docker network ls

# run containers on the network
docker run --name pos_db --network=my_network postgres
docker run -i -t --network=my_network ubuntu

#inside the ubuntu container
root@5434eb0cd678:/# apt-get update root@5434eb0cd678:/# apt-get install netcat -y
# check the connection 
root@5434eb0cd678:/# nc db 5432

Exit/shut down

To stop/shut down a container means to simply shut it down as you would with a power button to a computer. The progress on disk will remain, but the progress in memory will be lost.

To exit a container when you are inside it in the interactive mode you simply type exit  or hit ctrl+d.

To stop a container type docker stop <container name>:

# stop a running container
docker stop my-server

Clean the PC

Working with docker creates some artifacts. All the pulled images as well as stopped containers are stored in your computer and over time can take up some space. To remove containers type docker rm <container name> :

# remove one stopped container
docker rm my_server
# stop all containers -q lists container IDs
docker stop $(docker ps -a -q)
# remove all stopped containers
docker rm $(docker ps -a -q)

# remove a running container with --force (-f): 
docker rm -f my_server 
# remove all stopped/running containers
docker rm  -f $(docker ps -a -q)

You might want to remove the pulled/saved images as well:

# remove specific image
docker rmi ubuntu:latest
# remove pulled/created images that don't have an associated container 
docker image prune
# or
docker rmi $(docker images -q)

This Docker overview looked at simple commands that you can run from the terminal. If you want to use docker in production you should use Dockerfiles – an automatized way how to use docker images.

Simple KNN implementation in Python 2.7

This is a simple KNN implementation for supervised learning. It deals with examples with known classes. You can find K-means clustering implementation in my next post to come.

Dummy dataset

First let’s make some dummy data with training examples and labels and test examples some approximate labels. I chose 2 classes based on x and y coordinates (one is more on the positive side, the other – negative).

X_train = [
    [1, 1],
    [1, 2],
    [2, 4],
    [3, 5],
    [1, 0],
    [0, 0],
    [1, -2],
    [-1, 0],
    [-1, -2],
    [-2, -2]
]

y_train = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2]

X_test = [
    [5, 5],
    [0, -1],
    [-5, -5]
]

y_test = [1, 2, 2]

KNN theory

Right,  supervised KNN works by finding the numerically “closest” neighbors and seeing what those neighbors classes are like. Then taking the one that’s more popular among the n neighbors and saying that’s most probably the class for the data point you’re asking about.

One of the ways to calculate the “distance” is to do just that – calculate the distance with Euclidean distance from geometry. The formula looks like this for two points with two parameters A(a1, a2) and B(b1, b2) :

It can work with many more features not just two, you just keep summing the corresponding squared subtractions one after the other.

Euclidean distance in code

In code it would look like this:

import math

def euclidean_dist(A, B):
    return math.sqrt(sum([(A[i]-B[i])**2 for i, _ in enumerate(A)]) )

The function euclidean_dist takes as input two points regardless of how many features they have and outputs the Euclidean distance.

KNN in code

Now we just have to find the distance from each test set element to all of the training set elements and get the most popular class in the closest neighbor classes.  Here function knn does that.

def knn(X_train, y_train, X_test, k=1):
    y_test = []
    for test_row in X_test:
        eucl_dist = [euclidean_dist(train_row, test_row) for train_row in X_train]
        sorted_eucl_dist = sorted(eucl_dist)
        closest_knn = [eucl_dist.index(sorted_eucl_dist[i]) for i in xrange(0, k)] if k > 1 else [eucl_dist.index(min(eucl_dist))]
        closest_labels_knn = [y_train[x] for x in closest_knn]
        y_test.append(get_most_common_item(closest_labels_knn))
    return y_test

Finally, we need a helper function to find the most popular item in an array. I chose a solution where I put the class as a key in a dictionary and keep increasing the value for the key to count occurrences:

from collections import defaultdict
from operator import itemgetter

def get_most_common_item(array):
    count_dict = defaultdict(int)
    for key in array:
        count_dict[key] += 1
    key, count = max(count_dict.iteritems(), key=itemgetter(1))
    return key

Now we can test this out with our custom dataset setting neighbor count to 2:

print knn(X_train, y_train, X_test, k=2)  # output: [1, 1, 2]

And we can see that this somewhat matched my imagined labels [1, 2, 2].

You can find the whole core on my Github repository or here below:

Keyword matching with Aho-Corasick

So, for example, I have around a lot of different keywords and I want to find if any of these have been mentioned in some long text. I also want catch them if they are a part of another word like accidentally written two words together (e.g., greenplants). This is considerable amount of searching.

Brute Force

One way to search for keywords would be to take each keyword and go though the text or the another way around. This results in a lot of comparison operations, especially if the text is large.  (keywords * words in text).

You could first count which words are the most common in the type of texts you use and search for those first. If you are searching line by line in a text and match it by the first encountered keyword from your list this would somewhat reduce the number of operations.

Dictionary

A better way would be to make a dictionary from keywords, where the key is the keyword and the value is anything you want to match it with ( a category). Now you only need to take each word in a line /text and match it in the the dictionary (on average – O(1) for one word) . Nice now you only search as many times ar you have  words in the text. However, if you want to search for a partial match, you still need  to make substrings from words or add some other additional logic.

Automaton

Now, the best way is just to take a line and find all the keywords as you go though it, similar as you would read it. This can be done with an automaton. A nice algorithm for searching in text is Aho-Corasick algorithm. This algorithm takes a list of keywords and generates an automaton from them. With this you can match your keywords in linear time – which is about as good as it gets.

Aho-Corasick

Python has a simple Aho-Corasick library pyahocorasick.  According to pyahocorasick, this lib can work with unicode in Python 3, whereas Python 2 can only work with string bytes.

pip install pyahocorasick

Now, lets take some keywords and make an automaton. I have added a category for each keyword and stored that in a tuple (keyw, category). In this case all keywords are in the same category.

keywords = [
    ('he', 1),
    ('she', 1),
    ('hers', 1),
    ('her', 1)
]
text = [
    ' he is here ',
    ' this is she ',
    ' this is hers ',
    ' her bag is big '
]

First, keywords and their respective categories are added to a trie structure, from which an automaton will be generated.

import ahocorasick as ahc

def make_aho_automaton(keywords):
    A = ahc.Automaton()  # initialize
    for (key, cat) in keywords:
        A.add_word(key, (cat, key)) # add keys and categories
    A.make_automaton() # generate automaton
    return A

A = make_aho_automaton(keywords)

Next, let’s make a function which finds keywords in text. Searching in pyahocorasick happens via automaton.iter(line) function. This returns an iterable object with the end index of the word matched and the tuple of the respective keyword and its category.

def find_keywords(line, A):
    found_keywords = []
    for end_index, (cat, keyw) in A.iter(line):
        found_keywords.append(keyw)
    return found_keywords

Searching for keywords yields a list of the found keywords. Because the automaton was generated from keywords with no spaces around them, it finds overlapping keywords (here  contains he as well as her).

new_text = []
for line in text:
    print line, ':', find_keywords(line, A)

# he is here : ['he', 'he', 'her']
# this is she : ['she', 'he']
# this is hers : ['he', 'her', 'hers']
# her bag is big : ['he', 'her']

Let’s remake the automaton:

keywords = [
    (' he ', 1),
    (' she ', 1),
    (' hers ', 1),
    (' her ', 1)
]

A_spaces = make_aho_automaton(keywords)

Now the A_spaces automaton finds only fully matching words:

# he is here  : [' he ']
# this is she  : [' she ']
# this is hers  : [' hers ']
# her bag is big  : [' her ']

Replacing keywords

A useful function would be to replace the found keywords with some characters for anonymization. Here the function returns an array in the length of the line telling me which indices are to be hidden.

def find_keyword_locations(line, A):
    line_indices = [False for x in line]
    for end_index, (cat, keyw) in A.iter(line):
        start_index = end_index - len(keyw) + 2  # start index after first space
        for i in range(start_index, end_index):  # end index excluding last space
            line_indices[i] = True
    return line_indices

Let us try to replace the keywords with a dash or remove the altogether:

new_text_removed = []
new_text_replaced = []
for line in text:
    line_indices = find_keyword_locations(line, A_spaces)
    line = list(line) # split string into list
    new_line = "".join([line[i] if not x else '' for i, x in enumerate(line_indices)])
    new_text_removed.append(new_line)
    new_line = "".join([line[i] if not x else '-' for i, x in enumerate(line_indices)])
    new_text_replaced.append(new_line)

print text
print new_text_removed
print new_text_replaced

Original text:

[' he is here ', ' this is she ', ' this is hers ', ' her bag is big ']

Removed/replaced keywords in text:

['  is here ', ' this is  ', ' this is  ', '  bag is big ']
[' -- is here ', ' this is --- ', ' this is ---- ', ' --- bag is big ']

The last function seems pretty useful for text anonymization tasks if you, for example, need to anonymize person names/lastnames.

To conclude, an automaton can search really fast for selected keywords. It is easy to use it you need to match whole words. Partial matching can be archived by not adding spaces around the keywords. However, you have to choose carefully to not match accidental parts of words. I would consider making two automata – one for matching whole words, another for matching parts of words.

Full code can be found below, or on GitHub:

Generating Graphs on Server with no UI in Pyhton

So you run the analysis on Linux server because the data is  huge. Now you want to generate a nice graph or plot to see whats going on.

The script

Using the Iris Dataset I want to generate a plot showing the distribution of the flower examples like this:

Iris plot
Iris plot

To reduce features from 4 to 2 I use sklearn’s  truncated singular value decomposition (which also works on sparse matrices):

import matplotlib.pyplot as plt
from sklearn import datasets 
from sklearn.decomposition import TruncatedSVD 

iris = datasets.load_iris() 
X = iris.data 
y = iris.target  # Labels

# Visualize result using SVD
svd = TruncatedSVD(n_components=2) 
X_reduced = svd.fit_transform(X)

# Initialize scatter plot with x and y axis values
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, s=25)
plt.savefig("iris_plot.png")
plt.show()

When running this script on Windows everything seems fine.

Setting up server

On a Linux server with no graphical interface or UI there are no tools for the server to generate a picture. On a clean server I need to install sklearn and its dependencies as well as mathplotlib. To make this easier I use Anaconda:

wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh
bash Anaconda2-4.0.0-Linux-x86_64.sh -b -p $HOME/anaconda
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc
bash 

However, running the script now trows an error:

# ImportError: libSM.so.6: cannot open shared object file: No such file or directory

So I install also these two libs:

# sudo apt-get install -y libsm6 libxrender1

After this running the script trows:

# RuntimeError: Invalid DISPLAY variable

Generate plot

To enable the server to generate the plot I switch to a different ‘backend’. Pyplot enables various backends for different file formats. To generate and save a .png image I use ‘agg’ backend:

plt.switch_backend('agg')

With this I can create my plot and save it. As a side note the .show() method will not be executed when the backend is set to ‘agg’.

The full script code can be seen below or on github:

Dealing with JSON Encoded as JSON

The other day I had to fix a file containing JSON that had been encoded as JSON (JSON nested in JSON). It looked something like this:

"'{\"animal_list\": [{\"type\": \"mammal\", \"description\": \"Tall, with brown spots, lives in Savanna\", \"name\": \"Giraffa camelopardalis\"},{\"type\": \"mammal\", \"description\": \"Big, grey, with big ears, smart\", \"name\": \"Loxodonta africana\"},{\"type\": \"reptile\", \"description\": \"Green, changes color, lives in \"East Africa\"\", \"name\": \"Trioceros jacksonii\"}]}'"

It was all in one line, considerably big  – when properly formatted with around 1.5 million entries.

Quick’n’Dirty

The quick and dirty way to decode this would be to:

  • Put newline between each new entry (for readability)
  • Remove the quotation marks (” and ‘) from the beginning, end
  • Un-escape the quotation marks
newfile = open('animals_editted.json', 'w')
with open('animals.json', 'r') as jsonf:
    for line in jsonf:   # there will be only one line
        newline = line.replace('},{', '},\n{')
        newline = newline.strip('\"').strip('\'')
        newline = newline.replace('\\\"', '\"')
        print newline
        newfile.write(newline)
    newfile.close()

Now the JSON looks better but there is a problem:

{"animal_list": [{"type": "mammal", "description": "Tall, with brown spots, lives in Savanna", "name": "Giraffa camelopardalis"},
{"type": "mammal", "description": "Big, grey, with big ears, smart", "name": "Loxodonta africana"},
{"type": "reptile", "description": "Green, changes color, lives in "East Africa"", "name": "Trioceros jacksonii"}]}

By  escaping all the quotation marks also those that need to be escaped in the description field are now un-escaped. Now the challenge is to fix these description fields.

One way is to take the line, find “description” in it and split the parts – beginning of the line, description value, and end of the file. If we have the description value, we can escape the quotation marks there and glue back together all the parts:

newfile = open('animals_cleaned.json', 'w')
with open('animals_editted.json', 'r') as jsonf:
    for line in jsonf: 
        start, end = line.split('\"description\": \"')
        descr_val, end = end.split('", "')  
        # escape the quotes
        descr_val = descr_val.replace('"', '\\"')  
        descr = '\"description\": \"' + descr_val + '", "' 
        newline = start + descr + end
        newfile.write(newline)
    newfile.close()

Now the JSON is importing successfully:

with open('animals_cleaned.json', 'r') as nf:
    json_object = json.load(nf)
print len(json_object['animal_list'])
#>> 3

To sum up, this was a brute force way of decoding JSON encoded as JSON, this solution took me about 10 minutes and was good enough to fix the problem. Alternatively, other solutions could be to use json.reads() hook as explained in “encode nested JSON in JSON” answer.

See full code below or on github:

Iris Dataset and Xgboost Simple Tutorial

I had the opportunity to start using xgboost machine learning algorithm, it is fast and shows good results. Here I will be using multiclass prediction with the iris dataset from scikit-learn.

Installing Anaconda and xgboost

In order to work with the data, I need to install various scientific libraries for python. The best way I have found is to use Anaconda. It simply installs all the libs and helps to install new ones. You can download the installer for Windows, but if you want to install it on a Linux server, you can just copy-paste this into the terminal:

wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh
bash Anaconda2-4.0.0-Linux-x86_64.sh -b -p $HOME/anaconda
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc
bash 

After this, use conda to install pip which you will need for installing xgboost. It is important to install it using Anaconda (in Anaconda’s directory), so that pip installs other libs there as well:

conda install -y pip libgcc

Now, a very important step: install xgboost Python Package dependencies beforehand. I install these ones from experience:

sudo apt-get install -y make g++ build-essential gfortran libatlas-base-dev liblapacke-dev python-dev python-setuptools libsm6 libxrender1

I upgrade my python virtual environment to have no trouble with python versions:

pip install --upgrade virtualenv

And finally I can install xgboost with pip (keep fingers crossed):

pip install xgboost

This command installs the latest xgboost version, but if you want to use a previous one, just specify it with:

pip install xgboost==0.4a30

Now test if everything is has gone well – type python in the terminal and try to import xgboost:

import xgboost as xgb

If you see no errors – perfect.

Xgboost Demo with the Iris Dataset

Here I will use the Iris dataset to show a simple example of how to use Xgboost.

First you load the dataset from sklearn, where X will be the data, y – the class labels:

from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

Then you split the data into train and test sets with 80-20% split:

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next you need to create the Xgboost specific DMatrix data format from the numpy array. Xgboost can work with numpy arrays directly, load data from svmlignt files and other formats. Here is how to work with numpy arrays:

import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

If you want to use svmlight for less memory consumption, first dump the numpy array into svmlight format and then just pass the filename to DMatrix:

import xgboost as xgb
from sklearn.datasets import dump_svmlight_file

dump_svmlight_file(X_train, y_train, 'dtrain.svm', zero_based=True)
dump_svmlight_file(X_test, y_test, 'dtest.svm', zero_based=True)
dtrain_svm = xgb.DMatrix('dtrain.svm')
dtest_svm = xgb.DMatrix('dtest.svm')

Now for the Xgboost to work you need to set the parameters:

param = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3}  # the number of classes that exist in this datset
num_round = 20  # the number of training iterations

Different datasets perform better with different parameters. The result can be really low with one set of params and really good with others. You can look at this Kaggle script how to search for the best ones. Generally try with eta 0.1, 0.2, 0.3, max_depth in range of 2 to 10 and num_round around few hundred.

Train

Finally the training can begin. You just type:

bst = xgb.train(param, dtrain, num_round)

To see how the model looks you can also dump it in human readable form:

bst.dump_model('dump.raw.txt')

And it looks something like this (f0, f1, f2 are features):

booster[0]:
0:[f2<2.45] yes=1,no=2,missing=1
    1:leaf=0.426036
    2:leaf=-0.218845
booster[1]:
0:[f2<2.45] yes=1,no=2,missing=1
    1:leaf=-0.213018
    2:[f3<1.75] yes=3,no=4,missing=3
        3:[f2<4.95] yes=5,no=6,missing=5
            5:leaf=0.409091
            6:leaf=-9.75349e-009
        4:[f2<4.85] yes=7,no=8,missing=7
            7:leaf=-7.66345e-009
            8:leaf=-0.210219
....

You can see that each tree is no deeper than 3 levels as set in the params.

Use the model to predict classes for the test set:

preds = bst.predict(dtest)

But the predictions look something like this:

[[ 0.00563804 0.97755206 0.01680986]
 [ 0.98254657 0.01395847 0.00349498]
 [ 0.0036375 0.00615226 0.99021029]
 [ 0.00564738 0.97917044 0.0151822 ]
 [ 0.00540075 0.93640935 0.0581899 ]
....

Here each column represents class number 0, 1, or 2. For each line you need to select that column where the probability is the highest:

import numpy as np
best_preds = np.asarray([np.argmax(line) for line in preds])

Now you get a nice list with predicted classes:

[1, 0, 2, 1, 1, ...]

Determine the precision of this prediction:

from sklearn.metrics import precision_score

print precision_score(y_test, best_preds, average='macro')
# >> 1.0

Perfect! Now save the model for later use:

 from sklearn.externals import joblib

joblib.dump(bst, 'bst_model.pkl', compress=True)
# bst = joblib.load('bst_model.pkl') # load it later

Now you have a working model saved for later use, and ready for more prediction.

See the full code on github or  below:

Working With Numpy Matrices

At the beginning when I started working with natural language processing, I used the default Python lists. But soon enough with bigger experiments and more data I run out of RAM. Python lists are not optimized for memory space so onto Numpy.

Numpy arrays are much like in C – generally you create the array the size you need beforehand and then fill it. Merging, appending is not recommended as Numpy will create one empty array in the size of arrays being merged  and then just copy the contents into it.

Here are some ways Numpy arrays (ndarray) can be manipulated:

Create ndarray

Some ways to create numpy matrices are:

aimport numpy as np

list = [1, 2, 3]
c = np.asarray(list)
  • Create an ndarray in the size you need filled with ones, zeros or random values:
# Array items as ndarray 
c = np.array([1, 2, 3])

# A 2x2 2d array shape for the arrays in the format (rows, columns)
shape = (2, 2)

# Random values
c = np.empty(shape)

d  = np.ones(shape)
e = np.zeros(shape)
# Creating ndarray from list
c = np.array([[1., 2.,],[1., 2.]])

# Creating new array in the shape of c, filled with 0
d = np.empty_like(c)

Slice

Sometimes I need to select only a part of all columns or rows in a 2d matrix. For example, matrices:

a = np.asarray([[1,1,2,3,4], # 1st row
                [2,6,7,8,9], # 2nd row
                [3,6,7,8,9], # 3rd row
                [4,6,7,8,9], # 4th row
                [5,6,7,8,9]  # 5th row
              ])

b = np.asarray([[1,1],
                [1,1]])

# Select row in the format a[start:end], if start or end omitted it means all range.
y = a[:1]  # 1st row
y = a[0:1] # 1st row
y = a[2:5] # select rows from 3rd to 5th row

# Select column in the format a[start:end, column_number]
x = a[:, -1] # -1 means first from the end
x = a[:,1:3] # select cols from 2nd col until 3rd

Merge arrays

Merging  numpy arrays is not advised because because internally numpy will create empty big array and then copy the contents into it. It would be best to create the intended size at the beginning and then just fill it up. However, sometimes you cannot avoid merging. In this case, numpy has some built-in functions:

Concatenate 

1d arrays:

a = np.array([1, 2, 3])
b = np.array([5, 6])
print np.concatenate([a, b, b])  
# >>  [1 2 3 5 6 5 6]

2d arrays:

a2 = np.array([[1, 2], [3, 4]])

# axis=0 - concatenate along rows
print np.concatenate((a2, b), axis=0)
# >>   [[1 2]
#       [3 4]
#       [5 6]]

# axis=1 - concatenate along columns, but first b needs to be transposed:
b.T
#>> [[5]
#    [6]]
np.concatenate((a2, b.T), axis=1)
#>> [[1 2 5]
#    [3 4 6]]

Append – append values to the end of an array

1d arrays:

# 1d arrays
print np.append(a, a2)
# >> [1 2 3 1 2 3 4]

print np.append(a, a)
# >> [1 2 3 1 2 3]

2d arrays – both arrays must match the shape of rows:

print np.append(a2, b, axis=0)
# >> [[1 2]
#     [3 4]
#     [5 6]]

print np.append(a2, b.T, axis=1)
# >> [[1 2 5]
#     [3 4 6]]

Hstack (stack horizontally) and vstack (stack vertically)

1d arrays:

print np.hstack([a, b])
# >> [1 2 3 5 6]

print np.vstack([a, a])
# >> [[1 2 3]
#     [1 2 3]]

2d arrays:

print np.hstack([a2,a2]) # arrays must match shape
# >> [[1 2 1 2]
#     [3 4 3 4]]

print np.vstack([a2, b])
# >> [[1 2]
#     [3 4]
#     [5 6]]

Types in ndarray

Without the default float, numpy can hold all the common types. If any of numbers in array is float, all numbers will be converted to float:

a = np.array([1, 2, 3.3])
print a
# >> [ 1.   2.   3.3]

But you can easily cast the type to int, float or other:

print a.astype(int)
# >> [1 2 3]

String arrays need to be created as arrays with the type S1 for string with length 1, S2 for length of 2 and so on .  numpy.chararray() creates array with this type. You need to specify the shape of the array and itemsize – the length of each string.

chararray = np.chararray([3,3], itemsize=3)
chararray[:] = 'abc' # assing value to all fields 
print chararray
#>> [['abc' 'abc' 'abc']
#    ['abc' 'abc' 'abc']
#    ['abc' 'abc' 'abc']]

Read/write to file

Write numpy array to a file with numpy.savetext() in plain text form and load it with  numpy.loadtext():

a2 = np.array([
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1]
])

np.savetxt('test.txt', a2, delimiter=',')
a2_new = np.loadtxt('test.txt', delimiter=',')

Writing sparse matrices

However, in machine learning if you have a large, sparse matrix (with a lot of values that are 0), reading and writing large matrices is faster and the file is smaller if you use the svmlight format:

from sklearn.datasets import dump_svmlight_file, load_svmlight_file

matrix = [
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2]
]

labels = [1,1,1,1,1,2,2]


dump_svmlight_file(matrix, labels, 'svmlight.txt', zero_based=True)

# The file looks like this:

# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 2 0:1 5:3 13:1 14:2
# 2 0:1 5:3 13:1 14:2

svm_loaded = load_svmlight_file('svmlight.txt', zero_based=True)

Use .toarray() to get matrix back from the svmlight Compressed Sparse Row format:

svm_loaded[0].toarray() # matrix element at index 0
svm_loaded[1] # labels at index 1

To sum up, this post looked at how to :

  • create numpy arrays,
  • slice arrays,
  • merge arrays,
  • basic types of numpy arrays,
  • reading and writing arrays to file,
  • reading and writing sparse matrices to svmlight format.

This was just an introduction into numpy matrices on how to get started and do basic manipulations. More information can be found in this MIT guide book as well as in the official documentation.

Cleaning Text for Natural Language Processing Tasks in Machine Learning in Python

Often when I work with text I need it to be clean. That is to remove gibberish or symbols/words I don’t need and to make all letters lowercase.

For example, a “dirty” line of text:

text = ['This is dirty TEXT: A phone number +001234561234, moNey 3.333, some date like 09.08.2016 and weird Čárákterš.']

Using Python2.7:

1) Read the line from list:

for line in text:
    # do something with line

or read from file:

with open('file.txt', 'r') as f:
    for line in f:
        # do something with line

2) Decode the line to utf8 from a string of bytes to work with special symbols:

line = line.decode('utf8')

3) Remove the symbols you don’t need. With replace() you can stack as many replace operations as you want.

line = line.replace('+', ' ').replace('.', ' ').replace(',', ' ').replace(':', ' ')

4) Remove numbers. Here you can use regex \d+. Because dots have already been removed we only need to check for whole numbers.

line = re.sub("(^|\W)\d+($|\W)", " ", line)

This regex matches the start of line ^ or whitespace, digits, end of line $ or whitespace to a space.

Alternatively you can just check if a word evaluates to a number by a simple function – is_digit() attempts to turn a string into int. If it succeeds, then the function returns true.

def is_digit(word):
    try:
        int(word)
        return True
    except ValueError:
        return False

Use this function on each word in the line by splitting the line on space with line.split(). New line array will hold only those words that are not numbers. At the end the array is joined together to a string.

new_line = []
for word in line.split():
    if not is_digit(word):
        new_line.append()
line = " ".join(new_line)

5) Now only lowercase and special characters remain. As lowercase only supports Latin letters, the special characters need to be turned to Latin. This can be done using Transliterate Python package or by hand. Here is a simple transliteration dictionary made from lists of character pairs:

cedilla2latin = [[u'Á', u'A'], [u'á', u'a'], [u'Č', u'C'], [u'č', u'c'], [u'Š', u'S'], [u'š', u's']]
tr = dict([(a[0], a[1]) for (a) in cedilla2latin])

In this way you can have multiple simbols to stand for one special symbol (like German [u’ä’, u’ae’]).
With the dictionary I can recreate letters in Latin:

def transliterate(line):
    new_line = ""
        for letter in line:
            if letter in tr:
                new_line += tr[letter]
            else:
                new_line += letter
    return new_line

And call the transliterate function:

line = transliterate(line)

6) After clearing away unnecessary symbols, finally I can lowercase the line:

line = line.lower()

And finally the line is reduced to simple Latin characters.

print line
>> this is dirty text a phone number money some date like and weird carakters

If you need to retain some numbers or check for other fields then go ahead and write more specific regexes.  However, regexes in Python use backtracking that makes them n-squared in terms of speed. This can slow you down especially if you are working with millions of lines.

As a side note a more general cleaning method that leaves only Latin characters can be to check for the ASCII value of each letter with ord().

def get_latin(line):
    return ' '.join(''.join([i if ord(i) >=65 and ord(i) <=90 or  ord(i) >= 97 and ord(i) <= 122 else ' ' for i in line]).split())

Full code of the above description is available below or here: