All posts by ieva

Simple KNN implementation in Python 2.7

This is a simple KNN implementation for supervised learning. It deals with examples with known classes. You can find K-means clustering implementation in my next post to come.

Dummy dataset

First let’s make some dummy data with training examples and labels and test examples some approximate labels. I chose 2 classes based on x and y coordinates (one is more on the positive side, the other – negative).

X_train = [
    [1, 1],
    [1, 2],
    [2, 4],
    [3, 5],
    [1, 0],
    [0, 0],
    [1, -2],
    [-1, 0],
    [-1, -2],
    [-2, -2]

y_train = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2]

X_test = [
    [5, 5],
    [0, -1],
    [-5, -5]

y_test = [1, 2, 2]

KNN theory

Right,  supervised KNN works by finding the numerically “closest” neighbors and seeing what those neighbors classes are like. Then taking the one that’s more popular among the n neighbors and saying that’s most probably the class for the data point you’re asking about.

One of the ways to calculate the “distance” is to do just that – calculate the distance with Euclidean distance from geometry. The formula looks like this for two points with two parameters A(a1, a2) and B(b1, b2) :

It can work with many more features not just two, you just keep summing the corresponding squared subtractions one after the other.

Euclidean distance in code

In code it would look like this:

import math

def euclidean_dist(A, B):
    return math.sqrt(sum([(A[i]-B[i])**2 for i, _ in enumerate(A)]) )

The function euclidean_dist takes as input two points regardless of how many features they have and outputs the Euclidean distance.

KNN in code

Now we just have to find the distance from each test set element to all of the training set elements and get the most popular class in the closest neighbor classes.  Here function knn does that.

def knn(X_train, y_train, X_test, k=1):
    y_test = []
    for test_row in X_test:
        eucl_dist = [euclidean_dist(train_row, test_row) for train_row in X_train]
        sorted_eucl_dist = sorted(eucl_dist)
        closest_knn = [eucl_dist.index(sorted_eucl_dist[i]) for i in xrange(0, k)] if k > 1 else [eucl_dist.index(min(eucl_dist))]
        closest_labels_knn = [y_train[x] for x in closest_knn]
    return y_test

Finally, we need a helper function to find the most popular item in an array. I chose a solution where I put the class as a key in a dictionary and keep increasing the value for the key to count occurrences:

from collections import defaultdict
from operator import itemgetter

def get_most_common_item(array):
    count_dict = defaultdict(int)
    for key in array:
        count_dict[key] += 1
    key, count = max(count_dict.iteritems(), key=itemgetter(1))
    return key

Now we can test this out with our custom dataset setting neighbor count to 2:

print knn(X_train, y_train, X_test, k=2)  # output: [1, 1, 2]

And we can see that this somewhat matched my imagined labels [1, 2, 2].

You can find the whole core on my Github repository or here below:

Keyword matching with Aho-Corasick

So, for example, I have around a lot of different keywords and I want to find if any of these have been mentioned in some long text. I also want catch them if they are a part of another word like accidentally written two words together (e.g., greenplants). This is considerable amount of searching.

Brute Force

One way to search for keywords would be to take each keyword and go though the text or the another way around. This results in a lot of comparison operations, especially if the text is large.  (keywords * words in text).

You could first count which words are the most common in the type of texts you use and search for those first. If you are searching line by line in a text and match it by the first encountered keyword from your list this would somewhat reduce the number of operations.


A better way would be to make a dictionary from keywords, where the key is the keyword and the value is anything you want to match it with ( a category). Now you only need to take each word in a line /text and match it in the the dictionary (on average – O(1) for one word) . Nice now you only search as many times ar you have  words in the text. However, if you want to search for a partial match, you still need  to make substrings from words or add some other additional logic.


Now, the best way is just to take a line and find all the keywords as you go though it, similar as you would read it. This can be done with an automaton. A nice algorithm for searching in text is Aho-Corasick algorithm. This algorithm takes a list of keywords and generates an automaton from them. With this you can match your keywords in linear time – which is about as good as it gets.


Python has a simple Aho-Corasick library pyahocorasick.  According to pyahocorasick, this lib can work with unicode in Python 3, whereas Python 2 can only work with string bytes.

pip install pyahocorasick

Now, lets take some keywords and make an automaton. I have added a category for each keyword and stored that in a tuple (keyw, category). In this case all keywords are in the same category.

keywords = [
    ('he', 1),
    ('she', 1),
    ('hers', 1),
    ('her', 1)
text = [
    ' he is here ',
    ' this is she ',
    ' this is hers ',
    ' her bag is big '

First, keywords and their respective categories are added to a trie structure, from which an automaton will be generated.

import ahocorasick as ahc

def make_aho_automaton(keywords):
    A = ahc.Automaton()  # initialize
    for (key, cat) in keywords:
        A.add_word(key, (cat, key)) # add keys and categories
    A.make_automaton() # generate automaton
    return A

A = make_aho_automaton(keywords)

Next, let’s make a function which finds keywords in text. Searching in pyahocorasick happens via automaton.iter(line) function. This returns an iterable object with the end index of the word matched and the tuple of the respective keyword and its category.

def find_keywords(line, A):
    found_keywords = []
    for end_index, (cat, keyw) in A.iter(line):
    return found_keywords

Searching for keywords yields a list of the found keywords. Because the automaton was generated from keywords with no spaces around them, it finds overlapping keywords (here  contains he as well as her).

new_text = []
for line in text:
    print line, ':', find_keywords(line, A)

# he is here : ['he', 'he', 'her']
# this is she : ['she', 'he']
# this is hers : ['he', 'her', 'hers']
# her bag is big : ['he', 'her']

Let’s remake the automaton:

keywords = [
    (' he ', 1),
    (' she ', 1),
    (' hers ', 1),
    (' her ', 1)

A_spaces = make_aho_automaton(keywords)

Now the A_spaces automaton finds only fully matching words:

# he is here  : [' he ']
# this is she  : [' she ']
# this is hers  : [' hers ']
# her bag is big  : [' her ']

Replacing keywords

A useful function would be to replace the found keywords with some characters for anonymization. Here the function returns an array in the length of the line telling me which indices are to be hidden.

def find_keyword_locations(line, A):
    line_indices = [False for x in line]
    for end_index, (cat, keyw) in A.iter(line):
        start_index = end_index - len(keyw) + 2  # start index after first space
        for i in range(start_index, end_index):  # end index excluding last space
            line_indices[i] = True
    return line_indices

Let us try to replace the keywords with a dash or remove the altogether:

new_text_removed = []
new_text_replaced = []
for line in text:
    line_indices = find_keyword_locations(line, A_spaces)
    line = list(line) # split string into list
    new_line = "".join([line[i] if not x else '' for i, x in enumerate(line_indices)])
    new_line = "".join([line[i] if not x else '-' for i, x in enumerate(line_indices)])

print text
print new_text_removed
print new_text_replaced

Original text:

[' he is here ', ' this is she ', ' this is hers ', ' her bag is big ']

Removed/replaced keywords in text:

['  is here ', ' this is  ', ' this is  ', '  bag is big ']
[' -- is here ', ' this is --- ', ' this is ---- ', ' --- bag is big ']

The last function seems pretty useful for text anonymization tasks if you, for example, need to anonymize person names/lastnames.

To conclude, an automaton can search really fast for selected keywords. It is easy to use it you need to match whole words. Partial matching can be archived by not adding spaces around the keywords. However, you have to choose carefully to not match accidental parts of words. I would consider making two automata – one for matching whole words, another for matching parts of words.

Full code can be found below, or on GitHub:

Generating Graphs on Server with no UI in Pyhton

So you run the analysis on Linux server because the data is  huge. Now you want to generate a nice graph or plot to see whats going on.

The script

Using the Iris Dataset I want to generate a plot showing the distribution of the flower examples like this:

Iris plot
Iris plot

To reduce features from 4 to 2 I use sklearn’s  truncated singular value decomposition (which also works on sparse matrices):

import matplotlib.pyplot as plt
from sklearn import datasets 
from sklearn.decomposition import TruncatedSVD 

iris = datasets.load_iris() 
X = 
y =  # Labels

# Visualize result using SVD
svd = TruncatedSVD(n_components=2) 
X_reduced = svd.fit_transform(X)

# Initialize scatter plot with x and y axis values
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, s=25)

When running this script on Windows everything seems fine.

Setting up server

On a Linux server with no graphical interface or UI there are no tools for the server to generate a picture. On a clean server I need to install sklearn and its dependencies as well as mathplotlib. To make this easier I use Anaconda:

bash -b -p $HOME/anaconda
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc

However, running the script now trows an error:

# ImportError: cannot open shared object file: No such file or directory

So I install also these two libs:

# sudo apt-get install -y libsm6 libxrender1

After this running the script trows:

# RuntimeError: Invalid DISPLAY variable

Generate plot

To enable the server to generate the plot I switch to a different ‘backend’. Pyplot enables various backends for different file formats. To generate and save a .png image I use ‘agg’ backend:


With this I can create my plot and save it. As a side note the .show() method will not be executed when the backend is set to ‘agg’.

The full script code can be seen below or on github:

Dealing with JSON Encoded as JSON

The other day I had to fix a file containing JSON that had been encoded as JSON (JSON nested in JSON). It looked something like this:

"'{\"animal_list\": [{\"type\": \"mammal\", \"description\": \"Tall, with brown spots, lives in Savanna\", \"name\": \"Giraffa camelopardalis\"},{\"type\": \"mammal\", \"description\": \"Big, grey, with big ears, smart\", \"name\": \"Loxodonta africana\"},{\"type\": \"reptile\", \"description\": \"Green, changes color, lives in \"East Africa\"\", \"name\": \"Trioceros jacksonii\"}]}'"

It was all in one line, considerably big  – when properly formatted with around 1.5 million entries.


The quick and dirty way to decode this would be to:

  • Put newline between each new entry (for readability)
  • Remove the quotation marks (” and ‘) from the beginning, end
  • Un-escape the quotation marks
newfile = open('animals_editted.json', 'w')
with open('animals.json', 'r') as jsonf:
    for line in jsonf:   # there will be only one line
        newline = line.replace('},{', '},\n{')
        newline = newline.strip('\"').strip('\'')
        newline = newline.replace('\\\"', '\"')
        print newline

Now the JSON looks better but there is a problem:

{"animal_list": [{"type": "mammal", "description": "Tall, with brown spots, lives in Savanna", "name": "Giraffa camelopardalis"},
{"type": "mammal", "description": "Big, grey, with big ears, smart", "name": "Loxodonta africana"},
{"type": "reptile", "description": "Green, changes color, lives in "East Africa"", "name": "Trioceros jacksonii"}]}

By  escaping all the quotation marks also those that need to be escaped in the description field are now un-escaped. Now the challenge is to fix these description fields.

One way is to take the line, find “description” in it and split the parts – beginning of the line, description value, and end of the file. If we have the description value, we can escape the quotation marks there and glue back together all the parts:

newfile = open('animals_cleaned.json', 'w')
with open('animals_editted.json', 'r') as jsonf:
    for line in jsonf: 
        start, end = line.split('\"description\": \"')
        descr_val, end = end.split('", "')  
        # escape the quotes
        descr_val = descr_val.replace('"', '\\"')  
        descr = '\"description\": \"' + descr_val + '", "' 
        newline = start + descr + end

Now the JSON is importing successfully:

with open('animals_cleaned.json', 'r') as nf:
    json_object = json.load(nf)
print len(json_object['animal_list'])
#>> 3

To sum up, this was a brute force way of decoding JSON encoded as JSON, this solution took me about 10 minutes and was good enough to fix the problem. Alternatively, other solutions could be to use json.reads() hook as explained in “encode nested JSON in JSON” answer.

See full code below or on github:

Iris Dataset and Xgboost Simple Tutorial

I had the opportunity to start using xgboost machine learning algorithm, it is fast and shows good results. Here I will be using multiclass prediction with the iris dataset from scikit-learn.

Installing Anaconda and xgboost

In order to work with the data, I need to install various scientific libraries for python. The best way I have found is to use Anaconda. It simply installs all the libs and helps to install new ones. You can download the installer for Windows, but if you want to install it on a Linux server, you can just copy-paste this into the terminal:

bash -b -p $HOME/anaconda
echo 'export PATH="$HOME/anaconda/bin:$PATH"' >> ~/.bashrc

After this, use conda to install pip which you will need for installing xgboost. It is important to install it using Anaconda (in Anaconda’s directory), so that pip installs other libs there as well:

conda install -y pip libgcc

Now, a very important step: install xgboost Python Package dependencies beforehand. I install these ones from experience:

sudo apt-get install -y make g++ build-essential gfortran libatlas-base-dev liblapacke-dev python-dev python-setuptools libsm6 libxrender1

I upgrade my python virtual environment to have no trouble with python versions:

pip install --upgrade virtualenv

And finally I can install xgboost with pip (keep fingers crossed):

pip install xgboost

This command installs the latest xgboost version, but if you want to use a previous one, just specify it with:

pip install xgboost==0.4a30

Now test if everything is has gone well – type python in the terminal and try to import xgboost:

import xgboost as xgb

If you see no errors – perfect.

Xgboost Demo with the Iris Dataset

Here I will use the Iris dataset to show a simple example of how to use Xgboost.

First you load the dataset from sklearn, where X will be the data, y – the class labels:

from sklearn import datasets

iris = datasets.load_iris()
X =
y =

Then you split the data into train and test sets with 80-20% split:

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next you need to create the Xgboost specific DMatrix data format from the numpy array. Xgboost can work with numpy arrays directly, load data from svmlignt files and other formats. Here is how to work with numpy arrays:

import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

If you want to use svmlight for less memory consumption, first dump the numpy array into svmlight format and then just pass the filename to DMatrix:

import xgboost as xgb
from sklearn.datasets import dump_svmlight_file

dump_svmlight_file(X_train, y_train, 'dtrain.svm', zero_based=True)
dump_svmlight_file(X_test, y_test, 'dtest.svm', zero_based=True)
dtrain_svm = xgb.DMatrix('dtrain.svm')
dtest_svm = xgb.DMatrix('dtest.svm')

Now for the Xgboost to work you need to set the parameters:

param = {
    'max_depth': 3,  # the maximum depth of each tree
    'eta': 0.3,  # the training step for each iteration
    'silent': 1,  # logging mode - quiet
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': 3}  # the number of classes that exist in this datset
num_round = 20  # the number of training iterations

Different datasets perform better with different parameters. The result can be really low with one set of params and really good with others. You can look at this Kaggle script how to search for the best ones. Generally try with eta 0.1, 0.2, 0.3, max_depth in range of 2 to 10 and num_round around few hundred.


Finally the training can begin. You just type:

bst = xgb.train(param, dtrain, num_round)

To see how the model looks you can also dump it in human readable form:


And it looks something like this (f0, f1, f2 are features):

0:[f2<2.45] yes=1,no=2,missing=1
0:[f2<2.45] yes=1,no=2,missing=1
    2:[f3<1.75] yes=3,no=4,missing=3
        3:[f2<4.95] yes=5,no=6,missing=5
        4:[f2<4.85] yes=7,no=8,missing=7

You can see that each tree is no deeper than 3 levels as set in the params.

Use the model to predict classes for the test set:

preds = bst.predict(dtest)

But the predictions look something like this:

[[ 0.00563804 0.97755206 0.01680986]
 [ 0.98254657 0.01395847 0.00349498]
 [ 0.0036375 0.00615226 0.99021029]
 [ 0.00564738 0.97917044 0.0151822 ]
 [ 0.00540075 0.93640935 0.0581899 ]

Here each column represents class number 0, 1, or 2. For each line you need to select that column where the probability is the highest:

import numpy as np
best_preds = np.asarray([np.argmax(line) for line in preds])

Now you get a nice list with predicted classes:

[1, 0, 2, 1, 1, ...]

Determine the precision of this prediction:

from sklearn.metrics import precision_score

print precision_score(y_test, best_preds, average='macro')
# >> 1.0

Perfect! Now save the model for later use:

 from sklearn.externals import joblib

joblib.dump(bst, 'bst_model.pkl', compress=True)
# bst = joblib.load('bst_model.pkl') # load it later

Now you have a working model saved for later use, and ready for more prediction.

See the full code on github or  below:

Working With Numpy Matrices

At the beginning when I started working with natural language processing, I used the default Python lists. But soon enough with bigger experiments and more data I run out of RAM. Python lists are not optimized for memory space so onto Numpy.

Numpy arrays are much like in C – generally you create the array the size you need beforehand and then fill it. Merging, appending is not recommended as Numpy will create one empty array in the size of arrays being merged  and then just copy the contents into it.

Here are some ways Numpy arrays (ndarray) can be manipulated:

Create ndarray

Some ways to create numpy matrices are:

aimport numpy as np

list = [1, 2, 3]
c = np.asarray(list)
  • Create an ndarray in the size you need filled with ones, zeros or random values:
# Array items as ndarray 
c = np.array([1, 2, 3])

# A 2x2 2d array shape for the arrays in the format (rows, columns)
shape = (2, 2)

# Random values
c = np.empty(shape)

d  = np.ones(shape)
e = np.zeros(shape)
# Creating ndarray from list
c = np.array([[1., 2.,],[1., 2.]])

# Creating new array in the shape of c, filled with 0
d = np.empty_like(c)


Sometimes I need to select only a part of all columns or rows in a 2d matrix. For example, matrices:

a = np.asarray([[1,1,2,3,4], # 1st row
                [2,6,7,8,9], # 2nd row
                [3,6,7,8,9], # 3rd row
                [4,6,7,8,9], # 4th row
                [5,6,7,8,9]  # 5th row

b = np.asarray([[1,1],

# Select row in the format a[start:end], if start or end omitted it means all range.
y = a[:1]  # 1st row
y = a[0:1] # 1st row
y = a[2:5] # select rows from 3rd to 5th row

# Select column in the format a[start:end, column_number]
x = a[:, -1] # -1 means first from the end
x = a[:,1:3] # select cols from 2nd col until 3rd

Merge arrays

Merging  numpy arrays is not advised because because internally numpy will create empty big array and then copy the contents into it. It would be best to create the intended size at the beginning and then just fill it up. However, sometimes you cannot avoid merging. In this case, numpy has some built-in functions:


1d arrays:

a = np.array([1, 2, 3])
b = np.array([5, 6])
print np.concatenate([a, b, b])  
# >>  [1 2 3 5 6 5 6]

2d arrays:

a2 = np.array([[1, 2], [3, 4]])

# axis=0 - concatenate along rows
print np.concatenate((a2, b), axis=0)
# >>   [[1 2]
#       [3 4]
#       [5 6]]

# axis=1 - concatenate along columns, but first b needs to be transposed:
#>> [[5]
#    [6]]
np.concatenate((a2, b.T), axis=1)
#>> [[1 2 5]
#    [3 4 6]]

Append – append values to the end of an array

1d arrays:

# 1d arrays
print np.append(a, a2)
# >> [1 2 3 1 2 3 4]

print np.append(a, a)
# >> [1 2 3 1 2 3]

2d arrays – both arrays must match the shape of rows:

print np.append(a2, b, axis=0)
# >> [[1 2]
#     [3 4]
#     [5 6]]

print np.append(a2, b.T, axis=1)
# >> [[1 2 5]
#     [3 4 6]]

Hstack (stack horizontally) and vstack (stack vertically)

1d arrays:

print np.hstack([a, b])
# >> [1 2 3 5 6]

print np.vstack([a, a])
# >> [[1 2 3]
#     [1 2 3]]

2d arrays:

print np.hstack([a2,a2]) # arrays must match shape
# >> [[1 2 1 2]
#     [3 4 3 4]]

print np.vstack([a2, b])
# >> [[1 2]
#     [3 4]
#     [5 6]]

Types in ndarray

Without the default float, numpy can hold all the common types. If any of numbers in array is float, all numbers will be converted to float:

a = np.array([1, 2, 3.3])
print a
# >> [ 1.   2.   3.3]

But you can easily cast the type to int, float or other:

print a.astype(int)
# >> [1 2 3]

String arrays need to be created as arrays with the type S1 for string with length 1, S2 for length of 2 and so on .  numpy.chararray() creates array with this type. You need to specify the shape of the array and itemsize – the length of each string.

chararray = np.chararray([3,3], itemsize=3)
chararray[:] = 'abc' # assing value to all fields 
print chararray
#>> [['abc' 'abc' 'abc']
#    ['abc' 'abc' 'abc']
#    ['abc' 'abc' 'abc']]

Read/write to file

Write numpy array to a file with numpy.savetext() in plain text form and load it with  numpy.loadtext():

a2 = np.array([
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1]

np.savetxt('test.txt', a2, delimiter=',')
a2_new = np.loadtxt('test.txt', delimiter=',')

Writing sparse matrices

However, in machine learning if you have a large, sparse matrix (with a lot of values that are 0), reading and writing large matrices is faster and the file is smaller if you use the svmlight format:

from sklearn.datasets import dump_svmlight_file, load_svmlight_file

matrix = [
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2]

labels = [1,1,1,1,1,2,2]

dump_svmlight_file(matrix, labels, 'svmlight.txt', zero_based=True)

# The file looks like this:

# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 2 0:1 5:3 13:1 14:2
# 2 0:1 5:3 13:1 14:2

svm_loaded = load_svmlight_file('svmlight.txt', zero_based=True)

Use .toarray() to get matrix back from the svmlight Compressed Sparse Row format:

svm_loaded[0].toarray() # matrix element at index 0
svm_loaded[1] # labels at index 1

To sum up, this post looked at how to :

  • create numpy arrays,
  • slice arrays,
  • merge arrays,
  • basic types of numpy arrays,
  • reading and writing arrays to file,
  • reading and writing sparse matrices to svmlight format.

This was just an introduction into numpy matrices on how to get started and do basic manipulations. More information can be found in this MIT guide book as well as in the official documentation.

Cleaning Text for Natural Language Processing Tasks in Machine Learning in Python

Often when I work with text I need it to be clean. That is to remove gibberish or symbols/words I don’t need and to make all letters lowercase.

For example, a “dirty” line of text:

text = ['This is dirty TEXT: A phone number +001234561234, moNey 3.333, some date like 09.08.2016 and weird Čárákterš.']

Using Python2.7:

1) Read the line from list:

for line in text:
    # do something with line

or read from file:

with open('file.txt', 'r') as f:
    for line in f:
        # do something with line

2) Decode the line to utf8 from a string of bytes to work with special symbols:

line = line.decode('utf8')

3) Remove the symbols you don’t need. With replace() you can stack as many replace operations as you want.

line = line.replace('+', ' ').replace('.', ' ').replace(',', ' ').replace(':', ' ')

4) Remove numbers. Here you can use regex \d+. Because dots have already been removed we only need to check for whole numbers.

line = re.sub("(^|\W)\d+($|\W)", " ", line)

This regex matches the start of line ^ or whitespace, digits, end of line $ or whitespace to a space.

Alternatively you can just check if a word evaluates to a number by a simple function – is_digit() attempts to turn a string into int. If it succeeds, then the function returns true.

def is_digit(word):
        return True
    except ValueError:
        return False

Use this function on each word in the line by splitting the line on space with line.split(). New line array will hold only those words that are not numbers. At the end the array is joined together to a string.

new_line = []
for word in line.split():
    if not is_digit(word):
line = " ".join(new_line)

5) Now only lowercase and special characters remain. As lowercase only supports Latin letters, the special characters need to be turned to Latin. This can be done using Transliterate Python package or by hand. Here is a simple transliteration dictionary made from lists of character pairs:

cedilla2latin = [[u'Á', u'A'], [u'á', u'a'], [u'Č', u'C'], [u'č', u'c'], [u'Š', u'S'], [u'š', u's']]
tr = dict([(a[0], a[1]) for (a) in cedilla2latin])

In this way you can have multiple simbols to stand for one special symbol (like German [u’ä’, u’ae’]).
With the dictionary I can recreate letters in Latin:

def transliterate(line):
    new_line = ""
        for letter in line:
            if letter in tr:
                new_line += tr[letter]
                new_line += letter
    return new_line

And call the transliterate function:

line = transliterate(line)

6) After clearing away unnecessary symbols, finally I can lowercase the line:

line = line.lower()

And finally the line is reduced to simple Latin characters.

print line
>> this is dirty text a phone number money some date like and weird carakters

If you need to retain some numbers or check for other fields then go ahead and write more specific regexes.  However, regexes in Python use backtracking that makes them n-squared in terms of speed. This can slow you down especially if you are working with millions of lines.

As a side note a more general cleaning method that leaves only Latin characters can be to check for the ASCII value of each letter with ord().

def get_latin(line):
    return ' '.join(''.join([i if ord(i) >=65 and ord(i) <=90 or  ord(i) >= 97 and ord(i) <= 122 else ' ' for i in line]).split())

Full code of the above description is available below or here: