Category Archives: numpy

Working With Numpy Matrices

At the beginning when I started working with natural language processing, I used the default Python lists. But soon enough with bigger experiments and more data I run out of RAM. Python lists are not optimized for memory space so onto Numpy.

Numpy arrays are much like in C – generally you create the array the size you need beforehand and then fill it. Merging, appending is not recommended as Numpy will create one empty array in the size of arrays being merged  and then just copy the contents into it.

Here are some ways Numpy arrays (ndarray) can be manipulated:

Create ndarray

Some ways to create numpy matrices are:

aimport numpy as np

list = [1, 2, 3]
c = np.asarray(list)
  • Create an ndarray in the size you need filled with ones, zeros or random values:
# Array items as ndarray 
c = np.array([1, 2, 3])

# A 2x2 2d array shape for the arrays in the format (rows, columns)
shape = (2, 2)

# Random values
c = np.empty(shape)

d  = np.ones(shape)
e = np.zeros(shape)
# Creating ndarray from list
c = np.array([[1., 2.,],[1., 2.]])

# Creating new array in the shape of c, filled with 0
d = np.empty_like(c)

Slice

Sometimes I need to select only a part of all columns or rows in a 2d matrix. For example, matrices:

a = np.asarray([[1,1,2,3,4], # 1st row
                [2,6,7,8,9], # 2nd row
                [3,6,7,8,9], # 3rd row
                [4,6,7,8,9], # 4th row
                [5,6,7,8,9]  # 5th row
              ])

b = np.asarray([[1,1],
                [1,1]])

# Select row in the format a[start:end], if start or end omitted it means all range.
y = a[:1]  # 1st row
y = a[0:1] # 1st row
y = a[2:5] # select rows from 3rd to 5th row

# Select column in the format a[start:end, column_number]
x = a[:, -1] # -1 means first from the end
x = a[:,1:3] # select cols from 2nd col until 3rd

Merge arrays

Merging  numpy arrays is not advised because because internally numpy will create empty big array and then copy the contents into it. It would be best to create the intended size at the beginning and then just fill it up. However, sometimes you cannot avoid merging. In this case, numpy has some built-in functions:

Concatenate 

1d arrays:

a = np.array([1, 2, 3])
b = np.array([5, 6])
print np.concatenate([a, b, b])  
# >>  [1 2 3 5 6 5 6]

2d arrays:

a2 = np.array([[1, 2], [3, 4]])

# axis=0 - concatenate along rows
print np.concatenate((a2, b), axis=0)
# >>   [[1 2]
#       [3 4]
#       [5 6]]

# axis=1 - concatenate along columns, but first b needs to be transposed:
b.T
#>> [[5]
#    [6]]
np.concatenate((a2, b.T), axis=1)
#>> [[1 2 5]
#    [3 4 6]]

Append – append values to the end of an array

1d arrays:

# 1d arrays
print np.append(a, a2)
# >> [1 2 3 1 2 3 4]

print np.append(a, a)
# >> [1 2 3 1 2 3]

2d arrays – both arrays must match the shape of rows:

print np.append(a2, b, axis=0)
# >> [[1 2]
#     [3 4]
#     [5 6]]

print np.append(a2, b.T, axis=1)
# >> [[1 2 5]
#     [3 4 6]]

Hstack (stack horizontally) and vstack (stack vertically)

1d arrays:

print np.hstack([a, b])
# >> [1 2 3 5 6]

print np.vstack([a, a])
# >> [[1 2 3]
#     [1 2 3]]

2d arrays:

print np.hstack([a2,a2]) # arrays must match shape
# >> [[1 2 1 2]
#     [3 4 3 4]]

print np.vstack([a2, b])
# >> [[1 2]
#     [3 4]
#     [5 6]]

Types in ndarray

Without the default float, numpy can hold all the common types. If any of numbers in array is float, all numbers will be converted to float:

a = np.array([1, 2, 3.3])
print a
# >> [ 1.   2.   3.3]

But you can easily cast the type to int, float or other:

print a.astype(int)
# >> [1 2 3]

String arrays need to be created as arrays with the type S1 for string with length 1, S2 for length of 2 and so on .  numpy.chararray() creates array with this type. You need to specify the shape of the array and itemsize – the length of each string.

chararray = np.chararray([3,3], itemsize=3)
chararray[:] = 'abc' # assing value to all fields 
print chararray
#>> [['abc' 'abc' 'abc']
#    ['abc' 'abc' 'abc']
#    ['abc' 'abc' 'abc']]

Read/write to file

Write numpy array to a file with numpy.savetext() in plain text form and load it with  numpy.loadtext():

a2 = np.array([
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1]
])

np.savetxt('test.txt', a2, delimiter=',')
a2_new = np.loadtxt('test.txt', delimiter=',')

Writing sparse matrices

However, in machine learning if you have a large, sparse matrix (with a lot of values that are 0), reading and writing large matrices is faster and the file is smaller if you use the svmlight format:

from sklearn.datasets import dump_svmlight_file, load_svmlight_file

matrix = [
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2],
    [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2]
]

labels = [1,1,1,1,1,2,2]


dump_svmlight_file(matrix, labels, 'svmlight.txt', zero_based=True)

# The file looks like this:

# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 1 0:1 13:1 14:2
# 2 0:1 5:3 13:1 14:2
# 2 0:1 5:3 13:1 14:2

svm_loaded = load_svmlight_file('svmlight.txt', zero_based=True)

Use .toarray() to get matrix back from the svmlight Compressed Sparse Row format:

svm_loaded[0].toarray() # matrix element at index 0
svm_loaded[1] # labels at index 1

To sum up, this post looked at how to :

  • create numpy arrays,
  • slice arrays,
  • merge arrays,
  • basic types of numpy arrays,
  • reading and writing arrays to file,
  • reading and writing sparse matrices to svmlight format.

This was just an introduction into numpy matrices on how to get started and do basic manipulations. More information can be found in this MIT guide book as well as in the official documentation.