At the beginning when I started working with natural language processing, I used the default Python lists. But soon enough with bigger experiments and more data I run out of RAM. Python lists are not optimized for memory space so onto Numpy.

Numpy arrays are much like in C – generally you create the array the size you need beforehand and then fill it. Merging, appending is not recommended as Numpy will create one empty array in the size of arrays being merged and then just copy the contents into it.

Here are some ways Numpy arrays (ndarray) can be manipulated:

### Create ndarray

Some ways to create numpy matrices are:

**Cast**from Python list with numpy.asarray() :

aimport numpy as np list = [1, 2, 3] c = np.asarray(list)

**Create an ndarray in the size**you need filled with ones, zeros or random values:

# Array items as ndarray c = np.array([1, 2, 3]) # A 2x2 2d array shape for the arrays in the format (rows, columns) shape = (2, 2) # Random values c = np.empty(shape) d = np.ones(shape) e = np.zeros(shape)

- You can also create an array
**in the shape of another array**with numpy.empty_like():

# Creating ndarray from list c = np.array([[1., 2.,],[1., 2.]]) # Creating new array in the shape of c, filled with 0 d = np.empty_like(c)

### Slice

Sometimes I need to **select** only a part of all **columns** or **rows** in a 2d matrix. For example, matrices:

a = np.asarray([[1,1,2,3,4], # 1st row [2,6,7,8,9], # 2nd row [3,6,7,8,9], # 3rd row [4,6,7,8,9], # 4th row [5,6,7,8,9] # 5th row ]) b = np.asarray([[1,1], [1,1]]) # Select row in the format a[start:end], ifstartorendomitted it means all range. y = a[:1] # 1st row y = a[0:1] # 1st row y = a[2:5] # select rows from 3rd to 5th row # Select column in the format a[start:end, column_number] x = a[:, -1] # -1 means first from the end x = a[:,1:3] # select cols from 2nd col until 3rd

### Merge arrays

Merging numpy arrays is **not advised** because because internally numpy will create empty big array and then copy the contents into it. It would be best to create the intended size at the beginning and then just fill it up. However, sometimes you cannot avoid merging. In this case, numpy has some** built-in functions:**

#### Concatenate

1d arrays:

a = np.array([1, 2, 3]) b = np.array([5, 6]) print np.concatenate([a, b, b]) # >> [1 2 3 5 6 5 6]

2d arrays:

a2 = np.array([[1, 2], [3, 4]]) # axis=0 - concatenate along rows print np.concatenate((a2, b), axis=0) # >> [[1 2] # [3 4] # [5 6]] # axis=1 - concatenate along columns, but first b needs to be transposed: b.T #>> [[5] # [6]] np.concatenate((a2, b.T), axis=1) #>> [[1 2 5] # [3 4 6]]

#### Append – append values to the end of an array

1d arrays:

# 1d arrays print np.append(a, a2) # >> [1 2 3 1 2 3 4] print np.append(a, a) # >> [1 2 3 1 2 3]

2d arrays – both arrays must match the shape of rows:

print np.append(a2, b, axis=0) # >> [[1 2] # [3 4] # [5 6]] print np.append(a2, b.T, axis=1) # >> [[1 2 5] # [3 4 6]]

#### Hstack (stack horizontally) and vstack (stack vertically)

1d arrays:

print np.hstack([a, b]) # >> [1 2 3 5 6] print np.vstack([a, a]) # >> [[1 2 3] # [1 2 3]]

2d arrays:

print np.hstack([a2,a2]) # arrays must match shape # >> [[1 2 1 2] # [3 4 3 4]] print np.vstack([a2, b]) # >> [[1 2] # [3 4] # [5 6]]

### Types in ndarray

Without the default float, numpy can hold all the common** types.** If any of numbers in array is float, all numbers will be converted to float:

a = np.array([1, 2, 3.3]) print a # >> [ 1. 2. 3.3]

But you can easily** cast the type** to int, float or other:

print a.astype(int) # >> [1 2 3]

**String arrays** need to be created as arrays with the type S1 for string with length 1, S2 for length of 2 and so on . `numpy.``chararray()` creates array with this type. You need to specify the shape of the array and itemsize – the length of each string.

chararray = np.chararray([3,3], itemsize=3) chararray[:] = 'abc' # assing value to all fields print chararray #>> [['abc' 'abc' 'abc'] # ['abc' 'abc' 'abc'] # ['abc' 'abc' 'abc']]

### Read/write to file

Write numpy array to a file with numpy.savetext() in plain text form and load it with numpy.loadtext():

a2 = np.array([ [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1], [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1], [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1], [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1], [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1], [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1] ]) np.savetxt('test.txt', a2, delimiter=',') a2_new = np.loadtxt('test.txt', delimiter=',')

#### Writing sparse matrices

However, in machine learning if you have a large, **sparse matrix** (with a lot of values that are 0), reading and writing large matrices is faster and the file is smaller if you use the **svmlight format**:

from sklearn.datasets import dump_svmlight_file, load_svmlight_file matrix = [ [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2], [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2], [1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 2] ] labels = [1,1,1,1,1,2,2] dump_svmlight_file(matrix, labels, 'svmlight.txt', zero_based=True) # The file looks like this: # 1 0:1 13:1 14:2 # 1 0:1 13:1 14:2 # 1 0:1 13:1 14:2 # 1 0:1 13:1 14:2 # 1 0:1 13:1 14:2 # 2 0:1 5:3 13:1 14:2 # 2 0:1 5:3 13:1 14:2 svm_loaded = load_svmlight_file('svmlight.txt', zero_based=True)

Use .toarray() to get matrix back from the svmlight Compressed Sparse Row format:

svm_loaded[0].toarray() # matrix element at index 0 svm_loaded[1] # labels at index 1

To sum up, this post looked at how to :

- create numpy arrays,
- slice arrays,
- merge arrays,
- basic types of numpy arrays,
- reading and writing arrays to file,
- reading and writing sparse matrices to svmlight format.

This was just an introduction into numpy matrices on how to get started and do basic manipulations. More information can be found in this MIT guide book as well as in the official documentation.