Home >Blog

Getting started with NumPy

In this article we will take a look at NumPy, which is a library for numerical calculations in Python.

What are some reasons you might be interested NumPy? For one, if you want to do some data science or machine learning, then NumPy is invaluable. It is used to do calculations on a large-set of numerical data. But to understand topics such as machine learning, you have to understand the fundamentals behind them.

My intention with this article is for it to be the first in a series of articles which guide you through the fundamentals to the heights of data science.

Installing

Installing NumPy is, well, as easy as pie! You can get it through the Python Package Index:

pip install numpy

And you’re done. If you want to learn machine learning, then you can get some pre-built packages like Anaconda which already includes the required libraries like NumPy.

Creating NumPy arrays

After this brief introduction it is time to see what NumPy is capable of. We will go through a few examples starting small and then getting bigger to get the experience needed in order to understand NumPy's might.

First of all let's define a simple Python list holding some values:

>>> values = [1, 2.4, 234, 112, 345]

Now, this array is really just a plain-old Python list and we have no use for it with NumPy. So let's convert it to the right format:

>>> import numpy as np
>>> array = np.array(values)

As you can see, this goes simply with the function np.array. And also you can see that the terminology is different too: NumPy uses the term array instead of list. This is OK because Python lists are arrays and they are represented as one if you look at books or articles (or other programming languages like C or Java).

Now we can do different things on this new array:

>>> array
array([   1. ,    2.4,  234. ,  112. ,  345. ])
>>> print(array)
[   1.     2.4  234.   112.   345. ]
>>> array.dtype
dtype('float64')

The first interesting thing you see is how arrays are represented when printing to the console—the entries are evenly spaced to match the format of the longest element. One more interesting observation is that every number is a floating point number even though there was only one floating point number in the original list. And we can verify this by looking at the dtype of our new array.

Arrays in NumPy have shapes. The shape describes the dimension of an array:

>>> array.shape
(5,)

The example has only one dimension with 5 elements. Because NumPy is a complex system, it can handle multiple dimensions too, as we will see shortly.

Sometimes it would be problematic to create the array we want to convert to a NumPy array, for example with multidimensional arrays. Fortunately there are different ways where we do not need a pre-defined array as in the example above.

One method arranges a range of numbers to NumPy arrays. This is good for examples but you seldom have a range of data to convert to an array:

>>> np.arange(25)
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
      17, 18, 19, 20, 21, 22, 23, 24])
>>> np.arange(25).reshape(5,5)
array([[ 0,  1,  2,  3,  4],
      [ 5,  6,  7,  8,  9],
      [10, 11, 12, 13, 14],
      [15, 16, 17, 18, 19],
      [20, 21, 22, 23, 24]])

The arange function creates an array from a range of numbers. But the exciting method is reshape—it shapes the array into a new shape along the given dimensions. In this case you have to take care to use the right shape because if the array does not fit into the given shape you get an exception.

>>> np.arange(24).reshape(5,5)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ValueError: cannot reshape array of size 24 into shape (5,5)

Here we tried to add 24 elements to a 5x5 2d array (also known as 2d matrix) but this does not work because we need 25 elements.

Bedside this simple example you can create arrays with evenly distributed values, even with floating point differences between the elements. For example you want to have a range of numbers between 10 and 20 with a difference of 0.4. In Python it won't be easy to create such a list but with NumPy it’s a piece of cake:

>>> range(10, 20, 0.4)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: 'float' object cannot be interpreted as an integer
>>> list(map(lambda x: x/10, range(100, 200, 4)))
[10.0, 10.4, 10.8, 11.2, 11.6, 12.0, 12.4, 12.8, 13.2, 13.6, 14.0, 14.4, 14.8, 15.2, 15.6, 16.0, 16.4, 16.8, 17.2, 17.6, 18.0, 18.4, 18.8, 19.2, 19.6]
>>> np.arange(10, 20, 0.4)
array([ 10. ,  10.4,  10.8,  11.2,  11.6,  12. ,  12.4,  12.8,  13.2,
       13.6,  14. ,  14.4,  14.8,  15.2,  15.6,  16. ,  16.4,  16.8,
       17.2,  17.6,  18. ,  18.4,  18.8,  19.2,  19.6])

You can see, that 0.4 is not a valid option for a range in Python so I have had to take a workaround to solve this problem and it involves casting, mapping and eventually error-prone multiplication. But with NumPy this goes smoothly.

Now let's imagine we need a 3x3x3 3d matrix filled with ones. Because we know Python we can come up with a solution quickly which we can reused for other datasets in the future:

>>> np.array([1]*3*3*3).reshape(3,3,3)
array([[[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]],

      [[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]],

      [[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]]])

However NumPy already has a built-in method for this and other similar purposes. I suggest you use this approach instead of the Python to NumPy conversion above:

>>> np.ones((3,3,3))
array([[[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]],

      [[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]],

      [[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]]])

This creates a 3d matrix with floating point numbers. If you need only integers, you have to provide the right dtype to the ones function to use that datatype:

>>> np.ones((3,3,3), dtype=np.int64)
array([[[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]],

      [[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]],

      [[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]]])

The np.int64 class defined Python-compatible 64 bit integers. Naturally there are other integer types available but to maintain Python compatibility I suggest this one. This has of course its drawback—it requires more space than 16 bits for example.

Beside ones there exists a function called zeros which fills the matrix with zeros, and there is empty which creates an empty array... Or at least it’s empty for NumPy:

>>> empty_array = np.empty((3,6))
>>> empty_array
array([[  0.00000000e+000,   0.00000000e+000,   2.16211348e-314,
         2.16211778e-314,   2.16211782e-314,   2.16211787e-314],
      [  2.16211791e-314,   2.16211795e-314,   2.16211800e-314,
         2.16211804e-314,   2.16211808e-314,   2.16211813e-314],
      [  2.16211817e-314,   2.16211821e-314,   2.16211826e-314,
         2.16211830e-314,   2.16211834e-314,   2.16211839e-314]])

As you can see, this array contains very small elements but it does contains elements so they are not really empty. So take care every time you use the empty function.

Accessing elements

Now we have arrays and we want to access them. There are two ways to do this. One is a way already known from Python, through indexing:

>>> array = np.arange(10, 40, 0.4).reshape(5,5,3)
>>> array[1][2][2]
19.20000000000001

This is OK as long as you deal with small dimensions. As soon as the dimensions start to expand you will have to write a lot of brackets around your indexes. Fortunately NumPy comes with a solution to this problem too—you can access elements through their index but you can provide the dimension indexes as a comma separated list:

>>> array[1,2,2]
19.20000000000001

Sometimes you do not want to access only one field of your data-set but lines or columns. To achieve this we can use slicing just as in normal Python code:

>>> array = np.arange(10, 40, 0.4).reshape(5,5,3)
>>> array[:, 2, 2]
array([ 13.2, 19.2, 25.2, 31.2, 37.2])

In this example we have taken the third column of every second row in every matrix. If we are interested in the first row of every matrix we can do it the following way:

>>> array[:, 0, :]
array([[ 10. ,  10.4,  10.8],
      [ 16. ,  16.4,  16.8],
      [ 22. ,  22.4,  22.8],
      [ 28. ,  28.4,  28.8],
      [ 34. ,  34.4,  34.8]])
>>> array[:, 0]
array([[ 10. ,  10.4,  10.8],
      [ 16. ,  16.4,  16.8],
      [ 22. ,  22.4,  22.8],
      [ 28. ,  28.4,  28.8],
      [ 34. ,  34.4,  34.8]])

As you can see, you can omit the colon if you want all the elements of the last dimension. Naturally this does not work with the first dimension. If you try it you'll get an invalid syntax exception.

We will see more examples on slicing features of NumPy in the next article.

Data loading

If you do data science, your data comes from a source and it is not you who creates an array every time you want to calculate something.

Most of the time you get your data in the form of a CSV file. Let's load the following data contained in the example.csv file:

10.00,10.40,10.80,11.20,11.60,12.00,12.40,12.80,13.20,13.60,14.00,14.40,14.80,15.20,15.60,16.00,16.40,16.80,17.20,17.60,18.00,18.40,18.80,19.20,19.60
20.00,20.40,20.80,21.20,21.60,22.00,22.40,22.80,23.20,23.60,24.00,24.40,24.80,25.20,25.60,26.00,26.40,26.80,27.20,27.60,28.00,28.40,28.80,29.20,29.60
30.00,30.40,30.80,31.20,31.60,32.00,32.40,32.80,33.20,33.60,34.00,34.40,34.80,35.20,35.60,36.00,36.40,36.80,37.20,37.60,38.00,38.40,38.80,39.20,39.60

Yes, this is the same data from the previous section but I have reshaped it to a 3x25 2d matrix.

Again, we have different options to load this file into a NumPy array. I'll leave out the Python solutions which convert the result into a NumPy array and will focus on the built-in versions of NumPy:

>>> csv = np.genfromtxt ('example.csv', delimiter=",")
>>> csv
array([[ 10. ,  10.4,  10.8,  11.2,  11.6,  12. ,  12.4,  12.8,  13.2,
        13.6,  14. ,  14.4,  14.8,  15.2,  15.6,  16. ,  16.4,  16.8,
        17.2,  17.6,  18. ,  18.4,  18.8,  19.2,  19.6],
      [ 20. ,  20.4,  20.8,  21.2,  21.6,  22. ,  22.4,  22.8,  23.2,
        23.6,  24. ,  24.4,  24.8,  25.2,  25.6,  26. ,  26.4,  26.8,
        27.2,  27.6,  28. ,  28.4,  28.8,  29.2,  29.6],
      [ 30. ,  30.4,  30.8,  31.2,  31.6,  32. ,  32.4,  32.8,  33.2,
        33.6,  34. ,  34.4,  34.8,  35.2,  35.6,  36. ,  36.4,  36.8,
        37.2,  37.6,  38. ,  38.4,  38.8,  39.2,  39.6]])

>>> csv_2 = np.loadtxt('example.csv', delimiter=',')
>>> csv_2
array([[ 10. ,  10.4,  10.8,  11.2,  11.6,  12. ,  12.4,  12.8,  13.2,
       13.6,  14. ,  14.4,  14.8,  15.2,  15.6,  16. ,  16.4,  16.8,
       17.2,  17.6,  18. ,  18.4,  18.8,  19.2,  19.6],
     [ 20. ,  20.4,  20.8,  21.2,  21.6,  22. ,  22.4,  22.8,  23.2,
       23.6,  24. ,  24.4,  24.8,  25.2,  25.6,  26. ,  26.4,  26.8,
       27.2,  27.6,  28. ,  28.4,  28.8,  29.2,  29.6],
     [ 30. ,  30.4,  30.8,  31.2,  31.6,  32. ,  32.4,  32.8,  33.2,
       33.6,  34. ,  34.4,  34.8,  35.2,  35.6,  36. ,  36.4,  36.8,
       37.2,  37.6,  38. ,  38.4,  38.8,  39.2,  39.6]])

You can see two methods to import CSV files into your code with NumPy. These two functions look the same because both get the file name and the delimiter as parameters but they are different if we look at their description.

The genfromtxt is a bit more complex, it has more parameters with which to fine-tune your loading. In the next article we will use a CSV file with real data and format it to see how we can use it with NumPy.

Conclusion

We have taken a very brief look at NumPy. Next time we will dive deeper and do some advanced slicing and we will use a CSV file which has different types of values in its columns, as you would expect from a real-life project.

By Gabor Laszlo Hajba | 3/8/2017 | General