Home >Blog

Introduction to Pandas - Data Analysis in Python

In this article I will continue the previous series where we introduced NumPy. Now we will take a look at Pandas, the de-facto standard for data handling with Python.

We ran into some limitations while using NumPy, for instance loading from a CSV file required every column's contents to be strings if there was one column containing a non-number entry. Now we will take a step further and see how Pandas approaches this problem and what can we do with it.

Some words about Pandas

Pandas aims to integrate the functionality of NumPy and matplotlib, to give you a convenient tool to do data analytics and visualization. And beside the integration it does a bit more—it makes the usage better.

To install Pandas simply execute the following command on your command line:

pip install pandas

This will load and install Pandas with all its dependencies, like NumPy. After it’s done we can start our exploration.

The data structures in Pandas are capable of holding elements of any type: Series, DataFrame and Panel. The common factor is that the data structures are labeled.

We will use DataFrames most of the time through this tutorial series but here is a brief introduction to each of them:

Series -- a one-dimensional array-like labeled object, which is capable of holding any type of object.
DataFrame -- a two-dimensional labeled data structure where columns can be of different types.
Panel -- a three-dimensional data structure. You can think of them like dictionaries of DataFrames.

Basic functionality

Let's get started with exploring the basic functionality of Pandas.

>>> import pandas as pd
>>> import numpy as np
>>> series = pd.Series([1,2,3,4,5, np.nan, "a string", 6])
>>> series
0           1
1           2
2           3
3           4
4           5
5         NaN
6    a string
7           6
dtype: object

In the example above we have created a Series object which contains a lot of different values. One interesting thing to note is that we can reference to not-a-number elements using NumPy's nan symbol which tells us that the element is not a number but it can be used as one numerical type pointing out to be not a number. The type of the series is object which refers to its mixed contents because we have a string included.

If we use only numerical types we get the basic NumPy dtype, float for our series:

>>> series = pd.Series([1,2,np.nan, 4])
>>> series
0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

It does not matter if the data includes a nan—the series will be treated as a series of numbers.

But Series are not what we are interested in but DataFrames because this resembles a 2d array similar to a CSV file or a relational database table:

>>> df = pd.DataFrame(np.array([1,2,3,4,5,6]).reshape(2,3))
>>> df
  0  1  2
0  1  2  3
1  4  5  6
>>> df.dtypes
0    int32
1    int32
2    int32
dtype: object

The default setting shows the numeric index of the rows and the columns but it can be changed to give more sense to the data:

>>> df = pd.DataFrame(np.array([1,2,3,4,5,6]).reshape(2,3), columns=list('ABC'), index=list('XY'))
>>> df
  A  B  C
X  1  2  3
Y  4  5  6

As you can see, the index argument provides the list to use for the rows, while the list provided through the columns argument can be used to alter the column indexes.

If we have bigger datasets, the head method can be useful too. It displays the first n rows provided as an argument, if you do not provide an argument, the default of 5 will be used:

>>> df2 = pd.DataFrame(np.arange(1, 7501).reshape(500,15))
>>> df2.head(2)
  0   1   2   3   4   5   6   7   8   9   10  11  12  13  14
0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
1  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30
>>> df2.head()
  0   1   2   3   4   5   6   7   8   9   10  11  12  13  14
0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
1  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30
2  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45
3  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60
4  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75

The opposite is the tail method. It shows the last n lines of the DataFrame, if you omit the argument, the default of 5 will be used:

>>> df2.tail()
      0     1     2     3     4     5     6     7     8     9     10    11  \
495  7426  7427  7428  7429  7430  7431  7432  7433  7434  7435  7436  7437
496  7441  7442  7443  7444  7445  7446  7447  7448  7449  7450  7451  7452
497  7456  7457  7458  7459  7460  7461  7462  7463  7464  7465  7466  7467
498  7471  7472  7473  7474  7475  7476  7477  7478  7479  7480  7481  7482
499  7486  7487  7488  7489  7490  7491  7492  7493  7494  7495  7496  7497

      12    13    14
495  7438  7439  7440
496  7453  7454  7455
497  7468  7469  7470
498  7483  7484  7485
499  7498  7499  7500
>>> df2.tail(1)
      0     1     2     3     4     5     6     7     8     9     10    11  \
499  7486  7487  7488  7489  7490  7491  7492  7493  7494  7495  7496  7497

      12    13    14
499  7498  7499  7500

As you can see, the columns are split among multiple lines if they get too broad but this won't bother us in the future because we won't print our data to the console but will use visualizations to show what we can gather out if it.

Describing data

This is the functionality we will use very often if we have a new dataset to analyze:

>>> df3 = pd.DataFrame(np.arange(1, 100, 0.12).reshape(33,25))
>>> df3.describe()
             0          1          2          3          4          5   \
count  33.000000  33.000000  33.000000  33.000000  33.000000  33.000000
mean   49.000000  49.120000  49.240000  49.360000  49.480000  49.600000
std    29.008619  29.008619  29.008619  29.008619  29.008619  29.008619
min     1.000000   1.120000   1.240000   1.360000   1.480000   1.600000
25%    25.000000  25.120000  25.240000  25.360000  25.480000  25.600000
50%    49.000000  49.120000  49.240000  49.360000  49.480000  49.600000
75%    73.000000  73.120000  73.240000  73.360000  73.480000  73.600000
max    97.000000  97.120000  97.240000  97.360000  97.480000  97.600000

             6          7          8          9     ...             15  \
count  33.000000  33.000000  33.000000  33.000000    ...      33.000000
mean   49.720000  49.840000  49.960000  50.080000    ...      50.800000
std    29.008619  29.008619  29.008619  29.008619    ...      29.008619
min     1.720000   1.840000   1.960000   2.080000    ...       2.800000
25%    25.720000  25.840000  25.960000  26.080000    ...      26.800000
50%    49.720000  49.840000  49.960000  50.080000    ...      50.800000
75%    73.720000  73.840000  73.960000  74.080000    ...      74.800000
max    97.720000  97.840000  97.960000  98.080000    ...      98.800000

             16         17         18         19         20         21  \
count  33.000000  33.000000  33.000000  33.000000  33.000000  33.000000
mean   50.920000  51.040000  51.160000  51.280000  51.400000  51.520000
std    29.008619  29.008619  29.008619  29.008619  29.008619  29.008619
min     2.920000   3.040000   3.160000   3.280000   3.400000   3.520000
25%    26.920000  27.040000  27.160000  27.280000  27.400000  27.520000
50%    50.920000  51.040000  51.160000  51.280000  51.400000  51.520000
75%    74.920000  75.040000  75.160000  75.280000  75.400000  75.520000
max    98.920000  99.040000  99.160000  99.280000  99.400000  99.520000

             22         23         24
count  33.000000  33.000000  33.000000
mean   51.640000  51.760000  51.880000
std    29.008619  29.008619  29.008619
min     3.640000   3.760000   3.880000
25%    27.640000  27.760000  27.880000
50%    51.640000  51.760000  51.880000
75%    75.640000  75.760000  75.880000
max    99.640000  99.760000  99.880000

[8 rows x 25 columns]

As you can see, calling the describe method on the DataFrame prints out a brief summary on each column in the dataset: the number of elements (count), their mean, standard deviation (std), the minimum and maximum values, and some values in-between.

Indexing and Slicing

Slicing DataFrames works using the iloc attribute:

>>> df3 = pd.DataFrame(np.arange(1, 100, 0.12).reshape(33,25))
>>> df3.iloc[:5,:10]
     0      1      2      3      4     5      6      7      8      9
0   1.0   1.12   1.24   1.36   1.48   1.6   1.72   1.84   1.96   2.08
1   4.0   4.12   4.24   4.36   4.48   4.6   4.72   4.84   4.96   5.08
2   7.0   7.12   7.24   7.36   7.48   7.6   7.72   7.84   7.96   8.08
3  10.0  10.12  10.24  10.36  10.48  10.6  10.72  10.84  10.96  11.08
4  13.0  13.12  13.24  13.36  13.48  13.6  13.72  13.84  13.96  14.08

In the example above we have selected the first 5 rows and the first 10 columns. We can implement the head() and tail() methods using the default line-count of 5 with iloc too:

>>> df3.iloc[-5:] # df3.tail(5)
     0      1      2      3      4     5      6      7      8      9   ...    \
28  85.0  85.12  85.24  85.36  85.48  85.6  85.72  85.84  85.96  86.08  ...
29  88.0  88.12  88.24  88.36  88.48  88.6  88.72  88.84  88.96  89.08  ...
30  91.0  91.12  91.24  91.36  91.48  91.6  91.72  91.84  91.96  92.08  ...
31  94.0  94.12  94.24  94.36  94.48  94.6  94.72  94.84  94.96  95.08  ...
32  97.0  97.12  97.24  97.36  97.48  97.6  97.72  97.84  97.96  98.08  ...

     15     16     17     18     19    20     21     22     23     24
28  86.8  86.92  87.04  87.16  87.28  87.4  87.52  87.64  87.76  87.88
29  89.8  89.92  90.04  90.16  90.28  90.4  90.52  90.64  90.76  90.88
30  92.8  92.92  93.04  93.16  93.28  93.4  93.52  93.64  93.76  93.88
31  95.8  95.92  96.04  96.16  96.28  96.4  96.52  96.64  96.76  96.88
32  98.8  98.92  99.04  99.16  99.28  99.4  99.52  99.64  99.76  99.88

[5 rows x 25 columns]

>>> df3.iloc[:5] # df3.head(5)
    0      1      2      3      4     5      6      7      8      9   ...    \
0   1.0   1.12   1.24   1.36   1.48   1.6   1.72   1.84   1.96   2.08  ...
1   4.0   4.12   4.24   4.36   4.48   4.6   4.72   4.84   4.96   5.08  ...
2   7.0   7.12   7.24   7.36   7.48   7.6   7.72   7.84   7.96   8.08  ...
3  10.0  10.12  10.24  10.36  10.48  10.6  10.72  10.84  10.96  11.08  ...
4  13.0  13.12  13.24  13.36  13.48  13.6  13.72  13.84  13.96  14.08  ...

    15     16     17     18     19    20     21     22     23     24
0   2.8   2.92   3.04   3.16   3.28   3.4   3.52   3.64   3.76   3.88
1   5.8   5.92   6.04   6.16   6.28   6.4   6.52   6.64   6.76   6.88
2   8.8   8.92   9.04   9.16   9.28   9.4   9.52   9.64   9.76   9.88
3  11.8  11.92  12.04  12.16  12.28  12.4  12.52  12.64  12.76  12.88
4  14.8  14.92  15.04  15.16  15.28  15.4  15.52  15.64  15.76  15.88

[5 rows x 25 columns]

But Pandas takes an extra step and allows us to access data through labels in DataFrames. In this example it’s not very spectacular because the labels of the dataset are the same as their position. However using the rename method we can easily rename the columns of the DataFrame and we can use those indices to access the values:

>>> df4 = df3.rename(columns=lambda c: chr(65+c))
>>> df4.loc[:5, 'A':'D']
     A      B      C      D
0   1.0   1.12   1.24   1.36
1   4.0   4.12   4.24   4.36
2   7.0   7.12   7.24   7.36
3  10.0  10.12  10.24  10.36
4  13.0  13.12  13.24  13.36
5  16.0  16.12  16.24  16.36
>>> df4.loc[:5, ('A','D')]
     A      D
0   1.0   1.36
1   4.0   4.36
2   7.0   7.36
3  10.0  10.36
4  13.0  13.36
5  16.0  16.36

The most interesting part is the lambda function provided for the columns' names. That's because you need either a dictionary-like parameter or a function which renames the labels. Creating a dictionary would bloat the example so a simple anonymous function is ideal. The chr() function returns the character representation of the provided number, chr(65) equals the character A. The parameter c is the current header of the axis (in this case the columns), which is a number from 0 to 24.

This is something cool for future usage where we deal with data from CSV files.

Reading CSV files

Now it is time for some real-world-like examples. We will use the same dataset already known from the latest NumPy article. If you did not read it or just do not want to look it up here is the file available to download: Most Popular Baby Names by Sex and Mother's Ethnic Group, New York City. I saved it as baby_names.csv.

It contains names given to babies in New York city, based on gender and ethnicity. Let's start by reading in the file:

>>> import numpy as np
>>> import pandas as pd
>>> baby_names = pd.read_csv('baby_names.csv')
>>> baby_names.head()
  BRTH_YR    GNDR    ETHCTY         NM  CNT  RNK
0     2011  FEMALE  HISPANIC  GERALDINE   13   75
1     2011  FEMALE  HISPANIC        GIA   21   67
2     2011  FEMALE  HISPANIC     GIANNA   49   42
3     2011  FEMALE  HISPANIC    GISELLE   38   51
4     2011  FEMALE  HISPANIC      GRACE   36   53
>>> baby_names.tail()
      BRTH_YR  GNDR              ETHCTY       NM  CNT  RNK
13957     2014  MALE  WHITE NON HISPANIC   Yousef   18   94
13958     2014  MALE  WHITE NON HISPANIC  Youssef   24   88
13959     2014  MALE  WHITE NON HISPANIC    Yusuf   16   96
13960     2014  MALE  WHITE NON HISPANIC  Zachary   90   39
13961     2014  MALE  WHITE NON HISPANIC      Zev   49   65

You can see, that reading a CSV file happens with the read_csv function. The result is a DataFrame—and unlike with NumPy we did not have to tell Pandas to read in all the data as Strings. This means we have columns containing numeric values as numeric types in the DataFrame. We can verify this if we look at the description of the DataFrame:

>>> baby_names.describe()
           BRTH_YR           CNT           RNK
count  13962.000000  13962.000000  13962.000000
mean    2011.880318     34.531657     56.944349
std        1.134940     40.176370     25.361691
min     2011.000000     10.000000      1.000000
25%     2011.000000     13.000000     38.000000
50%     2011.000000     20.000000     59.000000
75%     2013.000000     36.000000     78.000000
max     2014.000000    426.000000    102.000000

You can see that descrcibe() only selects the columns which have numeric data. Naturally some information makes no sense, like the RNK column (only the count is useful) or the mean and std in BRTH_YR.

You may have noticed, that the columns in the DataFrame have labels which are identical to the headers in the CSV file. This is a nice feature in Pandas which becomes useful when you do not need to know the index of the column you want to access.

DataFrames can be sorted along columns or axis and the shape will be maintained after the sort. Now we have a real dataset so let's try out sorting. Let's find the first five names with the lowest and the highest count in all the data (remember: we have years from 2011 to 2014—the sort will only count the min and max based on each year's data and not sum up the same names over the years):

>>> baby_names.head()
  BRTH_YR    GNDR    ETHCTY         NM  CNT  RNK
0     2011  FEMALE  HISPANIC  GERALDINE   13   75
1     2011  FEMALE  HISPANIC        GIA   21   67
2     2011  FEMALE  HISPANIC     GIANNA   49   42
3     2011  FEMALE  HISPANIC    GISELLE   38   51
4     2011  FEMALE  HISPANIC      GRACE   36   53
>>> baby_names.sort_values(by='CNT').head()
      BRTH_YR    GNDR                      ETHCTY          NM  CNT  RNK
8744      2012  FEMALE              WHITE NON HISP         MAE   10   83
6746      2011  FEMALE          WHITE NON HISPANIC       LEILA   10   81
2389      2011    MALE                    HISPANIC       ALLAN   10   94
11009     2013    MALE  ASIAN AND PACIFIC ISLANDER      Martin   10   57
11013     2013    MALE  ASIAN AND PACIFIC ISLANDER  Maximilian   10   57
>>> baby_names.sort_values(by='CNT', ascending=False).head()
     BRTH_YR  GNDR    ETHCTY      NM  CNT  RNK
1504     2011  MALE  HISPANIC  JAYDEN  426    1
5430     2011  MALE  HISPANIC  JAYDEN  426    1
7393     2011  MALE  HISPANIC  JAYDEN  426    1
3505     2011  MALE  HISPANIC  JAYDEN  426    1
9385     2012  MALE  HISPANIC  JAYDEN  364    1

As you can see we have identified a problem in the dataset—we have entries which have the same content. This makes our data less usable as it is and we have to tidy it up to gain any real insights. But this is the good thing in data science and we will look at an approach how to filter out duplicates.

Conclusion

We have seen that Pandas is the next step to data analytics with Python after NumPy because it lets us handle data in a better way. However we cannot disregard NumPy because Pandas builds on top of NumPy and matplotlib to give us a single point where you can do your data analysis and visualization. Pandas really leverages the functionality if we import CSV files with mixed content—you do not have to worry about the conversions.

In the next article we will tidy our data from duplicates and do some data munging to visualize our dataset with matplotlib provide insights for the user.

Find and compare thousands of development tools with DiscoverSDK.

By Gabor Laszlo Hajba | 3/20/2017 | General