By Gabor Laszlo Hajba | 3/29/2017 | General |Beginners

Python and Pandas - Part 2

Python and Pandas - Part 2

In this article I will follow on from the previous article on Pandas, and we will take a deeper look into how to use this library for further data analysis. We will look at identifying duplicates and tidying up our data to create useful insights later.

We will continue to use the dataset of Most Popular Baby Names by Sex and Mother's Ethnic Group, New York City between 2011 and 2014.

Accessing Columns

We have seen in the previous article that we can access DataFrame columns through the loc and iloc variables, providing either an index or the name of the column.

But there is an alternative way too — you can use the dot (.) operator on the DataFrame and call the name of the column directly. This will result in a series of the selected column:

>>> baby_names.BRTH_YR.head()
0    2011
1    2011
2    2011
3    2011
4    2011
Name: BRTH_YR, dtype: int64

It is perfectly fine to stick with any approach. Sometimes I use the direct access, other times I go through with slicing.

Typos and Other Entry Errors

This is a hard topic to solve because typos are little mistakes that you do not notice and it is difficult to come up with an automated solution to fix them. However, sometimes you can see and identify problems in the dataset which are easy to solve, like different naming of the same value. If you have taken a look at the baby_names.csv file you probably noticed that the ethnicity has different forms:

>>> import numpy as np
>>> np.unique(baby_names.iloc[:, 2])
array(['ASIAN AND PACI', 'ASIAN AND PACIFIC ISLANDER', 'BLACK NON HISP',
      'BLACK NON HISPANIC', 'HISPANIC', 'WHITE NON HISP',
      'WHITE NON HISPANIC'], dtype=object)

Longer values have been trimmed in some cases. To avoid ambiguities later we will bring every entry to its longer form:

>>> baby_names.loc[baby_names.ETHCTY == 'ASIAN AND PACI', 'ETHCTY'] = 'ASIAN AND PACIFIC ISLANDER'
>>> baby_names.loc[baby_names.ETHCTY == 'BLACK NON HISP', 'ETHCTY'] = 'BLACK NON HISPANIC'
>>> baby_names.loc[baby_names.ETHCTY == 'WHITE NON HISP', 'ETHCTY'] = 'WHITE NON HISPANIC'
>>> np.unique(baby_names.iloc[:, 2])
array(['ASIAN AND PACIFIC ISLANDER', 'BLACK NON HISPANIC', 'HISPANIC',
      'WHITE NON HISPANIC'], dtype=object)

To do this we create a selector based on the ethnicity for each three values which come in two flavors, apply it to the loc of the DataFrame and then set the new value. This really replaces old values with the new ones. At the end we verify that the values are really updated and the short forms are gone from the dataset.

Another problem can be that we have names in all capital letters. We can use the knowledge of the above example to further improve the dataset:

>>> baby_names.NM.head()
0         GERALDINE
1               GIA
2            GIANNA
3           GISELLE
4             GRACE
>>> baby_names.NM = baby_names.NM.str.capitalize()
>>> baby_names.NM.head()
0    Geraldine
1          Gia
2       Gianna
3      Giselle
4        Grace
Name: NM, dtype: object

You can see that we’ve overwritten the values in the columns with their capitalized representations. This helps in the future to eliminate duplicates and when displaying the data.

Identifying Duplicate Data and Clearing the Dataset

This use case is really useful. Every time you get a new dataset you should keep an eye out for duplicated lines and if you find some, ask your customer what you should do with the duplicated information. Most of the time the answer will be to tidy-up the dataset and then re-run your analysis.

Duplicates can be identified with the duplicated() method of the DataFrame class. This method returns a Series of True and False values that you can use to display duplicated rows:

>>> baby_names.duplicated().head()
0    False
1    False
2    False
3    False
4    False
dtype: bool
>>> baby_names[baby_names.duplicated()].head()
     BRTH_YR    GNDR              ETHCTY        NM  CNT  RNK
1963     2011  FEMALE  WHITE NON HISPANIC  Scarlett   31   60
1964     2011  FEMALE  WHITE NON HISPANIC    Serena   16   75
1965     2011  FEMALE  WHITE NON HISPANIC    Shaina   16   75
1966     2011  FEMALE  WHITE NON HISPANIC  Shaindel   15   76
1967     2011  FEMALE  WHITE NON HISPANIC   Shaindy   48   44

You can see, that when we pass the Series containing the identified duplicates to the original DataFrame as a slice-argument we get back all the lines (a subset of the DataFrame) which are duplicated. The excerpt does not show anything interesting, it seems like the values are not duplicates. But if we sort the results, we will see that there is a problem:

>>> baby_names[baby_names.duplicated()].sort_values(by='NM').head()
     BRTH_YR    GNDR              ETHCTY       NM  CNT  RNK
2582     2011  FEMALE  BLACK NON HISPANIC  Aaliyah   69    5
4082     2011  FEMALE  BLACK NON HISPANIC  Aaliyah   69    5
2766     2011  FEMALE            HISPANIC  Aaliyah   63   30
4266     2011  FEMALE            HISPANIC  Aaliyah   63   30
6045     2011  FEMALE  BLACK NON HISPANIC  Aaliyah   69    5

We can see that there are duplicated rows in the dataset. If we dig a bit deeper we can find that all the duplicates are in the dataset of 2011:

>>> baby_names[baby_names.duplicated()].describe()
      BRTH_YR          CNT          RNK
count   5889.0  5889.000000  5889.000000
mean    2011.0    34.602140    56.494142
std        0.0    40.513511    25.012470
min     2011.0    10.000000     1.000000
25%     2011.0    13.000000    37.000000
50%     2011.0    20.000000    60.000000
75%     2011.0    36.000000    77.000000
max     2011.0   426.000000    97.000000

Now it’s time to tidy our list. Fortunately Pandas thought about this too and DataFrames have a method called drop_duplicates() which, as its name already suggests, drops duplicates and returns the tidied DataFrame as result. There are some arguments you can pass along to the method to fine-tune the tidying, for example to tidy the data in-place, but it is good practice to keep the original dataset if you test your data because sometimes you need the original version. It won't be easy to reproduce it if you’ve done a lot of transformations to it.

>>> clean_baby_names = baby_names.drop_duplicates()
>>> clean_baby_names.describe()
          BRTH_YR          CNT          RNK
count  8073.000000  8073.000000  8073.000000
mean   2012.522482    34.480243    57.272761
std       1.118043    39.931078    25.609985
min    2011.000000    10.000000     1.000000
25%    2012.000000    13.000000    38.000000
50%    2013.000000    20.000000    59.000000
75%    2014.000000    36.000000    78.000000
max    2014.000000   426.000000   102.000000
>>> clean_baby_names[clean_baby_names.NM == 'Aaliyah']
      BRTH_YR    GNDR              ETHCTY       NM  CNT  RNK
1110      2011  FEMALE  BLACK NON HISPANIC  Aaliyah   69    5
1294      2011  FEMALE            HISPANIC  Aaliyah   63   30
8031      2012  FEMALE  BLACK NON HISPANIC  Aaliyah   55   10
8202      2012  FEMALE            HISPANIC  Aaliyah   68   26
10031     2013  FEMALE  BLACK NON HISPANIC  Aaliyah   73    3
10212     2013  FEMALE            HISPANIC  Aaliyah   56   33
12072     2014  FEMALE  BLACK NON HISPANIC  Aaliyah   67    4
12259     2014  FEMALE            HISPANIC  Aaliyah   55   36

I have assigned the cleaned data to a new variable. Now we see, that the data seems to be clean because the mean of the BRTH_YR column is between 2012 and 2013 which is the middle of our dataset.

The drop_duplicates() method defines duplicate rows if all of their columns have the same value. Therefore we can be rest assured that we did not lose any information during our process.

Alternatively we can select columns to identify duplicates. For example, imagine that a name is entered twice for the same year and ethnicity but with different counts and ranks. In this case these lines would remain in the dataset but we do not want it because it changes our analysis. To identify such entries, we can fine-tune our query:

>>> clean_baby_names.duplicated().sum()
0
>>> clean_baby_names.duplicated(['NM', 'BRTH_YR','ETHCTY']).sum()
65

Wow, it seems, we have found duplicated data! Let's verify that they are really duplicates before we delete them:

>>> clean_baby_names[clean_baby_names.duplicated(['NM', 'BRTH_YR','ETHCTY'])].sort_values('NM').head()
      BRTH_YR    GNDR              ETHCTY      NM  CNT  RNK
13344     2014    MALE            HISPANIC  Alexis   40   68
1132      2011  FEMALE  BLACK NON HISPANIC   Angel   16   39
9551      2012    MALE  WHITE NON HISPANIC   Ariel   20   84
9268      2012    MALE            HISPANIC   Ariel   18   92
1340      2011  FEMALE            HISPANIC   Ariel   17   71
>>> clean_baby_names[(clean_baby_names.NM == 'Alexis') & (clean_baby_names.ETHCTY == 'HISPANIC') & (clean_baby_names.BRTH_YR == 2014)]
      BRTH_YR    GNDR    ETHCTY      NM  CNT  RNK
12272     2014  FEMALE  HISPANIC  Alexis   13   70
13344     2014    MALE  HISPANIC  Alexis   40   68

After creating the filtering we get only one result for the name Alexis. This is suspicious, so we get all the entries for Alexis, where the year is 2014 and the ethnicity is Hispanic. Now we see, that there are two results: one for females and one for males. So these are not real duplicates. We could continue the search for the different names in our list to verify that they are indeed split among MALE and FEMALE but we can add the gender column to our duplicate filter:

>>> clean_baby_names.duplicated(['NM', 'BRTH_YR','ETHCTY','GNDR']).sum()
0

OK, no more duplicates. Naturally if you have good domain knowledge you can identify names which can only be male/female and filter or aggregate them.

Handling NaN and N/A

Sometimes you have to deal with datasets where some fields are not filled but you need a value there because your calculations are based on it, for example with Machine Learning, but we do not have to go so far because aggregation functions need values too. To solve this problem Pandas has a method available: fillna.

For an example let's change some values in our baby_names DataFrame to missing:

>>> import numpy as np
>>> na_baby_names = clean_baby_names.copy()
>>> na_baby_names.loc[na_baby_names.NM == 'Riley', 'CNT'] = np.nan
>>> na_baby_names.loc[na_baby_names.NM == 'Avery', 'RNK'] = ''
>>> na_baby_names.describe()
          BRTH_YR          CNT
count  8073.000000  8049.000000
mean   2012.522482    34.527519
std       1.118043    39.978346
min    2011.000000    10.000000
25%    2012.000000    13.000000
50%    2013.000000    20.000000
75%    2014.000000    36.000000
max    2014.000000   426.000000
>>> na_baby_names.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8073 entries, 0 to 13961
Data columns (total 6 columns):
BRTH_YR    8073 non-null int64
GNDR       8073 non-null object
ETHCTY     8073 non-null object
NM         8073 non-null object
CNT        8049 non-null float64
RNK        8073 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 441.5+ KB

I have created a copy in the cleaned dataset to keep the original data intact. Now if we take a look at the describe() results we see that the RNK column is missing because it has string values so it cannot be used for statistical analysis, and the count of the CNT column is smaller than the count in the BRTH_YR column because there are some nan values. Calling the info() method we get a more detailed description.

>>> na_baby_names[na_baby_names.CNT.isnull()].head()
     BRTH_YR    GNDR              ETHCTY     NM  CNT RNK
156      2011  FEMALE            HISPANIC  Riley  NaN  76
473      2011  FEMALE  WHITE NON HISPANIC  Riley  NaN  63
1250     2011  FEMALE  BLACK NON HISPANIC  Riley  NaN  40
1608     2011    MALE            HISPANIC  Riley  NaN  86
1889     2011    MALE  WHITE NON HISPANIC  Riley  NaN  93

To fix this issue we will take the mean value for each year's names to fill in the missing values. I know this is not the best approach but it is better than guessing a value or leaving the fields as nan:

>>> for y in np.unique(na_baby_names.loc[(na_baby_names.CNT.isnull()), 'BRTH_YR']):
...     na_baby_names.loc[(na_baby_names.CNT.isnull()) & (na_baby_names.BRTH_YR == y), 'CNT'] = na_baby_names.loc[(na_baby_names.BRTH_YR == y), 'CNT'].mean().astype(int)
...
>>> na_baby_names.describe()
          BRTH_YR          CNT
count  8073.000000  8073.000000
mean   2012.522482    34.525579
std       1.118043    39.918905
min    2011.000000    10.000000
25%    2012.000000    13.000000
50%    2013.000000    20.000000
75%    2014.000000    36.000000
max    2014.000000   426.000000

The solution iterates over the unique years and sets the missing value to the integer-result (cropping the decimal part) of the mean of the other counts from the same year.

Now it is time to handle the empty values in the RNK column. There are again different approaches depending on your dataset. Here we know that the column should have only numeric (or to be more specific integer) values. To do this we can replace empty fields with np.nan:

>>> na_baby_names.loc[(na_baby_names.RNK == ''), 'RNK'] = np.nan
>>> na_baby_names.loc[(na_baby_names.RNK == ''), 'RNK']
Series([], Name: RNK, dtype: object)
>>> na_baby_names.loc[(na_baby_names.RNK.isnull()), 'RNK']
249      NaN
723      NaN
936      NaN
7888     NaN
8255     NaN
8552     NaN
9079     NaN
9273     NaN
9559     NaN
9890     NaN
10068    NaN
10273    NaN
10567    NaN
10915    NaN
11085    NaN
11285    NaN
11579    NaN
11926    NaN
12107    NaN
12323    NaN
12611    NaN
13170    NaN
13359    NaN
Name: RNK, dtype: object

Well, this is a very basic approach and only works with the empty string. If we have different strings in the column too then we need to find out those values and set them to nan individually — and this is a no-go. Fortunately we have a solution for this in Pandas:

>>> import pandas as pd
>>> na_baby_names.loc[na_baby_names.NM == 'Avery', 'RNK'] = ''
>>> na_baby_names.RNK = pd.to_numeric(na_baby_names.RNK)
>>> na_baby_names.loc[(na_baby_names.NM == 'Avery'), 'RNK'].head()
249    NaN
723    NaN
936    NaN
7888   NaN
8255   NaN
Name: RNK, dtype: float64

The to_numeric function converts the values of a specified column (Series) or table (DataFrame) to numeric values and does not raise an exception if a non-numeric value is encountered, like a string in the example case.

Information on Datasets

We know how we can access some information like the shape or number of elements of a DataFrame. But Pandas offers more insights using the info() method. This gives us memory usage data too, which helps to identify large datasets and to make decisions on whether in-place filtering or purging unnecessary data.

>>> clean_baby_names.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8073 entries, 0 to 13961
Data columns (total 6 columns):
BRTH_YR    8073 non-null int64
GNDR       8073 non-null object
ETHCTY     8073 non-null object
NM         8073 non-null object
CNT        8073 non-null int64
RNK        8073 non-null int64
dtypes: int64(3), object(3)
memory usage: 441.5+ KB

The + symbol indicates that the true memory usage can be higher because Pandas does not count the memory used by values in columns with dtype=object. But if we are keen enough we can tell Pandas to give us the real memory usage. For this we have to provide the memory_usage argument with the value of 'deep' to the info() method:

>>> clean_baby_names.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8073 entries, 0 to 13961
Data columns (total 6 columns):
BRTH_YR    8073 non-null int64
GNDR       8073 non-null object
ETHCTY     8073 non-null object
NM         8073 non-null object
CNT        8073 non-null int64
RNK        8073 non-null int64
dtypes: int64(3), object(3)
memory usage: 1.9 MB

Now we see that our dataset eats almost 2 MB of space. This is not so bad but remember, we work with an example dataset. If we step-out to real-life projects we can get significantly larger data and in such cases you have to consider memory usage in your applications.

Conclusion

We have seen that real datasets come flawed without intention. We need to take care and validate the dataset prior to jumping into it and doing analysis. For this we have seen how to identify duplicates and clean-up missing data.

In the next article we will continue our journey and will learn how to aggregate our data. We will use this knowledge to identify and remove outliers in the dataset which could flaw our analysis.

Be sure to use DiscoverSDK’s search to find all the best development tools in one place.

By Gabor Laszlo Hajba | 3/29/2017 | General

{{CommentsModel.TotalCount}} Comments

Your Comment

{{CommentsModel.Message}}

Recent Stories

Top DiscoverSDK Experts

User photo
3355
Ashton Torrence
Web and Windows developer
GUI | Web and 11 more
View Profile
User photo
3220
Mendy Bennett
Experienced with Ad network & Ad servers.
Mobile | Ad Networks and 1 more
View Profile
User photo
3060
Karen Fitzgerald
7 years in Cross-Platform development.
Mobile | Cross Platform Frameworks
View Profile
Show All
X

Compare Products

Select up to three two products to compare by clicking on the compare icon () of each product.

{{compareToolModel.Error}}

Now comparing:

{{product.ProductName | createSubstring:25}} X
Compare Now