Python and Pandas - Part 2
In this article I will follow on from the previous article on Pandas, and we will take a deeper look into how to use this library for further data analysis. We will look at identifying duplicates and tidying up our data to create useful insights later.
We will continue to use the dataset of Most Popular Baby Names by Sex and Mother's Ethnic Group, New York City between 2011 and 2014.
We have seen in the previous article that we can access DataFrame columns through the loc and iloc variables, providing either an index or the name of the column.
But there is an alternative way too — you can use the dot (.) operator on the DataFrame and call the name of the column directly. This will result in a series of the selected column:
>>> baby_names.BRTH_YR.head() 0 2011 1 2011 2 2011 3 2011 4 2011 Name: BRTH_YR, dtype: int64
It is perfectly fine to stick with any approach. Sometimes I use the direct access, other times I go through with slicing.
Typos and Other Entry Errors
This is a hard topic to solve because typos are little mistakes that you do not notice and it is difficult to come up with an automated solution to fix them. However, sometimes you can see and identify problems in the dataset which are easy to solve, like different naming of the same value. If you have taken a look at the baby_names.csv file you probably noticed that the ethnicity has different forms:
>>> import numpy as np >>> np.unique(baby_names.iloc[:, 2]) array(['ASIAN AND PACI', 'ASIAN AND PACIFIC ISLANDER', 'BLACK NON HISP', 'BLACK NON HISPANIC', 'HISPANIC', 'WHITE NON HISP', 'WHITE NON HISPANIC'], dtype=object)
Longer values have been trimmed in some cases. To avoid ambiguities later we will bring every entry to its longer form:
>>> baby_names.loc[baby_names.ETHCTY == 'ASIAN AND PACI', 'ETHCTY'] = 'ASIAN AND PACIFIC ISLANDER' >>> baby_names.loc[baby_names.ETHCTY == 'BLACK NON HISP', 'ETHCTY'] = 'BLACK NON HISPANIC' >>> baby_names.loc[baby_names.ETHCTY == 'WHITE NON HISP', 'ETHCTY'] = 'WHITE NON HISPANIC' >>> np.unique(baby_names.iloc[:, 2]) array(['ASIAN AND PACIFIC ISLANDER', 'BLACK NON HISPANIC', 'HISPANIC', 'WHITE NON HISPANIC'], dtype=object)
To do this we create a selector based on the ethnicity for each three values which come in two flavors, apply it to the loc of the DataFrame and then set the new value. This really replaces old values with the new ones. At the end we verify that the values are really updated and the short forms are gone from the dataset.
Another problem can be that we have names in all capital letters. We can use the knowledge of the above example to further improve the dataset:
>>> baby_names.NM.head() 0 GERALDINE 1 GIA 2 GIANNA 3 GISELLE 4 GRACE >>> baby_names.NM = baby_names.NM.str.capitalize() >>> baby_names.NM.head() 0 Geraldine 1 Gia 2 Gianna 3 Giselle 4 Grace Name: NM, dtype: object
You can see that we’ve overwritten the values in the columns with their capitalized representations. This helps in the future to eliminate duplicates and when displaying the data.
Identifying Duplicate Data and Clearing the Dataset
This use case is really useful. Every time you get a new dataset you should keep an eye out for duplicated lines and if you find some, ask your customer what you should do with the duplicated information. Most of the time the answer will be to tidy-up the dataset and then re-run your analysis.
Duplicates can be identified with the duplicated() method of the DataFrame class. This method returns a Series of True and False values that you can use to display duplicated rows:
>>> baby_names.duplicated().head() 0 False 1 False 2 False 3 False 4 False dtype: bool >>> baby_names[baby_names.duplicated()].head() BRTH_YR GNDR ETHCTY NM CNT RNK 1963 2011 FEMALE WHITE NON HISPANIC Scarlett 31 60 1964 2011 FEMALE WHITE NON HISPANIC Serena 16 75 1965 2011 FEMALE WHITE NON HISPANIC Shaina 16 75 1966 2011 FEMALE WHITE NON HISPANIC Shaindel 15 76 1967 2011 FEMALE WHITE NON HISPANIC Shaindy 48 44
You can see, that when we pass the Series containing the identified duplicates to the original DataFrame as a slice-argument we get back all the lines (a subset of the DataFrame) which are duplicated. The excerpt does not show anything interesting, it seems like the values are not duplicates. But if we sort the results, we will see that there is a problem:
>>> baby_names[baby_names.duplicated()].sort_values(by='NM').head() BRTH_YR GNDR ETHCTY NM CNT RNK 2582 2011 FEMALE BLACK NON HISPANIC Aaliyah 69 5 4082 2011 FEMALE BLACK NON HISPANIC Aaliyah 69 5 2766 2011 FEMALE HISPANIC Aaliyah 63 30 4266 2011 FEMALE HISPANIC Aaliyah 63 30 6045 2011 FEMALE BLACK NON HISPANIC Aaliyah 69 5
We can see that there are duplicated rows in the dataset. If we dig a bit deeper we can find that all the duplicates are in the dataset of 2011:
>>> baby_names[baby_names.duplicated()].describe() BRTH_YR CNT RNK count 5889.0 5889.000000 5889.000000 mean 2011.0 34.602140 56.494142 std 0.0 40.513511 25.012470 min 2011.0 10.000000 1.000000 25% 2011.0 13.000000 37.000000 50% 2011.0 20.000000 60.000000 75% 2011.0 36.000000 77.000000 max 2011.0 426.000000 97.000000
Now it’s time to tidy our list. Fortunately Pandas thought about this too and DataFrames have a method called drop_duplicates() which, as its name already suggests, drops duplicates and returns the tidied DataFrame as result. There are some arguments you can pass along to the method to fine-tune the tidying, for example to tidy the data in-place, but it is good practice to keep the original dataset if you test your data because sometimes you need the original version. It won't be easy to reproduce it if you’ve done a lot of transformations to it.
>>> clean_baby_names = baby_names.drop_duplicates() >>> clean_baby_names.describe() BRTH_YR CNT RNK count 8073.000000 8073.000000 8073.000000 mean 2012.522482 34.480243 57.272761 std 1.118043 39.931078 25.609985 min 2011.000000 10.000000 1.000000 25% 2012.000000 13.000000 38.000000 50% 2013.000000 20.000000 59.000000 75% 2014.000000 36.000000 78.000000 max 2014.000000 426.000000 102.000000 >>> clean_baby_names[clean_baby_names.NM == 'Aaliyah'] BRTH_YR GNDR ETHCTY NM CNT RNK 1110 2011 FEMALE BLACK NON HISPANIC Aaliyah 69 5 1294 2011 FEMALE HISPANIC Aaliyah 63 30 8031 2012 FEMALE BLACK NON HISPANIC Aaliyah 55 10 8202 2012 FEMALE HISPANIC Aaliyah 68 26 10031 2013 FEMALE BLACK NON HISPANIC Aaliyah 73 3 10212 2013 FEMALE HISPANIC Aaliyah 56 33 12072 2014 FEMALE BLACK NON HISPANIC Aaliyah 67 4 12259 2014 FEMALE HISPANIC Aaliyah 55 36
I have assigned the cleaned data to a new variable. Now we see, that the data seems to be clean because the mean of the BRTH_YR column is between 2012 and 2013 which is the middle of our dataset.
The drop_duplicates() method defines duplicate rows if all of their columns have the same value. Therefore we can be rest assured that we did not lose any information during our process.
Alternatively we can select columns to identify duplicates. For example, imagine that a name is entered twice for the same year and ethnicity but with different counts and ranks. In this case these lines would remain in the dataset but we do not want it because it changes our analysis. To identify such entries, we can fine-tune our query:
>>> clean_baby_names.duplicated().sum() 0 >>> clean_baby_names.duplicated(['NM', 'BRTH_YR','ETHCTY']).sum() 65
Wow, it seems, we have found duplicated data! Let's verify that they are really duplicates before we delete them:
>>> clean_baby_names[clean_baby_names.duplicated(['NM', 'BRTH_YR','ETHCTY'])].sort_values('NM').head() BRTH_YR GNDR ETHCTY NM CNT RNK 13344 2014 MALE HISPANIC Alexis 40 68 1132 2011 FEMALE BLACK NON HISPANIC Angel 16 39 9551 2012 MALE WHITE NON HISPANIC Ariel 20 84 9268 2012 MALE HISPANIC Ariel 18 92 1340 2011 FEMALE HISPANIC Ariel 17 71 >>> clean_baby_names[(clean_baby_names.NM == 'Alexis') & (clean_baby_names.ETHCTY == 'HISPANIC') & (clean_baby_names.BRTH_YR == 2014)] BRTH_YR GNDR ETHCTY NM CNT RNK 12272 2014 FEMALE HISPANIC Alexis 13 70 13344 2014 MALE HISPANIC Alexis 40 68
After creating the filtering we get only one result for the name Alexis. This is suspicious, so we get all the entries for Alexis, where the year is 2014 and the ethnicity is Hispanic. Now we see, that there are two results: one for females and one for males. So these are not real duplicates. We could continue the search for the different names in our list to verify that they are indeed split among MALE and FEMALE but we can add the gender column to our duplicate filter:
>>> clean_baby_names.duplicated(['NM', 'BRTH_YR','ETHCTY','GNDR']).sum() 0
OK, no more duplicates. Naturally if you have good domain knowledge you can identify names which can only be male/female and filter or aggregate them.
Handling NaN and N/A
Sometimes you have to deal with datasets where some fields are not filled but you need a value there because your calculations are based on it, for example with Machine Learning, but we do not have to go so far because aggregation functions need values too. To solve this problem Pandas has a method available: fillna.
For an example let's change some values in our baby_names DataFrame to missing:
>>> import numpy as np >>> na_baby_names = clean_baby_names.copy() >>> na_baby_names.loc[na_baby_names.NM == 'Riley', 'CNT'] = np.nan >>> na_baby_names.loc[na_baby_names.NM == 'Avery', 'RNK'] = '' >>> na_baby_names.describe() BRTH_YR CNT count 8073.000000 8049.000000 mean 2012.522482 34.527519 std 1.118043 39.978346 min 2011.000000 10.000000 25% 2012.000000 13.000000 50% 2013.000000 20.000000 75% 2014.000000 36.000000 max 2014.000000 426.000000 >>> na_baby_names.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 8073 entries, 0 to 13961 Data columns (total 6 columns): BRTH_YR 8073 non-null int64 GNDR 8073 non-null object ETHCTY 8073 non-null object NM 8073 non-null object CNT 8049 non-null float64 RNK 8073 non-null object dtypes: float64(1), int64(1), object(4) memory usage: 441.5+ KB
I have created a copy in the cleaned dataset to keep the original data intact. Now if we take a look at the describe() results we see that the RNK column is missing because it has string values so it cannot be used for statistical analysis, and the count of the CNT column is smaller than the count in the BRTH_YR column because there are some nan values. Calling the info() method we get a more detailed description.
>>> na_baby_names[na_baby_names.CNT.isnull()].head() BRTH_YR GNDR ETHCTY NM CNT RNK 156 2011 FEMALE HISPANIC Riley NaN 76 473 2011 FEMALE WHITE NON HISPANIC Riley NaN 63 1250 2011 FEMALE BLACK NON HISPANIC Riley NaN 40 1608 2011 MALE HISPANIC Riley NaN 86 1889 2011 MALE WHITE NON HISPANIC Riley NaN 93
To fix this issue we will take the mean value for each year's names to fill in the missing values. I know this is not the best approach but it is better than guessing a value or leaving the fields as nan:
>>> for y in np.unique(na_baby_names.loc[(na_baby_names.CNT.isnull()), 'BRTH_YR']): ... na_baby_names.loc[(na_baby_names.CNT.isnull()) & (na_baby_names.BRTH_YR == y), 'CNT'] = na_baby_names.loc[(na_baby_names.BRTH_YR == y), 'CNT'].mean().astype(int) ... >>> na_baby_names.describe() BRTH_YR CNT count 8073.000000 8073.000000 mean 2012.522482 34.525579 std 1.118043 39.918905 min 2011.000000 10.000000 25% 2012.000000 13.000000 50% 2013.000000 20.000000 75% 2014.000000 36.000000 max 2014.000000 426.000000
The solution iterates over the unique years and sets the missing value to the integer-result (cropping the decimal part) of the mean of the other counts from the same year.
Now it is time to handle the empty values in the RNK column. There are again different approaches depending on your dataset. Here we know that the column should have only numeric (or to be more specific integer) values. To do this we can replace empty fields with np.nan:
>>> na_baby_names.loc[(na_baby_names.RNK == ''), 'RNK'] = np.nan >>> na_baby_names.loc[(na_baby_names.RNK == ''), 'RNK'] Series(, Name: RNK, dtype: object) >>> na_baby_names.loc[(na_baby_names.RNK.isnull()), 'RNK'] 249 NaN 723 NaN 936 NaN 7888 NaN 8255 NaN 8552 NaN 9079 NaN 9273 NaN 9559 NaN 9890 NaN 10068 NaN 10273 NaN 10567 NaN 10915 NaN 11085 NaN 11285 NaN 11579 NaN 11926 NaN 12107 NaN 12323 NaN 12611 NaN 13170 NaN 13359 NaN Name: RNK, dtype: object
Well, this is a very basic approach and only works with the empty string. If we have different strings in the column too then we need to find out those values and set them to nan individually — and this is a no-go. Fortunately we have a solution for this in Pandas:
>>> import pandas as pd >>> na_baby_names.loc[na_baby_names.NM == 'Avery', 'RNK'] = '' >>> na_baby_names.RNK = pd.to_numeric(na_baby_names.RNK) >>> na_baby_names.loc[(na_baby_names.NM == 'Avery'), 'RNK'].head() 249 NaN 723 NaN 936 NaN 7888 NaN 8255 NaN Name: RNK, dtype: float64
The to_numeric function converts the values of a specified column (Series) or table (DataFrame) to numeric values and does not raise an exception if a non-numeric value is encountered, like a string in the example case.
Information on Datasets
We know how we can access some information like the shape or number of elements of a DataFrame. But Pandas offers more insights using the info() method. This gives us memory usage data too, which helps to identify large datasets and to make decisions on whether in-place filtering or purging unnecessary data.
>>> clean_baby_names.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 8073 entries, 0 to 13961 Data columns (total 6 columns): BRTH_YR 8073 non-null int64 GNDR 8073 non-null object ETHCTY 8073 non-null object NM 8073 non-null object CNT 8073 non-null int64 RNK 8073 non-null int64 dtypes: int64(3), object(3) memory usage: 441.5+ KB
The + symbol indicates that the true memory usage can be higher because Pandas does not count the memory used by values in columns with dtype=object. But if we are keen enough we can tell Pandas to give us the real memory usage. For this we have to provide the memory_usage argument with the value of 'deep' to the info() method:
>>> clean_baby_names.info(memory_usage='deep') <class 'pandas.core.frame.DataFrame'> Int64Index: 8073 entries, 0 to 13961 Data columns (total 6 columns): BRTH_YR 8073 non-null int64 GNDR 8073 non-null object ETHCTY 8073 non-null object NM 8073 non-null object CNT 8073 non-null int64 RNK 8073 non-null int64 dtypes: int64(3), object(3) memory usage: 1.9 MB
Now we see that our dataset eats almost 2 MB of space. This is not so bad but remember, we work with an example dataset. If we step-out to real-life projects we can get significantly larger data and in such cases you have to consider memory usage in your applications.
We have seen that real datasets come flawed without intention. We need to take care and validate the dataset prior to jumping into it and doing analysis. For this we have seen how to identify duplicates and clean-up missing data.
In the next article we will continue our journey and will learn how to aggregate our data. We will use this knowledge to identify and remove outliers in the dataset which could flaw our analysis.
Be sure to use DiscoverSDK’s search to find all the best development tools in one place.