Most of the times real-world data is rarely clean and homogeneous. In many cases, dataset of interest will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways.
In this module, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python. We refer the missing data as null, NaN, or NA values in general.
Before we start, lets make sure the Pandas and matplotlib packages are installed.
!pip install pandas matplotliboutputRequirement already satisfied: pandas in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages Requirement already satisfied: matplotlib in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages Requirement already satisfied: python-dateutil>=2.5.0 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from pandas) Requirement already satisfied: numpy>=1.9.0 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from pandas) Requirement already satisfied: pytz>=2011k in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from pandas) Requirement already satisfied: six>=1.10 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from matplotlib) Requirement already satisfied: cycler>=0.10 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from matplotlib) Requirement already satisfied: kiwisolver>=1.0.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from matplotlib) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from matplotlib) Requirement already satisfied: setuptools in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) �[33mYou are using pip version 9.0.1, however version 18.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command.�[0m
# Now import pandas into your notebook as pd
import pandas as pdNow again import surveys.csv dataset into our notebook as we did in previous lesson.
surveys_df = pd.read_csv("surveys.csv")A mask can be useful to locate where a particular subset of values exist or
don't exist - for example, NaN, or "Not a Number" values. To understand masks,
we also need to understand BOOLEAN objects in Python.
Boolean values include True or False. For example,
# set value of x to be 5
x = 5x > 5outputFalse
x == 5outputTrue
Let's identify all locations in the survey data that have
null (missing or NaN) data values. We can use the isnull method to do this.
The isnull method will compare each cell with a null value. If an element
has a null value, it will be assigned a value of True in the output object.
pd.isnull(surveys_df).head().dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | True |
| 1 | False | False | False | False | False | False | False | False | True |
| 2 | False | False | False | False | False | False | False | False | True |
| 3 | False | False | False | False | False | False | False | False | True |
| 4 | False | False | False | False | False | False | False | False | True |
To select the rows where there are null values, we can use the mask as an index to subset our data as follows:
# To select only the rows with NaN values, we can use the 'any()' method
surveys_df[pd.isnull(surveys_df).any(axis=1)].dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
| 1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
| 2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
| 3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
| 4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
| 5 | 6 | 7 | 16 | 1977 | 1 | PF | M | 14.0 | NaN |
| 6 | 7 | 7 | 16 | 1977 | 2 | PE | F | NaN | NaN |
| 7 | 8 | 7 | 16 | 1977 | 1 | DM | M | 37.0 | NaN |
| 8 | 9 | 7 | 16 | 1977 | 1 | DM | F | 34.0 | NaN |
| 9 | 10 | 7 | 16 | 1977 | 6 | PF | F | 20.0 | NaN |
| 10 | 11 | 7 | 16 | 1977 | 5 | DS | F | 53.0 | NaN |
| 11 | 12 | 7 | 16 | 1977 | 7 | DM | M | 38.0 | NaN |
| 12 | 13 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
| 13 | 14 | 7 | 16 | 1977 | 8 | DM | NaN | NaN | NaN |
| 14 | 15 | 7 | 16 | 1977 | 6 | DM | F | 36.0 | NaN |
| 15 | 16 | 7 | 16 | 1977 | 4 | DM | F | 36.0 | NaN |
| 16 | 17 | 7 | 16 | 1977 | 3 | DS | F | 48.0 | NaN |
| 17 | 18 | 7 | 16 | 1977 | 2 | PP | M | 22.0 | NaN |
| 18 | 19 | 7 | 16 | 1977 | 4 | PF | NaN | NaN | NaN |
| 19 | 20 | 7 | 17 | 1977 | 11 | DS | F | 48.0 | NaN |
| 20 | 21 | 7 | 17 | 1977 | 14 | DM | F | 34.0 | NaN |
| 21 | 22 | 7 | 17 | 1977 | 15 | NL | F | 31.0 | NaN |
| 22 | 23 | 7 | 17 | 1977 | 13 | DM | M | 36.0 | NaN |
| 23 | 24 | 7 | 17 | 1977 | 13 | SH | M | 21.0 | NaN |
| 24 | 25 | 7 | 17 | 1977 | 9 | DM | M | 35.0 | NaN |
| 25 | 26 | 7 | 17 | 1977 | 15 | DM | M | 31.0 | NaN |
| 26 | 27 | 7 | 17 | 1977 | 15 | DM | M | 36.0 | NaN |
| 27 | 28 | 7 | 17 | 1977 | 11 | DM | M | 38.0 | NaN |
| 28 | 29 | 7 | 17 | 1977 | 11 | PP | M | NaN | NaN |
| 29 | 30 | 7 | 17 | 1977 | 10 | DS | F | 52.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 35187 | 35188 | 11 | 10 | 2002 | 10 | NaN | NaN | NaN | NaN |
| 35256 | 35257 | 12 | 7 | 2002 | 22 | PB | M | 26.0 | NaN |
| 35259 | 35260 | 12 | 7 | 2002 | 21 | PB | F | 24.0 | NaN |
| 35277 | 35278 | 12 | 7 | 2002 | 20 | AH | NaN | NaN | NaN |
| 35279 | 35280 | 12 | 7 | 2002 | 16 | PB | M | 28.0 | NaN |
| 35322 | 35323 | 12 | 8 | 2002 | 11 | AH | NaN | NaN | NaN |
| 35328 | 35329 | 12 | 8 | 2002 | 11 | PP | M | NaN | 16.0 |
| 35370 | 35371 | 12 | 8 | 2002 | 14 | AH | NaN | NaN | NaN |
| 35378 | 35379 | 12 | 8 | 2002 | 15 | PB | F | 26.0 | NaN |
| 35384 | 35385 | 12 | 8 | 2002 | 10 | NaN | NaN | NaN | NaN |
| 35387 | 35388 | 12 | 29 | 2002 | 1 | DO | M | 35.0 | NaN |
| 35403 | 35404 | 12 | 29 | 2002 | 2 | NL | F | 30.0 | NaN |
| 35448 | 35449 | 12 | 29 | 2002 | 20 | OT | F | 20.0 | NaN |
| 35452 | 35453 | 12 | 29 | 2002 | 20 | PB | M | 28.0 | NaN |
| 35457 | 35458 | 12 | 29 | 2002 | 20 | AH | NaN | NaN | NaN |
| 35477 | 35478 | 12 | 29 | 2002 | 24 | AH | NaN | NaN | NaN |
| 35485 | 35486 | 12 | 29 | 2002 | 16 | DO | M | 37.0 | NaN |
| 35495 | 35496 | 12 | 31 | 2002 | 4 | PB | NaN | NaN | NaN |
| 35510 | 35511 | 12 | 31 | 2002 | 11 | DX | NaN | NaN | NaN |
| 35511 | 35512 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN |
| 35512 | 35513 | 12 | 31 | 2002 | 11 | US | NaN | NaN | NaN |
| 35514 | 35515 | 12 | 31 | 2002 | 11 | SF | F | 27.0 | NaN |
| 35519 | 35520 | 12 | 31 | 2002 | 9 | SF | NaN | 24.0 | 36.0 |
| 35527 | 35528 | 12 | 31 | 2002 | 13 | US | NaN | NaN | NaN |
| 35529 | 35530 | 12 | 31 | 2002 | 13 | OT | F | 20.0 | NaN |
| 35530 | 35531 | 12 | 31 | 2002 | 13 | PB | F | 27.0 | NaN |
| 35543 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN |
| 35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
| 35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
| 35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
4873 rows × 9 columns
Notice that we have 4873 observations/rows that contain one or more missing values.
Thats roughly 14% of data contains missing values.
We have used [] convension to select subset of data.
More information about slicing and indexing can be found out here.
(axis=1) is a numpy convention to specify columns.
Note that the weight column of our DataFrame contains many null or NaN
values. Next, we will explore ways of dealing with this.
If we look at the weight column in the surveys
data we notice that there are NaN (Not a Number) values. NaN values are undefined
values that cannot be represented mathematically. Pandas, for example, will read
an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we
were to average the weight column without replacing our NaNs, Python would know to skip
over those cells.
Dealing with missing data values is always a challenge. It's sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0? We need to know how missing values are represented in the dataset in order to make good decisions. If we're lucky, we have some metadata that will tell us more about how null values were handled.
For instance, in some disciplines, like Remote Sensing, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you (or someone else) explores your data.
Let's explore the NaN values in our data a bit further. Using the tools we learned in lesson 02, we can figure out how many rows contain NaN values for weight. We can also create a new subset from our data that only contains rows with weight values > 0 (i.e., select meaningful weight values):
## How many missing values are there in weight column?
len(surveys_df[pd.isnull(surveys_df.weight)])output3266
# How many rows have weight values?
len(surveys_df[surveys_df.weight> 0])output32283
We can replace all NaN values with zeroes using the .fillna() method (after
making a copy of the data so we don't lose our work):
# Creat a new DataFrame using copy
df1 = surveys_df.copy()
# Fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)However NaN and 0 yield different analysis results. The mean value when NaN values are replaced with 0 is different from when NaN values are simply thrown out or ignored.
surveys_df['weight'].mean()output42.672428212991356
df1['weight'].mean()output38.751976145601844
We can fill NaN values with any value that we chose. The code below fills all NaN values with a mean for all weight values.
df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())df1['weight'].mean()output42.672428212991356
We've learned about using manipulating data to get desired outputs. But we've also discussed keeping data that has been manipulated separate from our raw data. Something we might be interested in doing is working with only the columns that have full data. First, let's reload the data so we're not mixing up all of our previous manipulations.
df_na = surveys_df.dropna()If you now type df_na, you should observe that the resulting DataFrame has 30676 rows
and 9 columns, much smaller than the 35549 row original.
We can now use the to_csv command to do export a DataFrame in CSV format. Note that the code
below will by default save the data into the current working directory. We can
save it to a different folder by adding the foldername and a slash before the filename:
df1.to_csv('foldername/out.csv').
We use 'index=False' so that
pandas doesn't include the index number for each line.
# Write DataFrame to CSV
df_na.to_csv('output/surveys_complete.csv', index=False)What we've learned:
- What NaN values are, how they might be represented, and what this means for your work
- How to replace NaN values, if desired
- How to use
to_csvto write manipulated data to a file.
We can run isnull on a particular column too. What does the code below do?
# What does this do?
empty_weights = surveys_df[pd.isnull(surveys_df['weight'])]['weight']
print(empty_weights)output0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN 9 NaN 10 NaN 11 NaN 12 NaN 13 NaN 14 NaN 15 NaN 16 NaN 17 NaN 18 NaN 19 NaN 20 NaN 21 NaN 22 NaN 23 NaN 24 NaN 25 NaN 26 NaN 27 NaN 28 NaN 29 NaN .. 35138 NaN 35168 NaN 35187 NaN 35256 NaN 35259 NaN 35277 NaN 35279 NaN 35322 NaN 35370 NaN 35378 NaN 35384 NaN 35387 NaN 35403 NaN 35448 NaN 35452 NaN 35457 NaN 35477 NaN 35485 NaN 35495 NaN 35510 NaN 35511 NaN 35512 NaN 35514 NaN 35527 NaN 35529 NaN 35530 NaN 35543 NaN 35544 NaN 35545 NaN 35548 NaN Name: weight, Length: 3266, dtype: float64
Let's take a minute to look at the statement above. We are using the Boolean
object pd.isnull(surveys_df['weight']) as an index to surveys_df. We are
asking Python to select rows that have a NaN value of weight.