missing_values.md

Handling Missing Data

Most of the times real-world data is rarely clean and homogeneous. In many cases, dataset of interest will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways.

In this module, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python. We refer the missing data as null, NaN, or NA values in general.

Before we start, lets make sure the Pandas and matplotlib packages are installed.

!pip install pandas matplotlib

output

Requirement already satisfied: pandas in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages
Requirement already satisfied: matplotlib in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages
Requirement already satisfied: python-dateutil>=2.5.0 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from pandas)
Requirement already satisfied: numpy>=1.9.0 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from pandas)
Requirement already satisfied: pytz>=2011k in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from pandas)
Requirement already satisfied: six>=1.10 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from matplotlib)
Requirement already satisfied: cycler>=0.10 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from matplotlib)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from matplotlib)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from matplotlib)
Requirement already satisfied: setuptools in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib)
�[33mYou are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.�[0m

# Now import pandas into your notebook as pd
import pandas as pd

Now again import surveys.csv dataset into our notebook as we did in previous lesson.

surveys_df = pd.read_csv("surveys.csv")

Using masks to identify a specific condition

A mask can be useful to locate where a particular subset of values exist or don't exist - for example, NaN, or "Not a Number" values. To understand masks, we also need to understand BOOLEAN objects in Python.

Boolean values include True or False. For example,

# set value of x to be 5
x = 5

x > 5

output

False

x == 5

output

True

Finding Missing Values

Let's identify all locations in the survey data that have null (missing or NaN) data values. We can use the isnull method to do this. The isnull method will compare each cell with a null value. If an element has a null value, it will be assigned a value of True in the output object.

pd.isnull(surveys_df).head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
0	False	False	False	False	False	False	False	False	True
1	False	False	False	False	False	False	False	False	True
2	False	False	False	False	False	False	False	False	True
3	False	False	False	False	False	False	False	False	True
4	False	False	False	False	False	False	False	False	True

How to select rows with missing data

To select the rows where there are null values, we can use the mask as an index to subset our data as follows:

# To select only the rows with NaN values, we can use the 'any()' method
surveys_df[pd.isnull(surveys_df).any(axis=1)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	record_id	month	day	year	site_id	species_id	sex	hindfoot_length	weight
0	1	7	16	1977	2	NL	M	32.0	NaN
1	2	7	16	1977	3	NL	M	33.0	NaN
2	3	7	16	1977	2	DM	F	37.0	NaN
3	4	7	16	1977	7	DM	M	36.0	NaN
4	5	7	16	1977	3	DM	M	35.0	NaN
5	6	7	16	1977	1	PF	M	14.0	NaN
6	7	7	16	1977	2	PE	F	NaN	NaN
7	8	7	16	1977	1	DM	M	37.0	NaN
8	9	7	16	1977	1	DM	F	34.0	NaN
9	10	7	16	1977	6	PF	F	20.0	NaN
10	11	7	16	1977	5	DS	F	53.0	NaN
11	12	7	16	1977	7	DM	M	38.0	NaN
12	13	7	16	1977	3	DM	M	35.0	NaN
13	14	7	16	1977	8	DM	NaN	NaN	NaN
14	15	7	16	1977	6	DM	F	36.0	NaN
15	16	7	16	1977	4	DM	F	36.0	NaN
16	17	7	16	1977	3	DS	F	48.0	NaN
17	18	7	16	1977	2	PP	M	22.0	NaN
18	19	7	16	1977	4	PF	NaN	NaN	NaN
19	20	7	17	1977	11	DS	F	48.0	NaN
20	21	7	17	1977	14	DM	F	34.0	NaN
21	22	7	17	1977	15	NL	F	31.0	NaN
22	23	7	17	1977	13	DM	M	36.0	NaN
23	24	7	17	1977	13	SH	M	21.0	NaN
24	25	7	17	1977	9	DM	M	35.0	NaN
25	26	7	17	1977	15	DM	M	31.0	NaN
26	27	7	17	1977	15	DM	M	36.0	NaN
27	28	7	17	1977	11	DM	M	38.0	NaN
28	29	7	17	1977	11	PP	M	NaN	NaN
29	30	7	17	1977	10	DS	F	52.0	NaN
...	...	...	...	...	...	...	...	...	...
35187	35188	11	10	2002	10	NaN	NaN	NaN	NaN
35256	35257	12	7	2002	22	PB	M	26.0	NaN
35259	35260	12	7	2002	21	PB	F	24.0	NaN
35277	35278	12	7	2002	20	AH	NaN	NaN	NaN
35279	35280	12	7	2002	16	PB	M	28.0	NaN
35322	35323	12	8	2002	11	AH	NaN	NaN	NaN
35328	35329	12	8	2002	11	PP	M	NaN	16.0
35370	35371	12	8	2002	14	AH	NaN	NaN	NaN
35378	35379	12	8	2002	15	PB	F	26.0	NaN
35384	35385	12	8	2002	10	NaN	NaN	NaN	NaN
35387	35388	12	29	2002	1	DO	M	35.0	NaN
35403	35404	12	29	2002	2	NL	F	30.0	NaN
35448	35449	12	29	2002	20	OT	F	20.0	NaN
35452	35453	12	29	2002	20	PB	M	28.0	NaN
35457	35458	12	29	2002	20	AH	NaN	NaN	NaN
35477	35478	12	29	2002	24	AH	NaN	NaN	NaN
35485	35486	12	29	2002	16	DO	M	37.0	NaN
35495	35496	12	31	2002	4	PB	NaN	NaN	NaN
35510	35511	12	31	2002	11	DX	NaN	NaN	NaN
35511	35512	12	31	2002	11	US	NaN	NaN	NaN
35512	35513	12	31	2002	11	US	NaN	NaN	NaN
35514	35515	12	31	2002	11	SF	F	27.0	NaN
35519	35520	12	31	2002	9	SF	NaN	24.0	36.0
35527	35528	12	31	2002	13	US	NaN	NaN	NaN
35529	35530	12	31	2002	13	OT	F	20.0	NaN
35530	35531	12	31	2002	13	PB	F	27.0	NaN
35543	35544	12	31	2002	15	US	NaN	NaN	NaN
35544	35545	12	31	2002	15	AH	NaN	NaN	NaN
35545	35546	12	31	2002	15	AH	NaN	NaN	NaN
35548	35549	12	31	2002	5	NaN	NaN	NaN	NaN

4873 rows × 9 columns

Explaination

Notice that we have 4873 observations/rows that contain one or more missing values. Thats roughly 14% of data contains missing values.

We have used [] convension to select subset of data.

More information about slicing and indexing can be found out here.

(axis=1) is a numpy convention to specify columns.

Note that the weight column of our DataFrame contains many null or NaN values. Next, we will explore ways of dealing with this.

If we look at the weight column in the surveys data we notice that there are NaN (Not a Number) values. NaN values are undefined values that cannot be represented mathematically. Pandas, for example, will read an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we were to average the weight column without replacing our NaNs, Python would know to skip over those cells.

Dealing with missing values.

Dealing with missing data values is always a challenge. It's sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0? We need to know how missing values are represented in the dataset in order to make good decisions. If we're lucky, we have some metadata that will tell us more about how null values were handled.

For instance, in some disciplines, like Remote Sensing, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you (or someone else) explores your data.

Where Are the NaN's?

Let's explore the NaN values in our data a bit further. Using the tools we learned in lesson 02, we can figure out how many rows contain NaN values for weight. We can also create a new subset from our data that only contains rows with weight values > 0 (i.e., select meaningful weight values):

## How many missing values are there in weight column?
len(surveys_df[pd.isnull(surveys_df.weight)])

output

3266

# How many rows have weight values?
len(surveys_df[surveys_df.weight> 0])

output

32283

We can replace all NaN values with zeroes using the .fillna() method (after making a copy of the data so we don't lose our work):

# Creat a new DataFrame using copy
df1 = surveys_df.copy()

# Fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)

However NaN and 0 yield different analysis results. The mean value when NaN values are replaced with 0 is different from when NaN values are simply thrown out or ignored.

surveys_df['weight'].mean()

output

42.672428212991356

df1['weight'].mean()

output

38.751976145601844

Extra Information

We can fill NaN values with any value that we chose. The code below fills all NaN values with a mean for all weight values.

df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())

df1['weight'].mean()

output

42.672428212991356

Writing Out Data to CSV

We've learned about using manipulating data to get desired outputs. But we've also discussed keeping data that has been manipulated separate from our raw data. Something we might be interested in doing is working with only the columns that have full data. First, let's reload the data so we're not mixing up all of our previous manipulations.

df_na = surveys_df.dropna()

If you now type df_na, you should observe that the resulting DataFrame has 30676 rows and 9 columns, much smaller than the 35549 row original.

We can now use the to_csv command to do export a DataFrame in CSV format. Note that the code below will by default save the data into the current working directory. We can save it to a different folder by adding the foldername and a slash before the filename: df1.to_csv('foldername/out.csv'). We use 'index=False' so that pandas doesn't include the index number for each line.

# Write DataFrame to CSV
df_na.to_csv('output/surveys_complete.csv', index=False)

Recap

What we've learned:

What NaN values are, how they might be represented, and what this means for your work
How to replace NaN values, if desired
How to use to_csv to write manipulated data to a file.

Extra

We can run isnull on a particular column too. What does the code below do?

# What does this do?
empty_weights = surveys_df[pd.isnull(surveys_df['weight'])]['weight']
print(empty_weights)

output

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN
9       NaN
10      NaN
11      NaN
12      NaN
13      NaN
14      NaN
15      NaN
16      NaN
17      NaN
18      NaN
19      NaN
20      NaN
21      NaN
22      NaN
23      NaN
24      NaN
25      NaN
26      NaN
27      NaN
28      NaN
29      NaN
         ..
35138   NaN
35168   NaN
35187   NaN
35256   NaN
35259   NaN
35277   NaN
35279   NaN
35322   NaN
35370   NaN
35378   NaN
35384   NaN
35387   NaN
35403   NaN
35448   NaN
35452   NaN
35457   NaN
35477   NaN
35485   NaN
35495   NaN
35510   NaN
35511   NaN
35512   NaN
35514   NaN
35527   NaN
35529   NaN
35530   NaN
35543   NaN
35544   NaN
35545   NaN
35548   NaN
Name: weight, Length: 3266, dtype: float64

Let's take a minute to look at the statement above. We are using the Boolean object pd.isnull(surveys_df['weight']) as an index to surveys_df. We are asking Python to select rows that have a NaN value of weight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Missing Data

Using masks to identify a specific condition

Finding Missing Values

How to select rows with missing data

Explaination

Dealing with missing values.

Where Are the NaN's?

Extra Information

Writing Out Data to CSV

Recap

Extra

FilesExpand file tree

missing_values.md

Latest commit

History

missing_values.md

File metadata and controls

Handling Missing Data

Using masks to identify a specific condition

Finding Missing Values

How to select rows with missing data

Explaination

Dealing with missing values.

Where Are the NaN's?

Extra Information

Writing Out Data to CSV

Recap

Extra