Skip to content

Commit 34f1667

Browse files
committed
new tutorial, this one on pandas for covid data exploration
1 parent 27a4c94 commit 34f1667

File tree

5 files changed

+321
-1
lines changed

5 files changed

+321
-1
lines changed
Lines changed: 316 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
title: Learning pandas by Exploring COVID-19 Data
2+
slug: learn-pandas-basic-commands-explore-covid-19-data
3+
meta: Use the pandas data analysis tool to explore the free COVID-19 data set provided by the European Centre for Disease Prevention and Control.
4+
category: post
5+
date: 2020-03-28
6+
modified: 2020-03-28
7+
newsletter: False
8+
headerimage: /img/200328-covid-19-pandas/header.png
9+
headeralt: pandas logo. Copyright the PyData Foundation.
10+
11+
12+
The
13+
[European Centre for Disease Prevention and Control](https://www.ecdc.europa.eu/en)
14+
provides
15+
[daily-updated worldwide COVID-19 data](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide)
16+
that is easy to download in JSON, CSV or XML formats. In this tutorial,
17+
we will use the [pandas](/pandas.html) data analysis tool on the
18+
comma-separated values (CSV) data to learn some of the basic pandas
19+
commands and explore what is contained within the data set.
20+
21+
22+
## Configuring our development environment
23+
Make sure you have Python 3 installed. As of right now,
24+
[Python 3.8.2](https://www.python.org/downloads/) is the latest.
25+
26+
During this tutorial we're also going to use
27+
[pandas](https://pandas.pydata.org/).
28+
29+
Install it now into a new virtual environment with the following
30+
commands:
31+
32+
```bash
33+
python -m venv covidpandas
34+
source covidpandas/bin/activate
35+
36+
pip install pandas
37+
```
38+
39+
We're now ready to get the COVID-19 data and start analyzing it with
40+
pandas.
41+
42+
43+
## Obtaining the COVID-19 data
44+
Go to the
45+
[download today’s data on the geographic distribution of COVID-19 cases worldwide](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide)
46+
page in your web browser. It should look something like the following
47+
screenshot.
48+
49+
<img src="/img/200328-covid-19-pandas/covid-19-data-download.png" width="100%" class="shot rnd outl" alt="Download the CSV version of the COVID-19 data.">
50+
51+
There should be a link to download the
52+
data in CSV format, but the organization has changed the page layout
53+
several times in the past few weeks, which makes it difficult to find
54+
formats other than Excel (XLSX). If you have trouble obtaining the
55+
CSV version, just download
56+
[this one from GitHub](https://raw.githubusercontent.com/fullstackpython/blog-code-examples/master/pandas-covid-19/covid-19-cases-march-28-2020.csv)
57+
which is pegged to a copy downloaded on March 28th, 2020.
58+
59+
60+
61+
## Importing the CSV into pandas
62+
We have the data in a CSV now we need to import it into a pandas
63+
DataFrame.
64+
65+
Start by running the Python REPL:
66+
67+
```bash
68+
python
69+
70+
>>>
71+
```
72+
73+
The REPL is ready to go, now we need to import pandas so we can read
74+
the data we downloaded.
75+
76+
```python
77+
from pandas import read_csv
78+
79+
df = read_csv("covid-19-cases-march-28-2020.csv")
80+
```
81+
82+
Don't worry if you get an error like
83+
`UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7...`.
84+
Run this command instead which explicitly sets the file encoding
85+
so pandas can properly read the CSV.
86+
87+
```python
88+
# make sure the file name of the csv matches your file's name!
89+
df = read_csv("covid-19-cases-march-28-2020.csv", encoding="ISO-8859-1")
90+
```
91+
92+
We now have our data loaded into a
93+
[pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
94+
and can start running code to poke and prod and what's inside the
95+
data set.
96+
97+
98+
## Running pandas commands
99+
Let's first take a peek at what a sample of the data looks like. I
100+
typically run the `head` and `tail` functions when I open something
101+
up to find out what are contained in the first five and last five rows.
102+
103+
```python
104+
df.head()
105+
```
106+
107+
You should see six lines of output: one as the columns header and the
108+
first five rows of data from the CSV:
109+
110+
```
111+
dateRep day month year cases deaths countriesAndTerritories geoId countryterritoryCode popData2018
112+
0 28/03/2020 28 3 2020 16 1 Afghanistan AF AFG 37172386.0
113+
1 27/03/2020 27 3 2020 0 0 Afghanistan AF AFG 37172386.0
114+
2 26/03/2020 26 3 2020 33 0 Afghanistan AF AFG 37172386.0
115+
3 25/03/2020 25 3 2020 2 0 Afghanistan AF AFG 37172386.0
116+
4 24/03/2020 24 3 2020 6 1 Afghanistan AF AFG 37172386.0
117+
```
118+
119+
The `tail` function looks at the last five rows in a DataFrame.
120+
121+
```
122+
df.tail()
123+
```
124+
125+
`tail` output will look something like this:
126+
127+
```
128+
dateRep day month year cases deaths countriesAndTerritories geoId countryterritoryCode popData2018
129+
7315 25/03/2020 25 3 2020 0 0 Zimbabwe ZW ZWE 14439018.0
130+
7316 24/03/2020 24 3 2020 0 1 Zimbabwe ZW ZWE 14439018.0
131+
7317 23/03/2020 23 3 2020 0 0 Zimbabwe ZW ZWE 14439018.0
132+
7318 22/03/2020 22 3 2020 1 0 Zimbabwe ZW ZWE 14439018.0
133+
7319 21/03/2020 21 3 2020 1 0 Zimbabwe ZW ZWE 14439018.0
134+
```
135+
136+
Note that you can also pass an integer into `head` or `tail` like
137+
`df.head(10)` to get the first or last **n** number of rows.
138+
139+
It looks like based on the `tail` function we have around 7320 rows of
140+
data (since the first row is 0 indexed). We can confirm how much
141+
data is in each column with the `count` function.
142+
143+
```
144+
df.count()
145+
```
146+
147+
`count`'s output will look like:
148+
149+
```
150+
dateRep 7320
151+
day 7320
152+
month 7320
153+
year 7320
154+
cases 7320
155+
deaths 7320
156+
countriesAndTerritories 7320
157+
geoId 7306
158+
countryterritoryCode 7254
159+
popData2018 7311
160+
dtype: int64
161+
```
162+
163+
What if we want to look at one of those columns and find, for example,
164+
the highest value of cases?
165+
166+
```
167+
df.cases.max()
168+
```
169+
170+
In this data set we get 18695 as the output. What about looking at
171+
standard statistical measures across all columns? That's where the
172+
`describe` function comes in handy.
173+
174+
```
175+
df.describe()
176+
```
177+
178+
`describe` presents standard statistical measures such as min, max,
179+
median and mean for everything in your data set. In this case we
180+
receive as output:
181+
182+
```
183+
day month year cases deaths popData2018
184+
count 7320.000000 7320.000000 7320.000000 7320.000000 7320.000000 7.311000e+03
185+
mean 16.828142 2.249454 2019.990847 80.870355 3.687158 7.130483e+07
186+
std 8.322981 1.256463 0.095239 608.270244 35.327689 2.140624e+08
187+
min 1.000000 1.000000 2019.000000 -9.000000 0.000000 1.000000e+03
188+
25% 10.000000 1.000000 2020.000000 0.000000 0.000000 4.137309e+06
189+
50% 18.000000 2.000000 2020.000000 0.000000 0.000000 1.072767e+07
190+
75% 24.000000 3.000000 2020.000000 5.000000 0.000000 5.139301e+07
191+
max 31.000000 12.000000 2020.000000 18695.000000 971.000000 1.392730e+09
192+
```
193+
194+
How about a quick view into whether or not columns' data are correlated
195+
with each other? The `corr` function is what we need.
196+
197+
```
198+
df.corr()
199+
```
200+
201+
For our data set, `corr` outputs:
202+
203+
```
204+
day month year cases deaths popData2018
205+
day 1.000000 0.203006 -0.163665 0.063629 0.060075 -0.040677
206+
month 0.203006 1.000000 -0.745912 0.062494 0.052707 -0.039131
207+
year -0.163665 -0.745912 1.000000 0.012715 0.010032 -0.006294
208+
cases 0.063629 0.062494 0.012715 1.000000 0.716968 0.136580
209+
deaths 0.060075 0.052707 0.010032 0.716968 1.000000 0.082229
210+
popData2018 -0.040677 -0.039131 -0.006294 0.136580 0.082229 1.000000
211+
```
212+
213+
Not surprisingly, we see 1.000000 correlation between a column and itself.
214+
We'd have to worry if we didn't see that result! For other columns it may
215+
not make sense to look at their correlation. This is where you need to
216+
think about the data. There is often correlation between completely unrelated
217+
columns just because the data is structured a certain way.
218+
219+
If you are a developer like me without a rigorous background in statistics
220+
(Stats 200 in college was a **long** time ago), you may need to brush up
221+
on your stats knowledge before you are able to say whether something in the
222+
data matters or not.
223+
224+
Let's keep going exploring the data. We can select columns and determine how
225+
many unique items are held within it. For example, how many unique countries
226+
and territories are listed?
227+
228+
```
229+
df.countriesAndTerritories.nunique()
230+
```
231+
232+
In this case the result should be 196.
233+
234+
235+
## Asking questions of the data
236+
Those functions are fine for basic querying to learn what's in the
237+
data set, but how do we ask real questions by stringing together some
238+
commands?
239+
240+
We now know there are 7320 rows in this set since we used the `count`
241+
function above. Each row represents a single day within a country. Now
242+
to ask a question. How many days across these countries were there 10
243+
or more cases reported?
244+
245+
Let's create a new dataframe named df2 with the rows that only have
246+
10 or more cases reported on that day, then count the number of rows
247+
within it.
248+
249+
```
250+
df2 = df[df['cases']>=10]
251+
df2.count()
252+
```
253+
254+
That should give us the value 1531. There have been 1531 instances
255+
of 10 or more COVID-19 cases reported on a single day, across the
256+
196 countries or terrorities listed. But the 1531 is hard to explain
257+
to people. We should pick out a single country and show how many times
258+
10 or more cases were reported on one day. How about a smaller
259+
country like Vietnam that is not being reported on as much as China,
260+
the United States or Italy?
261+
262+
```
263+
df2[df2['countriesAndTerritories']=='Vietnam']
264+
```
265+
266+
This will give us the full output of data by column:
267+
268+
```
269+
dateRep day month year cases deaths countriesAndTerritories geoId countryterritoryCode popData2018
270+
7217 28/03/2020 28 3 2020 16 0 Vietnam VN VNM 95540395.0
271+
7219 26/03/2020 26 3 2020 14 0 Vietnam VN VNM 95540395.0
272+
7220 25/03/2020 25 3 2020 11 0 Vietnam VN VNM 95540395.0
273+
7222 23/03/2020 23 3 2020 24 0 Vietnam VN VNM 95540395.0
274+
7226 19/03/2020 19 3 2020 15 0 Vietnam VN VNM 95540395.0
275+
```
276+
277+
We can also use the `count` function here to confirm there have been
278+
five days in which 10 or more new cases have been reported in Vietnam
279+
so far:
280+
281+
```
282+
df2[df2['countriesAndTerritories']=='Vietnam'].count()
283+
```
284+
285+
We get the output of 5 for the columns. Unfortunately, when you look at
286+
the full data it appears these rows are all very recent and the virus
287+
is just beginning to spread more widely there. Let's hope they along
288+
with every other country is able to turn the tide, flatten the curve
289+
and keep more people from getting sick as we continue onwards.
290+
291+
That's a good spot to leave off, but we covered a lot of pandas ground
292+
in this tutorial!
293+
294+
295+
## What's next?
296+
We just imported and took a look at what's in the European Centre
297+
for Disease Prevention and Control's COVID-19 data set using
298+
[pandas](/pandas.html). That was a quick tour of some basic pandas
299+
commands and I strongly recommend you peruse the
300+
[DataFrame documentation list](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html)
301+
to learn about all of the other handy functions that this tool
302+
provides to developers.
303+
304+
You can also get an idea of what to code next in your Python project by
305+
reading the
306+
[Full Stack Python table of contents page](/table-of-contents.html).
307+
308+
Questions? Contact me via Twitter
309+
[@fullstackpython](https://twitter.com/fullstackpython)
310+
or [@mattmakai](https://twitter.com/mattmakai). I'm also on GitHub with
311+
the username [mattmakai](https://github.com/mattmakai).
312+
313+
Something wrong with this post? Fork
314+
[this page's source on GitHub](https://github.com/mattmakai/fullstackpython.com/blob/master/content/posts/200328-explore-covid-pandas.markdown)
315+
and submit a pull request.
316+
305 KB
Loading
51.8 KB
Loading

theme/templates/article-sidebar.html

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
{% if article.author == "Kevin Whinnery" %}
22
{% include "sponsor/twilioquest.html" %}
33
{% endif %}
4-
{% include "sponsor/carbon.html" %}
54
<div class="pn">
65
<div class="pnh">
76
<h3><a href="/table-of-contents.html">More Resources</a></h3>
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
<a href="/pandas.html" class="lgi">pandas</a>
2+
<a href="/data.html" class="lgi">Data</a>
3+
<a href="/numpy.html" class="lgi">NumPy</a>
4+
<a href="/data-analysis.html" class="lgi">Data analysis</a>
5+
<a href="https://pandas.pydata.org/docs/" class="lgi">Official pandas Docs {% include "blog/external-link.html" %}</a>

0 commit comments

Comments
 (0)