|
| 1 | +title: Learning pandas by Exploring COVID-19 Data |
| 2 | +slug: learn-pandas-basic-commands-explore-covid-19-data |
| 3 | +meta: Use the pandas data analysis tool to explore the free COVID-19 data set provided by the European Centre for Disease Prevention and Control. |
| 4 | +category: post |
| 5 | +date: 2020-03-28 |
| 6 | +modified: 2020-03-28 |
| 7 | +newsletter: False |
| 8 | +headerimage: /img/200328-covid-19-pandas/header.png |
| 9 | +headeralt: pandas logo. Copyright the PyData Foundation. |
| 10 | + |
| 11 | + |
| 12 | +The |
| 13 | +[European Centre for Disease Prevention and Control](https://www.ecdc.europa.eu/en) |
| 14 | +provides |
| 15 | +[daily-updated worldwide COVID-19 data](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide) |
| 16 | +that is easy to download in JSON, CSV or XML formats. In this tutorial, |
| 17 | +we will use the [pandas](/pandas.html) data analysis tool on the |
| 18 | +comma-separated values (CSV) data to learn some of the basic pandas |
| 19 | +commands and explore what is contained within the data set. |
| 20 | + |
| 21 | + |
| 22 | +## Configuring our development environment |
| 23 | +Make sure you have Python 3 installed. As of right now, |
| 24 | +[Python 3.8.2](https://www.python.org/downloads/) is the latest. |
| 25 | + |
| 26 | +During this tutorial we're also going to use |
| 27 | +[pandas](https://pandas.pydata.org/). |
| 28 | + |
| 29 | +Install it now into a new virtual environment with the following |
| 30 | +commands: |
| 31 | + |
| 32 | +```bash |
| 33 | +python -m venv covidpandas |
| 34 | +source covidpandas/bin/activate |
| 35 | + |
| 36 | +pip install pandas |
| 37 | +``` |
| 38 | + |
| 39 | +We're now ready to get the COVID-19 data and start analyzing it with |
| 40 | +pandas. |
| 41 | + |
| 42 | + |
| 43 | +## Obtaining the COVID-19 data |
| 44 | +Go to the |
| 45 | +[download today’s data on the geographic distribution of COVID-19 cases worldwide](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide) |
| 46 | +page in your web browser. It should look something like the following |
| 47 | +screenshot. |
| 48 | + |
| 49 | +<img src="/img/200328-covid-19-pandas/covid-19-data-download.png" width="100%" class="shot rnd outl" alt="Download the CSV version of the COVID-19 data."> |
| 50 | + |
| 51 | +There should be a link to download the |
| 52 | +data in CSV format, but the organization has changed the page layout |
| 53 | +several times in the past few weeks, which makes it difficult to find |
| 54 | +formats other than Excel (XLSX). If you have trouble obtaining the |
| 55 | +CSV version, just download |
| 56 | +[this one from GitHub](https://raw.githubusercontent.com/fullstackpython/blog-code-examples/master/pandas-covid-19/covid-19-cases-march-28-2020.csv) |
| 57 | +which is pegged to a copy downloaded on March 28th, 2020. |
| 58 | + |
| 59 | + |
| 60 | + |
| 61 | +## Importing the CSV into pandas |
| 62 | +We have the data in a CSV now we need to import it into a pandas |
| 63 | +DataFrame. |
| 64 | + |
| 65 | +Start by running the Python REPL: |
| 66 | + |
| 67 | +```bash |
| 68 | +python |
| 69 | + |
| 70 | +>>> |
| 71 | +``` |
| 72 | + |
| 73 | +The REPL is ready to go, now we need to import pandas so we can read |
| 74 | +the data we downloaded. |
| 75 | + |
| 76 | +```python |
| 77 | +from pandas import read_csv |
| 78 | + |
| 79 | +df = read_csv("covid-19-cases-march-28-2020.csv") |
| 80 | +``` |
| 81 | + |
| 82 | +Don't worry if you get an error like |
| 83 | +`UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7...`. |
| 84 | +Run this command instead which explicitly sets the file encoding |
| 85 | +so pandas can properly read the CSV. |
| 86 | + |
| 87 | +```python |
| 88 | +# make sure the file name of the csv matches your file's name! |
| 89 | +df = read_csv("covid-19-cases-march-28-2020.csv", encoding="ISO-8859-1") |
| 90 | +``` |
| 91 | + |
| 92 | +We now have our data loaded into a |
| 93 | +[pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) |
| 94 | +and can start running code to poke and prod and what's inside the |
| 95 | +data set. |
| 96 | + |
| 97 | + |
| 98 | +## Running pandas commands |
| 99 | +Let's first take a peek at what a sample of the data looks like. I |
| 100 | +typically run the `head` and `tail` functions when I open something |
| 101 | +up to find out what are contained in the first five and last five rows. |
| 102 | + |
| 103 | +```python |
| 104 | +df.head() |
| 105 | +``` |
| 106 | + |
| 107 | +You should see six lines of output: one as the columns header and the |
| 108 | +first five rows of data from the CSV: |
| 109 | + |
| 110 | +``` |
| 111 | + dateRep day month year cases deaths countriesAndTerritories geoId countryterritoryCode popData2018 |
| 112 | +0 28/03/2020 28 3 2020 16 1 Afghanistan AF AFG 37172386.0 |
| 113 | +1 27/03/2020 27 3 2020 0 0 Afghanistan AF AFG 37172386.0 |
| 114 | +2 26/03/2020 26 3 2020 33 0 Afghanistan AF AFG 37172386.0 |
| 115 | +3 25/03/2020 25 3 2020 2 0 Afghanistan AF AFG 37172386.0 |
| 116 | +4 24/03/2020 24 3 2020 6 1 Afghanistan AF AFG 37172386.0 |
| 117 | +``` |
| 118 | + |
| 119 | +The `tail` function looks at the last five rows in a DataFrame. |
| 120 | + |
| 121 | +``` |
| 122 | +df.tail() |
| 123 | +``` |
| 124 | + |
| 125 | +`tail` output will look something like this: |
| 126 | + |
| 127 | +``` |
| 128 | + dateRep day month year cases deaths countriesAndTerritories geoId countryterritoryCode popData2018 |
| 129 | +7315 25/03/2020 25 3 2020 0 0 Zimbabwe ZW ZWE 14439018.0 |
| 130 | +7316 24/03/2020 24 3 2020 0 1 Zimbabwe ZW ZWE 14439018.0 |
| 131 | +7317 23/03/2020 23 3 2020 0 0 Zimbabwe ZW ZWE 14439018.0 |
| 132 | +7318 22/03/2020 22 3 2020 1 0 Zimbabwe ZW ZWE 14439018.0 |
| 133 | +7319 21/03/2020 21 3 2020 1 0 Zimbabwe ZW ZWE 14439018.0 |
| 134 | +``` |
| 135 | + |
| 136 | +Note that you can also pass an integer into `head` or `tail` like |
| 137 | +`df.head(10)` to get the first or last **n** number of rows. |
| 138 | + |
| 139 | +It looks like based on the `tail` function we have around 7320 rows of |
| 140 | +data (since the first row is 0 indexed). We can confirm how much |
| 141 | +data is in each column with the `count` function. |
| 142 | + |
| 143 | +``` |
| 144 | +df.count() |
| 145 | +``` |
| 146 | + |
| 147 | +`count`'s output will look like: |
| 148 | + |
| 149 | +``` |
| 150 | +dateRep 7320 |
| 151 | +day 7320 |
| 152 | +month 7320 |
| 153 | +year 7320 |
| 154 | +cases 7320 |
| 155 | +deaths 7320 |
| 156 | +countriesAndTerritories 7320 |
| 157 | +geoId 7306 |
| 158 | +countryterritoryCode 7254 |
| 159 | +popData2018 7311 |
| 160 | +dtype: int64 |
| 161 | +``` |
| 162 | + |
| 163 | +What if we want to look at one of those columns and find, for example, |
| 164 | +the highest value of cases? |
| 165 | + |
| 166 | +``` |
| 167 | +df.cases.max() |
| 168 | +``` |
| 169 | + |
| 170 | +In this data set we get 18695 as the output. What about looking at |
| 171 | +standard statistical measures across all columns? That's where the |
| 172 | +`describe` function comes in handy. |
| 173 | + |
| 174 | +``` |
| 175 | +df.describe() |
| 176 | +``` |
| 177 | + |
| 178 | +`describe` presents standard statistical measures such as min, max, |
| 179 | +median and mean for everything in your data set. In this case we |
| 180 | +receive as output: |
| 181 | + |
| 182 | +``` |
| 183 | + day month year cases deaths popData2018 |
| 184 | +count 7320.000000 7320.000000 7320.000000 7320.000000 7320.000000 7.311000e+03 |
| 185 | +mean 16.828142 2.249454 2019.990847 80.870355 3.687158 7.130483e+07 |
| 186 | +std 8.322981 1.256463 0.095239 608.270244 35.327689 2.140624e+08 |
| 187 | +min 1.000000 1.000000 2019.000000 -9.000000 0.000000 1.000000e+03 |
| 188 | +25% 10.000000 1.000000 2020.000000 0.000000 0.000000 4.137309e+06 |
| 189 | +50% 18.000000 2.000000 2020.000000 0.000000 0.000000 1.072767e+07 |
| 190 | +75% 24.000000 3.000000 2020.000000 5.000000 0.000000 5.139301e+07 |
| 191 | +max 31.000000 12.000000 2020.000000 18695.000000 971.000000 1.392730e+09 |
| 192 | +``` |
| 193 | + |
| 194 | +How about a quick view into whether or not columns' data are correlated |
| 195 | +with each other? The `corr` function is what we need. |
| 196 | + |
| 197 | +``` |
| 198 | +df.corr() |
| 199 | +``` |
| 200 | + |
| 201 | +For our data set, `corr` outputs: |
| 202 | + |
| 203 | +``` |
| 204 | + day month year cases deaths popData2018 |
| 205 | +day 1.000000 0.203006 -0.163665 0.063629 0.060075 -0.040677 |
| 206 | +month 0.203006 1.000000 -0.745912 0.062494 0.052707 -0.039131 |
| 207 | +year -0.163665 -0.745912 1.000000 0.012715 0.010032 -0.006294 |
| 208 | +cases 0.063629 0.062494 0.012715 1.000000 0.716968 0.136580 |
| 209 | +deaths 0.060075 0.052707 0.010032 0.716968 1.000000 0.082229 |
| 210 | +popData2018 -0.040677 -0.039131 -0.006294 0.136580 0.082229 1.000000 |
| 211 | +``` |
| 212 | + |
| 213 | +Not surprisingly, we see 1.000000 correlation between a column and itself. |
| 214 | +We'd have to worry if we didn't see that result! For other columns it may |
| 215 | +not make sense to look at their correlation. This is where you need to |
| 216 | +think about the data. There is often correlation between completely unrelated |
| 217 | +columns just because the data is structured a certain way. |
| 218 | + |
| 219 | +If you are a developer like me without a rigorous background in statistics |
| 220 | +(Stats 200 in college was a **long** time ago), you may need to brush up |
| 221 | +on your stats knowledge before you are able to say whether something in the |
| 222 | +data matters or not. |
| 223 | + |
| 224 | +Let's keep going exploring the data. We can select columns and determine how |
| 225 | +many unique items are held within it. For example, how many unique countries |
| 226 | +and territories are listed? |
| 227 | + |
| 228 | +``` |
| 229 | +df.countriesAndTerritories.nunique() |
| 230 | +``` |
| 231 | + |
| 232 | +In this case the result should be 196. |
| 233 | + |
| 234 | + |
| 235 | +## Asking questions of the data |
| 236 | +Those functions are fine for basic querying to learn what's in the |
| 237 | +data set, but how do we ask real questions by stringing together some |
| 238 | +commands? |
| 239 | + |
| 240 | +We now know there are 7320 rows in this set since we used the `count` |
| 241 | +function above. Each row represents a single day within a country. Now |
| 242 | +to ask a question. How many days across these countries were there 10 |
| 243 | +or more cases reported? |
| 244 | + |
| 245 | +Let's create a new dataframe named df2 with the rows that only have |
| 246 | +10 or more cases reported on that day, then count the number of rows |
| 247 | +within it. |
| 248 | + |
| 249 | +``` |
| 250 | +df2 = df[df['cases']>=10] |
| 251 | +df2.count() |
| 252 | +``` |
| 253 | + |
| 254 | +That should give us the value 1531. There have been 1531 instances |
| 255 | +of 10 or more COVID-19 cases reported on a single day, across the |
| 256 | +196 countries or terrorities listed. But the 1531 is hard to explain |
| 257 | +to people. We should pick out a single country and show how many times |
| 258 | +10 or more cases were reported on one day. How about a smaller |
| 259 | +country like Vietnam that is not being reported on as much as China, |
| 260 | +the United States or Italy? |
| 261 | + |
| 262 | +``` |
| 263 | +df2[df2['countriesAndTerritories']=='Vietnam'] |
| 264 | +``` |
| 265 | + |
| 266 | +This will give us the full output of data by column: |
| 267 | + |
| 268 | +``` |
| 269 | + dateRep day month year cases deaths countriesAndTerritories geoId countryterritoryCode popData2018 |
| 270 | +7217 28/03/2020 28 3 2020 16 0 Vietnam VN VNM 95540395.0 |
| 271 | +7219 26/03/2020 26 3 2020 14 0 Vietnam VN VNM 95540395.0 |
| 272 | +7220 25/03/2020 25 3 2020 11 0 Vietnam VN VNM 95540395.0 |
| 273 | +7222 23/03/2020 23 3 2020 24 0 Vietnam VN VNM 95540395.0 |
| 274 | +7226 19/03/2020 19 3 2020 15 0 Vietnam VN VNM 95540395.0 |
| 275 | +``` |
| 276 | + |
| 277 | +We can also use the `count` function here to confirm there have been |
| 278 | +five days in which 10 or more new cases have been reported in Vietnam |
| 279 | +so far: |
| 280 | + |
| 281 | +``` |
| 282 | +df2[df2['countriesAndTerritories']=='Vietnam'].count() |
| 283 | +``` |
| 284 | + |
| 285 | +We get the output of 5 for the columns. Unfortunately, when you look at |
| 286 | +the full data it appears these rows are all very recent and the virus |
| 287 | +is just beginning to spread more widely there. Let's hope they along |
| 288 | +with every other country is able to turn the tide, flatten the curve |
| 289 | +and keep more people from getting sick as we continue onwards. |
| 290 | + |
| 291 | +That's a good spot to leave off, but we covered a lot of pandas ground |
| 292 | +in this tutorial! |
| 293 | + |
| 294 | + |
| 295 | +## What's next? |
| 296 | +We just imported and took a look at what's in the European Centre |
| 297 | +for Disease Prevention and Control's COVID-19 data set using |
| 298 | +[pandas](/pandas.html). That was a quick tour of some basic pandas |
| 299 | +commands and I strongly recommend you peruse the |
| 300 | +[DataFrame documentation list](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) |
| 301 | +to learn about all of the other handy functions that this tool |
| 302 | +provides to developers. |
| 303 | + |
| 304 | +You can also get an idea of what to code next in your Python project by |
| 305 | +reading the |
| 306 | +[Full Stack Python table of contents page](/table-of-contents.html). |
| 307 | + |
| 308 | +Questions? Contact me via Twitter |
| 309 | +[@fullstackpython](https://twitter.com/fullstackpython) |
| 310 | +or [@mattmakai](https://twitter.com/mattmakai). I'm also on GitHub with |
| 311 | +the username [mattmakai](https://github.com/mattmakai). |
| 312 | + |
| 313 | +Something wrong with this post? Fork |
| 314 | +[this page's source on GitHub](https://github.com/mattmakai/fullstackpython.com/blob/master/content/posts/200328-explore-covid-pandas.markdown) |
| 315 | +and submit a pull request. |
| 316 | + |
0 commit comments