|
| 1 | +[Home](https://blog.modeanalytics.com/) |
| 2 | +[Product](https://about.modeanalytics.com/product/) [Data |
| 3 | +Sources](https://about.modeanalytics.com/data-sources/) |
| 4 | +[Customers](https://about.modeanalytics.com/customers/) |
| 5 | +[Company](https://about.modeanalytics.com/company/) |
| 6 | +[Jobs](https://about.modeanalytics.com/jobs/) |
| 7 | +[Resources](https://about.modeanalytics.com/resources/) [SQL |
| 8 | +School](http://sqlschool.modeanalytics.com) |
| 9 | +[Playbook](https://about.modeanalytics.com/playbook/) [Sign |
| 10 | +In](https://modeanalytics.com/signin) |
| 11 | + |
| 12 | +[ |
| 13 | + |
| 14 | +](javascript://) |
| 15 | + |
| 16 | +[  |
| 17 | +](https://modeanalytics.com) |
| 18 | + |
| 19 | +[Product](https://about.modeanalytics.com/product/) |
| 20 | +[Pricing](https://about.modeanalytics.com/pricing/) |
| 21 | +[Community](https://community.modeanalytics.com/) |
| 22 | + |
| 23 | +[ More  |
| 24 | +](javascript://) |
| 25 | + |
| 26 | +[Data Sources](https://about.modeanalytics.com/data-sources/) |
| 27 | +[Customers](https://about.modeanalytics.com/customers/) |
| 28 | +[Company](https://about.modeanalytics.com/company/) |
| 29 | +[Jobs](https://about.modeanalytics.com/jobs/) |
| 30 | +[Blog](https://blog.modeanalytics.com) [Help](http://help.modeanalytics.com) |
| 31 | + |
| 32 | +[Sign Up](http://modeanalytics.com/signup) [Sign |
| 33 | +In](http://modeanalytics.com/signin) |
| 34 | + |
| 35 | +[Mode Blog](https://blog.modeanalytics.com/) |
| 36 | + |
| 37 | +# [Handy Python Libraries for Formatting and Cleaning |
| 38 | +Data](https://blog.modeanalytics.com/python-data-cleaning-libraries/) |
| 39 | + |
| 40 | +August 23, 2016 | [Melissa Bierly](http://www.twitter.com/melissa_bierly) -- |
| 41 | +Content Marketing at Mode |
| 42 | + |
| 43 | +The real world is messy, and so too is its data. So messy, that a [recent |
| 44 | +survey](http://visit.crowdflower.com/data-science-report.html) reported data |
| 45 | +scientists spend 60% of their time cleaning data. Unfortunately, 57% of them |
| 46 | +also find it to be the least enjoyable aspect of their job. |
| 47 | + |
| 48 | +Cleaning data may be time-consuming, but lots of tools have cropped up to make |
| 49 | +this crucial duty a little more bearable. The Python community offers a host |
| 50 | +of libraries for making data orderly and legible—from styling DataFrames to |
| 51 | +anonymizing datasets. |
| 52 | + |
| 53 | +Let us know which libraries you find useful—we're always looking to prioritize |
| 54 | +which libraries to add to [Mode Python |
| 55 | +Notebooks](https://about.modeanalytics.com/python/). |
| 56 | + |
| 57 | + _Too bad cleaning isn't as fun for data |
| 59 | +scientists as it is for this little guy._ |
| 60 | + |
| 61 | +## Dora |
| 62 | + |
| 63 | +Dora is designed for exploratory analysis; specifically, automating the most |
| 64 | +painful parts of it, like feature selection and extraction, visualization, |
| 65 | +and—you guessed it—data cleaning. Cleansing functions include: |
| 66 | + |
| 67 | + * Reading data with missing and poorly scaled values |
| 68 | + * Imputing missing values |
| 69 | + * Scaling values of input variables |
| 70 | + |
| 71 | +**Created by:** [Nathan Epstein](https://twitter.com/epstein_n) |
| 72 | +**Where to learn more:** <https://github.com/NathanEpstein/Dora> |
| 73 | + |
| 74 | +## datacleaner |
| 75 | + |
| 76 | +Surprise, surprise, datacleaner cleans your data—but only once it's in a |
| 77 | +[pandas DataFrame](https://community.modeanalytics.com/python/tutorial/pandas- |
| 78 | +dataframe/). From creator Randy Olson: “datacleaner is not magic, and it won't |
| 79 | +take an unorganized blob of text and automagically parse it out for you.” |
| 80 | + |
| 81 | +It will, however, drop rows with missing values, replace missing values with |
| 82 | +the mode or median on a column-by-column basis, and encode non-numeric |
| 83 | +variables with numerical equivalents. This library is fairly new, but since |
| 84 | +DataFrames are fundamental to analysis in Python, it's worth checking out. |
| 85 | + |
| 86 | +**Created by:** [Randy Olson](https://twitter.com/randal_olson) |
| 87 | +**Where to learn more:** <https://github.com/rhiever/datacleaner> |
| 88 | + |
| 89 | +## PrettyPandas |
| 90 | + |
| 91 | +DataFrames are powerful, but they don't produce the kind of tables you'd want |
| 92 | +to show your boss. PrettyPandas makes use of the [pandas Style |
| 93 | +API](http://pandas.pydata.org/pandas-docs/stable/style.html) to transform |
| 94 | +DataFrames into presentation-worthy tables. Create summaries, add styling, and |
| 95 | +format numbers, columns, and rows. Added bonus: robust, easy-to-read |
| 96 | +[documentation](http://prettypandas.readthedocs.io/en/latest/). |
| 97 | + |
| 98 | +**Created by:** [Henry Hammond](https://twitter.com/henryhammond92) |
| 99 | +**Where to learn more:** <https://github.com/HHammond/PrettyPandas> |
| 100 | + |
| 101 | +## tabulate |
| 102 | + |
| 103 | +tabulate lets you print small, nice-looking tables with just one function |
| 104 | +call. It's handy for making tables more readable with column alignment by |
| 105 | +decimal, number formatting, headers, and more. |
| 106 | + |
| 107 | +One of the coolest features is the ability to output data in a variety of |
| 108 | +formats like HTML, PHP, or Markdown Extra, so you can continue working with |
| 109 | +your tabular data in another tool or language. |
| 110 | + |
| 111 | +**Created by:** Sergey Astanin |
| 112 | +**Where to learn more:** <https://pypi.python.org/pypi/tabulate> |
| 113 | + |
| 114 | +## scrubadub |
| 115 | + |
| 116 | +Data scientists in fields like healthcare and finance regularly have to |
| 117 | +anonymize datasets. scrubadub removes [personally identifiable information |
| 118 | +(PII)](https://en.wikipedia.org/wiki/Personally_identifiable_information) from |
| 119 | +free text, such as: |
| 120 | + |
| 121 | + * Names (proper nouns) |
| 122 | + * Email addresses |
| 123 | + * URLs |
| 124 | + * Phone numbers |
| 125 | + * username/password combinations |
| 126 | + * Skype usernames |
| 127 | + * Social security numbers |
| 128 | + |
| 129 | +The documentation does a good job of showing ways in which you might want to |
| 130 | +customize scrubadub's behavior, like defining new PII types or excluding |
| 131 | +certain kinds of PII from being scrubbed. |
| 132 | + |
| 133 | +**Created by:** [Datascope Analytics](http://datascopeanalytics.com/) |
| 134 | +**Where to learn more:** <http://scrubadub.readthedocs.io/en/stable/index.html> |
| 135 | + |
| 136 | +## Arrow |
| 137 | + |
| 138 | +Let's be honest: working with dates and times in Python is a pain. Local |
| 139 | +timezones aren't automatically recognized. It takes several lines of |
| 140 | +unpleasant code to convert timezones and timestamps. |
| 141 | + |
| 142 | +Arrow aims to fix these problems and plug functionality gaps to help you |
| 143 | +handle dates and times with less code and fewer imports. Unlike Python's |
| 144 | +standard library, Arrow is time-zone aware and UTC by default. You can convert |
| 145 | +timezones or parse strings using one line of code. |
| 146 | + |
| 147 | +**Created by:** [Chris Smith](https://twitter.com/crsmithdev) |
| 148 | +**Where to learn more:** <http://arrow.readthedocs.io/en/latest/> |
| 149 | + |
| 150 | +## Beautifier |
| 151 | + |
| 152 | +Beautifier's mission is simple: clean and prettify URLs and email addresses. |
| 153 | +You can parse emails by domain and username; URLs by domain and parameters |
| 154 | +(e.g. UTMs or tokens). |
| 155 | + |
| 156 | +**Created by:** [Sachin Philip Mathew](https://twitter.com/sachin_philip) |
| 157 | +**Where to learn more:** <https://github.com/sachinvettithanam/beautifier> |
| 158 | + |
| 159 | +## ftfy |
| 160 | + |
| 161 | +ftfy (fixes text for you) takes in bad Unicode outputs good Unicode. |
| 162 | +Basically, it fixes all the junk characters. `“quotesâ€\x9d` becomes |
| 163 | +`"quotes"`; `ü` becomes `ü`; `<3` becomes `<3`. If you work with text on |
| 164 | +a daily basis, this library is, as one user says, “a handy piece of magic.” |
| 165 | + |
| 166 | +**Created by:** [Luminoso](http://www.luminoso.com/) |
| 167 | +**Where to learn more:** <https://github.com/LuminosoInsight/python-ftfy> |
| 168 | + |
| 169 | +## Further resources for wrangling data |
| 170 | + |
| 171 | +Here are a couple of our favorite reads on munging/wrangling/cleansing data. |
| 172 | + |
| 173 | + * [What every data scientist should know about data anonymization](https://github.com/krasch/presentations/blob/master/pydata_Berlin_2016.pdf) (Katharina Rasch) |
| 174 | + * [Cleaning data in Python](https://data.library.utoronto.ca/cleaning-data-python) (University of Toronto Map & Data Library) |
| 175 | + * [Data Cleaning with Python - MoMA's Artwork Collection](https://www.dataquest.io/blog/data-cleaning-with-python/) (Dataquest) |
| 176 | + |
| 177 | +### Recommended articles |
| 178 | + |
| 179 | + * [Cohort Analysis That Helps You Look Ahead](https://blog.modeanalytics.com/cohort-analysis-helps-look-ahead/?utm_medium=recommended&utm_source=blog&utm_content=data_cleaning) |
| 180 | + * [10 Useful Python Data Visualization Libraries for Any Discipline](https://blog.modeanalytics.com/python-data-visualization-libraries/?utm_medium=recommended&utm_source=blog&utm_content=data_cleaning) |
| 181 | + * [Thinking in SQL vs Thinking in Python](https://blog.modeanalytics.com/learning-python-sql/?utm_medium=recommended&utm_source=blog&utm_content=data_cleaning) |
| 182 | + |
| 183 | +Category: [Community](https://blog.modeanalytics.com/archive/community) |
| 184 | + |
| 185 | +## Keep your finger on the pulse of analytics. |
| 186 | + |
| 187 | +Each week we publish a roundup of the best analytics and data science content |
| 188 | +we can find. Sign up here: |
| 189 | + |
| 190 | +Thanks! Keep an eye on your email for the next issue of the Analytics |
| 191 | +Dispatch! |
| 192 | + |
| 193 | +Please enable JavaScript to view the [comments powered by |
| 194 | +Disqus.](https://disqus.com/?ref_noscript) |
| 195 | + |
| 196 | +### Next Article |
| 197 | + |
| 198 | +## [Analytics Dispatch 037: End the language |
| 199 | +war](https://blog.modeanalytics.com/analytics-dispatch-037/) |
| 200 | + |
| 201 | + |
| 202 | + |
| 203 | +Product |
| 204 | + |
| 205 | +[Overview](https://about.modeanalytics.com/product/) |
| 206 | +[SQL](https://about.modeanalytics.com/online-sql-editor/) |
| 207 | +[Python](https://about.modeanalytics.com/python/) |
| 208 | +[Reporting](https://about.modeanalytics.com/reporting/) |
| 209 | +[Pricing](https://about.modeanalytics.com/pricing/) |
| 210 | +[Customers](https://about.modeanalytics.com/customers/) [Data |
| 211 | +Sources](https://about.modeanalytics.com/data-sources/) |
| 212 | +[Security](https://about.modeanalytics.com/security/) |
| 213 | + |
| 214 | +Resources |
| 215 | + |
| 216 | +[Community](https://community.modeanalytics.com) [Learn |
| 217 | +SQL](https://community.modeanalytics.com/sql) [Learn |
| 218 | +Python](https://community.modeanalytics.com/python) [Open Source |
| 219 | +SQL](https://about.modeanalytics.com/playbook/) [Retention |
| 220 | +Analytics](https://about.modeanalytics.com/improving-retention-rates/) [CRM |
| 221 | +Analytics](https://about.modeanalytics.com/sales-analytics/) [Help + |
| 222 | +Support](http://help.modeanalytics.com) |
| 223 | + |
| 224 | +Company |
| 225 | + |
| 226 | +[About](https://about.modeanalytics.com/company/) |
| 227 | +[Careers](https://about.modeanalytics.com/jobs/) |
| 228 | +[Press](https://about.modeanalytics.com/press/) |
| 229 | +[Blog](http://blog.modeanalytics.com) |
| 230 | + |
| 231 | +Contact Us |
| 232 | + |
| 233 | +415-689-7436 |
| 234 | + |
| 235 | +208 Utah St. Suite 300 |
| 236 | + |
| 237 | +San Francisco CA 94103 |
| 238 | + |
| 239 | +[  |
| 240 | +](https://www.facebook.com/ModeAnalytics) [ |
| 241 | + |
| 242 | +](https://twitter.com/modeanalytics) [ |
| 243 | + |
| 244 | +](https://www.linkedin.com/company/mode-analytics) [ |
| 245 | + |
| 246 | +](https://github.com/mode) |
| 247 | + |
| 248 | +(C) Mode Analytics, Inc. 2015 [terms of |
| 249 | +service](https://about.modeanalytics.com/tos/) [privacy |
| 250 | +policy](https://about.modeanalytics.com/privacy/) |
| 251 | + |
0 commit comments