District Data Labs

Data Exploration with Python, Part 3

Fri, 31 Mar 2017 15:47:38 -0400

This is the third post in our Data Exploration with Python series. Before reading this post, make sure to check out Part 1 and Part 2!

Preparing yourself and your data like we have done thus far in this series is essential to analyzing your data well. However, the most exciting part of Exploratory Data Analysis (EDA) is actually getting in there, exploring the data, and discovering insights. That's exactly what we are going to start doing in this post.

We will begin with the cleaned and prepped vehicle fuel economy data set that we ended up with at the end of the last post. This version of the data set contains:

The higher-level categories we created via category aggregations.
The quintiles we created by binning our continuous variables.
The clusters we generated via k-means clustering based on numeric variables.

Now, without further ado, let's embark on our insight-finding mission!

Making Our Data Smaller: Filter + Aggregate

One of the fundamental ways to extract insights from a data set is to reduce the size of the data so that you can look at just a piece of it at a time. There are two ways to do this: filtering and aggregating. With filtering, you are essentially removing either rows or columns (or both rows and columns) in order to focus on a subset of the data that interests you. With aggregation, the objective is to group records in your data set that have similar categorical attributes and then perform some calculation (count, sum, mean, etc.) on one or more numerical fields so that you can observe and identify differences between records that fall into each group.

To begin filtering and aggregating our data set, we could write a function like the one below to aggregate based on a group_field that we provide, counting the number of rows in each group. To make things more intuitive and easier to interpret, we will also sort the data from most frequent to least and format it in a pandas data frame with appropriate column names.

def agg_count(df, group_field):
    grouped = df.groupby(group_field, as_index=False).size()
    grouped.sort(ascending = False)

    grouped = pd.DataFrame(grouped).reset_index()
    grouped.columns = [group_field, 'Count']
    return grouped

Now that we have this function in our toolkit, let's use it. Suppose we were looking at the Vehicle Category field in our data set and were curious about the number of vehicles in each category that were manufactured last year (2016). Here is how we would filter the data and use the agg_count function to transform it to show what we wanted to know.

vehicles_2016 = vehicles[vehicles['Year']==2016]
category_counts = agg_count(vehicles_2016, 'Vehicle Category')

This gives us what we want in tabular form, but we could take it a step further and visualize it with a horizontal bar chart.

ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Category (2016) \n')

Now that we know how to do this, we can filter, aggregate, and plot just about anything in our data set with just a few lines of code. For example, here is the same metric but filtered for a different year (1985).

vehicles_1985 = vehicles[vehicles['Year']==1985]
category_counts = agg_count(vehicles, 'Vehicle Category')

ax = sns.barplot(data=category_counts, x='Count', y='Vehicle Category')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Category (1985) \n')

If we wanted to stick with the year 2016 but drill down to the more granular Vehicle Class, we could do that as well.

class_counts = agg_count(vehicles_2016, 'Vehicle Class')

ax = sns.barplot(data=class_counts, x='Count', y='Vehicle Class')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Class (2016) \n')

We could also look at vehicle counts by manufacturer.

make_counts = agg_count(vehicles_2016, 'Make')

ax = sns.barplot(data=make_counts, x='Count', y='Make')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Vehicles Manufactured by Make (2016) \n')

What if we wanted to filter by something other than the year? We could do that by simply creating a different filtered data frame and passing that to our agg_count function. Below, instead of filtering by Year, I've filtered on the Fuel Efficiency field, which contains the fuel efficiency quintiles we generated in the last post. Let's choose the Very High Efficiency value so that we can see how many very efficient vehicles each manufacturer has made.

very_efficient = vehicles[vehicles['Fuel Efficiency']=='Very High Efficiency']
make_counts = agg_count(very_efficient, 'Make')

ax = sns.barplot(data=make_counts, x='Count', y='Make')
ax.set(xlabel='\n Number of Vehicles Manufactured')
sns.plt.title('Very Fuel Efficient Vehicles by Make \n')

What if we wanted to perform some other calculation, such as averaging, instead of counting the number of records that fall into each group? We can just create a new function called agg_avg that calculates the mean of a designated numerical field.

def agg_avg(df, group_field, calc_field):
    grouped = df.groupby(group_field, as_index=False)[calc_field].mean()
    grouped = grouped.sort(calc_field, ascending = False)
    grouped.columns = [group_field, 'Avg ' + str(calc_field)]
    return grouped

We can then simply swap out the agg_count function with our new agg_avg function and indicate what field we would like to use for our calculation. Below is an example showing the average fuel efficiency, represented by the Combined MPG field, by vehicle category.

category_avg_mpg = agg_avg(vehicles_2016, 'Vehicle Category', 'Combined MPG')

ax = sns.barplot(data=category_avg_mpg, x='Avg Combined MPG', y='Vehicle Category')
ax.set(xlabel='\n Average Combined MPG')
sns.plt.title('Average Combined MPG by Category (2016) \n')

Pivoting the Data for More Detail

Up until this point, we've been looking at our data at a pretty high level, aggregating up by a single variable. Sure, we were able to drill down from Vehicle Category to Vehicle Class to get a more granular view, but we only looked at the data one hierarchical level at a time. Next, we're going to go into further detail by taking a look at two or three variables at a time. The way we are going to do this is via pivot tables and their visual equivalents, pivot heatmaps.

First, we will create a pivot_count function, similar to the agg_count function we created earlier, that will transform whatever data frame we feed it into a pivot table with the rows, columns, and calculated field we specify.

def pivot_count(df, rows, columns, calc_field):
    df_pivot = df.pivot_table(values=calc_field, 
                              index=rows, 
                              columns=columns, 
                              aggfunc=np.size
                             ).dropna(axis=0, how='all')
    return df_pivot

We will then use this function on our vehicles_2016 data frame and pivot it out with the Fuel Efficiency quintiles we created in the last post representing the rows, the Engine Size quintiles representing the columns, and then counting the number of vehicles that had a Combined MPG value.

effic_size_pivot = pivot_count(vehicles_2016,'Fuel Efficiency',
                               'Engine Size','Combined MPG')

This is OK, but it would be faster to analyze visually. Let's create a heatmap that will color the magnitude of the counts and present us with a more intuitive view.

sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Engine Size')
sns.plt.title('Fuel Efficiency vs. Engine Size (2016) \n')

Just like we did earlier with our horizontal bar charts, we can easily filter by a different year and get a different perspective. For example, here's what this heatmap looks like for 1985.

effic_size_pivot = pivot_count(vehicles_1985,'Fuel Efficiency',
                               'Engine Size','Combined MPG')

fig, ax = plt.subplots(figsize=(15,8))
sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Engine Size')
sns.plt.title('Fuel Efficiency vs. Engine Size (1985) \n')

With these pivot heatmaps, we are not limited to just two variables. We can pass a list of variables for any of the axes (rows or columns), and it will display all the different combinations of values for those variables.

effic_size_category = pivot_count(vehicles_2016,
                                  ['Engine Size','Fuel Efficiency'],
                                  'Vehicle Category','Combined MPG')

fig, ax = plt.subplots(figsize=(20,10))
sns.heatmap(effic_size_category, annot=True, fmt='g')
ax.set(xlabel='\n Vehicle Category')
sns.plt.title('Fuel Efficiency + Engine Size vs. Vehicle Category (2016) \n')

In this heatmap, we have Engine Size and Fuel Efficiency combinations represented by the rows, and we've added a third variable (the Vehicle Category) across the columns. So now we can see a finer level of detail about what types of cars had what size engines and what level of fuel efficiency last year.

As a final example for this section, let's create a pivot heatmap that plots Make against Vehicle Category for 2016. We saw earlier, in the bar chart that counted vehicles by manufacturer, that BMW made the largest number of specific models last year. This pivot heatmap will let us see how those counts are distributed across vehicle categories, giving us a better sense of each auto company's current offerings in terms of the breadth vs. depth of vehicle types they make.

effic_size_pivot = pivot_count(vehicles_2016, 'Make',
                               'Vehicle Category','Combined MPG')

fig, ax = plt.subplots(figsize=(20,20))
sns.heatmap(effic_size_pivot, annot=True, fmt='g')
ax.set(xlabel='\n Vehicle Category')
sns.plt.title('Make vs. Vehicle Category (2016) \n')

Visualizing Changes Over Time

So far in this post, we've been looking at the data at given points in time. The next step is to take a look at how the data has changed over time. We can do this relatively easily by creating a multi_line function that accepts a data frame and x/y fields and then plots them on a multiline chart.

def multi_line(df, x, y):
    ax = df.groupby([x, y]).size().unstack(y).plot(figsize=(15,8), cmap="Set2")

Let's use this function to visualize our vehicle categories over time. The resulting chart shows the number of vehicles in each category that were manufactured each year.

multi_line(vehicles, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('Vehicle Categories Over Time \n')

We can see from the chart that Small Cars have generally dominated across the board and that there was a small decline in the late 90s that then started to pick up again in the early 2000s. We can also see the introduction and increase in popularity of SUVs starting in the late 90s, and the decline in popularity of trucks in recent years.

If we wanted to, we could zoom in and filter for specific manufacturers to see how their offerings have changed over the years. Since BMW had the most number of vehicles last year and we saw in the pivot heatmap that those were mostly small cars, let's filter for just their vehicles to see whether they have always made a lot of small cars or if this is more of a recent phenomenon.

bmw = vehicles[vehicles['Make'] == 'BMW']

multi_line(bmw, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('BMW Vehicle Categories Over Time \n')

We can see in the chart above that they started off making a reasonable number of small cars, and then seemed to ramp up production of those types of vehicles in the late 90s. We can contrast this with a company like Toyota, who started out making a lot of small cars back in the 1980s and then seemingly made a decision to gradually manufacture less of them over the years, focusing instead on SUVs, pickup trucks, and midsize cars.

toyota = vehicles[vehicles['Make'] == 'Toyota']

multi_line(toyota, 'Year', 'Vehicle Category')
ax.set(xlabel='\n Year')
sns.plt.title('Toyota Vehicle Categories Over Time \n')

Examining Relationships Between Variables

The final way we are going to explore our data in this post is by examining the relationships between numerical variables in our data. Doing this will provide us with better insight into which fields are highly correlated, what the nature of those correlations are, what typical combinations of numerical values exist in our data, and which combinations are anomalies.

For looking at relationships between variables, I often like to start with a scatter matrix because it gives me a bird's eye view of the relationships between all the numerical fields in my data set. With just a couple lines of code, we can not only create a scatter matrix, but we can also factor in a layer of color that can represent, for example, the clusters we generated at the end of the last post.

select_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year',
                   'City MPG','Highway MPG','Combined MPG',
                   'CO2 Emission Grams/Mile', 'Fuel Cost/Year', 'Cluster Name']

sns.pairplot(vehicles[select_columns], hue='Cluster Name', size=3)

From here, we can see that there are some strong positive linear relationships in our data, such as the correlations between the MPG fields, and also among the fuel cost, barrels, and CO2 emissions fields. There are also some hyperbolic relationships in there as well, particularly between the MPG fields and engine displacement, fuel cost, barrels, and emissions. Additionally, we can also get a sense of the size of our clusters, how they are distributed, and the level of overlap we have between them.

Once we have this high-level overview, we can zoom in on anything that we think looks interesting. For example, let's take a closer look at Engine Displacement plotted against Combined MPG.

sns.lmplot('Engine Displacement', 'Combined MPG', data=vehicles, 
           hue='Cluster Name', size=8, fit_reg=False)

In addition to being able to see that there is a hyperbolic correlation between these two variables, we can see that our Small Very Efficient cluster resides in the upper left, followed by our Midsized Balanced cluster that looks smaller and more compact than the others. After that, we have our Large Moderately Efficient cluster and finally our Large Inefficient cluster on the bottom right.

We can also see that there are a few red points at the very top left and a few purple points at the very bottom right that we may want to investigate further to get a sense of what types of vehicles we are likely to see at the extremes. Try identifying some of those on your own by filtering the data set like we did earlier in the post. While you're at it, try creating additional scatter plots that zoom in on other numerical field combinations from the scatter matrix above. There are a bunch of other insights to be found in this data set, and all it takes is a little exploration!

Conclusion

We have covered quite a bit in this post, and I hope I've provided you with some good examples of how, with just a few tools in your arsenal, you can embark on a robust insight-finding expedition and discover truly interesting things about your data. Now that you have some structure in your process and some tools for exploring data, you can let your creativity run wild a little and come up with filter, aggregate, pivot, and scatter combinations that are most interesting to you. Feel free to experiment and post any interesting insights you're able to find in the comments!

Also, make sure to stay tuned because in the next (and final) post of this series, I'm going to cover how to identify and think about the different networks that are present in your data and how to explore them using graph analytics. Click the subscribe button below and enter your email so that you receive a notification as soon as it's published!

District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!

Basics of Entity Resolution

Sat, 11 Mar 2017 13:55:00 -0500

Entity resolution (ER) is the task of disambiguating records that correspond to real world entities across and within datasets. The applications of entity resolution are tremendous, particularly for public sector and federal datasets related to health, transportation, finance, law enforcement, and antiterrorism.

Unfortunately, the problems associated with entity resolution are equally big — as the volume and velocity of data grow, inference across networks and semantic relationships between entities becomes increasingly difficult. Data quality issues, schema variations, and idiosyncratic data collection traditions can all complicate these problems even further. When combined, such challenges amount to a substantial barrier to organizations’ ability to fully understand their data, let alone make effective use of predictive analytics to optimize targeting, thresholding, and resource management.

Naming Your Problem

Let us first consider what an entity is. Much as the key step in machine learning is to determine what an instance is, the key step in entity resolution is to determine what an entity is. Let's define an entity as a unique thing (a person, a business, a product) with a set of attributes that describe it (a name, an address, a shape, a title, a price, etc.). That single entity may have multiple references across data sources, such as a person with two different email addresses, a company with two different phone numbers, or a product listed on two different websites. If we want to ask questions about all the unique people, or businesses, or products in a dataset, we must find a method for producing an annotated version of that dataset that contains unique entities.

How can we tell that these multiple references point to the same entity? What if the attributes for each entity aren't the same across references? What happens when there are more than two or three or ten references to the same entity? Which one is the main (canonical) version? Do we just throw the duplicates away?

Each question points to a single problem, albeit one that frequently goes unnamed. Ironically, one of the problems in entity resolution is that even though it goes by a lot of different names, many people who struggle with entity resolution do not know the name of their problem.

The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalization:

Deduplication: eliminating duplicate (exact) copies of repeated data.
Record linkage: identifying records that reference the same entity across different sources.
Canonicalization: converting data with more than one possible representation into a standard form.

Entity resolution is not a new problem, but thanks to Python and new machine learning libraries, it is an increasingly achievable objective. This post will explore some basic approaches to entity resolution using one of those tools, the Python Dedupe library. In this post, we will explore the basic functionalities of Dedupe, walk through how the library works under the hood, and perform a demonstration on two different datasets.

About Dedupe

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data. It isn't the only tool available in Python for doing entity resolution tasks, but it is the only one (as far as we know) that conceives of entity resolution as it's primary task. In addition to removing duplicate entries from within a single dataset, Dedupe can also do record linkage across disparate datasets. Dedupe also scales fairly well — in this post we demonstrate using the library with a relatively small dataset of a few thousand records and a very large dataset of several million.

How Dedupe Works

Effective deduplication relies largely on domain expertise. This is for two main reasons: first, because domain experts develop a set of heuristics that enable them to conceptualize what a canonical version of a record should look like, even if they've never seen it in practice. Second, domain experts instinctively recognize which record subfields are most likely to uniquely identify a record; they just know where to look. As such, Dedupe works by engaging the user in labeling the data via a command line interface, and using machine learning on the resulting training data to predict similar or matching records within unseen data.

Testing Out Dedupe

Getting started with Dedupe is easy, and the developers have provided a convenient repo with examples that you can use and iterate on. Let's start by walking through the csv_example.py from the dedupe-examples. To get Dedupe running, we'll need to install unidecode, future, and dedupe.

In your terminal (we recommend doing so inside a virtual environment):

git clone https://github.com/DistrictDataLabs/dedupe-examples.git
cd dedupe-examples

pip install unidecode
pip install future
pip install dedupe

Then we'll run the csv_example.py file to see what dedupe can do:

python csv_example.py

Blocking and Affine Gap Distance

Let's imagine we own an online retail business, and we are developing a new recommendation engine that mines our existing customer data to come up with good recommendations for products that our existing and new customers might like to buy. Our dataset is a purchase history log where customer information is represented by attributes like name, telephone number, address, and order history. The database we've been using to log purchases assigns a new unique ID for every customer interaction.

But it turns out we're a great business, so we have a lot of repeat customers! We'd like to be able to aggregate the order history information by customer so that we can build a good recommender system with the data we have. That aggregation is easy if every customer's information is duplicated exactly in every purchase log. But what if it looks something like the table below?

How can we aggregate the data so that it is unique to the customer rather than the purchase? Features in the data set like names, phone numbers, and addresses will probably be useful. What is notable is that there are numerous variations for those attributes, particularly in how names appear — sometimes as nicknames, sometimes even misspellings. What we need is an intelligent and mostly automated way to create a new dataset for our recommender system. Enter Dedupe.

When comparing records, rather than treating each record as a single long string, Dedupe cleverly exploits the structure of the input data to instead compare the records field by field. The advantage of this approach is more pronounced when certain feature vectors of records are much more likely to assist in identifying matches than other attributes. Dedupe lets the user nominate the features they believe will be most useful:

fields = [
    {'field' : 'Name', 'type': 'String'},
    {'field' : 'Phone', 'type': 'Exact', 'has missing' : True},
    {'field' : 'Address', 'type': 'String', 'has missing' : True},
    {'field' : 'Purchases', 'type': 'String'},
    ]

Dedupe scans the data to create tuples of records that it will propose to the user to label as being either matches, not matches, or possible matches. These uncertainPairs are identified using a combination of blocking , affine gap distance, and active learning.

Blocking is used to reduce the number of overall record comparisons that need to be made. Dedupe's method of blocking involves engineering subsets of feature vectors (these are called 'predicates') that can be compared across records. In the case of our people dataset above, the predicates might be things like:

the first three digits of the phone number
the full name
the first five characters of the name
a random 4-gram within the city name

Records are then grouped, or blocked, by matching predicates so that only records with matching predicates will be compared to each other during the active learning phase. The blocks are developed by computing the edit distance between predicates across records. Dedupe uses a distance metric called affine gap distance, which is a variation on Hamming distance that makes subsequent consecutive deletions or insertions cheaper.

Therefore, we might have one blocking method that groups all of the records that have the same area code of the phone number. This would result in three predicate blocks: one with a 202 area code, one with a 334, and one with NULL. There would be two records in the 202 block (IDs 452 and 821), two records in the 334 block (IDs 233 and 699), and one record in the NULL area code block (ID 720).

The relative weight of these different feature vectors can be learned during the active learning process and expressed numerically to ensure that features that will be most predictive of matches will be heavier in the overall matching schema. As the user labels more and more tuples, Dedupe gradually relearns the weights, recalculates the edit distances between records, and updates its list of the most uncertain pairs to propose to the user for labeling.

Once the user has generated enough labels, the learned weights are used to calculate the probability that each pair of records within a block is a duplicate or not. In order to scale the pairwise matching up to larger tuples of matched records (in the case that entities may appear more than twice within a document), Dedupe uses hierarchical clustering with centroidal linkage. Records within some threshold distance of a centroid will be grouped together. The final result is an annotated version of the original dataset that now includes a centroid label for each record.

Active Learning

You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Active learning is the so-called special sauce behind Dedupe. As in most supervised machine learning tasks, the challenge is to get labeled data that the model can learn from. The active learning phase in Dedupe is essentially an extended user-labeling session, which can be short if you have a small dataset and can take longer if your dataset is large. You are presented with four options:

You can experiment with typing the y, n, and u keys to flag duplicates for active learning. When you are finished, enter f to quit.

(y)es: confirms that the two references are to the same entity
(n)o: labels the two references as not the same entity
(u)nsure: does not label the two references as the same entity or as different entities
(f)inished: ends the active learning session and triggers the supervised learning phase

As you can see in the example above, some comparisons decisions are very easy. The first contains zero for zero hits on all four attributes being examined, so the verdict is most certainly a non-match. On the second, we have a 3/4 exact match, with the fourth being fuzzy in that one entity contains a piece of the matched entity; Ryerson vs. Chicago Public Schools Ryerson. A human would be able to discern these as two references to the same entity, and we can label it as such to enable the supervised learning that comes after the active learning.

The csv_example also includes an evaluation script that will enable you to determine how successfully you were able to resolve the entities. It's important to note that the blocking, active learning and supervised learning portions of the deduplication process are very dependent on the dataset attributes that the user nominates for selection. In the csv_example, the script nominates the following four attributes:

fields = [
    {'field' : 'Site name', 'type': 'String'},
    {'field' : 'Address', 'type': 'String'},
    {'field' : 'Zip', 'type': 'Exact', 'has missing' : True},
    {'field' : 'Phone', 'type': 'String', 'has missing' : True},
    ]

A different combination of attributes would result in a different blocking, a different set of uncertainPairs, a different set of features to use in the active learning phase, and almost certainly a different result. In other words, user experience and domain knowledge factor in heavily at multiple phases of the deduplication process.

Something a Bit More Challenging

In order to try out Dedupe with a more challenging project, we decided to try out deduplicating the White House visitors' log. Our hypothesis was that it would be interesting to be able to answer questions such as "How many times has person X visited the White House during administration Y?" However, in order to do that, it would be necessary to generate a version of the list that contained unique entities. We guessed that there would be many cases where there were multiple references to a single entity, potentially with slight variations in how they appeared in the dataset. We also expected to find a lot of names that seemed similar but in fact referenced different entities. In other words, a good challenge!

The data set we used was pulled from the WhiteHouse.gov website, a part of the executive initiative to make federal data more open to the public. This particular set of data is a list of White House visitor record requests from 2006 through 2010. Here's a snapshot of what the data looks like via the White House API.

The dataset includes a lot of columns, and for most of the entries, the majority of these fields are blank:

Database Field	Field Description
NAMELAST	Last name of entity
NAMEFIRST	First name of entity
NAMEMID	Middle name of entity
UIN	Unique Identification Number
BDGNBR	Badge Number
Type of Access	Access type to White House
TOA	Time of arrival
POA	Post on arrival
TOD	Time of departure
POD	Post on departure
APPT_MADE_DATE	When the appointment date was made
APPT_START_DATE	When the appointment date is scheduled to start
APPT_END_DATE	When the appointment date is scheduled to end
APPT_CANCEL_DATE	When the appointment date was canceled
Total_People	Total number of people scheduled to attend
LAST_UPDATEDBY	Who was the last person to update this event
POST	Classified as 'WIN'
LastEntryDate	When the last update to this instance
TERMINAL_SUFFIX	ID for terminal used to process visitor
visitee_namelast	The visitee's last name
visitee_namefirst	The visitee's first name
MEETING_LOC	The location of the meeting
MEETING_ROOM	The room number of the meeting
CALLER_NAME_LAST	The authorizing person for the visitor's last name
CALLER_NAME_FIRST	The authorizing person for the visitor's first name
CALLER_ROOM	The authorizing person's room for the visitor
Description	Description of the event or visit
RELEASE_DATE	The date this set of logs were released to the public

Loading the Data

Using the API, the White House Visitor Log Requests can be exported in a variety of formats to include, .json, .csv, and .xlsx, .pdf, .xlm, and RSS. However, it's important to keep in mind that the dataset contains over 5 million rows. For this reason, we decided to use .csv and grabbed the data using requests:

import requests

def getData(url,fname):
    """
    Download the dataset from the webpage.
    """
    response = requests.get(url)
    with open(fname, 'w') as f:
        f.write(response.content)

DATAURL = "https://open.whitehouse.gov/api/views/p86s-ychb/rows.csv?accessType=DOWNLOAD"
ORIGFILE = "fixtures/whitehouse-visitors.csv"

getData(DATAURL,ORIGFILE)

Once downloaded, we can clean it up and load it into a database for more secure and stable storage.

Tailoring the Code

Next, we'll discuss what is needed to tailor a dedupe example to get the code to work for the White House visitors log dataset. The main challenge with this dataset is its sheer size. First, we'll need to import a few modules and connect to our database:

import csv
import psycopg2
from dateutil import parser
from datetime import datetime

conn = None

DATABASE = your_db_name
USER = your_user_name
HOST = your_hostname
PASSWORD = your_password

try:
    conn = psycopg2.connect(database=DATABASE, user=USER, host=HOST, password=PASSWORD)
    print ("I've connected")
except:
    print ("I am unable to connect to the database")
cur = conn.cursor()

The other challenge with our dataset are the numerous missing values and datetime formatting irregularities. We wanted to be able to use the datetime strings to help with entity resolution, so we wanted to get the formatting to be as consistent as possible. The following script handles both the datetime parsing and the missing values by combining Python's dateutil module and PostgreSQL's fairly forgiving 'varchar' type.

This function takes the csv data in as input, parses the datetime fields we're interested in ('lastname','firstname','uin','apptmade','apptstart','apptend', 'meeting_loc'.), and outputs a database table that retains the desired columns. Keep in mind this will take a while to run.

def dateParseSQL(nfile):
    cur.execute('''CREATE TABLE IF NOT EXISTS visitors_er
                  (visitor_id SERIAL PRIMARY KEY,
                  lastname    varchar,
                  firstname   varchar,
                  uin         varchar,
                  apptmade    varchar,
                  apptstart   varchar,
                  apptend     varchar,
                  meeting_loc varchar);''')
    conn.commit()
    with open(nfile, 'rU') as infile:
        reader = csv.reader(infile, delimiter=',')
        next(reader, None)
        for row in reader:
            for field in DATEFIELDS:
                if row[field] != '':
                    try:
                        dt = parser.parse(row[field])
                        row[field] = dt.toordinal()  # We also tried dt.isoformat()
                    except:
                        continue
            sql = "INSERT INTO visitors_er(lastname,firstname,uin,apptmade,apptstart,apptend,meeting_loc) \
                   VALUES (%s,%s,%s,%s,%s,%s,%s)"
            cur.execute(sql, (row[0],row[1],row[3],row[10],row[11],row[12],row[21],))
            conn.commit()
    print ("All done!")


dateParseSQL(ORIGFILE)

About 60 of our rows had ASCII characters, which we dropped using this SQL command:

delete from visitors where firstname ~ '[^[:ascii:]]' OR lastname ~ '[^[:ascii:]]';

For our deduplication script, we modified the PostgreSQL example as well as Dan Chudnov's adaptation of the script for the OSHA dataset.

import tempfile
import argparse
import csv
import os

import dedupe
import psycopg2
from psycopg2.extras import DictCursor

Initially, we wanted to try to use the datetime fields to deduplicate the entities, but dedupe was not a big fan of the datetime fields, whether in isoformat or ordinal, so we ended up nominating the following fields:

KEY_FIELD = 'visitor_id'
SOURCE_TABLE = 'visitors'

FIELDS =  [{'field': 'firstname', 'variable name': 'firstname',
               'type': 'String','has missing': True},
              {'field': 'lastname', 'variable name': 'lastname',
               'type': 'String','has missing': True},
              {'field': 'uin', 'variable name': 'uin',
               'type': 'String','has missing': True},
              {'field': 'meeting_loc', 'variable name': 'meeting_loc',
               'type': 'String','has missing': True}
              ]

We modified a function Dan wrote to generate the predicate blocks:

def candidates_gen(result_set):
    lset = set
    block_id = None
    records = []
    i = 0
    for row in result_set:
        if row['block_id'] != block_id:
            if records:
                yield records

            block_id = row['block_id']
            records = []
            i += 1

            if i % 10000 == 0:
                print ('{} blocks'.format(i))

        smaller_ids = row['smaller_ids']
        if smaller_ids:
            smaller_ids = lset(smaller_ids.split(','))
        else:
            smaller_ids = lset([])

        records.append((row[KEY_FIELD], row, smaller_ids))

    if records:
        yield records

And we adapted the method from the dedupe-examples repo to handle the active learning, supervised learning, and clustering steps:

def find_dupes(args):
    deduper = dedupe.Dedupe(FIELDS)

    with psycopg2.connect(database=args.dbname,
                          host='localhost',
                          cursor_factory=DictCursor) as con:
        with con.cursor() as c:
            c.execute('SELECT COUNT(*) AS count FROM %s' % SOURCE_TABLE)
            row = c.fetchone()
            count = row['count']
            sample_size = int(count * args.sample)

            print ('Generating sample of {} records'.format(sample_size))
            with con.cursor('deduper') as c_deduper:
                c_deduper.execute('SELECT visitor_id,lastname,firstname,uin,meeting_loc FROM %s' % SOURCE_TABLE)
                temp_d = dict((i, row) for i, row in enumerate(c_deduper))
                deduper.sample(temp_d, sample_size)
                del(temp_d)

            if os.path.exists(args.training):
                print ('Loading training file from {}'.format(args.training))
                with open(args.training) as tf:
                    deduper.readTraining(tf)

            print ('Starting active learning')
            dedupe.convenience.consoleLabel(deduper)

            print ('Starting training')
            deduper.train(ppc=0.001, uncovered_dupes=5)

            print ('Saving new training file to {}'.format(args.training))
            with open(args.training, 'w') as training_file:
                deduper.writeTraining(training_file)

            deduper.cleanupTraining()

            print ('Creating blocking_map table')
            c.execute("""
                DROP TABLE IF EXISTS blocking_map
                """)
            c.execute("""
                CREATE TABLE blocking_map
                (block_key VARCHAR(200), %s INTEGER)
                """ % KEY_FIELD)

            for field in deduper.blocker.index_fields:
                print ('Selecting distinct values for "{}"'.format(field))
                c_index = con.cursor('index')
                c_index.execute("""
                    SELECT DISTINCT %s FROM %s
                    """ % (field, SOURCE_TABLE))
                field_data = (row[field] for row in c_index)
                deduper.blocker.index(field_data, field)
                c_index.close()

            print ('Generating blocking map')
            c_block = con.cursor('block')
            c_block.execute("""
                SELECT * FROM %s
                """ % SOURCE_TABLE)
            full_data = ((row[KEY_FIELD], row) for row in c_block)
            b_data = deduper.blocker(full_data)

            print ('Inserting blocks into blocking_map')
            csv_file = tempfile.NamedTemporaryFile(prefix='blocks_', delete=False)
            csv_writer = csv.writer(csv_file)
            csv_writer.writerows(b_data)
            csv_file.close()

            f = open(csv_file.name, 'r')
            c.copy_expert("COPY blocking_map FROM STDIN CSV", f)
            f.close()

            os.remove(csv_file.name)

            con.commit()

            print ('Indexing blocks')
            c.execute("""
                CREATE INDEX blocking_map_key_idx ON blocking_map (block_key)
                """)
            c.execute("DROP TABLE IF EXISTS plural_key")
            c.execute("DROP TABLE IF EXISTS plural_block")
            c.execute("DROP TABLE IF EXISTS covered_blocks")
            c.execute("DROP TABLE IF EXISTS smaller_coverage")

            print ('Calculating plural_key')
            c.execute("""
                CREATE TABLE plural_key
                (block_key VARCHAR(200),
                block_id SERIAL PRIMARY KEY)
                """)
            c.execute("""
                INSERT INTO plural_key (block_key)
                SELECT block_key FROM blocking_map
                GROUP BY block_key HAVING COUNT(*) > 1
                """)

            print ('Indexing block_key')
            c.execute("""
                CREATE UNIQUE INDEX block_key_idx ON plural_key (block_key)
                """)

            print ('Calculating plural_block')
            c.execute("""
                CREATE TABLE plural_block
                AS (SELECT block_id, %s
                FROM blocking_map INNER JOIN plural_key
                USING (block_key))
                """ % KEY_FIELD)

            print ('Adding {} index'.format(KEY_FIELD))
            c.execute("""
                CREATE INDEX plural_block_%s_idx
                    ON plural_block (%s)
                """ % (KEY_FIELD, KEY_FIELD))
            c.execute("""
                CREATE UNIQUE INDEX plural_block_block_id_%s_uniq
                ON plural_block (block_id, %s)
                """ % (KEY_FIELD, KEY_FIELD))

            print ('Creating covered_blocks')
            c.execute("""
                CREATE TABLE covered_blocks AS
                    (SELECT %s,
                            string_agg(CAST(block_id AS TEXT), ','
                            ORDER BY block_id) AS sorted_ids
                     FROM plural_block
                     GROUP BY %s)
                 """ % (KEY_FIELD, KEY_FIELD))

            print ('Indexing covered_blocks')
            c.execute("""
                CREATE UNIQUE INDEX covered_blocks_%s_idx
                    ON covered_blocks (%s)
                """ % (KEY_FIELD, KEY_FIELD))
            print ('Committing')

            print ('Creating smaller_coverage')
            c.execute("""
                CREATE TABLE smaller_coverage AS
                    (SELECT %s, block_id,
                        TRIM(',' FROM split_part(sorted_ids,
                        CAST(block_id AS TEXT), 1))
                        AS smaller_ids
                     FROM plural_block
                     INNER JOIN covered_blocks
                     USING (%s))
                """ % (KEY_FIELD, KEY_FIELD))
            con.commit()

            print ('Clustering...')
            c_cluster = con.cursor('cluster')
            c_cluster.execute("""
                SELECT *
                FROM smaller_coverage
                INNER JOIN %s
                    USING (%s)
                ORDER BY (block_id)
                """ % (SOURCE_TABLE, KEY_FIELD))
            clustered_dupes = deduper.matchBlocks(
                    candidates_gen(c_cluster), threshold=0.5)

            print ('Creating entity_map table')
            c.execute("DROP TABLE IF EXISTS entity_map")
            c.execute("""
                CREATE TABLE entity_map (
                    %s INTEGER,
                    canon_id INTEGER,
                    cluster_score FLOAT,
                    PRIMARY KEY(%s)
                )""" % (KEY_FIELD, KEY_FIELD))

            print ('Inserting entities into entity_map')
            for cluster, scores in clustered_dupes:
                cluster_id = cluster[0]
                for key_field, score in zip(cluster, scores):
                    c.execute("""
                        INSERT INTO entity_map
                            (%s, canon_id, cluster_score)
                        VALUES (%s, %s, %s)
                        """ % (KEY_FIELD, key_field, cluster_id, score))

            print ('Indexing head_index')
            c_cluster.close()
            c.execute("CREATE INDEX head_index ON entity_map (canon_id)")
            con.commit()

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--dbname', dest='dbname', default='whitehouse', help='database name')
    parser.add_argument('-s', '--sample', default=0.10, type=float, help='sample size (percentage, default 0.10)')
    parser.add_argument('-t', '--training', default='training.json', help='name of training file')
    args = parser.parse_args()
    find_dupes(args)

Active Learning Observations

We ran multiple experiments:

Test 1: lastname, firstname, meeting_loc => 447 (15 minutes of training)
Test 2: lastname, firstname, uin, meeting_loc => 3385 (5 minutes of training) - one instance that had 168 duplicates

We observed a lot of uncertainty during the active learning phase, mostly because of how enormous the dataset is. This was particularly pronounced with names that seemed more common to us and that sounded more domestic since those are much more commonly occurring in this dataset. For example, are two records containing the name Michael Grant the same entity?

Additionally, we noticed that there were a lot of variations in the way that middle names were captured. Sometimes they were concatenated with the first name, other times with the last name. We also observed what seemed to be many nicknames or that could have been references to separate entities: KIM ASKEW vs. KIMBERLEY ASKEW and Kathy Edwards vs. Katherine Edwards (and yes, dedupe does preserve variations in case). On the other hand, since nicknames generally appear only in people's first names, when we did see a short version of a first name paired with an unusual or rare last name, we were more confident in labeling those as a match.

Other things that made the labeling easier were clearly gendered names (e.g. Brian Murphy vs. Briana Murphy), which helped us to identify separate entities in spite of very small differences in the strings. Some names appeared to be clear misspellings, which also made us more confident in our labeling two references as matches for a single entity (Davifd Culp vs. David Culp). There were also a few potential easter eggs in the dataset, which we suspect might actually be aliases (Jon Doe and Ben Jealous).

One of the things we discovered upon multiple runs of the active learning process is that the number of fields the user nominates to Dedupe for use has a great impact on the kinds of predicate blocks that are generated during the initial blocking phase. Thus, the comparisons that are presented to the trainer during the active learning phase. In one of our runs, we used only the last name, first name, and meeting location fields. Some of the comparisons were easy:

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

lastname : KUZIEMKO
firstname : ILYANA
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Some were hard:

lastname : Desimone
firstname : Daniel
meeting_loc : OEOB

lastname : DeSimone
firstname : Daniel
meeting_loc : WH

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Results

What we realized from this is that there are two different kinds of duplicates that appear in our dataset. The first kind of duplicate is one that generated via (likely mistaken) duplicate visitor request forms. We noticed that these duplicate entries tended to be proximal to each other in terms of visitor_id number, have the same meeting location and the same uin (which confusingly, is not a unique guest identifier but appears to be assigned to every visitor within a unique tour group). The second kind of duplicate is what we think of as the frequent flier — people who seem to spend a lot of time at the White House like staffers and other political appointees.

During the dedupe process, we computed there were 332,606 potential duplicates within the data set of 1,048,576 entities. For this particular data, we would expect these kinds of figures, knowing that people visit for repeat business or social functions.

Within-Visit Duplicates

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

lastname : Ryan
meeting_loc : OEOB
firstname : Patrick
uin : U62671

Across-Visit Duplicates (Frequent Fliers)

lastname : TANGHERLINI
meeting_loc : OEOB
firstname : DANIEL
uin : U02692

lastname : TANGHERLINI
meeting_loc : NEOB
firstname : DANIEL
uin : U73085

lastname : ARCHULETA
meeting_loc : WH
firstname : KATHERINE
uin : U68121

lastname : ARCHULETA
meeting_loc : OEOB
firstname : KATHERINE
uin : U76331

Conclusion

In this beginners guide to Entity Resolution, we learned what it means to identify entities and their possible duplicates within and across records. To further examine this data beyond the scope of this blog post, we would like to determine which records are true duplicates. This would require additional information to canonicalize these entities, thus allowing for potential indexing of entities for future assessments. Ultimately we discovered the importance of entity resolution across a variety of domains, such as counter-terrorism, customer databases, and voter registration.

Please return to the District Data Labs blog for upcoming posts on entity resolution and discussion about a number of other important topics to the data science community. Upcoming post topics from our research group include string matching algorithms, data preparation, and entity identification!

Data Exploration with Python, Part 2

Tue, 07 Feb 2017 03:04:00 -0500

This is the second post in our Data Exploration with Python series. Before reading this post, make sure to check out Data Exploration with Python, Part 1!

Mise en place (noun): In a professional kitchen, the disciplined organization and preparation of equipment and food before service begins.

When performing exploratory data analysis (EDA), it is important to not only prepare yourself (the analyst) but to prepare your data as well. As we discussed in the previous post, a small amount of preparation will often save you a significant amount of time later on. So let's review where we should be at this point and then continue our exploration process with data preparation.

In Part 1 of this series, we were introduced to the data exploration framework we will be using. As a reminder, here is what that framework looks like.

We also introduced the example data set we are going to be using to illustrate the different phases and stages of the framework. Here is what that looks like.

We then familiarized ourselves with our data set by identifying the types of information and entities encoded within it. We also reviewed several data transformation and visualization methods that we will use later to explore and analyze it. Now we are at the last stage of the framework's Prep Phase, the Create stage, where our goal will be to create additional categorical fields that will make our data easier to explore and allow us to view it from new perspectives.

Renaming Columns to be More Intuitive

Before we dive in and start creating categories, however, we have an opportunity to improve our categorization efforts by examining the columns in our data and making sure their labels intuitively convey what they represent. Just as with the other aspects of preparation, changing them now will save us from having to remember what displ or co2TailpipeGpm mean when they show up on a chart later. In my experience, these small, detail-oriented enhancements to the beginning of your process usually compound and preserve cognitive cycles that you can later apply to extracting insights.

We can use the code below to rename the columns in our vehicles data frame.

vehicles.columns = ['Make','Model','Year','Engine Displacement','Cylinders',
                    'Transmission','Drivetrain','Vehicle Class','Fuel Type',
                    'Fuel Barrels/Year','City MPG','Highway MPG','Combined MPG',
                    'CO2 Emission Grams/Mile','Fuel Cost/Year']

Thinking About Categorization

Now that we have changed our column names to be more intuitive, let's take a moment to think about what categorization is and examine the categories that currently exist in our data set. At the most basic level, categorization is just a way that humans structure information — how we hierarchically create order out of complexity. Categories are formed based on attributes that entities have in common, and they present us with different perspectives from which we can view and think about our data.

Our primary objective in this stage is to create additional categories that will help us further organize our data. This will prove beneficial not only for the exploratory analysis we will conduct but also for any supervised machine learning or modeling that may happen further down the data science pipeline. Seasoned data scientists know that the better your data is organized, the better downstream analyses you will be able to perform and the more informative features you will have to feed into your machine learning models.

In this stage of the framework, we are going to create additional categories in 3 distinct ways:

Category Aggregations
Binning Continuous Variables
Clustering

Now that we have a better idea of what we are doing and why, let's get started.

Aggregating to Higher-Level Categories

The first way we are going to create additional categories is by identifying opportunities to create higher-level categories out of the variables we already have in our data set. In order to do this, we need to get a sense of what categories currently exist in the data. We can do this by iterating through our columns and printing out the name, the number of unique values, and the data type for each.

def unique_col_values(df):
    for column in df:
        print("{} | {} | {}".format(
            df[column].name, len(df[column].unique()), df[column].dtype
        ))

unique_col_values(vehicles)

Make | 126 | object
Model | 3491 | object
Year | 33 | int64
Engine Displacement | 65 | float64
Cylinders | 9 | float64
Transmission | 43 | object
Drivetrain | 7 | object
Vehicle Class | 34 | object
Fuel Type | 13 | object
Fuel Barrels/Year | 116 | float64
City MPG | 48 | int64
Highway MPG | 49 | int64
Combined MPG | 45 | int64
CO2 Emission Grams/Mile | 550 | float64
Fuel Cost/Year | 58 | int64

From looking at the output, it is clear that we have some numeric columns (int64 and float64) and some categorical columns (object). For now, let's focus on the six categorical columns in our data set.

Make: 126 unique values
Model: 3,491 unique values
Transmission: 43 unique values
Drivetrain: 7 unique values
Vehicle Class: 34 unique values
Fuel Type: 13 unique values

When aggregating and summarizing data, having too many categories can be problematic. The average human is said to have the ability to hold 7 objects at a time in their short-term working memory. Accordingly, I have noticed that once you exceed 8-10 discrete values in a category, it becomes increasingly difficult to get a holistic picture of how the entire data set is divided up.

What we want to do is examine the values in each of our categorical variables to determine where opportunities exist to aggregate them into higher-level categories. The way this is typically done is by using a combination of clues from the current categories and any domain knowledge you may have (or be able to acquire).

For example, imagine aggregating by Transmission, which has 43 discrete values in our data set. It is going to be difficult to derive insights due to the fact that any aggregated metrics are going to be distributed across more categories than you can hold in short-term memory. However, if we examine the different transmission categories with the goal of finding common features that we can group on, we would find that all 43 values fall into one of two transmission types, Automatic or Manual.

Let's create a new Transmission Type column in our data frame and, with the help of the loc method in pandas, assign it a value of Automatic where the first character of Transmission is the letter A and a value of Manual where the first character is the letter M.

AUTOMATIC = "Automatic"
MANUAL = "Manual"

vehicles.loc[vehicles['Transmission'].str.startswith('A'),
             'Transmission Type'] = AUTOMATIC

vehicles.loc[vehicles['Transmission'].str.startswith('M'),
             'Transmission Type'] = MANUAL

We can apply the same logic to the Vehicle Class field. We originally have 34 vehicle classes, but we can distill those down into 8 vehicle categories, which are much easier to remember.

small = ['Compact Cars','Subcompact Cars','Two Seaters','Minicompact Cars']
midsize = ['Midsize Cars']
large = ['Large Cars']

vehicles.loc[vehicles['Vehicle Class'].isin(small), 
             'Vehicle Category'] = 'Small Cars'

vehicles.loc[vehicles['Vehicle Class'].isin(midsize), 
             'Vehicle Category'] = 'Midsize Cars'

vehicles.loc[vehicles['Vehicle Class'].isin(large), 
             'Vehicle Category'] = 'Large Cars'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Station'), 
             'Vehicle Category'] = 'Station Wagons'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Truck'), 
             'Vehicle Category'] = 'Pickup Trucks'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Special Purpose'), 
             'Vehicle Category'] = 'Special Purpose'

vehicles.loc[vehicles['Vehicle Class'].str.contains('Sport Utility'), 
             'Vehicle Category'] = 'Sport Utility'

vehicles.loc[(vehicles['Vehicle Class'].str.lower().str.contains('van')),
             'Vehicle Category'] = 'Vans & Minivans'

Next, let's look at the Make and Model fields, which have 126 and 3,491 unique values respectively. While I can't think of a way to get either of those down to 8-10 categories, we can create another potentially informative field by concatenating Make and the first word of the Model field together into a new Model Type field. This would allow us to, for example, categorize all Chevrolet Suburban C1500 2WD vehicles and all Chevrolet Suburban K1500 4WD vehicles as simply Chevrolet Suburbans.

vehicles['Model Type'] = (vehicles['Make'] + " " +
                          vehicles['Model'].str.split().str.get(0))

Finally, let's look at the Fuel Type field, which has 13 unique values. On the surface, that doesn't seem too bad, but upon further inspection, you'll notice some complexity embedded in the categories that could probably be organized more intuitively.

vehicles['Fuel Type'].unique()

array(['Regular', 'Premium', 'Diesel', 'Premium and Electricity',
       'Premium or E85', 'Premium Gas or Electricity', 'Gasoline or E85',
       'Gasoline or natural gas', 'CNG', 'Regular Gas or Electricity',
       'Midgrade', 'Regular Gas and Electricity', 'Gasoline or propane'],
        dtype=object)

This is interesting and a little tricky because there are some categories that contain a single fuel type and others that contain multiple fuel types. In order to organize this better, we will create two sets of categories from these fuel types. The first will be a set of columns that will be able to represent the different combinations, while still preserving the individual fuel types.

vehicles['Gas'] = 0
vehicles['Ethanol'] = 0
vehicles['Electric'] = 0
vehicles['Propane'] = 0
vehicles['Natural Gas'] = 0

vehicles.loc[vehicles['Fuel Type'].str.contains(
        'Regular|Gasoline|Midgrade|Premium|Diesel'),'Gas'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('E85'),'Ethanol'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('Electricity'),'Electric'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('propane'),'Propane'] = 1

vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'),'Natural Gas'] = 1

As it turns out, 99% of the vehicles in our database have gas as a fuel type, either by itself or combined with another fuel type. Since that is the case, let's create a second set of categories - specifically, a new Gas Type field that extracts the type of gas (Regular, Midgrade, Premium, Diesel, or Natural) each vehicle accepts.

vehicles.loc[vehicles['Fuel Type'].str.contains(
        'Regular|Gasoline'),'Gas Type'] = 'Regular'

vehicles.loc[vehicles['Fuel Type'] == 'Midgrade',
             'Gas Type'] = 'Midgrade'

vehicles.loc[vehicles['Fuel Type'].str.contains('Premium'),
             'Gas Type'] = 'Premium'

vehicles.loc[vehicles['Fuel Type'] == 'Diesel',
             'Gas Type'] = 'Diesel'

vehicles.loc[vehicles['Fuel Type'].str.contains('natural|CNG'),
             'Gas Type'] = 'Natural'

An important thing to note about what we have done with all of the categorical fields in this section is that, while we created new categories, we did not overwrite the original ones. We created additional fields that will allow us to view the information contained within the data set at different (often higher) levels. If you need to drill down to the more granular original categories, you can always do that. However, now we have a choice whereas before we performed these category aggregations, we did not.

Creating Categories from Continuous Variables

The next way we can create additional categories in our data is by binning some of our continuous variables - breaking them up into different categories based on a threshold or distribution. There are multiple ways you can do this, but I like to use quintiles because it gives me one middle category, two categories outside of that which are moderately higher and lower, and then two extreme categories at the ends. I find that this is a very intuitive way to break things up and provides some consistency across categories. In our data set, I've identified 4 fields that we can bin this way.

Binning essentially looks at how the data is distributed, creates the necessary number of bins by splitting up the range of values (either equally or based on explicit boundaries), and then categorizes records into the appropriate bin that their continuous value falls into. Pandas has a qcut method that makes binning extremely easy, so let's use that to create our quintiles for each of the continuous variables we identified.

efficiency_categories = ['Very Low Efficiency', 'Low Efficiency',
                         'Moderate Efficiency','High Efficiency',
                         'Very High Efficiency']

vehicles['Fuel Efficiency'] = pd.qcut(vehicles['Combined MPG'],
                                      5, efficiency_categories)

engine_categories = ['Very Small Engine', 'Small Engine','Moderate Engine',
                     'Large Engine', 'Very Large Engine']

vehicles['Engine Size'] = pd.qcut(vehicles['Engine Displacement'],
                                  5, engine_categories)

emission_categories = ['Very Low Emissions', 'Low Emissions',
                        'Moderate Emissions','High Emissions',
                        'Very High Emissions']

vehicles['Emissions'] = pd.qcut(vehicles['CO2 Emission Grams/Mile'],
                                 5, emission_categories)

fuelcost_categories = ['Very Low Fuel Cost', 'Low Fuel Cost',
                       'Moderate Fuel Cost','High Fuel Cost',
                       'Very High Fuel Cost']

vehicles['Fuel Cost'] = pd.qcut(vehicles['Fuel Cost/Year'],
                                5, fuelcost_categories)

Clustering to Create Additional Categories

The final way we are going to prepare our data is by clustering to create additional categories. There are a few reasons why I like to use clustering for this. First, it takes multiple fields into consideration together at the same time, whereas the other categorization methods only consider one field at a time. This will allow you to categorize together entities that are similar across a variety of attributes, but might not be close enough in each individual attribute to get grouped together.

Clustering also creates new categories for you automatically, which takes much less time than having to comb through the data yourself identifying patterns across attributes that you can form categories on. It will automatically group similar items together for you.

The third reason I like to use clustering is because it will sometimes group things in ways that you, as a human, may not have thought of. I'm a big fan of humans and machines working together to optimize analytical processes, and this is a good example of value that machines bring to the table that can be helpful to humans. I'll write more about my thoughts on that in future posts, but for now, let's move on to clustering our data.

The first thing we are going to do is isolate the columns we want to use for clustering. These are going to be columns with numeric values, as the clustering algorithm will need to compute distances in order to group similar vehicles together.

cluster_columns = ['Engine Displacement','Cylinders','Fuel Barrels/Year',
                   'City MPG','Highway MPG','Combined MPG',
                   'CO2 Emission Grams/Mile', 'Fuel Cost/Year']

Next, we want to scale the features we are going to cluster on. There are a variety of ways to normalize and scale variables, but I'm going to keep things relatively simple and just use Scikit-Learn's MaxAbsScaler, which will divide each value by the max absolute value for that feature. This will preserve the distributions in the data and convert the values in each field to a number between 0 and 1 (technically -1 and 1, but we don't have any negatives).

from sklearn import preprocessing
scaler = preprocessing.MaxAbsScaler()

vehicle_clusters = scaler.fit_transform(vehicles[cluster_columns])
vehicle_clusters = pd.DataFrame(vehicle_clusters, columns=cluster_columns)

Now that our features are scaled, let's write a couple of functions. The first function we are going to write is a kmeans_cluster function that will k-means cluster a given data frame into a specified number of clusters. It will then return a copy of the original data frame with those clusters appended in a column named Cluster.

from sklearn.cluster import KMeans

def kmeans_cluster(df, n_clusters=2):
    model = KMeans(n_clusters=n_clusters, random_state=1)
    clusters = model.fit_predict(df)
    cluster_results = df.copy()
    cluster_results['Cluster'] = clusters
    return cluster_results

Our second function, called summarize_clustering is going to count the number of vehicles that fall into each cluster and calculate the cluster means for each feature. It is going to merge the counts and means into a single data frame and then return that summary to us.

def summarize_clustering(results):
    cluster_size = results.groupby(['Cluster']).size().reset_index()
    cluster_size.columns = ['Cluster', 'Count']
    cluster_means = results.groupby(['Cluster'], as_index=False).mean()
    cluster_summary = pd.merge(cluster_size, cluster_means, on='Cluster')
    return cluster_summary

We now have functions for what we need to do, so the next step is to actually cluster our data. But wait, our kmeans_cluster function is supposed to accept a number of clusters. How do we determine how many clusters we want?

There are a number of approaches for figuring this out, but for the sake of simplicity, we are just going to plug in a couple of numbers and visualize the results to arrive at a good enough estimate. Remember earlier in this post where we were trying to aggregate our categorical variables to less than 8-10 discrete values? We are going to apply the same logic here. Let's start out with 8 clusters and see what kind of results we get.

cluster_results = kmeans_cluster(vehicle_clusters, 8)
cluster_summary = summarize_clustering(cluster_results)

After running the couple of lines of code above, your cluster_summary should look similar to the following.

By looking at the Count column, you can tell that there are some clusters that have significantly more records in them (ex. Cluster 7) and others that have significantly fewer (ex. Cluster 3). Other than that, though, it is difficult to notice anything informative about the summary. I don't know about you, but to me, the rest of the summary just looks like a bunch of decimals in a table.

This is a prime opportunity to use a visualization to discover insights faster. With just a couple import statements and a single line of code, we can light this summary up in a heatmap so that we can see the contrast between all those decimals and between the different clusters.

import matplotlib.pyplot as plt
import seaborn as sns

sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)

In this heatmap, the rows represent the features and the columns represent the clusters, so we can compare how similar or differently columns look to each other. Our goal for clustering these features is to ultimately create meaningful categories out of the clusters, so we want to get to the point where we can clearly distinguish one from the others. This heatmap allows us to do this quickly and visually.

With this goal in mind, it is apparent that we probably have too many clusters because:

Clusters 3, 4, and 7 look pretty similar
Clusters 2 and 5 look similar as well
Clusters 0 and 6 are also a little close for comfort

From the way our heatmap currently looks, I'm willing to bet that we can cut the number of clusters in half and get clearer boundaries. Let's re-run the clustering, summary, and heatmap code for 4 clusters and see what kind of results we get.

cluster_results = kmeans_cluster(vehicle_clusters, 4)
cluster_summary = summarize_clustering(cluster_results)

sns.heatmap(cluster_summary[cluster_columns].transpose(), annot=True)

These clusters look more distinct, don't they? Clusters 1 and 3 look like they are polar opposites of each other, cluster 0 looks like it’s pretty well balanced across all the features, and cluster 2 looks like it’s about half-way between Cluster 0 and Cluster 1.

We now have a good number of clusters, but we still have a problem. It is difficult to remember what clusters 0, 1, 2, and 3 mean, so as a next step, I like to assign descriptive names to the clusters based on their properties. In order to do this, we need to look at the levels of each feature for each cluster and come up with intuitive natural language descriptions for them. You can have some fun and can get as creative as you want here, but just keep in mind that the objective is for you to be able to remember the characteristics of whatever label you assign to the clusters.

Cluster 1 vehicles seem to have large engines that consume a lot of fuel, process it inefficiently, produce a lot of emissions, and cost a lot to fill up. I'm going to label them Large Inefficient.
Cluster 3 vehicles have small, fuel efficient engines that don't produce a lot of emissions and are relatively inexpensive to fill up. I'm going to label them Small Very Efficient.
Cluster 0 vehicles are fairly balanced across every category, so I'm going to label them Midsized Balanced.
Cluster 2 vehicles have large engines but are more moderately efficient than the vehicles in Cluster 1, so I'm going to label them Large Moderately Efficient.

Now that we have come up with these descriptive names for our clusters, let's add a Cluster Name column to our cluster_results data frame, and then copy the cluster names over to our original vehicles data frame.

cluster_results['Cluster Name'] = ''
cluster_results['Cluster Name'][cluster_results['Cluster']==0] = 'Midsized Balanced'
cluster_results['Cluster Name'][cluster_results['Cluster']==1] = 'Large Inefficient'
cluster_results['Cluster Name'][cluster_results['Cluster']==2] = 'Large Moderately Efficient'
cluster_results['Cluster Name'][cluster_results['Cluster']==3] = 'Small Very Efficient'

vehicles = vehicles.reset_index().drop('index', axis=1)
vehicles['Cluster Name'] = cluster_results['Cluster Name']

Conclusion

In this post, we examined several ways to prepare a data set for exploratory analysis. First, we looked at the categorical variables we had and attempted to find opportunities to roll them up into higher-level categories. After that, we converted some of our continuous variables into categorical ones by binning them into quintiles based on how relatively high or low their values were. Finally, we used clustering to efficiently create categories that automatically take multiple fields into consideration. The result of all this preparation is that we now have several columns containing meaningful categories that will provide different perspectives of our data and allow us to acquire as many insights as possible.

Now that we have these meaningful categories, our data set is in really good shape, which means that we can move on to the next phase of our data exploration framework. In the next post, we will cover the first two stages of the Explore Phase and demonstrate various ways to visually aggregate, pivot, and identify relationships between fields in our data. Make sure to subscribe to the DDL blog so that you get notified when we publish it!

District Data Labs

Data Exploration with Python, Part 3

Making Our Data Smaller: Filter + Aggregate

Pivoting the Data for More Detail

Visualizing Changes Over Time

Examining Relationships Between Variables

Conclusion

Basics of Entity Resolution

Naming Your Problem

About Dedupe

How Dedupe Works

Testing Out Dedupe

Blocking and Affine Gap Distance

Active Learning

Something a Bit More Challenging

Loading the Data

Tailoring the Code

Active Learning Observations

Results

Within-Visit Duplicates

Across-Visit Duplicates (Frequent Fliers)

Conclusion

Recommended Reading

Data Exploration with Python, Part 2

Renaming Columns to be More Intuitive

Thinking About Categorization

Aggregating to Higher-Level Categories

Creating Categories from Continuous Variables

Clustering to Create Additional Categories

Conclusion