Just enough SQL

Just enough SQL


  Machine Learning in Python

Contents

SQL

SQL stands for Structured Query Language. It is the language of Relational databases. So, before we understand SQL, let’s understand what a Relational database is. The next couple of pages would introduce databases at a very high level.

What is a Database

A database is a software that is used to store data. There are many types of databases (relational, graph, document etc) but we are going to focus only on relational database as it is the most popular database format. Also, SQL is specifically used for relational databases.

Relational databases comprise of data in tabular format. For example, let’s see how an employee’s data could be structured in a relational database.

This is how we humans would like to see data. However, it is not scalable to store large amounts of data in a way that is more suitable for fast read and writes. That is where the concept of data normalization comes in. Normalization is a subject in itself and we will not be discussing it here. However, the process is pretty intuitive overall. Here is how the same data would be structured in a good database.

There are many databases in use today. For example, MySQL, Oracle, Microsoft SQL Server, PostgreSQL etc. In this chapter, we will be focussing on the MySQL database. Here is a quick instruction set on How to install MySQL Database.

By default MySQL database comes with very few data tables. MySQL provides a big employee database (with 300K employee records) that you can download from the web and upload onto your database. You can use it to practice data extraction. The data above is taken from that employee database.

What is SQL

SQL is the language of relational databases. It can be used to create, extract, manipulate and delete data in the database. For example, here is a quick SQL to extract the first 10 rows of data from a database table EMPLOYEES.

SELECT * FROM employees LIMIT 10;

The “*” indicates that all columns need to be selected. As you can see, the syntax is pretty intuitive. It is deliberately designed to be like English.

SQL Connectors

Like discussed previously, there are a variety of databases. Each of these databases has a specific connector and you have to install it. Since we will be working with MySQL database, we will install MySQL connector.

pip install mysql-connector

Typically, you will be given the database details like below.

Server Address : xx.xx.xx.xx
port           : 33xx
schema         : xxxx

user id        : xxxx
password       : xxxx

import mysql.connector

db = mysql.connector.connect(
  host     = "localhost",
  user     = "ajaytech",
  password   = "ajaytech"
)

db

Database interactions are typically retrieved using something called a cursor. A Cursor is just a pointer to a set of data retrieved from the database. It is upto us to iterate over the retrieved data and get what we went. Typically this is done using a loop. So, this is basically a 2 step process

  1. Execute an SQL statement and get the result into a cursor
  2. Iterate over the cursor to get the data

For example, let’s do these 2 steps to list all the databases. Each database is essentially a collection of tables.

Step 1 – Get the list of tables into a cursor

cur = db.cursor()

cur.execute("SHOW DATABASES")

Step 2 – Iterate over the cursor to get the list of databases

for db in cur:
  print(db)
('information_schema',)
('mysql',)
('performance_schema',)
('sakila',)
('sys',)
('world',)

Once we know the list of databases, we have to select the database first. Once we do that, we can freely go about executing the select statements on that particular database.

cur.execute("use world")

List all the tables in the database.

cur.execute("show tables")

for table in cur:
  print(table)

('city',)
('country',)
('countrylanguage',)

Let’s pick a table – say country. Now, let’s extract all the columns in that table. They will become the columns of our Pandas dataframe.

cur.execute("show columns from country")
column_names = []
for column in cur:
  column_names.append(column[0])

column_names

['Code',
 'Name',
 'Continent',
 'Region',
 'SurfaceArea',
 'IndepYear',
 'Population',
 'LifeExpectancy',
 'GNP',
 'GNPOld',
 'LocalName',
 'GovernmentForm',
 'HeadOfState',
 'Capital',
 'Code2']

Once we got the column names, let’s get the actual data from the table.

cur.execute("select * from country")

import pandas as pd

country_data = pd.DataFrame(columns = column_names)

rows = []
for data in cur:
  rows.append(list(data))

country_data = pd.DataFrame(rows)

country_data.head()
0       1       2       3       4       5       6       7       8       9       10      11      12      13      14
0       ABW     Aruba   North America   Caribbean       193.0   NaN     103000  78.4    828.0   793.0   Aruba   Nonmetropolitan Territory of The Netherlands    Beatrix         129.0   AW
1       AFG     Afghanistan     Asia    Southern and Central Asia       652090.0        1919.0  22720000        45.9    5976.0  NaN     Afganistan/Afqanestan   Islamic Emirate         Mohammad Omar   1.0     AF
2       AGO     Angola  Africa  Central Africa  1246700.0       1975.0  12878000        38.3    6648.0  7984.0  Angola  Republic        José Eduardo dos Santos        56.0    AO
3       AIA     Anguilla        North America   Caribbean       96.0    NaN     8000    76.1    63.2    NaN     Anguilla        Dependent Territory of the UK   Elisabeth II    62.0    AI
4       ALB     Albania         Europe  Southern Europe         28748.0         1912.0  3401200         71.6    3205.0  2500.0  Shqipëria      Republic        Rexhep Mejdani  34.0    AL

Great !!! We just need one last step before we finish creating the table into a Pandas dataframe. Set the column names that we have already extracted in a previous step.

country_data.columns = column_names

country_data.head()

Code    Name    Continent       Region  SurfaceArea     IndepYear       Population      LifeExpectancy  GNP     GNPOld  LocalName       GovernmentForm  HeadOfState     Capital         Code2
0       ABW     Aruba   North America   Caribbean       193.0   NaN     103000  78.4    828.0   793.0   Aruba   Nonmetropolitan Territory of The Netherlands    Beatrix         129.0   AW
1       AFG     Afghanistan     Asia    Southern and Central Asia       652090.0        1919.0  22720000        45.9    5976.0  NaN     Afganistan/Afqanestan   Islamic Emirate         Mohammad Omar   1.0     AF
2       AGO     Angola  Africa  Central Africa  1246700.0       1975.0  12878000        38.3    6648.0  7984.0  Angola  Republic        José Eduardo dos Santos        56.0    AO
3       AIA     Anguilla        North America   Caribbean       96.0    NaN     8000    76.1    63.2    NaN     Anguilla        Dependent Territory of the UK   Elisabeth II    62.0    AI
4       ALB     Albania         Europe  Southern Europe         28748.0         1912.0  3401200         71.6    3205.0  2500.0  Shqipëria      Republic        Rexhep Mejdani  34.0    AL

Instead of going the cursor route, you can also choose an ORM that can abstract the meta associated with the database table. A detailed description of an ORM is once again a subject in itself and we will not be discussing it today.

ORM stands for Object Relational Mapper. Instead of just focusing on SQL, ORM lets us use any Object Oriented paradigm to extract data from a relational database. Just think of ORM as a OO wrapper around SQL. Here is a quick visual to show the same.

One such ORM is SQLAlchemy.

pip install SQLAlchemy
import pandas as pd
import mysql.connector

import sqlalchemy as db

engine = db.create_engine('mysql+pymysql://ajaytech:ajaytech@localhost:3306/world')
df = pd.read_sql_table("country",engine)
df.head()

Code    Name    Continent       Region  SurfaceArea     IndepYear       Population      LifeExpectancy  GNP     GNPOld  LocalName       GovernmentForm  HeadOfState     Capital         Code2
0       ABW     Aruba   North America   Caribbean       193.0   NaN     103000  78.4    828.0   793.0   Aruba   Nonmetropolitan Territory of The Netherlands    Beatrix         129.0   AW
1       AFG     Afghanistan     Asia    Southern and Central Asia       652090.0        1919.0  22720000        45.9    5976.0  NaN     Afganistan/Afqanestan   Islamic Emirate         Mohammad Omar   1.0     AF
2       AGO     Angola  Africa  Central Africa  1246700.0       1975.0  12878000        38.3    6648.0  7984.0  Angola  Republic        José Eduardo dos Santos        56.0    AO
3       AIA     Anguilla        North America   Caribbean       96.0    NaN     8000    76.1    63.2    NaN     Anguilla        Dependent Territory of the UK   Elisabeth II    62.0    AI
4       ALB     Albania         Europe  Southern Europe         28748.0         1912.0  3401200         71.6    3205.0  2500.0  Shqipëria      Republic        Rexhep Mejdani  34.0    AL

If you wanted specific columns, you could specify them using the columns attribute.

df = pd.read_sql_table("country",engine, columns=["Code","Name"])
df.head()
        Code    Name
0       ABW     Aruba
1       AFG     Afghanistan
2       AGO     Angola
3       AIA     Anguilla
4       ALB     Albania

If you want to go specifically with SQL, then you can use the read_sql_query ( ) method.

df = pd.read_sql_query("select code, name from country",engine)
df.head()
        Code    Name
0       ABW     Aruba
1       AFG     Afghanistan
2       AGO     Angola
3       AIA     Anguilla
4       ALB     Albania

Essentially, the same method.

Loading Data

Instead of working on the sample tables provided as part of MySQL installation (like country,city etc), let’s load up a bigger dataset into the system. MySQL provides one such database right on their website – and loading it is just as easy. Just search google for “Employee database mysql” or click here. The link to download the actual data is available here. Once there, click on the Clone or Download button and then click on the Download Zip button as shown below.

Unzip the file and navigate into the directory.

In the address bar, type in cmd and that would open the command prompt with the current directory.

like so..

Once inside, type in

mysql -u ajaytech -p -t < employees.sql

where “ajaytech” is the user name you have chosen while installing MySQL – like so..

You have to use the full path of the mysql command, if it is not added to your path, like shown above

"C:\Program Files\MySQL\MySQL Server 8.0\bin\mysql"

Once the data hase been loaded, refresh your MySQL workbench and you should be able to see the database(schema).

You should be able to drill-down to find the tables. For example, here is a snapshot of the EMPLOYEES table.

This one is a pretty decent sized database – 300K odd employees as you can see from the screenshot below.

Selecting Data

For the purpose of extracting data for Data Science & Machine Learning, we will mostly be working with SELECTING data. So, our focus will mainly be on selecting data (as opposed to UPDATEs, DELETEs or CREATE or other administrative aspects of the database).

SELECT Statement

This is probably the most used SQL statement in Data Science or otherwise. We have seen an example of how to select data from the country table above. In the following sections, we will focus on the employees database and how to extract data from the tables inside it.

Here is a snapshot of the SELECT statement syntax in it’s most generic format. As you can see, it is pretty intense.

Luckily, we only have to deal with very few of these options. Let’s start with the basics.

df = pd.read_sql_query("select emp_no, first_name from employees.employees",engine)
df.head()

        emp_no  first_name
0       10001   Georgi
1       10002   Bezalel
2       10003   Parto
3       10004   Chirstian
4       10005   Kyoichi

As you can see, this one took some time. That is because we have pulled in all the records. A quick look at the shape of the dataframe should make this clear.

df.shape
(300024, 2)

Let’s try to just pull in the first 100 records.

df = pd.read_sql_query("SELECT emp_no, first_name FROM employees.employees LIMIT 100",engine)
df.head()
        emp_no  first_name
0       10001   Georgi
1       10002   Bezalel
2       10003   Parto
3       10004   Chirstian
4       10005   Kyoichi

This time, the select ran pretty quick, right ? That is because, we limited the rows retrieved to just 100. You can quickly verify this with the dataframe’s shape.

df.shape

(100, 2)

Just to make the syntax look simple, we have capitalized the SQL keywords in the select statement.

SELECT emp_no, first_name FROM employees.employees LIMIT 100

as opposed to

select emp_no, first_name from employees.employees limit 100

That way you know what the SQL statement is doing by quickly looking for the capitalized keywords. Also, there is a quick way to beautify SQL statements in the MySQL Workbench. Just type in the statement and click on the “Beautify/Format” button.

Look at the way the SQL Workbench has formatted it. It took care of capitalization, tabbing and formatting. This form of the SQL statement is more useful in cases where the statement is pretty large.

So far, we have seen 3 keywords ,

  • SELECT – The main keyword to select data from database tables.
  • FROM – used to specify the tables from which the data needs to be extracted.
  • LIMIT – used to limit the number of rows to extract.

Here is a quick visual

In case you are wondering why we are selecting the schema or database everytime we do a select, that is because there are multiple schemas/databases in any database system. To use a specific database by default without having to specify the database table everytime, use the following command.

USE employees

If you do this in the workbench, the corresponding database is highlighted.

In python, when you create the connection using the ORM, specify the database.

import pandas as pd
import mysql.connector

import sqlalchemy as db

engine = db.create_engine('mysql+pymysql://ajaytech:ajaytech@localhost:3306/employees'

And from that point on, you can just specify the table (without specifying the database).

df = pd.read_sql_query("SELECT emp_no, first_name FROM employees LIMIT 100",engine)
df.head()
emp_no  first_name
0       10001   Georgi
1       10002   Bezalel
2       10003   Parto
3       10004   Chirstian
4       10005   Kyoichi

Great ! Let’s move on to the next clause – WHERE. Next to FROM, this is probably the next important clause in the SELECT statement.

df = pd.read_sql_query("SELECT emp_no, first_name, birth_date FROM employees WHERE emp_no = 10002",engine)
df.head()

emp_no  first_name      birth_date
0       10002   Bezalel         1964-06-02

Let’s add some more filter conditions using the WHERE and AND clauses.

df = pd.read_sql_query("SELECT emp_no, first_name, gender  \
                       FROM employees \
                       WHERE gender = 'M' AND \
                             emp_no <= 10003",engine)
df.head()

emp_no  first_name      gender
0       10001   Georgi  M
1       10003   Parto   M

You can use many comparators that you are used to in any programming languages

  • Less than (<) , Greater than (>)
  • Greather than or equal to (>=) , Less than or equal to (<=)
  • Not equal to ( != or <> )

BETWEEN clause

df = pd.read_sql_query("SELECT emp_no, first_name, gender  \
                       FROM employees \
                       WHERE emp_no BETWEEN 10001 AND 10003",engine)
df.head()

        emp_no  first_name      gender
0       10001   Georgi          M
1       10002   Bezalel         F
2       10003   Parto           M

IN Clause

What if you wanted to select specific employee numbers ?

df = pd.read_sql_query("SELECT *  \
                       FROM employees \
                       WHERE emp_no IN (10001, 10003,10004)",engine)
df.head()

emp_no  birth_date      first_name      last_name       gender  hire_date
0       10001   1953-09-02      Georgi  Facello         M       1986-06-26
1       10003   1959-12-03      Parto   Bamford         M       1986-08-28
2       10004   1954-05-01      Chirstian       Koblick         M       1986-12-01

LIKE clause

df = pd.read_sql_query("SELECT *  FROM employees WHERE first_name LIKE 'G%%' AND emp_no < 10010 ",engine)
df.head()

emp_no  birth_date      first_name      last_name       gender  hire_date
0       10001   1953-09-02      Georgi  Facello         M       1986-06-26

ORDER BY clause

df = pd.read_sql_query("select * from employees where emp_no < 10005 order by emp_no desc ",engine)
df.head()

emp_no  birth_date      first_name      last_name       gender  hire_date
0       10004   1954-05-01      Chirstian       Koblick         M       1986-12-01
1       10003   1959-12-03      Parto   Bamford         M       1986-08-28
2       10002   1964-06-02      Bezalel         Simmel  F       1985-11-21
3       10001   1953-09-02      Georgi  Facello         M       1986-06-26

AS clause

df = pd.read_sql_query("SELECT  emp_no AS employee_number,first_name \
                        FROM employees WHERE emp_no < 10005 ORDER BY emp_no desc ",engine)
df.head()
    employee_number     first_name
0       10004   Chirstian
1       10003   Parto
2       10002   Bezalel
3       10001   Georgi

Aggregate Functions

MIN function

df = pd.read_sql_query("SELECT  emp_no ,MIN(birth_date),first_name \
                        FROM employees WHERE emp_no < 10005 ORDER BY emp_no desc ",engine)
df.head()

        emp_no  MIN(birth_date) first_name
0       10001   1953-09-02      Georgi

The other aggregate functions supported by MySQL are

  • AVG () – Average
  • SUM () – Summation
  • COUNT () – Row count
  • MAX () – Maximum value

AVG aggregate function

df = pd.read_sql_query("SELECT  AVG(salary) FROM salaries \
                        WHERE emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

       AVG(salary)
0       69713.25

MAX aggregate function

df = pd.read_sql_query("SELECT  MAX(salary) FROM salaries \
                        WHERE emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()
      MAX(salary)
0       88958

Table Joins

Suppose you wanted to find out the salary of employee 10001, and print the following information

  • emp_no
  • first_name, last_name
  • salary

How would you do it ? The data is available in 2 different tables

  • EMPLOYEES
  • SALARIES

You would have to join information from both of these tables to get the final output you need. Here is how you do it.

df = pd.read_sql_query("SELECT  first_name, last_name, salary \
                        FROM employees JOIN salaries \
                        WHERE employees.emp_no = salaries.emp_no AND \
                              employees.emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

first_name      last_name       salary
0       Georgi  Facello         88958
1       Bezalel         Simmel  72527
2       Parto   Bamford         43311
3       Chirstian       Koblick         74057

Say, you wanted to output the employee number also – Try this.

df = pd.read_sql_query("SELECT  emp_no, first_name, last_name, salary \
                        FROM employees JOIN salaries \
                        WHERE employees.emp_no = salaries.emp_no AND \
                              employees.emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

---------------------------------------------------------------------------
InternalError: (pymysql.err.InternalError) (1052, "Column 'emp_no' in field list is ambiguous")

SQL is complaining that Column ’emp_no’ in field list is ambiguous. That is because both the tables(employees,salaries) have a column called emp_no – and MySQL doesn’t understand which one we are referring to. Let’s make this clear for the database.

df = pd.read_sql_query("SELECT  employees.emp_no, first_name, last_name, salary \
                        FROM employees JOIN salaries \
                        WHERE employees.emp_no = salaries.emp_no AND \
                              employees.emp_no < 10005 AND \
                              to_date = '9999-01-01'",engine)
df.head()

emp_no  first_name      last_name       salary
0       10001   Georgi  Facello         88958
1       10002   Bezalel         Simmel  72527
2       10003   Parto   Bamford         43311
3       10004   Chirstian       Koblick         74057

Most of the time regular JOINs are good enough. The default JOIN that is used to join tables is an INNER JOIN. Most of the time, this join is good enough.