Learning Python from Data

LEARNING PYTHON
FROM DATA
Mosky
1

THIS SLIDE
• The online version is at
https://speakerdeck.com/mosky/learning-python-from-data.
• The examples are at
https://github.com/moskytw/learning-python-from-data-examples.
2

MOSKY
• I am working at Pinkoi.
3

MOSKY
• I've taught Python for 100+ hours.
3

MOSKY
• A speaker at
COSCUP 2014, PyCon SG 2014, PyCon APAC 014,
OSDC 2014, PyCon APAC 2013, COSCUP 2014, ...
3

MOSKY
• A speaker at
• The author of the Python packages:
MoSQL, Clime, ZIPCodeTW, ...
3

MOSKY
• A speaker at
• The author of the Python packages:
MoSQL, Clime, ZIPCodeTW, ...
• http://mosky.tw/
3

SCHEDULE
•Warm-up
• Packages - Install the packages we need.
4

SCHEDULE
•Warm-up
• CSV - Download a CSV from the Internet and handle it.
4

SCHEDULE
•Warm-up
• HTML - Parse a HTML source code and write a Web crawler.
4

SCHEDULE
•Warm-up
• SQL - Save data into a SQLite database.
4

SCHEDULE
•Warm-up
• SQL - Save data into a SQLite database.
• The End
4

2 OR 3?
• Use Python 3!
• But it actually depends on the libs you need.
7

2 OR 3?
• Use Python 3!
• https://python3wos.appspot.com/
7

2 OR 3?
• Use Python 3!
• https://python3wos.appspot.com/
•We will go ahead with Python 2.7,
but I will also introduce the changes in Python 3.
7

THE ONLINE RESOURCES
• The Python Official Doc
• http://docs.python.org
• The Python Tutorial
• The Python Standard
Library
8

THE ONLINE RESOURCES
• The Python Official Doc
• http://docs.python.org
• The Python Tutorial
• The Python Standard
Library
• My Past Slides
• Programming with Python
- Basic
• Programming with Python
- Adv.
8

THE BOOKS
• Learning Python by Mark Lutz
9

THE BOOKS
• Programming in Python 3 by Mark Summerfield
9

THE BOOKS
• Programming in Python 3 by Mark Summerfield
• Python Essential Reference by David Beazley
9

PREPARATION
• Did you say "hello" to Python?
10

PREPARATION
• If no, visit
• http://www.slideshare.net/moskytw/programming-with-python-
basic.
10

PREPARATION
• If no, visit
• http://www.slideshare.net/moskytw/programming-with-python-
basic.
• If yes, open your Python shell.
10

WARM-UP
The things you must know.
11

MATH & VARS
2 + 3
2 - 3
2 * 3
2 / 3, -2 / 3
!
(1+10)*10 / 2
!
2.0 / 3
!
2 % 3
!
2 ** 3
x = 2
!
y = 3
!
z = x + y
!
print z
!
'#' * 10
12

FOR
for i in [0, 1, 2, 3, 4]:
print i
!
items = [0, 1, 2, 3, 4]
for i in items:
print i
!
for i in range(5):
print i
!
!
!
chars = 'SAHFI'
for i, c in enumerate(chars):
print i, c
!
!
words = ('Samsung', 'Apple',
'HP', 'Foxconn', 'IBM')
for c, w in zip(chars, words):
print c, w
13

IF
for i in range(1, 10):
if i % 2 == 0:
print '{} is divisible by 2'.format(i)
elif i % 3 == 0:
print '{} is divisible by 3'.format(i)
else:
print '{} is not divisible by 2 nor 3'.format(i)
14

WHILE
while 1:
n = int(raw_input('How big pyramid do you want? '))
if n <= 0:
print 'It must greater than 0: {}'.format(n)
continue
break
15

TRY
while 1:
!
try:
n = int(raw_input('How big pyramid do you want? '))
except ValueError as e:
print 'It must be a number: {}'.format(e)
continue
!
if n <= 0:
print 'It must greater than 0: {}'.format(n)
continue
!
break
16

LOOP ... ELSE
for n in range(2, 100):
for i in range(2, n):
if n % i == 0:
break
else:
print '{} is a prime!'.format(n)
17

A PYRAMID
*
***
*****
*******
*********
***********
*************
***************
*****************
*******************
18

A FATER PYRAMID
*
*****
*********
*************
*******************
19

LIST COMPREHENSION
[
n
for n in range(2, 100)
if not any(n % i == 0 for i in range(2, n))
]
21

PACKAGES
import is important.
22

GET PIP - UN*X
• Debian family
• # apt-get install python-pip
24

GET PIP - UN*X
• Debian family
• Rehat family
• # yum install python-pip
24

GET PIP - UN*X
• Debian family
• Rehat family
• # yum install python-pip
• Mac OS X
• # easy_install pip
24

GET PIP - WIN *
• Follow the steps in http://stackoverflow.com/questions/
4750806/how-to-install-pip-on-windows.
25

GET PIP - WIN *
• Or just use easy_install to install.
The easy_install should be found at C:Python27Scripts.
25

GET PIP - WIN *
• Or just use easy_install to install.
The easy_install should be found at C:Python27Scripts.
• Or find the Windows installer on Python Package Index.
25

3-RD PARTY PACKAGES
• requests - Python HTTP for Humans
26

3-RD PARTY PACKAGES
• lxml - Pythonic XML processing library
26

3-RD PARTY PACKAGES
• uniout - Print the object representation in readable chars.
26

3-RD PARTY PACKAGES
• uniout - Print the object representation in readable chars.
• clime - Convert module into a CLI program w/o any config.
26

CSV
Let's start from making a HTTP request!
28

HTTP GET
import requests
!
#url = 'http://stats.moe.gov.tw/files/school/101/
u1_new.csv'
url = 'https://raw.github.com/moskytw/learning-python-
from-data-examples/master/sql/schools.csv'
!
print requests.get(url).content
!
#print requests.get(url).text
29

FILE
save_path = 'school_list.csv'
!
with open(save_path, 'w') as f:
f.write(requests.get(url).content)
!
with open(save_path) as f:
print f.read()
!
for line in f:
print line,
30

DEF
from os.path import basename
!
def save(url, path=None):
!
if not path:
path = basename(url)
!
with open(path, 'w') as f:
f.write(requests.get(url).content)
31

CSV
import csv
from os.path import exists
!
if not exists(save_path):
save(url, save_path)
!
for row in csv.reader(f):
print row
32

+ UNIOUT
import csv
import uniout # You want this!
!
if not exists(save_path):
save(url, save_path)
!
print row
33

NEXT
next(f) # skip the unwanted lines
next(f)
print row
34

DICT READER
next(f)
next(f)
for row in csv.DictReader(f):
print row
!
# We now have a great output. :)
35

DEF AGAIN
def parse_to_school_list(path):
school_list = []
with open(path) as f:
next(f)
next(f)
for school in csv.DictReader(f):
school_list.append(school)
!
return school_list[:-2]
36

+ COMPREHENSION
def parse_to_school_list(path='schools.csv'):
with open(path) as f:
next(f)
next(f)
school_list = [school for school in
csv.DictReader(f)][:-2]
!
return school_list
37

+ PRETTY PRINT
from pprint import pprint
!
pprint(parse_to_school_list(save_path))
!
# AWESOME!
38

PYTHONIC
school_list = parse_to_school_list(save_path)
!
# hmmm ...
!
for school in shcool_list:
print shcool['School Name']
!
# It is more Pythonic! :)
!
print [school['School Name'] for school in school_list]
39

GROUP BY
from itertools import groupby
!
# You MUST sort it.
keyfunc = lambda school: school['County']
school_list.sort(key=keyfunc)
!
for county, schools in groupby(school_list, keyfunc):
for school in schools:
print '%s %r' % (county, school)
print '---'
40

DOCSTRING
'''It contains some useful function for paring data
from government.'''
!
def save(url, path=None):
'''It saves data from `url` to `path`.'''
...
!
--- Shell ---
!
$ pydoc csv_docstring
41

CLIME
if __name__ == '__main__':
import clime.now
!
--- shell ---
!
$ python csv_clime.py
usage: basename <p>
or: parse-to-school-list <path>
or: save [--path] <url>
!
It contains some userful function for parsing data from
government.
42

DOC TIPS
help(requests)
!
print dir(requests)
!
print 'n'.join(dir(requests))
43

HTML
Have fun with the final crawler. ;)
45

LXML
import requests
from lxml import etree
!
content = requests.get('http://clbc.tw').content
root = etree.HTML(content)
!
print root
46

CACHE
!
cache_path = 'cache.html'
!
if exists(cache_path):
with open(cache_path) as f:
content = f.read()
else:
content = requests.get('http://clbc.tw').content
with open(cache_path, 'w') as f:
f.write(content)
47

SEARCHING
head = root.find('head')
print head
!
head_children = head.getchildren()
print head_children
!
metas = head.findall('meta')
print metas
!
title_text = head.findtext('title')
print title_text
48

XPATH
titles = root.xpath('/html/head/title')
print titles[0].text
!
title_texts = root.xpath('/html/head/title/text()')
print title_texts[0]
!
as_ = root.xpath('//a')
print as_
print [a.get('href') for a in as_]
49

MD5
from hashlib import md5
!
message = 'There should be one-- and preferably
only one --obvious way to do it.'
!
print md5(message).hexdigest()
!
# Actually, it is noting about HTML.
50

DEF GET
from os import makedirs
from os.path import exists, join
!
def get(url, cache_dir_path='cache/'):
!
if not exists(cache_dir_path):
makedirs(cache_dir)
!
cache_path = join(cache_dir_path,
md5(url).hexdigest())
!
...
51

DEF FIND_URLS
def find_urls(content):
root = etree.HTML(content)
return [
a.attrib['href'] for a in root.xpath('//a')
if 'href' in a.attrib
]
52

BFS 1/2
NEW = 0
QUEUED = 1
VISITED = 2
!
def search_urls(url):
!
url_queue = [url]
url_state_map = {url: QUEUED}
!
while url_queue:
!
url = url_queue.pop(0)
print url
53

BFS 2/2
# continue the previous page
try:
found_urls = find_urls(get(url))
except Exception, e:
url_state_map[url] = e
print 'Exception: %s' % e
except KeyboardInterrupt, e:
return url_state_map
else:
for found_url in found_urls:
if not url_state_map.get(found_url, NEW):
url_queue.append(found_url)
url_state_map[found_url] = QUEUED
url_state_map[url] = VISITED
54

DEQUE
from collections import deque
...
!
url_queue = deque([url])
...
while url_queue:
!
url = url_queue.popleft()
print url
...
55

YIELD
...
!
...
while url_queue:
!
url = url_queue.pop(0)
yield url
...
except KeyboardInterrupt, e:
print url_state_map
return
...
56

SQL
How about saving the CSV file into a db?
58

TABLE
CREATE TABLE schools (
id TEXT PRIMARY KEY,
name TEXT,
county TEXT,
address TEXT,
phone TEXT,
url TEXT,
type TEXT
);
!
DROP TABLE schools;
59

CRUD
INSERT INTO schools (id, name) VALUES ('1', 'The
First');
INSERT INTO schools VALUES (...);
!
SELECT * FROM schools WHERE id='1';
SELECT name FROM schools WHERE id='1';
!
UPDATE schools SET id='10' WHERE id='1';
!
DELETE FROM schools WHERE id='10';
60

COMMON PATTERN
import sqlite3
!
db_path = 'schools.db'
conn = sqlite3.connect(db_path)
cur = conn.cursor()
!
cur.execute('''CREATE TABLE schools (
...
)''')
conn.commit()
!
cur.close()
conn.close()
61

ROLLBACK
...
!
try:
cur.execute('...')
except:
conn.rollback()
raise
else:
conn.commit()
!
...
62

PARAMETERIZE QUERY
...
!
rows = ...
!
for row in rows:
cur.execute('INSERT INTO schools VALUES (?, ?, ?, ?, ?,
?, ?)', row)
!
conn.commit()
!
...
63

EXECUTEMANY
...
!
rows = ...
!
cur.executemany('INSERT INTO schools VALUES (?, ?, ?, ?, ?,
?, ?)', rows)
!
conn.commit()
!
...
64

FETCH
...
cur.execute('select * from schools')
!
print cur.fetchone()
!
# or
print cur.fetchall()
!
# or
for row in cur:
print row
...
65

TEXT FACTORY
# SQLite only: Let you pass the 8-bit string as parameter.
!
...
!
conn = sqlite3.connect(db_path)
conn.text_factory = str
!
...
66

ROW FACTORY
# SQLite only: Let you convert tuple into dict. It is
`DictCursor` in some other connectors.
!
def dict_factory(cursor, row):
d = {}
for idx, col in enumerate(cursor.description):
d[col[0]] = row[idx]
return d
!
...
con.row_factory = dict_factory
...
67

MORE
• Python DB API 2.0
68

MORE
• MySQLdb - MySQL connector for Python
68

MORE
• Psycopg2 - PostgreSQL adapter for Python
68

MORE
• SQLAlchemy - the Python SQL toolkit and ORM
68

MORE
• SQLAlchemy - the Python SQL toolkit and ORM
• MoSQL - Build SQL from common Python data structure.
68

THE END
• You learned how to ...
69

THE END
• make a HTTP request
69

THE END
• load a CSV file
69

THE END
• load a CSV file
• parse a HTML file
69

THE END
• load a CSV file
• write a Web crawler
69

THE END
• load a CSV file
• use SQL with SQLite
69

THE END
• load a CSV file
• use SQL with SQLite
• and lot of techniques today. ;)
69

Learning Python from Data

More Related Content

What's hot

Viewers also liked

Similar to Learning Python from Data

More from Mosky Liu

Recently uploaded

Learning Python from Data