Powered by Python - PyCon Germany 2016

Powered by Python
Summarizing hotel reviews for 100 million travelers
Steffen Wenz, CTO
steffen@trustyou.com

10,000 hotels
use TrustYou Analytics to
analyze their guest reviews.
100 million travelers
see our data on Google,
Hotels.com, Kayak …
actually it’s probably more.

Architecture ;-)
Hadoop Cluster
(Hortonworks Distribution)
Big Data Python
Machine
Learning
NLP
Scraping API
MagicLove

Python on Hadoop:
… possible, but not natural

Let’s try Spark!
$ # how old is the C code in CPython?
$ git clone https://github.com/python/cpython && cd cpython
$ find . -name "*.c" -exec git blame {} ; > blame
$ head blame
dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1)
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pg
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "to
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward *

Let’s try Spark!
import operator as op, re
# sc: SparkContext, connection to cluster
year_re = r"(d{4})-d{2}-d{2}"
years_hist = sc.textFile("blame")
.flatMap(lambda line: re.findall(year_re, line))
.map(lambda year: (year, 1))
.reduceByKey(op.add)
output = years_hist.collect()

Grammars & Parsing
Or: Why you should have paid attention in
compilers class

Grammars and Parsing
$ less Grammar/Grammar
...
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcde
async_stmt: ASYNC (funcdef | with_stmt | for_stmt)
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
while_stmt: 'while' test ':' suite ['else' ':' suite]
for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]
...
Parsing: Given an input string, determine/guess
grammar production rules to generate it

>>> grammar = nltk.CFG.fromstring("""
... OPINION -> NOUN COP ADJ
... OPINION -> ADJ NOUN
... NOUN -> 'hotel' | 'rooms'
... COP -> 'is' | 'are'
... ADJ -> 'great' | 'terrible'
... """)
>>> parser = nltk.ChartParser(grammar)
>>> sent = nltk.word_tokenize("great rooms")
>>> for tree in parser.parse(sent):
>>> print(tree)
(OPINION (ADJ great) (NOUN rooms))
Grammars and Parsing

Word2Vec
● Map words to vectors
● “Step up” from
bag-of-words model
● ‘Cats’ and ‘dogs’ should
be similar - because they
occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040,
-0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166,
0.3312, -0.0928, -0.0967,
-0.0199, -0.2498, -0.4445,
-0.0445,
# ...

Fun with Word2Vec
>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.83), (u'php', 0.82), (u'django', 0.81)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.81), (u'mamas', 0.74), (u'gals', 0.73)]

ML @ TrustYou
● gensim doc2vec model
to create hotel
embedding
● Used - together with
other features - for
various classifiers

● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
Luigi

class MyTask(luigi.Task):
def output(self):
return luigi.Target("/to/make/this/file")
def requires(self):
return [
INeedThisTask(),
AndAlsoThisTask("with_some arg")
]
def run(self):
# ... then ...
# I do this to make it!

steffen@trustyou.com or www.trustyou.com/careers
We’re hiring
web developers & data engineers!

Powered by Python - PyCon Germany 2016

More Related Content

What's hot

Viewers also liked

Similar to Powered by Python - PyCon Germany 2016

Recently uploaded

Powered by Python - PyCon Germany 2016