Powered by Python
Summarizing hotel reviews for 100 million travelers
Steffen Wenz, CTO
steffen@trustyou.com
10,000 hotels
use TrustYou Analytics to
analyze their guest reviews.
100 million travelers
see our data on Google,
Hotels.com, Kayak …
actually it’s probably more.
Architecture ;-)
Hadoop Cluster
(Hortonworks Distribution)
Big Data Python
Machine
Learning
NLP
Scraping API
MagicLove
Hadoop:
… slow & massive
Python on Hadoop:
… possible, but not natural
Let’s try Spark!
$ # how old is the C code in CPython?
$ git clone https://github.com/python/cpython && cd cpython
$ find . -name "*.c" -exec git blame {} ; > blame
$ head blame
dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1)
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pg
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "to
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "no
daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7)
badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward *
Let’s try Spark!
import operator as op, re
# sc: SparkContext, connection to cluster
year_re = r"(d{4})-d{2}-d{2}"
years_hist = sc.textFile("blame") 
.flatMap(lambda line: re.findall(year_re, line)) 
.map(lambda year: (year, 1)) 
.reduceByKey(op.add)
output = years_hist.collect()
What happened here?
Grammars & Parsing
Or: Why you should have paid attention in
compilers class
Grammars and Parsing
$ less Grammar/Grammar
...
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcde
async_stmt: ASYNC (funcdef | with_stmt | for_stmt)
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
while_stmt: 'while' test ':' suite ['else' ':' suite]
for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]
...
Parsing: Given an input string, determine/guess
grammar production rules to generate it
>>> grammar = nltk.CFG.fromstring("""
... OPINION -> NOUN COP ADJ
... OPINION -> ADJ NOUN
... NOUN -> 'hotel' | 'rooms'
... COP -> 'is' | 'are'
... ADJ -> 'great' | 'terrible'
... """)
>>> parser = nltk.ChartParser(grammar)
>>> sent = nltk.word_tokenize("great rooms")
>>> for tree in parser.parse(sent):
>>> print(tree)
(OPINION (ADJ great) (NOUN rooms))
Grammars and Parsing
Word2Vec
● Map words to vectors
● “Step up” from
bag-of-words model
● ‘Cats’ and ‘dogs’ should
be similar - because they
occur in similar contexts
>>> m["python"]
array([-0.1351, -0.1040,
-0.0823, -0.0287, 0.3709,
-0.0200, -0.0325, 0.0166,
0.3312, -0.0928, -0.0967,
-0.0199, -0.2498, -0.4445,
-0.0445,
# ...
Fun with Word2Vec
>>> # trained from 100k meetup descriptions!
>>> m = gensim.models.Word2Vec.load("data/word2vec")
>>> m.most_similar(positive=["python"])[:3]
[(u'javascript', 0.83), (u'php', 0.82), (u'django', 0.81)]
>>> m.doesnt_match(["python", "c++", "javascript"])
'c++'
>>> m.most_similar(positive=["ladies"])[:3]
[(u'girls', 0.81), (u'mamas', 0.74), (u'gals', 0.73)]
ML @ TrustYou
● gensim doc2vec model
to create hotel
embedding
● Used - together with
other features - for
various classifiers
● Build complex pipelines of
batch jobs
○ Dependency resolution
○ Parallelism
○ Resume failed jobs
Luigi
class MyTask(luigi.Task):
def output(self):
return luigi.Target("/to/make/this/file")
def requires(self):
return [
INeedThisTask(),
AndAlsoThisTask("with_some arg")
]
def run(self):
# ... then ...
# I do this to make it!
steffen@trustyou.com or www.trustyou.com/careers
We’re hiring
web developers & data engineers!

Powered by Python - PyCon Germany 2016

  • 1.
    Powered by Python Summarizinghotel reviews for 100 million travelers Steffen Wenz, CTO steffen@trustyou.com
  • 2.
    10,000 hotels use TrustYouAnalytics to analyze their guest reviews. 100 million travelers see our data on Google, Hotels.com, Kayak … actually it’s probably more.
  • 4.
    Architecture ;-) Hadoop Cluster (HortonworksDistribution) Big Data Python Machine Learning NLP Scraping API MagicLove
  • 5.
  • 6.
    Python on Hadoop: …possible, but not natural
  • 8.
    Let’s try Spark! $# how old is the C code in CPython? $ git clone https://github.com/python/cpython && cd cpython $ find . -name "*.c" -exec git blame {} ; > blame $ head blame dc5dbf61 (Guido van Rossum 1991-02-19 12:39:46 +0000 1) daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 2) /* List a no daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 3) badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 4) #include "pg daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 5) #include "to daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 6) #include "no daadddf7 (Guido van Rossum 1990-10-14 12:07:46 +0000 7) badc12f6 (Guido van Rossum 1990-12-20 15:06:42 +0000 8) /* Forward *
  • 9.
    Let’s try Spark! importoperator as op, re # sc: SparkContext, connection to cluster year_re = r"(d{4})-d{2}-d{2}" years_hist = sc.textFile("blame") .flatMap(lambda line: re.findall(year_re, line)) .map(lambda year: (year, 1)) .reduceByKey(op.add) output = years_hist.collect()
  • 10.
  • 12.
    Grammars & Parsing Or:Why you should have paid attention in compilers class
  • 13.
    Grammars and Parsing $less Grammar/Grammar ... compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcde async_stmt: ASYNC (funcdef | with_stmt | for_stmt) if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite] while_stmt: 'while' test ':' suite ['else' ':' suite] for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite] ... Parsing: Given an input string, determine/guess grammar production rules to generate it
  • 14.
    >>> grammar =nltk.CFG.fromstring(""" ... OPINION -> NOUN COP ADJ ... OPINION -> ADJ NOUN ... NOUN -> 'hotel' | 'rooms' ... COP -> 'is' | 'are' ... ADJ -> 'great' | 'terrible' ... """) >>> parser = nltk.ChartParser(grammar) >>> sent = nltk.word_tokenize("great rooms") >>> for tree in parser.parse(sent): >>> print(tree) (OPINION (ADJ great) (NOUN rooms)) Grammars and Parsing
  • 15.
    Word2Vec ● Map wordsto vectors ● “Step up” from bag-of-words model ● ‘Cats’ and ‘dogs’ should be similar - because they occur in similar contexts >>> m["python"] array([-0.1351, -0.1040, -0.0823, -0.0287, 0.3709, -0.0200, -0.0325, 0.0166, 0.3312, -0.0928, -0.0967, -0.0199, -0.2498, -0.4445, -0.0445, # ...
  • 16.
    Fun with Word2Vec >>># trained from 100k meetup descriptions! >>> m = gensim.models.Word2Vec.load("data/word2vec") >>> m.most_similar(positive=["python"])[:3] [(u'javascript', 0.83), (u'php', 0.82), (u'django', 0.81)] >>> m.doesnt_match(["python", "c++", "javascript"]) 'c++' >>> m.most_similar(positive=["ladies"])[:3] [(u'girls', 0.81), (u'mamas', 0.74), (u'gals', 0.73)]
  • 17.
    ML @ TrustYou ●gensim doc2vec model to create hotel embedding ● Used - together with other features - for various classifiers
  • 19.
    ● Build complexpipelines of batch jobs ○ Dependency resolution ○ Parallelism ○ Resume failed jobs Luigi
  • 20.
    class MyTask(luigi.Task): def output(self): returnluigi.Target("/to/make/this/file") def requires(self): return [ INeedThisTask(), AndAlsoThisTask("with_some arg") ] def run(self): # ... then ... # I do this to make it!
  • 22.
    steffen@trustyou.com or www.trustyou.com/careers We’rehiring web developers & data engineers!