Technologies for
Data Analytics Platform
YAPC::Asia Tokyo 2015 - Aug 22, 2015
Who are you?
• Masahiro Nakagawa
• github: @repeatedly
• Treasure Data Inc.
• Fluentd / td-agent developer
• https://jobs.lever.co/treasure-data
• I love OSS :)
• D Language, MessagePack, The organizer of several meetups, etc…
Why do we analyze data?
Reporting
Monitoring
Exploratory data analysis
Confirmatory data analysis
etc…
Need data, data, data!
It means we need
data analysis platform
for own requirements
Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring
Let’s launch platform!
• Easy to use and maintain
• Single server
• RDBMS is popular and has huge ecosystem









RDBMS
ETL Query
Extract + Transformation + Load
×
Oops! RDBMS is not good for data
analytics against large data volume.
We need more speed and scalability!
Let’s consider
Parallel RDBMS instead!
Parallel RDBMS
• Optimized for OLAP workload
• Columnar storage, Shared nothing, etc…
• Netezza, Teradata, Vertica, Greenplum, etc…








 Compute
Node
Leader
Node
Compute
Node
Compute
Node
Query
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
• Good data format for analytics workload
• Read only selected columns, efficient compression
• Not good for insert / update











Columnar Storage
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
Row Columnar
Unit
Unit
Okay, query is now
processed normally.

L
C C C
No silver bullet
• Performance depends on data modeling and query
• distkey and sortkey are important
• should reduce data transfer and IO Cost
• query should take advantage of these keys
• There are some problems
• Cluster scaling, metadata management, etc…
Performance is good :)
But we often want to change schema

for new workloads. Now,

hard to maintain schema and its data…
L
C C C
Okay, let’s separate
data sources into multiple
layers for reliable platform
Schema on Write(RDBMS)
• Writing data using schema

for improving query performance
• Pros:
• minimum query overhead
• Cons:
• Need to design schema and workload before
• Data load is expensive operation
Schema on Read(Hadoop)
• Writing data without schema and

map schema at query time
• Pros:
• Robust over schema and workload change
• Data load is cheap operation
• Cons:
• High overhead at query time
Data Lake
• Schema management is hard
• Volume is increasing and format is often changed
• There are lots of log types
• Feasible approach is storing raw data and

converting it before analyze data
• Data Lake is a single storage for any logs
• Note that no clear definition for now
Data Lake Patterns
• Use DFS, e.g. HDFS, for log storage
• ETL or data processing by Hadoop ecosystem
• Can convert logs via ingestion tools before
• Use Data Lake storage and related tools
• These storages support Hadoop ecosystem
Apache Hadoop
• Distributed computing framework
• First implementation based on Google MapReduce











http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
HDFS
http://nosqlessentials.com/
MapReduce
Cool!
Data load becomes robust!
EL
T
Raw data Transformed data
Apache Tez
• Low level framework for YARN Applications
• Hive, Pig, new query engine and more
• Task and DAG based processing flow









ProcessorInput Output
Task DAG
MapReduce vs Tez
MapReduce Tez
M
HDFS
R
R
M M
HDFS HDFS
R
M M
R
M M
R
M
R
M MM
M M
R
R
R
SELECT g1.x, g2.avg, g2.cnt

FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg;
GROUP b BY b.xGROUP a BY a.x
JOIN (a, b)
ORDER BY
GROUP BY x
GROUP BY a.x
JOIN (a, b)
ORDER BY
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
Superstition
• HDFS and YARN have SPOF
• Recent version doesn’t have SPOF on both
MapReduce 1 and MapReduce 2
• Can’t build from a scratch
• Really? Treasure Data builds Hadoop on CircleCI.

Cloudera, Hortonworks and MapR too.
• They also check its dependent toolchain.
Which Hadoop package

should we use?
• Distribution by Hadoop distributor is better
• CDH by Cloudera
• HDP by Hortonworks
• MapR distribution by MapR
• If you are familiar with Hadoop and its ecosystem,

Apache community edition becomes an option.
• For example, Treasure Data has patches and

they want to use patched version.
Good :)
In addition, we want to
collect data in efficient way!
Ingestion tools
• There are two execution model!
• Bulk load:
• For high-throughput
• Almost tools transfer data in batch and parallel
• Streaming load:
• For low-latency
• Almost tools transfer data in micro-batch
Bulk load tools
• Embulk
• Pluggable bulk data loader for

various inputs and outputs
• Write plugins using Java and JRuby
• Sqoop
• Data transfer between Hadoop and RDBMS
• Included in some distributions
• Or each bulk loader for each data store
Streaming load tools
• Fluentd
• Pluggable and json based streaming collector
• Lots of plugins in rubygems
• Flume
• Mainly for Hadoop ecosystem, HDFS, HBase, …
• Included in some distributions
• Or Logstash, Heka, Splunk and etc…
Data ingestion also

becomes robust and efficient!
Raw data Transformed data
It works! but…

we want to issue ad-hoc query to entire
data.
We can’t wait loading data into database.
You can use MPP query
engine for data stores.
MPP query engine
• It doesn’t have own storage unlike parallel RDBMS
• Follow “Schema on Read” approach
• data distribution depends on backend
• data schema also depends on backend
• Some products are called “SQL on Hadoop”
• Presto, Impala, Apache Drill, etc…
• It has own execution engine, not use MapReduce.
• Distributed Query Engine for interactive queries

against various data sources and large data.
• Pluggable connector for joining multiple backends
• You can join MySQL and HDFS data in one query
• Lots of useful functions for data analytics
• window functions, approximate query,

machine learning, etc…
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Commercial

BI Tools
Batch analysis platform Visualization platform
Dashboard
HDFS
Hive
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial

BI Tools
Dashboard
✓ More work to manage

2 platforms
✓ Can’t query against

“live” data directly
Batch analysis platform Visualization platform
PostgreSQL, etc.
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Interactive query
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets Commercial

BI Tools
✓ IBM Cognos

✓ Tableau

✓ ...
Data analysis platform
Client
Coordinator Connector

Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
Execution Model
All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data

to disk
Wait between

stages
Okay, we have now low latency
and batch combination.
Raw data
Resolved our concern! But…
we also need quick estimation.
Currently, we have several
stream processing softwares.
Let’s try!!
Apache Storm
• Distributed realtime processing framework
• Low latency: tuple at a time
• Trident mode uses micro batch













https://storm.apache.org/
Norikra
• Schema-less CEP engine for stream processing
• Use SQL like Esper EPL
• Not distributed unlike Storm for now













Calculated result
Great! We can get insight by
streaming and batch way :)
One more. We can make data
transfer more reliable for multiple
data streams with distributed queue
• Distributed messaging system
• Producer - Broker - Consumer pattern
• Pull model, replication, etc…









Apache Kafka
App
Push
Pull
Push vs Pull
• Push:
• Easy to transfer data to multiple destinations
• Hard to control stream ratio in multiple streams
• Pull:
• Easy to control stream ratio
• Should manage consumers correctly
This is a modern analytics platform
Seems complex and hard to
maintain?
Let’s use useful services!
Amazon Redshift
• Parallel RDBMS on AWS
• Re-use traditional Parallel RDMBS know-how
• Scale is easier than traditional systems
• With Amazon EMR is popular
1. Store data into S3
2. EMR processes S3 data
3. Load processed data into Redshift
• EMR has Hadoop ecosystem
Using AWS Services
Google BigQuery
• Distributed query engine and scalable storage
• Tree model, Columnar storage, etc…
• Separate storage from workers
• High performance query by Google infrastructure
• Lots of workers
• Storage / IO layer on Colossus
• Can’t manage Parallel RDBMS properties like distkey,

but it works on almost cases.
BigQuery architecture
Using GCP Services
Treasure Data
• Cloud based end-to-end data analytics service
• Hive, Presto, Pig and Hivemall for one big repository
• Lots of ingestion and output way, scheduling, etc…
• No stream processing for now
• Service concept is Data Lake
• JSON based schema-less storage
• Execution model is similar to BigQuery
• Separate storage from workers
• Can’t specify Parallel RDBMS properties
Using Treasure Data Service
Resource Model Trade-off
Pros Cons
Fully Guaranteed
Stable execution
Easy to control resource
Non boost mechanizm
Guaranteed with 

multi-tenanted
Stable execution
Good scalability
less controllable resource
Fully multi-tenanted
Boosted performance
Great scalability
Unstable execution
MS Azure also has useful services:
DataHub, SQL DWH, DataLake,
Stream Analytics, HDInsight…
Use service or build a platform?
• Should consider using service first
• AWS, GCP, MS Azure, Treasure Data, etc…
• Important factor is data analytics, not platform
• Do you have enough resources to maintain it?
• If specific analytics platform is a differentiator,

building a platform is better
• Use state-of-the-art technologies
• Hard to implement on existing platforms
Conclusion
• Many softwares and services for data analytics
• Lots of trade-off, performance, complexity,
connectivity, execution model, etc
• SQL is a primary language on data analytics
• Should focus your goal!
• data analytics platform is your business core?

If not, consider using services first.
Cloud service for
entire data pipeline!
Appendix
Apache Spark
• Another Distributed computing framework
• Mainly for in-memory computing with DAG
• RDD and DataFrame based clean API
• Combination with Hadoop is popular













http://slidedeck.io/jmarin/scala-talk
Apache Flink
• Streaming based execution engine
• Support batch and pipelined processing
• Hadoop and Spark are batch based
• 







https://ci.apache.org/projects/
flink/flink-docs-master/
Batch vs Pipelined
All stages are pipe-lined
✓ No wait time
✓ fault-tolerance with

check pointing
Batch(Staged) Pipelined
task task
task task
task
task
memory-to-memory
data transfer
✓ use disk if needed
task
disk
disk
Wait between

stagestask
task task
task task
task task stage3
stage2
stage1
Visualization
• Tableau
• Popular BI tool in many area
• Awesome GUI, easy to use, lots of charts, etc
• Metric Insights
• Dashboard for many metrics
• Scheduled query, custom handler, etc
• Chartio
• Cloud based BI tool
How to manage job dependency?
We want to issue Job X
after Job A and Job B are finished.
Data pipeline tool
• There are some important features
• Manage job dependency
• Handle job failure and retry
• Easy to define topology
• Separate tasks into sub-tasks
• Apache Oozie, Apache Falcon, Luigi, Airflow, JP1,
etc…
Luigi
• Python module for building job pipeline
• Write python code and run it.
• task is defined as Python class
• Easy to manage by VCS
• Need some extra tools
• scheduled job, job hisotry, etc…
class T1(luigi.task):
def requires(self):
# dependencies
def output(self):
# store result
def run(self):
# task body
Airflow
• Python and DAG based workflow
• Write python code but it is for defining ADAG
• Task is defined by Operator
• There are good features
• Management web UI
• Task information is stored into database
• Celery based distributed execution
dag = DAG('example')
t1 = Operator(..., dag=dag)
t2 = Operator(..., dag=dag)
t2.set_upstream(t1)

Technologies for Data Analytics Platform

  • 1.
    Technologies for Data AnalyticsPlatform YAPC::Asia Tokyo 2015 - Aug 22, 2015
  • 2.
    Who are you? •Masahiro Nakagawa • github: @repeatedly • Treasure Data Inc. • Fluentd / td-agent developer • https://jobs.lever.co/treasure-data • I love OSS :) • D Language, MessagePack, The organizer of several meetups, etc…
  • 3.
    Why do weanalyze data?
  • 4.
  • 5.
  • 6.
    It means weneed data analysis platform for own requirements
  • 7.
    Data Analytics Flow CollectStore Process Visualize Data source Reporting Monitoring
  • 8.
  • 9.
    • Easy touse and maintain • Single server • RDBMS is popular and has huge ecosystem
 
 
 
 
 RDBMS ETL Query Extract + Transformation + Load
  • 10.
    × Oops! RDBMS isnot good for data analytics against large data volume. We need more speed and scalability!
  • 11.
  • 12.
    Parallel RDBMS • Optimizedfor OLAP workload • Columnar storage, Shared nothing, etc… • Netezza, Teradata, Vertica, Greenplum, etc…
 
 
 
 
 Compute Node Leader Node Compute Node Compute Node Query
  • 13.
    time code method 2015-12-0110:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … • Good data format for analytics workload • Read only selected columns, efficient compression • Not good for insert / update
 
 
 
 
 
 Columnar Storage time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … Row Columnar Unit Unit
  • 14.
    Okay, query isnow processed normally.
 L C C C
  • 15.
    No silver bullet •Performance depends on data modeling and query • distkey and sortkey are important • should reduce data transfer and IO Cost • query should take advantage of these keys • There are some problems • Cluster scaling, metadata management, etc…
  • 16.
    Performance is good:) But we often want to change schema
 for new workloads. Now,
 hard to maintain schema and its data… L C C C
  • 17.
    Okay, let’s separate datasources into multiple layers for reliable platform
  • 18.
    Schema on Write(RDBMS) •Writing data using schema
 for improving query performance • Pros: • minimum query overhead • Cons: • Need to design schema and workload before • Data load is expensive operation
  • 19.
    Schema on Read(Hadoop) •Writing data without schema and
 map schema at query time • Pros: • Robust over schema and workload change • Data load is cheap operation • Cons: • High overhead at query time
  • 20.
    Data Lake • Schemamanagement is hard • Volume is increasing and format is often changed • There are lots of log types • Feasible approach is storing raw data and
 converting it before analyze data • Data Lake is a single storage for any logs • Note that no clear definition for now
  • 21.
    Data Lake Patterns •Use DFS, e.g. HDFS, for log storage • ETL or data processing by Hadoop ecosystem • Can convert logs via ingestion tools before • Use Data Lake storage and related tools • These storages support Hadoop ecosystem
  • 22.
    Apache Hadoop • Distributedcomputing framework • First implementation based on Google MapReduce
 
 
 
 
 
 http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
  • 23.
  • 24.
  • 25.
    Cool! Data load becomesrobust! EL T Raw data Transformed data
  • 26.
    Apache Tez • Lowlevel framework for YARN Applications • Hive, Pig, new query engine and more • Task and DAG based processing flow
 
 
 
 
 ProcessorInput Output Task DAG
  • 27.
    MapReduce vs Tez MapReduceTez M HDFS R R M M HDFS HDFS R M M R M M R M R M MM M M R R R SELECT g1.x, g2.avg, g2.cnt
 FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1 JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg; GROUP b BY b.xGROUP a BY a.x JOIN (a, b) ORDER BY GROUP BY x GROUP BY a.x JOIN (a, b) ORDER BY http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/9
  • 28.
    Superstition • HDFS andYARN have SPOF • Recent version doesn’t have SPOF on both MapReduce 1 and MapReduce 2 • Can’t build from a scratch • Really? Treasure Data builds Hadoop on CircleCI.
 Cloudera, Hortonworks and MapR too. • They also check its dependent toolchain.
  • 29.
    Which Hadoop package
 shouldwe use? • Distribution by Hadoop distributor is better • CDH by Cloudera • HDP by Hortonworks • MapR distribution by MapR • If you are familiar with Hadoop and its ecosystem,
 Apache community edition becomes an option. • For example, Treasure Data has patches and
 they want to use patched version.
  • 30.
    Good :) In addition,we want to collect data in efficient way!
  • 31.
    Ingestion tools • Thereare two execution model! • Bulk load: • For high-throughput • Almost tools transfer data in batch and parallel • Streaming load: • For low-latency • Almost tools transfer data in micro-batch
  • 32.
    Bulk load tools •Embulk • Pluggable bulk data loader for
 various inputs and outputs • Write plugins using Java and JRuby • Sqoop • Data transfer between Hadoop and RDBMS • Included in some distributions • Or each bulk loader for each data store
  • 33.
    Streaming load tools •Fluentd • Pluggable and json based streaming collector • Lots of plugins in rubygems • Flume • Mainly for Hadoop ecosystem, HDFS, HBase, … • Included in some distributions • Or Logstash, Heka, Splunk and etc…
  • 34.
    Data ingestion also
 becomesrobust and efficient! Raw data Transformed data
  • 35.
    It works! but…
 wewant to issue ad-hoc query to entire data. We can’t wait loading data into database.
  • 36.
    You can useMPP query engine for data stores.
  • 37.
    MPP query engine •It doesn’t have own storage unlike parallel RDBMS • Follow “Schema on Read” approach • data distribution depends on backend • data schema also depends on backend • Some products are called “SQL on Hadoop” • Presto, Impala, Apache Drill, etc… • It has own execution engine, not use MapReduce.
  • 38.
    • Distributed QueryEngine for interactive queries
 against various data sources and large data. • Pluggable connector for joining multiple backends • You can join MySQL and HDFS data in one query • Lots of useful functions for data analytics • window functions, approximate query,
 machine learning, etc…
  • 39.
    HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactivequery Commercial
 BI Tools Batch analysis platform Visualization platform Dashboard
  • 40.
    HDFS Hive Daily/Hourly Batch Interactive query ✓Less scalable ✓ Extra cost Commercial
 BI Tools Dashboard ✓ More work to manage
 2 platforms ✓ Can’t query against
 “live” data directly Batch analysis platform Visualization platform PostgreSQL, etc.
  • 41.
    HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/HourlyBatch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query
  • 42.
    Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query CassandraMySQL Commertial DBs SQL on any data sets Commercial
 BI Tools ✓ IBM Cognos
 ✓ Tableau
 ✓ ... Data analysis platform
  • 43.
  • 44.
    Execution Model All stagesare pipe-lined ✓ No wait time ✓ No fault-tolerance MapReduce Presto map map reduce reduce task task task task task task memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory task disk map map reduce reduce disk disk Write data
 to disk Wait between
 stages
  • 45.
    Okay, we havenow low latency and batch combination. Raw data
  • 46.
    Resolved our concern!But… we also need quick estimation.
  • 47.
    Currently, we haveseveral stream processing softwares. Let’s try!!
  • 48.
    Apache Storm • Distributedrealtime processing framework • Low latency: tuple at a time • Trident mode uses micro batch
 
 
 
 
 
 
 https://storm.apache.org/
  • 49.
    Norikra • Schema-less CEPengine for stream processing • Use SQL like Esper EPL • Not distributed unlike Storm for now
 
 
 
 
 
 
 Calculated result
  • 50.
    Great! We canget insight by streaming and batch way :)
  • 51.
    One more. Wecan make data transfer more reliable for multiple data streams with distributed queue
  • 52.
    • Distributed messagingsystem • Producer - Broker - Consumer pattern • Pull model, replication, etc…
 
 
 
 
 Apache Kafka App Push Pull
  • 53.
    Push vs Pull •Push: • Easy to transfer data to multiple destinations • Hard to control stream ratio in multiple streams • Pull: • Easy to control stream ratio • Should manage consumers correctly
  • 54.
    This is amodern analytics platform
  • 55.
    Seems complex andhard to maintain? Let’s use useful services!
  • 56.
    Amazon Redshift • ParallelRDBMS on AWS • Re-use traditional Parallel RDMBS know-how • Scale is easier than traditional systems • With Amazon EMR is popular 1. Store data into S3 2. EMR processes S3 data 3. Load processed data into Redshift • EMR has Hadoop ecosystem
  • 57.
  • 58.
    Google BigQuery • Distributedquery engine and scalable storage • Tree model, Columnar storage, etc… • Separate storage from workers • High performance query by Google infrastructure • Lots of workers • Storage / IO layer on Colossus • Can’t manage Parallel RDBMS properties like distkey,
 but it works on almost cases.
  • 59.
  • 60.
  • 61.
    Treasure Data • Cloudbased end-to-end data analytics service • Hive, Presto, Pig and Hivemall for one big repository • Lots of ingestion and output way, scheduling, etc… • No stream processing for now • Service concept is Data Lake • JSON based schema-less storage • Execution model is similar to BigQuery • Separate storage from workers • Can’t specify Parallel RDBMS properties
  • 62.
  • 63.
    Resource Model Trade-off ProsCons Fully Guaranteed Stable execution Easy to control resource Non boost mechanizm Guaranteed with 
 multi-tenanted Stable execution Good scalability less controllable resource Fully multi-tenanted Boosted performance Great scalability Unstable execution
  • 64.
    MS Azure alsohas useful services: DataHub, SQL DWH, DataLake, Stream Analytics, HDInsight…
  • 65.
    Use service orbuild a platform? • Should consider using service first • AWS, GCP, MS Azure, Treasure Data, etc… • Important factor is data analytics, not platform • Do you have enough resources to maintain it? • If specific analytics platform is a differentiator,
 building a platform is better • Use state-of-the-art technologies • Hard to implement on existing platforms
  • 66.
    Conclusion • Many softwaresand services for data analytics • Lots of trade-off, performance, complexity, connectivity, execution model, etc • SQL is a primary language on data analytics • Should focus your goal! • data analytics platform is your business core?
 If not, consider using services first.
  • 67.
  • 68.
  • 69.
    Apache Spark • AnotherDistributed computing framework • Mainly for in-memory computing with DAG • RDD and DataFrame based clean API • Combination with Hadoop is popular
 
 
 
 
 
 
 http://slidedeck.io/jmarin/scala-talk
  • 70.
    Apache Flink • Streamingbased execution engine • Support batch and pipelined processing • Hadoop and Spark are batch based • 
 
 
 
 https://ci.apache.org/projects/ flink/flink-docs-master/
  • 71.
    Batch vs Pipelined Allstages are pipe-lined ✓ No wait time ✓ fault-tolerance with
 check pointing Batch(Staged) Pipelined task task task task task task memory-to-memory data transfer ✓ use disk if needed task disk disk Wait between
 stagestask task task task task task task stage3 stage2 stage1
  • 72.
    Visualization • Tableau • PopularBI tool in many area • Awesome GUI, easy to use, lots of charts, etc • Metric Insights • Dashboard for many metrics • Scheduled query, custom handler, etc • Chartio • Cloud based BI tool
  • 73.
    How to managejob dependency? We want to issue Job X after Job A and Job B are finished.
  • 74.
    Data pipeline tool •There are some important features • Manage job dependency • Handle job failure and retry • Easy to define topology • Separate tasks into sub-tasks • Apache Oozie, Apache Falcon, Luigi, Airflow, JP1, etc…
  • 75.
    Luigi • Python modulefor building job pipeline • Write python code and run it. • task is defined as Python class • Easy to manage by VCS • Need some extra tools • scheduled job, job hisotry, etc… class T1(luigi.task): def requires(self): # dependencies def output(self): # store result def run(self): # task body
  • 76.
    Airflow • Python andDAG based workflow • Write python code but it is for defining ADAG • Task is defined by Operator • There are good features • Management web UI • Task information is stored into database • Celery based distributed execution dag = DAG('example') t1 = Operator(..., dag=dag) t2 = Operator(..., dag=dag) t2.set_upstream(t1)