Integrating deep learning
libraries with Apache Spark
Joseph K. Bradley
O’Reilly AI Conference NYC
June 29, 2017
About me
Software engineer at Databricks
Apache Spark committer & PMC member
Ph.D. Carnegie Mellon in Machine Learning
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
3	3	
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Deep Learning and Apache Spark
Deep Learning frameworks w/ Spark bindings
•  Caffe (CaffeOnSpark)
•  Keras (Elephas)
•  mxnet
•  Paddle
•  TensorFlow (TensorFlow on Spark, TensorFrames)
•  CNTK (mmlspark)
Extensions to Spark for specialized hardware
•  Blaze (UCLA & Falcon Computing Solutions)
•  IBM Conductor with Spark
Native Spark
•  BigDL
•  DeepDist
•  DeepLearning4J
•  MLlib
•  SparkCL
•  SparkNet
Deep Learning and Apache Spark
2016: the year of emerging solutions for Spark + Deep Learning
No consensus
•  Many approaches for libraries
–  integrate with Spark
–  build on Spark
–  modify Spark
•  Official Spark MLlib support is limited (perceptron-like networks)
One Framework to Rule Them All?
Databricks’ perspective
•  Databricks: hosted Spark platform on public cloud
•  GPUs for compute-intensive workloads
•  Customers use many Deep Learning frameworks: TensorFlow, Keras, BigDL,
CNTK, MXNet, Theano, Caffe and more
This talk
•  Lessons learned from supporting many Deep Learning frameworks
•  Multiple ways to integrate Deep Learning & Spark
•  High-level APIs vs. low-level cluster/dev issues
Outline
Deep Learning in data pipelines
Recurring patterns in Apache Spark + Deep Learning integrations
Integrating Spark + DL: APIs
Integrating Spark + DL: clusters
Outline
Deep Learning in data pipelines
Recurring patterns in Apache Spark + Deep Learning integrations
Integrating Spark + DL: APIs
Integrating Spark + DL: clusters
DL in data pipelines
Data
collection
ETL Featurization Validation Export,
Serving
compute intensive IO intensiveIO intensive
Large cluster
High memory/CPU ratio
Small cluster
Low memory/CPU ratio
Training
Outline
Deep Learning in data pipelines
Recurring patterns in Apache Spark + DL integrations
Integrating Spark + DL: APIs
Integrating Spark + DL: clusters
Recurring patterns
Data
collection
ETL Featurization Validation Export,
Serving
Training
DL for featurization
à other ML for training
ETL/featurization on cluster
à DL training on 1 beefy machine
Distributed model tuning:
local DL training for each
parameter set
Distributed DL
training
Recurring patterns
DL for featurization
à other ML for training
ETL/featurization on cluster
à DL training on 1 beefy machine
Distributed model tuning:
local DL training for each
parameter set
Deep Learning Transformer Deep Learning Estimator
Flexible cluster deployments
Distributed DL
training
APIs
Clusters
Outline
Deep Learning in data pipelines
Recurring patterns in Apache Spark + Deep Learning integrations
Integrating Spark + DL: APIs
Integrating Spark + DL: clusters
Recurring patterns: APIs
Deep learning is used in many parts of a data pipeline:
•  Featurization
•  Training
•  Local
•  Distributed: 1 model per machine
•  Distributed: 1 model per cluster
•  Prediction
Our answer:
•  Use familiar MLlib APIs
•  Leverage existing deep
learning frameworks
New library: Deep Learning Pipelines
• Simple API for deep learning, based on MLlib Pipelines
• Scales common tasks with Transformers and Estimators
• Exposes deep learning models in Spark DataFrames & SQL
Example: Image classification
Example: Identify the James Bond cars
Example: Identify the James Bond cars
• Neural networks are very good at image analysis
• Can work with complex situations:
INVARIANCE TO ROTATIONS
Good application for Deep Learning
INCOMPLETE DATA
Transfer Learning
Transfer Learning
Transfer Learning
Transfer Learning
Transfer Learning
SoftMax
GIANT PANDA
0.9
RED PANDA 0.05
RACCOON 0.01
…
Classifier
Transfer Learning
DeepImageFeaturize
r
Deep Learning Pipelines
MLlib Pipeline APIs for
•  Featurization (Transformers)
•  Training (Estimators)
Conversion of deep learning models to SQL UDFs
Early release of Deep Learning Pipelines
https://github.com/databricks/spark-deep-learning
MLlib Pipeline APIs for Deep Learning
Growing set of API integrations:
•  Spark MLlib (Apache)
•  Deep Learning Pipelines (Databricks)
•  mmlspark (Microsoft)
•  Others are under active development!
General trend in Spark + Machine Learning:
•  xgboost
•  many Spark Packages (spark-packages.org)
Outline
Deep Learning in data pipelines
Recurring patterns in Apache Spark + Deep Learning integrations
Integrating Spark + DL: APIs
Integrating Spark + DL: clusters
Recurring patterns: clusters
Flexible cluster deployments are critical.
• Handle different parts of workflow
•  Featurization
•  Training 1 model
•  Training many models
• Handle different data & model dimensions
•  CPUs vs. GPUs
•  Local vs. distributed
Diving deeper: communication
Spark as a scheduler
•  Data-parallel tasks
•  Data stored outside Spark
Embedded Deep Learning transforms
•  Data-parallel tasks
•  Data stored in DataFrames/RDDs
Cooperative frameworks
•  Multiple passes over data
•  Heavy and/or specialized communication
Streaming data through DL
Primary storage choices:
•  Cold layer (HDFS, S3)
•  Local storage (files, Spark’s on-disk persistence)
•  Memory (Spark RDDs, DataFrames)
Find out if you are I/O constrained or processor-constrained
•  How big is your dataset? MNIST or ImageNet?
Cooperative frameworks
Use Spark for data input. Alternative communication layer.
Examples:
•  Skymind’s DeepLearning4J
•  DistML and other Parameter Server projects
RDD	
Par((on	1	
Par((on	n	
RDD	
Par((on	1	
Par((on	m	
Black	box
Cooperative frameworks
Use Spark for data input. Alternative communication layer.
Examples:
•  Skymind’s DeepLearning4J
•  DistML and other Parameter Server projects
Bypass Spark communication:
•  Lose reproducibility/determinism of RDDs and DataFrames
•  But that’s OK: “reproducibility is worth a factor of 2” (Leon Bottou)
Simplified deployment
• Pre-installed GPU drivers on machine
• Docker image with full GPU SDK
GPU integration (in Databricks)
Container:	
nvidia-docker,	
lxc,	etc.	
GPU	hardware	
Linux	kernel	 NV	Kernel	driver	
CuBLAS	 CuDNN	
Deep	learning	libraries	
(Tensorflow,	etc.)	 JCUDA	
Python	/	JVM	clients	
CUDA	
NV	kernel	driver	(userspace	interface)	
Flexible cluster setup
•  CPUs or GPUs
•  Transfer & share data
across clusters
Monitoring Spark + DL integrations
The best method depends on task granularity.
•  Around tasks
•  Inside (long-running) tasks
Accumulators in tasks
•  Throughput or failure rate within tasks
External systems
•  Logging: TensorBoard
•  General system: Grafana, Graphite, Prometheus, etc.
Resources
Early release of Deep Learning Pipelines
https://github.com/databricks/spark-deep-learning
Recent blog posts
http://databricks.com/blog
Deep Learning Pipelines, TensorFrames, GPU acceleration, getting started with Deep
Learning, Intel’s BigDL
Docs for Deep Learning on Databricks
http://docs.databricks.com
Getting started, Spark integration
Deep Learning without Deep Pockets
For users
•  Life’s getting better.
•  MLlib APIs for Deep
Learning integrations
(Deep Learning Pipelines)
•  Flexible cluster
deployments
For developers
•  Challenges remain.
•  Communication patterns
for Deep Learning
•  Maintaining GPU software
stacks
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
•  Collaborative cloud environment
•  Free version (community edition)
39	39	
DATABRICKS RUNTIME 3.0
•  Apache Spark - optimized for the cloud
•  Caching and optimization layer - DBIO
•  Enterprise security - DBES
Try for free today.
databricks.com
Thank you!
Twitter: @jkbatcmu à I’ll share my slides.
http://databricks.com/try

Integrating Deep Learning Libraries with Apache Spark

  • 1.
    Integrating deep learning librarieswith Apache Spark Joseph K. Bradley O’Reilly AI Conference NYC June 29, 2017
  • 2.
    About me Software engineerat Databricks Apache Spark committer & PMC member Ph.D. Carnegie Mellon in Machine Learning
  • 3.
    TEAM About Databricks Started Sparkproject (now Apache Spark) at UC Berkeley in 2009 3 3 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 4.
    Deep Learning andApache Spark Deep Learning frameworks w/ Spark bindings •  Caffe (CaffeOnSpark) •  Keras (Elephas) •  mxnet •  Paddle •  TensorFlow (TensorFlow on Spark, TensorFrames) •  CNTK (mmlspark) Extensions to Spark for specialized hardware •  Blaze (UCLA & Falcon Computing Solutions) •  IBM Conductor with Spark Native Spark •  BigDL •  DeepDist •  DeepLearning4J •  MLlib •  SparkCL •  SparkNet
  • 5.
    Deep Learning andApache Spark 2016: the year of emerging solutions for Spark + Deep Learning No consensus •  Many approaches for libraries –  integrate with Spark –  build on Spark –  modify Spark •  Official Spark MLlib support is limited (perceptron-like networks)
  • 6.
    One Framework toRule Them All?
  • 7.
    Databricks’ perspective •  Databricks:hosted Spark platform on public cloud •  GPUs for compute-intensive workloads •  Customers use many Deep Learning frameworks: TensorFlow, Keras, BigDL, CNTK, MXNet, Theano, Caffe and more This talk •  Lessons learned from supporting many Deep Learning frameworks •  Multiple ways to integrate Deep Learning & Spark •  High-level APIs vs. low-level cluster/dev issues
  • 8.
    Outline Deep Learning indata pipelines Recurring patterns in Apache Spark + Deep Learning integrations Integrating Spark + DL: APIs Integrating Spark + DL: clusters
  • 9.
    Outline Deep Learning indata pipelines Recurring patterns in Apache Spark + Deep Learning integrations Integrating Spark + DL: APIs Integrating Spark + DL: clusters
  • 10.
    DL in datapipelines Data collection ETL Featurization Validation Export, Serving compute intensive IO intensiveIO intensive Large cluster High memory/CPU ratio Small cluster Low memory/CPU ratio Training
  • 11.
    Outline Deep Learning indata pipelines Recurring patterns in Apache Spark + DL integrations Integrating Spark + DL: APIs Integrating Spark + DL: clusters
  • 12.
    Recurring patterns Data collection ETL FeaturizationValidation Export, Serving Training DL for featurization à other ML for training ETL/featurization on cluster à DL training on 1 beefy machine Distributed model tuning: local DL training for each parameter set Distributed DL training
  • 13.
    Recurring patterns DL forfeaturization à other ML for training ETL/featurization on cluster à DL training on 1 beefy machine Distributed model tuning: local DL training for each parameter set Deep Learning Transformer Deep Learning Estimator Flexible cluster deployments Distributed DL training APIs Clusters
  • 14.
    Outline Deep Learning indata pipelines Recurring patterns in Apache Spark + Deep Learning integrations Integrating Spark + DL: APIs Integrating Spark + DL: clusters
  • 15.
    Recurring patterns: APIs Deeplearning is used in many parts of a data pipeline: •  Featurization •  Training •  Local •  Distributed: 1 model per machine •  Distributed: 1 model per cluster •  Prediction Our answer: •  Use familiar MLlib APIs •  Leverage existing deep learning frameworks
  • 16.
    New library: DeepLearning Pipelines • Simple API for deep learning, based on MLlib Pipelines • Scales common tasks with Transformers and Estimators • Exposes deep learning models in Spark DataFrames & SQL
  • 17.
  • 18.
    Example: Identify theJames Bond cars
  • 19.
    Example: Identify theJames Bond cars
  • 20.
    • Neural networks arevery good at image analysis • Can work with complex situations: INVARIANCE TO ROTATIONS Good application for Deep Learning INCOMPLETE DATA
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    SoftMax GIANT PANDA 0.9 RED PANDA0.05 RACCOON 0.01 … Classifier Transfer Learning DeepImageFeaturize r
  • 27.
    Deep Learning Pipelines MLlibPipeline APIs for •  Featurization (Transformers) •  Training (Estimators) Conversion of deep learning models to SQL UDFs Early release of Deep Learning Pipelines https://github.com/databricks/spark-deep-learning
  • 28.
    MLlib Pipeline APIsfor Deep Learning Growing set of API integrations: •  Spark MLlib (Apache) •  Deep Learning Pipelines (Databricks) •  mmlspark (Microsoft) •  Others are under active development! General trend in Spark + Machine Learning: •  xgboost •  many Spark Packages (spark-packages.org)
  • 29.
    Outline Deep Learning indata pipelines Recurring patterns in Apache Spark + Deep Learning integrations Integrating Spark + DL: APIs Integrating Spark + DL: clusters
  • 30.
    Recurring patterns: clusters Flexiblecluster deployments are critical. • Handle different parts of workflow •  Featurization •  Training 1 model •  Training many models • Handle different data & model dimensions •  CPUs vs. GPUs •  Local vs. distributed
  • 31.
    Diving deeper: communication Sparkas a scheduler •  Data-parallel tasks •  Data stored outside Spark Embedded Deep Learning transforms •  Data-parallel tasks •  Data stored in DataFrames/RDDs Cooperative frameworks •  Multiple passes over data •  Heavy and/or specialized communication
  • 32.
    Streaming data throughDL Primary storage choices: •  Cold layer (HDFS, S3) •  Local storage (files, Spark’s on-disk persistence) •  Memory (Spark RDDs, DataFrames) Find out if you are I/O constrained or processor-constrained •  How big is your dataset? MNIST or ImageNet?
  • 33.
    Cooperative frameworks Use Sparkfor data input. Alternative communication layer. Examples: •  Skymind’s DeepLearning4J •  DistML and other Parameter Server projects RDD Par((on 1 Par((on n RDD Par((on 1 Par((on m Black box
  • 34.
    Cooperative frameworks Use Sparkfor data input. Alternative communication layer. Examples: •  Skymind’s DeepLearning4J •  DistML and other Parameter Server projects Bypass Spark communication: •  Lose reproducibility/determinism of RDDs and DataFrames •  But that’s OK: “reproducibility is worth a factor of 2” (Leon Bottou)
  • 35.
    Simplified deployment • Pre-installed GPUdrivers on machine • Docker image with full GPU SDK GPU integration (in Databricks) Container: nvidia-docker, lxc, etc. GPU hardware Linux kernel NV Kernel driver CuBLAS CuDNN Deep learning libraries (Tensorflow, etc.) JCUDA Python / JVM clients CUDA NV kernel driver (userspace interface) Flexible cluster setup •  CPUs or GPUs •  Transfer & share data across clusters
  • 36.
    Monitoring Spark +DL integrations The best method depends on task granularity. •  Around tasks •  Inside (long-running) tasks Accumulators in tasks •  Throughput or failure rate within tasks External systems •  Logging: TensorBoard •  General system: Grafana, Graphite, Prometheus, etc.
  • 37.
    Resources Early release ofDeep Learning Pipelines https://github.com/databricks/spark-deep-learning Recent blog posts http://databricks.com/blog Deep Learning Pipelines, TensorFrames, GPU acceleration, getting started with Deep Learning, Intel’s BigDL Docs for Deep Learning on Databricks http://docs.databricks.com Getting started, Spark integration
  • 38.
    Deep Learning withoutDeep Pockets For users •  Life’s getting better. •  MLlib APIs for Deep Learning integrations (Deep Learning Pipelines) •  Flexible cluster deployments For developers •  Challenges remain. •  Communication patterns for Deep Learning •  Maintaining GPU software stacks
  • 39.
    UNIFIED ANALYTICS PLATFORM TryApache Spark in Databricks! •  Collaborative cloud environment •  Free version (community edition) 39 39 DATABRICKS RUNTIME 3.0 •  Apache Spark - optimized for the cloud •  Caching and optimization layer - DBIO •  Enterprise security - DBES Try for free today. databricks.com
  • 40.
    Thank you! Twitter: @jkbatcmuà I’ll share my slides. http://databricks.com/try