TensorFlow on
Android
“freedom” Koan-Sin Tan

freedom@computer.org

COSCUP 2017, Taipei, Taiwan
Who Am I
• A software engineer working for a SoC company

• An old open source user, learned to use Unix on a
VAX-11/780 running 4.3BSD

• Learned a bit about TensorFlow and how it works on
Android

• Send a couple of PRs, when I was learning to use
TensorFlow to classify image
TensorFlow
• “An open-source software library for Machine
Intelligence”

• An open-source library for deep neural network
learning

• https://www.tensorflow.org/
https://github.com/tensorflow/tensorflow/graphs/contributors
• My first impression of TensorFlow

• Hey, that’s scary. How come you see some many
compiler warnings when building such popular open-
source library

• think about: WebKit, llvm/clang, linux kernel, etc.

• Oh, Google has yet another build system and it’s
written in Java
How TensorFlow Works
• TensorFlow is dataflow programming

• a program modeled as an acyclic
directional graph

• node/vertex: operation

• edge: flow of data (tensor in
TensorFlow)

• operations don’t execute right
away

• operations execute when data
are available to ALL inputs
In [1]: import tensorflow as tf
In [2]: node1 = tf.constant(3.0)
...: node2 = tf.constant(4.0)
...: print(node1, node2)
...:
(<tf.Tensor 'Const:0' shape=()
dtype=float32>, <tf.Tensor 'Const_1:0'
shape=() dtype=float32>)
In [3]: sess = tf.Session()
...: print(sess.run([node1,
node2]))
...:
[3.0, 4.0]
In [4]: a = tf.add(3, 4)
...: print(sess.run(a))
...:
7
TensorFlow on Android
• https://www.tensorflow.org/mobile/

• ongoing effort “to reduce the code footprint,
and supporting quantization and lower
precision arithmetic that reduce model size”

• Looks good

• some questions

• how to build ARMv8 binaries, with latest
NDK?

• --cxxopt="-std=c++11" --cxxopt="-
Wno-c++11-narrowing" --cxxopt=“-
DTENSORFLOW_DISABLE_META”

• Inception models (e.g., V3) are relatively slow
on Android devices

• is there any benchmark or profiling tool?

• it turns out YES
9
• bazel build -c opt --linkopt="-ldl" --cxxopt="-std=c++11" --
cxxopt="-Wno-c++11-narrowing" --cxxopt="-
DTENSORFLOW_DISABLE_META" --crosstool_top=//external:android/
crosstool --cpu=arm64-v8a --host_crosstool_top=@bazel_tools//
tools/cpp:toolchain //tensorflow/examples/
android:tensorflow_demo --fat_apk_cpu=arm64-v8a
• bazel build -c opt --cxxopt="-std=c++11" --cxxopt="-
DTENSORFLOW_DISABLE_META" --crosstool_top=//external:android/
crosstool --cpu=arm64-v8a --host_crosstool_top=@bazel_tools//
tools/cpp:toolchain //tensorflow/examples/
android:tensorflow_demo --fat_apk_cpu=arm64-v8a
• in case you wanna know how to do it for with older NDK
• The TensorFlow benchmark
can benchmark a compute
graph and its individual
options

• both on desktop and
Android
• however, it doesn't deal with
real input(s)
• I saw label_image when reading an article on
quantization

• label_image didn't build for Android

• still image decoders (jpg, png, and gif) are
not included

• So,

• made it run

• added a quick and dirty BMP decider

• To hack more quickly (compiling TensorFlow
on MT8173 board running Debian is slow), I
wrote a Python script to mimic what the C++
program does

[1] https://github.com/tensorflow/tensorflow/
tree/master/tensorflow/examples/label_image
Quantization
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-
googles-first-tensor-processing-unit-tpu
https://www.tensorflow.org/performance/quantization
https://www.tensorflow.org/performance/quantization
Quantizated nodes
label_image
flounder:/data/local/tmp $ ./label_image --graph=inception_v3_2016_08_28_frozen.pb --
image=grace_hopper.bmp --labels=imagenet_slim_labels.txt
can't determine number of CPU cores: assuming 4
can't determine number of CPU cores: assuming 4
native : main.cc:250 military uniform (653): 0.834119
native : main.cc:250 mortarboard (668): 0.0196274
native : main.cc:250 academic gown (401): 0.00946237
native : main.cc:250 pickelhaube (716): 0.00757228
native : main.cc:250 bulletproof vest (466): 0.0055856
flounder:/data/local/tmp $ ./label_image --graph=quantized_graph.pb --image=grace_hopper.bmp --
labels=imagenet_slim_labels.txt
can't determine number of CPU cores: assuming 4
can't determine number of CPU cores: assuming 4
native : main.cc:250 military uniform (653): 0.930771
native : main.cc:250 mortarboard (668): 0.00730017
native : main.cc:250 bulletproof vest (466): 0.00365008
native : main.cc:250 pickelhaube (716): 0.00365008
native : main.cc:250 academic gown (401): 0.00365008
gemmlowp
• GEMM (GEneral Matrix Multiplication)

• The Basic Linear Algebra Subprograms (BLAS) are routines that provide standard building blocks
for performing basic vector and matrix operations

• The Level 1 BLAS (1979) perform scalar, vector and vector-vector operations,

• the Level 2 BLAS (1988) perform matrix-vector operations, and 

• the Level 3 BLAS (1990) perform matrix-matrix operations: {S,D,C,Z}GEMM and others

• Lowp: low-precision

• less than single precision floating point numbers (< 32-bit), well, actually, "low-precision" in
gemmlowp means that the input and output matrix entries are integers on at most 8 bits

• Why GEMM

• Optimized

• FC, Convolution (im2col, see next page)
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
Quantization is tricky
• Yes, we see https://www.tensorflow.org/performance/quantization

• tensorflow/tools/quantization/

• There are others utilities

• tensorflow/tools/graph_transforms/

• Inception V3 model,

• Floating point numbers

./benchmark_model --output_layer=InceptionV3/Predictions/Reshape_1 —input_layer_shape=1,299,299,3
• avg: around 840 ms for a 299x299x3 photo 

• Quantized one

./benchmark_model --graph=quantized_graph.pb --output_layer=InceptionV3/Predictions/Reshape_1 --input_layer_shape=1,299,299,3
• If we tried a recent one, oops, > 1.2 seconds
Current status of TF
• Well, public status

• Google does have internal branches, during review process of BMP decoder, I ran into one

• CPU ARMv7 and ARMv8

• Q hexagon DSP, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/hvx

• Eigen and gemmlowp

• Basic XLA works

• Not all operations are supported

• the ‘name = “mobile_srcs”’ in tensorflow/core/BUILD

• “//tensorflow/core/kernels:android_core_ops", “//tensorflow/core/kernels:android_extended_ops" in tensorflow/
core/kernel/BUILD

• C++ and Java API (the TensorFlow site lists Python, C++, Java, and GO now)

• I am far away from Java, don't know how good the API is

• “A word of caution: the APIs in languages other than Python are not yet covered by the API stability promises."

• You may find something like RSTensorFlow and tf-coriander, but AFAICT they are far away from complete
Arch including distributed
training
• The architecture figure of
TensorFlow show important
components, including
distributed stuff
https://www.tensorflow.org/extend/architecture
TensorFlow Architecture
AndroidNN is coming to town
Android Neural Network API
• New API for Neural Network

• Being added to the Android
framework

• Wraps hardware accelerators
(GPU, DSP, ISP, etc.)
from Google I/O 2017 video
• New TensorFlow runtime
• Optimized for mobile and
embedded apps

• Runs TensorFlow models on
device

• Leverage Android NN API

• Soon to be open sourced
from Google I/O 2017 video
Comparing with CoreML
stack
• No GPU/GPGPU support yet.
Hopefully, Android NN will
help.

• Albeit Google is so good at
ML/DL and various
applications, we don’t see
good application framework(s)
on Android yet.
Simple CoreML Exercise
• Simple app to use InceptionV3 to classify image from
Photo roll or camera 

• in Objective-C 

• in Swift

• Work continuously on camera

• in Objective-C

• in Swift
Depthwise Separable Convolution
• CNNs with depthwise separable convolution such as Mobilenet [1]
changed almost everything

• Depthwise separable convolution “factorize” a standard convolution
into a depthwise convolution and a 1 × 1 convolution called a
pointwise convolution. Thus it greatly reduces computation
complexity.

• Depthwise separable convolution is not that that new [2], but pure
depthwise separable convolution-based networks such as Xception
and MobileNet demonstrated its power

[1] https://arxiv.org/abs/1704.04861

[2] L. Sifre. “Rigid-motion scattering for image classification”, PhD thesis, 2014
...M
N
1
1
...
MDK
DK
1
...
M
DK
DK N
depthwise convolution filters
standard convolution filters
1×1 Convolutional Filters (Pointwise Convolution)https://arxiv.org/abs/1704.04861
Depthwise Separable Convolution
MobileNet
• D_K: kernel size

• D_F: input size

• M: input channel size

• N: output channel size
https://arxiv.org/abs/1704.04861
MobileNet on Nexus 9
• “largest” Mobilenet model http://download.tensorflow.org/
models/mobilenet_v1_1.0_224_frozen.tgz

• benchmark_model: ./benchmark_model --
graph=frozen_graph.pb —output_layer=MobilenetV1/
Predictions/Reshape_1

• around 120 ms

• Smallest one

• mobilenet_v1_0.25_128: ~25 ms
flounder:/data/local/tmp $ ./label_image --graph=mobilenet_10_224.pb --image=grace_hopper.bmp --
labels=imagenet_slim_labels.txt --output_layer=MobilenetV1/Predictions/Reshape_1 --input_width=224 --
input_height=224
can't determine number of CPU cores: assuming 4
can't determine number of CPU cores: assuming 4
native : main.cc:250 military uniform (653): 0.871238
native : main.cc:250 bow tie (458): 0.0575709
native : main.cc:250 ping-pong ball (723): 0.0113924
native : main.cc:250 suit (835): 0.0110482
native : main.cc:250 bearskin (440): 0.00586033
flounder:/data/local/tmp $ ./label_image --graph=mobilenet_025_128.pb --image=grace_hopper.bmp --
labels=imagenet_slim_labels.txt --output_layer=MobilenetV1/Predictions/Reshape_1 --input_width=128 --
input_height=128
can't determine number of CPU cores: assuming 4
can't determine number of CPU cores: assuming 4
native : main.cc:250 suit (835): 0.310988
native : main.cc:250 bow tie (458): 0.197784
native : main.cc:250 military uniform (653): 0.121169
native : main.cc:250 academic gown (401): 0.0309299
native : main.cc:250 mortarboard (668): 0.0242411
Recap
• TensorFlow may not be great on Android yet

• New techniques and NN models are changing status quo

• Android NN, XLA, MobileNet

• big.LITTLE and other system software optimization may
still be needed
Questions?
Backup
MobileNet on iPhone
• Find a Caffe model, e.g., the one

• Or, use a converted one

Tensorflow on Android

  • 1.
    TensorFlow on Android “freedom” Koan-SinTan freedom@computer.org COSCUP 2017, Taipei, Taiwan
  • 2.
    Who Am I •A software engineer working for a SoC company • An old open source user, learned to use Unix on a VAX-11/780 running 4.3BSD • Learned a bit about TensorFlow and how it works on Android • Send a couple of PRs, when I was learning to use TensorFlow to classify image
  • 3.
    TensorFlow • “An open-sourcesoftware library for Machine Intelligence” • An open-source library for deep neural network learning • https://www.tensorflow.org/
  • 4.
  • 7.
    • My firstimpression of TensorFlow • Hey, that’s scary. How come you see some many compiler warnings when building such popular open- source library • think about: WebKit, llvm/clang, linux kernel, etc. • Oh, Google has yet another build system and it’s written in Java
  • 8.
    How TensorFlow Works •TensorFlow is dataflow programming • a program modeled as an acyclic directional graph • node/vertex: operation • edge: flow of data (tensor in TensorFlow) • operations don’t execute right away • operations execute when data are available to ALL inputs In [1]: import tensorflow as tf In [2]: node1 = tf.constant(3.0) ...: node2 = tf.constant(4.0) ...: print(node1, node2) ...: (<tf.Tensor 'Const:0' shape=() dtype=float32>, <tf.Tensor 'Const_1:0' shape=() dtype=float32>) In [3]: sess = tf.Session() ...: print(sess.run([node1, node2])) ...: [3.0, 4.0] In [4]: a = tf.add(3, 4) ...: print(sess.run(a)) ...: 7
  • 9.
    TensorFlow on Android •https://www.tensorflow.org/mobile/ • ongoing effort “to reduce the code footprint, and supporting quantization and lower precision arithmetic that reduce model size” • Looks good • some questions • how to build ARMv8 binaries, with latest NDK? • --cxxopt="-std=c++11" --cxxopt="- Wno-c++11-narrowing" --cxxopt=“- DTENSORFLOW_DISABLE_META” • Inception models (e.g., V3) are relatively slow on Android devices • is there any benchmark or profiling tool? • it turns out YES 9
  • 10.
    • bazel build-c opt --linkopt="-ldl" --cxxopt="-std=c++11" -- cxxopt="-Wno-c++11-narrowing" --cxxopt="- DTENSORFLOW_DISABLE_META" --crosstool_top=//external:android/ crosstool --cpu=arm64-v8a --host_crosstool_top=@bazel_tools// tools/cpp:toolchain //tensorflow/examples/ android:tensorflow_demo --fat_apk_cpu=arm64-v8a • bazel build -c opt --cxxopt="-std=c++11" --cxxopt="- DTENSORFLOW_DISABLE_META" --crosstool_top=//external:android/ crosstool --cpu=arm64-v8a --host_crosstool_top=@bazel_tools// tools/cpp:toolchain //tensorflow/examples/ android:tensorflow_demo --fat_apk_cpu=arm64-v8a • in case you wanna know how to do it for with older NDK
  • 11.
    • The TensorFlowbenchmark can benchmark a compute graph and its individual options • both on desktop and Android • however, it doesn't deal with real input(s)
  • 12.
    • I sawlabel_image when reading an article on quantization • label_image didn't build for Android • still image decoders (jpg, png, and gif) are not included • So, • made it run • added a quick and dirty BMP decider • To hack more quickly (compiling TensorFlow on MT8173 board running Debian is slow), I wrote a Python script to mimic what the C++ program does [1] https://github.com/tensorflow/tensorflow/ tree/master/tensorflow/examples/label_image
  • 13.
  • 14.
  • 15.
  • 16.
    label_image flounder:/data/local/tmp $ ./label_image--graph=inception_v3_2016_08_28_frozen.pb -- image=grace_hopper.bmp --labels=imagenet_slim_labels.txt can't determine number of CPU cores: assuming 4 can't determine number of CPU cores: assuming 4 native : main.cc:250 military uniform (653): 0.834119 native : main.cc:250 mortarboard (668): 0.0196274 native : main.cc:250 academic gown (401): 0.00946237 native : main.cc:250 pickelhaube (716): 0.00757228 native : main.cc:250 bulletproof vest (466): 0.0055856 flounder:/data/local/tmp $ ./label_image --graph=quantized_graph.pb --image=grace_hopper.bmp -- labels=imagenet_slim_labels.txt can't determine number of CPU cores: assuming 4 can't determine number of CPU cores: assuming 4 native : main.cc:250 military uniform (653): 0.930771 native : main.cc:250 mortarboard (668): 0.00730017 native : main.cc:250 bulletproof vest (466): 0.00365008 native : main.cc:250 pickelhaube (716): 0.00365008 native : main.cc:250 academic gown (401): 0.00365008
  • 17.
    gemmlowp • GEMM (GEneralMatrix Multiplication) • The Basic Linear Algebra Subprograms (BLAS) are routines that provide standard building blocks for performing basic vector and matrix operations • The Level 1 BLAS (1979) perform scalar, vector and vector-vector operations, • the Level 2 BLAS (1988) perform matrix-vector operations, and • the Level 3 BLAS (1990) perform matrix-matrix operations: {S,D,C,Z}GEMM and others • Lowp: low-precision • less than single precision floating point numbers (< 32-bit), well, actually, "low-precision" in gemmlowp means that the input and output matrix entries are integers on at most 8 bits • Why GEMM • Optimized • FC, Convolution (im2col, see next page)
  • 18.
  • 19.
    Quantization is tricky •Yes, we see https://www.tensorflow.org/performance/quantization • tensorflow/tools/quantization/ • There are others utilities • tensorflow/tools/graph_transforms/ • Inception V3 model, • Floating point numbers ./benchmark_model --output_layer=InceptionV3/Predictions/Reshape_1 —input_layer_shape=1,299,299,3 • avg: around 840 ms for a 299x299x3 photo • Quantized one ./benchmark_model --graph=quantized_graph.pb --output_layer=InceptionV3/Predictions/Reshape_1 --input_layer_shape=1,299,299,3 • If we tried a recent one, oops, > 1.2 seconds
  • 21.
    Current status ofTF • Well, public status • Google does have internal branches, during review process of BMP decoder, I ran into one • CPU ARMv7 and ARMv8 • Q hexagon DSP, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/hvx • Eigen and gemmlowp • Basic XLA works • Not all operations are supported • the ‘name = “mobile_srcs”’ in tensorflow/core/BUILD • “//tensorflow/core/kernels:android_core_ops", “//tensorflow/core/kernels:android_extended_ops" in tensorflow/ core/kernel/BUILD • C++ and Java API (the TensorFlow site lists Python, C++, Java, and GO now) • I am far away from Java, don't know how good the API is • “A word of caution: the APIs in languages other than Python are not yet covered by the API stability promises." • You may find something like RSTensorFlow and tf-coriander, but AFAICT they are far away from complete
  • 22.
    Arch including distributed training •The architecture figure of TensorFlow show important components, including distributed stuff https://www.tensorflow.org/extend/architecture
  • 23.
  • 24.
  • 25.
    Android Neural NetworkAPI • New API for Neural Network • Being added to the Android framework • Wraps hardware accelerators (GPU, DSP, ISP, etc.) from Google I/O 2017 video
  • 26.
    • New TensorFlowruntime • Optimized for mobile and embedded apps • Runs TensorFlow models on device • Leverage Android NN API • Soon to be open sourced from Google I/O 2017 video
  • 27.
    Comparing with CoreML stack •No GPU/GPGPU support yet. Hopefully, Android NN will help. • Albeit Google is so good at ML/DL and various applications, we don’t see good application framework(s) on Android yet.
  • 28.
    Simple CoreML Exercise •Simple app to use InceptionV3 to classify image from Photo roll or camera • in Objective-C • in Swift • Work continuously on camera • in Objective-C • in Swift
  • 29.
    Depthwise Separable Convolution •CNNs with depthwise separable convolution such as Mobilenet [1] changed almost everything • Depthwise separable convolution “factorize” a standard convolution into a depthwise convolution and a 1 × 1 convolution called a pointwise convolution. Thus it greatly reduces computation complexity. • Depthwise separable convolution is not that that new [2], but pure depthwise separable convolution-based networks such as Xception and MobileNet demonstrated its power [1] https://arxiv.org/abs/1704.04861 [2] L. Sifre. “Rigid-motion scattering for image classification”, PhD thesis, 2014
  • 30.
    ...M N 1 1 ... MDK DK 1 ... M DK DK N depthwise convolutionfilters standard convolution filters 1×1 Convolutional Filters (Pointwise Convolution)https://arxiv.org/abs/1704.04861 Depthwise Separable Convolution
  • 31.
    MobileNet • D_K: kernelsize • D_F: input size • M: input channel size • N: output channel size https://arxiv.org/abs/1704.04861
  • 32.
    MobileNet on Nexus9 • “largest” Mobilenet model http://download.tensorflow.org/ models/mobilenet_v1_1.0_224_frozen.tgz • benchmark_model: ./benchmark_model -- graph=frozen_graph.pb —output_layer=MobilenetV1/ Predictions/Reshape_1 • around 120 ms • Smallest one • mobilenet_v1_0.25_128: ~25 ms
  • 33.
    flounder:/data/local/tmp $ ./label_image--graph=mobilenet_10_224.pb --image=grace_hopper.bmp -- labels=imagenet_slim_labels.txt --output_layer=MobilenetV1/Predictions/Reshape_1 --input_width=224 -- input_height=224 can't determine number of CPU cores: assuming 4 can't determine number of CPU cores: assuming 4 native : main.cc:250 military uniform (653): 0.871238 native : main.cc:250 bow tie (458): 0.0575709 native : main.cc:250 ping-pong ball (723): 0.0113924 native : main.cc:250 suit (835): 0.0110482 native : main.cc:250 bearskin (440): 0.00586033 flounder:/data/local/tmp $ ./label_image --graph=mobilenet_025_128.pb --image=grace_hopper.bmp -- labels=imagenet_slim_labels.txt --output_layer=MobilenetV1/Predictions/Reshape_1 --input_width=128 -- input_height=128 can't determine number of CPU cores: assuming 4 can't determine number of CPU cores: assuming 4 native : main.cc:250 suit (835): 0.310988 native : main.cc:250 bow tie (458): 0.197784 native : main.cc:250 military uniform (653): 0.121169 native : main.cc:250 academic gown (401): 0.0309299 native : main.cc:250 mortarboard (668): 0.0242411
  • 34.
    Recap • TensorFlow maynot be great on Android yet • New techniques and NN models are changing status quo • Android NN, XLA, MobileNet • big.LITTLE and other system software optimization may still be needed
  • 35.
  • 36.
  • 37.
    MobileNet on iPhone •Find a Caffe model, e.g., the one • Or, use a converted one
  • 38.
    label_image in Python •I am not familiar with Protocol Buffer