Tensorflow on Android

TensorFlow on
Android
“freedom” Koan-Sin Tan

freedom@computer.org

COSCUP 2017, Taipei, Taiwan

Who Am I
• A software engineer working for a SoC company

• An old open source user, learned to use Unix on a
VAX-11/780 running 4.3BSD

• Learned a bit about TensorFlow and how it works on
Android

• Send a couple of PRs, when I was learning to use
TensorFlow to classify image

TensorFlow
• “An open-source software library for Machine
Intelligence”

• An open-source library for deep neural network
learning

• https://www.tensorﬂow.org/

https://github.com/tensorﬂow/tensorﬂow/graphs/contributors

• My ﬁrst impression of TensorFlow

• Hey, that’s scary. How come you see some many
compiler warnings when building such popular open-
source library

• think about: WebKit, llvm/clang, linux kernel, etc.

• Oh, Google has yet another build system and it’s
written in Java

How TensorFlow Works
• TensorFlow is dataﬂow programming

• a program modeled as an acyclic
directional graph

• node/vertex: operation

• edge: ﬂow of data (tensor in
TensorFlow)

• operations don’t execute right
away

• operations execute when data
are available to ALL inputs
In [1]: import tensorflow as tf
In [2]: node1 = tf.constant(3.0)
...: node2 = tf.constant(4.0)
...: print(node1, node2)
...:
(<tf.Tensor 'Const:0' shape=()
dtype=float32>, <tf.Tensor 'Const_1:0'
shape=() dtype=float32>)
In [3]: sess = tf.Session()
...: print(sess.run([node1,
node2]))
...:
[3.0, 4.0]
In [4]: a = tf.add(3, 4)
...: print(sess.run(a))
...:
7

TensorFlow on Android
• https://www.tensorflow.org/mobile/

• ongoing effort “to reduce the code footprint,
and supporting quantization and lower
precision arithmetic that reduce model size”

• Looks good

• some questions

• how to build ARMv8 binaries, with latest
NDK?

• --cxxopt="-std=c++11" --cxxopt="-
Wno-c++11-narrowing" --cxxopt=“-
DTENSORFLOW_DISABLE_META”

• Inception models (e.g., V3) are relatively slow
on Android devices

• is there any benchmark or profiling tool?

• it turns out YES
9

• bazel build -c opt --linkopt="-ldl" --cxxopt="-std=c++11" --
cxxopt="-Wno-c++11-narrowing" --cxxopt="-
DTENSORFLOW_DISABLE_META" --crosstool_top=//external:android/
crosstool --cpu=arm64-v8a --host_crosstool_top=@bazel_tools//
tools/cpp:toolchain //tensorflow/examples/
android:tensorflow_demo --fat_apk_cpu=arm64-v8a
• bazel build -c opt --cxxopt="-std=c++11" --cxxopt="-
DTENSORFLOW_DISABLE_META" --crosstool_top=//external:android/
crosstool --cpu=arm64-v8a --host_crosstool_top=@bazel_tools//
tools/cpp:toolchain //tensorflow/examples/
android:tensorflow_demo --fat_apk_cpu=arm64-v8a
• in case you wanna know how to do it for with older NDK

• The TensorFlow benchmark
can benchmark a compute
graph and its individual
options

• both on desktop and
Android
• however, it doesn't deal with
real input(s)

• I saw label_image when reading an article on
quantization

• label_image didn't build for Android

• still image decoders (jpg, png, and gif) are
not included

• So,

• made it run

• added a quick and dirty BMP decider

• To hack more quickly (compiling TensorFlow
on MT8173 board running Debian is slow), I
wrote a Python script to mimic what the C++
program does

[1] https://github.com/tensorflow/tensorflow/
tree/master/tensorflow/examples/label_image

Quantization
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-
googles-ﬁrst-tensor-processing-unit-tpu
https://www.tensorﬂow.org/performance/quantization

https://www.tensorﬂow.org/performance/quantization

label_image
flounder:/data/local/tmp $ ./label_image --graph=inception_v3_2016_08_28_frozen.pb --
image=grace_hopper.bmp --labels=imagenet_slim_labels.txt
can't determine number of CPU cores: assuming 4
native : main.cc:250 military uniform (653): 0.834119
native : main.cc:250 mortarboard (668): 0.0196274
native : main.cc:250 academic gown (401): 0.00946237
native : main.cc:250 pickelhaube (716): 0.00757228
native : main.cc:250 bulletproof vest (466): 0.0055856
flounder:/data/local/tmp $ ./label_image --graph=quantized_graph.pb --image=grace_hopper.bmp --
labels=imagenet_slim_labels.txt
native : main.cc:250 bulletproof vest (466): 0.00365008
native : main.cc:250 pickelhaube (716): 0.00365008

gemmlowp
• GEMM (GEneral Matrix Multiplication)

• The Basic Linear Algebra Subprograms (BLAS) are routines that provide standard building blocks
for performing basic vector and matrix operations

• The Level 1 BLAS (1979) perform scalar, vector and vector-vector operations,

• the Level 2 BLAS (1988) perform matrix-vector operations, and

• the Level 3 BLAS (1990) perform matrix-matrix operations: {S,D,C,Z}GEMM and others

• Lowp: low-precision

• less than single precision ﬂoating point numbers (< 32-bit), well, actually, "low-precision" in
gemmlowp means that the input and output matrix entries are integers on at most 8 bits

• Why GEMM

• Optimized

• FC, Convolution (im2col, see next page)

https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/

Quantization is tricky
• Yes, we see https://www.tensorflow.org/performance/quantization

• tensorflow/tools/quantization/

• There are others utilities

• tensorflow/tools/graph_transforms/

• Inception V3 model,

• Floating point numbers

./benchmark_model --output_layer=InceptionV3/Predictions/Reshape_1 —input_layer_shape=1,299,299,3
• avg: around 840 ms for a 299x299x3 photo

• Quantized one

./benchmark_model --graph=quantized_graph.pb --output_layer=InceptionV3/Predictions/Reshape_1 --input_layer_shape=1,299,299,3
• If we tried a recent one, oops, > 1.2 seconds

Current status of TF
• Well, public status

• Google does have internal branches, during review process of BMP decoder, I ran into one

• CPU ARMv7 and ARMv8

• Q hexagon DSP, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/hvx

• Eigen and gemmlowp

• Basic XLA works

• Not all operations are supported

• the ‘name = “mobile_srcs”’ in tensorflow/core/BUILD

• “//tensorflow/core/kernels:android_core_ops", “//tensorflow/core/kernels:android_extended_ops" in tensorflow/
core/kernel/BUILD

• C++ and Java API (the TensorFlow site lists Python, C++, Java, and GO now)

• I am far away from Java, don't know how good the API is

• “A word of caution: the APIs in languages other than Python are not yet covered by the API stability promises."

• You may find something like RSTensorFlow and tf-coriander, but AFAICT they are far away from complete

Arch including distributed
training
• The architecture figure of
TensorFlow show important
components, including
distributed stuff
https://www.tensorflow.org/extend/architecture

Android Neural Network API
• New API for Neural Network

• Being added to the Android
framework

• Wraps hardware accelerators
(GPU, DSP, ISP, etc.)
from Google I/O 2017 video

• New TensorFlow runtime
• Optimized for mobile and
embedded apps

• Runs TensorFlow models on
device

• Leverage Android NN API

• Soon to be open sourced
from Google I/O 2017 video

Comparing with CoreML
stack
• No GPU/GPGPU support yet.
Hopefully, Android NN will
help.

• Albeit Google is so good at
ML/DL and various
applications, we don’t see
good application framework(s)
on Android yet.

Simple CoreML Exercise
• Simple app to use InceptionV3 to classify image from
Photo roll or camera

• in Objective-C

• in Swift

• Work continuously on camera

• in Objective-C

• in Swift

Depthwise Separable Convolution
• CNNs with depthwise separable convolution such as Mobilenet [1]
changed almost everything

• Depthwise separable convolution “factorize” a standard convolution
into a depthwise convolution and a 1 × 1 convolution called a
pointwise convolution. Thus it greatly reduces computation
complexity.

• Depthwise separable convolution is not that that new [2], but pure
depthwise separable convolution-based networks such as Xception
and MobileNet demonstrated its power

[1] https://arxiv.org/abs/1704.04861

[2] L. Sifre. “Rigid-motion scattering for image classiﬁcation”, PhD thesis, 2014

...M
N
1
1
...
MDK
DK
1
...
M
DK
DK N
depthwise convolution ﬁlters
standard convolution ﬁlters
1×1 Convolutional Filters (Pointwise Convolution)https://arxiv.org/abs/1704.04861
Depthwise Separable Convolution

MobileNet
• D_K: kernel size

• D_F: input size

• M: input channel size

• N: output channel size
https://arxiv.org/abs/1704.04861

MobileNet on Nexus 9
• “largest” Mobilenet model http://download.tensorﬂow.org/
models/mobilenet_v1_1.0_224_frozen.tgz

• benchmark_model: ./benchmark_model --
graph=frozen_graph.pb —output_layer=MobilenetV1/
Predictions/Reshape_1

• around 120 ms

• Smallest one

• mobilenet_v1_0.25_128: ~25 ms

flounder:/data/local/tmp $ ./label_image --graph=mobilenet_10_224.pb --image=grace_hopper.bmp --
labels=imagenet_slim_labels.txt --output_layer=MobilenetV1/Predictions/Reshape_1 --input_width=224 --
input_height=224
native : main.cc:250 bow tie (458): 0.0575709
native : main.cc:250 ping-pong ball (723): 0.0113924
native : main.cc:250 suit (835): 0.0110482
native : main.cc:250 bearskin (440): 0.00586033
flounder:/data/local/tmp $ ./label_image --graph=mobilenet_025_128.pb --image=grace_hopper.bmp --
labels=imagenet_slim_labels.txt --output_layer=MobilenetV1/Predictions/Reshape_1 --input_width=128 --
input_height=128
native : main.cc:250 suit (835): 0.310988
native : main.cc:250 bow tie (458): 0.197784

Recap
• TensorFlow may not be great on Android yet

• New techniques and NN models are changing status quo

• Android NN, XLA, MobileNet

• big.LITTLE and other system software optimization may
still be needed

MobileNet on iPhone
• Find a Caﬀe model, e.g., the one

• Or, use a converted one

Tensorflow on Android

More Related Content

What's hot

Similar to Tensorflow on Android

More from Koan-Sin Tan

Recently uploaded

Tensorflow on Android