-
Notifications
You must be signed in to change notification settings - Fork 354
GetStarted_yarn_object_storage
This guide aims to enable using datasets stored in online cloud object storage such as OpenStack Swift during training instead of using data stored locally.
For this, we need to reconfigure both Hadoop and Spark with OpenStack credentials like auth URL, username, password, region etc in core-site.xml file.
Cloud Object storage nodes differ from a traditional file systems. Once the dataset is uploaded to Swift, one can then access the stored dataset by using this format: swift://<container-name>.PROVIDER/path (for example, swift://MNISTlmdb.chameleoncloud/mnist_train_lmdb).
For simplicity, the wiki is separated into sections. Section II installs Hadoop 2.7.1, Spark 2.0.0 and adds necessary jar files into hadoop classpath. Section III configures Hadoop and Spark with OpenStack credentials. Finally in Section IV we will use the GetStarted_yarn guide to start a yarn cluster to train a model in CaffeOnSpark using data stored in Swift.
Please follow Steps 1 - 4 of GetStarted_yarn to build CaffeOnSpark.
- Update to Hadoop version 2.7.1 and Spark version 2.0.0. Update Hadoop Classpath to include
hadoop-openstack-2.7.1.jarfile.
$CAFFE_ON_SPARK/scripts/scripts_object_storage/openstack_swift/local-setup-hadoop.sh
export HADOOP_HOME=$(pwd)/hadoop-2.7.1
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_CLASSPATH=${HADOOP_HOME}/share/hadoop/tools/lib/*
$CAFFE_ON_SPARK/scripts/scripts_object_storage/openstack_swift/local-setup-spark.sh
export SPARK_HOME=$(pwd)/spark-2.0.0-bin-hadoop2.7
If you cannot ssh to localhost without a passphrase, execute the following commands:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
- Copy new configuration template and update
$HADOOP_HOME/etc/hadoop/core-site.xml.
sudo cp $CAFFE_ON_SPARK/scripts/scripts_object_storage/openstack_swift/core-site.xml.template $HADOOP_HOME/etc/hadoop/
Edit the core-site.xml.template file. All properties starting with fs.swift like AUTH URL, USERNAME etc mentioned in $HADOOP_HOME/etc/hadoop/core-site.xml.template must be updated. PROVIDER name should be changed to any custom, preferred name. Please refer Spark Documentation and OpenStack Documentation for more information.
Rename core-site.xml.template to core-site.xml.
sudo mv $HADOOP_HOME/etc/hadoop/core-site.xml.template $HADOOP_HOME/etc/hadoop/core-site.xml
- Copy the
core-site.xmlfile from Hadoop to Spark's config folder$SPARK_HOME/conf/.
sudo cp $HADOOP_HOME/etc/hadoop/core-site.xml $SPARK_HOME/conf/
After making necessary changes for GPU or CPU training in data/lenet_memory_solver.prototxt and data/cifar10_quick_solver.prototxt files, follow GetStarted_yarn Step 8 to initiate the training.
Make sure to change the source location to the swift in the format swift://<container-name>.PROVIDER/path in data/lenet_memory_train_test.prototxt and data/cifar10_quick_train_test.prototxt files.
Please note that the current implementation uses lmdb datasets. Since lmdb is not a distributed dataset, the implementation should be limited to small and medium size datasets. Please look into Spark Dataframes for large datasets. GetStarted_EC2 covers how to convert lmdb to dataframe.
##Appendix
Assuming that there is an object named imagenet_label.txt in a container called testcontainer and the PROVIDER name is set as chameleoncloud, do the following to check the working of hadoop and spark.
hadoop fs -ls swift://testcontainer.chameleoncloud/imagenet_label.txt
Output should look like:
-rw-rw-rw- 1 741401 2016-10-08 22:18 swift://testcontainer.chameleoncloud/imagenet_label.txt
Next for spark, in spark-shell:
scala> val data = sc.textFile("swift://testcontainer.chameleoncloud/imagenet_label.txt")
data: org.apache.spark.rdd.RDD[String] = swift://testcontainer.chameleoncloud/imagenet_label.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> data.count()
res1: Long = 21842
If you run into some errors, use
HADOOP_ROOT_LOGGER=DEBUG,consoleto get verbose output from hadoop commands. For example,
HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -ls swift://testcontainer.chameleoncloud/imagenet_label.txt