The notes are based on older Mahout notes: https://cwiki.apache.org/MAHOUT/mahout-on-amazon-ec2.html which are unfortunately outdated.
Note: part 1 of this post, explains how to install the same installation on top of Ubuntu based machine.
Full procedure should take around 2-3 hours.. :-(
1) Start high performance instance from amazon aws console
Cent OS AMI ID ami-7ea24a17 (x86_64) Edit AMI Name: Basic Cluster Instances HVM CentOS 5.4 Description: Minimal CentOS 5.4, 64-bit architecture, and HVM-based virtualization for use with Amazon EC2 Cluster Instances.
2) Login into the instance (right mouse click on running instance from AWS console)
3) Install some required stuff
sudo yum update sudo yum upgrade sudo apt-get install python-setuptools sudo easy_install "simplejson"
4) Install boto (unfortunately I was not able to install it using easy_install directly)
wget http://boto.googlecode.com/files/boto-1.8d.tar.gz tar xvzf boto-1.8d.tar.gz cd boto=1.8d sudo easy_install .
5) Install maven2 (unfortunately I was not able to install it using yum)
wget http://www.trieuvan.com/apache/maven/binaries/apache-maven-2.2.1-bin.tar.gz tar xvzf apache-maven-2.2.1-bin.tar.gz cp -R apache-maven-2.2.1 /usr/local/ ln -s /usr/local/apache-maven-2.2.1/bin/mvn /usr/local/bin/
6) Download and install Hadoop
wget http://apache.cyberuse.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz tar vxzf hadoop-0.20.2.tar.gz sudo mv hadoop-0.20.2 /usr/local/
add the following to $HADOOP_HOME/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/jre-openjdk/ # The maximum amount of heap to use, in MB. Default is 1000 export HADOOP_HEAPSIZE=2000
add the following to $HADOOP_HOME/conf/core-site.xml and also $HADOOP_HOME/conf/mapred-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property> <property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Edit the file hdfs-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/data/tmp/</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/data/tmp2/</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/data/tmp3/</value>
</property>
</configuration>
Note: directory /home/data does not exist, and you will have to create it
when starting the instance using the commands:
# mkdir -p /home/data # mount -t ext3 /dev/sdb/ /home/data/The reason for this setup is that the root dir has only 10GB, while /dev/sdb/
has 800GB.
set up authorized keys for localhost login w/o passwords and format your name node
# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa # cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
# svn co http://svn.apache.org/repos/asf/mahout/trunk mahout # cd mahout # mvn clean install # cd .. # sudo mv mahout /usr/local/mahout-0.4
4)Add the following to your .profile
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk export HADOOP_HOME=/usr/local/hadoop-0.20.2 export HADOOP_CONF_DIR=/usr/local/hadoop-0.20.2/conf export MAHOUT_HOME=/usr/local/mahout-0.4/ export MAHOUT_VERSION=0.4-SNAPSHOT export MAVEN_OPTS=-Xmx1024m
Verify that the paths on .profile point to the exact version you downloaded
6) Run Hadoop, just to prove you can, and test Mahout by building the Reuters dataset on it. Finally, delete the files and shut it down.
# $HADOOP_HOME/bin/hadoop namenode -format $HADOOP_HOME/bin/start-all.sh jps // you should see all 5 Hadoop processes (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker) cd $MAHOUT_HOME ./examples/bin/build-reuters.sh $HADOOP_HOME/bin/stop-all.sh rm -rf /tmp/* // delete the Hadoop files
// edit $HADOOP_HOME/conf/mapred-site.xml to include the following: <property> <name>mapred.child.java.opts</name> <value>-Xmx2000m</value> </property>
7) Allow for Hadoop to run even if you will work on a different EC2 machine:
echo "NoHostAuthenticationForLocalhost yes" >>~/.ssh/config
8) Now bundle the image.
Using Amazon AWS console - select running instance, right mouse click and then bundle EBS image. Enter image name and description. Now the machine will reboot and the image will be created.