Provides a set of Ansible playbooks to deploy a Big Data analytics stack on top of Hadoop/Yarn.
The play-hadoop.yml deploys the base system. Addons, such as Pig,
Spark, etc, are deployed using the playbooks in the addons
directory.
Legend:
- available
- planned
- git
- GitHub account with uploaded SSH keys (due to use of submodules)
- Python, pip, virtualenv, libffi-dev, pkg-config
- Nodes accessible by SSH to admin-privileged account
-
Clone this repository (you must have a GitHub account and uploaded your SSH)
$ git clone --recursive git://github.com/futuresystems/big-data-stack.git $ cd big-data-stack -
Create a virtualenv
$ virtualenv venv && source venv/bin/activate -
Install the dependencies
(venv) $ pip install -r requirements.txt -
Generate the inventory file
(venv) $ python mk-inventory -n bds- 10.0.0.10 10.0.0.11 >inventory.txt -
Sanity check
(venv) $ ansible all -m pingIf this fails, ensure that the nodes are SSH-accessible and that the user is correct in
ansible.cfg(alternatively, override using the-u $REMOTE_USERNAME) flag. You can pass-vto increase verbosity (add multiple for more details eg-vvvv). -
Deploy
(venv) $ ansible-playbook play-hadoop.yml addons/spark.yml # ... etc
-
Make sure to start an ssh-agent so you don't need to retype you passphrase multiple times. We've also noticied that if you are running on
india, Ansible may be unable to access the node and complain with something like:master0 | UNREACHABLE! => { "changed": false, "msg": "ssh cc@129.114.110.126:22 : Private key file is encrypted\nTo connect as a different user, use -u <username>.", "unreachable": true }To start the agent:
badi@i136 ~$ eval $(ssh-agent) badi@i136 ~$ ssh-add -
Make sure your public key is added to github.com IMPORTANT check the fingerprint
ssh-keygen -lf ~/.ssh/id_rsaand make sure it is in your list of keys! -
Download this repository using
git clone --recursive. IMPORTANT: make sure you specify the--recursiveoption otherwise you will get errors.git clone --recursive https://github.com/futuresystems/big-data-stack.git -
Install the requirements using
pip install -r requirements.txt -
Launch a virtual cluster and obtain the SSH-able IP addresses
-
Generate the inventory and variable files using
./mk-inventoryFor example:./mk-inventory -n $USER-mycluster 192.168.10{1,2,3,4} >inventory.txtWill define the inventory for a four-node cluster which nodes names as
$USER-myclusterN(withNfrom0..3) -
Make sure that
ansible.cfgreflects your environment. Look especially atremote_userif you are not using Ubuntu. You can alternatively override the user by passing-u $NODE_USERNAMEto the ansible commands. -
Ensure
ssh_configis to your liking. -
Run
ansible all -m pingto make sure all nodes can be managed. -
Run
ansible-playbook play-hadoop.ymlto install the base system -
Run
ansible-playbook addons/{pig,spark}.yml # etcto install the Pig and Spark addons. -
Log into the frontend node (see the
[frontends]group in the inventory) and use thehadoopuser (sudo su - hadoop) to run jobs on the cluster.
Sidenote: you may want to pass the -f <N> flag to ansible-playbook to use N parallel connections.
This will make the deployment go faster.
For example:
$ ansible-playbook -f $(egrep '^[a-zA-Z]' inventory.txt | sort | uniq | wc -l) # etc ...
The hadoop user is present on all the nodes and is the hadoop administrator.
If you need to change anything on HDFS, it must be done as hadoop.
Whenever a new release is made, you can get the changes by either cloning a fresh repository (as above), or pulling changes from the upstream master branch and updating the submodules:
$ git pull https://github.com/futuresystems/big-data-stack master
$ git submodule update
$ pip install -U -r requirements.txt
See the examples directory:
nist_fingerprint: fingerprint analysis using Spark with results pushed to HBase
Please see the LICENSE file in the root directory of the repository.
- Fork the repository
- Add yourself to the
CONTRIBUTORS.ymlfile - Submit a pull request to the
unstablebranch