Big Data Analytics Stack

Provides a set of Ansible playbooks to deploy a Big Data analytics stack on top of Hadoop/Yarn.

The play-hadoop.yml deploys the base system. Addons, such as Pig, Spark, etc, are deployed using the playbooks in the addons directory.

Stack

Legend:

available
planned

Analytics Layer

Data Processing Layer

Database Layer

Scheduling

Storage

HDFS

Monitoring

Ganglia

Requirements

git
GitHub account with uploaded SSH keys (due to use of submodules)
Python, pip, virtualenv, libffi-dev, pkg-config
Nodes accessible by SSH to admin-privileged account

Quickstart

Clone this repository (you must have a GitHub account and uploaded your SSH)

$ git clone --recursive git://github.com/futuresystems/big-data-stack.git
$ cd big-data-stack

Create a virtualenv

$ virtualenv venv && source venv/bin/activate

Install the dependencies

(venv) $ pip install -r requirements.txt

Generate the inventory file

(venv) $ python mk-inventory -n bds- 10.0.0.10 10.0.0.11 >inventory.txt

Sanity check
```
(venv) $ ansible all -m ping
```
If this fails, ensure that the nodes are SSH-accessible and that the user is correct in ansible.cfg (alternatively, override using the -u $REMOTE_USERNAME) flag. You can pass -v to increase verbosity (add multiple for more details eg -vvvv).

Deploy

(venv) $ ansible-playbook play-hadoop.yml addons/spark.yml    # ... etc

Usage

Make sure to start an ssh-agent so you don't need to retype you passphrase multiple times. We've also noticied that if you are running on india, Ansible may be unable to access the node and complain with something like:

master0 | UNREACHABLE! => {
    "changed": false,
    "msg": "ssh cc@129.114.110.126:22 : Private key file is encrypted\nTo connect as a different user, use -u <username>.",
    "unreachable": true
}

To start the agent:

badi@i136 ~$ eval $(ssh-agent)
badi@i136 ~$ ssh-add

Make sure your public key is added to github.com IMPORTANT check the fingerprint ssh-keygen -lf ~/.ssh/id_rsa and make sure it is in your list of keys!
Download this repository using git clone --recursive. IMPORTANT: make sure you specify the --recursive option otherwise you will get errors.
```
 git clone --recursive https://github.com/futuresystems/big-data-stack.git
```
Install the requirements using pip install -r requirements.txt
Launch a virtual cluster and obtain the SSH-able IP addresses
Generate the inventory and variable files using ./mk-inventory For example:
```
./mk-inventory -n $USER-mycluster 192.168.10{1,2,3,4} >inventory.txt
```
Will define the inventory for a four-node cluster which nodes names as $USER-myclusterN (with N from 0..3)
Make sure that ansible.cfg reflects your environment. Look especially at remote_user if you are not using Ubuntu. You can alternatively override the user by passing -u $NODE_USERNAME to the ansible commands.
Ensure ssh_config is to your liking.
Run ansible all -m ping to make sure all nodes can be managed.
Run ansible-playbook play-hadoop.yml to install the base system
Run ansible-playbook addons/{pig,spark}.yml # etc to install the Pig and Spark addons.
Log into the frontend node (see the [frontends] group in the inventory) and use the hadoop user (sudo su - hadoop) to run jobs on the cluster.

Sidenote: you may want to pass the -f <N> flag to ansible-playbook to use N parallel connections. This will make the deployment go faster. For example:

$ ansible-playbook -f $(egrep '^[a-zA-Z]' inventory.txt | sort | uniq | wc -l) # etc ...

The hadoop user is present on all the nodes and is the hadoop administrator. If you need to change anything on HDFS, it must be done as hadoop.

Upgrading

Whenever a new release is made, you can get the changes by either cloning a fresh repository (as above), or pulling changes from the upstream master branch and updating the submodules:

$ git pull https://github.com/futuresystems/big-data-stack master
$ git submodule update
$ pip install -U -r requirements.txt

Examples

See the examples directory:

nist_fingerprint: fingerprint analysis using Spark with results pushed to HBase

License

Please see the LICENSE file in the root directory of the repository.

Contributing

Fork the repository
Add yourself to the CONTRIBUTORS.yml file
Submit a pull request to the unstable branch

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
addons		addons
base		base
docs		docs
examples/nist_fingerprint		examples/nist_fingerprint
group_vars		group_vars
roles		roles
.cluster.py		.cluster.py
.dir-locals.el		.dir-locals.el
.gitignore		.gitignore
.gitmodules		.gitmodules
.projectile		.projectile
CHANGELOG.org		CHANGELOG.org
CONTRIBUTORS.yml		CONTRIBUTORS.yml
Documentation.md		Documentation.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ansible.cfg		ansible.cfg
bootstrap-ubuntu-16.04.sh		bootstrap-ubuntu-16.04.sh
mk-inventory		mk-inventory
nixpkgs.nix		nixpkgs.nix
play-alladdons.yml		play-alladdons.yml
play-hadoop.yml		play-hadoop.yml
requirements-open.txt		requirements-open.txt
requirements.txt		requirements.txt
shell.nix		shell.nix
ssh_config		ssh_config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Analytics Stack

Stack

Analytics Layer

Data Processing Layer

Database Layer

Scheduling

Storage

Monitoring

Requirements

Quickstart

Usage

Upgrading

Examples

License

Contributing

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

futuresystems/big-data-stack

Folders and files

Latest commit

History

Repository files navigation

Big Data Analytics Stack

Stack

Analytics Layer

Data Processing Layer

Database Layer

Scheduling

Storage

Monitoring

Requirements

Quickstart

Usage

Upgrading

Examples

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages