Big Data Pipeline
Lambda Architecture - Streaming(Real-Time) Layer
with
Apache Kafka
Apache Hadoop
Apache Spark
Apache Cassandra
on Amazon Web Services Cloud Platform
INGEST STORE Process Visualize
BIG Data Pipeline
Data Pipeline
AngularJS
Web App
ClickStream
Data
Apache
Web Logs
Log/Data File
Spark
Streaming
Spark
SQL
Apache
Kafka
S3
HDFS
Apache
Cassandra
AngularJS
Web App
April
INGEST STREA
M
PROCES
S
VISUALIZE
STORE
Interactive
Queries
Spark Cluster
TCP
Sockets
BIG Data Streaming (Real-Time) Layer Pipeline
Install Kafka - 3 Node Cluster on AWS
3 EC2 instance for Kafka Cluster
Repeat commands for all - 3 EC2 instance for Kafka Cluster
cat /etc/*-release
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
java -version
mkdir kafka
cd kafka
wget http://download.nextag.com/apache/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz
tar -zxvf kafka_2.11-0.10.0.0.tgz
cd kafka_2.11-0.10.0.0
ZooKeeper ==> 172.31.48.208 / 52.91.1.93
Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211
Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
Modify config/server.properties for
kafka-datanode1 & kafkadatanode2
ZooKeeper ==> 172.31.48.208 / 52.91.1.93
Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211
Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
Kafka-datanode1 (set following properties for config/server.properties)
ubuntu@ip-172-31-63-203:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties
broker.id=1
listeners=PLAINTEXT://172.31.63.203:9092
advertised.listeners=PLAINTEXT://54.173.215.211:9092
zookeeper.connect=52.91.1.93:2181
Kafka-datanode2 (set following properties for config/server.properties)
ubuntu@ip-172-31-9-25:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties
broker.id=2
listeners=PLAINTEXT://172.31.9.25:9092
advertised.listeners=PLAINTEXT://54.226.29.194:9092
zookeeper.connect=52.91.1.93:2181
Launch zookeeper / datanode1 / datanode2
ZooKeeper ==> 172.31.48.208 / 52.91.1.93
Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211
Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
1) Start zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
2) Start server on Kafka-datanode1
bin/kafka-server-start.sh config/server.properties
3) Start server on Kafka-datanode2
bin/kafka-server-start.sh config/server.properties
4) Create Topic & Start consumer
bin/kafka-topics.sh --zookeeper 52.91.1.93:2181 --create --topic data --partitions 1 --replication-factor 2
bin/kafka-console-consumer.sh --zookeeper 52.91.1.93:2181 --topic data --from-beginning
Java - Kafka Producer Sample Applicationpackage com.himanshu;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
//import util.properties packages
import java.util.Properties;
//import simple producer packages
import org.apache.kafka.clients.producer.Producer;
//import KafkaProducer packages
import org.apache.kafka.clients.producer.KafkaProducer;
//import ProducerRecord packages
import org.apache.kafka.clients.producer.ProducerRecord;
public class DataProducer {
public static void main(String[] args) {
// Check arguments length value
/*
if(args.length == 0) {
System.out.println("Enter topic name");
return;
}
*/
//Assign topicName to string variable
String topicName = "data"; //args[0].toString();
// create instance for properties to access producer configs
Properties props = new Properties();
//Assign localhost id
props.put("bootstrap.servers", "54.173.215.211:9092,54.226.29.194:9092");
//props.put("metadata.broker.list", "172.31.63.203:9092,172.31.9.25:9092");
//Set acknowledgements for producer requests.
props.put("acks", "all");
//If the request fails, the producer can automatically retry,
props.put("retries", 0);
//Specify buffer size in config
props.put("batch.size", 16384);
//Reduce the no of requests less than 0
props.put("linger.ms", 1);
//The buffer.memory controls the total amount of memory available to the producer for buffering.
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<String, String>(props);
String csvFile = "/Users/himanshu/Documents/workspace/KafkaProducer/src/com/himanshu/invoice.txt";
String csvSplitBy = ",";
BufferedReader br = null;
String lineInvoice = "";
try {
br = new BufferedReader(new FileReader(csvFile));
while((lineInvoice = br.readLine()) != null ) {
String[] invoice = lineInvoice.split(csvSplitBy);
producer.send(new ProducerRecord<String, String>(topicName, lineInvoice));
System.out.println("Message sent successfully....");
}
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
Java - Kafka Producer Sample Application
catch (IOException e) {
e.printStackTrace();
}
finally {
producer.close();
if (br != null) {
try {
br.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
}
}
}
Java - Kafka Producer Sample Application
Sample data which we will be sending to Kafka Server
from Java Kafka Producer (csv file)
Message received on kafka datanode1
RealTime Streaming with
Kafka
Apache Spark
Apache Cassandra
Launch Kafka Cluster
(Zookeeper/kafka datanode1/ kafka datanode2)
Execute Python / Kafka Spark Job
Sample data which we will be sending to Kafka Server
from Java Kafka Producer (csv file)
Python Spark Job Processing Data from AWS Kafka Cluster
Python Spark Job Processing Data from AWS Kafka Cluster
&
Processed Data stored in AWS Cassandra Cluster
Sample data which we will be sending to Kafka Server
from Java Kafka Producer
Python Spark Job Processing Data from AWS Kafka Cluster
&
Processed Data stored in AWS Cassandra Cluster
Apache Spark UI
Python Spark Streaming Application
Thank You
hkbhadraa@gmail.com

Big data lambda architecture - Streaming Layer Hands On

  • 1.
    Big Data Pipeline LambdaArchitecture - Streaming(Real-Time) Layer with Apache Kafka Apache Hadoop Apache Spark Apache Cassandra on Amazon Web Services Cloud Platform
  • 2.
    INGEST STORE ProcessVisualize BIG Data Pipeline Data Pipeline
  • 3.
    AngularJS Web App ClickStream Data Apache Web Logs Log/DataFile Spark Streaming Spark SQL Apache Kafka S3 HDFS Apache Cassandra AngularJS Web App April INGEST STREA M PROCES S VISUALIZE STORE Interactive Queries Spark Cluster TCP Sockets BIG Data Streaming (Real-Time) Layer Pipeline
  • 4.
    Install Kafka -3 Node Cluster on AWS
  • 5.
    3 EC2 instancefor Kafka Cluster
  • 6.
    Repeat commands forall - 3 EC2 instance for Kafka Cluster cat /etc/*-release sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer java -version mkdir kafka cd kafka wget http://download.nextag.com/apache/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz tar -zxvf kafka_2.11-0.10.0.0.tgz cd kafka_2.11-0.10.0.0 ZooKeeper ==> 172.31.48.208 / 52.91.1.93 Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211 Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194
  • 7.
    Modify config/server.properties for kafka-datanode1& kafkadatanode2 ZooKeeper ==> 172.31.48.208 / 52.91.1.93 Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211 Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194 Kafka-datanode1 (set following properties for config/server.properties) ubuntu@ip-172-31-63-203:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties broker.id=1 listeners=PLAINTEXT://172.31.63.203:9092 advertised.listeners=PLAINTEXT://54.173.215.211:9092 zookeeper.connect=52.91.1.93:2181 Kafka-datanode2 (set following properties for config/server.properties) ubuntu@ip-172-31-9-25:~/kafka/kafka_2.11-0.10.0.0$ vi config/server.properties broker.id=2 listeners=PLAINTEXT://172.31.9.25:9092 advertised.listeners=PLAINTEXT://54.226.29.194:9092 zookeeper.connect=52.91.1.93:2181
  • 8.
    Launch zookeeper /datanode1 / datanode2 ZooKeeper ==> 172.31.48.208 / 52.91.1.93 Kafka-datanode1 ==> 172.31.63.203 / 54.173.215.211 Kafka-datanode2 ==> 172.31.9.25 / 54.226.29.194 1) Start zookeeper bin/zookeeper-server-start.sh config/zookeeper.properties 2) Start server on Kafka-datanode1 bin/kafka-server-start.sh config/server.properties 3) Start server on Kafka-datanode2 bin/kafka-server-start.sh config/server.properties 4) Create Topic & Start consumer bin/kafka-topics.sh --zookeeper 52.91.1.93:2181 --create --topic data --partitions 1 --replication-factor 2 bin/kafka-console-consumer.sh --zookeeper 52.91.1.93:2181 --topic data --from-beginning
  • 9.
    Java - KafkaProducer Sample Applicationpackage com.himanshu; import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; //import util.properties packages import java.util.Properties; //import simple producer packages import org.apache.kafka.clients.producer.Producer; //import KafkaProducer packages import org.apache.kafka.clients.producer.KafkaProducer; //import ProducerRecord packages import org.apache.kafka.clients.producer.ProducerRecord; public class DataProducer { public static void main(String[] args) { // Check arguments length value /* if(args.length == 0) { System.out.println("Enter topic name"); return; } */ //Assign topicName to string variable String topicName = "data"; //args[0].toString(); // create instance for properties to access producer configs Properties props = new Properties(); //Assign localhost id props.put("bootstrap.servers", "54.173.215.211:9092,54.226.29.194:9092"); //props.put("metadata.broker.list", "172.31.63.203:9092,172.31.9.25:9092");
  • 10.
    //Set acknowledgements forproducer requests. props.put("acks", "all"); //If the request fails, the producer can automatically retry, props.put("retries", 0); //Specify buffer size in config props.put("batch.size", 16384); //Reduce the no of requests less than 0 props.put("linger.ms", 1); //The buffer.memory controls the total amount of memory available to the producer for buffering. props.put("buffer.memory", 33554432); props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); Producer<String, String> producer = new KafkaProducer<String, String>(props); String csvFile = "/Users/himanshu/Documents/workspace/KafkaProducer/src/com/himanshu/invoice.txt"; String csvSplitBy = ","; BufferedReader br = null; String lineInvoice = ""; try { br = new BufferedReader(new FileReader(csvFile)); while((lineInvoice = br.readLine()) != null ) { String[] invoice = lineInvoice.split(csvSplitBy); producer.send(new ProducerRecord<String, String>(topicName, lineInvoice)); System.out.println("Message sent successfully...."); } } catch (FileNotFoundException e) { e.printStackTrace(); } Java - Kafka Producer Sample Application
  • 11.
    catch (IOException e){ e.printStackTrace(); } finally { producer.close(); if (br != null) { try { br.close(); } catch (IOException e) { e.printStackTrace(); } } } } } Java - Kafka Producer Sample Application
  • 12.
    Sample data whichwe will be sending to Kafka Server from Java Kafka Producer (csv file)
  • 13.
    Message received onkafka datanode1
  • 14.
  • 15.
    Launch Kafka Cluster (Zookeeper/kafkadatanode1/ kafka datanode2)
  • 16.
    Execute Python /Kafka Spark Job
  • 17.
    Sample data whichwe will be sending to Kafka Server from Java Kafka Producer (csv file)
  • 18.
    Python Spark JobProcessing Data from AWS Kafka Cluster
  • 19.
    Python Spark JobProcessing Data from AWS Kafka Cluster & Processed Data stored in AWS Cassandra Cluster
  • 20.
    Sample data whichwe will be sending to Kafka Server from Java Kafka Producer
  • 21.
    Python Spark JobProcessing Data from AWS Kafka Cluster & Processed Data stored in AWS Cassandra Cluster
  • 22.
  • 23.
  • 24.