Skip to content

Tweets are streamed using Kafka into a Scala project where tweets are analyzed for their sentiment using Stanford NLP library. Results of sentiment analysis are classified as POSITIVE, NEGATIVE, or NEUTRAL. A message containing the sentiment is sent back to Kafka through topicA and they are visualized using Kibana and Elasticsearch. The keywords…

Notifications You must be signed in to change notification settings

codebankss/structured-streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Spark Streaming with Twitter and Kafka

Requirements

  1. Java (v1.8.0_261)
  2. Scala (v2.11.8)
  3. Spark (2.4.0)
  4. sbt
  5. Kafka (v2.11-2.4.0)
  6. Elastic Search (v7.9.3)
  7. Kibana (v7.9.3-darwin-x86_64)
  8. Logstash (v7.9.3)

Execution

How to Implement

The following instructions are for Linux/MacOS.

Assumptions: All the required tools (sbt, elasticsearch, kafka, kibana and logstash) are downloaded and twitter developer account is created.

To run the project, download the folder and in the terminal follow the steps:

STEP 1: Change direction (cd) to Kafka folder and start the zookeeper server by running the following command: sh bin/zookeeper-server-start.sh config/zookeeper.properties

STEP 2: In another terminal, cd to Kafka folder and run the Kafka server like following: sh bin/kafka-server-start.sh config/server.properties

STEP 3: In another terminal, cd to Kafka folder and create a topic. sh bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic topicA

STEP 4: Now, start Kafka consumer and check connection logs. sh bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic topicA --from-beginning

STEP 5: Now cd to elasticsearch/bin and start the server. ./elasticsearch localhost:9200 is now running.

STP 6: Now run the Kibana server by cd to kibana/bin. ./kibana localhost:5601 is now working. On the localhost, go to the dev tools and run GET /_cat/health?v&pretty Ideally, cluster health should be at 100%.

STEP 7: Logstash configuration is created as ‘logstash-simple.conf’ and located in the project (kafka) folder. To load the configuration, copy the path of the project and cd to logstach/bin and run ./logstash -f pathToProject/logstash-simple.conf logstash-simple.conf configuration is as follows: input { kafka { bootstrap_servers => "localhost:9092" topics => ["topicA"] } } output { elasticsearch { hosts => ["localhost:9200"] index => "index3" } }

STEP 8: Now create the fat jar using sbt by cd to the project (kafka) folder and running: sbt assembly Jar is created at pathToProject/target/scala-2.11/kafka-assembly-0.1.jar Project is now compiled

STEP 9: Now run the compiled project as following: spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.6 --class TwitterSentimentAnalysis pathToProject/target/scala-2.11/kafka-assembly-0.1.jar topicA accessKey secretKey tokenAccessKey tokenSecretKey Where pathToProject is your path where project is downloaded and accessKey, secretKey, tokenAccessKey, tokenSecretKey are keys from your twitter developer account.

STEP 10: Once the project is running and data is streaming, go to localhost:5601 and create an index ‘index3’. Once the index is created, you can see the data incoming in the ‘discovery’ section and create various visualisations. Note: If you are not able to create an index because of a server error, copy paste the following two lines in the same sequence in a new terminal window while the project is still running.

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

Now, try creating index again.

Visualization

Analyzed data is read by consumer and is being checked for consumer logs.

Data from consumer log in Kibana ‘discovery’ looks like:

coming in at about 650 tweets on average per minute for our keywords.

The following pie chart shows the tweet counts based on the classification.

The following timelion graph shows classification of tweets by count per minute.

Summary

By looking at the visualizations and analysis, we can say that approximately 78% of the tweets are classified as NEGATIVE, about 14% are NEUTRAL and 8% are POSITIVE. At any given point of time, negative tweets about ‘elections’ are considerable high than positive or neutral and the number is varying at a constant rate over time.

About

Tweets are streamed using Kafka into a Scala project where tweets are analyzed for their sentiment using Stanford NLP library. Results of sentiment analysis are classified as POSITIVE, NEGATIVE, or NEUTRAL. A message containing the sentiment is sent back to Kafka through topicA and they are visualized using Kibana and Elasticsearch. The keywords…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published