Apache Spark Structured Streaming + Apache Kafka = ♡

Apache Kafka + Apache Spark =
♡
Let's check how Kafka integrates with Spark
Bartosz Konieczny
@waitingforcode

First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com #becomedataengineer.com
#@waitingforcode
#github.com/bartosz25 /data-generator /spark-scala-playground ...
2

The distributed data processing ecosystem
4
SQL Structured Streaming Streaming GraphX MLib
Python Scala Java RSQL
Kubernetes Hadoop YARN Mesos
AWS
DataProc HDInsightEMR
Databricks
GCP Azure
Databricks

Apache Spark
Structured Streaming
6

Streaming query execution - micro-batch
7
load state
for t1 query
load offsets
to process &
write them
for t1 query
process
data
confirm
processed
offsets &
next
watermark
commit state
t2
partition-based
checkpoint location
state store offset log commit log

Streaming query execution - continuous (experimental)
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
order
offsets
logging

process datatask 1
process datatask 2
process datatask 3
epoch
coordinator
persist offsets
checkpoint location
t
order
offsets
logging report processed
offsets
long-running, per partition

process datatask 1
process datatask 2
process datatask 3
epoch
coordinator
persist offsets
checkpoint location
t
order
offsets
logging report processed
offsets
if all tasks
processed
offsets within
epoc
long-running, per partition

Popular data transformations
11
def select(cols: Column*): DataFrame
def as(alias: String): Dataset[T]
def map[U : Encoder](func: T => U): Dataset[U]
def filter(condition: Column): Dataset[T]
def groupByKey[K: Encoder](func: T => K):
KeyValueGroupedDataset[K, T]
def limit(n: Int): Dataset[T]

Popular data transformations
12
def select(cols: Column*): DataFrame
def as(alias: String): Dataset[T]
def map[U : Encoder](func: T => U): Dataset[U]
def filter(condition: Column): Dataset[T]
def groupByKey[K: Encoder](func: T => K):
KeyValueGroupedDataset[K, T]
def limit(n: Int): Dataset[T]
def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]):
Dataset[U]
def mapGroups[U : Encoder](f: (K, Iterator[V]) => U):
Dataset[U]
def flatMapGroups[U : Encoder](f: (K, Iterator[V]) =>
TraversableOnce[U]): Dataset[U]
def join(right: Dataset[_], joinExprs: Column, joinType: String)
def reduce(func: (T, T) => T): T

Structured Streaming pipeline example
13
val loadQuery = sparkSession.readStream.format("kafka")
.option("kafka.bootstrap.servers", "210.0.0.20:9092")
.option("client.id", s"simple_kafka_spark_app")
.option("subscribePattern", "ss_starting_offset.*")
.option("startingOffsets", "earliest")
.load()
val processingLogic = loadQuery.selectExpr("CAST(value AS STRING)").as[String]
.filter(letter => letter.nonEmpty)
.map(letter => letter.size)
.select($"value".as("letter_length"))
.agg(Map("letter_length" -> "sum"))
val writeQuery = processingLogic.writeStream.outputMode("update")
.option("checkpointLocation", "/tmp/kafka-sample")
.format("console")
writeQuery.start().awaitTermination()
data source
data
processing
logic
data sink

Kafka data source configuration
15
⇢ Where?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)

16
⇢ Where?
⇢ What?
startingOffsets, endingOffsets - topic/partition or global

17
⇢ Where?
⇢ What?
⇢ How?
startingOffsets, endingOffsets - topic/partition or global
data loss failure (streaming), max reading rate control, Spark partitions number

Kafka input schema
18
key
[binary]
value
[binary]
topic
[string]
partition
[int]
offset
[long]
timestamp
[long]
timestampType
[int]

Kafka input schema
19
key
[binary]
value
[binary]
topic
[string]
partition
[int]
offset
[long]
timestamp
[long]
timestampType
[int]
val query = dataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS
STRING)")
.groupByKey(row => row.getAs[String]("key"))

From the fetch to the reading - micro-batch
20
data loss
checks,
skewness
optimization
initialize
offsets to
process
create data
consumer if
needed
checkpoint
processed
offsets
poll
data
Apache Kafka broker
next offsets to process
max offsets in partition
(no maxOffsetsPerTrigger)
distribute
offsets to
executors
as long as
the read offset < max offset for topic/partition
data locality
if new data
available
data loss checks
if no
fatal failure

Data loss protection - conditions
21
deleted partitions

22
deleted partitions expired records
(metadata consumer)

23
(metadata consumer)
new partitions
with missing
offsets

24
(metadata consumer)
new partitions
with missing
offsets
expired records
(data consumer)

Delivery
semantics
26
at-least
once

At-least once - why?
27
protected def checkForErrors(): Unit = {
if (failedWrite != null) {
throw failedWrite
}
}
KafkaRowWriter

28
private val callback = new Callback() {
override def
onCompletion(recordMetadata:
RecordMetadata, e: Exception): Unit = {
if (failedWrite == null && e != null) {
failedWrite = e
}
}
}
KafkaRowWriter

29
def write(row: InternalRow): Unit = {
checkForErrors()
sendRow(row, producer)
}
KafkaStreamDataWriter

Output
generation
30
1 or
multiple
topics

1 or multiple outputs - how?
31
private def createProjection = {
val topicExpression = topic.map(Literal(_)).orElse {
inputSchema.find(_.name == TOPIC_ATTRIBUTE_NAME)
}.getOrElse {
throw new IllegalStateException(s"topic option required when no " +
s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present")
}
KafkaRowWriter

Summary
32
● micro-batch oriented
● low latency in progress effort
● fault-tolerance with checkpoint mechanism
● batch and streaming supported
● alternative way to other streaming approaches

Resources
● Kafka on Spark documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka-
integration.html
● Structured streaming support for consuming from Kafka:
https://issues.apache.org/jira/browse/SPARK-15406
● Github data generator: https://github.com/bartosz25/data-generator
● Kafka + Spark pipeline example: https://github.com/bartosz25/sessionization-demo
● Kafka + Spark series: https://www.waitingforcode.com/tags/kafka-spark-structured-streaming
33

Thank you !
@waitingforcode / waitingforcode.com
34

Apache Spark Structured Streaming + Apache Kafka = ♡

More Related Content

What's hot

Similar to Apache Spark Structured Streaming + Apache Kafka = ♡

Recently uploaded

Apache Spark Structured Streaming + Apache Kafka = ♡

Editor's Notes