Apache Kafka + Apache Spark =
♡
Let's check how Kafka integrates with Spark
Bartosz Konieczny
@waitingforcode
First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com #becomedataengineer.com
#@waitingforcode
#github.com/bartosz25 /data-generator /spark-scala-playground ...
2
Apache Spark
3
The distributed data processing ecosystem
4
SQL Structured Streaming Streaming GraphX MLib
Python Scala Java RSQL
Kubernetes Hadoop YARN Mesos
AWS
DataProc HDInsightEMR
Databricks
GCP Azure
Databricks
Maintainers
5
+
Apache Spark
Structured Streaming
6
Streaming query execution - micro-batch
7
load state
for t1 query
load offsets
to process &
write them
for t1 query
process
data
confirm
processed
offsets &
next
watermark
commit state
t2
partition-based
checkpoint location
state store offset log commit log
Streaming query execution - continuous (experimental)
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
order
offsets
logging
Streaming query execution - continuous (experimental)
process datatask 1
process datatask 2
process datatask 3
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
t
order
offsets
logging report processed
offsets
long-running, per partition
Streaming query execution - continuous (experimental)
process datatask 1
process datatask 2
process datatask 3
epoch
coordinator
persist offsets
checkpoint location
offset log commit log
t
order
offsets
logging report processed
offsets
if all tasks
processed
offsets within
epoc
long-running, per partition
Popular data transformations
11
def select(cols: Column*): DataFrame
def as(alias: String): Dataset[T]
def map[U : Encoder](func: T => U): Dataset[U]
def filter(condition: Column): Dataset[T]
def groupByKey[K: Encoder](func: T => K):
KeyValueGroupedDataset[K, T]
def limit(n: Int): Dataset[T]
Popular data transformations
12
def select(cols: Column*): DataFrame
def as(alias: String): Dataset[T]
def map[U : Encoder](func: T => U): Dataset[U]
def filter(condition: Column): Dataset[T]
def groupByKey[K: Encoder](func: T => K):
KeyValueGroupedDataset[K, T]
def limit(n: Int): Dataset[T]
def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]):
Dataset[U]
def mapGroups[U : Encoder](f: (K, Iterator[V]) => U):
Dataset[U]
def flatMapGroups[U : Encoder](f: (K, Iterator[V]) =>
TraversableOnce[U]): Dataset[U]
def join(right: Dataset[_], joinExprs: Column, joinType: String)
def reduce(func: (T, T) => T): T
Structured Streaming pipeline example
13
val loadQuery = sparkSession.readStream.format("kafka")
.option("kafka.bootstrap.servers", "210.0.0.20:9092")
.option("client.id", s"simple_kafka_spark_app")
.option("subscribePattern", "ss_starting_offset.*")
.option("startingOffsets", "earliest")
.load()
val processingLogic = loadQuery.selectExpr("CAST(value AS STRING)").as[String]
.filter(letter => letter.nonEmpty)
.map(letter => letter.size)
.select($"value".as("letter_length"))
.agg(Map("letter_length" -> "sum"))
val writeQuery = processingLogic.writeStream.outputMode("update")
.option("checkpointLocation", "/tmp/kafka-sample")
.format("console")
writeQuery.start().awaitTermination()
data source
data
processing
logic
data sink
Apache Kafka data
source
14
Kafka data source configuration
15
⇢ Where?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
Kafka data source configuration
16
⇢ Where?
⇢ What?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
startingOffsets, endingOffsets - topic/partition or global
Kafka data source configuration
17
⇢ Where?
⇢ What?
⇢ How?
kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
startingOffsets, endingOffsets - topic/partition or global
data loss failure (streaming), max reading rate control, Spark partitions number
Kafka input schema
18
key
[binary]
value
[binary]
topic
[string]
partition
[int]
offset
[long]
timestamp
[long]
timestampType
[int]
Kafka input schema
19
key
[binary]
value
[binary]
topic
[string]
partition
[int]
offset
[long]
timestamp
[long]
timestampType
[int]
val query = dataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS
STRING)")
.groupByKey(row => row.getAs[String]("key"))
From the fetch to the reading - micro-batch
20
data loss
checks,
skewness
optimization
initialize
offsets to
process
create data
consumer if
needed
checkpoint
processed
offsets
poll
data
Apache Kafka broker
next offsets to process
max offsets in partition
(no maxOffsetsPerTrigger)
distribute
offsets to
executors
as long as
the read offset < max offset for topic/partition
data locality
if new data
available
data loss checks
if no
fatal failure
Data loss protection - conditions
21
deleted partitions
Data loss protection - conditions
22
deleted partitions expired records
(metadata consumer)
Data loss protection - conditions
23
deleted partitions expired records
(metadata consumer)
new partitions
with missing
offsets
Data loss protection - conditions
24
deleted partitions expired records
(metadata consumer)
new partitions
with missing
offsets
expired records
(data consumer)
Apache Kafka data sink
25
Delivery
semantics
26
at-least
once
At-least once - why?
27
protected def checkForErrors(): Unit = {
if (failedWrite != null) {
throw failedWrite
}
}
KafkaRowWriter
At-least once - why?
28
private val callback = new Callback() {
override def
onCompletion(recordMetadata:
RecordMetadata, e: Exception): Unit = {
if (failedWrite == null && e != null) {
failedWrite = e
}
}
}
KafkaRowWriter
At-least once - why?
29
def write(row: InternalRow): Unit = {
checkForErrors()
sendRow(row, producer)
}
KafkaStreamDataWriter
Output
generation
30
1 or
multiple
topics
1 or multiple outputs - how?
31
private def createProjection = {
val topicExpression = topic.map(Literal(_)).orElse {
inputSchema.find(_.name == TOPIC_ATTRIBUTE_NAME)
}.getOrElse {
throw new IllegalStateException(s"topic option required when no " +
s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present")
}
KafkaRowWriter
Summary
32
● micro-batch oriented
● low latency in progress effort
● fault-tolerance with checkpoint mechanism
● batch and streaming supported
● alternative way to other streaming approaches
Resources
● Kafka on Spark documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka-
integration.html
● Structured streaming support for consuming from Kafka:
https://issues.apache.org/jira/browse/SPARK-15406
● Github data generator: https://github.com/bartosz25/data-generator
● Kafka + Spark pipeline example: https://github.com/bartosz25/sessionization-demo
● Kafka + Spark series: https://www.waitingforcode.com/tags/kafka-spark-structured-streaming
33
Thank you !
@waitingforcode / waitingforcode.com
34

Apache Spark Structured Streaming + Apache Kafka = ♡

  • 1.
    Apache Kafka +Apache Spark = ♡ Let's check how Kafka integrates with Spark Bartosz Konieczny @waitingforcode
  • 2.
    First things first BartoszKonieczny #dataEngineer #ApacheSparkEnthusiast #AWSuser #waitingforcode.com #becomedataengineer.com #@waitingforcode #github.com/bartosz25 /data-generator /spark-scala-playground ... 2
  • 3.
  • 4.
    The distributed dataprocessing ecosystem 4 SQL Structured Streaming Streaming GraphX MLib Python Scala Java RSQL Kubernetes Hadoop YARN Mesos AWS DataProc HDInsightEMR Databricks GCP Azure Databricks
  • 5.
  • 6.
  • 7.
    Streaming query execution- micro-batch 7 load state for t1 query load offsets to process & write them for t1 query process data confirm processed offsets & next watermark commit state t2 partition-based checkpoint location state store offset log commit log
  • 8.
    Streaming query execution- continuous (experimental) epoch coordinator persist offsets checkpoint location offset log commit log order offsets logging
  • 9.
    Streaming query execution- continuous (experimental) process datatask 1 process datatask 2 process datatask 3 epoch coordinator persist offsets checkpoint location offset log commit log t order offsets logging report processed offsets long-running, per partition
  • 10.
    Streaming query execution- continuous (experimental) process datatask 1 process datatask 2 process datatask 3 epoch coordinator persist offsets checkpoint location offset log commit log t order offsets logging report processed offsets if all tasks processed offsets within epoc long-running, per partition
  • 11.
    Popular data transformations 11 defselect(cols: Column*): DataFrame def as(alias: String): Dataset[T] def map[U : Encoder](func: T => U): Dataset[U] def filter(condition: Column): Dataset[T] def groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T] def limit(n: Int): Dataset[T]
  • 12.
    Popular data transformations 12 defselect(cols: Column*): DataFrame def as(alias: String): Dataset[T] def map[U : Encoder](func: T => U): Dataset[U] def filter(condition: Column): Dataset[T] def groupByKey[K: Encoder](func: T => K): KeyValueGroupedDataset[K, T] def limit(n: Int): Dataset[T] def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] def mapGroups[U : Encoder](f: (K, Iterator[V]) => U): Dataset[U] def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): Dataset[U] def join(right: Dataset[_], joinExprs: Column, joinType: String) def reduce(func: (T, T) => T): T
  • 13.
    Structured Streaming pipelineexample 13 val loadQuery = sparkSession.readStream.format("kafka") .option("kafka.bootstrap.servers", "210.0.0.20:9092") .option("client.id", s"simple_kafka_spark_app") .option("subscribePattern", "ss_starting_offset.*") .option("startingOffsets", "earliest") .load() val processingLogic = loadQuery.selectExpr("CAST(value AS STRING)").as[String] .filter(letter => letter.nonEmpty) .map(letter => letter.size) .select($"value".as("letter_length")) .agg(Map("letter_length" -> "sum")) val writeQuery = processingLogic.writeStream.outputMode("update") .option("checkpointLocation", "/tmp/kafka-sample") .format("console") writeQuery.start().awaitTermination() data source data processing logic data sink
  • 14.
  • 15.
    Kafka data sourceconfiguration 15 ⇢ Where? kafka.bootstrap.servers + (subscribe, subscribePattern, assign)
  • 16.
    Kafka data sourceconfiguration 16 ⇢ Where? ⇢ What? kafka.bootstrap.servers + (subscribe, subscribePattern, assign) startingOffsets, endingOffsets - topic/partition or global
  • 17.
    Kafka data sourceconfiguration 17 ⇢ Where? ⇢ What? ⇢ How? kafka.bootstrap.servers + (subscribe, subscribePattern, assign) startingOffsets, endingOffsets - topic/partition or global data loss failure (streaming), max reading rate control, Spark partitions number
  • 18.
  • 19.
    Kafka input schema 19 key [binary] value [binary] topic [string] partition [int] offset [long] timestamp [long] timestampType [int] valquery = dataFrame.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .groupByKey(row => row.getAs[String]("key"))
  • 20.
    From the fetchto the reading - micro-batch 20 data loss checks, skewness optimization initialize offsets to process create data consumer if needed checkpoint processed offsets poll data Apache Kafka broker next offsets to process max offsets in partition (no maxOffsetsPerTrigger) distribute offsets to executors as long as the read offset < max offset for topic/partition data locality if new data available data loss checks if no fatal failure
  • 21.
    Data loss protection- conditions 21 deleted partitions
  • 22.
    Data loss protection- conditions 22 deleted partitions expired records (metadata consumer)
  • 23.
    Data loss protection- conditions 23 deleted partitions expired records (metadata consumer) new partitions with missing offsets
  • 24.
    Data loss protection- conditions 24 deleted partitions expired records (metadata consumer) new partitions with missing offsets expired records (data consumer)
  • 25.
  • 26.
  • 27.
    At-least once -why? 27 protected def checkForErrors(): Unit = { if (failedWrite != null) { throw failedWrite } } KafkaRowWriter
  • 28.
    At-least once -why? 28 private val callback = new Callback() { override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = { if (failedWrite == null && e != null) { failedWrite = e } } } KafkaRowWriter
  • 29.
    At-least once -why? 29 def write(row: InternalRow): Unit = { checkForErrors() sendRow(row, producer) } KafkaStreamDataWriter
  • 30.
  • 31.
    1 or multipleoutputs - how? 31 private def createProjection = { val topicExpression = topic.map(Literal(_)).orElse { inputSchema.find(_.name == TOPIC_ATTRIBUTE_NAME) }.getOrElse { throw new IllegalStateException(s"topic option required when no " + s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present") } KafkaRowWriter
  • 32.
    Summary 32 ● micro-batch oriented ●low latency in progress effort ● fault-tolerance with checkpoint mechanism ● batch and streaming supported ● alternative way to other streaming approaches
  • 33.
    Resources ● Kafka onSpark documentation: https://spark.apache.org/docs/latest/structured-streaming-kafka- integration.html ● Structured streaming support for consuming from Kafka: https://issues.apache.org/jira/browse/SPARK-15406 ● Github data generator: https://github.com/bartosz25/data-generator ● Kafka + Spark pipeline example: https://github.com/bartosz25/sessionization-demo ● Kafka + Spark series: https://www.waitingforcode.com/tags/kafka-spark-structured-streaming 33
  • 34.
    Thank you ! @waitingforcode/ waitingforcode.com 34

Editor's Notes

  • #8 ask if everybody is aware of the watermark explain the idea of state store + where it can be stored explain where checkpoint location + where it can be stored (HDFS compatible fs)
  • #9 https://databricks.com/wp-content/uploads/2018/03/image2-2.png
  • #10 https://databricks.com/wp-content/uploads/2018/03/image2-2.png
  • #11 https://databricks.com/wp-content/uploads/2018/03/image2-2.png
  • #12 limit is useless since it will stop returning data as soon as it's reached
  • #13 limit is useless since it will stop returning data as soon as it's reached
  • #14 THE CODE used in the transformation is distributed only once, for the first query, or it's compiled & distributed for every query?
  • #16  .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") optinals > failOnDataLoss > maxOffsetsPerTrigger > minPartitions
  • #17  .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") optinals > failOnDataLoss > maxOffsetsPerTrigger > minPartitions
  • #18  .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""") .option("endingOffsets → but it only applies to the batch processing!", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""") .option("startingOffsets", "earliest") .option("endingOffsets", "latest") optinals > failOnDataLoss > maxOffsetsPerTrigger > minPartitions
  • #19 HEADERS and 3.0!
  • #20 TODO: extract_json no schema registry, even though there was a blog post of Xebia about integration of it
  • #21 explain that it can be different → V1 vs V2 data source say that it doesn't happen for the next query becaue data is stored in memory, unless the check on data loss poll data = seek + poll poll data => explain data loss checks consumer on the executor lifecycle ⇒ Is it closed after the batch read? In fact, it depends whether there are new topic/partitons. If it's not the case, it's reused, if yes, a new one is created. an exception ⇒ contiunous streaming mode always recreates a new consumer!!!!!!! EXPLAIN the diff between micro batch and continuous reader
  • #27 explain why not transactions (see comment from wfc)
  • #28 say that KafkaRowWriter is shared by V1 and V2 data sinks