how to manage offset read from kafka with spark structured stream - scala

I'm having a spark structured streaming job that need to read data from kafka topic and do some aggregation. The job needed to restart daily but when it restart, if I set startingOffsets="latest", I'll loss the data that coming between the restarting time. If I set startingOffsets="earliest" then the job will read all data from topic but not from where the last streaming job left. Can anyone help me how to config to set the offset right where last streaming job left?
I'm using Spark 2.4.0 and kafka 2.1.1, I have tried to set checkpoint location for the writing job but it seem like Spark doesn't check for the offset of kafka message so it keep check the last offset or first offset depend on startingOffsets.
Here is the config for my spark to read from kafka:
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", host)
.option("subscribe", topic)
.option("startingOffsets", offset)
.option("enable.auto.commit", "false")
.load()
with example that kafka topic has 10 message with offset from 1 to 10, spark just done processing message 5 and then restart. How can I make spark continue reading from message 5 not from 1 or 11?

It seems like with some code I can take the offset that I need and save it to some reliable storage like cassandra. Then when the spark streaming start i just need to read the saved offset and fill it to startingOffsets.
This is the codes that help me to get the offset that I need
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.StreamingQueryListener._
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
println("Query started:" + queryStarted.id)
}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {
println("Query terminated" + queryTerminated.id)
}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("Query made progress")
println("Starting offset:" + queryProgress.progress.sources(0).startOffset)
println("Ending offset:" + queryProgress.progress.sources(0).endOffset)
//Logic to save these offsets
// the logic to save the offset write in here
}
})

Related

Kafka as readstream source always returns 0 messages in the first iteration

I have a Structured Streaming job which has got Kafka as source and Delta as sink. Each of the batches will be processed inside a foreachBatch.
The problem I am facing is I need to have this Structured Streaming configured to be triggered just once, but in that initial run Kafka always returns no records.
This is how I have configured the Structured Streaming process:
var kafka_stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_bootstrap_config)
.option("subscribe", kafka_topic)
.option("startingOffsets", "latest")
.option("groupid", my_group_id)
.option("minOffsetsPerTrigger", "20")
.load()
val kafka_stream_payload = kafka_stream.selectExpr("cast (value as string) as msg ")
kafka_stream_payload
.writeStream
.format( "console" )
.queryName( "my_query" )
.outputMode( "append" )
.foreachBatch { (batchDF: DataFrame, batchId: Long) => process_micro_batch( batchDF ) }
.trigger(Trigger.AvailableNow())
.start()
.awaitTermination()
I tried to configure the Kafka readStream to pick a minimum of 20 new messages by using "minOffsetsPerTrigger", "20". However, every first iteration it keeps returning 0 new messages.
In case I remove the .trigger(Trigger.AvailableNow()) option, during the second (and following) iterations the process will be reading an average of 200 new kafka messages.
Is there a reason why I am getting 0 records during the first iteration?, and how can I configure the sourceStream to enforce a minimum number of new messages?
Since you configured (.option("startingOffsets", "latest")) it's possible that you can get 0 messages for the first iteration, if messages are not available on time in the Kafka topic. try to check for (.option("startingOffsets", "earliest")).
or add ("auto.offset.reset", "earliest")
or make sure the data is getting published to the Kafka topic continuously then start your consumer

Reading names of the indexes from kafka topic doesn't work

I have a problem writing a spark job. The problem goes as follows. I need to process all records from elasticsearch index. Spark seems like a good match for this and I wrote following code
val dataset: Dataset<Row> = session.read()
.format("org.elasticsearch.spark.sql")
.option("es.read.field.include", "orgUUID,serializedEventKey,involvedContactURNs,crmAssociationSmartURNs")
.option("es.read.field.as.array.include", "involvedContactURNs,crmAssociationSmartURNs")
.load(index)
dataset.foreach(transform)
This code works without problem and does everything as expected - the problem is that index - (name of the elasticsearch index) is not known a-priory. I have to read names of the indexes from kafka topic. So I have added following loop
while (true) {
val records = kafkaClient.poll(Duration.ofMillis(1000))
if (!records.isEmpty) {
records.forEach { record ->
val index = record.value()
// Here comes code from above to process index
}
}
This somehow doesn't work - the same records got read multiple times from the same kafka topic. I understand that spark spawns multiple executors behind the back still all kafka clients share the same group ID and according to kafka documentation only one of them should be able to read - that is the first mystery to which I want to get some explanation.
That is not the end of my adventure though. I decided to use spark streaming to read from kafka and went with the following
val df = session.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test_consumers")
.option("startingOffsets", "earliest")
.load()
and after that
val temp = df.selectExpr("CAST(key AS STRING)","CAST(value AS STRING)")
val temp1 = temp.map(retrieve, Encoders.STRING())
val retrieve: MapFunction<Row, String> = MapFunction { row ->
val index = row.getAs("value")
// Here initial block of code to process elasticsearch index
index
}
This failed with
Caused by: org.apache.spark.SparkException: Writing job aborted.
At the line in the initial code block dataset.foreach(transform)

Structured Streaming metrics understanding

I'm quite new to Structured Streaming and would like to understand a bit more in detail the main metrics of Spark.
I have a Structured Streaming process in Databricks that reads events from one Eventhub, read values from those events, creates a new df and writes this new df into a second Eventhub.
The event that comes from the first Eventhub, is an eventgrid event from which I read a url (when a blob is added to a storage account) and inside a foreachBatch, I create a new DF and write it to the second Eventhub.
The code has the following structure:
val streamingInputDF =
spark.readStream
.format("eventhubs")
.options(eventHubsConf.toMap)
.load()
.select(($"body").cast("string"))
def get_func( batchDF:DataFrame, batchID:Long ) : Unit = {
batchDF.persist()
for (row <- batchDF.rdd.collect) { //necessary to read the file with spark.read....
val file_url = "/mnt/" + path
// create df from readed url
val df = spark
.read
.option("rowTag", "Transaction")
.xml(file_url)
if (!(df.rdd.isEmpty)){
// some filtering
val eh_df = df.select(col(...).as(...),
val eh_jsoned = eh_df.toJSON.withColumnRenamed("value", "body")
// write to Eventhub
eh_jsoned.select("body")
.write
.format("eventhubs")
.options(eventHubsConfWrite.toMap)
.save()
}
}
batchDF.unpersist()
}
val query_test= streamingSelectDF
.writeStream
.queryName("query_test")
.foreachBatch(get_func _)
.start()
I have tried adding the maxEventsPerTrigger(100) parameter but this increases a lot the time from when the data arrives to the Storage Account until it is consumed in Databricks.
The value for maxEventsPerTrigger is set randomly in order to test behaviour.
Having seen the metrics, what sense does it make that the batch time is increasing so much and the processing rate and input rate are similar?
What approach should I consider to improve the process?
I'm running it from a Databricks 7.5 Notebook, Spark 3.0.1 and Scala 2.12.
Thank you all very much in advance.
NOTE:
XML files have the same size
First Eventhub has 20 partitions
Rate data input to first Eventhub is 2 events/sec

Termination of Structured Streaming queue using Databricks

I would like to understand whether running a cell in a Databricks notebook with the code below and then cancelling it means that the stream reading is over. Or perhaps it does require some explicit closing?
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServers)
.option("subscribe", "topic1")
.load()
display(df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)])
Non-display Mode
It's best to issue this command in a cell:
streamingQuery.stop()
for this type of approach:
val streamingQuery = streamingDF // Start with our "streaming" DataFrame
.writeStream // Get the DataStreamWriter
.queryName(myStreamName) // Name the query
.trigger(Trigger.ProcessingTime("3 seconds")) // Configure for a 3-second micro-batch
.format("parquet") // Specify the sink type, a Parquet file
.option("checkpointLocation", checkpointPath) // Specify the location of checkpoint files & W-A logs
.outputMode("append") // Write only new data to the "file"
.start(outputPathDir)
Otherwise it continues to run - which is the idea of streaming.
I would not stop the cluster as it is all Streams then.
Databricks display Mode
DataBricks have written a nice set of utilities, but you need to do the course to get them. My folly.
display is a databricks thing. Needs format like:
display(myDF, streamName = "myQuery")
then proceed as follows in a separate cell:
println("Looking for %s".format(myStreamName))
for (stream <- spark.streams.active) // Loop over all active streams
if (stream.name == myStreamName) // Single out your stream
{val s = spark.streams.get(stream.id)
s.stop()
}
This will stop the display approach which is write to memory sink.

Spark structured streaming - UNION two or more streaming sources

I'm using spark 2.3.2 and running into an issue doing a union on 2 or more streaming sources from Kafka. Each of these are streaming sources from Kafka that I've already transformed and stored in Dataframes.
I'd ideally want to store the results of this UNIONed dataframe in parquet format in HDFS or potentially even back into Kafka. The ultimate goal is to store these merged events with as low a latency as possible.
val finalDF = flatDF1
.union(flatDF2)
.union(flatDF3)
val query = finalDF.writeStream
.format("parquet")
.outputMode("append")
.option("path", hdfsLocation)
.option("checkpointLocation", checkpointLocation)
.option("failOnDataLoss", false)
.start()
query.awaitTermination()
when doing a writeStream to console instead of parquet I'm getting the expected results, but the example above causes an assertion failure.
Caused by: java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.sql.execution.streaming.OffsetSeq.toStreamProgress(OffsetSeq.scala:42)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$populateStartOffsets(MicroBatchExecution.scala:185)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:124)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
here is the class and assertion that is failing:
case class OffsetSeq(offsets: Seq[Option[Offset]], metadata: Option[OffsetSeqMetadata] = None) {
assert(sources.size == offsets.size)
Is this because the checkpoint is only storing the offsets for one of the dataframes? Looking through the Spark Structured Streaming documentation it looked like it was possible to do joins/union of streaming sources in Spark 2.2 or >
First, please define how your case class OffsetSeq is related to the code with the unions of the dataframes.
Next, checkpointing is a real issue when performing this union and then writing to Kafka with writestream. Separating into multiple writestreams - each with it's own checkpointing - confuses batch id's because of the union operating. Using the same writestream with union of dataframes fails with checkpointing since the checkpoint appears to seek all the models that generated the dataframes before the union and cannot distinguish what row/record came from what dataframe/model.
For writing to Kafka, from structured sql streaming unioned dataframes - best to use writestream with foreach and ForEachWriter including the Kafka Producer in the process method. No checkpointing is needed; the application the just uses temp checkpoint files which are set to be deleted when appropriate - set "forceDeleteTempCheckpointLocation" to true - in the session builder.
Anyway, I have just set up scala code to union an arbitrary number of streaming dataframes and then write to Kafka Producer. Appears to work well once all Kafka Producer code is placed in the ForEachWriter process method so that it can be serialized by Spark.
val output = dataFrameModelArray.reduce(_ union _)
val stream: StreamingQuery = output
.writeStream.foreach(new ForeachWriter[Row] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(row: Row): Unit = {
val producer: KafkaProducer[String, String] = new KafkaProducer[String, String](props)
val record = new ProducerRecord[String, String](producerTopic, row.getString(0), row.getString(1))
producer.send(record)
}
def close(errorOrNull: Throwable): Unit = {
}
}
).start()
Can add more logic in process method if needed.
Note prior to union, all dataframes to be unioned have been converted into key, value string columns. Value is a json string of the message data to be sent over the Kafka Producer. This is also very important to get write before the union is attempted.
svcModel.transform(query)
.select($"key", $"uuid", $"currentTime", $"label", $"rawPrediction", $"prediction")
.selectExpr("key", "to_json(struct(*)) AS value")
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
where svcModel is a dataframe in the dataFrameModelArray.