Kafka as readstream source always returns 0 messages in the first iteration - scala

I have a Structured Streaming job which has got Kafka as source and Delta as sink. Each of the batches will be processed inside a foreachBatch.
The problem I am facing is I need to have this Structured Streaming configured to be triggered just once, but in that initial run Kafka always returns no records.
This is how I have configured the Structured Streaming process:
var kafka_stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_bootstrap_config)
.option("subscribe", kafka_topic)
.option("startingOffsets", "latest")
.option("groupid", my_group_id)
.option("minOffsetsPerTrigger", "20")
.load()
val kafka_stream_payload = kafka_stream.selectExpr("cast (value as string) as msg ")
kafka_stream_payload
.writeStream
.format( "console" )
.queryName( "my_query" )
.outputMode( "append" )
.foreachBatch { (batchDF: DataFrame, batchId: Long) => process_micro_batch( batchDF ) }
.trigger(Trigger.AvailableNow())
.start()
.awaitTermination()
I tried to configure the Kafka readStream to pick a minimum of 20 new messages by using "minOffsetsPerTrigger", "20". However, every first iteration it keeps returning 0 new messages.
In case I remove the .trigger(Trigger.AvailableNow()) option, during the second (and following) iterations the process will be reading an average of 200 new kafka messages.
Is there a reason why I am getting 0 records during the first iteration?, and how can I configure the sourceStream to enforce a minimum number of new messages?

Since you configured (.option("startingOffsets", "latest")) it's possible that you can get 0 messages for the first iteration, if messages are not available on time in the Kafka topic. try to check for (.option("startingOffsets", "earliest")).
or add ("auto.offset.reset", "earliest")
or make sure the data is getting published to the Kafka topic continuously then start your consumer

Related

Reading names of the indexes from kafka topic doesn't work

I have a problem writing a spark job. The problem goes as follows. I need to process all records from elasticsearch index. Spark seems like a good match for this and I wrote following code
val dataset: Dataset<Row> = session.read()
.format("org.elasticsearch.spark.sql")
.option("es.read.field.include", "orgUUID,serializedEventKey,involvedContactURNs,crmAssociationSmartURNs")
.option("es.read.field.as.array.include", "involvedContactURNs,crmAssociationSmartURNs")
.load(index)
dataset.foreach(transform)
This code works without problem and does everything as expected - the problem is that index - (name of the elasticsearch index) is not known a-priory. I have to read names of the indexes from kafka topic. So I have added following loop
while (true) {
val records = kafkaClient.poll(Duration.ofMillis(1000))
if (!records.isEmpty) {
records.forEach { record ->
val index = record.value()
// Here comes code from above to process index
}
}
This somehow doesn't work - the same records got read multiple times from the same kafka topic. I understand that spark spawns multiple executors behind the back still all kafka clients share the same group ID and according to kafka documentation only one of them should be able to read - that is the first mystery to which I want to get some explanation.
That is not the end of my adventure though. I decided to use spark streaming to read from kafka and went with the following
val df = session.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test_consumers")
.option("startingOffsets", "earliest")
.load()
and after that
val temp = df.selectExpr("CAST(key AS STRING)","CAST(value AS STRING)")
val temp1 = temp.map(retrieve, Encoders.STRING())
val retrieve: MapFunction<Row, String> = MapFunction { row ->
val index = row.getAs("value")
// Here initial block of code to process elasticsearch index
index
}
This failed with
Caused by: org.apache.spark.SparkException: Writing job aborted.
At the line in the initial code block dataset.foreach(transform)

Read from Kafka topic process the data and write back to Kafka topic using scala and spark

Hi Im reading froma kafka topic and i want to process the data received from kafka such as tockenization, filtering out unncessary data, removing stop words and finally I want to write back to another Kafka topic
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
val df = readStream.selectExpr("CAST(value AS STRING)" )
df.show(false)
val df_json = df.select(from_json(col("value"), mySchema.defineSchema()).alias("parsed_value"))
val df_text = df_json.withColumn("text", col("parsed_value.payload.Text"))
// perform some data processing actions such as tokenization etc and return cleanedDataframe as the final result
// write back to kafka
val writeStream = cleanedDataframe
.writeStream
.outputMode("append")
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("topic", "writing.val")
.start()
writeStream.awaitTermination()
Then I am getting the below error
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Queries with streaming sources must be executed with
writeStream.start();;
Then I have edited my code as follows to read from kafka and write into console
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
// write to console
val df = readStream.selectExpr("CAST(value AS STRING)" )
val query = df.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination();
// then perform the data processing part as mentioned in the first half
With the second method, continuously data was displaying in the console but it never run through data processing part. Can I know how can I read from a kafka topic and then perform some actions ( tokenization, removing stop words) on the received data and finally writing back to a new kafka topic?
EDIT
Stack Trace is pointing at df.show(false) in the above code during the error
There are two common problems in your current implementation:
Apply show in a streaming context
Code after awaitTermination will not be executed
To 1.
The method show is an action (as opposed to a tranformation) on a dataframe. As you are dealing with streaming dataframes this will cause an error as streaming queries need to be started with start (just as the Excpetion text is telling you).
To 2.
The method awaitTermination is a blocking method which means that subsequent code will not be executed in each micro-batch.
Overall Solution
If you want to read and write to Kafka and in-between want to understand what data is being processed by showing the data in the console you can do the following:
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
// write to console
val df = readStream.selectExpr("CAST(value AS STRING)" )
df.writeStream
.outputMode("append")
.format("console")
.start()
val df_json = df.select(from_json(col("value"), mySchema.defineSchema()).alias("parsed_value"))
val df_text = df_json.withColumn("text", col("parsed_value.payload.Text"))
// perform some data processing actions such as tokenization etc and return cleanedDataframe as the final result
// write back to kafka
// the columns `key` and `value` of the DataFrame `cleanedDataframe` will be used for producing the message into the Kafka topic.
val writeStreamKafka = cleanedDataframe
.writeStream
.outputMode("append")
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("topic", "writing.val")
.start()
existingSparkSession.awaitAnyTermination()
Note the existingSparkSession.awaitAnyTermination() at the very end of the code without using awaitTermination directly after the start. Also, remember that the columns key and value of the DataFrame cleanedDataframe will be used for producing the message into the Kafka topic. However, a column key is not required, see also here
In addition, in case you are using checkpointing (recommended) then you need to have two different locations set: one for the console stream and the other one for the kafka output stream. It is important to keep in mind that those the streaming queries run independently.

How to create a Spark DataSet when the transformation is not 1:1, but 1:many

I'm writing a structured streaming Spark application where I'm reading from a Kafka queue and processing the messages received. The end result I want is a DataSet[MyMessage] (where MyMessage is a custom object) which I want to queue to another Kafka topic. The thing is, each input message from the consumer Kafka queue can yield multiple MyMessage objects, so the transformation is not 1:1, 1:Many.
So I'm doing
val messagesDataSet: DataSet[List[MyMessage]] = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server1")
.option("subscribe", "topic1")
.option("failOnDataLoss", false)
.option("startingOffsets", "offset1")
.load()
.select($"value")
.mapPartitions{r => createMessages(r)}
val createMessages(row: Iterator[Row]): List[MyMessage] = {
// ...
}
Obviously, messagesDataSet is a DataSet[List[MyMessage]]. Is there a way I can get just a DataSet[MyMessage]?
Or is there a way to take a DataSet[List[MyMessage]] and then write each MyMessage object to another Kafka topic? (That’s my end goal after all)
try
messagesDataSet.flatMap(identity)
You can create multiple values using mapPartitions (so it works similarly like flatMap), but you have to return Iterator:
def createMessages(row: Iterator[Row]): Iterator[MyMessage] = {
row.map(/*...*/) //you need too return iterator here
}

Termination of Structured Streaming queue using Databricks

I would like to understand whether running a cell in a Databricks notebook with the code below and then cancelling it means that the stream reading is over. Or perhaps it does require some explicit closing?
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServers)
.option("subscribe", "topic1")
.load()
display(df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)])
Non-display Mode
It's best to issue this command in a cell:
streamingQuery.stop()
for this type of approach:
val streamingQuery = streamingDF // Start with our "streaming" DataFrame
.writeStream // Get the DataStreamWriter
.queryName(myStreamName) // Name the query
.trigger(Trigger.ProcessingTime("3 seconds")) // Configure for a 3-second micro-batch
.format("parquet") // Specify the sink type, a Parquet file
.option("checkpointLocation", checkpointPath) // Specify the location of checkpoint files & W-A logs
.outputMode("append") // Write only new data to the "file"
.start(outputPathDir)
Otherwise it continues to run - which is the idea of streaming.
I would not stop the cluster as it is all Streams then.
Databricks display Mode
DataBricks have written a nice set of utilities, but you need to do the course to get them. My folly.
display is a databricks thing. Needs format like:
display(myDF, streamName = "myQuery")
then proceed as follows in a separate cell:
println("Looking for %s".format(myStreamName))
for (stream <- spark.streams.active) // Loop over all active streams
if (stream.name == myStreamName) // Single out your stream
{val s = spark.streams.get(stream.id)
s.stop()
}
This will stop the display approach which is write to memory sink.

how to manage offset read from kafka with spark structured stream

I'm having a spark structured streaming job that need to read data from kafka topic and do some aggregation. The job needed to restart daily but when it restart, if I set startingOffsets="latest", I'll loss the data that coming between the restarting time. If I set startingOffsets="earliest" then the job will read all data from topic but not from where the last streaming job left. Can anyone help me how to config to set the offset right where last streaming job left?
I'm using Spark 2.4.0 and kafka 2.1.1, I have tried to set checkpoint location for the writing job but it seem like Spark doesn't check for the offset of kafka message so it keep check the last offset or first offset depend on startingOffsets.
Here is the config for my spark to read from kafka:
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", host)
.option("subscribe", topic)
.option("startingOffsets", offset)
.option("enable.auto.commit", "false")
.load()
with example that kafka topic has 10 message with offset from 1 to 10, spark just done processing message 5 and then restart. How can I make spark continue reading from message 5 not from 1 or 11?
It seems like with some code I can take the offset that I need and save it to some reliable storage like cassandra. Then when the spark streaming start i just need to read the saved offset and fill it to startingOffsets.
This is the codes that help me to get the offset that I need
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.StreamingQueryListener._
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
println("Query started:" + queryStarted.id)
}
override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {
println("Query terminated" + queryTerminated.id)
}
override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
println("Query made progress")
println("Starting offset:" + queryProgress.progress.sources(0).startOffset)
println("Ending offset:" + queryProgress.progress.sources(0).endOffset)
//Logic to save these offsets
// the logic to save the offset write in here
}
})