Spark structured streaming with kafka throwing error after running for a while - scala

I am observing weired behaviour while running spark structured streaming program. I am using S3 bucket for metadata checkpointing.
The kafka topic has 310 partitions.
When i start streaming job for the first time, after completion of every batch spark creates a new file named after batch_id gets created in offset directory at checkpinting location. After successful
completion of few batches, spark job fails after few retries giving warning "WARN KafkaMicroBatchReader:66 - Set(logs-2019-10-04-77, logs-2019-10-04-85, logs-2019-10-04-71, logs-2019-10-04-93, logs-2019-10-04-97, logs-2019-10-04-101, logs-2019-10-04-89, logs-2019-10-04-81, logs-2019-10-04-103, logs-2019-10-04-104, logs-2019-10-04-102, logs-2019-10-04-98, logs-2019-10-04-94, logs-2019-10-04-90, logs-2019-10-04-74, logs-2019-10-04-78, logs-2019-10-04-82, logs-2019-10-04-86, logs-2019-10-04-99, logs-2019-10-04-91, logs-2019-10-04-73, logs-2019-10-04-79, logs-2019-10-04-87, logs-2019-10-04-83, logs-2019-10-04-75, logs-2019-10-04-92, logs-2019-10-04-70, logs-2019-10-04-96, logs-2019-10-04-88, logs-2019-10-04-95, logs-2019-10-04-100, logs-2019-10-04-72, logs-2019-10-04-76, logs-2019-10-04-84, logs-2019-10-04-80) are gone. Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such cases, set the source
option "failOnDataLoss" to "false"."
The weired thing here is previous batch's offset file contains partition info of all 310 partitions but current batch is reading only selected partitions(see above warning message).
I reran the job by setting ".option("failOnDataLoss", false)" but got same warning above without job failure. It was observed that spark was processing correct offsets for few partitions and for rest of the partitions it was reading from starting offset(0).
There were no connection issues with spark-kafka while this error coming (we checked kafka logs also).
Could someone help with this?Am i doing something wrong or missing something?
Below is the read and write stream code snippet.
val kafkaDF = ss.readStream.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers /*"localhost:9092"*/)
.option("subscribe", logs)
.option("fetchOffset.numRetries",5)
.option("maxOffsetsPerTrigger", 30000000)
.load()
val query = logDS
.writeStream
.foreachBatch {
(batchDS: Dataset[Row], batchId: Long) =>
batchDS.repartition(noofpartitions, batchDS.col("abc"), batchDS.col("xyz")).write.mode(SaveMode.Append).partitionBy("date", "abc", "xyz").format("parquet").saveAsTable(hiveTableName /*"default.logs"*/)
}
.trigger(Trigger.ProcessingTime(1800 + " seconds"))
.option("checkpointLocation", s3bucketpath)
.start()
Thanks in advance.

Related

How can I stop Kafka from retrying connections?

I have a structured streaming application that reads messages from Kafka and then shuts down. However, if the kafka broker is down or unreachable, my application will attempt to reconnect indefinitely. I would prefer that it shuts down after 2-3 retries and let my orchestration engine handle alerting and retrying.
It seems like Kafka does not expose any configuration options on the consumer side that would allow us to stop attempting reconnects after a certain time period or retry count.
https://github.com/apache/kafka/blob/6ab4d047d563e0fe42a7c0ed6f10ddecda135595/clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java
The only configurations that I found in the ConsumerConfig that might assist us with this are the following:
kafka.reconnect.backoff.ms
kafka.reconnect.backoff.max.ms
kafka.socket.connection.setup.timeout.max.ms
Here is my Structured Streaming code for reference:
val checkpointUrl = "s3://mybucket/experiment/checkpoint/"
val url = "s3://mybucket/experiment/triggerOnce/"
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "X.X.X.X:9092")
.option("kafka.reconnect.backoff.max.ms", 1000)
.option("kafka.socket.connection.setup.timeout.max.ms", 10000)
.option("subscribe", "experiment3")
.load()
.writeStream
.format("csv")
.option("checkpointLocation", checkpointUrl)
.trigger(Trigger.Once())
.start(url)
.awaitTermination()
I am currently looking into if there is a way to leverage SparkListeners to intervene within the reconnection logic.

Spark structured streaming acknowledge messages

I am using Spark Structured Streaming to read from a Kafka topic (say topic1) and using SINK to write to another topic (topic1-result). I can see the messages are not being removed from Topic1 after writing to another topic using Sink.
// Subscribe to 1 topic
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("subscribe", "topic1")
.load()
//SINK to another topic
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("checkpointLocation", "/tmp/checkpoint1")
.option("topic", "topic1-result")
.start()
the documentation says we can not use auto-commit for structured streams
enable.auto.commit: Kafka source doesn’t commit any offset.
but how to acknowledge messages and remove the processed messages from the topic (topic1)
Two considerations:
Messages are not removed from Kafka once you have committed. When your consumer executes commit, Kafka increases the offset of this topic respect to the consumer-group that has been created. But messages remain in the topic depending on the retention time that you configure for the topic.
Indeed, Kafka source doesn´t make the commit, the stream storages the offset that points to the next message in the streaming´s checkpoint dir. So when you stream restarts it takes the last offset to consume from it.

IllegalStateException: _spark_metadata/0 doesn't exist while compacting batch 9

We have Streaming Application implemented using Spark Structured Streaming which tries to read data from Kafka topics and write it to HDFS Location.
Sometimes application fails with Exception:
_spark_metadata/0 doesn't exist while compacting batch 9
java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
We are not able to resolve this issue.
Only solution I found is to delete checkpoint location files which will make the job read the topic/data from beginning as soon as we run the application again. However, this is not a feasible solution for production application.
Does anyone has a solution for this error without deleting checkpoint such that I can continue from where the last run was failed?
Sample code of application:
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
[...] // do some processing
dfProcessed.writeStream
.format("csv")
.option("format", "append")
.option("path",hdfsPath)
.option("checkpointlocation","")
.outputmode(append)
.start
The error message
_spark_metadata/n.compact doesn't exist when compacting batch n+10
can show up when you
process some data into a FileSink with checkpoint enabled, then
stop your streaming job, then
change the output directory of the FileSink while keeping the same checkpointLocation, then
restart the streaming job
Quick Solution (not for production)
Just delete the files in checkpointLocation and restart the application.
Stable Solution
As you do not want to delete your checkpoint files, you could simply copy the missing spark metadata files from the old File Sink output path to the new output Path. See below to understand what are the "missing spark metadata files".
Background
To understand, why this IllegalStateException is being thrown, we need to understand what is happening behind the scene in the provided file output path. Let outPathBefore be the name of this path. When your streaming job is running and processing data the job creates a folder outPathBefore/_spark_metadata. In that folder you will find a file named after micro-batch Identifier containing the list of files (partitioned files) the data has been written to, e.g:
/home/mike/outPathBefore/_spark_metadata$ ls
0 1 2 3 4 5 6 7
In this case we have details for 8 micro batches. The content of one of the files looks like
/home/mike/outPathBefore/_spark_metadata$ cat 0
v1
{"path":"file:///tmp/file/before/part-00000-99bdc705-70a2-410f-92ff-7ca9c369c58b-c000.csv","size":2287,"isDir":false,"modificationTime":1616075186000,"blockReplication":1,"blockSize":33554432,"action":"add"}
By default, on each tenth micro batch, these files are getting compacted, meaning the contents of the files 0, 1, 2, ..., 9 will be stored in a compacted file called 9.compact.
This procedure continuous for the subsequent ten batches, i.e. in the micro batch 19 the job aggregates the last 10 files which are 9.compact, 10, 11, 12, ..., 19.
Now, imagine you had the streaming job running until micro batch 15 which means the job has created the following files:
/home/mike/outPathBefore/_spark_metadata/0
/home/mike/outPathBefore/_spark_metadata/1
...
/home/mike/outPathBefore/_spark_metadata/8
/home/mike/outPathBefore/_spark_metadata/9.compact
/home/mike/outPathBefore/_spark_metadata/10
...
/home/mike/outPathBefore/_spark_metadata/15
After the fifteenth micro batch you stopped the streaming job and changed the output path of the File Sink to, say, outPathAfter. As you keep the same checkpointLocation the streaming job will continue with micro-batch 16. However, it now creates the metadata files in the new out path:
/home/mike/outPathAfter/_spark_metadata/16
/home/mike/outPathAfter/_spark_metadata/17
...
Now, and this is where the Exception is thrown: When reaching micro batch 19, the job tries to compact the tenth latest files from spark metadata folder. However, it can only find the files 16, 17, 18 but it does not find 9.compact, 10 etc. Hence the error message says:
java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
Documentation
The Structured Streaming Programming Guide explains on Recovery Semantics after Changes in a Streaming Query:
"Changes to output directory of a file sink are not allowed: sdf.writeStream.format("parquet").option("path", "/somePath") to sdf.writeStream.format("parquet").option("path", "/anotherPath")"
Databricks has also written some details in the article Streaming with File Sink: Problems with recovery if you change checkpoint or output directories
Error caused by checkpointLocation because checkpointLocation stores old or deleted data information. You just need to delete the folder containing checkpointLocation.
Explore more :https://kb.databricks.com/streaming/file-sink-streaming.html
Example :
df.writeStream
.format("parquet")
.outputMode("append")
.option("checkpointLocation", "D:/path/dir/checkpointLocation")
.option("path", "D:/path/dir/output")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
.awaitTermination()
You need to do delete directory checkpointLocation.
This article introduces the mechanism and gives a good way to recover from a deleted _spark_metadata folder in Spark Structured Streaming:
https://dev.to/kevinwallimann/how-to-recover-from-a-deleted-sparkmetadata-folder-546j
"Create dummy log files:
If the metadata log files are irrecoverable, we could create dummy log files for the missing micro-batches.
In our example, this could be done like this:
for i in {0..1}; do echo v1 > "/tmp/destination/_spark_metadata/$i"; done
This will create the files
/tmp/destination/_spark_metadata/0
/tmp/destination/_spark_metadata/1
Now, the query can be restarted and should finish without errors."
As my previous output folder was not recoverable anymore. I tried this dummy solution, which could work to get rid of the IllegalStateException: _spark_metadata/... doesn't exist exception.

spark strucuted streaming write errors

I'm running into some odd errors when I consume and sink kafka messages. I'm running 2.3.0, and I know this was working prior in some other version.
val event = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
val filesink_query = outputdf.writeStream
.partitionBy(<some column>)
.format("parquet")
.option("path", <some path in EMRFS>)
.option("checkpointLocation", "/tmp/ingestcheckpoint")
.trigger(Trigger.ProcessingTime(10.seconds))
.outputMode(OutputMode.Append)
.start
java.lang.IllegalStateException: /tmp/outputagent/_spark_metadata/0 doesn't exist when compacting batch 9 (compactInterval: 10)
I'm rather confused, is this an error in the newest version of spark?
issue seemed to be related to using S3n over s3a and only having checkpoints on hdfs not s3. this is highly annoying sin e I would like to avoid hard coding dns or ips in my code.

Offset Management For Apache Kafka With Apache Spark Batch

I'm writing a Spark (v2.2) batch job which reads from a Kafka topic. Spark jobs are scheduling with cron.
I can't use Spark Structured Streaming because non based-time windows are not supported.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", s"kafka_topic")
I need to set the offset for the kafka topic to know from where to start the next batch job. How can I do that?
I guess you are using KafkaUtils to create stream, you can pass this as parameter.
val inputDStream = KafkaUtils.createDirectStream[String,String](ssc,PreferConsistent,
Assign[String, String](fromOffsets.keys,kafkaParams,fromOffsets))
Hoping this helps !