PySpark and Kafka "Set are gone. Some data may have been missed.."

PySpark and Kafka "Set are gone. Some data may have been missed.." - pyspark

I'm running PySpark using a Spark cluster in local mode and I'm trying to write a streaming DataFrame to a Kafka topic.
When I run the query, I get the following message:
java.lang.IllegalStateException: Set(topicname-0) are gone. Some data may have been missed..
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
This is my code:
query = (
output_stream
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "ratings-cleaned")
.option("checkpointLocation", "checkpoints-folder")
.start()
)
sleep(2)
print(query.status)

This error message typically shows up when some messages/offsets were removed from the source topic since the last run of the query. The removal happened due to the cleanup policy, such as retention time.
Imagine your topic has messages with offsets 0, 1, 2 which have all been processed by the application. The checkpoint files stores that last offset 2 to remember continue with offset 3 next time it starts.
After some time, messages with offset 3, 4, 5 were produced to the topic but messages with offset 0, 1, 2, 3 were removed from the topic due to its retention.
Now, when restarting your spark structured streaming job it tries to fetch 3 based on its checkpoint files but realises that only the message with offset 4 is available. In exactly that case it will throw this exception.
You can solve this by
setting .option("failOnDataLoss", "false") in your readStream operation, or
delete existing checkpoint files
According to the Structured Streaming + Kafka Integration Guide the option failOnDataLoss is described as:
"Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. Batch queries will always fail if it fails to read any data from the provided offsets due to lost data."

On top of the answers above, Bartosz Konieczny posted a more detailed reason. The first part of the error message says the Set() is empty; that is a set of topic partitions (hence the -0 at the end). That means the partition to which the Spark cluster subscribed has been deleted. My guess is the Kafka setup was restarted. The Spark queries are using some default checkpoint folder that assumes the Kafka setup was not restarted.

This error message hints at issues with the checkpoints. During development, this can be caused by using an old checkpoints folder with an updated query.
If this is in a development environment and you don't need to save the state of the previous query, you can just remove the checkpoints folder (checkpoints-folder in the code example) and rerun your query.

Related

Spark structured streaming avoid delay and checkpointing: startingOffsets latest does not work?

i am developing a spark structured streaming process for a real time application
I need to read current kafka messages without any delay.
Messages older than 30 seconds are not relevant in this project.
I am reading old messages with a big delay from current timestamp ...(minutes) .. it seems that spark structured streaming does not use well startingOffsets property to latest.
I guess that the problem is the HDFS checkpoint location of topic where i write ...
I do not want to read old messages, only are important current ones!!
I have test many different configurations, kafka properties, etc .. but did not work ..
Here is my code and relevant config (kafka.bootstrap.servers and kafka.ssl.* properties are not include here but exists)
Spark version: 2.4.0-cdh6.3.3
Consumer properties used at readStream
offsets.retention.minutes -> 1,
groupIdPrefix -> 710b6fb4-4454-4a52-819e-f565e047ecb7,
subscribe -> topic_x,
consumer.override.auto.offset.reset -> latest,
failOnDataLoss -> false,
startingOffset -> latest,
offsets.retention.check.interval.ms -> 1000
Reader Kafka topic readerB
val readerB = spark
.readStream
.format("kafka")
.options(consumerProperties)
.load()
Producer properties used at writeStream
topic -> topic_destination,
checkpointLocation -> checkPointLocation
Write stream block
val sendToKafka = dtXXXKafka
.select(to_sr(struct($"*"), xxx_out_topic, xxx_out_topic, schemaRegistryNamespace).alias("value"))
.writeStream
.format("kafka")
.options(producerProperties(properties, xxx_agg_out_topic, xxx_agg_out_localcheckpoint_hdfs))
.outputMode("append")
.start()

The startingOffset property is applied only when a query is started, meaning, for the first batch of your streaming query only. Afterwards, this property is ignored and the batches are defined according to the checkpoint data.
In your case, the streaming query starts by reading only the freshest data from your kafka topic (thanks to the startingOffset -> latest setting). BUT, The second batch (and all the next batches) will be defined according to the checkpoint, or in other words, they will start exactly from where the previous batch ended - the offsets folder in the checkpoint contains the ending offsets (exclusive) for each batch so batch X starts from the offsets saved for the batch X-1. This is how the exactly once delivery semantics is achieved in spark structured streaming.
docs: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
If you want to process only the data from the last 30 seconds, I would suggest you to filter the Dataframe by the timestamp field to include only the data from the desired time frame.

When (if ever) to modify checkpointed metadata of streaming query in case of failure?

I have a doubt regarding Spark Checkpoints. I have spark streaming application and I'am managing Checkpoint n HDFS using following approach :-
val checkpointDirectory = "hdfs://192.168.0.1:8020/markingChecksPoints"
df.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF
.write
.cassandraFormat(
"table",
"keyspace",
"clustername"
)
.mode(SaveMode.Append)
.save()
}
.outputMode(OutputMode.Append())
.option("checkpointLocation", checkpointDirectory)
}
Now When I run the Application, in checkpoint directory I got 4 folders:
commits
offsets
metadata
sources
In offsets folder- I have files for each offset consumed which is like this
v1
{"batchWatermarkMs":0,"batchTimestampMs":1574680172097,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"30"}}
{"datatopicname":{"23":441210,"8":3094007,"17":44862,"26":0,"11":4302147,"29":0,"2":3758094,"20":6273,"5":4620156,"14":15375428,"4":4511998,"13":10652363,"22":1247616,"7":1787900,"16":1239584,"25":0,"10":3441724,"1":1808759,"28":0,"19":4123,"27":0,"9":3293762,"18":68,"12":4439364,"3":5910468,"21":182,"15":13510271,"6":2510314,"24":0,"0":40337}}
So, now my query is in case of failure or any other scenario How can I modify my directory so that when Application is restarted it should take from that particular point?
I understand that whenever we restart the application , it will automatically pick from the checkpoint that is fine but just in case I want to start it from any Specific value or change. Then what should I do ?
Shall I simply edit this "offsets" last created file.
Delete the checkpoint directory and restart the application with custom checkpoint for first run so that new checkpoint directory will be created from that onward.

There could be more files in the checkpointLocation directory (like state for stateful operators), but their role is exactly what you're asking for - in case of failure the stream processing engine of Spark Structured Streaming is going to resume a streaming query based on the checkpointed metadata.
Since these files are internal it's not recommended to amend the files in any way.
You can though (and perhaps that's one of the reasons why they're human-readable). Whether you change the existing offsets files or you create them from scratch does not really matter. The engine sees no difference. If the files are in proper format, they're going to be used. Otherwise, the checkpoint location will be of no use.
I'd rather use data source-specific options, e.g. startingOffsets for kafka data source to (re)start from specific offset.

Spark Structured Streaming Kafka Integration Offset management

The documentation says:
enable.auto.commit: Kafka source doesn’t commit any offset.
Hence my question is, in the event of a worker or partition crash/restart :
startingOffsets is set to latest, how do we not loose messages ?
startingOffsets is set to earliest, how do we not reprocess all messages ?
This is seems to be quite important. Any indication on how to deal with it ?

I also ran into this issue.
You're right in your observations on the 2 options i.e.
potential data loss if startingOffsets is set to latest
duplicate data if startingOffsets is set to earliest
However...
There is the option of checkpointing by adding the following option:
.writeStream
.<something else>
.option("checkpointLocation", "path/to/HDFS/dir")
.<something else>
In the event of a failure, Spark would go through the contents of this checkpoint directory, recover the state before accepting any new data.
I found this useful reference on the same.
Hope this helps!

Spark structured streaming query always starts with auto.offset.rest=earliest even though auto.offset.reset=latest is set

I have a weird issue with trying to read data from Kafka using Spark structured streaming.
My use case is to be able to read from a topic from the largest/latest offset available.
My read configs:
val data = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "some xyz server")
.option("subscribe", "sampletopic")
.option("auto.offset.reset", "latest")
.option("startingOffsets", "latest")
.option("kafkaConsumer.pollTimeoutMs", 20000)
.option("failOnDataLoss","false")
.option("maxOffsetsPerTrigger",20000)
.load()
My write configs:
data
.writeStream
.outputMode("append")
.queryName("test")
.format("parquet")
.option("checkpointLocation", "s3://somecheckpointdir")
.start("s3://outpath").awaitTermination()
Versions used:
spark-core_2.11 : 2.2.1
spark-sql_2.11 : 2.2.1
spark-sql-kafka-0-10_2.11 : 2.2.1
I have done my research online and from [the Kafka documentation](https://kafka.apache.org/0100/documentation.html0/
I am using the new consumer apis and as the documentation suggests i just need to set auto.offset.reset to "latest" or startingOffsets to "latest" to ensure that my Spark job starts consuming from the the latest offset available per partition in Kafka.
I am also aware that the setting auto.offset.reset only kicks in when a new query is started for the first time and not on a restart of an application in which case it will continue to read from the last saved offset.
I am using s3 for checkpointing my offsets. and I see them being generated under s3://somecheckpointdir.
The issue I am facing is that the Spark job always read from earliest offset even though latest option is specified in the code during startup of application when it is started for the first time and I see this in the Spark logs.
auto.offset.reset = earliest being used. I have not seen posts related to this particular issue.
I would like to know if I am missing something here and if someone has seen this behavior before. Any help/direction will indeed be useful. Thank you.

All Kafka configurations should be set with kafka. prefix. Hence the correct option key is kafka.auto.offset.reset.
You should never set auto.offset.reset. Instead, "set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off." [1]
[1] http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations

Update:
So i have done some testing on a local kafka instance with a controlled set of messages going in to kafka. I see that expected behavior is working fine when property startingOffsets is set to earlier or latest.
But the logs always show the property being pickup as earliest, which is a little misleading.
auto.offset.reset=earliest, even though i am not setting it.
Thank you.

You cannot set auto.offset.reset in Spark Streaming as per the documentation. For setting to latest you just need to set the source option startingOffsets to specify where to start instead (earliest or latest). Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed.
It clearly says that following fields can't be set and the Kafka source or sink will throw an exception:
group.id
auto.offset.reset
key.deserializer
value.deserializer
key.serializer
value.serializer
enable.auto.commit
interceptor.classes

For Structured Streaming can set startingOffsets to earliest so that every time you consume from the earliest available offset. The following will do the trick
.option("startingOffsets", "earliest")
However note that this is effective just for newly created queries:
startingOffsets
The start point when a query is started, either "earliest" which is
from the earliest offsets, "latest" which is just from the latest
offsets, or a json string specifying a starting offset for each
TopicPartition. In the json, -2 as an offset can be used to refer to
earliest, -1 to latest. Note: For batch queries, latest (either
implicitly or by using -1 in json) is not allowed. For streaming
queries, this only applies when a new query is started, and that
resuming will always pick up from where the query left off. Newly
discovered partitions during a query will start at earliest.
Alternatively, you might also choose to change the consumer group every time:
.option("kafka.group.id", "newGroupID")

Flink Kafka connector - commit offset without checkpointing

I have a question regarding Flink Kafka Consumer (FlinkKafkaConsumer09).
I've been using this version of connector:
flink-connector-kafka-0.9_2.11-1.1.2 (connector version is 0.9, akka version is 2.11, flink version is 1.1.2)
I gather communication data from kafka within 5-minutes tumbling windows. From what I've seen, the windows are aligned with system time (for example windows end in 12:45, 12:50, 12:55, 13:00 etc.)
After window is closed, its records are processed/aggregated and sent via Sink operator to database.
Simplified version of my program:
env.addSource(new FlinkKafkaConsumer09<>(topicName,jsonMapper, properties))
.keyBy("srcIp", "dstIp", "dstPort")
.window(TumblingEventTimeWindows.of(Time.of(5, TimeUnit.MINUTES)))
.apply(new CounterSum<>())
.addSink(new DbSink(...));
However I need to commit offset in kafka. From what I've read, the only way in FlinkKafkaConsumer09 is to turn on checkpointing. I do it like this:
env.enableCheckpointing(300000); // 5 minutes
Checkpointing stores state of all operators. After checkpoint is complete, the offset is comitted to kafka.
My checkpoints are stored via FsStateBackend in taskmanager system file structures (the first problem - older checkpoint data are not deleted, I saw some bugs being reported for this).
The second problem is when the checkpoint is triggered. If triggered at the beginning of the window, resulting checkpoint file is small, on the other side when triggered just before window is closed, resulting state is large (for example 50MB), because there are already many communication records in this window. The checkpoint process usually takes less than 1-2s, however when the checkpoint is triggered after the window is closed and while processing aggregations and DB sinks, the checkpoint process takes 45s.
But the whole point is that I don't need state checkpointing at all. All I need is to commit offset to kafka after window is closed, is processed and resulting data are sinked to db (or at the beginning of another window). If failover occured, flink would fetch last offset from kafka and would read data from last 5-minute interval again. Because last failed result was not sent to db, there would be no duplicate data being sent to DB and rereading last 5 minute interval is no overhead.
So basically I have 2 questions:
Is there any way how to achieve checkpointing being turned off
and only commit offsets like described above ?
If not, is there any way how to align checkpointing with start of
the window ? I read flink documentation - there is feature called
savepoints (i.e. manual checkpoints), but it is meant to be used
from command line. I would need to call savepoint from code on
window start - state would be small and checkpoint process would be
quick.

In order to commit offset in Kafka, set the property enable.auto.commit=true and then set a commit duration via auto.commit.interval.ms=300000 in the Kafka source builder.
FlinkKafkaConsumer09.<>builder()
...
.setProperty("auto.commit.interval.ms", "500")
.setProperty("enable.auto.commit", "true")
...
This will only commit your offset and not interfere with checkpointing

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PySpark and Kafka "Set are gone. Some data may have been missed.." - pyspark

Related

Spark structured streaming avoid delay and checkpointing: startingOffsets latest does not work?

When (if ever) to modify checkpointed metadata of streaming query in case of failure?

Spark Structured Streaming Kafka Integration Offset management

Spark structured streaming query always starts with auto.offset.rest=earliest even though auto.offset.reset=latest is set

Flink Kafka connector - commit offset without checkpointing

Categories

Resources