Spark Structured Streaming Kafka Integration Offset management - scala

The documentation says:
enable.auto.commit: Kafka source doesn’t commit any offset.
Hence my question is, in the event of a worker or partition crash/restart :
startingOffsets is set to latest, how do we not loose messages ?
startingOffsets is set to earliest, how do we not reprocess all messages ?
This is seems to be quite important. Any indication on how to deal with it ?

I also ran into this issue.
You're right in your observations on the 2 options i.e.
potential data loss if startingOffsets is set to latest
duplicate data if startingOffsets is set to earliest
However...
There is the option of checkpointing by adding the following option:
.writeStream
.<something else>
.option("checkpointLocation", "path/to/HDFS/dir")
.<something else>
In the event of a failure, Spark would go through the contents of this checkpoint directory, recover the state before accepting any new data.
I found this useful reference on the same.
Hope this helps!

Related

kafka offset in spark

Kafka enable.auto.commit is set to false and Spark version is 2.4
If using latest offset, do we need to manually find last offset details and mention it in .CreateDirectStream() in Spark application? or will it automatically take latest offset? In any case do we need to find the last offset details manually.
Is there any difference when use SparkSession.readstrem.format(kafka).... and KafkaUtils.createDirectStream()?
When using earliest offset option, will it consider the offset automatically?
Here is my attempt to answer your questions
Ques 1: enable.auto.commit is a kafka related parameter and if set to false requires you to manually commit (read update) your offset to the checkpoint directory. If your application restarts it will look into the checkpoint directory and start reading from last committed offset + 1. same is mentioned here https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-properties-enable-auto-commit.html by jaceklaskowski. There is no need to specify the offset anywhere as part of your spark application. All you need is the checkpoint directory. Also, remember offsets are maintained per partition in topic per consumer so it would be bad on Spark to expect developers/users to provide that.
Ques 2: spark.readStream is a generic method to read data from streaming sources such as tcp socket, kafka topics etc while kafkaUtils is a dedicated class for integration of spark with kafka so I assume it is more optimised if you are using kafka topics as source. I usually use KafkaUtils on my own through I haven't done any performance benchmarks. If I am not wrong, KafkaUtils can be used to subscribe to more than 1 topic as well while readStream cannot be.
Ques 3: earliest offset means your consumer will start reading from the oldest record available for example, if your topic is new (no clean up has happened) or cleanup is not configured for the topic it will start from offset 0. in case cleanup is configured and all records till offset 2000 have been removed, records will be read from offset 2001 while the topic may have records till offset 10000 ( this is assuming there is only one partition, in topics will multiple partitions the offset value will be different ). See section for batch queries here https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html for more details.
If you take a look at documentation of kafka connector for Spark, you can find most of the answers.
Documentation about startingOffsets option for Kafka connector, last part is about streaming queries.
The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
If you have offsets, it will always pick up offsets if they're available, otherwise it will ask Kafka for earliest or latest offset. This should be true for both types of streams, direct and structured streams should consider offsets.
I see that you mentioned enable.auto.commit option and I just want to make sure you're aware of the following quote from the same documentation site i provided above.
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception: enable.auto.commit: Kafka source doesn’t commit any offset.

PySpark and Kafka "Set are gone. Some data may have been missed.."

I'm running PySpark using a Spark cluster in local mode and I'm trying to write a streaming DataFrame to a Kafka topic.
When I run the query, I get the following message:
java.lang.IllegalStateException: Set(topicname-0) are gone. Some data may have been missed..
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you don't want your streaming query to fail on such cases, set the
source option "failOnDataLoss" to "false".
This is my code:
query = (
output_stream
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "ratings-cleaned")
.option("checkpointLocation", "checkpoints-folder")
.start()
)
sleep(2)
print(query.status)
This error message typically shows up when some messages/offsets were removed from the source topic since the last run of the query. The removal happened due to the cleanup policy, such as retention time.
Imagine your topic has messages with offsets 0, 1, 2 which have all been processed by the application. The checkpoint files stores that last offset 2 to remember continue with offset 3 next time it starts.
After some time, messages with offset 3, 4, 5 were produced to the topic but messages with offset 0, 1, 2, 3 were removed from the topic due to its retention.
Now, when restarting your spark structured streaming job it tries to fetch 3 based on its checkpoint files but realises that only the message with offset 4 is available. In exactly that case it will throw this exception.
You can solve this by
setting .option("failOnDataLoss", "false") in your readStream operation, or
delete existing checkpoint files
According to the Structured Streaming + Kafka Integration Guide the option failOnDataLoss is described as:
"Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected. Batch queries will always fail if it fails to read any data from the provided offsets due to lost data."
On top of the answers above, Bartosz Konieczny posted a more detailed reason. The first part of the error message says the Set() is empty; that is a set of topic partitions (hence the -0 at the end). That means the partition to which the Spark cluster subscribed has been deleted. My guess is the Kafka setup was restarted. The Spark queries are using some default checkpoint folder that assumes the Kafka setup was not restarted.
This error message hints at issues with the checkpoints. During development, this can be caused by using an old checkpoints folder with an updated query.
If this is in a development environment and you don't need to save the state of the previous query, you can just remove the checkpoints folder (checkpoints-folder in the code example) and rerun your query.

Apache Storm with Kafka offset management

I have built a sample topology with Storm using Kafka as a source. Here is a problem for which I need a solution.
Every time I kill a topology and start it again, the topology starts processing from the beginning.
Suppose Message A in Topic X was processed by Topology and then I kill the topology.
Now when I again submit the topology and Message A is still there is Topic X. It is processed again.
Is there a solution, maybe some sort of offset management to handle this situation.
You shouldn't use storm-kafka for new code, it is deprecated since the underlying client API is deprecated in Kafka, and removed as of 2.0.0. Instead, use storm-kafka-client.
With storm-kafka-client you want to set a group id, and a first poll offset strategy.
KafkaSpoutConfig.builder(bootstrapServers, "your-topic")
.setProp(ConsumerConfig.GROUP_ID_CONFIG, "kafkaSpoutTestGroup")
.setFirstPollOffsetStrategy(UNCOMMITTED_EARLIEST)
.build();
The above will make your spout start at the earliest offset first time you start it, and then it will pick up where it left off if you restart it. The group id is used by Kafka to recognize the spout when it restarts, so it can get the stored offset checkpoint back. Other offset strategies will behave differently, you can check the javadoc for the FirstPollOffsetStrategy enum.
The spout will checkpoint how far it got periodically, there is also a setting in the config to control this. The checkpointing is controlled by the setProcessingGuarantee setting in the config, and can be set to have at-least-once (only checkpoint acked offsets), at-most-once (checkpoint before spout emits the message), and "any times" (checkpoint periodically, ignoring acks).
Take a look at one of the example topologies included with Storm https://github.com/apache/storm/blob/dc56e32f3dcdd9396a827a85029d60ed97474786/examples/storm-kafka-client-examples/src/main/java/org/apache/storm/kafka/spout/KafkaSpoutTopologyMainNamedTopics.java#L93.
Make sure when creating your spoutconfig that it has a fixed spout id by which it can identify itself after a restart.
From official Storm site:
Important: When re-deploying a topology make sure that the settings
for SpoutConfig.zkRoot and SpoutConfig.id were not modified, otherwise
the spout will not be able to read its previous consumer state
information (i.e. the offsets) from ZooKeeper -- which may lead to
unexpected behavior and/or to data loss, depending on your use case.
since I am facing with a similar issue, take advantage of it and ask. I have a code like this:
KafkaTridentSpoutConfig.Builder kafkaSpoutConfigBuilder = KafkaTridentSpoutConfig.builder(bootstrapServers, topic);
kafkaSpoutConfigBuilder.setProp(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, fetchSizeBytes);
kafkaSpoutConfigBuilder.setProp(ConsumerConfig.GROUP_ID_CONFIG, clientId);
kafkaSpoutConfigBuilder.setProp(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
kafkaSpoutConfigBuilder.setProp(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer");
return new KafkaTridentSpoutOpaque(kafkaSpoutConfigBuilder.build());
But everytime I restart Storm Local Cluster, messages are read from the beginning. If I check offsets directly in Kafka for the particular group, there is no lag. It's like offsets from Kafka are not read.
Using Kafka 2.8, Storm 2.2.0. I didn't have this issue with Storm 0.9.X.
Any idea?
Thanks!

Spark structured streaming query always starts with auto.offset.rest=earliest even though auto.offset.reset=latest is set

I have a weird issue with trying to read data from Kafka using Spark structured streaming.
My use case is to be able to read from a topic from the largest/latest offset available.
My read configs:
val data = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "some xyz server")
.option("subscribe", "sampletopic")
.option("auto.offset.reset", "latest")
.option("startingOffsets", "latest")
.option("kafkaConsumer.pollTimeoutMs", 20000)
.option("failOnDataLoss","false")
.option("maxOffsetsPerTrigger",20000)
.load()
My write configs:
data
.writeStream
.outputMode("append")
.queryName("test")
.format("parquet")
.option("checkpointLocation", "s3://somecheckpointdir")
.start("s3://outpath").awaitTermination()
Versions used:
spark-core_2.11 : 2.2.1
spark-sql_2.11 : 2.2.1
spark-sql-kafka-0-10_2.11 : 2.2.1
I have done my research online and from [the Kafka documentation](https://kafka.apache.org/0100/documentation.html0/
I am using the new consumer apis and as the documentation suggests i just need to set auto.offset.reset to "latest" or startingOffsets to "latest" to ensure that my Spark job starts consuming from the the latest offset available per partition in Kafka.
I am also aware that the setting auto.offset.reset only kicks in when a new query is started for the first time and not on a restart of an application in which case it will continue to read from the last saved offset.
I am using s3 for checkpointing my offsets. and I see them being generated under s3://somecheckpointdir.
The issue I am facing is that the Spark job always read from earliest offset even though latest option is specified in the code during startup of application when it is started for the first time and I see this in the Spark logs.
auto.offset.reset = earliest being used. I have not seen posts related to this particular issue.
I would like to know if I am missing something here and if someone has seen this behavior before. Any help/direction will indeed be useful. Thank you.
All Kafka configurations should be set with kafka. prefix. Hence the correct option key is kafka.auto.offset.reset.
You should never set auto.offset.reset. Instead, "set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off." [1]
[1] http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations
Update:
So i have done some testing on a local kafka instance with a controlled set of messages going in to kafka. I see that expected behavior is working fine when property startingOffsets is set to earlier or latest.
But the logs always show the property being pickup as earliest, which is a little misleading.
auto.offset.reset=earliest, even though i am not setting it.
Thank you.
You cannot set auto.offset.reset in Spark Streaming as per the documentation. For setting to latest you just need to set the source option startingOffsets to specify where to start instead (earliest or latest). Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed.
It clearly says that following fields can't be set and the Kafka source or sink will throw an exception:
group.id
auto.offset.reset
key.deserializer
value.deserializer
key.serializer
value.serializer
enable.auto.commit
interceptor.classes
For Structured Streaming can set startingOffsets to earliest so that every time you consume from the earliest available offset. The following will do the trick
.option("startingOffsets", "earliest")
However note that this is effective just for newly created queries:
startingOffsets
The start point when a query is started, either "earliest" which is
from the earliest offsets, "latest" which is just from the latest
offsets, or a json string specifying a starting offset for each
TopicPartition. In the json, -2 as an offset can be used to refer to
earliest, -1 to latest. Note: For batch queries, latest (either
implicitly or by using -1 in json) is not allowed. For streaming
queries, this only applies when a new query is started, and that
resuming will always pick up from where the query left off. Newly
discovered partitions during a query will start at earliest.
Alternatively, you might also choose to change the consumer group every time:
.option("kafka.group.id", "newGroupID")

How to use Kafka consumer in spark

I am using spark 2.1 and Kafka 0.10.1.
I want to process the data by reading the entire data of specific topics in Kafka on a daily basis.
For spark streaming, I know that createDirectStream only needs to include a list of topics and some configuration information as arguments.
However, I realized that createRDD would have to include all of the topic, partitions, and offset information.
I want to make batch processing as convenient as streaming in spark.
Is it possible?
I suggest you to read this text from Cloudera.
This example show you how to get from Kafka the data just one time. That you will persist the offsets in a postgres due to the ACID archtecture.
So I hope that will solve your problem.