I am currently using spark-streaming-kafka-0-10_2.11 to connect my spark application with the kafka queue. For Streams everything works fine. For a specific scenario however I just need the whole content of the kafka queue exactly once - for this I got the suggestion to better use KafkaUtils.createRDD (SparkStreaming: Read Kafka Stream and provide it as RDD for further processing)
However for spark-streaming-kafka-0-10_2.11 I cannot figure out how to get the earliest and latest offset for my Kafka topic that would be needed to create the Offset-Range I have to hand of the the createRDD method.
What is the recommended way to get those offsets without opening a stream? Any help would be greatly appreciated.
After reading several discussions I am able to get the earliest or latest offset from a specific partition with :
val consumer = new SimpleConsumer(host,port,timeout,bufferSize,"offsetfetcher");
val topicAndPartition = new TopicAndPartition(topic, initialPartition)
val request = OffsetRequest(Map(topicAndPartition -> PartitionOffsetRequestInfo(OffsetRequest.EarliestTime,1)))
val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(topicAndPartition).offsets
return offsets.head
but still , how to replicate the behaviour of "from_beginning" in a kafka_consumer.sh CLI command is something I do not know by the KafkaUtils.createRDD aproach.
Related
I need some help in Kafka Streams. I have started a Kafka stream application, which is streaming one topic from the very first offset. Topic is very huge in data, so I want to implement a mechanism in my application, using Kafka streams, so that I can get notified when topic has been read completely to the very last offset.
I have read Kafka Streams 2.8.0 api, I have found an api method i-e allLocalStorePartitionLags, which is returning map of store names to another map of partition containing all the lag information against each partition. This method returns lag information for all store partitions (active or standby) local to this Streams. This method is quite useful for me, in above case, when I have one node running that stream application.
But in my case, system is distributed and application nodes are 3 and topic partitions are 10, which meaning each node have at least 3 partitions for the topic to read from.
I need help here. How I can implement this functionality where I can get notified when topic has been read completely from partition 0 to partition 9. Please note that I don't have option to use database here as of now.
Other approaches to achieve goal are also welcomed. Thank you.
I was able to achieve lag information from adminClient api. Below code results end offsets and current offsets for each partitions against topics read by given stream application i-e applicationId.
AdminClient adminClient = AdminClient.create(kafkaProperties);
ListConsumerGroupOffsetsResult listConsumerGroupOffsetsResult = adminClient.listConsumerGroupOffsets(applicationId);
// Current offsets.
Map<TopicPartition, OffsetAndMetadata> topicPartitionOffsetAndMetadataMap = listConsumerGroupOffsetsResult.partitionsToOffsetAndMetadata().get();
// all topic partitions.
Set<TopicPartition> topicPartitions = topicPartitionOffsetAndMetadataMap.keySet();
// list of end offsets for each partitions.
ListOffsetsResult listOffsetsResult = adminClient.listOffsets(topicPartitions.stream()
.collect(Collectors.toMap(Function.identity(), tp -> OffsetSpec.latest())));
I am fairly new to Flink and Kafka and have some data aggregation jobs written in Scala which run in Apache Flink, the jobs consume data from Kafka perform aggregation and produce results back to Kafka.
I need the jobs to consume data from any new Kafka topic created while the job is running which matches a pattern. I got this working by setting the following properties for my consumer
val properties = new Properties()
properties.setProperty(“bootstrap.servers”, “my-kafka-server”)
properties.setProperty(“group.id”, “my-group-id”)
properties.setProperty(“zookeeper.connect”, “my-zookeeper-server”)
properties.setProperty(“security.protocol”, “PLAINTEXT”)
properties.setProperty(“flink.partition-discovery.interval-millis”, “500”);
properties.setProperty(“enable.auto.commit”, “true”);
properties.setProperty(“auto.offset.reset”, “earliest”);
val consumer = new FlinkKafkaConsumer011[String](Pattern.compile(“my-topic-start-.*”), new SimpleStringSchema(), properties)
The consumer works fine and consumes data from existing topics which start with “my-topic-start-”
When I publish data against a new topic say for example “my-topic-start-test1” for the first time, my consumer does not recognise the topic until after 500 milliseconds after the topic was created, this is based on the properties.
When the consumer identifies the topic it does not read the first data record published and starts reading subsequent records so effectively I loose that first data record every time data is published against a new topic.
Is there a setting I am missing or is it how Kafka works? Any help would be appreciated.
Thanks
Shravan
I think part of the issue is my producer was creating topic and publishing message in one go, so by the time consumer discovers new partition that message has already been produced.
As a temporary solution I updated my producer to create the topic if it does not exists and then publish a message (make it 2 step process) and this works.
Would be nice to have a more robust consumer side solution though :)
:)
I've ended myself in a (strange) situation where, briefly, I don't want to consume any new record from Kafka, so pause the sparkStreaming consumption (InputDStream[ConsumerRecord]) for all partitions in the topic, do some operations and finally, resume consuming records.
First of all... is this possible?
I've been trying sth like this:
var consumer: KafkaConsumer[String, String] = _
consumer = new KafkaConsumer[String, String](properties)
consumer.subscribe(java.util.Arrays.asList(topicName))
consumer.pause(consumer.assignment())
...
consumer.resume(consumer.assignment())
but I got this:
println(s"Assigned partitions: $consumer.assignment()") --> []
println(s"Paused partitions: ${consumer.paused()}") --> []
println(s"Partitions for: ${consumer.partitionsFor(topicNAme)}") --> [Partition(topic=topicAAA, partition=0, leader=1, replicas=[1,2,3], partition=1, ... ]
Any help to understand what I'm missing and why I'm getting empty results when it's clear the consumer has partitions assigned will be welcomed!
Versions:
Kafka: 0.10
Spark: 2.3.0
Scala: 2.11.8
Yes it is possible
Add check pointing in your code and pass persistent storage (local disk,S3,HDFS) path
and whenever you start/resume your job it will pickup the Kafka Consumer group info with consumer offsets from the check pointing and start processing from where it was stopped.
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
Spark Check-=pointing is mechanism not only for saving the offset but also save the serialize state of your DAG of your Stages and Jobs. So whenever you restart your job with new code it would
Read and process the serialized data
Clean the cached DAG stages if there are any code changes in your Spark App
Resume processing from the new data with latest code.
Now here reading from disk is just a one time operation required by Spark to load the Kafka Offset, DAG and the old incomplete processed data.
Once it has done it will always keep on saving the data to disk on default or specified checkpoint interval.
Spark streaming provides an option to specifying Kafka group id but Spark structured stream does not.
for example .i have a topic "test" with 4 partitions .when i start the stream app, And after some time, the app crashes.I want to specify the offset for consumption,i want to consume data from offset after last reading.
But I can't find anything that can help achieve it using kafka-streams api.
P.S. We are using kafka-2.1.0
I have a weird issue with trying to read data from Kafka using Spark structured streaming.
My use case is to be able to read from a topic from the largest/latest offset available.
My read configs:
val data = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "some xyz server")
.option("subscribe", "sampletopic")
.option("auto.offset.reset", "latest")
.option("startingOffsets", "latest")
.option("kafkaConsumer.pollTimeoutMs", 20000)
.option("failOnDataLoss","false")
.option("maxOffsetsPerTrigger",20000)
.load()
My write configs:
data
.writeStream
.outputMode("append")
.queryName("test")
.format("parquet")
.option("checkpointLocation", "s3://somecheckpointdir")
.start("s3://outpath").awaitTermination()
Versions used:
spark-core_2.11 : 2.2.1
spark-sql_2.11 : 2.2.1
spark-sql-kafka-0-10_2.11 : 2.2.1
I have done my research online and from [the Kafka documentation](https://kafka.apache.org/0100/documentation.html0/
I am using the new consumer apis and as the documentation suggests i just need to set auto.offset.reset to "latest" or startingOffsets to "latest" to ensure that my Spark job starts consuming from the the latest offset available per partition in Kafka.
I am also aware that the setting auto.offset.reset only kicks in when a new query is started for the first time and not on a restart of an application in which case it will continue to read from the last saved offset.
I am using s3 for checkpointing my offsets. and I see them being generated under s3://somecheckpointdir.
The issue I am facing is that the Spark job always read from earliest offset even though latest option is specified in the code during startup of application when it is started for the first time and I see this in the Spark logs.
auto.offset.reset = earliest being used. I have not seen posts related to this particular issue.
I would like to know if I am missing something here and if someone has seen this behavior before. Any help/direction will indeed be useful. Thank you.
All Kafka configurations should be set with kafka. prefix. Hence the correct option key is kafka.auto.offset.reset.
You should never set auto.offset.reset. Instead, "set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off." [1]
[1] http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations
Update:
So i have done some testing on a local kafka instance with a controlled set of messages going in to kafka. I see that expected behavior is working fine when property startingOffsets is set to earlier or latest.
But the logs always show the property being pickup as earliest, which is a little misleading.
auto.offset.reset=earliest, even though i am not setting it.
Thank you.
You cannot set auto.offset.reset in Spark Streaming as per the documentation. For setting to latest you just need to set the source option startingOffsets to specify where to start instead (earliest or latest). Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed.
It clearly says that following fields can't be set and the Kafka source or sink will throw an exception:
group.id
auto.offset.reset
key.deserializer
value.deserializer
key.serializer
value.serializer
enable.auto.commit
interceptor.classes
For Structured Streaming can set startingOffsets to earliest so that every time you consume from the earliest available offset. The following will do the trick
.option("startingOffsets", "earliest")
However note that this is effective just for newly created queries:
startingOffsets
The start point when a query is started, either "earliest" which is
from the earliest offsets, "latest" which is just from the latest
offsets, or a json string specifying a starting offset for each
TopicPartition. In the json, -2 as an offset can be used to refer to
earliest, -1 to latest. Note: For batch queries, latest (either
implicitly or by using -1 in json) is not allowed. For streaming
queries, this only applies when a new query is started, and that
resuming will always pick up from where the query left off. Newly
discovered partitions during a query will start at earliest.
Alternatively, you might also choose to change the consumer group every time:
.option("kafka.group.id", "newGroupID")