is spark aware of new partitions that gets added in kafka? - scala

We recently had an issue where some of the Kafka partitions were lost and job continued without failing. In the meantime, new kafka partitions were added. Looks like our spark streaming job did not get restarted and it was not receiving any data from new partitions, until we noticed the discrepancy in the counts. We re-started the jobs and it was all good. So my question is, is spark-kafka streaming api doesn't check from time to time if new partitions were added? Is there any special setting to enable that?

AFAIK, Spark's Kafka Consumer will not automatically rebalance its consumer group when new topics/partitions are added.
That's one of the benefits that gets listed when comparing Spark Streaming with Kafka Streams, in that Kafka Streams will rebalance

Related

Kafka - message not getting consumed from only one of the partitions

We have a single node Kafka service running Kafka 2.13 version. For one of the topics, we have configured 20 partitions and there is just one consumer group consuming from this topic. This setup has been working fine for a long time now. Recently we have been seeing issues w.r.t Kafka rebalancing this consumer group frequently. After a while, the consumers start consuming again, but on one of the partitions, the current offset doesn't move forward at all indicating the messages are stuck in that partition. Other messages from other partitions get consumed without any issues. Logs from Kafka service doesn't show any issue. Any hints on what is going wrong and how to identify / rectify it?
A possible explanation is that the specific partition has a message that takes a long time to process, this could explain both the frequent rebalancing and the stuck offset

Topic and partition discovery for Kafka consumer

I am fairly new to Flink and Kafka and have some data aggregation jobs written in Scala which run in Apache Flink, the jobs consume data from Kafka perform aggregation and produce results back to Kafka.
I need the jobs to consume data from any new Kafka topic created while the job is running which matches a pattern. I got this working by setting the following properties for my consumer
val properties = new Properties()
properties.setProperty(“bootstrap.servers”, “my-kafka-server”)
properties.setProperty(“group.id”, “my-group-id”)
properties.setProperty(“zookeeper.connect”, “my-zookeeper-server”)
properties.setProperty(“security.protocol”, “PLAINTEXT”)
properties.setProperty(“flink.partition-discovery.interval-millis”, “500”);
properties.setProperty(“enable.auto.commit”, “true”);
properties.setProperty(“auto.offset.reset”, “earliest”);
val consumer = new FlinkKafkaConsumer011[String](Pattern.compile(“my-topic-start-.*”), new SimpleStringSchema(), properties)
The consumer works fine and consumes data from existing topics which start with “my-topic-start-”
When I publish data against a new topic say for example “my-topic-start-test1” for the first time, my consumer does not recognise the topic until after 500 milliseconds after the topic was created, this is based on the properties.
When the consumer identifies the topic it does not read the first data record published and starts reading subsequent records so effectively I loose that first data record every time data is published against a new topic.
Is there a setting I am missing or is it how Kafka works? Any help would be appreciated.
Thanks
Shravan
I think part of the issue is my producer was creating topic and publishing message in one go, so by the time consumer discovers new partition that message has already been produced.
As a temporary solution I updated my producer to create the topic if it does not exists and then publish a message (make it 2 step process) and this works.
Would be nice to have a more robust consumer side solution though :)

Multiple Flink pipelines for the same Kafka topic

Background
We have a Kafka topic with a steady stream of data. To process it we have a stateless Flink pipeline that consumes that topic and writes to another topic.
From time to time we have bursts of information that our Flink is not configured to handle. We don't want to configure our Flink pipeline and cluster to always support the maximum load we can have, we want to dynamically scale according to the load. (budget reasons $$$)
Solutions we thought of
One way to do so is to add/remove nodes to the Flink cluster and change the parallelism of the Flink pipeline operators. This will require stopping the Flink job with a snapshot, reconfiguring the parallelism and restarting with new parallelism.
This would be great but we cannot allow ourselves the downtime it produces. We have to scale up/down without downtime.
If we would use regular Kafka consumers it would be as simple as adding a consumer (assuming we have enough Kafka partitions) and Kafka would redistribute the topic partitions between all the consumers.
The Flink Kafka consumer manages the partition assignment and the offset on its own which allows exactly-once semantics (we don't need it). The drawback is that a single Flink job always uses all the topic partitions.
We thought we could create another instance of Flink that would subscribe to the same topic with the same group and let Kafka distribute the partitions between them. But for that we would need the Kafka Flink consumer to let Kafka manage which partitions are assigned to which consumer.
What are we looking for
We couldn't find a library that contains such a consumer or a configuration in the existing consumer. We could write it on our own (not so difficult) but if there is an existing solution we'd rather use it.
Are we missing something? Are we misunderstanding something? Is there a better solution?
Thanks!
The most straightforward approach, since you said that at worst you'll need double the capacity, would be to modify your topology to be able to write Kafka messages you can't process quickly enough to a second overflow Kafka topic. Both input and output Kafka topic names would be configurable. Maybe you would have a threshold backlog delay that automatically triggers this writing or maybe you would have a flag in the topology that you can externally set while the topology is running. That's a design detail you can work through that has operational implications.
This gives you a Flink topology that can handle some maximum number of messages in a timely fashion while writing the rest of the messages that can't be handled to a second Kafka topic. You can then run a second instance of the same Flink topology that reads from that secondary topic and writes, if necessary to a third topic. If the writing to the overflow topic happens very early in the topology processing, you could chain several of these instances together via Kafka with minimal latency and without having to reconfigure and restart any topologies.

Kafka Stream program is reprocessing the already processed events

I forwarded few events to Kafka and started my Kafka stream program. My program started processing the events and completed. After some time I stopped my Kafka stream application and I started again. Observed that My Kafka stream program is processing the already processed previous events.
As per my understanding, Kafka stream internally maintains the offset for input topics itself per application id. But here reprocessing the already processed events.
How to verify up to which offset Kafka stream processing was done? How Kafka stream persisted these bookmarks? On what basis & from which Kafka offset, Kafka stream will start read the events from Kafka?
If Kafka steam throws exceptions then is it reprocessed already processed events?
Please clarify my doubts.
Please help me to under stand more.
Kafka Streams internally uses a KafkaConsumer and all running instances form a consumer group using application.id as group.id. Offsets are committed to the Kafka cluster in regular intervals (configurable). Thus, on restart with the same application.id Kafka Streams should pick up the latest committed offset and continue processing from there.
You can check committed offset as for any other consumer group using bin/kafka-consumer-groups.sh tool.

spark streaming cannot receive data from kafka if send some message to kafka beforehand

I produce some messages first and these messages are persisted on disk by kafka's brokers. Then I start the spark streaming program to process these data, but I can't receive anything in spark streaming. And there is not any error log.
However, If I produce message when the spark streaming program is running, it can receive data.
Can spark streaming only receive the real time data from kafka?
To control the behavior of what data is consumed at the start of a new consumer stream, you should provide auto.offset.reset as part of the properties used to create the kafka stream.
auto.offset.reset can take the following values:
earliest => the kafka topic will be consumed from the earliest offset available
latest => the kafka topic will be consumed, starting at the current latest offset
Also note that depending on the kafka consumer model you are using (received-based or direct), the behavior of a restarted spark streaming job will be different.