This question already has answers here:
kafka consumer to dynamically detect topics added
(4 answers)
Closed 1 year ago.
I have create some KafkaConsumer, next call subscribe on it (using Pattern), and next poll.
I noticed that if my consumer running, and later I create some new topic (matches to pattern) this consument will not consume this data! If I restart my app then consumer get data from newly created topics.
It is ok ? How can I solve it ?
The KafkaConsumer will do a refresh of meta data of its subscriptions based on the KafkaConsumer configuration metadata.max.age.ms which defaults to 5 minutes.
You could reduce this configuration to have your consumer consuming also newly created topics matching your pattern.
Related
I need some help in Kafka Streams. I have started a Kafka stream application, which is streaming one topic from the very first offset. Topic is very huge in data, so I want to implement a mechanism in my application, using Kafka streams, so that I can get notified when topic has been read completely to the very last offset.
I have read Kafka Streams 2.8.0 api, I have found an api method i-e allLocalStorePartitionLags, which is returning map of store names to another map of partition containing all the lag information against each partition. This method returns lag information for all store partitions (active or standby) local to this Streams. This method is quite useful for me, in above case, when I have one node running that stream application.
But in my case, system is distributed and application nodes are 3 and topic partitions are 10, which meaning each node have at least 3 partitions for the topic to read from.
I need help here. How I can implement this functionality where I can get notified when topic has been read completely from partition 0 to partition 9. Please note that I don't have option to use database here as of now.
Other approaches to achieve goal are also welcomed. Thank you.
I was able to achieve lag information from adminClient api. Below code results end offsets and current offsets for each partitions against topics read by given stream application i-e applicationId.
AdminClient adminClient = AdminClient.create(kafkaProperties);
ListConsumerGroupOffsetsResult listConsumerGroupOffsetsResult = adminClient.listConsumerGroupOffsets(applicationId);
// Current offsets.
Map<TopicPartition, OffsetAndMetadata> topicPartitionOffsetAndMetadataMap = listConsumerGroupOffsetsResult.partitionsToOffsetAndMetadata().get();
// all topic partitions.
Set<TopicPartition> topicPartitions = topicPartitionOffsetAndMetadataMap.keySet();
// list of end offsets for each partitions.
ListOffsetsResult listOffsetsResult = adminClient.listOffsets(topicPartitions.stream()
.collect(Collectors.toMap(Function.identity(), tp -> OffsetSpec.latest())));
This question already has answers here:
Delete message after consuming it in KAFKA
(5 answers)
Closed 2 years ago.
I'm learning Kafka and if someone could help me to understood one thing.
"Producer' send message to Kafka topic. It stays there some time (7 days by default, right?).
But "consumer" receives such message and there is not much sense to keep it there eternally.
I expected that these messages disappear when consumer gets them.
Otherwise, when I connect to Kafka again, I will download the same messages again. So I have to manage duplicate avoidance.
What's the logic behind it?
Regards
"Producer" send message to Kafka topic. It stays there some time (7 days by default, right?).
Yes, a Producer send the data to a Kafka topic. Each topic has its own configurable cleanup.policy. By default it is set to a retention period of 7 days. You can also configure the retention of the topic based on byte size.
But "consumer" receives such message and there is not much sense to keep it there eternally.
Kafka can be seen as a Publisher/Subscribe messaging system (although mainly being a streaming platform). It has the great benefit that more than one single Consumer can read the same messages of a topic. Compared to other messaging systems the data is not deleted after acknowledged by a consumer.
Otherwise, when I connect to Kafka again, I will download the same messages again. So I have to manage duplicate avoidance.
Kafka has the concept of "Offsets" and "ConsumerGroups" and I highly recommend to get familiar with them as they are essential when working with Kafka. Each consumer is part of a ConsumerGroup and each message in a topic has a unique identifer called "offset". An offset is like a unique identifer that stays with the same message for its life-time.
Each ConsumerGroup keeps track of the messages (offsets) that it already consumed. Now, if you do not want to read the same messages again your ConsumerGroup just have to commit those offsets and it will not read them again.
That way you will not consume duplicates, but still other consumers (with a differen ConsumerGroup) are able to read all messages again.
I have the following topology:
topology.addSource(WS_CONNECTION_SOURCE, new StringDeserializer(), new WebSocketConnectionEventDeserializer()
, utilService.getTopicByType(TopicType.CONNECTION_EVENTS_TOPIC))
.addProcessor(SESSION_PROCESSOR, WSUserSessionProcessor::new, WS_CONNECTION_SOURCE)
.addStateStore(sessionStoreBuilder, SESSION_PROCESSOR)
.addSink(WS_STATUS_SINK, utilService.getTopicByType(TopicType.ONLINE_STATUS_TOPIC),
stringSerializer, stringSerializer
, SESSION_PROCESSOR)
//WS session routing
.addSource(WS_NOTIFICATIONS_SOURCE, new StringDeserializer(), new StringDeserializer(),
utilService.getTopicByType(TopicType.NOTIFICATION_TOPIC))
.addProcessor(WS_NOTIFICATIONS_ROUTE_PROCESSOR, SessionRoutingEventGenerator::new,
WS_NOTIFICATIONS_SOURCE)
.addSink(WS_NOTIFICATIONS_DELIVERY_SINK, new NodeTopicNameExtractor(), WS_NOTIFICATIONS_ROUTE_PROCESSOR)
.addStateStore(userConnectedNodesStoreBuilder, WS_NOTIFICATIONS_ROUTE_PROCESSOR, SESSION_PROCESSOR);
As you can see there are 2 source topics. State store is built from the first topic and the second flow reads the state store. When I start the topology, I see those stream threads are assigned the same partitions (co-partitioning) of both source topics. I assume this is because the state store is accessed by the second topic flow.
This is functionally working fine. But there is a performance problem. When there is a surge in the volume of input data to the first source topic, which updates state-store, second topic processing is delayed.
For me, the second topic should be processed as fast as possible. Delay in processing the first topic is fine.
I am thinking of the following strategy:
Current configuration:
WS_CONNECTION_SOURCE - 30 partitions
WS_NOTIFICATIONS_SOURCE - 30 partitions
streamThreads: 10
appInstances: 3
New configuration:
WS_CONNECTION_SOURCE - 15 partitions
WS_NOTIFICATIONS_SOURCE - 30 partitions
streamThreads: 10
appInstances: 3
Since there is no co-partitioning, tasks has to use interactive query to access store
The idea is out of 10 threads, 5 threads will only process the second topic which can alleviate the current problem when there is a surge in the first topic.
Here are my questions:
1. Is this strategy correct? To avoid co-partitioning and use interactive query
2. Is there a chance that Kafka will assign 10 partitions of WS_CONNECTION_SOURCE
to one instance since there are 10 threads and one instance won't get any?
3. Is there any better approach to solve the performance problem?
State store and Interactive Query are Kafka Streams abstraction.
To use Interactive Query you have to define state store (using Kafka Streams API) and that enforce you to have same number of partitions, for inputs topics.
I think your solution will not work. Interactive query are for exposing ability to query state store outside the Kafka Streams (not for access within Processor API)
Maybe you can review your SESSION_PROCESSOR source code and extract more work to Process from the other topology and publish result to intermediate topic and then based on that build that state store.
Additionally:
Currently Kafka Streams doesn't support prioritization for input topics. There is KIP about priorities for Source topic: KIP-349. Unfortunately linked Jira ticket was closed as Won't FIX (https://issues.apache.org/jira/browse/KAFKA-6690)
Can the following situation happen, when I create the consumer and immediately invoke consumer.pool(1000), the consumer doesn't consume messages which the topic contains.
Is it possible the same situation happens when I have the topic (with 4 partitions) and add new consumer in the same consumer groups, and the old consumer on operation oldConsumer.pool(100) returns 0 records.
I didn't find the description of this process in the documentation but I can reproduce this situation on my local machine many times
Our use case is to delete stale/unused topics from kafka i.e. if a topic (on all partitions) doesn't have any new message in last 7 days then we would consider it as stale/unused and delete it.
Many google results suggested to add timestamp to messages and then parse it. For new topics & messages that soultion would work but our existing topics & messages doesn't have any timestamp in them.
How can I get this working ?.
kafka.api.OffsetRequest.LatestTime() will return the latest message added to a queue. You can use the Simple Consumer API to determine which offset to read from.
For more details take look at the wiki page