High Performing Kafka Consumer - apache-kafka

We have a requirement to consume from a Kafka Topic. The Topic is provided by the producer team and we have no control on them. The producer publishes huge amount of messages which our consumer is unable to consume. However we only require 5-10% of the volume produced. Currently in Consumer we deserialize the message and based on certain attributes drop 90-95% of the messages. The consumer is behind 5-10L messages most of the time during the day. We even tried with 5 consumer and 30 threads in each consumer but not much success.
Is there any way we can subscribe Consumer to the Topic with some filter criteria so we only receive messages we are interested in.
Any help or guidance would be highly appreciated.

It is not possible to filter messages without consuming and even partially deserializing all of them.

Broker-Side filtering is not supported, though it has been discussed for a long time (https://issues.apache.org/jira/browse/KAFKA-6020)
You mentioned that you do not control the producer. However, if you can get the producer to add the attribute you filter by to a message header, you can save yourself the parsing of the message body. You still need to read all the messages, but the parsing can be CPU intensive, so skipping that helps with lag.

Related

How to consume Kafka's messages on a single consumer?

I need to implement a system that when the application starts a thread consumes all the messages generated during the shutdown of the service, it means that in parallel the application must consume the messages starting from the last message read by the thread that is in charge of consuming the old messages.
Is there a solution to this problem on kafka?
I'm not writing the language I'm using because I think it's a kafka feature.
EDIT:
Suppose we start the machine with consumers at 18:00 from 00:00 must take all messages from 00:00 to 18:00 the consumer assigned to read old messages and in parallel the other consumers start reading messages from 18:00 onward
This is how consumers work by default. You also have to be mindful about the retention of messages, as if that process doesn't restart after a certain amount of time you might lose messages. Kafka can retain data forever but it costs $$$, you need to find out what is the right retention for you.
From your comment, what you describe (multiple consumers consuming the same messages) happens when they have different consumer group ids. If you use the same consumer group, messages won't be processed twice during normal operation.
I need to warn you: Kafka is very complex technology, do not use it unless you know properly how consumers and producers work in detail. I would suggest you to pick at bare minimum the Kafka Definitive Guide before using it, unless you are ok with all kinds of failure scenarios.
Also, by default kafka guarantees "deliver at least once". If you want to be sure that you process messages exactly once, please read Exactly-Once Semantics Are Possible: Here’s How Kafka Does It, and know that this also depends on what you do while processing messages. If you touch a database, it might be better to use something on the DB that guarantees uniqueness (a kind of idempotency) so each message is processed once.

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

What is the difference between pulsar and kafka in regards to consumption?

In order to consume data from Kafka, we can have multiple consumers on a topic, totally decoupled. Then, what is meant by no shared consumption on the page(https://streaml.io/blog/pulsar-streaming-queuing) which shares differences between kafka and pulsar?
In his blog, Sijie is referring to shared messaging as queuing. With queuing messaging, multiple consumers are created to receive messages from a single topic. Which consumer gets the message is completely random.
The issue with implementing the messaging pattern with Kafka lies in way that Kafka consumers mark that they’ve consumed a message. Kafka consumers use what’s called a high watermark for consumer offsets. That means that a consumer can only say, “I’ve processed up to this point” rather than, “I’ve processed this message.”
Consider the scenario in which multiple Kafka consumers from the same consumer group were processing from the same topic partition and one of the consumers fails due to an exception while the other succeed. Because Kafka does not a have a built-in way to only acknowledge a single message, and only uses a high-water mark, the failed message would be erronously marked as consumed when in fact it failed and needs to be either reprocessed or published to an error queue, etc.
In order to avoid this situation, you would need to have just a single consumer per partition which limits the comsumption throughput of the topic. Which in turn requires you to increase the number of partitions in order to meet your throughput needs.
There is a detailed explanation in this blog post

How to read messages from kafka consumer group without consuming?

I'm managing a kafka queue using a common consumer group across multiple machines. Now I also need to show the current content of the queue. How do I read only those messages within the group which haven't been read, yet making those messages again readable by other consumers in the group which actually processes those messages. Any help would be appreciated.
In Kafka, the notion of "reading" messages from a topic and that of "consuming" them are the same thing. At a high level, the only thing that makes a "consumed" message unavailable to a consumer is that consumer setting its read offset to a value beyond that of the message in question. Thus, you can turn off the autocommit feature of your consumers and avoid committing offsets in cases where you'd like only to "read" but not to "consume".
A good proxy for getting "all messages which haven't been read" is to compare the latest committed offset to the highwater mark offset per partition. This provides a notion of "lag" that indicates how far behind a given consumer is in its consumption of a partition. The fetch_consumer_lag CLI function in pykafka is a good example of how to do this.
In Kafka, a partition can be consumed by only one consumer in a group i.e. if your topic has 10 partitions and you spawned 20 consumers with same groupId, then only 10 will be connected to Kafka and remaining 10 will be sitting idle. A new consumer will be identified by Kafka only in case one of the existing consumer dies or does not poll from the topic.
AFAIK, I don't think you can do what I understand you want to do within a consumer group. You can obviously create another groupId and process message based on the information gathered by first consumer group.
Kafka now has a KStream.peek() method
See proposal "Add KStream peek method".
It's not 100% clear to me from the docs that this prevents consuming of message that's peeked from the topic, but I can't see how you could use it in any crash-safe, robust way unless it does.
See also:
Handling consumer rebalance when implementing synchronous auto-offset commit
High-Level Consumer and peeking messages
I think that you can use publish-subscribe model. Then each consumer has own offset and could consume all messages for itself.

multiplexing consumer and producer in kafka

In my kafka consumer threads(high level), after I consumed a message I am applying some business logic to this message and forwarding this to a WS. But this webservice may be down sometimes and since I consumed this object from kafka and offset is moved forward, i would missed this object.
One way get rid of from this problem is to disabling autocommit in zookeeper and committing offset by calling programmaticaly but i expect that this is a very costly operation. I will be producing to kafka at about 2000 tps and may increase later times.
Another way - which i am not sure if it is a good idea - is if i face with any problem, producing this consumed object to kafka again but i didn't see any post related to this across all my googleings. Is this a thing which is even not considerable?
Can you please give me some insights about handling this situation.
Thanks
You can post back the failed message to the same topic or another of your choice.
If you use the same topic, you will push the messages at the end of the topic and they will be picked up after the others (so if order matters to you don't do this). Also if the action that you perform before sending the message is not idempotent you will have to something to identifying this records so they don't perform the action twice.
If you use a failed_topic, you can push the messages that you can't send to this topic and when the WS is healthy again you need to create a consumer that consumes all the messages there and sends them to the WS.
Hope it helps!
Moving such messages to an error queue and retrying them later is a well known approach.
See Dead letter channel