Apache Pulsar vs Kafka - do consumers pull (poll) messages off the topics?

Apache Pulsar vs Kafka - do consumers pull (poll) messages off the topics? - apache-kafka

I know that in Kafka, the consumer pulls messages off the broker topics (pull) ?
I get the feeling that Pulsar works the same way, considering that the receive method blocks. But I can't find a confirmation. Can someone point me to a reference or correct me ?
Thanks

Pulsar's Documentation clearly explains how message consumption works:
The Pulsar Consumer origin reads messages from one or more topics in
an Apache Pulsar cluster.
The Pulsar Consumer origin subscribes to Pulsar topics, processes
incoming messages, and then sends acknowledgements back to Pulsar as
the messages are read.
Messages can be received from brokers either synchronously (sync) or asynchronously (async).
receive method receives messages synchronously. The consumer process will be blocked until a message becomes available. For example,
Message msg = consumer.receive();
An asynchronous receive will return immediately with a value of type CompletableFuture that completes once a new message is available. For example,
CompletableFuture<Message> asyncMessage = consumer.receiveAsync();

In Pulsar document:
There is a queue at the consumer side to receive messages pushed from the broker. You can configure the queue size with the receiverQueueSize parameter. The default size is 1000). Each time consumer.receive() is called, a message is dequeued from the buffer.
So broker pushes messages to a queue on consumer side. When the receive method is invoked, a message will be dequeued and returned.
Pulsar consumer will regularly send permit request to Pulsar broker to ask for more message when the half of the queue is consumed. This is described here.
In short, as described here
Pulsar also uses a push-based approach but with an API that simulates consumer pulls.

Related

KAFKA client library (confluent-kafka-go): synchronisation between consumer and producer in the case of auto.offset.reset = latest

I have a use case where I want to implement synchronous request / response on top of kafka. For example when the user sends an HTTP request, I want to produce a message on a specific kafka input topic that triggers a dataflow eventually resulting in a response produced on an output topic. I want then to consume the message from the output topic and return the response to the caller.
The workflow is:
HTTP Request -> produce message on input topic -> (consume message from input topic -> app logic -> produce message on output topic) -> consume message from output topic -> HTTP Response.
To implement this case, upon receiving the first HTTP request I want to be able to create on the fly a consumer that will consume from the output topic, before producing a message on the input topic. Otherwise there is a possibility that messages on the output topic are "lost". Consumers in my case have a random group.id and have auto.offset.reset = latest for application reasons.
My question is how I can make sure that the consumer is ready before producing messages. I make sure that I call SubscribeTopics before producing messages. but in my tests so far when there are no committed offsets and kafka is resetting offsets to latest, there is a possibility that messages are lost and never read by my consumer because kafka sometimes thinks that the consumer registered after the messages have been produced.
My workaround so far is to sleep for a bit after I create the consumer to allow kafka to proceed with the commit reset workflow before I produce messages.
I have also tried to implement logic in a rebalance call back (triggered by a consumer subscribing to a topic), in which I am calling assign with offset = latest for the topic partition, but this doesn't seem to have fixed my issue.
Hopefully there is a better solution out there than sleep.

Most HTTP client libraries have an implicit timeout. There's no guarantee your consumer will ever consume an event or that a downstream producer will send data to the "response topic".
Instead, have your initial request immediately return 201 Accepted status (or 400, for example, if you do request validation) with some tracking ID. Then require polling GET requests by-id for status updates either with 404 status or 200 + some status field within the request body.
You'll need a database to store intermediate state.

Consumer timeout during rebalance

When a consumer drops from a group and a rebalance is triggered, I understand no messages are consumed -
But does an in-flight request for messages stay queued passed the max wait time?
Or does Kafka send any payload back during the rebalance?
UPDATE
For clarification, I'm referring specifically to the consumer polling process.
From my understanding, when one of the consumers drop from the consumer group, a rebalance of the partitions to consumers is performed.
During the rebalance, will an error be sent back to the consumer if it's already polled and waiting for max time to pass?
Or does Kafka wait the max time and send an empty payload?
Or does Kafka queue the request passed max wait time until the rebalance is complete?
Bottom line - I'm trying to explain periodic timeouts from consumers.
This may be in the docs, but I'm not sure where to find it.

Kafka producers doesn't directly send messages to their consumers, rather they send them to the brokers.
The inflight requests corresponds to the producer and not to the consumer.
Whether the consumer leaves a group and a rebalance is triggered or not is quite immaterial to the behaviour of the producer.
Producer messages are queued in the buffer, batched, optionally compressed and sent to the Kafka broker as per the configuration.
In-flight requests are the maximum number of unacknowledged requests
the client will send on a single connection before blocking.
Note that when we say ack, it is acknowledgement by the broker and not by the consumer.
Does Kafka send any payload back during the rebalance?
Kafka broker doesn't notify of any rebalance to its producers.

Confused about preventing duplicates with new Kafka idempotent producer API

My app has 5+ consumers consuming off of five partitions on a kafka topic.(using kafka version 11) My consumer's each produce a message to another topic then save some state to the database, then do a manual_ immediate acknowledgement and move onto the next message.
I'm trying to solve the scenario when they emit successful to the outbound topic. then we have a failure/lose the consumer. When another consumer takes over the partition it will emit ANOTHER message to the outbound topic. This is bad :(
I discovered that kafka now has idempotent producers but from what I read it only guarantees for a producers session.
"When producer restarts, new PID gets assigned. So the idempotency is promised only for a single producer session" - (blog) - https://hevodata.com/blog/kafka-exactly-once
This seems largely useless to me. In my use-case the whole point is when I replay a message on another consumer it does not duplicate the outbound message.
Is there something i'm missing?

When using transactions, you shouldn't use ANY consumer-based mechanism, manual or otherwise, to commit the offsets.
Instead, you use the producer to send the offsets to the transaction so the offset commit is part of the transaction.
If configured with a KafkaTransactionManager, or ChainedKafkaTransactionManager the Spring listener container will send the offsets to the transaction when the listener exits normally.
If you don't use a Kafka transaction manager, you need to use the KafkaTemplate (or Producer if you are using the native APIs) to send the offsets to the transaction.
Using the consumer to commit the offset is not part of the transaction, so things will not work as expected.
When using a transaction manager, the listener container binds the Producer to the thread so any downstream KafkaTemplate operations participate in the transaction that the consumer starts. See the documentation.

Making Kafka producer and Consumer synchronous

I have one kafka producer and consumer.The kafka producer is publishing to one topic and the data is taken and some processing is done. The kafka consumer is reading from another topic about whether the processing of data from topic 1 was successful or not ie topic 2 has success or failure messages.Now Iam starting my consumer and then publishing the data to topic 1 .I want to make the producer and consumer synchronous ie once the producer publishes the data the consumer should read the success or failure message for that data and then the producer should proceed with the next set of data .

Apache Kafka and Publish/Subscribe messaging in general seeks to de-couple producers and consumers through the use of streaming async events. What you are describing is more like a batch job or a synchronous Remote Procedure Call (RPC) where the Producer and Consumer are explicitly coupled together. The standard Apache Kafka Producers/Consumer APIs do not support this Message Exchange Pattern but you can always write your own simple wrapper on top of the Kafka API's that uses Correlation IDs, Consumption ACKs, and Request/Response messages to make your own interface that behaves as you wish.

Short Answer : You can't do that, Kafka doesn't provide that support.
Long Answer: As Hans explained, Publish/Subscribe messaging model keeps Publish and subscribe completely unaware of each other and I believe that is where the power of this model lies. Producer can produce without worrying about if there is any consumer and consumer can consume without worrying about how many producers are there.
The closest you can do is, you can make your producer synchronous. Which means you can wait till your message is received and acknowledged by broker.
if you want to do that, flush after every send.

Does a Kafka Consumer receive a list of offsets first, before receiving the bytes/data?

I'm quite new to Apache Kafka and I'm currently reading Learning Apache Kafka, 2ed, (2015). Chapter 3, paragraph Kafka Design fundamentals says the following:
Consumers always consume messages from a particular partition sequentially and also acknowledge the message offset. This acknowledgement implies that the consumer has consumed all prior messages. Consumers issue an asynchronous pull request containing the offset of the message to be consumed to the broker and get the buffer of bytes.
I'm a bit thrown off by the word 'acknowledge'. Do I understand it correctly that Kafka sends the offset first and then the consumer uses the list of offsets to pull request the data it has not consumed yet?
Thanks in advance,
Nick

On startup, KafkaConsumer issues a offset lookup request to the brokers for the specific consumer group that was configured on this consumer. If valid offsets are returned those are used. Otherwise, the consumer uses an initial offset according to auto.offset.reset parameter.
Afterwards, offsets are maintained mainly in-memory within the consumer. Each poll() sends the current offset to the broker and on broker reply consumer updates the in-memory offsets.
Additionally, in-memory offset are committed/acked to the broker from time to time. This can happen automatically within poll() if auto commit is enabled, or commit() must be called explicitly to send offsets to the broker for reliably storing them.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse