How do I specify how often to poll? - apache-kafka

I'm working with Logstash's Kafka input plugin and I don't see any setting which would allow me to specify how often should the consumer poll.
If I check the consumer configuration I don't see anything either there.
How does it work then?
How does the consumer know when to try to pull data in?

There are a variety of configurables on that one, but poll_timeout_ms seems to be the one controlling the poll interval. It defaults to 100ms. The Kafka plugin maintains an open TCP connection to the Kafka cluster, so you don't get connection-establishment overhead for a fast polling interval.

Related

Kafka consumer disconnect Event

Is there any way we can detect crash or shut down of consumer?
I want that kafka server publish event when mentioned situation to all kafka clients (publishers, consumers....).
Is it possible?
Kafka keeps track of the consumed offsets per consumer on special internal topics. You could setup a special "monitoring service", have that constantly consuming from those offset internal topics, and trigger any notification/alerting mechanisms as needed so that your publishers and other consumers are notified programatically. This other SO question has more details about that.
Depending on your use case, lag monitoring is also a really good way to know if your consumers are falling behind and/or crashed. There's multiple solutions for that out there, or again, you could build your own to customize alerting/notification behavior.

Why does kafka consumer poll the broker?

Currently learning about Kafka architecture and I'm confused as to why the consumer polls the broker. Why wouldn't the consumer simply subscribe to the broker and supply some callback information and wait for the broker to get a record? Then when the broker gets a relevant record, look up who needs to know about it and look at the callback information to dispatch the messages? This would reduce the number of network operations hugely.
Kafka can be used as a messaging service, but it is not the only possible usecase. You could also treat it as a remote file that can have bytes (records) read on demand.
Also, if notification mechanism were to implemented in message-broker fashion as you suggest, you'd need to handle slow consumers. Kafka leaves all control to consumers, allowing them to consume at their own speed.

Kafka Producer: Disconnect after sending message vs keeping connection open

I've not been able to find an answer to this in the kafkajs docs or from skimming through the official Apache Kafka design docs, but in their producer examples, the producer disconnects after sending the messages. However, that could be because it's a trivial example, and not a long running process.
For long running applications, like web apps, I'm wondering if it is better to disconnect from the producer after sending messages, or if it is better to (presumably) keep the connection open throughout the life of the running application.
An obvious advantage to keeping the connection open is that it wouldn't to reconnect when sending messages, and an obvious disadvantage would be that it holds a TCP connection open. I don't know how big of an advantage or disadvantage either are.
My guess would be that it depends on the expected volume; if the application is going to be constantly sending messages, it'd be best to keep the connection open, while if it is not going to be sending messages frequently, it would be appropriate to disconnect after sending messages.
Is this an accurate assessment? I'm more wondering if there are nuances that I've missed or made incorrect assumptions.
It is recommended to have producer open for the scope of the application.
Only if you have it open you can leverage the properties like batch.size and linger.ms to improve the throughput of your application.
Even for less busy applications it's better to have a producer instance shared in your application.
However if you're looking to enable transactions, you might want to consider implementing a pool of producer instances.
Although KafkaProducer is thread-safe, it does not support multiple concurrent transactions, so if you want to run different transactions concurrently, it's recommended to have separate producer instances.

If you use Apache Kafka as a strict message broker, can you just set all retention entries to the minimum?

We want to use Apache Kafka as a live message broker only. A message is distributed and instantly utilized (fire and forget).
Could we theoretically keep no logs and just send and forget?
Which config entries should we change?
log.retention.hours and log.retention.bytes?
That's not how Kafka works. If you don't keep logs, then your messages won't be available to consume. If you set it to a low value, you lose the message if your consumer is offline (crashes, etc).
Kafka tracks the messages that each consumer has read, so you don't need to think about deleting to prevent re-reading.
If you still don't like this approach, then just use a 'traditional' MQ that does have the semantics you are looking for :)
You might find this whitepaper useful.
Disclaimer: I work for Confluent

Behaviour of KafkaProducer when the connection to Kafka stream is broken

I was expecting that KafkaProducer will throw timeout exception when the connection to brokers is broken (e.g. losing internet access, brokers is not available...), but from what I observed, KafkaProducer still performed the sending normally without any problem. (I set Ack to no).
I checked its documentation and there is no part about how the KafkaProducer will behave when the connection to brokers is broken/ restored.
Does anyone have experiences with this? I'm using Kafka version 0.10, with asynchronous send and handle error in callback.
First, I want to clarify that Kafka Streams is Apache Kafka's stream processing library and your question seems not to be about Kafka Streams. You only talk about producer and brokers (just want to clarify terminology to avoid future confusion).
About your question: The only way to check if a write to a broker was successful is by enabling acks. If you disable acks, the producer applies a "fire and forget" strategy and does not check if a write was successful and/or if any connection to the Kafka cluster is still established etc.
Because you do not enable acks, you can never get an error callback. This in independent of sync/async writing.