Behaviour of KafkaProducer when the connection to Kafka stream is broken - apache-kafka

I was expecting that KafkaProducer will throw timeout exception when the connection to brokers is broken (e.g. losing internet access, brokers is not available...), but from what I observed, KafkaProducer still performed the sending normally without any problem. (I set Ack to no).
I checked its documentation and there is no part about how the KafkaProducer will behave when the connection to brokers is broken/ restored.
Does anyone have experiences with this? I'm using Kafka version 0.10, with asynchronous send and handle error in callback.

First, I want to clarify that Kafka Streams is Apache Kafka's stream processing library and your question seems not to be about Kafka Streams. You only talk about producer and brokers (just want to clarify terminology to avoid future confusion).
About your question: The only way to check if a write to a broker was successful is by enabling acks. If you disable acks, the producer applies a "fire and forget" strategy and does not check if a write was successful and/or if any connection to the Kafka cluster is still established etc.
Because you do not enable acks, you can never get an error callback. This in independent of sync/async writing.

Related

Why does kafka consumer poll the broker?

Currently learning about Kafka architecture and I'm confused as to why the consumer polls the broker. Why wouldn't the consumer simply subscribe to the broker and supply some callback information and wait for the broker to get a record? Then when the broker gets a relevant record, look up who needs to know about it and look at the callback information to dispatch the messages? This would reduce the number of network operations hugely.
Kafka can be used as a messaging service, but it is not the only possible usecase. You could also treat it as a remote file that can have bytes (records) read on demand.
Also, if notification mechanism were to implemented in message-broker fashion as you suggest, you'd need to handle slow consumers. Kafka leaves all control to consumers, allowing them to consume at their own speed.

How can I ensure that the messages sent are not lost when the kafka is not working?

I've started to use Kafka. And I have a question about it.
If Kafka is not running because of network problem, kafka crash etc. how can I eliminate this problem? And, What happens to messages that was sent to kafka?
If all brokers in the cluster are unavailable your producer will not get any acknowledgements (note that the actual send happens in a background thread, not the thread that calls send - that is an async call).
If you have acks=0 then you have lost the message but acks=1 or acks=all then it depends on retry configuration. By default the producer thread retries pretty much indefinitely which means at some point the send buffer will fill up and then the async send method will fail synchronously, but if your client fails in the meantime then the messages in the buffer are lost as that is just in memory.
If you are wondering about behaviour when some but not all brokers are down, I wrote about it here

How to skip old messages when connecting to a RabbitMQ producer

I've looked into the expiration and TTL policies for messages and queues, but I'm not sure if that's the best way to accomplish what I'm trying to do.
Ideally, when my consumer connects to my sender, I want to skip any old, unreceived messages, and only receive messages that are sent after connection. In Kafka, this was accomplished by configuring the consumer to essentially seek the queue to the end before beginning the consumption of more messages. A direct RabbitMQ equivalent to this feature didn't seem to exist, but I have to imagine there's a more efficient way to accomplish this without making the TTL or expiration on the messages to be very short.
How do I consume only messages received after connecting to a RabbitMQ producer?
What ended up working for us was configuring the producer during publishing rather than configuring the consumer.
channel.basic_publish([other params], properties=pika.BasicProperties(expiration='1000'))
This causes the messages to expire after one second, which was good enough for our needs.

If you use Apache Kafka as a strict message broker, can you just set all retention entries to the minimum?

We want to use Apache Kafka as a live message broker only. A message is distributed and instantly utilized (fire and forget).
Could we theoretically keep no logs and just send and forget?
Which config entries should we change?
log.retention.hours and log.retention.bytes?
That's not how Kafka works. If you don't keep logs, then your messages won't be available to consume. If you set it to a low value, you lose the message if your consumer is offline (crashes, etc).
Kafka tracks the messages that each consumer has read, so you don't need to think about deleting to prevent re-reading.
If you still don't like this approach, then just use a 'traditional' MQ that does have the semantics you are looking for :)
You might find this whitepaper useful.
Disclaimer: I work for Confluent

Using Kafka Producer by different threads

I have kafka producer for my java based web application to push messages to Kafka. As per the documentation I could see kafka producer is thread safe. Does it mean that I can have single instance of Kafka producer and use it by different threads ( web requests ) each will open and close the producer in my case. Will this create any issues ? Or Is better to initiate Producers per request ?
Yes, KafkaProducer is threadsafe.
Refer to Class KafkaProducer
A Kafka client that publishes records to the Kafka cluster.
The producer is thread safe and should generally be shared among all
threads for best performance.
The producer manages a single background thread that does I/O as well
as a TCP connection to each of the brokers it needs to communicate
with. Failure to close the producer after use will leak these
resources.
By far the best approach (which is typical of most stateful clients connectors, eg SQL clients, elasticsearch client, etc) is to instantiate a single instance at application start and share it across all threads. It should only be closed on application shutdown.