Using Kafka Producer by different threads - apache-kafka

I have kafka producer for my java based web application to push messages to Kafka. As per the documentation I could see kafka producer is thread safe. Does it mean that I can have single instance of Kafka producer and use it by different threads ( web requests ) each will open and close the producer in my case. Will this create any issues ? Or Is better to initiate Producers per request ?

Yes, KafkaProducer is threadsafe.
Refer to Class KafkaProducer
A Kafka client that publishes records to the Kafka cluster.
The producer is thread safe and should generally be shared among all
threads for best performance.
The producer manages a single background thread that does I/O as well
as a TCP connection to each of the brokers it needs to communicate
with. Failure to close the producer after use will leak these
resources.

By far the best approach (which is typical of most stateful clients connectors, eg SQL clients, elasticsearch client, etc) is to instantiate a single instance at application start and share it across all threads. It should only be closed on application shutdown.

Related

Why does kafka consumer poll the broker?

Currently learning about Kafka architecture and I'm confused as to why the consumer polls the broker. Why wouldn't the consumer simply subscribe to the broker and supply some callback information and wait for the broker to get a record? Then when the broker gets a relevant record, look up who needs to know about it and look at the callback information to dispatch the messages? This would reduce the number of network operations hugely.
Kafka can be used as a messaging service, but it is not the only possible usecase. You could also treat it as a remote file that can have bytes (records) read on demand.
Also, if notification mechanism were to implemented in message-broker fashion as you suggest, you'd need to handle slow consumers. Kafka leaves all control to consumers, allowing them to consume at their own speed.

Process messages pushed through Kafka

I haven't used Kafka before and wanted to know if messages are published through Kafka what are the possible ways to capture that info?
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Haven't used Kafka before and while reading up I did find that Kafka needs ZooKeeper running too.
I don't need to publish info just process data received from Kafka publisher.
Any pointers will help.
Kafka is a distributed streaming platform that allows you to process streams of records in near real-time.
Producers publish records/messages to Topics in the cluster.
Consumers subscribe to Topics and process those messages as they are available.
The Kafka docs are an excellent place to get up to speed on the core concepts: https://kafka.apache.org/intro
Is Kafka only way to receive that info via "Consumers" or can Rest APIs be also used here?
Kafka has its own TCP based protocol, not a native HTTP client (assuming that's what you actually mean by REST)
Consumers are the only way to get and subsequently process data, however plenty of external tooling exists to make it so you don't have to write really any code if you don't want to in order to work on that data

How to implement communication between consumer and producer with fast and slow workers?

I'm looking for a pattern and existing implementations for my situation:
I have synchronous soa which uses REST APIs, in fact I implement remote procedure call with REST. And I have got some slow workers which process requests for substantial time (around 30 seconds) and due to some license constraints for some requests I can process them only sequentially (see system setup).
What are recommended ways to implement communication for such case?
How can I mix synchronous and asynchronous communication in the situation when the consumer is behind firewall and I cannot send him notification about completed tasks easily and I might not be able to let consumer use my message broker if I have one?
Workers are implemented in Python using Flask and gunicorn. At the moment I'm using synchronous REST interfaces and allow for the delay as I only had fast workers. I looked at Kafka and RabbitMq and they would fit for backend side
communication, however how does producer communicate with the consumer?
If the consumer fires API request my producer can return code 202, then how shall producer notify consumer that result is available? Will consumer have to poll the producer for results?
Also if I use message brokers and my gateway acts on behalf of consumer, it should have a registry of requests (I already have GUID for every request now) and results, which approach would you recommend for implementing it?
Producer- agent which produces the message
Consumer - agent which can handle the message and implement the logic for processing the message

Behaviour of KafkaProducer when the connection to Kafka stream is broken

I was expecting that KafkaProducer will throw timeout exception when the connection to brokers is broken (e.g. losing internet access, brokers is not available...), but from what I observed, KafkaProducer still performed the sending normally without any problem. (I set Ack to no).
I checked its documentation and there is no part about how the KafkaProducer will behave when the connection to brokers is broken/ restored.
Does anyone have experiences with this? I'm using Kafka version 0.10, with asynchronous send and handle error in callback.
First, I want to clarify that Kafka Streams is Apache Kafka's stream processing library and your question seems not to be about Kafka Streams. You only talk about producer and brokers (just want to clarify terminology to avoid future confusion).
About your question: The only way to check if a write to a broker was successful is by enabling acks. If you disable acks, the producer applies a "fire and forget" strategy and does not check if a write was successful and/or if any connection to the Kafka cluster is still established etc.
Because you do not enable acks, you can never get an error callback. This in independent of sync/async writing.

What ways can a Consumer consume message in Kafka?

If there is a Kafka server "over there" somewhere across the network I would assume that there might be two ways that a Consumer could consume the messages:
By first of all 'subscribing' to the Topic and in effect telling the Kafka server where it is listening so that when a new message is Produced, Kafka proactively sends the message to the Consumer, across the network.
The Consumer has to poll the Kafka server asking for any new messages, using the offset of the messages it has currently taken.
Is this how Kafka works, and is the option configurable for which one to use?
I'm expanding my comment into an answer.
Reading through the consumer documentation, Kafka only supports option 2 as you've described it. It is the consumers responsibility to get messages from the Kafka server. In the 0.9.x.x Consumer this is accomplished by the poll() method. The Consumer polls the Kafka Server and returns messages if there are any. There are several reasons I believe they've chosen to avoid supporting option 1.
It limits the complexity needed in the Kafka Server. It's not the Server's responsibility to push messages to a consumer, it just holds the messages and waits till a consumer fetches them.
If the Kafka Server was pushing all messages to the consumers, it could overwhelm a consumer. Say a Producer was pushing messaging into a Kafka Server 10 msg/sec, but a certain Consumer could only process 2 msg/sec. If the Kafka Server attempted to push every message it received to that Consumer, the Consumer would quickly be overwhelmed by the number of messages it receives.
There's probably other reasons, but at the moment those were the two I thought about.