Kafka Producer: Disconnect after sending message vs keeping connection open

Kafka Producer: Disconnect after sending message vs keeping connection open - apache-kafka

I've not been able to find an answer to this in the kafkajs docs or from skimming through the official Apache Kafka design docs, but in their producer examples, the producer disconnects after sending the messages. However, that could be because it's a trivial example, and not a long running process.
For long running applications, like web apps, I'm wondering if it is better to disconnect from the producer after sending messages, or if it is better to (presumably) keep the connection open throughout the life of the running application.
An obvious advantage to keeping the connection open is that it wouldn't to reconnect when sending messages, and an obvious disadvantage would be that it holds a TCP connection open. I don't know how big of an advantage or disadvantage either are.
My guess would be that it depends on the expected volume; if the application is going to be constantly sending messages, it'd be best to keep the connection open, while if it is not going to be sending messages frequently, it would be appropriate to disconnect after sending messages.
Is this an accurate assessment? I'm more wondering if there are nuances that I've missed or made incorrect assumptions.

It is recommended to have producer open for the scope of the application.
Only if you have it open you can leverage the properties like batch.size and linger.ms to improve the throughput of your application.
Even for less busy applications it's better to have a producer instance shared in your application.
However if you're looking to enable transactions, you might want to consider implementing a pool of producer instances.
Although KafkaProducer is thread-safe, it does not support multiple concurrent transactions, so if you want to run different transactions concurrently, it's recommended to have separate producer instances.

Related

Redis / Kafka - Why stream consumers are blocked?

Are Kafka stream/redis stream good for reactive architecture?
I am asking this mainly because both redis and kafka seem to be blocking threads when consuming the messages.
Are there any reasons behind this? I was hoping that I could read messages with some callback - so execute when the message was delivered like pub/sub in a reactive manner. Not by blocking thread.

Kafka client is relatively low level, what is "good" as in: it provides you with much flexibility when (and in which thread) you'd do the record processing. In the end, to receive a record, someone needs to block (as actual reading is sending fetch requests over and over). Whether the thread being blocked is a "main-business" thread or some side-i/o dedicated one will depend on developer's choice.
You might take a look at higher-level offerings such as Spring-Kafka or Kafka Stream API / Kafka Connect frameworks that provide "fatter" inversion-of-control containers, effectively answering the above concern.

Forwarding messages coming from Kafka Topic

I am in the process of designing a system which acts like a message forwarder from one system to another system. I have several options to go for but I would like to apply the best option which provides less resource consumption (cpu, ram) and latency. Thus, I need your recommendation and view on this.
We assume that messages will be streaming to our system from a topic in Kafka. We need to forward all the messages from the topic to another host. There can be different strategies for this purpose.
Collect certain number of messages let's say 100 messages (batch processing) and send them at once within a single HTTP message.
When one message is received, system will send this message as the http POST request to the target host.
Open webSocket between our system and the target host and send messages.
Behave like a Kafka producer and send messages to topic.
Each of them might have advantages and disadvantages. I have concern that system may not handle the high amount of messages coming. Do you have any option other than these 4 items? Which is the best option you think in terms of what?

How important your latency requirement is ?
HTTP is quite slow, compared to an UDP based messaging system, but maybe you don't need a so tailored latency.
Batching your messages will increase latency, as you may know.
But it's disturbing because the title of this page is "rest - forwarding" =).
Does it has to be REST ( so HTTP) ? because it seems you can as well envisage to act like a kafka producer, if so, it's not REST.
The memory footprint of Kafka may be a bit high (Java lib), but no so much.
Do you work on embedded system (willing to reduce memory footprint ?)
For CPU purposes.. it depends with what we're comparing Kafka, but I still think Kafka is quite optimised when asking for performance.
Think we lack more information about this "another host" , could you give more details about its purpose ?
Yannick

I think you are looking for Kafka Streaming in this scenario. Although from an efficiency point of view maybe some hadoop stack implementation (Flume) or Spark would be less consuming, not sure, depends on amount of data, network jumps, disk used, amount of memory.
If you have huge amounts of messages those distributed solutions should be your right approach not a custom REST client.

Is Kafka suitable for running a public API?

I have an event stream that I want to publish. It's partitioned into topics, continually updates, will need to scale horizontally (and not having a SPOF is nice), and may require replaying old events in certain circumstances. All the features that seem to match Kafka's capabilities.
I want to publish this to the world through a public API that anyone can connect to and get events. Is Kafka a suitable technology for exposing as a public API?
I've read the Documentation page, but not gone any deeper yet. ACLs seem to be sensible.
My concerns
Consumers will be anywhere in the world. I can't see that being a problem seeing Kafka's architecture. The rate of messages probably won't be more than 10 per second.
Is integration with zookeeper an issue?
Are there any arguments against letting subscriber clients connect that I don't control?

Are there any arguments against letting subscriber clients connect that I don't control?
One of the issues that I would consider is possible group.id collisions.
Let's say that you have one single topic to be used by the world for consuming your messages.
Now if one of your clients has a multi-node system and wants to avoid reading the same message twice, they would set the same group.id to both nodes, forming a consumer group.
But, what if someone else in the world uses the same group.id? They would affect the first client, causing it to lose messages. There seems to be no security at that level.

Is Kafka ready for production use?

I have an application in production that has to process several gigabytes of messages per day. I like the Kafka architecture and performance a lot; it perfectly fits my needs.
I'd like to replace my messaging layer with Kafka at some point. Is the 0.7.1 version good enough for production use in terms of stability and consistency in performance?

It is definitely in use at several Big Data companies already, including LinkedIn, where it was created (and later open sourced), and Tumblr. Just Tumblr by itself handles many gigabytes of messages per day. I'm sure LinkedIn is way up there too. You can see a list of companies known to currently use it here:
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Also, be sure to subscribe to their mailing list, there are lots of people actively trying it out and using it in production environments.
I'm sure it can handle whatever volume you can throw at it.

There is one critical feature I think Kafka is missing before it is ready for production.
"Flushing messages to disc if the producer can't reach any Kafka broker"
The issue has been filed a long time ago here:
https://issues.apache.org/jira/browse/KAFKA-156
This feature will makes the complete Kafka event pipline even more robust for some use-cases when the producer always has to be able to send events. For example when you track pageviews or like-button clicks and you don't want to miss any events, even if all Kafka brokers are unreachable.

I must agree with Dave, Kafka is a good tool but it missing some basic features which some can be done manually but then you need to think what Kafka provide. some missing things are:
(As Dave said) Flushing messages to disk when the producer fail to send them
Consumers ability to track which messages were handled (not just consumed) and which wasn't in case of a restart.
Monitoring - a way to receive the current status of the entities in the system like the current size of the queue in the producer or the write\read pace at the brokers (those can be done but are not part of the tool).

I have used kafka for quite sometime. Using native java and python clients would be preferred.
I had to struggle a lot finding a proper node.js client. literally re-wrote my whole code many a times using different clients as they had lot of bugs.
Finally settled with franz-kafka for node.js.
Apart from that maintaining the consumer offsets is a bit difficult. It is missing some good features like exchanges that exist in AMQP based Apache Qpid or RabbitMQ
Since it's distributed, supports offline messages and the performance is really impressive. I too preferred it :)

Is there a performance difference between pooling connections or channels in rabbitmq?

I'm a newbie with Rabbitmq(and programming) so sorry in advance if this is obvious. I am creating a pool to share between threads that are working on a queue but I'm not sure if I should use connections or channels in the pool.
I know I need channels to do the actual work but is there a performance benefit of having one channel per connection(in terms of more throughput from the queue)? or am I better off just using a single connection per application and pool many channels?
note: because I'm pooling the resources the initial cost is not a factor, as I know connections are more expensive than channels. I'm more interested in throughput.

I have found this on the rabbitmq website it is near the bottom so I have quoted the relevant part below.
The tl;dr version is that you should have 1 connection per application and 1 channel per thread.
Connections
AMQP connections are typically long-lived. AMQP is an application
level protocol that uses TCP for reliable delivery. AMQP connections
use authentication and can be protected using TLS (SSL). When an
application no longer needs to be connected to an AMQP broker, it
should gracefully close the AMQP connection instead of abruptly
closing the underlying TCP connection.
Channels
Some applications need multiple connections to an AMQP broker.
However, it is undesirable to keep many TCP connections open at the
same time because doing so consumes system resources and makes it more
difficult to configure firewalls. AMQP 0-9-1 connections are
multiplexed with channels that can be thought of as "lightweight
connections that share a single TCP connection".
For applications that use multiple threads/processes for processing,
it is very common to open a new channel per thread/process and not
share channels between them.
Communication on a particular channel is completely separate from
communication on another channel, therefore every AMQP method also
carries a channel number that clients use to figure out which channel
the method is for (and thus, which event handler needs to be invoked,
for example).
It is advised that there is 1 channel per thread, even though they are thread safe, so you could have multiple threads sending through one channel. In terms of your application I would suggest that you stick with 1 channel per thread though.
Additionally it is advised to only have 1 consumer per channel.
These are only guidelines so you will have to do some testing to see what works best for you.
This thread has some insights here and here.
Despite all these guidelines this post suggests that it will most likely not affect performance by having multiple connections. Though it is not specific whether it is talking about client side or server(rabbitmq) side. With the one point that it will of course use more systems resources with more connections. If this is not a problem and you wish to have more throughput it may indeed be better to have multiple connections as this post suggests multiple connections will allow you more throughput. The reason seems to be that even if there are multiple channels only one message goes through the connection at one time. Therefore a large message will block the whole connection or many unimportant messages on one channel may block an important message on the same connection but a different channel. Again resources are an issue. If you are using up all the bandwidth with one connection then adding an additional connection will have no increase performance over having two channels on the one connection. Also each connection will use more memory, cpu and filehandles, but that may well not be a concern though might be an issue when scaling.

In addition to the accepted answer:
If you have a cluster of RabbitMQ nodes with either a load-balancer in front, or a short-lived DNS (making it possible to connect to a different rabbit node each time), then a single, long-lived connection would mean that one application node works exclusively with a single RabbitMQ node. This may lead to one RabbitMQ node being more heavily utilized than the others.
The other concern mentioned above is that the publishing and consuming are blocking operations, which leads to queueing messages. Having more connections will ensure that 1. processing time for each messages doesn't block other messages 2. big messages aren't blocking other messages.
That's why it's worth considering having a small connection pool (having in mind the resource concerns raised above)

The "one channel per thread" might be a safe assumption (I say might as I have not made any research by myself and I have no reason to doubt the documentation :) ) but beware that there is a case where this breaks:
If you you use RPC with RabbitMQ Direct reply-to then you cannot reuse the same channel to consume for another RPC request. I asked for details about that in the google user group and the answer I got from Michael Klishin (who seems to be actively involved in RabbitMQ development) was that
Direct Reply to is not meant to be used with channel sharing either way.
I've email Pivotal to update their documentation to explain how amq.rabbitmq.reply-to is working under the hood and I'm still waiting for an answer (or an update).
So if you want to stick to "one channel per thread" beware as this will not work good with Direct reply-to.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse