I am newbie to kafka. I have created sample kafka sync producer and consumergroup programs using kafka_2.9.2-0.8.1.1.
So My question is, do I need to add multithreading code to producer (like consumergroup class has) to support huge number of requests? I read producer send method is thread safe.
So kafka producer will take care of multithreading concepts internally or developer has to code explicitly?
Any help would be highly appreciated.
Thanks,
Cdhar
There are two types of producers available with Kafka. (1) SyncProducer (2) AsyncProducer. If you set the producer.type configuration as async it will uses the AsyncProducers. By default it uses the Synchronous producer class.
Once running in async mode it creates a separate AsyncProducer instance per broker.And each of these AsyncProducer instances maintains its own internal background thread for sending the messages. These are called ProducerSendThread.
So there is one thread running per broker and your parallelism is based on the number of brokers available in the cluster. So adding new brokers in the cluster should provide you the flexibilities to increase the level of parallelism while producing data using Kafka.But remember adding a new broker to your cluster should be considered taking other paramaters also into consideration.
Related
i have a Kafka Streams DSL application, we have a requirement on exactly once processing, for the same i have added the configuration
streamConfig.put(processing.gurantee, "exactly_once");
I am using kafka 2.7
I have 2 queries
what's the difference between exactly_once and exactly_once_beta
how do i test this functionality to be sure my messages are getting processed only once
Thanks!
exactly_once_beta is an improvement over exactly_once. While exactly_once uses a transactional producer for each stream task (combination of sub-topology and input partition, exactly_once_beta uses a transactional producer for each stream thread of a Kafka Streams client.
Every producer comes with separate memory buffers, a separate thread, separate network connections which might limit scaling the number of input partitions (i.e. number of tasks). A high number of producers might also cause more load on the brokers. Hence, exactly_once_beta has better scaling characteristics. You can find more details in KIP-447.
Note that exactly_once will be deprecated and exactly_once_beta will be renamed to exactly_once_v2 in Apache Kafka 3.0. See KIP-732 for more details.
For tests you can get inspiration from the tests in the Apache Kafka repo:
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/integration/EosIntegrationTest.java
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/integration/EOSUncleanShutdownIntegrationTest.java
https://github.com/apache/kafka/blob/trunk/tests/kafkatest/tests/streams/streams_eos_test.py
Basically, you need to create a failover scenario and verify that messages are not produced multiple times to the output topics. Note that messages may be processed multiple times, but the results in the output topics must appear as if they were only processed once. You can find a pretty good talk about exactly-once semantics that also explains the failover scenarios here: https://www.confluent.io/kafka-summit-london18/dont-repeat-yourself-introducing-exactly-once-semantics-in-apache-kafka/
I have data coming in through RabbitMQ. The data is coming in constantly, multiple messages per second.
I need to forward that data to Kafka.
In my RabbitMQ delivery callback where I am getting the data from RabbitMQ I have a Kafka producer that immediately sends the recevied messages to Kafka.
My question is very simple. Is it better to create a Kafka producer outside of the callback method and use that one producer for all messages or should I create the producer inside the callback method and close it after the message is sent, which means that I am creating a new producer for each message?
It might be a naive question but I am new to Kafka and so far I did not find a definitive answer on the internet.
EDIT : I am using a Java Kafka client.
Creating a Kafka producer is an expensive operation, so using Kafka producer as a singleton will be a good practice considering performance and utilizing resources.
For Java clients, this is from the docs:
The producer is thread safe and should generally be shared among all threads for best performance.
For librdkafka based clients (confluent-dotnet, confluent-python etc.), I can link this related issue with this quote from the issue:
Yes, creating a singleton service like that is a good pattern. you definitely should not create a producer each time you want to produce a message - it is approximately 500,000 times less efficient.
Kafka producer is stateful. It contains meta info(periodical synced from brokers), send message buffer etc. So create producer for each message is impracticable.
In our application we have multiple topics where some topics will be created with 16 partition and some topics will be created with 1 partition. Is there any spring.cloud.stream.kafka.bindings property/option available to achieve this?
Maybe this helps: num.stream.threads creating idle threads
If there is one KafkaStreams instance it is not possible because Kafka Streams does only have a global config. Hence, you would need to have multiple applications, i.e., multiple KafkaStreams instances that process different input topic to configure each with a different number of threads. Following the answer from above, it seems that spring-cloud-streams can create multiple KafkaStreams clients to support what you want.
However, I am not sure why you would want/need this (but I am also not exaclty sure how spring-cloud-stream translates your program)? In the end, parallelization is done based on tasks and thus for single input topic partitions only one of your thread will get the corresponding task assigned. Thus, there is no overhead you need to worry about.
For more details check out: https://docs.confluent.io/current/streams/architecture.html#parallelism-model
There are several partition properties available. For example,
spring.cloud.stream.bindings.func-out-0.producer.partitionKeyExpression=payload.id
spring.cloud.stream.bindings.func-out-0.producer.partition
You can get more information on both producer and consumer configuration properties here
I was reading about Kafka Stream - Elastic Scaling features.
Means Kafka Stream can handover the task to other instance and task states will get created using changelog. Its mentioned that Instance coardinate with each other to achieve rebalance.
But there is no such detail given how exactly rebalance work?
Is it same like how Consumer Group works or different mechanism because Kafka Stream instances not exactly how consumer in Consumer Group?
Visit this article for a more thorough explanation.
..."In a nutshell, running instances of your application will automatically become aware of new instances joining the group, and will split the work with them; and vice versa, if any running instances are leaving the group (e.g. because they were stopped or they failed), then the remaining instances will become aware of that, too, and will take over their work. More specifically, when you are launching instances of your Streams API based application, these instances will share the same Kafka consumer group id. The group.id is a setting of Kafka’s consumer configuration, and for a Streams API based application this consumer group id is derived from the application.id setting in the Kafka Streams configuration."...
Background
We have a Kafka topic with a steady stream of data. To process it we have a stateless Flink pipeline that consumes that topic and writes to another topic.
From time to time we have bursts of information that our Flink is not configured to handle. We don't want to configure our Flink pipeline and cluster to always support the maximum load we can have, we want to dynamically scale according to the load. (budget reasons $$$)
Solutions we thought of
One way to do so is to add/remove nodes to the Flink cluster and change the parallelism of the Flink pipeline operators. This will require stopping the Flink job with a snapshot, reconfiguring the parallelism and restarting with new parallelism.
This would be great but we cannot allow ourselves the downtime it produces. We have to scale up/down without downtime.
If we would use regular Kafka consumers it would be as simple as adding a consumer (assuming we have enough Kafka partitions) and Kafka would redistribute the topic partitions between all the consumers.
The Flink Kafka consumer manages the partition assignment and the offset on its own which allows exactly-once semantics (we don't need it). The drawback is that a single Flink job always uses all the topic partitions.
We thought we could create another instance of Flink that would subscribe to the same topic with the same group and let Kafka distribute the partitions between them. But for that we would need the Kafka Flink consumer to let Kafka manage which partitions are assigned to which consumer.
What are we looking for
We couldn't find a library that contains such a consumer or a configuration in the existing consumer. We could write it on our own (not so difficult) but if there is an existing solution we'd rather use it.
Are we missing something? Are we misunderstanding something? Is there a better solution?
Thanks!
The most straightforward approach, since you said that at worst you'll need double the capacity, would be to modify your topology to be able to write Kafka messages you can't process quickly enough to a second overflow Kafka topic. Both input and output Kafka topic names would be configurable. Maybe you would have a threshold backlog delay that automatically triggers this writing or maybe you would have a flag in the topology that you can externally set while the topology is running. That's a design detail you can work through that has operational implications.
This gives you a Flink topology that can handle some maximum number of messages in a timely fashion while writing the rest of the messages that can't be handled to a second Kafka topic. You can then run a second instance of the same Flink topology that reads from that secondary topic and writes, if necessary to a third topic. If the writing to the overflow topic happens very early in the topology processing, you could chain several of these instances together via Kafka with minimal latency and without having to reconfigure and restart any topologies.