I have the situation where my producers currently do not use compression. The topic is configured with compression.type=lz4.
Say that I wanted to, in the future, switch this configuration to be on the producer side, so that compression.type=producer and my producers use e.g. lz4.
My questions are:
Are there any special considerations for this scenario?
What if I were to choose another compression algorithm on the producer side down the road, e.g. zstd? Does Kafka retain the required metadata for this to be possible, or would I need to reprocess my topics so that a single compression algorithm would be used during its lifetime?
Related
When configuring ksqlDB I can set the option ksql.streams.producer.compression.type which enables compression for ksqlDB's internal producers. Thus when I create a ksqlDB stream, it's output topic will be compressed with the selected compression type.
However, as far as I have understood the compression performance is heavily impacted by how much batching the producer does. Therefore, I wish to be able to configure the batch.size and linger.ms parameters for ksqlDB's producers. Does anyone know if and how these parameters can be set for ksqlDB?
Thanks to Matthias J Sax for answering my question on the Confluent Community Slack channel: https://app.slack.com/client/T47H7EWH0/threads?cdn_fallback=1
There is an info-box in the documentation.
That explains it pretty well:
KSQL documentation info box
The underlying producer and consumer clients in ksqlDB's server can be
modified with any valid properties. Simply use the form
ksql.streams.producer.xxx, ksql.streams.consumer.xxx to pass the
property through. For example, ksql.streams.producer.compression.type
sets the compression type on the producer.
Source: https://docs.ksqldb.io/en/latest/reference/server-configuration/
i have a Kafka Streams DSL application, we have a requirement on exactly once processing, for the same i have added the configuration
streamConfig.put(processing.gurantee, "exactly_once");
I am using kafka 2.7
I have 2 queries
what's the difference between exactly_once and exactly_once_beta
how do i test this functionality to be sure my messages are getting processed only once
Thanks!
exactly_once_beta is an improvement over exactly_once. While exactly_once uses a transactional producer for each stream task (combination of sub-topology and input partition, exactly_once_beta uses a transactional producer for each stream thread of a Kafka Streams client.
Every producer comes with separate memory buffers, a separate thread, separate network connections which might limit scaling the number of input partitions (i.e. number of tasks). A high number of producers might also cause more load on the brokers. Hence, exactly_once_beta has better scaling characteristics. You can find more details in KIP-447.
Note that exactly_once will be deprecated and exactly_once_beta will be renamed to exactly_once_v2 in Apache Kafka 3.0. See KIP-732 for more details.
For tests you can get inspiration from the tests in the Apache Kafka repo:
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/integration/EosIntegrationTest.java
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/integration/EOSUncleanShutdownIntegrationTest.java
https://github.com/apache/kafka/blob/trunk/tests/kafkatest/tests/streams/streams_eos_test.py
Basically, you need to create a failover scenario and verify that messages are not produced multiple times to the output topics. Note that messages may be processed multiple times, but the results in the output topics must appear as if they were only processed once. You can find a pretty good talk about exactly-once semantics that also explains the failover scenarios here: https://www.confluent.io/kafka-summit-london18/dont-repeat-yourself-introducing-exactly-once-semantics-in-apache-kafka/
my question is rather specific, so I will be ok with a general answer, which will point me in the right direction.
Description of the problem:
I want to deliver specific task data from multiple producers to a particular consumer working on the task (both are docker containers run in k8s). The relation is many to many - any producer can create a data packet for any consumer. Each consumer is processing ~10 streams of data at any given moment, while each data stream consists of 100 of 160b messages per second (from different producers).
Current solution:
In our current solution, each producer has a cache of a task: (IP: PORT) pair values for consumers and uses UDP data packets to send the data directly. It is nicely scalable but rather messy in deployment.
Question:
Could this be realized in the form of a message queue of sorts (Kafka, Redis, rabbitMQ...)? E.g., having a channel for each task where producers send data while consumer - well consumes them? How many streams would be feasible to handle for the MQ (i know it would differ - suggest your best).
Edit: Would 1000 streams which equal 100 000 messages per second be feasible? (troughput for 1000 streams is 16 Mb/s)
Edit 2: Fixed packed size to 160b (typo)
Unless you need disk persistence, do not even look in message broker direction. You are just adding one problem to an other. Direct network code is a proper way to solve audio broadcast. Now if your code is messy and if you want a simplified programming model good alternative to sockets is a ZeroMQ library. This will give you all MessageBroker functionality for which you care: a) discrete messaging instead of streams, b) client discoverability; without going overboard with another software layer.
When it comes to "feasible": 100 000 messages per second with 160kb message is a lot of data and it comes to 1.6 Gb/sec even without any messaging protocol on top of it. In general Kafka shines at message throughput of small messages as it batches messages on many layers. Knowing this sustained performances of Kafka are usually constrained by disk speed, as Kafka is intentionally written this way (slowest component is disk). However your messages are very large and you need to both write and read messages at same time so I don't see it happen without large cluster installation as your problem is actual data throughput, and not number of messages.
Because you are data limited, even other classic MQ software like ActiveMQ, IBM MQ etc is actually able to cope very well with your situation. In general classic brokers are much more "chatty" than Kafka and are not able to hit message troughpout of Kafka when handling small messages. But as long as you are using large non-persistent messages (and proper broker configuration) you can expect decent performances in mb/sec from those too. Classic brokers will, with proper configuration, directly connect a socket of producer to a socket of a consumer without hitting a disk. In contrast Kafka will always persist to disk first. So they even have some latency pluses over Kafka.
However this direct socket-to-socket "optimisation" is just a full circle turn to the start of an this answer. Unless you need audio stream persistence, all you are doing with a broker-in-the-middle is finding an indirect way of binding producing sockets to consuming ones and then sending discrete messages over this connection. If that is all you need - ZeroMQ is made for this.
There is also messaging protocol called MQTT which may be something of interest to you if you choose to pursue a broker solution. As it is meant to be extremely scalable solution with low overhead.
A basic approach
As from Kafka perspective, each stream in your problem can map to one topic in Kafka and
therefore there is one producer-consumer pair per topic.
Con: If you have lots of streams, you will end up with lot of topics and IMO the solution can get messier here too as you are increasing the no. of topics.
An alternative approach
Alternatively, the best way is to map multiple streams to one topic where each stream is separated by a key (like you use IP:Port combination) and then have multiple consumers each subscribing to a specific set of partition(s) as determined by the key. Partitions are the point of scalability in Kafka.
Con: Though you can increase the no. of partitions, you cannot decrease them.
Type of data matters
If your streams are heterogeneous, in the sense that it would not be apt for all of them to share a common topic, you can create more topics.
Usually, topics are determined by the data they host and/or what their consumers do with the data in the topic. If all of your consumers do the same thing i.e. have the same processing logic, it is reasonable to go for one topic with multiple partitions.
Some points to consider:
Unlike in your current solution (I suppose), once the message is received, it doesn't get lost once it is received and processed, rather it continues to stay in the topic till the configured retention period.
Take proper care in determining the keying strategy i.e. which messages land in which partitions. As said, earlier, if all of your consumers do the same thing, all of them can be in a consumer group to share the workload.
Consumers belonging to the same group do a common task and will subscribe to a set of partitions determined by the partition assignor. Each consumer will then get a set of keys in other words, set of streams or as per your current solution, a set of one or more IP:Port pairs.
If Kafka producer compression is set (e.g. to gzip), and the broker configuration is also set to the same codec, will the broker re-compress any messages from the producer, or recognise that its the same codec and skip and broker-side re-compression?
I'm aware that the broker can be configured to inherit broker codec via the 'producer' setting. However, for our scenario we may have producers (out of our control) who may not set any compression, so we'd like to configure the broker to have default compression enabled, but for those producers that are in our control we'd prefer to use producer compression to save on network bandwidth but also the reduce load on the broker.
Setting topic compression to producer is equivalent to setting it to the same value you use in your producers.
Thus to achieve what you need, you just need to set topic compression to the same algo you use in your producers. The external producers that use the same compression algorithm will work the same as your internal producers, and the rest will trigger a potential decompression/recompression.
This article sums it up nicely:
https://newbedev.com/if-i-set-compression-type-at-topic-level-and-producer-level-which-takes-precedence
We have a use case where data loss is acceptable(think 30-50% loss acceptable). In an effort to reduce costs, we want to know if it is possible to configure Kafka with a replication factor of 1 such that consumers and producer can recover from broker failures by simply consuming and producing from and to available partitions.
If this is possible, what are the configurations that need to be set?
There are other broker technologies that inherently behave this way, however, we would like to avoid the introduction of another technology as kafka is already part of our ecosystem.
If you create a new topic via bin/kafka-topics.sh you need to specify parameter --replication-factor; just set it to 1 to disable replication.
For existing topics, you can change the replication factor using bin/kafka/topics.sh using parameter --alter.
For producers and consumers you might need to do some extra exception handling. For example, if you do specify a dedicated partition when you write a record and the broker is not reachable, you might need to take for of this (maybe just skip this write or whatever is appropriate). But there is no specific configuration you need to set for you clients.