kafka | How to use replica.high.watermark.checkpoint.interval.ms - apache-kafka

I've been looking a way to reduce duplications or totally eliminate them and what I found is an interesting property
replica.high.watermark.checkpoint.interval.ms = 5000(default)
The frequency with which the high watermark is saved out to disk
and I was going through the random link which says,
replica.high.watermark.checkpoint.interval.ms property can affect throughput. Also, we can mark the last point where we read information while reading from a partition. In this way, we have a checkpoint from which to move forward without having to reread prior data, if we have to go back and locate the missing data. So, we will never lose a message, if we set the checkpoint watermark for every event.
First, So my question is how to use replica.high.watermark.checkpoint.interval.ms and
Second, is there any way to reduce duplicates using this property?

As far as I know, the high watermark indicates the last record that consumers can see, as it is the last record that has been fully replicated for that partition. This seems to indicate that it is used to prevent a consumer from consuming a record that is not yet fully replicated across all of its brokers, so that you don't consume something that could end up lost, leading to a bad state.
Changing the interval at which this would be updated does not seem like it would reduce duplication of messages. It would potentially have a slight performance impact (smaller interval = more disk writes) however.
For reducing duplication, I'd probably look at the Kafka exactly-once semantics introduced in 0.11.

Related

linger.ms and batch.size having no, to little, effect on batching at producers (transactional)

We are relatively new to Kafka and are struggling with getting any sort of decent throughput at our services.
Commit latency (replication=3 and acks=all) appears to be severely throttling our throughput, but that's a separate problem.
To try and compensate for this, we're looking to see if we can encourage more batching at the (async) producer. For our requirements, producers need to be transactional with "exactly once semantics" enabled.
However, even when setting linger.ms and batch.size to very high values (e.g. 10 seconds and 1M respectively), we are not seeing any difference.
From what we can tell, only on the very first request, do we see any sort of batching occurring. Subsequent requests however seem to be sent out immediately, regardless of these two settings.
There appears to be a check for any flushes that are in progress at RecordAccumulator which seems to always return true after the first request is sent which we suspect may be the cause here.
Again, we are pretty fresh to Kafka, so our understanding of what these two configuration items do in the context of transactional producers may be incomplete.
Are we correct in expecting that batching would be improved with tuning these two settings when used with transactional producers? Is batching even the correct approach in dealing with our original latency problem here?
We are using Spring-Kafka (2.5.5) / Kafka Client (2.5.1)
Any help would be appreciated. Thanks.

Avoiding small files from Kafka connect using HDFS connector sink in distributed mode

We have a topic with messages at the rate of 1msg per second with 3 partitions and I am using HDFS connector to write the data to HDFS as AVRo format(default), it generates files with size in KBS,So I tried altering the below properties in the HDFS properties.
"flush.size":"5000",
"rotate.interval.ms":"7200000"
but the output is still small files,So I need clarity on the following things to solve this issue:
is flush.size property mandatory, in-case if we do not mention the flus.size property how does the data gets flushed?
if the we mention the flush size as 5000 and rotate interval as 2 hours,it is flushing the data for every 2 hours for first 3 intervals but after that it flushes data randomly,Please find the timings of the file creation(
19:14,21:14,23:15,01:15,06:59,08:59,12:40,14:40)--highlighted the mismatched intervals.is it because of the over riding of properties mentioned?that takes me to the third question.
What is the preference for flush if we mention all the below properties (flush.size,rotate.interval.ms,rotate.schedule.interval.ms)
Increasing the rate of msg and reducing the partition is actually showing an increase in the size of the data being flushed, is it the only way to have control over the small files,how can we handle the properties if the rate of the input events are varying and not stable?
It would be great help if you could share documentations regarding handling small files in kafka connect with HDFS connector,Thank you.
If you are using a TimeBasedPartitioner, and the messages are not consistently going to have increasing timestamps, then you will end up with a single writer task dumping files when it sees a message with a lesser timestamp in the interval of rotate.interval.ms of reading any given record.
If you want to have consistent bihourly partition windows, then you should be using rotate.interval.ms=-1 to disable it, then rotate.schedule.interval.ms to some reasonable number that is within the partition duration window.
E.g. you have 7200 messages every 2 hours, and it's not clear how large each message is, but let's say 1MB. Then, you'd be holding ~7GB of data in a buffer, and you need to adjust your Connect heap sizes to hold that much data.
The order of presecence is
scheduled rotation, starting from the top of the hour
flush size or "message-based" time rotation, whichever occurs first, or there is a record that is seen as "before" the start of the current batch
And I believe flush size is mandatory for the storage connectors
Overall, systems such as Uber's Hudi or the previous Kafka-HDFS tool of Camus Sweeper are more equipped to handle small files. Connect Sink Tasks only care about consuming from Kafka, and writing to downstream systems; the framework itself doesn't recognize Hadoop prefers larger files.

Implications of keeping linger.ms at 0

We are using kafka 0.10.2.1. The documentation specifies that a buffer is available to send even if it isn't full-
By default a buffer is available to send immediately even if there is additional unused space in the buffer. However if you want to reduce the number of requests you can set linger.ms to something greater than 0.
However, it also says that the producer will attempt to batch requests even if linger time is set to 0ms-
Note that records that arrive close together in time will generally batch together even with linger.ms=0 so under heavy load batching will occur regardless of the linger configuration; however setting this to something larger than 0 can lead to fewer, more efficient requests when not under maximal load at the cost of a small amount of latency.
Intuitively, it seems that any kind of batching would require some linger time, and the only way to achieve a linger time of 0 would be to make the broker call synchronised. Clearly, keeping the linger time at 0 doesn't appear to harm performance as much as blocking on the send call, but seems to have some impact on performance. Can someone clarify what the docs are saying above?
The docs are saying that even though you set linger time to 0, you might end up with a little bit of batching under load since records are getting added to be sent faster than the send thread can dispatch them. This setting is optimizing for minimal latency. If the measure of performance you really care about is throughput, you'd increase the linger time a bit to batch more and that's what the docs are getting at. Not so much to do with synchronous send in this case. More in depth info
With linger.ms=0 the record is sent as soon as possible and with many requests this may impact the performance. Forcing a little wait by increasing linger.ms on moderate/high load will optimize the use of the batch and increase throughput. This depends as well on the record size, the bigger the less can fit in the batch (batch.size default is 16Kb).
Basically it is a trade off between number of number of requests and throughput and it really depends on your scenario, however sending immediately does not take full advantage of batching and compression (if enabled) and I suggest to run some metrics with different values of linger.ms such as 0/5/10/50/200
In general I will suggest to set linger.ms > 0
References:
KIP-91
Nice tutorial from cloudurable
Official Docs v0.10.2
I am by far no kafka expert, but these things should be explained easier, otherwise all the metrics read and not going to be understood.
First thing I want to notice is that a Sender Thread, which is not the thread you call producer::send under, sends batches of messages to the cluster. Now if your current batch has a single message inside it, it does not break the rule : it still sends batches, it just happens that there is one single in the current batch. There are metrics that allow you to see how full, on average, is a batch before it was sent.
If there are many batches that senders send that are more empty than full - it's not a good thing. The work it has to do to actually place messages is much more than expensive than the actual message sent and that's why batching exists to begin with.
In such cases, linger.ms might help, cause it will allow for a "batch" to stay a little bit more in the RecordAccumulator and thus more batching will happen.

Apache Kafka persist all data

When using Kafka as an event store, how is it possible to configure the logs never to lose data (v0.10.0.0) ?
I have seen the (old?) log.retention.hours, and I have been considering playing with compaction keys, but is there simply an option for kafka never to delete messages ?
Or is the best option to put a ridiculously high value for the retention period ?
You don't have a better option that using a ridiculously high value for the retention period.
Fair warning : Using an infinite retention will probably hurt you a bit.
For example, default behaviour only allows a new suscriber to start from start or end of a topic, which will be at least annoying in an event sourcing perspective.
Also, Kafka, if used at scale (let's say tens of thousands of messages per second), benefits greatly for high performance storage, the cost of which will be ridiculously high with an eternal retention policy.
FYI, Kafka provides tools (Kafka Connect e.g) to easily persist data on cheap data stores.
Update: It’s Okay To Store Data In Apache Kafka
Obviously this is possible, if you just set the retention to “forever”
or enable log compaction on a topic, then data will be kept for all
time. But I think the question people are really asking, is less
whether this will work, and more whether it is something that is
totally insane to do.
The short answer is that it’s not insane, people do this all the time,
and Kafka was actually designed for this type of usage. But first, why
might you want to do this? There are actually a number of use cases,
here’s a few:
People concerned with data replaying and disk cost for eternal messages, just wanted to share some things.
Data replaying:
You can seek your consumer consumer to a given offset. It is possible even to query offset given a timestamp. Then, if your consumer doesn't need to know all data from beginning but a subset of the data is enough, you can use this.
I use kafka java libs, eg: kafka-clients. See:
https://kafka.apache.org/0101/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#offsetsForTimes(java.util.Map)
and
https://kafka.apache.org/0101/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#seek(org.apache.kafka.common.TopicPartition,%20long)
Disk cost:
You can at least minimize disk space usage a lot by using something like Avro (https://avro.apache.org/docs/current/) and compation turned on.
Maybe there is a way to use symbolic links to separate between file systems. But that is only an untried idea.

Is it possible to use a cassandra table as a basic queue

Is it possible to use a table in cassandra as a queue, I don't think the strategy I use in mysql works, ie given this table:
create table message_queue(id integer, message varchar(4000), retries int, sending boolean);
We have a transaction that marks the row as "sending", tries to send, and then either deletes the row, or increments the retries count. The transaction ensures that only one server will be attempting to process an item from the message_queue at any one time.
There is an article on datastax that describes the pitfalls and how to get around it, however Im not sure what the impact of having lots of tombstones lying around is, how long do they stay around for?
Don't do this. Cassandra is a terrible choice as a queue backend unless you are very, very careful. You can read more of the reasons in Jonathan Ellis blog post "Cassandra anti-patterns: Queues and queue-like datasets" (which might be the post you're alluding to). MySQL is also not a great choice for backing a queue, us a real queue product like RabbitMQ, it's great and very easy to use.
The problem with using Cassandra as the storage for a queue is this: every time you delete a message you write a tombstone for that message. Every time you query for the next message Cassandra will have to trawl through those tombstones and deleted messages and try to determine the few that have not been deleted. With any kind of throughput the number of read values versus the number of actual live messages will be hundreds of thousands to one.
Tuning GC grace and other parameters will not help, because that only applies to how long tombstones will hang around after a compaction, and even if you dedicated the CPUs to only run compactions you would still have dead to live rations of tens of thousands or more. And even with a GC grace of zero tombstones will hang around after compactions in some cases.
There are ways to mitigate these effects, and they are outlined in Jonathan's post, but here's a summary (and I don't write this to encourage you to use Cassandra as a queue backend, but because it explains a bit more about Cassandra works, and should help you understand why it's a bad fit for the problem):
To avoid the tombstone problem you cannot keep using the same queue, because it will fill upp with tombstones quicker than compactions can get rid of them and your performance will run straight into a brick wall. If you add a column to the primary key that is deterministic and depends on time you can avoid some of the performance problems, since fewer tombstones have time to build up and Cassandra will be able to completely remove old rows and all their tombstones.
Using a single row per queue also creates a hotspot. A single node will have to handle that queue, and the rest of the nodes will be idle. You might have lots of queues, but chances are that one of them will see much more traffic than the others and that means you get a hotspot. Shard the queues over multiple nodes by adding a second column to the primary key. It can be a hash of the message (for example crc32(message) % 60 would create 60 shards, don't use a too small number). When you want to find the next message you read from all of the shards and pick one of the results, ignoring the others. Ideally you find a way to combine this with something that depends on time, so that you fix that problem too while you're at it.
If you sort your messages after time of arrival (for example with TIMEUUID clustering key) and can somehow keep track of the newest messages that has been delivered, you can do a query to find all messages after that message. That would mean less thrawling through tombstones for Cassandra, but it is no panacea.
Then there's the issue of acknowledgements. I'm not sure if they matter to you, but it looks like you have some kind of locking mechanism in your schema (I'm thinking of the retries and sending columns). This will not work. Until Cassandra 2.0 and it's compare-and-swap features there is no way to make that work correctly. To implement a lock you need to read the value of the column, check if it's not locked, then write that it should now be locked. Even with consistency level ALL another application node can do the same operations at the same time, and both end up thinking that they locked the message. With CAS in Cassandra 2.0 it will be possible to do atomically, but at the cost of performance.
There are a couple of more answers here on StackOverflow about Cassandra and queues, read them (start with this: Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds.
The grace period can be defined. Per default it is 10 days:
gc_grace_seconds¶
(Default: 864000 [10 days]) Specifies the time to wait before garbage
collecting tombstones (deletion markers). The default value allows a
great deal of time for consistency to be achieved prior to deletion.
In many deployments this interval can be reduced, and in a single-node
cluster it can be safely set to zero. When using CLI, use gc_grace
instead of gc_grace_seconds.
Taken from the
documentation
On a different note, I do not think that implementing a queue pattern in Cassandra is very useful. To prevent your worker to process one entry twice, you need to enforce "ALL" read consistency, which defeats the purpose of distributed database systems.
I highly recommend looking at specialized systems like messaging systems which support the queue pattern natively. Take a look at RabbitMQ for instance. You will be up and running in no time.
Theo's answer about not using Cassandra for queues is spot on.
Just wanted to add that we have been using Redis sorted sets for our queues and it has been working pretty well. Some of our queues have tens of millions of elements and are accessed hundreds of times per second.