Implication of setting log.retention.hours to a very high number - apache-kafka

I'm researching the possibility of using Kafka as the main storage for an event sourcing pattern. I'm having a hard time understanding if it is a good idea to store things in Kafka, longer term, or why not.
What would be the implications of simply setting log.retention.hours to a very large number, effectively turning Kafka into a permanent storage? As I've understood it - "Kafka's performance is effectively constant with respect to data size so retaining lots of data is not a problem."
That said, I also get the sense that this is not a common use case for Kafka, so there might be some limitation that I'm not understanding. I'm completely open to this being a bad idea, but I would like to understand why.

Related

How to Partition a Queue in a distributed system

This problem accrued to me a while ago, unfortunately, I could not find the answer I was looking for on the web. Here is the problem statement:
Consider a simple producer-consumer environment where we only have one
producer writing to a queue and one consumer reading from it. Now
since the objects written on the queue are quite large in size and our
available resources are not much on our current machine, we decided to
implement a distributed queue system where the data inside the queue
is partitioned among multiple nodes. It is important to us that the
total ordering is conserved while pushing and poping the data,
meaning that from the point of a user this distributed queue acts just
like a single unified queue.
Before giving a solution to this problem we have to ask if high availability is more important to us or portion tolerance. I believe in both versions, there are interesting challenges to tackle and I thought that such a question must surely be raised before, however, after searching for existing solutions I could not find a complete and well-thought-out answer from an algorithmic or scientific point of view. Most of what I found were engineering and high-level approaches, leveraging tools like Kafka, RabitMQ, Redis etc.
So the problem remains and I would be thankful if you could share with me your designs, algorithms and thoughts on this problem or point me to some scientific journal or article etc that has already tackled such a problem.
This can be one of the ways in which the above can be achieved. Here the partitioning is achieved in the round-robin fashion.
To achieve high availability, you can have partition replicas.
Pros:-
By adding replicas system becomes highly available.
Multi-consumer groups can be implemented
Cons:-
route table becomes the single source of failure, hence redundancy can be achieved via using dynamo DB & consistent read here.

Size of the Kafka Streams In Memory Store

I am doing an aggregation on a Kafka topic stream and saving to an in memory state store. I would like to know the exact size of the accumulated in memory data, is this possible to find?
I looked through the jmx metrics on jconsole and Confluent Control Centre but nothing seemed relevant, is there anything I can use to find this out please?
You can get the number of stored key-value-pairs of an in-memory store, via KeyValueStore#approximateNumEntries() (for the default in-memory-store implementation, this number is actually accurate). If you can estimate the byte size per key-value pair, you can do the math.
However, estimating the byte size of an object is pretty hard to do in general in Java. The problem is, that Java does not provide any way to receive the actual size of an object. Also, objects can be nested making it even harder. Finally, besides the actual data, there is always some metadata overhead per object, and this overhead is JVM implementation dependent.

kafka | How to use replica.high.watermark.checkpoint.interval.ms

I've been looking a way to reduce duplications or totally eliminate them and what I found is an interesting property
replica.high.watermark.checkpoint.interval.ms = 5000(default)
The frequency with which the high watermark is saved out to disk
and I was going through the random link which says,
replica.high.watermark.checkpoint.interval.ms property can affect throughput. Also, we can mark the last point where we read information while reading from a partition. In this way, we have a checkpoint from which to move forward without having to reread prior data, if we have to go back and locate the missing data. So, we will never lose a message, if we set the checkpoint watermark for every event.
First, So my question is how to use replica.high.watermark.checkpoint.interval.ms and
Second, is there any way to reduce duplicates using this property?
As far as I know, the high watermark indicates the last record that consumers can see, as it is the last record that has been fully replicated for that partition. This seems to indicate that it is used to prevent a consumer from consuming a record that is not yet fully replicated across all of its brokers, so that you don't consume something that could end up lost, leading to a bad state.
Changing the interval at which this would be updated does not seem like it would reduce duplication of messages. It would potentially have a slight performance impact (smaller interval = more disk writes) however.
For reducing duplication, I'd probably look at the Kafka exactly-once semantics introduced in 0.11.

Apache Kafka persist all data

When using Kafka as an event store, how is it possible to configure the logs never to lose data (v0.10.0.0) ?
I have seen the (old?) log.retention.hours, and I have been considering playing with compaction keys, but is there simply an option for kafka never to delete messages ?
Or is the best option to put a ridiculously high value for the retention period ?
You don't have a better option that using a ridiculously high value for the retention period.
Fair warning : Using an infinite retention will probably hurt you a bit.
For example, default behaviour only allows a new suscriber to start from start or end of a topic, which will be at least annoying in an event sourcing perspective.
Also, Kafka, if used at scale (let's say tens of thousands of messages per second), benefits greatly for high performance storage, the cost of which will be ridiculously high with an eternal retention policy.
FYI, Kafka provides tools (Kafka Connect e.g) to easily persist data on cheap data stores.
Update: It’s Okay To Store Data In Apache Kafka
Obviously this is possible, if you just set the retention to “forever”
or enable log compaction on a topic, then data will be kept for all
time. But I think the question people are really asking, is less
whether this will work, and more whether it is something that is
totally insane to do.
The short answer is that it’s not insane, people do this all the time,
and Kafka was actually designed for this type of usage. But first, why
might you want to do this? There are actually a number of use cases,
here’s a few:
People concerned with data replaying and disk cost for eternal messages, just wanted to share some things.
Data replaying:
You can seek your consumer consumer to a given offset. It is possible even to query offset given a timestamp. Then, if your consumer doesn't need to know all data from beginning but a subset of the data is enough, you can use this.
I use kafka java libs, eg: kafka-clients. See:
https://kafka.apache.org/0101/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#offsetsForTimes(java.util.Map)
and
https://kafka.apache.org/0101/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#seek(org.apache.kafka.common.TopicPartition,%20long)
Disk cost:
You can at least minimize disk space usage a lot by using something like Avro (https://avro.apache.org/docs/current/) and compation turned on.
Maybe there is a way to use symbolic links to separate between file systems. But that is only an untried idea.

Why can't CP systems also be CAP?

My understanding of the CAP acronym is as follows:
Consistent: every read gets the most recent write
Available: every node is available
Partion Tolerant: the system can continue upholding A and C promises when the network connection between nodes goes down
Assuming my understanding is more or less on track, then something is bother me.
AFAIK, availability is achieved via any of the following techniques:
Load balancing
Replication to a disaster recovery system
So if I have a system that I already know is CP, why can't I "make it full CAP" by applying one of these techniques to make it available as well? I'm sure I'm missing something important here, just not sure what.
It's the partition tolerance, that you got wrong.
As long as there isn't any partitioning happening, systems can be consistent and available. There are CA systems which say, we don't care about partitions. You can have them running inside racks with server hardware and make partitioning extremely unlikely. The problem is, what if partitions occur?
The system can either choose to
continue providing the service, hoping the other server is down rather than providing the same service and serving different data - choosing availability (AP)
stop providing the service, because it couldn't guarantee consistency anymore, since it doesn't know if the other server is down or in fact up and running and just the communication between these two broke off - choosing consistency (CP)
The idea of the CAP theorem is that you cannot provide both Availability AND Consistency, once partitioning occurs, you can either go for availability and hope for the best, or play it safe and be unavailable, but consistent.
Here are 2 great posts, which should make it clear:
You Can’t Sacrifice Partition Tolerance shows the idea, that every truly distributed system needs to deal with partitioning now and than and hence CA systems will break instantly at the first occurrence of a partition
CAP Twelve Years Later: How the "Rules" Have Changed is slightly more up to date and shows the CAP theorem more flexible, where developers can choose how applications behave during partitioning and can sacrifice a bit of consistency to gain some availability, ...
So to finally answer your question, if you take a CP system and replicate it more often, you might either run into overhead of messages sent between the nodes of the system to keep it consistent, or - in case a substantial part of the nodes fails or network partitioning occurs without any part having a clear majority, it won't be able to continue operation as it wouldn't be able to guarantee consistency anymore. But yes, these lines are getting more blurred now and I think the references I've provided will give you a much better understanding.