Kafka variable event payload size - apache-kafka

I am trying to figure out an optimal event size to produce into Kafka. I may have events ranging from 1KB to 20KB and wonder if this will be an issue.
It is possible that I could make some producer changes to make them all roughly a similar size, say 1KB-3KB. Would this be an advantage or will Kafka have no issue with the variable event size?
Is there an optimal event size for Kafka or does that depend on the configured Segment settings?
Thanks.

By default, Kafka supports up to 1MB messages, and this can be changed to be larger, of course sacrificing network IO and latency as a result of making it larger.
That being said, I don't think it really matters if messages are consistently sized or not for the sizes of data that you are talking about.
If you really want to squeeze your payloads, you can look into different serialization frameworks and compression algorithms offered in the Kafka API.

Related

Kafka Streams Yearly time Window

There is a requirement in one of the applications that we are working on is, aggregation to happen on a windowed manner and the windowing size may vary monthly/quarterly/half yearly/yearly.
Kafka streams calendar based timed window supports this and I would like to get more inputs on the performance front to know if it would best suit the need.
The memory consumed by the cache to hold the records till the window size.
Number of records that gets streamed on a daily basis within the window is really high.
Please suggest can Kafka stream processing be used in this case and how about the resources for the memory management.?

Calculating Memory Footprint of GlobalKTables

I have a Kafka Streams application with GlobalKTables. I would like to compute the memory footprint of the same.
The data in the underlying Kafka topics are compressed using SNAPPY. I couldn't find information about data stored on Ktables. Are records uncompressed once loaded to KTables or are they uncompressed on demand?
Would be very helpful to understand the best way to compute memory footprint of the application.
Data would not be compressed.
GlobalKTables (and KTables) use RocksDB to actually hold the data. I guess RocksDB support some compression thought that you could enable.

How to use Kafka Streams to Split Messages into Slow and Fast Tracks

I have a stream of messages to be processed by an app written in Kafka streams, small subset of those messages require external DB lookups to be processed.
I believe this DB is too big to be streamed and too much to cache.
Is there a way to split the stream into to Fast and Slow streams so the slow one doesn't interfere with the fast one?
I have thought of the following 3 options but I was hopping there might be sth simpler or more efficient:
1) Let the messages be distributed evenly and since the volume of the ones that require reading from DB is low they wouldn't affect the overall throughput badly (latency is not a problem)
2) Use special key for the slow ones so they get assigned to one partition (I own the producer), but then it's hard to scale the slow ones and there is no guarantee that they will not interfere with the fast ones and it needs missing with producer.
3) Write the slow ones to as separate topic all together.

Size of the Kafka Streams In Memory Store

I am doing an aggregation on a Kafka topic stream and saving to an in memory state store. I would like to know the exact size of the accumulated in memory data, is this possible to find?
I looked through the jmx metrics on jconsole and Confluent Control Centre but nothing seemed relevant, is there anything I can use to find this out please?
You can get the number of stored key-value-pairs of an in-memory store, via KeyValueStore#approximateNumEntries() (for the default in-memory-store implementation, this number is actually accurate). If you can estimate the byte size per key-value pair, you can do the math.
However, estimating the byte size of an object is pretty hard to do in general in Java. The problem is, that Java does not provide any way to receive the actual size of an object. Also, objects can be nested making it even harder. Finally, besides the actual data, there is always some metadata overhead per object, and this overhead is JVM implementation dependent.

Apache Kafka persist all data

When using Kafka as an event store, how is it possible to configure the logs never to lose data (v0.10.0.0) ?
I have seen the (old?) log.retention.hours, and I have been considering playing with compaction keys, but is there simply an option for kafka never to delete messages ?
Or is the best option to put a ridiculously high value for the retention period ?
You don't have a better option that using a ridiculously high value for the retention period.
Fair warning : Using an infinite retention will probably hurt you a bit.
For example, default behaviour only allows a new suscriber to start from start or end of a topic, which will be at least annoying in an event sourcing perspective.
Also, Kafka, if used at scale (let's say tens of thousands of messages per second), benefits greatly for high performance storage, the cost of which will be ridiculously high with an eternal retention policy.
FYI, Kafka provides tools (Kafka Connect e.g) to easily persist data on cheap data stores.
Update: It’s Okay To Store Data In Apache Kafka
Obviously this is possible, if you just set the retention to “forever”
or enable log compaction on a topic, then data will be kept for all
time. But I think the question people are really asking, is less
whether this will work, and more whether it is something that is
totally insane to do.
The short answer is that it’s not insane, people do this all the time,
and Kafka was actually designed for this type of usage. But first, why
might you want to do this? There are actually a number of use cases,
here’s a few:
People concerned with data replaying and disk cost for eternal messages, just wanted to share some things.
Data replaying:
You can seek your consumer consumer to a given offset. It is possible even to query offset given a timestamp. Then, if your consumer doesn't need to know all data from beginning but a subset of the data is enough, you can use this.
I use kafka java libs, eg: kafka-clients. See:
https://kafka.apache.org/0101/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#offsetsForTimes(java.util.Map)
and
https://kafka.apache.org/0101/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#seek(org.apache.kafka.common.TopicPartition,%20long)
Disk cost:
You can at least minimize disk space usage a lot by using something like Avro (https://avro.apache.org/docs/current/) and compation turned on.
Maybe there is a way to use symbolic links to separate between file systems. But that is only an untried idea.