I have a Kafka Streams application with GlobalKTables. I would like to compute the memory footprint of the same.
The data in the underlying Kafka topics are compressed using SNAPPY. I couldn't find information about data stored on Ktables. Are records uncompressed once loaded to KTables or are they uncompressed on demand?
Would be very helpful to understand the best way to compute memory footprint of the application.
Data would not be compressed.
GlobalKTables (and KTables) use RocksDB to actually hold the data. I guess RocksDB support some compression thought that you could enable.
Related
I would like know transporting data as compress format from Kafka to external Kafka using Mirror maker.I used compression.type=gzip than none. Any other suggestions to improve the data transfer b/w Kafka cluster and disk space.
Compression types have tradeoffs between CPU usage and disk space
Zstd compression is supposedly the best option, with snappy being the best for low cpu usage, I think. Then lz4 and gzip are somewhere in the middle.
In general, you need to do your own benchmarks on your own infrastructure
https://blog.cloudflare.com/squeezing-the-firehose/
Serialization formats also matter
https://eng.uber.com/trip-data-squeeze-json-encoding-compression/
https://medium.com/#nitinpaliwal87/compression-and-serialization-techniques-benchmarking-fd1f34c1098b
I am trying to figure out an optimal event size to produce into Kafka. I may have events ranging from 1KB to 20KB and wonder if this will be an issue.
It is possible that I could make some producer changes to make them all roughly a similar size, say 1KB-3KB. Would this be an advantage or will Kafka have no issue with the variable event size?
Is there an optimal event size for Kafka or does that depend on the configured Segment settings?
Thanks.
By default, Kafka supports up to 1MB messages, and this can be changed to be larger, of course sacrificing network IO and latency as a result of making it larger.
That being said, I don't think it really matters if messages are consistently sized or not for the sizes of data that you are talking about.
If you really want to squeeze your payloads, you can look into different serialization frameworks and compression algorithms offered in the Kafka API.
I am doing an aggregation on a Kafka topic stream and saving to an in memory state store. I would like to know the exact size of the accumulated in memory data, is this possible to find?
I looked through the jmx metrics on jconsole and Confluent Control Centre but nothing seemed relevant, is there anything I can use to find this out please?
You can get the number of stored key-value-pairs of an in-memory store, via KeyValueStore#approximateNumEntries() (for the default in-memory-store implementation, this number is actually accurate). If you can estimate the byte size per key-value pair, you can do the math.
However, estimating the byte size of an object is pretty hard to do in general in Java. The problem is, that Java does not provide any way to receive the actual size of an object. Also, objects can be nested making it even harder. Finally, besides the actual data, there is always some metadata overhead per object, and this overhead is JVM implementation dependent.
I am using kafka_2.10-0.10.0.1. I have two questions:
- I want to know how I can modify the default configuration of Kafka to process large amount of data with good performance.
- Is it possible to configure Kafka to process the records in memory without storing in disk?
thank you
Is it possible to configure Kafka to process the records in memory without storing in disk?
No. Kafka is all about storing records reliably on disk, and then reading them back quickly off of disk. In fact, its documentation says:
As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.
You can read more about its design here: https://kafka.apache.org/documentation/#design. The implementation section is also quite interesting: https://kafka.apache.org/documentation/#implementation.
That said, Kafka is also all about processing large amounts of data with good performance. In 2014 it could handle 2 million writes per second on three cheap instances: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines. More links about performance:
https://docs.confluent.io/current/kafka/deployment.html
https://www.confluent.io/blog/optimizing-apache-kafka-deployment/
https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html
https://www.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html
We need a CEP engine which can run over large datasets so I had a look over alternatives like FLink, Ignite etc.
When I was on Ignite, I saw that Ignite's querying api is not eligible enough to run over large data. The reason is: that much data can not be stored into cache(insufficient memory size : 2 TB is needed). I have looked at write-through and read-through but the data payload(not key) is not queryable with Predicates(for ex SQLPredicate).
My question is: Am I missing something or is it really like that?
Thx
Ignite is an in-memory system by design. Cache store (read-through/write-through) allows storing data on disk, but queries only work over in-memory data.
that much data can not be stored into cache(insufficient memory size : 2 TB is needed)
Why not? Ignite is a distributed system, it is possible to build a cluster with more than 2TB of combined RAM.