Accessing Kafka stream's KTable underlying RocksDB memory usage - apache-kafka

I have a kafka stream app that currently takes 3 topics and aggregates them into a KTable. This app resides inside a scala microservice on marathon which has been allocated 512 MB memory to work with. After implementing this, I've noticed that the docker container running the microservice eventually runs out of memory and was trying to debug the cause.
My current theory (whilst reading the sizing guide https://docs.confluent.io/current/streams/sizing.html) is that over time, the increasing records stored in the KTable and by extension, the underlying RocksDB, is causing the OOM for the microservice. Is there any way to find out the memory used by the underlying default RocksDB implementation?

In case anyone runs into a similar issue, setting the environment variable MALLOC_ARENA_MAX=2 seems to have fixed it for me. For a more detailed explanation as to why, please refer to section "Why memory allocators make a difference?" and "Tuning glibc" here: https://github.com/prestodb/presto/issues/8993.

Related

Flink reduce shuffling overhead and memory consumption

My Flink job is frequently going OOM with one or the other task manager. I have enough memory and storage for my job (2 JobManagers/16 TaskManagers - each with 15core and 63GB RAM). Sometimes the job runs 4 days before throwing OOM, sometimes job goes into OOM in 2 days. But the traffic is consistent compared to previous days.
I have a received a suggestion not to pass through objects in streaming pipeline and instead use primitives to reduce shuffling overhead and memory consumption.
The flink job I work is written in Java. Lets say below is my pipeline
Kafka source
deserialize (converted bytes to java object, the object contains String, int, long types)
FirstKeyedWindow (the above serialized java objects received here)
reduce
SecondKeyedWindow (the above reduced java objects received here)
reduce
Kafka sink (above java objects are serialized into bytes and are produced to kafka)
My question is what all should I consider to reduce the overhead and memory consumption?
Will replacing String with char array helps reduce overhead a bit? or
Should I only deal with bytes all through the pipeline?
If I serialize the object between the KeyedWindows, will it help reduce the overhead? but if I have to read the bytes back, then I need to deserialize, use as required and then serialize it. Wouldn't it create more overhead of serializing/deserializing?
Appreciate your suggestions. Headsup, I am talking about 10TB of data received per day.
Update 1:
The exception I see for OOM is as below:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'host/host:port'. This might indicate that the remote task manager was lost.
Answering to David Anderson comments below:
The Flink version used is v1.11 The state backend used is RocksDB, file system based. The job is running out of heap memory. Each message from Kafka source is sized up-to 300Bytes.
The reduce function does deduplication (removes duplicates within the same group), the second reduce function does aggregation (updates the count within the object).
Update 2:
After thorough exploration, I found that Flink uses Kyro default serializer which is inefficient. I understood custom_serializers can help reduce overhead if we define one instead of using Kyro default. I am now trying out google-protobuf to see if it performs any better.
And, I am also looking forward to increase taskmanager.network.memory.fraction which suits to my job parallelism. Yet to find out the right calculation to set the above configuration.
I am answering my own question here after what I tried has worked for me. I have found extra metrics in Grafana that is tied to my Flink job. Two of the metrics are GC time and GC count. I have seen some good spikes in GC (Garbage Collection) metrics. The reason for that could possibly be is, I have some new object creations going in the job pipeline. And considering the TBs of data I am dealing with and 20 Billion records per day, this object creations went haywire. I have optimized it to reuse the objects as much as I can and that reduced the amount of memory consumption.
And I have increased the taskmanager.network.memory to the required value which is set to 1GB default.
In my question above, I talked about custom serializers to reduce network overhead. I tried implementing protobuf serializer with Kyro and the protobuf generated classes are final. If I have to update the objects, I have to create new objects which will create spikes in GC metrics. So, avoided using it. May be I can further change the protobuf generate classes to suit my needs. Will consider that step if things are inconsistent.

RocksDB in Kafka stream reporting no space when there is space available

I have a Streams application with a GlobalKtable backed by RocksDB that’s failing. I was originally getting the error described in https://issues.apache.org/jira/browse/KAFKA-6327, so I upgraded RocksDB to v5.14.2, which now gives a more explicit error: org.rocksdb.RocksDBException: While open a file for appending: /kafka_streams/...snip.../000295.sst: No space left on device
The directory in which the RocksDB spills to disk (a file mount on RHEL) seems to have ample space (Size: 5.4G Used: 2.8G Available: 2.6G Use%: 52%). I'm assuming that it's actually trying to allocate more than the remaining 2.6G, but that seems unlikely; there isn't that much data in the topic.
I found details on configuring RocksDB away from the defaults at https://docs.confluent.io/current/streams/developer-guide/config-streams.html#rocksdb-config-setter, but I don't see anything obvious that could potentially resolve the issue.
I haven't found any bug reports related to an issue like this, and I'm at a loss for troubleshooting next steps.
Edited to add:
I just ran the streams application on my local development machine against the same Kafka environment having the problem above. While the state stores were being loaded, the state store directory drifted up to a high of 3.1G and then settled at around 2.1G. It never got close to the 5G available on our development server. I haven't gotten any closer to finding an answer.
I never found an answer to why the disk usage in the deployed environment was behaving this way, but I eventually got more space allocated out of desperation; as the stream was processing, it consumed as much as 14GB of space before settling down around 3-4GB. I assume the disk space error was because RocksDB was trying to allocate space, not that it had written to it.
I've added a 'rule of thumb' that I should allocate 4x the disk space I expect for streaming applications.

In-memory vs persistent state stores in Kafka Streams?

I've read the stateful stream processing overview and if I understand correctly, one of the main reasons why the RocksDB is being used as a default implementation of the key value store is a fact, that unlike in-memory collections, it can handle data larger than the available memory, because it can flush to disk. Both types of stores can survive application restarts, because the data is backed up as a Kafka topic.
But are there other differences? For example, I've noticed that my persistent state store creates some .log files for each topic partition, but they're all empty.
In short, I'm wondering what are the performance benefits and possible risks of replacing persistent stores with in-memory ones.
I've got a very limited understanding of the internals of Kafka Streams and the different use cases of state stores, esp. in-memory vs persistent, but what I managed to learn so far is that a persistent state store is one that is stored on disk (and hence the name persistent) for a StreamTask.
That does not give much as the names themselves in-memory vs persistent may have given the same understanding, but something that I found quite refreshing was when I learnt that Kafka Streams tries to assign partitions to the same Kafka Streams instances that had the partitions assigned before (a restart or a crash).
That said, an in-memory state store is simply recreated (replayed) every restart which takes time before a Kafka Streams application is up and running while a persistent state store is something already materialized on a disk and the only time the Kafka Streams instance has to do to re-create the state store is to load the files from disk (not from the changelog topic that takes longer).
I hope that helps and I'd be very glad to be corrected if I'm wrong (or partially correct).
I don't see any real reason to swap current RocksDB store. In fact RocksDB one of the fastest k,v store:
Percona benchmarks (based on RocksDB)
with in-memory ones - RocksDB already acts as in-memory with some LRU algorithms involved:
RocksDB architecture
The three basic constructs of RocksDB are memtable, sstfile and logfile. The memtable is an in-memory data structure - new writes are inserted into the memtable and are optionally written to the logfile.
But there is one more noticeable reason for choosing this implementation:
RocksDB source code
If you will look at source code ratio - there are a lot of Java api exposed from C++ code. So, it's much simpler to integrate this product in existing Java - based Kafka ecosystem with comprehensive control over store, using exposed api.

Zookeeper for Data Storage?

I want a external config store for some of my services , and the data can be in following format like JSON,YML,XML. The use case I want is that I can save my configs , change them dynamically , and the read for these configs will be very frequent. So, for this is Zookeeper a good solution. Also my configs are of atmost 500MB.
The reason that Zookeeper is under consideration as it has synchronization property, version (as I will be changing configs a lot) ,can provide notifications to the depending service of changes to config. Kindly tell if Zookeeper can be data store and will be best for this use case,any other suggestion if possible.
Zookeeper may be used as data store but
Size of single node should not be longer than 1MB
Getting huge amount of nodes from zookeeper will take time, so you need to use caches. You can use Curator PathChildrenCache recipe. If you have tree structure in your zNodes you can use TreeCache, but be aware that TreeCache had memory leaks in various 2.x versions of Curator.
Zookeeper notifications is a nice feature, but if you have pretty big cluster you might have too many watchers which brings stress on your zookeeper cluster.
Please find more information about zookeeper failure reasons.
So generally speaking Zookeeper can be used as a datastore if the data is organized as key/value and value doesn't exceed 1MB. In order to get fast access to the data you should use caches on your application side: see Curator PathChildrenCache recipe.
Alternatives are Etcd and consul

Apache Spark Auto Scaling properties - Add Worker on the Fly

During the execution of a Spark Program, let's say,
reading 10GB of data into memory, and just doing a filtering, a map, and then saving in another storage.
Can I auto-scale the cluster based on the load, and for instance add more Worker Nodes to the Program, if this program eventually needs to hangle 1TB instead of 10GB ?
If this is possible, how can it be done?
It is possible to some extent, using dynamic allocation, but behavior is dependent on the job latency, not direct usage of particular resource.
You have to remember that in general, Spark can handle data larger than memory just fine, and memory problems are usually caused by user mistakes, or vicious garbage collecting cycles. None of these could be easily solved, by "adding more resources".
If you are using any of the cloud platforms for creating the cluster you can use auto-scaling functionality. that will scale cluster horizontally(number of nodes with change)
Agree with #user8889543 - You can read much more data then your memory.
And as for adding more resources on the fly. It is depended on your cluster type.
I use standalone mode, and I have a code that add on the fly machines that attached to the master automatically, then my cluster has more cores and memory.
If you only have on job/program in the cluster then it is pretty simple. Just set
spark.cores.max
to a very high number and the job will take all the cores of the cluster always. see
If you have several jobs in the cluster it becomes complicate. as mentioned in #user8889543 answer.