We are running kafka streams applications and frequency running into off heap memory issues. Our applications are deployed and kubernetes PODs and they keep on restarting.
I am doing some investigation and found that we can limit the off heap memory by implementing RocksDBConfigSetter as shown in following example.
public static class BoundedMemoryRocksDBConfig implements RocksDBConfigSetter {
// See #1 below
private static org.rocksdb.Cache cache = new org.rocksdb.LRUCache(TOTAL_OFF_HEAP_MEMORY, -1, false, INDEX_FILTER_BLOCK_RATIO);
private static org.rocksdb.WriteBufferManager writeBufferManager = new org.rocksdb.WriteBufferManager(TOTAL_MEMTABLE_MEMORY, cache);
#Override
public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
BlockBasedTableConfig tableConfig = (BlockBasedTableConfig) options.tableFormatConfig();
// These three options in combination will limit the memory used by RocksDB to the size passed to the block cache (TOTAL_OFF_HEAP_MEMORY)
tableConfig.setBlockCache(cache);
tableConfig.setCacheIndexAndFilterBlocks(true);
options.setWriteBufferManager(writeBufferManager);
// These options are recommended to be set when bounding the total memory
// See #2 below
tableConfig.setCacheIndexAndFilterBlocksWithHighPriority(true);
tableConfig.setPinTopLevelIndexAndFilter(true);
// See #3 below
tableConfig.setBlockSize(BLOCK_SIZE);
options.setMaxWriteBufferNumber(N_MEMTABLES);
options.setWriteBufferSize(MEMTABLE_SIZE);
options.setTableFormatConfig(tableConfig);
}
#Override
public void close(final String storeName, final Options options) {
// Cache and WriteBufferManager should not be closed here, as the same objects are shared by every store instance.
}
}
In our application, we have input topics with 6 partitions and there are about 40 topics from which we are consuming data. Out appplication has just 1 topology which consumes from these topics, stores the data in statestores ( for dedup, look and some verfication). So, as per my understanding, kafka streams application will create following rocksdb instances and will need following max off heap memory. Please correct me if i am wrong.
Total rocksdb instances (assuming that each task will create its own instance of rocksdb)
6(partitions) * 40(topics) -> 240 rocksdb instances
Maximum off heap memory consumed
240 * ( 50 (Block cache) + 16*3(memcache) + filters(unknown))
- 240 * ~110 MB
- 26400 MB
- 25 GB
It seems to be a large number. Is the calculation correct? I know practically we should not hit this max number but is the calculation correct ?
Also, If we implement RocksDBConfigSetter and set the max off heap memory to 4 GB. Will the application complain(crash OOM) if rocksdb asks for more memory (since it is expecting about 25 GB) ?
Update:
I reduced LRU to 1GB and my streams app started throwing the LRU full exception
2021-02-07 23:20:47,443 15448195 [dp-Corrigo-67c5563a-9e3c-4d79-bc1e-23175e2cba6c-StreamThread-2] ERROR o.a.k.s.p.internals.StreamThread - stream-thread [dp-Corrigo-67c5563a-9e3c-4d79-bc1e-23175e2cba6c-StreamThread-2] Encountered the following exception during processing and the thread is going to shut down:
org.apache.kafka.streams.errors.ProcessorStateException: stream-thread [dp-Corrigo-67c5563a-9e3c-4d79-bc1e-23175e2cba6c-StreamThread-2] task [29_4] Exception caught while trying to restore state from dp-Corrigo-InvTreeObject-Store-changelog-4
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.restore(ProcessorStateManager.java:425)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.restoreChangelog(StoreChangelogReader.java:562)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:461)
at org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(StreamThread.java:744)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:625)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:553)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:512)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error restoring batch to store InvTreeObject-Store
at org.apache.kafka.streams.state.internals.RocksDBStore$RocksDBBatchingRestoreCallback.restoreAll(RocksDBStore.java:647)
at org.apache.kafka.streams.processor.internals.StateRestoreCallbackAdapter.lambda$adapt$0(StateRestoreCallbackAdapter.java:42)
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.restore(ProcessorStateManager.java:422)
... 6 common frames omitted
Caused by: org.rocksdb.RocksDBException: Insert failed due to LRU cache being full.
at org.rocksdb.RocksDB.write0(Native Method)
at org.rocksdb.RocksDB.write(RocksDB.java:806)
at org.apache.kafka.streams.state.internals.RocksDBStore.write(RocksDBStore.java:439)
at org.apache.kafka.streams.state.internals.RocksDBStore$RocksDBBatchingRestoreCallback.restoreAll(RocksDBStore.java:645)
... 8 common frames omitted
Not sure how many RocksDB instances you get. It depends on the structure of your program. You should check out TopologyDescription (via Topology#describe()). Sub-topologies are instantiated as tasks (base on number of partitions) and each task will have it's own RocksDB to maintain a shard of the overall state per store.
I would recommend to check out the Kafka Summit talk "Performance Tuning RocksDB for Kafka Streams' State Store": https://videos.confluent.io/watch/Ud6dtEC3DMYEtmK3dMK5ci
Also, If we implement RocksDBConfigSetter and set the max off heap memory to 4 GB. Will the application complain(crash OOM) if rocksdb asks for more memory (since it is expecting about 25 GB) ?
It won't crash. RocksDB will spill to disk. Being able to spill to disk is the reason why we use a persistent state store (and not an in-memory state store) by default. It allows to hold state that is larger than main-memory. As you use Kubernetes, you should attach corresponding volumes to your containers and size them accordingly (cf https://docs.confluent.io/platform/current/streams/sizing.html). You might also want to watch Kafka Summit talk "Deploying Kafka Streams Applications with Docker and Kubernetes": https://www.confluent.io/kafka-summit-sf18/deploying-kafka-streams-applications/
If state is larger than main-memory, you might also want to monitor RocksDB metrics if you run into per issues to tune the different "buffers" accordingly: https://docs.confluent.io/platform/current/streams/monitoring.html#rocksdb-metrics
Related
Our use case -> Using Remote partitioning - the job is devided into multiple partitions and using active MQ workers are processing these partitions.
Job is failing with memory issue at MessageChannelPartitionHandler handle method where it is holding more number of StepExecution in memory.(we have around 20K StepExecutions/partitions in this case)
we override message channel partition handler for submitting controlled messages to ActiveMQ and even when we try to poll replies from DB it is having database connection timeout issues and when we increased idle connection this approach as well failing to hold all those StepExecutions in memory.
Either case of our Custom/MessageChannelPartitionHandler we are facing similar issues and these step executions are required to aggregate at master. Do we have any alternative way of achieving this.
Can someone help us to understand better way of handling these long running/huge data processing scenarios?
I use Kafka to join two streams with 3 days join window:
...
private final long retentionHours = Duration.ofDays(3);
...
var joinWindow = JoinWindows.of(Duration.ofMinutes(retentionHours))
.grace(Duration.ofMillis(0));
var joinStores = StreamJoined.with(Serdes.String(), aggregatorSerde, aggregatorSerde)
.withStoreName("STORE-1")
.withName("STORE-2");
stream1.join(stream2, streamJoiner(), joinWindow, joinStores);
With above implementation, I found that Kafka created state folder: /tmp/kafka-streams, (looks like RocksDB) and it grows constantly.
Also, state store in Kafka cluster grows constantly.
So, I changed streams join implementation to:
...
private final long retentionHours = Duration.ofDays(3);
...
var joinWindow = JoinWindows.of(Duration.ofMinutes(retentionHours))
.grace(Duration.ofMillis(0));
var joinStores = StreamJoined.with(Serdes.String(), aggregatorSerde, aggregatorSerde)
.withStoreName("STORE-1")
.withName("STORE-2")
.withThisStoreSupplier(createStoreSupplier("MEM-STORE-1"))
.withOtherStoreSupplier(createStoreSupplier("MEM-STORE-2"));
stream1.join(stream2, streamJoiner(), joinWindow, joinStores);
...
private WindowBytesStoreSupplier createStoreSupplier(String storeName) {
var window = Duration.ofMinutes(retentionHours * 2)
.toMillis();
return new InMemoryWindowBytesStoreSupplier(storeName, window, window, true);
}
Now, there is no state folder: /tmp/kafka-streams.
Does it mean that InMemoryWindowBytesStoreSupplier doesn't use disk at all?
If yes, how does it work?
Also, I still see that state store in Kafka cluster grows constantly.
Does it mean that InMemoryWindowBytesStoreSupplier doesn't use disk at all? If yes, how does it work?
IIRC, InMemoryWindowBytesStore doesn't use disk at all.
Generally speaking, a logical state store is in fact partitioned into multiple state store 'instances' (think: each stream task has its own, local state store instance). For the InMemoryWindowBytesStore specifically, and by design, these store instances manage all their local data in memory.
Also, I still see that state store in Kafka cluster grows constantly.
However, the InMemoryWindowBytesStore is still fault-tolerant. This is often confusing for new Kafka Streams developers because, in most software, "in memory" always implies "data is lost if something happens". This is not the case with Kafka Streams, however. A state store is always 'backed up' durably to its Kafka changelog topic, regardless of whether you use the default state store (with RocksDB) or the in-memory state store. This explains why you see the in-memory state's (changelog) data in the Kafka cluster. The data should not grow forever, btw, as changelog topics are compacted to prevent exactly this scenario.
Note: What can happen, however, when using the in-memory store is that your application instances could run out of memory (OOM), and thus crash. While your state data will never be lost, as explained above, your application will not be running due to the OOM crash / it will run only partially (some app instances run OOM, others do not). This OOM problem doesn't apply to the default store (RocksDB), as it manages its data on disk, and uses memory (RAM) only for caching purposes. But, again, this question of app availability is orthogonal to data safety (your data is safe regardless of whether your app is crashing or not).
When a fetch request is made - the response size is limited by various Kafka parameters and they are well documented. But my question is - what is the read IO size at the core. A process must be opening the segment file and issue a read() operation and get the data into memory. The question is - what is the size of this read() - is it a fixed number or it is equal to - max.partition.fetch.bytes? If so, then if the partition has sufficient data one read IO will get enough data to feed the consumer for that partition. I tried looking into the source code, but could not figure out this size.
The reason I am doing this is - I am benchmarking my Kafka logs file system and for consumers I want to see at what read IO size the filesystem behaves better and want to see if Kafka fetches/polls show case the same pattern.
Two I/O's to your question: Disk i/o and Network i/o
Disk i/o:
Kafka leverages filesystem for the storage and cache.
If you are looking for the core disk I/O operation size, then it is typical block sizes, and most of the modern operating systems the block size defines by PageCache in general it is upto 4096bytes (ex $getconf PAGESIZE shows size on your server)
In summary: Virtual Memory pages map to Filesystem blocks, which map to Block Device sectors.
Reference code: LogSegment.scala internally uses FileRecord.java which uses FileSystem call.
Network i/o next comes to the Consumer Fetch request,
Most of the time the Consumer FetchRequest fetches from (Hot data) PageCache on particular leader partition broker. Based on your consumer Kafka params (ex,Min/MaxBytes and maxWaitMs) it fills-up the NIC request from PageCache and transmit over the wire.
Reference code: Fetcher.java (ConsumerNetworkClient)Client.send() and wait for NIC response.
This combination of pagecache and sendfile means that on a Kafka cluster where the consumers are mostly caught up and you will see no read activity on the disks whatsoever as they will be serving data entirely from cache. Because Kafka uses zero-copy data transfer (Broker → Consumer)
So, most of the performance tuning(apart from available memory for pagecache and disk i/o) you can play with consumer params like min wait time and max buffer size of packets.
Here are some of the points to consider for performance tuning on Consumption:
You can check the default ConsumerConfig here:
https://github.com/apache/kafka/blob/2.3/clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java
And how consumer fetches the data from kafka topics, is well defined in Fetcher.java
final FetchSessionHandler.FetchRequestData data = entry.getValue();
final FetchRequest.Builder request = FetchRequest.Builder
.forConsumer(this.maxWaitMs, this.minBytes, data.toSend())
.isolationLevel(isolationLevel)
.setMaxBytes(this.maxBytes)
.metadata(data.metadata())
.toForget(data.toForget())
.rackId(clientRackId);
https://github.com/apache/kafka/blob/2.3/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L237
It has the default values for each property which is overridden by user input value from the config.
We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.shuffle()
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
.parallelismHint(Integer.parseInt(configuration.getProperty(PARALLELISM_HINT)));
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withProducerProperties(kafkaProperties)
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
return topology.build();
}
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
conf.setNumWorkers(4);
return conf;
}
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?
I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
stream.parallelismHint(1).shuffle().each(…).each(…).parallelismHint(3).groupBy(…);
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.
We create a simple iqueue in Hazelcast:
HazelcastInstance h = Hazelcast.newHazelcastInstance(config);
BlockingQueue<String> queue = h.getQueue("my-distributed-queue");
Let's assume that queue.size() == 0.
Does the distributed queue "my-distributed-queue" use any memory resources?
Background:
I want to use Hazelcast for creating large amount (>1k) of short lived queues (for keeping time order in item groups). I'm wondering what happens if an IQueue object in Hazelcast is drained out (size==0). Will it leave any artifacts in memory that won't be cleaned up by GC?
I've analized the heap dumps in VisualVM and I've found that queue items are stored as IQueueItem objects. When the queue size is 0, then there are no IQueueItem instances. But are there any other no removable artefacts? Thx for help.
There is some fixed cost of each structure even if it doesn't contain any data. The cost is rather low, you can see the structure backing each instance of a queue here: https://github.com/hazelcast/hazelcast/blob/master/hazelcast/src/main/java/com/hazelcast/queue/impl/QueueContainer.java
You can always destroy a queue once you don't need it - just call the destroy() method Each structure provided by Hazelcast implements this interface.