Read IO size for Kafka fetch request - apache-kafka

When a fetch request is made - the response size is limited by various Kafka parameters and they are well documented. But my question is - what is the read IO size at the core. A process must be opening the segment file and issue a read() operation and get the data into memory. The question is - what is the size of this read() - is it a fixed number or it is equal to - max.partition.fetch.bytes? If so, then if the partition has sufficient data one read IO will get enough data to feed the consumer for that partition. I tried looking into the source code, but could not figure out this size.
The reason I am doing this is - I am benchmarking my Kafka logs file system and for consumers I want to see at what read IO size the filesystem behaves better and want to see if Kafka fetches/polls show case the same pattern.

Two I/O's to your question: Disk i/o and Network i/o
Disk i/o:
Kafka leverages filesystem for the storage and cache.
If you are looking for the core disk I/O operation size, then it is typical block sizes, and most of the modern operating systems the block size defines by PageCache in general it is upto 4096bytes (ex $getconf PAGESIZE shows size on your server)
In summary: Virtual Memory pages map to Filesystem blocks, which map to Block Device sectors.
Reference code: LogSegment.scala internally uses which uses FileSystem call.
Network i/o next comes to the Consumer Fetch request,
Most of the time the Consumer FetchRequest fetches from (Hot data) PageCache on particular leader partition broker. Based on your consumer Kafka params (ex,Min/MaxBytes and maxWaitMs) it fills-up the NIC request from PageCache and transmit over the wire.
Reference code: (ConsumerNetworkClient)Client.send() and wait for NIC response.
This combination of pagecache and sendfile means that on a Kafka cluster where the consumers are mostly caught up and you will see no read activity on the disks whatsoever as they will be serving data entirely from cache. Because Kafka uses zero-copy data transfer (Broker → Consumer)
So, most of the performance tuning(apart from available memory for pagecache and disk i/o) you can play with consumer params like min wait time and max buffer size of packets.
Here are some of the points to consider for performance tuning on Consumption:

You can check the default ConsumerConfig here:
And how consumer fetches the data from kafka topics, is well defined in
final FetchSessionHandler.FetchRequestData data = entry.getValue();
final FetchRequest.Builder request = FetchRequest.Builder
.forConsumer(this.maxWaitMs, this.minBytes, data.toSend())
It has the default values for each property which is overridden by user input value from the config.


Minimizing failure without impacting recovery when building processes on top of Kafka

I am working with a microservice that consumes messages from Kafka. It does some processing on the message and then inserts the result in a database. Only then am I acknowledging the message with Kafka.
It is required that I keep data loss to an absolute minimum but recovery rate is quick (avoid reprocessing message because it is expensive).
I realized that if there was to be some kind of failure, like my microservice would crash, my messages would be reprocessed. So I thought to add some kind of 'checkpoint' to my process by writing the state of the transformed message to the file and reading from it after a failure. I thought this would mean that I could move my Kafka commit to an earlier stage, only after writing to the file is successful.
But then, upon further thinking, I realized that if there was to be a failure on the file system, I might not find my files e.g. using a cloud file service might still have a chance of failure even if the marketed rate is that of >99% availability. I might end up in an inconsistent state where I have data in my Kafka topic (which is unaccessible because the Kafka offset has been committed) but I have lost my file on the file system. This made me realize that I should send the Kafka commit at a later stage.
So now, considering the above two design decisions, it feels like there is a tradeoff between not missing data and minimizing time to recover from failure. Am I being unrealistic in my concerns? Is there some design pattern that I can follow to minimize the tradeoffs? How do I reason about this situation? Here I thought that maybe the Saga pattern is appropriate, but am I overcomplicating things?
If you are that concerned of data reprocess, you could always follow the paradigm of sending the offsets out of kafka.
For example, in your consumer-worker reading loop:
MessageAndOffset = getMsg();
//do your things
saveOffsetInQueueToDB is responsible of adding the offset to a Queue/List, or whatever. This operation is only done one the message has been correctly processed.
Periodically, when a certain number of offsets are stored, or when shutdown is captured, you could implement another function that stores the offsets for each topic/partition in:
An external database.
An external SLA backed storing system, such as S3 or Azure Blobs.
Internal (disk) and remote loggers.
If you are concerned about failures, you could use a combination of two of those three options (or even use all three).
Storing these in a "memory buffer" allows the operation to be async, so there's no need for a new transfer/connection to the database/datalake/log for each processed message.
If there's a crash, you could read all messages from the beginning (easiest way is just changing the and setting from beginning) but discarding those whose offset is included in the database, avoiding the reprocess. For example by adding a condition in your loop (yep pseudocode again):
MessageAndOffset = getMsg();
if (offset.notIncluded(offsetListFromDB))
//do your things
You could implement better performant algorithms instead a "non-included" type one, just storing the last read offsets for each partition in a HashMap and then just checking if the partition that belongs to each consumer is bigger or not than the stored one. For example, partition 0's last offset was 558 and partitions 1's 600:
//offsetMap = {[0,558],[1,600]}
MessageAndOffset = getMsg();
//get partition => 0
if (offset > offsetMap.get(partition))
//do your things
This way, you guarantee that only the non-processed messages from each partition will be processed.
Regarding file system failures, that's why Kafka comes as a cluster: Fault tolerance in Kafka is done by copying the partition data to other brokers which are known as replicas.
So if you have 5 brokers, for example, you must experience a total of 5 different system failures at the same time (I guess brokers are in separate hosts) in order to lose any data. Even 4 different brokers could fail at the same time without losing any data.
All brokers save the same amount of data, same partitions. If a filesystem error occurs in one of the brokers, the others will still hold all the information:

process pubsub messages in constant rate. Using streaming and serverless

The scenario:
I have thousands of requests I need to issue each day.
I know the number at the beginning of the day and hopefully I want to send all the data about the requests to pubsub. Message per request.
I want to make the requests in constant rate. for example if I have 172800 requests, I want to process 2 in each second.
The ultimate way will involved pubsub push and cloud run.
Using pull with long running instances is also an option.
Any other option are also welcome.
I want to avoid running in a loop and fetch records from a database with limit.
This is how I am doing it today.
You can use batch and flow control settings for fine-tuning Pub/Sub performance which will help in processing messages at a constant rate.
A batch, within the context of Cloud Pub/Sub, refers to a group of one or more messages published to a topic by a publisher in a single publish request. Batching is done by default in the client library or explicitly by the user. The purpose for this feature is to allow for a higher throughput of messages while also providing a more efficient way for messages to travel through the various layers of the service(s). Adjusting the batch size (i.e. how many messages or bytes are sent in a publish request) can be used to achieve the desired level of throughput.
Features specific to batching on the publisher side include setElementCountThreshold(), setRequestByteThreshold(), and setDelayThreshold() as part of setBatchSettings() on a publisher client (the naming varies slightly in the different client libraries). These features can be used to finely tune the behavior of batching to find a better balance among cost, latency, and throughput.
Note: The maximum number of messages that can be published in a single batch is 1000 messages or 10 MB.
An example of these batching properties can be found in the Publish with batching settings documentation.
Flow Control
Flow control features on the subscriber side can help control the unhealthy behavior of tasks on the pipeline by allowing the subscriber to regulate the rate at which messages are ingested. These features provide the added functionality to adjust how sensitive the service is to sudden spikes or drops of published throughput.
Some features that are helpful for adjusting flow control and other settings on the subscriber are setMaxOutstandingElementCount(), setMaxOutstandingRequestBytes(), and setMaxAckExtensionPeriod().
Examples of these settings being used can be found in the Subscribe with flow control documentation.
For more information refer to this link.
If you are having long running instances as subscribers, then you will need to set relevant FlowControl settings for example .setMaxOutstandingElementCount(1000L)
Once you have set it to the desired number (for example 1000), this should control the maximum amount of messages the subscriber receives before pausing the message stream, as explained in the code below from this documentation:
// The subscriber will pause the message stream and stop receiving more messsages from the
// server if any one of the conditions is met.
FlowControlSettings flowControlSettings =
// 1,000 outstanding messages. Must be >0. It controls the maximum number of messages
// the subscriber receives before pausing the message stream.
// 100 MiB. Must be >0. It controls the maximum size of messages the subscriber
// receives before pausing the message stream.
.setMaxOutstandingRequestBytes(100L * 1024L * 1024L)

Kafka Streams limiting off-heap memory

We are running kafka streams applications and frequency running into off heap memory issues. Our applications are deployed and kubernetes PODs and they keep on restarting.
I am doing some investigation and found that we can limit the off heap memory by implementing RocksDBConfigSetter as shown in following example.
public static class BoundedMemoryRocksDBConfig implements RocksDBConfigSetter {
// See #1 below
private static org.rocksdb.Cache cache = new org.rocksdb.LRUCache(TOTAL_OFF_HEAP_MEMORY, -1, false, INDEX_FILTER_BLOCK_RATIO);
private static org.rocksdb.WriteBufferManager writeBufferManager = new org.rocksdb.WriteBufferManager(TOTAL_MEMTABLE_MEMORY, cache);
public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
BlockBasedTableConfig tableConfig = (BlockBasedTableConfig) options.tableFormatConfig();
// These three options in combination will limit the memory used by RocksDB to the size passed to the block cache (TOTAL_OFF_HEAP_MEMORY)
// These options are recommended to be set when bounding the total memory
// See #2 below
// See #3 below
public void close(final String storeName, final Options options) {
// Cache and WriteBufferManager should not be closed here, as the same objects are shared by every store instance.
In our application, we have input topics with 6 partitions and there are about 40 topics from which we are consuming data. Out appplication has just 1 topology which consumes from these topics, stores the data in statestores ( for dedup, look and some verfication). So, as per my understanding, kafka streams application will create following rocksdb instances and will need following max off heap memory. Please correct me if i am wrong.
Total rocksdb instances (assuming that each task will create its own instance of rocksdb)
6(partitions) * 40(topics) -> 240 rocksdb instances
Maximum off heap memory consumed
240 * ( 50 (Block cache) + 16*3(memcache) + filters(unknown))
- 240 * ~110 MB
- 26400 MB
- 25 GB
It seems to be a large number. Is the calculation correct? I know practically we should not hit this max number but is the calculation correct ?
Also, If we implement RocksDBConfigSetter and set the max off heap memory to 4 GB. Will the application complain(crash OOM) if rocksdb asks for more memory (since it is expecting about 25 GB) ?
I reduced LRU to 1GB and my streams app started throwing the LRU full exception
2021-02-07 23:20:47,443 15448195 [dp-Corrigo-67c5563a-9e3c-4d79-bc1e-23175e2cba6c-StreamThread-2] ERROR o.a.k.s.p.internals.StreamThread - stream-thread [dp-Corrigo-67c5563a-9e3c-4d79-bc1e-23175e2cba6c-StreamThread-2] Encountered the following exception during processing and the thread is going to shut down:
org.apache.kafka.streams.errors.ProcessorStateException: stream-thread [dp-Corrigo-67c5563a-9e3c-4d79-bc1e-23175e2cba6c-StreamThread-2] task [29_4] Exception caught while trying to restore state from dp-Corrigo-InvTreeObject-Store-changelog-4
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.restore(
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.restoreChangelog(
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(
at org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error restoring batch to store InvTreeObject-Store
at org.apache.kafka.streams.state.internals.RocksDBStore$RocksDBBatchingRestoreCallback.restoreAll(
at org.apache.kafka.streams.processor.internals.StateRestoreCallbackAdapter.lambda$adapt$0(
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.restore(
... 6 common frames omitted
Caused by: org.rocksdb.RocksDBException: Insert failed due to LRU cache being full.
at org.rocksdb.RocksDB.write0(Native Method)
at org.rocksdb.RocksDB.write(
at org.apache.kafka.streams.state.internals.RocksDBStore.write(
at org.apache.kafka.streams.state.internals.RocksDBStore$RocksDBBatchingRestoreCallback.restoreAll(
... 8 common frames omitted
Not sure how many RocksDB instances you get. It depends on the structure of your program. You should check out TopologyDescription (via Topology#describe()). Sub-topologies are instantiated as tasks (base on number of partitions) and each task will have it's own RocksDB to maintain a shard of the overall state per store.
I would recommend to check out the Kafka Summit talk "Performance Tuning RocksDB for Kafka Streams' State Store":
Also, If we implement RocksDBConfigSetter and set the max off heap memory to 4 GB. Will the application complain(crash OOM) if rocksdb asks for more memory (since it is expecting about 25 GB) ?
It won't crash. RocksDB will spill to disk. Being able to spill to disk is the reason why we use a persistent state store (and not an in-memory state store) by default. It allows to hold state that is larger than main-memory. As you use Kubernetes, you should attach corresponding volumes to your containers and size them accordingly (cf You might also want to watch Kafka Summit talk "Deploying Kafka Streams Applications with Docker and Kubernetes":
If state is larger than main-memory, you might also want to monitor RocksDB metrics if you run into per issues to tune the different "buffers" accordingly:

what is distributed queue?

My Understanding :- A distributed destination is a single, logical(not physical) destination to a client which internally contains set of physical destinations (queues or topics) .
It helps in scalable applications in terms of High availability(HA) and Load Balancing(LB).
So when i do distributedQueue.put(someObject), distributed queue will put the object on one of the phyicalQueue and also maintains some meta data to record which
object lies on which on which queue
Now when i do distributedQueue.receive() , it will refer metadata , poll the data from right queue and serve it to client.
Is that correct ?
That would be one way of implementing a distributed queue, yes.
However, in your implementation the metadata store will very quickly become a bottleneck/hot-spot.

use of Mongo/Redis on a full firehose stream

I have been reading up on how DataSift uses different technologies to consume the twitter firehose and since I need to follow the same concept, wanted to get some understanding on the differences between mongo/redis and its use in storage of realtime data. My understanding is this:
The stream volume is way too high to simply consume and place the data (tweets etc) in for example a rabbitmq bunch of queues. My concern is the issue of data loss. My current architecture involves connecting to an open stream and consume the data and push each post or message into a couple of queues in rabbitmq. The queues hold a copy of each message with one being the processing queue and one being the storage queue. I then consume each queue by doing processing on the processing queue which is time intensive but my workers keep up fine and write all the storage queue contents to files and that works fine.
If my volume was to increase 100x, I am told this current setup will not be able to handle the volume and using the mongo/redis approach would be better. So not sure how this would be implemented: would I then consume the stream into mongo and then from there into queues and why would this be a better approach.