Google Cloud Pub Sub with Apache beam acknowledgement slow - apache-beam

Apache beam pulls the packets from google cloud pubsub at 30 packets per second but acknowledges only 20 per second. Need to why it happens and how can I limit the packet input per second? We use a window of 20 seconds and pubsub expiry is also 20 seconds.
We use ReadStringsFromPubSub to read and WindowInto to set rate limit.

Related

How to read specific number of messages per minute from apache kafka message queue

How to read specific number of messages per minute from apache kafka message queue? For example, imagine that there are 100 messages in the queue, how can I get 5 messages to be read per minute. I don't know how to set "max.partition.fetch.bytes" as my byte size is not the same in every message.
Is there a way to dynamically set this to read 5 messages per minute?

Kafka Streams Topology which inserts records to DynamoDB Model - Consumption rate is low

This issue has been eating my head for few days now & have come here for getting some help on figuring out the root cause. Let me elaborate the whole issue.
Problem Statement
I have a KafkaStreams topology which reads JSON strings from a kafka topic. There is a processor which gets these messages & inserts into DynamoDB using AmazonDynamoDBAsyncClient. Thats pretty much the topology does.
More Details
Source kafka topic has 200 partitions. Kafka Streams Topology is configured with 50 Stream threads currently (we previously had set 10,20,30,100,200 as values with no luck)
Issue Being Faced
We are visualizing the lag in kafka topic along with the consumption rate (per minute) in Grafana dashboard. What we see is that, after the Streams process is started there is a steady consumption rate of 300K to 500K Messages per minute for around 5 to 6 mins. After that, the rate drops steeply and stays fixed at 63K per minute. It doesn't go up or down & fixed right at 63K messages per minute.
Parameters configured
Poll_ms - 10000 (10 secs)
Max_Poll_Records - 10000
Max.Partition.Fetch.bytes - 50 MB
commit_ms - 15000 (15 secs)
kafka.consumer.heartbeat.interval - 60 secs
session.timeout.ms - 180000 (3 minutes)
partition.assignment.strategy - org.apache.kafka.clients.consumer.RoundRobinAssignor
AmazonAsyncClient Connection Pool Size - 200 (to match the no. of topic partitions)
DynamoDB Model
We even saw the metrics on the corresponding DynamoDB table & we saw throttling for 10 or 15 secs after which autoscaling kicked in. We saw no capacity issues/errors on Cloudwatch.
Please let me know you more details are needed or if problem statement is unclear. Appreciate the help.
Threaddumps
We checked the threaddumps for any clues. We only see 200 consumer threads on "WAITING" state for polling & there was no threads on BLOCKED state.

Better Understanding Min Fetch Bytes Within Kafka?

Looking at some config I'm tuning for Kafka for batching records to a file.
I see min fetch bytes which is the minimum number of bytes returned from a single poll across N partitions of a topic. Here is the scenario I'm concerned about:
I set min fetch to 100mb worth of record data. Let's say I have 250mb worth of data. I do two polls and persist 200mb. Now.. I have 50mb sitting in the queue, but I still want it to be proccessed, but don't plan of having more data to come in. If the timeout is hit, will it just grab the remaining 50mb?
Sorry, I should have looked at the docs a bit more closely. Seeing this is used in conjunction with the timeout.
fetch.max.wait.ms
By setting fetch.min.bytes, you tell Kafka to wait until it has enough
data to send before responding to the consumer. fetch.max.wait.ms lets
you control how long to wait. By default, Kafka will wait up to 500
ms. This results in up to 500 ms of extra latency in case there is not
enough data flowing to the Kafka topic to satisfy the minimum amount
of data to return. If you want to limit the potential latency (usually
due to SLAs controlling the maximum latency of the application), you
can set fetch.max.wait.ms to a lower value. If you set
fetch.max.wait.ms to 100 ms and fetch.min.bytes to 1 MB, Kafka will
receive a fetch request from the consumer and will respond with data
either when it has 1 MB of data to return or after 100 ms, whichever
happens first.
tl;dr if timeout exceeds before queue is filled, it would just return the remaining 50mb

Poll messages from kafka topic every X minutes

I wanted to know that in my Kafka Streams application or Spring-Kafka;
Is there a way where I can read my messages from a topic in some time interval.
Read 1000 records per 5 minutes let's say.
Read 1000 records from a topic, wait 5 min again and consume 1000 messages again.
I have read the .poll() documentation but it does not do what I actually want. It says
The configuration poll.ms is the maximum "blocking time" within poll() if no data avaliable.
Think like a slow notification processing. Can I handle this with consumer, producer api or using kafka streams ?
Thanks !

Testing Kafka producer throughput

We have a Kafka cluster consists of 3 nodes each with 32GB of RAM and 6 core 2.5 CPU.
We wrote a kafka producer that receive tweets from twitter and send it to Kafka in batches of 5000 tweets.
In the Producer we uses producer.send(list<KeyedMessages>) method.
The avg size of the tweet is 7KB.
Printing the time in milliseconds before and after the send statement to measure the time taken to send 5000 messages we found that it takes about 3.5 seconds.
Questions
Is the way we test the Kafka performance correct?
Is using the send method that takes list of keyed messages the correct way to send batch of messages to Kafka? Is there any other way?
What are the important configurations that affects the producer performance?
You're measuring only the producer side? That metric tells you only how much data you can store in a unit of time.
Maybe that's what you wanted to measure, but since the title of your question is "Kafka performance", I would think that you'd actually want to measure the throughput, i.e. how long does it take for a message to go though Kafka (usually referred to as end-to-end latency).
You'd achieve that by measuring the difference in time between sending a message and receiving that message on the other side, by a consumer.
If the cluster is configured correctly (default configuration will do), you should see latency ranging from only a couple of ms (less than 10ms), up to 50ms (few tens of milliseconds).
Kafka is able to do that because messages you read by the consumer don't even touch the disk, cuz' they are still in RAM (page cache and socket buffer cache). Keep in mind that this only works while you're able to "catch up" with your consumers, i.e. don't have a large consumer lag. If a consumer lags behind producers, the messages will eventually be purged from cache (depending on the rate of messages - how long it takes for the cache to fill up with new messages), and will thus have to be read from disk. Even that is not the end of the world (order of magnitude slower, in the range of low 100s of ms), because messages are written consecutively, one by one is a straight line, which is a single disk seek.
BTW you'd want to give Kafka only a small percentage of those 32GB, e.g. 5 to 8GB (even G1 garbage collector slows down with bigger sizes) and leave everything else unassigned so OS can use it for page and buffer cache.