Kafka set maximum records consumed in a minute - apache-kafka

I'm creating a scraper. The producer sends data to Kafka's topic with the information about links to be scraped. The consumer is an AWS Lambda function that will be triggered when a message is received on that topic.
To avoid blocking, I want to add a cap on the maximum number of messages consumed in a given time. For example, I just want to consume only 5 messages in a minute. While the producer should keep pushing the messages to Kafka.
How can I achieve this?

Related

Difference between kafka batch and kafka request

I was not able to find an satisfactory answer anywhere, sorry for if this question might look trivial:
In Kafka, on producer side, can a request contain multiple batches to different partitions ?
I see the words batch and requests are used as synonyms in the documentation, and I was hoping to find some clarity on this.
If yes, how does this affect the ack policy ?
Are acks on per batch or request basis ?
A Kafka request (and response) is a message sent over the network between a Kafka client and broker. The Kafka protocol uses many types of requests, you can find them all in the Kafka protocol documentation.
The Produce and Fetch requests are used to exchange records. They both contain Kafka batches, it's the RECORDS field in the protocol description. A Kafka batch is used to group several records together and saves some bytes by sharing the metadata for all records. You can find the exact format of a batch in the documentation.
TLDR:
Requests/responses are the full messages exchanged between Kafka clients and brokers. Some requests contain Kafka batches that are groups of records.
I'm not sure you are asking about producer or consumer side. Here are some info that might answer your question.
On producer side:
By default, Kafka producer will accumulate records in a batch up to 16KB.
By default, the producer will have up to 5 requests in flight, meaning that 5 batches can be sent to Kafka at the same time. Meanwhile, the producer start to accumulate data for the next batches.
The acks config controls the number of brokers required to answer in order to consider each request successful.
On consumer side:
By default, the Kafka consumer regularly calls poll() to get a maximum of 500 records per poll.
Also by default, the Kafka consumer will ack every 5 seconds.
Meaning that the consumer will commit all the records that have been polled during the last 5 seconds by all the subsequent calls to poll().
Hope this helps!

Kafka to Kafka -> reading source kafka topic multiple times

I new to Kafka and i have a configuration where i have a source Kafka topic which has messages with a default retention for 7 days. I have 3 brokers with 1 partition and 1 replication.
When i try to consume messages from source Kafka topic and to my target Kafka topic i was able to consume messages in the same order. Now my question is if i am trying to reprocess all the messages from my source Kafka and consume in ,y Target Kafka i see that my Target Kafka is not consuming any messages. I know that duplication should be avoided but lets say i have a scenario where i have 100 messages in my source Kafka and i am expecting 200 messages in my target Kafka after running it twice. But i am just getting 100 messages in my first run and my second run returns nothing.
Can some one please explain why this is happening and what is the functionality behind it ?
Kafka consumer reads data from a partition of a topic. One consumer can read from one partition at one time only.
Once a message has been read by the consumer, it can't be re-read again. Let me first explain the current offset. When we call a poll method, Kafka sends some messages to us. Let us assume we have 100 records in the partition. The initial position of the current offset is 0. We made our first call and received 100 messages. Now Kafka will move the current offset to 100.
The current offset is a pointer to the last record that Kafka has already sent to a consumer in the most recent poll and that has been committed. So, the consumer doesn't get the same record twice because of the current offset. Please go through the following diagram and URL for complete understanding.
https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/

Are SQS and Kafka same?

Are Kafka and SQS same?
I see that both are messaging queue systems and are event-based.
Do they serve the same purpose, If not how are they different?
Apache Kafka and Amazon SQS both are using for message streaming but are not same.
Apache Kafka follows publish subscriber model, where producer send event/message to topic and one or more consumers are subscribed to that topic to get the event/message. In topic you find partitions for parallel streaming. There is consumer group concept once. When a message is read from a partition of topic it will be commit to identify its already read by that consumer group to avoid inconsistency in reading in concurrent programming. However, other consumer group can still read that message form the partition.
Where, Amazon SQS follows Queue and queue can be created any region of Amazon SQS. You can push message to Queue and only one consumer can subscribe to each Queue and can pull message from Queue. That's why SQS is pull based streaming. SQS Queues are two types: FIFO and Standard.
There is another concept in AWS which is Amazon SNS, which is publish subscriber based like Kafka, but there is not any message retention policy in SNS. It's for instant messaging like email, sms etc. It only can push message to subscriber when the subscribers are available. Otherwise message will be lost. However, SQS with SNS can overcome this drawback. Amazon SNS with SQS is called fanout pattern. In this pattern, a message published to an SNS topic is distributed to multiple SQS queues in parallel and SQS queue assure persistence, because SQS has retention policy. It can persist message up to 14 days(defult 4 days). Amazon SQS with SNS can achieved highly throughput parallel streaming and can replace Apache Kafka.
Yes, they are two messaging systems, but there is a big difference:
Kafka
Kafka is pretty scalable system and fits on high workloads when you want to send messages in batches (to have a good message throughput).
Kafka topic consists of some number of partitions which can be read completely parallel by different consumers in one consumer group and that give us a very good performance.
For example, if you need to build a high loaded streaming system, Kafka is really suitable for it.
SQS
SQS is an Amazon managed service (so you do not have to support infrastructure by yourself).
SQS is better for eventing when you need to catch some message (event) by some client and then this message will be automatically popped out from the queue.
As for my experience SQS is not so fast as Kafka and it doesn't fit to high workload, it's much more suitable for eventing where count of events per second is not so much.
For example if you want to react on some S3 file upload (to start some processing of this file) SQS is very good.
SQS and Kafka are both messaging systems.
The primary differences are :
Ordering at scale.
Kafka - Produced messages are always consumed in order irrespective of the number of items in the queue.
SQS - "A FIFO queue looks through the first 20k messages to determine available message groups. This means that if you have a backlog of messages in a single message group, you can't consume messages from other message groups that were sent to the queue at a later time until you successfully consume the messages from the backlog"
Limit on the Number of groups/topic/partitions
Kafka - Although the limit is quite high, but the number of topics/partitions is usually in the order of thousands (which can increase depending on the cluster size). SQS - "There is no quota to the number of message groups within a FIFO queue."
Deduplication - Kafka does not support deduplication of data in case same data is produced multiple times. SQS tries to dedup messages based on the dedup-id and the dedup-interval. "Assuming that the producer receives at least one acknowledgement before the deduplication interval expires, multiple retries neither affect the ordering of messages nor introduce duplicates."
Partition management. Kafka - Creations or additions of partitions are created and managed by the user. SQS controls the number of partitions and it can increase or decrease it depending on the load and usage pattern.
Dead letter queue - Kafka does not have the concept of a DL queue (it can be explicitly created and maintained by the user thought). SQS inherently supports a DL queue by itself.
Overall if we want so summarise the points above, we can say that SQS is meant for offloading background tasks to an async pipeline. Kafka is much more scalable and should be used as a stream processing pipeline.
SQS is a queue. You have a list of messages that would need to be processed by some other part of the application. A message would ideally be processed once by one processor and would be marked as processed and removed from the queue. The purpose of the queue is to coordinate and distribute the processing of messages among the different processors.
Kafka is more similar to Kinesis which is used mainly for data streaming. Messages are stored in topics for other components to read. Any component can listen to topics and/or read all messages at any time. The main purpose is to allow the efficient delivery of messages to any number of recipients and allow the continuous streaming of data between components in a dynamic and elastic way.
At a birds view, there is one main difference
Kafka is used for pub sub model. If a producer sends a single message. If a kafka topic has 2 consumers , both the consumers will receive the message
SQS is more like competing consumer pattern. If a producer sends a message and the sqs has 2 consumers. Only one consumer will receive the message. The other one wont get the message, if the 1st consumer has processed the message successfully. The 2nd consumer has a chance to recieve the message only if the message visibility times out. ie., 1st consumer is not able to process the message within the given time and cant delete the message within the visibility timeout.

Apache NiFi - Asynchronously process messages after Kafka consumer

Currently we are using Apache NiFi to consume messages via Kafka consumer. Output of kafka consumer is connected to DB processor which gets the messages in queue (from consumer) and runs the stored proc/processing on it. So the DB processor will be working on one message per from queue and I can set the DB processor to work in parallel for n threads, but primarily each thread can work on one message per queue.
I am looking to do something like below:
processor after consumer will just consume message (or take messages) from queue and say will wait for "batch" or total to 1000 messages.
As soon as it gets 1000 messages OR 60 secs passed and message count is < 1000, push to another processor which can be DB stored proc for business logic on group of those messages.
Mainly, I want above to be multithreaded i.e. if we get 3000 messages, the first processor will read them in 3 batches and push to DB processor (parallely).
So I want to know any such processor which can do point 2 above i.e. just read messages and push it to next based on batch/time rules?
If you can leverage NiFi's record processor's then using ConsumeKafkaRecord with a batch size of 1000 followed by PutDatabaseRecord will give you similar behavior to what you are describing.
If you may not always have enough messages available in the Kafka topic at the time of consuming, then adding MergeContent or MergeRecord in the middle would let you wait for a certain amount of time or number of messages.

Ensure that all messages are consumed in Kafka

I has a consumer server in Apache-Kafka and its consuming the messages.
However in the poll loop, I have a time consuming process (db access).
So my questions are:
How to make sure that I read all the messages in a given timeframe on the consumer?
Do I need to increase the number of consumer clients?