Kafka consumer configuration to fetch at interval - apache-kafka

I am new to kafka and trying to understand various configuration properties I need to set for my requirement as below.
I have a kafka consumer which is expected to fetch 'n' records at a time and after successfully processing them, another fetch should happen.
Example: If my consumer fetches 10 records at a time and every record takes 5 seconds to complete its processing, then after 50 seconds another fetch request should get executed and so on.
Considering the above example, Could anyone let me know what should be the values for the below configs ?
Below is my current configuration. After processing 10 records, it doesn't wait for minute as I configured. It keeps on fetching and polling.
fetchMaxBytes=50000 //approx size for 10 records
fetchWaitMaxMs=60000 //wait for a minute before a next fetch
requestTimeoutMs= //default value
heartbeatIntervalMs=1000 //acknowledgement to avoid rebalancing
maxPollIntervalMs=60000 //assuming the processing takes one minute
maxPollRecords=10 //we need 10 records to be polled at once
sessionTimeoutMs= //default value
I am using camel-kafka component to implement this.
It would be really great if someone could help me with this. Thanks Much.

Related

Batching Kafka Events using Faust

I have a Kafka topic we will call ingest that receives an entry every x seconds. I have a process that I want to run on this data but it has to be run on 100 events at a time. Thus, I want to batch the entries together and send them to a new topic called batched-ingest. The two topics will look like this...
ingest = [entry, entry, entry, ...]
batched-ingest = [[entry_0, entry_1, ..., entry_99]]
What is the correct way to do this using faust? The solution I have right now is this...
app = faust.App("explore", value_serializer="raw")
ingest = app.topic('ingest')
ingest_batch = app.topic('ingest-batch')
#app.agent(ingest, sink=[ingest_batch])
async def test(stream):
async for values in stream.take(10, within=1000):
yield values
I am not sure if this is the correct way to do this in Faust. If so, what should I set within to in order to make it always wait until len(values) = 100?
as mentioned in the faust take documentation if you omit the within from take(100, within=10) the code will block forever if there are 99 messages and the last hundredth message is never received. To solve this add a within timeout so that up to 100 values will be processed within 10 seconds. so that if there are periods of 10 seconds with no events received it will still process what it has gathered.

Deduplication using Kafka-Streams

I want to deduplication in my kafka-streams application which uses state-store and using this very good example:
https://github.com/confluentinc/kafka-streams-examples/blob/5.5.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java
I have few questions about this example.
As I correctly understand, this example briefly do this:
Message comes into input topic
Look at the store, if it does not exist, write to state-store and return
if it does exist drop the record, so the deduplication is applied.
But in the code example there is a time window size that you can determine. Also, retention time for the messages in the state-store. You can also check the record is in the store or not by giving timestamp timeFrom + timeTo
final long eventTime = context.timestamp();
final WindowStoreIterator<String> timeIterator = store.fetch(
key,
eventTime - leftDurationMs,
eventTime + rightDurationMs
);
What is the actual purpose for the timeTo and timeFrom ? I am not sure why I am checking the next time interval because I am checking the future messages that did not come to my topic yet ?
My second question does this time interval related and should HIT the previous time window ?
If I am able to search the time interval by giving timeTo and timeFrom, why time window size is important ?
If I give the window Size 12 hours, am I able to guarantee that I am deduplicated messages for 12 hours ?
I think like this:
First message comes with key "A" in the first minute of the application start-up, after 11 hours, the message with a key "A" comes again. Can I catch this duplicated message by giving enough time interval like eventTime - 12hours ?
Thanks for any ideas !
TimeWindow size decides how long you wants the "duplication" runs, no duplication forever or just during 5 minutes. Kafka has to store these records. A large timewindow may consume a large resource of your server.
TimeFrom and TimeTo, cause your record(event) may arrive/process late in kafka, so the event-time of the record is 1 minute ago, not now. Kafka is process an "old" record, and that's it needs to take care of records which are not that old, relatively "future" records to the "old" one.

Does Kafka consumer fetch-min-size (fetch.min.bytes) wait for the mentioned size to get filled?

Suppose there are 107 records, each record is 1kb. If the fetch-size is 15kb, in 7 iterations 105kb would be consumed. Now, only 2kb is remaining, will I get the remaining 2 records in next iterations or will it wait for more 15kb to be accumulated? Assuming after this there are no records remaining after this one.
It waits until the time defined in fetch.max.wait.ms is reached. This value is set to 500 by default. You can find the description of the two relevant configurations in the Kafka documentation on Consumer
fetch.min.bytes
The minimum amount of data the server should return for a fetch request. If insufficient data is available the request will wait for that much data to accumulate before answering the request. The default setting of 1 byte means that fetch requests are answered as soon as a single byte of data is available or the fetch request times out waiting for data to arrive. Setting this to something greater than 1 will cause the server to wait for larger amounts of data to accumulate which can improve server throughput a bit at the cost of some additional latency.
fetch.max.wait.ms
The maximum amount of time the server will block before answering the fetch request if there isn't sufficient data to immediately satisfy the requirement given by fetch.min.bytes.

Window does not assess elements from Kafka Source

I think my perception of Flink windows may be wrong, since they are not evaluated as I would expect from the documentation or the Flink book. The goal is to join a Kafka topic, which has rather static data, with a Kafka topic with constantly incoming data.
env.addSource(createKafkaConsumer())
.join(env.addSource((createKafkaConsumer()))))
.where(keySelector())
.equalTo(keySelector())
.window(TumblingProcessingTimeWindows.of(Time.hours(2)))
.apply(new RichJoinFunction<A, B>() { ... }
createKafkaConsumer() returns a FlinkKafkaConsumer
keySelector() is a placeholder for my key selector.
KafkaTopic A has 1 record, KafkaTopic B has 5. My understanding would be, that the JoinFunction is triggered 5 times (join condition is valid each time), resulting in 5 records in the sink. If a new record for topic A comes in within the 2 hours, another 5 records would be created (2x5 records). However, what comes through in the sink is rather unpredictable, I could not see a pattern. Sometimes there's nothing, sometimes the initial records, but if I send additional messages, they are not being processed by the join with prior records.
My key question:
What does even happen here? Are the records emitted after the window is done processing? I would expect a real-time output to the sink, but that would explain a lot.
Related to that:
Could I handle this problem with onElement trigger or would this make my TimeWindow obsolete? Do those two concepts exists parallel to each other, i.e. that the join window is 2 hours, but the join function + output is triggered per element? How about duplicates in that case?
Subsequently, does processing time mean the point in time, when the record is consumed from the topic? So if I e.g. setStartFromEarliest() on start, all messages which were consumed within the next two hours, were in that window?
Additional info:
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime); is set and I also switched to EventTime in between.
The semantics of a tumbling processing time window is that it processes all events which fall into the given timespan. In your case, it is 2 hours. Per default, the window will only output results once the 2 hours are over because it needs to know that no other events will be coming for this window.
If you want to output early results (e.g. for every incoming record), then you could specify a custom Trigger which fires on every element. See the Trigger API docs for more information about this.
Update
The window time does not start with the first element but the window starts at multiples of the window length. For example, if your window size is 2 hours, then you can only have windows [0, 2), [2, 4), ... but not [1, 3), [3, 5).

Kafka Connect fetch.max.wait.ms & fetch.min.bytes combined not honored?

I'm creating a custom SinkConnector using Kafka Connect (2.3.0) that needs to be optimized for throughput rather than latency. Ideally, what I want is:
Batches of ~ 20 megabytes or 100k records whatever comes first, but if message rate is low, process at least every minute (avoid small batches, but minimum MySinkTask.put() rate to be every minute).
This is what I set for consumer settings in an attempt to accomplish it:
consumer.max.poll.records=100000
consumer.fetch.max.bytes=20971520
consumer.fetch.max.wait.ms=60000
consumer.max.poll.interval.ms=120000
consumer.fetch.min.bytes=1048576
I needs this fetch.min.bytes setting, or else MySinkTask.put() is called for multiple times per second despite the other settings...?
Now, what I observe in a low-rate situation is that MySinkTask.put() is called with 0 records multiple times and several minutes pass by, until fetch.min.bytes is reached, and then I get them all at once.
I fail to understand so far:
Why fetch.max.wait.ms=60000 is not pushing downwards from the consumer to the put() call of my connector? Shouldn't that have precedence over fetch.min.bytes?
What setting controls the ~ 2x per second call to MySinkTask.put() if fetch.min.bytes=1 (default)? I don't understand why it does that, even the verbose output of the Connect runtime settings don't show any interval below multiples of seconds.
I've double-checked the log output, and the lines INFO o.a.k.c.consumer.ConsumerConfig - ConsumerConfig values: as printed by the Connect Runtime are showing the expected values as I pass with the consumer. prefixed values.
The "process at least every interval" part seems not possible, as the fetch.min.bytes consumer setting takes precedence and Connect does not allow you to dynamically adjust the ConsumerConfig while the Task is running. :-(
Work-around for now is batching in the Task manually; set fetch.min.bytes to 1 (yikes), buffer records in the Task on put() calls, and flush when necessary. This is not very ideal as it infers some overhead for the Connector which I hoped to avoid.
The logic how Connect does a ~ 2x per second batching from its consumer's poll to SinkTask.put() remains a mystery to me, but it's better than being called for every message.