Kafka Connect fetch.max.wait.ms & fetch.min.bytes combined not honored? - apache-kafka

I'm creating a custom SinkConnector using Kafka Connect (2.3.0) that needs to be optimized for throughput rather than latency. Ideally, what I want is:
Batches of ~ 20 megabytes or 100k records whatever comes first, but if message rate is low, process at least every minute (avoid small batches, but minimum MySinkTask.put() rate to be every minute).
This is what I set for consumer settings in an attempt to accomplish it:
consumer.max.poll.records=100000
consumer.fetch.max.bytes=20971520
consumer.fetch.max.wait.ms=60000
consumer.max.poll.interval.ms=120000
consumer.fetch.min.bytes=1048576
I needs this fetch.min.bytes setting, or else MySinkTask.put() is called for multiple times per second despite the other settings...?
Now, what I observe in a low-rate situation is that MySinkTask.put() is called with 0 records multiple times and several minutes pass by, until fetch.min.bytes is reached, and then I get them all at once.
I fail to understand so far:
Why fetch.max.wait.ms=60000 is not pushing downwards from the consumer to the put() call of my connector? Shouldn't that have precedence over fetch.min.bytes?
What setting controls the ~ 2x per second call to MySinkTask.put() if fetch.min.bytes=1 (default)? I don't understand why it does that, even the verbose output of the Connect runtime settings don't show any interval below multiples of seconds.
I've double-checked the log output, and the lines INFO o.a.k.c.consumer.ConsumerConfig - ConsumerConfig values: as printed by the Connect Runtime are showing the expected values as I pass with the consumer. prefixed values.

The "process at least every interval" part seems not possible, as the fetch.min.bytes consumer setting takes precedence and Connect does not allow you to dynamically adjust the ConsumerConfig while the Task is running. :-(
Work-around for now is batching in the Task manually; set fetch.min.bytes to 1 (yikes), buffer records in the Task on put() calls, and flush when necessary. This is not very ideal as it infers some overhead for the Connector which I hoped to avoid.
The logic how Connect does a ~ 2x per second batching from its consumer's poll to SinkTask.put() remains a mystery to me, but it's better than being called for every message.

Related

Is there a way in the S3 Kafka sink connector to ensure all records are consumed

I have a problem in the S3 Kafka connector but also seen this in the JDBC connector.
I'm trying to see how can I ensure that my connectors are actually consuming all the data in a certain topic.
I expect because of the flush sizes that there could be a certain delay (10/15 minutes) in the consumption of the messages but I notice that I end up having big delays (days...) and my consumers always have something in the lag on the offset
I was reading/viewing the post/video about this for example (mainly that comment): https://rmoff.net/2020/12/08/twelve-days-of-smt-day-1-insertfield-timestamp/
https://github.com/confluentinc/demo-scene/blob/master/kafka-connect-single-message-transforms/day1.adoc
"flush.size of 16 is stupidly low, but if it’s too high you have to wait for your files to show up in S3 and I get bored waiting."
And it does mention there if the flush.size is bigger than the available records it can be that the records take time to be consumed but I never expected this to be more than a couple of minutes.
How can I ensure that all records are consumed, and I would really like to avoid having flush.size = 1
Maybe this is just a miss-understanding on my part about the sink connectors but I do expect them to work as a normal consumer so I expect them to consume all data and that this kind of flush/batch sizes would work more based on the timeouts and for performance issues.
If anyone is interested this are my connector configuration
For S3 sink:
topics.regex: com.custom.obj_(.*)
storage.class: io.confluent.connect.s3.storage.S3Storage
s3.region: ${#S3_REGION#}
s3.bucket.name: ${#S3_BUCKET#}
topics.dir: ${#S3_OBJ_TOPICS_DIR#}
flush.size: 200
rotate.interval.ms: 20000
auto.register.schemas: false
s3.part.size: 5242880
parquet.codec: snappy
offset.flush.interval.ms: 20000
offset.flush.timeout.ms: 5000
aws.access.key.id: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:accesskey}
aws.secret.access.key: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:secretkey}
format.class: com.custom.connect.s3.format.parquet.ParquetFormat
key.converter: org.apache.kafka.connect.storage.StringConverter
value.converter: com.custom.insight.connect.protobuf.ProtobufConverter
partitioner.class: io.confluent.connect.storage.partitioner.DailyPartitioner
timestamp.extractor: Record
locale: ${#S3_LOCALE#}
timezone: ${#S3_TIMEZONE#}
store.url: ${#S3_STORAGE_URL#}
connect.meta.data: false
transforms: kafkaMetaData,formatTs
transforms.kafkaMetaData.type: org.apache.kafka.connect.transforms.InsertField$Value
transforms.kafkaMetaData.offset.field: kafka_offset
transforms.kafkaMetaData.partition.field: kafka_partition
transforms.kafkaMetaData.timestamp.field: kafka_timestamp
transforms.formatTs.format: yyyy-MM-dd HH:mm:ss:SSS
transforms.formatTs.field: message_ts
transforms.formatTs.target.type: string
transforms.formatTs.type: org.apache.kafka.connect.transforms.TimestampConverter$Value
errors.tolerance: all
errors.deadletterqueue.topic.name: ${#DLQ_STORAGE_TOPIC#}
errors.deadletterqueue.context.headers.enable: true
For JDBC sink:
topics.regex: com.custom.obj_(.*)
table.name.format: ${#PREFIX#}${topic}
batch.size: 200
key.converter: org.apache.kafka.connect.storage.StringConverter
value.converter: com.custom.insight.connect.protobuf.ProtobufConverter
connection.url: ${#DB_URL#}
connection.user: ${#DB_USER#}
connection.password: ${#DB_PASSWORD#}
auto.create: true
auto.evolve: true
db.timezone: ${#DB_TIMEZONE#}
quote.sql.identifiers: never
transforms: kafkaMetaData
transforms.kafkaMetaData.offset.field: kafka_offset
transforms.kafkaMetaData.partition.field: kafka_partition
transforms.kafkaMetaData.timestamp.field: kafka_timestamp
transforms.kafkaMetaData.type: org.apache.kafka.connect.transforms.InsertField$Value
errors.tolerance: all
errors.deadletterqueue.topic.name: ${#DLQ_STORAGE_TOPIC#}
errors.deadletterqueue.context.headers.enable: true
I've read this two already and still not sure:
Kafka JDBC Sink Connector, insert values in batches
https://github.com/confluentinc/kafka-connect-jdbc/issues/290
Also for example I've seen examples of people using (which I don't think it would help my use case) but I was wondering is this value defined per connector?
I'm even a bit confused about the fact that in the documentation I always find the configuration without the consumer. but the examples I always find with consumer. so I guess this means that this is a generic property that applies both to consumers and producers?
consumer.max.interval.ms: 300000
consumer.max.poll.records: 200
Anyone has some good feedback?
Regarding the provided Kafka S3 sink connector configuration:
topics.regex: com.custom.obj_(.*)
storage.class: io.confluent.connect.s3.storage.S3Storage
s3.region: ${#S3_REGION#}
s3.bucket.name: ${#S3_BUCKET#}
topics.dir: ${#S3_OBJ_TOPICS_DIR#}
flush.size: 200
rotate.interval.ms: 20000
auto.register.schemas: false
s3.part.size: 5242880
parquet.codec: snappy
offset.flush.interval.ms: 20000
offset.flush.timeout.ms: 5000
aws.access.key.id: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:accesskey}
aws.secret.access.key: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:secretkey}
format.class: com.custom.connect.s3.format.parquet.ParquetFormat
key.converter: org.apache.kafka.connect.storage.StringConverter
value.converter: com.custom.insight.connect.protobuf.ProtobufConverter
partitioner.class: io.confluent.connect.storage.partitioner.DailyPartitioner
timestamp.extractor: Record
locale: ${#S3_LOCALE#}
timezone: ${#S3_TIMEZONE#}
store.url: ${#S3_STORAGE_URL#}
connect.meta.data: false
transforms: kafkaMetaData,formatTs
transforms.kafkaMetaData.type: org.apache.kafka.connect.transforms.InsertField$Value
transforms.kafkaMetaData.offset.field: kafka_offset
transforms.kafkaMetaData.partition.field: kafka_partition
transforms.kafkaMetaData.timestamp.field: kafka_timestamp
transforms.formatTs.format: yyyy-MM-dd HH:mm:ss:SSS
transforms.formatTs.field: message_ts
transforms.formatTs.target.type: string
transforms.formatTs.type:org.apache.kafka.connect.transforms.TimestampConverter$Value
errors.tolerance: all
errors.deadletterqueue.topic.name: ${#DLQ_STORAGE_TOPIC#}
errors.deadletterqueue.context.headers.enable: true
There are configuration fields you can tweak to control the consumption\upload to S3 rate. Thus reducing the lag in Kafka offset you are seeing. Its best practice to use variables for the below fields in your configuration.
From personal experience, the tweaks you can do are:
Tweak flush.size
flush.size: 800
which is (as you stated):
Maximum number of records: The connector’s flush.size configuration
property specifies the maximum number of records that should be
written to a single S3 object. There is no default for this setting.
I would prefer bigger files and use the timing tweaks below to control consumption. Make sure your records are not too big or small to make rational files as a result of flush.size * RECORD_SIZE.
Tweak rotate.interval.ms
rotate.interval.ms: (i would delete this field, see rotate.schedule explanation below)
which is:
Maximum span of record time: The connector’s rotate.interval.ms
specifies the maximum timespan in milliseconds a file can remain open
and ready for additional records.
Add field rotate.schedule.interval.ms:
rotate.schedule.interval.ms 60000
which is:
Scheduled rotation: The connector’s rotate.schedule.interval.ms
specifies the maximum timespan in milliseconds a file can remain open
and ready for additional records. Unlike with rotate.interval.ms, with
scheduled rotation the timestamp for each file starts with the system
time that the first record is written to the file. As long as a record
is processed within the timespan specified by
rotate.schedule.interval.ms, the record will be written to the file.
As soon as a record is processed after the timespan for the current
file, the file is flushed, uploaded to S3, and the offset of the
records in the file are committed. A new file is created with a
timespan that starts with the current system time, and the record is
written to the file. The commit will be performed at the scheduled
time, regardless of the previous commit time or number of messages.
This configuration is useful when you have to commit your data based
on current server time, for example at the beginning of every hour.
The default value -1 means that this feature is disabled.
You use the default -1 which means disabling this rotation. This tweak will make the most difference as each task will consume more frequently.
Regarding the second part of the question:
You can gain observability by adding metrics to your kafka and connect using for example prometheus and grafana. Configuration guide below in sources.
Sources:
Connect S3 sink
kafka-monitoring-via-prometheus
Connect S3 Sink config Docs

Will Kafka Producer always waits for the value specified by linger.ms, before sending a request?

As per LINGER_MS_DOC in ProducerConfig java class:
"The producer groups together any records that arrive in between
request transmissions into a single batched request. Normally this
occurs only under load when records arrive faster than they can be
sent out. However in some circumstances the client may want to reduce
the number of requests even under moderate load. This setting
accomplishes this by adding a small amount of artificial delay; that
is, rather than immediately sending out a record the producer will
wait for up to the given delay to allow other records to be sent so
that the sends can be batched together. This can be thought of as
analogous to Nagle's algorithm in TCP. This setting gives the upper
bound on the delay for batching: once we get "BATCH_SIZE_CONFIG" worth
of records for a partition it will be sent immediately regardless of
this setting, however if we have fewer than this many bytes
accumulated for this partition we will 'linger' for the specified time
waiting for more records to show up. This setting defaults to 0 (i.e.
no delay). Setting "LINGER_MS_CONFIG=5" for example, would have the
effect of reducing the number of requests sent but would add up to 5ms
of latency to records sent in the absence of load."
I searched for a suggested value for linger.ms but nowhere found a higher value suggested for this. Most of the places 5 ms is mentioned for linger.ms.
For testing, I have set "batch.size" to 16384 (16 KB)
and "linger.ms" to 60000 (60 seconds)
as per doc I felt if I send a message of size > 16384 bytes then the producer will not wait and send the message immediately, but I am not observing the same behavior.
I am sending events of size > 16384 bytes but it still waits for 60 seconds. Am I missing to understand the purpose of "batch.size"? My understanding of "batch.size" and "linger.ms" is that whichever meets first the messages/batch will be sent.
In this case, if it is going to be the minimum wait time and do not give preference to "batch.size" then I guess setting a high value for linger.ms is not right.
Here is the kafka properties used in yaml:
producer:
properties:
acks: all
retries: 5
batch:
size: 16384
linger:
ms: 10
max:
request:
size: 1046528

Can I break anything if I set log.segment.delete.delay.ms to 0

I'm trying to put an absolute limit to the size of a topic so it won't fill up disks in case of unexpectedly high write speeds.
Mostly, this amounts to some calculation like segment=min(1GB, 0.1 * $max_space / $partitions) and setting retention.bytes=($max_space / $partitions - $segment),segment.bytes=$segment and reducing log.retention.check.interval.ms. (You can even pull stunts like determining your maximum write rate (disk or network bound) and then subtract a safety margin based on that and log.retention.check.interval.ms, and maybe some absolute offset of the 10MB head index file.) That's all been discussed in other questions, I'm not asking about any of that.
I'm wondering what log.segment.delete.delay.ms is for, and if I can safely set it to 0 (or some very small value - 5?). The documentation just states "The amount of time to wait before deleting a file from the filesystem". I'd like to know why you would do that, i.e. what or who benefits from this amount of time being non-0.
This configuration can be used to ensure that a Consumer of your topic has actually the time to consume the message before it gets deleted.
This configuration declares the delay that is applied when the LogCleaner is triggered by the time-based or size-based cleanup policies set through log.retention.bytes or log.retention.ms.
Especially when your have calculated the retention based on the size of data, this configuration can help you set a minimal time before a message gets deleted and to allow the consumer to actually consume the message during that time.

Consuming exact number of events from Kafka

Naturally, streaming apps are unbounded, but I have a new use case where I need to consume an exact number of messages or less (configurable, for example, 100 messages) from a Kafka topic. Then the app should stop.
The motivation is very simple, the flow is rarely used and no need for real-time, so there is no reason to have a permanent streaming app.
It is enough to invoke the app once in a while.
Is there a way to implement it using FlinkKafkaConsumer?
Adding a counter that will kill the app when it reaches the required number of messages is an option, but I prefer to use something more elegant.
You could create a wrapper SourceFunction for the FlinkKafkaConsumer, which delegates to it, and terminates when the target number of messages have been read. When all of the sources of a Flink streaming job are done, the workflow will automatically stop.
You can set max.poll.records to 100 and set adjust fetch.max.bytes to the number equal to the size of 100 messages. Eg: 1 mess = 10bytes => fetch.max.bytes = 100*10bytes

Kafka producer future metadata in callback

In my application when I send messages I use the Metadata in the callback to save the offset of the record for future usage. However sometimes the metadata.offset() returns -1 which makes things hard later.
Why does this happen and is there a way to get the offset without consuming the topic to find it.
Edit: I am on ack 0 currently, when I pass to ack 1 I don't have these errors anymore however my performance drops drastically. From 100k message in 10 sec to 1 min.
acks=0 If set to zero then the producer will not wait for any
acknowledgment from the server at all. The record will be immediately
added to the socket buffer and considered sent. No guarantee can be
made that the server has received the record in this case, and the
retries configuration will not take effect (as the client won't
generally know of any failures). The offset given back for each
record will always be set to -1.
This is not exactly true as out of 100k messages I got 95k with offsets but I guess it's normal.
Still will need to find another solution to get the offset with ack=0