Is there a way in the S3 Kafka sink connector to ensure all records are consumed - apache-kafka

I have a problem in the S3 Kafka connector but also seen this in the JDBC connector.
I'm trying to see how can I ensure that my connectors are actually consuming all the data in a certain topic.
I expect because of the flush sizes that there could be a certain delay (10/15 minutes) in the consumption of the messages but I notice that I end up having big delays (days...) and my consumers always have something in the lag on the offset
I was reading/viewing the post/video about this for example (mainly that comment): https://rmoff.net/2020/12/08/twelve-days-of-smt-day-1-insertfield-timestamp/
https://github.com/confluentinc/demo-scene/blob/master/kafka-connect-single-message-transforms/day1.adoc
"flush.size of 16 is stupidly low, but if it’s too high you have to wait for your files to show up in S3 and I get bored waiting."
And it does mention there if the flush.size is bigger than the available records it can be that the records take time to be consumed but I never expected this to be more than a couple of minutes.
How can I ensure that all records are consumed, and I would really like to avoid having flush.size = 1
Maybe this is just a miss-understanding on my part about the sink connectors but I do expect them to work as a normal consumer so I expect them to consume all data and that this kind of flush/batch sizes would work more based on the timeouts and for performance issues.
If anyone is interested this are my connector configuration
For S3 sink:
topics.regex: com.custom.obj_(.*)
storage.class: io.confluent.connect.s3.storage.S3Storage
s3.region: ${#S3_REGION#}
s3.bucket.name: ${#S3_BUCKET#}
topics.dir: ${#S3_OBJ_TOPICS_DIR#}
flush.size: 200
rotate.interval.ms: 20000
auto.register.schemas: false
s3.part.size: 5242880
parquet.codec: snappy
offset.flush.interval.ms: 20000
offset.flush.timeout.ms: 5000
aws.access.key.id: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:accesskey}
aws.secret.access.key: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:secretkey}
format.class: com.custom.connect.s3.format.parquet.ParquetFormat
key.converter: org.apache.kafka.connect.storage.StringConverter
value.converter: com.custom.insight.connect.protobuf.ProtobufConverter
partitioner.class: io.confluent.connect.storage.partitioner.DailyPartitioner
timestamp.extractor: Record
locale: ${#S3_LOCALE#}
timezone: ${#S3_TIMEZONE#}
store.url: ${#S3_STORAGE_URL#}
connect.meta.data: false
transforms: kafkaMetaData,formatTs
transforms.kafkaMetaData.type: org.apache.kafka.connect.transforms.InsertField$Value
transforms.kafkaMetaData.offset.field: kafka_offset
transforms.kafkaMetaData.partition.field: kafka_partition
transforms.kafkaMetaData.timestamp.field: kafka_timestamp
transforms.formatTs.format: yyyy-MM-dd HH:mm:ss:SSS
transforms.formatTs.field: message_ts
transforms.formatTs.target.type: string
transforms.formatTs.type: org.apache.kafka.connect.transforms.TimestampConverter$Value
errors.tolerance: all
errors.deadletterqueue.topic.name: ${#DLQ_STORAGE_TOPIC#}
errors.deadletterqueue.context.headers.enable: true
For JDBC sink:
topics.regex: com.custom.obj_(.*)
table.name.format: ${#PREFIX#}${topic}
batch.size: 200
key.converter: org.apache.kafka.connect.storage.StringConverter
value.converter: com.custom.insight.connect.protobuf.ProtobufConverter
connection.url: ${#DB_URL#}
connection.user: ${#DB_USER#}
connection.password: ${#DB_PASSWORD#}
auto.create: true
auto.evolve: true
db.timezone: ${#DB_TIMEZONE#}
quote.sql.identifiers: never
transforms: kafkaMetaData
transforms.kafkaMetaData.offset.field: kafka_offset
transforms.kafkaMetaData.partition.field: kafka_partition
transforms.kafkaMetaData.timestamp.field: kafka_timestamp
transforms.kafkaMetaData.type: org.apache.kafka.connect.transforms.InsertField$Value
errors.tolerance: all
errors.deadletterqueue.topic.name: ${#DLQ_STORAGE_TOPIC#}
errors.deadletterqueue.context.headers.enable: true
I've read this two already and still not sure:
Kafka JDBC Sink Connector, insert values in batches
https://github.com/confluentinc/kafka-connect-jdbc/issues/290
Also for example I've seen examples of people using (which I don't think it would help my use case) but I was wondering is this value defined per connector?
I'm even a bit confused about the fact that in the documentation I always find the configuration without the consumer. but the examples I always find with consumer. so I guess this means that this is a generic property that applies both to consumers and producers?
consumer.max.interval.ms: 300000
consumer.max.poll.records: 200
Anyone has some good feedback?

Regarding the provided Kafka S3 sink connector configuration:
topics.regex: com.custom.obj_(.*)
storage.class: io.confluent.connect.s3.storage.S3Storage
s3.region: ${#S3_REGION#}
s3.bucket.name: ${#S3_BUCKET#}
topics.dir: ${#S3_OBJ_TOPICS_DIR#}
flush.size: 200
rotate.interval.ms: 20000
auto.register.schemas: false
s3.part.size: 5242880
parquet.codec: snappy
offset.flush.interval.ms: 20000
offset.flush.timeout.ms: 5000
aws.access.key.id: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:accesskey}
aws.secret.access.key: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:secretkey}
format.class: com.custom.connect.s3.format.parquet.ParquetFormat
key.converter: org.apache.kafka.connect.storage.StringConverter
value.converter: com.custom.insight.connect.protobuf.ProtobufConverter
partitioner.class: io.confluent.connect.storage.partitioner.DailyPartitioner
timestamp.extractor: Record
locale: ${#S3_LOCALE#}
timezone: ${#S3_TIMEZONE#}
store.url: ${#S3_STORAGE_URL#}
connect.meta.data: false
transforms: kafkaMetaData,formatTs
transforms.kafkaMetaData.type: org.apache.kafka.connect.transforms.InsertField$Value
transforms.kafkaMetaData.offset.field: kafka_offset
transforms.kafkaMetaData.partition.field: kafka_partition
transforms.kafkaMetaData.timestamp.field: kafka_timestamp
transforms.formatTs.format: yyyy-MM-dd HH:mm:ss:SSS
transforms.formatTs.field: message_ts
transforms.formatTs.target.type: string
transforms.formatTs.type:org.apache.kafka.connect.transforms.TimestampConverter$Value
errors.tolerance: all
errors.deadletterqueue.topic.name: ${#DLQ_STORAGE_TOPIC#}
errors.deadletterqueue.context.headers.enable: true
There are configuration fields you can tweak to control the consumption\upload to S3 rate. Thus reducing the lag in Kafka offset you are seeing. Its best practice to use variables for the below fields in your configuration.
From personal experience, the tweaks you can do are:
Tweak flush.size
flush.size: 800
which is (as you stated):
Maximum number of records: The connector’s flush.size configuration
property specifies the maximum number of records that should be
written to a single S3 object. There is no default for this setting.
I would prefer bigger files and use the timing tweaks below to control consumption. Make sure your records are not too big or small to make rational files as a result of flush.size * RECORD_SIZE.
Tweak rotate.interval.ms
rotate.interval.ms: (i would delete this field, see rotate.schedule explanation below)
which is:
Maximum span of record time: The connector’s rotate.interval.ms
specifies the maximum timespan in milliseconds a file can remain open
and ready for additional records.
Add field rotate.schedule.interval.ms:
rotate.schedule.interval.ms 60000
which is:
Scheduled rotation: The connector’s rotate.schedule.interval.ms
specifies the maximum timespan in milliseconds a file can remain open
and ready for additional records. Unlike with rotate.interval.ms, with
scheduled rotation the timestamp for each file starts with the system
time that the first record is written to the file. As long as a record
is processed within the timespan specified by
rotate.schedule.interval.ms, the record will be written to the file.
As soon as a record is processed after the timespan for the current
file, the file is flushed, uploaded to S3, and the offset of the
records in the file are committed. A new file is created with a
timespan that starts with the current system time, and the record is
written to the file. The commit will be performed at the scheduled
time, regardless of the previous commit time or number of messages.
This configuration is useful when you have to commit your data based
on current server time, for example at the beginning of every hour.
The default value -1 means that this feature is disabled.
You use the default -1 which means disabling this rotation. This tweak will make the most difference as each task will consume more frequently.
Regarding the second part of the question:
You can gain observability by adding metrics to your kafka and connect using for example prometheus and grafana. Configuration guide below in sources.
Sources:
Connect S3 sink
kafka-monitoring-via-prometheus
Connect S3 Sink config Docs

Related

How does Google Dataflow determine the watermark for various sources?

I was just reviewing the documentation to understand how Google Dataflow handles watermarks, and it just mentions the very vague:
The data source determines the watermark
It seems you can add more flexibility through withAllowedLateness but what will happen if we do not configure this?
Thoughts so far
I found something indicating that if your source is Google PubSub it already has a watermark which will get taken, but what if the source is something else? For example a Kafka topic (which I believe does not inherently have a watermark, so I don't see how something like this would apply).
Is it always 10 seconds, or just 0? Is it looking at the last few minutes to determine the max lag and if so how many (surely not since forever as that would get distorted by the initial start of processing which might see giant lag)? I could not find anything on the topic.
I also searched outside the context of Google DataFlow for Apache Beam documentation but did not find anything explaining this either.
When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks. Note, that TimestampAssigner is not provided in the example, the timestamps of the Kafka records themselves will be used instead.
In any data processing system, there is a certain amount of lag between the time a data event occurs (the “event time”, determined by the timestamp on the data element itself) and the time the actual data element gets processed at any stage in your pipeline (the “processing time”, determined by the clock on the system processing the element). In addition, there are no guarantees that data events will appear in your pipeline in the same order that they were generated.
For example, let’s say we have a PCollection that’s using fixed-time windowing, with windows that are five minutes long. For each window, Beam must collect all the data with an event time timestamp in the given window range (between 0:00 and 4:59 in the first window, for instance). Data with timestamps outside that range (data from 5:00 or later) belong to a different window.
However, data isn’t always guaranteed to arrive in a pipeline in time order, or to always arrive at predictable intervals. Beam tracks a watermark, which is the system’s notion of when all data in a certain window can be expected to have arrived in the pipeline. Once the watermark progresses past the end of a window, any further element that arrives with a timestamp in that window is considered late data.
From our example, suppose we have a simple watermark that assumes approximately 30s of lag time between the data timestamps (the event time) and the time the data appears in the pipeline (the processing time), then Beam would close the first window at 5:30. If a data record arrives at 5:34, but with a timestamp that would put it in the 0:00-4:59 window (say, 3:38), then that record is late data.

Can I break anything if I set log.segment.delete.delay.ms to 0

I'm trying to put an absolute limit to the size of a topic so it won't fill up disks in case of unexpectedly high write speeds.
Mostly, this amounts to some calculation like segment=min(1GB, 0.1 * $max_space / $partitions) and setting retention.bytes=($max_space / $partitions - $segment),segment.bytes=$segment and reducing log.retention.check.interval.ms. (You can even pull stunts like determining your maximum write rate (disk or network bound) and then subtract a safety margin based on that and log.retention.check.interval.ms, and maybe some absolute offset of the 10MB head index file.) That's all been discussed in other questions, I'm not asking about any of that.
I'm wondering what log.segment.delete.delay.ms is for, and if I can safely set it to 0 (or some very small value - 5?). The documentation just states "The amount of time to wait before deleting a file from the filesystem". I'd like to know why you would do that, i.e. what or who benefits from this amount of time being non-0.
This configuration can be used to ensure that a Consumer of your topic has actually the time to consume the message before it gets deleted.
This configuration declares the delay that is applied when the LogCleaner is triggered by the time-based or size-based cleanup policies set through log.retention.bytes or log.retention.ms.
Especially when your have calculated the retention based on the size of data, this configuration can help you set a minimal time before a message gets deleted and to allow the consumer to actually consume the message during that time.

KStreamWindowAggregate 2.0.1 vs 2.5.0: skipping records instead of processing

I've recently upgraded my kafka streams from 2.0.1 to 2.5.0. As a result I'm seeing a lot of warnings like the following:
org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate$KStreamWindowAggregateProcessor Skipping record for expired window. key=[325233] topic=[MY_TOPIC] partition=[20] offset=[661798621] timestamp=[1600041596350] window=[1600041570000,1600041600000) expiration=[1600059629913] streamTime=[1600145999913]
There seem to be new logic in the KStreamWindowAggregate class that checks if a window has closed. If it has been closed the messages are skipped. Compared to 2.0.1 these messages where still processed.
Question
Is there a way to get the same behavior like before? I'm seeing lots of gaps in my data with this upgrade and not sure how to solve this, as previously these gaps where not seen.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Update
While further exploring I indeed see it to be related to the graceperiod in ms. It seems that in my custom timestampextractor (that has the logic to use the timestamp from the payload instead of the normal timestamp), I'm able to see that the incoming timestamp for the expired window warnings indeed is bigger than the 24 hours compared to the event time from the payload.
I assume this is caused by consumer lags of over 24 hours.
The timestamp extractor extract method has a partition time which according to the docs:
partitionTime the highest extracted valid timestamp of the current record's partition˙ (could be -1 if unknown)
so is this the create time of the record on the topic? And is there a way to influence this in a way that my records are no longer skipped?
Compared to 2.0.1 these messages where still processed.
That is a little bit surprising (even if I would need to double check the code), at least for the default config. By default, store retention time is set to 24h, and thus in 2.0.1 older messages than 24h should also not be processed as the corresponding state got purged already. If you did change the store retention time (via Materialized#withRetention) to a larger value, you would also need to increase the window grace period via TimeWindows#grace() method accordingly.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Not sure what you mean by this or how you actually do this? The old and new logic are similar with regard to how a long a window is stored (retention time config). The new part is the grace period that you can increase to the same value as retention time if you wish).
About "partition time": it is computed base on whatever TimestampExtractor returns. For your case, it's the max of whatever you extracted from the message payload.

Kafka - different configuration settings

I am going through the documentation, and there seems to be there are lot of moving with respect to message processing like exactly once processing , at least once processing . And, the settings scattered here and there. There doesnt seem a single place that documents the properties need to be configured rougly for exactly once processing and atleast once processing.
I know there are many moving parts involved and it always depends . However, like i was mentioning before , what are the settings to be configured atleast to provide exactly once processing and at most once and atleast once ...
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.

Kafka Connect fetch.max.wait.ms & fetch.min.bytes combined not honored?

I'm creating a custom SinkConnector using Kafka Connect (2.3.0) that needs to be optimized for throughput rather than latency. Ideally, what I want is:
Batches of ~ 20 megabytes or 100k records whatever comes first, but if message rate is low, process at least every minute (avoid small batches, but minimum MySinkTask.put() rate to be every minute).
This is what I set for consumer settings in an attempt to accomplish it:
consumer.max.poll.records=100000
consumer.fetch.max.bytes=20971520
consumer.fetch.max.wait.ms=60000
consumer.max.poll.interval.ms=120000
consumer.fetch.min.bytes=1048576
I needs this fetch.min.bytes setting, or else MySinkTask.put() is called for multiple times per second despite the other settings...?
Now, what I observe in a low-rate situation is that MySinkTask.put() is called with 0 records multiple times and several minutes pass by, until fetch.min.bytes is reached, and then I get them all at once.
I fail to understand so far:
Why fetch.max.wait.ms=60000 is not pushing downwards from the consumer to the put() call of my connector? Shouldn't that have precedence over fetch.min.bytes?
What setting controls the ~ 2x per second call to MySinkTask.put() if fetch.min.bytes=1 (default)? I don't understand why it does that, even the verbose output of the Connect runtime settings don't show any interval below multiples of seconds.
I've double-checked the log output, and the lines INFO o.a.k.c.consumer.ConsumerConfig - ConsumerConfig values: as printed by the Connect Runtime are showing the expected values as I pass with the consumer. prefixed values.
The "process at least every interval" part seems not possible, as the fetch.min.bytes consumer setting takes precedence and Connect does not allow you to dynamically adjust the ConsumerConfig while the Task is running. :-(
Work-around for now is batching in the Task manually; set fetch.min.bytes to 1 (yikes), buffer records in the Task on put() calls, and flush when necessary. This is not very ideal as it infers some overhead for the Connector which I hoped to avoid.
The logic how Connect does a ~ 2x per second batching from its consumer's poll to SinkTask.put() remains a mystery to me, but it's better than being called for every message.