I am using Kafka Jdbc Connect timestamp+incrementing mode to sync a table rows to Kafka. Reference https://docs.confluent.io/current/connect/connect-jdbc/docs/source_config_options.html#mode
The challenge is the table gets synced from the beginning of time since the start time by default is 1970. Is there any way to over ride the start time (i.e) I want to sync only from the beginning of input given date.
You need to set the timestamp.initial to the input given date you desire. It needs to be set in the epoch format.
The SQL query in timestamp+incrementing mode is appended with:
WHERE "DBTime" < 'system high date i.e. 9999-12-12T23:59:59+00:00' AND (("DBTime" = 'timestamp.initial' AND "DBKey" > '-1') OR "DBTime" > 'timestamp.initial') ORDER BY "DBTime","DBKey" ASC
https://docs.confluent.io/kafka-connect-jdbc/current/source-connector/source_config_options.html#mode
In case that you want to start from a given offset with your connector, I'd suggest overwriting the information stored in the connect-offsets topic.
Through the Kafka REST API you can easily read the content of this topic:
http://localhost:8082/topics/connect-offsets
Looking through the code of the kafka-connector-jdbc in the relevant methods for the use case that you've described:
io.confluent.connect.jdbc.source.TimestampIncrementingCriteria#extractValues
io.confluent.connect.jdbc.source.TimestampIncrementingOffset#getTimestampOffset
that overriding the connect-offsets topic content seems to be the only way available at the moment.
Related
I run this query
select * from USER_EVENTS emit changes limit 1;
USER_EVENTS is a stream.
Before this i set auto.offset.reset to earliest.
This query run slowly. I don't know why.
And then i show queries to check consumer id of above query and search it in kafka connect.
And i find out query need fetch all message in topic, although i only need one row.
Is that true, and why it need fetch all ? I think fetch one is enough because i had add limit 1 to query.
Topic behind USER_EVENTS have ~1 m message.
I use ksqlServer 6.1.0 and the same for ksqlCli.
This is what ksqldb is supposed to do. Consume the entire stream and materialize a table from that. Your query even says
emit changes
which means it will go through your messages one by one and update the table in near real time. LIMIT 1 only means, that it will show a single message (and update that) instead of showing a growing table, but it consumes the stream either way.
The alternative would be
emit final
which would only show the final result, but still go trough the entire stream.
At least to my knowledge, this is not possible with ksqldb.
If you just need to look at one message interactively, I recommend to use a CLI tool like kcat or https://github.com/birdayz/kaf which all have a config option to consume only a single message.
If you need it programmatically, I would probably try to write a consumer by hand and simple call poll() once instead of the standard poll loop.
If you want "hacky" quickfix, you could also try to set
SET 'auto.offset.reset'='earliest';
for your query in ksqldb. This will still go through the entire stream, but start with the newest available message. So it would ignore everything that is in the topic.
I have a problem in the S3 Kafka connector but also seen this in the JDBC connector.
I'm trying to see how can I ensure that my connectors are actually consuming all the data in a certain topic.
I expect because of the flush sizes that there could be a certain delay (10/15 minutes) in the consumption of the messages but I notice that I end up having big delays (days...) and my consumers always have something in the lag on the offset
I was reading/viewing the post/video about this for example (mainly that comment): https://rmoff.net/2020/12/08/twelve-days-of-smt-day-1-insertfield-timestamp/
https://github.com/confluentinc/demo-scene/blob/master/kafka-connect-single-message-transforms/day1.adoc
"flush.size of 16 is stupidly low, but if it’s too high you have to wait for your files to show up in S3 and I get bored waiting."
And it does mention there if the flush.size is bigger than the available records it can be that the records take time to be consumed but I never expected this to be more than a couple of minutes.
How can I ensure that all records are consumed, and I would really like to avoid having flush.size = 1
Maybe this is just a miss-understanding on my part about the sink connectors but I do expect them to work as a normal consumer so I expect them to consume all data and that this kind of flush/batch sizes would work more based on the timeouts and for performance issues.
If anyone is interested this are my connector configuration
For S3 sink:
topics.regex: com.custom.obj_(.*)
storage.class: io.confluent.connect.s3.storage.S3Storage
s3.region: ${#S3_REGION#}
s3.bucket.name: ${#S3_BUCKET#}
topics.dir: ${#S3_OBJ_TOPICS_DIR#}
flush.size: 200
rotate.interval.ms: 20000
auto.register.schemas: false
s3.part.size: 5242880
parquet.codec: snappy
offset.flush.interval.ms: 20000
offset.flush.timeout.ms: 5000
aws.access.key.id: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:accesskey}
aws.secret.access.key: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:secretkey}
format.class: com.custom.connect.s3.format.parquet.ParquetFormat
key.converter: org.apache.kafka.connect.storage.StringConverter
value.converter: com.custom.insight.connect.protobuf.ProtobufConverter
partitioner.class: io.confluent.connect.storage.partitioner.DailyPartitioner
timestamp.extractor: Record
locale: ${#S3_LOCALE#}
timezone: ${#S3_TIMEZONE#}
store.url: ${#S3_STORAGE_URL#}
connect.meta.data: false
transforms: kafkaMetaData,formatTs
transforms.kafkaMetaData.type: org.apache.kafka.connect.transforms.InsertField$Value
transforms.kafkaMetaData.offset.field: kafka_offset
transforms.kafkaMetaData.partition.field: kafka_partition
transforms.kafkaMetaData.timestamp.field: kafka_timestamp
transforms.formatTs.format: yyyy-MM-dd HH:mm:ss:SSS
transforms.formatTs.field: message_ts
transforms.formatTs.target.type: string
transforms.formatTs.type: org.apache.kafka.connect.transforms.TimestampConverter$Value
errors.tolerance: all
errors.deadletterqueue.topic.name: ${#DLQ_STORAGE_TOPIC#}
errors.deadletterqueue.context.headers.enable: true
For JDBC sink:
topics.regex: com.custom.obj_(.*)
table.name.format: ${#PREFIX#}${topic}
batch.size: 200
key.converter: org.apache.kafka.connect.storage.StringConverter
value.converter: com.custom.insight.connect.protobuf.ProtobufConverter
connection.url: ${#DB_URL#}
connection.user: ${#DB_USER#}
connection.password: ${#DB_PASSWORD#}
auto.create: true
auto.evolve: true
db.timezone: ${#DB_TIMEZONE#}
quote.sql.identifiers: never
transforms: kafkaMetaData
transforms.kafkaMetaData.offset.field: kafka_offset
transforms.kafkaMetaData.partition.field: kafka_partition
transforms.kafkaMetaData.timestamp.field: kafka_timestamp
transforms.kafkaMetaData.type: org.apache.kafka.connect.transforms.InsertField$Value
errors.tolerance: all
errors.deadletterqueue.topic.name: ${#DLQ_STORAGE_TOPIC#}
errors.deadletterqueue.context.headers.enable: true
I've read this two already and still not sure:
Kafka JDBC Sink Connector, insert values in batches
https://github.com/confluentinc/kafka-connect-jdbc/issues/290
Also for example I've seen examples of people using (which I don't think it would help my use case) but I was wondering is this value defined per connector?
I'm even a bit confused about the fact that in the documentation I always find the configuration without the consumer. but the examples I always find with consumer. so I guess this means that this is a generic property that applies both to consumers and producers?
consumer.max.interval.ms: 300000
consumer.max.poll.records: 200
Anyone has some good feedback?
Regarding the provided Kafka S3 sink connector configuration:
topics.regex: com.custom.obj_(.*)
storage.class: io.confluent.connect.s3.storage.S3Storage
s3.region: ${#S3_REGION#}
s3.bucket.name: ${#S3_BUCKET#}
topics.dir: ${#S3_OBJ_TOPICS_DIR#}
flush.size: 200
rotate.interval.ms: 20000
auto.register.schemas: false
s3.part.size: 5242880
parquet.codec: snappy
offset.flush.interval.ms: 20000
offset.flush.timeout.ms: 5000
aws.access.key.id: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:accesskey}
aws.secret.access.key: ${file:/opt/kafka/external-configuration/aws-credentials/aws-credentials.properties:secretkey}
format.class: com.custom.connect.s3.format.parquet.ParquetFormat
key.converter: org.apache.kafka.connect.storage.StringConverter
value.converter: com.custom.insight.connect.protobuf.ProtobufConverter
partitioner.class: io.confluent.connect.storage.partitioner.DailyPartitioner
timestamp.extractor: Record
locale: ${#S3_LOCALE#}
timezone: ${#S3_TIMEZONE#}
store.url: ${#S3_STORAGE_URL#}
connect.meta.data: false
transforms: kafkaMetaData,formatTs
transforms.kafkaMetaData.type: org.apache.kafka.connect.transforms.InsertField$Value
transforms.kafkaMetaData.offset.field: kafka_offset
transforms.kafkaMetaData.partition.field: kafka_partition
transforms.kafkaMetaData.timestamp.field: kafka_timestamp
transforms.formatTs.format: yyyy-MM-dd HH:mm:ss:SSS
transforms.formatTs.field: message_ts
transforms.formatTs.target.type: string
transforms.formatTs.type:org.apache.kafka.connect.transforms.TimestampConverter$Value
errors.tolerance: all
errors.deadletterqueue.topic.name: ${#DLQ_STORAGE_TOPIC#}
errors.deadletterqueue.context.headers.enable: true
There are configuration fields you can tweak to control the consumption\upload to S3 rate. Thus reducing the lag in Kafka offset you are seeing. Its best practice to use variables for the below fields in your configuration.
From personal experience, the tweaks you can do are:
Tweak flush.size
flush.size: 800
which is (as you stated):
Maximum number of records: The connector’s flush.size configuration
property specifies the maximum number of records that should be
written to a single S3 object. There is no default for this setting.
I would prefer bigger files and use the timing tweaks below to control consumption. Make sure your records are not too big or small to make rational files as a result of flush.size * RECORD_SIZE.
Tweak rotate.interval.ms
rotate.interval.ms: (i would delete this field, see rotate.schedule explanation below)
which is:
Maximum span of record time: The connector’s rotate.interval.ms
specifies the maximum timespan in milliseconds a file can remain open
and ready for additional records.
Add field rotate.schedule.interval.ms:
rotate.schedule.interval.ms 60000
which is:
Scheduled rotation: The connector’s rotate.schedule.interval.ms
specifies the maximum timespan in milliseconds a file can remain open
and ready for additional records. Unlike with rotate.interval.ms, with
scheduled rotation the timestamp for each file starts with the system
time that the first record is written to the file. As long as a record
is processed within the timespan specified by
rotate.schedule.interval.ms, the record will be written to the file.
As soon as a record is processed after the timespan for the current
file, the file is flushed, uploaded to S3, and the offset of the
records in the file are committed. A new file is created with a
timespan that starts with the current system time, and the record is
written to the file. The commit will be performed at the scheduled
time, regardless of the previous commit time or number of messages.
This configuration is useful when you have to commit your data based
on current server time, for example at the beginning of every hour.
The default value -1 means that this feature is disabled.
You use the default -1 which means disabling this rotation. This tweak will make the most difference as each task will consume more frequently.
Regarding the second part of the question:
You can gain observability by adding metrics to your kafka and connect using for example prometheus and grafana. Configuration guide below in sources.
Sources:
Connect S3 sink
kafka-monitoring-via-prometheus
Connect S3 Sink config Docs
I've recently upgraded my kafka streams from 2.0.1 to 2.5.0. As a result I'm seeing a lot of warnings like the following:
org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate$KStreamWindowAggregateProcessor Skipping record for expired window. key=[325233] topic=[MY_TOPIC] partition=[20] offset=[661798621] timestamp=[1600041596350] window=[1600041570000,1600041600000) expiration=[1600059629913] streamTime=[1600145999913]
There seem to be new logic in the KStreamWindowAggregate class that checks if a window has closed. If it has been closed the messages are skipped. Compared to 2.0.1 these messages where still processed.
Question
Is there a way to get the same behavior like before? I'm seeing lots of gaps in my data with this upgrade and not sure how to solve this, as previously these gaps where not seen.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Update
While further exploring I indeed see it to be related to the graceperiod in ms. It seems that in my custom timestampextractor (that has the logic to use the timestamp from the payload instead of the normal timestamp), I'm able to see that the incoming timestamp for the expired window warnings indeed is bigger than the 24 hours compared to the event time from the payload.
I assume this is caused by consumer lags of over 24 hours.
The timestamp extractor extract method has a partition time which according to the docs:
partitionTime the highest extracted valid timestamp of the current record's partition˙ (could be -1 if unknown)
so is this the create time of the record on the topic? And is there a way to influence this in a way that my records are no longer skipped?
Compared to 2.0.1 these messages where still processed.
That is a little bit surprising (even if I would need to double check the code), at least for the default config. By default, store retention time is set to 24h, and thus in 2.0.1 older messages than 24h should also not be processed as the corresponding state got purged already. If you did change the store retention time (via Materialized#withRetention) to a larger value, you would also need to increase the window grace period via TimeWindows#grace() method accordingly.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Not sure what you mean by this or how you actually do this? The old and new logic are similar with regard to how a long a window is stored (retention time config). The new part is the grace period that you can increase to the same value as retention time if you wish).
About "partition time": it is computed base on whatever TimestampExtractor returns. For your case, it's the max of whatever you extracted from the message payload.
I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
g1.leftJoin(g2).to();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
UPDATE
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/InMemoryWindowStore.java#L125
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?
After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
.windowedBy(TimeWindows.of(windowSize))
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
windowSize, false)
)
That's it, the output is as expected.
I am using a Windowed Join between two streams, let's say a 7 day window.
On initial load, all records in the DB (via kafka connect source connector) are being loaded to the streams. It seems then that ALL records end up in the window state store for those first 7 days as the producer/ingested timestamps are all in current time vs. a field (like create_time) that might be in the message value.
Is there a recommended way to balance the initial load against the Windows of the join?
Well, the question is what records do you want to join to each other? And what timestamp the source connector sets as record timestamp (might also depend on the topic configuration, [log.]message.timestamp.type.
The join is executed based on whatever the TimestampExtractor returns. By default, that is the record timestamp. If you want to base the join on some other timestamp, a custom timestampe extractor is the way to go.
If you want to get processing time semantics, you may want to use the WallclockTimestampExtractor though.