kafka s3 connector "skip" records if flush.size is set > 1 - flush

I set up a kafka s3 sink connector to consume messages from a kafka topic and dump them into minio in parquet format. Finally I query from dremio to verify the integrity of the data pipe.
The kafka topic consists of 12 partitions and each partition contains various # of records.
What I've found out is that
if I set flush.size=1.
I can get all records one per parquet file in minio and query in dremio returns correct # of records.
if I set flush.size > 1
I won't be able to get the exact total number of records in minio and dremio query.
I've always got less.
The larger the flush.size is set, the more records are skipped and if flush.size is set large enough, partitions are skipped as well.
I understand that it's probably not skipping records.
The connector is waiting for more new records to fill up the buffer size then flushes to s3. This won't work as if the data is EOD, I'll have to wait for 24 hours to get yesterdays data dumped to minio?
I am looking for a parameter to trigger time-out then force flush to s3.
I tried rotate.interval.ms but it only checks first record and last record time stamp span. It will not trigger a time out and force flush if no new record is injected to kafka.
Is there any parameter to trigger time-out and force flush to s3?
It seems that all rotate interval parameters are expecting a new record to trigger the evaluation of the flush condition, either span or scheduled.
That's not gonna serve the purpose I mentioned. We want to time-out and force flush without the dependency on a new record being processed.

rotate.schedule.interval.ms works.
I made a typo in sink properties, I put rotate.scheduled.interval.ms
Once I correct the typo, it asks me to specify timezone then everything works as expected. I got all 10071 records in all 12 partitions.

Related

How can I set the micro batch size in Spark Structured Streaming from Kafka topic?

I have a Spark Structured Streaming app that reads from Kafka and writes to Elasticsearch and S3. I have enabled checkpointing to a S3 bucket as well (app runs AWS EMR). I saw that in S3 bucket that over time the commits get less frequently and there is always growing delay in the data.
So I want to make Spark to process always to process batches with same amount of data each batch. I tried to set the ".option("maxOffsetsPerTrigger", 100)" but the batch size didnt become smaller, still huge amount of time between commits.
As I understood that we just tell spark how much data consume from kafka per poll and that spark just polls multiple times and then writes, so no limitations in the batch size.
I also tried to use continuous mode but the submit failed, i guess cuz of the output sink / foreachbatch doesnt support it.
any ideas are welcome, i will try everything ^^
actually the each offset contained so much data that I had to limit the max offsets per trigger to 50, and had to delete the old checkpoint folder, I read somewhere that it tries to finish first batch with the offset in the checkpoint, and then turns on the max offset per trigger

JDBC Source Connector in Incrementing Mode Skips Records (but then doesn't...)

I'm facing an issue with a JDBC Source Connector in (strictly) incrementing mode. In my dataset which the connector is pulling from, there are consecutive ids starting from id=1.
On the initial deployment of the connector, it ingests only about 1/3-1/2 of the expected records in the dataset. Note that I am sure that in the "initial deployment of the connector" a brand new consumer group is created. The ids of the records getting skipped over are seemingly random. I'm determining the "expected records" by running the source SQL directly against the db containing the dataset. I've encountered the skipping in datasets where ids go up to 10k, in datasets where ids go over 130k, and a few in between. What's weirder is that when I force re-ingestion of the data (i.e. same exact data/dataset) via sending a message to the consumer_offsets topic and re-deploying the connector, I do not encounter the skipping; all expected records make it to the topic.
How can that be?
Other notes:
-There is no ordering or CTE in my SQL

How the offset should be configured for a new consumer group for an existing topic when source connectors can't be paused

We have an existing topic where the data gets published by JDBC source connector using the mode increment + timestamp (The source connector uses increment+timestamp (https://docs.confluent.io/current/connect/kafka-connect-jdbc/source-connector/index.html#incremental-query-modes)
We have existing consumer groups which consumes data from some existing topics. Now we are introducing a new consumer group (call this group k) which should consume data from the same existing topics and should write to database. As a first step, we have an initial data migration workflow to take a dump of source database and copy the dump to destination database before starting consuming messages from existing topic.
Now when the consumer group starts, I am wondering what should be the offset it should start with?
One option is to use latest. But the problem is that existing source connectors would be publishing data to existing topics when initial data migration is being done for this new consumer group. In our case we have 10s of tables to be migrated and there could be a gap where the table dump was taken but still some changes are getting done to the source database and so data will get added to topics. So, there is a chance that we may miss to process some records.
We don't have the option to pause the source connectors which would solve the problem for us.
If we use offset earliest we will end up processing all the old data from kafka topic which is not required as we have done an initial data migration.
We want to maintain only one source connector regardless of the number of consumer groups.
I was going through kafka consumer APIs like seek which takes timestamp. I can note down the time before initial data migration and call consumer.seek once the consumer group has started and partitions are assigned. But I couldn't find any docs saying that the timestamp is GMT based or something else. Is it ok to use this API by passing the time which is number of milliseconds elapsed from epoch?
If I understand this sentence correctly: "If we use offset latest we might lose some data as source connectors might have written some data to the topic during initial data migration" the topic will end up having some data from initial loads and CDC data mixed up, so there is no offset that clearly distinct this. Therefore, you will not get far setting any particular offset.
I see the following options:
Have your consumer group K filtering out initial load data and read from earliest
Produce the initial load data to a dedicated topic
If possible perform the initial load outside of business hours so that no CDC data is flowing (maybe over week-end or bank holidays)

Kafka data reads and offset management with sink

What happens when the consumer reads the data from kafka but fails to write into sink. Lets say, I read the data from kafka and applied some transformation on data and finally storing the final result into Database. If everything is perfectly working my final result will be stored in Database. But lets say for some reason my Database isn't available. what happens with the data that i read from kafka? When I restart my application, can I read the same data again since I failed to store it in sink? or will the kafka marks this data as read and will not allow me to read this data?
can you also tell me what this property is used for - enable.auto.commit=true?
There's a part of the metadata in Kafka called consumer offsets. Each message has a unique offset - an integer value that continually increases for each message.
So, in the scenario you've described:
If, you've committed the offset BEFORE writing to the database then you will not be able to read those messages again.
But, if you commit the offset AFTER writing to the database then you will be able to re-read those messages.
enable.auto.commit=true as the name suggests will automatically commit consumer offsets after a certain time interval defined by auto.commit.interval.ms parameter - which by default is 5000 ms (5 seconds). So, as you can probably imagine that if these default values are used, then the offsets will be committed in 5 seconds regardless of whether they have landed in the destination or not.
So, you would basically need to control these through your code and change the enable.auto.commit to false if you'd like to ensure guaranteed delivery.
Hope this helps!

How the Kafka Topic Offsets works

I have a question about how the topic offsets works in Kafka, are they stored B-Tree like structure in Kafka?
The specific reason I ask for it, lets say I have a Topic with 10 millions records in Topic, that will mean 10 millions offset if no compaction occurred or it is turned off, now if I use consumer.seek(5000000), it will work like LinkList by that I mean, it will go to 0 offset and will try to hop from there to 5000000th offset or it does have index like structure will tell exactly where is the 5000000th record in the log?
Thx for answers?
Kafka records are stored sequentially in the logs. The exact format is well described in the documentation.
Kafka usually expects read to be sequential, as Consumers fetch records in order. However when a random access is required (via seek or to restart from a specific position), Kafka uses index files to quickly find a record based on its offset.
A Kafka log is made of several segments. Each segments has an index and a timeindex file associated which map offsets and timestamp to file position. The frequency at which entries are added to the indexes can be configured using index.interval.bytes. Using these files Kafka is able to immediately seek to the nearby position and avoid re-reading all messages.
You may have noticed after an unclean shutdown that Kafka is rebuilding indexes for a few minutes. It's these indexes used to file position lookups that are being rebuilt.