Debezium read from the specified the bin log position - debezium

Need the debezium to read from a specific position in the bin log. Is it possible? If yes, how can it be specified?
Debezium snapshot modes are: initial, when_needed, never.
We want to keep the snapshot mode to never and start reading from the specified bin log position.

you can hack it witha source offset manipulation, see https://debezium.io/documentation/faq/#how_to_change_the_offsets_of_the_source_database

Related

Debezium causes Postgres to run out of disk space on RDS

I have a small Postgres development database running on Amazon RDS, and I'm running K8s. As far as I can tell, there is barely any traffic.
I want to enable change capture, I've enabled rds.logical_replication, started a Debezium instance, and the topics appear in Kafka, and all seems fine.
After a few hours, the free disk space starts tanking:
It started to consume disk at a constant rate, and eat up all of the 20Gb available within 24 hours. Stopping Debezium doesn't do anything. The way I got my disk space back was by:
select pg_drop_replication_slot('services_debezium')
and:
vacuum full
Then, after a few minutes, as you can see in the graph, disk space is reclaimed.
Any tips? I would love to see what is it what's actually filling up the space, but I don't think I can. Nothing seems to happen on the Debezium side (no ominous logs), and the Postgres logs don't show anything special either. Or is there some external event that triggers the start of this?
You need to periodically generate some movement in your database (perform an update on any record for example).
Debezium provides a feature called heartbeat to perform this type of operation.
Heartbeat can be configured in the connector as follows:
"heartbeat.interval.ms" : "300000",
"heartbeat.action.query": "update my_table SET date_column = now();"
You can find more information in the official documentation:
https://debezium.io/documentation/reference/connectors/postgresql.html#postgresql-wal-disk-space
The replication slot is the problem. It marks a position in the WAL, and PostgreSQL won't delete any WAL segments newer than that. Those files are in the pg_wal subdirectory of the data directory.
Dropping the replication slot and running CHECKPOINT will delete the files and free space.
The cause of the problem must be misconfiguration of Debrezium: it does not consume changes and move the replication slot ahead. Fix that problem and you are good.
Ok, I think I figured it out. There is another 'hidden' database on Amazon RDS, that has changes, but changes that I didn't make and I can's see, so Debezium can't pick them up either. If change my monitored database, it will show that change and in the process flush the buffer and reclaim that space. So the very lack of changes was the reason it filled up. Don't know if there is a pretty solution for this, but at least I can work with this.

How can Kafka reads be constant irrespective of the datasize?

As per the documentation of Kafka
the data structure used in Kafka to store the messages is a simple log where all writes are actually just appends to the log.
What I don't understand here is, many claim that Kafka performance is constant irrespective of the data size it handles.
How can random reads be constant time in a linear data structure?
If I have a single partition topic with 1 billion messages in it. How can the time taken to retrieve the first message be same as the time taken to retrieve the last message, if the reads are always sequential?
In Kafka, the log for each partition is not a single file. It is actually split in segments of fixed size.
For each segment, Kafka knows the start and end offsets. So for a random read, it's easy to find the correct segment.
Then each segment has a couple of indexes (time and offset based). Those are the file named *.index and *.timeindex. These files enable jumping directly to a location near (or at) the desired read.
So you can see that the total number of segments (also total size of the log) does not really impact the read logic.
Note also that the size of segments, the size of indexes and the index interval are all configurable settings (even at the topic level).

What happens if you delete Kafka snapshot files?

The question is simple. What will happen if you delete the Kafka snapshot files in the Kafka log dir. Will Kafka be able to start? Will it have to do a slow rebuild of something?
Bonus question what exactly do the snapshot files contain?
Background for this question
I have a cluster that has been down for a couple of days due to simultaneous downtime on all brokers and a resulting corrupted broker. Now when it starts it is silent for hours (in the log file no new messages). By inspecting the JVM I have found that all the (very limited) cpu usage is spent in loadproducersfromlog function/method. By reading the comments above it is suggested that this is an attempt to recover producer state from the snapshots. I do not care about this. I just want my broker back so I am thinking if I can simply delete the snapshots to get Kafka started again.
If snapshot files are deleted, during start up method log.loadSegmentFiles(), all messages in the partition will have to be read to recreate the snapshot even if log and index files are present. This will increase the time to load partition.
For contents of snapshot file, please refer writeSnapshot() in ProducerStateManager.
https://github.com/apache/kafka/blob/980b725bb09ee42469534bf50d01118ce650880a/core/src/main/scala/kafka/log/ProducerStateManager.scala
Parameter log.dir defines where topics (ie, data) is stored (supplemental for log.dirs property).
A snapshot basically gives you a copy of your data at one point in time.
In a situation like yours, instead of waiting for a response, you could:
change the log.dirs path, restart everything and see how it goes;
backup the snapshots, saving them in a different location, delete them all from the previous one and see how it goes.
After that you're meant to be able to start up Kafka.

Avoiding small files from Kafka connect using HDFS connector sink in distributed mode

We have a topic with messages at the rate of 1msg per second with 3 partitions and I am using HDFS connector to write the data to HDFS as AVRo format(default), it generates files with size in KBS,So I tried altering the below properties in the HDFS properties.
"flush.size":"5000",
"rotate.interval.ms":"7200000"
but the output is still small files,So I need clarity on the following things to solve this issue:
is flush.size property mandatory, in-case if we do not mention the flus.size property how does the data gets flushed?
if the we mention the flush size as 5000 and rotate interval as 2 hours,it is flushing the data for every 2 hours for first 3 intervals but after that it flushes data randomly,Please find the timings of the file creation(
19:14,21:14,23:15,01:15,06:59,08:59,12:40,14:40)--highlighted the mismatched intervals.is it because of the over riding of properties mentioned?that takes me to the third question.
What is the preference for flush if we mention all the below properties (flush.size,rotate.interval.ms,rotate.schedule.interval.ms)
Increasing the rate of msg and reducing the partition is actually showing an increase in the size of the data being flushed, is it the only way to have control over the small files,how can we handle the properties if the rate of the input events are varying and not stable?
It would be great help if you could share documentations regarding handling small files in kafka connect with HDFS connector,Thank you.
If you are using a TimeBasedPartitioner, and the messages are not consistently going to have increasing timestamps, then you will end up with a single writer task dumping files when it sees a message with a lesser timestamp in the interval of rotate.interval.ms of reading any given record.
If you want to have consistent bihourly partition windows, then you should be using rotate.interval.ms=-1 to disable it, then rotate.schedule.interval.ms to some reasonable number that is within the partition duration window.
E.g. you have 7200 messages every 2 hours, and it's not clear how large each message is, but let's say 1MB. Then, you'd be holding ~7GB of data in a buffer, and you need to adjust your Connect heap sizes to hold that much data.
The order of presecence is
scheduled rotation, starting from the top of the hour
flush size or "message-based" time rotation, whichever occurs first, or there is a record that is seen as "before" the start of the current batch
And I believe flush size is mandatory for the storage connectors
Overall, systems such as Uber's Hudi or the previous Kafka-HDFS tool of Camus Sweeper are more equipped to handle small files. Connect Sink Tasks only care about consuming from Kafka, and writing to downstream systems; the framework itself doesn't recognize Hadoop prefers larger files.

Spark Streaming direct approach without Check point location

When we use Spark Streaming Direct approach and without specifying the check point location, where the offsets will be stored and how?
Is there really any difference between using check point location and without specifying any check point location?
Is there going to be any data loss, if i am not specifying the check point location?
If you don't checkpoint, you won't be able to recover in case your driver crashes. In addition, Kafka offsets won't be checkpointed since there is no checkpoint, you'll need to manually store them yourself.
Is there really any difference between using check point location and without specifying any check point location?
That sentence doesn't make much sense. If you don't provide a checkpoint directory, there'll be not checkpoint, if you do there will. To reach exactly once semantics (if required) you'll need to store offsets manually.