If one ClickHouse replica is down, how long operation log can keep in ZooKeeper? - apache-zookeeper

Under my understanding, in ClickHouse ReplicatedMergeTree, insert operation will write log in ZK "/log", other replica pull log, execute task and sync date.
My question is when one replica is unavailable(machine is down or clickhouse instance is down), this replica cannot pull log and sync data. If other replica still insert data and push log to ZK. How long the operation log will keep in ZK? Is there valid period? Maybe ZK will not keep these log forever, is there exact keep time?
And if insert log in ZK is removed and the prior unavailable replica is normal again, how this replica sync data with other replica?
Appreciate for any answer or discussion, thank you.

SELECT *
FROM system.merge_tree_settings
WHERE name LIKE '%replicated_logs%'
FORMAT Vertical
Query id: 534466cf-1624-4ca0-b559-bc8c381ff547
Row 1:
──────
name: max_replicated_logs_to_keep
value: 1000
changed: 0
description: How many records may be in log, if there is inactive replica. Inactive replica becomes lost when when this number exceed.
type: UInt64
Row 2:
──────
name: min_replicated_logs_to_keep
value: 10
changed: 0
description: Keep about this number of last records in ZooKeeper log, even if they are obsolete. It doesn't affect work of tables: used only to diagnose ZooKeeper log before cleaning.
type: UInt64
max_replicated_logs_to_keep now is 1000.
During the past this default value were changing, it was 10000, 100, 1000 :) .
If a replication log is "rotated" (a replica delay is >1000), it's not a problem at all, the stale replica starts a special bootstrap procedure, it does not use log at all, but it syncs it's metadata and a list of parts with other replicas, this procedure is slightly longer than rolling the log.

Related

PostgreSQL Streaming Replication Reject Insert

I have Postgresql 14 and I made streaming replication (remote_apply) for 3 nodes.
When the two standby nodes are down, if I tried to do an insert command this will show up:
WARNING: canceling wait for synchronous replication due to user request
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
INSERT 0 1
I don't want to insert it locally. I want to reject the transaction and show an error instead.
Is it possible to do that?
No, there is no way to do that with synchronous replication.
I don't think you have thought through the implications of what you want. If it doesn't commit locally first, then what should happen if the master crashes after sending the transaction to the replica, but before getting back word that it was applied there? If it was committed on the replica but rejected on the master, how would they ever get back into sync?
I made a script that checks the number of standby nodes and then make the primary node read-only if the standby nodes are down.

Apache Flink - duplicate message processing during job deployments, with ActiveMQ as source

Given,
I have a Flink job that reads from ActiveMQ source & writes to a mysql database - keyed on an identifier. I have enabled checkpoints for this job every one second. I point the checkpoints to a Minio instance, I verified the checkpoints are working with the jobid. I deploy this job is an Openshift (Kubernetes underneath) - I can scale up/down this job as & when required.
Problem
When the job is deployed (rolling) or the job went down due to a bug/error, and if there were any unconsumed messages in ActiveMQ or unacknowledged messages in Flink (but written to the database), when the job recovers (or new job is deployed) the job process already processed messages, resulting in duplicate records inserted in the database.
Question
Shouldn't the checkpoints help the job recover from where it left?
Should I take the checkpoint before I (rolling) deploy new job?
What happens if the job quit with error or cluster failure?
As the jobid keeps changing on every deployment, how does the recovery happens?
Edit As I cannot expect idempotency from the database, to avoid duplicates saved into the database (Exactly-Once), can I write database specific (upsert) query to update if the given record is present & insert if not?
JDBC currently only supports at least once, meaning you get duplicate messages upon recovery. There is currently a draft to add support for exactly once, which would probably be released with 1.11.
Shouldn't the checkpoints help the job recover from where it left?
Yes, but the time between last successful checkpoints and recovery could produce the observed duplicates. I gave a more detailed answer on a somewhat related topic.
Should I take the checkpoint before I (rolling) deploy new job?
Absolutely. You should actually use cancel with savepoint. That is the only reliable way to change the topology. Additionally, cancel with savepoints avoids any duplicates in the data as it gracefully shuts down the job.
What happens if the job quit with error or cluster failure?
It should automatically restart (depending on your restart settings). It would use the latest checkpoint for recovery. That would most certainly result in duplicates.
As the jobid keeps changing on every deployment, how does the recovery happens?
You usually point explicitly to the same checkpoint directory (on S3?).
As I cannot expect idempotency from the database, is upsert the only way to achieve Exactly-Once processing?
Currently, I do not see a way around it. It should change with 1.11.

How to syncup MongoDB failed over with working replica machine?

MongoDB replication, it has 3 servers(Server1, Server2, Server3). Due to any reason, Server1 goes down and Server2 acts as Primary and Server3 as Secondary mode.
Query: As Server1 is down and after 2-3 hours we made it up(running). The 3 hrs gap between Server1 data and Server2 data, how it will be sync up?
The primary maintains an oplog detailing all of the writes that have been done to the data. The oplog is capped by size, the oldest entries are automatically removed to keep it below the configured size.
When a secondary node replicates from the primary, it reads the oplog and creates a local copy. If a secondary is offline for a period of time, when it comes back online, it will ask the primary for all oplog entries since the last one that it successfully copied.
If the primary still has the entry that the secondary most recently saw, the secondary will begin applying the events it missed.
If the primary no longer has that entry, the secondary will log a message that it is too stale to catch up, and manual intervention will be required. This would usually require a manual resync

Kafka broker taking too long to come up

Recently, one of our Kafka broker (out of 5) got shut down incorrectly. Now that we are starting it up again, there are a lot of warning messages about corrupted index files and the broker is still starting up even after 24 hours. There is over 400 GB of data in this broker.
Although the rest of the brokers are up and running but some of the partitions are showing -1 as their leader and the bad broker as the only ISR. I am not seeing other Replicas to be appointed as new leaders, maybe because the bad broker is the only one in sync for those partitions.
Broker Properties:
Replication Factor: 3
Min In Sync Replicas: 1
I am not sure how to handle this. Should I wait for the broker to fix everything itself? is it normal to take so much time?
Is there anything else I can do? Please help.
After an unclean shutdown, a broker can take a while to restart as it has to do log recovery.
By default, Kafka only uses a single thread per log directory to perform this recovery, so if you have thousands of partitions it can take hours to complete.
To speed that up, it's recommended to bump num.recovery.threads.per.data.dir. You can set it to the number of CPU cores.

Understanding Schema ID allocation in Confluent Schema Registry

I am trying to understand how globally unique UUIDs are generated for schemas in schema registry but fail to understand the following text present on this page.
Schema ID allocation always happen in the master node and they ensure
that the Schema IDs are monotonically increasing.
If you are using Kafka master election, the Schema ID is always based
off the last ID that was written to Kafka store. During a master
re-election, batch allocation happens only after the new master has
caught up with all the records in the store .
If you are using ZooKeeper master election,
{schema.registry.zk.namespace}/schema_id_counter path stores the
upper bound on the current ID batch, and new batch allocation is
triggered by both master election and exhaustion of the current batch.
This batch allocation helps guard against potential zombie-master
scenarios, (for example, if the previous master had a GC pause that
lasted longer than the ZooKeeper timeout, triggering master
reelection).
Question:
When using zookeeper for master election, what is the need to store the current batch id in zookeeper unlike the kafka master election?
Can someone explain in detail how batch allocation when using zookeeper election works? Specifically, I don't understand the following:
new batch allocation is triggered by both master election and
exhaustion of the current batch. This batch allocation helps guard
against potential zombie-master scenarios, (for example, if the
previous master had a GC pause that lasted longer than the ZooKeeper
timeout, triggering master reelection).