Understanding Schema ID allocation in Confluent Schema Registry - apache-kafka

I am trying to understand how globally unique UUIDs are generated for schemas in schema registry but fail to understand the following text present on this page.
Schema ID allocation always happen in the master node and they ensure
that the Schema IDs are monotonically increasing.
If you are using Kafka master election, the Schema ID is always based
off the last ID that was written to Kafka store. During a master
re-election, batch allocation happens only after the new master has
caught up with all the records in the store .
If you are using ZooKeeper master election,
{schema.registry.zk.namespace}/schema_id_counter path stores the
upper bound on the current ID batch, and new batch allocation is
triggered by both master election and exhaustion of the current batch.
This batch allocation helps guard against potential zombie-master
scenarios, (for example, if the previous master had a GC pause that
lasted longer than the ZooKeeper timeout, triggering master
reelection).
Question:
When using zookeeper for master election, what is the need to store the current batch id in zookeeper unlike the kafka master election?
Can someone explain in detail how batch allocation when using zookeeper election works? Specifically, I don't understand the following:
new batch allocation is triggered by both master election and
exhaustion of the current batch. This batch allocation helps guard
against potential zombie-master scenarios, (for example, if the
previous master had a GC pause that lasted longer than the ZooKeeper
timeout, triggering master reelection).

Related

Kafka scalability if consuming from replica node

In a cluster scenario with data replication > 1, why is that we must always consume from a master/leader of a partition instead of being able to consume from a replica/follower node that contains a replica of this master node?
I understand the Kafka will always route the request to a master node(of that particular partition/topic) but doesn't this affect scalability (since all requests go to a single node)? Wouldnt it be better if we could read from any node containing the replica information and not necessarily the master?
Partition leader replicas, from which you can write/read data, are evenly distributed among available brokers. Anyway, you may also want to leverage the "fetch from closest replica" functionality, which is described in KIP-392, and available since Kafka 2.4.0.

Kafka Connect MongoDB Source Connector failure scenario

I need to use Kafka Connect to monitor changes to a MongoDB cluster with one primary and 2 replicas.
I see there is the official MongoDB connector, and I want to understand what would be the connector's behaviour, in case the primary replica would fail. Will it automatically read from one of the secondary replicas which will become the new primary? I couldn't find information for this in the official docs.
I've seen this post related to the tasks.max configuration, which I thought might be related to this scenario, but the answer implies that it always defaults to 1.
I've also looked at Debezium's implementation of the connector, which seems to support this scenario automatically:
The MongoDB connector is also quite tolerant of changes in membership
and leadership of the replica sets, of additions or removals of shards
within a sharded cluster, and network problems that might cause
communication failures. The connector always uses the replica set’s
primary node to stream changes, so when the replica set undergoes an
election and a different node becomes primary, the connector will
immediately stop streaming changes, connect to the new primary, and
start streaming changes using the new primary node.
Also, Debezium's version of the tasks.max configuration property states that:
The maximum number of tasks that should be created for this connector.
The MongoDB connector will attempt to use a separate task for each
replica set, [...] so that the work for each replica set can be
distributed by Kafka Connect.
The question is - can I get the same default behaviour with the default connector - as advertised for the Debezium one? Because of external reasons, I can't use the Debezium one for now.
In a PSS deployment:
If one node is not available, the other two nodes can elect a primary
If two nodes are not available, there can be no primary
The quote you referenced suggests the connector may be using primary read preference, which means as long as two nodes are up it will be working and if only one node is up it will not retrieve any data.
Therefore, bring down two of the three nodes and observe whether you are able to query.

Standby tasks not writing updates to .checkpoint files

I have a Kafka Streams application that is configured to have 1 standby replica created for each task. I have two instances of the application running. When the application starts the application writes .checkpoint files for each of the partitions it is responsible for. It writes these files for partitions owned by both active and standby tasks.
When sending a new Kafka event to be processed by the application, the instance containing that active task for the partition updates the offsets in the .checkpoint file. However, the .checkpoint file for the standby task on the second instance is never updated. It remains at the old offset.
I believe this is causing us to see OffsetOutOfRangeEceptions to be thrown when we rebalance which results in tasks being torn down and created from scratch.
Am I right in thinking that offsets should be written for partitions in both standby and active tasks?
Is this an indication that my standby tasks are not consuming or could it be that it is purely not able to write the offset?
Any ideas what could be causing this behaviour?
Streams version: 2.3.1
This issue has been fixed in Kafka 2.4.0 which resolves the following bug issues.apache.org/jira/browse/KAFKA-8755
Note: The issue looks to only effect applications the are configured OPTIMIZE="all"

Apache Flink - duplicate message processing during job deployments, with ActiveMQ as source

Given,
I have a Flink job that reads from ActiveMQ source & writes to a mysql database - keyed on an identifier. I have enabled checkpoints for this job every one second. I point the checkpoints to a Minio instance, I verified the checkpoints are working with the jobid. I deploy this job is an Openshift (Kubernetes underneath) - I can scale up/down this job as & when required.
Problem
When the job is deployed (rolling) or the job went down due to a bug/error, and if there were any unconsumed messages in ActiveMQ or unacknowledged messages in Flink (but written to the database), when the job recovers (or new job is deployed) the job process already processed messages, resulting in duplicate records inserted in the database.
Question
Shouldn't the checkpoints help the job recover from where it left?
Should I take the checkpoint before I (rolling) deploy new job?
What happens if the job quit with error or cluster failure?
As the jobid keeps changing on every deployment, how does the recovery happens?
Edit As I cannot expect idempotency from the database, to avoid duplicates saved into the database (Exactly-Once), can I write database specific (upsert) query to update if the given record is present & insert if not?
JDBC currently only supports at least once, meaning you get duplicate messages upon recovery. There is currently a draft to add support for exactly once, which would probably be released with 1.11.
Shouldn't the checkpoints help the job recover from where it left?
Yes, but the time between last successful checkpoints and recovery could produce the observed duplicates. I gave a more detailed answer on a somewhat related topic.
Should I take the checkpoint before I (rolling) deploy new job?
Absolutely. You should actually use cancel with savepoint. That is the only reliable way to change the topology. Additionally, cancel with savepoints avoids any duplicates in the data as it gracefully shuts down the job.
What happens if the job quit with error or cluster failure?
It should automatically restart (depending on your restart settings). It would use the latest checkpoint for recovery. That would most certainly result in duplicates.
As the jobid keeps changing on every deployment, how does the recovery happens?
You usually point explicitly to the same checkpoint directory (on S3?).
As I cannot expect idempotency from the database, is upsert the only way to achieve Exactly-Once processing?
Currently, I do not see a way around it. It should change with 1.11.

if Schema-Registry is down, does that mean Kafka will have downtime?

So there is a Kafka cluster and we have a Schema registry on top of it to validate schema for topics. For some maintenance reason if schema registry is down, Kafka will have downtime for that duration and it will not accept any new incoming data request ?
Kafka consumers and producers cache the schemas they retrieve from the schema registry internally. The Schema Registry is only contacted when a record is sent/received for which no schema was previously seen.
So as long as you don't start any new consumers/producers or send records with schemas that have not been previously sent you should be fine.
Take this with a grain of salt though, I've looked through the code and run a quick test with the console consumer and producer and could still produce and consume after killing the schema registry, but there may be cases where it still fails.
Update:
It occurred to me today that I probably have answered your question too literal, instead of trying to understand what you are trying to do :)
If you want to enable maintenance windows on your schema registry, it might be worthwhile looking into running two or more schema registries in parallel and configure both of them in your producers and consumers.
One of them will be elected master and write requests for schemas will be forwarded to that instance. That way you can perform rolling restarts if you need maintenance windows.
The KafkaAvroSerializer and deserializer maintain a schema ID cache.
So as long as no new producers and consumers come online, you would see no errors.
If the registry was down, Kafka will not have downtime, but you will start to see network exception errors in the clients since they use HTTP to talk to the registry whereas Kafka uses its own TCP protocol