Kafka - broker partitions not in-sync after restart - apache-kafka

We use 3 node kafka clusters running 2.7.0 with quite high number of topics and partitions. Almost all the topics have only 1 partition and replication factor of 3 so that gives us roughly:
topics: 7325
partitions total in cluster (including replica): 22110
Brokers are relatively small with
6vcpu
16gb memory
500GB in /var/lib/kafka occupied by partitions data
As you can imagine because we have 3 brokers and replication factor 3 the data is very evenly spread across brokers. Each broker leads very similar (same) amount of partitions and the number of partitions per broker is equal. Under normal circumstances.
Before doing rolling restart yesterday everything was in-sync. We stopped the process and started it again after 1 minute. It took some 10minutes to get synchronized with Zookeeper and start listening on port.
After saing 'Kafka server started'. Nothing is happening. There is no CPU, memory or disk activity. The partition data is visible on data disk. There are no messages in log for more than 1 day now since process booted up.
We've tried restarting zookeeper cluster (one by one). We've tried restart of broker again. Now it's been 24 hours since last restart and still not change.
Broker itself is reporting it leads 0 partitions. Leadership for all the partitions moved to other brokers and they are reporting that everything located in this broker is not in sync.
I'm aware the number of partitions per broker is far exceeding the recommendation but I'm still confused by lack of any activity or log messages. Any ideas what should be checked further? It looks like something is stuck somewhere. I checked the kafka ACLs and there are no block messages related to broker username.
I tried another restart with DEBUG mode and it seems there is some problem with metadata. These two messages are constantly repeating:
[2022-05-13 16:33:25,688] DEBUG [broker-1-to-controller-send-thread]: Controller isn't cached, looking for local metadata changes (kafka.server.BrokerToControllerRequestThread)
[2022-05-13 16:33:25,688] DEBUG [broker-1-to-controller-send-thread]: No controller defined in metadata cache, retrying after backoff (kafka.server.BrokerToControllerRequestThread)
With kcat it's also impossible to fetch metadata about topics (meaning if I specify this broker as bootstrap server).

Related

Kafka Brokers don't balance partitions after a broker goes down

We have a 3-broker 3-zookeeper cluster and we've taken down a broker. We have total of 180 partitions, each of the topics have 2 replicas. When a node is taken down, there are 75 under replicated partitions and it stays that way and it doesn't look like anything happens. When I start up the broker I took down, the partitions are quickly picked up by it and it works ok.
The machines are quite big (30gb ram, fast disks) and all the data is 10gb on each broker so I have no idea why it wouldn't move the partitions quickly from a node that is down to a node that is still up, seems like it's not aware that the node was taken down.
Any tips? How can I monitor the recovery process after a node is taken down?
Kafka version - 2.6.0
This is by design, no data is moved off the broker until you manually move partitions off of it first using kafka-reassign-partitions
Similarly, you'd need to do this if you're trying to fully remove a node from the cluster, which is effectively the same behavior of having it crash and never come back

How does the replicas in kafka works if one broker went done for 2 hours and came back?

we are having 3 zookeeper and 3 kafka broker nodes as cluster setup running in different systems in AWS,And we changes the below properties to ensure the high availabilty and prevent data loss.
server.properties
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=1
i am having the following question
Assume BROKER A,B,C
since we are enabling replication factor as 3 all the data will be available in all A,B,C brokers if A broker is down it wont affect the flow.
but when ever broker A went down but at the same time we are continously receiving data from connector and it is storing the broker B and C
so after 2 hours broker A came up
In that time the data came between the down time and up time of A is available in broker A or not?
is there any specific configuration we need to mention for that?
how does the replication between the broker happen when one broker came online from offline?
i didn't know whether it is a valid question, but please share your thoughts on this to understand this replication factor working
While A is recovering, it'll be out of the ISR list. If you've disabled unclean leader election, then A cannot become the leader broker of any partitions it holds (no client can write or read to it) and will replicate data from other replicas until its up to date, then join the ISR

Kafka broker taking too long to come up

Recently, one of our Kafka broker (out of 5) got shut down incorrectly. Now that we are starting it up again, there are a lot of warning messages about corrupted index files and the broker is still starting up even after 24 hours. There is over 400 GB of data in this broker.
Although the rest of the brokers are up and running but some of the partitions are showing -1 as their leader and the bad broker as the only ISR. I am not seeing other Replicas to be appointed as new leaders, maybe because the bad broker is the only one in sync for those partitions.
Broker Properties:
Replication Factor: 3
Min In Sync Replicas: 1
I am not sure how to handle this. Should I wait for the broker to fix everything itself? is it normal to take so much time?
Is there anything else I can do? Please help.
After an unclean shutdown, a broker can take a while to restart as it has to do log recovery.
By default, Kafka only uses a single thread per log directory to perform this recovery, so if you have thousands of partitions it can take hours to complete.
To speed that up, it's recommended to bump num.recovery.threads.per.data.dir. You can set it to the number of CPU cores.

Apache kafka production cluster setup problems

We have been trying to set up a production level Kafka cluster in AWS Linux machines and till now we have been unsuccessful.
Kafka version:
2.1.0
Machines:
5 r5.xlarge machines for 5 Kafka brokers.
3 t2.medium zookeeper nodes
1 t2.medium node for schema-registry and related tools. (a Single instance of each)
1 m5.xlarge machine for Debezium.
Default broker configuration :
num.partitions=15
min.insync.replicas=1
group.max.session.timeout.ms=2000000
log.cleanup.policy=compact
default.replication.factor=3
zookeeper.session.timeout.ms=30000
Our problem is mainly related to huge data.
We are trying to transfer our existing tables in kafka topics using debezium. Many of these tables are quite huge with over 50000000 rows.
Till now, we have tried many things but our cluster fails every time with one or more reasons.
ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /brokers/topics/__consumer_offsets/partitions/0/state
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)..
Error 2:
] INFO [Partition xxx.public.driver_operation-14 broker=3] Cached zkVersion [21] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-12-12 14:07:26,551] INFO [Partition xxx.public.hub-14 broker=3] Shrinking ISR from 1,3 to 3 (kafka.cluster.Partition)
[2018-12-12 14:07:26,556] INFO [Partition xxx.public.hub-14 broker=3] Cached zkVersion [3] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-12-12 14:07:26,556] INFO [Partition xxx.public.field_data_12_2018-7 broker=3] Shrinking ISR from 1,3 to 3 (kafka.cluster.Partition)
Error 3:
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=888665879, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
Some more errors :
Frequent disconnections among broker which probably is the reason
behind nonstop shrinking and expanding ISRs with no auto recovery.
Schema registry gets timed out. I don't know how is schema registry even affected. I don't see too much load on that server. Am I missing something? Should I use a Load balancer for multiple instances of schema Registry as failover?. The topic __schemas has just 28 messages in it.
The exact error message is RestClientException: Register operation timed out. Error code: 50002
Sometimes the message transfer rate is over 100000 messages per second, sometimes it drops to 2000 messages/second? message size could cause this?
In order to solve some of the above problems, we increased the number of brokers and increased zookeeper.session.timeout.ms=30000 but I am not sure if it actually solved the our problem and if it did, how?.
I have a few questions:
Is our cluster good enough to handle this much data.
Is there anything obvious that we are missing?
How can I load test my setup before moving to the production level?
What could cause the session timeouts between brokers and the schema registry.
Best way to handle the schema registry problem.
Network Load on one of our Brokers.
Feel free to ask for any more information.
Please Use The latest official version of the Confluent for you cluster.
Actually you can make it better by increasing the number of partitions of your topics and also increasing the tasks.max(of course in your sink connectors) more than 1 in your connector to work more concurrently and faster.
Please increase the number of Kafka-Connect topics and use Kafka-Connect distributed mode to increase the High Availability of your Kafka-connect cluster. you can make it by setting the number of replication factor in the Kafka-Connectand Schema-Registry config for example:
config.storage.replication.factor=2
status.storage.replication.factor=2
offset.storage.replication.factor=2
Please set the topic compression to snappy for your large tables. it will increase the throughput of the topics and this helps the Debezium connector to work faster and also do not use JSON Converter it's recommended to use Avro Convertor!
Also please use load-balancer for your Schema Registry
For testing the cluster you can create a connector with only one table (I mean a large table!) with the database.whitelist and set snapshot.mode to initial
And About the schema-registry! Schema-registry user both Kafka and Zookeeper with setting these configs:
bootstrap.servers
kafkastore.connection.url
And this is the reason of your downtime of the shema-registry cluster

Kafka broker goes out of sync for random partitions

I have a setup of 4 Kafka brokers. Each partition in each topic in my setup has a replication factor of 2. All partitions are balanced - Leaders and followers are uniformly distributed
This setup has been running for over 6 months
While monitoring the setup via Kafka Manager I see that 8% of my partitions are under-replicated.
All these partitions were assigned to the same set of replicas. And every partition which was assigned to this set of replicas is displayed as under-replicated
Lets call this set of brokers as [1,2] - broker 1 and 2. The ISR for all these partitions is [1] right now.
Both brokers 1 and 2 are up and running. All other partitions have the ISR count as expected.
The script bin/kafka-topics.sh also shows 8% of partitions to be under replicated.
But the jolokia metric - UnderReplicatedPartitions - is 0
I need help to answer -
Is there an issue?
Why is there an inconsistency in the jolokia metric and kafka console?
How can I fix the issue ?
I can't say anything about "jolokia metric" but we experienced the same because we had a "slow" broker which was lagging behind with replicating the data.
"Slow" meaning that the replication requests somes breached the broker-wide configuration replica.lag.time.max.ms which defaults to 10 seconds and is described as:
"If a follower hasn't sent any fetch requests or hasn't consumed up to the leaders log end offset for at least this time, the leader will remove the follower from isr"
Slightly increasing this configuration solved the problem for us.