kafka broker with inconsistent data - NotLeaderForPartitionError - apache-kafka

We have a 13 node Kafka cluster and each broker has multiple disks and all topics have replication factor 3. 
Broker 6 had a hardware issue and required a complete OS reload (Linux) and 2 disk replacements. Now I installed Kafka again on this node with the same broker id 6 but started to get an exception from all producers -
[Error 6] NotLeaderForPartitionError: ProduceResponsePayload(topic=u'amyTopic', partition=7, error=6, offset=-1)
I am assuming that since I am using the same broker ID, it (zookeeper? or controller broker?) is expecting data in the disk which got replaced or some other meta info that might get wiped out during OS reload.
What are the options I have to add this node back to the cluster without much disturbance to the cluster and without data loss? Should I use a new broker ID for this node and then repartition the data of every topic as we do after adding a new node? We have a lot of data (a few hundred TB) in the cluster and I am trying to avoid huge data movement caused by the repartition of data and it may choke the entire cluster. Please suggest.

Related

Kafka - broker partitions not in-sync after restart

We use 3 node kafka clusters running 2.7.0 with quite high number of topics and partitions. Almost all the topics have only 1 partition and replication factor of 3 so that gives us roughly:
topics: 7325
partitions total in cluster (including replica): 22110
Brokers are relatively small with
6vcpu
16gb memory
500GB in /var/lib/kafka occupied by partitions data
As you can imagine because we have 3 brokers and replication factor 3 the data is very evenly spread across brokers. Each broker leads very similar (same) amount of partitions and the number of partitions per broker is equal. Under normal circumstances.
Before doing rolling restart yesterday everything was in-sync. We stopped the process and started it again after 1 minute. It took some 10minutes to get synchronized with Zookeeper and start listening on port.
After saing 'Kafka server started'. Nothing is happening. There is no CPU, memory or disk activity. The partition data is visible on data disk. There are no messages in log for more than 1 day now since process booted up.
We've tried restarting zookeeper cluster (one by one). We've tried restart of broker again. Now it's been 24 hours since last restart and still not change.
Broker itself is reporting it leads 0 partitions. Leadership for all the partitions moved to other brokers and they are reporting that everything located in this broker is not in sync.
I'm aware the number of partitions per broker is far exceeding the recommendation but I'm still confused by lack of any activity or log messages. Any ideas what should be checked further? It looks like something is stuck somewhere. I checked the kafka ACLs and there are no block messages related to broker username.
I tried another restart with DEBUG mode and it seems there is some problem with metadata. These two messages are constantly repeating:
[2022-05-13 16:33:25,688] DEBUG [broker-1-to-controller-send-thread]: Controller isn't cached, looking for local metadata changes (kafka.server.BrokerToControllerRequestThread)
[2022-05-13 16:33:25,688] DEBUG [broker-1-to-controller-send-thread]: No controller defined in metadata cache, retrying after backoff (kafka.server.BrokerToControllerRequestThread)
With kcat it's also impossible to fetch metadata about topics (meaning if I specify this broker as bootstrap server).

How to simulate KEY_BUSY hot key error code 14 in Aerospike

How can I test/produce an Aerospike exception code 14?
I have a simple one node environment with Aerospike in it and a java application on K8s.
There are 3 pods of the application, all are consuming messages from Kafka topic with 3 partitions, all in the same consumer group.
With Kafka producer driver, we inject at once 200 messages, with no Kafka message key (so that kafka will round robin on the 3 topic partitions).
All messages relate to the same Aerospike key so the 3 application pods suppose to update the same record in parallel, resulting with Aerospike hotkey exception (KEY_BUSY, error code 14).
But that's not happening and all 200 messages are processed successfully.
The configuration parameter "transaction-pending-limit" is set to 1 in aerospike.conf.
Many thanks.
Try adding one more node in the Aerospike cluster. With a one node Aerospike cluster, you are not replicating to another node. So the transaction is completing before you can encounter "key busy". Adding another node to the Aerospike cluster with replication factor 2 will cause the current transaction to wait in the queue for the replication ack and then, I believe, you will be able to simulate key busy error with transaciton-pending-limit set to 1. Let us know if that works for you.

Handle kafka broker full disk space

We have setup a zookeeper quorum (3 nodes) and 3 kafka brokers. The producers can't able to send record to kafka --- data loss. During investigation, we (can still) SSH to that broker and observed that the broker disk is full. We deleted topic logs to clear some disk space and the broker function as expected again.
Given that we can still SSH to that broker, (we can't see the logs right now) but I assume that zookeeper can hear the heartbeat of that broker and didn't consider it down? What is the best practice to handle such events?
The best practice is to avoid this from happening!
You need to monitor the disk usage of your brokers and have alerts in advance in case available disk space runs low.
You need to put retention limits on your topics to ensure data is deleted regularly.
You can also use Topic Policies (see create.topic.policy.class.name) to control how much retention time/size is allowed when creating/updating topics to ensure topics can't fill your disk.
The recovery steps you did are ok but you really don't want to fill the disks to keep your cluster availability high.

Apache kafka production cluster setup problems

We have been trying to set up a production level Kafka cluster in AWS Linux machines and till now we have been unsuccessful.
Kafka version:
2.1.0
Machines:
5 r5.xlarge machines for 5 Kafka brokers.
3 t2.medium zookeeper nodes
1 t2.medium node for schema-registry and related tools. (a Single instance of each)
1 m5.xlarge machine for Debezium.
Default broker configuration :
num.partitions=15
min.insync.replicas=1
group.max.session.timeout.ms=2000000
log.cleanup.policy=compact
default.replication.factor=3
zookeeper.session.timeout.ms=30000
Our problem is mainly related to huge data.
We are trying to transfer our existing tables in kafka topics using debezium. Many of these tables are quite huge with over 50000000 rows.
Till now, we have tried many things but our cluster fails every time with one or more reasons.
ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /brokers/topics/__consumer_offsets/partitions/0/state
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)..
Error 2:
] INFO [Partition xxx.public.driver_operation-14 broker=3] Cached zkVersion [21] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-12-12 14:07:26,551] INFO [Partition xxx.public.hub-14 broker=3] Shrinking ISR from 1,3 to 3 (kafka.cluster.Partition)
[2018-12-12 14:07:26,556] INFO [Partition xxx.public.hub-14 broker=3] Cached zkVersion [3] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-12-12 14:07:26,556] INFO [Partition xxx.public.field_data_12_2018-7 broker=3] Shrinking ISR from 1,3 to 3 (kafka.cluster.Partition)
Error 3:
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=888665879, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
Some more errors :
Frequent disconnections among broker which probably is the reason
behind nonstop shrinking and expanding ISRs with no auto recovery.
Schema registry gets timed out. I don't know how is schema registry even affected. I don't see too much load on that server. Am I missing something? Should I use a Load balancer for multiple instances of schema Registry as failover?. The topic __schemas has just 28 messages in it.
The exact error message is RestClientException: Register operation timed out. Error code: 50002
Sometimes the message transfer rate is over 100000 messages per second, sometimes it drops to 2000 messages/second? message size could cause this?
In order to solve some of the above problems, we increased the number of brokers and increased zookeeper.session.timeout.ms=30000 but I am not sure if it actually solved the our problem and if it did, how?.
I have a few questions:
Is our cluster good enough to handle this much data.
Is there anything obvious that we are missing?
How can I load test my setup before moving to the production level?
What could cause the session timeouts between brokers and the schema registry.
Best way to handle the schema registry problem.
Network Load on one of our Brokers.
Feel free to ask for any more information.
Please Use The latest official version of the Confluent for you cluster.
Actually you can make it better by increasing the number of partitions of your topics and also increasing the tasks.max(of course in your sink connectors) more than 1 in your connector to work more concurrently and faster.
Please increase the number of Kafka-Connect topics and use Kafka-Connect distributed mode to increase the High Availability of your Kafka-connect cluster. you can make it by setting the number of replication factor in the Kafka-Connectand Schema-Registry config for example:
config.storage.replication.factor=2
status.storage.replication.factor=2
offset.storage.replication.factor=2
Please set the topic compression to snappy for your large tables. it will increase the throughput of the topics and this helps the Debezium connector to work faster and also do not use JSON Converter it's recommended to use Avro Convertor!
Also please use load-balancer for your Schema Registry
For testing the cluster you can create a connector with only one table (I mean a large table!) with the database.whitelist and set snapshot.mode to initial
And About the schema-registry! Schema-registry user both Kafka and Zookeeper with setting these configs:
bootstrap.servers
kafkastore.connection.url
And this is the reason of your downtime of the shema-registry cluster

Separate zookeeper install or not using kafka 10.2?

I would like to use the embedded Zookeeper 3.4.9 that come with Kafka 10.2, and not install Zookeeper separately. Each Kafka broker will always have a 1:1 Zookeeper on localhost.
So if I have 5 brokers on hosts A, B, C, D and E, each with a single Kafka and Zookeeper instance running on them, is it sufficient to just run the Zookeeper provided with Kafka?
What downsides or configuration limitations, if any, does the embedded 3.4.9 Zookeeper have compared to the standalone version?
These are a few reason not to run zookeeper on the same box as Kafka brokers.
They scale differently
5 zk and 5 Kafka works but 6:6 or 11:11 do not. You don't need more than 5 zookeeper nodes even for a quite large Kafka cluster. Unlike Kafka, Zookeeper replicates data to all nodes so it gets slower as you add more nodes.
They compete for disk I/O
Zookeeper is very disk I/O latency sensitive. You need to have it on a separate physical disk from the Kafka commit log or you run the risk that a lot of publishing to Kafka will slow zookeeper down and cause it to drop out of the ensemble causing potential problems.
They compete for page cache memory
Kafka uses Linux OS page cache to reduce disk I/O. When other apps run on the same box as Kafka you reduce or "pollute" the page cache with other data that takes away from cache for Kafka.
Server failures take down more infrastructure
If the box reboots you lose both a zookeeper and a broker at the same time.
Even though ZooKeeper comes with each Kafka release it does not mean they should run on the same server. Actually, it is advised that in a production environment they run on separate servers.
In the Kafka broker configuration you can specify the ZooKeeper address, and it can be local or remote. This is from broker config (config/server.properties):
# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=localhost:2181
You can replace localhost with any other accessible server name or IP address.
We've been running a setup as you described, with 3 to 5 nodes, each running a kafka broker and the zookeeper that comes with kafka distribution on the same nodes. No issues with that setup so far, but our data throughput isn't high.
If we were to scale above 5 nodes we'd separate them, so that we only scale kafka brokers but keep the zookeeper ensemble small. If zookeeper and kafka start competing for I/O too much, then we'd move their data directories to separate drives. If they start competing for CPU, then we'd move them to separate boxes.
All in all, it depends on your expected throughput and how easily you can upgrade your setup if it starts causing contention. You can start small and easy, with kafka and zookeeper co-located as long as you have the flexibility to upgrade your setup with more nodes and introduce separation later on. If you think this will be hard to add later, better start running them separate from the start. We've been running them co-located for 18+ months and haven't encountered resource contention so far.