Apache kafka production cluster setup problems - apache-kafka

We have been trying to set up a production level Kafka cluster in AWS Linux machines and till now we have been unsuccessful.
Kafka version:
2.1.0
Machines:
5 r5.xlarge machines for 5 Kafka brokers.
3 t2.medium zookeeper nodes
1 t2.medium node for schema-registry and related tools. (a Single instance of each)
1 m5.xlarge machine for Debezium.
Default broker configuration :
num.partitions=15
min.insync.replicas=1
group.max.session.timeout.ms=2000000
log.cleanup.policy=compact
default.replication.factor=3
zookeeper.session.timeout.ms=30000
Our problem is mainly related to huge data.
We are trying to transfer our existing tables in kafka topics using debezium. Many of these tables are quite huge with over 50000000 rows.
Till now, we have tried many things but our cluster fails every time with one or more reasons.
ERROR Uncaught exception in scheduled task 'isr-expiration' (kafka.utils.KafkaScheduler)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /brokers/topics/__consumer_offsets/partitions/0/state
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)..
Error 2:
] INFO [Partition xxx.public.driver_operation-14 broker=3] Cached zkVersion [21] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-12-12 14:07:26,551] INFO [Partition xxx.public.hub-14 broker=3] Shrinking ISR from 1,3 to 3 (kafka.cluster.Partition)
[2018-12-12 14:07:26,556] INFO [Partition xxx.public.hub-14 broker=3] Cached zkVersion [3] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-12-12 14:07:26,556] INFO [Partition xxx.public.field_data_12_2018-7 broker=3] Shrinking ISR from 1,3 to 3 (kafka.cluster.Partition)
Error 3:
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=888665879, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 3 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
Some more errors :
Frequent disconnections among broker which probably is the reason
behind nonstop shrinking and expanding ISRs with no auto recovery.
Schema registry gets timed out. I don't know how is schema registry even affected. I don't see too much load on that server. Am I missing something? Should I use a Load balancer for multiple instances of schema Registry as failover?. The topic __schemas has just 28 messages in it.
The exact error message is RestClientException: Register operation timed out. Error code: 50002
Sometimes the message transfer rate is over 100000 messages per second, sometimes it drops to 2000 messages/second? message size could cause this?
In order to solve some of the above problems, we increased the number of brokers and increased zookeeper.session.timeout.ms=30000 but I am not sure if it actually solved the our problem and if it did, how?.
I have a few questions:
Is our cluster good enough to handle this much data.
Is there anything obvious that we are missing?
How can I load test my setup before moving to the production level?
What could cause the session timeouts between brokers and the schema registry.
Best way to handle the schema registry problem.
Network Load on one of our Brokers.
Feel free to ask for any more information.

Please Use The latest official version of the Confluent for you cluster.
Actually you can make it better by increasing the number of partitions of your topics and also increasing the tasks.max(of course in your sink connectors) more than 1 in your connector to work more concurrently and faster.
Please increase the number of Kafka-Connect topics and use Kafka-Connect distributed mode to increase the High Availability of your Kafka-connect cluster. you can make it by setting the number of replication factor in the Kafka-Connectand Schema-Registry config for example:
config.storage.replication.factor=2
status.storage.replication.factor=2
offset.storage.replication.factor=2
Please set the topic compression to snappy for your large tables. it will increase the throughput of the topics and this helps the Debezium connector to work faster and also do not use JSON Converter it's recommended to use Avro Convertor!
Also please use load-balancer for your Schema Registry
For testing the cluster you can create a connector with only one table (I mean a large table!) with the database.whitelist and set snapshot.mode to initial
And About the schema-registry! Schema-registry user both Kafka and Zookeeper with setting these configs:
bootstrap.servers
kafkastore.connection.url
And this is the reason of your downtime of the shema-registry cluster

Related

Kafka - broker partitions not in-sync after restart

We use 3 node kafka clusters running 2.7.0 with quite high number of topics and partitions. Almost all the topics have only 1 partition and replication factor of 3 so that gives us roughly:
topics: 7325
partitions total in cluster (including replica): 22110
Brokers are relatively small with
6vcpu
16gb memory
500GB in /var/lib/kafka occupied by partitions data
As you can imagine because we have 3 brokers and replication factor 3 the data is very evenly spread across brokers. Each broker leads very similar (same) amount of partitions and the number of partitions per broker is equal. Under normal circumstances.
Before doing rolling restart yesterday everything was in-sync. We stopped the process and started it again after 1 minute. It took some 10minutes to get synchronized with Zookeeper and start listening on port.
After saing 'Kafka server started'. Nothing is happening. There is no CPU, memory or disk activity. The partition data is visible on data disk. There are no messages in log for more than 1 day now since process booted up.
We've tried restarting zookeeper cluster (one by one). We've tried restart of broker again. Now it's been 24 hours since last restart and still not change.
Broker itself is reporting it leads 0 partitions. Leadership for all the partitions moved to other brokers and they are reporting that everything located in this broker is not in sync.
I'm aware the number of partitions per broker is far exceeding the recommendation but I'm still confused by lack of any activity or log messages. Any ideas what should be checked further? It looks like something is stuck somewhere. I checked the kafka ACLs and there are no block messages related to broker username.
I tried another restart with DEBUG mode and it seems there is some problem with metadata. These two messages are constantly repeating:
[2022-05-13 16:33:25,688] DEBUG [broker-1-to-controller-send-thread]: Controller isn't cached, looking for local metadata changes (kafka.server.BrokerToControllerRequestThread)
[2022-05-13 16:33:25,688] DEBUG [broker-1-to-controller-send-thread]: No controller defined in metadata cache, retrying after backoff (kafka.server.BrokerToControllerRequestThread)
With kcat it's also impossible to fetch metadata about topics (meaning if I specify this broker as bootstrap server).

kafka broker with inconsistent data - NotLeaderForPartitionError

We have a 13 node Kafka cluster and each broker has multiple disks and all topics have replication factor 3. 
Broker 6 had a hardware issue and required a complete OS reload (Linux) and 2 disk replacements. Now I installed Kafka again on this node with the same broker id 6 but started to get an exception from all producers -
[Error 6] NotLeaderForPartitionError: ProduceResponsePayload(topic=u'amyTopic', partition=7, error=6, offset=-1)
I am assuming that since I am using the same broker ID, it (zookeeper? or controller broker?) is expecting data in the disk which got replaced or some other meta info that might get wiped out during OS reload.
What are the options I have to add this node back to the cluster without much disturbance to the cluster and without data loss? Should I use a new broker ID for this node and then repartition the data of every topic as we do after adding a new node? We have a lot of data (a few hundred TB) in the cluster and I am trying to avoid huge data movement caused by the repartition of data and it may choke the entire cluster. Please suggest.

How does the replicas in kafka works if one broker went done for 2 hours and came back?

we are having 3 zookeeper and 3 kafka broker nodes as cluster setup running in different systems in AWS,And we changes the below properties to ensure the high availabilty and prevent data loss.
server.properties
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=1
i am having the following question
Assume BROKER A,B,C
since we are enabling replication factor as 3 all the data will be available in all A,B,C brokers if A broker is down it wont affect the flow.
but when ever broker A went down but at the same time we are continously receiving data from connector and it is storing the broker B and C
so after 2 hours broker A came up
In that time the data came between the down time and up time of A is available in broker A or not?
is there any specific configuration we need to mention for that?
how does the replication between the broker happen when one broker came online from offline?
i didn't know whether it is a valid question, but please share your thoughts on this to understand this replication factor working
While A is recovering, it'll be out of the ISR list. If you've disabled unclean leader election, then A cannot become the leader broker of any partitions it holds (no client can write or read to it) and will replicate data from other replicas until its up to date, then join the ISR

Kafka is trying to send messages to a broker in "recovery mode"

I have the following setup
3 Kafka (v2.1.1) Brokers
5 Zookeeper instances
Kafka brokers have the following configuration:
auto.create.topics.enable: 'false'
default.replication.factor: 1
delete.topic.enable: 'false'
log.cleaner.threads: 1
log.message.format.version: '2.1'
log.retention.hours: 168
num.partitions: 1
offsets.topic.replication.factor: 1
transaction.state.log.min.isr: '2'
transaction.state.log.replication.factor: '3'
zookeeper.connection.timeout.ms: 10000
zookeeper.session.timeout.ms: 10000
min.insync.replicas: '2'
request.timeout.ms: 30000
Producer configuration (using Spring Kafka) is more or less as following:
...
acks: all
retries: Integer.MAX_VALUE
deployment.timeout.ms: 360000ms
enable.idempotence: true
...
This configuration I read as follows: There are three Kafka brokers, but once one of them dies, it is fine if only at least two replicate and persist the data before sending the ack back (= in sync replicas). In case of failure, Kafka producer will keep retrying for 6 minutes, but then gives up.
This is the scenario which causes me headache:
All Kafka and Zookeeper instances are up and alive
I start sending messages in chunks (500 pcs each)
In the middle of the processing, one of the Brokers dies (hard kill)
Immediately, I see logs like 2019-08-09 13:06:39.805 WARN 1 --- [b6b45bb5c-7dxh7] o.a.k.c.NetworkClient : [Producer clientId=bla-6b6b45bb5c-7dxh7, transactionalId=bla-6b6b45bb5c-7dxh70] 4 partitions have leader brokers without a matching listener, including [...] (question 1: I do not see any further messages coming in, does this really mean the whole cluster is now stuck and waiting for the dead Broker to come back???)
After the dead Broker starts to boot up again, it starts with recovery of the corrupted index. This operation takes more than 10 minutes as I have a lot of data on the Kafka cluster
Every 30s, the producer tries to send the message again (due to request.timeout.ms property set to 30s)
Since my deployment.timeout.ms is se to 6 minutes and the Broker needs 10 minutes to recover and does not persist the data until then, the producer gives up and stops retrying = I potentially lose the data
The questions are
Why the Kafka cluster waits until the dead Broker comes back?
When the producer realizes the Broker does not respond, why it does not try to connect another Broker?
The thread is completely stuck for 6 minutes and waiting until the dead Broker recovers, how can I tell the producer to rather try another Broker?
Am I missing something or is there any good practice to avoid such scenario?
You have a number of questions, I'll take a shot at providing our experience which will hopefully shed light on some of them.
In my product, IBM IDR Replication, we had to provide information for robustness to customers who's topics were being rebalanced, or whom had lost a broker in their clusters. The results of some of our testing was the simply setting the request timeout was not sufficient because in certain circumstances the request would decide not to wait the entire time, and rather perform another retry almost instantly. This burned through the configured number of retries Ie. there are circumstances where the timeout period is circumvented.
As such we instructed users to utilize a formula like the following...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/robust.html
"To tune the values for your environment, adjust the Kafka producer properties retry.backoff.ms and retries according to the following formula:
retry.backoff.ms * retries > the anticipated maximum time for leader change metadata to propagate in the clusterCopy
For example, you might wish to configure retry.backoff.ms=300, retries=150 and max.in.flight.requests.per.connection=1."
So maybe try utilizing retries and retry.backoff.ms. Note that utilizing retries without idempotence can cause batches to be written out of order if you have more than one in flight... so choose accordingly based on your business logic.
It was our experience that the Kafka Producer writes to the broker which is the leader for the topic, and so you have to wait for the new leader to be elected. When it is, if the retry process is still ongoing, the producer transparently determines the new leader and writes data accordingly.

Kafka broker taking too long to come up

Recently, one of our Kafka broker (out of 5) got shut down incorrectly. Now that we are starting it up again, there are a lot of warning messages about corrupted index files and the broker is still starting up even after 24 hours. There is over 400 GB of data in this broker.
Although the rest of the brokers are up and running but some of the partitions are showing -1 as their leader and the bad broker as the only ISR. I am not seeing other Replicas to be appointed as new leaders, maybe because the bad broker is the only one in sync for those partitions.
Broker Properties:
Replication Factor: 3
Min In Sync Replicas: 1
I am not sure how to handle this. Should I wait for the broker to fix everything itself? is it normal to take so much time?
Is there anything else I can do? Please help.
After an unclean shutdown, a broker can take a while to restart as it has to do log recovery.
By default, Kafka only uses a single thread per log directory to perform this recovery, so if you have thousands of partitions it can take hours to complete.
To speed that up, it's recommended to bump num.recovery.threads.per.data.dir. You can set it to the number of CPU cores.