Separate Apache Kafka clusters unreachable at the same time - kafka_network_socketserver_networkprocessoravgidlepercent goes to zero - apache-kafka

We have 4 Kafka clusters:
ENV1: 6 brokers and 3 zookeepers
ENV2: 6 brokers and 3 zookeepers
ENV3: 8 brokers (on 2 DCs, 4-4 brokers) and 9 zookeepers (on 3 DCs, 3-3-3 nodes)
ENV4: 16 brokers (on 2 DCs, 8-8 brokers) and 9 zookeepers (on 3 DCs, 3-3-3 nodes)
All of the Kafka brokers are on version 2.7.0, and all of the ZK nodes are on version 3.4.13. Every Kafka brokers and ZK nodes are VMs. All the four environments are running in a separate subnet. Swap is turned off everywhere. All the clusters are Kerberized and are using a separate high available AD for it, which contains 7 kerberos servers.
VM parameters:
ENV1:
Kafka brokers:
16 GB RAM,
8 vCPU,
1120 GB Hard Disk,
RHEL 7.9
ZK nodes:
4 GB RAM,
2 vCPU,
105 GB Hard Disk,
RHEL 8.5
ENV2:
Kafka brokers:
16 GB RAM,
8 vCPU,
1120 GB Hard Disk,
RHEL 7.9
ZK nodes:
4 GB RAM,
2 vCPU,
105 GB Hard Disk,
RHEL 8.5
ENV3:
Kafka brokers:
24 GB RAM,
8 vCPU,
2120 GB Hard Disk,
RHEL 7.9
ZK nodes:
8 GB RAM,
2 vCPU,
200 GB Hard Disk,
RHEL 8.5
ENV4:
Kafka brokers:
24 GB RAM,
8 vCPU,
7145 GB Hard Disk,
RHEL 7.9
ZK nodes:
8 GB RAM,
2 vCPU,
200 GB Hard Disk,
RHEL 7.9
We have the following issue on every environments at the same time, 3-4 times a day for a few seconds: our kafka_network_socketserver_networkprocessoravgidlepercent metrics goes down to zero on every brokers, and our cluster becames unreachable, even our brokers cannot communicate with each other when this happens. Here is a picture of it from our Grafana dashboard:
We can see the following ERRORs in the server log, but we suppose all of them are just consequences:
ERROR Error while processing notification change for path = /kafka-acl-changes (kafka.common.ZkNodeChangeNotificationListener)
kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either before or while waiting for connection
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:270)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$1(ZooKeeperClient.scala:252)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:252)
at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1730)
at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1700)
at kafka.zk.KafkaZkClient.retryRequestUntilConnected(KafkaZkClient.scala:1695)
at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719)
at kafka.common.ZkNodeChangeNotificationListener.kafka$common$ZkNodeChangeNotificationListener$$processNotifications(ZkNodeChangeNotificationListener.scala:83)
at kafka.common.ZkNodeChangeNotificationListener$ChangeNotification.process(ZkNodeChangeNotificationListener.scala:120)
at kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread.doWork(ZkNodeChangeNotificationListener.scala:146)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
ERROR [ReplicaManager broker=1] Error processing append operation on partition topic-4 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order sequence number for producerId 461002 at offset 5761036 in partition topic-4: 1022 (incoming seq. number), 1014 (current end sequence number)
ERROR [KafkaApi-5] Number of alive brokers '0' does not meet the required replication factor '3' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis)
ERROR [KafkaApi-11] Error when handling request: clientId=broker-12-fetcher-0, correlationId=8972304, api=FETCH, version=12, body={replica_id=12,max_wait_ms=500,min_bytes=1,max_bytes=10485760,isolation_level=0,session_id=683174603,session_epoch=47,topics=[{topic=__consumer_offsets,partitions=[{partition=36,current_leader_epoch=294,fetch_offset=1330675527,last_fetched_epoch=-1,log_start_offset=0,partition_max_bytes=1048576,_tagged_fields={}},{partition=25,current_leader_epoch=288,fetch_offset=3931235489,last_fetched_epoch=-1,log_start_offset=0,partition_max_bytes=1048576,_tagged_fields={}}],_tagged_fields={}}],forgotten_topics_data=[],rack_id=,_tagged_fields={}} (kafka.server.KafkaApis)
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Leader not local for partition topic-42 on broker 11
ERROR [GroupCoordinator 5]: Group consumer-group_id could not complete rebalance because no members rejoined (kafka.coordinator.group.GroupCoordinator)
ERROR [Log partition=topic-1, dir=/kafka/logs] Could not find offset index file corresponding to log file /kafka/logs/topic-1/00000000000001842743.log, recovering segment and rebuilding index files... (kafka.log.Log)
[ReplicaFetcher replicaId=16, leaderId=10, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=16, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1439560581, epoch=1138570), rackId=) (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 10 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:110) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:211) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:301) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:136) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:135) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:118) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
I will update the issue if needed, eg. with relevant Kafka configs.
We know that it might be some issue with our infrastructure, but can not see problem with the network, neither on the Kerberos side, so thats why I'm asking for help from you guys. Do you have any idea what may cause this issue? Every idea may be helpful, because we run out of it.
Thanks in advance!

Related

Kafka expectantly shutting down. License topic could not be created

On Kafka startup multiple messages are logged to kafka/logs/kafkaServer.out and contain:
INFO [Admin Manager on Broker 0]: Error processing create topic
request CreatableTopic(name='_confluent-license', numPartitions=1,
replicationFactor=3, assignments=[],
configs=[CreateableTopicConfig(name='cleanup.policy',
value='compact'), CreateableTopicConfig(name='min.insync.replicas',
value='2')]) (kafka.server.AdminManager)
After approx 15 minutes Kafka shuts down and outputted to
kafka/logs/kafkaServer.out is :
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 1.
[2020-12-08 04:04:15,951] ERROR [KafkaServer id=0] Fatal error during KafkaServer startup. Prepare to shutdown
(kafka.server.KafkaServer)
org.apache.kafka.common.errors.TimeoutException: License topic could not be created
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException:
Replication factor: 3 larger than available brokers: 1.
[2020-12-08 04:04:15,952] INFO [KafkaServer id=0] shutting down (kafka.server.KafkaServer)
It appears Kafka shuts down because the replication factor is set to 3 for the topic _confluent-license ? I'm not creating the topic _confluent-license, is this created as part of Kafka startup for licensing check ?
In attempt to fix I've modified /v5.5.0/etc/kafka/server.properties so that replication factor is 1 for internal topics:
############################# Internal Topic Settings #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
instead of 3 :
#offsets.topic.replication.factor=3
#transaction.state.log.replication.factor=3
But this does not fix the issue and same logs are generated. The replication factor of __consumer_offsets is still 3. How to reduce the replication factor of topic _confluent-license from 3 to 1 ? Or could there be an alternative issue that is causing Kafka to shutdown ?
You should change the property confluent.license.topic.replication.factor by default it is 3.
(kafka.server.KafkaServer)
org.apache.kafka.common.errors.TimeoutException: License topic could
not be created Caused by:
org.apache.kafka.common.errors.InvalidReplicationFactorException:
Replication factor: 3 larger than available brokers: 1. [2020-12-08
04:04:15,952] INFO [KafkaServer id=0] shutting down
(kafka.server.KafkaServer)
The above error is due to the license topic having a default replication factor as 3. The same can be configured with confluent.license.topic.replication.factor to be equal to 1 if you have only 1 broker. The documentation for the same is here.
[2020-12-08 07:46:02,241] ERROR Error checking or creating metrics topic (io.confluent.metrics.reporter.ConfluentMetricsReporter)
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 2 larger than available brokers: 1.
The above error is due to the Confluent Metrics Reporter being enabled. The replication factor for the metric topic is defaulted at 3 and can be configured with confluent.metrics.reporter.topic.replicas to be 1 if you have just one broker. The documentation for the same is here.
The property confluent.license.topic.replication.factor is not working as expected.
Instead, we can try anyone of these properties confluent.license.replication.factor=1 or confluent.license.admin.replication.factor=1 if the number of broker is 1. This property is not in the documentation.

Kakfa broker v2.1.0 gets into stuck state after LeaderEpochCache truncation

We are running a 3 broker kafka cluster (v2.11-2.1.0), r4.xlarge machines on aws.We have cpu usage upto 85%, memory usage nearing 99%(including i/o buffers). We have 3 node zk cluster. we have around ~80-90k messages/sec load
broker java env:
java -Xmx7982m -Xms7770m -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Xloggc:/home/kafka/kafka/bin/../logs/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M```
Usually producers and consumers dont face any issues, but intermittently below log errors are written in "server.log"
2019-05-27 20:21:53,211] WARN [LeaderEpochCache pulse-flattening-errors-23] New epoch entry EpochEntry(epoch=2, startOffset=0) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=1, startOffset=0)). Cache now contains 1 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-27 20:23:45,175] WARN [LeaderEpochCache pulse-21-feb-error-message-16] New epoch entry EpochEntry(epoch=4, startOffset=1833) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=1833)). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-27 20:23:45,269] WARN [LeaderEpochCache pulse-21-feb-error-message-28] New epoch entry EpochEntry(epoch=4, startOffset=1525) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=1525)). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-27 20:23:45,339] WARN [LeaderEpochCache pulse-21-feb-error-message-4] New epoch entry EpochEntry(epoch=4, startOffset=1427) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=1427)). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-27 20:23:45,529] WARN [LeaderEpochCache pulse-21-feb-error-message-10] New epoch entry EpochEntry(epoch=4, startOffset=2430) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=2430)). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-27 20:23:45,577] WARN [LeaderEpochCache pulse-21-feb-error-message-22] New epoch entry EpochEntry(epoch=4, startOffset=1802) caused truncation of conflicting entries ListBuffer(EpochEntry(epoch=3, startOffset=1802)). Cache now contains 2 entries. (kafka.server.epoch.LeaderEpochFileCache
Obvservations:
Above warnings occur on leader of a particular partition
broker while still listening on port 9092, stops responding to any producer requests.
producers timeout and not able to send messages to cluster
Other brokers fail replication hence they also get stuck
New leader is not elected
All producers fail
I have Checked config for the topics and partitions , nothing seems to be out of ordinary. Mostly maintaining defaults of the apache kafka distribution v2.11-2.1.0
unclean leader election is also set to false for the cluster as well as topics.
This has been happening intermittently and repeatedly.
I have checked issues.apache.org for related issues, could not find any relevant matching issue.
This is making our cluster unstable and causes full downtime to producers. Dont have any clue in which direction to start looking

Kafka broker are keep on restarting?

I have 4 kafka brokers and 3 zookeepers deployed upon kubernetes. Out of 4 only 2 kafka brokers are working and rest 2 keep shutting down and restart with below error:
Exiting because log truncation is not allowed for partition byfn-sys-channel-0, current leader's latest offset 2 is less than replica's latest offset 21 (kafka.server.ReplicaFetcherThread)
Below is the config of kafka
KAFKA_ZOOKEEPER_CONNECT zookeeper0:2181,zookeeper1:2181,zookeeper2:2181
KAFKA_UNCLEAN_LEADER_ELECTION_ENABLE false
KAFKA_REPLICA_FETCH_MAX_BYTES 103809024
KAFKA_MIN_INSYNC_REPLICAS 1
KAFKA_MESSAGE_MAX_BYTES 103809024
KAFKA_LOG_RETENTION_MS -1
KAFKA_LOG_DIRS /var/kafkas/kafka2
KAFKA_DEFAULT_REPLICATION_FACTOR 3
KAFKA_BROKER_ID 2
KAFKA_ADVERTISED_LISTENERS PLAINTEXT://kafka2:9092
Please let me know how can i fix this ?
Kafka halting because log truncation is not allowed for topic error shuttng down kafka nodes
Above link shows if log truncation is not allowed for a topic

Kafka kafka-producer-perf-test.sh NetworkException The server disconnected before a response was received

I am trying to perform a benchmark test in our kafka env. I have played with few configurations such as request.timeout.ms and max.block.ms and throughout but not able to avoid the error:
org.apache.kafka.common.errors.TimeoutException: The request timed out.
org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
org.apache.kafka.common.errors.TimeoutException: Expiring 148 record(s) for benchmark-6-3r-2isr-none-0: 182806 ms has passed since last append
Produce Perf Test command:
nohup sh ~/kafka/kafka_2.11-1.0.0/bin/kafka-producer-perf-test.sh --topic benchmark-6p-3r-2isr-none --num-records 10000000 --record-size 100 --throughput 1000 --print-metrics --producer-props acks=all bootstrap.servers=node1:9092,node2:9092,node3:9092 request.timeout.ms=180000 max.block.ms=180000 buffer.memory=100000000 > ~/kafka/load_test/results/6p-3r-10M-100B-t-1-ackall-rto3m-block2m-bm100m-2 2>&1
Cluster: 3 nodes, topic: 6 partitions, RF=3 and minISR=2
I am monitoring the kafka metrics using a tsdb and grafana. I know that disk IO perf is bad [disk await(1.5 secs), IO queue size and disk utilization metrics are high(60-75%)] but I don't see any issue in kafka logs that can relate slow disk io to the above perf errors.
But I get the error even for 1000 messages/sec.
Need suggestions to understand the issue and fix the above errors?
I have another very disturbing observation.
The errors go away if I start 2 kafka-producer-perf-test.sh with the same configs on different hosts.
If I cancel 1 kafka-producer-perf-test.sh then after some time the above errors start reappearing.

Ceph enters degraded state after Deis installation

I have successfully upgraded Deis to v1.0.1 with 3 nodes cluster, with each node having 2GB ram, hosted by Digital Ocean.
I then nse'ed into a deis-store-monitor service, ran ceph -s, and realized it has entered active+undersized+degraded state, and never get back to the active+clean state.
Detail messages follow:
root#deis-2:/# ceph -s
libust[276/276]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305)
cluster dfa09ba0-66f2-46bb-8d84-12795f281f7d
health HEALTH_WARN 1536 pgs degraded; 1536 pgs stuck unclean; 1536 pgs undersized; recovery 1314/3939 objects degraded (33.359%)
monmap e3: 3 mons at {deis-1=10.132.183.190:6789/0,deis-2=10.132.183.191:6789/0,deis-3=10.132.183.192:6789/0}, election epoch 28, quorum 0,1,2 deis-1,deis-2,deis-3
mdsmap e32: 1/1/1 up {0=deis-1=up:active}, 2 up:standby
osdmap e77: 3 osds: 2 up, 2 in
pgmap v109093: 1536 pgs, 12 pools, 897 MB data, 1313 objects
27342 MB used, 48256 MB / 77175 MB avail
1314/3939 objects degraded (33.359%)
1536 active+undersized+degraded
client io 817 B/s wr, 0 op/s
I am totally new to ceph. I wonder:
Is it a big deal to fix this issue, or could I let it be in this state?
If it is recommended to fix this, would you point out how should I go about it?
I read about Ceph troubleshooting section and POOL, PG AND CRUSH CONFIG REFERENCE, but still have no idea what I should do next.
Thanks a lot!
From this output: osdmap e77: 3 osds: 2 up, 2 in. It sounds like one of your deis-store-daemons isn't responding. deisctl restart store-daemon should recover your cluster, but I'd be curious about what happened to that daemon. I'd love to see journalctl --no-pager -u deis-store-daemon on all of your hosts. If you could add your logs to https://github.com/deis/deis/issues/2520 that'd help us figure out why the daemon isn't responding.
Also, 2GB nodes on DO will likely result in performance issues (and Ceph may be unhappy).