We have 4 Kafka clusters:
ENV1: 6 brokers and 3 zookeepers
ENV2: 6 brokers and 3 zookeepers
ENV3: 8 brokers (on 2 DCs, 4-4 brokers) and 9 zookeepers (on 3 DCs, 3-3-3 nodes)
ENV4: 16 brokers (on 2 DCs, 8-8 brokers) and 9 zookeepers (on 3 DCs, 3-3-3 nodes)
All of the Kafka brokers are on version 2.7.0, and all of the ZK nodes are on version 3.4.13. Every Kafka brokers and ZK nodes are VMs. All the four environments are running in a separate subnet. Swap is turned off everywhere. All the clusters are Kerberized and are using a separate high available AD for it, which contains 7 kerberos servers.
VM parameters:
ENV1:
Kafka brokers:
16 GB RAM,
8 vCPU,
1120 GB Hard Disk,
RHEL 7.9
ZK nodes:
4 GB RAM,
2 vCPU,
105 GB Hard Disk,
RHEL 8.5
ENV2:
Kafka brokers:
16 GB RAM,
8 vCPU,
1120 GB Hard Disk,
RHEL 7.9
ZK nodes:
4 GB RAM,
2 vCPU,
105 GB Hard Disk,
RHEL 8.5
ENV3:
Kafka brokers:
24 GB RAM,
8 vCPU,
2120 GB Hard Disk,
RHEL 7.9
ZK nodes:
8 GB RAM,
2 vCPU,
200 GB Hard Disk,
RHEL 8.5
ENV4:
Kafka brokers:
24 GB RAM,
8 vCPU,
7145 GB Hard Disk,
RHEL 7.9
ZK nodes:
8 GB RAM,
2 vCPU,
200 GB Hard Disk,
RHEL 7.9
We have the following issue on every environments at the same time, 3-4 times a day for a few seconds: our kafka_network_socketserver_networkprocessoravgidlepercent metrics goes down to zero on every brokers, and our cluster becames unreachable, even our brokers cannot communicate with each other when this happens. Here is a picture of it from our Grafana dashboard:
We can see the following ERRORs in the server log, but we suppose all of them are just consequences:
ERROR Error while processing notification change for path = /kafka-acl-changes (kafka.common.ZkNodeChangeNotificationListener)
kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either before or while waiting for connection
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:270)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$1(ZooKeeperClient.scala:252)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:252)
at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1730)
at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1700)
at kafka.zk.KafkaZkClient.retryRequestUntilConnected(KafkaZkClient.scala:1695)
at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719)
at kafka.common.ZkNodeChangeNotificationListener.kafka$common$ZkNodeChangeNotificationListener$$processNotifications(ZkNodeChangeNotificationListener.scala:83)
at kafka.common.ZkNodeChangeNotificationListener$ChangeNotification.process(ZkNodeChangeNotificationListener.scala:120)
at kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread.doWork(ZkNodeChangeNotificationListener.scala:146)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
ERROR [ReplicaManager broker=1] Error processing append operation on partition topic-4 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order sequence number for producerId 461002 at offset 5761036 in partition topic-4: 1022 (incoming seq. number), 1014 (current end sequence number)
ERROR [KafkaApi-5] Number of alive brokers '0' does not meet the required replication factor '3' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis)
ERROR [KafkaApi-11] Error when handling request: clientId=broker-12-fetcher-0, correlationId=8972304, api=FETCH, version=12, body={replica_id=12,max_wait_ms=500,min_bytes=1,max_bytes=10485760,isolation_level=0,session_id=683174603,session_epoch=47,topics=[{topic=__consumer_offsets,partitions=[{partition=36,current_leader_epoch=294,fetch_offset=1330675527,last_fetched_epoch=-1,log_start_offset=0,partition_max_bytes=1048576,_tagged_fields={}},{partition=25,current_leader_epoch=288,fetch_offset=3931235489,last_fetched_epoch=-1,log_start_offset=0,partition_max_bytes=1048576,_tagged_fields={}}],_tagged_fields={}}],forgotten_topics_data=[],rack_id=,_tagged_fields={}} (kafka.server.KafkaApis)
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Leader not local for partition topic-42 on broker 11
ERROR [GroupCoordinator 5]: Group consumer-group_id could not complete rebalance because no members rejoined (kafka.coordinator.group.GroupCoordinator)
ERROR [Log partition=topic-1, dir=/kafka/logs] Could not find offset index file corresponding to log file /kafka/logs/topic-1/00000000000001842743.log, recovering segment and rebuilding index files... (kafka.log.Log)
[ReplicaFetcher replicaId=16, leaderId=10, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=16, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1439560581, epoch=1138570), rackId=) (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 10 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:110) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:211) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:301) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:136) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:135) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:118) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
I will update the issue if needed, eg. with relevant Kafka configs.
We know that it might be some issue with our infrastructure, but can not see problem with the network, neither on the Kerberos side, so thats why I'm asking for help from you guys. Do you have any idea what may cause this issue? Every idea may be helpful, because we run out of it.
Thanks in advance!
My VM System has a below config, but when i download any bitnami/dokuwiki from bitnami charts or any other and run the deployment, Pods are getting pending or crashloop back. Can some one help in this regard.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 4
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6230R CPU # 2.10GHz
Stepping: 7
CPU MHz: 2095.077
BogoMIPS: 4190.15
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 4 MiB
L3 cache: 143 MiB
NUMA node0 CPU(s): 0-3
issue:
I tried applying pv, but still it is not running. I want to run this pods.
I have a PostgreSQL backed complex Ruby on Rails application running on a Ubuntu Virtual Machine. I see the Postgres processes are having very high %CPU values while running "top"commands.
. Periodically the %CPU is going up to 94 and 95.
lscpu
gives the fallowing output
Architecture: i686
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Stepping: 4
CPU MHz: 2100.000
BogoMIPS: 4200.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
top -n1
top -c
I want the know the reason for the High CPU utilization by Postgres.
Any help is appreciated.
Thanks in Advance!!
I recently started using a tool called pg_top that shows statistics for Postgres, however since I am not verify versed with the internals of Postgres I needed a bit of clarification on the output.
last pid: 6152; load avg: 19.1, 18.6, 20.4; up 119+20:31:38 13:09:41
41 processes: 5 running, 36 sleeping
CPU states: 52.1% user, 0.0% nice, 0.8% system, 47.1% idle, 0.0% iowait
Memory: 47G used, 16G free, 2524M buffers, 20G cached
DB activity: 151 tps, 0 rollbs/s, 253403 buffer r/s, 86 hit%, 1550639 row r/s,
21 row w/s
DB I/O: 0 reads/s, 0 KB/s, 35 writes/s, 2538 KB/s
DB disk: 233.6 GB total, 195.1 GB free (16% used)
Swap:
My question is under the DB Activity, the 1.5 million row r/s, is that a lot? If so what can be done to improve it? I am running puppetdb 2.3.8, with 6.8 million resources, 2500 nodes, and Postgres 9.1. All of this runs on a single 24 core box with 64GB of memory.
We us KVM and libvirt on a 6 core (12 HT cores) machine for virtualization.
Problem: wrong CPU type in virtual host.
used KVM, libvirt, kernel version:
libvirt version: 0.9.8
QEMU emulator version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard
Ubuntu 12.04.1 LTS
kernel: 3.2.0-32-generic x86_64
/usr/share/libvirt/cpu_map.xml does not support more recent cpu types than Westmare.
Do I need this kind of virtualisation of the cpu at all? because of some reasons we need maximum cpu-performance in the virtual host. Will be glad to have some cores of the server's i7-3930K CPU#3.20GHz available in my virtual machines.
Maybe we do too muczh virtualization...?
my virtual host's xml looks like: where can I set the cpu -host flag?
<domain type='kvm'>
<name>myVirtualServer</name>
<uuid>2344481d-f455-455e-9558</uuid>
<description>Test-Server</description>
<memory>4194304</memory>
<currentMemory>4194304</currentMemory>
<vcpu>2</vcpu>
<cpu match='exact'>
<model>Westmere</model>
<vendor>Intel</vendor>
</cpu>
<os>
<type arch='x86_64' machine='pc-1.0'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
$ lscpu of physical Server with 6 (12) cores with HT
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Stepping: 7
CPU MHz: 1200.000
BogoMIPS: 6400.05
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-11
$ lscpu of virtual Server (wrong CPU type, wrong L2-Cache, wrong MHz)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 15
Stepping: 11
CPU MHz: 3200.012
BogoMIPS: 6400.02
Virtualisation: VT-x
Hypervisor vendor: KVM
Virtualisation type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0,1
in the client's xml
<cpu mode='custom' match='exact'>
<model fallback='allow'>core2duo</model>
<feature policy='require' name='vmx'/>
</cpu>
as an example. virsh edit then restart the guest.
EDIT. Ignore this. I've just re-read your question and you're already doing that.