Spring Batch Partition Threads Not Executing together - spring-batch

We have spring batch job reading XML files and inserting to DB. Each batch slave partition is taking 10000 XML's and writing to DB.
Below is the Thread pool configuration
#Bean
public ThreadPoolTaskExecutor taskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setMaxPoolSize(80);
taskExecutor.setCorePoolSize(50);
taskExecutor.setQueueCapacity(30);
taskExecutor.setWaitForTasksToCompleteOnShutdown(true);
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
We are partitioning into 30 chunks and commit of 100 for each threads. All the 30 threads are inserting into table BATCH_STEP_EXECUTION says starting. Few completes in seconds with same number of records and others are waiting and takes 3 to 7 minutes. Application configuration is 6GB memory and Linux 64 bit having 96 CPU.
Server Configuration
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8160M CPU # 2.10GHz
Stepping: 4

Related

Separate Apache Kafka clusters unreachable at the same time - kafka_network_socketserver_networkprocessoravgidlepercent goes to zero

We have 4 Kafka clusters:
ENV1: 6 brokers and 3 zookeepers
ENV2: 6 brokers and 3 zookeepers
ENV3: 8 brokers (on 2 DCs, 4-4 brokers) and 9 zookeepers (on 3 DCs, 3-3-3 nodes)
ENV4: 16 brokers (on 2 DCs, 8-8 brokers) and 9 zookeepers (on 3 DCs, 3-3-3 nodes)
All of the Kafka brokers are on version 2.7.0, and all of the ZK nodes are on version 3.4.13. Every Kafka brokers and ZK nodes are VMs. All the four environments are running in a separate subnet. Swap is turned off everywhere. All the clusters are Kerberized and are using a separate high available AD for it, which contains 7 kerberos servers.
VM parameters:
ENV1:
Kafka brokers:
16 GB RAM,
8 vCPU,
1120 GB Hard Disk,
RHEL 7.9
ZK nodes:
4 GB RAM,
2 vCPU,
105 GB Hard Disk,
RHEL 8.5
ENV2:
Kafka brokers:
16 GB RAM,
8 vCPU,
1120 GB Hard Disk,
RHEL 7.9
ZK nodes:
4 GB RAM,
2 vCPU,
105 GB Hard Disk,
RHEL 8.5
ENV3:
Kafka brokers:
24 GB RAM,
8 vCPU,
2120 GB Hard Disk,
RHEL 7.9
ZK nodes:
8 GB RAM,
2 vCPU,
200 GB Hard Disk,
RHEL 8.5
ENV4:
Kafka brokers:
24 GB RAM,
8 vCPU,
7145 GB Hard Disk,
RHEL 7.9
ZK nodes:
8 GB RAM,
2 vCPU,
200 GB Hard Disk,
RHEL 7.9
We have the following issue on every environments at the same time, 3-4 times a day for a few seconds: our kafka_network_socketserver_networkprocessoravgidlepercent metrics goes down to zero on every brokers, and our cluster becames unreachable, even our brokers cannot communicate with each other when this happens. Here is a picture of it from our Grafana dashboard:
We can see the following ERRORs in the server log, but we suppose all of them are just consequences:
ERROR Error while processing notification change for path = /kafka-acl-changes (kafka.common.ZkNodeChangeNotificationListener)
kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either before or while waiting for connection
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$3(ZooKeeperClient.scala:270)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:258)
at kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$1(ZooKeeperClient.scala:252)
at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:252)
at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1730)
at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1700)
at kafka.zk.KafkaZkClient.retryRequestUntilConnected(KafkaZkClient.scala:1695)
at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719)
at kafka.common.ZkNodeChangeNotificationListener.kafka$common$ZkNodeChangeNotificationListener$$processNotifications(ZkNodeChangeNotificationListener.scala:83)
at kafka.common.ZkNodeChangeNotificationListener$ChangeNotification.process(ZkNodeChangeNotificationListener.scala:120)
at kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread.doWork(ZkNodeChangeNotificationListener.scala:146)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
ERROR [ReplicaManager broker=1] Error processing append operation on partition topic-4 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.OutOfOrderSequenceException: Out of order sequence number for producerId 461002 at offset 5761036 in partition topic-4: 1022 (incoming seq. number), 1014 (current end sequence number)
ERROR [KafkaApi-5] Number of alive brokers '0' does not meet the required replication factor '3' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis)
ERROR [KafkaApi-11] Error when handling request: clientId=broker-12-fetcher-0, correlationId=8972304, api=FETCH, version=12, body={replica_id=12,max_wait_ms=500,min_bytes=1,max_bytes=10485760,isolation_level=0,session_id=683174603,session_epoch=47,topics=[{topic=__consumer_offsets,partitions=[{partition=36,current_leader_epoch=294,fetch_offset=1330675527,last_fetched_epoch=-1,log_start_offset=0,partition_max_bytes=1048576,_tagged_fields={}},{partition=25,current_leader_epoch=288,fetch_offset=3931235489,last_fetched_epoch=-1,log_start_offset=0,partition_max_bytes=1048576,_tagged_fields={}}],_tagged_fields={}}],forgotten_topics_data=[],rack_id=,_tagged_fields={}} (kafka.server.KafkaApis)
org.apache.kafka.common.errors.NotLeaderOrFollowerException: Leader not local for partition topic-42 on broker 11
ERROR [GroupCoordinator 5]: Group consumer-group_id could not complete rebalance because no members rejoined (kafka.coordinator.group.GroupCoordinator)
ERROR [Log partition=topic-1, dir=/kafka/logs] Could not find offset index file corresponding to log file /kafka/logs/topic-1/00000000000001842743.log, recovering segment and rebuilding index files... (kafka.log.Log)
[ReplicaFetcher replicaId=16, leaderId=10, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=16, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1439560581, epoch=1138570), rackId=) (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to 10 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:110) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:211) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:301) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:136) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:135) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:118) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
I will update the issue if needed, eg. with relevant Kafka configs.
We know that it might be some issue with our infrastructure, but can not see problem with the network, neither on the Kerberos side, so thats why I'm asking for help from you guys. Do you have any idea what may cause this issue? Every idea may be helpful, because we run out of it.
Thanks in advance!

bitnami helm chart fails to launch pods,

My VM System has a below config, but when i download any bitnami/dokuwiki from bitnami charts or any other and run the deployment, Pods are getting pending or crashloop back. Can some one help in this regard.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 4
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6230R CPU # 2.10GHz
Stepping: 7
CPU MHz: 2095.077
BogoMIPS: 4190.15
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 4 MiB
L3 cache: 143 MiB
NUMA node0 CPU(s): 0-3
issue:
I tried applying pv, but still it is not running. I want to run this pods.

High CPU usage of PostgreSQL

I have a PostgreSQL backed complex Ruby on Rails application running on a Ubuntu Virtual Machine. I see the Postgres processes are having very high %CPU values while running "top"commands.
. Periodically the %CPU is going up to 94 and 95.
lscpu
gives the fallowing output
Architecture: i686
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Stepping: 4
CPU MHz: 2100.000
BogoMIPS: 4200.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
top -n1
top -c
I want the know the reason for the High CPU utilization by Postgres.
Any help is appreciated.
Thanks in Advance!!

pg_top output analysis of puppetdb with postgres

I recently started using a tool called pg_top that shows statistics for Postgres, however since I am not verify versed with the internals of Postgres I needed a bit of clarification on the output.
last pid: 6152; load avg: 19.1, 18.6, 20.4; up 119+20:31:38 13:09:41
41 processes: 5 running, 36 sleeping
CPU states: 52.1% user, 0.0% nice, 0.8% system, 47.1% idle, 0.0% iowait
Memory: 47G used, 16G free, 2524M buffers, 20G cached
DB activity: 151 tps, 0 rollbs/s, 253403 buffer r/s, 86 hit%, 1550639 row r/s,
21 row w/s
DB I/O: 0 reads/s, 0 KB/s, 35 writes/s, 2538 KB/s
DB disk: 233.6 GB total, 195.1 GB free (16% used)
Swap:
My question is under the DB Activity, the 1.5 million row r/s, is that a lot? If so what can be done to improve it? I am running puppetdb 2.3.8, with 6.8 million resources, 2500 nodes, and Postgres 9.1. All of this runs on a single 24 core box with 64GB of memory.

KVM and libvirt: wrong CPU type in virtual host

We us KVM and libvirt on a 6 core (12 HT cores) machine for virtualization.
Problem: wrong CPU type in virtual host.
used KVM, libvirt, kernel version:
libvirt version: 0.9.8
QEMU emulator version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008 Fabrice Bellard
Ubuntu 12.04.1 LTS
kernel: 3.2.0-32-generic x86_64
/usr/share/libvirt/cpu_map.xml does not support more recent cpu types than Westmare.
Do I need this kind of virtualisation of the cpu at all? because of some reasons we need maximum cpu-performance in the virtual host. Will be glad to have some cores of the server's i7-3930K CPU#3.20GHz available in my virtual machines.
Maybe we do too muczh virtualization...?
my virtual host's xml looks like: where can I set the cpu -host flag?
<domain type='kvm'>
<name>myVirtualServer</name>
<uuid>2344481d-f455-455e-9558</uuid>
<description>Test-Server</description>
<memory>4194304</memory>
<currentMemory>4194304</currentMemory>
<vcpu>2</vcpu>
<cpu match='exact'>
<model>Westmere</model>
<vendor>Intel</vendor>
</cpu>
<os>
<type arch='x86_64' machine='pc-1.0'>hvm</type>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
$ lscpu of physical Server with 6 (12) cores with HT
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Stepping: 7
CPU MHz: 1200.000
BogoMIPS: 6400.05
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-11
$ lscpu of virtual Server (wrong CPU type, wrong L2-Cache, wrong MHz)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 15
Stepping: 11
CPU MHz: 3200.012
BogoMIPS: 6400.02
Virtualisation: VT-x
Hypervisor vendor: KVM
Virtualisation type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0,1
in the client's xml
<cpu mode='custom' match='exact'>
<model fallback='allow'>core2duo</model>
<feature policy='require' name='vmx'/>
</cpu>
as an example. virsh edit then restart the guest.
EDIT. Ignore this. I've just re-read your question and you're already doing that.