When I create an iscsi target containing two luns (bdev), these two luns are mapped to two disks. When I use fio to read and write two disks, iscsi target uses a thread (or core) to perform the operation.
Operators:
./scripts/rpc.py bdev_malloc_create -b Malloc0 64 512
./scripts/rpc.py bdev_malloc_create -b Malloc1 64 512
./scripts/rpc.py --verbose DEBUG iscsi_create_portal_group 1 172.20.20.156:3261
./scripts/rpc.py --verbose DEBUG iscsi_create_initiator_group 2 ANY 172.20.20.156/24
./scripts/rpc.py --verbose DEBUG iscsi_create_target_node disk1 "Data Disk1" "Malloc0:0 Malloc1:1" 1:2 64 -d
iscsiadm -m discovery -t sendtargets -p 172.20.20.156:3261
iscsiadm -m node --targetname iqn.2016-06.io.spdk:disk1 --portal 172.20.20.156:3261 --login
fio -ioengine=libaio -bs=512B -direct=1 -thread -numjobs=2 -size=64M -rw=write -filename=/dev/sdd -name="BS 512B read test" -iodepth=2
fio -ioengine=libaio -bs=512B -direct=1 -thread -numjobs=2 -size=64M -rw=write -filename=/dev/sde -name="BS 512B read test" -iodepth=2
enter image description here
The log circled in red above was added by myself. When I read and write to two disks at the same time, the thread does not change.
Can't the read and write operations of these two disks be performed on two different threads?
the read and write operations of these two disks can be performed on two different threads
I'm going mad with a stupid issue I can't solve.
During the testing of my Yocto project I always used connmactl in order to connect my board to the internet.
Now I am going to release the product but before releasing I am working on an “internet connection manager”
I guess I can’t use connmanctl anymore since it consist in an interactive command (isn’t it?) so I’m going to use directly wpa_supplicant.
In my script I edit wpa_supplicant.conf as follow:
root#localhost:~# cat /etc/wpa_supplicant/wpa_supplicant.conf
ctrl_interface=/var/run/wpa_supplicant
ctrl_interface_group=0
update_config=1
bgscan=""
network={
ssid="Obi_Lan_Kenobi"
psk="TheForceIsStrongWithThisOne"
}
After that I try to start wpa_supplicant with this command:
wpa_supplicant -B -i mlan0 -c /etc/wpa_supplicant/wpa_supplicant.conf wext
as result of this command I get:
Successfully initialized wpa_supplicant
But if I try to ping google.com (or any other website) i seet that the network doesn’t work, In particular I get this message: ping: sendto: Network is unreachable
Everything is working under connmanctl, but not under wpa_supplicant.
The strange thing is that running iw command everything seems to be configurated in the right way:
root#localhost:~# iw dev mlan0 link
Connected to 56:0c:ff:37:1a:69 (on mlan0)
SSID: Obi_Lan_Kenobi
freq: 2412
RX: 32154 bytes (310 packets)
TX: 19436 bytes (128 packets)
signal: -38 dBm
rx bitrate: 1.0 MBit/s
tx bitrate: 72.2 MBit/s MCS 7 short GI
bss flags: short-preamble short-slot-time
dtim period: 2
beacon int: 100
I honestly can’t understand why.
Does anybody have a suggestion about that?
I have encountered some mysterious network errors in our kubernetes cluster. Although I originally encountered these errors using ingress, there are even more errors when I bypass our load balancer, bypass kube-proxy and bypass nginx-ingress. The most errors are present when going directly to services and straight to the pod IPs. I believe this is because the load balancer and nginx have some better error handling than the raw iptable routing.
To test the error I use apache benchmark from VM on same subnet, any concurrency level, no keep-alive, connect to the pod IP and use a high enough request number to give me time to either scale up or scale down a deployment. Odd thing is it doesn't matter at all which deployment I modify since it always causes the same sets of errors even when its not related to the pod I am modifying. ANY additions or removals of pods will trigger apache benchmark errors. Manual deletions, scaling up/down, auto-scaling all trigger errors. If there are no pod changes while the ab test is running then no errors get reported. Note keep-alive does seem to greatly reduce if not eliminate the errors, but I only tested that a handful of times and never saw an error.
Other than some bizarre iptable conflict, I really don't see how deleting pod A can effect network connections of pod B. Since the errors are brief and go away within seconds it seems more like a brief network outage.
Sample ab test: ab -n 5000 -c 2 https://10.112.0.24/
Errors when using HTTPS:
SSL handshake failed (5).
SSL read failed (5) - closing connection
Errors when using HTTP:
apr_socket_recv: Connection reset by peer (104)
apr_socket_recv: Connection refused (111)
Example ab output. I ctl-C after encountering first errors:
$ ab -n 5000 -c 2 https://10.112.0.24/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.112.0.24 (be patient)
Completed 500 requests
Completed 1000 requests
SSL read failed (5) - closing connection
Completed 1500 requests
^C
Server Software: nginx
Server Hostname: 10.112.0.24
Server Port: 443
SSL/TLS Protocol: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
Document Path: /
Document Length: 2575 bytes
Concurrency Level: 2
Time taken for tests: 21.670 seconds
Complete requests: 1824
Failed requests: 2
(Connect: 0, Receive: 0, Length: 1, Exceptions: 1)
Total transferred: 5142683 bytes
HTML transferred: 4694225 bytes
Requests per second: 84.17 [#/sec] (mean)
Time per request: 23.761 [ms] (mean)
Time per request: 11.881 [ms] (mean, across all concurrent requests)
Transfer rate: 231.75 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 5 15 9.8 12 82
Processing: 1 9 9.0 6 130
Waiting: 0 8 8.9 6 129
Total: 7 23 14.4 19 142
Percentage of the requests served within a certain time (ms)
50% 19
66% 24
75% 28
80% 30
90% 40
95% 54
98% 66
99% 79
100% 142 (longest request)
Current sysctl settings that may be relevant:
net.netfilter.nf_conntrack_tcp_be_liberal = 1
net.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_buckets = 65536
net.netfilter.nf_conntrack_count = 1280
net.ipv4.ip_local_port_range = 27050 65500
I didn't see any conntrack "full" errors. Best I could tell there isn't packet loss. We recently upgraded from 1.14 and didn't notice the issue but I can't say for certain it wasn't there. I believe we will be forced to migrate away from romana soon since it doesn't seem to be maintained anymore and as we upgrade to kube 1.16.x we are encountering problems with it starting up.
I have searched the internet all day today looking for similar problems and the closest one that resembles our problem is https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 but I have no idea how to implement the iptable masquerade --random-fully option given we use romana and I read (https://github.com/kubernetes/kubernetes/pull/78547#issuecomment-527578153) that random-fully is the default for linux kernel 5 which we are using. Any ideas?
kubernetes 1.15.5
romana 2.0.2
centos7
Linux kube-master01 5.0.7-1.el7.elrepo.x86_64 #1 SMP Fri Apr 5 18:07:52 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
====== Update Nov 5, 2019 ======
It has been suggested to test an alternate CNI. I chose calico since we used that in an older Debian based kube cluster. I rebuilt a VM with our most basic Centos 7 template (vSphere) so there is a little baggage coming from our customizations. I can't list everything we customized in our template but the most notable change is the kernel 5 upgrade yum --enablerepo=elrepo-kernel -y install kernel-ml.
After starting up the VM these are the minimal steps to install kubernetes and run the test:
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum -y install docker-ce-3:18.09.6-3.el7.x86_64
systemctl start docker
cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
# Set SELinux in permissive mode (effectively disabling it)
setenforce 0
sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
echo '1' > /proc/sys/net/bridge/bridge-nf-call-iptables
yum install -y kubeadm-1.15.5-0 kubelet-1.15.5-0 kubectl-1.15.5-0
systemctl enable --now kubelet
kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
kubectl taint nodes --all node-role.kubernetes.io/master-
kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml
cat <<EOF > /tmp/test-deploy.yml
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: test
spec:
selector:
matchLabels:
app: test
replicas: 1
template:
metadata:
labels:
app: test
spec:
containers:
- name: nginx
image: nginxdemos/hello
ports:
- containerPort: 80
EOF
# wait for control plane to become healthy
kubectl apply -f /tmp/test-deploy.yml
Now the setup is ready and this is the ab test:
$ docker run --rm jordi/ab -n 100 -c 1 http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.4.4 (be patient)...apr_pollset_poll: The timeout specified has expired (70007)
Total of 11 requests completed
The ab test gives up after this error. If I decrease the number of requests to see avoid the timeout this is what you would see:
$ docker run --rm jordi/ab -n 10 -c 1 http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.4.4 (be patient).....done
Server Software: nginx/1.13.8
Server Hostname: 192.168.4.4
Server Port: 80
Document Path: /
Document Length: 7227 bytes
Concurrency Level: 1
Time taken for tests: 0.029 seconds
Complete requests: 10
Failed requests: 0
Total transferred: 74140 bytes
HTML transferred: 72270 bytes
Requests per second: 342.18 [#/sec] (mean)
Time per request: 2.922 [ms] (mean)
Time per request: 2.922 [ms] (mean, across all concurrent requests)
Transfer rate: 2477.50 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.8 1 3
Processing: 1 2 1.2 1 4
Waiting: 0 1 1.3 0 4
Total: 1 3 1.4 3 5
Percentage of the requests served within a certain time (ms)
50% 3
66% 3
75% 4
80% 5
90% 5
95% 5
98% 5
99% 5
100% 5 (longest request)
This issue is technically different than the original issue I reported but this is a different CNI and there are still network issues. It does have the timeout error in common when I run the same test in the kube/romana cluster: run the ab test on the same node as the pod. Both encountered the same timeout error but in romana I could get a few thousand requests to finish before hitting the timeout. Calico encounters the timeout error before reaching a dozen requests.
Other variants or notes:
- net.netfilter.nf_conntrack_tcp_be_liberal=0/1 doesn't seem to make a difference
- higher -n values sometimes work but it is largely random.
- running the 'ab' test at low -n values several times in a row can sometimes trigger the timeout
At this point I am pretty sure it is some issue with our centos installation but I can't even guess what it could be. Are there any other limits, sysctl or other configs that could cause this?
====== Update Nov 6, 2019 ======
I observer that we had an older kernel installed in so I upgraded my kube/calico test VM with the same newer kernel 5.3.8-1.el7.elrepo.x86_64. After the update and a few reboots I can no longer reproduce the "apr_pollset_poll: The timeout specified has expired (70007)" timout errors.
Now that the timeout error is gone I was able to repeat the original test where I load test pod A and kill pod B on my vSphere VMs. On the romana environments the problem still existed but only when the load test is on a different host than where the pod A is located. If I run the test on the same host, no errors at all. Using Calico instead of romana, there are no load test errors on either host so the problem was gone. There may still be some setting to tweak that can help romana but I think this is "strike 3" for romana so I will start transitioning a full environment to Calico and do some acceptance testing there to ensure there are no hidden gotchas.
You mentioned that if there are no pod changes while the ab test is running, then no errors get reported. So it means that errors occur when you add pod or delete one.
This is normal behaviour as when pod gets deleted; it takes time for iptable rules changes to propagate. It may happen that container got removed, but iptable rules haven't got changed yet end packets are being forwarded to the nonexisting container, and this causes errors (it is sort of like a race condition).
The first thing you can do is always to create readiness probe as it will make sure that traffic will not be forwarded to the container until it is ready to handle requests.
The second thing to do is to handle deleting the container properly. This is a bit harder task because it may be handled at many levels, but the easiest thing you can do is adding PreStop hook to your container like this:
lifecycle:
preStop:
exec:
command:
- sh
- -c
- "sleep 5"
PreStop hook gets executed at the moment of the pod deletion request. From this moment, k8s start changing iptable rules and it should stop forwarding new traffic to the container that's about to get deleted. While sleeping you give some time for k8s to propagate iptable changes in the cluster while not interrupting already existing connections. After PreStop handle exits, the container will receive SIGTERM signal.
My suggestion would be to apply both of these mechanisms together and check if it helps.
You also mentioned that bypassing ingress is causing more errors. I would assume that this is due to the fact that ingress has implemented retries mechanism. If it's unable to open a connection to a container, it will try several times, and hopefully will get to a container that can handle its request.
Our Varnish Instance
/usr/sbin/varnishd -P /var/run/varnish.pid -a :6081 -f /etc/varnish/cm-varnish.vcl -T 127.0.0.1:6082 -t 1h -u varnish -g varnish -S /etc/varnish/secret -s malloc,24G -p shm_reclen 10000 -p http_req_hdr_len 10000 -p thread_pool_add_delay 2 -p thread_pools 8 -p thread_pool_min 500 -p thread_pool_max 4000 -p sess_workspace 1073741824
32G Ram, 16 Core Processor and We allocate 24GB of memory for varnish
Average uptime of our varnish instance remains 3hrs which is very much low. Our Cache TTL is 1Hr and Grace time is 2Hrs. Every 5 min once we generally refresh the cache contents [with more than n hits] through a java process. We track hits of varnish by constanly polling varnishncsa output.
I tried varnishadm panic.show
Last panic at: Thu, 23 May 2013 09:14:42 GMT
Assert error in WSLR(), cache_shmlog.c line 220:
Condition(VSL_END(w->wlp, l) < w->wle) not true.
thread = (cache-worker)
ident = Linux,2.6.18-238.el5,x86_64,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
0x42dc76: /usr/sbin/varnishd [0x42dc76]
0x432d1f: /usr/sbin/varnishd(WSLR+0x27f) [0x432d1f]
0x42a667: /usr/sbin/varnishd [0x42a667]
0x42a89e: /usr/sbin/varnishd(http_DissectRequest+0xee) [0x42a89e]
0x4187d1: /usr/sbin/varnishd(CNT_Session+0x741) [0x4187d1]
0x42f706: /usr/sbin/varnishd [0x42f706]
0x3009c0673d: /lib64/libpthread.so.0 [0x3009c0673d]
0x30094d40cd: /lib64/libc.so.6(clone+0x6d) [0x30094d40cd]
Any inputs on what do we miss?
My best guess is that you have a very long cookie string (or other custom headers) so that it overflows the http_req_hdr_len. I remember reading something about such a bug that was fixed but afaik not released in a stable version. I'm afraid I don't have better sources than my own memory at hand.
You also have a very high sess_workspace and total number of threads possible. That does less for performance than it does in risking swapping in most setups.
I am again rephrasing the issue that we are facing:
We are creating link aggregations [dlmp groups] with two interfaces named net0 & net5:
# dladm create-aggr -m dlmp -l net0 -l net5 -l net2 aggr1
Setting prob targets for aggr1:
# dladm set-linkprop -p probe-ip=+ aggr1
Setting failure detection time:
# dladm set-linkprop -p probe-fdt=15 aggr1
After this we are adding IP to this aggregation as follows:
# ipadm create-ip aggr1
Assigns an IP to this:
# ipadm create-addr -T static -a x.x.x.x/y aggr1/addr
Then we check the status using dladm and ipadm everything seems up and running.
Then we tested a scenario where we are dettached cables from above n/w interfaces, but what we got is as follows:
# dladm show-aggr -x
LINK PORT SPEED DUPLEX STATE ADDRESS PORTSTATE
traf0 -- 100Mb unknown up 0:10:e0:5b:69:1 --
net0 100Mb unknown down 0:10:e0:5b:69:1 attached
net5 100Mb unknown down a0:36:9f:45:de:9d attached
First issues is that we are getting the state of link "traf0" as up in above command output, secondly in the output of "ipadm":
traf0 ip ok -- --
traf0/addr static ok -- 7.8.0.199/16
We are getting the status of traf0 as ok.
So here I have a query, can't we have any configuration where we could get right status of traf0 both in dladm and ipadm output?
[One more thing to add here is, when we don't assign any IP to this traf0 aggregation then in that case on dettaching the cables we get right output in dladm command.]
Apart from this configuration, we are using these aggregations as vnics in zones. There also we are getting the status of these links up in ipadm command output [after dettaching the cables].
A small update::
We have set the value of "TRACK_INTERFACES_ONLY_WITH_GROUPS" parameter in /etc/default/mpathd as no and getting the state of "traf0" in ipadm command as failed, but still we get traf0/addr as ok.
traf0 ip failed -- --
traf0/addr static ok -- 7.8.0.199/16