Varnish Restart Trace - restart

Our Varnish Instance
/usr/sbin/varnishd -P /var/run/varnish.pid -a :6081 -f /etc/varnish/cm-varnish.vcl -T 127.0.0.1:6082 -t 1h -u varnish -g varnish -S /etc/varnish/secret -s malloc,24G -p shm_reclen 10000 -p http_req_hdr_len 10000 -p thread_pool_add_delay 2 -p thread_pools 8 -p thread_pool_min 500 -p thread_pool_max 4000 -p sess_workspace 1073741824
32G Ram, 16 Core Processor and We allocate 24GB of memory for varnish
Average uptime of our varnish instance remains 3hrs which is very much low. Our Cache TTL is 1Hr and Grace time is 2Hrs. Every 5 min once we generally refresh the cache contents [with more than n hits] through a java process. We track hits of varnish by constanly polling varnishncsa output.
I tried varnishadm panic.show
Last panic at: Thu, 23 May 2013 09:14:42 GMT
Assert error in WSLR(), cache_shmlog.c line 220:
Condition(VSL_END(w->wlp, l) < w->wle) not true.
thread = (cache-worker)
ident = Linux,2.6.18-238.el5,x86_64,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
0x42dc76: /usr/sbin/varnishd [0x42dc76]
0x432d1f: /usr/sbin/varnishd(WSLR+0x27f) [0x432d1f]
0x42a667: /usr/sbin/varnishd [0x42a667]
0x42a89e: /usr/sbin/varnishd(http_DissectRequest+0xee) [0x42a89e]
0x4187d1: /usr/sbin/varnishd(CNT_Session+0x741) [0x4187d1]
0x42f706: /usr/sbin/varnishd [0x42f706]
0x3009c0673d: /lib64/libpthread.so.0 [0x3009c0673d]
0x30094d40cd: /lib64/libc.so.6(clone+0x6d) [0x30094d40cd]
Any inputs on what do we miss?

My best guess is that you have a very long cookie string (or other custom headers) so that it overflows the http_req_hdr_len. I remember reading something about such a bug that was fixed but afaik not released in a stable version. I'm afraid I don't have better sources than my own memory at hand.
You also have a very high sess_workspace and total number of threads possible. That does less for performance than it does in risking swapping in most setups.

Related

How to count cache-misses in mmap-ed memory (using eBPF)?

I would like to get timeseries
t0, misses
...
tN, misses
where tN is a timestamp (second-resolution) and misses is a number of times the kernel made disk-IO for my PID to load missing page of the mmap()-ed memory region when process did access to that memory. Ok, maybe connection between disk-IO and memory-access is harder to track, lets assume my program can not do any disk-io with another (than assessing missing mmapped memory) reason. I THINK, I need to track something called node-load-misses in perf world.
Any ideas how eBPF can be used to collect such data? What probes should I use?
Tried to use perf record for similar purpose: I dislike how much data perf records. As I recall the try was like (also I dont remember how I parsed that output.data file):
perf record -p $PID -a -F 10 -e node-loads -e node-load-misses -o output.data
I thought eBPF could give some facility to implement such thing in less overhead way.
Loading of mmaped pages which are not present in memory is not hardware event like perf's cache-misses or node-loads or node-load-misses. When your program assess not present memory address, GPFault/pagefault exception is generated by hardware and it is handled in software by Linux kernel codes. For first access to anonymous memory physical page will be allocated and mapped for this virtual address; for access of mmaped file disk I/O will be initiated. There are two kinds of page faults in linux: minor and major, and disk I/O is major page fault.
You should try to use trace-cmd or ftrace or perf trace. Support of fault tracing was planned for perf tool in 2012, and patches were proposed in https://lwn.net/Articles/602658/
There is a tracepoint for page faults from userspace code, and this command prints some events with memory address of page fault:
echo 2^123456%2 | perf trace -e 'exceptions:page_fault_user' bc
With recent perf tool (https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/) there is perf trace record which can record both mmap syscalls and page_fault_user into perf.data and perf script will print all events and they can be counted by some awk or python script.
Some useful links on perf and tracing: http://www.brendangregg.com/perf.html http://www.brendangregg.com/ebpf.html https://github.com/iovisor/bpftrace/blob/master/INSTALL.md
And some bcc tools may be used to trace disk I/O, like https://github.com/iovisor/bcc/blob/master/examples/tracing/disksnoop.py or https://github.com/brendangregg/perf-tools/blob/master/examples/iosnoop_example.txt
And for simple time-series stat you can use perf stat -I 1000 command with correct software events
perf stat -e cpu-clock,page-faults,minor-faults,major-faults -I 1000 ./program
...
# time counts unit events
1.000112251 413.59 msec cpu-clock # 0.414 CPUs utilized
1.000112251 5,361 page-faults # 0.013 M/sec
1.000112251 5,301 minor-faults # 0.013 M/sec
1.000112251 60 major-faults # 0.145 K/sec
2.000490561 16.32 msec cpu-clock # 0.016 CPUs utilized
2.000490561 1 page-faults # 0.005 K/sec
2.000490561 1 minor-faults # 0.005 K/sec
2.000490561 0 major-faults # 0.000 K/sec

Docker swarm network latency with mesh and DNSRR

I have a 3 node docker swarm.
One stack deployed is a database cluster with 3 replicas. (MariaDB Galera)
Another stack deployed is a web application with 2 replicas.
The web application looks like this:
version: '3'
networks:
web:
external: true
galera_network:
external: true
services:
application:
image: webapp:latest
networks:
- galera_network
- web
environment:
DB_HOST: galera_node
deploy:
replicas: 2
FWIW, the web network is what traefik is hooked up to.
The issue here is galera_node (used for the webapp's database host) resolves to a VIP that ends up leveraging swarm's mesh routing (as far as I can tell) and that adds extra latency when the mesh routing ends up going over the WAN instead of resolving to the galera_node container that is deployed on the same physical host.
Another option I've found is to use tasks.galera_node, but that seems to use DNSRR for the 3 galera cluster containers. So 33% of the time, things are good and fast... but the rest of the time, I have unnecessary latency added to the mix.
These two behaviors look to be documented as what we'd expect from the different endpoint_mode options. Reference: Docker endpoint_mode
To illustrate the latency, notice when pinging from within the webapp container:
Notice the IP addresses that are resolving for each ping along with the response time.
### hitting VIP that "masks" the fact that there is extra latency
### behind it depending on where the mesh routing sends the traffic.
root#294114cb14e6:/var/www/html# ping galera_node
PING galera_node (10.0.4.16): 56 data bytes
64 bytes from 10.0.4.16: icmp_seq=0 ttl=64 time=0.520 ms
64 bytes from 10.0.4.16: icmp_seq=1 ttl=64 time=0.201 ms
64 bytes from 10.0.4.16: icmp_seq=2 ttl=64 time=0.153 ms
--- galera_node ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.153/0.291/0.520/0.163 ms
### hitting DNSRR that resolves to worst latency server
root#294114cb14e6:/var/www/html# ping tasks.galera_node
PING tasks.galera_node (10.0.4.241): 56 data bytes
64 bytes from 10.0.4.241: icmp_seq=0 ttl=64 time=60.736 ms
64 bytes from 10.0.4.241: icmp_seq=1 ttl=64 time=60.573 ms
--- tasks.galera_node ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 60.573/60.654/60.736/0.082 ms
### hitting DNSRR that resolves to local galera_node container
root#294114cb14e6:/var/www/html# ping tasks.galera_node
PING tasks.galera_node (10.0.4.242): 56 data bytes
64 bytes from 10.0.4.242: icmp_seq=0 ttl=64 time=0.133 ms
64 bytes from 10.0.4.242: icmp_seq=1 ttl=64 time=0.117 ms
--- tasks.galera_node ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.117/0.125/0.133/0.000 ms
### hitting DNSRR that resolves to other "still too much" latency server
root#294114cb14e6:/var/www/html# ping tasks.galera_node
PING tasks.galera_node (10.0.4.152): 56 data bytes
64 bytes from 10.0.4.152: icmp_seq=0 ttl=64 time=28.218 ms
64 bytes from 10.0.4.152: icmp_seq=1 ttl=64 time=40.912 ms
64 bytes from 10.0.4.152: icmp_seq=2 ttl=64 time=26.293 ms
--- tasks.galera_node ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 26.293/31.808/40.912/6.486 ms
The only way I've been able to get decent performance that bypasses the latency is to hard code the IP address of the local container, but that is obviously not a long-term solution as containers should be treated as ephemeral things.
I totally get that I might need to rethink my geographic node locations due to this latency, and there might be some other performance tuning things I can do. It seems like there should be a way to enforce my desired behavior, though.
I essentially want to bypass DNSRR and the VIP/mesh routing behavior when a local container is available to service the given request.
So the question is:
How can I have each replica of my webapp only hit the local swarm host's galera container without hard coding that container's IP address?
If anyone else is fighting with this sort of issue, I wanted to post a solution (though I wouldn't necessarily call it an "answer" to the actual question) that is more of a workaround than something I'm actually happy with.
Inside of my webapp, I can use galera_node as my database host and it resolves to the mesh routing VIP that I mentioned above. This gives me functionality no matter what, so if my workaround gets tripped up I know that my connectivity is still in tact.
I whipped up a little bash script that I could call as a cron job and give me the results that I want. It could be used for other use cases that stem from this same issue.
It takes in three parameters:
$1 = database container name
$2 = database network name
$3 = webapp container name
The script looks for the container name, finds its IP address for the specified network, and then adds that container name and IP address to the webapp container's /etc/hosts file.
This works because the container name is also galera_node (in my case) so adding it to the hosts file just overrides the hostname that docker resolves to the VIP.
As mentioned, I don't love this, but it does seem to work for my purposes and it avoids me having to hardcode IP addresses and manually maintain them. I'm sure there are some scenarios that will require tweaks to the script, but it's a functional starting point.
My script: update_container_hosts.sh
#!/bin/bash
HOST_NAME=$1
HOST_NETWORK=$2
CONTAINER_NAME=$3
FMT="{{(index (index .NetworkSettings.Networks \"$HOST_NETWORK\") ).IPAddress}}"
CONTAINERS=`docker ps | grep $CONTAINER_NAME | cut -d" " -f1`
HOST_ID=`docker ps | grep $HOST_NAME | cut -d" " -f1`
HOST_IP=$(docker inspect $HOST_ID --format="$FMT")
echo --- containers ---
echo $CONTAINERS
echo ------------------
echo host: $HOST_NAME
echo network: $HOST_NETWORK
echo ip: $HOST_IP
echo ------------------
for c in $CONTAINERS;
do
if [ "$HOST_IP" != "" ]
then
docker cp $c:/etc/hosts /tmp/hosts.tmp
IP_COUNT=`cat /tmp/hosts.tmp | grep $HOST_IP | wc -l`
rm /tmp/hosts.tmp
if [ "$IP_COUNT" = "0" ]
then
docker exec $c /bin/sh -c "echo $HOST_IP $HOST_NAME >> /etc/hosts"
echo "$c: Added entry to container hosts file."
else
echo "$c: Entry already exists in container hosts file. Skipping."
fi
fi
done
I wrote a PoC for adjusting the loadbalancer to exlude containers on other hosts. It adjusts the config of the virtual IP itself, so there is no need to change anything in the container filesystem. It needs to be rerun, on every node in the cluster, whenever a container is stopped or started. It takes one argument, that is the exposed port, it will then figure out the virtual IP and the IPs of the containers. It needs nsenter and ipvsadm. I thought someone may find it useful.
#!/bin/bash
port="$1"
if [ -z "$port" ]; then
echo "Please specify port"
exit 1
fi
echo "Collecting data"
INGRESS_IP=$(iptables -t nat -S DOCKER-INGRESS |grep -- "--dport $port "|cut -d\ -f 12|cut -d: -f1)
if [ -z "$INGRESS_IP" ]; then
echo "Can't find ingress IP"
exit 1
fi
echo "INRESS_IP = $INGRESS_IP"
FWM_HEX=$( nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t mangle -S PREROUTING|grep -- "--dport $port "|cut -d\ -f12|cut -d/ -f 1|cut -dx -f2)
FWM=$((16#$FWM_HEX))
echo "Firewall mark = $FWM"
declare -A LOCAL_CONTAINER_IPS
LOCAL_CONTAINERS=$(docker ps -q)
for c in $LOCAL_CONTAINERS; do
i=$(docker inspect $c|jq '.[0]["NetworkSettings"]["Networks"]["ingress"]["IPAMConfig"]["IPv4Address"]'|cut -d\" -f 2)
LOCAL_CONTAINER_IPS[$i]=1
done
LB_IPS=$(nsenter --net=/var/run/docker/netns/ingress_sbox ipvsadm -S|grep -- "-a -f $FWM -r"|cut -d\ -f5|cut -d: -f1)
declare -A EXISTING_CONTAINER_IPS
echo "Checking for IPs to remove"
for i in $LB_IPS; do
EXISTING_CONTAINER_IPS[$i]=1
if [ ! ${LOCAL_CONTAINER_IPS[$i]+_} ]; then
echo "$i is not a local container IP, deleting"
nsenter --net=/var/run/docker/netns/ingress_sbox ipvsadm -d -f $FWM -r $i:0
fi
done
echo "Checking for IPs to add"
for i in "${!LOCAL_CONTAINER_IPS[#]}"; do
if [ ! ${EXISTING_CONTAINER_IPS[$i]+_} ]; then
echo "$i is a local container IP but not in the load balancer, adding"
nsenter --net=/var/run/docker/netns/ingress_sbox ipvsadm -a -f $FWM -r $i:0 -m -w 1
fi
done
echo "done"

Kubernetes 1.15.5 and romana 2.0.2 getting network errors when ANY pods added or removed

I have encountered some mysterious network errors in our kubernetes cluster. Although I originally encountered these errors using ingress, there are even more errors when I bypass our load balancer, bypass kube-proxy and bypass nginx-ingress. The most errors are present when going directly to services and straight to the pod IPs. I believe this is because the load balancer and nginx have some better error handling than the raw iptable routing.
To test the error I use apache benchmark from VM on same subnet, any concurrency level, no keep-alive, connect to the pod IP and use a high enough request number to give me time to either scale up or scale down a deployment. Odd thing is it doesn't matter at all which deployment I modify since it always causes the same sets of errors even when its not related to the pod I am modifying. ANY additions or removals of pods will trigger apache benchmark errors. Manual deletions, scaling up/down, auto-scaling all trigger errors. If there are no pod changes while the ab test is running then no errors get reported. Note keep-alive does seem to greatly reduce if not eliminate the errors, but I only tested that a handful of times and never saw an error.
Other than some bizarre iptable conflict, I really don't see how deleting pod A can effect network connections of pod B. Since the errors are brief and go away within seconds it seems more like a brief network outage.
Sample ab test: ab -n 5000 -c 2 https://10.112.0.24/
Errors when using HTTPS:
SSL handshake failed (5).
SSL read failed (5) - closing connection
Errors when using HTTP:
apr_socket_recv: Connection reset by peer (104)
apr_socket_recv: Connection refused (111)
Example ab output. I ctl-C after encountering first errors:
$ ab -n 5000 -c 2 https://10.112.0.24/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.112.0.24 (be patient)
Completed 500 requests
Completed 1000 requests
SSL read failed (5) - closing connection
Completed 1500 requests
^C
Server Software: nginx
Server Hostname: 10.112.0.24
Server Port: 443
SSL/TLS Protocol: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
Document Path: /
Document Length: 2575 bytes
Concurrency Level: 2
Time taken for tests: 21.670 seconds
Complete requests: 1824
Failed requests: 2
(Connect: 0, Receive: 0, Length: 1, Exceptions: 1)
Total transferred: 5142683 bytes
HTML transferred: 4694225 bytes
Requests per second: 84.17 [#/sec] (mean)
Time per request: 23.761 [ms] (mean)
Time per request: 11.881 [ms] (mean, across all concurrent requests)
Transfer rate: 231.75 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 5 15 9.8 12 82
Processing: 1 9 9.0 6 130
Waiting: 0 8 8.9 6 129
Total: 7 23 14.4 19 142
Percentage of the requests served within a certain time (ms)
50% 19
66% 24
75% 28
80% 30
90% 40
95% 54
98% 66
99% 79
100% 142 (longest request)
Current sysctl settings that may be relevant:
net.netfilter.nf_conntrack_tcp_be_liberal = 1
net.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_buckets = 65536
net.netfilter.nf_conntrack_count = 1280
net.ipv4.ip_local_port_range = 27050 65500
I didn't see any conntrack "full" errors. Best I could tell there isn't packet loss. We recently upgraded from 1.14 and didn't notice the issue but I can't say for certain it wasn't there. I believe we will be forced to migrate away from romana soon since it doesn't seem to be maintained anymore and as we upgrade to kube 1.16.x we are encountering problems with it starting up.
I have searched the internet all day today looking for similar problems and the closest one that resembles our problem is https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 but I have no idea how to implement the iptable masquerade --random-fully option given we use romana and I read (https://github.com/kubernetes/kubernetes/pull/78547#issuecomment-527578153) that random-fully is the default for linux kernel 5 which we are using. Any ideas?
kubernetes 1.15.5
romana 2.0.2
centos7
Linux kube-master01 5.0.7-1.el7.elrepo.x86_64 #1 SMP Fri Apr 5 18:07:52 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
====== Update Nov 5, 2019 ======
It has been suggested to test an alternate CNI. I chose calico since we used that in an older Debian based kube cluster. I rebuilt a VM with our most basic Centos 7 template (vSphere) so there is a little baggage coming from our customizations. I can't list everything we customized in our template but the most notable change is the kernel 5 upgrade yum --enablerepo=elrepo-kernel -y install kernel-ml.
After starting up the VM these are the minimal steps to install kubernetes and run the test:
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum -y install docker-ce-3:18.09.6-3.el7.x86_64
systemctl start docker
cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
# Set SELinux in permissive mode (effectively disabling it)
setenforce 0
sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
echo '1' > /proc/sys/net/bridge/bridge-nf-call-iptables
yum install -y kubeadm-1.15.5-0 kubelet-1.15.5-0 kubectl-1.15.5-0
systemctl enable --now kubelet
kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
kubectl taint nodes --all node-role.kubernetes.io/master-
kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml
cat <<EOF > /tmp/test-deploy.yml
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: test
spec:
selector:
matchLabels:
app: test
replicas: 1
template:
metadata:
labels:
app: test
spec:
containers:
- name: nginx
image: nginxdemos/hello
ports:
- containerPort: 80
EOF
# wait for control plane to become healthy
kubectl apply -f /tmp/test-deploy.yml
Now the setup is ready and this is the ab test:
$ docker run --rm jordi/ab -n 100 -c 1 http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.4.4 (be patient)...apr_pollset_poll: The timeout specified has expired (70007)
Total of 11 requests completed
The ab test gives up after this error. If I decrease the number of requests to see avoid the timeout this is what you would see:
$ docker run --rm jordi/ab -n 10 -c 1 http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.4.4 (be patient).....done
Server Software: nginx/1.13.8
Server Hostname: 192.168.4.4
Server Port: 80
Document Path: /
Document Length: 7227 bytes
Concurrency Level: 1
Time taken for tests: 0.029 seconds
Complete requests: 10
Failed requests: 0
Total transferred: 74140 bytes
HTML transferred: 72270 bytes
Requests per second: 342.18 [#/sec] (mean)
Time per request: 2.922 [ms] (mean)
Time per request: 2.922 [ms] (mean, across all concurrent requests)
Transfer rate: 2477.50 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.8 1 3
Processing: 1 2 1.2 1 4
Waiting: 0 1 1.3 0 4
Total: 1 3 1.4 3 5
Percentage of the requests served within a certain time (ms)
50% 3
66% 3
75% 4
80% 5
90% 5
95% 5
98% 5
99% 5
100% 5 (longest request)
This issue is technically different than the original issue I reported but this is a different CNI and there are still network issues. It does have the timeout error in common when I run the same test in the kube/romana cluster: run the ab test on the same node as the pod. Both encountered the same timeout error but in romana I could get a few thousand requests to finish before hitting the timeout. Calico encounters the timeout error before reaching a dozen requests.
Other variants or notes:
- net.netfilter.nf_conntrack_tcp_be_liberal=0/1 doesn't seem to make a difference
- higher -n values sometimes work but it is largely random.
- running the 'ab' test at low -n values several times in a row can sometimes trigger the timeout
At this point I am pretty sure it is some issue with our centos installation but I can't even guess what it could be. Are there any other limits, sysctl or other configs that could cause this?
====== Update Nov 6, 2019 ======
I observer that we had an older kernel installed in so I upgraded my kube/calico test VM with the same newer kernel 5.3.8-1.el7.elrepo.x86_64. After the update and a few reboots I can no longer reproduce the "apr_pollset_poll: The timeout specified has expired (70007)" timout errors.
Now that the timeout error is gone I was able to repeat the original test where I load test pod A and kill pod B on my vSphere VMs. On the romana environments the problem still existed but only when the load test is on a different host than where the pod A is located. If I run the test on the same host, no errors at all. Using Calico instead of romana, there are no load test errors on either host so the problem was gone. There may still be some setting to tweak that can help romana but I think this is "strike 3" for romana so I will start transitioning a full environment to Calico and do some acceptance testing there to ensure there are no hidden gotchas.
You mentioned that if there are no pod changes while the ab test is running, then no errors get reported. So it means that errors occur when you add pod or delete one.
This is normal behaviour as when pod gets deleted; it takes time for iptable rules changes to propagate. It may happen that container got removed, but iptable rules haven't got changed yet end packets are being forwarded to the nonexisting container, and this causes errors (it is sort of like a race condition).
The first thing you can do is always to create readiness probe as it will make sure that traffic will not be forwarded to the container until it is ready to handle requests.
The second thing to do is to handle deleting the container properly. This is a bit harder task because it may be handled at many levels, but the easiest thing you can do is adding PreStop hook to your container like this:
lifecycle:
preStop:
exec:
command:
- sh
- -c
- "sleep 5"
PreStop hook gets executed at the moment of the pod deletion request. From this moment, k8s start changing iptable rules and it should stop forwarding new traffic to the container that's about to get deleted. While sleeping you give some time for k8s to propagate iptable changes in the cluster while not interrupting already existing connections. After PreStop handle exits, the container will receive SIGTERM signal.
My suggestion would be to apply both of these mechanisms together and check if it helps.
You also mentioned that bypassing ingress is causing more errors. I would assume that this is due to the fact that ingress has implemented retries mechanism. If it's unable to open a connection to a container, it will try several times, and hopefully will get to a container that can handle its request.

haproxy ulimit-n computation

I got a haproxy 1.8 vanilla alpine docker image running with maxconn = 2000
curl -s http://host:port/stats| grep maxsock
<b>maxsock = </b> 4017; <b>maxconn = </b> 2000; <b>maxpipes = </b> 0<br>
Sometimes I get the following Warning in my logs:
[WARNING] 0/0 (0) : [/usr/local/sbin/haproxy.main()] FD limit (4015) too low for maxconn=2000/maxsock=4016. Please raise 'ulimit-n' to 4016 or more to avoid any trouble.
I find it very odd since I read this in haproxy doc:
ulimit-n
Sets the maximum number of per-process file-descriptors to . By
default, it is automatically computed, so it is recommended not to use this
option.
Not sure if it's a bug on haproxy or something I am doing wrong.
What do you think of that?
edit: haproxy is running as root
It depends on the open file descriptor limit(hard and soft), you can check that by ulimit -Hn and ulimit -Sn.
It is automatically computed, but it depends on the user you run haproxy, if you run haproxy using root then even if the computed value is greater than hard limit, you can set that value without warning.
But if you run as a normal user, then the max value is hard limit, if the computed value is greater than that, you got the warning.

Kitura slow or low request per second?

I've download Kitura 0.20 and created a new project for a benchmark on a swift build -c release
import Kitura
let router = Router()
router.get("/") {
request, response, next in
response.send("Hello, World!")
next()
}
Kitura.addHTTPServer(onPort: 8090, with: router)
Kitura.run()
and the score appear to be low compare to Zewo and Vapor which could hit 400k+ request/s?
MacBook-Pro:hello2 yanli$ wrk -t1 -c100 -d30 --latency http://localhost:8090
Running 30s test # http://localhost:8090
1 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 415.36us 137.54us 3.09ms 91.11%
Req/Sec 5.80k 2.47k 7.19k 85.71%
Latency Distribution
50% 391.00us
75% 443.00us
90% 513.00us
99% 0.93ms
16229 requests in 30.01s, 1.67MB read
Socket errors: connect 0, read 342, write 55, timeout 0
Requests/sec: 540.84
Transfer/sec: 57.04KB
I suspect you are running out of ephemeral ports. Your issue is probably the same as this one: 'ab' program freezes after lots of requests, why?
Kitura currently does not support HTTP keepalive, and so every request requires a new connection. One symptom of this is that regardless of how many seconds you attempt to drive load, you'll see a similar number of completed requests (16229 in your example).
On OS X, there are 16,384 ephemeral ports available by default, and these will be rapidly exhausted unless you tune the network settings.
[1] http://danielmendel.github.io/blog/2013/04/07/benchmarkers-beware-the-ephemeral-port-limit/
[2] https://rolande.wordpress.com/2010/12/30/performance-tuning-the-network-stack-on-mac-osx-10-6/
My approach has been to reduce the Maximum Segment Lifetime tunable (which defaults to 15000, or 15 seconds) and increase the range of available ports temporarily while benchmarking, for example:
sudo sysctl -w net.inet.tcp.msl=1000
sudo sysctl -w net.inet.ip.portrange.first=32768
<run benchmark>
sudo sysctl -w net.inet.tcp.msl=15000
sudo sysctl -w net.inet.ip.portrange.first=49152