When I create an iscsi target containing two luns (bdev), these two luns are mapped to two disks. When I use fio to read and write two disks, iscsi target uses a thread (or core) to perform the operation.
Operators:
./scripts/rpc.py bdev_malloc_create -b Malloc0 64 512
./scripts/rpc.py bdev_malloc_create -b Malloc1 64 512
./scripts/rpc.py --verbose DEBUG iscsi_create_portal_group 1 172.20.20.156:3261
./scripts/rpc.py --verbose DEBUG iscsi_create_initiator_group 2 ANY 172.20.20.156/24
./scripts/rpc.py --verbose DEBUG iscsi_create_target_node disk1 "Data Disk1" "Malloc0:0 Malloc1:1" 1:2 64 -d
iscsiadm -m discovery -t sendtargets -p 172.20.20.156:3261
iscsiadm -m node --targetname iqn.2016-06.io.spdk:disk1 --portal 172.20.20.156:3261 --login
fio -ioengine=libaio -bs=512B -direct=1 -thread -numjobs=2 -size=64M -rw=write -filename=/dev/sdd -name="BS 512B read test" -iodepth=2
fio -ioengine=libaio -bs=512B -direct=1 -thread -numjobs=2 -size=64M -rw=write -filename=/dev/sde -name="BS 512B read test" -iodepth=2
enter image description here
The log circled in red above was added by myself. When I read and write to two disks at the same time, the thread does not change.
Can't the read and write operations of these two disks be performed on two different threads?
the read and write operations of these two disks can be performed on two different threads
Related
Looking for a reliable (and hopefully simple) way to trace a directory in an lvm or other dm mounted fs back to the physical disk it resides on. Goal is to get the model and serial number of the drive no matter where the script wakes up.
Not a problem when the fs mount is on a physical partition, but gets messy when layers of lvm and/or loopbacks are in between. The lsblk tree shows the dm relationships back to /dev/sda in the following example, but wouldn't be very easy or desirable to parse:
# lsblk -po NAME,MODEL,SERIAL,MOUNTPOINT,MAJ:MIN
NAME MODEL SERIAL MOUNTPOINT MAJ:MIN
/dev/loop0 /mnt/test 7:0
/dev/sda AT1000MX500SSD1 21035FEA05B8 8:0
├─/dev/sda1 /boot 8:1
├─/dev/sda2 8:2
└─/dev/sda5 8:5
└─/dev/mapper/sda5_crypt 254:0
├─/dev/mapper/test5--vg-root / 254:1
└─/dev/mapper/test5--vg-swap_1 [SWAP] 254:2
Tried udevadm info, stat and a few other variations, but they all dead end at the device mapper without a way (that I can see) of connecting the dots to the backing disk and it's model/serial number.
Got enough solution by enumerating the base /dev/sd? devices, looping through each one and its partitions with lsblk -ln devpart and looking for the mountpoint in column 7. In the following example, the desired / shows up in the mappings to the /dev/sda5 partition. The serial number (and a lot of other data) for the base device can then be returned with udevadm info /dev/sda:
sda5 8:5 0 931G 0 part
sda5_crypt 254:0 0 931G 0 crypt
test5--vg-root 254:1 0 651G 0 lvm /
test5--vg-swap_1 254:2 0 976M 0 lvm [SWAP]
I would like to get timeseries
t0, misses
...
tN, misses
where tN is a timestamp (second-resolution) and misses is a number of times the kernel made disk-IO for my PID to load missing page of the mmap()-ed memory region when process did access to that memory. Ok, maybe connection between disk-IO and memory-access is harder to track, lets assume my program can not do any disk-io with another (than assessing missing mmapped memory) reason. I THINK, I need to track something called node-load-misses in perf world.
Any ideas how eBPF can be used to collect such data? What probes should I use?
Tried to use perf record for similar purpose: I dislike how much data perf records. As I recall the try was like (also I dont remember how I parsed that output.data file):
perf record -p $PID -a -F 10 -e node-loads -e node-load-misses -o output.data
I thought eBPF could give some facility to implement such thing in less overhead way.
Loading of mmaped pages which are not present in memory is not hardware event like perf's cache-misses or node-loads or node-load-misses. When your program assess not present memory address, GPFault/pagefault exception is generated by hardware and it is handled in software by Linux kernel codes. For first access to anonymous memory physical page will be allocated and mapped for this virtual address; for access of mmaped file disk I/O will be initiated. There are two kinds of page faults in linux: minor and major, and disk I/O is major page fault.
You should try to use trace-cmd or ftrace or perf trace. Support of fault tracing was planned for perf tool in 2012, and patches were proposed in https://lwn.net/Articles/602658/
There is a tracepoint for page faults from userspace code, and this command prints some events with memory address of page fault:
echo 2^123456%2 | perf trace -e 'exceptions:page_fault_user' bc
With recent perf tool (https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/) there is perf trace record which can record both mmap syscalls and page_fault_user into perf.data and perf script will print all events and they can be counted by some awk or python script.
Some useful links on perf and tracing: http://www.brendangregg.com/perf.html http://www.brendangregg.com/ebpf.html https://github.com/iovisor/bpftrace/blob/master/INSTALL.md
And some bcc tools may be used to trace disk I/O, like https://github.com/iovisor/bcc/blob/master/examples/tracing/disksnoop.py or https://github.com/brendangregg/perf-tools/blob/master/examples/iosnoop_example.txt
And for simple time-series stat you can use perf stat -I 1000 command with correct software events
perf stat -e cpu-clock,page-faults,minor-faults,major-faults -I 1000 ./program
...
# time counts unit events
1.000112251 413.59 msec cpu-clock # 0.414 CPUs utilized
1.000112251 5,361 page-faults # 0.013 M/sec
1.000112251 5,301 minor-faults # 0.013 M/sec
1.000112251 60 major-faults # 0.145 K/sec
2.000490561 16.32 msec cpu-clock # 0.016 CPUs utilized
2.000490561 1 page-faults # 0.005 K/sec
2.000490561 1 minor-faults # 0.005 K/sec
2.000490561 0 major-faults # 0.000 K/sec
I have a 3 node docker swarm.
One stack deployed is a database cluster with 3 replicas. (MariaDB Galera)
Another stack deployed is a web application with 2 replicas.
The web application looks like this:
version: '3'
networks:
web:
external: true
galera_network:
external: true
services:
application:
image: webapp:latest
networks:
- galera_network
- web
environment:
DB_HOST: galera_node
deploy:
replicas: 2
FWIW, the web network is what traefik is hooked up to.
The issue here is galera_node (used for the webapp's database host) resolves to a VIP that ends up leveraging swarm's mesh routing (as far as I can tell) and that adds extra latency when the mesh routing ends up going over the WAN instead of resolving to the galera_node container that is deployed on the same physical host.
Another option I've found is to use tasks.galera_node, but that seems to use DNSRR for the 3 galera cluster containers. So 33% of the time, things are good and fast... but the rest of the time, I have unnecessary latency added to the mix.
These two behaviors look to be documented as what we'd expect from the different endpoint_mode options. Reference: Docker endpoint_mode
To illustrate the latency, notice when pinging from within the webapp container:
Notice the IP addresses that are resolving for each ping along with the response time.
### hitting VIP that "masks" the fact that there is extra latency
### behind it depending on where the mesh routing sends the traffic.
root#294114cb14e6:/var/www/html# ping galera_node
PING galera_node (10.0.4.16): 56 data bytes
64 bytes from 10.0.4.16: icmp_seq=0 ttl=64 time=0.520 ms
64 bytes from 10.0.4.16: icmp_seq=1 ttl=64 time=0.201 ms
64 bytes from 10.0.4.16: icmp_seq=2 ttl=64 time=0.153 ms
--- galera_node ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.153/0.291/0.520/0.163 ms
### hitting DNSRR that resolves to worst latency server
root#294114cb14e6:/var/www/html# ping tasks.galera_node
PING tasks.galera_node (10.0.4.241): 56 data bytes
64 bytes from 10.0.4.241: icmp_seq=0 ttl=64 time=60.736 ms
64 bytes from 10.0.4.241: icmp_seq=1 ttl=64 time=60.573 ms
--- tasks.galera_node ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 60.573/60.654/60.736/0.082 ms
### hitting DNSRR that resolves to local galera_node container
root#294114cb14e6:/var/www/html# ping tasks.galera_node
PING tasks.galera_node (10.0.4.242): 56 data bytes
64 bytes from 10.0.4.242: icmp_seq=0 ttl=64 time=0.133 ms
64 bytes from 10.0.4.242: icmp_seq=1 ttl=64 time=0.117 ms
--- tasks.galera_node ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.117/0.125/0.133/0.000 ms
### hitting DNSRR that resolves to other "still too much" latency server
root#294114cb14e6:/var/www/html# ping tasks.galera_node
PING tasks.galera_node (10.0.4.152): 56 data bytes
64 bytes from 10.0.4.152: icmp_seq=0 ttl=64 time=28.218 ms
64 bytes from 10.0.4.152: icmp_seq=1 ttl=64 time=40.912 ms
64 bytes from 10.0.4.152: icmp_seq=2 ttl=64 time=26.293 ms
--- tasks.galera_node ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 26.293/31.808/40.912/6.486 ms
The only way I've been able to get decent performance that bypasses the latency is to hard code the IP address of the local container, but that is obviously not a long-term solution as containers should be treated as ephemeral things.
I totally get that I might need to rethink my geographic node locations due to this latency, and there might be some other performance tuning things I can do. It seems like there should be a way to enforce my desired behavior, though.
I essentially want to bypass DNSRR and the VIP/mesh routing behavior when a local container is available to service the given request.
So the question is:
How can I have each replica of my webapp only hit the local swarm host's galera container without hard coding that container's IP address?
If anyone else is fighting with this sort of issue, I wanted to post a solution (though I wouldn't necessarily call it an "answer" to the actual question) that is more of a workaround than something I'm actually happy with.
Inside of my webapp, I can use galera_node as my database host and it resolves to the mesh routing VIP that I mentioned above. This gives me functionality no matter what, so if my workaround gets tripped up I know that my connectivity is still in tact.
I whipped up a little bash script that I could call as a cron job and give me the results that I want. It could be used for other use cases that stem from this same issue.
It takes in three parameters:
$1 = database container name
$2 = database network name
$3 = webapp container name
The script looks for the container name, finds its IP address for the specified network, and then adds that container name and IP address to the webapp container's /etc/hosts file.
This works because the container name is also galera_node (in my case) so adding it to the hosts file just overrides the hostname that docker resolves to the VIP.
As mentioned, I don't love this, but it does seem to work for my purposes and it avoids me having to hardcode IP addresses and manually maintain them. I'm sure there are some scenarios that will require tweaks to the script, but it's a functional starting point.
My script: update_container_hosts.sh
#!/bin/bash
HOST_NAME=$1
HOST_NETWORK=$2
CONTAINER_NAME=$3
FMT="{{(index (index .NetworkSettings.Networks \"$HOST_NETWORK\") ).IPAddress}}"
CONTAINERS=`docker ps | grep $CONTAINER_NAME | cut -d" " -f1`
HOST_ID=`docker ps | grep $HOST_NAME | cut -d" " -f1`
HOST_IP=$(docker inspect $HOST_ID --format="$FMT")
echo --- containers ---
echo $CONTAINERS
echo ------------------
echo host: $HOST_NAME
echo network: $HOST_NETWORK
echo ip: $HOST_IP
echo ------------------
for c in $CONTAINERS;
do
if [ "$HOST_IP" != "" ]
then
docker cp $c:/etc/hosts /tmp/hosts.tmp
IP_COUNT=`cat /tmp/hosts.tmp | grep $HOST_IP | wc -l`
rm /tmp/hosts.tmp
if [ "$IP_COUNT" = "0" ]
then
docker exec $c /bin/sh -c "echo $HOST_IP $HOST_NAME >> /etc/hosts"
echo "$c: Added entry to container hosts file."
else
echo "$c: Entry already exists in container hosts file. Skipping."
fi
fi
done
I wrote a PoC for adjusting the loadbalancer to exlude containers on other hosts. It adjusts the config of the virtual IP itself, so there is no need to change anything in the container filesystem. It needs to be rerun, on every node in the cluster, whenever a container is stopped or started. It takes one argument, that is the exposed port, it will then figure out the virtual IP and the IPs of the containers. It needs nsenter and ipvsadm. I thought someone may find it useful.
#!/bin/bash
port="$1"
if [ -z "$port" ]; then
echo "Please specify port"
exit 1
fi
echo "Collecting data"
INGRESS_IP=$(iptables -t nat -S DOCKER-INGRESS |grep -- "--dport $port "|cut -d\ -f 12|cut -d: -f1)
if [ -z "$INGRESS_IP" ]; then
echo "Can't find ingress IP"
exit 1
fi
echo "INRESS_IP = $INGRESS_IP"
FWM_HEX=$( nsenter --net=/var/run/docker/netns/ingress_sbox iptables -t mangle -S PREROUTING|grep -- "--dport $port "|cut -d\ -f12|cut -d/ -f 1|cut -dx -f2)
FWM=$((16#$FWM_HEX))
echo "Firewall mark = $FWM"
declare -A LOCAL_CONTAINER_IPS
LOCAL_CONTAINERS=$(docker ps -q)
for c in $LOCAL_CONTAINERS; do
i=$(docker inspect $c|jq '.[0]["NetworkSettings"]["Networks"]["ingress"]["IPAMConfig"]["IPv4Address"]'|cut -d\" -f 2)
LOCAL_CONTAINER_IPS[$i]=1
done
LB_IPS=$(nsenter --net=/var/run/docker/netns/ingress_sbox ipvsadm -S|grep -- "-a -f $FWM -r"|cut -d\ -f5|cut -d: -f1)
declare -A EXISTING_CONTAINER_IPS
echo "Checking for IPs to remove"
for i in $LB_IPS; do
EXISTING_CONTAINER_IPS[$i]=1
if [ ! ${LOCAL_CONTAINER_IPS[$i]+_} ]; then
echo "$i is not a local container IP, deleting"
nsenter --net=/var/run/docker/netns/ingress_sbox ipvsadm -d -f $FWM -r $i:0
fi
done
echo "Checking for IPs to add"
for i in "${!LOCAL_CONTAINER_IPS[#]}"; do
if [ ! ${EXISTING_CONTAINER_IPS[$i]+_} ]; then
echo "$i is a local container IP but not in the load balancer, adding"
nsenter --net=/var/run/docker/netns/ingress_sbox ipvsadm -a -f $FWM -r $i:0 -m -w 1
fi
done
echo "done"
Our Varnish Instance
/usr/sbin/varnishd -P /var/run/varnish.pid -a :6081 -f /etc/varnish/cm-varnish.vcl -T 127.0.0.1:6082 -t 1h -u varnish -g varnish -S /etc/varnish/secret -s malloc,24G -p shm_reclen 10000 -p http_req_hdr_len 10000 -p thread_pool_add_delay 2 -p thread_pools 8 -p thread_pool_min 500 -p thread_pool_max 4000 -p sess_workspace 1073741824
32G Ram, 16 Core Processor and We allocate 24GB of memory for varnish
Average uptime of our varnish instance remains 3hrs which is very much low. Our Cache TTL is 1Hr and Grace time is 2Hrs. Every 5 min once we generally refresh the cache contents [with more than n hits] through a java process. We track hits of varnish by constanly polling varnishncsa output.
I tried varnishadm panic.show
Last panic at: Thu, 23 May 2013 09:14:42 GMT
Assert error in WSLR(), cache_shmlog.c line 220:
Condition(VSL_END(w->wlp, l) < w->wle) not true.
thread = (cache-worker)
ident = Linux,2.6.18-238.el5,x86_64,-smalloc,-smalloc,-hcritbit,epoll
Backtrace:
0x42dc76: /usr/sbin/varnishd [0x42dc76]
0x432d1f: /usr/sbin/varnishd(WSLR+0x27f) [0x432d1f]
0x42a667: /usr/sbin/varnishd [0x42a667]
0x42a89e: /usr/sbin/varnishd(http_DissectRequest+0xee) [0x42a89e]
0x4187d1: /usr/sbin/varnishd(CNT_Session+0x741) [0x4187d1]
0x42f706: /usr/sbin/varnishd [0x42f706]
0x3009c0673d: /lib64/libpthread.so.0 [0x3009c0673d]
0x30094d40cd: /lib64/libc.so.6(clone+0x6d) [0x30094d40cd]
Any inputs on what do we miss?
My best guess is that you have a very long cookie string (or other custom headers) so that it overflows the http_req_hdr_len. I remember reading something about such a bug that was fixed but afaik not released in a stable version. I'm afraid I don't have better sources than my own memory at hand.
You also have a very high sess_workspace and total number of threads possible. That does less for performance than it does in risking swapping in most setups.
I am again rephrasing the issue that we are facing:
We are creating link aggregations [dlmp groups] with two interfaces named net0 & net5:
# dladm create-aggr -m dlmp -l net0 -l net5 -l net2 aggr1
Setting prob targets for aggr1:
# dladm set-linkprop -p probe-ip=+ aggr1
Setting failure detection time:
# dladm set-linkprop -p probe-fdt=15 aggr1
After this we are adding IP to this aggregation as follows:
# ipadm create-ip aggr1
Assigns an IP to this:
# ipadm create-addr -T static -a x.x.x.x/y aggr1/addr
Then we check the status using dladm and ipadm everything seems up and running.
Then we tested a scenario where we are dettached cables from above n/w interfaces, but what we got is as follows:
# dladm show-aggr -x
LINK PORT SPEED DUPLEX STATE ADDRESS PORTSTATE
traf0 -- 100Mb unknown up 0:10:e0:5b:69:1 --
net0 100Mb unknown down 0:10:e0:5b:69:1 attached
net5 100Mb unknown down a0:36:9f:45:de:9d attached
First issues is that we are getting the state of link "traf0" as up in above command output, secondly in the output of "ipadm":
traf0 ip ok -- --
traf0/addr static ok -- 7.8.0.199/16
We are getting the status of traf0 as ok.
So here I have a query, can't we have any configuration where we could get right status of traf0 both in dladm and ipadm output?
[One more thing to add here is, when we don't assign any IP to this traf0 aggregation then in that case on dettaching the cables we get right output in dladm command.]
Apart from this configuration, we are using these aggregations as vnics in zones. There also we are getting the status of these links up in ipadm command output [after dettaching the cables].
A small update::
We have set the value of "TRACK_INTERFACES_ONLY_WITH_GROUPS" parameter in /etc/default/mpathd as no and getting the state of "traf0" in ipadm command as failed, but still we get traf0/addr as ok.
traf0 ip failed -- --
traf0/addr static ok -- 7.8.0.199/16