Only one CPU core handling network interrupts - streaming

I developed a video streaming server in golang which receives multiple streams via a regular TCP connection and broadcasts them over HTTP.
This is working on my raspberryPi cluster which I am running in docker swarm. The server is horizontally distributed among all raspberries. All instances receive all streams, the load distribution comes when distributing among many clients over HTTP.
A single stream is pretty heavy, about 16Mbit/s of bandwidth used.
It works pretty well as I get 32fps almost without delay.
I decided to check how much CPU and memory my programs are consuming via htop. When I run it as the root user I see my server using around 50% CPU, which I thought was ok since it was receiving 4 simultaneous streams at 32fps. But then I ran htop with the pi user and noticed another process ksoftirqd consuming another 50% CPU (don't know why I can't see process when running htop as root).
I did some research and learned this is a special little process that handles delayed system interrupts (or something similar). This kind of makes sense since I'm receiving heavy load via the eth0 interface.
I then looked into /proc/interrupts:
CPU0 CPU1 CPU2 CPU3
16: 0 0 0 0 bcm2836-timer 0 Edge arch_timer
17: 92241367 299499641 352012196 317363321 bcm2836-timer 1 Edge arch_timer
23: 288978 0 0 0 ARMCTRL-level 1 Edge 3f00b880.mailbox
24: 25146874 0 0 0 ARMCTRL-level 2 Edge VCHIQ doorbell
46: 0 0 0 0 ARMCTRL-level 48 Edge bcm2708_fb dma
48: 0 0 0 0 ARMCTRL-level 50 Edge DMA IRQ
50: 0 0 0 0 ARMCTRL-level 52 Edge DMA IRQ
51: 530982 0 0 0 ARMCTRL-level 53 Edge DMA IRQ
54: 13573 0 0 0 ARMCTRL-level 56 Edge DMA IRQ
55: 0 0 0 0 ARMCTRL-level 57 Edge DMA IRQ
56: 0 0 0 0 ARMCTRL-level 58 Edge DMA IRQ
59: 3354 0 0 0 ARMCTRL-level 61 Edge bcm2835-auxirq
62: 3696376760 0 0 0 ARMCTRL-level 64 Edge dwc_otg, dwc_otg_pcd, dwc_otg_hcd:usb1
79: 0 0 0 0 ARMCTRL-level 81 Edge 3f200000.gpio:bank0
80: 0 0 0 0 ARMCTRL-level 82 Edge 3f200000.gpio:bank1
83: 0 0 0 0 ARMCTRL-level 85 Edge 3f804000.i2c
84: 0 0 0 0 ARMCTRL-level 86 Edge 3f204000.spi
86: 460008 0 0 0 ARMCTRL-level 88 Edge mmc0
87: 5270 0 0 0 ARMCTRL-level 89 Edge uart-pl011
92: 4975621 0 0 0 ARMCTRL-level 94 Edge mmc1
220: 3352 0 0 0 bcm2835-auxirq 0 Edge serial
FIQ: usb_fiq
IPI0: 0 0 0 0 CPU wakeup interrupts
IPI1: 0 0 0 0 Timer broadcast interrupts
IPI2: 41355699 278691553 311249136 291769396 Rescheduling interrupts
IPI3: 7758 9176 8710 9334 Function call interrupts
IPI4: 0 0 0 0 CPU stop interrupts
IPI5: 1387927 1201820 2860486 1356720 IRQ work interrupts
IPI6: 0 0 0 0 completion interrupts
Err: 0
As you can see interrupt 62 is the one that is called the most by far and only being handled by CPU0. I suspect that this interrupt is the eth0 interrupt (how can I confirm this?). So I thought I could configure this so the interrupt would be handled by all CPUs.
So I looked into /proc/irq/62/smp_affinity_list and it's contents are 0-3 which seems correct and the /proc/irq/62/smp_affinity file contents are f which also seems correct.
So I don't understand why interrupt 62 is only being handled by CPU0.
Also don't understand why I can only see the ksoftirqd in the htop if I run it as pi instead of root (the owner of the process is root).
How can I configure the raspberry to handle that interrupt on all CPUs?
* EDIT *
I was checking for this on only one of the raspberries belonging to the cluster. Checking for this on the other hosts I saw similar results (just a little lower CPU usage). Strangely on another raspberry there is no ksoftirqd process consuming that much CPU.

Related

pyspark conf and yarn top memory discrepancies

An EMR cluster reads (from main node, after running yarn top):
ARN top - 13:27:57, up 0d, 1:34, 1 active users, queue(s): root
NodeManager(s): 6 total, 6 active, 0 unhealthy, 2 decommissioned, 0
lost, 0 rebooted Queue(s) Applications: 3 running, 8 submitted, 0
pending, 5 completed, 0 killed, 0 failed Queue(s) Mem(GB): 18
available, 189 allocated, 1555 pending, 0 reserved Queue(s) VCores: 44
available, 20 allocated, 132 pending, 0 reserved Queue(s) Containers:
20 allocated, 132 pending, 0 reserved
APPLICATIONID USER TYPE QUEUE PRIOR #CONT #RCONT VCORES RVCORES MEM RMEM VCORESECS MEMSECS %PROGR TIME NAME
application_1663674823778_0002 hadoop spark default 0 10 0 10 0 99G 0G 18754 187254 10.00 00:00:33 PyS
application_1663674823778_0003 hadoop spark default 0 9 0 9 0 88G 0G 9446 84580 10.00 00:00:32 PyS
application_1663674823778_0008 hadoop spark default 0 1 0 1 0 0G 0G 382 334 10.00 00:00:06 PyS
Note that the PySpark apps for application_1663674823778_0002 and application_1663674823778_0003 were provisioned via the main node command line with just executing pyspark (with no explicit config editing).
However, the application_1663674823778_0008 was provisioned via the following command: pyspark --conf spark.executor.memory=11g --conf spark.driver.memory=12g. Despite this (test) PySpark config customization, that app in yarn fails to show anything other than 0 for the memory (regular or reserved) value.
Why is this?

How to send web socket connection to different process in tornado?

I am using tornado to implement the server with web sockets. I have some multi core CPU and I want to use the other CPU as well. So I though of using python multiprocess module. I want to accept the connection on the main process and send the data using other process. My questions are:
Is it possible to share the socket information between processes?
Is it better to use pickling or is there any other method that I can use?
If I use pickling the additional duplicates file descriptors that are created by it will affect the number of file descriptors the OS can handle or is it the same file descriptor shared between the processes?
Explanation:
There will be a lot of incoming connections and there will be a lot of messages from the client side so I do not want to the main event to loop to be stuck in sending the data. That is why I am trying to use different process to send the data to the connections.
Output of strace
I have started strace and given the process id from which I am sending data to web sockets. The output of strace looks like this:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
91.01 0.019570 0 441736 sendto
3.60 0.000774 0 29314 read
3.14 0.000675 0 30623 clock_gettime
1.15 0.000248 0 2909 write
0.96 0.000206 0 11855 epoll_wait
0.13 0.000029 0 1534 680 recvfrom
0.00 0.000000 0 17 open
0.00 0.000000 0 34 close
0.00 0.000000 0 17 stat
0.00 0.000000 0 17 fstat
0.00 0.000000 0 34 poll
0.00 0.000000 0 39 mmap
0.00 0.000000 0 26 munmap
0.00 0.000000 0 408 brk
0.00 0.000000 0 134 ioctl
0.00 0.000000 0 34 socket
0.00 0.000000 0 34 17 connect
0.00 0.000000 0 300 setsockopt
0.00 0.000000 0 17 getsockopt
0.00 0.000000 0 200 fcntl
0.00 0.000000 0 17 gettimeofday
0.00 0.000000 0 1185 epoll_ctl
0.00 0.000000 0 178 78 accept4
------ ----------- ----------- --------- --------- ----------------
100.00 0.021502 520662 775 total
Is there any reason that i am getting error recvfrom and connect?
No, Tornado does not support this. There are techniques like SCM_RIGHTS to transfer file descriptors to other processes, but this will give you a raw socket in the other process, not a Tornado websocket object (and there is no supported way to construct a websocket object for this socket).
The recommended approach with Tornado is to run one process per CPU and let them share traffic by either putting them behind a load balancer or using SO_REUSEPORT. Sending the data in Tornado is non-blocking; you must make sure that your own code is non-blocking too (using asynchronous interfaces or thread pools).
I will answer to the 1st question:
is it possible to share the socket information between processes?
May depend on OS, but with Linux it is possible at least 2 ways:
When the main process accepts a new TCP connection, it can fork a new child process for handling it. After fork, the new child process will have the same socket file descriptor than the main process.
Use a UNIX domain socket to pass file descriptor of a socket from the main process to the other process. This requires to use of SCM_RIGHTS control message and ancillary data. Check this.

Mongod resident memory usage low

I'm trying to debug some performance issues with a MongoDB configuration, and I noticed that the resident memory usage is sitting very low (around 25% of the system memory) despite the fact that there are occasionally large numbers of faults occurring. I'm surprised to see the usage so low given that MongoDB is so memory dependent.
Here's a snapshot of top sorted by memory usage. It can be seen that no other process is using an significant memory:
top - 21:00:47 up 136 days, 2:45, 1 user, load average: 1.35, 1.51, 0.83
Tasks: 62 total, 1 running, 61 sleeping, 0 stopped, 0 zombie
Cpu(s): 13.7%us, 5.2%sy, 0.0%ni, 77.3%id, 0.3%wa, 0.0%hi, 1.0%si, 2.4%st
Mem: 1692600k total, 1676900k used, 15700k free, 12092k buffers
Swap: 917500k total, 54088k used, 863412k free, 1473148k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2461 mongodb 20 0 29.5g 564m 492m S 22.6 34.2 40947:09 mongod
20306 ubuntu 20 0 24864 7412 1712 S 0.0 0.4 0:00.76 bash
20157 root 20 0 73352 3576 2772 S 0.0 0.2 0:00.01 sshd
609 syslog 20 0 248m 3240 520 S 0.0 0.2 38:31.35 rsyslogd
20304 ubuntu 20 0 73352 1668 872 S 0.0 0.1 0:00.00 sshd
1 root 20 0 24312 1448 708 S 0.0 0.1 0:08.71 init
20442 ubuntu 20 0 17308 1232 944 R 0.0 0.1 0:00.54 top
I'd like to at least understand why the memory isn't being better utilized by the server, and ideally to learn how to optimize either the server config or queries to improve performance.
UPDATE:
It's fair that the memory usage looks high, which might lead to the conclusion it's another process. There's no other processes using any significant memory on the server; the memory appears to be consumed in the cache, but I'm not clear why that would be the case:
$free -m
total used free shared buffers cached
Mem: 1652 1602 50 0 14 1415
-/+ buffers/cache: 172 1480
Swap: 895 53 842
UPDATE:
You can see that the database is still page faulting:
insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn set repl time
0 402 377 0 1167 446 0 24.2g 51.4g 3g 0 <redacted>:9.7% 0 0|0 1|0 217k 420k 457 mover PRI 03:58:43
10 295 323 0 961 592 0 24.2g 51.4g 3.01g 0 <redacted>:10.9% 0 14|0 1|1 228k 500k 485 mover PRI 03:58:44
10 240 220 0 698 342 0 24.2g 51.4g 3.02g 5 <redacted>:10.4% 0 0|0 0|0 164k 429k 478 mover PRI 03:58:45
25 449 359 0 981 479 0 24.2g 51.4g 3.02g 32 <redacted>:20.2% 0 0|0 0|0 237k 503k 479 mover PRI 03:58:46
18 469 337 0 958 466 0 24.2g 51.4g 3g 29 <redacted>:20.1% 0 0|0 0|0 223k 500k 490 mover PRI 03:58:47
9 306 238 1 759 325 0 24.2g 51.4g 2.99g 18 <redacted>:10.8% 0 6|0 1|0 154k 321k 495 mover PRI 03:58:48
6 301 236 1 765 325 0 24.2g 51.4g 2.99g 20 <redacted>:11.0% 0 0|0 0|0 156k 344k 501 mover PRI 03:58:49
11 397 318 0 995 395 0 24.2g 51.4g 2.98g 21 <redacted>:13.4% 0 0|0 0|0 198k 424k 507 mover PRI 03:58:50
10 544 428 0 1237 532 0 24.2g 51.4g 2.99g 13 <redacted>:15.4% 0 0|0 0|0 262k 571k 513 mover PRI 03:58:51
5 291 264 0 878 335 0 24.2g 51.4g 2.98g 11 <redacted>:9.8% 0 0|0 0|0 163k 330k 513 mover PRI 03:58:52
It appears this was being caused by a large amount of inactive memory on the server that wasn't be cleared for Mongo's use.
By looking at the result from:
cat /proc/meminfo
I could see a large amount of Inactive memory. Using this command as a sudo user:
free && sync && echo 3 > /proc/sys/vm/drop_caches && echo "" && free
Freed up the inactive memory, and over the next 24 hours I was able to see the resident memory of my Mongo instance increasing to consume the rest of the memory available on the server.
Credit to the following blog post for it's instructions:
http://tinylan.com/index.php/article/how-to-clear-inactive-memory-in-linux
MongoDB only uses as much memory as it needs, so if all of the data and indexes that are in MongoDB can fit inside what it's currently using you won't be able to push that anymore.
If the data set is larger than memory, there are a couple of considerations:
Check MongoDB itself to see how much data it thinks its using by running mongostat and looking at resident-memory
Was MongoDB re/started recently? If it's cold then the data won't be in memory until it gets paged in (leading to more page faults initially that gradually settle). Check out the touch command for more information on "warming MongoDB up"
Check your read ahead settings. If your system read ahead is too high then MongoDB can't efficiently use the memory on the system. For MongoDB a good number to start with is a setting of 32 (that's 16 KB of read ahead assuming you have 512 byte blocks)
I had the same issue: Windows Server 2008 R2, 16 Gb RAM, Mongo 2.4.3. Mongo uses only 2 Gb of RAM and generates a lot of page faults. Queries are very slow. Disk is idle, memory is free. Found no other solution than upgrade to 2.6.5. It helped.

Is there a way to find out the total memory used by UDP sockets on a system

By monitoring /proc/net/sockstat or /proc/net/protocols, I am able to find out the total amount of memory used by TCP sockets in the system in realtime:
[gpadmin#sdw4 ~]$ cat /proc/net/sockstat
sockets: used 240
TCP: inuse 55 orphan 0 tw 0 alloc 69 mem 2171
UDP: inuse 22 mem 0
RAW: inuse 0
FRAG: inuse 0 memory 0
[gpadmin#sdw4 ~]$ cat /proc/net/sockstat
sockets: used 240
TCP: inuse 55 orphan 0 tw 0 alloc 69 mem 761
UDP: inuse 22 mem 0
RAW: inuse 0
FRAG: inuse 0 memory 0
The above metrics show me the memory used by TCP sockets but the UDP socket metrics are marked as 0. Is there a way to find out this information? Any /proc/net files capture this info?
Thanks in advance.
Could it be that you have a low traffic?
This is on my machine, receiving 400 UDP packets/sec on 3 ports (there is a 4th UDP stream but I don't use that).
# cat /proc/net/sockstat
sockets: used 32
TCP: inuse 6 orphan 0 tw 0 alloc 6 mem 1
UDP: inuse 4 mem 3
The same machine, serving those UDP streams to loads of clients on HTTP:
#cat /proc/net/sockstat
sockets: used 7232
TCP: inuse 7206 orphan 0 tw 0 alloc 7206 mem 405397
UDP: inuse 4 mem 30
The HTTP server is single threaded so I had to set the receive buffer for the UDP sockets quite high to not lose any packets. I've run the test for a while but I've never seen UDP mem go above 50.
Sorry if I'm completely off base here but I think you need to turn on udp accounting via /proc/sys/net/ipv4/udp_mem first before it will collect memory statistics.

Interpret differences in prstat vs. 'prstat -m' on Solaris

I've been using prstat and prstat -m a lot to investigate performance issues lately, and I think I've basically understood the differences of sampling vs. microstate accounting in Solaris 10. So I don't expect both to always show the exactly same number.
Today I came across an occasion where the 2 showed so vastly different outputs, that I have problems interpreting them and making sense of the output. The machine is a heavily loaded 8-CPU Solaris 10, with several large WebSphere processes and an Oracle database. The system practically came to a halt today for about 15 minutes (load averages of >700). I had difficulties to get any prstat information, but was able to get some outputs from "prtstat 1 1" and "prtstat -m 1 1", issued shortly one after another.
The top lines of the outputs:
prstat 1 1:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
8379 was 3208M 2773M cpu5 60 0 5:29:13 19% java/145
7123 was 3159M 2756M run 59 0 5:26:45 7.7% java/109
5855 app1 1132M 26M cpu2 60 0 0:01:01 7.7% java/18
16503 monitor 494M 286M run 59 19 1:01:08 7.1% java/106
7112 oracle 15G 15G run 59 0 0:00:10 4.5% oracle/1
7124 oracle 15G 15G cpu3 60 0 0:00:10 4.5% oracle/1
7087 app1 15G 15G run 58 0 0:00:09 4.0% oracle/1
7155 oracle 96M 6336K cpu1 60 0 0:00:07 3.6% oracle/1
...
Total: 495 processes, 4581 lwps, load averages: 74.79, 35.35, 23.8
prstat -m 1 1 (some seconds later)
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
7087 app1 0.1 56 0.0 0.2 0.4 0.0 13 30 96 2 33 0 oracle/1
7153 oracle 0.1 53 0.0 3.2 1.1 0.0 1.0 42 82 0 14 0 oracle/1
7124 oracle 0.1 47 0.0 0.2 0.2 0.0 0.0 52 77 2 16 0 oracle/1
7112 oracle 0.1 47 0.0 0.4 0.1 0.0 0.0 52 79 1 16 0 oracle/1
7259 oracle 0.1 45 9.4 0.0 0.3 0.0 0.1 45 71 2 32 0 oracle/1
7155 oracle 0.0 42 11 0.0 0.5 0.0 0.1 46 90 1 9 0 oracle/1
7261 oracle 0.0 37 9.5 0.0 0.3 0.0 0.0 53 61 1 17 0 oracle/1
7284 oracle 0.0 32 5.9 0.0 0.2 0.0 0.1 62 53 1 21 0 oracle/1
...
Total: 497 processes, 4576 lwps, load averages: 88.86, 39.93, 25.51
I have a very hard time interpreting the output. prstat seems to tell me that a fair amount of Java processing is going on, together with some Oracle stuff, just as I would expect in normal situation. prtstat -m shows a machine totally dominated by Oracle, consuming huge amounts of system time, and the overall CPU being heavily overloaded (large numbers in LAT).
I'm inclined to believe the output of prstat -m, because that matches much mor closely to what the system felt like during this time. Totally sluggish, almost no more user request processing going on from WebSphere, etc. But why would prstat show so heavily differing numbers?
Any explanation of this would be welcome!
CU, Joe
There's a known problem with prstat -m on Solaris in the way cpu usage figures are calculated - the value you see has been averaged over all threads (LWPs) in a process, and hence is far far too low for heavily multithreaded processes - such as your Java app servers, which can have hundreds of threads each (see your NLWP). Less than a dozen of them are probably CPU hogs hence the CPU usage by java looks "low". You'd need to call it with prstat -Lm to get the per-LWP (thread) breakdown to see that effect. Reference:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6780169
Without further performance monitoring data it's hard to give non-speculative explanations of what you've seen there. I assume lock contention within java. One particular workload that can cause this is heavily multithreaded memory mapped I/O, they'll all pile up on the process address space lock. But it could be a purely java userside lock of course. A plockstat on one of the java processes, and/or simple dtrace profiling would be helpful.