I have a network program written in C using TCP sockets. Sometimes the client program hangs forever expecting input from server. Specifically, the client hangs on select() call set on an fd intended to read characters sent by server.
I am using strace to know where the process got stuck. However, sometimes when I attach the hung client process to strace, it immediately resumes it's execution and properly exits. Not all hung processes exhibit this behavior, some processes stuck in the select() even if I attach them to strace. But most of the processes resume their execution when attached to strace.
I am curious what causing the processes resume when attached to strace. It might give me clues to know why client processes are getting hung.
Any ideas? what causes a hung process to resume it's execution when attached to strace?
Update:
Here's the output of strace on hung processes.
> sudo strace -p 25645
Process 25645 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
[ Process PID=25645 runs in 32 bit mode. ]
select(6, [3 5], NULL, NULL, NULL) = 2 (in [3 5])
read(5, "\0", 8192) = 1
write(2, "", 0) = 0
read(3, "====Setup set_oldtempbehaio"..., 8192) = 555
write(1, "====Setup set_oldtempbehaio"..., 555) = 555
select(6, [3 5], NULL, NULL, NULL) = 2 (in [3 5])
read(5, "", 8192) = 0
read(3, "", 8192) = 0
close(5) = 0
kill(25652, SIGKILL) = 0
exit_group(0) = ?
Process 25645 detached
_
> sudo strace -p 14462
Process 14462 attached - interrupt to quit
[ Process PID=14462 runs in 32 bit mode. ]
read(0, 0xff85fdbc, 8192) = -1 EIO (Input/output error)
shutdown(3, 1 /* send */) = 0
exit_group(0) = ?
_
> sudo strace -p 7517
Process 7517 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
[ Process PID=7517 runs in 32 bit mode. ]
connect(3, {sa_family=AF_INET, sin_port=htons(300), sin_addr=inet_addr("100.64.220.98")}, 16) = -1 ETIMEDOUT (Connection timed out)
close(3) = 0
dup(2) = 3
fcntl64(3, F_GETFL) = 0x1 (flags O_WRONLY)
close(3) = 0
write(2, "dsd13: Connection timed out\n", 30) = 30
write(2, "Error code : 110\n", 17) = 17
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(1) = ?
Process 7517 detached
Not just select(), but the processes(of same program) are stuck in various system calls before I attach them to strace. They suddenly resume after attaching to strace. If I don't attach them to strace, they just hang there forever.
Update 2:
I learned that strace could start a process which was previously stopped (process in T sate). Now I am trying to understand why did these processes go to 'T' state, what's the cause. Here's the /proc//status information:
> cat /proc/12554/status
Name: someone
State: T (stopped)
SleepAVG: 88%
Tgid: 12554
Pid: 12554
PPid: 9754
TracerPid: 0
Uid: 5000 5000 5000 5000
Gid: 48986 48986 48986 48986
FDSize: 256
Groups: 9149 48986
VmPeak: 1992 kB
VmSize: 1964 kB
VmLck: 0 kB
VmHWM: 608 kB
VmRSS: 608 kB
VmData: 156 kB
VmStk: 20 kB
VmExe: 16 kB
VmLib: 1744 kB
VmPTE: 20 kB
Threads: 1
SigQ: 54/73728
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000006
SigCgt: 0000000000004000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
Cpus_allowed: 00000000,00000000,00000000,0000000f
Mems_allowed: 00000000,00000001
strace uses ptrace. The ptrace man page has this:
Since attaching sends SIGSTOP and the tracer usually suppresses it,
this may cause a stray EINTR return from the currently executing system
call in the tracee, as described in the "Signal injection and
suppression" section.
Are you seeing select return EINTR?
Related
I am creating a macOS app with Swift that needs to download and run a shell script. I have been able to download the script with curl, but I can't run it. I am using a function from this answer to run other commands. When I use the App Sandbox in Xcode, I get the error /bin/bash: ./file.sh: Permission denied. When I try changing file permissions, I get the error chmod: Unable to change file mode on file.sh: Operation not permitted.
Here is my code:
func shell(_ command: String) -> String {
let task = Process()
task.launchPath = "/bin/bash"
task.arguments = ["-c", command]
let pipe = Pipe()
task.standardOutput = pipe
task.launch()
let data = pipe.fileHandleForReading.readDataToEndOfFile()
let output: String = NSString(data: data, encoding: String.Encoding.utf8.rawValue)! as String
return output
}
shell("curl https://www.example.com/file.sh -o file.sh")
shell("./file.sh")
Output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
1 16.0M 1 224k 0 0 231k 0 0:01:11 --:--:-- 0:01:11 231k
47 16.0M 47 7808k 0 0 4064k 0 0:00:04 0:00:01 0:00:03 4062k
100 16.0M 100 16.0M 0 0 6037k 0 0:00:02 0:00:02 --:--:-- 6037k
/bin/bash: ./file.sh: Permission denied
I have created a running windows process list to a text file using the command:
tasklist > c:\mytasklist.txt. The result something like this:
Image Name PID Session Name Session# Mem Usage
========================= ======== ================ =========== ============
System Idle Process 0 Services 0 24 K
System 4 Services 0 12.408 K
smss.exe 320 Services 0 1.236 K
csrss.exe 424 Services 0 4.720 K
wininit.exe 516 Services 0 4.684 K
csrss.exe 524 Console 1 7.888 K
winlogon.exe 572 Console 1 7.764 K
services.exe 620 Services 0 9.532 K
and so on...
My question:
How I parsing the line on the text file, so I get output like the following format using Lua script:
PID - Process Name - Memory Usage
And the output automatic sorted by the biggest memory usage
local tasklist = io.popen"tasklist /fo csv /nh"
local list = {}
for line in tasklist:lines() do
local exe, pid, mem = line:match'^"(.-)","(%d+)",.-"([^"]+)"$'
table.insert(list, {pid = tonumber(pid), exe = exe, mem = tonumber((mem:gsub("%D", "")))})
end
tasklist:close()
table.sort(list, function(a, b) return a.mem > b.mem end)
for j = 1, math.min(10, #list) do
print(list[j].pid, list[j].mem, list[j].exe)
end
Output:
1036 549416 svchost.exe
4972 439524 firefox.exe
6540 214476 plugin-container.exe
7144 169268 OUTLOOK.EXE
532 75320 svchost.exe
1948 71644 avp.exe
3752 62704 svchost.exe
5268 61100 explorer.exe
596 56732 csrss.exe
5048 50248 CcmExec.exe
I need to know what tm->when means, but proc(5) doesn't mention anything helpful,
So, does it store the creation time of the socket? The number seems to be decreasing each time I view the file.
root#ubuntu-vm:~# cat /proc/net/tcp
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
0: 00000000:0CEA 00000000:0000 0A 00000000:00000000 00:00000000 00000000 104 0 17410 1 dddb6d00 100 0 0 10 -1
1: 00000000:0016 00000000:0000 0A 00000000:00000000 00:00000000 00000000 0 0 7959 1 dddb4500 100 0 0 10 -1
2: B238A8C0:0016 0138A8C0:9C96 01 00000000:00000000 02:00061444 00000000 0 0 8243 4 daa3c000 20 4 27 10 16
3: B238A8C0:0CEA 0138A8C0:8753 01 00000000:00000000 02:0009C787 00000000 104 0 19467 2 daa3e300 20 4 18 10 -1
From Exploring the /proc/net/ Directory
The tr field indicates whether a timer is active for this socket. A value of zero indicates the timer is not active. The tm->when field indicates the time remaining (in jiffies) before timeout occurs.
I have been told to make prstat flash the background from white to black a few times when any value in the size category passes a threshold. Is there a way to edit the command and put this in here or will this never happen?
I'm not trying to be mean, but somebody who asked for this is not being reasonable or does not understand. I would guess the "asker" has no clue about prstat. Look at these two examples:
example% prstat -u root -n 5 -P 1,2 1 1
PID USERNAME SWAP RSS STATE PRI NICE TIME CPU PROCESS/LWP
306 root 3024K 1448K sleep 58 0 0:00.00 0.3% sendmail/1
102 root 1600K 592K sleep 59 0 0:00.00 0.1% in.rdisc/1
250 root 1000K 552K sleep 58 0 0:00.00 0.0% utmpd/1
288 root 1720K 1032K sleep 58 0 0:00.00 0.0% sac/1
1 root 744K 168K sleep 58 0 0:00.00 0.0% init/1
TOTAL: 25, load averages: 0.05, 0.08, 0.12
example% prstat -S rss -n 5 -vc -u root,john
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWP
1 root 0.0 0.0 - - - - 100 - 0 0 0 0 init/1
102 root 0.0 0.0 - - - - 100 - 0 0 3 0 in.rdisc/1
250 root 0.0 0.0 - - - - 100 - 0 0 0 0 utmpd/1
1185 john 0.0 0.0 - - - - 100 - 0 0 0 0 csh/1
240 root 0.0 0.0 - - - - 100 - 0 0 0 0 powerd/4
TOTAL: 71, load averages: 0.02, 0.04, 0.08
So, what value do you look for? There are lots of things prstat displays, so you have to learn all of them then code for whatever each of the many possible outputs means.
To do this:
What you will have to do is to run prstat with arguments entered on the command line, in a child process, read and interpret everything it produces, then map it to output and flash the screen as appropriate. You can do this with coprocesses in ksh or zsh or by using fifos in bash. Consider running prtstat in -e mode regardless of the what the user enters so you have full screens to read and manipulate.
Flashing the screen can be done with escape sequences, like changing background color or whatever you want. Here is a starting point for Windows based terminals:
ANSI escape sequences
And for Vt100 (UNIX)
terminal escape codes
I am using ubuntu 12, nginx, uwsgi 1.9 with socket, django 1.5.
Config:
[uwsgi]
base_path = /home/someuser/web/
module = server.manage_uwsgi
uid = www-data
gid = www-data
virtualenv = /home/someuser
master = true
vacuum = true
harakiri = 20
harakiri-verbose = true
log-x-forwarded-for = true
profiler = true
no-orphans = true
max-requests = 10000
cpu-affinity = 1
workers = 4
reload-on-as = 512
listen = 3000
Client tests from Windows7:
C:\Users\user>C:\AppServ\Apache2.2\bin\ab.exe -c 255 -n 5000 http://www.someweb.com/about/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/
Benchmarking www.someweb.com (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Finished 5000 requests
Server Software: nginx
Server Hostname: www.someweb.com
Server Port: 80
Document Path: /about/
Document Length: 1881 bytes
Concurrency Level: 255
Time taken for tests: 66.669814 seconds
Complete requests: 5000
Failed requests: 1
(Connect: 1, Length: 0, Exceptions: 0)
Write errors: 0
Total transferred: 10285000 bytes
HTML transferred: 9405000 bytes
Requests per second: 75.00 [#/sec] (mean)
Time per request: 3400.161 [ms] (mean)
Time per request: 13.334 [ms] (mean, across all concurrent requests)
Transfer rate: 150.64 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 8 207.8 1 9007
Processing: 10 3380 11480.5 440 54421
Waiting: 6 1060 3396.5 271 48424
Total: 11 3389 11498.5 441 54423
Percentage of the requests served within a certain time (ms)
50% 441
66% 466
75% 499
80% 519
90% 3415
95% 36440
98% 54407
99% 54413
100% 54423 (longest request)
I have set following options too:
echo 3000 > /proc/sys/net/core/netdev_max_backlog
echo 3000 > /proc/sys/net/core/somaxconn
So,
1) I make first 3000 requests super fast. I see progress in ab and in uwsgi requests logs -
[pid: 5056|app: 0|req: 518/4997] 80.114.157.139 () {30 vars in 378 bytes} [Thu Mar 21 12:37:31 2013] GET /about/ => generated 1881 bytes in 4 msecs (HTTP/1.0 200) 3 headers in 105 bytes (1 switches on core 0)
[pid: 5052|app: 0|req: 512/4998] 80.114.157.139 () {30 vars in 378 bytes} [Thu Mar 21 12:37:31 2013] GET /about/ => generated 1881 bytes in 4 msecs (HTTP/1.0 200) 3 headers in 105 bytes (1 switches on core 0)
[pid: 5054|app: 0|req: 353/4999] 80.114.157.139 () {30 vars in 378 bytes} [Thu Mar 21 12:37:31 2013] GET /about/ => generated 1881 bytes in 4 msecs (HTTP/1.0 200) 3 headers in 105 bytes (1 switches on core 0)
I dont have any broken pipes or worker respawns.
2) Next requests are running very slow or with some timeout. Looks like that some buffer becomes full and I am waiting before it becomes empty.
3) Some buffer becomes empty.
4) ~500 requests are processed super fast.
5) Some timeout.
6) see Nr. 4
7) see Nr. 5
8) see Nr. 4
9) see Nr. 5
....
....
Need your help
check with netstat and dmesg. You have probably exhausted ephemeral ports or filled the conntrack table.