The apache children on my server (ubuntu 12.04 upgraded from 11.10, apache 2.2.22, perl 5.14.2, mod_perl 2.0.5) are hanging.
I tried to catch signals usr2, and alarm but without success (when using sleep for testing, it works like excpected but when the programm hangs by itself no output is given)
sub handler : method{
my $mask = POSIX::SigSet->new(&POSIX::SIGUSR2, &POSIX::SIGALRM);
my $oldaction_usr2 = POSIX::SigAction->new();
my $oldaction_alarm = POSIX::SigAction->new();
my $action = POSIX::SigAction->new(sub {
Carp::confess("hm caught SIGUSR2 or ALARM DEAD LOCK YOU can run but not hide!");
},$mask,&POSIX::SA_NODEFER);
POSIX::sigaction(&POSIX::SIGUSR2,$action, $oldaction_usr2);
POSIX::sigaction(&POSIX::SIGALRM,$action, $oldaction_alarm);
alarm(30); #max 30 seconds per request
So I used Apache status to get the pid of the child which is hanging (cpu time is not increasing but only SS (Seconds since beginning of most recent request).
Then I attach gdb with the pid to get an backtrace:
(gdb) bt
#0 0x00007fc4610fb606 in myck_entersub (my_perl=0x7fc47f7f63e0, op=0x7fc484b40910) at lib/Params/Classify.xs:682
#1 0x00007fc477a67abd in Perl_convert () from /usr/lib/libperl.so.5.14
#2 0x00007fc477a6f769 in Perl_utilize () from /usr/lib/libperl.so.5.14
#3 0x00007fc477a9daef in Perl_yyparse () from /usr/lib/libperl.so.5.14
#4 0x00007fc477b1635d in ?? () from /usr/lib/libperl.so.5.14
the problem is I have no idea how to fix this or what this means.
On modper 1 gude I found:
% gdb httpd <pid of spinning process>
(gdb) where
(gdb) source mod_perl-x.xx/.gdbinit
(gdb) curinfo
but I don't know where .gdbinit is located or which package I need to install
or do I need to make this file by my self from source (maybe with Devel::DebugInit::GDB) ?
The problem may be "Params::Classify," which is not thread-safe. See:
https://bugs.launchpad.net/ubuntu/+source/libmodule-runtime-perl/+bug/991650
mod_perl script going in to a tight loop during 'use' processing
http://www.perlmonks.org/?node_id=886909
The author of Params::Classify acknowledged the problem in November 2011 but has not released a fix.
Related
We are seeing suck sendmail processes when we are attempting to send email from a Perl FCGI process. These processes are taking too long, hours to a day, since it should just be doing a relay to a server configured in sendmail as the smart host. Most of the mail from the FCGI processes takes less than 5 seconds. The slow sendmail processes are easy to find on the our servers with $ ps -ef | grep sendmail
Almost all of the email works normally from these web nodes. I'd guess thousands of mails go through with no problem. Sending test email from the command line goes smoothly. The sendmail command gets stuck rarely and we don't have a way to reproduce it.
It seems that most of this stuck email gets through sooner or later. These seem to be sending mail hours later, sometimes over a day later.
All of the sendmail that we've seen stuck has been a command that was run by a Perl process, which is a child process of a FCGI process.
Looking at the logs of the smart host we see that most of this mail does get through sooner or later but we have found some that don't seem to have ever been sent.
This is running in FCGI for Catalyst and then added to a IO::Async::Loop which does some processing, and in the IO::Async::Loop, Email::Sender::Transport::Sendmail is used which does a open($fh, '|-', #args) and pipes the mail header+body and does a close($fh).
I've seen this http://perldoc.perl.org/perlipc.html#Avoiding-Pipe-Deadlocks but don't know how to apply it in this situation. The child sendmail has only STDIN open.
When we have one of these stuck sendmails the sendmail is waiting on STDIN:
[<ffffffff8119ce8b>] pipe_wait+0x5b/0x80
[<ffffffff8119d8ad>] pipe_read+0x34d/0x4d0
[<ffffffff8119204a>] do_sync_read+0xfa/0x140
[<ffffffff81192945>] vfs_read+0xb5/0x1a0
[<ffffffff81192c91>] sys_read+0x51/0xb0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
and the async perl process is waiting on the child to die:
#0 0x00007f8849e6065e in waitpid () from /lib64/libc.so.6
#1 0x000000000046dc2d in Perl_wait4pid ()
#2 0x000000000046de2d in Perl_my_pclose ()
#3 0x00000000004cec4e in Perl_io_close ()
#4 0x00000000004ceda8 in Perl_do_close ()
#5 0x00000000004c2629 in Perl_pp_close ()
#6 0x00000000004804de in Perl_runops_standard ()
#7 0x000000000042e7ad in perl_run ()
#8 0x000000000041bbc5 in main ()
An example of one that didn't get through:
Job #1653576 (that's just our internal job number) has a sendmail process that started on Aug 19 13:04.
Process on webnode2:
fcgi-user 13621 13466 0 13:04 ? 00:00:00 /usr/sbin/sendmail -i -f admin#ourServer.org -- proffunnyhat#mit.edu
I don't see the record I expect to see on our smart host for this in /var/log/maillog that would indicate that it was relayed to nexus and then to MIT.
I do see successful email for proffunnyhat#mit.edu on Aug 21 (from web2 /var/log/maillog):
Aug 21 00:00:02 node-008 sendmail[13621]: u7JH4tbr013621: to=proffunnyhat#mit.edu, ctladdr=admin#ourServer.org (10520/10520), delay=1+10:55:07, xdelay=00:00:01, mailer=relay, pri=32292, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (u7L401Z1026237 Message accepted for delivery)
Aug 21 00:00:02 node-008 sendmail[26247]: u7L401Z1026237: to=<proffunnyhat#mit.edu>, delay=00:00:01, xdelay=00:00:00, mailer=relay, pri=122657, relay=mail.ourServer.org. [128.84.4.11], dsn=2.0.0, stat=Sent (u7L402jx001185 Message accepted for delivery)
and then on mail.ourServer.org:
bdc34 #mail.ourServer.org: log$ sudo grep u7L402jx001185 maillog*
maillog-20160821:Aug 21 00:00:02 web2 sendmail[1185]: u7L402jx001185: from=<admin#ourServer.org>, size=2874, class=0, nrcpts=1, msgid=<201608191704.u7JH4tbr013621#mail.ourServer.org>, proto=ESMTP, daemon=MTA, relay=mail.ourServer.org [128.84.4.13]
maillog-20160821:Aug 21 00:00:03 mail.ourServer.org[1200]: u7L402jx001185: to=<proffunnyhat#mit.edu>, ctladdr=<e-admin#ourServer.org> (10519/10519), delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=122874, relay=dmz-mailsec-scanner-8.mit.edu. [18.7.68.37], dsn=2.0.0, stat=Sent (OK 5E/2D-20045-34729B75)
An example of one that was stuck but seems to have been sent:
mail.ourServer.org:/var/log/sendmail:
Aug 19 02:19:51 mail.ourServer.org sendmail[20792]: u7J6JlP6020790: to=<jxjx#connect.ust.hk>, ctladdr=<admin#ourServer.org> (10519/10519), delay=00:00:04, xdelay=00:00:04, mailer=esmtp, pri=122504, relay=connect-ust-hk\
.mai...ction.outlook.com. [213.199.154.87], dsn=2.0.0, stat=Sent (<201608190619.u7J6Jlda000738#web2.ourServer.org> [InternalId=15526306777069,...1MB1197.apcprd01.prod.exchangelabs.com] 9137 bytes in 0.189, 47.082 KB/sec\
Queued mail for delivery)
Things we have tried
I've modified Email::Sender::Transport::Sendmail to send a '\x00' to the pipe, that didn't work.
I've replaced IO::Async::Loop::Poll with IO::Async::Loop::Select. That didn't change anything.
I've tried sending signals to the sendmail and its parent. That killed them but the mail was aborted.
Added our fcgi user to sendmail's trusted users file. Didn't change anything.
I wrote a wrapper script that read from STDIN and writes to sendmail. If nothing comes in on STDIN for 5 seconds it exits. This feels really hacky to me but it does seem to work. Since mail is a critical part of our system I'd rather have a real solution.
ikegami comment lead us to the answer of doing a double fork. Looking at the signal handlers and file handles set up under FCGI made it clear that excessively clever things were happening. So I moved to cut all ties with the parent process using a double fork like when starting a daemon. That worked.
# FCGI does some clever signal handeling and file handers to do
# its work. This causes problems with forked processes that then
# fork and wait for other processes. So use exec to get a process
# that is not a copy of the parent. Double fork to avoid zombies.
# Check -e of submit script because $! has a cryptic messsage if it doens't exist
my $script = "$SAFEBINDIR/submit.pl";
unless( -e -r -x $script ){
$submission->submit_log->error($submission->submission_id . ": $script doesn't exist");
$c->flash( message => "There was a problem");
$c->res->redirect( $c->uri_for('/user') );
$c->detach;
}
# Do the double fork + exec to have an independent process
my $pid;
unless( $pid = fork() ) { #this is the child
unless( fork() ){ #this is the grandchild
exec( $script, $submission->submission_id ) #should never return
or $submission->submit_log->error($submission->submission_id
. ": Error when trying to exec() $script '$!'");
exit(0);
}
}
waitpid($pid,0); #wait for child, grandchild will get ppid 1
}
Is there a way to trace down who sends the sigusr1?
It is for mex file created for Ubuntu, which hangs for certain condition. So I did:
1, mex -g *.c (create .mex file)
2 matlab -Dgdb
3 handle SIGUSR1 stop print
4 run -nojvm (run matlab without gui)
5 dbmex on
6 run my executable
then it prints out:
MEX FILE: /home/X/Desktop/Test/test.mexa 64 entry point located at address 0xd11ea144
Add breakpoints at the debugger prompt and issue a "continue" to resume execution of MATLAB
If I do "continue", my executable runs, then it hangs there again(I think the same as before).
I tried "bt" and "where", but still no clue where I get the SIGUSR1 and why it hangs.
For "where", I get:
#0 0x00007ffff5962ca4 in pthread_cond_wait##GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007ffff69d7015 in mcr_run_main(boost::function0<int> const&, bool, bool)() from /usr/local/MATLAB/R2013b/bin/glnxa64/libmwmcr.so
#2 0x0000000000405291 in ?? ()
#3 0x00007ffff55b0ea5 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x0000000000405489 in ?? ()
#5 0x00007fffffffded8 in ?? ()
#6 0x000000000000001c in ?? ()
#7 0x0000000000000002 in ?? ()
#8 0x00007fffffffe208 in ?? ()
#9 0x00007fffffffe234 in ?? ()
#10 0x000000000000000 in ?? ()
Can anyone help here point out the correct way to trace down the signal SIGUSR1 (which I think cause my exe to hang)? Thanks a lot!!
Updates:
set break points as suggested in the source files before and around suspicious code. then continue to track down bugs.
LJ
When stopped at a signal, you can print $_siginfo, and then examine the fields of this object to find the PID of the process that sent the signal.
I run gammu-smsd:
# gammu-smsd
Log filename is "/var/log/gammu-smsd.log"
next I send sms via gammu-smsd-inject:
# echo sms bla bla bla | gammu-smsd-inject TEXT 123456789
gammu-smsd-inject[2050]: Warning: No PIN code in /etc/gammu-smsdrc file
gammu-smsd-inject[2050]: Created outbox message OUTC20151124_121117_00_796996999_sms0.smsbackup
Written message with ID /var/spool/gammu/outbox/OUTC20151124_121117_00_796996999_sms0.smsbackup
and..... 1 minute, 5 minites, 15 minutes and nothing.So I interrupt gammu-smsd by ^\ and start it again:
# gammu-smsd
Log filename is "/var/log/gammu-smsd.log"
And now I have in /var/log/gammu-smsd.log:
Tue 2015/11/24 12:17:07 gammu-smsd[2074]: Warning: No PIN code in /etc/gammu-smsdrc file
Tue 2015/11/24 12:17:07 gammu-smsd[2074]: Created POSIX RW shared memory at 0xb6fcc000
Tue 2015/11/24 12:17:07 gammu-smsd[2074]: Starting phone communication...
Tue 2015/11/24 12:17:17 gammu-smsd[2074]: Read 1 messages
Tue 2015/11/24 12:17:18 gammu-smsd[2074]: Message without SMSC, assuming you want to use the one from phone
Tue 2015/11/24 12:17:19 gammu-smsd[2074]: Transmitted OUTC20151124_121117_00_123456789_sms0.smsbackup (total: 1) to 123456789, message reference 0x1b
Tue 2015/11/24 12:17:25 gammu-smsd[2074]: Read 1 messages
My configuration /etc/gammu-smsdrc:
# Configuration file for Gammu SMS Daemon
# Gammu library configuration, see gammurc(5)
[gammu]
port = /dev/huawei
model = at
connection = at19200
synchronizetime = yes
# SMSD configuration, see gammu-smsdrc(5)
[smsd]
service = files
logfile = /var/log/gammu-smsd.log
#debuglevel = 255
commtimeout = 10
sendtimeout = 20
deliveryreport = log
transmitformat = auto
# Paths where messages are stored
inboxpath = /var/spool/gammu/inbox/
outboxpath = /var/spool/gammu/outbox/
sentsmspath = /var/spool/gammu/sent/
errorsmspath = /var/spool/gammu/error/
So what am I doing wrong?
--- EDIT ---
I have removed gammu installed via apt-get, downloaded newest gammu from website wammu.eu and I compiled like in instruction. So now:
# gammu version
[Gammu version 1.36.6]
...
And
# gammu-detect
; Configuration file generated by gammu-detect.
; Please check The Gammu Manual for more information.
[gammu]
device = /dev/ttyUSB0
name = Phone on USB serial port HUAWEI_MOBILE HUAWEI_MOBILE
connection = at
[gammu1]
device = /dev/ttyUSB1
name = Phone on USB serial port HUAWEI_MOBILE HUAWEI_MOBILE
connection = at
opening socket: Nie ma takiego urzÄ…dzenia
Where /dev/huawei is created by ln -s /dev/ttyUSB0
Now I typed gammu identify to check my device and after 1 hour I interrupted it because it waiting for something - i don't know for what.
Bellow is backtrac from gdb:
# gdb --args gammu --identify
GNU gdb (Raspbian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from gammu...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/local/bin/gammu --identify
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
^C
Program received signal SIGINT, Interrupt.
0xb6d674ec in select () at ../sysdeps/unix/syscall-template.S:81
81 ../sysdeps/unix/syscall-template.S: Nie ma takiego pliku ani katalogu.
(gdb) bt
#0 0xb6d674ec in select () at ../sysdeps/unix/syscall-template.S:81
#1 0xb6f32968 in serial_read () from /usr/local/lib/libGammu.so.7
#2 0xb6e95c8c in GSM_ReadDevice () from /usr/local/lib/libGammu.so.7
#3 0xb6e95dcc in GSM_WaitForOnce () from /usr/local/lib/libGammu.so.7
#4 0xb6e95ef0 in GSM_WaitFor () from /usr/local/lib/libGammu.so.7
#5 0xb6edda2c in ATGEN_Initialise () from /usr/local/lib/libGammu.so.7
#6 0xb6e94f20 in GSM_TryGetModel () from /usr/local/lib/libGammu.so.7
#7 0xb6e95518 in GSM_InitConnection_Log () from /usr/local/lib/libGammu.so.7
#8 0x00000000 in ?? ()
(gdb)
According to this thread you need to use the /dev/serial/by-id/ path to the USB device.
e.g. port = /dev/serial/by-id/usb-Standard_USB_USB_2.0-if01
See http://comments.gmane.org/gmane.linux.drivers.gammu/9719
I'm attempting to debug a manual dump file of a 64bit w3wp process with 64bit Windbg (Version 6.10). The dump was taken with taskmgr. I can't get anything from the !clrstack command. Here is what I'm getting:
!loadby sos clr
!runaway
User Mode Time
Thread Time
17:cf4 0 days 5:37:42.455
~17s
ntdll!ZwDelayExecution+0xa:
00000000`776208fa c3 ret
!clrstack
GetFrameContext failed: 1
What is GetFrameContext failed: 1?
Use !dumpstack command instead of !clrstack. It usually works.
Try getting the "native" call stack by doing "k" and see what that gets you. Sometimes, the stack isn't quite right and the !ClrStack extension is pretty sensitive.
When i debug a programmer, I found too many lines useless info which appear in GDB. this kind of infomation may come from iphone framework. it is not logged by my code. the info like this
Node 48 TrialMT(102,102,101,101)
Node 58 TrialMT(102,102,101,101)
Node 69 TrialMT(102,102,101,101)
Node 72 TrialMT(102,102,101,101)
Just too much. so i can not find my log.
I want to known is there a way i can export GDB log to a file, so i can find my log info in the file later on.
thanks
In Xcode you can type GDB commands in the debugger console. There you can reset the stdout and stderr file descriptors to your preferred log file like this
(gdb) call (void)close(1)
(gdb) call (void)close(2)
(gdb) call (int)open("/tmp/out.log", 0x201, 0644)
$1 = 1
(gdb) call (int)dup(1)
$2 = 2
(gdb) continue