Deadlock with Perl, IO:Async::Loop and pipe to sendmail - perl

We are seeing suck sendmail processes when we are attempting to send email from a Perl FCGI process. These processes are taking too long, hours to a day, since it should just be doing a relay to a server configured in sendmail as the smart host. Most of the mail from the FCGI processes takes less than 5 seconds. The slow sendmail processes are easy to find on the our servers with $ ps -ef | grep sendmail
Almost all of the email works normally from these web nodes. I'd guess thousands of mails go through with no problem. Sending test email from the command line goes smoothly. The sendmail command gets stuck rarely and we don't have a way to reproduce it.
It seems that most of this stuck email gets through sooner or later. These seem to be sending mail hours later, sometimes over a day later.
All of the sendmail that we've seen stuck has been a command that was run by a Perl process, which is a child process of a FCGI process.
Looking at the logs of the smart host we see that most of this mail does get through sooner or later but we have found some that don't seem to have ever been sent.
This is running in FCGI for Catalyst and then added to a IO::Async::Loop which does some processing, and in the IO::Async::Loop, Email::Sender::Transport::Sendmail is used which does a open($fh, '|-', #args) and pipes the mail header+body and does a close($fh).
I've seen this http://perldoc.perl.org/perlipc.html#Avoiding-Pipe-Deadlocks but don't know how to apply it in this situation. The child sendmail has only STDIN open.
When we have one of these stuck sendmails the sendmail is waiting on STDIN:
[<ffffffff8119ce8b>] pipe_wait+0x5b/0x80
[<ffffffff8119d8ad>] pipe_read+0x34d/0x4d0
[<ffffffff8119204a>] do_sync_read+0xfa/0x140
[<ffffffff81192945>] vfs_read+0xb5/0x1a0
[<ffffffff81192c91>] sys_read+0x51/0xb0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
and the async perl process is waiting on the child to die:
#0 0x00007f8849e6065e in waitpid () from /lib64/libc.so.6
#1 0x000000000046dc2d in Perl_wait4pid ()
#2 0x000000000046de2d in Perl_my_pclose ()
#3 0x00000000004cec4e in Perl_io_close ()
#4 0x00000000004ceda8 in Perl_do_close ()
#5 0x00000000004c2629 in Perl_pp_close ()
#6 0x00000000004804de in Perl_runops_standard ()
#7 0x000000000042e7ad in perl_run ()
#8 0x000000000041bbc5 in main ()
An example of one that didn't get through:
Job #1653576 (that's just our internal job number) has a sendmail process that started on Aug 19 13:04.
Process on webnode2:
fcgi-user 13621 13466 0 13:04 ? 00:00:00 /usr/sbin/sendmail -i -f admin#ourServer.org -- proffunnyhat#mit.edu
I don't see the record I expect to see on our smart host for this in /var/log/maillog that would indicate that it was relayed to nexus and then to MIT.
I do see successful email for proffunnyhat#mit.edu on Aug 21 (from web2 /var/log/maillog):
Aug 21 00:00:02 node-008 sendmail[13621]: u7JH4tbr013621: to=proffunnyhat#mit.edu, ctladdr=admin#ourServer.org (10520/10520), delay=1+10:55:07, xdelay=00:00:01, mailer=relay, pri=32292, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (u7L401Z1026237 Message accepted for delivery)
Aug 21 00:00:02 node-008 sendmail[26247]: u7L401Z1026237: to=<proffunnyhat#mit.edu>, delay=00:00:01, xdelay=00:00:00, mailer=relay, pri=122657, relay=mail.ourServer.org. [128.84.4.11], dsn=2.0.0, stat=Sent (u7L402jx001185 Message accepted for delivery)
and then on mail.ourServer.org:
bdc34 #mail.ourServer.org: log$ sudo grep u7L402jx001185 maillog*
maillog-20160821:Aug 21 00:00:02 web2 sendmail[1185]: u7L402jx001185: from=<admin#ourServer.org>, size=2874, class=0, nrcpts=1, msgid=<201608191704.u7JH4tbr013621#mail.ourServer.org>, proto=ESMTP, daemon=MTA, relay=mail.ourServer.org [128.84.4.13]
maillog-20160821:Aug 21 00:00:03 mail.ourServer.org[1200]: u7L402jx001185: to=<proffunnyhat#mit.edu>, ctladdr=<e-admin#ourServer.org> (10519/10519), delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=122874, relay=dmz-mailsec-scanner-8.mit.edu. [18.7.68.37], dsn=2.0.0, stat=Sent (OK 5E/2D-20045-34729B75)
An example of one that was stuck but seems to have been sent:
mail.ourServer.org:/var/log/sendmail:
Aug 19 02:19:51 mail.ourServer.org sendmail[20792]: u7J6JlP6020790: to=<jxjx#connect.ust.hk>, ctladdr=<admin#ourServer.org> (10519/10519), delay=00:00:04, xdelay=00:00:04, mailer=esmtp, pri=122504, relay=connect-ust-hk\
.mai...ction.outlook.com. [213.199.154.87], dsn=2.0.0, stat=Sent (<201608190619.u7J6Jlda000738#web2.ourServer.org> [InternalId=15526306777069,...1MB1197.apcprd01.prod.exchangelabs.com] 9137 bytes in 0.189, 47.082 KB/sec\
Queued mail for delivery)
Things we have tried
I've modified Email::Sender::Transport::Sendmail to send a '\x00' to the pipe, that didn't work.
I've replaced IO::Async::Loop::Poll with IO::Async::Loop::Select. That didn't change anything.
I've tried sending signals to the sendmail and its parent. That killed them but the mail was aborted.
Added our fcgi user to sendmail's trusted users file. Didn't change anything.
I wrote a wrapper script that read from STDIN and writes to sendmail. If nothing comes in on STDIN for 5 seconds it exits. This feels really hacky to me but it does seem to work. Since mail is a critical part of our system I'd rather have a real solution.

ikegami comment lead us to the answer of doing a double fork. Looking at the signal handlers and file handles set up under FCGI made it clear that excessively clever things were happening. So I moved to cut all ties with the parent process using a double fork like when starting a daemon. That worked.
# FCGI does some clever signal handeling and file handers to do
# its work. This causes problems with forked processes that then
# fork and wait for other processes. So use exec to get a process
# that is not a copy of the parent. Double fork to avoid zombies.
# Check -e of submit script because $! has a cryptic messsage if it doens't exist
my $script = "$SAFEBINDIR/submit.pl";
unless( -e -r -x $script ){
$submission->submit_log->error($submission->submission_id . ": $script doesn't exist");
$c->flash( message => "There was a problem");
$c->res->redirect( $c->uri_for('/user') );
$c->detach;
}
# Do the double fork + exec to have an independent process
my $pid;
unless( $pid = fork() ) { #this is the child
unless( fork() ){ #this is the grandchild
exec( $script, $submission->submission_id ) #should never return
or $submission->submit_log->error($submission->submission_id
. ": Error when trying to exec() $script '$!'");
exit(0);
}
}
waitpid($pid,0); #wait for child, grandchild will get ppid 1
}

Related

wait synchronously for rsyslog flush to complete

I am running rsyslogd 8.24.0 with a local logfile.
I have a test which runs a program that does some syslog logging (with entries from my test going to another file via rsyslog.conf setting) then exits back to a shell script to check the log has expected content. This usually works but sometimes fails as though the logging hadn't happened. I've added a flush (using HUP signal) to the shell script before it does the check. I can see that the HUP has happened and that the correct entry is in the log, but the script's check still fails.
Is there a way for the shell script to wait until the flush has completed? I can add an arbitrary sleep but would prefer to have something more definite.
Here are the relevant bits of the shell script:
# Set syslog to send dump_hook's logging to a local logfile...
sudo echo "user.* `pwd`/dump_hook_log" >> /etc/rsyslog.conf
sudo systemctl restart rsyslog.service
echo "" > ./dump_hook_log
# run the test program which does syslog logging
kill -HUP `cat /var/run/syslogd.pid` # flush syslog
if [ $? -ne 0 ]
then
logFail "failed to HUP `cat /var/run/syslogd.pid`: $?"
fi
echo "sent HUP to `cat /var/run/syslogd.pid`"
grep <the string I want> ./dump_hook_log >/dev/null
The string in question is always in the dump_hook_log by the time that the test has reported fail and I've gone to look at it. I presume it must be that the flush hasn't completed by the time of the grep.
Here is an example:
In /var/log/messages
2019-01-30T12:13:27.216523+00:00 apx-ont-1 apx_dump_hook[28279]: Failed to open raw dump file "core" (Is a directory)
2019-01-30T12:13:27.216754+00:00 apx-ont-1 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="28185" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Mod date of the log file (n.b. this is earlier than the entries it contains!):
rw-rw-rw- 1 nealec appexenv1_group 2205 2019-01-30 12:13:27.215053296 +0000 testdir_OPT/dump_hook_log
Last line of the log file (only apx_dump_hook entries in here):
2019-01-30T12:13:27.216523+00:00 apx-ont-1 apx_dump_hook[28279]: Failed to open raw dump file "core" (Is a directory)
Script reporting error:
Wed 30 Jan 12:13:27 GMT 2019 PSE Test 0.2b FAILED: 'Failed to open raw dump file' not found in ./dump_hook_log
I think I understand this now. The HUP causes rsyslogd to close its open files but it doesn’t reopen a file until it needs to log to it.
Consider the following:
I use inotify to wait for a file to close, like this:
case 9:
{
// Wait for the file, specified in argv[2], to be closed
int inotfd = inotify_init();
if (inotfd < 0) {
printf("inotify_init failed; errno %d: %s\n",
errno, strerror(errno));
exit(99);
}
int watch_desc = inotify_add_watch(inotfd, argv[2], IN_CLOSE);
if (watch_desc < 0) {
printf("can't watch %s failed; errno %d: %s\n",
argv[2], errno, strerror(errno));
exit(99);
}
size_t bufsiz = sizeof(struct inotify_event) + PATH_MAX + 1;
struct inotify_event* event = static_cast<inotify_event*>(malloc(bufsiz));
if (!event) {
printf("Failed to malloc event buffer; errno %d: %s\n",
errno, strerror(errno));
exit(99);
}
/* wait for an event to occur with blocking read*/
read(inotfd, event, bufsiz);
}
Then in my shell script I wait for that:
# Start a process that waits for the log file be closed
${bin}/test_dump_hook.exe 9 "./dump_hook_log" &
wait_pid=$!
# Signal syslogd to cause it it close/reopen its log files
kill -HUP `cat /var/run/syslogd.pid` # flush syslog
if [ $? -ne 0 ]
then
logFail "failed to HUP `cat /var/run/syslogd.pid`: $?"
fi
wait $waid_pid
I find this never returns. Sending a HUP to rsyslogd from another process doesn't break it out of the wait either, but a cat (which does open/close the file) of the log file does.
That’s because the HUP in the shell script was done before the other process waited for it. So the file was already closed at the start of the wait, and because there is no more logging to that file it is not reopened and doesn’t need to close when any subsequent HUPs are received, so the event never occurs to end the wait.
Having understood this behaviour how can I be sure that the log has been written before I check it? I've gone with this solution; put a known message into the log and wait until that appears, I know that the entries I'm waiting for must be before that. Like this:-
function flushSyslog
{
logger -p user.info -t dump_hoook_test "flushSyslog"
# Signal syslogd to cause it it close its log file
kill -HUP `cat /var/run/syslogd.pid` # flush syslog
if [ $? -ne 0 ]
then
logFail "failed to HUP `cat /var/run/syslogd.pid`: $?"
fi
# wait upto 10 secs for the entry we've just logged to appear
sleeps=0
until
grep "flushSyslog" ./dump_hook_log > /dev/null
do
sleeps=$((sleeps+1))
if [ $sleeps -gt 100 ]
then
logFail "failed to flush syslog dump_hook_log"
fi
sleep 0.1
done
}
This seems a bit heavyweight as a solution, but you can use the system's inotify api to wait for the log file to be closed (the result of the HUP signal). For example,
inotifywait -e close ./dump_hook_log
will hang until rsyslogd (or any process) closes the file, when you will get the message
./dump_hook_log CLOSE_WRITE,CLOSE
and the program will exit with return code 0. You can add a timeout.

perl module Class::HPLOO v0.23 install issue #2

Having the exact issue as described at: perl module Class::HPLOO v0.23 install issue, I have attempted to correct the defined(#array) problem by editing to just (#array) and trying to rebuild the module. However I continue to get the return of:
$ make clean
$ perl Makefile.PL
$ make
$ make test: *** No rule to
make target `clean:'. Stop. Manifying 2 pod documents
PERL_DL_NONLAZY=1 "/opt/local/bin/perl5.26" "-Iblib/lib" "-Iblib/arch"
test.pl
1..42
# Running under perl version 5.026002 for darwin
# Current time local: Sun Aug 26 06:48:26 2018
# Current time GMT: Sat Aug 25 22:48:26 2018
# Using Test.pm version 1.26 not ok 1
# Failed test 1 in test.pl at line 9
# test.pl line 9 is: ok(!$#) ; Can't locate object method "new" via package "Foo" at test.pl line 11. make: *** [test_dynamic] Error 2
There are three issues with Class::HPLOO (which as I noted before, hasn't been updated since 2005) that make it fail with modern perls.
As discovered in the previous post,
the obsolete construct defined (#array) is used once in lib/Class/HPLOO.pm' and three times inlib/Class/HPLOO/Base.pm`. This construction has been prohibited since v5.22
The current directory (.) is no longer in #INC (as of v5.24, I think). So the lines in test.pl like
require "test/classtest.pm"
either all need to be rewritten as
require "./test/classtest.pm"
or an easier fix is to put
use lib '.';
at the top of the script.
There is a regular expression in lib/Class/HPLOO.pm, line 1077, with an "unescaped left brace"
$sub =~ s/(\S)( {) (\S)/$1$2\n$FIRST_SUB_IDENT $3/gs ;
{ is a regex metacharacter, and since v5.22 it has been illegal to use it in a context where it is not indicating a quantity. The fix, as the error message suggests, is to escape it.
$sub =~ s/(\S)( \{) (\S)/$1$2\n$FIRST_SUB_IDENT $3/gs ;
Make these three changes to the code you download from CPAN and the module should build on modern Perls. If you're feeling helpful, you can submit a bug report (linking to this post, if you want) or even a patch with an email to bug-Class-HPLOO#rt.cpan.org
come across this issue today, so I fixed it following the answer above. if anyone want to save some time.
I create a repo with the changes. https://github.com/swuecho/Class_HPLOO.git

perl Net::Frame::Layer::8021X module not found

Having trouble trying to get passed this last dependency for an old Cisco dump.pl script.
https://supportforums.cisco.com/blog/154046
So far I've installed the following dependencies:
apt-get install libclass-gomor-perl libnet-libdnet-perl libnet-pcap-perl libbit-vector-perl libnetpacket-perl
and placed all of his custom perl modules in a Net/ directory
then finally I run
./dump.pl
My output:
*** Net::Frame::Layer::8021X module not found.
*** Either install it (if avail), or implement it.
*** You can also send the pcap file to perl#gomor.org.
Frame number: 11 (length: 60)
Frame NUMBER 11 SSL FOUND, preparing SSL payload and crafting TCP packet
*** Net::Frame::Layer::8021X module not found.
*** Either install it (if avail), or implement it.
*** You can also send the pcap file to perl#gomor.org.
Frame number: 12 (length: 64)
Frame NUMBER 12 SSL FOUND, preparing SSL payload and crafting TCP packet
*** Net::Frame::Layer::8021X module not found.
*** Either install it (if avail), or implement it.
*** You can also send the pcap file to perl#gomor.org.
Frame number: 13 (length: 61)
Frame NUMBER 13 SSL FOUND, preparing SSL payload and crafting TCP packet
After debugging using perl -d dump.pl, I saw that it actually requires the modules Net::Frame::Layer::ETH and Net::Frame::Layer::8021X. The former I have and the latter is missing.
Net::Frame::Simple::unpack(Net/Frame/Simple.pm:98):
98: for (1..1000) {
DB<4> n
Net::Frame::Simple::unpack(Net/Frame/Simple.pm:99):
99: last unless $raw;
DB<4> n
Net::Frame::Simple::unpack(Net/Frame/Simple.pm:101):
101: $encapsulate =~ s/[^-:\w]//g; # Fix potential code injection
DB<4> n
Net::Frame::Simple::unpack(Net/Frame/Simple.pm:102):
102: my $layer = 'Net::Frame::Layer::'.$encapsulate;
DB<4> n
Net::Frame::Simple::unpack(Net/Frame/Simple.pm:103):
103: eval "require $layer";
DB<4> n
So I installed these dependencies:
apt-get install cpanminus
cpanm Net::Frame
cpanm Net::Frame::Simple
cpanm Net::Frame::Layer
cpanm Socket6
I'm still hitting the same issue.
With additional debugging it seems that the issue is with raw => $h->{raw} because I ran print Net::Frame::Simple->new( firstLayer => $h->{firstLayer}, timestamp => $h->{timestamp} ) and print $h->{raw} in the interpreter without triggering the issue.
main::(dump.pl:44): my $f = Net::Frame::Simple->new(
main::(dump.pl:45): raw => $h->{raw},
main::(dump.pl:46): firstLayer => $h->{firstLayer},
main::(dump.pl:47): timestamp => $h->{timestamp},
main::(dump.pl:48): );
It will write out an ssl.pcap file but it will not contain the raw packet data, just the firstLayer e.g. ETH and the UNIX timestamp. I also made sure that I'm using the local modules inside the Net/ directory by replacing all of the use statements with require and replacing the :: with /.
Still complains when trying to require Net::Frame::Layer::8021X which doesn't seem to exist anywhere on the Internet...

WHM/cPanel Copy Multiple Accounts/Packages From Another Server failing for each account?

Just bought a new server that runs WHM/cPanel, same as the old. Trying to use the built in tool to migrate multiple accounts / packages over. I'm able to connect to the other server, it lists out all the packages & accounts, I select all and start the process.
Then it goes through each package and account and fails to copy anything over. This is the error given for a sample account:
Command failed with exit status 255
...etc...
Copying Suspension Info (if needed)...Done
Copying SSL certificates, CSRs, and keys...Privilege de-escalation before loading datastore either failed or was omitted. at /usr/local/cpanel/Cpanel/SSLStorage.pm line 1159
Cpanel::SSLStorage::_load_datastore('Cpanel::SSLStorage::Installed=HASH(0x3c72300)', 'lock', 1) called at /usr/local/cpanel/Cpanel/SSLStorage.pm line 1244
Cpanel::SSLStorage::_load_datastore_rw('Cpanel::SSLStorage::Installed=HASH(0x3c72300)') called at /usr/local/cpanel/Cpanel/SSLStorage/Installed.pm line 634
Cpanel::SSLStorage::Installed::_rebuild_records('Cpanel::SSLStorage::Installed=HASH(0x3c7230 0)') called at /usr/local/cpanel/Cpanel/SSLStorage.pm line 308
Cpanel::SSLStorage::__ANON__() called at /usr/local/cpanel/Cpanel/SSLStorage.pm line 1330
Cpanel::SSLStorage::_execute_coderef('Cpanel::SSLStorage::Installed=HASH(0x3c72300)', 'CODE(0x49ee958)') called at /usr/local/cpanel/Cpanel/SSLStorage.pm line 310
Cpanel::SSLStorage::rebuild_records('Cpanel::SSLStorage::Installed=HASH(0x3c72300)') called at /usr/local/cpanel/scripts/pkgacct line 2888
Script::Pkgacct::__ANON__('Cpanel::SSLStorage::Installed=HASH(0x3c72300)') called at /usr/local/cpanel/scripts/pkgacct line 2913
Script::Pkgacct::backup_ssl_for_user('jshea89', '/home/webwizard/cpmove-jshea89') called at /usr/local/cpanel/scripts/pkgacct line 532
Script::Pkgacct::script('Script::Pkgacct', '--use_backups', '--skiphomedir', 'jshea89', '/home/webwizard', '--split', '--compressed', '--mysql', 5.5, ...) called at /usr/local/cpanel/scripts/pkgacct line 111
==sshcontroloutput==
sh-4.1# exit $RET
exit
sh-4.1$ exit $RET
exit
sshcommandfailed=255`
A bit of a hack, but I went to /usr/local/cpanel/Cpanel/SSLStorage.pm line 1244 and commented out the Carp.
Accounts from my old dead server are now archiving :)
After some researching, I was able to determine that this was caused by incorrect ownership on the /home/user/ssl directory and its subdirectories. Someone had set the owner and group to root:root, when infact it should have been user:user.
Hopefully this helps some of you solve the problem!

Apache children hanging / mod perl

The apache children on my server (ubuntu 12.04 upgraded from 11.10, apache 2.2.22, perl 5.14.2, mod_perl 2.0.5) are hanging.
I tried to catch signals usr2, and alarm but without success (when using sleep for testing, it works like excpected but when the programm hangs by itself no output is given)
sub handler : method{
my $mask = POSIX::SigSet->new(&POSIX::SIGUSR2, &POSIX::SIGALRM);
my $oldaction_usr2 = POSIX::SigAction->new();
my $oldaction_alarm = POSIX::SigAction->new();
my $action = POSIX::SigAction->new(sub {
Carp::confess("hm caught SIGUSR2 or ALARM DEAD LOCK YOU can run but not hide!");
},$mask,&POSIX::SA_NODEFER);
POSIX::sigaction(&POSIX::SIGUSR2,$action, $oldaction_usr2);
POSIX::sigaction(&POSIX::SIGALRM,$action, $oldaction_alarm);
alarm(30); #max 30 seconds per request
So I used Apache status to get the pid of the child which is hanging (cpu time is not increasing but only SS (Seconds since beginning of most recent request).
Then I attach gdb with the pid to get an backtrace:
(gdb) bt
#0 0x00007fc4610fb606 in myck_entersub (my_perl=0x7fc47f7f63e0, op=0x7fc484b40910) at lib/Params/Classify.xs:682
#1 0x00007fc477a67abd in Perl_convert () from /usr/lib/libperl.so.5.14
#2 0x00007fc477a6f769 in Perl_utilize () from /usr/lib/libperl.so.5.14
#3 0x00007fc477a9daef in Perl_yyparse () from /usr/lib/libperl.so.5.14
#4 0x00007fc477b1635d in ?? () from /usr/lib/libperl.so.5.14
the problem is I have no idea how to fix this or what this means.
On modper 1 gude I found:
% gdb httpd <pid of spinning process>
(gdb) where
(gdb) source mod_perl-x.xx/.gdbinit
(gdb) curinfo
but I don't know where .gdbinit is located or which package I need to install
or do I need to make this file by my self from source (maybe with Devel::DebugInit::GDB) ?
The problem may be "Params::Classify," which is not thread-safe. See:
https://bugs.launchpad.net/ubuntu/+source/libmodule-runtime-perl/+bug/991650
mod_perl script going in to a tight loop during 'use' processing
http://www.perlmonks.org/?node_id=886909
The author of Params::Classify acknowledged the problem in November 2011 but has not released a fix.