wait synchronously for rsyslog flush to complete - flush

I am running rsyslogd 8.24.0 with a local logfile.
I have a test which runs a program that does some syslog logging (with entries from my test going to another file via rsyslog.conf setting) then exits back to a shell script to check the log has expected content. This usually works but sometimes fails as though the logging hadn't happened. I've added a flush (using HUP signal) to the shell script before it does the check. I can see that the HUP has happened and that the correct entry is in the log, but the script's check still fails.
Is there a way for the shell script to wait until the flush has completed? I can add an arbitrary sleep but would prefer to have something more definite.
Here are the relevant bits of the shell script:
# Set syslog to send dump_hook's logging to a local logfile...
sudo echo "user.* `pwd`/dump_hook_log" >> /etc/rsyslog.conf
sudo systemctl restart rsyslog.service
echo "" > ./dump_hook_log
# run the test program which does syslog logging
kill -HUP `cat /var/run/syslogd.pid` # flush syslog
if [ $? -ne 0 ]
then
logFail "failed to HUP `cat /var/run/syslogd.pid`: $?"
fi
echo "sent HUP to `cat /var/run/syslogd.pid`"
grep <the string I want> ./dump_hook_log >/dev/null
The string in question is always in the dump_hook_log by the time that the test has reported fail and I've gone to look at it. I presume it must be that the flush hasn't completed by the time of the grep.
Here is an example:
In /var/log/messages
2019-01-30T12:13:27.216523+00:00 apx-ont-1 apx_dump_hook[28279]: Failed to open raw dump file "core" (Is a directory)
2019-01-30T12:13:27.216754+00:00 apx-ont-1 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="28185" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Mod date of the log file (n.b. this is earlier than the entries it contains!):
rw-rw-rw- 1 nealec appexenv1_group 2205 2019-01-30 12:13:27.215053296 +0000 testdir_OPT/dump_hook_log
Last line of the log file (only apx_dump_hook entries in here):
2019-01-30T12:13:27.216523+00:00 apx-ont-1 apx_dump_hook[28279]: Failed to open raw dump file "core" (Is a directory)
Script reporting error:
Wed 30 Jan 12:13:27 GMT 2019 PSE Test 0.2b FAILED: 'Failed to open raw dump file' not found in ./dump_hook_log

I think I understand this now. The HUP causes rsyslogd to close its open files but it doesn’t reopen a file until it needs to log to it.
Consider the following:
I use inotify to wait for a file to close, like this:
case 9:
{
// Wait for the file, specified in argv[2], to be closed
int inotfd = inotify_init();
if (inotfd < 0) {
printf("inotify_init failed; errno %d: %s\n",
errno, strerror(errno));
exit(99);
}
int watch_desc = inotify_add_watch(inotfd, argv[2], IN_CLOSE);
if (watch_desc < 0) {
printf("can't watch %s failed; errno %d: %s\n",
argv[2], errno, strerror(errno));
exit(99);
}
size_t bufsiz = sizeof(struct inotify_event) + PATH_MAX + 1;
struct inotify_event* event = static_cast<inotify_event*>(malloc(bufsiz));
if (!event) {
printf("Failed to malloc event buffer; errno %d: %s\n",
errno, strerror(errno));
exit(99);
}
/* wait for an event to occur with blocking read*/
read(inotfd, event, bufsiz);
}
Then in my shell script I wait for that:
# Start a process that waits for the log file be closed
${bin}/test_dump_hook.exe 9 "./dump_hook_log" &
wait_pid=$!
# Signal syslogd to cause it it close/reopen its log files
kill -HUP `cat /var/run/syslogd.pid` # flush syslog
if [ $? -ne 0 ]
then
logFail "failed to HUP `cat /var/run/syslogd.pid`: $?"
fi
wait $waid_pid
I find this never returns. Sending a HUP to rsyslogd from another process doesn't break it out of the wait either, but a cat (which does open/close the file) of the log file does.
That’s because the HUP in the shell script was done before the other process waited for it. So the file was already closed at the start of the wait, and because there is no more logging to that file it is not reopened and doesn’t need to close when any subsequent HUPs are received, so the event never occurs to end the wait.
Having understood this behaviour how can I be sure that the log has been written before I check it? I've gone with this solution; put a known message into the log and wait until that appears, I know that the entries I'm waiting for must be before that. Like this:-
function flushSyslog
{
logger -p user.info -t dump_hoook_test "flushSyslog"
# Signal syslogd to cause it it close its log file
kill -HUP `cat /var/run/syslogd.pid` # flush syslog
if [ $? -ne 0 ]
then
logFail "failed to HUP `cat /var/run/syslogd.pid`: $?"
fi
# wait upto 10 secs for the entry we've just logged to appear
sleeps=0
until
grep "flushSyslog" ./dump_hook_log > /dev/null
do
sleeps=$((sleeps+1))
if [ $sleeps -gt 100 ]
then
logFail "failed to flush syslog dump_hook_log"
fi
sleep 0.1
done
}

This seems a bit heavyweight as a solution, but you can use the system's inotify api to wait for the log file to be closed (the result of the HUP signal). For example,
inotifywait -e close ./dump_hook_log
will hang until rsyslogd (or any process) closes the file, when you will get the message
./dump_hook_log CLOSE_WRITE,CLOSE
and the program will exit with return code 0. You can add a timeout.

Related

Getting emacs realgud:pry to work

Need some help getting realgud to play nice with pry-remote.
This is a sample ruby file (x.rb)
require 'pry-remote'
class Foo
def initialize(x, y)
binding.remote_pry
puts "hello"
end
end
Foo.new(1,3)
In a shell buffer:
bash: ruby x.rb
[pry-remote] Waiting for client on druby://127.0.0.1:9876
# check that I have a listening process
bash: ss -nlt
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 127.0.0.1:9876 *:*
In emacs:
M-x realgud:pry
pry-remote ; enter this at the prompt
This is what I see in emacs;
File Edit Options Buffers Tools Inf-Ruby Debugger Help
32: end
33:
34: return_value
35: end
=>[0G[1] pry(#<PryNav::Tracer>)>
-=--:**--F1 *pry +No filename+ shell* Bot L53 (Comint:r
stop unless RUBY_VERSION == '1.9.2'
return_value = nil
command = catch(:breakout_nav) do # Coordinates w$
return_value = yield
=> {} # Nothing thrown == no navigational command
end
-=--:%%--F1 tracer.rb 24% L21 (Ruby ShortKeys) ---
My question is: why am I seeing the breakpoint in tracer.rb? How do I get the breakpoint
to be in my source file?
Hitting 'n' twice in the source window causes the shell buffer to echo the following, but there
is no change in the source window itself.
=>[0G[1] pry(#<PryNav::Tracer>)> next 1
next 1
Also, the 'u' and 'd' keystrokes yield
Command down is not implemented for this debugger
Command up is not implemented for this debugger

Deadlock with Perl, IO:Async::Loop and pipe to sendmail

We are seeing suck sendmail processes when we are attempting to send email from a Perl FCGI process. These processes are taking too long, hours to a day, since it should just be doing a relay to a server configured in sendmail as the smart host. Most of the mail from the FCGI processes takes less than 5 seconds. The slow sendmail processes are easy to find on the our servers with $ ps -ef | grep sendmail
Almost all of the email works normally from these web nodes. I'd guess thousands of mails go through with no problem. Sending test email from the command line goes smoothly. The sendmail command gets stuck rarely and we don't have a way to reproduce it.
It seems that most of this stuck email gets through sooner or later. These seem to be sending mail hours later, sometimes over a day later.
All of the sendmail that we've seen stuck has been a command that was run by a Perl process, which is a child process of a FCGI process.
Looking at the logs of the smart host we see that most of this mail does get through sooner or later but we have found some that don't seem to have ever been sent.
This is running in FCGI for Catalyst and then added to a IO::Async::Loop which does some processing, and in the IO::Async::Loop, Email::Sender::Transport::Sendmail is used which does a open($fh, '|-', #args) and pipes the mail header+body and does a close($fh).
I've seen this http://perldoc.perl.org/perlipc.html#Avoiding-Pipe-Deadlocks but don't know how to apply it in this situation. The child sendmail has only STDIN open.
When we have one of these stuck sendmails the sendmail is waiting on STDIN:
[<ffffffff8119ce8b>] pipe_wait+0x5b/0x80
[<ffffffff8119d8ad>] pipe_read+0x34d/0x4d0
[<ffffffff8119204a>] do_sync_read+0xfa/0x140
[<ffffffff81192945>] vfs_read+0xb5/0x1a0
[<ffffffff81192c91>] sys_read+0x51/0xb0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
and the async perl process is waiting on the child to die:
#0 0x00007f8849e6065e in waitpid () from /lib64/libc.so.6
#1 0x000000000046dc2d in Perl_wait4pid ()
#2 0x000000000046de2d in Perl_my_pclose ()
#3 0x00000000004cec4e in Perl_io_close ()
#4 0x00000000004ceda8 in Perl_do_close ()
#5 0x00000000004c2629 in Perl_pp_close ()
#6 0x00000000004804de in Perl_runops_standard ()
#7 0x000000000042e7ad in perl_run ()
#8 0x000000000041bbc5 in main ()
An example of one that didn't get through:
Job #1653576 (that's just our internal job number) has a sendmail process that started on Aug 19 13:04.
Process on webnode2:
fcgi-user 13621 13466 0 13:04 ? 00:00:00 /usr/sbin/sendmail -i -f admin#ourServer.org -- proffunnyhat#mit.edu
I don't see the record I expect to see on our smart host for this in /var/log/maillog that would indicate that it was relayed to nexus and then to MIT.
I do see successful email for proffunnyhat#mit.edu on Aug 21 (from web2 /var/log/maillog):
Aug 21 00:00:02 node-008 sendmail[13621]: u7JH4tbr013621: to=proffunnyhat#mit.edu, ctladdr=admin#ourServer.org (10520/10520), delay=1+10:55:07, xdelay=00:00:01, mailer=relay, pri=32292, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (u7L401Z1026237 Message accepted for delivery)
Aug 21 00:00:02 node-008 sendmail[26247]: u7L401Z1026237: to=<proffunnyhat#mit.edu>, delay=00:00:01, xdelay=00:00:00, mailer=relay, pri=122657, relay=mail.ourServer.org. [128.84.4.11], dsn=2.0.0, stat=Sent (u7L402jx001185 Message accepted for delivery)
and then on mail.ourServer.org:
bdc34 #mail.ourServer.org: log$ sudo grep u7L402jx001185 maillog*
maillog-20160821:Aug 21 00:00:02 web2 sendmail[1185]: u7L402jx001185: from=<admin#ourServer.org>, size=2874, class=0, nrcpts=1, msgid=<201608191704.u7JH4tbr013621#mail.ourServer.org>, proto=ESMTP, daemon=MTA, relay=mail.ourServer.org [128.84.4.13]
maillog-20160821:Aug 21 00:00:03 mail.ourServer.org[1200]: u7L402jx001185: to=<proffunnyhat#mit.edu>, ctladdr=<e-admin#ourServer.org> (10519/10519), delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=122874, relay=dmz-mailsec-scanner-8.mit.edu. [18.7.68.37], dsn=2.0.0, stat=Sent (OK 5E/2D-20045-34729B75)
An example of one that was stuck but seems to have been sent:
mail.ourServer.org:/var/log/sendmail:
Aug 19 02:19:51 mail.ourServer.org sendmail[20792]: u7J6JlP6020790: to=<jxjx#connect.ust.hk>, ctladdr=<admin#ourServer.org> (10519/10519), delay=00:00:04, xdelay=00:00:04, mailer=esmtp, pri=122504, relay=connect-ust-hk\
.mai...ction.outlook.com. [213.199.154.87], dsn=2.0.0, stat=Sent (<201608190619.u7J6Jlda000738#web2.ourServer.org> [InternalId=15526306777069,...1MB1197.apcprd01.prod.exchangelabs.com] 9137 bytes in 0.189, 47.082 KB/sec\
Queued mail for delivery)
Things we have tried
I've modified Email::Sender::Transport::Sendmail to send a '\x00' to the pipe, that didn't work.
I've replaced IO::Async::Loop::Poll with IO::Async::Loop::Select. That didn't change anything.
I've tried sending signals to the sendmail and its parent. That killed them but the mail was aborted.
Added our fcgi user to sendmail's trusted users file. Didn't change anything.
I wrote a wrapper script that read from STDIN and writes to sendmail. If nothing comes in on STDIN for 5 seconds it exits. This feels really hacky to me but it does seem to work. Since mail is a critical part of our system I'd rather have a real solution.
ikegami comment lead us to the answer of doing a double fork. Looking at the signal handlers and file handles set up under FCGI made it clear that excessively clever things were happening. So I moved to cut all ties with the parent process using a double fork like when starting a daemon. That worked.
# FCGI does some clever signal handeling and file handers to do
# its work. This causes problems with forked processes that then
# fork and wait for other processes. So use exec to get a process
# that is not a copy of the parent. Double fork to avoid zombies.
# Check -e of submit script because $! has a cryptic messsage if it doens't exist
my $script = "$SAFEBINDIR/submit.pl";
unless( -e -r -x $script ){
$submission->submit_log->error($submission->submission_id . ": $script doesn't exist");
$c->flash( message => "There was a problem");
$c->res->redirect( $c->uri_for('/user') );
$c->detach;
}
# Do the double fork + exec to have an independent process
my $pid;
unless( $pid = fork() ) { #this is the child
unless( fork() ){ #this is the grandchild
exec( $script, $submission->submission_id ) #should never return
or $submission->submit_log->error($submission->submission_id
. ": Error when trying to exec() $script '$!'");
exit(0);
}
}
waitpid($pid,0); #wait for child, grandchild will get ppid 1
}

Creating linked_dirs in Capistrano 3 fails

I am attempting to set up Capistrano with a SilverStripe build and am running into a few troubles setting up the shared directories.
I set the linked_dirs in deploy.rb with the following:
set :linked_dirs, %w{assets vendor}
Since adding this line I get the following error:
[617afa7f] Command: /usr/bin/env mkdir -p /var/www/website/releases/20160215083713 /var/www/website/releases/20160215083713
INFO [617afa7f] Finished in 0.250 seconds with exit status 0 (successful).
DEBUG [88c3de20] Running /usr/bin/env [ -L /var/www/website/releases/20160215083713/assets ] as capistrano#128.199.231.152
DEBUG [88c3de20] Command: [ -L /var/www/website/releases/20160215083713/assets ]
DEBUG [88c3de20] Finished in 0.258 seconds with exit status 1 (failed).
DEBUG [3d61c1c4] Running /usr/bin/env [ -d /var/www/website/releases/20160215083713/assets ] as capistrano#128.199.231.152
DEBUG [3d61c1c4] Command: [ -d /var/www/website/releases/20160215083713/assets ]
DEBUG [3d61c1c4] Finished in 0.254 seconds with exit status 1 (failed).
INFO [3016a8cd] Running /usr/bin/env ln -s /var/www/website/shared/assets /var/www/website/releases/20160215083713/assets as capistrano#128.199.231.152
I am a mega noob when it comes to Capistrano and a semi noob when it comes to server configuration and permissions, so any pointers would be appreciated.
It probably hasn't actually failed. One thing to know about Capistrano is that (success) and (failed) are actually returning the result of the exit status, (success) if 0 and (failed) if non-0.
If we look at the command in question, it says that /usr/bin/env [ -L /var/www/website/releases/20160215083713/assets ] failed. This command is saying "return 0 if /var/www/website/releases/20160215083713/assets exists and is a link (-L). This fails, but that just means it returns non-0, thus the link needs to be created. Note that the next command also fails (-d) with asserting that the path is a directory. And the last line in your output is actually creating the link in question.
You can see the test in the Capistrano codebase here: https://github.com/capistrano/capistrano/blob/master/lib/capistrano/tasks/deploy.rake#L128
You can clean up and simplify the output with https://github.com/mattbrictson/airbrussh. This is developed by one of the primary Capistrano devs.
As a sidenote, similarly all the green text in your terminal is stdout and the red text is stderr. This can also be confusing.

Apache init.d script

I have the following script to start, stop, restart apache2 in my debian 7
#!/bin/sh
### BEGIN INIT INFO
# Provides: apache2
# Required-Start: $all
# Required-Stop: $all
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: apache2
# Description: Start apache2
### END INIT INFO
case "$1" in
start)
echo "Starting Apache ..."
# Change the location to your specific location
/usr/local/apache2/bin/apachectl start
;;
stop)
echo "Stopping Apache ..."
# Change the location to your specific location
/usr/local/apache2/bin/apachectl stop
;;
graceful)
echo "Restarting Apache gracefully..."
# Change the location to your specific location
/usr/local/apache2/bin/apachectl graceful
;;
restart)
echo "Restarting Apache ..."
# Change the location to your specific location
/usr/local/apache2/bin/apachectl restart
;;
*)
echo "Usage: '$0' {start|stop|restart|graceful}"
exit 64
;;
esac
exit 0
When I add the script to update-rc.d I see the following warnings:
root#pomelo:/etc/init.d# update-rc.d apache2 defaults
update-rc.d: using dependency based boot sequencing
insserv: Script jexec is broken: incomplete LSB comment.
insserv: missing `Required-Stop:' entry: please add even if empty.
insserv: missing `Default-Stop:' entry: please add even if empty.
insserv: Default-Stop undefined, assuming empty stop runlevel(s) for script `jexec'
But I already added Required-Stop and Default-Stop to the script.
Does anybody know how to solve this problem?
The issue is not in your apache2 init script, it is in 'jexec' it says so 'Script jexec is broken'.
That one is missing the Required-Stop and Default-Stop
Had the same issue on my SLES boxen. Don't worry though, even if it shows you these errors, everything still just runs fine!
HTH

supervisorctl ERROR (abnormal termination)

When I run sudo supervisorctl start stage then I get ERROR (abnormal termination). Will you please take look?
Here is my file /etc/supervisord.conf. Am i missing something? thanks
[unix_http_server]
file=/tmp/supervisor.sock ; (the path to the socket file)
[supervisord]
logfile=/tmp/supervisord.log ; (main log file;default $CWD/supervisord.log)
logfile_maxbytes=50MB ; (max main logfile bytes b4 rotation;default 50MB)
logfile_backups=10 ; (num of main logfile rotation backups;default 10)
loglevel=info ; (log level;default info; others: debug,warn,trace)
pidfile=/tmp/supervisord.pid ; (supervisord pidfile;default supervisord.pid)
nodaemon=false ; (start in foreground if true;default false)
minfds=1024 ; (min. avail startup file descriptors;default 1024)
minprocs=200 ; (min. avail process descriptors;default 200)
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[supervisorctl]
serverurl=unix:///tmp/supervisor.sock ; use a unix:// URL for a unix socket
[program:stage]
command=/home/me/envs/project/bin/python /home/me/webapps/project/manage.py run_gunicorn -b 127.0.0.1:8002 --log-file=/tmp/stage_gunicorn.log
directory=/home/me/webapps/project/
user=www-data
autostart=true
autorestart=true
stdout_logfile=/tmp/stage_supervisord.log
redirect_stderr=true
I meet the same problem as yours. As Martijn Pieters saying, it doesn't mean that something goes wrong with your supervisorctl. It just tells you that the program didn't work. You can find some error details in the log.
It indicated error so find it using below command :
supervisorctl tail <APP_NAME>
This error is occurring due to the underlying stage application is not running properly. To fix the error, you can simply go to your console and run the command that you are passing. In your case:
It is
/home/me/envs/project/bin/python /home/me/webapps/project/manage.py run_gunicorn -b 127.0.0.1:8002 --log-file=/tmp/stage_gunicorn.log
It will show you the error that need to be fixed
It means that your APP is wrong.
Go and check [program:stage] section, path or something is not c
Just edit the log level to trace then restart supervisord and see what happened from the supervisor log.
[supervisord]
loglevel=trace
sudo systemctl restart supervisord.service
tail -f /path/to/supervisord.log
When the problem has been resolved, modify the loglevel to info.