delayed_job monitored by God - duplicate processes after restart - capistrano

I'm monitoring delayed_job using God. This is my God config file.
QUEUE = "slow"
WORKERS = 14
WORKERS.times do |num|
God.watch do |w|
w.name = "dj.#{num}"
w.group = "tanda"
w.uid = 'deployer'
w.gid = 'deployer'
w.start = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job start --queue=#{QUEUE} --pid-dir=#{RAILS_ROOT}/tmp/pids -i #{num}"
w.restart = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job restart --queue=#{QUEUE} --pid-dir=#{RAILS_ROOT}/tmp/pids -i #{num}"
w.stop = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job stop -i #{num}"
w.start_grace = 30.seconds
w.restart_grace = 30.seconds
w.stop_grace = 30.seconds
w.pid_file = "#{RAILS_ROOT}/tmp/pids/delayed_job.#{num}.pid"
w.log = "#{RAILS_ROOT}/log/dj.#{num}.log"
w.err_log = "#{RAILS_ROOT}/log/dj.#{num}.errors.log"
w.behavior(:clean_pid_file)
w.interval = 30.seconds
w.dir = File.expand_path('.')
w.env = {
"RACK_ENV" => RAILS_ENV,
"RAILS_ENV" => RAILS_ENV,
"CURRENT_DIRECTORY" => RAILS_ROOT
}
w.start_if do |start|
start.condition(:process_running) do |c|
c.interval = 5.seconds
c.running = false
end
end
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 10
c.within = 3.minutes
c.transition = :unmonitored
c.retry_in = 10.minutes
end
end
end
end
I'm then restarting these processes using Capistrano 2 on each deploy:
run("cd #{current_path} && rvmsudo god restart tanda")
When I start God, my ps output looks like this.
s -e -www -o pid,rss,command | grep delayed
31960 220804 delayed_job.0
31966 220152 delayed_job.8
31973 226012 delayed_job.9
31979 215176 delayed_job.1
31984 210260 delayed_job.13
31994 240424 delayed_job.3
31997 225248 delayed_job.11
32003 196364 delayed_job.5
32009 236192 delayed_job.6
32015 214540 delayed_job.12
32022 247096 delayed_job.4
32029 206352 delayed_job.2
32047 232748 delayed_job.7
32061 228128 delayed_job.10
If I immediately do a Capistrano restart, without doing a deploy or anything else, then after a minute it looks like this.
ps -e -www -o pid,rss,command | grep delayed
9884 198076 delayed_job.10
9895 195372 delayed_job.0
9919 196856 delayed_job.6
9948 196772 delayed_job.5
9964 196568 delayed_job.9
9973 194092 delayed_job.12
9982 195648 delayed_job.13
9997 196392 delayed_job.2
10005 195356 delayed_job.4
10016 197268 delayed_job.3
10032 198820 delayed_job.8
10054 194316 delayed_job.7
10078 196780 delayed_job.11
10127 202420 delayed_job.1
10133 197468 delayed_job.1
10145 194040 delayed_job.1
10158 195760 delayed_job.1
10173 195844 delayed_job.1
And after another restart:
ps -e -www -o pid,rss,command | grep delayed
9884 221780 delayed_job.10
9973 225100 delayed_job.12
9982 224708 delayed_job.13
10078 235076 delayed_job.11
21467 187056 delayed_job.0
21483 187844 delayed_job.7
21497 189648 delayed_job.10
21509 187316 delayed_job.2
21518 188180 delayed_job.11
21527 187968 delayed_job.3
21542 187852 delayed_job.12
21546 186900 delayed_job.13
21556 188628 delayed_job.5
21565 187816 delayed_job.9
21574 185216 delayed_job.4
21585 188088 delayed_job.1
21599 188556 delayed_job.1
21602 188400 delayed_job.1
21615 193484 delayed_job.1
21628 193288 delayed_job.8
21632 188228 delayed_job.1
21643 187804 delayed_job.6
As you can see these duplicate processes sometimes have new pids (eg. all from the first dump to the second) but sometimes don't (eg. DJ 10 from the 2nd to the 3rd).
I don't really know where to start debugging this. God isn't giving any errors when restarting and the DJ logs just show the usual output when launching a process. And the same thing isn't happening on a smaller server that is only meant to have 4 workers running (but is otherwise identical).
Has anyone seen this before?

I think this must be an issue in the daemons gem that delayed_job job uses for working in the background, because adding this at the top of my God file seems to have fixed things:
ids = ('a'..'z').to_a
workers.times do |num|
num = ids[num]
It seems like there was an issue where the processes named delayed_job.1 and delayed_job.11 (etc) would clash which would cause lots of problems. I haven't really isolated it down too far, but changing it to a different naming convention (delayed_job.a in this case) has fixed things for me now.
Will leave this open in case someone has a better solution/a reason for why this worked.

Related

How to break a MATLAB system call after a specified number of seconds

I am using Windows MATLAB to run SSH commands, but every once in a while the SSH command hangs indefinitely, and then so does my MATLAB script (sometimes when running overnight). How can I have a command time-out after a certain amount of waiting time? For example, suppose I don't want to wait more than 3 seconds for a SSH command to finish execution before breaking it and moving on:
% placeholder for a command that sometimes hangs
[status,result] = system('ssh some-user#0.1.2.3 sleep 10')
% placeholder for something I still want to run if the above command hangs
[status,result] = system('ssh some-user#0.1.2.3 ls')
I'd like to make a function sys_with_timeout that can be used like this:
timeoutDuration = 3;
[status,result] = sys_with_timeout('ssh some-user#0.1.2.3 sleep 10', timeoutDuration)
[status,result] = sys_with_timeout('ssh some-user#0.1.2.3 ls', timeoutDuration)
I've tried the timeout function from FEX but it doesn't seem to work for a system/SSH command.
I don't think the system command is very flexible, and I don't see a way to do what you want using it.
But we can use the Java functionality built into MATLAB for this, according to this answer by André Caron.
This is how you'd wait for the process to finish with a timeout:
runtime = java.lang.Runtime.getRuntime();
process = runtime.exec('sleep 20');
process.waitFor(10, java.util.concurrent.TimeUnit.SECONDS);
if process.isAlive()
disp("Done waiting for this...")
process.destroyForcibly();
end
process.exitValue()
In the example, we run the sleep 20 shell command, then waitFor() waits until the program finishes, but for a maximum of 10 seconds. We poll to see if the process is still running, and kill it if it is. exitValue() returns the status, if you need it.
Running sleep 5 I see:
ans =
0
Running sleep 20 I see:
Done waiting for this...
ans =
137
Building on #Cris Luengo's answer, here is the sys_with_timeout() function. I didn't use the process.waitFor() function because I'd rather wait in a while loop and display output as it comes in. The while loop breaks once the command finishes or it times out, whichever comes first.
function [status,cmdout] = sys_with_timeout(command,timeoutSeconds,streamOutput,errorOnTimeout)
arguments
command char
timeoutSeconds {mustBeNonnegative} = Inf
streamOutput logical = true % display output as it comes in
errorOnTimeout logical = false % if false, display warning only
end
% launch command as java process (does not wait for output)
process = java.lang.Runtime.getRuntime().exec(command);
% start the timeout timer!
timeoutTimer = tic();
% Output reader (from https://www.mathworks.com/matlabcentral/answers/257278)
outputReader = java.io.BufferedReader(java.io.InputStreamReader(process.getInputStream()));
errorReader = java.io.BufferedReader(java.io.InputStreamReader(process.getErrorStream()));
% initialize output char array
cmdout = '';
while true
% If any lines are ready to read, append them to output
% and display them if streamOutput is true
if outputReader.ready() || errorReader.ready()
if outputReader.ready()
nextLine = char(outputReader.readLine());
elseif errorReader.ready()
nextLine = char(errorReader.readLine());
end
cmdout = [cmdout,nextLine,newline()];
if streamOutput == true
disp(nextLine);
end
continue
else
% if there are no lines ready in the reader, and the
% process is no longer running, then we are done
if ~process.isAlive()
break
end
% Check for timeout. If timeout is reached, destroy
% the process and break
if toc(timeoutTimer) > timeoutSeconds
timeoutMessage = ['sys_with_timeout(''',command, ''',', num2str(timeoutSeconds), ')',...
' failed after timeout of ',num2str(toc(timeoutTimer)),' seconds'];
if errorOnTimeout == true
error(timeoutMessage)
else
warning(timeoutMessage)
end
process.destroyForcibly();
break
end
end
end
if ~isempty(cmdout)
cmdout(end) = []; % remove trailing newline of command output
end
status = process.exitValue(); % return
end
Replacing ssh some-user#0.1.2.3 with wsl (requires WSL of course) for simplicity, here is an example of a function that times out:
>> [status,cmdout] = sys_with_timeout('wsl echo start! && sleep 2 && echo finished!',1)
start!
Warning: sys_with_timeout('wsl echo start! && sleep 2 && echo finished!',1) failed after timeout of 1.0002 seconds
> In sys_with_timeout (line 41)
status =
1
cmdout =
'start!'
and here is an example of a function that doesn't time out:
>> [status,cmdout] = sys_with_timeout('wsl echo start! && sleep 2 && echo finished!',3)
start!
finished!
status =
0
cmdout =
'start!
finished!'

Celery: How to schedule worker processes/children restart

I try to figure out how to setup my "celery workers" to restart after a living day.
Indeed, I configured my worker children/processes to restart after executing a task.
But in some cases, there are no tasks to execute in 3-4 days. So I need to restart the long-living children.
Do you how to do this ?
This is my actual celery app setup:
app = Celery(
"celery",
broker=f"amqp://bla#blabla/blablabla",
backend="rpc://",
)
app.conf.task_serializer = "pickle"
app.conf.result_serializer = "pickle"
app.conf.accept_content = ["pickle", "application/json"]
app.conf.broker_connection_max_retries = 5
app.conf.broker_pool_limit = 1
app.conf.worker_max_tasks_per_child = 1 # Ensure 1 task is executed before restarting child
app.conf.worker_max_living_time_before_restart = 60 * 60 * 24 # The conf I want
Thank you :)

Celery: Routing tasks issue - only one worker consume all tasks from all queues

I've some tasks with manually configured routes and 3 workers which were configured to consume tasks from specific queue. But only one worker consuming all of the tasks and I've no idea how to fix this issue.
My celeryconfig.py
class CeleryConfig:
enable_utc = True
timezone = 'UTC'
imports = ('events.tasks')
broker_url = Config.BROKER_URL
broker_transport_options = {'visibility_timeout': 10800} # 3H
worker_hijack_root_logger = False
task_protocol = 2
task_ignore_result = True
task_publish_retry_policy = {'max_retries': 3, 'interval_start': 0, 'interval_step': 0.2, 'interval_max': 0.2}
task_time_limit = 30 # sec
task_soft_time_limit = 15 # sec
task_default_queue = 'low'
task_default_exchange = 'low'
task_default_routing_key = 'low'
task_queues = (
Queue('daily', Exchange('daily'), routing_key='daily'),
Queue('high', Exchange('high'), routing_key='high'),
Queue('normal', Exchange('normal'), routing_key='normal'),
Queue('low', Exchange('low'), routing_key='low'),
Queue('service', Exchange('service'), routing_key='service'),
Queue('award', Exchange('award'), routing_key='award'),
)
task_route = {
# -- SCHEDULE QUEUE --
base_path.format(task='refresh_rank'): {'queue': 'daily'}
# -- HIGH QUEUE --
base_path.format(task='execute_order'): {'queue': 'high'},
# -- NORMAL QUEUE --
base_path.format(task='calculate_cost'): {'queue': 'normal'},
# -- SERVICE QUEUE --
base_path.format(task='send_pin'): {'queue': 'service'},
# -- LOW QUEUE
base_path.format(task='invite_to_tournament'): {'queue': 'low'},
# -- AWARD QUEUE
base_path.format(task='get_lesson_award'): {'queue': 'award'},
# -- TEST TASK
worker_concurrency = multiprocessing.cpu_count() * 2 + 1
worker_prefetch_multiplier = 1 #
worker_max_tasks_per_child = 1
worker_max_memory_per_child = 90000 # 90MB
beat_max_loop_interval = 60 * 5 # 5 min
I run workers in a docker, part of my stack.yml
version: "3.7"
services:
worker_high:
command: celery worker -l debug -A runcelery.celery -Q high -n worker.high#%h
worker_normal:
command: celery worker -l debug -A runcelery.celery -Q normal,award,service,low -n worker.normal#%h
worker_schedule:
command: celery worker -l debug -A runcelery.celery -Q daily -n worker.schedule#%h
beat:
command: celery beat -l debug -A runcelery.celery
flower:
command: flower -l debug -A runcelery.celery --port=5555
broker:
image: redis:5.0-alpine
I thought that my config is right and run command correct too, but docker logs and flower shown that only worker.normal consume all tasks.
I
Update
Here is part of task.py:
def refresh_rank_in_tournaments():
logger.debug(f'Start task refresh_rank_in_tournaments')
return AnalyticBackgroundManager.refresh_tournaments_rank()
base_path is shortcut for full task path:
base_path = 'events.tasks.{task}'
execute_order task code:
#celery.task(bind=True, default_retry_delay=5)
def execute_order(self, private_id, **kwargs):
try:
return OrderBackgroundManager.execute_order(private_id, **kwargs)
except IEXException as exc:
raise self.retry(exc=exc)
This task will call in a view as tasks.execute_order.delay(id)
Your worker.normal is subscribed to the normal,award,service,low queues. Furthermore, the low queue is the default one, so every task that does not have explicitly set queue will be executed on worker.normal.

Audit daemon does not take rules from audit.rules

I am unable to add rules to audit daemon using /etc/audit/audit.rules
Every time i add the rules using auditctl it gets removed on reboot or audit daemon restart I have attached the /etc/audit/audit.rules and /etc/audit/auditd.conf
cat /etc/audit/auditd.conf
$ cat /etc/audit/auditd.conf
#
# This file controls the configuration of the audit daemon
#
local_events = yes
write_logs = yes
log_file = /NU_Application/audit.log
log_group = root
log_format = RAW
flush = INCREMENTAL_ASYNC
freq = 50
max_log_file = 8
num_logs = 5
priority_boost = 4
disp_qos = lossy
dispatcher = /sbin/audispd
name_format = NONE
##name = mydomain
max_log_file_action = ROTATE
space_left = 75
space_left_action = SYSLOG
verify_email = yes
action_mail_acct = root
admin_space_left = 50
admin_space_left_action = SUSPEND
disk_full_action = SUSPEND
disk_error_action = SUSPEND
use_libwrap = yes
##tcp_listen_port = 22
tcp_listen_queue = 5
tcp_max_per_addr = 1
##tcp_client_ports = 1024-65535
tcp_client_max_idle = 0
enable_krb5 = no
krb5_principal = auditd
##krb5_key_file = /etc/audit/audit.key
distribute_network = no
cat /etc/audit/audit.rules
$ cat /etc/audit/audit.rules
## First rule - delete all
## Increase the buffers to survive stress events.
## Make this bigger for busy systems
-b 8192
## This determine how long to wait in burst of events
--backlog_wait_time 0
## Set failure mode to syslog
-f 1
-w /var/log/lastlog -p wa
root#iWave-G22M:~# auditctl
When i restart the audit daemon ( i.e /etc/init.d/auditd restart ) and try to list the rules i get the message No rules
$ /etc/init.d/auditd restart
Restarting audit daemon auditd
type=1305 audit(1558188111.980:3): audit_pid=0 old=1148 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.010:4): audit_enabled=1 old=1 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.020:5): audit_pid=30342 old=0 auid=4294967295 ses=4294967295
res=1
1
$ auditctl -l
No rules
OS INFO
$ uname -a
Linux iWave-G22M 3.10.31-ltsi-svn743 #5 SMP PREEMPT Mon May 27 18:28:01 IST 2019 armv7l GNU/Linux
audit_2.8.4.bb file was used to install auditd daemon via yocto
path of audit_2.8.4.bb -- http://git.yoctoproject.org/cgit/cgit.cgi/meta-selinux/tree/recipes-security/audit/audit_2.8.4.bb?h=master
audit rules add via /etc/audit/audit.rules and auditctl command are not permanent. to make them permanent across reboot you have to add them /etc/audit/rules.d/audit.rules file.
after adding the rule, restart auditd service and run command auditctl -l, it will list all the rules and also reflect in /etc/audit/audit.rules file.

LSF serial jobs on HPC performance worse than local sequential executions

I'm learning how to use HPC on our lab's clusters, which uses LSF. I tried a simple serial jobs each of which count the frequency of the words in a text file. I wrote a python code for counting the word frequency named count_word_freq.py, a jobs script named myjob.job as following:
python count_word_freq.py --in ~/books/1.txt --out ~/freqs/freq1.txt
python count_word_freq.py --in ~/books/2.txt --out ~/freqs/freq2.txt
python count_word_freq.py --in ~/books/3.txt --out ~/freqs/freq3.txt
python count_word_freq.py --in ~/books/4.txt --out ~/freqs/freq4.txt
python count_word_freq.py --in ~/books/5.txt --out ~/freqs/freq5.txt
and a lsf script to submit the serial jobs to the selfscheduler:
#!/bin/bash
#BSUB -J test01
#BSUB -P acc_pandeg01a
#BSUB -q alloc
#BSUB -W 20
#BSUB -n 20
#BSUB -m manda
#BSUB -o %J.stdout
#BSUB -eo %J.stderr
#BSUB -L /bin/bash
module load python
module load py_packages
module load selfsched
# And run the program; output will be on stdout
mpirun selfsched < myjobs001.jobs
The python code is as following:
def readBookAsFreqDict(infile):
dic = {}
with open(infile,"r") as file:
for line in file:
contents = line.split(" ")
for cont in contents:
if str.isalpha(cont):
if cont not in dic.keys():
dic[cont] = 1
else:
dic[cont] = dic[cont] + 1
return dic
import sys
import argparse
import time as T
if __name__ == "__main__":
start = T.time()
parser = argparse.ArgumentParser()
parser.add_argument('--i', type=str, help = 'input file')
parser.add_argument('--o', type=str, help = 'output file')
args = parser.parse_args()
dic = readBookAsFreqDict(args.i)
outfile = open(args.o,"w")
for key,freq in dic.iteritems():
outfile.write(key + ":" + str(freq) + "\n")
end = T.time()
print (end - start)
The 5 input texts are almost the same size of around 3.5 MB. My question is that the CPU time for running this serial job is 980s, which is worse than running it sequentially.
To my understanding, the selfscheduler can automatically assign the 5 jobs to empty nodes, thus can save the running time for running it sequentially. Is that because the execution time for each job is too short compared to the time to find an empty node? Is there any other approaches can be used to make it faster?
Thank you!