LSF serial jobs on HPC performance worse than local sequential executions - hpc

I'm learning how to use HPC on our lab's clusters, which uses LSF. I tried a simple serial jobs each of which count the frequency of the words in a text file. I wrote a python code for counting the word frequency named, a jobs script named myjob.job as following:
python --in ~/books/1.txt --out ~/freqs/freq1.txt
python --in ~/books/2.txt --out ~/freqs/freq2.txt
python --in ~/books/3.txt --out ~/freqs/freq3.txt
python --in ~/books/4.txt --out ~/freqs/freq4.txt
python --in ~/books/5.txt --out ~/freqs/freq5.txt
and a lsf script to submit the serial jobs to the selfscheduler:
#BSUB -J test01
#BSUB -P acc_pandeg01a
#BSUB -q alloc
#BSUB -W 20
#BSUB -n 20
#BSUB -m manda
#BSUB -o %J.stdout
#BSUB -eo %J.stderr
#BSUB -L /bin/bash
module load python
module load py_packages
module load selfsched
# And run the program; output will be on stdout
mpirun selfsched <
The python code is as following:
def readBookAsFreqDict(infile):
dic = {}
with open(infile,"r") as file:
for line in file:
contents = line.split(" ")
for cont in contents:
if str.isalpha(cont):
if cont not in dic.keys():
dic[cont] = 1
dic[cont] = dic[cont] + 1
return dic
import sys
import argparse
import time as T
if __name__ == "__main__":
start = T.time()
parser = argparse.ArgumentParser()
parser.add_argument('--i', type=str, help = 'input file')
parser.add_argument('--o', type=str, help = 'output file')
args = parser.parse_args()
dic = readBookAsFreqDict(args.i)
outfile = open(args.o,"w")
for key,freq in dic.iteritems():
outfile.write(key + ":" + str(freq) + "\n")
end = T.time()
print (end - start)
The 5 input texts are almost the same size of around 3.5 MB. My question is that the CPU time for running this serial job is 980s, which is worse than running it sequentially.
To my understanding, the selfscheduler can automatically assign the 5 jobs to empty nodes, thus can save the running time for running it sequentially. Is that because the execution time for each job is too short compared to the time to find an empty node? Is there any other approaches can be used to make it faster?
Thank you!


Generate many files with wildcard, then merge into one

I have two rules on my Snakefile: one generates several sets of files using wildcards, the other one merges everything into a single file. This is how I wrote it:
chr = range(1,23)
rule generate:
og_files = config["tmp"] + '/chr{chr}.bgen',
out = multiext(config["tmp"] + '/plink/chr{{chr}}',
'.bed', '.bim', '.fam')
plink \
--bgen {input.og_files} \
--make-bed \
--oxford-single-chr \
--out {config[tmp]}/plink/chr{chr}
rule merge:
plink_chr = expand(config["tmp"] + '/plink/chr{chr}.{ext}',
chr = chr,
ext = ['bed', 'bim', 'fam'])
out = multiext(config["tmp"] + '/all',
'.bed', '.bim', '.fam')
plink \
--pmerge-list-dir {config[tmp]}/plink \
--make-bed \
--out {config[tmp]}/all
Unfortunately, this does not allow me to track the file coming from the first rule to the 2nd rule:
$ snakemake -s myfile.smk -c1 -np
Building DAG of jobs...
MissingInputException in line 17 of myfile.smk:
Missing input files for rule merge:
[list of all the files made by expand()]
What can I use to be able to generate the 22 sets of files with the wildcard chr in generate, but be able to track them in the input of merge? Thank you in advance for your help
In rule generate I think you don't want to escape the {chr} wildcard, otherwise it doesn't get replaced. I.e.:
out = multiext(config["tmp"] + '/plink/chr{{chr}}',
'.bed', '.bim', '.fam')
should be:
out = multiext(config["tmp"] + '/plink/chr{chr}',
'.bed', '.bim', '.fam')

Celery: Routing tasks issue - only one worker consume all tasks from all queues

I've some tasks with manually configured routes and 3 workers which were configured to consume tasks from specific queue. But only one worker consuming all of the tasks and I've no idea how to fix this issue.
class CeleryConfig:
enable_utc = True
timezone = 'UTC'
imports = ('events.tasks')
broker_url = Config.BROKER_URL
broker_transport_options = {'visibility_timeout': 10800} # 3H
worker_hijack_root_logger = False
task_protocol = 2
task_ignore_result = True
task_publish_retry_policy = {'max_retries': 3, 'interval_start': 0, 'interval_step': 0.2, 'interval_max': 0.2}
task_time_limit = 30 # sec
task_soft_time_limit = 15 # sec
task_default_queue = 'low'
task_default_exchange = 'low'
task_default_routing_key = 'low'
task_queues = (
Queue('daily', Exchange('daily'), routing_key='daily'),
Queue('high', Exchange('high'), routing_key='high'),
Queue('normal', Exchange('normal'), routing_key='normal'),
Queue('low', Exchange('low'), routing_key='low'),
Queue('service', Exchange('service'), routing_key='service'),
Queue('award', Exchange('award'), routing_key='award'),
task_route = {
base_path.format(task='refresh_rank'): {'queue': 'daily'}
# -- HIGH QUEUE --
base_path.format(task='execute_order'): {'queue': 'high'},
base_path.format(task='calculate_cost'): {'queue': 'normal'},
base_path.format(task='send_pin'): {'queue': 'service'},
base_path.format(task='invite_to_tournament'): {'queue': 'low'},
base_path.format(task='get_lesson_award'): {'queue': 'award'},
worker_concurrency = multiprocessing.cpu_count() * 2 + 1
worker_prefetch_multiplier = 1 #
worker_max_tasks_per_child = 1
worker_max_memory_per_child = 90000 # 90MB
beat_max_loop_interval = 60 * 5 # 5 min
I run workers in a docker, part of my stack.yml
version: "3.7"
command: celery worker -l debug -A runcelery.celery -Q high -n worker.high#%h
command: celery worker -l debug -A runcelery.celery -Q normal,award,service,low -n worker.normal#%h
command: celery worker -l debug -A runcelery.celery -Q daily -n worker.schedule#%h
command: celery beat -l debug -A runcelery.celery
command: flower -l debug -A runcelery.celery --port=5555
image: redis:5.0-alpine
I thought that my config is right and run command correct too, but docker logs and flower shown that only worker.normal consume all tasks.
Here is part of
def refresh_rank_in_tournaments():
logger.debug(f'Start task refresh_rank_in_tournaments')
return AnalyticBackgroundManager.refresh_tournaments_rank()
base_path is shortcut for full task path:
base_path = 'events.tasks.{task}'
execute_order task code:
#celery.task(bind=True, default_retry_delay=5)
def execute_order(self, private_id, **kwargs):
return OrderBackgroundManager.execute_order(private_id, **kwargs)
except IEXException as exc:
raise self.retry(exc=exc)
This task will call in a view as tasks.execute_order.delay(id)
Your worker.normal is subscribed to the normal,award,service,low queues. Furthermore, the low queue is the default one, so every task that does not have explicitly set queue will be executed on worker.normal.

AWS Glue job failing with OOM exception when changing column names

I have an ETL job where I load some data from S3 into a dynamic frame, relationalize it, and iterate through the dynamic frames returned. I want to query the result of this in Athena later so I want to change the names of the columns from having '.' to '_' and lower case them. When I do this transformation, I change the DynamicFrame into a spark dataframe and have been doing it this way. I've also seen a problem in another SO question where it turned out there is a reported problem with AWS Glue rename field transform so I've stayed away from that.
I've tried a couple things, including adding a load limit size to 50MB, repartitioning the dataframe, using both dataframe.schema.names and dataframe.columns, using reduce instead of loops, using sparksql to change it and nothing has worked. I'm fairly certain that its this transformation that failing because I've put some print statements in and the print that I have right after the completion of this transformation never shows up. I used a UDF at one point but that also failed. I've tried the actual transformation using df.toDF(new_column_names) and df.withColumnRenamed() but it never gets this far because I've not seen it get past retrieving the column names. Here's the code I've been using. I've been changing the actual name transformation as I said above, but the rest of it has stayed pretty much the same.
I've seen some people try and use the spark.executor.memory, spark.driver.memory, spark.executor.memoryOverhead and spark.driver.memoryOverhead. I've used those and set them to the most AWS Glue will let you but to no avail.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import explode, col, lower, trim, regexp_replace
import copy
import json
import boto3
import botocore
import time
# ========================================================
# ========================================================
def lower_and_pythonize(s=None):
if s is not None:
return s.replace('.', '_').lower()
return None
# pyspark implementation of renaming
# exprs = [
# regexp_replace(lower(trim(col(c))),'\.' , '_').alias(c) if t == "string" else col(c)
# for (c, t) in data_frame.dtypes
# ]
# ========================================================
# ========================================================
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
#my params
bucket_name = '<my-s3-bucket>' # name of the bucket. do not include 's3://' thats added later
output_key = '<my-output-path>' # key where all of the output is saved
input_keys = ['<root-directory-i'm using'] # highest level key that holds all of the desired data
s3_exclusions = "[\"*.orc\"]" # list of strings to exclude. Documentation:
s3_exclusions = s3_exclusions.replace('\n', '')
dfc_root_table_name = 'root' # name of the root table generated in the relationalize process
input_paths = ['s3://' + bucket_name + '/' + x for x in input_keys] # turn input keys into s3 paths
output_connection_opts = {"path": "s3://" + bucket_name + "/" + output_key} # dict of options. Documentation link found above the write_dynamic_frame.from_options line
s3_client = boto3.client('s3', 'us-east-1') # s3 client used for writing to s3
s3_resource = boto3.resource('s3', 'us-east-1') # s3 resource used for checking if key exists
group_mb = 50 # NOTE: 75 has proven to be too much when running on all of the april data
group_size = str(group_mb * 1024 * 1024)
input_connection_opts = {'paths': input_paths,
'groupFiles': 'inPartition',
'groupSize': group_size,
'recurse': True,
'exclusions': s3_exclusions} # dict of options. Documentation link found above the create_dynamic_frame_from_options line
num_paritions = int(sc._conf.get('spark.executor.cores')) * 4
print('Loading all json files into DynamicFrame...')
loading_time = time.time()
df = glueContext.create_dynamic_frame_from_options(connection_type='s3', connection_options=input_connection_opts, format='json')
print('Done. Time to complete: {}s'.format(time.time() - loading_time))
# using the list of known null fields (at least on small sample size) remove them
#df = df.drop_fields(drop_paths)
# drop any remaining null fields. The above covers known problems that this step doesn't fix
print('Dropping null fields...')
dropping_time = time.time()
df_without_null = DropNullFields.apply(frame=df, transformation_ctx='df_without_null')
print('Done. Time to complete: {}s'.format(time.time() - dropping_time))
df = None
print('Relationalizing dynamic frame...')
relationalizing_time = time.time()
dfc = Relationalize.apply(frame=df_without_null, name=dfc_root_table_name, info="RELATIONALIZE", transformation_ctx='dfc', stageThreshold=3)
print('Done. Time to complete: {}s'.format(time.time() - relationalizing_time))
keys = dfc.keys()
keys.sort(key=lambda s: len(s))
print('Writting all dynamic frames to s3...')
writting_time = time.time()
for key in keys:
good_key = lower_and_pythonize(s=key)
data_frame =
# lowercase all the names and remove '.'
print('Removing . and _ from names for {} frame...'.format(key))
df_fix_names_time = time.time()
print('Repartitioning data frame...')
print('Changing names...')
for old_name in data_frame.schema.names:
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
df_now = DynamicFrame.fromDF(dataframe=data_frame, glue_ctx=glueContext, name='df_now')
print('Done. Time to complete: {}'.format(time.time() - df_fix_names_time))
# if a conflict of types appears, make it 2 columns
print('Fixing any type conficts for {} frame...'.format(key))
df_resolve_time = time.time()
resolved = ResolveChoice.apply(frame = df_now, choice = 'make_cols', transformation_ctx = 'resolved')
print('Done. Time to complete: {}'.format(time.time() - df_resolve_time))
# check if key exists in s3. if not make one
out_connect = copy.deepcopy(output_connection_opts)
out_connect['path'] = out_connect['path'] + '/' + str(good_key)
s3_resource.Object(bucket_name, output_key + '/' + good_key + '/').load()
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == '404' or 'NoSuchKey' in e.response['Error']['Code']:
# object doesn't exist
s3_client.put_object(Bucket=bucket_name, Key=output_key+'/'+good_key + '/')
print('Writing {} frame to S3...'.format(key))
df_writing_time = time.time()
datasink4 = glueContext.write_dynamic_frame.from_options(frame = df_now, connection_type = "s3", connection_options = out_connect, format = "orc", transformation_ctx = "datasink4")
out_connect = None
datasink4 = None
print('Done. Time to complete: {}'.format(time.time() - df_writing_time))
print('Done. Time to complete: {}s'.format(time.time() - writting_time))
Here is the error I'm getting
19/06/07 16:33:36 DEBUG Client:
client token: N/A
diagnostics: Application application_1559921043869_0001 failed 1 times due to AM Container for appattempt_1559921043869_0001_000001 exited with exitCode: -104
For more detailed output, check application tracking page:http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001Then, click on links to logs of each attempt.
Diagnostics: Container [pid=9630,containerID=container_1559921043869_0001_01_000001] is running beyond physical memory limits. Current usage: 5.6 GB of 5.5 GB physical memory used; 8.8 GB of 27.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1559921043869_0001_01_000001 :
|- 9630 9628 9630 9630 (bash) 0 0 115822592 675 /bin/bash -c LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' '' '' '' '-DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem' '-DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem' '-DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks' org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file --arg '' --arg '--JOB_NAME' --arg 'tss-json-to-orc' --arg '--JOB_ID' --arg 'j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe' --arg '--JOB_RUN_ID' --arg 'jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233' --arg '--job-bookmark-option' --arg 'job-bookmark-disable' --arg '--TempDir' --arg 's3://aws-glue-temporary-059866946490-us-east-1/zmcgrath' --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/ 1> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stderr
|- 9677 9648 9630 9630 (python) 12352 2628 1418354688 261364 python --JOB_NAME tss-json-to-orc --JOB_ID j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --JOB_RUN_ID jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --job-bookmark-option job-bookmark-disable --TempDir s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath
|- 9648 9630 9630 9630 (java) 265906 3083 7916974080 1207439 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.deploy.PythonRunner --primary-py-file --arg --arg --JOB_NAME --arg tss-json-to-orc --arg --JOB_ID --arg j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --arg --JOB_RUN_ID --arg jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --arg --job-bookmark-option --arg job-bookmark-disable --arg --TempDir --arg s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1559921462650
final status: FAILED
tracking URL: http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001
user: root
Here are the log contents from the job
Log Upload Time:Fri Jun 07 16:33:36 +0000 2019
Log Contents:
Loading all json files into DynamicFrame...
Done. Time to complete: 59.5056920052s
Dropping null fields...
null_fields [<some fields that were dropped>]
Done. Time to complete: 529.95293808s
Relationalizing dynamic frame...
Done. Time to complete: 2773.11689401s
Writting all dynamic frames to s3...
Removing . and _ from names for root frame...
Repartitioning data frame...
Changing names...
End of LogType:stdout
As I said earlier, the Done. print after changing the names never appears in the logs. I've seen plenty of people getting the same error I'm seeing and I've tried a fair bit of them with no success. Any help you can provide would b e much appreciated. Let me know if you need any more information. Thanks
Prabhakar's comment reminded me that I have tried the memory worker type in AWS Glue and it still failed. As stated above, I have tried raising the amount of memory in the memoryOverhead from 5 to 12, but to avail. Neither of these made the job complete successfully
I put in the following code for column name change instead of the above code for easier debugging
print('Changing names...')
name_counter = 0
for old_name in data_frame.schema.names:
print('Name number {}. name being changed: {}'.format(name_counter, old_name))
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
name_counter += 1
And I got the following output
Removing . and _ from names for root frame...
Repartitioning data frame...
Changing names...
End of LogType:stdout
So it must be a problem with the data_frame.schema.names part. Could it be this line with my loop through all of the DynamicFrames? Am I looping through the DynamicFrames from the relationalize transformation correctly?
Update 2
Glue recently added more verbose logs and I found this
ERROR YarnClusterScheduler: Lost executor 396 on ip-172-32-78-221.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
This happens for more than just this executor too; it looks like almost all of them.
I can try to increase the executor memory overhead, but I would like to know why getting the column names results in an OOM error. I wouldn't think that something that trivial would take up that much memory?
I attempted to run the job with both spark.driver.memoryOverhead=7g and spark.yarn.executor.memoryOverhead=7g and I again got an OOM error

How to combine outputs of multiple tasks performed using Matlab SGE?

I have the following batch file launching some m-files (main.m and f.m which are scripts) 4 times (4 tasks).
#$ -S /bin/bash
#$ -l h_vmem=2G
#$ -l tmem=2G
#$ -cwd
#$ -j y
#Run 4 tasks where each task has a different $SGE_TASK_ID ranging from 1 to 4
#$ -t 1-4
#$ -N example
#Output the Task ID
echo "Task ID is $SGE_TASK_ID"
cat main.m f.m | matlab -nodisplay -nodesktop -nojvm -nosplash
At the end I obtain 4 outputs that are example.o[...].1,example.o[...].2, example.o[...].3, example.o[...].4. Each of them looks like
Task ID is ...
< M A T L A B (R) >
>> >> >> >> >> >> >> >> >> >> >>
output =
4.0234 -3.4763
How can I combine these 4 outputs in a matrix 4x2 and save it?
You should save the relevant output from within f.m using MATLAB's save or something similar.
If you use the -r flag to call main and f from the command line you can add a variable which will contain the task ID and you can then access that from within f.m
matlab -nodisplay -nodesktop -nojvm -nosplash -r "main; ID = $SGE_TASK_ID; f; exit"
Then within f.m
% You theoretically generate some numeric result
result = rand(1, 2);
filename = sprintf('Result.%d.mat', ID);
save(filename, 'result')
This will save Result.0.mat, Result.1.mat etc.
Alternately, you could modify f.m such that it loads the data from the file, appends to it, and re-saves it every time
result = rand(1,2);
filename = 'Results.mat';
% If this is the first task, then create a new file, otherwise append to the old
if ID == 1
data = result;
tmp = load(filename, '-mat');
data =;
data(ID,:) = result;
save(filename, 'data')

delayed_job monitored by God - duplicate processes after restart

I'm monitoring delayed_job using God. This is my God config file.
QUEUE = "slow"
WORKERS.times do |num| do |w| = "dj.#{num}" = "tanda"
w.uid = 'deployer'
w.gid = 'deployer'
w.start = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job start --queue=#{QUEUE} --pid-dir=#{RAILS_ROOT}/tmp/pids -i #{num}"
w.restart = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job restart --queue=#{QUEUE} --pid-dir=#{RAILS_ROOT}/tmp/pids -i #{num}"
w.stop = "cd #{RAILS_ROOT}; RAILS_ENV=#{RAILS_ENV} bundle exec script/delayed_job stop -i #{num}"
w.start_grace = 30.seconds
w.restart_grace = 30.seconds
w.stop_grace = 30.seconds
w.pid_file = "#{RAILS_ROOT}/tmp/pids/delayed_job.#{num}.pid"
w.log = "#{RAILS_ROOT}/log/dj.#{num}.log"
w.err_log = "#{RAILS_ROOT}/log/dj.#{num}.errors.log"
w.interval = 30.seconds
w.dir = File.expand_path('.')
w.env = {
w.start_if do |start|
start.condition(:process_running) do |c|
c.interval = 5.seconds
c.running = false
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 10
c.within = 3.minutes
c.transition = :unmonitored
c.retry_in = 10.minutes
I'm then restarting these processes using Capistrano 2 on each deploy:
run("cd #{current_path} && rvmsudo god restart tanda")
When I start God, my ps output looks like this.
s -e -www -o pid,rss,command | grep delayed
31960 220804 delayed_job.0
31966 220152 delayed_job.8
31973 226012 delayed_job.9
31979 215176 delayed_job.1
31984 210260 delayed_job.13
31994 240424 delayed_job.3
31997 225248 delayed_job.11
32003 196364 delayed_job.5
32009 236192 delayed_job.6
32015 214540 delayed_job.12
32022 247096 delayed_job.4
32029 206352 delayed_job.2
32047 232748 delayed_job.7
32061 228128 delayed_job.10
If I immediately do a Capistrano restart, without doing a deploy or anything else, then after a minute it looks like this.
ps -e -www -o pid,rss,command | grep delayed
9884 198076 delayed_job.10
9895 195372 delayed_job.0
9919 196856 delayed_job.6
9948 196772 delayed_job.5
9964 196568 delayed_job.9
9973 194092 delayed_job.12
9982 195648 delayed_job.13
9997 196392 delayed_job.2
10005 195356 delayed_job.4
10016 197268 delayed_job.3
10032 198820 delayed_job.8
10054 194316 delayed_job.7
10078 196780 delayed_job.11
10127 202420 delayed_job.1
10133 197468 delayed_job.1
10145 194040 delayed_job.1
10158 195760 delayed_job.1
10173 195844 delayed_job.1
And after another restart:
ps -e -www -o pid,rss,command | grep delayed
9884 221780 delayed_job.10
9973 225100 delayed_job.12
9982 224708 delayed_job.13
10078 235076 delayed_job.11
21467 187056 delayed_job.0
21483 187844 delayed_job.7
21497 189648 delayed_job.10
21509 187316 delayed_job.2
21518 188180 delayed_job.11
21527 187968 delayed_job.3
21542 187852 delayed_job.12
21546 186900 delayed_job.13
21556 188628 delayed_job.5
21565 187816 delayed_job.9
21574 185216 delayed_job.4
21585 188088 delayed_job.1
21599 188556 delayed_job.1
21602 188400 delayed_job.1
21615 193484 delayed_job.1
21628 193288 delayed_job.8
21632 188228 delayed_job.1
21643 187804 delayed_job.6
As you can see these duplicate processes sometimes have new pids (eg. all from the first dump to the second) but sometimes don't (eg. DJ 10 from the 2nd to the 3rd).
I don't really know where to start debugging this. God isn't giving any errors when restarting and the DJ logs just show the usual output when launching a process. And the same thing isn't happening on a smaller server that is only meant to have 4 workers running (but is otherwise identical).
Has anyone seen this before?
I think this must be an issue in the daemons gem that delayed_job job uses for working in the background, because adding this at the top of my God file seems to have fixed things:
ids = ('a'..'z').to_a
workers.times do |num|
num = ids[num]
It seems like there was an issue where the processes named delayed_job.1 and delayed_job.11 (etc) would clash which would cause lots of problems. I haven't really isolated it down too far, but changing it to a different naming convention (delayed_job.a in this case) has fixed things for me now.
Will leave this open in case someone has a better solution/a reason for why this worked.