Airflow tasks timeout in one hour even the setting is larger than 1 hour - celery

Currently I'm using airflow with celery-executor+redis to run dags, and I have set execution_timeout to be 12 hours in a S3 key sensor, but it will fail in one hour in each retry
I have tried to update visibility_timeout = 64800 in airflow.cfg but the issue still exist
file_sensor = CorrectedS3KeySensor(
task_id = 'listen_for_file_drop', dag = dag,
aws_conn_id = 'aws_default',
poke_interval = 15,
timeout = 64800, # 18 hours
bucket_name = EnvironmentConfigs.S3_SFTP_BUCKET_NAME,
bucket_key = dag_config[ConfigurationConstants.FILE_S3_PATTERN],
wildcard_match = True,
execution_timeout = timedelta(hours=12)
)
For my understanding, execution_timeout should work that it will last for 12 hours after total four times run (retry = 3). But the issue is for each retry, it will fail in an hour and it only last total 4 hours+
[2019-08-06 13:00:08,597] {{base_task_runner.py:101}} INFO - Job 9: Subtask listen_for_file_drop [2019-08-06 13:00:08,595] {{timeout.py:41}} ERROR - Process timed out
[2019-08-06 13:00:08,612] {{models.py:1788}} ERROR - Timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1652, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/sensors/base_sensor_operator.py", line 97, in execute
while not self.poke(context):
File "/usr/local/airflow/dags/ProcessingStage/sensors/sensors.py", line 91, in poke
time.sleep(30)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/timeout.py", line 42, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout

I figure it out a few days before.
Since I'm using AWS to deploy airflow with celery executor, there's a few improper cloudwatch alarm will keep scale up and down the workers and webserver/scheuler :(
After those alarms updated, it works well now!!

Related

Celery: How to schedule worker processes/children restart

I try to figure out how to setup my "celery workers" to restart after a living day.
Indeed, I configured my worker children/processes to restart after executing a task.
But in some cases, there are no tasks to execute in 3-4 days. So I need to restart the long-living children.
Do you how to do this ?
This is my actual celery app setup:
app = Celery(
"celery",
broker=f"amqp://bla#blabla/blablabla",
backend="rpc://",
)
app.conf.task_serializer = "pickle"
app.conf.result_serializer = "pickle"
app.conf.accept_content = ["pickle", "application/json"]
app.conf.broker_connection_max_retries = 5
app.conf.broker_pool_limit = 1
app.conf.worker_max_tasks_per_child = 1 # Ensure 1 task is executed before restarting child
app.conf.worker_max_living_time_before_restart = 60 * 60 * 24 # The conf I want
Thank you :)

How to use `client.start_ipython_workers()` in dask-distributed?

I am trying to get workers to output some information from their ipython kernel and execute various commands in the ipython session. I tried the examples in the documentation and the ipyparallel example works, but not the second example (with ipython magics). I cannot get the workers to execute any commands. For example, I am stuck on the following issue:
from dask.distributed import Client
client = Client()
info = client.start_ipython_workers()
list_workers = info.keys()
%remote info[list_workers[0]]
The last line returns an error:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-9118451af441> in <module>
----> 1 get_ipython().run_line_magic('remote', "info['tcp://127.0.0.1:50497'] worker.active")
~/miniconda/envs/dask/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line, _stack_depth)
2334 kwargs['local_ns'] = self.get_local_scope(stack_depth)
2335 with self.builtin_trap:
-> 2336 result = fn(*args, **kwargs)
2337 return result
2338
~/miniconda/envs/dask/lib/python3.7/site-packages/distributed/_ipython_utils.py in remote_magic(line, cell)
115 info_name = split_line[0]
116 if info_name not in ip.user_ns:
--> 117 raise NameError(info_name)
118 connection_info = dict(ip.user_ns[info_name])
119
NameError: info['tcp://127.0.0.1:50497']
I would appreciate any examples of how to get any information from the ipython kernel running on workers.
Posting here just for keeping track, I raised an issue for this on GitHub: https://github.com/dask/distributed/issues/4522

Wildlfy Schedule overlapping

I use a scheduler in WildFly 9, with this EJBs:
import javax.ejb.Singleton;
import javax.ejb.Startup;
import javax.ejb.Schedule;
I get loads of these warnings:
2020-01-21 12:35:59,000 WARN [org.jboss.as.ejb3] (EJB default - 6) WFLYEJB0043: A previous execution of timer [id=3e4ec2d2-cea9-43c2-8e80-e4e66593dc31 timedObjectId=FiloJobScheduler.FiloJobScheduler.FiskaldatenScheduler auto-timer?:true persistent?:false timerService=org.jboss.as.ejb3.timerservice.TimerServiceImpl#71518cd4 initialExpiration=null intervalDuration(in milli sec)=0 nextExpiration=Tue Jan 21 12:35:59 GMT+02:00 2020 timerState=IN_TIMEOUT info=null] is still in progress, skipping this overlapping scheduled execution at: Tue Jan 21 12:35:59 GMT+02:00 2020.
But when I measure the elapsed times, they are allways < 1 minute.
The Scheduling is:
#Schedule(second = "*", minute = "*/5", hour = "*", persistent = false)
Has anyone an idea what is going on?
A little logging would help you. This runs every second because that's what you're telling it to do with the second="*" section. If you want to only run every 5 minutes of every hour, change the schedule to:
#Schedule(minute = "*/5", hour="*", persistent = false)

Getting description of a PBS job queue

Is there any command that would allow me to query the description of a running/ queued PBS job for its attributes such as RAM, number of processors, GPUs etc?
Use qstat command:
qstat -f job_id
Expanding on the answer posted by dimm.
If a job is registered in a queue, you can query it's attributes with qstat command. If the job has already finished, you can only grep relevant information from the log files. There is a handy tracejob command to do the grepping for you.
In PBS Pro and Torque each job registered with a queue has two sets of attributes:
Resource_List has resources requested for a running or queued job
resources_used holds actual resource usage for a running job.
For example in PBS Pro you could get the following attributes for Resource_List
Resource_List.mem = 2000mb
Resource_List.mpiprocs = 8
Resource_List.ncpus = 8
Resource_List.nodect = 1
Resource_List.place = free
Resource_List.qlist = queue1
Resource_List.select = 1:ncpus=8:mpiprocs=8
Resource_List.walltime = 02:00:00
 
And the following values for resources_used
resources_used.cpupercent = 800
resources_used.cput = 00:03:31
resources_used.mem = 529992kb
resources_used.ncpus = 8
resources_used.vmem = 3075580kb
resources_used.walltime = 00:00:28
For finished jobs tracejob could fetch you only some of the requested resources:
ncpus=8:mem=2048000kb
and the final values for resources_used
resources_used.cpupercent=799
resources_used.cput=00:54:29
resources_used.mem=725520kb
resources_used.ncpus=8
resources_used.vmem=3211660kb
resources_used.walltime=00:06:53

GAE Socket API error - ApplicationError: 4 Unknown error

We have an app engine cron that checks the liveness of a number of DNS servers using dnspython. It has been working without issue until [12/Nov/2014:13:28:12 -0800] (about 12 hours ago) where it started failing 100% of the time with the following:
DNS Lookup failed: 'ApplicationError: 4 Unknown error.'. Traceback (most recent call last):
File ".../handlers/tasks.py", line 150, in _checkDNSServer
answers = resolver.query(domain, 'A', source='')
File "lib/dns/resolver.py", line 830, in query
source_port=source_port)
File "lib/dns/query.py", line 213, in udp
s.bind(source)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/socket.py", line 222, in meth
return getattr(self._sock,name)(*args)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 660, in bind
self._CreateSocket(bind_address=address)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 611, in _CreateSocket
raise _SystemExceptionFromAppError(e)
ApplicationError: ApplicationError: 4 Unknown error.
The code in question is fairly simple ...
def _checkDNSServer(self, ip):
""" Return true if the server is up and responds within 1 second.
False the server is down or responded slowly
"""
domain = 'www.testdomain.com'
resolver = dns.resolver.Resolver()
resolver.nameservers = [ip]
starttime = datetime.now()
try:
answers = resolver.query(domain, 'A', source='')
duration = datetime.now() - starttime
logging.debug("DNS Lookup Time %s" % duration)
# Max delay of 1 second
if duration > timedelta(seconds=1):
return False
return True
except Exception as e:
tb = traceback.format_exc()
logging.error("DNS Lookup failed: '%s'. %s", e, tb)
return False
the code continues to work on the local development server
billing is enabled
quota is sufficient
no changes to app engine release version (1.9.16) before/after the error appearing
target servers are live and responding ok
Suggestions?
Seems it was a transient issue and is now resolved.