Airflow tasks timeout in one hour even the setting is larger than 1 hour

Airflow tasks timeout in one hour even the setting is larger than 1 hour - celery

Currently I'm using airflow with celery-executor+redis to run dags, and I have set execution_timeout to be 12 hours in a S3 key sensor, but it will fail in one hour in each retry
I have tried to update visibility_timeout = 64800 in airflow.cfg but the issue still exist
file_sensor = CorrectedS3KeySensor(
task_id = 'listen_for_file_drop', dag = dag,
aws_conn_id = 'aws_default',
poke_interval = 15,
timeout = 64800, # 18 hours
bucket_name = EnvironmentConfigs.S3_SFTP_BUCKET_NAME,
bucket_key = dag_config[ConfigurationConstants.FILE_S3_PATTERN],
wildcard_match = True,
execution_timeout = timedelta(hours=12)
)
For my understanding, execution_timeout should work that it will last for 12 hours after total four times run (retry = 3). But the issue is for each retry, it will fail in an hour and it only last total 4 hours+
[2019-08-06 13:00:08,597] {{base_task_runner.py:101}} INFO - Job 9: Subtask listen_for_file_drop [2019-08-06 13:00:08,595] {{timeout.py:41}} ERROR - Process timed out
[2019-08-06 13:00:08,612] {{models.py:1788}} ERROR - Timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1652, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/sensors/base_sensor_operator.py", line 97, in execute
while not self.poke(context):
File "/usr/local/airflow/dags/ProcessingStage/sensors/sensors.py", line 91, in poke
time.sleep(30)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/timeout.py", line 42, in handle_timeout
raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout

I figure it out a few days before.
Since I'm using AWS to deploy airflow with celery executor, there's a few improper cloudwatch alarm will keep scale up and down the workers and webserver/scheuler :(
After those alarms updated, it works well now!!

Related

Celery: How to schedule worker processes/children restart

I try to figure out how to setup my "celery workers" to restart after a living day.
Indeed, I configured my worker children/processes to restart after executing a task.
But in some cases, there are no tasks to execute in 3-4 days. So I need to restart the long-living children.
Do you how to do this ?
This is my actual celery app setup:
app = Celery(
"celery",
broker=f"amqp://bla#blabla/blablabla",
backend="rpc://",
)
app.conf.task_serializer = "pickle"
app.conf.result_serializer = "pickle"
app.conf.accept_content = ["pickle", "application/json"]
app.conf.broker_connection_max_retries = 5
app.conf.broker_pool_limit = 1
app.conf.worker_max_tasks_per_child = 1 # Ensure 1 task is executed before restarting child
app.conf.worker_max_living_time_before_restart = 60 * 60 * 24 # The conf I want
Thank you :)

How to use `client.start_ipython_workers()` in dask-distributed?

I am trying to get workers to output some information from their ipython kernel and execute various commands in the ipython session. I tried the examples in the documentation and the ipyparallel example works, but not the second example (with ipython magics). I cannot get the workers to execute any commands. For example, I am stuck on the following issue:
from dask.distributed import Client
client = Client()
info = client.start_ipython_workers()
list_workers = info.keys()
%remote info[list_workers[0]]
The last line returns an error:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-9118451af441> in <module>
----> 1 get_ipython().run_line_magic('remote', "info['tcp://127.0.0.1:50497'] worker.active")
~/miniconda/envs/dask/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line, _stack_depth)
2334 kwargs['local_ns'] = self.get_local_scope(stack_depth)
2335 with self.builtin_trap:
-> 2336 result = fn(*args, **kwargs)
2337 return result
2338
~/miniconda/envs/dask/lib/python3.7/site-packages/distributed/_ipython_utils.py in remote_magic(line, cell)
115 info_name = split_line[0]
116 if info_name not in ip.user_ns:
--> 117 raise NameError(info_name)
118 connection_info = dict(ip.user_ns[info_name])
119
NameError: info['tcp://127.0.0.1:50497']
I would appreciate any examples of how to get any information from the ipython kernel running on workers.

Posting here just for keeping track, I raised an issue for this on GitHub: https://github.com/dask/distributed/issues/4522

Wildlfy Schedule overlapping

I use a scheduler in WildFly 9, with this EJBs:
import javax.ejb.Singleton;
import javax.ejb.Startup;
import javax.ejb.Schedule;
I get loads of these warnings:
2020-01-21 12:35:59,000 WARN [org.jboss.as.ejb3] (EJB default - 6) WFLYEJB0043: A previous execution of timer [id=3e4ec2d2-cea9-43c2-8e80-e4e66593dc31 timedObjectId=FiloJobScheduler.FiloJobScheduler.FiskaldatenScheduler auto-timer?:true persistent?:false timerService=org.jboss.as.ejb3.timerservice.TimerServiceImpl#71518cd4 initialExpiration=null intervalDuration(in milli sec)=0 nextExpiration=Tue Jan 21 12:35:59 GMT+02:00 2020 timerState=IN_TIMEOUT info=null] is still in progress, skipping this overlapping scheduled execution at: Tue Jan 21 12:35:59 GMT+02:00 2020.
But when I measure the elapsed times, they are allways < 1 minute.
The Scheduling is:
#Schedule(second = "*", minute = "*/5", hour = "*", persistent = false)
Has anyone an idea what is going on?

A little logging would help you. This runs every second because that's what you're telling it to do with the second="*" section. If you want to only run every 5 minutes of every hour, change the schedule to:
#Schedule(minute = "*/5", hour="*", persistent = false)

Getting description of a PBS job queue

Is there any command that would allow me to query the description of a running/ queued PBS job for its attributes such as RAM, number of processors, GPUs etc?

Use qstat command:
qstat -f job_id

Expanding on the answer posted by dimm.
If a job is registered in a queue, you can query it's attributes with qstat command. If the job has already finished, you can only grep relevant information from the log files. There is a handy tracejob command to do the grepping for you.
In PBS Pro and Torque each job registered with a queue has two sets of attributes:
Resource_List has resources requested for a running or queued job
resources_used holds actual resource usage for a running job.
For example in PBS Pro you could get the following attributes for Resource_List
Resource_List.mem = 2000mb
Resource_List.mpiprocs = 8
Resource_List.ncpus = 8
Resource_List.nodect = 1
Resource_List.place = free
Resource_List.qlist = queue1
Resource_List.select = 1:ncpus=8:mpiprocs=8
Resource_List.walltime = 02:00:00
 
And the following values for resources_used
resources_used.cpupercent = 800
resources_used.cput = 00:03:31
resources_used.mem = 529992kb
resources_used.ncpus = 8
resources_used.vmem = 3075580kb
resources_used.walltime = 00:00:28
For finished jobs tracejob could fetch you only some of the requested resources:
ncpus=8:mem=2048000kb
and the final values for resources_used
resources_used.cpupercent=799
resources_used.cput=00:54:29
resources_used.mem=725520kb
resources_used.ncpus=8
resources_used.vmem=3211660kb
resources_used.walltime=00:06:53

GAE Socket API error - ApplicationError: 4 Unknown error

We have an app engine cron that checks the liveness of a number of DNS servers using dnspython. It has been working without issue until [12/Nov/2014:13:28:12 -0800] (about 12 hours ago) where it started failing 100% of the time with the following:
DNS Lookup failed: 'ApplicationError: 4 Unknown error.'. Traceback (most recent call last):
File ".../handlers/tasks.py", line 150, in _checkDNSServer
answers = resolver.query(domain, 'A', source='')
File "lib/dns/resolver.py", line 830, in query
source_port=source_port)
File "lib/dns/query.py", line 213, in udp
s.bind(source)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/socket.py", line 222, in meth
return getattr(self._sock,name)(*args)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 660, in bind
self._CreateSocket(bind_address=address)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/remote_socket/_remote_socket.py", line 611, in _CreateSocket
raise _SystemExceptionFromAppError(e)
ApplicationError: ApplicationError: 4 Unknown error.
The code in question is fairly simple ...
def _checkDNSServer(self, ip):
""" Return true if the server is up and responds within 1 second.
False the server is down or responded slowly
"""
domain = 'www.testdomain.com'
resolver = dns.resolver.Resolver()
resolver.nameservers = [ip]
starttime = datetime.now()
try:
answers = resolver.query(domain, 'A', source='')
duration = datetime.now() - starttime
logging.debug("DNS Lookup Time %s" % duration)
# Max delay of 1 second
if duration > timedelta(seconds=1):
return False
return True
except Exception as e:
tb = traceback.format_exc()
logging.error("DNS Lookup failed: '%s'. %s", e, tb)
return False
the code continues to work on the local development server
billing is enabled
quota is sufficient
no changes to app engine release version (1.9.16) before/after the error appearing
target servers are live and responding ok
Suggestions?

Seems it was a transient issue and is now resolved.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Airflow tasks timeout in one hour even the setting is larger than 1 hour - celery

I figure it out a few days before. Since I'm using AWS to deploy airflow with celery executor, there's a few improper cloudwatch alarm will keep scale up and down the workers and webserver/scheuler :( After those alarms updated, it works well now!!

Related

Celery: How to schedule worker processes/children restart

How to use `client.start_ipython_workers()` in dask-distributed?

Wildlfy Schedule overlapping

Getting description of a PBS job queue

GAE Socket API error - ApplicationError: 4 Unknown error

Categories

Resources