I am using rundeck 1.6 version and need to configure timeout option for the rundeck jobs so that when the job exceeds the timeout value, it automatically stops without going to the next step.
Thanks
As 2.7 Rundeck support
Job Timeout
Timeout
You can set a maximum runtime for a job. If the runtime exceeds this value, the job will be halted (as if a user had killed it.) (Note: Timeout only affects the job if is invoked directly, not if it is used as a Job Reference.)
On server from Rundeck installed, alter the file /etc/ssh/ssh_config to:
Host *
ServerAliveInterval 180
ServerAliveCountMax 2
Related
When I run (in debug mode) a Spark notebook in Azure Synapse Analytics, it doesn't seem to shutdown as expected.
In the last cell I call: mssparkutils.notebook.exit("exiting notebook")
But then when I fire off another notebook (again in debug mode, same pool), I get this error:
AVAILABLE_COMPUTE_CAPACITY_EXCEEDED: Livy session has failed. Session state: Error. Error code: AVAILABLE_COMPUTE_CAPACITY_EXCEEDED. Your job requested 12 vcores. However, the pool only has 0 vcores available out of quota of 12 vcores. Try ending the running job(s) in the pool, reducing the numbers of vcores requested, increasing the pool maximum size or using another pool. Source: User.
So I go to Monitor => Apache Spark applications and I see my the first notebook I ran still in a "Running" status and I can manually stop it.
How do I automatically stop the Notebook / Apache Spark application? I thought that was the notebook.exit() call but apparently not...
In debug mode, the cluster's vcores are supplied to the notebook for the entire duration of the debug (that is one hour of inactivity or until you manually terminate it)
Thus, you have two options:
Work on one notebook at a time, closing the debug before starting another
OR
Configure the session to reduce the number of executors so that the spark cluster can provision all three debug modes at the same time (might need to increase the size of the cluster)
I just switched from ForkPool to gevent with concurrency (5) as the pool method for Celery workers running in Kubernetes pods. After the switch I've been getting a non recoverable erro in the worker:
amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
The broker logs gives basically the same message:
2021-11-01 22:26:17.251 [warning] <0.18574.1> Consumer None4 on channel 1 has timed out waiting for delivery acknowledgement. Timeout used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
I have the CELERY_ACK_LATE set up, but was not familiar with the necessity to set a timeout for the acknowledgement period. And that never happened before using processes. Tasks can be fairly long (60-120 seconds sometimes), but I can't find a specific setting to allow that.
I've read in another post in other forum a user who set the timeout on the broker configuration to a huge number (like 24 hours), and was also having the same problem, so that makes me think there may be something else related to the issue.
Any ideas or suggestions on how to make worker more resilient?
For future reference, it seems that the new RabbitMQ versions (+3.8) introduced a tight default for consumer_timeout (15min I think).
The solution I found (that has also been added to Celery docs not long ago here) was to just add a large number for the consumer_timeout in RabbitMQ.
In this question, someone mentions setting consumer_timeout to false, in a way that using a large number is not needed, but apparently there's some specifics regarding the format of the configuration for that to work.
I'm running RabbitMQ in k8s and just done something like:
rabbitmq.conf: |
consumer_timeout = 31622400000
The accepted answer is the correct answer. However, if you have an existing RabbitMQ server running and do not want to restart it, you can dynamically set the configuration value by running the following command on the RabbitMQ server:
rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).'
This will set the new timeout to 10 hrs (36000000ms). For this to take effect, you need to restart your workers though. Existing worker connections will continue to use the old timeout.
You can check the current configured timeout value as well:
rabbitmqctl eval 'application:get_env(rabbit, consumer_timeout).'
If you are running RabbitMQ via Docker image, here's how to set the value: Simply add -e RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="-rabbit consumer_timeout 36000000" to your docker run OR set the environment RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS to "-rabbit consumer_timeout 36000000".
Hope this helps!
Whenever we restart a failed task, it will ALWAYS pick up the config it had at the time of the failure, and run with that.. and THEN it picks up the new config.. and runs that as well.
We have connect jobs that we pause, update config, and then resume. This works fine, unless the task has failed.
If we restart a failed task, even if the connector has an updated config, the task will launch with the old config.. run to completion/failure.. then a new task will be launched with the new config.
This can cause various data/etc issues.. if you really don't want that old task to run with that config.
Any ideas how to restart a connector with a failed task.. with a new config.. and NOT have the old config get invoked?
(running Kafka v2.5, btw)
I don't know if it would make sense for the task to pick up the latest config.
For instance, let's assume that your connector fires up 10 distinct tasks and 1 of them fails. It won't make sense to have the remaining 9 tasks of the connector running with the older config while the failed task runs the newest config once it is restarted.
I would say that in cases you want to use a new/different configuration file when a task fails, it might make more sense to restart the connector and not the individual task(s):
POST /connectors/connector-name/restart HTTP/1.1
I was having this problem and managed to "fix" this by a bit of randomness.
I increased the number of Tasks in the connector and then reduced it again and it seemed to have picked up a new configuration.
Was really random.
I do know the restart did not work for me
For EC2 launch type I'm able to check agent configuration in /etc/ecs/ecs.config file at EC2 container instance. But is it possible to find out the same info at ECS Fargate Task? For example, I'd like to know, what is the timeout between SIGTERM and SIGKILL (ECS_CONTAINER_STOP_TIMEOUT). I wonder should it be possible to retrieve such info from Amazon ECS Task Metadata Endpoint?
In Fargate, timeout between SIGTERM and SIGKILL is the same as the default setting of 30 seconds.
For newer Fargate platform versions, you can use the stopTimeout container definition parameter. Note the maximum value of 120 seconds:
For tasks that use the Fargate launch type, the task or service requires platform version 1.3.0 or later (Linux) or 1.0.0 or later (for Windows). The max stop timeout value is 120 seconds. However, if the parameter isn't specified, the default value of 30 seconds is used.
I have QUARTZ 1.8.5 running in a clustered environment (2 nodes, persistence, clustered , JobStoreCMT).
Now I schedule several jobs to run everyday at a specific hour.
I set REQUEST RECOVERY to true for every of these jobs (jobDetail.setRequestsRecovery(true).
I see that the flag is set to 1 into QRTZ_JOB_DETAILS table.
What I want is that a node fails (Jboss server is restarted for example) then the other alive node to restart the failed job. But this doesn't happens.
What I'm doing wrong/ not doing ?
Thanks.
Have you tried to update to the latest Quartz? There is a version 2.1.6 out already.
Otherwise, what you're doing seems to be right.