DecisionTaskTimedOut before the specified timeout - cadence-workflow

I have a case when decision times out after 5 seconds when timeout is set to 10:
17 2019-06-13T17:46:59Z DecisionTaskScheduled {TaskList:{Name:maxim-C02XD0AAJGH6:db09fd84-98bf-4546-a0d8-fb51e30c2b41},
StartToCloseTimeoutSeconds:10, Attempt:0}
18 2019-06-13T17:47:04Z DecisionTaskTimedOut {ScheduledEventId:17,
StartedEventId:0,
TimeoutType:SCHEDULE_TO_START}
10:49 AM
It is using Cadence service running in a local docker and I can reproduce it reliably.

The 5s timeout is due to Cadence Sticky Execution feature. Sticky Execution is enabled by default on Cadence Worker which allows the workflow state to be cached on the worker after responding back with decisions. This allows Cadence server to directly dispatch new decision tasks to the same worker which allows to reuse the cached state and produce new decisions without replaying the entire execution history.
Decision SCHEDULE_TO_START timeout is put in place to allow decision to be sent to another worker when worker restarts and there is no poller on the sticky tasklist for a workflow execution. This causes the stickyness to be cleared by Cadence server for that execution and decision dispatched to original tasklist so it can be picked up by any other worker.
// Optional: Sticky schedule to start timeout.
// default: 5s
// The resolution is seconds. See details about StickyExecution on the comments for DisableStickyExecution.
StickyScheduleToStartTimeout time.Duration

Related

Airflow tasks failing with SIGTERM when worker pod downscaling

I am running an airflow cluster on EKS on AWS. I have setup some scaling config for worker setup. If CPU/Mem > 70% then airflow spins up new worker pod. However I am facing an issue when these worker pods are scaling down. When worker pods start scaling down, two things happen:
If no tasks is running on a worker pod, it terminates within 40sec.
If any task is running on a worker pod, it terminates in about 8min, and after one more minute, I find the task failing on UI.
I have setup below two properties in helm chart for worker pod termiantion.
celery:
## if celery worker Pods are gracefully terminated
## - consider defining a `workers.podDisruptionBudget` to prevent there not being
## enough available workers during graceful termination waiting periods
##
## graceful termination process:
## 1. prevent worker accepting new tasks
## 2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for tasks to finish
## 3. send SIGTERM to worker
## 4. wait AT MOST `workers.terminationPeriod` for kill to finish
## 5. send SIGKILL to worker
##
gracefullTermination: true
## how many seconds to wait for tasks to finish before SIGTERM of the celery worker
##
gracefullTerminationPeriod: 180
## how many seconds to wait after SIGTERM before SIGKILL of the celery worker
## - [WARNING] tasks that are still running during SIGKILL will be orphaned, this is important
## to understand with KubernetesPodOperator(), as Pods may continue running
##
terminationPeriod: 120
I can see that worker pod should shutdown after 5 mins or irrespective task running or not. So I am not sure why I see total of 8 min for worker pod termination. And my main issue is there any way I can setup config so that worker pod only terminates when task running on it finishes execution. Since tasks in my dags can run anywhere between few minutes to few hours so I don't want to put a large value for gracefullTerminationPeriod. I Would appreciate any solution around this.
Some more info: Generally the long running task is a python operator which runs either a presto sql query or Databricks job via Prestohook or DatabricksOperator respectively. And I don't want these to recivie SIGTERM before they complete their execution on worker pod scaling down.
This is not possible due to limitations from K8 end. More details are available here. However by using a large value of "gracefulTerminationPeriod" works, although this is not what I intended to do but it works better than I originally thought. When large value of gracefulTerminationPeriod is set, workers doesn't wait around for gracefulTerminationPeriod time to terminate. If a worker pod is marked for termination it terminates as soon as tasks running on it reaches zero.
Until K8 accept proposed changes and new community helm chart is released, I think this is the best solution without incurring costs of keeping worker up.

How to make cadence workers stop accepting new tasks

I want to achieve a use-case where, during graceful scale-down, I want to ensure that cadence workers do not accept any new jobs. I am using cadence on k8, so I plan to give a
terminationGracePeriodSeconds to a known maximum timeout by which I know that all the in-progress tasks will be finished on that particular pod. Hence new tasks will be allocated only to active workers.
My use case is that my activity has large startToClose timeout and during deployment, the activity task will be picked up and cannot complete until the timeout and retry.
This is the background/use case:
the activity will still wait for start to close timeout and then retry, in scenarios where we have large start to close timeout, how do retry immediatly
It's recommended to set correct timeout/retry policy for activity to mitigate this issue.
For large startToClose timeout, activity should set heartbeat timeout and send heartbeat to server.
Without heartbeat timeout, the issue in the question could happen even this "graceful shutdown" is provided. Because there could be some crash in activity execution, or worker host crashes.

For an activity called in a loop, does the retry policy for the activity apply to each run?

For a given workflow with activity A with max retries set to 3, if I have the following piece of code:
for (String type: types) {
activityA.process(type);
}
and types in this case is ["type1", "type2", "type3"]
So if activityA processed type1 successfully and starts processing type2 and fails for some reason,
Will the retry policy for activityA apply each time a type is run or will it be 3 retries across all activity types?
If the workflow fails when executing type2, will the workflow restart from the beginning and process type1 again or will it start from type2?
For 1. The retry policy will be working independently for each activity. So each type will have three retries.
For 2. Workflow failure is a terminal state for workflow execution. It would not retry automatically unless you specify a retry policy when starting the workflow. When workflow retry. It will start from very beginning.
See also https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries
Or maybe what you asked is if worker fails instead of workflow? Cadence is very fault tolerant to worker failure, workflow will automatically resume running from what it has left before the previous worker dies.
See also https://cadenceworkflow.io/docs/concepts/workflows/#state-recovery-and-determinism

Delay in Kubernetes Job status update when running many jobs in parallel

I have a bit of a unique use-case where I want to run a large number (thousands to tens of thousands) of Kubernetes Jobs at once. Each job consists of a single container, Parallelism 1 and Completions 1, with no side-car or agent. My cluster has plenty of capacity for the resources I'm requesting.
My problem is that the Job status is not transitioning to Complete for a significant period of time when I run many jobs concurrently.
My application submits Jobs and has a watcher on the namespace - as soon as a Job's status transitions to 'succeeded 1', we delete the Job and send information back to the application. The application needs this to happen as soon as possible in order to define and submit subsequent Jobs.
I'm able to submit new Job requests as fast as I want, and Pod scheduling happens without delay, but beyond about one or two hundred concurrent Jobs I get significant delay between a Job's Pod completing and the Job's status updating to Complete. At only around 1,000 jobs in the cluster, it can easily take 5-10 minutes for a Job status to update.
This tells me there is some process in the Kubernetes Control Plane that needs more resources to process Pod completion events more rapidly, or a configuration option that enables it to process more tasks in parallel. However, my system monitoring tools have not yet been able to identify any Control Plane services that are maxing out their available resources while the cluster processes the backlog, and all other operations on the cluster appear to be normal.
My question is - where should I look for system resource or configuration bottlenecks? I don't know enough about Kubernetes to know exactly what components are responsible for updating a Job's status.

Warmup services on upgrade in Service Fabric

We are wondering if there is a built-in way to warm up services as part of the service upgrades in Service Fabric, similar to the various ways you could warm up e.g. IIS based app pools before they are hit by requests. Ideally we want the individual services to perform some warm-up tasks as part of their initialization (could be cache loading, recovery etc.) before being considered as started and available for other services to contact. This warmup should be part of the upgrade domain processing so the upgrade process should wait for the warmup to be completed and the service reported as OK/Ready.
How are others handling such scenarios, controlling the process for signalling to the service fabric that the specific service is fully started and ready to be contacted by other services?
In the health policy there's this concept:
HealthCheckWaitDurationSec The time to wait (in seconds) after the upgrade has finished on the upgrade domain before Service Fabric evaluates the health of the application. This duration can also be considered as the time an application should be running before it can be considered healthy. If the health check passes, the upgrade process proceeds to the next upgrade domain. If the health check fails, Service Fabric waits for an interval (the UpgradeHealthCheckInterval) before retrying the health check again until the HealthCheckRetryTimeout is reached. The default and recommended value is 0 seconds.
Source
This is a fixed wait period though.
You can also emit Health events yourself. For instance, you can report health 'Unknown' while warming up. And adjust your health policy (HealthCheckWaitDurationSec) to check this.
Reporting health can help. You can't report Unknown, you must report Error very early on, then clear the Error when your service is ready. Warning and Ok do not impact upgrade. To clear the Error, your service can report health state Ok, RemoveWhenExpired=true, low TTL (read more on how to report).
You must increase HealthCheckRetryTimeout based on the max warm up time. Otherwise, if a health check is performed and cluster is evaluated to Error, the upgrade will fail (and rollback or pause, per your policy).
So, the order the events is:
your service reports Error - "Warming up in progress"
upgrade waits for fixed HealthCheckWaitDurationSec (you can set this to min time to warm up)
upgrade performs health checks: if the service hasn't yet warmed up, the health state is Error, so upgrade retries until either HealthCheckRetryTimeout is reached or your service is not in Error anymore (warm up completed and your service cleared the Error).