How to make cadence workers stop accepting new tasks - cadence-workflow

I want to achieve a use-case where, during graceful scale-down, I want to ensure that cadence workers do not accept any new jobs. I am using cadence on k8, so I plan to give a
terminationGracePeriodSeconds to a known maximum timeout by which I know that all the in-progress tasks will be finished on that particular pod. Hence new tasks will be allocated only to active workers.
My use case is that my activity has large startToClose timeout and during deployment, the activity task will be picked up and cannot complete until the timeout and retry.

This is the background/use case:
the activity will still wait for start to close timeout and then retry, in scenarios where we have large start to close timeout, how do retry immediatly
It's recommended to set correct timeout/retry policy for activity to mitigate this issue.
For large startToClose timeout, activity should set heartbeat timeout and send heartbeat to server.
Without heartbeat timeout, the issue in the question could happen even this "graceful shutdown" is provided. Because there could be some crash in activity execution, or worker host crashes.

Related

How do we ensure scale in protection for kubernetes pods?

While scaling-in, HPA shouldn't terminate a pod that has a job running on it.
This is taken care of by AWS autoscaling groups in the form of scale-in protection for instances. Is there something similar in kubernetes?
You use terminationGracePeriodSeconds to make your worker process wait until it is done. It will get a SIGTERM, then has that many seconds to finish (default 9 but you can make it anything, some of my workers have it set to 12 hours), then SIGKILL if it hasn't exited. So stop accepting new work units on SIGTERM, set the threshold to be the length of your longest work unit, and no worries :)

Out of box distributed job queue solution

Are there any existing out of the box job queue framework? basic idea is
someone to enqueue a job with job status New
(multiple) workers get a job and work on it, mark the job as Taken. One job can only be running on at most one worker
something will monitor the worker status, if the running jobs exceed predefined timeout, will be re-queued with status New, could be worker health issue
Once a worker completes a task, it marks the task as Completed in the queue.
something keeps cleaning up completed tasks. Or at step #4 when worker completes a task, the worker simply dequeues the task.
From my investigation, things like Kafka (pub/sub) or MQ (push/pull & pub/sub) or cache (Redis, Memcached) are mostly sufficient for this work. However, they all require some sort of development around its core functionality to become a fully functional job queue.
Also looked into relational DB, the ones supports "SELECT * FOR UPDATE SKIP LOCKED" syntax is also a good candidate, this again requires a daemon between the DB and worker, which means extra effort.
Also looked into the cloud solutions, Azure Queue storage, etc. similar assessment.
So my question is, is there any out of the box solution for job queue, that are tailored and dedicated for one thing, job queuing, without much effort to set up?
Thanks
Take a look at Python Celery. https://docs.celeryproject.org/en/stable/getting-started/introduction.html
The default mode uses RabbitMQ as the message broker, but other options are available. Results can be stored in a DB if needed.

What does shutdown look like for a kubernetes cron job pod when it's being terminated by "replace" concurrency policy?

I couldn't find anything in the official kubernetes docs about this. What's the actual low-level process for replacing a long-running cron job? I'd like to understand this so my application can handle it properly.
Is it a clean SIGHUP/SIGTERM signal that gets sent to the app that's running?
Is there a waiting period after that signal gets sent, so the app has time to cleanup/shutdown before it potentially gets killed? If so, what's that timeout in seconds? Or does it wait forever?
For reference, here's the Replace policy explanation in the docs:
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/
Concurrency Policy
Replace: If it is time for a new job run and the previous job run hasn’t finished yet, the cron job replaces the currently running job run with a new job run
A CronJob has just another Pod underneath.
When a Cronjob with a concurrency policy of "Replace" is still active, the Job will be deleted, which also deletes the Pod.
When a Pod is deleted the Linux container/s will be sent a SIGTERM and then a SIGKILL after a grace period, defaulted to 30 seconds. The terminationGracePeriodSeconds property in a PodSpec can be set to override that default.
Due to the flag added to the DeleteJob call, it sounds like this delete is only deleting values from the kube key/value store. Which would mean the new Job/Pod could be created while the current Job/Pod is still terminating. You could confirm with a Job that doesn't respect a SIGTERM and has a terminationGracePeriodSeconds set to a few times your clusters scheduling speed.

DecisionTaskTimedOut before the specified timeout

I have a case when decision times out after 5 seconds when timeout is set to 10:
17 2019-06-13T17:46:59Z DecisionTaskScheduled {TaskList:{Name:maxim-C02XD0AAJGH6:db09fd84-98bf-4546-a0d8-fb51e30c2b41},
StartToCloseTimeoutSeconds:10, Attempt:0}
18 2019-06-13T17:47:04Z DecisionTaskTimedOut {ScheduledEventId:17,
StartedEventId:0,
TimeoutType:SCHEDULE_TO_START}
10:49 AM
It is using Cadence service running in a local docker and I can reproduce it reliably.
The 5s timeout is due to Cadence Sticky Execution feature. Sticky Execution is enabled by default on Cadence Worker which allows the workflow state to be cached on the worker after responding back with decisions. This allows Cadence server to directly dispatch new decision tasks to the same worker which allows to reuse the cached state and produce new decisions without replaying the entire execution history.
Decision SCHEDULE_TO_START timeout is put in place to allow decision to be sent to another worker when worker restarts and there is no poller on the sticky tasklist for a workflow execution. This causes the stickyness to be cleared by Cadence server for that execution and decision dispatched to original tasklist so it can be picked up by any other worker.
// Optional: Sticky schedule to start timeout.
// default: 5s
// The resolution is seconds. See details about StickyExecution on the comments for DisableStickyExecution.
StickyScheduleToStartTimeout time.Duration

What happens when Eureka instance skips a heartbeat against a Eureka server with self preservation turned off?

Consider this set-up:
Eureka server with self preservation mode disabled i.e. enableSelfPreservation: false
2 Eureka instances each for 2 services (say service#1 and service#2). Total 4 instances.
And one of the instances (say srv#1inst#1, an instance of service#1) sent a heartbeat, but it did not reach the Eureka server.
AFAIK, following actions take place in sequence on Server side:
ServerStep1: Server observes that a particular instance has missed a heartbeat.
ServerStep2: Server marks the instance for eviction.
ServerStep3: Server's eviction scheduler (which runs periodically) evicts the instance from registry.
Now on instance (srv#1inst#1) side:
InstanceStep1: It skips a heartbeat.
InstanceStep2: It realizes heartbeat did not reach Eureka Server. It retries with exponential back-off.
AFAIK, the eviction and registration do not happen immediately. Eureka server runs separate scheduler for both tasks periodically.
I have some questions related to this process:
Are the sequences correct? If not, what did I miss?
Is the assumption about eviction and registration scheduler correct?
An instance of service#2 requests fresh registry copy from server right after ServerStep2.
Will srv#1inst#1 be in the fresh registry copy, because it has not been evicted yet?
If yes, will srv#1inst#1 be marked UP or DOWN?
The retry request from InstanceStep2 of srv#1inst#1 reaches server right after ServerStep2.
Will there be an immediate change in registry?
How that will affect the response to instance of service#2's request for fresh registry? How will it affect the eviction scheduler?
This question was answered by qiangdavidliu in one of the issues of eureka's GitHub repository.
I'm adding his explanations here for sake of completeness.
Before I answer the questions specifically, here's some high level information regarding heartbeats and evictions (based on default configs):
instances are only evicted if they miss 3 consecutive heartbeats
(most) heartbeats do not retry, they are best effort every 30s. The only time a heartbeat will retry is that if there is a threadlevel error on the heartbeating thread (i.e. Timeout or RejectedExecution), but this should be very rare.
Let me try to answer your questions:
Are the sequences correct? If not, what did I miss?
A: The sequences are correct, with the above clarifications.
Is the assumption about eviction and registration scheduler correct?
A: The eviction is handled by an internal scheduler. The registration is processed by the handler thread for the registration request.
An instance of service#2 requests fresh registry copy from server right after ServerStep2.
Will srv#1inst#1 be in the fresh registry copy, because it has not been evicted yet?
If yes, will srv#1inst#1 be marked UP or DOWN?
A: There are a few things here:
until the instance is actually evicted, it will be part of the result
eviction does not involve changing the instance's status, it merely removes the instance from the registry
the server holds 30s caches of the state of the world, and it is this cache that's returned. So the exact result as seem by the client, in an eviction scenario, still depends on when it falls within the cache's update cycle.
The retry request from InstanceStep2 of srv#1inst#1 reaches server right after ServerStep2.
Will there be an immediate change in registry?
How that will affect the response to instance of service#2's request for fresh registry? How will it affect the eviction scheduler?
A: again a few things:
When the actual eviction happen, we check each evictee's time to see if it is eligible to be evicted. If an instance is able to renew its heartbeats before this event, then it is no longer a target for eviction.
The 3 events in question (evaluation of eviction eligibility at eviction time, updating the heartbeat status of an instance, generation of the result to be returned to the read operations) all happen asynchronously and their result will depend on the evaluation of the above described criteria at execution time.