How to determine the graceful period of kubernetes while terminating a pod? - kubernetes

we'd like to scale in some of the running instances on which many kubernetes pods are running. So, we are going to gracefully stop the pods by using graceful period according to the official document termination-of-pods. I have read many blog posts and official document, they all tells how to gracefully terminate pod with graceful period. But they do not say how to determine how long the graceful period would be better.
Let's say, for example, a container in a pod may serves for thousands of requests in a time period and it will spend more than 30s to complete all request. I think in this case it would be a bad idea to set graceful period to 30s, because some of the request would be lost. However, when the user load is down and the same container in the same pod serves for only dozens of request in other time period and it only spend 5s to complete all, in this case 30s for graceful period would be too long.
That's my consideration. So, my question is as follows.
1. Is there any best practice to determine how long the graceful period is better?
2. Is there any approach to check if the processing request is completed in a container and then gracefully terminate pod?
3. Can I extend the initial graceful period after sending the termination command to a pod?
Thanks in advance.

The best way to determine the ideal graceful period is by observability. Put your service under a realistic production load and measure. This is highly project specific!
If the process with PID 1 exits before the graceful period your container will be marked as Terminated before the end of the graceful period, so it's worth setting a value slightly higher than what you would expect under normal circumstances.
You might be interested in letting your containers write arbitrary information when they terminate. Kubernetes has a feature called Termination messages you might want to look into.

Related

Implementing a downtime on a resource that waits for the resource to finish its task before delaying it

I'm trying to represent a machine that works for a x amount of time before warning the operator that the oil tank needs to be refilled. Have in mind that the machine doesn't stop as soon as it send the warning message out. That way, the operator will wait until the machine stops any activity it had already started, and once it's done, he'll stop the machine and fill the tank.
In order to represent this process I'm using a Station block from the Material Handling library, that seizes a resource from a resource pool block, to which a downtime block is applied.
Is there a way to make the downtime block wait until the machine stops before performing the maintenance?
I also want to associate a resource pool representative of the operator to the downtime block, so that the operator is busy during the downtime, since he's the responsible for filling the tank. Can I do that?
Thank you in advance!
Is there a way to make the downtime block wait until the machine stops before performing the maintenance?
Yes, explore how Priorities work. Give your machine task a higher priority than the downtime task and ensure that the downtime block does not preempt other tasks:
I also want to associate a resource pool representative of the operator to the downtime block, so that the operator is busy during the downtime, since he's the responsible for filling the tank. Can I do that?
Yes, set the task type to "go to flowchart" and use a custom flow chart to seize from a resource pool (again, check the help on how to set this up in detail):
PS: Please only ask 1 question per issue always. See https://stackoverflow.com/help/how-to-ask and for AnyLogic https://www.benjamin-schumann.com/blog/2021/4/1/how-to-win-at-anylogic-on-stackoverflow

Uber Cadence task list management

We are using Uber Cadence and periodically we run into issues on the production environment.
The setup is the following:
One Java 14 BE with Cadence client 2.7.5
Cadence service version 0.14.1 with Postgres DB
There are multiple domains, for all domains the single BE server is registered as a worker.
What is visible in the logs is that sometimes during a query the cadence seems to lose stickiness to the BE service:
"msg":"query direct through matching failed on sticky, clearing sticky before attempting on non-sticky","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
...
In the backend in the meanwhile nothing is visible. However, during this time if I check the pollers on the cadence web client I see that the task list is there, but it is not considered as a decision handler any more (http://localhost:8088/domains/mydomain/task-lists/mytasklist/pollers). Because of this pretty much the whole environment is dead because there is nothing that can progress with the decision. The only option is to restart the backend service and let it re-register as a worker.
At this point the investigation is stuck, so some help would be appreciated.
Does anyone know about how a worker or task list can lose its ability to be a decision handler? Is it managed by cadence, like based on how many errors the worker generates? I was not able to find anything about this.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue it (in my case this will be the same worker as there is only one). Is it possible that replaying the flow is not possible (although I think it would generate something in the backend log from the cadence client) or at that point the worker is already removed from the list and that causes the time-out?
Any help would be more than welcome! Thanks!
Does anyone know about how a worker or task list can lose its ability to be a decision handler
This will happen when worker stops polling for decision tasks. For example if you configure the worker only polls for activity tasks, then it will show like that. So apparently it will also happen if for some reason the worker stops polling for decision tasks.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue
Yes, as long as there is another worker polling for decision tasks. Note that Query tasks is considered as of the the decision task types. (this is a wrong design, we are working on to separate it).
From your logs:
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
This means that Cadence dispatch the Query tasks to a worker, and a worker accepted, but didn't respond back within timeout.
It's very highly possible that there is some bugs in your Query handler logic. The bug caused decision worker to crash(which means Cadence java client also has a bug too, user code crashing shouldn't crash worker). And then a query task loop over all instances of your worker pool, finally crashed all your decision workers.

How should I pick ScheduleToStartTimeout and StartToCloseTimeout values for ActivityOptions

There are four different timeout options in the ActivityOptions, and two of those are mandatory without any default values: ScheduleToStartTimeout and StartToCloseTimeout.
What considerations should be made when selecting values for these timeouts?
As mentioned in the question, there are four different timeout options in ActivityOptions, and the differences between them may not be super clear to a new Cadence user. Let’s first briefly explain what those are:
ScheduleToStartTimeout: This configuration specifies the maximum
duration between the time the Activity is scheduled by a workflow and
it’s picked up by an activity worker to start executing it. In other
words, it configures the time a task spends in the queue.
StartToCloseTimeout: This one specifies the maximum time taken by
an activity worker from the time it fetches a task until it reports
the completion of it to the Cadence server.
ScheduleToCloseTimeout: This configuration specifies an end-to-end
timeout duration for an activity from the time it is scheduled by the
workflow until it is completed by an activity worker.
HeartbeatTimeout: If your activity is a heartbeating activity, this
configuration basically specifies the maximum duration the Cadence
server would wait for a heartbeat before assuming the activity worker
has failed.
How to select a proper timeout value
Picking the StartToCloseTimeout is fairly straightforward when you know what it does. Essentially, you should make this long enough so that the activity can complete under normal circumstances. Therefore, you should account for everything that can affect the time taken by an activity worker the latency of your down-stream (ie. services, networking etc.). On the other hand, you should aim to keep this value as small as it’s feasible to make your end-to-end system more responsive. If you can’t make this timeout less than a couple of minutes (ideally 1 minute or less), you should consider using a HeartbeatTimeout config and implement heartbeating in your activity.
ScheduleToCloseTimeout is also easy to understand, but it is more common to face issues caused by picking a less-than-ideal value here. Therefore, it’s important to ensure that a moment to pay some extra attention to this configuration.
Basically, you should consider everything that can create a backlog in the activity task queue. Some common events that contribute to a backlog are:
Reduced worker pool throughput due to deployments, maintenance or
network-related issues.
Down-stream latency spikes that would increase the time it takes to
complete each activity task, which then reduces the throughput of the
worker pool.
A significant spike in the number of workflow instances that schedule
the activity; especially if one of the upstream services is also an
asynchronous queue/stream processor which can create its own backlog
and suddenly start processing it at a very high-volume.
Ideally, no activity should timeout while waiting in the task queue, especially if the queue is backed up and the activity is configured to be retried. Because the retries would add more activity tasks to the queue and subsequently make it harder to recover from backlog or make it even worse. On the other hand, there are many use cases where business requirements really limit the total time the system can take to process an activity. Therefore, it’s usually not a bad idea to aim for a high ScheduleToCloseTimeout value as long as the business requirements allow. Depending on your use case, it might not make sense to keep your activity in the queue for more than a few minutes or it might be perfectly fine to keep it there for several days before timing out.

Relation between preStop hook and terminationGracePeriodSeconds

Basically I am trying to do is play around with pod lifecycle and check if we can do some cleanup/backup such as copy logs before the pod terminates.
What I need :
Copy logs/heapdumps from container to a hostPath/S3 before terminating
What I tried:
I used a preStop hook with a bash command to echo a message (just to see if it works !!). Used terminationGracePeriodSeconds with a delay to preStop and toggle them to see if the process works. Ex. keep terminationGracePeriodSeconds:30 sec (default) and set preStop command to sleep by 50 sec and the message should not be generated since the container will be terminated by then. This works as expected.
My questions:
what kind of processes are allowed(recommended) for a preStop hook? As copying logs/heapdumps of 15 gigs or more will take a lot of time. This time will then be used to define terminationGracePeriodSeconds
what happens when preStop takes more time than the set gracePeriod ?
(in case logs are huge say 10 gigs)
what happens if I do not have any hooks but still set terminationGracePeriodSeconds ? will the container remain up until that grace time ?
I found this article which closely relates to this but could not follow through https://github.com/kubernetes/kubernetes/issues/24695
All inputs appreciated !!
what kind of processes are allowed(recommended) for a preStop hook? As copying logs/heapdumps of 15 gigs or more will take a lot of time. This time will then be used to define terminationGracePeriodSeconds
Anything goes here, it's more of an opinion and how you would like your pods to linger around. Another option is to let your pods terminate and store your data in some place (i.e, AWS S3, EBS) where data will persist past the pod lifecycle then use something like Job to clean up the data, etc.
what happens when preStop takes more time than the set gracePeriod? (in case logs are huge say 10 gigs)
Your preStop will not complete which may mean incomplete data or data corruption.
what happens if I do not have any hooks but still set terminationGracePeriodSeconds ? will the container remain up until that grace time ?
This explains would be the sequence:
A SIGTERM signal is sent to the main process in each container, and a “grace period” countdown starts.
If a container doesn’t terminate within the grace period, a SIGKILL signal will be sent and the container.

why a update-period value for kubernetes rolling-update should be long (default: 1m0s)?

I was wondering what would be the potential problem if I reduce the --update-period (whose default value is 1m0s) to about 5s (or even 1s)? I've watched a few video clips, and it seems the presenters implied that it's a bad idea to have a short period but did not explain why.
The reason why I want to make it shorter is that we sometimes prefer fast and a little risky transition, rather than safe and steady one. As far as I know, what rolling-update does is:
while the goal has not been achieved {
scale-up the new version
sleep as specified by --update-period
scale-down the old one
check deadline
}
From the above flow, I don't see any problem of not sleeping for a long time. Deadline checking is based on the timeout configuration, and so, it seems the only outcome of changing the --update-period would be iterating the loop more frequently.
One thing I have not fully understood is how scaling down is performed, but I assume that it still does graceful termination, such as sending SIGTERM and waiting for 30s until finally sending SIGKILL to the processes in the pod.
FYI, I'm using the Google Container Engine.
It should not be long, this is just a precaution in case a pod transitions to a Running state but crashes a couple of seconds later. If your update period is short, you'll keep deploying pods that are unstable eventually, and won't give the whole process enough time to notice.
If you're willing to take the risk it's totally fine to have a short update period.
Also, if you want true fast and reliable deployments you should check the Deployment API. The rolling update logic happens server side which increases the reliability and speed.