why a update-period value for kubernetes rolling-update should be long (default: 1m0s)? - kubernetes

I was wondering what would be the potential problem if I reduce the --update-period (whose default value is 1m0s) to about 5s (or even 1s)? I've watched a few video clips, and it seems the presenters implied that it's a bad idea to have a short period but did not explain why.
The reason why I want to make it shorter is that we sometimes prefer fast and a little risky transition, rather than safe and steady one. As far as I know, what rolling-update does is:
while the goal has not been achieved {
scale-up the new version
sleep as specified by --update-period
scale-down the old one
check deadline
}
From the above flow, I don't see any problem of not sleeping for a long time. Deadline checking is based on the timeout configuration, and so, it seems the only outcome of changing the --update-period would be iterating the loop more frequently.
One thing I have not fully understood is how scaling down is performed, but I assume that it still does graceful termination, such as sending SIGTERM and waiting for 30s until finally sending SIGKILL to the processes in the pod.
FYI, I'm using the Google Container Engine.

It should not be long, this is just a precaution in case a pod transitions to a Running state but crashes a couple of seconds later. If your update period is short, you'll keep deploying pods that are unstable eventually, and won't give the whole process enough time to notice.
If you're willing to take the risk it's totally fine to have a short update period.
Also, if you want true fast and reliable deployments you should check the Deployment API. The rolling update logic happens server side which increases the reliability and speed.

Related

How to distribute tasks between servers where each task must be done by only one server?

Goal: There are X number backend servers. There are Y number of tasks. Each task must be done only by one server. The same task ran by two different servers should not happen.
There are tasks which include continuous work for an indefinite amount of time, such as polling for data. The same server can keep doing such a task as long as the server stays alive.
Problem: How to reassign a task if the server executing it dies? If the server dies, it can't mark the task as open. What are efficient ways to accomplish this?
Well, the way you define your problem makes it sloppy to reason about. What you actually is looking for called a "distributed lock".
Let's start with a simpler problem: assume you have only two concurrent servers S1, S2 and a single task T. The safety property you stated remains as is: at no point in time both S1 and S2 may process task T. How could that be achieved? The following strategies come to mind:
Implement an algorithm that deterministically maps task to a responsible server. For example, it could be as stupid as if task.name.contains('foo') then server1.process(task) else server2.process(task). That works and indeed might fit some real world requirements out there, yet such an approach is a dead end: a) you have to know how many server would you have upfront, statically and - the most dangerous - 2) you can not tolerate either server being down: if, say, S1 is taken off then there is nothing you can do with T right now except then just wait for S1 to come back online. These drawbacks could be softened, optimized - yet there is no way to get rid of them; escaping these deficiencies requires a more dynamic approach.
Implement an algorithm that would allow S1 and S2 to agree upon who is responsible for the T. Basically, you want both S1 and S2 to come to a consensus about (assumed, not necessarily needed) T.is_processed_by = "S1" or T.is_processed_by = "S2" property's value. Then your requirement translates to the "at any point in time is_process_by is seen by both servers in the same way". Hence "consensus": "an agreement (between the servers) about a is_processed_by value". Having that eliminates all the "too static" issues of the previous strategy: actually, you are no longer bound to 2 servers, you could have had n, n > 1 servers (provided that your distributed consensus works for a chosen n), however it is not prepared for accidents like unexpected power outage. It could be that S1 won the competition, is_processed_by became equal to the "S1", S2 agreed with that and... S1 went down and did nothing useful....
...so you're missing the last bit: the "liveness" property. In simple words, you'd like your system to continuously progress whenever possible. To achieve that property - among many other things I am not mentioning - you have to make sure that spontaneous server's death is monitored and - once it happened - not a single task T gets stuck for indefinitely long. How do you achieve that? That's another story, a typical piratical solution would be to copy-paste the good old TCP's way of doing essentially the same thing: meet the keepalive approach.
OK, let's conclude what we have by now:
Take any implementation of a "distributed locking" which is equivalent to "distributed consensus". It could be a ZooKeeper done correctly, a PostgreSQL running a serializable transaction or whatever alike.
Per each unprocessed or stuck task T in your system, make all the free servers S to race for that lock. Only one of them guaranteed to win and all the rest would surely loose.
Frequently enough push sort of TCP's keepalive notifications per each processing task or - at least - per each alive server. Missing, let say, three notifications in a sequence should be taken as server's death and all of it's tasks should be re-marked as "stuck" and (eventually) reprocessed in the previous step.
And that's it.
P.S. Safety & liveness properties is something you'd definitely want to be aware of once it comes to distributed computing.
Try rabbitmq worker queues
https://www.rabbitmq.com/tutorials/tutorial-two-python.html
It has an acknowledgement feature so if a task fails or server cashes it will automatically replay your task. Based on your specific use case u can setup retries, etc
"Problem: How to reassign a task if the server executing it dies? If the server dies, it can't mark the task as open. What are efficient ways to accomplish this?"
You are getting into a known problem in distributed systems, how does a system makes decisions when the system is partitioned. Let me elaborate on this.
A simple statement "server dies" requires quite a deep dive on what does this actually mean. Did the server lost power? Is it the network between your control plane and the server is down (and the task is keep running)? Or, maybe, the task was done successfully, but the failure happened just before the task server was about to report about it? If you want to be 100% correct in deciding the current state of the system - that the same as to say that the system has to be 100% consistent.
This is where CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem) comes to play. Since your system may be partitioned at any time (a worker server may get disconnected or die - which is the same state) and you want to be 100% correct/consistent, this means that the system won't be 100% available.
To reiterate the previous paragraph: if the system suspects a task server is down, the system as a whole will have to come to a stop, till it will be able to determine on what happened with the particular task server.
Trade off between consistency and availability is the core of distributed systems. Since you want to be 100% correct, you won't have 100% availability.
While availability is not 100%, you still can improve the system to make it as available as possible. Several approaches may help with that.
Simplest one is to alert a human when the system suspects it is down. The human will get a notification (24/7), wake up, login and do a manual check on what is going on. Whether this approach works for your case - it depends on how much availability you need. But this approach is completely legit and is widely used in the industry (those engineers carrying pagers).
More complicated approach is to let the system to fail over to another task server automatically, if that is possible. Few options are available here, depending on type of task.
First type of task is a re-runnable one, but they have to exist as a single instance. In this case, the system uses "STONITH" (shoot the other node in the head) technic to make sure previous node is dead for good. For example, in a cloud the system would actually kill the whole container of task server and then start a new container as a failover.
Second type of tasks is not re-runnable. For example, a task of transferring money from account A to be B is not (automatically) re-runnable. System does not know if the task failed before or after the money were moved. Hence, the fail over needs to do extra steps to calculate the outcome, which may also be impossible if network is not working correctly. In this cases the system usually goes to halt, till it can make 100% correct decision.
None of these options will give 100% of availability, but they can do as good as possible due to nature of distributed systems.

How should I pick ScheduleToStartTimeout and StartToCloseTimeout values for ActivityOptions

There are four different timeout options in the ActivityOptions, and two of those are mandatory without any default values: ScheduleToStartTimeout and StartToCloseTimeout.
What considerations should be made when selecting values for these timeouts?
As mentioned in the question, there are four different timeout options in ActivityOptions, and the differences between them may not be super clear to a new Cadence user. Let’s first briefly explain what those are:
ScheduleToStartTimeout: This configuration specifies the maximum
duration between the time the Activity is scheduled by a workflow and
it’s picked up by an activity worker to start executing it. In other
words, it configures the time a task spends in the queue.
StartToCloseTimeout: This one specifies the maximum time taken by
an activity worker from the time it fetches a task until it reports
the completion of it to the Cadence server.
ScheduleToCloseTimeout: This configuration specifies an end-to-end
timeout duration for an activity from the time it is scheduled by the
workflow until it is completed by an activity worker.
HeartbeatTimeout: If your activity is a heartbeating activity, this
configuration basically specifies the maximum duration the Cadence
server would wait for a heartbeat before assuming the activity worker
has failed.
How to select a proper timeout value
Picking the StartToCloseTimeout is fairly straightforward when you know what it does. Essentially, you should make this long enough so that the activity can complete under normal circumstances. Therefore, you should account for everything that can affect the time taken by an activity worker the latency of your down-stream (ie. services, networking etc.). On the other hand, you should aim to keep this value as small as it’s feasible to make your end-to-end system more responsive. If you can’t make this timeout less than a couple of minutes (ideally 1 minute or less), you should consider using a HeartbeatTimeout config and implement heartbeating in your activity.
ScheduleToCloseTimeout is also easy to understand, but it is more common to face issues caused by picking a less-than-ideal value here. Therefore, it’s important to ensure that a moment to pay some extra attention to this configuration.
Basically, you should consider everything that can create a backlog in the activity task queue. Some common events that contribute to a backlog are:
Reduced worker pool throughput due to deployments, maintenance or
network-related issues.
Down-stream latency spikes that would increase the time it takes to
complete each activity task, which then reduces the throughput of the
worker pool.
A significant spike in the number of workflow instances that schedule
the activity; especially if one of the upstream services is also an
asynchronous queue/stream processor which can create its own backlog
and suddenly start processing it at a very high-volume.
Ideally, no activity should timeout while waiting in the task queue, especially if the queue is backed up and the activity is configured to be retried. Because the retries would add more activity tasks to the queue and subsequently make it harder to recover from backlog or make it even worse. On the other hand, there are many use cases where business requirements really limit the total time the system can take to process an activity. Therefore, it’s usually not a bad idea to aim for a high ScheduleToCloseTimeout value as long as the business requirements allow. Depending on your use case, it might not make sense to keep your activity in the queue for more than a few minutes or it might be perfectly fine to keep it there for several days before timing out.

When does a term begin and end in Raft?

I've been reading the paper "In Search of an Understandable Consensus Algorithm". I'm confused with how "term" works.
I have two thoughts.
A term begins with an election, and ends with the next election. The next election may happen due to the crash of the current leader. As long as the current leader works perfectly, the term could be lasting for a very long time.
A term's end is determined when it begins. For example, after a server wins the election, the term begins and plans to end in 30 minutes. Then after 30 minutes, the leader stops sending heartbeats to cause another election.
So which one is correct? I feel like the first thought makes more sense and it provides better performance.
Either option would work, but your first option is preferable. If you stop sending heartbeats then you likely have to wait for quite some time (a few seconds perhaps) before the new master is elected. You can in theory avoid this wait and trigger an election immediately but elections are always slightly disruptive so one normally design systems to avoid them as much as possible.
The only time an election is really needed is if something has gone wrong: for instance a communication breakdown or some nodes have failed. In practice clusters may run for a very long time (weeks? years?) without a failure, so they do not need more frequent elections.
Also note that terms so not really have a well-defined (global) beginning and end because of the asynchronous nature of communication and the difficulty of pinning down a notion of time in a distributed system. A node may believe a term is still ongoing even though the other nodes all believe it either hasn't started or has finished.

How to determine the graceful period of kubernetes while terminating a pod?

we'd like to scale in some of the running instances on which many kubernetes pods are running. So, we are going to gracefully stop the pods by using graceful period according to the official document termination-of-pods. I have read many blog posts and official document, they all tells how to gracefully terminate pod with graceful period. But they do not say how to determine how long the graceful period would be better.
Let's say, for example, a container in a pod may serves for thousands of requests in a time period and it will spend more than 30s to complete all request. I think in this case it would be a bad idea to set graceful period to 30s, because some of the request would be lost. However, when the user load is down and the same container in the same pod serves for only dozens of request in other time period and it only spend 5s to complete all, in this case 30s for graceful period would be too long.
That's my consideration. So, my question is as follows.
1. Is there any best practice to determine how long the graceful period is better?
2. Is there any approach to check if the processing request is completed in a container and then gracefully terminate pod?
3. Can I extend the initial graceful period after sending the termination command to a pod?
Thanks in advance.
The best way to determine the ideal graceful period is by observability. Put your service under a realistic production load and measure. This is highly project specific!
If the process with PID 1 exits before the graceful period your container will be marked as Terminated before the end of the graceful period, so it's worth setting a value slightly higher than what you would expect under normal circumstances.
You might be interested in letting your containers write arbitrary information when they terminate. Kubernetes has a feature called Termination messages you might want to look into.

Sensu Scheduler Oddness

I run < 24 checks on my systems. Servers are not regularly heavily loaded. Load averages keep well under 1 during normal operation.
I have noticed a re-occurring issue where the check-cpu check would start triggering high load averages on systems where there was no organic cause for high load. Further investigation showed the high load report was actually due to the check-cpu script running in parallel with other checks. Outside of the checks executing, cpu load was fine.
I upgraded from sensu 0.20 to 0.23 and continued to observe the same issue.
We found that a re-start of the sensu-server and sensu-client services would resolve the problem for a period of time (approximately 24 hours) and then it would return.
We theorized at this point, there must be some sort of time-delay in the dispatch / execution of the checks on the host which causes this overlap to eventually occur.
All checks are set to run at an interval of 30 or 60.
I decided to set the interval of the check-cpu check to 83, and the issue has not occurred since. Presumably because the check-cpu check does not coincide with any others, thus not seeing high cpu load during that short moment.
Is this some sort of inherent scheduling issue with sensu? Is it supposed to know how to dispatch checks with adequate spacing, or is this something that should be controlled by the interval parameter?
Thanks!
I have noticed that the checks drift in execution time. i.e they do not run exactly every 30 seconds but every 30.001s or something like that. I guess the drift might be different on different checks. So eventually you will run into the problem that the checks sync up and all run at the same time, causing the problem. Running more checks at regular intervals (30s, 60s etc) will make this problem occur more often. If you want a change to this problem you have to report it to sensu directly. I think they might fix it eventually since they probably want the system to be scalable.