How can I get down time of a specific deployment in kubernetes?

How can I get down time of a specific deployment in kubernetes? - kubernetes

I have an use case where I need to collect the downtime of each deployment (if all the replicas(pods) are down at the same point of time).
My goal is to maintain the total down time for each deployment since it was created.
I tried getting it from deployment status, but the problem is that I need to make frequent calls to get the deployment and check for any down time.
Also the deployment status stores only the latest change. So, I will end up missing out the changes that occurred in between each call if there is more than one change(i.e., down time). Also I will end up making multiple calls for multiple deployments frequently which will consume more compute resource.
Is there any reliable method to collect the down time data of an deployment?
Thanks in advance.

A monitoring tool like prometheus would be a better solution to handle this.
As an example, below is a graph from one of our deployments for last 2 days
If you look at the blue line for unavailable replicas, we had one replica unavailable from about 17:00 to 10:30 (ideally unavailable count should be zero)
This seems pretty close to what you are looking for.

Related

Uptime of K8s Service over a period of time - Prometheus?

What is the simplest way to find out the Availability of a K8s service over a period of time, lets say 24h. Should I target a pod or find a way to calculate service reachability

I'd recommend to not approach it from a binary (is it up or down) but from a "how long does it take to serve requests" perspective. In other words, phrase your availability in terms of SLOs. You can get a very nice automatically generated SLO-based alter rules from PromTools. One concrete example rule from there, showing the PromQL part:
1 - (
sum(rate(http_request_duration_seconds_bucket{job="prometheus",le="0.10000000000000001",code!~"5.."}[30m]))
/
sum(rate(http_request_duration_seconds_count{job="prometheus"}[30m]))
)
Above captures the ratio of how long it took the service to serve non-500 (non-server errors, that is, assumed good responses) in less than 100ms to overall responses over the last 30 min with http_request_duration_seconds being the histogram, capturing the distribution of the requests of your service.

How to configure druid properly to fire a periodic kill task

I have been trying to get druid to fire a kill task periodically to clean up unused segments.
These are the configuration variables responsible for it
druid.coordinator.kill.on=true
druid.coordinator.kill.period=PT45M
druid.coordinator.kill.durationToRetain=PT45M
druid.coordinator.kill.maxSegments=10
From the above configuration my mental model is, once ingested data is marked unused, kill task will fire and delete the segments that are older that 45 mins while retaining 45 mins worth of data. period and durationToRetain are the config vars that are confusing me, not quite sure how to leverage them. Any help would be appreciated.

The caveat for druid.coordinator.kill.on=true is that segments are deleted from whitelisted datasources. The whitelist is empty by default.
To populate the whitelist with all datasources, set killAllDataSources to true. Once I did that, the kill task fired as expected and deleted the segments from s3 (COS). This was tested for Druid version 0.18.1.
Now, while the above configuration properties can be set when you build your image, the killAllDataSources needs to be set through an API. This can be set via the druid UI too.
When you click the option, a modal appears that has Kill All Data Sources. Click on True and you should see a kill task (Ingestion ---> Tasks below) firing in the interval specified. It would be really nice to have this as a part of runtime.properties or some sort of common configuration file that we can set the value in when build the druid image.

Use crontab it works quite well for us.

If you want to have a control outside the druid over the segments removal, then you must use an scheduled task which runs based on your desire interval and register kill-tasks in druid. It can increase your control over your segments, since when they go away, you cannot recover them. You can use this script to accompany you:
https://github.com/mostafatalebi/druid-kill-task

k8s job with an unlimited/unknown number of work-items (completions)

I have a queue which filled (consistently) with work-item by one of my k8s pod.
I want to use k8s job to process each work item as I want each job to be handle by new pod as I want to use multiple container as suggest here, but jobs don't seem to support an infinite number of completions.
When I use spec.completions: null I got BackoffLimitExceeded.
Any Idea how to implement job without the need to specify the number of work item?
Is there alternative to job for implementing background worker in k8s ?
Thank you

My suggestion is to use Kubernetes resources in the way they have been designed: Job resources are one-off tasks that can ben triggered many times but they differ from the background jobs you're eager to implement.
If your application is popping jobs from a queue/backend, it's better to put in a Deployment with a for loop (YMMV according to programming language) and, eventually, scale it to down with another component if you don't want to allocate unused resources.
Another solution could be the specialization of each Job, using UUID as the name of the jobs and labelling them in order to group them: anyway, first suggestion of using Kubernetes in the Kubernetes way is strongly recommended.

Unique access to Kubernetes resources

For some integration tests we would like to have a way of ensuring, that only one test at a time has access to certain resources (e.g. 3 DeploymentConfigurations).
For that to work we have have the following workflow:
Before test is started - wait until all DCs are "undeployed".
When test is started - set DC replicas to 1.
When test is stopped - set DC replicas to 0.
This works to some degree, but obviously has the problem, that once the test terminates unexpectedly, the DCs might still be in flight.
Now one way to "solve" this would be to introduce a CR, with a Controller, which handles lifetime of the lock (CR).
Is there any more elegant and straight forward way of allowing unique access to Kubernetes resources?
EDIT:
Sadly we are stuck with Kubernetes 1.9 for now.

look at 'kubectl wait' api to set different conditions between the test flow and depending on the result proceed to next test step.

why a update-period value for kubernetes rolling-update should be long (default: 1m0s)?

I was wondering what would be the potential problem if I reduce the --update-period (whose default value is 1m0s) to about 5s (or even 1s)? I've watched a few video clips, and it seems the presenters implied that it's a bad idea to have a short period but did not explain why.
The reason why I want to make it shorter is that we sometimes prefer fast and a little risky transition, rather than safe and steady one. As far as I know, what rolling-update does is:
while the goal has not been achieved {
scale-up the new version
sleep as specified by --update-period
scale-down the old one
check deadline
}
From the above flow, I don't see any problem of not sleeping for a long time. Deadline checking is based on the timeout configuration, and so, it seems the only outcome of changing the --update-period would be iterating the loop more frequently.
One thing I have not fully understood is how scaling down is performed, but I assume that it still does graceful termination, such as sending SIGTERM and waiting for 30s until finally sending SIGKILL to the processes in the pod.
FYI, I'm using the Google Container Engine.

It should not be long, this is just a precaution in case a pod transitions to a Running state but crashes a couple of seconds later. If your update period is short, you'll keep deploying pods that are unstable eventually, and won't give the whole process enough time to notice.
If you're willing to take the risk it's totally fine to have a short update period.
Also, if you want true fast and reliable deployments you should check the Deployment API. The rolling update logic happens server side which increases the reliability and speed.