Is it possible, and how to limit kubernetes job to create a maxium number of pods if always fail? - kubernetes

As a QA in our company I am daily user of kubernetes, and we use kubernetes job to create performance tests pods. One advantage of job, according to the docs, is
to create one Job object in order to reliably run one Pod to completion
But in our tests this feature will create infinite pods if previous ones fail, which will occupy resources of our team's shared cluster, and deleting such pods will take a lot of time. see this image:
Currently the job manifest is like this:
{
"apiVersion": "batch/v1",
"kind": "Job",
"metadata": {
"name": "upgradeperf",
"namespace": "ntg6-grpc26-tts"
},
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "upgradeperfjob",
"image":
"mycompany.com:5000/ncs-cd-qa/upgradeperf:0.1.1",
"command": [
"python",
"/jmeterwork/jmeter.py",
"-gu",
"git#gitlab-pri-eastus2.dev.mycompany.net:mobility-ncs-tools/tts-cdqa-tool.git",
"-gb",
"upgradeperf",
"-t",
"JMeter/testcases/ttssvc/JMeterTestPlan_ttssvc_cmpsize.jmx",
"-JtestDataFile",
"JMeter/testcases/ttssvc/testData/avaml_opus.csv",
"-JthreadNum",
"3",
"-JthreadLoopCount",
"1500",
"-JresultsFile",
"results_upgradeperf_cavaml_opus_t3_l1500.csv",
"-Jhost",
"mtl-blade32-03.mycompany.com",
"-Jport",
"28416"
]
}
],
"restartPolicy": "Never",
"imagePullSecrets": [
{
"name": "docker-registry-secret"
}
]
}
}
}
}
In some cases, such as misconfiguring of ip/ports, 'reliably run one Pod to completion' is impossible and recreating pods is waste of time and resource.
So is it possible, and how to limit kubernetes job to create a maxium number(say 3) of pods if always fail?

Depending on your kubernetes version, you can resolve this problem with these methods:
set the option: restartPolicy: OnFailure, then the failed container will be restarted in the same Pod, so you will not get lots of failed Pods, instead you will get a Pod with lots of restart.
From Kubernetes 1.8 on, There is a parameter backoffLimit to control the restart policy of failed job. This parameter defines the retry times of the job before treating the job to be failed, default 6 times. For this parameter to work you must set the parameter restartPolicy: Never .

You probably didn't set restartPolicy: Never in your pod spec, add that and I would expect it matches your expected behaviors better.

Related

How to pass a flag to klog for structured logging

As part of kubernetes 1.19, structured logging has been implemented.
I've read that kubernetes log's engine is klog and structured logs are following this format :
<klog header> "<message>" <key1>="<value1>" <key2>="<value2>" ...
Cool ! But even better, you apparently can pass a --logging-format=json flag to klog so logs are generated in json directly !
{
"ts": 1580306777.04728,
"v": 4,
"msg": "Pod status updated",
"pod":{
"name": "nginx-1",
"namespace": "default"
},
"status": "ready"
}
Unfortunately, I haven't been able to find out how and where I should specify that --logging-format=json flag.
Is it a kubectl command? I'm using Azure's aks.
--logging-format=json is a flag which need to be set on all Kuberentes System Components ( Kubelet, API-Server, Controller-Manager & Scheduler). You can check all flags here.
Unfortunately you cant do it right now with AKS as you have the managed control plane from Microsoft.

Create one liner (Imperative way) command in kubernetes

Create one liner (Imperative way) command in kubernetes
kubectl run test --image=ubuntu:latest --limits="cpu=200m,memory=512Mi" --requests="cpu=200m,memory=512Mi" --privileged=false
And also I need to set securityContext in one liner, is it possible? basically I need to run container as securityContext/runAsUser not as root account.
Yes declarative works, but I'm looking for an imperative way.
Posting this answer as a community wiki to highlight the fact that the solution was posted in the comments (a link to another answer):
Hi, check this answer: stackoverflow.com/a/37621761/5747959 you can solve this with --overrides – CLNRMN 2 days ago
Feel free to edit/expand.
Citing $ kubectl run --help:
--overrides='': An inline JSON override for the generated object. If this is non-empty, it is used to override the generated object. Requires that the object supply a valid apiVersion field.
Following on --overrides example that have additionals field included and to be more specific to this particular question (securityContext wise):
kubectl run -it ubuntu --rm --overrides='
{
"apiVersion": "v1",
"spec": {
"securityContext": {
"runAsNonRoot": true,
"runAsUser": 1000,
"runAsGroup": 1000,
"fsGroup": 1000
},
"containers": [
{
"name": "ubuntu",
"image": "ubuntu",
"stdin": true,
"stdinOnce": true,
"tty": true,
"securityContext": {
"allowPrivilegeEscalation": false
}
}
]
}
}
' --image=ubuntu --restart=Never -- bash
By above override you will use a securityContext to constrain your workload.
Side notes!
The example above is specific to running a Pod that you will exec into (bash)
The --overrides will override the other specified parameters outside of it (for example: image)
Additional resources:
Kubernetes.io: Docs: Tasks: Configure pod container: Security context
Kubernetes.io: Docs: Concepts: Security: Pod security standards

Triggering alerts on Prometheus dashboard

Is it possible to trigger some alerts on the Prometheus dashboard by manually stopping respective services on the Kubernetes cluster in order to verify that I'm receiving alert for issues on Prometheus dashboard ?
I would recommend using tools such as chaos toolkit to do this declaratively and automatically instead of doing it manually. This is called chaos engineering more generally.
{
"title": "Do we remain available in face of pod going down?",
"description": "We expect Kubernetes to handle the situation gracefully when a pod goes down",
"tags": ["kubernetes"],
"steady-state-hypothesis": {
"title": "Verifying service remains healthy",
"probes": [
{
"name": "all-our-microservices-should-be-healthy",
"type": "probe",
"tolerance": true,
"provider": {
"type": "python",
"module": "chaosk8s.probes",
"func": "microservice_available_and_healthy",
"arguments": {
"name": "myapp"
}
}
}
]
},
"method": [
{
"type": "action",
"name": "terminate-db-pod",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "app=my-app",
"name_pattern": "my-app-[0-9]$",
"rand": true
}
},
"pauses": {
"after": 5
}
}
]
}
You can use Gremlin to achieve this goal too. First, install the Gremlin agent on your Kubernetes cluster using the helm chart: https://github.com/gremlin/helm/
Next, shutdown the specific services using the Kubernetes features within Gremlin. You can control the blast radius by selecting 1 pod/1 service or many pods/services. This is a tutorial that I wrote on this topic: https://www.gremlin.com/community/tutorials/how-to-install-and-use-gremlin-with-kubernetes/.
Validating monitoring and alerting is a great use case for Chaos Engineering. As you said, triggering alerts on the Prometheus dashboard by manually stopping respective services on the Kubernetes cluster. This will enable you to verify alerts for issues on your Prometheus dashboard. This tutorial explains how to use Gremlin webhooks with Grafana and Prometheus: https://www.gremlin.com/community/tutorials/visualize-chaos-experiments-in-grafana-with-gremlin-webhooks/

Automatically schedule future deployment in Octopus

Update: I found executing script on the octopus server is now available in version 3.3, I haven't update my octopus yet but I will take that would work as designed. I'm still wondering if there is a better way to do this without octo.exe?
The task I'm trying to accomplish is after each successful production deployment, automatically schedule a DR deployment to happen next 24 hours.
My desired approach is have octopus do it.
I added a new Octopus step at the end of the deployment only runs upon success of previous step. I attempted to use octo deploy-release --deployAt can be found here in the newly created step.
My challenge is, a script step requires me to pick a target role, which means it will be executed on a tentacle. Also, presence of Octo.exe is required.
I tried to create my own octopus step template, a deployment target role is still required in my customized step.
{
"Id": "ActionTemplates-2",
"Name": "Octopus - Schedule Deployment",
"Description": "Schedule a future octopus deployment",
"ActionType": "Octopus.Script",
"Version": 3,
"Properties": {
"Octopus.Action.Script.Syntax": "PowerShell",
"Octopus.Action.Script.ScriptBody": "--hide--"
},
"SensitiveProperties": {},
"Parameters": [
{
"Name": "OctoPath",
"Label": "Path for Octo.exe",
"HelpText": "Location for octo.exe",
"DefaultValue": null,
"DisplaySettings": {
"Octopus.ControlType": "SingleLineText"
}
},
{
"Name": "projName",
"Label": "Project Name",
"HelpText": "The name of the project should be deployed",
"DefaultValue": null,
"DisplaySettings": {
"Octopus.ControlType": "SingleLineText"
}
},
{
"Name": "days",
"Label": "Days",
"HelpText": "The days in future this deployment would happen",
"DefaultValue": null,
"DisplaySettings": {
"Octopus.ControlType": "SingleLineText"
}
},
{
"Name": "hours",
"Label": "Hours",
"HelpText": "The hours in future this deployment would happen",
"DefaultValue": null,
"DisplaySettings": {
"Octopus.ControlType": "SingleLineText"
}
},
{
"Name": "env",
"Label": "Environment to deploy",
"HelpText": "The environment next deployment should happen",
"DefaultValue": null,
"DisplaySettings": {
"Octopus.ControlType": "SingleLineText"
}
}
],
"$Meta": {
"ExportedAt": "2016-04-20T13:58:54.263Z",
"OctopusVersion": "3.2.0",
"Type": "ActionTemplate"
}
}
Is there a way to alter the template to get rid of the role selection and have octopus server directly execute it as it does for Azure script step?
Is there any another way we can have octopus server automatically schedule the deployment without external help? I guess this go back to first problem, I may still need octopus to run something on the server side.
Note: We kick off production deployment manually, thus I don't have another tool waiting for the response of the deployment. I think it is possible to have a process regularly call out the last deployment and do some analysis then schedule new deployment accordingly but this is not as clean as have octopus do it directly. Injecting octo.exe to a random production machine is not desired at all
You could create new WebAPI project in C#, pull in the Octopus.Deploy nuget package,
write code that accepts HTTP requests, and deals with the scheduling logic.
Host that project on the same server as Octopus server itself. Should be 20-30 minute job to set the website up in IIS.
In your deployment process, add step that creates http request, and done. You could go even one step further, and have the site/service listen for every successful deployment, and do decisions based on that, such that other projects don't have to add extra steps to octopus deployment process.
As you said, polling is also viable option.
Alternatively, if you're on Octopus deploy 3.0, they already expose REST API, I am not sure if it's powerful enough to allow you create scheduled deployment, but you could explore that: https://github.com/OctopusDeploy/OctopusDeploy-Api/wiki/Releases
I agree floating octo.exe in production servers is bad idea. It might get out of sync, and your production server shouldn't deal with this.

pod is not showing in ready state

I am trying to configure php phabricator example from kubernetes but after creating the replication controller. POD is not showing in ready state ever. It shows in below state:
NAME READY STATUS RESTARTS AGE
phabricator-controller-z0nk3 0/1 CrashLoopBackOff 5 2m
Below is the controller yaml:
{
"kind": "ReplicationController",
"apiVersion": "v1",
"metadata": {
"name": "phabricator-controller",
"labels": {
"name": "phabricator"
}
},
"spec": {
"replicas": 1,
"selector": {
"name": "phabricator"
},
"template": {
"metadata": {
"labels": {
"name": "phabricator"
}
},
"spec": {
"containers": [
{
"name": "phabricator",
"image": "fgrzadkowski/example-php-phabricator",
"ports": [
{
"name": "http-server",
"containerPort": 80
}
]
}
]
}
}
}
}
Can someone please suggest me how to fix this?
This Pod is crash-looping. You can tell because the number of restarts is greater than zero.
kubectl describe pods <pod-name>
Should give further details to help debug. As will
kubectl logs <pod-name>
Actually tracking issues with kubectl describe pods <pod-name> and kubectl logs <pod-name> is indeed the default way to track issues, unfortunately in my case it WASN'T helpful (at first.) All logs were nice or at least were giving no error or clue that something goes wrong.
Readiness and Liveness probes were however showing the app is not passing through...
So where the devil were hiding? In my case increasing values for "initialDelaySeconds" and/or "timeoutSeconds" for Readiness and Liveness probes did the thing.
My first assumption was the app has not enough time to reach "Ready status". However app was still not ready and failed in fact... !!!BUT!!! extending those values increased deployment attempt time and thus I've been able to reach more logs. And what I got??? "Database connection failed attempt due to the timeout". So no connection to the database, and the app is died in fact. Tricky moment is - timeouts are not appearing quickly and you need to wait a bit more ... at least default values for "initialDelaySeconds" and/or "timeoutSeconds" were unable to give me needed time to see the "database connectivity timeout".
When firewall rule was set to allow app talk to the database, issue has gone!