How to get 1 initContainer when parallelism is set > 1 - kubernetes

I have a job in Kubernetes as defined below (with some omissions for brevity). Parallelism and N completions are set. I want 1 init container to delay the start of the parallel containers. When I run it as-is I get an init container per each completion it appears.
apiVersion: batch/v1
kind: Job
metadata:
name: my-job
spec:
parallelism: 10
completions: 10
backoffLimit: 0
template:
spec:
# I want 1 of these to run first
initContainers:
- name: init-container
image: my/container:latest
imagePullPolicy: Always
command: ["init_script.sh"]
# I want 10 of these to run in parallel once init_script.sh exits
containers:
- name: container
image: my/container:latest
imagePullPolicy: Always
command: ["run_job.sh"]
restartPolicy: Never

Kubernetes Jobs are very basic and don't provide any advanced scheduling mechanism with scheduling dependencies. For example, the ability to start a job after another job has finished. My advice is to use a more advanced scheduling tool on top of Kubernetes. There are few in open source that you can use. For example.
Volcano - For example you can use the TaskCompleted event.
Airflow - You can schedule pods directly from the tools and build your DAG (Directed Acyclic Graph).
Etc.
Two other alternatives:
You can even create your own custom scheduler.
You can build your own operator that kicks off your regular jobs based on status.
✌️

Related

Horizontal Pod Autoscaling (HPA) with an initContainer that requires a Job

I have a specific scenario where I'd like to have a deployment controlled by horizontal pod autoscaling. To handle database migrations in pods when pushing a new deployment, I followed this excellent tutorial by Andrew Lock here.
In short, you must define an initContainer that waits for a Kubernetes Job to complete a process (like running db migrations) before the new pods can run.
This works well, however, I'm not sure how to handle HPA after the initial deployment because if the system detects the need to add another Pod in my node, the initContainer defined in my deployment requires a Job to be deployed and run, but since Jobs are one-off processes, the pod can not initialize and run properly (a ttlSecondsAfterFinished attribute removes the Job anyways).
How can I define an initContainer to run when I deploy my app so I can push my database migrations in a Job, but also allow HPA to control dynamically adding a Pod without needing an initContainer?
Here's what my deployment looks like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: graphql-deployment
spec:
replicas: 1
selector:
matchLabels:
app: graphql-pod
template:
metadata:
labels:
app: graphql-pod
spec:
initContainers:
- name: wait-for-graphql-migration-job
image: groundnuty/k8s-wait-for:v1.4 # This is an image that waits for a process to complete
args:
- job
- graphql-migration-job # this job is defined next
containers:
- name: graphql-container
image: image(graphql):tag(graphql)
The following Job is also deployed
apiVersion: batch/v1
kind: Job
metadata:
name: graphql-migration-job
spec:
ttlSecondsAfterFinished: 30
template:
spec:
containers:
- name: graphql-migration-container
image: image(graphql):tag(graphql)
command: ["npm", "run", "migrate:reset"]
restartPolicy: Never
So basically what happens is:
I deploy these two resources to my node
Job is initialized
initContainer on Pod waits for Job to complete using an image called groundnuty/k8s-wait-for:v1.4
Job completes
initContainer completes
Pod initializes
(after 30 TTL seconds) Job is removed from node
(LOTS OF TRAFFIC)
HPA realizes a need for another pod
initContainer for NEW pod is started, but cant run because Job doesn't exist
...crashLoopBackOff
Would love any insight on the proper way to handle this scenario!
There is, unfortunately, no simple Kubernetes feature to resolve your issue.
I recommend extending your deployment tooling/scripts to separate the migration job and your deployment. During the deploy process, you first execute the migration job and then deploy your deployment. Without the job attached, the HPA can nicely scale your pods.
There is a multitude of ways to achieve this:
Have a bash, etc. script first to execute the job, wait and then update your deployment
Leverage more complex deployment tooling like Helm, which allows you to add a 'pre-install hook' to your job to execute them when you deploy your application

How to run a container only once that completes after ~10 minutes of execution via a deployment in Kubernetes

I have just started with Kubernetes.
I need to run a Deployment in Kubernetes with a container that competes for execution after ~10-15 minutes.
When I tried, "restart Policy=Never" doesn't hold true with Deployments.
Reason for opting for Deployment is to use Replicas.
Please provide your inputs on how I can achieve multiple replicas of my Deployment with the container that completes execution and not keep running.
You can run a Job as below where the container runs a sleep command for 15m. After 15 minutes the container will exit and pod will be terminated.
apiVersion: batch/v1
kind: Job
metadata:
name: job
spec:
template:
spec:
containers:
- command:
- sh
- -c
- sleep 15m
image: bash:5.1.0
restartPolicy: Never

How to specify two pods in the same job in a kubernetes yaml file?

I'm trying to do something simple, just create two pods within a job. I'm looking at the documentation here: https://kubernetes.io/docs/concepts/workloads/controllers/job/#single-job-starts-controller-pod
While the documentation discusses parallelization it doesn't give much in the way of examples. The only example with one pod is given as:
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
To create two pods I attempted effectively this:
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi1
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
spec:
containers:
- name: pi2
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
But that didn't get me two pods, instead it appears that only the second container ran in one pod.
It's not clear to me how I get my job to launch multiple pods. My pods don't need to run on the same machine, but they each need a unique environment variable to be set so the pods know what part of the work to do. The work is divided in an embarrassingly parallel way and with a fixed number of pods (2 in this example).
You can not have two different pod template in the same job. From the docs here
Parallel Jobs with a work queue:
do not specify .spec.completions, default to .spec.parallelism.
the Pods must coordinate amongst themselves or an external service to
determine what each should work on. For example, a Pod might fetch a
batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all
its peers are done, and thus that the entire Job is done. when any
Pod from the Job terminates with success, no new Pods are created.
once at least one Pod has terminated with success and all Pods are
terminated, then the Job is completed with success.
once any Pod has exited with success, no other Pod should still be
doing any work for this task or writing any output. They should all
be in the process of exiting.
For a work queue Job, you must leave 
.spec.completions
 unset, and set 
.spec.parallelism
 to a non-negative integer.

Run containers which intentionally exit periodically

How can I have Kubernates automatically restart a container which purposefully exits in order to get new data from environment variables?
I have a container running on a Kubernates cluster which operates in the following fashion:
Container starts, polls for work
If it receives a task, it does some work
It polls for work again, until ...
.. the container has been running for over a certain period of time, after which it exits instead of polling for more work.
It needs to be continually restarted, as it uses environment variables which are populated by Kubernates secrets which are periodically refreshed by another process.
I've tried a Deployment, but it doesn't seem like the right fit as I get CrashLoopBackOff status, which means the worker is scheduled less and less often.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-fonky-worker
labels:
app: my-fonky-worker
spec:
replicas: 2
selector:
matchLabels:
app: my-fonky-worker
template:
metadata:
labels:
app: my-fonky-worker
spec:
containers:
- name: my-fonky-worker-container
image: my-fonky-worker:latest
env:
- name: NOTSOSECRETSTUFF
value: cats_are_great
- name: SECRETSTUFF
valueFrom:
secretKeyRef:
name: secret-name
key: secret-key
I've also tried a CronJob, but that seems a bit hacky as it could mean that the container is left in the stopped state for several seconds.
As #Josh said you need to exit with exit 0 else it will be treated as a failed container! Here is the reference According to the first example there "Pod is running and has one Container. Container exits with success." if your restartPolicy is set to Always (which is default by the way) then the container will restart although the Pod status shows running but if you log the pod then you can see the restart of the container.
It needs to be continually restarted, as it uses environment variables which are populated by Kubernates secrets which are periodically refreshed by another process.
I would take a different approach to this. I would mount the config map as explained here this will automatically refresh the Mounted config maps data Ref. Note: please take care of the " kubelet sync period (1 minute by default) + ttl of ConfigMaps cache (1 minute by default) in kubelet" to manage the refresh rate of configmap data in the Pod.
What I see as a solution for this would be to run your container as a cronjob. but don't use startingDeadlineSeconds as your container killer.
It runs on its schedule.
In your container you can have it poll for work N times.
After N times it exits 0.
If I understood correctly in your example there are 2 problems:
Restarting container
Updating secret values
In order to keep your secrets up to date you should consider using secrets as described by Amit Kumar Gupta comment and mount secrets as volume instead of environment variable, here is an example.
As per the second problem with restarting container it depends on what is the exit code as described by garlicFrancium
From another point of view you can use init container waiting for new tasks and main container in order to proceed this tasks according to your requirements or create job scheduler.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: complete
name: complete
spec:
replicas: 1
selector:
matchLabels:
app: complete
template:
metadata:
labels:
app: complete
spec:
hostname: c1
containers:
- name: complete
command:
- "bash"
args:
- "-c"
- "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
image: ubuntu
imagePullPolicy: IfNotPresent
resources: {}
initContainers:
- name: wait-for
image: ubuntu
command: ['bash', '-c', 'sleep 30']
restartPolicy: Always
Please note:
When a secret being already consumed in a volume is updated, projected keys are eventually updated as well. Kubelet is checking whether the mounted secret is fresh on every periodic sync. However, it is using its local cache for getting the current value of the Secret.
The type of the cache is configurable using the (ConfigMapAndSecretChangeDetectionStrategy field in KubeletConfiguration struct). It can be either propagated via watch (default), ttl-based, or simply redirecting all requests to directly kube-apiserver. As a result, the total delay from the moment when the Secret is updated to the moment when new keys are projected to the Pod can be as long as kubelet sync period + cache propagation delay, where cache propagation delay depends on the chosen cache type (it equals to watch propagation delay, ttl of cache, or zero corespondingly).
A container using a Secret as a subPath volume mount will not receive Secret updates.
Please refer also to:
Fine Parallel Processing Using a Work Queue

Is there a way to downscale pods only when message is processed (the pod finished its task) with the HorizontalPodAutoscaler in Kubernetes?

I'v setup Kubernetes Horizontal Pod Autoscaler with custom metrics using the prometheus adapter https://github.com/DirectXMan12/k8s-prometheus-adapter. Prometheus is monitoring rabbitmq, and Im watching the rabbitmq_queue_messages metric. The messages from the queue are picked up by the pods, that then do some processing, which can last for several hours.
The scale-up and scale-down is working based on the number of messages in the queue.
The problem:
When a pod finishes the processing and acks the message, that will lower the num. of messages in the queue, and that would trigger the Autoscaler terminate a pod. If I have multipe pods doing the processing and one of them finishes, if Im not mistaking, Kubernetes could terminate a pod that is still doing the processing of its own message. This wouldnt be desirable as all the processing that the pod is doing would be lost.
Is there a way to overcome this, or another way how this could be acheveed?
here is the Autoscaler configuration:
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
name: sample-app-rabbitmq
namespace: monitoring
spec:
scaleTargetRef:
# you created above
apiVersion: apps/v1
kind: Deployment
name: sample-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Object
object:
target:
kind: Service
name: rabbitmq-cluster
metricName: rabbitmq_queue_messages_ready
targetValue: 5
You could consider approach using preStop hook.
As per documentation Container States, Define postStart and preStop handlers:
Before a container enters into Terminated, preStop hook (if any) is executed.
So you can use in your deployment:
lifecycle:
preStop:
exec:
command: ["your script"]
### update:
I would like to provide more information due to some research:
There is an interesting project:
KEDA allows for fine grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition.
KEDA can run on both the cloud and the edge, integrates natively with Kubernetes components such as the Horizontal Pod Autoscaler, and has no external dependencies.
For the main question "Kubernetes could terminate a pod that is still doing the processing of its own message".
As per documentation:
"Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features"
Deployment is backed by Replicaset. As per this controller code there exist function "getPodsToDelete". In combination with "filteredPods" it gives the result: "This ensures that we delete pods in the earlier stages whenever possible."
So as proof of concept:
You can create deployment with init container. Init container should check if there is a message in the queue and exit when at least one message appears. This will allow main container to start, take and process that message. In this case we will have two kinds of pods - those which process the message and consume CPU and those who are in the starting state, idle and waiting for the next message. In this case starting containers will be deleted at the first place when HPA decide to decrease number of replicas in the deployment.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: complete
name: complete
spec:
replicas: 5
revisionHistoryLimit: 10
selector:
matchLabels:
app: complete
template:
metadata:
creationTimestamp: null
labels:
app: complete
spec:
hostname: c1
containers:
- name: complete
command:
- "bash"
args:
- "-c"
- "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
image: ubuntu
imagePullPolicy: IfNotPresent
resources: {}
initContainers:
- name: wait-for
image: ubuntu
command: ['bash', '-c', 'sleep 30']
dnsPolicy: ClusterFirst
restartPolicy: Always
terminationGracePeriodSeconds: 30
Hope this help.
Horizontal Pod Autoscaler is not designed for long-running tasks, and will not be a good fit. If you need to spawn one long-running processing tasks per message, I'd take one of these two approaches:
Use a task queue such as Celery. It is designed to solve your exact problem: have a queue of tasks that needs to be distributed to workers, and ensure that the tasks run to completion. Kubernetes even provides an official example of this setup.
If you don't want to introduce another component such as Celery, you can spawn a Kubernetes job for every incoming message by yourself. Kubernetes will make sure that the job runs to completion at least once - reschedule the pod if it dies, etc. In this case you will need to write a script that reads RabbitMQ messages and creates jobs for them by yourself.
In both cases, make sure you also have Cluster Autoscaler enabled so that new nodes get automatically provisioned if your current nodes are not sufficient to handle the load.