What is the relationship between scheduling policies and scheduling Configuration in k8s - kubernetes

The k8s scheduling implementation comes in two forms: Scheduling Policies and Scheduling Profiles.
What is the relationship between the two? They seem to overlap but have some differences. For example, there is a NodeUnschedulable in the profiles but not in the policy. CheckNodePIDPressure is in the policy, but not in the profiles
In addition, there is a default startup option in the scheduling configuration, but it is not specified in the scheduling policy. How can I know about the default startup policy?
I really appreciate any help with this.

The difference is simple: Kubernetes does't support 'Scheduling policies' from v1.19 or later. Kubernetes v1.19 supports configuring multiple scheduling policies with a single scheduler. We are using this to define a bin-packing scheduling policy in all v1.19 clusters by default. 'Scheduling profiles' can be used to define a bin-packing scheduling policy in all v1.19 clusters by default.
To use that scheduling policy, all that is required is to specify the scheduler name bin-packing-scheduler in the Pod spec. For example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 5
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
schedulerName: bin-packing-scheduler
containers:
- name: nginx
image: nginx:1.17.8
resources:
requests:
cpu: 200m
The pods of this deployment will be scheduled onto the nodes which already have the highest resource utilisation, to optimise for autoscaling or ensuring efficient pod placement when mixing large and small pods in the same cluster.
If a scheduler name is not specified then the default spreading algorithm will be used to distribute pods across all nodes.

Related

AKS with Keda: pod are removed during execution

I tried Keda with AKS and I really appreciate when pod are automatically instanciate based on Azure Dev Ops queue job for release & build.
However I noticed something strange and often AKS/Keda remove pod while processing which makes workflow failed.
Message reads
We stopped hearing from agent aks-linux-768d6647cc-ntmh4. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Expected behavior: pods must complete the tasks then Keda/AKS can remove this pod.
I share with you my keda yml file
# deployment.yaml
apiVersion: apps/v1 # The API resource where this workload resides
kind: Deployment # The kind of workload we're creating
metadata:
name: aks-linux # This will be the name of the deployment
spec:
selector: # Define the wrapping strategy
matchLabels: # Match all pods with the defined labels
app: aks-linux # Labels follow the `name: value` template
replicas: 1
template: # This is the template of the pod inside the deployment
metadata: # Metadata for the pod
labels:
app: aks-linux
spec:
nodeSelector:
agentpool: linux
containers: # Here we define all containers
- image: <My image here>
name: aks-linux
env:
- name: "AZP_URL"
value: "<myURL>"
- name: "AZP_TOKEN"
value: "<MyToken>"
- name: "AZP_POOL"
value: "<MyPool>"
resources:
requests: # Minimum amount of resources requested
cpu: 2
memory: 4096Mi
limits: # Maximum amount of resources requested
cpu: 4
memory: 8192Mi
I used latest version of AKS and Keda. Any idea ?
Check the official Keda docs:
When running your agents as a deployment you have no control on which pod gets killed when scaling down.
So, to solve it you need to use ScaledJob:
If you run your agents as a Job, KEDA will start a Kubernetes job for each job that is in the agent pool queue. The agents will accept one job when they are started and terminate afterwards. Since an agent is always created for every pipeline job, you can achieve fully isolated build environments by using Kubernetes jobs.
See there how to implement it.

Kubernetes scheduling: control preference and remove pods in priority on a node

Context
I have a Kubernetes cluster with 2 pools:
the first one, bare-metal, is composed of a finite number of bare metal instances. Every node in this pool has a label node-type: bare-metal
the second one, vms, is composed of virtual machines. This is used if the cluster needs to autoscale, since provisioning VMs is faster than provisioning a new bare metal instance. Every node in this pool has a label node-type: vms
For performance reasons, I would like that, when possible, all pods run on a bare-metal node. If not possible, and only if not possible (not enough memory, too many pods already existing), that's ok for me to run pods on a vms node. I have 2 questions (tightly related) about this.
About preferredDuringSchedulingIgnoredDuringExecution behavior
To schedule pods in priority on bare-metal, I've used preferredDuringSchedulingIgnoredDuringExecution. Example with a simple nginx deployment with 90 replicas (a bare-metal pool node could hold all of them):
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
spec:
replicas: 90
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
resources:
requests:
cpu: 50m
memory: 50M
limits:
cpu: 50m
memory: 50M
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- bare-metal
This works fine, even if I don't really understand why on the 90 pods, 3 are scheduled on a node of vms pool (even if scheduling pods on the bare-metal pool is still possible). Well, I understand that this is preferred instead of required, but I'm wondering if we can tune this "preference".
Question: Can I configure my deployment to tell like "preferred only if there is no other solution"?
Scaling down
When I scale down my Deployment, I would like that pods scheduled on vms pool are deleted first, because my pods should run on priority on a bare-metal pool node. However, if I run:
kubectl scale deployment nginx --replicas 3
I can observe that pods on bare-metal pool nodes are deleted first, leaving only 3 pods on a vms pool node.
Question: Is there a way to configure my deployment to tell Kubernetes to remove the pods on vms pool first?
Edit: It seems that the proposal of the introduction of PodDeletionCost can solve this. But it's not implemented for now.

Is it possible for running pods on kubernetes to share the same PVC

I've currently set up a PVC with the name minio-pvc and created a deployment based on the stable/minio chart with the values
mode: standalone
replicas: 1
persistence:
enabled: true
existingClaim: minio-pvc
What happens if I increase the number of replicas? Do i run the risk of corrupting data if more than one pod tries to write to the PVC at the same time?
Don't use deployment for stateful containers. Instead use StatefulSets.
StatefulSets are specifically designed for running stateful containers like databases. They are used to persist the state of the container.
Note that each pod is going to bind a separate persistent volume via pvc. There is no possibility of multiple instances of pods writing to same pv. Hope I answered your question.
In case you are sticking to Deployments instead of StatefulSets it won't be feasible for multiple replicas to write to the same PVC, since there is no guarantee that the different replicas are scheduled on the same node, and so you might have a pending pod waiting to establish a connection to the volume and fail. The solution is to choose a specific node and have all your replicas run on the same node.
Run the following and assign a label to one of your nodes:
kubectl label nodes <node-name> <label-key>=<label-value>
Say we choose label-key to be labelKey and label-value to be node1. Then you can go ahead and add the following to your YAML file and have the pods scheduled on the same node:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
spec:
replicas: 3
template:
spec:
nodeSelector:
labelKey: node1
containers:
...

Why should I specify service before deployment in a single Kubernetes configuration file?

I'm trying to understand why kubernetes docs recommend to specify service before deployment in one configuration file:
The resources will be created in the order they appear in the file. Therefore, it’s best to specify the service first, since that will ensure the scheduler can spread the pods associated with the service as they are created by the controller(s), such as Deployment.
Does it mean spread pods between kubernetes cluster nodes?
I tested with the following configuration where a deployment is located before a service and pods are distributed between nods without any issues.
apiVersion: apps/v1
kind: Deployment
metadata:
name: incorrect-order
namespace: test
spec:
selector:
matchLabels:
app: incorrect-order
replicas: 2
template:
metadata:
labels:
app: incorrect-order
spec:
containers:
- name: incorrect-order
image: nginx
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: incorrect-order
namespace: test
labels:
app: incorrect-order
spec:
type: NodePort
ports:
- port: 80
selector:
app: incorrect-order
Another explanation is that some environment variables with service URL will not be set for pods in this case. However it also works ok in case a configuration is inside one file like the example above.
Could you please explain why it is better to specify service before the deployment in case of one configuration file? Or may be it is some outdated recommendation.
If you use DNS as service discovery, the order of creation doesn't matter.
In case of Environment Vars (the second way K8S offers service discovery) the order matters, because once that vars are passed to the starting pod, they cannot be modified later if the service definition changes.
So if your service is deployed before you start your pod, the service envvars are injected inside the linked pod.
If you create a Pod/Deployment resource with labels, this resource will be exposed through a service once this last is created (with proper selector to indicate what resource to expose).
You are correct in that it effects the spread among the worker nodes.
Deployments without a Service will simply be scheduled onto the nodes with the least cpu/memory allocation. For instance, a brand new and empty node will get all new pods from a new deployment.
With a Deployment that also has a service the Scheduler tries to spread the pods between nodes, disregarding the cpu/memory load (within limits), to help the Service survive better.
It puzzles me that a Deployment on it's own doesn't cause a optimal spread but it doesn't, not yet at least.
This is the answer from the official documentation:
The resources will be created in the order they appear in the file.
Therefore, it's best to specify the service first, since that will
ensure the scheduler can spread the pods associated with the service
as they are created by the controller(s), such as Deployment.
Kubernetes Documentation/Concepts/Cluster/Administration/Managing Resources

How to configure a Kubernetes Multi-Pod Deployment

I would like to deploy an application cluster by managing my deployment via k8s Deployment object. The documentation has me extremely confused. My basic layout has the following components that scale independently:
API server
UI server
Redis cache
Timer/Scheduled task server
Technically, all 4 above belong in separate pods that are scaled independently.
My questions are:
Do I need to create pod.yml files and then somehow reference them in deployment.yml file or can a deployment file also embed pod definitions?
K8s documentation seems to imply that the spec portion of Deployment is equivalent to defining one pod. Is that correct? What if I want to declaratively describe multi-pod deployments? Do I do need multiple deployment.yml files?
Pagids answer has most of the basics. You should create 4 Deployments for your scenario. Each deployment will create a ReplicaSet that schedules and supervises the collection of PODs for the Deployment.
Each Deployment will most likely also require a Service in front of it for access. I usually create a single yaml file that has a Deployment and the corresponding Service in it. Here is an example for an nginx.yaml that I use:
apiVersion: v1
kind: Service
metadata:
annotations:
service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
name: nginx
labels:
app: nginx
spec:
type: NodePort
ports:
- port: 80
name: nginx
targetPort: 80
nodePort: 32756
selector:
app: nginx
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginxdeployment
spec:
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginxcontainer
image: nginx:latest
imagePullPolicy: Always
ports:
- containerPort: 80
Here some additional information for clarification:
A POD is not a scalable unit. A Deployment that schedules PODs is.
A Deployment is meant to represent a single group of PODs fulfilling a single purpose together.
You can have many Deployments work together in the virtual network of the cluster.
For accessing a Deployment that may consist of many PODs running on different nodes you have to create a Service.
Deployments are meant to contain stateless services. If you need to store a state you need to create StatefulSet instead (e.g. for a database service).
You can use the Kubernetes API reference for the Deployment and you'll find that the spec->template field is of type PodTemplateSpec along with the related comment (Template describes the pods that will be created.) it answers you questions. A longer description can of course be found in the Deployment user guide.
To answer your questions...
1) The Pods are managed by the Deployment and defining them separately doesn't make sense as they are created on demand by the Deployment. Keep in mind that there might be more replicas of the same pod type.
2) For each of the applications in your list, you'd have to define one Deployment - which also makes sense when it comes to difference replica counts and application rollouts.
3) you haven't asked that but it's related - along with separate Deployments each of your applications will also need a dedicated Service so the others can access it.
additional information:
API server use deployment
UI server use deployment
Redis cache use statefulset
Timer/Scheduled task server maybe use a statefulset (If your service has some state in)