Run a pod without scheduler in Kubernetes - kubernetes

I searched the documentation but I am unable to find out if I can run a pod in Kubernetes without Scheduler. If anyone can help with any pointers it would be helpful
Update:
I can attach a label to node and let pod stick to that label but that would involve going through the scheduler. Is there any method without daemonset and does not use scheduler.

The scheduler just sets the spec.nodeName field on the pod. You can set that to a node name yourself if you know which node you want to run your pod, though you are then responsible for ensuring the node has sufficient resources to run the pod (enough memory, free host ports, etc… all things the scheduler is normally responsible for checking before it assigns a pod to a node)

You want static pods
Static pods are managed directly by kubelet daemon on a specific node, without API server observing it. It does not have associated any replication controller, kubelet daemon itself watches it and restarts it when it crashes.

You can simply add a nodeName attribute to the pod definition which by they normally field out by the scheduler therefor it's not a mandatory field.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- image: nginx
name: nginx
nodeName: node01
if the pod has been created and in pending state you have to recreate it with the new field, edit is not permitted with the nodeName att.

All the answers given here would require a scheduler to run.
I think what you want to do is create the manifest file of the pod and put it in the default manifest directory of the node in question.
Default directory is /etc/kubernetes/manifests/
The pod will automatically be created and if you wish to delete it, just delete the manifest file.

You can simply add a nodeName attribute to the pod definition
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
nodeName: controlplane
containers:
- image: nginx
name: nginx
Now important point - check the node listed by using below command, and then assign to one of them:
kubectl get nodes

Related

Get error: Edit cancelled, no valid changes were saved when i want edit pod with kubectl edit

When i want to edit my deployment pod with kubectl edit deployment [name] command i got this error. whats the problem?!
i found this: You cant edit the pod. You can edit only deployment. If you want to change anything in pod, you need to take a pod yaml output and then update your changes and recreate the pod.
how can i do that?
You need to update the object manifest that you first used to deploy that object. If you're using a Pod object such as:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
edit this YAML and redeploy. The reason as to why you can't edit the deployed pod (i.e. the pod listed from kubectl get po) is because pods are ephemeral, they an be killed and restarted for any reason. If you could edit a deployed pod, if for any reason, the pod terminates, you're changes are gone. That's why you edit through the YAML/Object manifest which is your source of truth (desired state).

Moving Pod to another node automatically

Is it possible for a pod/deployment/statefulset to be moved to another node or be recreated on another node automatically if the first node fails? The pod in question is set to 1 replica. So is it possible to configure some sort of failover for kubernetes pods? I've tried out pod affinity settings but nothing is moved automatically it has been around 10 minutes.
the yaml for the said pod is like below:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: ceph-rbd-sc-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: ceph-rbd-sc
---
apiVersion: v1
kind: Pod
metadata:
name: ceph-rbd-pod-pvc-sc
labels:
app: ceph-rbd-pod-pvc-sc
spec:
containers:
- name: ceph-rbd-pod-pvc-sc
image: busybox
command: ["sleep", "infinity"]
volumeMounts:
- mountPath: /mnt/ceph_rbd
name: volume
nodeSelector:
etiket: worker
volumes:
- name: volume
persistentVolumeClaim:
claimName: ceph-rbd-sc-pvc
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
name: ceph-rbd-pod-pvc-sc
topologyKey: "kubernetes.io/hostname"
Edit:
I managed to get it to work. But now i have another problem, the newly created pod in the other node is stuck in "container creating" and the old pod is stuck in "terminating". I also get Multi-Attach error for volume stating that the pv is still in use by the old pod. The situation is the same for any deployment/statefulset with a pv attached, the problem is resolved only when the failed node comes back online. Is there a solution for this?
Answer from coderanger remains valid regarding Pods. Answering to your last edit:
Your issue is with CSI.
When your Pod uses a PersistentVolume whose accessModes is RWO.
And when the Node hosting your Pod gets unreachable, prompting Kubernetes scheduler to Terminate the current Pod and create a new one on another Node
Your PersistentVolume can not be attached to the new Node.
The reason for this is that CSI introduced some kind of "lease", marking a volume as bound.
With previous CSI spec & implementations, this lock was not visible, in terms of Kubernetes API. If your ceph-csi deployment is recent enough, you should find a corresponding "VolumeAttachment" object that could be deleted, to fix your issue:
# kubectl get volumeattachments -n ci
NAME ATTACHER PV NODE ATTACHED AGE
csi-194d3cfefe24d5f22616fabd3d2fb2ce5f79b16bdca75088476c2902e7751794 rbd.csi.ceph.com pvc-902c3925-11e2-4f7f-aac0-59b1edc5acf4 melpomene.xxx.com true 14d
csi-24847171efa99218448afac58918b6e0bb7b111d4d4497166ff2c4e37f18f047 rbd.csi.ceph.com pvc-b37722f7-0176-412f-b6dc-08900e4b210d clio.xxx.com true 90d
....
kubectl delete -n ci volumeattachment csi-xxxyyyzzz
Those VolumeAttachments are created by your CSI provisioner, before the device mapper attaches a volume.
They would be deleted only once the corresponding PV would have been released from a given Node, according to its device mapper - that needs to be running, kubelet up/Node marked as Ready according to the the API. Until then, other Nodes can't map it. There's no timeout, should a Node get unreachable due to network issues or an abrupt shutdown/force off/reset: its RWO PV are stuck.
See: https://github.com/ceph/ceph-csi/issues/740
One workaround for this would be not to use CSI, and rather stick with legacy StorageClasses, in your case installing rbd on your nodes.
Though last I checked -- k8s 1.19.x -- I couldn't manage to get it working, I can't recall what was wrong, ... CSI tends to be "the way" to do it, nowadays. Despite not being suitable for production use, sadly, unless running in an IAAS with auto-scale groups deleting Nodes from the Kubernetes API (eventually evicting the corresponding VolumeAttachments), or using some kind of MachineHealthChecks like OpenShift 4 implements.
A bare Pod is a single immutable object. It doesn't have any of these nice things. Related: never ever use bare Pods for anything. If you try this with a Deployment you should see it spawn a new one to get back to the requested number of replicas. If the new Pod is Unschedulable you should see events emitted explaining why. For example if only node 1 matches the nodeSelector you specified, or if another Pod is already running on the other node which triggers the anti-affinity.

Kubectl get deployments, no resources

I've just started learning kubernetes, in every tutorial the writer generally uses "kubectl .... deploymenst" to control the newly created deploys. Now, with those commands (ex kubectl get deploymets) i always get the response No resources found in default namespace., and i have to use "pods" instead of "deployments" to make things work (which works fine).
Now my question is, what is causing this to happen, and what is the difference between using a deployment or a pod? ? i've set the docker driver in the first minikube, it has something to do with this?
First let's brush up some terminologies.
Pod - It's the basic building block for Kubernetes. It groups one or more containers (such as Docker containers), with shared storage/network, and a specification for how to run the containers.
Deployment - It is a controller which wraps Pod/s and manages its life cycle, which is to say actual state to desired state. There is one more layer in between Deployment and Pod which is ReplicaSet : A ReplicaSet’s purpose is to maintain a stable set of replica Pods running at any given time. As such, it is often used to guarantee the availability of a specified number of identical Pods.
Below is the visualization:
Source: I drew it!
In you case what might have happened :
Either you have created a Pod not a Deployment. Therefore, when you do kubectl get deployment you don't see any resources. Note when you create Deployments it in turn creates a ReplicaSet for you and also creates the defined pods.
Or may be you created your deployment in a different namespace, if that's the case, then type this command to find your deployments in that namespace kubectl get deploy NAME_OF_DEPLOYMENT -n NAME_OF_NAMESPACE
More information to clarify your concepts:
Source
Below the section inside spec.template is the section which is supposedly your POD manifest if you were to create it manually and not take the deployment route. Now like I said earlier in simple terms Deployments are a wrapper to your PODs, therefore anything which you see outside the path spec.template is the configuration which you will need to defined on how you want to manage (scaling,affinity, e.t.c) your POD
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Deployment is a controller providing higher level abstraction on top of pods and ReplicaSets. A Deployment provides declarative updates for Pods and ReplicaSets. Deployments internally creates ReplicaSets within which pods are created.
Use cases of deployment is documented here
One reason for No resources found in default namespace could be that you created the deployment in a specific namespace and not in default namespace.
You can see deployments in a specific namespace or in all namespaces via
kubectl get deploy -n namespacename
kubectl get deploy -A

Make Kubernetes wait for Pod termination before removing from Service endpoints

According to Termination of Pods, step 7 occurs simultaneously with 3. Is there any way I can prevent this from happening and have 7 occur only after the Pod's graceful termination (or expiration of the grace period)?
The reason why I need this is that my Pod's termination routine requires my-service-X.my-namespace.svc.cluster.local to resolve to the Pod's IP during the whole process, but the corresponding Endpoint gets removed as soon as I run kubectl delete on the Pod / Deployment.
Note: In case it helps making this clear, I'm running a bunch of clustered VerneMQ (Erlang) nodes which, on termination, dump their contents to other nodes on the cluster — hence the need for the nodenames to resolve correctly during the whole termination process. Only then should the corresponding Endpoints be removed.
Unfortunately kubernetes was designed to remove the Pod from the endpoints at the same time as the prestop hook is started (see link in question to kubernetes docs):
At the same time as the kubelet is starting graceful shutdown, the
control plane removes that shutting-down Pod from Endpoints
This google kubernetes docs says it even more clearly:
Pod is set to the “Terminating” State and removed from the
endpoints list of all Services
There also was also a feature request for that. which was not recognized.
Solution for helm users
But if you are using helm, you can use hooks (e.g. pre-delete,pre-upgrade,pre-rollback). Unfortunately this helm hook is an extra pod which can not access all pod resources.
This is an example for a hook:
apiVersion: batch/v1
kind: Job
metadata:
name: graceful-shutdown-hook
annotations:
"helm.sh/hook": pre-delete,pre-upgrade,pre-rollback
labels:
app.kubernetes.io/name: graceful-shutdown-hook
spec:
template:
spec:
containers:
- name: graceful-shutdown
image: busybox:1.28.2
command: ['sh', '-cx', '/bin/sleep 15']
restartPolicy: Never
backoffLimit: 0
Maybe you should consider using headless service instead of using ClusterIP one. That way your apps will discover using the actual endpoint IPs and the removal from endpoint list will not break the availability during shutdown, but will remove from discovery (or from ie. ingress controller backends in nginx contrib)

How are minions chosen for given pod creation request

How does kubernetes choose the minion among many available for a given pod creation command? Is it something that can be controlled/tweaked ?
If replicated pods are submitted for deployment, is kubernetes intelligent enough to place them in different minions if they expose the same container/host port pair? Or does it always place different replicas in different minions ?
What about corner cases like what if two different pods (not necessarily replicas) that expose same host/container port pair are submitted? Will they carefully be placed on different minions ?
If a pod requires specific compute/memory requirements, can it be placed in a minion/host that has sufficient resources left to meet those requirement?
In summary, is there detailed documentation on kubernetes pod placement strategy?
Pods are scheduled to ports using the algorithm in generic_scheduler.go
There are rules that prevent host-port conflicts, and also to make sure that there are sufficient memory and cpu requirements. predicates.go
One way to choose minion for pod creation is using nodeSelector. Inside the yaml file of pod specify the label of minion for which you want to choose the minion.
apiVersion: v1
kind: Pod
metadata:
name: nginx1
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
key: value