Tekton on EKS how to work with zones when using volumeClaim? - kubernetes

Update 2022-03-22: I could isolate the problem to the Cluster
Autoscaler and not enough pod "slots" left on a node. No solution still. For a detailed
analysis see https://github.com/tektoncd/pipeline/issues/4699
I have an EKS cluster running with aws-ebs controller. Now I want to use tekton on this cluster. Tekton has an affinity assistant which should schedule pods on the same node if they share a workspace (aka volumeClaim). Sadly, this does not seem to work for me, as I randomly get an error from my node stating didn't match pod affinity rules and didn't find available persistent volume to bind even there is a volumne existing. After debugging, I found
that the persistentVolumes created from time to time are in a different region and on another host than the pod which is spanned.
Does somebody know how to still use “automatic” aws-ebs provisioning with tekton on EKS or something similar, making this work? My fallback solution would be to try S3 as a storage ... but I assume this maybe not the best solution as I have many small files from a git repository. Just provisioning a volume and then running pods only on this one node, is not the solution I would opt for. Even this is better than nothing :)
Any help would be appreciated! If more information is needed, please add a comment and I try to follow up.
-Thanks a lot!

You get this events:
0/6 nodes are available: 2 Too many pods
Your node is essentially full. When you use Tekton Pipelines with the Affinity Assistant, all pods in the run will be scheduled to the same pod.
If you want to run Tekton Pipelines with space for just few pods per node, then you should disable the Affinity Assistant in the config map.

Related

Why would the Kubernetes scheduler always place my Pod replicas on the same node in AKS?

We have an AKS test cluster with four Windows worker nodes and a Deployment with a replica count of two. The corresponding Pod spec does not specify any resource requests and limits (thus, the resulting Pods are in the BestEffort QoS class).
In order to conduct a performance test, we scaled all other Deployments on those worker nodes to 0 replicas and deleted all remaining Pods on the nodes. Only the system Pods created by AKS DaemonSets itself (in the kube-system namespace) remained. We then created the Deployment mentioned above.
We had assumed that the default Kubernetes scheduler would place the two replicas on different nodes by default, or at least choose nodes randomly. However, the scheduler always chose the same node to place both replicas on, no matter how often we deleted the Pods or scaled the Deployment to 0 and back again to 2. Only after we tainted that node as NoSchedule, did the scheduler choose another node.
I know I could configure anti-affinities or topology spread constraints to get a better spreading of my Pods. But in the Cloud Native DevOps with Kubernetes book, I read that the scheduler actually does a very good job by default and one should only use those features if absolutely necessary. (Instead maybe using the descheduler if the scheduler is forced to make bad decisions.)
So, I would like to understand why the behavior we observed would happen. From the docs, I've learned that the scheduler first filters the nodes for fitting ones. In this case, all of them should fit, as all are configured identically. It then scores the nodes, choosing randomly if all have the same score. Why would one node always win that scoring?
Follow-up question: Is there some way how I could reconstruct the scheduler's decision logic in AKS? I can see kube-scheduler logs in Container Insights, but they don't contain any information regarding scheduling, just some operative stuff.
I believe that the scheduler is aware of which Nodes already have the container images pulled down, and will give them preference to avoid the image pull (and thus faster start time)
Short of digging up the source code as proof, I would guess one could create a separate Pod (for this purpose, I literally mean kind: Pod), force it onto one of the other Nodes via nodeName:, then after the Pod has been scheduled and attempted to start, delete the Pod and scale up your Deployment
I would then expect the new Deployment managed Pod to arrive on that other Node because it by definition has less resources in use but also has the container image required
Following mdaniel's reply, which I've marked as the accepted answer, we've done some more analysis and have found the list of scheduling plugins and the scheduling framework docs. Reading the code, we can see the ImageLocality plugin assigns a very high score due to the Windows container images being really large. As we don't have resource requests, the NodeResourcesFit plugin will not compensate this.
We did not find a plugin that would strive to not put Pod replicas onto the same node (unless configured via anti-affinities or a PodTopologySpreadConstraint). Which surprised me, as that would seem to be a good default to me?
Some experimentation shows that the situation indeed changes, once we, for example, start adding (even minimal) resource requests.
In the future, we'll therefore assign resource requests (which is good practice anyway) and, if this isn't enough, follow up with PodTopologySpreadConstraints.

Is it possible to schedule a pod to run for say 24 hours and then remove deployment/statefulset? or need to use jobs?

We have a bunch of pods running in dev environment. The pods are auto-provisioned by an application on every business action. The problem is that across various namespaces they are accumulating and eating available resources in EKS.
Is there a way without jenkins/k8s jobs to simply put some parameter on the pod manifest to tell it to self destruct say in 24 hours?
Add to your pod.spec:
activeDeadlineSeconds: 86400
After deadline your Pod will be stopped for good with the status DeadlineExceeded
If I understood your situation properly, you would like to scale your cluster down in order to save resources.
Kubernetes is featured with the ability to autoscale your application in a cluster. Literally, it means that Kubernetes can start additional pods when the load is increasing and terminate excessive pods when the load is decreasing.
It is possible to downscale the application to zero pods, but, in this case, you will have a delay serving the first request while the pod is starting.
This functionality relies on performance metrics. From the practical side, it means that autoscaling doesn't happen instantly, because it takes some time to performance metrics reach the configured threshold.
The mentioned Kubernetes feature called HPA(horizontal pod autoscale) is described in this document.
In case you are running your cluster on GCP or GKE, you are able to go further and automatically start additional nodes for your cluster when you need more computing capacity and shut down nodes when they are not running application pods anymore.
More information about this functionality can be found following the link.
Last, but not least, you can use tool like Ansible to manage all your kubernetes assets (it can create/manage deployments via playbooks).
If you decide to give it a try, you might find this information useful:
Creating a Container cluster in GKE
70% cheaper Kubernetes cluster on AWS
How to build a Kubernetes Horizontal Pod Autoscaler using custom metrics

Expandable single node K8s cluster

I am searching for a solution that enables me to set up a single node K8s cluster and if I needed I add nodes to it later.
I am aware of solutions such as minikube and microk8s but they are not expandable. I am trying k3s at the moment exactly because it is offering this feature but I have some problems with storage and other stuff that I am working on them.
Now my questions:
What other solution for this exists?
What are the disadvantages if I untaint the master node and run everything there (for a long period and not just for test)?
You can use kubeadm to setup a single node "cluster". Then you can use the join command to add more nodes
You can expand k3s cluster via k3sup join.Here is guide.
Key Kubernetes services such as kube-apiserver, kube-scheduler should be available and running smoothly at all times on master nodes. Therefore, it is essential to have dedicated resources for the master nodes, and avoid having other non-critical workloads interfere with the functioning of the master services
What are the disadvantages if I untaint the master node and run everything there (for a long period and not just for test)?
Failure of the worker will of course bring down your applications. When you recover it or spin up another one, K8s will recover your apps for you.
Failure of the master will not adversely affect your systems only the cluster's ability to manage itself and its self-healing capabilities (which will affect uptime at some point).
I am searching for a solution that enables me to set up a single node K8s cluster and if I needed I add nodes to it later.
To the best of my knowledge, there is no such thing as single node production ready k8s cluster.
For something small and simple you can check Rancher.
What other solution for this exists?
kubeadm allows you to install everything on a single node. Install kubeadm on the node, "kubeadm init", install a pod network, then remove the master taint.
Another solution you may be interested in is the Kubespray.
Some "honorable mentions" are:
Charmed Kubernetes by Canonical allows you to do everything on one node; however it should be quite a big node, so may be not the case here (but still worth mentioning).
If you don't really require all the k8s power (with only one small node), then Nomad could be an alternative.
Let me know if that helps.

GKE and NodeLocal DNSCache

We have a deployment of Kubernetes in Google Cloud Platform. Recently we hit one of the well known issues related on a problem with the kube-dns that happens at high amount of requests https://github.com/kubernetes/kubernetes/issues/56903 (its more related to SNAT/DNAT and contract but the final result is out of service of kube-dns).
After a few days of digging on that topic we found that k8s already have a solution witch is currently in alpha (https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/)
The solution is to create a caching CoreDNS as a daemonset on each k8s node so far so good.
Problem is that after you create the daemonset you have to tell to kubelet to use it with --cluster-dns option and we cant find any way to do that in GKE environment. Google bootstraps the cluster with "configure-sh" script in instance metadata. There is an option to edit the instance template and "hardcode" the required values but that is not an option if you upgrade the cluster or use the horizontal autoscaling all of the modified values will be lost.
The last idea was to use custom startup script that pull configuration and update the metadata server but this is a too complicated task.
As of 2019/12/10, GKE now supports through the gcloud CLI in beta:
Kubernetes Engine
Promoted NodeLocalDNS Addon to beta. Use --addons=NodeLocalDNS with gcloud beta container clusters create. This addon can be enabled or disabled on existing clusters using --update-addons=NodeLocalDNS=ENABLED or --update-addons=NodeLocalDNS=DISABLED with gcloud container clusters update.
See https://cloud.google.com/sdk/docs/release-notes#27300_2019-12-10
You can spin up another kube-dns deployment e.g. in different node-pool and thus having 2x nameserver in the pod's resolv.conf.
This would mitigate the evictions and other failures and generally allow you to completely control your kube-dns service in the whole cluster.
In addition to what was mentioned in this answer - With beta support on GKE, the nodelocal caches now listen on the kube-dns service IP, so there is no need for a kubelet flag change.

How do I create a policy to run a container on every node, except the master unless there is only one node?

In the Kubernetes Book, it says that it's poor form to run pods on the master node.
Following this advice, I'd like to create a policy that runs a pod on all nodes, except the master if there are more than one nodes. However, to simplify testing and work in single-node environments, I'd also like to run my pod on the master node if there is just a single node in the entire system.
I've been looking around, and can't figure out how to express this policy. I see that DaemonSets have affinities and anti-affinities. I considered labeling the master node and adding an anti-affinity for that label. However, I didn't see how to require that at least a single pod would always come up (to ensure that things worked for single-node environment). Please let me know if I'm misunderstanding something. Thanks!
How about something like this:
During node provisioning, assign a particular label to each node that should run the job. In a single node cluster, this would be the master. In a multi-node environment, it would be every node except the master(s).
Create a deamonset that has tolerations for any nodes
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
As described in that doc you linked, use .spec.template.spec.nodeSelector to select only nodes with your special label. (node selector docs).
How you assign the special label to nodes is probably a fairly manual process heavily dependent on how you are actually deploying your clusters, but that is the general plan I would follow.
EDIT: Or I believe it may be simplest to just remove the master node taint from your single-node cluster. I believe most simple distributions like minikube will come this way by default.