how to scale daemon set about kubernetes using kubectl - kubernetes

Now I only have terminal to access kubernetes cluster now, check the ingress controller like this:
$ k get daemonset --all-namespaces
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system traefik-ingress-controller 0 0 0 0 0 IngressProxy=true 60d
logging fluentd-es 0 0 0 0 0 beta.kubernetes.io/fluentd-ds-ready=true 28d
I am now using kubectl(v1.15.2) to scale daemon set like this:
kubectl scale --replicas=1 DaemonSet/traefik-ingress-controller -n kube-system
but it shows:
Error from server (NotFound): the server could not find the requested resource
what should I do to start the traefik in terminal using command line? This is my daemon set describe output:
~/Library/Mobile Documents/com~apple~CloudDocs/Document/k8s/work/traefik-deployment-yaml/k8s-backup ⌚ 17:49:58
$ k describe daemonset traefik-ingress-controller -n kube-system
Name: traefik-ingress-controller
Selector: app=traefik
Node-Selector: IngressProxy=true
Labels: app=traefik
Annotations: deprecated.daemonset.template.generation: 18
kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{},"labels":{"app":"traefik"},"name":"traefik-ingress-controller","na...
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=traefik
Service Account: traefik-ingress-controller
Containers:
traefik-ingress-lb:
Image: traefik:v2.1.6
Ports: 80/TCP, 443/TCP, 8080/TCP
Host Ports: 80/TCP, 443/TCP, 0/TCP
Args:
--configfile=/config/traefik.yaml
--logLevel=INFO
--metrics=true
--metrics.prometheus=true
--entryPoints.metrics.address=:8080
--metrics.prometheus.entryPoint=metrics
--metrics.prometheus.addServicesLabels=true
--metrics.prometheus.addEntryPointsLabels=true
--metrics.prometheus.buckets=0.100000, 0.300000, 1.200000, 5.000000
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 1
memory: 1Gi
Environment: <none>
Mounts:
/config from config (rw)
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: traefik-config
Optional: false
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedDaemonPod 3h32m daemonset-controller Found failed daemon pod kube-system/traefik-ingress-controller-wdpsq on node azshara-k8s03, will try to kill it
Normal SuccessfulDelete 3h32m daemonset-controller Deleted pod: traefik-ingress-controller-wdpsq
Normal SuccessfulCreate 3h32m daemonset-controller Created pod: traefik-ingress-controller-qmttl
Warning FailedDaemonPod 3h32m daemonset-controller Found failed daemon pod kube-system/traefik-ingress-controller-qmttl on node azshara-k8s03, will try to kill it
Normal SuccessfulDelete 3h32m daemonset-controller Deleted pod: traefik-ingress-controller-qmttl
Normal SuccessfulCreate 3h32m daemonset-controller Created pod: traefik-ingress-controller-nlxwc

You don not need to scale a deamon set on K8s.
A Daemon Set ensures that all eligible nodes run a copy of a Pod..
As nodes are added to the cluster, Pods are added to them. So you need to add new node to cluster and deamon set will be scheduled there unless you have a very unique taint to disallow given deamon set.

Related

Metrics server is currently unable to handle the request

I am new to kubernetes and was trying to apply horizontal pod autoscaling to my existing application. and after following other stackoverflow details - got to know that I need to install metric-server - and I was able to - but some how it's not working and unable to handle request.
Further I followed few more things but unable to resolve the issue - I will really appreciate any help here.
Please let me know for any further details you need for helping me :) Thanks in advance.
Steps followed:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
kubectl get deploy,svc -n kube-system | egrep metrics-server
deployment.apps/metrics-server 1/1 1 1 2m6s
service/metrics-server ClusterIP 10.32.0.32 <none> 443/TCP 2m6s
kubectl get pods -n kube-system | grep metrics-server
metrics-server-64cf6869bd-6gx88 1/1 Running 0 2m39s
vi ana_hpa.yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: ana-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: common-services-auth
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 160
k apply -f ana_hpa.yaml
horizontalpodautoscaler.autoscaling/ana-hpa created
k get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
ana-hpa StatefulSet/common-services-auth <unknown>/160%, <unknown>/80% 1 10 0 4s
k describe hpa ana-hpa
Name: ana-hpa
Namespace: default
Labels: <none>
Annotations: <none>
CreationTimestamp: Tue, 12 Apr 2022 17:01:25 +0530
Reference: StatefulSet/common-services-auth
Metrics: ( current / target )
resource memory on pods (as a percentage of request): <unknown> / 160%
resource cpu on pods (as a percentage of request): <unknown> / 80%
Min replicas: 1
Max replicas: 10
StatefulSet pods: 3 current / 0 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetResourceMetric 38s (x8 over 2m23s) horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Warning FailedComputeMetricsReplicas 38s (x8 over 2m23s) horizontal-pod-autoscaler invalid metrics (2 invalid out of 2), first error is: failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
Warning FailedGetResourceMetric 23s (x9 over 2m23s) horizontal-pod-autoscaler failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
kubectl get --raw /apis/metrics.k8s.io/v1beta1
Error from server (ServiceUnavailable): the server is currently unable to handle the request
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
Error from server (ServiceUnavailable): the server is currently unable to handle the request
kubectl edit deployments.apps -n kube-system metrics-server
Add hostNetwork: true
deployment.apps/metrics-server edited
kubectl get pods -n kube-system | grep metrics-server
metrics-server-5dc6dbdb8-42hw9 1/1 Running 0 10m
k describe pod metrics-server-5dc6dbdb8-42hw9 -n kube-system
Name: metrics-server-5dc6dbdb8-42hw9
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: pusntyn196.apac.avaya.com/10.133.85.196
Start Time: Tue, 12 Apr 2022 17:08:25 +0530
Labels: k8s-app=metrics-server
pod-template-hash=5dc6dbdb8
Annotations: <none>
Status: Running
IP: 10.133.85.196
IPs:
IP: 10.133.85.196
Controlled By: ReplicaSet/metrics-server-5dc6dbdb8
Containers:
metrics-server:
Container ID: containerd://024afb1998dce4c0bd5f4e58f996068ea37982bd501b54fda2ef8d5c1098b4f4
Image: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
Image ID: k8s.gcr.io/metrics-server/metrics-server#sha256:5ddc6458eb95f5c70bd13fdab90cbd7d6ad1066e5b528ad1dcb28b76c5fb2f00
Port: 4443/TCP
Host Port: 4443/TCP
Args:
--cert-dir=/tmp
--secure-port=4443
--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
--kubelet-use-node-status-port
--metric-resolution=15s
State: Running
Started: Tue, 12 Apr 2022 17:08:26 +0530
Ready: True
Restart Count: 0
Requests:
cpu: 100m
memory: 200Mi
Liveness: http-get https://:https/livez delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get https://:https/readyz delay=20s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/tmp from tmp-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g6p4g (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
tmp-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-g6p4g:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 2s
node.kubernetes.io/unreachable:NoExecute op=Exists for 2s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m31s default-scheduler Successfully assigned kube-system/metrics-server-5dc6dbdb8-42hw9 to pusntyn196.apac.avaya.com
Normal Pulled 2m32s kubelet Container image "k8s.gcr.io/metrics-server/metrics-server:v0.6.1" already present on machine
Normal Created 2m31s kubelet Created container metrics-server
Normal Started 2m31s kubelet Started container metrics-server
kubectl get --raw /apis/metrics.k8s.io/v1beta1
Error from server (ServiceUnavailable): the server is currently unable to handle the request
kubectl get pods -n kube-system | grep metrics-server
metrics-server-5dc6dbdb8-42hw9 1/1 Running 0 10m
kubectl logs -f metrics-server-5dc6dbdb8-42hw9 -n kube-system
E0412 11:43:54.684784 1 configmap_cafile_content.go:242] kube-system/extension-apiserver-authentication failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
E0412 11:44:27.001010 1 configmap_cafile_content.go:242] key failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
k logs -f metrics-server-5dc6dbdb8-42hw9 -n kube-system
I0412 11:38:26.447305 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0412 11:38:26.899459 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0412 11:38:26.899477 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0412 11:38:26.899518 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0412 11:38:26.899545 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0412 11:38:26.899546 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0412 11:38:26.899567 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0412 11:38:26.900480 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0412 11:38:26.900811 1 secure_serving.go:266] Serving securely on [::]:4443
I0412 11:38:26.900854 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
W0412 11:38:26.900965 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
I0412 11:38:26.999960 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0412 11:38:26.999989 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0412 11:38:26.999970 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
E0412 11:38:27.000087 1 configmap_cafile_content.go:242] kube-system/extension-apiserver-authentication failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
E0412 11:38:27.000118 1 configmap_cafile_content.go:242] key failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
kubectl top nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
kubectl top pods
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
Edit metrics server deployment yaml
Add - --kubelet-insecure-tls
k apply -f metric-server-deployment.yaml
serviceaccount/metrics-server unchanged
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader unchanged
clusterrole.rbac.authorization.k8s.io/system:metrics-server unchanged
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader unchanged
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator unchanged
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server unchanged
service/metrics-server unchanged
deployment.apps/metrics-server configured
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io unchanged
kubectl get pods -n kube-system | grep metrics-server
metrics-server-5dc6dbdb8-42hw9 1/1 Running 0 10m
kubectl top pods
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)
Also tried by adding below to metrics server deployment
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
This can easily be resolved by editing the deployment yaml files and adding the hostNetwork: true after the dnsPolicy: ClusterFirst
kubectl edit deployments.apps -n kube-system metrics-server
insert:
hostNetwork: true
I hope this help somebody for bare metal cluster:
$ helm --repo https://kubernetes-sigs.github.io/metrics-server/ --kubeconfig=$HOME/.kube/loc-cluster.config -n kube-system --set args='{--kubelet-insecure-tls}' upgrade --install metrics-server metrics-server
$ helm --kubeconfig=$HOME/.kube/loc-cluster.config -n kube-system uninstall metrics-server
Update: I deployed the metrics-server using the same command. Perhaps you can start fresh by removing existing resources and running:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
=======================================================================
It appears the --kubelet-insecure-tls flag was not configured correctly for the pod template in the deployment. The following should fix this:
Edit the existing deployment in the cluster with kubectl edit deployment/metrics-server -nkube-system.
Add the flag to the spec.containers[].args list, so that the deployment looks like this:
...
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=4443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls <=======ADD IT HERE.
image: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
...
Simply save your changes and let the deployment rollout the updated pods. You can use watch -n1 kubectl get deployment/kube-metrics -nkube-system and wait for UP-TO-DATE column to show 1.
Like this:
NAME READY UP-TO-DATE AVAILABLE AGE
metrics-server 1/1 1 1 16m
Verify with kubectl top nodes. It will show something like
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
docker-desktop 222m 5% 1600Mi 41%
I've just verified this to work on a local setup. Let me know if this helps :)
Please configuration aggregation layer correctly and carefully, you can use this link for help : https://kubernetes.io/docs/tasks/extend-kubernetes/configure-aggregation-layer/.
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: <name of the registration object>
spec:
group: <API group name this extension apiserver hosts>
version: <API version this extension apiserver hosts>
groupPriorityMinimum: <priority this APIService for this group, see API documentation>
versionPriority: <prioritizes ordering of this version within a group, see API documentation>
service:
namespace: <namespace of the extension apiserver service>
name: <name of the extension apiserver service>
caBundle: <pem encoded ca cert that signs the server cert used by the webhook>
It would be helpful to provide kubectl version return value.
For me on EKS with helmfile I had to write in the values.yaml using the metrics-server chart :
containerPort: 10250
The value was enforced by default to 4443 for an unknown reason when I first deployed the chart.
See doc:
https://github.com/kubernetes-sigs/metrics-server/blob/master/charts/metrics-server/values.yaml#L62
https://aws.amazon.com/premiumsupport/knowledge-center/eks-metrics-server/#:~:text=confirm%20that%20your%20security%20groups
Then kubectl top nodes and kubectl describe apiservice v1beta1.metrics.k8s.io were working.
First of all, execute the following command:
kubectl get apiservices
And checkout the availablity (status) of kube-system/metrics-server service.
In case the availability is True:
Add hostNetwork: true to the spec of your metrics-server deployment by executing the following command:
kubectl edit deployment -n kube-system metrics-server
It should look like the following:
...
spec:
hostNetwork: true
...
Setting hostNetwork to true means that Pod will have access to
the host where it's running.
In case the availability is False (MissingEndpoints):
Download metrics-server:
wget https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.5.0/components.yaml
Remove (legacy) metrics server:
kubectl delete -f components.yaml
Edit downloaded file and add - --kubelet-insecure-tls to args list:
...
labels:
k8s-app: metrics-server
spec:
containers:
- args:
- --cert-dir=/tmp
- --secure-port=443
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
- --kubelet-use-node-status-port
- --metric-resolution=15s
- --kubelet-insecure-tls # add this line
...
Create service once again:
kubectl apply -f components.yaml

Argo sample workflows stuck in the pending state

I follow the Argo Workflow's Getting Started documentation. Everything goes smooth until I run the first sample workflow as described in 4. Run Sample Workflows. The workflow just gets stuck in the pending state:
vagrant#master:~$ argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml
Name: hello-world-z4lbs
Namespace: default
ServiceAccount: default
Status: Pending
Created: Thu May 14 12:36:45 +0000 (now)
vagrant#master:~$ argo list
NAME STATUS AGE DURATION PRIORITY
hello-world-z4lbs Pending 27m 0s 0
Here it was mentioned that taints on the muster node may be the problem, so I untainted the master node:
vagrant#master:~$ kubectl taint nodes --all node-role.kubernetes.io/master-
node/master untainted
taint "node-role.kubernetes.io/master" not found
taint "node-role.kubernetes.io/master" not found
Then I deleted the pending workflow and resubmitted it, but it got stuck in the pending state again.
The details of the newly submitted workflow that is also stuck:
vagrant#master:~$ kubectl describe workflow hello-world-8kvmb
Name: hello-world-8kvmb
Namespace: default
Labels: <none>
Annotations: <none>
API Version: argoproj.io/v1alpha1
Kind: Workflow
Metadata:
Creation Timestamp: 2020-05-14T13:57:44Z
Generate Name: hello-world-
Generation: 1
Managed Fields:
API Version: argoproj.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:generateName:
f:spec:
.:
f:arguments:
f:entrypoint:
f:templates:
f:status:
.:
f:finishedAt:
f:startedAt:
Manager: argo
Operation: Update
Time: 2020-05-14T13:57:44Z
Resource Version: 16780
Self Link: /apis/argoproj.io/v1alpha1/namespaces/default/workflows/hello-world-8kvmb
UID: aa82d005-b7ac-411f-9d0b-93f34876b673
Spec:
Arguments:
Entrypoint: whalesay
Templates:
Arguments:
Container:
Args:
hello world
Command:
cowsay
Image: docker/whalesay:latest
Name:
Resources:
Inputs:
Metadata:
Name: whalesay
Outputs:
Status:
Finished At: <nil>
Started At: <nil>
Events: <none>
While trying to get the workflow-controller logs I get the follwoing error:
vagrant#master:~$ kubectl logs -n argo -l app=workflow-controller
Error from server (BadRequest): container "workflow-controller" in pod "workflow-controller-6c4787844c-lbksm" is waiting to start: ContainerCreating
The details for the corresponding workflow-controller pod:
vagrant#master:~$ kubectl -n argo describe pods/workflow-controller-6c4787844c-lbksm
Name: workflow-controller-6c4787844c-lbksm
Namespace: argo
Priority: 0
Node: node-1/192.168.50.11
Start Time: Thu, 14 May 2020 12:08:29 +0000
Labels: app=workflow-controller
pod-template-hash=6c4787844c
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/workflow-controller-6c4787844c
Containers:
workflow-controller:
Container ID:
Image: argoproj/workflow-controller:v2.8.0
Image ID:
Port: <none>
Host Port: <none>
Command:
workflow-controller
Args:
--configmap
workflow-controller-configmap
--executor-image
argoproj/argoexec:v2.8.0
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from argo-token-pz4fd (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
argo-token-pz4fd:
Type: Secret (a volume populated by a Secret)
SecretName: argo-token-pz4fd
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 7m17s (x4739 over 112m) kubelet, node-1 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 2m18s (x4950 over 112m) kubelet, node-1 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1bd1fd11dfe677c749b4a1260c29c2f8cff0d55de113d154a822e68b41f9438e" network for pod "workflow-controller-6c4787844c-lbksm": networkPlugin cni failed to set up pod "workflow-controller-6c4787844c-lbksm_argo" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/
I run Argo 2.8:
vagrant#master:~$ argo version
argo: v2.8.0
BuildDate: 2020-05-11T22:55:16Z
GitCommit: 8f696174746ed01b9bf1941ad03da62d312df641
GitTreeState: clean
GitTag: v2.8.0
GoVersion: go1.13.4
Compiler: gc
Platform: linux/amd64
I have checked the cluster status and it looks OK:
vagrant#master:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 95m v1.18.2
node-1 Ready <none> 92m v1.18.2
node-2 Ready <none> 92m v1.18.2
As to the K8s cluster installation, I created it using Vagrant as described here, the only differences being:
libvirt as provdier
newer version of Ubuntu: generic/ubuntu1804
newer version of Calico: v3.14
Any idea why the workflows get stuck in the pending state and how to fix it?
Workflows start in the Pending state and then are moved through their steps by the workflow-controller pod (which is installed in the cluster as part of Argo).
The workflow-controller pod is stuck in ContainerCreating. kubectl describe po {workflow-controller pod} reveals a Calico-related network error.
As mentioned in the comments, it looks like a common Calico error. Once you clear that up, your hello-world workflow should execute just fine.
Note from OP: Further debugging confirms the Calico problem (Calico nodes are not in the running state):
vagrant#master:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
argo argo-server-84946785b-94bfs 0/1 ContainerCreating 0 3h59m
argo workflow-controller-6c4787844c-lbksm 0/1 ContainerCreating 0 3h59m
kube-system calico-kube-controllers-74d45555dd-zhkp6 0/1 CrashLoopBackOff 56 3h59m
kube-system calico-node-2n9kt 0/1 CrashLoopBackOff 72 3h59m
kube-system calico-node-b8sb8 0/1 Running 70 3h56m
kube-system calico-node-pslzs 0/1 CrashLoopBackOff 67 3h56m
kube-system coredns-66bff467f8-rmxsp 0/1 ContainerCreating 0 3h59m
kube-system coredns-66bff467f8-z4lbq 0/1 ContainerCreating 0 3h59m
kube-system etcd-master 1/1 Running 2 3h59m
kube-system kube-apiserver-master 1/1 Running 2 3h59m
kube-system kube-controller-manager-master 1/1 Running 2 3h59m
kube-system kube-proxy-k59ks 1/1 Running 2 3h59m
kube-system kube-proxy-mn96x 1/1 Running 1 3h56m
kube-system kube-proxy-vxj8b 1/1 Running 1 3h56m
kube-system kube-scheduler-master 1/1 Running 2 3h59m
For the calico CrashLoopBackOff, kubeadm use the default interface eth0 to bootstrap the cluster.
But the eth0 interface is used by Vagrant (for ssh).
You could configure the kubelet to use a private IP address (for instance) and not eth0.
You'll have to do that for each node then vagrant reload.
sudo vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
#Add the Environment line in 10-kubeadm.conf and replace your_node_ip
Environment="KUBELET_EXTRA_ARGS=--node-ip=your_node_ip"
Hope it helps

FailedScheduling: 0/3 nodes are available: 3 Insufficient pods

I'm trying to deploy my NodeJS application to EKS and run 3 pods with exactly the same container.
Here's the error message:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
cm-deployment-7c86bb474c-5txqq 0/1 Pending 0 18s
cm-deployment-7c86bb474c-cd7qs 0/1 ImagePullBackOff 0 18s
cm-deployment-7c86bb474c-qxglx 0/1 ImagePullBackOff 0 18s
public-api-server-79b7f46bf9-wgpk6 0/1 ImagePullBackOff 0 2m30s
$ kubectl describe pod cm-deployment-7c86bb474c-5txqq
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 23s (x4 over 2m55s) default-scheduler 0/3 nodes are available: 3 Insufficient pods.
So it says that 0/3 nodes are available However, if I run
kubectl get nodes --watch
$ kubectl get nodes --watch
NAME STATUS ROLES AGE VERSION
ip-192-168-163-73.ap-northeast-2.compute.internal Ready <none> 6d7h v1.14.6-eks-5047ed
ip-192-168-172-235.ap-northeast-2.compute.internal Ready <none> 6d7h v1.14.6-eks-5047ed
ip-192-168-184-236.ap-northeast-2.compute.internal Ready <none> 6d7h v1.14.6-eks-5047ed
3 pods are running.
here are my configurations:
aws-auth-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: aws-auth
namespace: kube-system
data:
mapRoles: |
- rolearn: [MY custom role ARN]
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cm-deployment
spec:
replicas: 3
selector:
matchLabels:
app: cm-literal
template:
metadata:
name: cm-literal-pod
labels:
app: cm-literal
spec:
containers:
- name: cm
image: docker.io/cjsjyh/public_test:1
imagePullPolicy: Always
ports:
- containerPort: 80
#imagePullSecrets:
# - name: regcred
env:
[my environment variables]
I applied both .yaml files
How can I solve this?
Thank you
My guess, without running the manifests you've got is that the image tag 1 on your image doesn't exist, so you're getting ImagePullBackOff which usually means that the container runtime can't find the image to pull .
Looking at the Docker Hub page there's no 1 tag there, just latest.
So, either removing the tag or replace 1 with latest may resolve your issue.
I experienced this issue with aws instance types with low resources

Cluster-autoscaler not triggering scale-up on Daemonset deployment

I deployed the Datadog agent using the Datadog Helm chart which deploys a Daemonset in Kubernetes. However when checking the state of the Daemonset I saw it was not creating all pods:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
datadog-agent-datadog 5 2 2 2 2 <none> 1h
When describing the Daemonset to figure out what was going wrong I saw it did not have enough resources:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedPlacement 42s (x6 over 42s) daemonset-controller failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 42s (x6 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 42s (x5 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Warning FailedPlacement 42s (x7 over 42s) daemonset-controller failed to place pod on "<ip>": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Normal SuccessfulCreate 42s daemonset-controller Created pod: datadog-agent-7b2kp
However, I have the Cluster-autoscaler installed in the cluster and configured properly (It does trigger on regular Pod deployments that do not have enough resources to schedule), but it does not seem to trigger on the Daemonset:
I0424 14:14:48.545689 1 static_autoscaler.go:273] No schedulable pods
I0424 14:14:48.545700 1 static_autoscaler.go:280] No unschedulable pods
The AutoScalingGroup has enough nodes left:
Did I miss something in the configuration of the Cluster-autoscaler? What can I do to make sure it triggers on Daemonset resources as well?
Edit:
Describe of the Daemonset
Name: datadog-agent
Selector: app=datadog-agent
Node-Selector: <none>
Labels: app=datadog-agent
chart=datadog-1.27.2
heritage=Tiller
release=datadog-agent
Annotations: deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 5
Current Number of Nodes Scheduled: 2
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 2
Number of Nodes Misscheduled: 0
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=datadog-agent
Annotations: checksum/autoconf-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
checksum/checksd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
checksum/confd-config: 38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bed
Service Account: datadog-agent
Containers:
datadog:
Image: datadog/agent:6.10.1
Port: 8125/UDP
Host Port: 0/UDP
Limits:
cpu: 200m
memory: 256Mi
Requests:
cpu: 200m
memory: 256Mi
Liveness: http-get http://:5555/health delay=15s timeout=5s period=15s #success=1 #failure=6
Environment:
DD_API_KEY: <set to the key 'api-key' in secret 'datadog-secret'> Optional: false
DD_LOG_LEVEL: INFO
KUBERNETES: yes
DD_KUBERNETES_KUBELET_HOST: (v1:status.hostIP)
DD_HEALTH_PORT: 5555
Mounts:
/host/proc from procdir (ro)
/host/sys/fs/cgroup from cgroups (ro)
/var/run/docker.sock from runtimesocket (ro)
/var/run/s6 from s6-run (rw)
Volumes:
runtimesocket:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType:
procdir:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
cgroups:
Type: HostPath (bare host directory volume)
Path: /sys/fs/cgroup
HostPathType:
s6-run:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedPlacement 33m (x6 over 33m) daemonset-controller failed to place pod on "ip-10-0-2-144.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Normal SuccessfulCreate 33m daemonset-controller Created pod: datadog-agent-7b2kp
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-1-124.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1810, capacity: 2000
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-2-174.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
Warning FailedPlacement 16m (x25 over 33m) daemonset-controller failed to place pod on "ip-10-0-3-250.eu-west-1.compute.internal": Node didn't have enough resource: cpu, requested: 200, used: 1860, capacity: 2000
You can add priorityClassName to point to a high priority PriorityClass to your DaemonSet. Kubernetes will then remove other pods in order to run the DaemonSet's pods. If that results in unschedulable pods, cluster-autoscaler should add a node to schedule them on.
See the docs (Most examples based on that) (For some pre-1.14 versions, the apiVersion is likely a beta (1.11-1.13) or alpha version (1.8 - 1.10) instead)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority class for essential pods"
Apply it to your workload
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: datadog-agent
spec:
template:
metadata:
labels:
app: datadog-agent
name: datadog-agent
spec:
priorityClassName: high-priority
serviceAccountName: datadog-agent
containers:
- image: datadog/agent:latest
############ Rest of template goes here
You should understand how cluster autoscaler works. It is responsible only for adding or removing nodes. It is not responsible for creating or destroying pods. So in your case cluster autoscaler is not doing anything because it's useless. Even if you add one more node - there will be still a requirement to run DaemonSet pods on nodes where is not enough CPU. That's why it is not adding nodes.
What you should do is to manually remove some pods from occupied nodes. Then it will be able to schedule DaemonSet pods.
Alternatively you can reduce CPU requests of Datadog to, for example, 100m or 50m. This should be enough to start those pods.

Kubernetes pod stuck in waiting state

Trying to start this pod
apiVersion: v1
kind: Pod
metadata:
name: tinyproxy
spec:
containers:
- name: master
image: asdrepo.isus.emc.com:8091/francisbesset/tinyproxy
env:
- name: MASTER
value: "true"
ports:
- containerPort: 6379
resources:
limits:
cpu: "0.1"
volumeMounts:
- mountPath: /tinyproxy-data
name: data
volumes:
- name: data
emptyDir: {}
This gets stuck in pending state. I looked in the troubleshooting guide, but this pod does not seem to have any events
$ kubectl describe pods tinyproxy
Name: tinyproxy
Namespace: default
Node: /
Labels: name=tinyproxy
Status: Pending
IP:
Controllers: <none>
Containers:
master:
Image: asdrepo.isus.emc.com:8091/francisbesset/tinyproxy
Port: 6379/TCP
QoS Tier:
cpu: Guaranteed
memory: BestEffort
Limits:
cpu: 100m
Requests:
cpu: 100m
Environment Variables:
MASTER: true
Volumes:
data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
No events.
Also
$ kubectl get events
FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
13m 13m 1 10.0.0.5 Node Normal Starting {kubelet 10.0.0.5} Starting kubelet.
13m 13m 2 10.0.0.5 Node Warning MissingClusterDNS {kubelet 10.0.0.5} kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. pod: "kube-proxy-10.0.0.5_kube-system(9fa6e0ea64b9f19ad6996367402408eb)". Falling back to DNSDefault policy.
13m 13m 1 10.0.0.5 Node Normal NodeHasSufficientDisk {kubelet 10.0.0.5} Node 10.0.0.5 status is now: NodeHasSufficientDisk
13m 13m 1 10.0.0.5 Node Normal Starting {kubelet 10.0.0.5} Starting kubelet.
13m 13m 1 10.0.0.5 Node Normal NodeHasSufficientDisk {kubelet 10.0.0.5} Node 10.0.0.5 status is now: NodeHasSufficientDisk
13m 13m 1 k8-dvawxybzux-0-a7m3diiryehx-kube-minion-itahxn4icom6 Node Normal Starting {kube-proxy k8-dvawxybzux-0-a7m3diiryehx-kube-minion-itahxn4icom6} Starting kube-proxy.
The proxy does seem to be running and is not restarting
bash-4.3# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d6dd779b301f gcr.io/google_containers/hyperkube:v1.2.0 "/hyperkube proxy --m" 15 minutes ago Up 15 minutes k8s_kube-proxy.d87e83d4_kube-proxy-10.0.0.5_kube-system_9fa6e0ea64b9f19ad6996367402408eb_caae92ac
8191770f15d9 gcr.io/google_containers/pause:2.0 "/pause" 15 minutes ago Up 15 minutes k8s_POD.6059dfa2_kube-proxy-10.0.0.5_kube-system_9fa6e0ea64b9f19ad6996367402408eb_e4da5a30
How do I debug this?
Looks like the scheduler service did not start (this is in an openstack VM). All services were supposed to be configured and started automatically. This worked after I started the service manually.