Kubernetes: How are etcd component services health checked? - kubernetes

I have a k8s cluster in AWS that looks partially up, but won't actually do deployments. When looking at the health of components, etcd is shown as unhealthy. This looks like it's an issue with the etcd endpoints getting queried as http versus https:
kubectl --kubeconfig=Lab_42/kubeconfig.yaml get componentstatuses --namespace=default
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-2 Unhealthy Get http://ip-10-42-2-50.ec2.internal:2379/health: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
etcd-1 Unhealthy Get http://ip-10-42-2-41.ec2.internal:2379/health: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
etcd-0 Unhealthy Get http://ip-10-42-2-40.ec2.internal:2379/health: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
I'm not using the --ca-config option, but putting the config values directly in the apiserver run. My apiserver config:
command:
- /hyperkube
- apiserver
- --advertise-address=10.42.2.50
- --admission_control=NamespaceLifecycle,NamespaceAutoProvision,LimitRanger,SecurityContextDeny,ServiceAccount,ResourceQuota
- --allow-privileged=true
- --authorization-mode=AlwaysAllow
- --bind-address=0.0.0.0
- --client-ca-file=/etc/ssl/kubernetes/k8s-ca.pem
- --etcd-cafile=/etc/ssl/etcd/etcd-ca.pem
- --etcd-certfile=/etc/ssl/etcd/etcd-client.pem
- --etcd-keyfile=/etc/ssl/etcd/etcd-client-key.pem
- --etcd-servers=https://127.0.0.1:2379
- --kubelet-certificate-authority=/etc/ssl/kubernetes/k8s-ca.pem
- --kubelet-client-certificate=/etc/ssl/kubernetes/k8s-apiserver-client.pem
- --kubelet-client-key=/etc/ssl/kubernetes/k8s-apiserver-client-key.pem
- --kubelet-https=true
- --logtostderr=true
- --runtime-config=extensions/v1beta1/deployments=true,extensions/v1beta1/daemonsets=true,api/all
- --secure-port=443
- --service-account-lookup=false
- --service-cluster-ip-range=10.3.0.0/24
- --tls-cert-file=/etc/ssl/kubernetes/k8s-apiserver.pem
- --tls-private-key-file=/etc/ssl/kubernetes/k8s-apiserver-key.pem
The actual problem is that simple deployments don't actually do anything, and I'm not sure if etcd being unhealthy is causing the problem or not as we have many other certificates in the mix.
kubectl --kubeconfig=Lab_42/kubeconfig.yaml get deployments --namespace=default
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
nginx-deployment 3 0 0 0 2h
I can actually query etcd directly if I use the local https endpoint
/usr/bin/etcdctl --ca-file /etc/ssl/etcd/etcd-ca.pem --cert-file /etc/ssl/etcd/etcd-client.pem --key-file /etc/ssl/etcd/etcd-client-key.pem
--endpoints 'https://127.0.0.1:2379' \
get /registry/minions/ip-10-42-2-50.ec2.internal | jq "."
{
"kind": "Node",
"apiVersion": "v1",
"metadata": {
"name": "ip-10-42-2-50.ec2.internal",
"selfLink": "/api/v1/nodes/ip-10-42-2-50.ec2.internal",
...SNIP

So it turns out that the component statuses was a red herring. The real problem was due to the fact that my controller configuration was wrong. The master was set for http://master_ip:8080 instead of http://127.0.0.1:8080. The insecure port for apiserver is not exposed to external interfaces, so the controller could not connect.
Switching to either loopback insecure or :443 solved my problem.
When using the CoreOS hypercube and kubelet-wrapper, you lose out on the automatically linked container logs in /var/log/containers. To find those, you can do something like:
ls -latr /var/lib/docker/containers/*/*-json.log
I was actually able to see the errors causing my problem this way.

I think your kube-apiserver's config is missing the option --etcd-server=xxx

Related

Kubernetes Ingress Controller: Failed calling webhook, dial tcp connect: connection refused

I have set up a Kubernetes cluster (a master and a worker) on two Centos 7 machines. They have the following IPs:
Master: 192.168.1.40
Worker: 192.168.1.41
They are accessible by SSH and I am not using a VPN. For both boxes, I have sudo access.
For the work I am doing, I had to add an Nginx Ingress Controller, which I did by doing:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v0.43.0/deploy/static/provider/baremetal/deploy.yaml
This yaml file seems fine to me and is a common one that occurs when trying to add an nginx ingress controller to a kubernetes cluster.
I don't see any errors when I do the above command.
However, when I try to install a helm configuration, such as:
helm install dai eggplant/dai --version 0.6.5 -f dai.yaml --namespace dai
I am getting an error with my Nginx Ingress Controller:
W0119 11:58:00.550727 60628 warnings.go:70] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
Error: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/extensions/v1beta1/ingresses?timeout=30s": dial tcp 10.108.86.48:443: connect: connection refused
I think this is because of some kind of DNS error. I don't know where the IP 10.108.86.48:443 is coming from or how to find out.
I have also enabled a bunch of ports with firewall-cmd.
[root#manager-node ~]# sudo firewall-cmd --list-all
public (active)
target: default
icmp-block-inversion: no
interfaces: ens33
sources:
services: dhcpv6-client ssh
ports: 6443/tcp 2379-2380/tcp 10250/tcp 10251/tcp 10252/tcp 10255/tcp 443/tcp 30154/tcp 31165/tcp
protocols:
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
However, my nginx ingress pod doesn't seem to start either:
NAME READY STATUS RESTARTS AGE
ingress-nginx-controller-7bc44b4bb-rwmh2 0/1 ContainerCreating 0 19h
It remains as ContainerCreating for hours.
The issue is that as part of that kubectl apply -f you are also applying a ValidatingWebhookConfiguration (check the applied manifest file).
See Using Admission Controllers | Kubernetes
Using Admission Controllers | Kubernetes for more info.
The error you are seeing is because your Deployment is not starting up, and thus the ValidatingWebhook service configured as part of it isn't starting up either, so the Validating Controller in Kubernetes is failing every request.
- --validating-webhook=:8443
- --validating-webhook-certificate=/usr/local/certificates/cert
- --validating-webhook-key=/usr/local/certificates/key
Your pod is most likely not starting for another reason. More information is required to further debug.
I would recommend removing the ValidatingWebhookConfiguration from the applied manfiest.
You can also remove it manually with
kubectl delete ValidatingWebhookCOnfiguration ingress-nginx-admission
(Validating Controllers aren't namespaced)

metric-server : TLS handshake error from 20.99.219.64:57467: EOF

I have deployed metric server using :
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.3.6/components.yaml
Metric server pod is running but in logs I am getting error :
I0601 20:00:01.321004 1 log.go:172] http: TLS handshake error from 20.99.219.64:34903: EOF
I0601 20:00:01.321160 1 log.go:172] http: TLS handshake error from 20.99.219.64:22575: EOF
I0601 20:00:01.332318 1 log.go:172] http: TLS handshake error from 20.99.219.64:14603: EOF
I0601 20:00:01.333174 1 log.go:172] http: TLS handshake error from 20.99.219.64:22517: EOF
I0601 20:00:01.351649 1 log.go:172] http: TLS handshake error from 20.99.219.64:3598: EOF
IP : 20.99.219.64
This is not present in Cluster. I have checked using :
kubectl get all --all-namespaces -o wide | grep "20.99.219.64"
Nothing is coming as O/P.
I have using Calico and initialize the cluster with --pod-network-cidr=20.96.0.0/12
Also kubectl top nodes is not working, Getting error :
node#kubemaster:~/Desktop/dashboard$ kubectl top nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
During deployment of metrics-server remember to add following line in args section:
- args:
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
Also add the following line at the spec:spec level outside of the previous containers level:
hostNetwork: true
restartPolicy: Always
Remember to apply changes.
Metrics-server attempts to authorize itself using token authentication. Please ensure that you're running your kubelets with webhook token authentication turned on.
Speaking about TLS directly. TLS message is big packet and MTU in calico is wrong so change it according calico-project-mtu.
Execute command:
$ kubectl edit configmap calico-config -n kube-system and change the MTU value from 1500 to 1430.
Take a look: metrics-server-mtu.
I also met the this problem that I can't make metrics-server to work in my k8s cluster (kubectl version is 1.25.4). I follow the instructions above and solve the issue!
I downloaded the components.yaml file and only add the - --kubelet-insecure-tls in the args of deployment. Then I got the metrics-server work!

Kubernetes RBAC - user has access to get pods but it says 'Unauthorized'

I have configured keycloak for Kubernetes RBAC.
user having access to get pods
vagrant#haproxy:~/.kube$ kubectl auth can-i get pods --user=oidc
Warning: the server doesn't have a resource type 'pods'
yes
vagrant#haproxy:~/.kube$ kubectl get pods --user=oidc
error: You must be logged in to the server (Unauthorized)
my kubeconfig file for the user looks like below
users:
- name: oidc
user:
exec:
apiVersion: client.authentication.k8s.io/v1beta1
args:
- oidc-login
- get-token
- --oidc-issuer-url=https://test.example.com/auth/realms/kubernetes
- --oidc-client-id=kubernetes
- --oidc-client-secret=e479f74d-d9fd-415b-b1db-fd7946d3ad90
- --username=test
- --grant-type=authcode-keyboard
command: kubectl
Is there anyway to get this to work?
The issue was with the ip address of the cluster. You might have to configure the DNS name if the ip address.

forbidden returned when mounting the default tokens in HA kubernetes cluster

I have a problem with mounting the default tokens in kubernetes it no longer works with me, I wanted to ask directly before creating an issue on Github, so my setup consists of basically a HA bare metal cluster with manually deployed etcd (which includes certs ca, keys).The deployments run the nodes register, I just cannot deploy pods, always giving the error:
MountVolume.SetUp failed for volume "default-token-ddj5s" : secrets "default-token-ddj5s" is forbidden: User "system:node:tweak-node-1" cannot get secrets in the namespace "default": no path found to object
where tweak-node-1 is one of my node names and hostnames, I have found some similar issues:
- https://github.com/kubernetes/kubernetes/issues/18239
- https://github.com/kubernetes/kubernetes/issues/25828
but none came close to fixing my issue as the issue was not the same.I only use default namespaces when trying to run pods and tried setting both RBAC ABAC, both gave the same result, this is a template I use for deploying showing version an etcd config:
apiVersion: kubeadm.k8s.io/v1alpha1
kind: MasterConfiguration
api:
advertiseAddress: IP1
bindPort: 6443
authorizationMode: ABAC
kubernetesVersion: 1.8.5
etcd:
endpoints:
- https://IP1:2379
- https://IP2:2379
- https://IP3:2379
caFile: /opt/cfg/etcd/pki/etcd-ca.crt
certFile: /opt/cfg/etcd/pki/etcd.crt
keyFile: /opt/cfg/etcd/pki/etcd.key
dataDir: /var/lib/etcd
etcdVersion: v3.2.9
networking:
podSubnet: 10.244.0.0/16
apiServerCertSANs:
- IP1
- IP2
- IP3
- DNS-NAME1
- DNS-NAME2
- DNS-NAME3
Your node must use credentials that match its Node API object name, as described in https://kubernetes.io/docs/admin/authorization/node/#overview
In order to be authorized by the Node authorizer, kubelets must use a credential that identifies them as being in the system:nodes group, with a username of system:node:. This group and user name format match the identity created for each kubelet as part of kubelet TLS bootstrapping.
update
So the specific solution, the problem was because I was using version 1.8.x and was copying the certs and keys manually each kubelet didn't have its own system:node binding or specific key as specified in https://kubernetes.io/docs/admin/authorization/node/#overview:
RBAC Node Permissions In 1.8, the binding will not be created at all.
When using RBAC, the system:node cluster role will continue to be
created, for compatibility with deployment methods that bind other
users or groups to that role.
I fixed using either two ways :
1 - Using kubeadm join instead of copying the /etc/kubernetes file from master1
2 - after deployment patching the clusterrolebinding for system:node
kubectl patch clusterrolebinding system:node -p '{"apiVersion":
"rbac.authorization.k8s.io/v1beta1","kind":
"ClusterRoleBinding","metadata": {"name": "system:node"},"subjects":
[{"kind": "Group","name": "system:nodes"}]}'

Istio Ingress resulting in "no healthy upstream"

I am using deploying an outward facing service, that is exposed behind a nodeport and then an istio ingress. The deployment is using manual sidecar injection. Once the deployment, nodeport and ingress are running, I can make a request to the istio ingress.
For some unkown reason, the request does not route through to my deployment and instead displays the text "no healthy upstream". Why is this, and what is causing it?
I can see in the http response that the status code is 503 (Service Unavailable) and the server is "envoy". The deployment is functioning as I can map a port forward to it and everything works as expected.
Just in case, like me, you get curious... Even though in my scenario it was clear the case of the error...
Error cause: I had two versions of the same service (v1 and v2), and an Istio VirtualService configured with traffic route destination using weights. Then, 95% goes to v1 and 5% goes to v2. As I didn't have the v1 deployed (yet), of course, the error "503 - no healthy upstream" shows up 95% of the requests.
Ok, even so, I knew the problem and how to fix it (just deploy v1), I was wondering... But, how can I have more information about this error? How could I get a deeper analysis of this error to find out what was happening?
This is a way of investigating using the configuration command line utility of Istio, the istioctl:
# 1) Check the proxies status -->
$ istioctl proxy-status
# Result -->
NAME CDS LDS EDS RDS PILOT VERSION
...
teachstore-course-v1-74f965bd84-8lmnf.development SYNCED SYNCED SYNCED SYNCED istiod-86798869b8-bqw7c 1.5.0
...
...
# 2) Get the name outbound from JSON result using the proxy (service with the problem) -->
$ istioctl proxy-config cluster teachstore-course-v1-74f965bd84-8lmnf.development --fqdn teachstore-student.development.svc.cluster.local -o json
# 2) If you have jq install locally (only what we need, already extracted) -->
$ istioctl proxy-config cluster teachstore-course-v1-74f965bd84-8lmnf.development --fqdn teachstore-course.development.svc.cluster.local -o json | jq -r .[].name
# Result -->
outbound|80||teachstore-course.development.svc.cluster.local
inbound|80|9180-tcp|teachstore-course.development.svc.cluster.local
outbound|80|v1|teachstore-course.development.svc.cluster.local
outbound|80|v2|teachstore-course.development.svc.cluster.local
# 3) Check the endpoints of "outbound|80|v2|teachstore-course..." using v1 proxy -->
$ istioctl proxy-config endpoints teachstore-course-v1-74f965bd84-8lmnf.development --cluster "outbound|80|v2|teachstore-course.development.svc.cluster.local"
# Result (the v2, 5% of the traffic route is ok, there are healthy targets) -->
ENDPOINT STATUS OUTLIER CHECK CLUSTER
172.17.0.28:9180 HEALTHY OK outbound|80|v2|teachstore-course.development.svc.cluster.local
172.17.0.29:9180 HEALTHY OK outbound|80|v2|teachstore-course.development.svc.cluster.local
# 4) However, for the v1 version "outbound|80|v1|teachstore-course..." -->
$ istioctl proxy-config endpoints teachstore-course-v1-74f965bd84-8lmnf.development --cluster "outbound|80|v1|teachstore-course.development.svc.cluster.local"
ENDPOINT STATUS OUTLIER CHECK CLUSTER
# Nothing! Emtpy, no Pods, that's explain the "no healthy upstream" 95% of time.
Although this is a somewhat general error resulting from a routing issue within an improper Istio setup, I will provide a general solution/piece of advice to anyone coming across the same issue.
In my case the issue was due to incorrect route rule configuration, the Kubernetes native services were functioning however the Istio routing rules were incorrectly configured so Istio could not route from the ingress into the service.
I faced the issue, when I my pod was in ContainerCreating state. So, it resulted in 503 error. Also as #pegaldon, explained it can also occur due to incorrect route configuration or there are no gateways created by user.
delete destinationrules.networking.istio.io
and recreate the virtualservice.networking.istio.io
[root#10-20-10-110 ~]# curl http://dprovider.example.com:31400/dw/provider/beat
no healthy upstream[root#10-20-10-110 ~]#
[root#10-20-10-110 ~]# curl http://10.210.11.221:10100/dw/provider/beat
"该服务节点 10.210.11.221 心跳正常!"[root#10-20-10-110 ~]#
[root#10-20-10-110 ~]#
[root#10-20-10-110 ~]# cat /home/example_service_yaml/vs/dw-provider-service.yaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: dw-provider-service
namespace: example
spec:
hosts:
- "dprovider.example.com"
gateways:
- example-gateway
http:
- route:
- destination:
host: dw-provider-service
port:
number: 10100
subset: "v1-0-0"
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: dw-provider-service
namespace: example
spec:
host: dw-provider-service
subsets:
- name: "v1-0-0"
labels:
version: 1.0.0
[root#10-20-10-110 ~]# vi /home/example_service_yaml/vs/dw-provider-service.yaml
[root#10-20-10-110 ~]# kubectl -n example get vs -o wide | grep dw
dw-collection-service [example-gateway] [dw.collection.example.com] 72d
dw-platform-service [example-gateway] [dplatform.example.com] 81d
dw-provider-service [example-gateway] [dprovider.example.com] 21m
dw-sync-service [example-gateway] [dw-sync-service dsync.example.com] 34d
[root#10-20-10-110 ~]# kubectl -n example delete vs dw-provider-service
virtualservice.networking.istio.io "dw-provider-service" deleted
[root#10-20-10-110 ~]# kubectl -n example delete d dw-provider-service
daemonsets.apps deniers.config.istio.io deployments.extensions dogstatsds.config.istio.io
daemonsets.extensions deployments.apps destinationrules.networking.istio.io
[root#10-20-10-110 ~]# kubectl -n example delete destinationrules.networking.istio.io dw-provider-service
destinationrule.networking.istio.io "dw-provider-service" deleted
[root#10-20-10-110 ~]# kubectl apply -f /home/example_service_yaml/vs/dw-provider-service.yaml
virtualservice.networking.istio.io/dw-provider-service created
[root#10-20-10-110 ~]# curl http://dprovider.example.com:31400/dw/provider/beat
"该服务节点 10.210.11.221 心跳正常!"[root#10-20-10-110 ~]#
[root#10-20-10-110 ~]#
From my experience, the "no healthy upstream" error can have different causes. Usually, Istio has received ingress traffic that should be forwarded (the client request, or Istio downstream), but the destination is unavailable (istio upstream / kubernetes service). This results in a HTTP 503 "no healthy upstream" error.
1.) Broken Virtualservice definitions
If you have a destination in your VirtualService context where the traffic should be routed, ensure this destination exists (in terms of the hostname is correct, or the service is available from this namespace)
2.) ImagePullBack / Terminating / Service is not available
Ensure your destination is available in general. Sometimes no pod is available, so no upstream will be available too.
3.) ServiceEntry - same destination in 2 lists, but lists with different DNS Rules
Check your namespace for ServiceEntry objects with:
kubectl -n <namespace> get serviceentry
If the result has more than one entry (multiple lines in one ServiceEntry object), check if a destination address (e.g. foo.com) is available in various lines.
If the same destination address (e.g. foo.com) is available in various lines, ensure that the column "DNS" does not have different resolution settings (e.g. one line uses DNS, the other line has NONE). If yes, this is an indicator that you try to apply different DNS settings to the same destination address.
A solution is:
a) to unify the DNS setting, setting all lines to NONE or DNS, but not to mix it up.
b) Ensure the destination (foo.com) is available in one line, and a collision of different DNS rules does not appear.
a) involves restarting istio-ingressgateway pods (data plane) to make it work.
b) Involves no restart of istio data or istio control plane.
Basically: It helps to check the status between Control Plane (istiod) and DatapPlane (istio-ingressgateway) with
istioctl proxy-status
The output of istioctl proxy-status should ensure that the columns say "SYNC" this ensures that the control plane and Data Plane are synced. If not, you can restart the istio-ingressgateway deployment or the istiod daemonset, to force "fresh" processes.
Further, it helped to run
istioctl analyze -A
to ensure that targets are checked in the VirtualService context and do exist. If a virtual service definition exists with routing definitions whose destination is unavailable, istioctl analyze -A can detect these unavailable destinations.
Furthermore, reading the logfiles of the istiod container helps. The istiod error messages often indicate the context of the error in the routing (which namespace and service or istio setting). You can use the default way with
kubectl -n istio-system logs <nameOfIstioDPod>
Referenes:
https://istio.io/latest/docs/reference/config/networking/service-entry/
https://istio.io/latest/docs/reference/config/networking/virtual-service/
https://istio.io/latest/docs/ops/diagnostic-tools/proxy-cmd/