How to debug GKE internal network issue?

How to debug GKE internal network issue? - kubernetes

UPDATE 1:
Some more logs from api-servers:
https://gist.github.com/nvcnvn/47df8798e798637386f6e0777d869d4f
This question is more about debugging method for current GKE but welcome for solution.
We're using GKE version 1.22.3-gke.1500 with following configuration:
We recently facing issue that commands like kubectl logs and exec doesn't work, deleting a namespace taking forever.
Checking some service inside the cluster, it seem for some reason some network operation just randomly failed. For example metric-server keep crashing with these error logs:
message: "pkg/mod/k8s.io/client-go#v0.19.10/tools/cache/reflector.go:156: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.97.0.1:443/api/v1/nodes?resourceVersion=387681528": net/http: TLS handshake timeout"
HTTP request timeout also:
unable to fully scrape metrics: unable to fully scrape metrics from node gke-staging-n2d-standard-8-78c35b3a-6h16: unable to fetch metrics from node gke-staging-n2d-standard-8-78c35b3a-6h16: Get "http://10.148.15.217:10255/stats/summary?only_cpu_and_memory=true": context deadline exceeded
and I also try to restart (by kubectl delete) most of the pod in this list:
kubectl get pod
NAME READY STATUS RESTARTS AGE
event-exporter-gke-5479fd58c8-snq26 2/2 Running 0 4d7h
fluentbit-gke-gbs2g 2/2 Running 0 4d7h
fluentbit-gke-knz2p 2/2 Running 0 85m
fluentbit-gke-ljw8h 2/2 Running 0 30h
gke-metadata-server-dtnvh 1/1 Running 0 4d7h
gke-metadata-server-f2bqw 1/1 Running 0 30h
gke-metadata-server-kzcv6 1/1 Running 0 85m
gke-metrics-agent-4g56c 1/1 Running 12 (3h6m ago) 4d7h
gke-metrics-agent-hnrll 1/1 Running 13 (13h ago) 30h
gke-metrics-agent-xdbrw 1/1 Running 0 85m
konnectivity-agent-87bc84bb7-g9nd6 1/1 Running 0 2m59s
konnectivity-agent-87bc84bb7-rkhhh 1/1 Running 0 3m51s
konnectivity-agent-87bc84bb7-x7pk4 1/1 Running 0 3m50s
konnectivity-agent-autoscaler-698b6d8768-297mh 1/1 Running 0 83m
kube-dns-77d9986bd5-2m8g4 4/4 Running 0 3h24m
kube-dns-77d9986bd5-z4j62 4/4 Running 0 3h24m
kube-dns-autoscaler-f4d55555-dmvpq 1/1 Running 0 83m
kube-proxy-gke-staging-n2d-standard-8-78c35b3a-8299 1/1 Running 0 11s
kube-proxy-gke-staging-n2d-standard-8-78c35b3a-fp5u 1/1 Running 0 11s
kube-proxy-gke-staging-n2d-standard-8-78c35b3a-rkdp 1/1 Running 0 11s
l7-default-backend-7db896cb4-mvptg 1/1 Running 0 83m
metrics-server-v0.4.4-fd9886cc5-tcscj 2/2 Running 82 33h
netd-5vpmc 1/1 Running 0 30h
netd-bhq64 1/1 Running 0 85m
netd-n6jmc 1/1 Running 0 4d7h
Some logs from metrics server
https://gist.github.com/nvcnvn/b77eb02705385889961aca33f0f841c7

if you cannot use kubectl to get info from your cluster, can you try to access them by using their restfull api
http://blog.madhukaraphatak.com/understanding-k8s-api-part-2/
try to delete "metric-server" pods or get logs from it using podman or curl command.

Related

Sumo Logic kubernetes integration requires that no Prometheus exists

I am currently working on integrating Sumo Logic in a AWS EKS cluster. After going through Sumo Logic's documentation on their integration with k8s I have arrived at the following section Installation Steps. This section of the documentation is a fork in the road where one must figure out if you want to continue with the installation :
side by side with your existing Prometheus Operator
and update your existing Prometheus Operator
with your standalone Prometheus (not using Prometheus Operator)
with no pre-existing Prometheus installation
With that said I am trying to figure out which scenario I am in as I am unsure.
Let me explain, previous to working on this Sumo Logic integration I have completed the New Relic integration which makes me wonder if it uses Prometheus in any ways that could interfere with the Sumo Logic integration ?
So in order to figure that out I started by executing:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
aws-alb-ingress-controller-1600289507-7c7dc6f57d-sgpd8 1/1 Running 1 7d19h
f5-admin-ui-5cbcc464df-lh8nl 1/1 Running 0 7d19h
f5-ambassador-5b5db5ff88-k5clw 1/1 Running 0 7d19h
f5-api-gateway-7bdfc9cb-q57lt 1/1 Running 0 7d19h
f5-argo-ui-7b98dd67-2cwrz 1/1 Running 0 7d19h
f5-auth-ui-58794664d9-rbccn 1/1 Running 0 7d19h
f5-classic-rest-service-0 1/1 Running 0 7d19h
f5-connector-plugin-service-box-7f8b48b88-8jxxq 1/1 Running 0 7d19h
f5-connector-plugin-service-ldap-5d79fd4b8b-8kpcj 1/1 Running 0 7d19h
f5-connector-plugin-service-sharepoint-77b5bdbf9b-vqx4t 1/1 Running 0 7d19h
f5-devops-ui-859c97fb97-ftdxh 1/1 Running 0 7d19h
f5-fusion-admin-64fb9df99f-svznw 1/1 Running 0 7d19h
f5-fusion-indexing-6bbc7d4bcd-jh7cf 1/1 Running 0 7d19h
f5-fusion-log-forwarder-78686cb8-shd6p 1/1 Running 0 7d19h
f5-insights-6d9795f57-62qbg 1/1 Running 0 7d19h
f5-job-launcher-9b659d984-n7h65 1/1 Running 3 7d19h
f5-job-rest-server-55586d8db-xrzcn 1/1 Running 2 7d19h
f5-ml-model-service-6c5bfd5b68-wwdkq 2/2 Running 0 7d19h
f5-pm-ui-cc64c9498-gdmvp 1/1 Running 0 7d19h
f5-pulsar-bookkeeper-0 1/1 Running 0 7d19h
f5-pulsar-bookkeeper-1 1/1 Running 0 7d19h
f5-pulsar-bookkeeper-2 1/1 Running 0 7d19h
f5-pulsar-broker-0 1/1 Running 0 7d19h
f5-pulsar-broker-1 1/1 Running 0 7d19h
f5-query-pipeline-84749b6b65-9hzcx 1/1 Running 0 7d19h
f5-rest-service-7855fdb676-6s6n8 1/1 Running 0 7d19h
f5-rpc-service-676bfbf7f-nmbgp 1/1 Running 0 7d19h
f5-rules-ui-6677475b8b-vbhcj 1/1 Running 0 7d19h
f5-solr-0 1/1 Running 0 20h
f5-templating-b6b964cdb-l4vjq 1/1 Running 0 7d19h
f5-webapps-798b4d6864-b92wt 1/1 Running 0 7d19h
f5-workflow-controller-7447466c89-pzpqk 1/1 Running 0 7d19h
f5-zookeeper-0 1/1 Running 0 7d19h
f5-zookeeper-1 1/1 Running 0 7d19h
f5-zookeeper-2 1/1 Running 0 7d19h
nri-bundle-kube-state-metrics-cdc9ffd85-2s688 1/1 Running 0 2d21h
nri-bundle-newrelic-infrastructure-fj9g9 1/1 Running 0 2d21h
nri-bundle-newrelic-infrastructure-jgckv 1/1 Running 0 2d21h
nri-bundle-newrelic-infrastructure-pv27n 1/1 Running 0 2d21h
nri-bundle-newrelic-logging-694hl 1/1 Running 0 2d21h
nri-bundle-newrelic-logging-7w8cj 1/1 Running 0 2d21h
nri-bundle-newrelic-logging-8gjw8 1/1 Running 0 2d21h
nri-bundle-nri-kube-events-865664658d-ztq89 2/2 Running 0 2d21h
nri-bundle-nri-metadata-injection-557855f78d-rzjxd 1/1 Running 0 2d21h
nri-bundle-nri-metadata-injection-job-cxmqg 0/1 Completed 0 2d21h
nri-bundle-nri-prometheus-ccd7b7fbd-2npvn 1/1 Running 0 2d21h
seldon-controller-manager-5b5f89545-6vxgf 1/1 Running 1 7d19h
As you can see New Relic is running nri-bundle-nri-prometheus-ccd7b7fbd-2npvn which seems to correspond to the New Relic OpenMetric integration for Kubernetes or Docker. Browsing through New Relic's documentation I found:
We currently offer two integration options:
Prometheus remote write integration. Use this if you currently have
Prometheus servers and want an easy access to your combined metrics
from New Relic.
Prometheus OpenMetrics integration for Kubernetes or
Docker. Use this if you’re looking for an alternative or replacement
to a Prometheus server and store all your metrics directly in New
Relic.
So from what I can gather I am not running Prometheus server or operator and I can continue with the Sumo Logic integration setup by following the section dedicated to installation with no pre-existing Prometheus installation ? This is what I am trying to clarify, wondering if someone can help as I am new to Kubernetes and Prometheus.

I think you most likely will have to go with the below installation option :
with your standalone Prometheus (not using Prometheus Operator)
Can you check and paste the output of kubectl get prometheus. If you see any running prometheus, you can run kubectl describe prometheus $prometheus_resource_name and check the labels to verify if it is deployed by the operator or it is a standalone prometheus.
In case it is deployed by Prometheus operator, you can use either of these approaches:
side by side with your existing Prometheus Operator
update your existing Prometheus Operator

In the end I followed the Sumo Logic integration instructions dedicated to a setup with no pre-existing Prometheus installation and everything worked just fine.

Delete all the pods created by applying Helm2.13.1

I'm new to Helm. I'm trying to deploy a simple server on the master node. When I do helm install and see the details using the command kubectl get po,svc I see lot of pods created other than the pods I intend to deploy.So, My precise questions are:
Why so many pods got created?
How do I delete all those pods?
Below is the output of the command kubectl get po,svc:
NAME READY STATUS RESTARTS AGE
pod/altered-quoll-stx-sdo-chart-6446644994-57n7k 1/1 Running 0 25m
pod/austere-garfish-stx-sdo-chart-5b65d8ccb7-jjxfh 1/1 Running 0 25m
pod/bald-hyena-stx-sdo-chart-9b666c998-zcfwr 1/1 Running 0 25m
pod/cantankerous-pronghorn-stx-sdo-chart-65f5699cdc-5fkf9 1/1 Running 0 25m
pod/crusty-unicorn-stx-sdo-chart-7bdcc67546-6d295 1/1 Running 0 25m
pod/exiled-puffin-stx-sdo-chart-679b78ccc5-n68fg 1/1 Running 0 25m
pod/fantastic-waterbuffalo-stx-sdo-chart-7ddd7b54df-p78h7 1/1 Running 0 25m
pod/gangly-quail-stx-sdo-chart-75b9dd49b-rbsgq 1/1 Running 0 25m
pod/giddy-pig-stx-sdo-chart-5d86844569-5v8nn 1/1 Running 0 25m
pod/hazy-indri-stx-sdo-chart-65d4c96f46-zmvm2 1/1 Running 0 25m
pod/interested-macaw-stx-sdo-chart-6bb7874bbd-k9nnf 1/1 Running 0 25m
pod/jaundiced-orangutan-stx-sdo-chart-5699d9b44b-6fpk9 1/1 Running 0 25m
pod/kindred-nightingale-stx-sdo-chart-5cf95c4d97-zpqln 1/1 Running 0 25m
pod/kissing-snail-stx-sdo-chart-854d848649-54m9w 1/1 Running 0 25m
pod/lazy-tiger-stx-sdo-chart-568fbb8d65-gr6w7 1/1 Running 0 25m
pod/nonexistent-octopus-stx-sdo-chart-5f8f6c7ff8-9l7sm 1/1 Running 0 25m
pod/odd-boxer-stx-sdo-chart-6f5b9679cc-5stk7 1/1 Running 1 15h
pod/orderly-chicken-stx-sdo-chart-7889b64856-rmq7j 1/1 Running 0 25m
pod/redis-697fb49877-x5hr6 1/1 Running 0 25m
pod/rv.deploy-6bbffc7975-tf5z4 1/2 CrashLoopBackOff 93 30h
pod/sartorial-eagle-stx-sdo-chart-767d786685-ct7mf 1/1 Running 0 25m
pod/sullen-gnat-stx-sdo-chart-579fdb7df7-4z67w 1/1 Running 0 25m
pod/undercooked-cow-stx-sdo-chart-67875cc5c6-mwvb7 1/1 Running 0 25m
pod/wise-quoll-stx-sdo-chart-5db8c766c9-mhq8v 1/1 Running 0 21m

You can run the command helm ls to see all the deployed helm releases in your cluster.
To remove the release (and every resource it created, including the pods), run: helm delete RELEASE_NAME --purge.
If you want to delete all the pods in your namespace without your Helm release (I DON'T think this is what you're looking for), you can run: kubectl delete pods --all.
On a side note, if you're new to Helm, consider starting with Helm v3 since it has many improvements, and specially because the migration from v2 to v3 can become cumbersome, and if you can avoid it - you should.

promethues operator alertmanager-main-0 pending and display

What happened?
kubernetes version: 1.12
promethus operator: release-0.1
I follow the README:
$ kubectl create -f manifests/
# It can take a few seconds for the above 'create manifests' command to fully create the following resources, so verify the resources are ready before proceeding.
$ until kubectl get customresourcedefinitions servicemonitors.monitoring.coreos.com ; do date; sleep 1; echo ""; done
$ until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
$ kubectl apply -f manifests/ # This command sometimes may need to be done twice (to workaround a race condition).
and then I use the command and then is showed like:
[root#VM_8_3_centos /data/hansenwu/kube-prometheus/manifests]# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 2/2 Running 0 66s
alertmanager-main-1 1/2 Running 0 47s
grafana-54f84fdf45-kt2j9 1/1 Running 0 72s
kube-state-metrics-65b8dbf498-h7d8g 4/4 Running 0 57s
node-exporter-7mpjw 2/2 Running 0 72s
node-exporter-crfgv 2/2 Running 0 72s
node-exporter-l7s9g 2/2 Running 0 72s
node-exporter-lqpns 2/2 Running 0 72s
prometheus-adapter-5b6f856dbc-ndfwl 1/1 Running 0 72s
prometheus-k8s-0 3/3 Running 1 59s
prometheus-k8s-1 3/3 Running 1 59s
prometheus-operator-5c64c8969-lqvkb 1/1 Running 0 72s
[root#VM_8_3_centos /data/hansenwu/kube-prometheus/manifests]# kubectl get pod -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-main-0 0/2 Pending 0 0s
grafana-54f84fdf45-kt2j9 1/1 Running 0 75s
kube-state-metrics-65b8dbf498-h7d8g 4/4 Running 0 60s
node-exporter-7mpjw 2/2 Running 0 75s
node-exporter-crfgv 2/2 Running 0 75s
node-exporter-l7s9g 2/2 Running 0 75s
node-exporter-lqpns 2/2 Running 0 75s
prometheus-adapter-5b6f856dbc-ndfwl 1/1 Running 0 75s
prometheus-k8s-0 3/3 Running 1 62s
prometheus-k8s-1 3/3 Running 1 62s
prometheus-operator-5c64c8969-lqvkb 1/1 Running 0 75s
I don't know why the pod altertmanager-main-0 pending and disaply then restart.
And I see the event, it is showed as:
72s Warning FailedCreate StatefulSet create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again.
72s Warning FailedCreate StatefulSet create Pod alertmanager-main-0 in StatefulSet alertmanager-main failed error: The POST operation against Pod could not be completed at this time, please try again.
72s Warning^Z FailedCreate StatefulSet
[10]+ Stopped kubectl get events -n monitoring

Most likely the alertmanager does not get enough time to start correctly.
Have a look at this answer : https://github.com/coreos/prometheus-operator/issues/965#issuecomment-460223268
You can set the paused field to true, and then modify the StatefulSet to try if extending the liveness/readiness solves your issue.

Kubernetes dashboard connect: no route to host

I am running Kubernetes on bare metal and use Kubernets dashboard to manage the cluster. This functions fine at first, but after 5-30 minutes when I try to access the dashboard at:
http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/
I get the following error:
Error: 'dial tcp 10.35.0.19:8443: connect: no route to host'
Trying to reach: 'https://10.35.0.19:8443/'
All pods in kube-system are up and running if I check them with kubectl get pods -n kube-system:
NAME READY STATUS RESTARTS AGE
coredns-86c58d9df4-87pfc 1/1 Running 0 1m
coredns-86c58d9df4-tflg5 1/1 Running 0 1m
etcd-controller01 1/1 Running 5 1m
etcd-controller02 1/1 Running 6 1m
heapster-798ffb9b4-744q4 1/1 Running 0 1m
kube-apiserver-controller01 1/1 Running 1 1m
kube-apiserver-controller02 1/1 Running 3 1m
kube-controller-manager-controller01 1/1 Running 5 1m
kube-controller-manager-controller02 1/1 Running 2 1m
kube-proxy-8qqnq 1/1 Running 0 1m
kube-proxy-9vgck 1/1 Running 0 1m
kube-proxy-dht69 1/1 Running 0 1m
kube-proxy-f7bx8 1/1 Running 0 1m
kube-proxy-jnxtq 1/1 Running 0 1m
kube-proxy-l5h7m 1/1 Running 0 1m
kube-proxy-p9gt5 1/1 Running 0 1m
kube-proxy-zv4sr 1/1 Running 0 1m
kube-scheduler-controller01 1/1 Running 3 1m
kube-scheduler-controller02 1/1 Running 4 1m
kubernetes-dashboard-57df4db6b-px8xc 1/1 Running 0 1m
metrics-server-55d46868d4-s9j5v 1/1 Running 0 1m
monitoring-grafana-564f579fd4-fm6lm 1/1 Running 0 1m
monitoring-influxdb-8b7d57f5c-llgz9 1/1 Running 0 1m
weave-net-2b2dm 2/2 Running 1 1m
weave-net-988rf 2/2 Running 0 1m
weave-net-hcm5n 2/2 Running 0 1m
weave-net-kb2gk 2/2 Running 0 1m
weave-net-ksvbf 2/2 Running 0 1m
weave-net-q9zlw 2/2 Running 0 1m
weave-net-t9f6m 2/2 Running 0 1m
weave-net-vdspp 2/2 Running 0 1m
When I restart all pods in this namespace with kubectl delete pods --all -n kube-system the dashboard sometimes works again for 5-30 minutes and at other times it randomly starts working again out of itself. I have tried restarting each pod in this namespace individually to try and track down which pod is causing this issue but restarting the pods one by one does not get the dashboard up again. Only the delete all at once command works.
Does anybody have an idea why this happens and how I can fix this?
Thank you in advance!

(MISSING)TILLER: dial tcp 127.0.0.1:8080: connect: connection refused

I created a Single node Kuberenetes cluster using minikube and I installed helm on that. But I am getting issue while executing helm ls and helm install commands. This is this issue I am facing:
"Get http://localhost:8080/api/v1/namespaces/kube-system/configmaps?labelSelector=OWNER%!D(MISSING)TILLER: dial tcp 127.0.0.1:8080: connect: connection refused".
These are pods are running on kube-system namespace
ubuntu#openshift:~$ kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
default-http-backend-vqbh4 1/1 Running 1 6h
etcd-minikube 1/1 Running 0 6h
kube-addon-manager-minikube 1/1 Running 4 1d
kube-apiserver-minikube 1/1 Running 0 6h
kube-controller-manager-minikube 1/1 Running 0 6h
kube-dns-86f4d74b45-xxznk 3/3 Running 15 1d
kube-proxy-j28zs 1/1 Running 0 6h
kube-scheduler-minikube 1/1 Running 3 1d
kubernetes-dashboard-5498ccf677-89hrf 1/1 Running 8 1d
nginx-ingress-controller-tjljg 1/1 Running 3 6h
registry-wzwnq 1/1 Running 1 7h
storage-provisioner 1/1 Running 8 1d
tiller-deploy-75d848bb9-tmm9b 1/1 Running 0 4h
If you have any idea please help me. Thanks

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to debug GKE internal network issue? - kubernetes

if you cannot use kubectl to get info from your cluster, can you try to access them by using their restfull api http://blog.madhukaraphatak.com/understanding-k8s-api-part-2/ try to delete "metric-server" pods or get logs from it using podman or curl command.

Related

Sumo Logic kubernetes integration requires that no Prometheus exists

Delete all the pods created by applying Helm2.13.1

promethues operator alertmanager-main-0 pending and display

Kubernetes dashboard connect: no route to host

(MISSING)TILLER: dial tcp 127.0.0.1:8080: connect: connection refused

Categories

Resources