EKS: kubectl exec does not respect streamingConnectionIdleTimeout - kubernetes

Using EKS with Kubernetes 1.21, managed nodegroups in a private subnet. I'm trying to set the cluster up so that kubectl exec times out after inactivity regardless of the workload being execed into, and without any client configuration.
I'm aware of https://github.com/containerd/containerd/issues/5563, except we're on 1.21 with Docker runtime, not containerd yet.
I set streamingConnectionIdleTimeout: 3600s on the kubelet in the launch template:
cat /etc/kubernetes/kubelet/kubelet-config.json | jq '.streamingConnectionIdleTimeout = "3600s"' > /etc/kubernetes/kubelet/kubelet-config.json
/etc/eks/bootstrap.sh {{CLUSTER_NAME}}
And confirmed with curl -sSL "http://localhost:8001/api/v1/nodes/(node name)/proxy/configz".
However, kubectl exec still does not time out.
I confirmed /proc/sys/net/ipv4/tcp_keepalive_time = 7200 on both the client and the node, so we should be hitting the streaming connection idle timeout before Linux starts sending keepalive probes.
Reading through How kubectl exec Works, it seems possible that the EKS managed control plane is keeping the connection alive. There are people online who have the opposite problem - their connection times out regardless of streamingConnectionIdleTimeout - and they solve it by adjusting the timeout on the load balancer in front of their k8s API server. However, there are no knobs (that I know of) to tweak in that regard on the EKS managed control plane.
I would appreciate any input on this topic.

Related

k8s, RabbitMQ, and Peer Discovery

We are trying to run an instance of the RabbitMQ chart with Helm from the helm/charts/stable/rabbit project. I had it running perfect but then I had to restart k8s for some maintenance. Now we are completely unable to launch the RabbitMQ chart in any way shape or form. I am not even trying to run the chart with any variables, i.e. just the default values.
Here is all I am doing:
helm install stable/rabbitmq
I have confirmed I can simply run the default right on my local k8s which I'm running with Docker for Desktop. When we run the rabbit chart on our shared k8s the exact same way as on desktop and what we did before the restart, the following error is thrown:
Failed to get nodes from k8s - 503
I have also posted an issue on the Helm charts repo as well. Click here to see the issue on Github.
We are suspecting the DNS but are unable to confirm anything yet. What is very frustrating is after the restart every single other chart we installed restarted perfectly except Rabbit which now will not start at all.
Anyone know what I could do to get Rabbits peer discovery to work? Anyone seen issue like this after restarting k8s?
So I actually got rabbit to run. Turns out my issue was the k8s peer discovery could not connect over the default port 443 and I had to use the external port 6443 because kubernetes.default.svc.cluster.local resolved to the public port and could not find the internal, so yeah our config is messed up too.
It took me a while to realize the variable below was not overriding when I overrode it with helm install . -f server-values.yaml.
rabbitmq:
configuration: |-
## Clustering
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.k8s.port = 6443
cluster_formation.node_cleanup.interval = 10
cluster_formation.node_cleanup.only_log_warning = true
cluster_partition_handling = autoheal
# queue master locator
queue_master_locator=min-masters
# enable guest user
loopback_users.guest = false
I had to add cluster_formation.k8s.port = 6443 to the main values.yaml file instead of my own. Once the port was changed specifically in the values.yaml, rabbit started right up.
I'm wondering what is the reason of using rabbit_peer_discovery_k8s plugin, if values.yaml defaults to 1 replicas (your manifest file does not override this setting) ?
I was trying to reproduce your issue with given by you override values (dev-server.yaml), as per the details in your github issue #10811, but I somewhat failed. Here are my observations:
If to install RabbitMQ chart with your custom values, my rabbitmq-dev-default-0 pod gets stuck in CrashLoopBackOff state.
It`s quite hard to troubleshoot it further for me as bitnami`s rabbitmq image containers, used by this rabbitmq Helm chart, are shipped with non-root account.
On the other hand if rabbitmq chart is installed on my Kubernetes cluster (v1.13.2) in simplest form:
helm install stable/rabbitmq
I observe similar issue then. I mean rabbitmq server survives a simulated VM restart of all cluster nodes (including master), but I cannot connect to it from outside:
Post VM restart, I`m getting following error from my python mqclient:
socket.gaierror: [Errno -2] Name or service not known
Few remarks here:
Yes, I did port(s)-forward as per instructions on "helm status " command:
The readiness probe works fine:
curl -sS -f --user user:<my_pwd> 127.0.0.1:15672/api/healthchecks/node
{"status":"ok"}
rabbitmqctl to rabbitmq-server connectivity from inside the container works fine too:
kubectl exec rabbitmq-dev-default-0 -- rabbitmqctl list_queues
warning: the VM is running with native name encoding of latin1 which may cause Elixir to malfunction as it expects utf8. Please ensure your locale is set to UTF-8 (which can be verified by running "locale" in your shell)
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
name messages
hello 11
From the moment I used kubectl port-forward to pod instead service, connectivity to rabbitmq server is restored:
kubectl port-forward --namespace default pod/rabbitmq-dev-default-0 5672:5672
$ python send.py
[x] Sent 'Hello World!'

Liveliness probe test in google cloud clustered kubernetes environment

I want to test liveliness probe in google cloud clustered kubernetes environment. How can I bring a pod or container down to test liveliness probes ?
The problem is that replica sets will automatically bring the pods up, if I delete any on those.
On Kubernetes, pods are mortal, and the number of live pods at any given time is guaranteed by the replicasets (which are wrapped by the deployments). So, to take your pods down, you can scale down your deployment to the number you need, or even to zero, like this:
kubectl scale deployment your-deployment-name --replicas=0
However, if you are trying to test and verify that the kubernetes service resource not sending packets to the non live or non ready pod, here's what you can do: You can create another pod with same labels as your real application pods, such that label selectors in the service would match this new pod as well. Configure the pod to have an invalid liveness/readiness probes, so it will not be considered live/ready. Then, hit your service with requests etc. to verify that it never hits the new pod you created.
The question is (quote) "...How can I bring a pod or container down to test liveliness probes ?". The type of probe isn't specified but I'll assume it is HTTP GET or TCP Socket.
Assuming you have proper access to the node/host on which the pod is running:
Start a single pod.
Verify that the liveness probe checks out - that's it, it is working.
Find out on which node the pod is running. This, for example, will return the IP address:
kubectl -n <namespace> get pod <pod-name> -o jsonpath={.status.hostIP}
Log onto the node.
Find the PID of the application process. For example, list all processes (ps aux) and look for the specific process or grep by (part of the) name: ps aux | grep -i <name>. Take the number in the second column. For example, the PID in this ps aux partial output is 13314:
nobody 13314 0.0 0.6 145856 38644 ? Ssl 13:24 0:00 /bin/prometheus --storage....
While on the node, suspend (pause/stop) the process by executing kill -STOP <PID>. For example, for the PID from above:
kill -STOP 13314
At this point:
If there is no liveness probe defined, the pod should still be in Running status and not restarted even though it won't be responding to attempts for connections. To resume the stopped process, execute kill -CONT <PID>.
A properly configured HTTP GET or TCP Socket liveness probe should fail because connection with the application can't be established.
Notice that this method may also work for "exec.command" probes depending what those commands do.
It is to note, also, that most applications run as PID 1 in a (Docker) container. As the Docker docs explain "...A process running as PID 1 inside a container is treated specially by Linux: it ignores any signal with the default action. So, the process will not terminate on SIGINT or SIGTERM unless it is coded to do so". That is probably the reason why the approach won't work from inside the container.

Gcloud Kubernetes and Redis memory store, intermittent issues, host not found

From time to time once a week or so we get in a weird state with our Kubernetes cluster not able to connect to the memory store Redis service.
K8S mater version: 1.10.7
cloud beta redis instances list --region europe-west1  1 ↵  10122  12:26:38
INSTANCE_NAME REGION TIER SIZE_GB HOST PORT NETWORK RESERVED_IP STATUS CREATE_TIME
chefclub-redis europe-west1 STANDARD_HA 1 10.0.10.4 6379 default 10.0.10.0/29 READY 2018-05-29T14:12:46
Getting a No route to host.
kubectl run -i --tty busybox --image=busybox -- sh  ✓  10125  12:28:36
If you don't see a command prompt, try pressing enter.
/ # telnet 10.0.10.4 6379
telnet: can't connect to remote host (10.0.10.4): No route to host
It happened a few times in the past, Now I just did an upgrade of my node to 1.10.7 and everything went back in place, I could connect again.
I wonder what other steps I could take next it happens?
Make sure you have followed the instructions on how to connect to Redis instance from a cluster and troubleshooting doc. Note that while connecting to redis server if your cluster configuration have IP aliases enabled, steps may vary.
You can research through Stackdriver logging for Kubernetes pods and check for complete error message during the affected timeframe. This will help you check for known issues in Github or other Stackoverflow thread. Advanced Stackdriver logging filter to view pod logs:
resource.type="container" resource.labels.cluster_name="cluster_name"
resource.labels.namespace_id="k8s_namespace"
labels."container.googleapis.com/k8s_pod_name"="k8s_pod_name"
If you did not find any known issues and suspect that the issue could be on Google end. You can create an issue using the Public Issue Tracker.

Changing run parameter for cockroachDB in kubernetes GKE

I have a running GKE cluster with cockroachDB active. It's been running for quite a while and I don't want to reinitialize it from scratch - it uses the (almost) standard cockroachDB supplied yaml file to start. I need to change a switch in the exec line to modify the logging level -- currently it's set to the below (but that is logging all information messages as well as errors)
exec /cockroach/cockroach start --logtostderr --insecure --advertise-host $(hostname -f) --http-host 0.0.0.0 --join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroa
chdb-2.cockroachdb --cache 25% --max-sql-memory 25%"
How do I do this without completely stopping the DB?
Kubernetes allows you to update StatefulSets in a rolling manner, such that only one pod is brought down at a time.
The simplest way to make changes is to run kubectl edit statefulset cockroachdb. This will open up a text editor in which you can make the desired change to the command, then save and exit. After that, Kubernetes should handle replacing the pods one-by-one with new pods that use the new command.
For more information:
https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes.html#step-10-upgrade-the-cluster
https://kubernetes.io/docs/tutorials/stateful-application/basic-stateful-set/#updating-statefulsets
https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#in-place-updates-of-resources

kubernetes pods spawn across all servers but kubectl only shows 1 running and 1 pending

I have new setup of Kubernetes and I created replication with 2. However what I see when I do " kubectl get pods' is that one is running another is "pending". Yet when I go to my 7 test nodes and do docker ps I see that all of them are running.
What I think is happening is that I had to change the default insecure port from 8080 to 7080 (the docker app actually runs on 8080), however I don't know how to tell if I am right, or where else to look.
Along the same vein, is there any way to setup config for kubectl where I can specify the port. Doing kubectl --server="" is a bit annoying (yes I know I can alias this).
If you changed the API port, did you also update the nodes to point them at the new port?
For the kubectl --server=... question, you can use kubectl config set-cluster to set cluster info in your ~/.kube/config file to avoid having to use --server all the time. See the following docs for details:
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config.html
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config_set-cluster.html
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config_set-context.html
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config_use-context.html