Is there a way of obtaining activity logs from Service objects in Kubernetes? - kubernetes

I have the following situation (this is my motivation to ask the question, not the question itself):
I've got a web application that accepts uploads from users.
The users access the application through an Ingress,
then a Service,
Then a Deployment with two Pods.
Application contained in each Pod.
Sometimes the upload fails:
I can see in the logs from the Pod that the upload went all right.
I can even see the data uploaded by the user.
There are nothing but normal logs in the Pod.
But the ingress reports a HTTP 500 error.
And the users sees a HTTP 500 error - connection reset by peer.
If the Pod seems all right, but ingress complains, then I should check the middle man, the Service. Then I realized that there is no easy way to obtain logs from the service.
So this is my question:
How can I read logs from the Service object ? I mean activity logs, not the deployment events.
Do they exist?

The only resources in K8s that produce logs are Pods! Pods lead to the creation of containers, which for their part lead to the creation of Linux processes on the K8s nodes. Those processes write logs that are "reaped" by the container runtime and made available to K8s, e.g. when you run kubectl logs.
Consequently, only K8s resources that are backed by Pods produce logs, e.g. Deployment, Daemonsets, StatefulSets and Jobs.
Services are merely logical resources that configures how network traffic can be routed to Pods. So, in a way they have underlying Pods, but do not produce any additional log output. The only tangible outcome of a Service resource are iptables rules on the K8s nodes, that define how traffic has to be routed from the Service IP to the IPs of the underlying Pods.
To resolve Ingress related problems, you might get further insights from the logs of your ingress controller which is typically deployed as a deployment and therefore backed by Pods.

Related

Web-Server running in an EKS cluster with spot-instances

I'm running a web-server deployment in an EKS cluster. The deployment is exposed behind a NodePort service, ingress resource, and AWS Load Balancer controller.
This deployment is configured to run on "always-on" nodes, using a Node Selector.
The EKS cluster runs additional auto-scaled workloads which can also use spot instances if needed (in the same namespace).
Since the Node-Port service exposes a static port across all nodes in the cluster, there are many targets in the said target group, which are being registered and de-registered whenever a new node is being added/removed from the cluster.
What exactly happens if a request from the client is being navigated to the service that resides in a node that is about the be scaled down?
I'm asking since I'm getting many 504 Gateway Timeouts from the ALB. Specifically, these requests do not reach our FE/BE pods and terminate at the ALB level.
Welcome to the community #gil-shelef!
Based on AWS documentation, there should be used additional handlers to add both resilience and cost-savings.
Let's start with understanding how this works:
There is a specific node termination handler DaemonSet which adds pods to each spot instances and listens to spot instance interruption notification. This provides a possibility to gracefully terminate any running pods on that node, drain the node from loadbalancer and for Kubernetes scheduler to reschedule removed pods on different instances.
Workflow looks like following (taken from aws documentation - Spot Instance Interruption Handling. This link also has an example):
The workflow can be summarized as:
Identify that a Spot Instance is about to be interrupted in two minutes.
Use the two-minute notification window to gracefully prepare the node for termination.
Taint the node and cordon it off to prevent new pods from being placed on it.
Drain connections on the running pods.
Once pods are removed from endpoints, kube-proxy will trigger an update in iptables. It takes a little bit of time. To make this smoother for end-users, you should consider adding pre-stop pause about 5-10 seconds. More information about how this happens and how you can mitigate it, you can find in my answer here.
Also here are links for these handlers:
Node termination handler
Cluster autoscaler on AWS
For your last question, please check this AWS KB article on how to troubleshoot EKS and 504 errors

Kubernetes traffic on deployments

Off late, we have found many Kubernetes pods are running without any ingress/egress traffic. APM monitoring revealed the actual traffic flows in each pod.
Now, I would like to terminate those pods that doesn't have any traffic over the period of time. So that I can reduce the worker nodes.
I need your help on below query.
Is there a way we can find ingress/egress traffic at deployment level? Currently it shows at pod level. But if I generate the report it includes the pods that are already terminated. It is difficult for me to get historical report of pods. Because whenever the pods gets scaled, everytime it creates with new name.

What exactly Kubernetes Services are and how they are different from Deployments

After reading thru Kubernetes documents like this, deployment , service and this I still do not have a clear idea what the purpose of service is.
It seems that the service is used for 2 purposes:
expose the deployment to the outside world (e.g using LoadBalancer),
expose one deployment to another deployment (e.g. using ClusterIP services).
Is this the case? And what about the Ingress?
------ update ------
Connect a Front End to a Back End Using a Service is a good example of the service working with the deployment.
Service
A deployment consists of one or more pods and replicas of pods. Let's say, we have 3 replicas of pods running in a deployment. Now let's assume there is no service. How does other pods in the cluster access these pods? Through IP addresses of these pods. What happens if we say one of the pods goes down. Kunernetes bring up another pod. Now the IP address list of these pods changes and all the other pods need to keep track of the same. The same is the case when there is auto scaling enabled. The number of the pods increases or decreases based on demand. To avoid this problem services come into play. Thus services are basically programs that manages the list of the pods ip for a deployment.
And yes, also regarding the uses that you posted in the question.
Ingress
Ingress is something that is used for providing a single point of entry for the various services in your cluster. Let's take a simple scenario. In your cluster there are two services. One for the web app and another for documentation service. If you are using services alone and not ingress, you need to maintain two load balancers. This might cost more as well. To avoid this, ingress when defined, sits on top of services and routes to services based on the rules and path defined in the ingress.

Kubernetes pod/containers running but not listed with 'kubectl get pods'?

I have an issue that, at face value, appears to indicate that I have two deployments running in parallel within my kube cluster, but 'kubectl get pods' only shows one deployment.
My deployment is composed of a pod with two containers. One of the containers runs a golang application that creates an http API endpoint, and the other runs Telegraf to read metrics from the API endpoint and push them to InfluxDB. When writing the data to Influx I tag the data with the source host as the name of the pod. I use Grafana to plot the metrics and I can clearly see incoming streaming data coming from two hosts (e.g. I can set a "WHERE host=" query clause as either "application-pod-name-231620957-7n32f" and "application-pod-name-1931165991-x154c").
Based on the above, I'm fairly certain that two deployments of the pod are running, each with the two containers (one providing application metrics and the other with telegraf sending metrics to InfluxDB).
However, kube seems to think that one of the deployments doesn't exist. As mentioned, "kubectl get pods" doesn't display the 2nd pod name in any way shape or form. Only one of them.
Has anyone seen this? Any ideas on further troubleshooting? I've attempted to use the pod name (that I have within telegraf) to query more information using kubectl but always get the response that the pod doesn't exist... but it must exist! It's sending live data!
We had been experiencing issues with a node within the cluster. Specifically, the node was experiencing GC failures and communications into the cluster from that node was broken. Due to these failures, someone on our team performed a 'kubectl delete' on the node from within the cluster. By doing so the node continued running, but also the kubelet running on the node remained in a broken state, and so the node couldn't re-auto-register itself into the cluster. This node happened to be running the 2nd pod, and the pods running on the node continued running without issue. In our case, the node was running on AWS, in which case the way to avoid this situation is to reboot the node either from the AWS console or AWS API.

liveness probes for manually created Endpoints

Is this a thing?
I have some legacy services which will never run in Kubernetes that I currently make available to my cluster by defining a service and manually uploading an endpoints object.
However, the service is horizontally sharded and we often need to restart one of the endpoints. My google-fu might be weak, but i can't figure out if Kubernetes is clever enough to prevent the Service from repeatedly trying the dead endpoint?
The ideal behavior is that the proxy should detect the outage, mark the endpoint as failed, and at some point when the endpoint comes back re-admit it into the full list of working endpoints.
BTW, I understand that at present, liveness probes are HTTP only. This would need to be a TCP probe because it's a replicated database service that doesn't grok HTTP.
I think the design is for the thing managing the endpoint addresses to add/remove them based on liveness. For services backed by pods, the pod IPs are added to endpoints based on the pod's readiness check. If a pod's liveness check fails, it is deleted and its IP removed from the endpoint.
If you are manually managing endpoint addresses, the burden is currently on you (or your external health checker) to maintain the addresses/notReadyAddresses in the endpoint.