GKE Internal Load Balancer does not distribute load between gRPC servers - kubernetes

I have an API that recently started receiving more traffic, about 1.5x. That also lead to a doubling in the latency:
This surprised me since I had setup autoscaling of both nodes and pods as well as GKE internal loadbalancing.
My external API passes the request to an internal server which uses a lot of CPU. And looking at my VM instances it seems like all of the traffic got sent to one of my two VM instances (a.k.a. Kubernetes nodes):
With loadbalancing I would have expected the CPU usage to be more evenly divided between the nodes.
Looking at my deployment there is one pod on the first node:
And two pods on the second node:
My service config:
$ kubectl describe service model-service
Name: model-service
Namespace: default
Labels: app=model-server
Annotations: networking.gke.io/load-balancer-type: Internal
Selector: app=model-server
Type: LoadBalancer
IP Families: <none>
IP: 10.3.249.180
IPs: 10.3.249.180
LoadBalancer Ingress: 10.128.0.18
Port: rest-api 8501/TCP
TargetPort: 8501/TCP
NodePort: rest-api 30406/TCP
Endpoints: 10.0.0.145:8501,10.0.0.152:8501,10.0.1.135:8501
Port: grpc-api 8500/TCP
TargetPort: 8500/TCP
NodePort: grpc-api 31336/TCP
Endpoints: 10.0.0.145:8500,10.0.0.152:8500,10.0.1.135:8500
Session Affinity: None
External Traffic Policy: Cluster
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal UpdatedLoadBalancer 6m30s (x2 over 28m) service-controller Updated load balancer with new hosts
The fact that Kubernetes started a new pod seems like a clue that Kubernetes autoscaling is working. But the pods on the second VM do not receive any traffic. How can I make GKE balance the load more evenly?
Update Nov 2:
Goli's answer leads me to think that it has something to do with the setup of the model service. The service exposes both a REST API and a GRPC API but the GRPC API is the one that receives traffic.
There is a corresponding forwarding rule for my service:
$ gcloud compute forwarding-rules list --filter="loadBalancingScheme=INTERNAL"
NAME REGION IP_ADDRESS IP_PROTOCOL TARGET
aab8065908ed4474fb1212c7bd01d1c1 us-central1 10.128.0.18 TCP us-central1/backendServices/aab8065908ed4474fb1212c7bd01d1c1
Which points to a backend service:
$ gcloud compute backend-services describe aab8065908ed4474fb1212c7bd01d1c1
backends:
- balancingMode: CONNECTION
group: https://www.googleapis.com/compute/v1/projects/questions-279902/zones/us-central1-a/instanceGroups/k8s-ig--42ce3e0a56e1558c
connectionDraining:
drainingTimeoutSec: 0
creationTimestamp: '2021-02-21T20:45:33.505-08:00'
description: '{"kubernetes.io/service-name":"default/model-service"}'
fingerprint: lA2-fz1kYug=
healthChecks:
- https://www.googleapis.com/compute/v1/projects/questions-279902/global/healthChecks/k8s-42ce3e0a56e1558c-node
id: '2651722917806508034'
kind: compute#backendService
loadBalancingScheme: INTERNAL
name: aab8065908ed4474fb1212c7bd01d1c1
protocol: TCP
region: https://www.googleapis.com/compute/v1/projects/questions-279902/regions/us-central1
selfLink: https://www.googleapis.com/compute/v1/projects/questions-279902/regions/us-central1/backendServices/aab8065908ed4474fb1212c7bd01d1c1
sessionAffinity: NONE
timeoutSec: 30
Which has a health check:
$ gcloud compute health-checks describe k8s-42ce3e0a56e1558c-node
checkIntervalSec: 8
creationTimestamp: '2021-02-21T20:45:18.913-08:00'
description: ''
healthyThreshold: 1
httpHealthCheck:
host: ''
port: 10256
proxyHeader: NONE
requestPath: /healthz
id: '7949377052344223793'
kind: compute#healthCheck
logConfig:
enable: true
name: k8s-42ce3e0a56e1558c-node
selfLink: https://www.googleapis.com/compute/v1/projects/questions-279902/global/healthChecks/k8s-42ce3e0a56e1558c-node
timeoutSec: 1
type: HTTP
unhealthyThreshold: 3
List of my pods:
kubectl get pods
NAME READY STATUS RESTARTS AGE
api-server-deployment-6747f9c484-6srjb 2/2 Running 3 3d22h
label-server-deployment-6f8494cb6f-79g9w 2/2 Running 4 38d
model-server-deployment-55c947cf5f-nvcpw 0/1 Evicted 0 22d
model-server-deployment-55c947cf5f-q8tl7 0/1 Evicted 0 18d
model-server-deployment-766946bc4f-8q298 1/1 Running 0 4d5h
model-server-deployment-766946bc4f-hvwc9 0/1 Evicted 0 6d15h
model-server-deployment-766946bc4f-k4ktk 1/1 Running 0 7h3m
model-server-deployment-766946bc4f-kk7hs 1/1 Running 0 9h
model-server-deployment-766946bc4f-tw2wn 0/1 Evicted 0 7d15h
model-server-deployment-7f579d459d-52j5f 0/1 Evicted 0 35d
model-server-deployment-7f579d459d-bpk77 0/1 Evicted 0 29d
model-server-deployment-7f579d459d-cs8rg 0/1 Evicted 0 37d
How do I A) confirm that this health check is in fact showing 2/3 backends as unhealthy? And B) configure the health check to send traffic to all of my backends?
Update Nov 5:
After finding that several pods had gotten evicted in the past because of too little RAM, I migrated the pods to a new nodepool. The old nodepool VMs had 4 CPU and 4GB memory, the new ones have 2 CPU and 8GB memory. That seems to have resolved the eviction/memory issues, but the loadbalancer still only sends traffic to one pod at a time.
Pod 1 on node 1:
Pod 2 on node 2:
It seems like the loadbalancer is not splitting the traffic at all but just randomly picking one of the GRPC modelservers and sending 100% of traffic there. Is there some configuration that I missed which caused this behavior? Is this related to me using GRPC?

Turns out the answer is that you cannot loadbalance gRPC requests using a GKE loadbalancer.
A GKE loadbalancer (as well as Kubernetes' default loadbalancer) picks a new backend every time a new TCP connection is formed. For regular HTTP 1.1 requests each request gets a new TCP connection and the loadbalancer works fine. For gRPC (which is based on HTTP 2), the TCP connection is only setup once and all requests are multiplexed on the same connection.
More details in this blog post.
To enable gRPC loadbalancing I had to:
Install Linkerd
curl -fsL https://run.linkerd.io/install | sh
linkerd install | kubectl apply -f -
Inject the Linkerd proxy in both the receiving and sending pods:
kubectl apply -f api_server_deployment.yaml
kubectl apply -f model_server_deployment.yaml
After realizing that Linkerd would not work together with the GKE loadbalancer, I exposed the receiving deployment as a ClusterIP service instead.
kubectl expose deployment/model-server-deployment
Pointed the gRPC client to the ClusterIP service IP address I just created, and redeployed the client.
kubectl apply -f api_server_deployment.yaml

Google Cloud provides health checks to determine if backends respond to traffic.Health checks connect to backends on a configurable, periodic basis. Each connection attempt is called a probe. Google Cloud records the success or failure of each probe.
Based on a configurable number of sequential successful or failed probes, an overall health state is computed for each backend. Backends that respond successfully for the configured number of times are considered healthy.
Backends that fail to respond successfully for a separately configurable number of times are unhealthy.
The overall health state of each backend determines eligibility to receive new requests or connections.So one of the chances of instance not getting requests can be that your instance is unhealthy. Refer to this documentation for creating health checks .
You can configure the criteria that define a successful probe. This is discussed in detail in the section How health checks work.
Edit1:
The Pod is evicted from the node due to lack of resources, or the node fails. If a node fails, Pods on the node are automatically scheduled for deletion.
So to know the exact reason for pods getting evicted Run
kubectl describe pod <pod name> and look for the node name of this pod. Followed by kubectl describe node <node-name> that will show what type of resource cap the node is hitting under Conditions: section.
From my experience this happens when the host node runs out of disk space.
Also after starting the pod you should run kubectl logs <pod-name> -f and see the logs for more detailed information.
Refer this documentation for more information on eviction.

Related

Which Redis sentinel server should I connect to while connecting to Redis?

I am new to Redis and Kubernetes and have a Kubernetes cluster setup of 3 Redis and 3 Sentinel
kubernetes % kubectl -n redis get pods
NAME READY STATUS RESTARTS AGE
redis-0 1/1 Running 0 7d16h
redis-1 1/1 Running 0 7d15h
redis-2 1/1 Running 0 8d
sentinel-0 1/1 Running 0 9d
sentinel-1 1/1 Running 0 9d
sentinel-2 1/1 Running 0 9d
I have successfully connected the sentinel and Redis-master to each other and was able to test basic HA operations using Redis-cli by exec into the pods, now I wanna connect my external java application to this Redis-cluster,
Now best to my understanding we are supposed to connect to Sentinel and Sentinel will guide us Redis-master pod where write operations can be executed.
So I had a few doubts regarding this, Can I connect to any of the sentinel servers and be able to execute my operations or should I always connect to a master? if we are connecting to a sentinel and if the master Sentinel goes down, what should be the plan of action or best practices in this regard, I have read a few blogs but can't seem to reach a clear understanding ?
What should I do to be able to connect my Spring boot to connect to this cluster? I read a bit on this too and it seems I can't connect to a minikube cluster directly from my IntelliJ/local machine and need to create an image and deploy it in a same namespace, is there any workaround for this ?
below is the yaml file for my redis service
apiVersion: v1
kind: Service
metadata:
name: redis
spec:
selector:
name: redis
ports:
- port: 6379
targetPort: 6379
Thanks
Yes you should be able to connect to any sentinel. From https://redis.io/topics/sentinel-clients
The client should iterate the list of Sentinel addresses. For every
address it should try to connect to the Sentinel, using a short
timeout (in the order of a few hundreds of milliseconds). On errors or
timeouts the next Sentinel address should be tried.
If all the Sentinel addresses were tried without success, an error
should be returned to the client.
The first Sentinel replying to the client request should be put at the
start of the list, so that at the next reconnection, we'll try first
the Sentinel that was reachable in the previous connection attempt,
minimizing latency.
If you are using a redis client library that supports sentinel, you can just pass all the redis sentinel addresses to the client and the client library takes care of connection logic for you as recommended by Redis documentation above.
Since you are on kubernetes, you can make it more simpler. Instead of deploying the sentinels as Statefulset with 3 replicas like you have done, deploy it as a Deployment with 3 replicas. Create a service object for the redis sentinel. Pass this service address as the sentinel adress to your redis client library. This way you don't need to work with multiple sentinel addresses and k8s would automatically take care of removing the sentinels that are down from the service endpoints if it is not reachable etc so all your clients don't need to discover the sentinels that are online.
You could also use a redis operator like https://github.com/spotahome/redis-operator which would takes care of deployment lifecycle of sentinel based redis HA clusters.

Can a deployment be completed even when readiness probe is failing

I have an application running in Kubernetes as a StatefulSet that starts 2 pods. It has configured a liveness probe and a readiness probe.
The liveness probe call a simple /health endpoint that responds when the server is done loading
The readiness probe, wait for some start-up job to complete. The job can take several minutes in some cases, and only when it finish the api of the application is ready to start accepting requests.
Even when the api is not available my app also run side jobs that don't depend on it, and I expect them to be done while the startup is happening too.
Is it possible to force Kubernetes deployment to complete and deploy 2 pods, even when the readiness probe is still not passing?
From the docs I get that the only effect of a readiness probe not passing is that the current pod won't be included as available in the loadbalancer service (which is actually the only effect that I want).
If the readiness probe fails, the endpoints controller removes the
Pod's IP address from the endpoints of all Services that match the
Pod.
However I am also seeing that the deployment never finishes, since pod 1 readiness probe is not passing and pod 2 is never created.
kubectl rollout restart statefulset/pod
kubectl get pods
NAME READY STATUS RESTARTS AGE
pod-0 1/2 Running 0 28m
If the readiness probe failure, always prevent the deployment, Is there other way to selectively expose only ready pods in the loadbalancer, while not marking them as Unready during the deployment?
Thanks in advance!
StatefulSet deployment
Is it possible to force kubernetes deployment to complete and deploy 2
pods, even when the readiness probe is still not passing?
Assuming it's meant statefulSet instead of deployment as object, the answer is no, it's not possible by design, most important is second point:
For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
When the nginx example above is created, three Pods will be deployed
in the order web-0, web-1, web-2. web-1 will not be deployed before
web-0 is Running and Ready, and web-2 will not be deployed until web-1
is Running and Ready
StatefulSets - Deployment and scaling guaranties
Readyness probe, endpoints and potential workaround
If the readiness probe failure, always prevent the deployment, Is
there other way to selectively expose only ready pods in the load
balancer, while not marking them as Unready during the deployment?
This is by design, pods are added to service endpoints once they are in ready state.
Some kind of potential workaround can be used, at least in simple example it does work, however you should try and evaluate if this approach will suit your case, this is fine to use as initial deployment.
statefulSet can be started without readyness probe included, this way statefulSet will start pods one by one when previous is run and ready, liveness may need to set up initialDelaySeconds so kubernetes won't restart the pod thinking it's unhealthy. Once statefulSet is fully run and ready, you can add readyness probe to the statefulSet.
When readyness probe is added, kubernetes will restart all pods again starting from the last one and your application will need to start again.
Idea is to start all pods and they will be able to serve requests +- at the same time, while with readyness probe applied, only one pod will start in 5 minutes for instance, next pod will take 5 minutes more and so on.
Example
Simple example to see what's going on based on nginx webserver and sleep 30 command which makes kubernetes think when readyness probe is setup that pod is not ready.
Apply headless service
Comment readyness probe in statefulSet and apply manifest
Observe that all pods are created right after previous pod is running and ready
Uncomment readyness probe and apply manifest
Kubernetes will recreate all pods starting from the last one waiting this time readyness probe to complete and flag a pod as running and ready.
Very convenient to use this command to watch for progress:
watch -n1 kubectl get pods -o wide
nginx-headless-svc.yaml:
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
nginx-statefulset.yaml:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nginx
spec:
selector:
matchLabels:
app: nginx
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
name: web
command: ["/bin/bash", "-c"]
args: ["sleep 30 ; echo sleep completed ; nginx -g \"daemon off;\""]
readinessProbe:
tcpSocket:
port: 80
initialDelaySeconds: 1
periodSeconds: 5
Update
Thanks to #jesantana for this much easier solution.
If all pods have to be scheduled at once and it's not necessary to wait for pods readyness, .spec.podManagementPolicy can be set to Parallel. Pod Management Policies
Useful links:
Kubernetes statefulsets
kubernetes liveness, readyness and startup probes

Is there any way to know which pod the service is load-balanced in Kubernetes?

I manage 3 Pods through Deployment and connect through NodePort of Service.
I wonder which pod the service load balanced whenever I connect from outside.
It's hard to check with Pods log, can I find out through the event or kubectl command?
I am not sure if this is exactly what you're looking for, but you can use Istio to generate detailed telemetry for all service communications.
You may be particularly interested in Distributed tracing:
Istio generates distributed trace spans for each service, providing operators with a detailed understanding of call flows and service dependencies within a mesh.
By using distributed tracing, you are able to monitor every requests as they flow through a mesh.
More information about Distributed Tracing with Istio can be found in the FAQ on Distributed Tracing documentation.
Istio supports multiple tracing backends (e.g. Jaeger).
Jaeger is a distributed tracing system similar to OpenZipkin and as we can find in the jaegertracing documentation:
It is used for monitoring and troubleshooting microservices-based distributed systems, including:
Distributed context propagation
Distributed transaction monitoring
Root cause analysis
Service dependency analysis
Performance / latency optimization
Of course, you don't need to install Istio to use Jaeger, but you'll have to instrument your application so that trace data from different parts of the stack are sent to Jaeger.
I'll show you how you can use Jaeger to monitor a sample request.
Suppose I have an app-1 Deployment with three Pods exposed using the NodePort service.
$ kubectl get pod,deploy,svc
NAME READY STATUS RESTARTS AGE IP
app-1-7ddf4f77c6-g682z 2/2 Running 0 25m 10.60.1.11
app-1-7ddf4f77c6-smlcr 2/2 Running 0 25m 10.60.0.7
app-1-7ddf4f77c6-zn7kh 2/2 Running 0 25m 10.60.2.5
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/app-1 3/3 3 3 21m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/app-1 NodePort 10.64.0.88 <none> 80:30881/TCP 25m
Additionally, I deployed jaeger (with istio):
$ kubectl get deploy -n istio-system | grep jaeger
jaeger 1/1 1 1 67m
To check if Jaeger is working as expected, I will try to connect to this app-1 application from outside the cluster (using the NodePort service):
$ curl <PUBLIC_IP>:30881
app-1
Let's find this trace with Jaeger:
As you can see, we can easily find out which Pod has received our request.

Why does k8s wordpress and mysql example use headless service for wordpress-mysql?

The example is described here - https://kubernetes.io/docs/tutorials/stateful-application/mysql-wordpress-persistent-volume/
The Service object for the wordpress-mysql is:
apiVersion: v1
kind: Service
metadata:
name: wordpress-mysql
labels:
app: wordpress
spec:
ports:
- port: 3306
selector:
app: wordpress
tier: mysql
clusterIP: None
The headless services are documented here - https://kubernetes.io/docs/concepts/services-networking/service/#headless-services The Service definition defines selectors, so I suppose the following passage applies:
For headless Services that define selectors, the endpoints controller
creates Endpoints records in the API, and modifies the DNS
configuration to return records (addresses) that point directly to the
Pods backing the Service
I have followed the example on a 3 node managed k8s cluster in Azure:
C:\work\k8s\mysql-wp-demo> kubectl.exe get ep
NAME ENDPOINTS AGE
kubernetes 52.186.94.71:443 47h
wordpress 10.244.0.10:80 5h33m
wordpress-mysql 10.244.3.28:3306 5h33m
C:\work\k8s\mysql-wp-demo> kubectl.exe get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
wordpress-584f8d8666-rlbf5 1/1 Running 0 5h33m 10.244.0.10 aks-nodepool1-30294001-vmss000001 <none> <none>
wordpress-mysql-55c74969cd-4l8d4 1/1 Running 0 5h33m 10.244.3.28 aks-nodepool1-30294001-vmss000003 <none> <none>
C:\work\k8s\mysql-wp-demo>
As far as I understand there is no difference from the endpoints perspective.
Can someone explain to me - what is the point of headless services in general and in this example in particular?
A regular service has a virtual Service IP that exists as iptables or ipvs rules on each node. A new connection to this service IP is then routed with DNAT to one of the Pod endpoints, to support a form of load balancing across multiple pods.
A headless service (that isn't an ExternalName) will create DNS A records for any endpoints with matching labels or name. Connections will go directly to a single pod/endpoint without traversing the service rules.
A service with a type of ExternalName is just a DNS CNAME record in kubernetes DNS. These are headless by definition as they are names for an IP external to the cluster.
The linked myql deployment/service example is leading into StatefulSet's. This Deployment is basically a single pod statefulset. When you do move to a StatefulSet with multiple pods, you will mostly want to address individual members of the StatefulSet with a specific name (see mdaniels comment).
Another reason to set clusterIP: None is to lessen the load on iptables processing which slows down as the number of services (i.e. iptables rules) increases. Applications that don't need multiple pods, don't need the Service IP. Setting up a cluster to use IPVS alleviates the slow down issue somewhat.

NodePort doesn't work in OpenShift CodeReady Container

Install a latest OpenShift CodeReady Container on CentOS VM, and then run a TCP server app written by Java on OpenShift. The TCP Server is listening on port 7777.
Run app and expose it as a service with NodePort, seems that everything runs well. The pod port is 7777, and the service port is 31777.
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
tcpserver-57c9b44748-k9dxg 1/1 Running 0 113m 10.128.0.229 crc-2n9vw-master-0 <none> <none>
$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
tcpserver-ingres NodePort 172.30.149.98 <none> 7777:31777/TCP 18m
Then get node IP, the command shows as 192.168.130.11, I can ping this ip on my VM successfully.
$ oc get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
crc-2n9vw-master-0 Ready master,worker 26d v1.14.6+6ac6aa4b0 192.168.130.11 <none> Red Hat Enterprise Linux CoreOS 42.81.20191119.1 (Ootpa) 4.18.0-147.0.3.el8_1.x86_64 cri-o://1.14.11-0.24.dev.rhaos4.2.gitc41de67.el8
Now, run a client app which is located in my VM, because I can ping OpenShift Node IP, so I think I can run the client app successfully. The result is that connection time out, my client fails to connect server running on OpenShift.
Please give your advice how to troubleshoot the issue, or any ideas for the issue.
I understood your problem. As per what you described, I can see your Node port is 31777.
The best way to debug this problem is going step by step.
Step 1:
Check if you are able to access your app server using your pod IP and port i.e curl 10.128.0.229:7777/endpoint from one of your nodes within your cluster. This helps you with checking if pod is working or not. Even though kubectl describe pod gives you everything.
Step 2:
After that, on the Node which the pod is deployed i.e 192.168.130.11 on this try to access your app server using curl localhost:31777/endpoint. If this works, Nodeport is accessible i.e your service is working fine without any issues.
Step 3:
After that, try to connect to your node using curl 192.168.130.11:31777/endpoint from the vm running your client server. Just to let you know, 192. is class A private ip, so I am assuming your client is within the same network and able to talk to 192.169.130.11:31777 Or make sure you open your the respective 31777 port of 192.169.130.11 to the vm ip that has client server.
This is a small process of debugging the issue with service and pod. But the best is to use the ingress and an ingress controller, which will help you to talk to your app server with a url instead of ip address and port numbers. However, even with ingress and ingress controller the best way to debug all the parts are working as expected is following these steps.
Please feel free to let me know for any issues.
Thanks prompt answer.
Regarding Step 1,
I don't know where I could run "curl 10.128.0.229:7777/endpoint" inside cluster, but I check the status of pod via going to inside pod, port 777 is listening as expected.
$ oc rsh tcpserver-57c9b44748-k9dxg
sh-4.2$ netstat -nap | grep 7777
tcp6 0 0 127.0.0.1:7777 :::* LISTEN 1/java
Regarding Step 2,
run command "curl localhost:31777/endpoint" on Node where pod is deployed, it failed.
$ curl localhost:31777/endpoint
curl: (7) Failed to connect to localhost port 31777: Connection refused
That means, it seems that 31777 is not opened by OpenShift.
Do you have any ideas how to check why 31777 is not opened by OpenShift.
More information about service definition:
apiVersion: v1
kind: Service
metadata:
name: tcpserver-ingress
labels:
app: tcpserver
spec:
selector:
app: tcpserver
type: NodePort
ports:
- protocol: TCP
port: 7777
targetPort: 7777
nodePort: 31777
Service status:
$ oc describe svc tcpserver-ingress
Name: tcpserver-ingress
Namespace: myproject
Labels: app=tcpserver
Annotations: <none>
Selector: app=tcpserver
Type: NodePort
IP: 172.30.149.98
Port: <unset> 7777/TCP
TargetPort: 7777/TCP
NodePort: <unset> 31777/TCP
Endpoints: 10.128.0.229:7777
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>