How to scale Websocket Connections with Azure Application Gateway and AKS - kubernetes

We want to dynamically scale our AKS Cluster based on the number of Websocket connections.
We use Application Gateway V2 along with Application Gateway Ingress Controller on AKS as Ingress.
I configured HorizontalPodAutoscaler to scale the deployment based on the consumed memory.
When i deploy the sample app to AKS i can connect to the websocket endpoints and communicate.
However, when any scale operation happens (pods added or removed) i see connection losses on all the clients.
How can i keep the existing connections when pods are added?
How can i gracefully drain connections when pods are removed so existing clients are not affected?
I tried activating cookie based affinity on application gateway but this had no effect on the issue.
Below is the deployment i use for testing. It is based on this sample and modified a but so it allows to specify the number of connections and regularily sends ping messages to the server.
apiVersion: apps/v1
kind: Deployment
metadata:
name: wssample
spec:
replicas: 1
selector:
matchLabels:
app: wssample
template:
metadata:
labels:
app: wssample
spec:
containers:
- name: wssamplecontainer
image: marxx/websocketssample:10
resources:
requests:
memory: "100Mi"
cpu: "50m"
limits:
memory: "150Mi"
cpu: "100m"
ports:
- containerPort: 80
name: wssample
---
apiVersion: v1
kind: Service
metadata:
name: wssample-service
spec:
ports:
- port: 80
selector:
app: wssample
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: websocket-ingress
annotations:
kubernetes.io/ingress.class: azure/application-gateway
appgw.ingress.kubernetes.io/cookie-based-affinity: "true"
appgw.ingress.kubernetes.io/connection-draining: "true"
appgw.ingress.kubernetes.io/connection-draining-timeout: "60"
spec:
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: wssample-service
port:
number: 80
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: websocket-scaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: wssample
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 50
Update:
I am running on a 2-node cluster with autoscaler activated to scale up to 4 nodes.
There is still plenty of memory available on the nodes
At first i thought it was an issue with browsers and javascript but i got the same results when i connected to the endpoint via a .NET Core Based Console Application (Websockets went to state 'Aborted' after the scale operation)
Update 2:
I found a pattern. The problem occurs also without HPA and can be reproduced using the following steps:
Scale Deployment to 3 Replicas
Connect 20 Clients
Manually Scale Deployment to 6 Replicas with kubectl scale command
(existing connections are still fine and clients communicate with backend)
Connect another 20 Clients
After a few seconds all the existing connections are reset
Update 3:
The AKS cluster is using kubenet networking
Same issue with Azure CNI networking though

I made a very unpleasant discovery. The outcome of this GitHub issue basically says that the behavior is by design and AGW resets all websocket connections when any backend pool rules change (which happens during scale operations).
It's possible to vote for a feature to keep those connections in those situations.

Related

GKE sticky connections makes autoscaling uneffective because of limited pod ports (API to database)

I have an API to which I send requests, and this API connects to MongoDB through MongoClient in PyMongo. Here is a scheme of my system that I deployed in GKE:
The major part of the calculations needed for each request are made in the MongoDB, so I want the MongoDB pods to be autoscaled based on CPU usage. Thus I have an HPA for the MongoDB deployment, with minReplicas: 1.
When I send many requests to the Nginx Ingress, I see that my only MongoDB pod has 100% CPU usage, so the HPA creates a second pod. But this second pod isn't used.
After looking in the logs of my first MongoDB pod, I see that all the requests have this :
"remote":"${The_endpoint_of_my_API_Pod}:${PORT}", and the PORT only takes 12 different values (I counted them, they started repeating so I guessed that there aren't others).
So my guess is that the second pod isn't used because of sticky connections, as suggested in this answer https://stackoverflow.com/a/73028316/19501779 to one my previous questions, where there is more detail on my MongoDB deployment.
I have 2 questions :
Is the second pod not used in fact because of sticky connections between my API Pod and my first MongoDB Pod?
If this is the case, how can I overcome this issue to make the autoscaling effective?
Thanks, and if you need more info please ask me.
EDIT
Here is my MongoDB configuration:
Its Dockerfile, from which I create my MongoDB image from the VM where my original MongoDB is. A single deployment of this image works in k8s.
FROM mongo:latest
EXPOSE 27017
COPY /mdb/ /data/db
The deployment.yml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: mongodb
labels:
app: mongodb
spec:
replicas: 1
selector:
matchLabels:
app: mongodb
template:
metadata:
labels:
app: mongodb
spec:
containers:
- name: mongodb
image: $My_MongoDB_image
ports:
- containerPort: 27017
resources:
requests:
memory: "1000Mi"
cpu: "1000m"
imagePullSecrets: #for pulling from my Docker Hub
- name: regcred
and the service.yml and hpa.yml:
apiVersion: v1
kind: Service
metadata:
name: mongodb-service
labels:
app: mongodb
spec:
selector:
app: mongodb
ports:
- protocol: TCP
port: 27017
targetPort: 27017
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: mongodb-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mongodb
minReplicas: 1
maxReplicas: 70
targetCPUUtilizationPercentage: 85
And I access to this service from my API Pod with PyMongo:
def get_db(database: str):
client = MongoClient(host="$Cluster_IP_of_{mongodb-service}",
port=27017,
username="...",
password="...",
authSource="admin")
return client.get_database(database)
And moreover, when a second MongoDB Pod is created thanks to autoscaling, its endpoint appears in my mongodb-service:
the HPA created a second Pod
the new Pod endpoints appears in the mongodb-service

Active/Passive routing a headless service with mutliple backends

I'm setting up active/passive routing for an application in kube, that's outside of the typical K8S use case. I'm trying to find configuration related to routing or load balancing in headless services with multiple backends. So far I have managed to route traffic to my backends however I need to ensure that the traffic is routed correctly. The application requires TCP connections and the primary/secondary instances have differing configuration(requiring different deployment objects). If fail-over occurs, routing is expected to return to the primary once it is restored.
The routing consistently behaves as-desired, but no documentation or configuration would indicate as much. I've found documentation stating that it should be round-robin or random because of the order of the dns entries. The crux of the question is: can I depend on this behavior? Since this is undocumented and not explicitly configured I'm concerned that it will change in future versions, or deployments.
I'm using Rancher with the canal networking layer.
I've read through both the calico and flannel docs.
Neither Endpoints/endpoint slices, nor do dns entries indicate any order for routing.
Currently the setup has two deployments that are selected by a headless service. the deployed pods have a hostname of input-primary in deployment 1 and input-secondary in deployment 2. I can access either of them by dns as input-primary.myservice, or input-secondary.myservice.
the ingress controller tcp-services config map has an entry for my service:
25252: default/myservice:9999
and an abridged version of the k8s config:
ApiVersion: v1
kind: Service
metadata:
name:myservice
spec:
clusterIP: None
ports:
- name: input
port: 9999
selector:
app: myapp
type: ClusterIP
----
apiVersion: apps/v1beta2
kind: Deployment
metadata:
labels:
app: myapp
name: input-primary
spec:
hostname: input-primary
containers:
- ports:
- containerport: 9999
name: input
protocol: TCP
----
apiVersion: apps/v1beta2
kind: Deployment
metadata:
labels:
app: myapp
name: input-secondary
spec:
hostname: input-secondary
containers:
- ports:
- containerport: 9999
name: input
protocol: TCP```

Pod deletion causes errors when using NEG

I am for this example running the "echoheaders" Nginx in a deployment with 2 replicas. When I delete 1 pod, I sometimes get slow responses and errors for ~40 seconds.
We are running our API-gateway in Kubernetes, and need to be able to allow the Kubernetes scheduler to handle the pods as it sees fit.
We recently wanted to introduce session affinity, and for that, we wanted to migrate to the new and shiny NEG's: Network Endpoint Groups:
https://cloud.google.com/load-balancing/docs/negs/
When using NEG we experience problems during failover. Without NEG we're fine.
deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: echoheaders
labels:
app: echoheaders
spec:
replicas: 2
selector:
matchLabels:
app: echoheaders
template:
metadata:
labels:
app: echoheaders
spec:
containers:
- image: brndnmtthws/nginx-echo-headers
imagePullPolicy: Always
name: echoheaders
readinessProbe:
httpGet:
path: /
port: 8080
lifecycle:
preStop:
exec:
# Hack: wait for kube-proxy to remove endpoint before exiting, and
# gracefully shut down
command: ["bash", "-c", "sleep 10; nginx -s quit; sleep 40"]
restartPolicy: Always
terminationGracePeriodSeconds: 60
service.yaml
apiVersion: v1
kind: Service
metadata:
name: echoheaders
labels:
app: echoheaders
annotations:
cloud.google.com/neg: '{"ingress": true}'
spec:
ports:
- port: 80
protocol: TCP
targetPort: 8080
selector:
app: echoheaders
ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
annotations:
kubernetes.io/ingress.global-static-ip-name: echoheaders-staging
name: echoheaders-staging
spec:
backend:
serviceName: echoheaders
servicePort: 80
When deleting a pod I get errors as shown in this image of
$ httping -G -K 35.190.69.21
(https://i.imgur.com/u14MvHN.png)
This is new behaviour when using NEG. Disabling NEG gives the old behaviour with working failover.
Any way to use Google LB, ingress, NEG and Kubernetes without errors during pod deletion?
In GCP load balancers a GET request will only be served a 502 after two subsequent backends fail to meet the response timeout or a impactful error occurred which seems more plausible.
What is possibly happening may be an interim period, wherein a Pod was due to be terminated and had received its SIGTERM, but was still considered healthy by the load balancer and was sent a request. Since this period was so brief, it wasn't able to complete the request and closed the connection.
A graceful service stop[1] in the machine will make that after receiving a SIGTERM, your service would continue to serve in-flight requests, but refuse new connections. This may solve your issue, but keep in mind that there is no guarantee of zero-downtime.
[1] https://landing.google.com/sre/sre-book/chapters/load-balancing-datacenter/#robust_approach_lame_duck

Grafana is not working on kubernetes cluster while using k8s Service

I am trying to setup a very simple monitoring cluster for my k8s cluster. I have successfully created prometheus pod and is running fine.
When I tried to create grafana pod the same way, its not accessible through the node port.
My Grafana deploy file is-
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: grafana-deployment
namespace: monitoring
spec:
replicas: 1
template:
metadata:
labels:
app: grafana-server
spec:
containers:
- name: grafana
image: grafana/grafana:5.1.0
ports:
- containerPort: 3000
And Service File is --
apiVersion: v1
kind: Service
metadata:
name: grafana-service
namespace: monitoring
spec:
selector:
app: grafana-server
type: NodePort
ports:
- port: 3000
targetPort: 3000
Note- When I am creating a simple docker container on the same host using same image, its working fine.
I have come to know that my servers provider had not enabled these ports (like grafana-3000, kibana-5601). Never thought of this since i am using these servers from quite a long time and never faced such blocker. They implemented these rules recently.
Well, after some port approvals, I tried the same config again and it worked like a charm.

ISTIO: enable circuit breaking on egress

I am unable to get circuit breaking configuration to work on my elb through egress config.
ELB
elb has success rate of 25% (75% 500 error & 25% with status 200),
the elb has 4 instances, only 1 returns a successful response, other instances are configured to returns 500 error for testing purpose.
Setup
k8s: v1.7.4
istio: 0.5.0
env: k8s on aws
Egress rule
apiVersion: config.istio.io/v1alpha2
kind: EgressRule
metadata:
name: elb-egress-rule
spec:
destination:
service: xxxx.us-east-1.elb.amazonaws.com
ports:
- port: 80
protocol: http
Destination Policy
kind: DestinationPolicy
metadata:
name: elb-circuit-breaker
spec:
destination:
service: xxxx.us-east-1.elb.amazonaws.com
loadBalancing:
name: RANDOM
circuitBreaker:
simpleCb:
maxConnections: 100
httpMaxPendingRequests: 100
sleepWindow: 3m
httpDetectionInterval: 1s
httpMaxEjectionPercent: 100
httpConsecutiveErrors: 3
httpMaxRequestsPerConnection: 10
Route rules: not set
Testing
apiVersion: v1
kind: Service
metadata:
name: sleep
labels:
app: sleep
spec:
ports:
- port: 80
name: http
selector:
app: sleep
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: sleep
spec:
replicas: 1
template:
metadata:
labels:
app: sleep
spec:
containers:
- name: sleep
image: tutum/curl
command: ["/bin/sleep","infinity"]
imagePullPolicy: IfNotPresent
.
export SOURCE_POD=$(kubectl get pod -l app=sleep -o jsonpath={.items..metadata.name})
kubectl exec -it $SOURCE_POD -c sleep bash
Sending requests in parallel from the pod
#!/bin/sh
set -m # Enable Job Control
for i in `seq 100`; do # start 100 jobs in parallel
curl xxxx.us-east-1.elb.amazonaws.com &
done
Response
Currently, Istio considers an Egress Rule to designate a single host. This single host will not be ejected due to the load balancer's panic threshold of Envoy (the sidecar proxy implementation of Istio). The default panic threshold of Envoy is 50%. This means that at least two hosts are required for one host to be ejected, so the single host of an Egress Rule will not be ejected.
This practically means that httpConsecutiveErrors does not effect the external services. This lack of functionality should be partially resolved with External Services of Istio that will replace the Egress Rules.
See documentation of the Istio External Services backed by multiple endpoints -https://github.com/istio/api/blob/master/routing/v1alpha2/external_service.proto#L113