Graceful scaledown of stateful apps in Kubernetes - kubernetes

I have a stateful application deployed in Kubernetes cluster. Now the challenge is how do I scale down the cluster in a graceful way so that each pod while terminating (during scale down) completes it’s pending tasks and then gracefully shuts-down. The scenario is similar to what is explained below but in my case the pods terminating will have few inflight tasks to be processed.
https://medium.com/#marko.luksa/graceful-scaledown-of-stateful-apps-in-kubernetes-2205fc556ba9 1
Do we have an official feature support for this from kubernetes api.
Kubernetes version: v1.11.0
Host OS: linux/amd64
CRI version: Docker 1.13.1
UPDATE :
Possible Solution - While performing a statefulset scale-down the preStop hook for the terminating pod(s) will send a message notification to a queue with the meta-data details of the resp. task(s) to be completed. Afterwards use a K8 Job to complete the tasks. Please do comment if the same is a recommended approach from K8 perspective.
Thanks In Advance!
Regards,
Balu

Your pod will be scaled down only after the in-progress job is completed. You may additionally configure the lifecycle in the deployment manifest with prestop attribute which will gracefully stop your application. This is one of the best practices to follow. Please refer this for detailed explanation and syntax.
Updated Answer
This is the yaml I tried to deploy on my local and tried generating the load to raise the cpu utilization and trigger the hpa.
Deployment.yaml
kind: Deployment
apiVersion: apps/v1
metadata:
namespace: default
name: whoami
labels:
app: whoami
spec:
replicas: 1
selector:
matchLabels:
app: whoami
template:
metadata:
labels:
app: whoami
spec:
containers:
- name: whoami
image: containous/whoami
resources:
requests:
cpu: 30m
limits:
cpu: 40m
ports:
- name: web
containerPort: 80
lifecycle:
preStop:
exec:
command:
- /bin/sh
- echo "Starting Sleep"; date; sleep 600; echo "Pod will be terminated now"
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: whoami
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: whoami
minReplicas: 1
maxReplicas: 3
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 40
# - type: Resource
# resource:
# name: memory
# targetAverageUtilization: 10
---
apiVersion: v1
kind: Service
metadata:
name: whoami-service
spec:
ports:
- port: 80
targetPort: 80
protocol: TCP
name: http
selector:
app: whoami
Once the pod is deployed, execute the below command which will generate the load.
kubectl run -i --tty load-generator --image=busybox /bin/sh
while true; do wget -q -O- http://whoami-service.default.svc.cluster.local; done
Once the replicas are created, I stopped the load and the pods are terminated after 600 seconds. This scenario worked for me. I believe this would be the similar case for statefulset as well. Hope this helps.

Related

Unable to fetch metrics from custom metrics API: the server is currently unable to handle the request

I'm using a HPA based on a custom metric on GKE.
The HPA is not working and it's showing me this error log:
unable to fetch metrics from custom metrics API: the server is currently unable to handle the request
When I run kubectl get apiservices | grep custom I get
v1beta1.custom.metrics.k8s.io services/prometheus-adapter False (FailedDiscoveryCheck) 135d
this is the HPA spec config :
spec:
scaleTargetRef:
kind: Deployment
name: api-name
apiVersion: apps/v1
minReplicas: 3
maxReplicas: 50
metrics:
- type: Object
object:
target:
kind: Service
name: api-name
apiVersion: v1
metricName: messages_ready_per_consumer
targetValue: '1'
and this is the service's spec config :
spec:
ports:
- name: worker-metrics
protocol: TCP
port: 8080
targetPort: worker-metrics
selector:
app.kubernetes.io/instance: api
app.kubernetes.io/name: api-name
clusterIP: 10.8.7.9
clusterIPs:
- 10.8.7.9
type: ClusterIP
sessionAffinity: None
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
What should I do to make it work ?
First of all, confirm that the Metrics Server POD is running in your kube-system namespace. Also, you can use the following manifest:
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: metrics-server
namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
labels:
k8s-app: metrics-server
spec:
selector:
matchLabels:
k8s-app: metrics-server
template:
metadata:
name: metrics-server
labels:
k8s-app: metrics-server
spec:
serviceAccountName: metrics-server
volumes:
# mount in tmp so we can safely use from-scratch images and/or read-only containers
- name: tmp-dir
emptyDir: {}
containers:
- name: metrics-server
image: k8s.gcr.io/metrics-server-amd64:v0.3.1
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
imagePullPolicy: Always
volumeMounts:
- name: tmp-dir
mountPath: /tmp
If so, take a look into the logs and look for any stackdriver adapter’s line. This issue is commonly caused due to a problem with the custom-metrics-stackdriver-adapter. It usually crashes in the metrics-server namespace. To solve that, use the resource from this URL, and for the deployment, use this image:
gcr.io/google-containers/custom-metrics-stackdriver-adapter:v0.10.1
Another common root cause of this is an OOM issue. In this case, adding more memory solves the problem. To assign more memory, you can specify the new memory amount in the configuration file, as the following example shows:
apiVersion: v1
kind: Pod
metadata:
name: memory-demo
namespace: mem-example
spec:
containers:
- name: memory-demo-ctr
image: polinux/stress
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]
In the above example, the Container has a memory request of 100 MiB and a memory limit of 200 MiB. In the manifest, the "--vm-bytes", "150M" argument tells the Container to attempt to allocate 150 MiB of memory. You can visit this Kubernetes Official Documentation to have more references about the Memory settings.
You can use the following threads for more reference GKE - HPA using custom metrics - unable to fetch metrics, Stackdriver-metadata-agent-cluster-level gets OOMKilled, and Custom-metrics-stackdriver-adapter pod keeps crashing.
What do you get for kubectl get pod -l "app.kubernetes.io/instance=api,app.kubernetes.io/name=api-name"?
There should be a pod, to which the service reffers.
If there is a pod, check its logs with kubectl logs <pod-name>. you can add -f to kubectl logs command, to follow the logs.
Adding this block in my EKS nodes security group rules solved the issue for me:
node_security_group_additional_rules = {
...
ingress_cluster_metricserver = {
description = "Cluster to node 4443 (Metrics Server)"
protocol = "tcp"
from_port = 4443
to_port = 4443
type = "ingress"
source_cluster_security_group = true
}
...
}

Non uniform distribution of requests between internal services in minikube

I have 2 different microservices deployed as backend in minikube, call it deployment A and deployment B. Both these deployments have a different replica of pods running.
Deployment B is exposed as service B. Pods of deployment A call deployment B pods via service B which is of ClusterIP type.
The pods of deployment B have a scrapyd application running inside them with scraping spiders deployed in them. Each celery worker( pods of deployment A) takes a task from a redis queue and calls scrapyd service to schedule spiders on them.
Everything works fine but after I scale the application (deployment A and B seperately), I observe that the resource consumption is not uniform, using kubectl top pods
I observe that some of the pods of deployment B are not used at all. The pattern I observe is that only those pods of deployment B, that are up and running after all the pods of deployment A are up, are never utilized.
Is it normal behavior? I suspect the connection between pods of deployment A and B is persistent I am confused as to why request handling by pods of deployment B is not uniformly distributed? Sorry for the naive question. I am new to this field.
The manifest for deployment A is :
apiVersion: apps/v1
kind: Deployment
metadata:
name: celery-worker
labels:
deployment: celery-worker
spec:
replicas: 1
selector:
matchLabels:
pod: celery-worker
template:
metadata:
labels:
pod: celery-worker
spec:
containers:
- name: celery-worker
image: celery:latest
imagePullPolicy: Never
command: ['celery', '-A', 'mysite', 'worker', '-E', '-l', 'info',]
resources:
limits:
cpu: 500m
requests:
cpu: 200m
terminationGracePeriodSeconds: 200
and that of deployment B is
apiVersion: apps/v1
kind: Deployment
metadata:
name: scrapyd
labels:
app: scrapyd
spec:
replicas: 1
selector:
matchLabels:
pod: scrapyd
template:
metadata:
labels:
pod: scrapyd
spec:
containers:
- name: scrapyd
image: scrapyd:latest
imagePullPolicy: Never
ports:
- containerPort: 6800
resources:
limits:
cpu: 800m
requests:
cpu: 800m
terminationGracePeriodSeconds: 100
---
kind: Service
apiVersion: v1
metadata:
name: scrapyd
spec:
selector:
pod: scrapyd
ports:
- protocol: TCP
port: 6800
targetPort: 6800
Output of kubectl top pods :
So the solution to the above problem that I figured out is as follows :
In the current setup, install linkerd using this link. Linkerd Installation in Kubernetes
After that inject the linkerd proxy into celery deployment as follows :
cat celery/deployment.yaml | linkerd inject - | kubectl apply -f -
This ensures that the requests from celery are first passed on to this proxy and then Load balanced directly to scrapy at L7 Layer. In this case, kube-proxy is by passed and the default load balancing over L4 is no longer functional.

Kubenetes Pod showing status "Completed" without any jobs

I'm hosting an Angular website that connects to a C#-backend inside a Kubernetes Cluster. When I use a certain function on the website that I can't describe in more detail, the pod shows status "Completed", then goes into "CrashLoopBackOff" and then restarts. The problem is, there are no jobs set up for this Pod (in fact, I didn't even know Jobs are a thing until one hour ago). So my main question would be: How can a Pod go into the "Completed" status without running any jobs?
My .yaml file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-demo
namespace: my-namespace
labels:
app: my-demo
spec:
replicas: 1
template:
metadata:
name: my-demo-pod
labels:
app: my-demo
spec:
nodeSelector:
"beta.kubernetes.io/os": windows
containers:
- name: my-demo-container
image: myregistry.azurecr.io/my.demo:latest
imagePullPolicy: Always
resources:
limits:
cpu: 1
memory: 800M
requests:
cpu: .1
memory: 300M
ports:
- containerPort: 80
imagePullSecrets:
- name: my-registry-secret
selector:
matchLabels:
app: my-demo
---
apiVersion: v1
kind: Service
metadata:
name: my-demo-service
namespace: my-namespace
spec:
ports:
- protocol: TCP
port: 80
name: my-demo-port
selector:
app: my-demo
Completed status indicates that the application called by the cmd or ENTRYPOINT exited with a non-error (i.e. 0) status code. This Completed->CrashLoopBackoff->Running cycle usually indicates that the process called on the container start doesn't daemonize itself and exits, which Kubernetes sees as the process 'completing', hence the status.
Check that your ENTRYPOINT in your Dockerfile or your cmd in your pod template are calling the right process (with the appropriate flags) for the process to be daemonized. You can also check the logs for the previous pod (i.e. using kubectl logs --previous) to see what output the application gave

ISTIO: enable circuit breaking on egress

I am unable to get circuit breaking configuration to work on my elb through egress config.
ELB
elb has success rate of 25% (75% 500 error & 25% with status 200),
the elb has 4 instances, only 1 returns a successful response, other instances are configured to returns 500 error for testing purpose.
Setup
k8s: v1.7.4
istio: 0.5.0
env: k8s on aws
Egress rule
apiVersion: config.istio.io/v1alpha2
kind: EgressRule
metadata:
name: elb-egress-rule
spec:
destination:
service: xxxx.us-east-1.elb.amazonaws.com
ports:
- port: 80
protocol: http
Destination Policy
kind: DestinationPolicy
metadata:
name: elb-circuit-breaker
spec:
destination:
service: xxxx.us-east-1.elb.amazonaws.com
loadBalancing:
name: RANDOM
circuitBreaker:
simpleCb:
maxConnections: 100
httpMaxPendingRequests: 100
sleepWindow: 3m
httpDetectionInterval: 1s
httpMaxEjectionPercent: 100
httpConsecutiveErrors: 3
httpMaxRequestsPerConnection: 10
Route rules: not set
Testing
apiVersion: v1
kind: Service
metadata:
name: sleep
labels:
app: sleep
spec:
ports:
- port: 80
name: http
selector:
app: sleep
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: sleep
spec:
replicas: 1
template:
metadata:
labels:
app: sleep
spec:
containers:
- name: sleep
image: tutum/curl
command: ["/bin/sleep","infinity"]
imagePullPolicy: IfNotPresent
.
export SOURCE_POD=$(kubectl get pod -l app=sleep -o jsonpath={.items..metadata.name})
kubectl exec -it $SOURCE_POD -c sleep bash
Sending requests in parallel from the pod
#!/bin/sh
set -m # Enable Job Control
for i in `seq 100`; do # start 100 jobs in parallel
curl xxxx.us-east-1.elb.amazonaws.com &
done
Response
Currently, Istio considers an Egress Rule to designate a single host. This single host will not be ejected due to the load balancer's panic threshold of Envoy (the sidecar proxy implementation of Istio). The default panic threshold of Envoy is 50%. This means that at least two hosts are required for one host to be ejected, so the single host of an Egress Rule will not be ejected.
This practically means that httpConsecutiveErrors does not effect the external services. This lack of functionality should be partially resolved with External Services of Istio that will replace the Egress Rules.
See documentation of the Istio External Services backed by multiple endpoints -https://github.com/istio/api/blob/master/routing/v1alpha2/external_service.proto#L113

Kubernetes statefulset ends in completed state

I'm running a k8s cluster on Google GKE where I have a statefulsets running Redis and ElasticSearch.
So every now and then the pods end up in a completed state and so they aren't running anymore and my services depending on it fail.
These pods will also never restart by themselves, a simple kubectl delete pod x will resolve the problem but I want my pods to heal by themselves.
I'm running the latest version available 1.6.4, I have no clue why they aren't pickup and restarted like any other regular pod. Maybe I'm missing something obvious.
edit: I've also notice the pod get a termination signal and shuts down properly so I'm wondering where that is coming from. I'm not manually shutting down and I experience the same with ElasticSearch
This is my statefulset resource declaration:
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: redis
spec:
serviceName: "redis"
replicas: 1
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:3.2-alpine
ports:
- name: redis-server
containerPort: 6379
volumeMounts:
- name: redis-storage
mountPath: /data
volumeClaimTemplates:
- metadata:
name: redis-storage
annotations:
volume.alpha.kubernetes.io/storage-class: anything
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
Check the version of docker you run, and whether the docker daemon was restarted during that time.
If the docker daemon was restarted, all the container would be terminated (unless you use the new "live restore" feature in 1.12). In some docker versions, docker may incorrectly reports "exit code 0" for all containers terminated in this situation. See https://github.com/docker/docker/issues/31262 for more details.
source: https://stackoverflow.com/a/43051371/5331893
I am using same configuration as you but removing the annotation in the volumeClaimTemplates since I am trying this on minikube:
$ cat sc.yaml
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: redis
spec:
serviceName: "redis"
replicas: 1
template:
metadata:
labels:
app: redis
spec:
containers:
- name: redis
image: redis:3.2-alpine
ports:
- name: redis-server
containerPort: 6379
volumeMounts:
- name: redis-storage
mountPath: /data
volumeClaimTemplates:
- metadata:
name: redis-storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
Now trying to simulate the case where redis fails, so execing into the pod and killing the redis server process:
$ k exec -it redis-0 sh
/data # kill 1
/data # $
See that immediately after the process dies I can see that the STATUS has changed to Completed:
$ k get pods
NAME READY STATUS RESTARTS AGE
redis-0 0/1 Completed 1 38s
It took some time for me to get the redis up and running:
$ k get pods
NAME READY STATUS RESTARTS AGE
redis-0 1/1 Running 2 52s
But soon after that I could see it starting the pod, can you see the events triggered when this happened? Like was there a problem when re-attaching the volume to the pod?