Way to configure notifications/alerts for a kubernetes pod which is reaching 90% memory and which is not exposed to internet(backend microservice)

Way to configure notifications/alerts for a kubernetes pod which is reaching 90% memory and which is not exposed to internet(backend microservice) - kubernetes

I am currently working on a solution for alerts/notifications where we have microservices deployed on kubernetes in a way of frontend and back end services. There has been multiple occasions where backend services are not able to restart or reach a 90% allocated pod limit, if they encounter memory exhaust. To identify such pods we want an alert mechanism to lookin when they fail or saturation level. We have prometheus and grafana as monitoring services but are not able to configure alerts, as i have quite a limited knowledge in these, however any suggestions and references provided where i can have detailed way on achieving this will be helpful. Please do let me know
I did try it out on the internet for such ,but almost all are pointing to node level ,cluster level monitoring only. :(
enter image description here
The Query used to check the memory usage is :
sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",namespace=~"^$namespace$",pod_name=~"^$deployment-[a-z0-9]+-[a-z0-9]+"}) by (pod_name)

I saw this recently on google. It might be helpful to you. https://groups.google.com/u/1/g/prometheus-users/c/1n_z3cmDEXE?pli=1

Related

MongoDB Atlas dedicated cluster: Should I be concerned about 'Restarts in last hour' alerts?

We’re using a standard 3-node Atlas replicaset in a dedicated cluster (M10, Mongo 6.0.3, AWS) and have configured an alert if the ‘Restarts in last hour is’ rule exceeds 0 for any node.
https://www.mongodb.com/docs/atlas/reference/alert-conditions/#mongodb-alert-Restarts-in-Last-Hour-is
We’re seeing this alert fire every now and then and we’re wondering what this means for a node in a dedicated cluster and whether this is something to be concerned about, since I don’t think we have any control over it. Should we should disable this rule or increase the restart threshold?
Thanks in advance for any advice.
(Note I've asked this over at the Mongo community support site also, but haven't received any traction yet so asking here too)

I got an excellent response on my question at the Mongo community support site:
A node restarting is not necessarily a cause for concern. However, you should investigate the cause of the restart itself to better determine if this is an issue or not. You should take a look at your Project Activity Feed to see if you can determine why the nodes are restarting. I understand you have noted this is an M10 cluster so you should have access to the MongoDB logs, you also can check those to try determine the cause of the node restart. If you do not have access to the logs, you can consider working with Atlas in-app chat support to diagnose the issue.
It’s always good to keep the alerts active, as they can indicate a potential problem as soon as they occur. You can consider increasing the restart threshold to reduce alert noise after concluding whether the restarts are expected or not.
In my case, having checked the activity feed I was able to match up all the alerts we were seeing to Mongo version auto-updates on the nodes. We still wanted to keep that so we've increased our alert threshold to fire on >1 restart per hour rather than >0 restart, assuming that auto-updates won't be applied multiple times in the same hour.

Monitoring memory leaks in Kubernetes cluster using prometheus and alertmanager

i am looking for a way to get alerts when a pod may have memory leak. we're using prometheus and alertmanager for monitoring and i wanted to ask if anyone created such rule or if you guys have any idea how we could achieve this.
Basically i am trying to use something like the following query to track the pod memory usage (sum by(pod, namespace) (container_memory_working_set_bytes)) however i am not sure how to add the time periods and how to track the constant increase of memory overtime.
Thanks.

Track time taken to scale up in Kubernetes using HPA and CA

I am trying to track and monitor, how much time does a pod take to come online/healthy/Running.
I am using EKS. And I have got HPA and cluster-autoscaler installed on my cluster.
Let's say I have a deployment with HorizontalPodAutoscaler scaling policy with 70% targetAverageUtilization.
So whenever the average utilization of deployment will go beyond 70%, HPA will trigger to create new POD. Now, based on different factors, like if nodes are available or not, and if not is already available, then the image needs to be downloaded or is it present on cache, the scaling can take from few seconds to few minutes to come up.
I want to track this time/duration, every time the POD is scheduled, how much time does it take to come to Running state. Any suggestions?
Or any direction where I should be looking at.
I found this Cluster Autoscaler Visibility Logs, but this is only available in GCE.
I am looking for any solution, can be out-of-the-box integration, or raising events and storing them in some time-series DB or scraping data from Prometheus. But I couldn't find any solution for this till now.
Thanks in advance.

There is nothing out of the box for this, you will need to build something yourself.

Kubernetes dynamic pod provisioning

I have an app I'm building on Kubernetes which needs to dynamically add and remove worker pods (which can't be known at initial deployment time). These pods are not interchangeable (so increasing the replica count wouldn't make sense). My question is: what is the right way to do this?
One possible solution would be to call the Kubernetes API to dynamically start and stop these worker pods as needed. However, I've heard that this might be a bad way to go since, if those dynamically-created pods are not in a replica set or deployment, then if they die, nothing is around to restart them (I have not yet verified for certain if this is true or not).
Alternatively, I could use the Kubernetes API to dynamically spin up a higher-level abstraction (like a replica set or deployment). Is this a better solution? Or is there some other more preferable alternative?

If I understand you correctly you need ConfigMaps.
From the official documentation:
The ConfigMap API resource stores configuration data as key-value
pairs. The data can be consumed in pods or provide the configurations
for system components such as controllers. ConfigMap is similar to
Secrets, but provides a means of working with strings that don’t
contain sensitive information. Users and system components alike can
store configuration data in ConfigMap.
Here you can find some examples of how to setup it.
Please try it and let me know if that helped.

Auto scale kubernetes pods based on downstream api results

I have seen HPA can be scaled based on CPU usage. That is super cool.
However, the scenario I have is: the stateful app (container in pod) is one to one mapping based on the downstream API results. For example, the downstream api results return maximum and expected capacity like {response: 10}. I would like to see replicaSet or statefulSet or other kubernetes controller can obtain this value and auto scale the pods to 10. Unfortunately, the pod replicas is hardcoded in the yaml file.
If I am doing it manually, I think I can do it via running start a scheduler. The job of the scheduler is to watch the api and run the kubectl scale command based on the downstream api results. This can be error prone and there is another system I need to maintain. I guess this logic should belong to a kubernetes controller ?
May I ask has someone done this stuff before and what is the way to configure it ?
Thanks in advance.

Unfortunately, it is not possible to use an HPA in that mode, but your conception about how to scale is right.
HPA is designed to analyze metrics and decide how many pods need to be spawned based on those metrics. It is using scaling rules and can only spawn pods one by one based on the result of its decision.
Moreover, it using standard Kubernetes API for scale pods.
Because a logic of HPA is already in your application, you can use the same API to scale your pods. Btw, kubectl scale is using the same way to interact with a cluster.
So, you can use i.e. Cronjob, with a small application which will call API of your application every 5 minutes and call kubectl scale with proper name of deployment to scale your app.
But, please keep in mind, you need to somehow control the frequency of up- and downscaling of pods, it will make your application more stable. That’s why I think that scaling not more often than once per 5 minutes is OK, but trying to do it every minute generally is not the best idea.
And of course, you can create a daemon and run it using Deployment, but I think Cronjob solution is more easy and faster to implement.