Kubernetes dashboard: How to access more than 15 minutes of CPU and memory usage in the Web UI dashboard - kubernetes

The web UI dashboard (which is rendered by running 'kubectl proxy') is very useful and gives a great highlevel overview of the cluster. However the CPU and memory usage graphs seems to be hardcoded to display only the last 15 minutes. I am not able to find any settings that allows me to increase this, nor could I find any documentation on how to do this. Our team is exploring setting up grafana/influxdb and other services to get more detailed metrics, but it will be nice if there is an option to increase the timeline to the webui dashboard.

Related

Prometheus servicemonitor interval is being ignored

My metrics are scraping every 30seconds, even though I've specified a 10s interval when defining my servicemonitor.
I've created a servicemonitor for my exporter that seems to be working well. I can see my exporter as a target, and view metrics on the /graph endpoint. However, when on the "targets" page, the "last scrape" is showing the interval as 30s (I refresh the page to see how high the # of seconds will go up to, it was 30). Sure enough, zooming in on the graph shows that metrics are coming in every 30 seconds.
I've set my servicemonitor's interval to 10s, which should override any other intervals. Why is it being ignored?
endpoints:
- port: http-metrics
scheme: http
interval: 10s
First: Double check if you've changed the ServiceMonitor you need changed and if you are looking at scrapes from your ServiceMonitor.
Go to the web UI of your prometheus and select Status -> Configuration.
Now try to find part of the config that prometheus operator created (based on ServiceMonitor config). Probably looking by servicemonitor name will work - there should be a section with job_name containing your servicemonitor name.
Now look at the scrape_interval value in this section. If it is "30s" (or anything else that is not the expected "10s") and you are sure you're looking at the correct section then it means one of those things happened:
your ServiceMonitor does not really contain "10s" - maybe it was not applied correctly? Verify the live object in your cluster
prometheus-operator did not update Prometheus configuration - maybe it is not working? or is crashing? or just silently stopped working? It is quite safe just to restart the prometheus-operator pod, maybe it is worth trying.
prometheus did not load the new config correctly? prometheus operator updates a secret and when it is changed sidecar in prometheus pod triggers reload in prometheus. Maybe it didn't work? Look again in the Web UI in Status -> Runtime & Build information for "Configuration reload". Is it Successful? Does the "Last successful configuration reload" time roughly matches your change in servicemonitor? If it was not "Successful" then maybe some other change made the final promethuus config incorrect and it is unable to load it?
If I understand correctly, as per Configure Grafana Query Interval with Prometheus, that works as expected
This works as intended. This is because Grafana choses a step size,
and until that step is full, it won't display new data, or rather
Prometheus won't return a new data point. So this all works as
expected. You would need to align the step size to be something
smaller in order to see more data more quickly. The Prometheus
interface dynamically sets the step, which is why you get a different
result there.
It's a configuration of each Grafana panel
Grafana itself doesn't recommend to use intervals less then 30sec, you can find a reasonable explanation in Dashboard refresh rate issues part of Grafana Troubleshoot dashboards documentation page
Dashboard refresh rate issues By default, Grafana queries your data
source every 30 seconds. Setting a low refresh rate on your dashboards
puts unnecessary stress on the backend. In many cases, querying this
frequently makes no sense, because the data isn’t being sent to the
system such that changes would be seen.
We recommend the following:
Do not enable auto-refreshing on dashboards, panels, or variables unless you need it. Users can refresh their browser manually, or you
can set the refresh rate for a time period that makes sense (every ten
minutes, every hour, and so on).
If it is required, then set the refresh rate to once a minute. Again, users can always refresh the dashboard manually.
If your dashboard has a longer time period (such as a week), then you really don’t need automated refreshing.
For more information you can also visit still opened Specify desired step in Prometheus in dashboard panels github issue.

Way to configure notifications/alerts for a kubernetes pod which is reaching 90% memory and which is not exposed to internet(backend microservice)

I am currently working on a solution for alerts/notifications where we have microservices deployed on kubernetes in a way of frontend and back end services. There has been multiple occasions where backend services are not able to restart or reach a 90% allocated pod limit, if they encounter memory exhaust. To identify such pods we want an alert mechanism to lookin when they fail or saturation level. We have prometheus and grafana as monitoring services but are not able to configure alerts, as i have quite a limited knowledge in these, however any suggestions and references provided where i can have detailed way on achieving this will be helpful. Please do let me know
I did try it out on the internet for such ,but almost all are pointing to node level ,cluster level monitoring only. :(
enter image description here
The Query used to check the memory usage is :
sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",namespace=~"^$namespace$",pod_name=~"^$deployment-[a-z0-9]+-[a-z0-9]+"}) by (pod_name)
I saw this recently on google. It might be helpful to you. https://groups.google.com/u/1/g/prometheus-users/c/1n_z3cmDEXE?pli=1

Track time taken to scale up in Kubernetes using HPA and CA

I am trying to track and monitor, how much time does a pod take to come online/healthy/Running.
I am using EKS. And I have got HPA and cluster-autoscaler installed on my cluster.
Let's say I have a deployment with HorizontalPodAutoscaler scaling policy with 70% targetAverageUtilization.
So whenever the average utilization of deployment will go beyond 70%, HPA will trigger to create new POD. Now, based on different factors, like if nodes are available or not, and if not is already available, then the image needs to be downloaded or is it present on cache, the scaling can take from few seconds to few minutes to come up.
I want to track this time/duration, every time the POD is scheduled, how much time does it take to come to Running state. Any suggestions?
Or any direction where I should be looking at.
I found this Cluster Autoscaler Visibility Logs, but this is only available in GCE.
I am looking for any solution, can be out-of-the-box integration, or raising events and storing them in some time-series DB or scraping data from Prometheus. But I couldn't find any solution for this till now.
Thanks in advance.
There is nothing out of the box for this, you will need to build something yourself.

Build Stackdriver Dashboard that contains a filtered list of log entries

We are evaluating Stackdriver as an alternative to our ELK-stack, I'm missing a few features that I have in kibana (1).
Most important I don't find a way to show the actual logs in a Stackdriver Dashboard, I can only show graphs based on the logs. Changing between two tabs all the time (2 and 3) and adapting the filters on both of them seems very inconvenient for log/error analysis.
Is there a way that I can have a dashboard that also shows logs (based on the filters in the dashboard search)?
There is currently no way to show raw log files in the Metrics Dashboard unfortunately.
You can file a feature request to add this functionality to Stackdriver.

Stackdriver custom metric aggregate alerts

I'm using Kubernetes on Google Compute Engine and Stackdriver. The Kubernetes metrics show up in Stackdriver as custom metrics. I successfully set up a dashboard with charts that show a few custom metrics such as "node cpu reservation". I can even set up an aggregate mean of all node cpu reservations to see if my total Kubernetes cluster CPU reservation is getting too high. See screenshot.
My problem is, I can't seem to set up an alert on the mean of a custom metric. I can set up an alert on each node, but that isn't what I want. I can also set up "Group Aggregate Threshold Condition", but custom metrics don't seem to work for that. Notice how "Custom Metric" is not in the dropdown.
Is there a way to set an alert for an aggregate of a custom metric? If not, is there some way I can alert when my Kubernetes cluster is getting too high on CPU reservation?
alerting on an aggregation of custom metrics is currently not available in Stackdriver. We are considering various solutions to the problem you're facing.
Note that sometimes it's possible to alert directly on symptoms of the problem rather than monitoring the underlying resources. For example, if you're concerned about cpu because X happens and users notice, and X is bad - you could consider alerting on symptoms of X instead of alerting on cpu.