Don't show data from redeployed pod in Grafana using promQL - grafana

I have a PromQL query that is looking at max latency per quantile and displays the data in Grafana, but it shows data from a pod that is redeployed and no longer exists. The pod is younger than the staleness period of 15 days.
Here's the query: max(latency{quantile="..."})
The max latency found is from the time it was throttling, and shortly after it got redeployed and went back to normal, and now I want to look only at the max latency of what is currently live.
All the info that I found so far about staleness says it should be filtering behind the scenes, but doesn't look like it's happening in the current setup and I cannot figure out what should I change.
When adding manually in the query the specific instance ID - it works well, but the ID will change once it gets redeployed: max(latency{quantile="...", exported_instance="ID"})
Here is a long list of similar questions I found, some are not answered, some are not asking for the same. The ideas that I did find that are somewhat relevant but don't solve the problem in a sustainable way are:
Suggestions from the links below that were not helpful
change staleness period, won't work because it affects the whole system
restart Prometheus, won't work because it can't be done every time a pod is redeployed
list each graph per machine, won't work with a max query
Links to similar questions
How do I deal with old collected metrics in Prometheus?
Switch prom->elk: log based monitoring
Get data from prometheus only from last scrape iteration
Staleness is a relevant concept, in Singlestat it shows how to use only current value
Grafana dashboard showing deleted information from prometheus
Default retention is 15 days, hide machines with a checkbox
How can I delete old Jobs from Prometheus?
Manual query/restart
grafana variable still catch old metrics info
Update prometheus targets
Clear old data in Grafana
Delete with prometheus settings
https://community.grafana.com/t/prometheus-push-gateway/18835
Not answered
https://www.robustperception.io/staleness-and-promql
Explains how new staleness works without examples
The end goal
is displaying the max latency between all sources that are live now, dropping data from no longer existing sources.

You can use auto generated metric named up to isolate your required metrics from others. You can easily determine which metric sources are offline from up metric.
up{job="", instance=""}:
1 if the instance is healthy, i.e. reachable,
or 0 if the scrape failed.

Related

Is there an HPA configuration that could autoscale based on previous CPU usage?

We currently have a GKE environemt with several HPAs for different deployments. All of them work just fine out-of-the-box, but sometimes our users still experience some delay during peak hours.
Usually this delay is the time it takes the new instances to start and become ready.
What I'd like is a way to have an HPA that could predict usage and scale eagerly before it is needed.
The simplest implementation I could think of is just an HPA that could take the average usage of previous days and in advance (say 10 minutos earliers) scale up or down based on the historic usage for the current time-frame.
Is there anything like that in vanilla k8s or GKE? I was unable to find anything like that in GCP's docs.
If you want to scale your applications based on events/custom metrics, you can use KEDA (Kubernetes-based Event Driven Autoscaler) which support scaling based on GCP Stackdriver, Datadog or Promtheus metrics (and many other scalers).
What you need to do is creating some queries to get the CPU usage at the moment: CURRENT_TIMESTAMP - 23H50M (or the aggregated value for the last week), then defining some thresholds to scale up/down your application.
If you have trouble doing this with your monitoring tool, you can create a custom metrics API that queries the monitoring API and aggregate the values (with the time shift) before sending it to the metrics-api scaler.

Are time-related OpenTelemetry metrics an anti-pattern?

When setting up metrics and telemetry for my API, is it an anti-pattern to track something like "request-latency" as a metric (possibly in addition to) tracking it as a span?
For example, say my API makes a request to another API in order to generate a response. If I want to track latency information such as:
My API's response latency
The latency for the request from my API to the upstream API
DB request latency
Etc.
That seems like a good candidate for using a span but I think it would also be helpful to have it as a metric.
Is it a bad practice to duplicate the OTEL data capture (as both a metric and a span)?
I can likely extract that information and avoid duplication, but it might be simpler to log it as a metric as well.
Thanks in advance for your help.
I would say traces and also metrics have own use cases. Traces have usually low retention period (AWS X-Ray: 30 days) + you can generate metrics based on traces for short time period (AWS X-Ray: 24 hours). If you will need longer time period then those queries will be expensive (and slow). So I would say metrics stored in time series DB will be perfect use case for longer time period stats.
BTW: there is also experimental Span Metrics Processor, which you can use to generate Prometheus metrics from the spans directly with OTEL collector - no additional app instrumentation/code.

kubernetes / prometheus custom metric for horizontal autoscaling

I'm wondering about an approach one has to take for our server setup. We have pods that are short lived. They are started up with 3 pods at a minimum and each server is waiting on a single request that it handles - then the pod is destroyed. I'm not sure of the mechanism that this pod is destroyed, but my question is not about this part anyway.
There is an "active session count" metric that I am envisioning. Each of these pod resources could make a rest call to some "metrics" pod that we would create for our cluster. The metrics pod would expose a sessionStarted and sessionEnded endpoint - which would increment/decrement the kubernetes activeSessions metric. That metric would be what is used for horizontal autoscaling of the number of pods needed.
Since having a pod as "up" counts as zero active sessions, the custom event that increments the session count would update the metric server session count with a rest call and then decrement again on session end (the pod being up does not indicate whether or not it has an active session).
Is it correct to think that I need this metric server (and write it myself)? Or is there something that Prometheus exposes where this type of metric is supported already - rest clients and all (for various languages), that could modify this metric?
Looking for guidance and confirmation that I'm on the right track. Thanks!
It's impossible to give only one way to solve this and your question is more "opinion-based". However there is an useful similar question on StackOverFlow, please check the comments that can give you some tips. If nothing works, probably you should write the script. There is no exact solution from Kubernetes's side.
Please also take into the consideration of Apache Flink. It has Reactive Mode in combination of Kubernetes:
Reactive Mode allows to run Flink in a mode, where the Application Cluster is always adjusting the job parallelism to the available resources. In combination with Kubernetes, the replica count of the TaskManager deployment determines the available resources. Increasing the replica count will scale up the job, reducing it will trigger a scale down. This can also be done automatically by using a Horizontal Pod Autoscaler.

split Kubernetes cluster costs between namespaces

We are running a multi tenant Kubernetes cluster running on EKS (in AWS) and I need to come up with an appropriate way of charging all the teams that use the cluster. We have the costs of the EC2 worker nodes but I don't know how to split these costs up given metrics from prometheus. To make it trickier I also need to give the cost per team (or pod/namespace) for the past week and the past month.
Each team uses a different namespace but this will change soon so that each pod will have a label with the team name.
From looking around I can see that I'll need to use container_spec_cpu_shares and container_memory_working_set_bytes metrics but how can these two metrics be combined to used so that we get a percentage of the worker node cost?
Also, I don't know promql well enough to know how to get the stats for the past week and the past month for the range vector metrics.
If anyone can share a solution if they're done this already or maybe even point me in the right direction i would appreciate it.
Thanks

Rebalance data after adding nodes

I'm using Cassandra 2.0.4 (with vnodes) and 2 days ago I added 2 nodes (.210 and .195.) I expected Cassandra to redistribute the existing data automatically, but today I still find this nodetool status
Issuing a nodetool repair on any of the nodes doesn't do anything either (the repair finishes within seconds.) The logs state that the repair is being executed as expected, but after preparing the repair plan it pretty much instantly finishes executing said plan.
Was I wrong to assume the existing data would be redistributed at all, or is something wrong? And if that isn't the case; how do I manually 'rebalance' the data?
Worth noting: I seem to have lost some data after adding this new nodes. Issuing a select on certain keys only returns data from the last couple of days rather than weeks, this makes me think the data is saved on .92 while Cassandra queries for it on one of the new servers. But that's really just an uneducated guess, I may have simple broken something during all of my trial & error tests meaning the data is actually gone (even though I don't issue deletes, ever.)
Can anyone enlighten me?
There is currently no manual rebalance option for vnode-enabled clusters.
But your cluster doesn't look unbalanced based on the nodetool status output you show. I'm curious as to why node .88 has only 64 tokens compared to the others but that isn't a problem per se. When a cluster is smaller there will be a slight variance in the balance of data across the nodes.
As for the data issues, you can try running nodetool repair -pr around the nodes in the ring and then nodetool cleanup and see if that helps.