Dashboard visualizing CPU usage of kafka container chopped up - apache-kafka

I want to monitor the cpu usage of kafka container, but the graph is chopped up into different pieces. There seem to be gaps in the graph and after each gap a different colored line follows. The time range is last 30 days. For the exporter we use danielqsj/kafka-exporter:v1.4.2
The promql query used to create this graph is:
rate(container_cpu_usage_seconds_total{container="cp-kafka-broker"}[1m])
Can I merge these lines into one continual? If so, with what promql expression/dashboard configuration?

This happens when at least 1 of the labels that are attached to the metric changes. The rate function keeps all the original labels from the underline time series. In Prometheus, each time series is uniquely identified by the metric name container_cpu_usage_seconds_total and any labels (key-value pairs) attached to the metric (container, for instance). This is why Grafana uses different colors because they are different time series.
If you want to get a single series in Grafana you can aggregate using the sum operator:
sum(rate(container_cpu_usage_seconds_total{container="cp-kafka-broker"}[1m]))
which by default will not keep any of the original labels.

Related

Don't see all points in Grafana on lower scales

On lower scale I am obviously seeing several outliers, maximal of which is 18211
if I zoom in then I am starting to see additional outliers
Is it possible to configure Grafana to show all points all the time or aggregate them differently?
Backend is Graphite.
No, this is not possible due to space limitations
For example:
Suppose you have 60 places and you want to fill them with numbers
If the time period is one hour, then in each of these places it will display the metrics stored of every minute
But if you make this interval smaller and convert it to a minute, each of these places will display the metrics stored of every second.

Divide two Prometheus metrics that don't have the same dimension set?

I'm trying to get the average of some prometheus metric (kafka_commit_latency) per kubernetes pod. My approach was to get the sum of kafka_commit_latency and to divide it by the number of kubernetes pods for my application, so here are the variables I derived and the overall expression:
Sum of desired metric (kafka commit latencies) across my application: sum(kafka_consumer_commit_latency_avg_seconds{application="my_app"})
No. of kubernetes pods for my application:
sum(node_namespace_pod:kube_pod_info:{pod=~".*my_app.*"})
Overall expression:
sum(kafka_consumer_commit_latency_avg_seconds{application="my_app"})/sum(node_namespace_pod:kube_pod_info:{pod=~".*my_app.*"})
but the main issue here is that the two range vectors don't have anything common in the dimension set, so how can this division be made?
For binary operators, you can use grouping modifiers. In your case, it would be ON without label list since you want to disregard all labels.
sum(kafka...) / ON() sum(node_...)

Metadata in Cloudwatch metrics (alternative to dimensions?)

I'm logging custom metric data into AWS Cloudwatch and trying to graph it. I assumed that Dimensions in Cloudwatch were metadata for enriching my data, but it seems that once you add dimensions you can no longer query across different combinations of dimensions. So for one I don't really see the point of dimensions as any unique combination is basically just a new metric. But more importantly, is there a way to log one set of data with different labels or dimensions and then slice and dice that data (e.g., in Grafana).
To make it more concrete, I am logging cache load times in my application. I have one metric called "cache-miss", with several dimensions, for example:
the cached collection
the customer associated with the cached data
I want to several different graphs:
Total cache misses (i.e., ignore dimensions, just see a count over time)
Total cache misses per collection (aggregate by first dimension)
Total cache misses per customer (aggregate by second dimension)
Is there some way to achieve this with Cloudwatch metrics and/or Grafana (or alternate tool)?
As you have mentioned - https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html :
CloudWatch treats each unique combination of dimensions as a separate metric, even if the metrics have the same metric name. You can only retrieve statistics using combinations of dimensions that you specifically published. When you retrieve statistics, specify the same values for the namespace, metric name, and dimension parameters that were used when the metrics were created.
So if you have pushed Total cache misses with 2 dimensions, you can query this metric only with 2 dimensions. So you really can't just see a count over time.
Possible workarounds:
CloudWatch math - see example in CloudWatch does not aggregate across dimensions for your custom metrics
in theory also Grafana 7+ transformation feature https://grafana.com/blog/2020/06/11/new-in-grafana-7.0-data-transformations-for-all-visualizations-that-support-queries/
Or you can switch from the CloudWatch to better TSDB for your use case.

Grafana aggregation issue when changing time range (%CPU and more)

I have an % CPU usage grafana graph.
The problem is that the source data is collected by collectd as Jiffies.
I am using the following formula:
collectd|<ServerName>|cpu-*|cpu-idle|value|nonNegativeDerivative()|asPercent(-6000)|offset(100)
The problem is that when I increase the time range (to 30 days for example), the grafana is aggregating the data and since it is accumulative numbers (And not percentage or something it can make a simple average), the data in the graph is becoming invalid.
Any idea how to create a better formula?
Have you looked at the aggregation plugin (read type) to compute averages?
https://collectd.org/wiki/index.php/Plugin:Aggregation/Config
it is very strange that you have to use the nonNegativeDerivative function for a CPU metric. nonNegativeDerivative should only be used for ever increasing counters, not a gauge like metric like CPU

Graphite show top 10 metrics filtered by time

I am new to Graphite and can't understand how to do this:
I have a large number of time-metrics (celery metrics) in format stats.timers.*.median
I want to show:
Top N metrics with average value above X
Display them on one graph with the names of metrics
Now I have averageAbove(stats.timers.*.median,50) but it displays graphs without names and renders strangely and in bad scale. Help, please! :)
You will need to chain a few functions together in order to get the desired result.
limit(sortByMaxima(averageAbove(stats.timers.*.median, X)), N)
Starting the the averageAbove as the base.
The next thing you want to do is get all the metrics in order, "top-to-bottom" by using sortByMAxima.
Then you can limit the results that are rendered with the limit function.
You might not be rending the legend if you have too many metrics for the size of the graph. You can do 3 things.
Make the graph larger
Reduce the number of metrics using limit
Force the legend to be displayed via hideLegend