I try to get Total and Free disk space on my Kubernetes VM so I can display % of taken space on it. I tried various metrics that included "filesystem" in name but none of these displayed correct total disk size. Which one should be used to do so?
Here is a list of metrics I tried
node_filesystem_size_bytes
node_filesystem_avail_bytes
node:node_filesystem_usage:
node:node_filesystem_avail:
node_filesystem_files
node_filesystem_files_free
node_filesystem_free_bytes
node_filesystem_readonly
According to my Grafana dashboard, the following metrics work nicely for alerting for available space,
100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"})
The formula gives out the percentage of available space on the pointed disk. Make sure you include the mountpoint and fstype within the metrics.
FS utilization can be calculated as
100 - (100 * ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} ) / (node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) ))
According kubernetes-mixin you can use node:node_filesystem_usage, but if this metrics won't show your correct vales please ensure that device is set correctly.
I highly recommend using Prometheus Graph or Graphana Explore tab to check all available metrics.
Related
Trying to understand the difference between the two: Aggregator vs Aligner.
Docs was not helpful for me.
What I'm trying to achieve is to get the bytes of logs generated within a week per each namespace and container combination. For example, I want to see that container C in namespace N generated 10Gb of logs during the last 7 days.
This is how far I got:
Resource type = Kubernetes Container
Metric = Log bytes
Group by = namespace_name and container_name
Aggregator = sum(?) mean(?)
Minimum alignment period = 1(?) 7(?) days
Aligner = sum(?) mean(?)
I was confused with this until I realized that a single metric, such as kubernetes.io/container/cpu/core_usage_time is available in multiple different resources in my cluster.
So when you search for that metric, you'll get a whole lot of different resources that emit that metric. Aggregation is adding up all the data from those different resources WITH THAT SAME METRIC.
This all combines into one "time series" for that metric, an aggregation of all the individual time series from each of those different resources.
Now, alignment is the process of using that time series and putting all the data points through a function (over a period of time, known as the alignment period) which results in one single data point (per alignment period).
So aggregation combines the same metric across multiple resources, while alignment combines multiple data points in the same time series into one data point (per alignment period, which is why that field is required when using alignment).
I'm using the latest prometheus 2.21.0 and latest node-exporter
Trying to run the query and getting no datapoints found however both metrics kube_pod_container_resource_limits_memory_bytes and node_memory_MemTotal_bytes are working independently and return data
(sum(kube_pod_container_resource_limits_memory_bytes) / :node_memory_MemTotal_bytes:sum)*100
So two questions
I never saw such syntax before :node_memory_MemTotal_bytes:sum - is it valid prometheus query?
What is wrong with the query if the syntax is correct?
This is a convention widely used in prometheus land. It means this metric is not one directly scraped from some target(s), but instead a result of recording rule. This convention is described here.
If queries on both left and right side return data individually but after performing artihmetic on them you are left with no data then it probably means labels on them are not exactly the same. Execute them separately and compare labels you have on your results. Assuming that :node_memory_MemTotal_bytes:sum does return data then you'll probably have to add sum there too to remove any remaining labels there
My atomist client exposes metrics on commands that are run. Each command is a metric with a username element as well a status element.
I've been scraping this data for months without resetting the counts.
My requirement is to show the number of active users over a time period. i.e 1h, 1d, 7d and 30d in Grafana.
The original query was:
count(count({Username=~".+"}) by (Username))
this is an issue because I dont clear the metrics so its always a count since inception.
I then tried this:
count(max_over_time(help_command{job=“Application
Name”,Username=~“.+“}[1w]) -
max_over_time(help_command{job=“Application name”,Username=~“.+“}[1w]
offset 1w) > 0)
which works but only for one command I have about 50 other commands that need to be added to that count.
I tried the:
"{__name__=~".+_command",job="app name"}[1w] offset 1w"
but this is obviously very expensive (timeout in browser) and has issues with integrating max_over_time which doesn't support it.
Any help, am I using the metric in the wrong way. Is there a better way to query... my only option at the moment is the count (format working above for each command)
Thanks in advance.
To start, I will point out a number of issues with your approach.
First, the Prometheus documentation recommends against using arbitrarily large sets of values for labels (as your usernames are). As you can see (based on your experience with the query timing out) they're not entirely wrong to advise against it.
Second, Prometheus may not be the right tool for analytics (such as active users). Partly due to the above, partly because it is inherently limited by the fact that it samples the metrics (which does not appear to be an issue in your case, but may turn out to be).
Third, you collect separate metrics per command (i.e. help_command, foo_command) instead of a single metric with the command name as label (i.e. command_usage{commmand="help"}, command_usage{commmand="foo"})
To get back to your question though, you don't need the max_over_time, you can simply write your query as:
count by(__name__)(
(
{__name__=~".+_command",job=“Application Name”}
-
{__name__=~".+_command",job=“Application name”} offset 1w
) > 0
)
This only works though because you say that whatever exports the counts never resets them. If this is simply because that exporter never restarted and when it will the counts will drop to zero, then you'd need to use increase instead of minus and you'd run into the exact same performance issues as with max_over_time.
count by(__name__)(
increase({__name__=~".+_command",job=“Application Name”}[1w]) > 0
)
I am trying to monitor Latency on ElasticBeanstalk environment using Grafana.
I get some things to work, and some things do not provide any information.
I am using "CloudWatch" data source.
There is ELB and ApplicationELB.
The ApplicationELB does not offer Latency metric. In fact, every metric I select here will result with "no data".
When I configure monitoring on AWS, I get this following graph:
I am able to query for Latency on a region using Grafana and I do get some correlation
As you can see around 13:50 some requests timed-out. But it is also obvious Grafana is showing additional information from other environments which I would like to ignore.
My query currently looks like this:
Which I know is too broad, but I do not know how to refine.
I tried using "InstanceName" as dimension, but it is not clear to me which ELB I should look for, and seems to me like ApplicationELB should be what I am looking for, but that one does not offer Latency and does not provide any data either way.
Using AvailabilityZone does not help, and that's the only other option for dimension (other than InstanceName).
I need a way to refine the query so I see the same result in AWS and Grafana.
A clarification about ApplicationELB and ELB would be great also!
Application ELB vs ELB: they are just different types of load balancers provided by AWS https://aws.amazon.com/elasticloadbalancing/ - I'm not sure which one is used by ElasticBeanstalk.
You need to add dimension to filter your metrics. Some metrics may need multiple dimensions for correct filtering. Available dimensions are available in the docs. For example LoadBalancerName is a correct dimension for AWS/ELB namespace: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-cloudwatch-metrics.html
I recommend to use existing published AWS dashboard(s) (https://github.com/monitoringartist/grafana-aws-cloudwatch-dashboards - I'm the author) and then just customize them for your needs.
I have Grafana with Bosun connected as OpenTSDB source. Problem is Grafana interprets data in different way than Bosun. To be precise, when I set same query in Bosun and in Grafana, resulting graphs differ. When I turn on gauge downsample, graphs are same. So I guess there is implicit gauging of some sort in Grafana. I would be grateful for some hint how to disable that gauging.
Bosun:
Grafana:
The os.net.bytes metric includes metadata to indicate that it is a rate. When you use the default "auto" in Bosun's graph page it will convert the raw counter data into a rate calculation. Grafana's OpenTSDB data source does not have an auto mode, so things always default to a gauge unless you check the Rate box at the bottom of the metric.
In your example you should just need to check the rate box to get the graphs to match. You can also use the Counter option and provide a max or reset value if you need to deal with counter overflows
You can also use the Bosun data source if you want to use a Bosun query instead of accessing OpenTSDB directly. In this example we combine two queries to generate a Singlestat panel (displays last value and a line graph in the background)
The __ny-nexus01/02 part comes from using tsdbrelay to denormalize the metric and address high tag cardinality issues.