Postgresql: how statistics are collected in the histogram_bounds - postgresql

The documentation of Postgresql is the explanation that this histogram_bounds field in pg_stats.
The histogram divides the range into equal frequency buckets, so all we have to do is locate the bucket that our value is in and count part of it and all of the ones before.
But I still can not understand how the algorithms based on this field. I would like to describe in more detail how ANALYZE puts value in this field.

Related

Why is sum of series coming as fractional and less than the actual values in graphite

I am creating a dashboard using metrics in graphite. I have tried consolidatedBy to get all the metrics. The metric values looks correct and are of the range of 1000s.
The graphite query for the same is
consolidateBy(monitors.x.y.z.client_metrics.k.*.*.*.*.*.response_codes.*.count, 'sum')
I want to get the total number of requests, which would be sum of all this series.
So, I tried the sum function but it is giving the sum around 1-12, which is actually less than the actual values.
The query is
sum(consolidateBy(monitors.x.y.z.client_metrics.k.*.*.*.*.*.response_codes.*.count, 'sum'))
This query also gives the same result
sum(monitors.x.y.z.client_metrics.k.*.*.*.*.*.response_codes.*.count)
My questions are:
Why is the sum of series giving point's value less than the actual values ?
I just want to calculate the total number of requests. If there is an easier solution. Can you please specify it.

How to plot daily increment data from a sparse data set with interpolation in Grafana?

How can I plot time-grouped increment data in a bar graph in Grafana, but with a sparse data source that needs interpolation BEFORE calculating the increment?
My data source is an InfluxDB with a sparse time series of accumulated values (think: gas meter readings). The data points are usually a few days apart.
My goal is to create a bar graph with value increase per day. For the missing values, linear interpolation will do just fine.
I've come up with
SELECT spread("value") FROM "gas" WHERE $timeFilter GROUP BY time(1d) fill(linear)
but this won't work as the fill(linear) command is executed AFTER the spread(value) command. If I use time periods much greater than my granularity of input data (e.g. time(14d)), it shows proper bars, but once I use smaller periods, the bars collapse to 0.
How can I apply the interpolation BEFORE the difference operation?
Described situation by you is caused by fact that fill() fills data only if you do not have anything in your group by time() period in your query. If you get spread=0 then you probably have only one value in this period, so no fill() is used.
What I can suggest to you is to use subquery with lower group period time to prepare interpolation of your original signal. This is an example:
SELECT spread("interpolated_value") FROM (
SELECT first("value") as "interpolated_value" from "gas"
WHERE $timeFilter
GROUP BY time(10s) fill(linear)
)
GROUP BY time(1d) fill(none)
Subquery will prepare value for each 10s period (I recommend to set this value possibly as high as you can accept). If in 10s periods are values, it will pick the first one, if there is no value in this period, it will do an interpolation.
In main query there is an usage from prepared interpolated set of values to calculate spread.
All above only describes how you can get interpolated data within shorted periods. I strongly recommend to think about usability of this data. Calculating spread from lineary interpolated data may have questionable reliability.

Want to SUM all values for a specific date within column NOT sum all values in that column

I want to create a graph which shows the total capacity for each week relative to remaining availability across a series of specific dates. Just now when I attempt this in Power Bi it calculates this correctly for one of the values (remaining availability) but generates a value much higher than expected by manual calculation for the total capacity - instead showing the total for the entire column rather than for each specific date.
Why is Power Bi doing this and how can I solve it?
So far, I have tried generating the graph like this:
(https://i.stack.imgur.com/GV3vk.png)
and as you can see the capacity values are incredibly high they should be 25 days.
The total availability values are correct (ranging from 0 to 5.5 days).
When I create matrices to see the sum breakdown they are correct but it only appears to be that when combined together one of the values changes to the value for the whole column.
If anyone could help me with this issue that would be great! Thanks!

Filter prometheus results by metric value, not by label value

Because Prometheus topk returns more results than expected, and because https://github.com/prometheus/prometheus/issues/586 requires client-side processing that has not yet been made available via https://github.com/grafana/grafana/issues/7664, I'm trying to pursue a different near-term work-around to my similar problem.
In my particular case most of the metric values that I want to graph will be zero most of the time. Only when they are above zero are they interesting.
I can find ways to write prometheus queries to filter data points based on the value of a label, but I haven't yet been able to find a way to tell prometheus to return time series data points only if the value of the metric meets a certain condition. In my case, I want to filter for a value greater than zero.
Can I add a condition to a prometheus query that filters data points based on the metric value? If so, where can I find an example of the syntax to do that?
If you're confused by brian's answer: The result of filtering with a comparison operator is not a boolean, but the filtered series. E.g.
min(flink_rocksdb_actual_delayed_write_rate > 0)
Will show the minimum value above 0.
In case you actually want a boolean (or rather 0 or 1), use something like
sum (flink_rocksdb_actual_delayed_write_rate >bool 0)
which will give you the greater-than-zero count.
Filtering is done with the comparison operators, for example x > 0.
This can be solved with subqueries:
count_over_time((metric > 0)[5m:10s])
The query above would return the number of metric data points greater than 0 over the last 5 minutes.
This query may return inaccurate results depending on the relation between the second arg in square brackets (aka step for the inner query) and the real interval between raw samples (aka scrape_interval):
If the step exceeds scrape_interval, them some samples may be missing during the calculations. In this case the query will return lower than expected result.
If the step is smaller than the scrape_interval, then some samples may be counted multiple times. In this case the query will return bigger than expected result.
So it is recommended setting the step to scrape_interval in order to get accurate results.
P.S. The issues mentioned above are solved in VictoriaMetrics - Prometheus-like monitoring system I work on. It provides count_gt_over_time() function, which ideally fits this case. For example, the following MetricsQL query returns the exact number of raw samples with values greater than 0 over the last 5 minutes:
count_gt_over_time(metric[5m], 0)

How to calculate the average value in a Prometheus query from Grafana

I was trying to create a Prometheus graph on Grafana, but i can't find the function which calculate the average value.
For example , to create a graph for read_latency, the result contain many tags. If there are 3 machine, there will be 3 tag seperately, for machine1, machine2, machine3. Here is a graph(click to show)
Prometheus
I want to combine these three together, so there will be only one tag : machines, and the value is the average of those three.
It seems that Prometheus query function doesn't have something like average(), so I am not sure how to do this.
I used to work on InfluxDB, and the graph can show like (click to show):
influxDB
I think you are searching for the avg() operation. see documentation
Use built-in $__interval variable, where node, name are custom labels (depending on you metrics):
sum(avg_over_time(some_metric[$__interval])) by (node, name)
or fixed value like 1m,1h etc:
sum(avg_over_time(some_metric[1m])) by (node, name)
You can filter using Grafana variables:
sum(avg_over_time(some_metric{cluster=~"$cluster"}[1m])) by (node, name)
Short answer: use avg() function to return the average value across multiple time series. For example, avg(metric) returns the average value for time series with metric name.
Long answer: Prometheus provides two functions for calculating the average:
avg_over_time calculates the average over raw sample stored in the database on the lookbehind window specified in square brackets. The average is calculated independently per each matching time series. For example, avg_over_time(metric[1h]) calculates average values for raw samples over the last hour per each time series with metric name.
avg calculates the average over multiple time series. The average is calculated independently per each point on the graph.
If you need to calculate the average over raw samples across all the time series, which match the given selector, per each time bucket, e.g.:
SELECT
time_bucket('5 minutes', timestamp) AS t,
avg(value)
FROM table
GROUP BY t
Then the following PromQL query must be used:
sum(sum_over_time(metric[$__interval])) / sum(count_over_time(metric[$__interval]))
Do not use avg(avg_over_time(metric[$__interval])), since it returns average of averages, which isn't equal to real average. See this explanation for details.