Google Cloud: Metrics Explorer: "Aggregator" vs "Aligner" - Whats the difference? - aggregate

Trying to understand the difference between the two: Aggregator vs Aligner.
Docs was not helpful for me.
What I'm trying to achieve is to get the bytes of logs generated within a week per each namespace and container combination. For example, I want to see that container C in namespace N generated 10Gb of logs during the last 7 days.
This is how far I got:
Resource type = Kubernetes Container
Metric = Log bytes
Group by = namespace_name and container_name
Aggregator = sum(?) mean(?)
Minimum alignment period = 1(?) 7(?) days
Aligner = sum(?) mean(?)

I was confused with this until I realized that a single metric, such as kubernetes.io/container/cpu/core_usage_time is available in multiple different resources in my cluster.
So when you search for that metric, you'll get a whole lot of different resources that emit that metric. Aggregation is adding up all the data from those different resources WITH THAT SAME METRIC.
This all combines into one "time series" for that metric, an aggregation of all the individual time series from each of those different resources.
Now, alignment is the process of using that time series and putting all the data points through a function (over a period of time, known as the alignment period) which results in one single data point (per alignment period).
So aggregation combines the same metric across multiple resources, while alignment combines multiple data points in the same time series into one data point (per alignment period, which is why that field is required when using alignment).

Related

Graphite: keepLastValue for a given time period instead of number of points

I'm using Graphite + Grafana to monitor (by sampling) queue lengths in a test system at work. The data that gets uploaded to Graphite is grouped into different series/metrics by properties of the payloads in the queue. These properties can be somewhat arbitrary, at least to the point where they are not all known at the time when the data collection script is run.
For example, a property could be the project that the payload belongs to and this could be uploaded as a separate series/metric so that we can monitor the queues broken down by the different projects.
This has the consequence that Graphite sends a lot of null values for certain metrics if the queues in the test system did not contain any payloads with properties that would group it into that specific series/metric.
For example, if a certain project did not have any payloads in queue at the time when the data collection was ran.
In Grafana this is not so nice as the line graphs don't show up as connected lines and gauges will show either null or the last non-null value.
For line graphs I can just chose to connect null values in Grafana but for gauges thats not possible.
I know about the keepLastValue function in Graphite but that includes a limit for how long to keep the value which I like very much as I would like to keep the last value until the next time data collection is ran. Data collection is run periodically at known intervals.
The problem with keepLastValue is it expects a number of points as this limit. I would rather give it a time period instead. In Grafana the relationship between time and data points is very dynamic so its not easy to hard-code a good limit for keepLastValue.
Thus, my question is: Is there a way to tell Graphite to keep the last value for a given time instead of a given number of points?

How to make sense of the micrometer metrics using SpringBoot 2, InfluxDB and Grafana?

I'm trying to configure a SpringBoot application to export metrics to InfluxDB to visualise them using a Grafana dashboard. I'm using this dashboard as an example which uses Prometheus as a backend.
For some metrics I have no problem figuring out how to create graphs for them but for some others I don't know how to create the graphs or even if it's possible at all. So I enumerate the things I'm not really sure about in the following points:
Is there any documentation where a value unit is described? The application I'm using as an example doesn't have any load on it so sometimes I don't know whether the value is a bit, a byte, a second, a millisecond, a count, etc.
Some measurements contain the tag 'metric_type = histogram' with fields 'count', 'sum', 'mean' and 'upper'. Again, here I don't know what the value units are, what upper means or how I'm suppose to plot them. Examples of this are 'http_server_requests' or 'jvm_gc_pause'.
From what I see in the Grafana dashboard example, it seems I should use these measurements of type histogram to create both a graph with counts and graphs with duration. For example I see I should be able to create a graph with the number of requests and another one with their duration. Or for the garbage collector, I should be able to provide a graph for the number of minor and major GCs and another for their duration.
As an example of measures I get inserted into InfluxDB:
time count exception mean method metric_type outcome status sum upper uri
1625579637946000000 1 None 0.892144 GET histogram SUCCESS 200 0.892144 0.892144 /actuator/health
or
time action cause count mean metric_type sum upper
1625581132316000000 end of minor GC Allocation Failure 1 2 histogram 2 2
I agree the documentation for micrometer is not great. I've had to dig through the code to find answers...
Regarding your questions about jvm_gc_pause, it is a Timer and the implementation is AbstractTimer which is a class that wraps a Histogram among other components. This particular metric is registered by the JvmGcMetrics class. The various measurements that are published to InfluxDB are determined by the InfluxMeterRegistry.writeTimer(Timer timer) method:
sum: timer.totalTime(getBaseTimeUnit()) // The total time of recorded events
count: timer.count() // The number of times stop has been called on the timer
mean: timer.mean(getBaseTimeUnit()) // totalTime()/count()
upper: timer.max(getBaseTimeUnit()) // The max time of a single event
The base time unit is milliseconds.
Similarly, http_server_requests appears to be a Timer as well.
I believe you are correct that the sensible thing is to chart on two separate Grafana panels: one panel for GC pause seconds using sum (or mean or upper), and one panel for GC events using count.

Prometheus query to get memory limit commitment for the entire cluster

I'm using the latest prometheus 2.21.0 and latest node-exporter
Trying to run the query and getting no datapoints found however both metrics kube_pod_container_resource_limits_memory_bytes and node_memory_MemTotal_bytes are working independently and return data
(sum(kube_pod_container_resource_limits_memory_bytes) / :node_memory_MemTotal_bytes:sum)*100
So two questions
I never saw such syntax before :node_memory_MemTotal_bytes:sum - is it valid prometheus query?
What is wrong with the query if the syntax is correct?
This is a convention widely used in prometheus land. It means this metric is not one directly scraped from some target(s), but instead a result of recording rule. This convention is described here.
If queries on both left and right side return data individually but after performing artihmetic on them you are left with no data then it probably means labels on them are not exactly the same. Execute them separately and compare labels you have on your results. Assuming that :node_memory_MemTotal_bytes:sum does return data then you'll probably have to add sum there too to remove any remaining labels there

Creating Datadog alerts for when the percentage difference between two custom metrics goes over a specified percentage threshold

My current situation is that I have two different data feeds (Feed A & Feed B) and I have created custom metrics for both feeds:
Metric of Order counts from Feed A
Metric Order counts from Feed B
Next steps is to create alert monitoring for the agreed upon threshold of difference between the two metrics. Say we have agreed that it is acceptable for Order Counts from Feed A to be within ~5% of Order Counts from Feed B. How can I go about creating that threshold and comparison between the two metrics that I have already developed in Datadog?
I would like to send alerts to myself when the % difference between the two data feeds is > 5 % for a daily validation.
You might be able to get this if you...
Start creating a metric type monitor
To the far right of the metric definition, select "advanced"
Select "Add Query"
Input your metrics
In the field called "Express these queries as:", input (a-b)/b or some such
Trigger when the metric is above or equal to the threshold in total during the last 24 hours
Set Alert threshold >= 0.05
If you start having trouble as you start setting it up, you may want to reach out to support#datadoghq.com to get their assistance.

Can you calculate active users using time series

My atomist client exposes metrics on commands that are run. Each command is a metric with a username element as well a status element.
I've been scraping this data for months without resetting the counts.
My requirement is to show the number of active users over a time period. i.e 1h, 1d, 7d and 30d in Grafana.
The original query was:
count(count({Username=~".+"}) by (Username))
this is an issue because I dont clear the metrics so its always a count since inception.
I then tried this:
count(max_over_time(help_command{job=“Application
Name”,Username=~“.+“}[1w]) -
max_over_time(help_command{job=“Application name”,Username=~“.+“}[1w]
offset 1w) > 0)
which works but only for one command I have about 50 other commands that need to be added to that count.
I tried the:
"{__name__=~".+_command",job="app name"}[1w] offset 1w"
but this is obviously very expensive (timeout in browser) and has issues with integrating max_over_time which doesn't support it.
Any help, am I using the metric in the wrong way. Is there a better way to query... my only option at the moment is the count (format working above for each command)
Thanks in advance.
To start, I will point out a number of issues with your approach.
First, the Prometheus documentation recommends against using arbitrarily large sets of values for labels (as your usernames are). As you can see (based on your experience with the query timing out) they're not entirely wrong to advise against it.
Second, Prometheus may not be the right tool for analytics (such as active users). Partly due to the above, partly because it is inherently limited by the fact that it samples the metrics (which does not appear to be an issue in your case, but may turn out to be).
Third, you collect separate metrics per command (i.e. help_command, foo_command) instead of a single metric with the command name as label (i.e. command_usage{commmand="help"}, command_usage{commmand="foo"})
To get back to your question though, you don't need the max_over_time, you can simply write your query as:
count by(__name__)(
(
{__name__=~".+_command",job=“Application Name”}
-
{__name__=~".+_command",job=“Application name”} offset 1w
) > 0
)
This only works though because you say that whatever exports the counts never resets them. If this is simply because that exporter never restarted and when it will the counts will drop to zero, then you'd need to use increase instead of minus and you'd run into the exact same performance issues as with max_over_time.
count by(__name__)(
increase({__name__=~".+_command",job=“Application Name”}[1w]) > 0
)