Creating Datadog alerts for when the percentage difference between two custom metrics goes over a specified percentage threshold - triggers

My current situation is that I have two different data feeds (Feed A & Feed B) and I have created custom metrics for both feeds:
Metric of Order counts from Feed A
Metric Order counts from Feed B
Next steps is to create alert monitoring for the agreed upon threshold of difference between the two metrics. Say we have agreed that it is acceptable for Order Counts from Feed A to be within ~5% of Order Counts from Feed B. How can I go about creating that threshold and comparison between the two metrics that I have already developed in Datadog?
I would like to send alerts to myself when the % difference between the two data feeds is > 5 % for a daily validation.

You might be able to get this if you...
Start creating a metric type monitor
To the far right of the metric definition, select "advanced"
Select "Add Query"
Input your metrics
In the field called "Express these queries as:", input (a-b)/b or some such
Trigger when the metric is above or equal to the threshold in total during the last 24 hours
Set Alert threshold >= 0.05
If you start having trouble as you start setting it up, you may want to reach out to support#datadoghq.com to get their assistance.

Related

Graphite: keepLastValue for a given time period instead of number of points

I'm using Graphite + Grafana to monitor (by sampling) queue lengths in a test system at work. The data that gets uploaded to Graphite is grouped into different series/metrics by properties of the payloads in the queue. These properties can be somewhat arbitrary, at least to the point where they are not all known at the time when the data collection script is run.
For example, a property could be the project that the payload belongs to and this could be uploaded as a separate series/metric so that we can monitor the queues broken down by the different projects.
This has the consequence that Graphite sends a lot of null values for certain metrics if the queues in the test system did not contain any payloads with properties that would group it into that specific series/metric.
For example, if a certain project did not have any payloads in queue at the time when the data collection was ran.
In Grafana this is not so nice as the line graphs don't show up as connected lines and gauges will show either null or the last non-null value.
For line graphs I can just chose to connect null values in Grafana but for gauges thats not possible.
I know about the keepLastValue function in Graphite but that includes a limit for how long to keep the value which I like very much as I would like to keep the last value until the next time data collection is ran. Data collection is run periodically at known intervals.
The problem with keepLastValue is it expects a number of points as this limit. I would rather give it a time period instead. In Grafana the relationship between time and data points is very dynamic so its not easy to hard-code a good limit for keepLastValue.
Thus, my question is: Is there a way to tell Graphite to keep the last value for a given time instead of a given number of points?

Google Cloud: Metrics Explorer: "Aggregator" vs "Aligner" - Whats the difference?

Trying to understand the difference between the two: Aggregator vs Aligner.
Docs was not helpful for me.
What I'm trying to achieve is to get the bytes of logs generated within a week per each namespace and container combination. For example, I want to see that container C in namespace N generated 10Gb of logs during the last 7 days.
This is how far I got:
Resource type = Kubernetes Container
Metric = Log bytes
Group by = namespace_name and container_name
Aggregator = sum(?) mean(?)
Minimum alignment period = 1(?) 7(?) days
Aligner = sum(?) mean(?)
I was confused with this until I realized that a single metric, such as kubernetes.io/container/cpu/core_usage_time is available in multiple different resources in my cluster.
So when you search for that metric, you'll get a whole lot of different resources that emit that metric. Aggregation is adding up all the data from those different resources WITH THAT SAME METRIC.
This all combines into one "time series" for that metric, an aggregation of all the individual time series from each of those different resources.
Now, alignment is the process of using that time series and putting all the data points through a function (over a period of time, known as the alignment period) which results in one single data point (per alignment period).
So aggregation combines the same metric across multiple resources, while alignment combines multiple data points in the same time series into one data point (per alignment period, which is why that field is required when using alignment).

How to make sense of the micrometer metrics using SpringBoot 2, InfluxDB and Grafana?

I'm trying to configure a SpringBoot application to export metrics to InfluxDB to visualise them using a Grafana dashboard. I'm using this dashboard as an example which uses Prometheus as a backend.
For some metrics I have no problem figuring out how to create graphs for them but for some others I don't know how to create the graphs or even if it's possible at all. So I enumerate the things I'm not really sure about in the following points:
Is there any documentation where a value unit is described? The application I'm using as an example doesn't have any load on it so sometimes I don't know whether the value is a bit, a byte, a second, a millisecond, a count, etc.
Some measurements contain the tag 'metric_type = histogram' with fields 'count', 'sum', 'mean' and 'upper'. Again, here I don't know what the value units are, what upper means or how I'm suppose to plot them. Examples of this are 'http_server_requests' or 'jvm_gc_pause'.
From what I see in the Grafana dashboard example, it seems I should use these measurements of type histogram to create both a graph with counts and graphs with duration. For example I see I should be able to create a graph with the number of requests and another one with their duration. Or for the garbage collector, I should be able to provide a graph for the number of minor and major GCs and another for their duration.
As an example of measures I get inserted into InfluxDB:
time count exception mean method metric_type outcome status sum upper uri
1625579637946000000 1 None 0.892144 GET histogram SUCCESS 200 0.892144 0.892144 /actuator/health
or
time action cause count mean metric_type sum upper
1625581132316000000 end of minor GC Allocation Failure 1 2 histogram 2 2
I agree the documentation for micrometer is not great. I've had to dig through the code to find answers...
Regarding your questions about jvm_gc_pause, it is a Timer and the implementation is AbstractTimer which is a class that wraps a Histogram among other components. This particular metric is registered by the JvmGcMetrics class. The various measurements that are published to InfluxDB are determined by the InfluxMeterRegistry.writeTimer(Timer timer) method:
sum: timer.totalTime(getBaseTimeUnit()) // The total time of recorded events
count: timer.count() // The number of times stop has been called on the timer
mean: timer.mean(getBaseTimeUnit()) // totalTime()/count()
upper: timer.max(getBaseTimeUnit()) // The max time of a single event
The base time unit is milliseconds.
Similarly, http_server_requests appears to be a Timer as well.
I believe you are correct that the sensible thing is to chart on two separate Grafana panels: one panel for GC pause seconds using sum (or mean or upper), and one panel for GC events using count.

Query Graphite Metrics for specific data points

I want to query my graphite server to retrieve certain metrics.
I am able to query all data points between certain time period but my requirement is, I want to query data points of specific time of previous days.
How can I do this?
The Graphite Render API supports a number of arguments in order to make your query more specific. Specifically, the from / until arguments will be useful to you, you can read about them here: https://graphite.readthedocs.io/en/latest/render_api.html#from-until
edit: I should add that if you're using Grafana for visulaising your data, you can click+drag on the graph to select specific time ranges or use the timepicker in the top-right corner to choose Custom and set your range there.

Tableau: how can I filter dimensions/metrics on the dashboard based on the user's previous selection

I just started working with Tableau and I fail to find a way to filter dimensions/metrics on the dashboard based on the user's previous selection.
We use MongoDB NoSQL database to store various events sent from our system.
Event consist of Key-Value pairs (translated to metrics and dimensions), each event has a unique Id (EventType) and a list of parameters.
The number of parameters per EventType is constant but vary between event types.
When we connect the Events catalog to Tableau (using MongoDB BI connector) we receive a flat table with all possible keys while only the ones that apply to the specific event has a value.
Since we have a lot of event types and a large number of possible keys (between them) this cause problems when using the dashboard.
The user see a flat list of all possible dimensions and metrics with no correlation between them.
He can not know which metric apply to which eventType.
How can I can guide Tableau to present/highlight only the relevant dimensions / metrics, based on the EventType selected by the user?
You click on the down arrow in the top right of the filter and then select Only Show Relevant Values.
Click for screenshot