How do I get a Sensu Go check to show up as a status panel on Grafana? - grafana

As described in the Sensu documentation, I've written a custom check script that returns 0 for OK, 1 for Warning, 2 for Critical, and prints out the description of the status. It shows up as expected on Sensu's built-in web interface, but I'm not sure how to make it show up in Grafana. I have some canned metrics that work through InfluxDB, but this is just a status check, not a metric.
I gather that I need some sort of handler on the Sensu side and/or some sort of datasource on the Grafana side that talks to the Sensu API, but the one for Sensu Core (1.x) doesn't seem to work with the newer Sensu Go (5.x). So, do I:
Rewrite the check to do graphite_plaintext output and use the
influxdb handler?
Write a custom Grafana datasource and/or Sensu handler?
Revert to Sensu Core?
Sensu Go seems to have been re-oriented around metrics, so it's not clear from the docs how to deal with simple checks anymore.

It's probably terribly inefficient, but for now I've simply chosen option 1, to rewrite the check to use influxdb handler.
All I had to do for this was to print my output with the form:
metric_path value timestamp\n
Where metric_path is something like computer_name.topic.status, value is just an integer status, and then timestamp is the current unix time as an integer. That last bit took an embarrassingly long time to figure out...nothing was showing up in InfluxDB database (and thus, Grafana) because the sensu-influxdb-handler errors out if the timestamp is not an integer.
Then, on the Grafana side, I installed the Status Panel plugin developed by Vonage here:
https://grafana.com/plugins/vonage-status-panel
Once the data was finally showing up in InfluxDB, I could select it from Grafana. I set the thresholds for warning and critical to 1 and 2 respectively, and it now works like I wanted. Still, if anyone has a more efficient way to handle this, I'd like to know about it, because I'm going to want to track the status of a large number of things, and I want to do it the right way.

Related

Delete Time series in prometheus without API

I am monitoring a instance and changed its target IP. now when I graph it in grafana, there is 2 lines(with different color) showing with the tail of first line the head of the second line.
My goal is to remove the first line and just show the updated, second line.
My attempt is to adjust the time frame in grafana which works but it will affect all the instances that are not changed.
My second attempt is to remove the time-series in prometheus but the API was not enabled and restarting would cause a hiccup in the prometheus system (which is not good in monitoring).
It also said here that time-series can only be deleted via API but this is 2018. I was wondering if it is now possible to remove time-series without API.
No, the only way to remove time series is using the API
Yes, restarting would cause a hiccup, but let's be practical: the downtime is really very small.

Grafana - Alert based on a minimum of requests

I hope you can help me.
What I have is some metric data I am collecting. I want to send out an alert, when I reach a specific error-rate on these metrics.
To make clear, my data looks something like this:
Timestamp
value (the runtime of a query)
state (error, success)
api-endpoint called
I have a grafana-Board doing some calculations, drops out something linke this:
error-rate
api-endpoint
number of calls made to the api endpoint
Fine for now - as I can read out on my grafana, I am able to send some error-messages/warnings, if the error-rate is too high. Works like a charm. But now comes the point:
If the first two (e.g.) calls to a specific api fail, I will instantly receive an alarm send by my grafana. I do not wan't that!
Is it possible - and if: how? - to alert me ONLY if this specific request was executed at least 5 times? It is no problem if this is a generic alert like "hey, something is wrong!" - but I need to figure out if the request triggering the alarm with 50-100% error-Rate was at least executed a specific amount of time before alarming.
It has to be done based on tags/fields, I do not want to add a single query for all of my 35+ APIs (number growing).
Any Idea anybody?
Using Grafana 8.0
Using InfluxDb 1.8 (with Flux enabled)
Just to make clear, if everyone ever want's to do the same: FluxQL is king - you can use filter functionality in there and only base data on datasets where value count is greater than X.
Yes: FluxQL is damn hot. I love it since a few month.

How do we change the "precision:ms" setting in the Grafana Query Inspector?

I have an InfluxDB database with only x11 data points in it. These data are not displaying correctly (or at least as I would expect) in Grafana when the time between them is shorter than 1ms.
If I insert data points 1 ms apart, then everything works as expected and I see all x11 points at the correct times, as shown below.:
However, if I delete these points and upload new ones but this time one point per 100 μs, then although the data displays correctly in InfluxDB, in Grafana I see only two points in my graph:
It seems like the data is being rounded/binned to the nearest millisecond, an that this is related to the “precision=ms” setting in the query here:
but I cannot find any way to change this setting. What is the correct way to fix this?
You can't configure Grafana to support different time precision for the InfluxDB. It is hardcoded in the source code: https://github.com/grafana/grafana/blob/36fd746c5df1438f27aa33fc74b24be77debc7ff/public/app/plugins/datasource/influxdb/datasource.ts#L364 (It may need to be fixed in multiple places of the source, not only in this one.)
So the correct way to fix it is to code it, which is of course not in the scope of this question.

Can you calculate active users using time series

My atomist client exposes metrics on commands that are run. Each command is a metric with a username element as well a status element.
I've been scraping this data for months without resetting the counts.
My requirement is to show the number of active users over a time period. i.e 1h, 1d, 7d and 30d in Grafana.
The original query was:
count(count({Username=~".+"}) by (Username))
this is an issue because I dont clear the metrics so its always a count since inception.
I then tried this:
count(max_over_time(help_command{job=“Application
Name”,Username=~“.+“}[1w]) -
max_over_time(help_command{job=“Application name”,Username=~“.+“}[1w]
offset 1w) > 0)
which works but only for one command I have about 50 other commands that need to be added to that count.
I tried the:
"{__name__=~".+_command",job="app name"}[1w] offset 1w"
but this is obviously very expensive (timeout in browser) and has issues with integrating max_over_time which doesn't support it.
Any help, am I using the metric in the wrong way. Is there a better way to query... my only option at the moment is the count (format working above for each command)
Thanks in advance.
To start, I will point out a number of issues with your approach.
First, the Prometheus documentation recommends against using arbitrarily large sets of values for labels (as your usernames are). As you can see (based on your experience with the query timing out) they're not entirely wrong to advise against it.
Second, Prometheus may not be the right tool for analytics (such as active users). Partly due to the above, partly because it is inherently limited by the fact that it samples the metrics (which does not appear to be an issue in your case, but may turn out to be).
Third, you collect separate metrics per command (i.e. help_command, foo_command) instead of a single metric with the command name as label (i.e. command_usage{commmand="help"}, command_usage{commmand="foo"})
To get back to your question though, you don't need the max_over_time, you can simply write your query as:
count by(__name__)(
(
{__name__=~".+_command",job=“Application Name”}
-
{__name__=~".+_command",job=“Application name”} offset 1w
) > 0
)
This only works though because you say that whatever exports the counts never resets them. If this is simply because that exporter never restarted and when it will the counts will drop to zero, then you'd need to use increase instead of minus and you'd run into the exact same performance issues as with max_over_time.
count by(__name__)(
increase({__name__=~".+_command",job=“Application Name”}[1w]) > 0
)

JMeter to record results on hourly basis

I have a JMeter project with multiple GET and POST requests and assertions for these. I use Aggregate results and View results tree listeners, but none of these can store results on hourly basis. I tried JMeterPlugins-Standard and JMeterPlugins-Extras packages and jp#gc - Graphs Generator listener, but all of them use aggregated data instead of hourly data. So I would like to get number of successful and failed requests/assertions per hour, maybe a bar chart would be most suitable for this purpose.
I'm going to suggest a non-conventional design-level solution: name your samplers dynamically with hour (or date and hour), so that each hour the name will change, and thus they will appear in different category, i.e.:
The code for such name is:
${__time(dd:hh,)} the rest of sampler name
Such sampler will appear in the following way in Aggregate Report (here I simulated it with minutes/seconds, but same will happen with days/hours, just on larger scale):
Pros and cons of such approach:
Simple, you can aggregate anything by hour, minute, or any other time slice while test is running, and not by analysis after execution.
Not listener-dependant, can be used with pretty much any listener or visualizer
If you want to also have overall stats, it will require to sum up every sub-category. So it alters data, but in the way that it can still can be added back to original relatively easy.
Calculating __time before every sampler will not be unnoticed completely from performance perspective, but I don't think it will add visible overhead to a script.
You could get the same data by properly aggregating JTL or CSV (whichever you use) after execution, so it doesn't provide you with anything that is not possible to achieve using standard methods
Script needs altering to make this happen. if you have 100s of samplers, it's going to take a while. And if you want to change back...
You might want to use Filter Results Tool which has --start-offset and --end-offset parameters, you can "cut" your results file into "interesting" pieces and plot them according to your requirements.
You can install Filter Results Tool using JMeter Plugins Manager
Also be aware that according to JMeter Best Practices you should
Use as few Listeners as possible; if using the -l flag as above they can all be deleted or disabled.
Don't use "View Results Tree" or "View Results in Table" listeners during the load test, use them only during scripting phase to debug your scripts.
You can get whatever information you need from the .jtl results file, you can specify test results location via -l command-line argument
To get summarized results per hour add to your test plan Generate Summary Results:
Generates a summary of the test run so far to the log file and/or standard output
Update interval in jmeter.properties to your needs ,1 hour, 3600 seconds:
summariser.interval=3600
You will get summary per hour of your requests.
You can try with Jmeter backend Listener. It has integration with graphite and Influxdb. After storing the results in these time series database you can display the result in Grafana dashboard. Grafana has its own filtering of showing the results in hourly, monthly, daily basis and so on.