Delete Time series in prometheus without API - centos

I am monitoring a instance and changed its target IP. now when I graph it in grafana, there is 2 lines(with different color) showing with the tail of first line the head of the second line.
My goal is to remove the first line and just show the updated, second line.
My attempt is to adjust the time frame in grafana which works but it will affect all the instances that are not changed.
My second attempt is to remove the time-series in prometheus but the API was not enabled and restarting would cause a hiccup in the prometheus system (which is not good in monitoring).
It also said here that time-series can only be deleted via API but this is 2018. I was wondering if it is now possible to remove time-series without API.

No, the only way to remove time series is using the API
Yes, restarting would cause a hiccup, but let's be practical: the downtime is really very small.

Related

How to display empty graph in Grafana even if there is no data

I'm using Grafana v9.1.8.
I created a panel bases on data from influxdb.
The data only sent when application is working, so sometimes there is no data.
And the dashboard will show just 'No Data' in the middile of the panel without any graph.
I'm trying to keep the graph(axis) shown even if there's no data, but I cannot find the solution.
As far as I know, there is no such feature on Grafana at the moment, but I found this solution:
https://community.grafana.com/t/what-to-show-when-the-panel-is-without-data/66524/9
Make a fake union, check if you have any data and if you don't create some random time data without other parameters. As they say in the answer, this may not be scalable, as you need to add extra lines for each query, but it may be a workaround.

Problem displaying geoserver's layer when resource is updated with rest API

I am having a weird issue when using geoserver api to update a netcdf resource of a coverage store layer. The resource is a netcdf containing one 3D (lon, ,lat, time) variable. However, the time dimension is only of length = 1. My code runs within a docker container and uses curl in a .sh file to run the api commands.
I must stress that the problem occurs only once in a while, maybe 10% of the time, maybe less.
When th problem occurs, the update of the store seems to have a problem and the layer cannot be displayed. When looking in the get Capabilities, one of the weird thing is that the time dimension is not right date, but is rather equal to 1970-01-01T00:00:00.000Z, which is the reference date used in the netcdf for the time dimension. Also, no problems are detected in the logs.
I do know that the problem is not with the file, and probably not with the upload of the file. Indeed, when the problem occurs, I can successfully create a store and a layer with the same resource and the same parameters as the layer that is not working.
I have tried multiple things via the API to solve this issue:
Reset the resource cache. It sometimes works, but not always
Delete layer and store and recreating them every time I need to update the resource
Delete resource, layer and store and recreating everything when resource update.
Nothing seems to get rid of the problem permanently. Has anyone experienced the same kind of behavior? It is not the first time I use geoserver’s api in a data harvester, but it is the first time I have this problem!
EDIT
I also tried to make the make the netcdf file as simple as I could, by removing the time dimension.
So now, the netcdf file only has 4 variables: lon, lat, the gridded variable, and a variable called crs that is of dimension 0, so is empty (I left it there for now since it comes from the outside source file).
But then again, the same kind of issue occurs, and again only once in a while. However, when it occurs, there seems to be something wrong caught in geoserver's log:
2022-06-08 16:01:28,267 WARN [operation.projection] - Possible use of "Popular Visualisation Pseudo Mercator" projection outside its valid area.
Longitude 2147483287°00.0'W is out of range (±180°).
But again, if when this happens, I can usually clear the resource cache and the layer will become visible again.
So I still dont know what is happening. Could it be the empty crs variable that sometimes creates problems?
Thanks a lot for your help!

How do we change the "precision:ms" setting in the Grafana Query Inspector?

I have an InfluxDB database with only x11 data points in it. These data are not displaying correctly (or at least as I would expect) in Grafana when the time between them is shorter than 1ms.
If I insert data points 1 ms apart, then everything works as expected and I see all x11 points at the correct times, as shown below.:
However, if I delete these points and upload new ones but this time one point per 100 μs, then although the data displays correctly in InfluxDB, in Grafana I see only two points in my graph:
It seems like the data is being rounded/binned to the nearest millisecond, an that this is related to the “precision=ms” setting in the query here:
but I cannot find any way to change this setting. What is the correct way to fix this?
You can't configure Grafana to support different time precision for the InfluxDB. It is hardcoded in the source code: https://github.com/grafana/grafana/blob/36fd746c5df1438f27aa33fc74b24be77debc7ff/public/app/plugins/datasource/influxdb/datasource.ts#L364 (It may need to be fixed in multiple places of the source, not only in this one.)
So the correct way to fix it is to code it, which is of course not in the scope of this question.

Can you calculate active users using time series

My atomist client exposes metrics on commands that are run. Each command is a metric with a username element as well a status element.
I've been scraping this data for months without resetting the counts.
My requirement is to show the number of active users over a time period. i.e 1h, 1d, 7d and 30d in Grafana.
The original query was:
count(count({Username=~".+"}) by (Username))
this is an issue because I dont clear the metrics so its always a count since inception.
I then tried this:
count(max_over_time(help_command{job=“Application
Name”,Username=~“.+“}[1w]) -
max_over_time(help_command{job=“Application name”,Username=~“.+“}[1w]
offset 1w) > 0)
which works but only for one command I have about 50 other commands that need to be added to that count.
I tried the:
"{__name__=~".+_command",job="app name"}[1w] offset 1w"
but this is obviously very expensive (timeout in browser) and has issues with integrating max_over_time which doesn't support it.
Any help, am I using the metric in the wrong way. Is there a better way to query... my only option at the moment is the count (format working above for each command)
Thanks in advance.
To start, I will point out a number of issues with your approach.
First, the Prometheus documentation recommends against using arbitrarily large sets of values for labels (as your usernames are). As you can see (based on your experience with the query timing out) they're not entirely wrong to advise against it.
Second, Prometheus may not be the right tool for analytics (such as active users). Partly due to the above, partly because it is inherently limited by the fact that it samples the metrics (which does not appear to be an issue in your case, but may turn out to be).
Third, you collect separate metrics per command (i.e. help_command, foo_command) instead of a single metric with the command name as label (i.e. command_usage{commmand="help"}, command_usage{commmand="foo"})
To get back to your question though, you don't need the max_over_time, you can simply write your query as:
count by(__name__)(
(
{__name__=~".+_command",job=“Application Name”}
-
{__name__=~".+_command",job=“Application name”} offset 1w
) > 0
)
This only works though because you say that whatever exports the counts never resets them. If this is simply because that exporter never restarted and when it will the counts will drop to zero, then you'd need to use increase instead of minus and you'd run into the exact same performance issues as with max_over_time.
count by(__name__)(
increase({__name__=~".+_command",job=“Application Name”}[1w]) > 0
)

How do I get a Sensu Go check to show up as a status panel on Grafana?

As described in the Sensu documentation, I've written a custom check script that returns 0 for OK, 1 for Warning, 2 for Critical, and prints out the description of the status. It shows up as expected on Sensu's built-in web interface, but I'm not sure how to make it show up in Grafana. I have some canned metrics that work through InfluxDB, but this is just a status check, not a metric.
I gather that I need some sort of handler on the Sensu side and/or some sort of datasource on the Grafana side that talks to the Sensu API, but the one for Sensu Core (1.x) doesn't seem to work with the newer Sensu Go (5.x). So, do I:
Rewrite the check to do graphite_plaintext output and use the
influxdb handler?
Write a custom Grafana datasource and/or Sensu handler?
Revert to Sensu Core?
Sensu Go seems to have been re-oriented around metrics, so it's not clear from the docs how to deal with simple checks anymore.
It's probably terribly inefficient, but for now I've simply chosen option 1, to rewrite the check to use influxdb handler.
All I had to do for this was to print my output with the form:
metric_path value timestamp\n
Where metric_path is something like computer_name.topic.status, value is just an integer status, and then timestamp is the current unix time as an integer. That last bit took an embarrassingly long time to figure out...nothing was showing up in InfluxDB database (and thus, Grafana) because the sensu-influxdb-handler errors out if the timestamp is not an integer.
Then, on the Grafana side, I installed the Status Panel plugin developed by Vonage here:
https://grafana.com/plugins/vonage-status-panel
Once the data was finally showing up in InfluxDB, I could select it from Grafana. I set the thresholds for warning and critical to 1 and 2 respectively, and it now works like I wanted. Still, if anyone has a more efficient way to handle this, I'd like to know about it, because I'm going to want to track the status of a large number of things, and I want to do it the right way.