I'm using Prometheus to monitor an application which is run on a cronjob basis. So, I'm using Pushgateway to make my desired metrics available for Prometheus. One of the metrics is to report how long does a certain task take to finish. Therefore I'm using a Summary to report that. My issue is that I see the same amount reported for each quantile! My understanding was that the reported time for each quantile should be different.
I'm using the followings to observe() the time and to push my metrics to Pushgateway
Summary.labels(myLable).observe(Date.now() - startedAt)
gateway.pushAdd { jobName: 'test' }, (err, resp, body) ->
console.log "Error!!" if err
and here is a screenshot which shows that I'm getting the final time for all quantiles!
I'd appreciate any comments on this!
If you only have one observation, then a Summary's quantiles will be the same. I'm not sure what you're expecting here instead, a gauge would be the more usual way to report this.
Related
I read gatling official blog and understood that we cannot get 95th,99th percentile "All aggregations will result in computing averages on percentiles and will inherently be broken". I dint understand one thing why we are not getting simple response time series in gatling which we can customized through our queries to get any percentile we want . Without correct percentile such integration are worthless.
Is there any way we can get close to the gatling reports timing in InfluxDB if not exact.
Gatling doesn't really send to influxDB the response time of each operation. If gatling sent complete information about the operations, it would significantly reduce the performance of gatling and influxDB. You would not have enough RAM for a test with a load intensity of even 100 transactions per second. Instead, gatling sends aggregated response times over the time interval you specified in gatling.conf.
Because of this, grafana cannot show us the actual value of the 90th percentile. There are many utilities for proper aggregation of gatling logs, such as this one: https://github.com/codefiler/gatling-analyser
we have a lot of jobs (for example batch jobs), that are executed each day. Therefor we’d like to have an overview of all jobs.
→ track start time and end time (–> complete runtime).
All of these infos should be available in a visualisation.
Is InflixDB with Grafana a good solution for this or do you recommend another app?
I think InfluxDB and Grafana are really a good starting point to collect data from your services.
You'll also need to integrate some type of metrics library and an exporter in your code.
On Java you could use Micrometer (https://micrometer.io/) and Prometheus.
Here you can find more information about them: https://micrometer.io/docs/registry/prometheus
After having integrated metrics in your code you simply need to configure Grafana to use data from InfluxDB and configure InfluxDB to scrape your metrics endpoint.
We have a custom metric that gets exported only upon some error condition in app
Alert rule use that custom metric that gets registered with rule manager of Prometheus
Why Prometheus does not raise error, when this metric name is queried? Despite the metric is not available in Prometheus yet...
It seems correct that the absence of a signal is not treated as an error.
However, it can cause problems with dashboards and alerting.
See this presentation by one of Prometheus' creators: Best Practices & Beastly Pitfalls
I would like Prometheus to scrape metrics every hour and display these hourly scrape events in a table in a Grafana dashboard. I have the global scrape interval set to 1h in the prometheus.yml file. From the prometheus visualizer, it seems like Prometheus scrapes around the 43 minute mark of every hour. However, it also seems like this data is only valid for about 3 minutes: Prometheus graph
My situation, then, is this: In a Grafana table, I set the min step of a query on this metric to 1h, but this causes the table to say that there are no data points. However, if I set the min step to 5 minutes, it displays the hourly scrape events with a timestamp on the 45 minute mark. My guess as to why this happens is that Prometheus starts on the dot of some hour and steps either forward or backward by the min step.
This does achieve what I would like to do, but it also has potential for incorrect behavior if Prometheus ever does something like can been seen at the beginning of the earlier graph. I also know that I can add a time shift, but it seems like it is always relative to the current time rather than an absolute time.
Is it possible to increase the amount of time that the scrape data is valid in Prometheus without having to scrape again every 3 minutes? Or maybe tell Prometheus to scrape at the 00 minute mark of every hour? Or if not, then can I add a relative time shift to the table so that it goes from the 45 minute mark instead of the 00 minute mark?
On a side note, in the above Prometheus graph, the irregular data was scraped after Prometheus was started. I had started Prometheus around 18:30 on the 22nd, but Prometheus didn't scrape until 23:30, and then it scraped at different intervals until it stabilized around 2:43 on the 23rd. Does anybody know why?
Your data disappear because of the staleness strategy implemented in Prometheus. Once a sample has been ingested, the metric is considered stale after 5 minutes. I didn't find any configuration to change that value.
Scraping every hour is not really the philosophy of Prometheus. If your really need to scrape with such a low frequency, it could be a better idea to schedule a job sending the data to a push gateway or using a prom file fed to a node exporter (if it makes sense). You can then scrape this endpoint every 1-2 minutes.
You could also roll your own exporter that memorize the last scrape and scrape anew only if the data age exceeds one hour. (That's the solution I would prefer)
Now, as a quick solution you can request the data over the last hour and average on it. That way, you'll get the last (old) scrape taken into account:
avg_over_time(old_metric[1h])
It should work or have some transient incorrect values if there is some jitters in the scheduling of the scrape.
Regarding the issues you had about late scraping, I suspect the scraping failed at those dates. Prometheus retries only at the next schedule (1h in your case).
If the metric is scraped with intervals exceeding 5 minutes, then Prometheus would return gaps to Grafana because of staleness mechanism. These gaps can be filled with the last raw sample value by wrapping the queried time series into last_over_time function. Just specify the lookbehind window in square brackets, which equals or exceeds the interval between samples. For example, the following query would fill gaps for my_gauge time series with one hour interval between samples:
last_over_time(my_gauge[1h])
See these docs for time durations format, which can be used in square brackets.
I am trying to use promethues for reporting executes Hystrix commands for mongo db,Everything is working fine except promethues not able to understand below line and shows the target state as down.
# HELP rideshare-engine_hystrix_command_latency_total_percentile_995 DEPRECATED: Rolling percentiles of execution times for the end-to-end execution of HystrixCommand.execute() or HystrixCommand.queue() until a response is returned (or ready to return in case of queue(). The purpose of this compared with the latency_execute* percentiles is to measure the cost of thread queuing/scheduling/execution, semaphores, circuit breaker logic and other aspects of overhead (including metrics capture itself).
complete stack
Not sure what Am a doing wrong here
Config:
Hypens are not valid in Prometheus metric names, use underscore instead.