PromQL Requests per minute - grafana

I'm trying to create a graph of total POST requests per minute in a graph, but there's this "ramp up" pattern that leads me to believe that I'm not getting the actual total of requests per minute, but getting an accumulative value.
Here is my query:
sum_over_time(django_http_responses_total_by_status_view_method_total{job="django-prod-app", method="POST", view="twitch_webhooks"}[1m])
Here are the "ramp up" patterns over 7days (drop offs indicating a reboot):
What leads me to believe my understanding of sum_over_time() is incorrect is because the existing webhooks should always exist. At the time of the most recent reboot, we have 72k webhook subscriptions, so it doesn't make sense for the value to climb over time, it would make more sense to see a large spike at the start for catching webhooks that were not captured during downtime.
Is this query correct for what I'm trying to achieve?
I am using django-prometheus for exporting.

You want increase rather than sum_over_time, as this is a counter.

If the django_http_responses_total_by_status_view_method_total metrics is a counter, then increase() function must be used for returning the number of requests during the last minute:
increase(django_http_responses_total_by_status_view_method_total[1m])
Note that increase() function in Prometheus can return fractional results even if django_http_responses_total_by_status_view_method_total metric contains only integer values. This is due to implementation details - see this comment and this article for details.
If the django_http_responses_total_by_status_view_method_total metric is a gauge, which shows the number of requests since the previous sample, then sum_over_time() function must be used for returning requests per last minute:
sum_over_time(django_http_responses_total_by_status_view_method_total[1m])

Related

Implementing API throttling with RDB

I would like to implement this API throttling:
A user can only execute the operation once per minute (once executed, following requests will be rejected for 1 minute)
The expected total number of requests from all users is around 2 per second.
I am using PostgreSQL 14.5.
I guess I will need a table for exclusive processing. What kind of SQL/algorithm should I use?
You could store the latest accepted timestamp in a column. Every time a request is processed, the code could check if the interval between the current timestamp and the last accepted timestamp is less than a minute and reject if so.

Why is subscribersGained - subscribersLost different than my current sub count?

I don't see an option in documentation for just "subscribers" but I can take subscribersGained and subtract subscribersLost. However, the number calculated is a bit lower than the actual result. Is there a reason for this or is there a way to actually get raw sub count?
As of writing (June 2021), there is no metric to gather the total number of subscribers via the YouTube Analytics API, although, as you have already noted, it is possible to collect the number of people that have subscribed and unsubscribed. There is currently no metric or dimension that allows this.
A workaround to get the total number of subscribers would be to collect all subscribers and those that have unsubscribed (using the subscribersGained and subscribersLost metrics), from when a channel was created, and subtracting the one from the other to get the total. eg. -
Total number of subscribers = subscribersGained - subscribersLost

How to correctly scrape and query metrics in Prometheus every hour

I would like Prometheus to scrape metrics every hour and display these hourly scrape events in a table in a Grafana dashboard. I have the global scrape interval set to 1h in the prometheus.yml file. From the prometheus visualizer, it seems like Prometheus scrapes around the 43 minute mark of every hour. However, it also seems like this data is only valid for about 3 minutes: Prometheus graph
My situation, then, is this: In a Grafana table, I set the min step of a query on this metric to 1h, but this causes the table to say that there are no data points. However, if I set the min step to 5 minutes, it displays the hourly scrape events with a timestamp on the 45 minute mark. My guess as to why this happens is that Prometheus starts on the dot of some hour and steps either forward or backward by the min step.
This does achieve what I would like to do, but it also has potential for incorrect behavior if Prometheus ever does something like can been seen at the beginning of the earlier graph. I also know that I can add a time shift, but it seems like it is always relative to the current time rather than an absolute time.
Is it possible to increase the amount of time that the scrape data is valid in Prometheus without having to scrape again every 3 minutes? Or maybe tell Prometheus to scrape at the 00 minute mark of every hour? Or if not, then can I add a relative time shift to the table so that it goes from the 45 minute mark instead of the 00 minute mark?
On a side note, in the above Prometheus graph, the irregular data was scraped after Prometheus was started. I had started Prometheus around 18:30 on the 22nd, but Prometheus didn't scrape until 23:30, and then it scraped at different intervals until it stabilized around 2:43 on the 23rd. Does anybody know why?
Your data disappear because of the staleness strategy implemented in Prometheus. Once a sample has been ingested, the metric is considered stale after 5 minutes. I didn't find any configuration to change that value.
Scraping every hour is not really the philosophy of Prometheus. If your really need to scrape with such a low frequency, it could be a better idea to schedule a job sending the data to a push gateway or using a prom file fed to a node exporter (if it makes sense). You can then scrape this endpoint every 1-2 minutes.
You could also roll your own exporter that memorize the last scrape and scrape anew only if the data age exceeds one hour. (That's the solution I would prefer)
Now, as a quick solution you can request the data over the last hour and average on it. That way, you'll get the last (old) scrape taken into account:
avg_over_time(old_metric[1h])
It should work or have some transient incorrect values if there is some jitters in the scheduling of the scrape.
Regarding the issues you had about late scraping, I suspect the scraping failed at those dates. Prometheus retries only at the next schedule (1h in your case).
If the metric is scraped with intervals exceeding 5 minutes, then Prometheus would return gaps to Grafana because of staleness mechanism. These gaps can be filled with the last raw sample value by wrapping the queried time series into last_over_time function. Just specify the lookbehind window in square brackets, which equals or exceeds the interval between samples. For example, the following query would fill gaps for my_gauge time series with one hour interval between samples:
last_over_time(my_gauge[1h])
See these docs for time durations format, which can be used in square brackets.

Time since a value was zero

I have an application that consumes work to do from an AWS topic. Work is added several times a day and my application quickly consumes it and the queue length goes back to 0. I am able to produce a metric for the length of the queue.
I would like a metric for the time since the length of queue was last zero. Any ideas how to get started?
Assuming a queue_size gauge that records the size of the queue, you can define a recorded rule like this:
# Timestamp of the most recent `queue_size` == 0 sample; else propagate the previous value
- record: last_empty_queue_timestamp
expr: timestamp(queue_size == 0) or last_empty_queue_timestamp
Then you can compute the time since the last time the queue was empty as simply as:
timestamp(queue_size) - last_empty_queue_timestamp
Note however that because this is a gauge (and because of the limitations of sampling), you may end up with weird results. E.g. if one work item is added every minute, your sampling interval is one minute and you sample exactly after the work items have been added, your queue may never (or very rarely) appear empty from the point of view of Prometheus. If that turns out to be an issue (or simply a concern) you may be better off having your application export a metric that is the last timestamp when something was added to an empty queue (basically what the recorded rule attempts to compute).
Similar to Alin's answer; upon revisiting this problem I found this from the Prometheus documentation:
https://prometheus.io/docs/practices/instrumentation/#timestamps,-not-time-since
If you want to track the amount of time since something happened, export the
Unix timestamp at which it happened - not the time since it happened.
With the timestamp exported, you can use the expression time() -
my_timestamp_metric to calculate the time since the event, removing the need for
update logic and protecting you against the update logic getting stuck.

rate limit policy on queries to Azure Insights REST API for Events (Audit Logs)

I have some questions regarding Azure Insights REST Api for Events.
When I make HTTP request to Inisghts API for events, I receive the header "
x-ms-ratelimit-remaining-subscription-reads", with value "14999".
But next query in 1s returns me the same value of remaining reads.
I see there is some throttling policy there, but I would like to understand how it works and what is the correct way to deal with that.
In particular,
1) how many reads I am able to do per second?
2) if I exceed the whole remaining reads parameter, how much time should I wait before it will again be maximum?
3) is it decreased on every query attempt, despite of the $top parameter setted and how many results has been returned?
Thank you!
This article seems to have the responses you need.
To answer the questions based on it:
There is no limit to the number of requests per second, but you have 15k
requests/hour/subscription/region/instance of ARM region. Worst case scenario you will get throttled after 15k requests but you'd have to be extremely unlucky for that.
If you exceed the limit, you are
told how much you have to wait and you can integrate that logic by
looking at the Retry-After header. Happily, it's a matter of
seconds.
I believe the $top parameter doesn't affect the query since
no matter how many results are brought back, a paging request is
still just one request.
As for the fact that you get 14999 requests
remaining multiple times, as they say in their documentation it is
expected since an ARM region has multiple instances and each instance has
15k requests limit/subscription/hour. If you hit simultaneously and
you get the same number remaining, it just means that you were lucky
enough to hit different instances within the same ARM region.
1) how many reads I am able to do per second?
Based on the rate limits published here - https://azure.microsoft.com/en-in/documentation/articles/azure-subscription-service-limits/#subscription-limits, you can perform 15000 reads / hour (not sure it would translate to 4 reads / second).
2) if I exceed the whole remaining reads parameter, how much time
should I wait before it will again be maximum?
Given the rates are defined per hour, my guess would be to wait till next hour if you exhaust 15000 read request limit.
3) is it decreased on every query attempt, despite of the $top
parameter setted and how many results has been returned?
This is based on the number of API calls and not the amount of data returned. So I would say defining $top parameter should not have any impact on this.
When I make HTTP request to Inisghts API for events, I receive the
header " x-ms-ratelimit-remaining-subscription-reads", with value
"14999". But next query in 1s returns me the same value of remaining
reads.
I would assume there's some caching in play here. Is it the same request you're repeating or a different request all together?