Dropwizard metrics how to count for an action without carrying forward the counter value - counter

I have a very simple case where I want to see how many time a user click on the ButtonA in my app. I'm using DropWizard metrics counter to archive this and the coursera reporter to report them to DataDog every 1 minutes.
registry.counter("buttonA").inc();
But what is happening is that this counter doesn't behave like I thought it would. so for example if the buttonA has been clicked 4 times, the counter will keep the value 4 until the app restart which is not very useful.
Is there an other metrics I'm not aware about that would keep a count and at each reports reset to 0 ? So that on Datadog dashboard I can easily sum all the count and manage to get the exact numbers even if the app is restarted it will not affect the metrics.

I don't think there is something that does this for you automatically. You have to reset the counter yourself at each reporting interval. Something like this should work:
long count = registry.counter("buttonA").getCount();
dataDogReporter.report("buttonA", count);
registry.counter("buttonA").dec(count);

Related

Prometheus Alerting expression for metrics that increases once a day

I have a custom counter metric that should increase one time a day at a specific time period. and I have to create an alerting rule that fires if this metric doesn't increase.
Infrastructure: Kubernetes
So for my case, I used the next alerting expression:
increase(my_custom_metric[1d]) == 0
But it seems to be some issues with using it. Suppose the pod restarted and the counter reset, after a couple of hours the counter increased by one, but the value of the expression above still remains 0. What expression should be used to account for such situations?

Anylogic Mean rate lost customer

I have a model where the queue's capacity limit is 2. I also set up a "SelectOutput"-block before the queue, as customers who can't enter the queue are lost. So now I want to find out the mean rate at which customers are lost? How can I do that?
Many thanks!
Create a variable leftLastMin of type int.
In the "on exit (false)" code of the SelectOutput, write leftLastMin ++
Now, create a recurring event that triggers every minute. Here, you can easily do something with the number left last minute (traceln, store into dataset or statistics object...).
Also, reset your counter leftLastMin=0 so it is ready for the next minute.

Is it possible to accelerate time in grafana?

Actually what I want to do,
I created dashboards to monitor the alert status in grafana.
I created fake data in my system to simulate my alert situations on these boards. The time of this data covers the range now - now + 12h. In fact, it takes a long time to analyze the alert status in real data. For this reason, I cannot be very flexible on my alert rules. I have to wait until the end of this period to see alert status in the system. (I have many states like this actually.) Grafana creates pending, alerting, and ok states according to the records in my database. Is there a method to quickly verify my tests without waiting for this time?
The main problem is that it is fairly expensive to do in a data source agnostic way. The way worked in Bosun is you would select a time range, and then an interval or a number of queries to run.
Setting both From and To enables testing multiple iterations of the selected alert over time. The number of iterations depends on the setting to the two linked fields Intervals and Step Duration at 3 Changing one changes the other. Intervals will be the number of runs to do even spaced out over the duration of From to To and Step Duration is how much time in minutes should be between intervals. Doing a test over time will populate the Timeline tab 5 which draws a clickable graphic of severity states for each item in the set:
It would then run all those queries with a pool limiting simultaneous queries. For an interval of say 5 minutes, it would run adjacent 5 minute queries.
So this would speed up the alert authoring and testing workflow significantly. But it would best be implemented as a job system. This is because with more expensive queries, or range/interval combination that is a fair amount of runs, it may take a minute or so - so having to wait on an open network connection is less ideal.
So I found I generally used in two modes:
To tweak a specific alert that had fired at some time
To get a general overview of how much the alert rule would trigger for the historical data
For the general over, a larger time range is generally wanted, which means more queries if the interval is kept the same. And with a feature like FOR (Pending), you would have to use the same interval it would actually run at.
So possible, has some limitations, and some care needs to be taken to do it right. But extremely useful in my experience.

How to interpret LocustIO's output / simulate short user visits

I like Locust, but I'm having a problem interpreting the results.
e.g. my use case is that I have a petition site. I expect 10,000 people to sign the petition over a 12 hour period.
I've written a locust file that simulates user behaviour:
Some users load but don't sign petition
Some users load and submit invalid data
Some users (hopefully) successfully submit.
In real life the user now goes away (because the petition is an API not a main website).
Locust shows me things like:
with 50 concurrent users the median time is 11s
with 100 concurent users the median time is 20s
But as one "Locust" just repeats the tasks over and over, it's not really like one user. If I set it up with a swarm of 1 user, then that still represents many real world users, over a period of time; e.g. in 1 minute it might do the task 5 times: that would be 5 users.
Is there a way I can interpret the data ("this means we can handle N people/hour"), or some way I can see how many "tasks" get run per second or minute etc. (ie locust gives me requests per second but not tasks)
Tasks dont really exist on the logging level in locust.
If you want, you could log your own fake samples, and use that as your task counter. This has an unfortunate side effect of inflating your request rate, but it should not impact things like average response times.
Like this:
from locust.events import request_success
...
#task(1)
def mytask(self):
# do your normal requests
request_success.fire(request_type="task", name="completed", response_time=None, response_length=0)
Here's the hacky way that I've got somewhere. I'm not happy with it and would love to hear some other answers.
Create class variables on my HttpLocust (WebsiteUser) class:
WebsiteUser.successfulTasks = 0
Then on the UserBehaviour taskset:
#task(1)
def theTaskThatIsConsideredSuccessful(self):
WebsiteUser.successfulTasks += 1
# ...do the work...
# This runs once regardless how many 'locusts'/users hatch
def setup(self):
WebsiteUser.start_time = time.time();
WebsiteUser.successfulTasks = 0
# This runs for every user when test is stopped.
# I could not find another method that did this (tried various combos)
# It doesn't matter much, you just get N copies of the result!
def on_stop(self):
took = time.time() - WebsiteUser.start_time
total = WebsiteUser.successfulTasks
avg = took/total
hr = 60*60/avg
print("{} successful\nAverage: {}s/success\n{} successful signatures per hour".format(total, avg, hr)
And then set a zero wait_time and run till it settles (or failures emerge) and then stop the test with the stop button in the web UI.
Output is like
188 successful
0.2738157498075607s/success
13147.527132862522 successful signatures per hour
I think this therefore gives me the max conceivable throughput that the server can cope with (determined by changing the No. users hatched until failures emerge, or until the average response time becomes unbearable).
Obviously real users would have pauses, but that makes it harder to test the maximums.
Drawbacks
Can't use distributed Locust instances
Messy; also can't 'reset' - have to quit the process and restart for another test.

Time since a value was zero

I have an application that consumes work to do from an AWS topic. Work is added several times a day and my application quickly consumes it and the queue length goes back to 0. I am able to produce a metric for the length of the queue.
I would like a metric for the time since the length of queue was last zero. Any ideas how to get started?
Assuming a queue_size gauge that records the size of the queue, you can define a recorded rule like this:
# Timestamp of the most recent `queue_size` == 0 sample; else propagate the previous value
- record: last_empty_queue_timestamp
expr: timestamp(queue_size == 0) or last_empty_queue_timestamp
Then you can compute the time since the last time the queue was empty as simply as:
timestamp(queue_size) - last_empty_queue_timestamp
Note however that because this is a gauge (and because of the limitations of sampling), you may end up with weird results. E.g. if one work item is added every minute, your sampling interval is one minute and you sample exactly after the work items have been added, your queue may never (or very rarely) appear empty from the point of view of Prometheus. If that turns out to be an issue (or simply a concern) you may be better off having your application export a metric that is the last timestamp when something was added to an empty queue (basically what the recorded rule attempts to compute).
Similar to Alin's answer; upon revisiting this problem I found this from the Prometheus documentation:
https://prometheus.io/docs/practices/instrumentation/#timestamps,-not-time-since
If you want to track the amount of time since something happened, export the
Unix timestamp at which it happened - not the time since it happened.
With the timestamp exported, you can use the expression time() -
my_timestamp_metric to calculate the time since the event, removing the need for
update logic and protecting you against the update logic getting stuck.