Is there a cloudwatch equivalent for prometheus Counter? - counter

I need to be able to atomically increment (or decrement) a metric value in cloudwatch (and also be able to reset it to zero). Prometheus provides a Counter type that allows one to do this; is there an equivalent in cloudwatch? All I'm able to find is a way to add a new sample value to a metric, but not increment or decrement it.

CloudWatch is like a TSDB. It stores point-in-time values. You can't mutate a metric value once it is ingested. See Publishing Metrics. Also, I don't think storing a counter in CloudWatch will be very useful. There is no rate(...) function in CloudWatch like in Prometheus. The best you can do is store the deltas and use the sum statistic with a period. Here is an e.g. assuming metrics are ingested at 1m granularity
Time
Counter
rate(5m)
CW metric
sum with period 5m
1m
0
0
0
0
2m
10
10
10
10
3m
20
20
10
20
4m
40
40
20
40
5m
50
50
10
50
6m
60
60
10
60
7m
100
90
40
90
Note that metrics can be ingested at finer granularity but it comes at a cost. Also, the statistics (Sum,Average,Maximum,Minimum etc) can be retrieved only at 1 minute granularity. There is an option to retrieve the raw data when retrieving a statistic but not sure what would be the use of doing so.

Related

Filter out everything except values of monotonic growth

Say I have micrometer timer my.timer. I have grafana metric max(my_timer_seconds_max) by (pod) that gives:
I need a metric that shows only values that grow at least for 5 minutes and filter out everything else:

Query to find different combinations of CPU and Memory in the cluster

I was wondering if it's possible to write a query to show the number of nodes in the cluster with a given cpu and memory configuration.
I have a metric kube_node_status_allocatable available with different tags.
Metric 1 (cpu count on each node):
kube_node_status_allocatable{instance="ip1",node="host1",resource="cpu",unit="core"} 21
kube_node_status_allocatable{instance="ip2",node="host2",resource="cpu",unit="core"} 21
kube_node_status_allocatable{instance="ip3",node="host3",resource="cpu",unit="core"} 61
kube_node_status_allocatable{instance="ip4",node="host4",resource="cpu",unit="core"} 61
kube_node_status_allocatable{instance="ip5",node="host5",resource="cpu",unit="core"} 61
Metric 2 (memory count on each node)::
kube_node_status_allocatable{instance="ip1",node="host1",resource="memory",unit="gb"} 64
kube_node_status_allocatable{instance="ip2",node="host2",resource="memory",unit="gb"} 64
kube_node_status_allocatable{instance="ip3",node="host3",resource="memory",unit="gb"} 128
kube_node_status_allocatable{instance="ip4",node="host4",resource="memory",unit="gb"} 128
kube_node_status_allocatable{instance="ip5",node="host5",resource="memory",unit="gb"} 128
I want to output a metric that looks something like this:
{cpu=21, memory=64} 2
{cpu=61, memory=128} 3
So far I have been able to get number of nodes with a given configuration for one resource at a time.
i.e., number of nodes with different cpu configuration
count_values("node", kube_node_status_allocatable{resource="cpu"})
Above outputs:
{node=21} 2
{node=61} 3
Which roughly maps to configuration (cpu == 21 or 61) and the number of nodes with that configuration (2 or 3).
I can get a similar result for memory, but I am not sure how to join these two.

Scaling Kafka streams application using kubernetes horizontal pod scaling

We have a kafka streams application with 3 pods. Application scaling is a heavy operation(because of large state) for us. So, I would like to increase/scale pod only if it absolutely necessary. For example, if the application utilization increases beyond a number for lets say 10 mins.
Again, i don't need to scale up/down my application for sudden burst(a fews seconds) of messages
Looking for something configuration like:
window : 15 mins
avergae cpu : 1000 milli
So, I would like to scale the application is the average cpu over 15 mins window is greater than 1000 milli.
You can take a look into HPA policies.
There is stabilizationWindowSeconds:
StabilizationWindowSeconds is the number of seconds for which past
recommendations should be considered while scaling up or scaling down.
StabilizationWindowSeconds must be greater than or equal to zero and
less than or equal to 3600 (one hour). If not set, use the default
values: - For scale up: 0 (i.e. no stabilization is done)
And the limits CPU average utilization can be set in metric target objects under averageUtilization.

Prometheus / Grafana count a downtime of service

I have a service metric that returns either some positive value, or 0 in case of failure.
I want to count how many seconds my service was failing during some time period.
E.g. the expression:
service_metric_name == 0
gives me a dashed line in Grafana:
line_of_downtime
Is there any way to count how many seconds my service was down for the last 2 hours?
I assume the service is either 0 for being down or 1 for being up.
In this case you can calculate an average over a time range. If the result is 0.9, your service has been up for 90% of the time. If you calculated the average over an hour, this would have been 6 minutes downtime out of 60 minutes.
avg_over_time(up{service_metric_name[1h])
This will be a moving average, that is: when your service is down, the value will slowly decrease. Then your service is up, it will slowly increase again.

Count values by groups

I currently send events to graphite like
server.my_value with a count that varies between 1 and 1000
I'd like to plot a graph that will show me how many of these my_value I received by groups of 100 (so 1,4,7 and 89 will be counted toward the 1-100 group and etc)
so to illustrate, lets say I sent these values in a specific time
1
3
45
13
299
455
74
924
the groups will be
1-100: 5
200-300: 1
300-400: 0
400-500: 1
500-600: 0
600-700: 0
700-800: 0
800-900: 0
900-1000: 1
so I would have 10 lines, each represent a group
This could be in a histogram, but it won't show me the changes with time
Is it possible to aggregate values by ranges?
If you avoid using downsampled data by only querying for small time intervals and do not cross the aggregation boundary defined in your Graphite storage schema then you can just use the new histogram feature in the Graph panel in Grafana or the new Heatmap panel to get a histogram over time. (For example, if you have the storage schema 15s:7d,1m:21d,15m:5y and you stay within the 7 day boundary.) Here is an example from the Grafana demo site.
I don't think there is a good way to aggregate values by ranges in a Graphite query if you want to aggregate for larger time intervals. However, it seems to be possible to do it by aggregating the data in statsd:
http://dieter.plaetinck.be/post/histogram-statsd-graphing-over-time-with-graphite/