How to exclude spikes from SumoLogic alert? - error-logging

We have SumoLogic alert that happens if more than 10 errors logged in 60 min.
I prefer to have something like: 
if there is a spike and all the errors happen in e.g. 1 minute ( consider as issue has been auto resolved ) do not generate alert.
How can I set such sumoLogic query?
Variances of the requirements :
Logs have clientIp field, and if all errors are reported for the same client, do not generate alert( problem with particular client, not with application)
if more than 10 errors logged in 60 min, send an alert, unless the errors are of type A, but if there are more than 100 errors of type A, send the alert.( log errors of type A are acceptable, unless the number is too big)
if more than 10 errors logged in 60 min, send an alert Only if the last error happened less than 30 min ago(otherwise consider as auto-fixed)

I am not fully sure how is your data shaped, but...
if there is a spike and all the errors happen in e.g. 1 minute ( consider as issue has been auto resolved ) do not generate alert.
This you can solve by aggregating:
| timeslice 1m
| count by _timeslice
| where _count > 1
or similar.
if all errors are reported for the same client, do not generate alert
It sounds like:
| count by _timeslice, clientIp
would do the job.
if more than 10 errors logged in 60 min, send an alert, unless the errors are of type A, but if there are more than 100 errors of type A,
Rough sketch of the query clause would be:
| if(something, 1, 0) as is_of_type_A
| count by is_of_type_A, ...
| where (is_of_type_A = 1 and _count > 100)
OR (is_of_type_A = 0 and _count > 10)
Disclaimer: I am currently employed by Sumo Logic.

Related

Gatling load testing and running scenarios

I am looking to create three scenarios:
The first scenario will run a bunch of GET requests for 30s
The second and third scenarios will run in parallel and wait until the first is finished.
I want the requests from the first scenario to be excluded from the report.
I have the basic outline of what I want to achieve but not seeing expected results:
val myFeeder = csv("somefile.csv")
val scenario1 = scenario("Get stuff")
.feed(myFeeder)
.during(30 seconds) {
exec(
http("getStuff(${csv_colName})").get("/someEndpoint/${csv_colName}")
)
}
val scenario2 = ...
val scenario3 = ...
setUp(
scenario1.inject(
constantUsersPerSec(20) during (30 seconds)
).protocols(firstProtocaol),
scenario2.inject(
nothingFor(30 seconds), //wait 30s
...
).protocols(secondProt)
scenario3.inject(
nothingFor(30 seconds), //wait 30s
...
).protocols(thirdProt)
)
I am seeing the first scenario being run throughout the entire test. It doesn't stop after the 30s?
For the first scenario I would like to cycle through the CSV file and perform a request for each line. Perhaps 5-10 requests per second, how do I achieve that?
I would also like it to stop after the 30s and then run the other two in parallel. Hence the nothingFor in last two scenarios above.
Also how do I exclude from report, is it possible?
You are likely not getting the expected results due to the combination of settings between your injection profile and your "Get Stuff" scenario.
constantUsersPerSec(20) during (30 seconds)
will start 20 users on scenario "Get Stuff" every second for 30 seconds. So even during the 30th second, 20 users will START "Get Stuff". The injection pofile only controls when a user starts, not how long they are active for. So when a user executes the "Get Stuff" scenario, they make the 'get' request repeatedly over the course of 30 seconds due to the .during loop.
So at the very least, you will have users executing "Get Stuff" for 60 seconds - well into the execution of your other scenarios. Depending on the execution time for you getStuff call, it may be even longer.
To avoid this, you could work out exactly how long you want the "Get Stuff" scenario to run, set that in the injection profile and have no looping in the scenario. Alternatively, you could just set your 'nothingFor' values to be >60s.
To exclude the Get Stuff calls from reports, you can add silencing to the protocol definition (assuming it's not shared with your other requests). More details at https://gatling.io/docs/3.2/http/http_protocol/#silencing

Gatling: Understanding rampUsersPerSec(minTPS) to maxTPS during seconds

I am checking a scala code for gatling where they inject transactions for the period of 20 seconds.
/*TPS = Transaction Per Second */
val minTps = Integer.parseInt(System.getProperty("minTps", "1"))
val maxTps = Integer.parseInt(System.getProperty("maxTps", "5"))
var rampUsersDurationInMinutes =Integer.parseInt(System.getProperty("rampUsersDurationInMinutes", "20"))
setUp(scn.inject(
rampUsersPerSec(minTps) to maxTps during (rampUsersDurationInMinutes seconds)).protocols(tcilProtocol))
The same question was asked What does rampUsersPerSec function really do? but never answered. I think that ideally the the graph should be looking like that.
could you please confirm if I correctly understood
rampUsersPerSec?
block (ramp) 1 = 4 users +1
block (ramp) 2 = 12 users +2
block (ramp) 3 = 24 users +3
block (ramp) 4 = 40 users +4
block (ramp) 5 = 60 users +5
The results show that the requests count is indeed 60. Is my calculation correct?
---- Global Information --------------------------------------------------------
> request count 60 (OK=38 KO=22 )
> min response time 2569 (OK=2569 KO=60080 )
> max response time 61980 (OK=61980 KO=61770 )
> mean response time 42888 (OK=32411 KO=60985 )
> std deviation 20365 (OK=18850 KO=505 )
> response time 50th percentile 51666 (OK=32143 KO=61026 )
> response time 75th percentile 60903 (OK=48508 KO=61371 )
> response time 95th percentile 61775 (OK=61886 KO=61725 )
> response time 99th percentile 61974 (OK=61976 KO=61762 )
> mean requests/sec 0.741 (OK=0.469 KO=0.272 )
---- Response Time Distribution ------------------------------------------------
rampUsersPerSec is an open workload model injection where you specify the rate at which users start the scenario. The gatling documentation says that this injection profile
Injects users from starting rate to target rate, defined in users per second, during a given duration. Users will be injected at regular intervals
So while I'm not sure that the example you provide is precisely correct in that gatling is using a second as the 'regular interval' (it might be a smoother model), you are more or less correct. You specify a starting rate and a final rate, and gatling works out all the intermediate injection rates for your duration.
Note that this says nothing about the number of concurrent users your simulation will generate - that is a function of the arrival rate (which you control) and the execution time (which you do not)

promql example with related fields but different labels

I'm using Prometheus and Grafana, and I'm trying to track a web server app.
I want to graph the average duration in ms of a particular query. I think I can get there from the data below, but I'm struggling.
My two sets of values:
rate(http_server_request_duration_seconds_sum[5m])
Element Value
{instance="dbserver:5000",job="control-tower",method="get",path="/api/control/v1/node/config.json"} 0.0010491088980113385
{instance="dbserver:5000",job="control-tower",method="get",path="/api/schedule/v1/programs/:id.json"} 0
{instance="dbserver:5000",job="control-tower",method="get",path="/api/schedule/v1/users.json"} 0
{instance="dbserver:5000",job="control-tower",method="get",path="/metrics"} 0.00009133616130826839
{instance="dbserver:5000",job="control-tower",method="post",path="/api/caption/v1/messages.json"} 0
{instance="dbserver:5000",job="control-tower",method="post",path="/api/caption/v1/sessions.json"} 0
{instance="dbserver:5000",job="control-tower",method="post",path="/api/schedule/v1/programs.json"} 0
{instance="dbserver:5000",job="control-tower",method="put",path="/api/caption/v1/sessions/captioners.json"} 0
{instance="dbserver:5000",job="control-tower",method="put",path="/api/control/v1/agents/:id.json"}
rate(http_server_requests_total[5m])
Element Value
{code="200",host="dbserver:5000",instance="dbserver:5000",job="control-tower",method="get",path="/api/control/v1/node/config.json"} 0.03511075688258612
{code="200",host="dbserver:5000",instance="dbserver:5000",job="control-tower",method="get",path="/api/schedule/v1/programs/:id.json"} 0
{code="200",host="dbserver:5000",instance="dbserver:5000",job="control-tower",method="get",path="/api/schedule/v1/users.json"} 0
{code="200",host="dbserver:5000",instance="dbserver:5000",job="control-tower",method="get",path="/metrics"} 0.06671043807691363
{code="200",host="dbserver:5000",instance="dbserver:5000",job="control-tower",method="post",path="/api/caption/v1/sessions.json"} 0
{code="200",host="dbserver:5000",instance="dbserver:5000",job="control-tower",method="post",path="/api/schedule/v1/programs.json"} 0
{code="200",host="dbserver:5000",instance="dbserver:5000",job="control-tower",method="put",path="/api/caption/v1/sessions/captioners.json"} 0
{code="200",host="dbserver:5000",instance="dbserver:5000",job="control-tower",method="put",path="/api/control/v1/agents/:id.json"} 0
{code="422",host="dbserver:5000",instance="dbserver:5000",job="control-tower",method="post",path="/api/schedule/v1/programs.json"} 0
{code="502",host="dbserver:5000",instance="dbserver:5000",job="control-tower",method="post",path="/api/caption/v1/messages.json"}
They have different labels. For this, I only care where path="/api/caption/v1/messages.json".
I think I need to use a combination of rate, sum, and "on" or "ignore", but I haven't been able to get on or ignore to work at all.
I can get the numerator (in seconds) with:
rate( http_server_request_duration_seconds_sum { path="/api/caption/v1/messages.json" }[5m])
And that returns:
{instance="dbserver:5000", job="control-tower", method="post", path="/api/caption/v1/messages.json"}
But the denominator can have different return codes, so I have to sum those, and I need to do some ignore or on or something, but I haven't found an example that helps me out, and I'm really new at this.
Anyone?
Okay, I continued to play. Because I only have one path I worry about, i figured out I could sum the rates. I think this works:
sum( rate( http_server_request_duration_seconds_sum {path="/api/caption/v1/messages.json"}[2h])) / sum( rate( http_server_requests_total{ path="/api/caption/v1/messages.json"}[2h]))
I changed the sample rate as my sample data fell off my 5-minute window, and I had zeros.
I THINK what this is doing is summing the rates, which gets rid of all the labels. And I THINK what it's also doing is using 2 hours of data. I think the rate value is how quickly the value changed over that 2 hour period.
I would love comments.
This solution won't work if I want one chart to include other paths, and I'm still not sure what to do about that, so this solves my current problem but still doesn't help me figure out how to do something similar with ignore or on.

Possible bug in Pd patch

I have made a very simple patch, by which when a bang is triggered, it is meant to trigger a unique number between 0-2, in other words, no numbers are repeated.
In the way that I set it up, it is meant to work in theory. Even my programming mentor said that it should work, in theory, and he's generally a very smart man. He's informally known as being the boffin of the academy.
A few more details:
This happens in both purr data and pure data, with the exact same setup.
There are no external libraries are used. Just plain Vanilla objects.
Since there doesn't seem to be a way to attach the actual file itself, I will instead post an image of the code:
The problem is with depth-first processing (as used by Pd) and the related stack-unrolling, as this might lead to setting the 2nd input of [select] to an old value (which you didn't expect).
Example
Note: select:in0 means the left-most inlet of [select],... The numbers generated by [random] are shown in bold (1) and the numbers output the patch are shown in bold italics (e.g. 3)
Imagine the [select] is initialized to 0 and the [random 3] object outputs a list 2 0 0 2 0 2 ... (hint: [seed 96().
The expected output would be 2 0 2 0 2 ..., however the output really is 2 0 2 2 2 ...
Now this is what happens if you consecutively send [bang( to the random generator:
random generates 2
2 is sent to the sel:in0, which compares it to 0 (no match)
and sends it out of sel:out1 (the reject outlet), displaying the number 2
after that the number is sent to sel:in1, setting it's internal state to 2.
random generates 0
0 is sent to the sel:in0, which compares it to 2 (no match)
and sends it out of sel:out1, displaying the number 0
after that the number is sent to sel:1, setting it's internal state to 0.
random generates 0
0 is sent to the sel:in0, which compares it to 0 (match!)
and sends a bang through sel:out0 (the match outlet)
triggering a new call to random, which now generates 2
2 is sent to the sel:in0, which compares it to 0 (no match)
and sends it out of sel:out1, displaying the number 2
after that the number is sent to sel:1, setting it's internal state to 2.
after that the number 0 (still pending in the trigger:out0) is sent to sel:1, setting it's internal state to 0!!!
random generates 0
0 is sent to the sel:in0, which compares it to 0 (match!)
and sends a bang through sel:out0
triggering a new call to random, which now generates 2
2 is sent to the sel:in0, which compares it to 0
and sends it out of sel:out1, displaying the number 2
after that the number is sent to sel:1, setting it's internal state to 2.
after that the number 0 (still pending in the trigger:out0) is sent to sel:1, setting it's internal state to 0!!!
As you can see, at the end of #3 the internal state of [select] is 0, even though the last number generated by [random] was 2 (because the left-most outlet of [trigger] will only send to 0 after it has sent the 2, due to stack-unrolling).
Solution
The solution is simple: make sure that the state of [select] contains the last displayed value, rather than the last one generated on the stack. avoid feedback when modifying the internal state.
E.g (using local send/receive to for nicer ASCII-art)
[r $0-again]
|
[bang(
|
[random 3]
|
| [r $0-last]
| |
[select]
| |
| [t f f]
| | |
| | [s $0-last]
| |
| [print]
|
[s $0-again]

Force gatling.io test to fail connections after a certain duration

Is there a way to force a gatling test to consider connections that have been active longer than a certain duration to have failed?
For instance, I have a test that will create 400 users/second for 60 seconds. However I am having the test "hang" indefinitely.
================================================================================
2016-04-13 08:08:25 200s elapsed
---- Full Chain test -----------------------------------------------------------
[##############################################################------------] 84%
waiting: 0 / active: 3728 / done:20362
---- Requests ------------------------------------------------------------------
> Global (OK=20362 KO=0 )
================================================================================
As you can see, the 60 second test, +/- a few seconds for the final requests to complete, has gone on for 200 seconds (this is before killing it). The "active" number has remained at 3728 since the 65 second mark.
This duration goes against all the timeouts I can see in the gatling docs, and setting my own timeouts doesn't appear to do anything. Here's my reference.conf
gatling {
http {
ahc {
requestTimeout=7000
maxRetry=0
sslSessionTimeout=7000
}
}
data {
noActivityTimeout=5
}
}
Has anyone figured out a way to get around this issue?
Please upgrade to Gatling 2.2.0 that we've just released. There's a good chance it fixes your bug.