Dropwizard IO Metrics are always aggregated - aggregate
I'm trying to monitor some metrics via Dropwizard IO Metrics. I want to get number of files downloaded in a specific time period, and I want to aggregate that metric by myself. So for example, let's say between 10.00 to 10.15, 60 files are downloaded. I want the metric to be 60 between this period and after 10.15, it must return zero. However, after 10.15, the metric always returns 60. Is there a way to avoid its automatic aggregation?
This is possible, you are just using the wrong metric I assume. You will have to use a metric that uses time windows and configure them correctly. Look at this simple example I wrote up:
public class MetricTest {
public static final long SLEEP_TIME = 51;
public static void main(String[] args) throws InterruptedException {
SlidingTimeWindowReservoir reservoir = new SlidingTimeWindowReservoir(50, TimeUnit.MILLISECONDS);
Histogram h = new Histogram(reservoir);
h.update(1);
System.out.println(h.getSnapshot().size());
Thread.sleep(SLEEP_TIME);
h.update(1);
System.out.println(h.getSnapshot().size());
Thread.sleep(SLEEP_TIME);
h.update(1);
System.out.println(h.getSnapshot().size());
Thread.sleep(SLEEP_TIME);
h.update(1);
System.out.println(h.getSnapshot().size());
Thread.sleep(SLEEP_TIME);
h.update(1);
System.out.println(h.getSnapshot().size());
Thread.sleep(SLEEP_TIME);
h.update(1);
System.out.println(h.getSnapshot().size());
}
}
I am using a histogram to count the amount of values I submitted. See the SLEEP_TIME constant above for expiry.
If I run this with a sleep time of 0, i get:
1 2 3 4 5 6
If I run it with a sleep time of 51, the previous value is expired each time a snapshot is created, and I get:
1 1 1 1 1 1
You can write your own window implementation that simply does bucketing and deletes old buckets.
I hope that clarifies how metrics work.
Artur
Related
How do I calculate the number of ticks per measure from a MIDI file
I am trying to calculate the number of ticks per measure (bar) from a MIDI file, but I am a bit stuck. I have a MIDI file from which I can extract the following information (provided in meta messages): #0: Time signature: 4/4, Metronome pulse: 24 MIDI clock ticks per click, Number of 32nd notes per beat: 8 There are two tempo messages, which I'm not sure are relevant: #0: Microseconds per quarternote: 400000, Beats per minute: 150.0 #1800: Microseconds per quarternote: 441176, Beats per minute: 136.0001450668214 From trial and error, looking at the Note On messages, and looking at the MIDI file in Garageband, I can 'guess' that the number of ticks per measure is 2100, with a quarternote 525 ticks. My question is: can I arrive at the 2100 number using the tempo information that was provided above, and if so how? Or have I not parsed enough information from the MIDI file and is there some other control message that I need to look at?
Use the following Java 11 code to extract the ticks per measure. This assumes 4 quarter notes per bar. public MidiFile(String filename) throws Exception { var file = new File(filename); var sequence = MidiSystem.getSequence(file); System.out.println("Tick length: " + sequence.getTickLength()); System.out.println("Division Type: " + sequence.getDivisionType()); System.out.println("Resolution (PPQ if division = " + javax.sound.midi.Sequence.PPQ + "): " + sequence.getResolution()); System.out.println("Ticks per measure: " + (4 * sequence.getResolution())); }
haproxy stats: qtime,ctime,rtime,ttime?
Running a web app behind HAProxy 1.6.3-1ubuntu0.1, I'm getting haproxy stats qtime,ctime,rtime,ttime values of 0,0,0,2704. From the docs (https://www.haproxy.org/download/1.6/doc/management.txt): 58. qtime [..BS]: the average queue time in ms over the 1024 last requests 59. ctime [..BS]: the average connect time in ms over the 1024 last requests 60. rtime [..BS]: the average response time in ms over the 1024 last requests (0 for TCP) 61. ttime [..BS]: the average total session time in ms over the 1024 last requests I'm expecting response times in the 0-10ms range. ttime of 2704 milliseconds seems unrealistically high. Is it possible the units are off and this is 2704 microseconds rather than 2704 millseconds? Secondly, it seems suspicious that ttime isn't even close to qtime+ctime+rtime. Is total response time not the sum of the time to queue, connect, and respond? What is the other time, that is included in total but not queue/connect/response? Why can my response times be <1ms, but my total response times be ~2704 ms? Here is my full csv stats: $ curl "http://localhost:9000/haproxy_stats;csv" # pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime, http-in,FRONTEND,,,4707,18646,50000,5284057,209236612829,42137321877,0,0,997514,,,,,OPEN,,,,,,,,,1,2,0,,,,0,4,0,2068,,,,0,578425742,0,997712,22764,1858,,1561,3922,579448076,,,0,0,0,0,,,,,,,, servers,server1,0,0,0,4337,20000,578546476,209231794363,41950395095,,0,,22861,1754,95914,0,no check,1,1,0,,,,,,1,3,1,,578450562,,2,1561,,6773,,,,0,578425742,0,198,0,0,0,,,,29,1751,,,,,0,,,0,0,0,2704, servers,BACKEND,0,0,0,5919,5000,578450562,209231794363,41950395095,0,0,,22861,1754,95914,0,UP,1,1,0,,0,320458,0,,1,3,0,,578450562,,1,1561,,3922,,,,0,578425742,0,198,22764,1858,,,,,29,1751,0,0,0,0,0,,,0,0,0,2704, stats,FRONTEND,,,2,5,2000,5588,639269,8045341,0,0,29,,,,,OPEN,,,,,,,,,1,4,0,,,,0,1,0,5,,,,0,5374,0,29,196,0,,1,5,5600,,,0,0,0,0,,,,,,,, stats,BACKEND,0,0,0,1,200,196,639269,8045341,0,0,,196,0,0,0,UP,0,0,0,,0,320458,0,,1,4,0,,0,,1,0,,5,,,,0,0,0,0,196,0,,,,,0,0,0,0,0,0,0,,,0,0,0,0,
In haproxy >2 you now get two values n / n which is the max within a sliding window and the average for that window. The max value remains the max across all sample windows until a higher value is found. On 1.8 you only get the average. Example of haproxy 2 v 1.8. Note these proxies are used very differently and with dramatically different loads. So looks like the average response times at least since last reboot are 66m and 275ms. The average is computed as: data time + cumulative http connections - 1 / cumulative http connections This might not be a perfect analysis so if anyone has improvements it'd be appreciated. This is meant to show how I came to the answer above so you can use it to gather more insight into the other counters you asked about. Most of this information was gathered from reading stats.c. The counters you asked about are defined here. unsigned int q_time, c_time, d_time, t_time; /* sums of conn_time, queue_time, data_time, total_time */ unsigned int qtime_max, ctime_max, dtime_max, ttime_max; /* maximum of conn_time, queue_time, data_time, total_time observed */``` The stats page values are built from this code: if (strcmp(field_str(stats, ST_F_MODE), "http") == 0) chunk_appendf(out, "<tr><th>- Responses time:</th><td>%s / %s</td><td>ms</td></tr>", U2H(stats[ST_F_RT_MAX].u.u32), U2H(stats[ST_F_RTIME].u.u32)); chunk_appendf(out, "<tr><th>- Total time:</th><td>%s / %s</td><td>ms</td></tr>", U2H(stats[ST_F_TT_MAX].u.u32), U2H(stats[ST_F_TTIME].u.u32)); You asked about all the counter but I'll focus on one. As can be seen in the snippit above for "Response time:" ST_F_RT_MAX and ST_F_RTIME are the values displayed on the stats page as n (rtime_max) / n (rtime) respectively. These are defined as follows: [ST_F_RT_MAX] = { .name = "rtime_max", .desc = "Maximum observed time spent waiting for a server response, in milliseconds (backend/server)" }, [ST_F_RTIME] = { .name = "rtime", .desc = "Time spent waiting for a server response, in milliseconds, averaged over the 1024 last requests (backend/server)" }, These set a "metric" value (among other things) in a case statement further down in the code: case ST_F_RT_MAX: metric = mkf_u32(FN_MAX, sv->counters.dtime_max); break; case ST_F_RTIME: metric = mkf_u32(FN_AVG, swrate_avg(sv->counters.d_time, srv_samples_window)); break; These metric values give us a good look at what the stats page is telling us. The first value in the "Responses time: 0 / 0" ST_F_RT_MAX, is some max value time spent waiting. The second value in "Responses time: 0 / 0" ST_F_RTIME is an average time taken for each connection. These are the max and average taken within a window of time, i.e. however long it takes for you to get 1024 connections. For example "Responses time: 10000 / 20": max time spent waiting (max value ever reached including http keepalive time) over the last 1024 connections 10 seconds average time over the last 1024 connections 20ms So for all intents and purposes rtime_max = dtime_max rtime = swrate_avg(d_time, srv_samples_window) Which begs the question what is dtime_max d_time and srv_sample_window? These are the data time windows, I couldn't actually figure how these time values are being set, but at face value it's "some time" for the last 1024 connections. As pointed out here keepalive times are included in max totals which is why the numbers are high. Now that we know ST_F_RT_MAX is a max value and ST_F_RTIME is an average, an average of what? /* compue time values for later use */ if (selected_field == NULL || *selected_field == ST_F_QTIME || *selected_field == ST_F_CTIME || *selected_field == ST_F_RTIME || *selected_field == ST_F_TTIME) { srv_samples_counter = (px->mode == PR_MODE_HTTP) ? sv->counters.p.http.cum_req : sv->counters.cum_lbconn; if (srv_samples_counter < TIME_STATS_SAMPLES && srv_samples_counter > 0) srv_samples_window = srv_samples_counter; } TIME_STATS_SAMPLES value is defined as #define TIME_STATS_SAMPLES 512 unsigned int srv_samples_window = TIME_STATS_SAMPLES; In mode http srv_sample_counter is sv->counters.p.http.cum_req. http.cum_req is defined as ST_F_REQ_TOT. [ST_F_REQ_TOT] = { .name = "req_tot", .desc = "Total number of HTTP requests processed by this object since the worker process started" }, For example if the value of http.cum_req is 10, then srv_sample_counter will be 10. The sample appears to be the number of successful requests for a given sample window for a given backends server. d_time (data time) is passed as "sum" and is computed as some non-negative value or it's counted as an error. I thought I found the code for how d_time is created but I wasn't sure so I haven't included it. /* Returns the average sample value for the sum <sum> over a sliding window of * <n> samples. Better if <n> is a power of two. It must be the same <n> as the * one used above in all additions. */ static inline unsigned int swrate_avg(unsigned int sum, unsigned int n) { return (sum + n - 1) / n; }
Marathon backoff - is it really exponential?
I'm trying to figure out Marathon's exponential backoff configuration. Here's the documentation: The backoffSeconds and backoffFactor values are multiplied until they reach the maxLaunchDelaySeconds value. After they reach that value, Marathon waits maxLaunchDelaySeconds before repeating this cycle exponentially. For example, if backoffSeconds: 3, backoffFactor: 2, and maxLaunchDelaySeconds: 3600, there will be ten attempts to launch a failed task, each three seconds apart. After these ten attempts, Marathon will wait 3600 seconds before repeating this cycle. The way I think of exponential backoff is that the wait periods should be: 3*2^0 = 3 3*2^1 = 6 3*2^2 = 12 3*2^3 = 24 and so on so every time the app crashes, Marathon will wait a longer period of time before retrying. However, given the description above, Marathon's logic for waiting looks something like this: int retryCount = 0; while(backoffSeconds * (backoffFactor ^ retryCount) < maxLaunchDelaySeconds) { wait(backoffSeconds); retryCount++; } wait(maxLaunchDelaySeconds); This matches the explanation in the documentation, since 3*2^x < 3600 for values of x fewer than or equal to 10. However, I really don't see how it can be called an exponential backoff, since the wait time is constant. Is there a way to make Marathon wait progressively longer times with every restart of the app? Am I misunderstand the doc? Any help would be appreciated!
as far as I understand the code in the RateLimiter.scala, it is like you described, but then capped to the maxLaunchDelay waiting period. Let`s say maxLaunchDelay is one hour (3600s) 3*2^0 = 3 3*2^1 = 6 3*2^2 = 12 3*2^3 = 24 3*2^4 = 48 3*2^5 = 96 3*2^6 = 192 3*2^7 = 384 3*2^8 = 768 3*2^9 = 1536 3*2^10 = 3072 3*2^11 = 3600 (6144) 3*2^12 = 3600 (12288) 3*2^13 = 3600 (24576) Which brings us a typically 2^n graph, see You would get a bigger increase, if you would other backoffFactors, for example backoff factor 10: or backoff factor 20: Additionally I saw a re-work of this topic, code review currently open here: https://phabricator.mesosphere.com/D1007 What do you think? Thanks Johannes
What is the purpose of Flux::sampleTimeout method in the project-reactor API?
The Java docs say the following: Emit the last value from this Flux only if there were no new values emitted during the time window provided by a publisher for that particular last value. However I found the above description confusing. I read in gitter chat that its similar to debounce in RxJava. Can someone please illustrate it with an example? I could not find this anywhere after doing a thorough search.
sampleTimeout lets you associate a companion Flux X' to each incoming value x in the source. If X' completes before the next value is emitted in the source, then value x is emitted. If not, x is dropped. The same processing is applied to subsequent values. Think of it as splitting the original sequence into windows delimited by the start and completion of each companion flux. If two windows overlap, the value that triggered the first one is dropped. On the other side, you have sample(Duration) which only deals with a single companion Flux. It splits the sequence into windows that are contiguous, at a regular time period, and drops all but the last element emitted during a particular window. (edit): about your use case If I understand correctly, it looks like you have a processing of varying length that you want to schedule periodically, but you also don't want to consider values for which processing takes more than one period? If so, it sounds like you want to 1) isolate your processing in its own thread using publishOn and 2) simply need sample(Duration) for the second part of the requirement (the delay allocated to a task is not changing). Something like this: List<Long> passed = //regular scheduling: Flux.interval(Duration.ofMillis(200)) //this is only to show that processing is indeed started regularly .elapsed() //this is to isolate the blocking processing .publishOn(Schedulers.elastic()) //blocking processing itself .map(tuple -> { long l = tuple.getT2(); int sleep = l % 2 == 0 || l % 5 == 0 ? 100 : 210; System.out.println(tuple.getT1() + "ms later - " + tuple.getT2() + ": sleeping for " + sleep + "ms"); try { Thread.sleep(sleep); } catch (InterruptedException e) { e.printStackTrace(); } return l; }) //this is where we say "drop if too long" .sample(Duration.ofMillis(200)) //the rest is to make it finite and print the processed values that passed .take(10) .collectList() .block(); System.out.println(passed); Which outputs: 205ms later - 0: sleeping for 100ms 201ms later - 1: sleeping for 210ms 200ms later - 2: sleeping for 100ms 199ms later - 3: sleeping for 210ms 201ms later - 4: sleeping for 100ms 200ms later - 5: sleeping for 100ms 201ms later - 6: sleeping for 100ms 196ms later - 7: sleeping for 210ms 204ms later - 8: sleeping for 100ms 198ms later - 9: sleeping for 210ms 201ms later - 10: sleeping for 100ms 196ms later - 11: sleeping for 210ms 200ms later - 12: sleeping for 100ms 202ms later - 13: sleeping for 210ms 202ms later - 14: sleeping for 100ms 200ms later - 15: sleeping for 100ms [0, 2, 4, 5, 6, 8, 10, 12, 14, 15] So the blocking processing is triggered approximately every 200ms, and only values that where processed within 200ms are kept.
Simpy: How can I represent failures in a train subway simulation?
New python user here and first post on this great website. I haven't been able to find an answer to my question so hopefully it is unique. Using simpy I am trying to create a train subway/metro simulation with failures and repairs periodically built into the system. These failures happen to the train but also to signals on sections of track and on plaforms. I have read and applied the official Machine Shop example (which you can see resemblance of in the attached code) and have thus managed to model random failures and repairs to the train by interrupting its 'journey time'. However I have not figured out how to model failures of signals on the routes which the trains follow. I am currently just specifying a time for a trip from A to B, which does get interrupted but only due to train failure. Is it possible to define each trip as its own process i.e. a separate process for sections A_to_B and B_to_C, and separate platforms as pA, pB and pC. Each one with a single resource (to allow only one train on it at a time) and to incorporate random failures and repairs for these section and platform processes? I would also need to perhaps have several sections between two platforms, any of which could experience a failure. Any help would be greatly appreciated. Here's my code so far: import random import simpy import numpy RANDOM_SEED = 1234 T_MEAN_A = 240.0 # mean journey time T_MEAN_EXPO_A = 1/T_MEAN_A # for exponential distribution T_MEAN_B = 240.0 # mean journey time T_MEAN_EXPO_B = 1/T_MEAN_B # for exponential distribution DWELL_TIME = 30.0 # amount of time train sits at platform for passengers DWELL_TIME_EXPO = 1/DWELL_TIME MTTF = 3600.0 # mean time to failure (seconds) TTF_MEAN = 1/MTTF # for exponential distribution REPAIR_TIME = 240.0 REPAIR_TIME_EXPO = 1/REPAIR_TIME NUM_TRAINS = 1 SIM_TIME_DAYS = 100 SIM_TIME = 3600 * 18 * SIM_TIME_DAYS SIM_TIME_HOURS = SIM_TIME/3600 # Defining the times for processes def A_B(): # returns processing time for journey A to B return random.expovariate(T_MEAN_EXPO_A) + random.expovariate(DWELL_TIME_EXPO) def B_C(): # returns processing time for journey B to C return random.expovariate(T_MEAN_EXPO_B) + random.expovariate(DWELL_TIME_EXPO) def time_to_failure(): # returns time until next failure return random.expovariate(TTF_MEAN) # Defining the train class Train(object): def __init__(self, env, name, repair): self.env = env self.name = name self.trips_complete = 0 self.broken = False # Start "travelling" and "break_train" processes for the train self.process = env.process(self.running(repair)) env.process(self.break_train()) def running(self, repair): while True: # start trip A_B done_in = A_B() while done_in: try: # going on the trip start = self.env.now yield self.env.timeout(done_in) done_in = 0 # Set to 0 to exit while loop except simpy.Interrupt: self.broken = True done_in -= self.env.now - start # How much time left? with repair.request(priority = 1) as req: yield req yield self.env.timeout(random.expovariate(REPAIR_TIME_EXPO)) self.broken = False # Trip is finished self.trips_complete += 1 # start trip B_C done_in = B_C() while done_in: try: # going on the trip start = self.env.now yield self.env.timeout(done_in) done_in = 0 # Set to 0 to exit while loop except simpy.Interrupt: self.broken = True done_in -= self.env.now - start # How much time left? with repair.request(priority = 1) as req: yield req yield self.env.timeout(random.expovariate(REPAIR_TIME_EXPO)) self.broken = False # Trip is finished self.trips_complete += 1 # Defining the failure def break_train(self): while True: yield self.env.timeout(time_to_failure()) if not self.broken: # Only break the train if it is currently working self.process.interrupt() # Setup and start the simulation print('Train trip simulator') random.seed(RANDOM_SEED) # Helps with reproduction # Create an environment and start setup process env = simpy.Environment() repair = simpy.PreemptiveResource(env, capacity = 1) trains = [Train(env, 'Train %d' % i, repair) for i in range(NUM_TRAINS)] # Execute env.run(until = SIM_TIME) # Analysis trips = [] print('Train trips after %s hours of simulation' % SIM_TIME_HOURS) for train in trains: print('%s completed %d trips.' % (train.name, train.trips_complete)) trips.append(train.trips_complete) mean_trips = numpy.mean(trips) std_trips = numpy.std(trips) print "mean trips: %d" % mean_trips print "standard deviation trips: %d" % std_trips
it looks like you are using Python 2, which is a bit unfortunate, because Python 3.3 and above give you some more flexibility with Python generators. But your problem should be solveable in Python 2 nonetheless. you can use sub processes within in a process: def sub(env): print('I am a sub process') yield env.timeout(1) # return 23 # Only works in py3.3 and above env.exit(23) # Workaround for older python versions def main(env): print('I am the main process') retval = yield env.process(sub(env)) print('Sub returned', retval) As you can see, you can use Process instances returned by Environment.process() like normal events. You can even use return values in your sub proceses. If you use Python 3.3 or newer, you don’t have to explicitly start a new sub-process but can use sub() as a sub routine instead and just forward the events it yields: def sub(env): print('I am a sub routine') yield env.timeout(1) return 23 def main(env): print('I am the main process') retval = yield from sub(env) print('Sub returned', retval) You may also be able to model signals as resources that may either be used by failure process or by a train. If the failure process requests the signal at first, the train has to wait in front of the signal until the failure process releases the signal resource. If the train is aleady passing the signal (and thus has the resource), the signal cannot break. I don’t think that’s a problem be cause the train can’t stop anyway. If it should be a problem, just use a PreemptiveResource. I hope this helps. Please feel welcome to join our mailing list for more discussions.