How to reduce throughput in Talend? - talend

I am trying to reduce the throughput to 3 rows/s. I tried searching for this online but didn't find much. Can anybody help?
My current job looks like this:
Or is it possible to limit tHTTPRequests in the component?

For this you will want to use the tSleep component that will introduce a wait time per row.
The wait time is in seconds but you might be able to use a floating point value (eg. 0.3333). Otherwise you'll be limited to 1 row/s.
If you can't use a floating point value in the tSleep configuration and you absolutely need 3 rows per second then you could use a tJavaRow component that passes everything in the input to the output but also uses this snippet of Java code:
Thread.sleep(333);
This will sleep the running thread for 333 milliseconds on each row of data being passed through the component and give you roughly 3 rows per second (minus actual processing time which in this case should be minimal).

Related

How to make sense of the micrometer metrics using SpringBoot 2, InfluxDB and Grafana?

I'm trying to configure a SpringBoot application to export metrics to InfluxDB to visualise them using a Grafana dashboard. I'm using this dashboard as an example which uses Prometheus as a backend.
For some metrics I have no problem figuring out how to create graphs for them but for some others I don't know how to create the graphs or even if it's possible at all. So I enumerate the things I'm not really sure about in the following points:
Is there any documentation where a value unit is described? The application I'm using as an example doesn't have any load on it so sometimes I don't know whether the value is a bit, a byte, a second, a millisecond, a count, etc.
Some measurements contain the tag 'metric_type = histogram' with fields 'count', 'sum', 'mean' and 'upper'. Again, here I don't know what the value units are, what upper means or how I'm suppose to plot them. Examples of this are 'http_server_requests' or 'jvm_gc_pause'.
From what I see in the Grafana dashboard example, it seems I should use these measurements of type histogram to create both a graph with counts and graphs with duration. For example I see I should be able to create a graph with the number of requests and another one with their duration. Or for the garbage collector, I should be able to provide a graph for the number of minor and major GCs and another for their duration.
As an example of measures I get inserted into InfluxDB:
time count exception mean method metric_type outcome status sum upper uri
1625579637946000000 1 None 0.892144 GET histogram SUCCESS 200 0.892144 0.892144 /actuator/health
or
time action cause count mean metric_type sum upper
1625581132316000000 end of minor GC Allocation Failure 1 2 histogram 2 2
I agree the documentation for micrometer is not great. I've had to dig through the code to find answers...
Regarding your questions about jvm_gc_pause, it is a Timer and the implementation is AbstractTimer which is a class that wraps a Histogram among other components. This particular metric is registered by the JvmGcMetrics class. The various measurements that are published to InfluxDB are determined by the InfluxMeterRegistry.writeTimer(Timer timer) method:
sum: timer.totalTime(getBaseTimeUnit()) // The total time of recorded events
count: timer.count() // The number of times stop has been called on the timer
mean: timer.mean(getBaseTimeUnit()) // totalTime()/count()
upper: timer.max(getBaseTimeUnit()) // The max time of a single event
The base time unit is milliseconds.
Similarly, http_server_requests appears to be a Timer as well.
I believe you are correct that the sensible thing is to chart on two separate Grafana panels: one panel for GC pause seconds using sum (or mean or upper), and one panel for GC events using count.

Is it possible use composite triggers in conjunction with micro-batching with dataflow?

We have an unbounded PCollection PCollection<TableRow> source that we are inserting to BigQuery.
An easy "by the book" way to fire windows every 500 thousand messages or five minutes would be:
source.apply("GlobalWindow", Window.<TableRow>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(500000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))))
).withAllowedLateness(Duration.standardMinutes(1440)).discardingFiredPanes())
You would think that applying the following to the fired window/pane would allow you to write contents of the fired pane to BigQuery:
.apply("BatchWriteToBigQuery", BigQueryIO.writeTableRows()
.to(destination)
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withNumFileShards(NUM_FILE_SHARDS)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
But this would yield a compile error An exception occured while executing the Java class. When writing an unbounded PCollection via FILE_LOADS, triggering frequency must be specified
Relatively easy fix would be to add .withTriggeringFrequency(Duration.standardMinutes(5)) to the above, which would essentially render the idea of inserting either every five minutes or every N messages completely void, and you might as well get rid of the windowing in that case anyway.
Is there a way to actually accomplish this?
FILE_LOADS requires triggering frequency.
If you want more realtime results then you can use STREAMING_INSERTS
Reference https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html#FILE_LOADS

Can you calculate active users using time series

My atomist client exposes metrics on commands that are run. Each command is a metric with a username element as well a status element.
I've been scraping this data for months without resetting the counts.
My requirement is to show the number of active users over a time period. i.e 1h, 1d, 7d and 30d in Grafana.
The original query was:
count(count({Username=~".+"}) by (Username))
this is an issue because I dont clear the metrics so its always a count since inception.
I then tried this:
count(max_over_time(help_command{job=“Application
Name”,Username=~“.+“}[1w]) -
max_over_time(help_command{job=“Application name”,Username=~“.+“}[1w]
offset 1w) > 0)
which works but only for one command I have about 50 other commands that need to be added to that count.
I tried the:
"{__name__=~".+_command",job="app name"}[1w] offset 1w"
but this is obviously very expensive (timeout in browser) and has issues with integrating max_over_time which doesn't support it.
Any help, am I using the metric in the wrong way. Is there a better way to query... my only option at the moment is the count (format working above for each command)
Thanks in advance.
To start, I will point out a number of issues with your approach.
First, the Prometheus documentation recommends against using arbitrarily large sets of values for labels (as your usernames are). As you can see (based on your experience with the query timing out) they're not entirely wrong to advise against it.
Second, Prometheus may not be the right tool for analytics (such as active users). Partly due to the above, partly because it is inherently limited by the fact that it samples the metrics (which does not appear to be an issue in your case, but may turn out to be).
Third, you collect separate metrics per command (i.e. help_command, foo_command) instead of a single metric with the command name as label (i.e. command_usage{commmand="help"}, command_usage{commmand="foo"})
To get back to your question though, you don't need the max_over_time, you can simply write your query as:
count by(__name__)(
(
{__name__=~".+_command",job=“Application Name”}
-
{__name__=~".+_command",job=“Application name”} offset 1w
) > 0
)
This only works though because you say that whatever exports the counts never resets them. If this is simply because that exporter never restarted and when it will the counts will drop to zero, then you'd need to use increase instead of minus and you'd run into the exact same performance issues as with max_over_time.
count by(__name__)(
increase({__name__=~".+_command",job=“Application Name”}[1w]) > 0
)

how to set 'warm-up period' in AnyLogic

I want to set the warm-up period in AnyLogic Personal Learning Edition. I searched for the warm-up period place in AnyLogic, but I couldn't find any thing about the warm-up period.
Is there a warm up period in Anylogic or something like this?
There is no default warm-up setting as it would not make sense given the vast flexibility of the tool and user needs.
It is easy, however, to set it up yourself. As usual, there are many different options, here is one:
create a variable v_WarmupDuration on Main, set it to whatever many time units you need
any data object you want to only record after the warmup period, ensure it only captures data if time() > v_WarmupDuration.
Events can have a custom initial time which you can use v_WarmupDuration for.
Functions that log data can only do so if time() > v_WarmupDuration, and so on.
Alternatively, log all your data as normal but add time stamps to them. Then, you can
Creating a warmup variable works fine for metrics you create yourself.
But if you want to use the built in functionality like histograms created from timeMeasureStart and timeMeasureEnd blocks in Anylogic, you will also need to add an extra select option so e.g. assuming you set v_WarmupDuration to 60 minutes, then you need a select block with a decision on false that goes straight to sink block or the next element after the timeMeasureEnd.
Condition if true: time(MINUTE) > v_warmupDuration
That way, the warmup period will not accumulate into the dataset of the timeMeasureEnd.
If you want to set this as a parmeter to an experiment, then ...
Add a variable to the experiment page off the screen e.g. v_warmupMins
Add a control like a slider on the experiment page and link to the variable v_warmupMins
Add a parameter to hold the warmup time in the Main canvas e.g. p_warmupMins
On the experiment properties, set the parameter p_warmupMins = v_warmupMins
to programmatically add this time onto the StopTime, add this to the Before Simulation Runs
getEngine().setStopTime( getEngine().getStopTime(MINUTE) + v_warmupMins );
Now when i run experiment with slider set to 60 mins, it adds 60 mins onto the stoptime and runs the experiment without accumulating metrics until that time has passed.
Hope that helps.

MongoDB : Alert when count of records is high

I am new to MongoDB and have a table like below
boxname
time_create
box_data
Basically what we are logging here is which box is sending what data and at what time.
Now my requirement is that is to create an alert in system if a box sends more request than a threshold values indicating probably something is wrong or suspicious from that box. I can get the count of records for a box for a time period say 10 mins - but this query is different - each 10 mins check counts for all boxes and check if any exceed more than threshold.
Do I need to do regular polling after 10 mins? Do a program needs to run infinitely to count # and alert - what would be best method to implement the same?
What would be the best mechanism of implementation for such a requirement?
To solve this problem - query scheduler is needed.
That means - every x minutes we need to execute query and check if any box have been matched with threshold criteria.
Scheduler implementation is most based on your current solution architecture (as I am in c# - HangFire will be my selection to implement scheduler).