Apache Beam Unbounded data processing and compute aggregations like count, durations per key per window - apache-beam

I am new to beam pipeline and I have a requirement to compute aggregated stats (counts and duration for each window - like very 30 min or so) from the events received through a Kafka topic (unbounded source).
Events
{"id":"xxxxx", "state": "start", "timestamp": 1625718600000, "device": "device-1", ...}
{"id":"xxxxx", "state": "end", "timestamp": 1625721300000, "device": "device-1",. ..}
{"id":"yyyyy", "state": "start", "timestamp": 1625718600000, "device": "device-2", ...}
{"id":"yyyyy", "state": "end", "timestamp": 1625719500000, "device": "device-2", ...}
Event "xxxxx" started 10:00 and ended 10:45
Event "yyyyy" started 10:00 and ended 10:15
Expected Stats from pipeline
Device Interval Count Duration
device-1 10:00-10:30 1 30 min
device-2 10:00-10:30 1 15 min
device-1 10:30-11:00 0 15 min
I played with fixed-windows, triggers, groupByKey, CombineFn etc and I am successful in computing the aggregated counters, increment the count if the event state is "start" but, I am clueless on computing the overlapping window duration even with the state-full processing.
Note: Used event identifier while grouping the events.
Please advice me on this.

So sounds like what you need to do is compute the start and end events at the same worker to compute the difference, right ?
I can think of few ways to do this.
Use GroupByKey - you can set even ID to the key and perform a GroupByKey to group events based on the key and compute and output the difference.
Stateful DoFn - When you receive events, store them in state keyed by the ID and when you store check for the availability of another event and compute the difference.
One thing to note that it's possible for start and end events to fall in two different Windows. Either of above solutions will not work in this case since different Windows are computed differently. I think you'll have to adapt the pipeline to account for such (rare ?) occurrences.

Related

Grafana dashboard panel with custom time range

I have a Grafana dashboard with a time range for "today so far"
"time": {
"from": "now/d",
"to": "now"
}
But I have a service that runs only between 4am and 5am every day for which I would like to plot some metrics in a panel on this dashboard. I'm trying to set the time range for that panel to 4am-5am of the current day so that it doesn't show a full 24 hour x-axis for data that only exists between 4am and 5am. If I were able to specify the time range for the panel using From: and To:, it would have looked something like this:
"time": {
"from": "now/d+4h",
"to": "now/d-19h"
}
But of course, you can't specify a time range like that for individual panels. You have to make use of the Relative Time and Time Shift values in Query Options for the panel but, for the life of me, I can't figure out how to represent the time range I want in relative time using those two values. It would seem that Grafana's documentation in this regard is sorely lacking; They only show a few very basic and specific examples, none of which helps me with this query.
Would anyone be able to set me straight here please?
For what it's worth, my data source for this Grafana dashboard is Prometheus but I don't think that is germane to this question.

when to store data different collections MongoDB?

I want to store events occurred during a week in MongoDB. Each week can have 300-400 events.
One week's events are independent of another week and at a time i fetch or process only one week at time.( Never joining two or more week)
All event objects have same properties but different values.
Is it good to create separate collection for a each week or in same collection?
By given information, I would choose to store everything in one collection, because 300-400 events per week is really a small amount of entities (documents).
The document schema depends on the project needs/details, but as minimum I would add separated the year and week elements for the filtering purpose
Document example:
{
"_id": ObjectId("5c530d202029a5144454f9c2"),
"year": 2019,
"week": 10,
"event": {
"date": ISODate("2019-03-07T11:00:01.022Z")
"name": "some event name",
"message": "some message"
}
}

Count and Time window in Esper EPL

I have the following use case, which I'm trying to write in EPL, without success. I'm generating analytics events of different types, generated in different intervals (1min, 5min, 10min, ...). In special kind of analytics, I need to collect 4 specific
Analytics events (from which I will count another analytic event) of different types, returned every interval (1min, 5min, 10min, ...). The condition there is, that on every whole interval, e.g., every whole minute 00:01:00, 00:02:00 I want to have returned either 4 events or nothing if the events don't arrive in some slack period after (e.g., 2s).
case 1: events A,B,C,D arrive at times 00:01:00.500, 00:01:00.600, 00:01:00.700, 00:01:00.800 - right after fourth event arrives to esper, the aggregated event with all 4 events is returned
case 2: slack period is 2seconds, events A,B,C,D arrives at 00:01:00.500, 00:01:00.600, 00:01:00.700, 00:01:02.200 - nothing is arrived, as the last event is out of the slack period
You could create a trigger event every minute like this:
insert into TriggerEvent select * from pattern[timer:schedule(date:'1970-01-01T00:00:00.0Z', period: 1 minute, repetitions: -1)]
The trigger that arrives every minute can kick off a pattern or context. A pattern would seem to be good enough. Here is something like that:
select * from pattern [every TriggerEvent -> (a=A -> b=B -> c=C -> d=D) where timer:within(2 seconds)]

Is a Self Join in Kafka Streams possible?

We looking at Kafka Streams as a way to solve comparisons in flight. Specifically, we have data arriving to a Kafka topic in the order of ~15,000 transactions per second, and we would like to do comparison operations on the records as they roll by. The records are very wide (1900 columns or thereabouts), but the comparison operations occur on very few columns (~10-20). Our comparison window is about a minute.
The scenario would be something like this:
Message 1 arrives with values of foo, bar, foobar, barfoo, 12, 34 at time 00s
Message 2 arrives with values of foo, bat, barbat, batbar, 12, 57 at time 05s
Message 3 arrives with values of foo, bay, barbat, baybat, 14, 19 at time 10s
Message 4 arrives with values of foo, bar, foobar, barfoo, 12, 50 at time 15s
Message 5 arrives with values of bar, bat, barbat, batbar, 14, 18 at time 40s
Message 6 arrives with values of foo, bar, foobar, barfoo, 12, 36 at time 59s
We would like to be able to read the stream, identify that messages 1, 4, and 6 all match our comparison criteria, and then discard messages 1 and 6 while keeping message 4.
I found a comment from Gouzhang Wang in Nov 2016 suggesting implementing this through the Processor API. Is this still the current best approach?

Monthly Schedule in Azure data factory pipeline

How can we schedule a Azure Data Factory pipeline to run only on particular day of month.(for e.g 9th of every month)
This can be achieved with the "schedule trigger" capability in the V2 ADF service: https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers#examples-recurrence-schedules
If you are using data factory version 1, you can achieve this by setting the availability with frequency month, interval 1, and set the offset with the number of the day you want the pipeline to run.
For example if you want it to run the 9th of each month as you said, you will have something like this:
"availability": {
"frequency": "Month",
"interval": 1,
"offset": "9.00:00:00",
"style": "StartOfInterval"
}
If you are using data factory version 2, this can be achieved with triggers as Mark Kromer said. You have to set {monthDays":[9]} in the trigger's schedule.
Cheers.