Siddhi delayed query - complex-event-processing

I am struggling to understand this query:
from heartbeats#window.time(1 hour) insert expired events into delayedStream;
from every e = heartbeats -> e2 = heartbeats[deviceId == e.deviceId]
or expired = delayedStream[deviceId == e.deviceId]
within 1 hour 10 minutes
select e.deviceId, e2.deviceId as id2, expired.deviceId as id3
insert into tmpStream;
The first query delays all Events by 1 hour.
The second query filters all Events that occured 1 hour ago and no newer Events have been found.
This works but I dont understand this part:
from every e = heartbeats -> e2 = heartbeats[deviceId == e.deviceId] or expired = delayedStream[deviceId == e.deviceId]
The second part of the query (or expired = ...) checks if the Event with the given deviceId is on the delayedStream. What is the purpose of the first part and how does it come together, that this query finds devices that sent no data for more than 1 hour?

I don't think the above query will be accurate if you want to check if a sensor did not send reading for the last 1 hour. I tweaked the windows as 1 minute and sent 2 events,
[2019-07-19 16:48:23,774] heartbeats : Event{timestamp=1563535103772, data=[1], isExpired=false}
[2019-07-19 16:48:24,696] tmpStream : Event{timestamp=1563535104694, data=[1, 1, null], isExpired=false}
[2019-07-19 16:48:24,697] heartbeats : Event{timestamp=1563535104694, data=[1], isExpired=false}
[2019-07-19 16:49:23,774] tmpStream : Event{timestamp=1563535163772, data=[1, null, 1], isExpired=false}
Let's say events arrive at 10 and 10.15, the outputs at the tmpStream will be at 10.15 (first part) and 11 (due to delayed stream). The second match is incorrect as it has to match at 11.15 as per use case.
However, if you want to improve the query you can use the Siddhi detecting non-occurance pattern feature for your use case, https://siddhi.io/en/v5.0/docs/query-guide/#detecting-non-occurring-events, it will be simpler

Related

kafka-streams sliding agg window discard out-of-order record belongs to window without grace

I have next error, actually i don't understand it.
o.a.k.s.k.i.KStreamSlidingWindowAggregate - Skipping record for expired window. topic=[...] partition=[0] offset=[16880] timestamp=[1662556875000] window=[1662542475000,1662556875000] expiration=[1662556942000] streamTime=[1662556942000]
streamTime=[1662556942000]
timestamp=[1662556875000]
streamTime-timestamp = 67s
window size is 4hour.
grace period is 0
Why was record skipped and i didn't get a output message? it belongs to window. Yes record out-of-order
Update:
After read more about kafka-streams, i understand that on each message it creates two window:
(message time - window) and this window include message.
(message time + window) and this window exclude message.
Window 1 is expired. Window 2 don't. that's why i dind't see out message.
But logically it's wrong, message belong to window but i havn't a out message.
Example
sliding window time diff = 10, grace = 0
stream time = 0
send message (time = 10, key = 2) -> message key = 2; stream time = 10
send message (time = 4, key = 1) -> no out message;
send message (time = 5, key = 1) -> no out message;
last message belongs to window (stream-time - window-time)
------ restart stream -------
stream time = 0
send message (time = 10, key = 2) -> message key = 2; stream time = 10
send message (time = 4, key = 2) -> 2 message
In Kafka Streams sliding windows are event based. A new window is created each time a new record enters or drops from the window. It is defined by the record timestamp and a fixed duration.
Each record creates a window [record.timestamp - duration, record.timestamp] and
Each dropped record creates a window [record.timestamp + 1ms, record.timestamp + 1ms + duration].
(Be aware that other stream processing frameworks use a totally different definition of 'sliding windows')
The record is not included in the window when
stream-time > window-end + grace-period
(https://kafka.apache.org/27/javadoc/org/apache/kafka/streams/kstream/SlidingWindows.html)
For your initial example, the grace-period is zero and your window ends(at record timestamp) after the stream-time; thus the record is not included in the window.
For the second example, I am not sure. My guess is the records with key=1 are expired because the stream-time(10) has exceeded the record times(4,5)(and grace period=0). For the records with key=2, one window is created for the record with timestamp=10 and an update of the same window is emitted, because the record satisfies the condition above. However, no additional windows are created for the out-of-order record, because the grace period is zero.

apache beam (python SDK): Late (or early) events discarded and triggers. How to know how many discarded and why?

I have a streaming pipeline connected with a PubSub subscription (with around 2MLN elements every hour. I need to collect them in a group and then extract some information.
def expand(self, pcoll):
return (
pcoll |
beam.WindowInto(FixedWindows(10),
trigger=AfterAny(AfterCount(2000), AfterProcessingTime(30)),
allowed_lateness=10,
trigger=AfterAny(
AfterCount(8000),
AfterProcessingTime(30),
AfterWatermark(
early=AfterProcessingTime(60),
late=AfterProcessingTime(60)
)
),
allowed_lateness=60 * 60 * 24,
accumulation_mode=AccumulationMode.DISCARDING)
| "Group by Key" >> beam.GroupByKey()
I try my best to NOT miss any data. But I found out that I have around 4% missing data.
As you can see in the code I trigger anytime I hit 8k elements or every 30 seconds.
Allowing lateness 1 day, and it should trigger both if the pipeline is analyzing early or late events.
Still missing those 4% though. So, is there a way to know if the pipeline is discarding some data? How many elements? For which reason?
Thank you so much in advance
First, I see you have two triggers in the sample code, I assume this is a typo, though.
It looks you are dropping elements due to no using Repeatedly, so all elements after the first trigger get lost. There's an official doc on this from Beam.
Allow me to post an example:
test_stream = (TestStream()
.add_elements([
TimestampedValue('in_time_1', 0),
TimestampedValue('in_time_2', 0)])
.advance_watermark_to(9)
.advance_processing_time(9)
.add_elements([TimestampedValue('late_but_in_window', 8)])
.advance_watermark_to(10)
.advance_processing_time(10)
.add_elements([TimestampedValue('in_time_window2', 12)])
.advance_watermark_to(20) # Past window time
.advance_processing_time(20)
.add_elements([TimestampedValue('late_window_closed', 9),
TimestampedValue('in_time_window2_2', 12)])
.advance_watermark_to_infinity())
class RecordFn(beam.DoFn):
def process(
self,
element=beam.DoFn.ElementParam,
timestamp=beam.DoFn.TimestampParam):
yield ("key", (element, timestamp))
options = PipelineOptions()
options.view_as(StandardOptions).streaming = True
with TestPipeline(options=options) as p:
records = (p | test_stream
| beam.ParDo(RecordFn())
| beam.WindowInto(FixedWindows(10),
allowed_lateness=0,
# trigger=trigger.AfterCount(1),
trigger=trigger.Repeatedly(trigger.AfterCount(1)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| beam.GroupByKey()
| beam.Map(print)
)
If we have trigger trigger.Repeatedly(trigger.AfterCount(1)), all elements are fired as they come, with no dropped element (but late_window_closed which is expected as it was late):
('key', [('in_time_1', Timestamp(0)), ('in_time_2', Timestamp(0))]) # this two are together since they arrived together
('key', [('late_but_in_window', Timestamp(8))])
('key', [('in_time_window2', Timestamp(12))])
('key', [('in_time_window2_2', Timestamp(12))])
If we use trigger.AfterCount(1) (no repeatedly), we only get the first elements that arrived in the pipeline:
('key', [('in_time_1', Timestamp(0)), ('in_time_2', Timestamp(0))])
('key', [('in_time_window2', Timestamp(12))])
Note that both in_time_(1,2) appear in the first fired pane because the arrived at the same time (0), were one of them appear later it would have been dropped.

Kafka Streams - GroupBy - Late Event - persistentWindowStore - WindowBy with Grace Period and Suppress

My purpose to calculate success and fail message from source to destination per second and sum their results in daily bases.
I had two options to do that ;
stream events then group them time#source#destination
KeyValueBytesStoreSupplier streamStore = Stores.persistentKeyValueStore("store-name");
sourceStream.selectKey((k, v) -> v.getDataTime() + KEY_SEPERATOR + SRC + KEY_SEPERATOR + DEST ).groupByKey().aggregate(
DO SOME Aggregation,
Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes));
After trying this approach above we noticed that state store is getting increase because of number of unique keys are increasing and if i am correct, because of state topics are only "compact" they are never expires.
NumberOfUniqueKeys = 86.400 seconds in a day X SOURCE X DESTINATION
Then we thought that if we do not put a time field in a KEY block, we can reduce state store size. We tried windowing operation as second approach.
using windowing operation with persistentWindowStore, CustomTimeStampExtractor, WindowBy, Suppress
WindowBytesStoreSupplier streamStore = Stores.persistentWindowStore("store-name", Duration.ofHours(6), Duration.ofSeconds(1), false);
sourceStream.selectKey((k, v) -> SRC + KEY_SEPERATOR + DEST)
.groupByKey() .windowedBy(TimeWindows.of(Duration.ofSeconds(1)).grace(Duration.ofSeconds(5)))
.aggregate(
{
DO SOME Aggregation
}, Materialized.<String, AggregationObject>as(streamStore)
.withKeySerde(Serdes.String())
.withValueSerde(AggregationObjectSerdes))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())).toStream();`
After trying that second approach, we reduced state store size but now we had problem with late arrive events. Then we added grace period with 5 seconds with suppress operation and in addition using grace period and suppress operation did not guarantee to handle all late arrived events, another side effect of suppress operation is a latency because it emits result of aggregation after window grace period.
BTW
using windowing operation caused a getting WARNING message like
"WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
I checked the reason from source code and I found from here
https://github.com/a0x8o/kafka/blob/master/streams/src/main/java/org/apache/kafka/streams/state/internals/WindowKeySchema.java
/**
* Safely construct a time window of the given size,
* taking care of bounding endMs to Long.MAX_VALUE if necessary
*/
static TimeWindow timeWindowForSize(final long startMs,
final long windowSize) {
long endMs = startMs + windowSize;
if (endMs < 0) {
LOG.warn("Warning: window end time was truncated to Long.MAX");
endMs = Long.MAX_VALUE;
}
return new TimeWindow(startMs, endMs);
}
BUT actually it does not make any sense to me that how endMs can be lower than 0...
Questions ?
What if we go through with approach 1, how can we reduce state store size ? In approach 1, It was guaranteed that all event will be processed and there will be no missing event because of latency.
What if we go through with approach 2, how should i tune my logic and catch late arrival data and reduce latency ?
Why do i get Warning message in approach 2 although all time fields are positive in my model ?
What can be other options that you can suggest other then these two approaches ?
I need some expert help :)
BR,
According to mail kafka mail group about warning message
WARNING message like "WARN 1 --- [-StreamThread-2] o.a.k.s.state.internals.WindowKeySchema : Warning: window end time was truncated to Long.MAX"
It was written to me :
You can get this message "o.a.k.s.state.internals.WindowKeySchema :
Warning: window end time was truncated to Long.MAX"" when your
TimeWindowDeserializer is created without a windowSize. There are two
constructors for a TimeWindowDeserializer, are you using the one with
WindowSize?
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L46-L55
It calls WindowKeySchema with a Long.MAX_VALUE
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindowedDeserializer.java#L84-L90

Aggregation (summation) of number of events from different Kafka topics

My application has three topics that receive some events belonging to users:
Event Type A -> Topic A
Event Type B -> Topic B
Event Type C -> Topic C
This would be an example of the flow of messages:
Message(user 1 - event A - 2020-01-03)
Message(user 2 - event A - 2020-01-03)
Message(user 1 - event C - 2020-01-20)
Message(user 1 - event B - 2020-01-22)
I want to be able to generate reports with the total number of events per user per month, aggregating all the events from the three topics, something like:
User 1 - 2020-01 -> 3 total events
User 2 - 2020-01 -> 1 total events
Having three KStreams (one per topic), how can I perform this addition per month to have the summation of all the events from three different topics? Can you show the code for this?
Because you are only interested in counting, the simplest way would be to just keep the user-id as key, and some dummy value for each KStream, merge all three streams and do a windowed-count afterwards (note that calendar based windows are not supported out-of-the-box; you could use a 31 day window as an approximation or build your own customized windows):
// just map to dummy empty string (note, that `null` would not work
KStream<UserId, String> streamA = builder.stream("topic-A").mapValues(v -> "");
KStream<UserId, String> streamB = builder.stream("topic-B").mapValues(v -> "");
KStream<UserId, String> streamC = builder.stream("topic-C").mapValues(v -> "");
streamA.merge(streamB).merge(streamC).groupByKey().windowBy(...).count();
You might also be interested in the suppress() operator.

pseudocode about registers and clients

I have projects that requires to simulate a market with 3 registers. Every second an amount of clients come to the registers and we assume that each clients takes 4 seconds to the register before he leaves. Now lets suppose that we get an input of all the customers and their arriving time: e.x: 0001122334455 which means that 3 customers enter at second 0, 2 at second 1 etc. What i need to find is the total time which need to serve all the customers now matter how many they are and also to find the average waiting time at the store.
Can someone come up with a pseudocode for this problem?
while(flag){
while(i<A.length-1){
if(fifo[tail].isEmpty()) fifo[tail].put(A[i] +4);
else{
temp= fifo[tail].peek();
fifo[tail].put(A[i]-temp+4);
i++;
}
if(tail == a-1){
tail=0;
}else tail++;
if(i>3){
for(int q =0; q<a; q++){
temp = fifo[q].peek();
if(temp==i){
fifo[q].get();
}
}
}
}
}
where A is the array which contains all the customers as numbers as required from the input, and fifo is the array of the registers with the get , put and peek(get the tail but not remove it) methods. I have no clue though how to find the total time an the average waiting time