How to express the event de-duplication logics in Siddhi stream processing - complex-event-processing

Hi: I need the following de-duplication logics to be implemented in Siddhi stream processing. Assume I have an InputStream, and I want to produce the OutputStream as the following:
(1) when the event is the first one (since the Event Processing engine starts) in the InputStream, insert the event to the OutputStream.
(2)if the event with the same signature, for example, the same event name, arrives in within a 2-minute windows, we consider that the event is identical, and we should NOT insert the event into the OutputStream. Otherwise, we should insert the event into the OutputStream.
I tried to use event pattern to do the filtering. However, I can not find that I can express the "negation logics" in Siddhi, that is, if (not ( e1 --> e2 with same signature in 2 minute window)). Is there a clever way to perform such event-deduplication logics? Note that event deduplication is a very common expression needed for event processing.
If I would implement it in Java, that is relatively straightforward. I will create a hash table. When the first event arrives, I register it to the hash able, and set the event acceptable time of this registered event to be 2 minutes later. When the next event arrives, I look up the hash table, and compare the retrieved event's acceptable time with my current event time, and if the current event time is smaller than the acceptable time, I will not consider it as an output event. Instead of the Java implementation, I prefer to having a declarative solution implemented in Siddhi's stream processing query, if that is possible.

You can use an in-memory table and achieve that; Please find the sample below; it's pretty much similar to your approach with Java.
define stream InputStream (event_id string, data string);
define stream OutputStream (event_id string, data string);
define table ProcessedEvents (event_id string);
from InputStream[not(ProcessedEvents.event_id == event_id in ProcessedEvents)]
insert into OutputStream ;
from OutputStream
select event_id
insert into ProcessedEvents ;
from OutputStream#window.time(2 sec)
select event_id
insert expired events into PurgeStream ;
from PurgeStream
delete ProcessedEvents
on ProcessedEvents.event_id == event_id ;

Related

Confused about intervalJoin

I'm trying to come up with a solution which involves applying some logic after the join operation to pick one event from streamB among multiple EventBs. It would be like a reduce function but it only returns 1 element instead of doing it incrementally. So the end result would be a single (EventA, EventB) pair instead of a cross product of 1 EventA and multiple EventB.
streamA
.keyBy((a: EventA) => a.common_key)
.intervalJoin(
streamB
.keyBy((b: EventB) => b.common_key)
)
.between(Time.days(-30), Time.days(0))
.process(new MyJoinFunction)
The data would be ingested like (assuming they have the same key):
EventB ts: 1616686386000
EventB ts: 1616686387000
EventB ts: 1616686388000
EventB ts: 1616686389000
EventA ts: 1616686390000
Each EventA key is guaranteed to arrive only once.
Assume a join operation like above and it generated 1 EventA with 4 EventB, successfully joined and collected in MyJoinFunction. Now what I want to do is, access these values at once and do some logic to correctly match the EventA to exactly one EventB.
For example, for the above dataset I need (EventA 1616686390000, EventB 1616686387000).
MyJoinFunction will be invoked for each (EventA, EventB) pair but I'd like an operation after this, that lets me access an iterator so I can look through all EventB events for each EventA.
I am aware that I can apply another windowing operation after the join to group all the pairs, but I want this to happen immediately after the join succeeds. So if possible, I'd like to avoid adding another window since my window is already large (30 days).
Is Flink the correct choice for this use case or I am completely in the wrong?
This could be implemented as a KeyedCoProcessFunction. You would key both streams by their common key, connect them, and process both streams together.
You can use ListState to store the events from B (for a given key), and ValueState for A (again, for a given key). You can use an event time timer to know when the time has come to look through the B events in the ListState, and produce your result. Don't forget to clear the state once you are finished with it.
If you're not familiar with this part of the Flink API, the tutorial on Process Functions should be helpful.

Kafka Streams as table Patch log not full Post

Desired functionality: For a given key, key123, numerous services are running in parallel and reporting their results to a single location, once all results are gathered for key123 they are passed to a new downstream consumer.
Original idea: Using AWS DynamoDB to hold all results for a given entry. Every time a result is ready a micro-service does a PATCH operation to the database on key123. An output stream checks each UPDATE to see if the entry is complete, if so, it is forwarded downstream.
New Idea: Use Kafka Streams and KSQL to reach the same goal. All services write their output to the results topic, the topic forms a change log Kstream that we KSQL query for completed entries. Something like:
CREATE STREAM competed_results FROM results_stream SELECT * WHERE (all results != NULL).
The part I'm not sure how to do is the PATCH operation on the stream. To have the output stream show the accumulation of all messages for key123 instead of just the most recent one?
KSQL users, does this even make sense? Am I close to a solution that someone has done before?
If you can produce all your events to the same topic, with the key set, then you can collect all of the events for a specific key using an aggregation in ksqlDB such as:
CREATE STREAM source (
KEY INT KEY, -- example key to group by
EVENT STRING -- example event to collect
) WITH (
kafka_topic='source', -- or whatever your source topic is called.
value_format='json' -- or whatever value format you need.
);
CREATE TABLE agg AS
SELECT
key,
COLLECT_LIST(event) as events
FROM source
GROUP BY key;
This will create a changelog topic called AGG by default. As new events are received for a specific key on the source topic, ksqlDB will produce messages to the AGG topic, with the key set to key and the value containing the list of all the events seen for that key.
You can then import this changelog as a stream:
CREATE STREAM agg_stream (
KEY INT KEY,
EVENTS ARRAY<STRING>
) WITH (
kafka_topic='AGG',
value_format='json'
);
And you can then apply some criteria to filter the stream to only include your final results:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE ARRAY_LEN(EVENTS) = 5; -- example 'complete' criteria.
You may even want to use a user-defined function to define your complete criteria:
STREAM competed_results AS
SELECT
*
FROM agg_stream
WHERE IS_COMPLETE(EVENTS);

How to do pattern matching using match_recognize in Esper for unknown properties of events in event stream?

I am new to Esper and I am trying to filter the events properties from event streams having multiple events coming with high velocity.
I am using Kafka to send row by row of CSV from producer to consumer and at consumer I am converting those rows to HashMap and creating Events at run-time in Esper.
For example I have events listed below which are coming every 1 second
WeatherEvent Stream:
E1 = {hum=51.0, precipi=1, precipm=1, tempi=57.9, icon=clear, pickup_datetime=2016-09-26 02:51:00, tempm=14.4, thunder=0, windchilli=, wgusti=, pressurei=30.18, windchillm=}
E2 = {hum=51.5, precipi=1, precipm=1, tempi=58.9, icon=clear, pickup_datetime=2016-09-26 02:55:00, tempm=14.5, thunder=0, windchilli=, wgusti=, pressurei=31.18, windchillm=}
E3 = {hum=52, precipi=1, precipm=1, tempi=59.9, icon=clear, pickup_datetime=2016-09-26 02:59:00, tempm=14.6, thunder=0, windchilli=, wgusti=, pressurei=32.18, windchillm=}#
Where E1, E2...EN are multiple events in WeatherEvent
In the above events I just want to filter out properties like hum, tempi, tempm and presssurei because they are changing as the time proceeds ( during 4 secs) and dont want to care about the properties which are not changing at all or are changing really slowly.
Using below EPL query I am able to filter out the properties like temp, hum etc
#Name('Out') select * from weatherEvent.win:time(10 sec)
match_recognize (
partition by pickup_datetime?
measures A.tempm? as a_temp, B.tempm? as b_temp
pattern (A B)
define
B as Math.abs(B.tempm? - A.tempm?) > 0
)
The problem is I can only do it when I specify tempm or hum in the query for pattern matching.
But as the data is coming from CSV and it has high-dimensions or features so I dont know what are the properties of Events before-hand.
I want Esper to automatically detects features/properties (during run-time) which are changing and filter it out, without me specifying properties of events.
Any Ideas how to do it? Is that even possible with Esper? If not, can I do it with other CEP engines like Siddhi, OracleCEP?
You may add a "?" to the event property name to get the value of those properties that are not known at the time the event type is defined. This is called dynamic property see documentation . The type returned is Object so you need to downcast.

Kafka Streams - adding message frequency in enriched stream

From a stream (k,v), I want to calculate a stream (k, (v,f)) where f is the frequency of the occurrences of a given key in the last n seconds.
Give a topic (t1) if I use a windowed table to calculate the frequency:
KTable<Windowed<Integer>,Long> t1_velocity_table = t1_stream.groupByKey().windowedBy(TimeWindows.of(n*1000)).count();
This will give a windowed table with the frequency of each key.
Assuming I won’t be able to join with a Windowed key, instead of the table above I am mapping the stream to a table with simple key:
t1_Stream.groupByKey()
.windowedBy(TimeWindows.of( n*1000)).count()
.toStream().map((k,v)->new KeyValue<>(k.key(), Math.toIntExact(v))).to(frequency_topic);
KTable<Integer,Integer> t1_frequency_table = builder.table(frequency_topic);
If I now lookup in this table when a new key arrives in my stream, how do I know if this lookup table will be updated first or the join will occur first (which will cause the stale frequency to be added in the record rather that the current updated one). Will it be better to create a stream instead of table and then do a windowed join ?
I want to lookup the table with something like this:
KStream<Integer,Tuple<Integer,Integer>> t1_enriched = t1_Stream.join(t1_frequency_table, (l,r) -> new Tuple<>(l, r));
So instead of having just a stream of (k,v) I have a stream of (k,(v,f)) where f is the frequency of key k in the last n seconds.
Any thoughts on what would be the right way to achieve this ? Thanks.
For the particular program you shared, the stream side record will be processed first. The reason is, that you pipe the data through a topic...
When the record is processed, it will update the aggregation result that will emit an update record that is written to the through-topic. Directly afterwards, the record will be processed by the join operator. Only afterwards a new poll() call will eventually read the aggregation result from the through-topic and update the table side of the join.
Using the DSL, it seems not to be possible for achieve what you want. However, you can write a custom Transformer that re-implements the stream-table join that provides the semantics you need.

Siddhi CEP: Aggregate the String values of grouped events in a batch time window

I'm using Siddhi to reduce the amount of events existing in a system. To do so, I declared a batch time window, that groupes all the events based on their target_ip.
from Events#window.timeBatch(30 sec)
select id as meta_ID, Target_IP4 as target_ip
group by Target_IP4
insert into temp;
The result I would like to have is a single event for each target_ip and the meta_ID parameter value as the concatenation of the distinct events that forms the event.
The problem is that the previous query generates as many events as distinct meta_ID values. for example, I'm getting
"id_10", "target_1"
"id_11", "target_1"
And I would like to have
"id_10,id_11", "target_1"
I'm aware that some aggregation method is missing in my query, I saw a lot of aggregation function in Siddhi, including the siddhi-execution-string extension which has the method str:concat, but I don't know how to use it to aggregate the meta_ID values. Any idea?
You could write an execution plan as shown below, to achieve your requirement:
define stream inputStream (id string, target string);
-- Query 1
from inputStream#window.timeBatch(30 sec)
select *
insert into temp;
-- Query 2
from temp#custom:aggregator(id, target)
select *
insert into reducedStream;
Here, the custom:aggregator is the custom stream processor extension that you will have to implement. You can follow [1] when implementing it.
Let me explain a bit about how things work:
Query 1 generates a batch of events every 30 seconds. In other words, we use Query 1 for creating a batch of events.
So, at the end of every 30 second interval, the batch of events will be fed into the custom:aggregator stream processor. When an input is received to the stream processor, its process() method will be hit.
#Override
protected void process(ComplexEventChunk<StreamEvent> streamEventChunk, Processor nextProcessor, StreamEventCloner streamEventCloner, ComplexEventPopulater complexEventPopulater) {
//implement the aggregation & grouping logic here
}
The batch of events is there in the streamEventChunk. When implementing the process() method, you can iterate over the streamEventChunk and create one event per each destination. You will need to implement this logic in the process() method.
[1] https://docs.wso2.com/display/CEP420/Writing+a+Custom+Stream+Processor+Extension