How to do pattern matching using match_recognize in Esper for unknown properties of events in event stream? - apache-kafka

I am new to Esper and I am trying to filter the events properties from event streams having multiple events coming with high velocity.
I am using Kafka to send row by row of CSV from producer to consumer and at consumer I am converting those rows to HashMap and creating Events at run-time in Esper.
For example I have events listed below which are coming every 1 second
WeatherEvent Stream:
E1 = {hum=51.0, precipi=1, precipm=1, tempi=57.9, icon=clear, pickup_datetime=2016-09-26 02:51:00, tempm=14.4, thunder=0, windchilli=, wgusti=, pressurei=30.18, windchillm=}
E2 = {hum=51.5, precipi=1, precipm=1, tempi=58.9, icon=clear, pickup_datetime=2016-09-26 02:55:00, tempm=14.5, thunder=0, windchilli=, wgusti=, pressurei=31.18, windchillm=}
E3 = {hum=52, precipi=1, precipm=1, tempi=59.9, icon=clear, pickup_datetime=2016-09-26 02:59:00, tempm=14.6, thunder=0, windchilli=, wgusti=, pressurei=32.18, windchillm=}#
Where E1, E2...EN are multiple events in WeatherEvent
In the above events I just want to filter out properties like hum, tempi, tempm and presssurei because they are changing as the time proceeds ( during 4 secs) and dont want to care about the properties which are not changing at all or are changing really slowly.
Using below EPL query I am able to filter out the properties like temp, hum etc
#Name('Out') select * from weatherEvent.win:time(10 sec)
match_recognize (
partition by pickup_datetime?
measures A.tempm? as a_temp, B.tempm? as b_temp
pattern (A B)
define
B as Math.abs(B.tempm? - A.tempm?) > 0
)
The problem is I can only do it when I specify tempm or hum in the query for pattern matching.
But as the data is coming from CSV and it has high-dimensions or features so I dont know what are the properties of Events before-hand.
I want Esper to automatically detects features/properties (during run-time) which are changing and filter it out, without me specifying properties of events.
Any Ideas how to do it? Is that even possible with Esper? If not, can I do it with other CEP engines like Siddhi, OracleCEP?

You may add a "?" to the event property name to get the value of those properties that are not known at the time the event type is defined. This is called dynamic property see documentation . The type returned is Object so you need to downcast.

Related

Apache beam: SQL aggregation outputs no results for Unbounded/Bounded join

I am working on an apache beam pipeline to run a SQL aggregation function.Reference: https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlDslJoinTest.java#L159.
The example here works fine.However, when I replace the source with an actual unbounded source and do an aggregation, I see no results.
Steps in my pipeline:
Read bounded data from a source and convert to collection of rows.
Read unbounded json data from a websocket source.
Assign timestamp to the every source stream via a DoFn.
Convert the unbounded json to unbounded row collection
Apply a window on the row collection
Apply a SQL statement.
Output the result of the sql.
A normal SQL statement executes and outputs the results. However, when I use a group by in the SQL, there is no output.
SELECT
o1.detectedCount,
o1.sensor se,
o2.sensor sa
FROM SENSOR o1
LEFT JOIN AREA o2
on o1.sensor = o2.sensor
The results are continous and like shown below.
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":0,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":1,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
2019-07-19 20:43:11 INFO ConsoleSink:27 - {
"detectedCount":0,
"se":"3a002f000647363432323230",
"sa":"3a002f000647363432323230"
}
The results don't show up at all when I change the sql to
SELECT
COUNT(o1.detectedCount) o2.sensor sa
FROM SENSOR o1
LEFT JOIN AREA o2
on o1.sensor = o2.sensor
GROUP BY o2.sensor
Is there anything I am doing wrong in this implementation.Any pointers would be really helpful.
Some suggestions come up when reading your code:
Extend the window, to allow lateness, and to emit early arrived data.
.apply("windowing", Window.<Row>into(FixedWindows.of(Duration.standardSeconds(2)))
.triggering(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(2))))
.withAllowedLateness(Duration.standardMinutes(10))
.discardingFiredPanes());
Try to remove the join and check if without it you have output to the window,
Try to add more time to the window. because sometimes it is too short to shuffle the data between the workers. and the joined streams aren't emitted at the same time.
outputWithTimestamp will output the rows in a different timestamp, and then they can be dropped when you don't allow lateness.
Read the docs for outputWithTimestamp, this API is a bit risky.
If the input {#link PCollection} elements have timestamps, the output
timestamp for each element must not be before the input element's
timestamp minus the value of {#link getAllowedTimestampSkew()}. If an
output timestamp is before this time, the transform will throw an
{#link IllegalArgumentException} when executed. Use {#link
withAllowedTimestampSkew(Duration)} to update the allowed skew.
CAUTION: Use of {#link #withAllowedTimestampSkew(Duration)} permits
elements to be emitted behind the watermark. These elements are
considered late, and if behind the {#link
Window#withAllowedLateness(Duration) allowed lateness} of a downstream
{#link PCollection} may be silently dropped.
SELECT
COUNT(o1.detectedCount) as number
,o2.sensor
,sa
FROM SENSOR o1
LEFT OUTER JOIN AREA o2
on o1.sensor = o2.sensor
GROUP BY sa,o1.sensor,o2.sensor

Is it possible to Group By a field extracted from JSON input in a Siddhi Query?

I currently have an stream, with 1 string attribute that contains a Json event.
This stream receives different events, which I want to apply Json path expressions so I can use those attributes on filters and functions.
JsonPath extractors work like a charm on filters and selectors, unfortunately, I am not being able to use them for the 'Group By' part.
I am actually doing it in an embedded Siddhi App with siddhi-execution-json extension added manually, but for the discussion, so everybody can
easily check and test it, I will paste an example app that works on WSO2 Stream Processor.
The objective looks like the following App:
#App:name("Group_by_json_attribute")
define stream JsonStream(json string);
#sink(type='log')
define stream LogStream(myField string, count long);
#info(name='query1')
from JsonStream#window.time(10 sec)
select json:getString(json, '$.myField') as myField, count() as count
group by myField having count > 1
insert into LogStream;
and it can accept the following events:
{"myField": "my_value"}
However, this query will raise the error:
Cannot find attribute type as 'myField' does not exist in 'JsonStream'; define stream JsonStream(json string)
I have also tried to use directly the Json extractor at 'Group by':
group by json:getString(json, '$.myField') as myField having count > 1
However the error now is:
mismatched input ':' expecting {',', ORDER, LIMIT, OFFSET, HAVING, INSERT, DELETE, UPDATE, RETURN, OUTPUT}
which seems to not be expecting to use an extension here
I am just wondering, if it is possible to group by attributes not directly defined in the input stream. In this case is a field extracted from a JSON object, but it could be any other function that generates another attribute.
I am also using versions from maven central repository
Siddhi: io.siddhi:siddhi-core:5.0.1
siddhi-execution-json: io.siddhi.extension.execution.json:siddhi-execution-json:2.0.1
(Edit) Clarification
The objective is, to use attributes not directly defined in the Stream, to be used on the Group By.
The reason why is, I currently have an embedded app which defines the whole set of input streams coming from external sources formatted as JSON, and there are also a set of output streams to inform external components when a query matches.
This app allows users to create custom queries on this set of predefined Streams, but they are not able to create Streams by their own.
Many thanks!
It seems we are expecting the group by fields from the query input stream, in this case, JsonStream. Use another query before this for extraction and the aggregation and filtering in the following query,
#App:name("Group_by_json_attribute")
define stream JsonStream(json string);
#sink(type='log')
define stream LogStream(myField string, count long);
#info(name='extract_stream')
from JsonStream
select json:getString(json, '$.myField') as myField
insert into ExtractedStream;
#info(name='query1')
from ExtractedStream#window.time(10 sec)
select myField, count() as count
group by myField
having count > 1
insert into LogStream;

Flink Table/SQL API: modify rowtime attribute after session window aggregation

I want to use Session window aggregation and then run Tumble window aggregation on top of the produced result in Table API/Flink SQL.
Is it possible to modify rowtime attribute after first session aggregation to have it equal a .rowtime of the last observed event in a session?
I'm trying to do something like this:
table
.window(Session withGap 2.minutes on 'rowtime as 'w)
.groupBy('w, 'userId)
.select(
'userId,
('w.end.cast(Types.LONG) - 'w.start.cast(Types.LONG)) as 'sessionDuration,
('w.rowtime - 2.minutes) as 'rowtime
)
.window(Tumble over 5.minutes on 'rowtime as 'w)
.groupBy('w)
.select(
'w.start,
'w.end,
'sessionDuration.avg as 'avgSession,
'sessionDuration.count as 'numberOfSession
)
The key part is:
('w.rowtime - 2.minutes) as 'rowtime
So I want to re-assign to a record the .rowtime of the latest event in the session, without the session gap (2.minutes in this example).
This works fine in BatchTable, however it doesn't work in StreamTable:
Exception in thread "main" org.apache.flink.table.api.ValidationException: TumblingGroupWindow('w, 'rowtime, 300000.millis) is invalid: Tumbling window expects a time attribute for grouping in a stream environment.
Yeah, I know, it feels like I wan't to invent a time machine and change the order of time. But is it actually possible to somehow achieve described behaviour?
No, unfortunately, you cannot do that with SQL or the Table API in the current version (1.6.0). As soon as you modify a time attribute (rowtime or proctime), it becomes a regular TIMESTAMP attribute and loses its special time characteristics.
For rowtime attributes the reason is that we cannot guarantee that the timestamp is still aligned with the watermarks. In principle, we could delay the watermarks by the subtracted time interval, but this is not supported yet.

Using NEsper to read LogFiles for reporting purposes

We are evaluating NEsper. Our focus is to monitor data quality in an enterprise context. In an application we are going to log every change on a lot of fields - for example in an "order". So we have fields like
Consignee name
Consignee street
Orderdate
....and a lot of more fields. As you can imagine the log files are going to grow big.
Because the data is sent by different customers and is imported in the application, we want to analyze how many (and which) fields are updated from "no value" to "a value" (just as an example).
I tried to build a test case with just with the fields
order reference
fieldname
fieldvalue
For my test cases I added two statements with context-information. The first one should just count the changes in general per order:
epService.EPAdministrator.CreateEPL("create context RefContext partition by Ref from LogEvent");
var userChanges = epService.EPAdministrator.CreateEPL("context RefContext select count(*) as x, context.key1 as Ref from LogEvent");
The second statement should count updates from "no value" to "a value":
epService.EPAdministrator.CreateEPL("create context FieldAndRefContext partition by Ref,Fieldname from LogEvent");
var countOfDataInput = epService.EPAdministrator.CreateEPL("context FieldAndRefContext SELECT context.key1 as Ref, context.key2 as Fieldname,count(*) as x from pattern[every (a=LogEvent(Value = '') -> b=LogEvent(Value != ''))]");
To read the test-logfile I use the csvInputAdapter:
CSVInputAdapterSpec csvSpec = new CSVInputAdapterSpec(ais, "LogEvent");
csvInputAdapter = new CSVInputAdapter(epService.Container, epService, csvSpec);
csvInputAdapter.Start();
I do not want to use the update listener, because I am interested only in the result of all events (probably this is not possible and this is my failure).
So after reading the csv (csvInputAdapter.Start() returns) I read all events, which are stored in the statements NewEvents-Stream.
Using 10 Entries in the CSV-File everything works fine. Using 1 Million lines it takes way to long. I tried without EPL-Statement (so just the CSV import) - it took about 5sec. With the first statement (not the complex pattern statement) I always stop after 20 minutes - so I am not sure how long it would take.
Then I changed my EPL of the first statement: I introduce a group by instead of the context.
select Ref,count(*) as x from LogEvent group by Ref
Now it is really fast - but I do not have any results in my NewEvents Stream after the CSVInputAdapter comes back...
My questions:
Is the way I want to use NEsper a supported use case or is this the root cause of my failure?
If this is a valid use case: Where is my mistake? How can I get the results I want in a performant way?
Why are there no NewEvents in my EPL-statement when using "group by" instead of "context"?
To 1), yes
To 2) this is valid, your EPL design is probably a little inefficient. You would want to understand how patterns work, by using filter indexes and index entries, which are more expensive to create but are extremely fast at discarding unneeded events.
Read:
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#processingmodel_indexes_filterindexes and also
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#pattern-walkthrough
Try the "previous" perhaps. Measure performance for each statement separately.
Also I don't think the CSV adapter is optimized for processing a large file. I think CSV may not stream.
To 3) check your code? Don't use CSV file for large stuff. Make sure a listener is attached.

Kafka Streams - adding message frequency in enriched stream

From a stream (k,v), I want to calculate a stream (k, (v,f)) where f is the frequency of the occurrences of a given key in the last n seconds.
Give a topic (t1) if I use a windowed table to calculate the frequency:
KTable<Windowed<Integer>,Long> t1_velocity_table = t1_stream.groupByKey().windowedBy(TimeWindows.of(n*1000)).count();
This will give a windowed table with the frequency of each key.
Assuming I won’t be able to join with a Windowed key, instead of the table above I am mapping the stream to a table with simple key:
t1_Stream.groupByKey()
.windowedBy(TimeWindows.of( n*1000)).count()
.toStream().map((k,v)->new KeyValue<>(k.key(), Math.toIntExact(v))).to(frequency_topic);
KTable<Integer,Integer> t1_frequency_table = builder.table(frequency_topic);
If I now lookup in this table when a new key arrives in my stream, how do I know if this lookup table will be updated first or the join will occur first (which will cause the stale frequency to be added in the record rather that the current updated one). Will it be better to create a stream instead of table and then do a windowed join ?
I want to lookup the table with something like this:
KStream<Integer,Tuple<Integer,Integer>> t1_enriched = t1_Stream.join(t1_frequency_table, (l,r) -> new Tuple<>(l, r));
So instead of having just a stream of (k,v) I have a stream of (k,(v,f)) where f is the frequency of key k in the last n seconds.
Any thoughts on what would be the right way to achieve this ? Thanks.
For the particular program you shared, the stream side record will be processed first. The reason is, that you pipe the data through a topic...
When the record is processed, it will update the aggregation result that will emit an update record that is written to the through-topic. Directly afterwards, the record will be processed by the join operator. Only afterwards a new poll() call will eventually read the aggregation result from the through-topic and update the table side of the join.
Using the DSL, it seems not to be possible for achieve what you want. However, you can write a custom Transformer that re-implements the stream-table join that provides the semantics you need.