Siddhi query with conditions within multiple occurrences - complex-event-processing

We can write Siddhi query with few occurrences of events with some condition like
For 3 events with customerId 'xyz' and source as 'log', we can use
from every (e1 = CargoStream[e1.customerId == 'xyz' AND e1.source = 'log']<3>)
But what we need to do is add conditions between these 3 events.
Something like all these three elements should have the same source, not a specific value.
from every (e1 = CargoStream[e1.customerId == 'xyz' AND all these 3 events have same source does not matter the value]<3>)
We tried query with access to indexed events in occurrences but does not seem to be triggering events well.
from every (e1 = CargoStream[e1.customerId == 'xyz' AND (e1[0].source == e1[1].sourse AND e1[1].source == e1[2].source)]<3>)
Is this even possible with Siddhi Query? If yes, then how?

For your question, for having the same condition across the events. you can use partitions
https://siddhi.io/en/v5.1/docs/query-guide/#partition
also, look into this issue - https://github.com/siddhi-io/siddhi/issues/1425
the query would be like -
define stream AuthenticationStream (ip string, type string);
#purge(enable='true', interval='15 sec', idle.period='2 min')
partition with (ip of AuthenticationStream)
begin
from every (e1=AuthenticationStream[type == 'FAILURE' ]<1:> ->
e2=AuthenticationStream[type == 'SUCCESS' ]) within 1 min
select e1[0].ip as ip, e1[3].ip as ip4
having not(ip4 is null)
insert into BreakIn
end;

Related

Aggregation on updating order data in Druid

I have streaming data using Kafka to Druid. It's an eCommerce de-normalized order event data where status and few fields get updated in every event.
I need to do aggregate query based on timestamp with the most updated entry only.
For example: If data sample is:
{"orderId":"123","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:02:33Z"}
{"orderId":"abc","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:03:33Z"}
{"orderId":"123","status":"Shipped","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-07T02:03:33Z"}
Now if I want to query on all orders stuck on "Initiated" status for more than 2 days then for above data it should only show orderId "abc".
But if I query something like
Select orderId,qty,paymentId from order where status = Initiated and WHERE "timestamp" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
This query will return both orders "123" and "abc", but 123 has another event received after 2 days so the previous events should not be included in result.
Is their any good and optimized way to handle this kind of scenarios in Apache druid?
One way I was thinking to use a separate lookup table to store orderId and latest status and perform a join with this lookup and above aggregation query on orderId and status
EDIT 1:
This query works but it joins on whole table, which can give resource limit exception for big datasets:
WITH maxOrderTime (orderId, "__time") AS
(
SELECT orderId, max("__time") FROM inline_data
GROUP BY orderId
)
SELECT inline_data.orderId FROM inline_data
JOIN maxOrderTime
ON inline_data.orderId = maxOrderTime.orderId
AND inline_data."__time" = maxOrderTime."__time"
WHERE inline_data.status='Initiated' and inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
EDIT 2:
Tried with:
SELECT
inline_data.orderID,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderID
HAVING last_status = 1
But gives this error:
Error: Unknown exception
Error while applying rule DruidQueryRule(AGGREGATE), args
[rel#1853:LogicalAggregate.NONE.,
rel#1863:DruidQueryRel.NONE.[](query={"queryType":"scan","dataSource":{"type":"table","name":"inline_data"},"intervals":{"type":"intervals","intervals":["-146136543-09-08T08:23:32.096Z/2021-03-14T09:57:05.000Z"]},"virtualColumns":[{"type":"expression","name":"v0","expression":"lookup("status",'status_as_number')","outputType":"STRING"}],"resultFormat":"compactedList","batchSize":20480,"order":"none","filter":null,"columns":["orderID","v0"],"legacy":false,"context":{"sqlOuterLimit":100,"sqlQueryId":"fbc167be-48fc-4863-b3a8-b8a7c45fb60f"},"descending":false,"granularity":{"type":"all"}},signature={orderID:LONG,
v0:STRING})]
java.lang.RuntimeException
I think this can be done easier. If you replace the status to a numeric representation, you can use it more easy.
First use an inline lookup to replace the status. See this page how to define a lookup: https://druid.apache.org/docs/0.20.1/querying/lookups.html
Now, we have for example these values in a lookup named status_as_number:
Initiated = 1
Shipped = 2
Since we now have a numeric representation, you can simply do a group by query and see the max status number. A query like this would be sufficient:
SELECT
inline_data.orderId,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderId
HAVING last_status = 1
Note: this query is not tested. The HAVING part makes sure that you only see orders which are Initiated.
I hope this solves your problem.

Kafka kstream comparing two values from two different topics

Im currently trying to send two different formats of the same event on two different topics. Lets say format A to topic A and format B to topic B.
And Format B is only sent approx 15% of the times since it's not supported for older stuff. And if B is sent there will be an A equivalent of each event.
What i want to do is listen to them at the same time and if B exists i need to discard A.
What i've tried so far is to listen on both (im using kstreams),
and doing stream - stream join
streamA.leftJoin(streamB, (A_VALUE, B_VALUE) -> {
if (B_VALUE != null && A_VALUE != null) {
return B_VALUE
} else if (A_VALUE != null && B_VALUE == null) {
return A_VALUE
}
return null;
},
JoinWindows.of(Duration.ofMinutes(5)).grace(Duration.ofMinutes(15)),
Joined.with(
Serdes.String(),
Serdes.String(),
Serdes.String()
))
And running tests on a load between 50-200 events/s i've seen results as such:
number B_VALUEs sent is always correct,
but number A_VALUE is larger than expected.
I think that sometimes it's sending both A and B.
I've tried using guava cache as a "hashmap with TTL", Storing all the B Events and then comparing that way. Here i find that the total amount is always correct but there is less B events then expected meaning that sometimes it does not find a match.
If there is a better way of doing it without using databases then im open for it!
note: unique events correlated always have the same key e.g. <432, A_VALUE> , <432, B_VALUE>.

PostgresSQL - How to get a string instead of NULL when there is no value in left join?

I have 2 tables that have list of workers and events.
When I'm trying to bring the list of ALL the workers with the current event that is still not finished , I either get an event name or NULL , but instead of null , I would like to get empty string because when I'm sending the result back to HTML , I get "NULL" on the screen.
I tried
select workers.id, workers.name , events.name
from workers left join events on events.workerid = workers.id
where events.isfinished = false;
results is :
1 Dave NULL
2 Charlie Event 2
3 Steve Event 3
My Tables
Workers
Id Name
-----------------
1 Dave
2 Charlie
3 Steve
Events
Id Description workderId isFinished
------------------------------------------------------
1 Event 1 1 true
2 Event 2 2 false
3 Event 3 3 false
What should my sql be in order to get empty string or different value instead of NULL ?
Try adding COALESCE(), something like this:
SELECT workers.id, workers.name AS worker, COALESCE(events.name, '') AS event
FROM workers
LEFT JOIN events ON events.workerid = workers.id
WHERE events.isfinished = false
Use COALESCE:
SELECT
w.id,
w.name AS workers_name,
COALESCE(e.name, '') AS events_name -- or replace with some other string value
FROM workers w
LEFT JOIN events e
ON e.workerid = w.id
WHERE
e.isfinished = false;
Note that your current query is selecting two columns both of which are called name. I aliased them to different things, to keep the result set readable.

Postgresql and comparing to an empty field

It seems that in PostgreSQL, empty_field != 1 (or some other value) is FALSE. If this is true, can somebody tell me how to compare with empty fields?
I have following query, which translates to "select all posts in users group for which one hasn't voted yet:
SELECT p.id, p.body, p.author_id, p.created_at
FROM posts p
LEFT OUTER JOIN votes v ON v.post_id = p.id
WHERE p.group_id = 1
AND v.user_id != 1
and it outputs nothing, even though votes table is empty. Maybe there is something wrong with my query and not with the logic above?
Edit: it seems that changing v.user_id != 1, to v.user_id IS DISTINCT FROM 1, did the job.
From PostgreSQL docs:
For non-null inputs, IS DISTINCT FROM
is the same as the <> operator.
However, when both inputs are null it
will return false, and when just one
input is null it will return true.
If you want to return rows where v.user_id is NULL then you need to handle that specially. One way you can fix it is to write:
AND COALESCE(v.user_id, 0) != 1
Another option is:
AND (v.user_id != 1 OR v.user_id IS NULL)
Edit: spacemonkey is correct that in PostgreSQL you should use IS DISTINCT FROM here.
NULL is a unknown value so it can never equal something. Look into using the COALESCE function.

PostgreSQL and pl/pgsql SYNTAX to update fields based on SELECT and FUNCTION (while loop, DISTINCT COUNT)

I have a large database, that I want to do some logic to update new fields.
The primary key is id for the table harvard_assignees
The LOGIC GOES LIKE THIS
Select all of the records based on id
For each record (WHILE), if (state is NOT NULL && country is NULL), update country_out = "US" ELSE update country_out=country
I see step 1 as a PostgreSQL query and step 2 as a function. Just trying to figure out the easiest way to implement natively with the exact syntax.
====
The second function is a little more interesting, requiring (I believe) DISTINCT:
Find all DISTINCT foreign_keys (a bivariate key of pat_type,patent)
Count Records that contain that value (e.g., n=3 records have fkey "D","388585")
Update those 3 records to identify percent as 1/n (e.g., UPDATE 3 records, set percent = 1/3)
For the first one:
UPDATE
harvard_assignees
SET
country_out = (CASE
WHEN (state is NOT NULL AND country is NULL) THEN 'US'
ELSE country
END);
At first it had condition "id = ..." but I removed that because I believe you actually want to update all records.
And for the second one:
UPDATE
example_table
SET
percent = (SELECT 1/cnt FROM (SELECT count(*) AS cnt FROM example_table AS x WHERE x.fn_key_1 = example_table.fn_key_1 AND x.fn_key_2 = example_table.fn_key_2) AS tmp WHERE cnt > 0)
That one will be kind of slow though.
I'm thinking on a solution based on window functions, you may want to explore those too.