We can write Siddhi query with few occurrences of events with some condition like
For 3 events with customerId 'xyz' and source as 'log', we can use
from every (e1 = CargoStream[e1.customerId == 'xyz' AND e1.source = 'log']<3>)
But what we need to do is add conditions between these 3 events.
Something like all these three elements should have the same source, not a specific value.
from every (e1 = CargoStream[e1.customerId == 'xyz' AND all these 3 events have same source does not matter the value]<3>)
We tried query with access to indexed events in occurrences but does not seem to be triggering events well.
from every (e1 = CargoStream[e1.customerId == 'xyz' AND (e1[0].source == e1[1].sourse AND e1[1].source == e1[2].source)]<3>)
Is this even possible with Siddhi Query? If yes, then how?
For your question, for having the same condition across the events. you can use partitions
https://siddhi.io/en/v5.1/docs/query-guide/#partition
also, look into this issue - https://github.com/siddhi-io/siddhi/issues/1425
the query would be like -
define stream AuthenticationStream (ip string, type string);
#purge(enable='true', interval='15 sec', idle.period='2 min')
partition with (ip of AuthenticationStream)
begin
from every (e1=AuthenticationStream[type == 'FAILURE' ]<1:> ->
e2=AuthenticationStream[type == 'SUCCESS' ]) within 1 min
select e1[0].ip as ip, e1[3].ip as ip4
having not(ip4 is null)
insert into BreakIn
end;
Related
I have streaming data using Kafka to Druid. It's an eCommerce de-normalized order event data where status and few fields get updated in every event.
I need to do aggregate query based on timestamp with the most updated entry only.
For example: If data sample is:
{"orderId":"123","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:02:33Z"}
{"orderId":"abc","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:03:33Z"}
{"orderId":"123","status":"Shipped","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-07T02:03:33Z"}
Now if I want to query on all orders stuck on "Initiated" status for more than 2 days then for above data it should only show orderId "abc".
But if I query something like
Select orderId,qty,paymentId from order where status = Initiated and WHERE "timestamp" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
This query will return both orders "123" and "abc", but 123 has another event received after 2 days so the previous events should not be included in result.
Is their any good and optimized way to handle this kind of scenarios in Apache druid?
One way I was thinking to use a separate lookup table to store orderId and latest status and perform a join with this lookup and above aggregation query on orderId and status
EDIT 1:
This query works but it joins on whole table, which can give resource limit exception for big datasets:
WITH maxOrderTime (orderId, "__time") AS
(
SELECT orderId, max("__time") FROM inline_data
GROUP BY orderId
)
SELECT inline_data.orderId FROM inline_data
JOIN maxOrderTime
ON inline_data.orderId = maxOrderTime.orderId
AND inline_data."__time" = maxOrderTime."__time"
WHERE inline_data.status='Initiated' and inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
EDIT 2:
Tried with:
SELECT
inline_data.orderID,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderID
HAVING last_status = 1
But gives this error:
Error: Unknown exception
Error while applying rule DruidQueryRule(AGGREGATE), args
[rel#1853:LogicalAggregate.NONE.,
rel#1863:DruidQueryRel.NONE.[](query={"queryType":"scan","dataSource":{"type":"table","name":"inline_data"},"intervals":{"type":"intervals","intervals":["-146136543-09-08T08:23:32.096Z/2021-03-14T09:57:05.000Z"]},"virtualColumns":[{"type":"expression","name":"v0","expression":"lookup("status",'status_as_number')","outputType":"STRING"}],"resultFormat":"compactedList","batchSize":20480,"order":"none","filter":null,"columns":["orderID","v0"],"legacy":false,"context":{"sqlOuterLimit":100,"sqlQueryId":"fbc167be-48fc-4863-b3a8-b8a7c45fb60f"},"descending":false,"granularity":{"type":"all"}},signature={orderID:LONG,
v0:STRING})]
java.lang.RuntimeException
I think this can be done easier. If you replace the status to a numeric representation, you can use it more easy.
First use an inline lookup to replace the status. See this page how to define a lookup: https://druid.apache.org/docs/0.20.1/querying/lookups.html
Now, we have for example these values in a lookup named status_as_number:
Initiated = 1
Shipped = 2
Since we now have a numeric representation, you can simply do a group by query and see the max status number. A query like this would be sufficient:
SELECT
inline_data.orderId,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderId
HAVING last_status = 1
Note: this query is not tested. The HAVING part makes sure that you only see orders which are Initiated.
I hope this solves your problem.
Im currently trying to send two different formats of the same event on two different topics. Lets say format A to topic A and format B to topic B.
And Format B is only sent approx 15% of the times since it's not supported for older stuff. And if B is sent there will be an A equivalent of each event.
What i want to do is listen to them at the same time and if B exists i need to discard A.
What i've tried so far is to listen on both (im using kstreams),
and doing stream - stream join
streamA.leftJoin(streamB, (A_VALUE, B_VALUE) -> {
if (B_VALUE != null && A_VALUE != null) {
return B_VALUE
} else if (A_VALUE != null && B_VALUE == null) {
return A_VALUE
}
return null;
},
JoinWindows.of(Duration.ofMinutes(5)).grace(Duration.ofMinutes(15)),
Joined.with(
Serdes.String(),
Serdes.String(),
Serdes.String()
))
And running tests on a load between 50-200 events/s i've seen results as such:
number B_VALUEs sent is always correct,
but number A_VALUE is larger than expected.
I think that sometimes it's sending both A and B.
I've tried using guava cache as a "hashmap with TTL", Storing all the B Events and then comparing that way. Here i find that the total amount is always correct but there is less B events then expected meaning that sometimes it does not find a match.
If there is a better way of doing it without using databases then im open for it!
note: unique events correlated always have the same key e.g. <432, A_VALUE> , <432, B_VALUE>.
I have 2 tables that have list of workers and events.
When I'm trying to bring the list of ALL the workers with the current event that is still not finished , I either get an event name or NULL , but instead of null , I would like to get empty string because when I'm sending the result back to HTML , I get "NULL" on the screen.
I tried
select workers.id, workers.name , events.name
from workers left join events on events.workerid = workers.id
where events.isfinished = false;
results is :
1 Dave NULL
2 Charlie Event 2
3 Steve Event 3
My Tables
Workers
Id Name
-----------------
1 Dave
2 Charlie
3 Steve
Events
Id Description workderId isFinished
------------------------------------------------------
1 Event 1 1 true
2 Event 2 2 false
3 Event 3 3 false
What should my sql be in order to get empty string or different value instead of NULL ?
Try adding COALESCE(), something like this:
SELECT workers.id, workers.name AS worker, COALESCE(events.name, '') AS event
FROM workers
LEFT JOIN events ON events.workerid = workers.id
WHERE events.isfinished = false
Use COALESCE:
SELECT
w.id,
w.name AS workers_name,
COALESCE(e.name, '') AS events_name -- or replace with some other string value
FROM workers w
LEFT JOIN events e
ON e.workerid = w.id
WHERE
e.isfinished = false;
Note that your current query is selecting two columns both of which are called name. I aliased them to different things, to keep the result set readable.
It seems that in PostgreSQL, empty_field != 1 (or some other value) is FALSE. If this is true, can somebody tell me how to compare with empty fields?
I have following query, which translates to "select all posts in users group for which one hasn't voted yet:
SELECT p.id, p.body, p.author_id, p.created_at
FROM posts p
LEFT OUTER JOIN votes v ON v.post_id = p.id
WHERE p.group_id = 1
AND v.user_id != 1
and it outputs nothing, even though votes table is empty. Maybe there is something wrong with my query and not with the logic above?
Edit: it seems that changing v.user_id != 1, to v.user_id IS DISTINCT FROM 1, did the job.
From PostgreSQL docs:
For non-null inputs, IS DISTINCT FROM
is the same as the <> operator.
However, when both inputs are null it
will return false, and when just one
input is null it will return true.
If you want to return rows where v.user_id is NULL then you need to handle that specially. One way you can fix it is to write:
AND COALESCE(v.user_id, 0) != 1
Another option is:
AND (v.user_id != 1 OR v.user_id IS NULL)
Edit: spacemonkey is correct that in PostgreSQL you should use IS DISTINCT FROM here.
NULL is a unknown value so it can never equal something. Look into using the COALESCE function.
I have a large database, that I want to do some logic to update new fields.
The primary key is id for the table harvard_assignees
The LOGIC GOES LIKE THIS
Select all of the records based on id
For each record (WHILE), if (state is NOT NULL && country is NULL), update country_out = "US" ELSE update country_out=country
I see step 1 as a PostgreSQL query and step 2 as a function. Just trying to figure out the easiest way to implement natively with the exact syntax.
====
The second function is a little more interesting, requiring (I believe) DISTINCT:
Find all DISTINCT foreign_keys (a bivariate key of pat_type,patent)
Count Records that contain that value (e.g., n=3 records have fkey "D","388585")
Update those 3 records to identify percent as 1/n (e.g., UPDATE 3 records, set percent = 1/3)
For the first one:
UPDATE
harvard_assignees
SET
country_out = (CASE
WHEN (state is NOT NULL AND country is NULL) THEN 'US'
ELSE country
END);
At first it had condition "id = ..." but I removed that because I believe you actually want to update all records.
And for the second one:
UPDATE
example_table
SET
percent = (SELECT 1/cnt FROM (SELECT count(*) AS cnt FROM example_table AS x WHERE x.fn_key_1 = example_table.fn_key_1 AND x.fn_key_2 = example_table.fn_key_2) AS tmp WHERE cnt > 0)
That one will be kind of slow though.
I'm thinking on a solution based on window functions, you may want to explore those too.