Druid SQL: filter on result of expression - druid

I have HTTP access log data in a Druid data source, and I want to see access patterns based on certain identifiers in the URL path. I wrote this query, and it works fine:
select regexp_extract(path, '/id/+([0-9]+)', 1) as "id",
sum("count") as "request_count"
from "access-logs"
where __time >= timestamp '2022-01-01'
group by 1
The only problem is that not all requests match that pattern, so I get one row in the result with an empty "id". I tried adding an extra condition in the where clause:
select regexp_extract(path, '/id/+([0-9]+)', 1) as "id",
sum("count") as "request_count"
from "access-logs"
where __time >= timestamp '2022-01-01' and "id" != ''
group by 1
But when I do that, I get this error message:
Error: Plan validation failed: org.apache.calcite.runtime.CalciteContextException:
From line 4, column 46 to line 4, column 49: Column 'id' not found in any table
So it doesn't let me reference the result of the expression in the where clause. I could of course just copy the entire regexp_extract expression, but is there a cleaner way of doing this?

Since id is an aggregated column, you would need a HAVING clause to filter on it.

Related

Create rows from part of column names

Source data
I am working on an ELT project to load data from CSV files into PostgreSQL where I will transform it. The CSV files have many columns that are consistent across files, but also contain activity columns that are inconsistent with names like Date (05/19/2020), Type (05/19/2020), etc.
In the loading script I am merging all of the columns with dates in the column name into one jsonb column so I don't have to constantly add new columns to the raw data table.
The resulting jsonb column in the raw data table looks like this:
id
activity
12345678
{"Date (05/19/2020)": null, "Type (05/19/2020)": null, "Date (06/03/2020)": "06/01/2020", "Type (06/03/2020)": "E"}
98765432
{"Date (05/19/2020)": "05/18/2020", "Type (05/19/2020)": "B", "Date (10/23/2020)": "10/26/2020", "Type (10/23/2020)": "T"}
JSON to columns
Using the amazing create_jsonb_flat_view function from this post I can convert the jsonb to columns like this:
id
Date (05/19/2020)
Type (05/19/2020)
Date (06/03/2020)
Type (06/03/2020)
Type (10/23/2020
Date (10/23/2020)
Type (10/23/2020)
10629465
null
null
06/01/2020
E
98765432
05/18/2020
B
10/26/2020
T
Need to move part of column name to row
Now, this is where I'm stuck. I need to remove the portion of the column name that is the Activity Date (e.g. (05/19/2020)) and create a row for each id and ActivityDate with additional columns for Date and Type like this:
id
ActivityDate
Date
Type
12345678
05/19/2020
null
null
12345678
06/03/2020
06/01/2020
E
98765432
05/19/2020
05/18/2020
B
98765432
10/23/2020
10/26/2020
T
I followed your link to the create_jsonb_flat_view article yesterday and then forgot this question. While I thank you for pointing me there, I think that mentioning it worked against you.
A more conventional approach using regexp_replace() works here. I left the date values as strings, but you can convert them with to_date() if needed:
with parse as (
select id, e.k, e.v,
regexp_replace(e.k, '\s+\([0-9/]{10}\)', '') as k_no_date,
regexp_replace(e.k, '^.+([0-9/]{10}).+', '\1') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;
db<>fiddle here
#Mike-Organek's Answer works beautifully!
However, I was curious if the regexp_replace() calls might be slowing the query down a bit and it seemed I could get the same results using a simpler function.
Since Mike gave me a great example to start with I modified it to split on the space between Date and (05/19/2020).
For 20,000 rows, it went from taking an avg of 7 sec on my local machine to an avg of .9 sec.
Here is the resulting query:
with parse as (
select id, e.k, e.v,
split_part(e.k, ' ', 1) as k_no_date,
trim(split_part(e.k, ' ', 2),'()') as k_date_only
from rawinput
cross join lateral jsonb_each_text(activity) as e(k, v)
)
select id,
k_date_only as activity_date,
min(v) filter (where k_no_date = 'Date') as date,
min(v) filter (where k_no_date = 'Type') as type
from parse
group by id, k_date_only;

How to add pipeline().parameters to Lookup in Azure Data Factory?

I have Lookup Activity in Azure Data Factory.
I have parameter "offset", which have initial value 5.
I want to use parameter value as Integer value in Lookup query, but failing. Please advice.
Original Static Lookup Query:
SELECT *
FROM sales.[Customers]
ORDER BY CustomerId OFFSET 5 ROWS FETCH NEXT 10 ROWS ONLY
--Parameterized Lookup Query:
SELECT *
FROM sales.[Customers]
ORDER BY CustomerId #concat('OFFSET ', pipeline().parameters.offset,' ROWS FETCH NEXT 10 ROWS ONLY')
Error of ADF for parameterized Lookup:
A database operation failed with the following error: 'Incorrect syntax near
'#concat'.',Source=,''Type=System.Data.SqlClient.SqlException,Message=Incorrect syntax near
'#concat'.,Source=.Net SqlClient Data
Provider,SqlErrorNumber=102,Class=15,ErrorCode=-2146232060,State=1,Errors=
[{Class=15,Number=102,State=1,Message=Incorrect syntax near '#concat'.,},],'
Put the entire SQL statement in an expression (using the Expression Builder):
#concat('SELECT * FROM sales.[Customers] ORDER BY CustomerId OFFSET ', pipeline().parameters.offset, ' ROWS FETCH NEXT 10 ROWS ONLY')
You can directly call the parameter as well
SELECT *
FROM sales.[Customers]
ORDER BY CustomerId OFFSET #{pipeline().parameters.offset} ROWS FETCH NEXT 10 ROWS ONLY

Aggregation on updating order data in Druid

I have streaming data using Kafka to Druid. It's an eCommerce de-normalized order event data where status and few fields get updated in every event.
I need to do aggregate query based on timestamp with the most updated entry only.
For example: If data sample is:
{"orderId":"123","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:02:33Z"}
{"orderId":"abc","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:03:33Z"}
{"orderId":"123","status":"Shipped","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-07T02:03:33Z"}
Now if I want to query on all orders stuck on "Initiated" status for more than 2 days then for above data it should only show orderId "abc".
But if I query something like
Select orderId,qty,paymentId from order where status = Initiated and WHERE "timestamp" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
This query will return both orders "123" and "abc", but 123 has another event received after 2 days so the previous events should not be included in result.
Is their any good and optimized way to handle this kind of scenarios in Apache druid?
One way I was thinking to use a separate lookup table to store orderId and latest status and perform a join with this lookup and above aggregation query on orderId and status
EDIT 1:
This query works but it joins on whole table, which can give resource limit exception for big datasets:
WITH maxOrderTime (orderId, "__time") AS
(
SELECT orderId, max("__time") FROM inline_data
GROUP BY orderId
)
SELECT inline_data.orderId FROM inline_data
JOIN maxOrderTime
ON inline_data.orderId = maxOrderTime.orderId
AND inline_data."__time" = maxOrderTime."__time"
WHERE inline_data.status='Initiated' and inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
EDIT 2:
Tried with:
SELECT
inline_data.orderID,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderID
HAVING last_status = 1
But gives this error:
Error: Unknown exception
Error while applying rule DruidQueryRule(AGGREGATE), args
[rel#1853:LogicalAggregate.NONE.,
rel#1863:DruidQueryRel.NONE.[](query={"queryType":"scan","dataSource":{"type":"table","name":"inline_data"},"intervals":{"type":"intervals","intervals":["-146136543-09-08T08:23:32.096Z/2021-03-14T09:57:05.000Z"]},"virtualColumns":[{"type":"expression","name":"v0","expression":"lookup("status",'status_as_number')","outputType":"STRING"}],"resultFormat":"compactedList","batchSize":20480,"order":"none","filter":null,"columns":["orderID","v0"],"legacy":false,"context":{"sqlOuterLimit":100,"sqlQueryId":"fbc167be-48fc-4863-b3a8-b8a7c45fb60f"},"descending":false,"granularity":{"type":"all"}},signature={orderID:LONG,
v0:STRING})]
java.lang.RuntimeException
I think this can be done easier. If you replace the status to a numeric representation, you can use it more easy.
First use an inline lookup to replace the status. See this page how to define a lookup: https://druid.apache.org/docs/0.20.1/querying/lookups.html
Now, we have for example these values in a lookup named status_as_number:
Initiated = 1
Shipped = 2
Since we now have a numeric representation, you can simply do a group by query and see the max status number. A query like this would be sufficient:
SELECT
inline_data.orderId,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderId
HAVING last_status = 1
Note: this query is not tested. The HAVING part makes sure that you only see orders which are Initiated.
I hope this solves your problem.

If I can't use aggregate in a where clause, how to get results

Ok I have a query where I need to ommit the result if the first value of an array_agg = natural so I thought I can do this:
select
visitor_id,
array_agg(code
order by session_start) codes_array
from mark_conversion_sessions
where conv_visit_num2 < 2
and max_conv = 1
and (array_agg(code
order by session_start))[1] != 'natural'
group by visitor_id
But when I run this I get the error:
ERROR: aggregate functions are not allowed in WHERE
LINE 31: and (array_agg(code
So is there a way I can reference that array_agg in the where clause?
Thank you
The having clause is used to act like a where clause on grouped data. Move the criteria that is using aggregates into the having clause, eg:
select
visitor_id,
array_agg(code order by session_start) codes_array
from mark_conversion_sessions
where
conv_visit_num2 < 2
and max_conv = 1
group by visitor_id
having
(array_agg(code order by session_start))[1] != 'natural'
docs:
https://www.postgresql.org/docs/9.6/static/tutorial-agg.html

Have to find a count of rows from my postgresql query

Hi I tried to get a count of rows from my below query:
select count(substring(wsresult_question FROM '[0-9]+') as pumporder) AS totals,
job_id,
job_siteid,
job_completed
from webserviceresults w, jobs s
where job_siteid = '1401'
and job_id = wsresult_jobid
and job_completed is not null
and wsresult_question LIKE 'job.job_site_data.site_meters.pump.%'
and wsresult_category = 'Job'
group by pumporder,job_id,job_siteid,job_completed order by job_completed desc
I tried this and i got the error like
There was an SQL error:
ERROR: syntax error at or near "as" LINE 1: ... count(substring(wsresult_question FROM '[0-9]+') as pumpord... ^
In this line substring(wsresult_question FROM '[0-9]+') as pumporder I just tired to get only a number from some concatenate strings. The concatenate string is being like
1.job.job_site_data.site_meters.pump.0.meter_calibration_record.meter_adjustedtofast
2.job.job_site_data.site_meters.pump.0.meter_calibration_record.meter_adjustedtoslow
3.job.job_site_data.site_meters.pump.1.meter_calibration_record.meter_adjustedtofast
So substring(wsresult_question FROM '[0-9]+') as pumporder is return the numbers like 0,1 in array. I need to total the count of rows now. So Kindly help me on this.
Please let me know if you have any queries.
Thanks in advance!
your error means you should not create an alias for the function - only for the column, so if you remove as pumporder from count(substring(wsresult_question FROM '[0-9]+') as pumporder) , error will go away
Your approach though is very doubtful. If you want to count number of rows with substring(wsresult_question FROM '[0-9]+'), you better instead:
select count(1) AS totals,
job_id,
job_siteid,
job_completed
from webserviceresults w, jobs s
where job_siteid = '1401'
and job_id = wsresult_jobid
and job_completed is not null
and wsresult_question ~ '^(job.job_site_data.site_meters.pump.)[0-9]'
and wsresult_category = 'Job'
group by pumporder,job_id,job_siteid,job_completed order by job_completed desc
and lastly the string job.job_site_data.site_meters.pump.0 looks like json path, so it would be more appropriate using json array length function, not count on rows