Getting a BIT value as to whether a non-Aggregated Value is in a constant list - tsql

Given the following tables:
Logs
-Id
-SuperLogId
-ScanMode
Super Log
-Id
-Name
How do I write the following query to get 1 for the Logs from SuperLogs where the Log has a ScanMode of "Navigational", "Locatable" or "Automatic" and 0 otherwise?
SELECT Name,
CAST
(
CASE WHEN
Logs.ScanMode IN
(
'Navigational',
'Locatable',
'Automatic'
)
THEN
1
ELSE
0
END
AS BIT
) HasLogsWithGpsData
FROM SuperLogs
INNER JOIN Logs ON Logs.SuperLogId = SuperLogs.Id
GROUP BY SuperLogs.Id
The above query just gives me this error instead of working:
Column 'Logs.ScanMode' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
I want to check whether any of the ScanModes are in that list, not some accumulated version of them, so I don't want to pre-aggregate these values.

The Logs and SuperLogs tables have a one-to-many relationship. This means that for each record in SuperLogs there can be many records in Logs. This is why your query doesn't make sense and can't work as is.
The way I understand the question is this: If Any Log that belongs to the current SuperLog
have a ScanMode that is either one of the values in the list, you want to get 1. If none of them have a ScanMode that fits the list, you want to get 0.
If that is correct, a simple solution would be using conditional aggregation:
SELECT Name,
CAST
(
MAX(
CASE WHEN
Logs.ScanMode IN
(
'Navigational',
'Locatable',
'Automatic'
)
THEN
1
ELSE
0
END
)
AS BIT
) AS HasLogsWithGpsData
FROM SuperLogs
INNER JOIN Logs ON Logs.SuperLogId = SuperLogs.Id
GROUP BY SuperLogs.Id

Related

PostgreSQL: json object where keys are unique array elements and values are the count of times they appear in the array

I have an array of strings, some of which may be repeated. I am trying to build a query which returns a single json object where the keys are the distinct values in the array, and the values are the count of times each value appears in the array.
I have built the following query;
WITH items (item) as (SELECT UNNEST(ARRAY['a','b','c','a','a','a','c']))
SELECT json_object_agg(distinct_values, counts) item_counts
FROM (
SELECT
sub2.distinct_values,
count(items.item) counts
FROM (
SELECT DISTINCT items.item AS distinct_values
FROM items
) sub2
JOIN items ON items.item = sub2.distinct_values
GROUP BY sub2.distinct_values, items.item
) sub1
DbFiddle
Which provides the result I'm looking for: { "a" : 4, "b" : 1, "c" : 2 }
However, it feels like there's probably a better / more elegant / less verbose way of achieving the same thing, so I wondered if any one could point me in the right direction.
For context, I would like to use this as part of a bigger more complex query, but I didn't want to complicate the question with irrelevant details. The array of strings is what one column of the query currently returns, and I would like to convert it into this JSON blob. If it's easier and quicker to do it in code then I can, but I wanted to see if there was an easy way to do it in postgres first.
I think a CTE and json_object_agg() is a little bit of a shortcut to get you there?
WITH counter AS (
SELECT UNNEST(ARRAY['a','b','c','a','a','a','c']) AS item, COUNT(*) AS item_count
GROUP BY 1
ORDER BY 1
)
SELECT json_object_agg(item, item_count) FROM counter
Output:
{"a":4,"b":1,"c":2}

Aggregation on updating order data in Druid

I have streaming data using Kafka to Druid. It's an eCommerce de-normalized order event data where status and few fields get updated in every event.
I need to do aggregate query based on timestamp with the most updated entry only.
For example: If data sample is:
{"orderId":"123","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:02:33Z"}
{"orderId":"abc","status":"Initiated","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-05T01:03:33Z"}
{"orderId":"123","status":"Shipped","items":"item","qty":1,"paymentId":null,"shipmentId":null,timestamp:"2021-03-07T02:03:33Z"}
Now if I want to query on all orders stuck on "Initiated" status for more than 2 days then for above data it should only show orderId "abc".
But if I query something like
Select orderId,qty,paymentId from order where status = Initiated and WHERE "timestamp" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
This query will return both orders "123" and "abc", but 123 has another event received after 2 days so the previous events should not be included in result.
Is their any good and optimized way to handle this kind of scenarios in Apache druid?
One way I was thinking to use a separate lookup table to store orderId and latest status and perform a join with this lookup and above aggregation query on orderId and status
EDIT 1:
This query works but it joins on whole table, which can give resource limit exception for big datasets:
WITH maxOrderTime (orderId, "__time") AS
(
SELECT orderId, max("__time") FROM inline_data
GROUP BY orderId
)
SELECT inline_data.orderId FROM inline_data
JOIN maxOrderTime
ON inline_data.orderId = maxOrderTime.orderId
AND inline_data."__time" = maxOrderTime."__time"
WHERE inline_data.status='Initiated' and inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
EDIT 2:
Tried with:
SELECT
inline_data.orderID,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderID
HAVING last_status = 1
But gives this error:
Error: Unknown exception
Error while applying rule DruidQueryRule(AGGREGATE), args
[rel#1853:LogicalAggregate.NONE.,
rel#1863:DruidQueryRel.NONE.[](query={"queryType":"scan","dataSource":{"type":"table","name":"inline_data"},"intervals":{"type":"intervals","intervals":["-146136543-09-08T08:23:32.096Z/2021-03-14T09:57:05.000Z"]},"virtualColumns":[{"type":"expression","name":"v0","expression":"lookup("status",'status_as_number')","outputType":"STRING"}],"resultFormat":"compactedList","batchSize":20480,"order":"none","filter":null,"columns":["orderID","v0"],"legacy":false,"context":{"sqlOuterLimit":100,"sqlQueryId":"fbc167be-48fc-4863-b3a8-b8a7c45fb60f"},"descending":false,"granularity":{"type":"all"}},signature={orderID:LONG,
v0:STRING})]
java.lang.RuntimeException
I think this can be done easier. If you replace the status to a numeric representation, you can use it more easy.
First use an inline lookup to replace the status. See this page how to define a lookup: https://druid.apache.org/docs/0.20.1/querying/lookups.html
Now, we have for example these values in a lookup named status_as_number:
Initiated = 1
Shipped = 2
Since we now have a numeric representation, you can simply do a group by query and see the max status number. A query like this would be sufficient:
SELECT
inline_data.orderId,
MAX(LOOKUP(status, 'status_as_number')) as last_status
FROM inline_data
WHERE
inline_data."__time" < TIMESTAMPADD(DAY, -2, CURRENT_TIMESTAMP)
GROUP BY inline_data.orderId
HAVING last_status = 1
Note: this query is not tested. The HAVING part makes sure that you only see orders which are Initiated.
I hope this solves your problem.

EXISTS in filter returning too many values

I need to write a query that uses EXISTS, rather than IN, so that it will run fast. The filter is being fed so many parameter values that EXISTS seems like the only option. The difference is between a 20+ minute query and a 5 second query.
This is the query I have:
SELECT DISTINCT d.GROUP_NAME
FROM [EMPLOYEE] e JOIN [DATA_FACT] d ON (e.KEY = d.KEY)
WHERE d.DATE BETWEEN #Start and #End
AND EXISTS
(
select '1234567' -- #ID
)
AND e.Location IN (#Location)
ORDER BY d.GROUP_NAME ASC
The problem is that it is returning too many records. Based on the values I'm passing to filter on, I should get 1 row back but instead I am getting 28.
If I remove the EXISTS and add the following then I get the 1 record I need:
AND e.ID IN ('1234567')
Is there a way to fix the query to work with EXISTS so that I get the correct results?
This is essentially what you want if you are going to try to use exists to filter your data_fact table by parameters in your employee table. Not sure how much it's going to improve your performance though when you throw a massive number of employee IDs at it.
SELECT
d.GROUP_NAME
FROM [DATA_FACT] AS d
WHERE d.DATE BETWEEN #Start and #End
AND EXISTS
(
select 1
from EMPLOYEE AS e
WHERE d.[KEY] = e.[KEY]
AND e.[Location] IN (#Location)
AND e.ID IN ('1234567')
)
ORDER BY d.GROUP_NAME ASC

Postgres Update Using Select Passing In Parent Variable

I need to update a few thousand rows in my Postgres table using the result of a array_agg and spatial lookup.
The query needs to take the geometry of the parent table, and return an array of the matching row IDs in the other table. It may return no IDs or potentially 2-3 IDs.
I've tried to use an UPDATE FROM but I can't seem to pass into the subquery the parent table geom column for the SELECT. I can't see any way of doing a JOIN between the 2 tables.
Here is what I currently have:
UPDATE lrc_wales_data.records
SET lrc_array = subquery.lrc_array
FROM (
SELECT array_agg(wales_lrcs.gid) AS lrc_array
FROM layers.wales_lrcs
WHERE st_dwithin(records.geom_poly, wales_lrcs.geom, 0)
) AS subquery
WHERE records.lrc = 'nrw';
The error I get is:
ERROR: invalid reference to FROM-clause entry for table "records"
LINE 7: WHERE st_dwithin(records.geom_poly, wales_lrcs.geom, 0)
Is this even possible?
Many thanks,
Steve
Realised there was no need to use SET FROM. I could just use a sub query directly in the SET:
UPDATE lrc_wales_data.records
SET lrc_array = (
SELECT array_agg(wales_lrcs.gid) AS lrc
FROM layers.wales_lrcs
WHERE st_dwithin(records.geom_poly, wales_lrcs.geom, 0)
)
WHERE records.lrc = 'nrw';

sp_executesql vs user defined scalar function

In the table below I am storing some conditions like this:
Then, generally, in second table, I am having the following records:
and what I need is to compare these values using the right condition and store the result ( let's say '0' for false, and '1' for true in additional column).
I am going to do this in a store procedure and basically I am going to compare from several to hundreds of records.
What of the possible solution is to use sp_executesql for each row building dynamic statements and the other is to create my own scalar function and to call it for eacy row using cross apply.
Could anyone tell which is the more efficient way?
Note: I know that the best way to answer this is to make the two solutions and test, but I am hoping that there might be answered of this, based on other stuff like caching and SQL internal optimizations and others, which will save me a lot of time because this is only part of a bigger problem.
I don't see the need in use of sp_executesql in this case. You can obtain result for all records at once in a single statement:
select Result = case
when ct.Abbreviation='=' and t.ValueOne=t.ValueTwo then 1
when ct.Abbreviation='>' and t.ValueOne>t.ValueTwo then 1
when ct.Abbreviation='>=' and t.ValueOne>=t.ValueTwo then 1
when ct.Abbreviation='<=' and t.ValueOne<=t.ValueTwo then 1
when ct.Abbreviation='<>' and t.ValueOne<>t.ValueTwo then 1
when ct.Abbreviation='<' and t.ValueOne<t.ValueTwo then 1
else 0 end
from YourTable t
join ConditionType ct on ct.ID = t.ConditionTypeID
and update additional column with something like:
;with cte as (
select t.AdditionalColumn, Result = case
when ct.Abbreviation='=' and t.ValueOne=t.ValueTwo then 1
when ct.Abbreviation='>' and t.ValueOne>t.ValueTwo then 1
when ct.Abbreviation='>=' and t.ValueOne>=t.ValueTwo then 1
when ct.Abbreviation='<=' and t.ValueOne<=t.ValueTwo then 1
when ct.Abbreviation='<>' and t.ValueOne<>t.ValueTwo then 1
when ct.Abbreviation='<' and t.ValueOne<t.ValueTwo then 1
else 0 end
from YourTable t
join ConditionType ct on ct.ID = t.ConditionTypeID
)
update cte
set AdditionalColumn = Result
If above logic is supposed to be applied in many places, not just over one table, then yes you may think about function. Though I would used rather inline table-valued function (not scalar), because of there is overhead imposed with use of user defined scalar functions (to call and return, and the more rows to be processed the more time wastes).
create function ftComparison
(
#v1 float,
#v2 float,
#cType int
)
returns table
as return
select
Result = case
when ct.Abbreviation='=' and #v1=#v2 then 1
when ct.Abbreviation='>' and #v1>#v2 then 1
when ct.Abbreviation='>=' and #v1>=#v2 then 1
when ct.Abbreviation='<=' and #v1<=#v2 then 1
when ct.Abbreviation='<>' and #v1<>#v2 then 1
when ct.Abbreviation='<' and #v1<#v2 then 1
else 0
end
from ConditionType ct
where ct.ID = #cType
which can be applied then as:
select f.Result
from YourTable t
cross apply ftComparison(ValueOne, ValueTwo, t.ConditionTypeID) f
or
select f.Result
from YourAnotherTable t
cross apply ftComparison(SomeValueColumn, SomeOtherValueColumn, #someConditionType) f