T-SQL Query Performance Optimization - tsql

This query shows how many stolen and recovered vehicles there were based on county and date range. I have a query that works on small data sets, but when I run it on the actual data (several million records) it takes way too long to run. I was wondering if there is another way I could write this query to be more efficient. I think my issue is when I join the agency table with 'or' to compare the agency primary keys with the Thefts table. Any input would be appreciated.
Thefts Table: County Table: Agency Table:
TheftAgencyPK: TheftDate: RecoveryAgencypk: RecoveryDate: PK: Name: PK Name:
1 2019-05-01 1 2019-05-02 1 Sacramento 1 SacPD
1 2019-05-02 2 2019-05-04 2 Aptos 2 AptosPD
1 2019-05-03 1 2019-05-05
2 2019-05-05 1 2019-05-09
1 2019-01-01 2 2019-05-01
Select
sub.CountyName
,Sum(Case When sub.TheftDate Between '2019-01-01' and '2019-05-31' and sub.Agency = sub.TheftAgency Then 1 Else 0 End) As Thefts
,Sum(Case When sub.RecoveryDate Between '2019-01-01' and '2019-05-31' and sub.Agency = sub.RecoveryAgency then 1 else 0 end) as Recoveries
From
(Select
Theft.TheftDate as TheftDate, Theft.TheftAgencyPK as TheftAgency,
Theft.RecoveryDate as RecoveryDate,
Theft.RecoveryAgencyPK as RecoveryAgency, Agency.pk as Agency,
County.PK as CountyPK, County.name as CountyName
From
Thefts Theft
Left Join
Agency Agency on Agency.pk = Theft.TheftAgencyPK or Agency.pk = Theft.RecoveryAgencyPK
Inner Join
County County on County.PK = Agency.pk
Where
TheftDate between '2019-01-01' and '2019-05-31'
or RecoveryDate between '2019-01-01' and '2019-05-01') Sub
Group By
sub.CountyName
Output:
CountyName: Thefts: Recoveries:
----------------------------------------
Aptos 1 2
Sacramento 4 3

You can remove the ORs by changing the structure of your SQL. However I am not sure what your result really means. You sum over county, with recovery agency and theft agency, however, the you compare to the "agency" column which is taken primarily from theftagency.
I would suggest something like this, which first sums thefts pr agency, and then recoveries pr agency, and then sums it up by county, by which county the theft was in and which county the recovery as in. This ensures that the result is pr county.
SELECT
County.Name
,sum(a.Thefts) Thefts
,sum(a.Recoveries) Recoveries
FROM (SELECT
Theft.TheftAgencyPK AS Agency
,COUNT(*) Thefts
,0 AS Recoveries
FROM Thefts Theft
WHERE Theft.TheftDate BETWEEN '2019-01-01' AND '2019-05-31'
GROUP BY Theft.TheftAgencyPK
UNION ALL
SELECT
Theft.RecoveryAgencyPK AS Agency
,0 AS Thefts
,COUNT(*) AS Recoveries
FROM Thefts Theft
WHERE RecoveryDate BETWEEN '2019-01-01' AND '2019-05-01'
GROUP BY Theft.RecoveryAgencyPK) a
LEFT JOIN Agency Agency
ON Agency.pk = a.Agency
INNER JOIN County County
ON County.PK = Agency.pk
Performance wise the heavy lifting is done in the union all- If you have indexes on the Thefts table on theft and recovery date, this might give you an ok performance.
It would most likely lead to two passes of the Thefts table, which might be ok, depending on size and indexing strategy.
If you want just one pass you might do something like this instead of the union all, this might force the optimizer to do just one pass of the thefts table:
SELECT
IIF(t = 'theft', Theft.TheftAgencyPK, Theft.RecoveryAgencyPK) AS Agency
,SUM(IIF(t = 'theft', 1, 0)) AS Thefts
,SUM(IIF(t = 'theft', 0, 1)) AS Recoveries
FROM Thefts Theft
INNER JOIN (SELECT
'theft' t UNION ALL SELECT
'recovery' t) t
ON (t = 'theft'
AND Theft.TheftDate BETWEEN '2019-01-01' AND '2019-05-31')
OR (t = 'recovery'
AND Theft.RecoveryDate BETWEEN '2019-01-01' AND '2019-05-31')
GROUP BY IIF(t = 'theft', Theft.TheftAgencyPK, Theft.RecoveryAgencyPK)

Related

How to repeat some data points in query results?

I am trying to get the max date by account from 3 different tables and view those dates side by side. I created a separate query for each table, merged the results with UNION ALL, and then wrapped all that in a PIVOT.
The first 2 sections in the link/pic below show what I have been able to accomplish and the 3rd section is what I would like to do.
Query results by step
How can I get the results from 2 of the tables to repeat? Is that possible?
--define var_ent_type = 'ACOM'
--define var_ent_id = '52766'
--define var_dict_id = 113
SELECT
*
FROM
(
SELECT
E.ENTITY_TYPE,
E.ENTITY_ID,
'PERF_SUMMARY' as "TableName",
PS.DICTIONARY_ID,
to_char(MAX(PS.END_EFFECTIVE_DATE), 'YYYY-MM-DD') as "MaxDate"
FROM
RULESDBO.ENTITY E
INNER JOIN PERFORMDBO.PERF_SUMMARY PS ON (PS.ENTITY_ID = E.ENTITY_ID)
WHERE
1=1
-- AND E.ENTITY_TYPE = '&var_ent_type'
-- AND E.ENTITY_ID = '&var_ent_id'
AND PS.DICTIONARY_ID >= 100
AND (E.ACTIVE_STATUS <> 'N' )--and E.TERMINATION_DATE is null )
GROUP BY
E.ENTITY_TYPE,
E.ENTITY_ID,
'PERF_SUMMARY',
PS.DICTIONARY_ID
union all
SELECT
E.ENTITY_TYPE,
E.ENTITY_ID,
'POSITION' as "TableName",
0 as DICTIONARY_ID,
to_char(MAX(H.EFFECTIVE_DATE), 'YYYY-MM-DD') as "MaxDate"
FROM
RULESDBO.ENTITY E
INNER JOIN HOLDINGDBO.POSITION H ON (H.ENTITY_ID = E.ENTITY_ID)
WHERE
1=1
-- AND E.ENTITY_TYPE = '&var_ent_type'
-- AND E.ENTITY_ID = '&var_ent_id'
AND (E.ACTIVE_STATUS <> 'N' )--and E.TERMINATION_DATE is null )
GROUP BY
E.ENTITY_TYPE,
E.ENTITY_ID,
'POSITION',
1
union all
SELECT
E.ENTITY_TYPE,
E.ENTITY_ID,
'CASH_ACTIVITY' as "TableName",
0 as DICTIONARY_ID,
to_char(MAX(C.EFFECTIVE_DATE), 'YYYY-MM-DD') as "MaxDate"
FROM
RULESDBO.ENTITY E
INNER JOIN CASHDBO.CASH_ACTIVITY C ON (C.ENTITY_ID = E.ENTITY_ID)
WHERE
1=1
-- AND E.ENTITY_TYPE = '&var_ent_type'
-- AND E.ENTITY_ID = '&var_ent_id'
AND (E.ACTIVE_STATUS <> 'N' )--and E.TERMINATION_DATE is null )
GROUP BY
E.ENTITY_TYPE,
E.ENTITY_ID,
'CASH_ACTIVITY',
1
--ORDER BY
-- 2,3, 4
)
PIVOT
(
MAX("MaxDate")
FOR "TableName"
IN ('CASH_ACTIVITY', 'PERF_SUMMARY','POSITION')
)
Everything is possible. You only need a window function to make the value repeat across rows w/o data.
--Assuming current query is QC
With QC as (
...
)
select code, account, grouping,
--cash,
first_value(cash) over (partition by code, account order by grouping asc rows unbounded preceding) as cash_repeat,
perf,
--pos,
first_value(pos) over (partition by code, account order by grouping asc rows unbounded preceding) as pos_repeat
from QC
;
See first_value() help here: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC

Postgres Query Optimization without adding an extra index

I was trying to optimize this query differently, but before that, Can we make any slight change in this query to reduce the time without adding any index?
Postgres version: 13.5
Query:
SELECT
orders.id as order_id,
orders.*, u1.name as user_name,
u2.name as driver_name,
u3.name as payment_by_name, referrals.name as ref_name,
array_to_string(array_agg(orders_payments.payment_type_name), ',') as payment_type_name,
array_to_string(array_agg(orders_payments.amount), ',') as payment_type_amount,
array_to_string(array_agg(orders_payments.reference_code), ',') as reference_code,
array_to_string(array_agg(orders_payments.tips), ',') as tips,
array_to_string(array_agg(locations.name), ',') as location_name,
(select
SUM(order_items.tax) as tax from order_items
where order_items.order_id = orders.id and order_items.deleted = 'f'
) as tax,
(select
SUM(orders_surcharges.surcharge_tax) as surcharge_tax from orders_surcharges
where orders_surcharges.order_id = orders.id
)
FROM "orders"
LEFT JOIN
users as u1 ON u1.id = orders.user_id
LEFT JOIN
users as u2 ON u2.id = orders.driver_id
LEFT JOIN
users as u3 ON u3.id = orders.payment_received_by
LEFT JOIN
referrals ON referrals.id = orders.referral_id
INNER JOIN
locations ON locations.id = orders.location_id
LEFT JOIN
orders_payments ON orders_payments.order_id = orders.id
WHERE
(orders.company_id = '626')
AND
(orders.created_at BETWEEN '2021-04-23 20:00:00' AND '2021-07-24 20:00:00')
AND
orders.order_status_id NOT IN (10, 5, 50)
GROUP BY
orders.id, u1.name, u2.name, u3.name, referrals.name
ORDER BY
created_at ASC LIMIT 300 OFFSET 0
Current Index:
"orders_pkey" PRIMARY KEY, btree (id)
"idx_orders_company_and_location" btree (company_id, location_id)
"idx_orders_created_at" btree (created_at)
"idx_orders_customer_id" btree (customer_id)
"idx_orders_location_id" btree (location_id)
"idx_orders_order_status_id" btree (order_status_id)
Execution Plan
Seems this takes more time on the parallel heap scan.
You're looking for 300 orders and try to get some additional information about these records. I would see if I could first get these 300 records, instead of getting all the data and then limit it to 300. Something like this:
WITH orders_300 AS (
SELECT * -- just get the columns that you really need, never use * in production
FROM orders
INNER JOIN locations ON locations.id = orders.location_id
WHERE orders.company_id = '626'
AND orders.created_at BETWEEN '2021-04-23 20:00:00' AND '2021-07-24 20:00:00'
AND orders.order_status_id NOT IN (10, 5, 50)
ORDER BY
created_at ASC LIMIT 300 -- LIMIT
OFFSET 0
)
SELECT
orders.id as order_id,
orders.*, -- just get the columns that you really need, never use * in production
u1.name as user_name,
u2.name as driver_name,
u3.name as payment_by_name, referrals.name as ref_name,
array_to_string(array_agg(orders_payments.payment_type_name), ',') as payment_type_name,
array_to_string(array_agg(orders_payments.amount), ',') as payment_type_amount,
array_to_string(array_agg(orders_payments.reference_code), ',') as reference_code,
array_to_string(array_agg(orders_payments.tips), ',') as tips,
array_to_string(array_agg(locations.name), ',') as location_name,
(SELECT SUM(order_items.tax) as tax
FROM order_items
WHERE order_items.order_id = orders.id
AND order_items.deleted = 'f'
) as tax,
( SELECT SUM(orders_surcharges.surcharge_tax) as surcharge_tax
FROM orders_surcharges
WHERE orders_surcharges.order_id = orders.id
)
FROM "orders_300" AS orders
LEFT JOIN users as u1 ON u1.id = orders.user_id
LEFT JOIN users as u2 ON u2.id = orders.driver_id
LEFT JOIN users as u3 ON u3.id = orders.payment_received_by
LEFT JOIN referrals ON referrals.id = orders.referral_id
LEFT JOIN orders_payments ON orders_payments.order_id = orders.id
GROUP BY
orders.id, u1.name, u2.name, u3.name, referrals.name
ORDER BY
created_at;
This will at least have a huge impact on the slowest part of your query, all these index scans on orders_payments. Every single scan is fast, but the query is doing 165000 of them... Limit this to just 300 and will be much faster.
Another issue is that none of your indexes covers the entire WHERE condition on the table "orders". But if you can't create a new index, you're out of luck.

Postgresql count by past weeks

select id, wk0_count
from teams
left join
(select team_id, count(team_id) as wk0_count
from (
select created_at, team_id, trunc(EXTRACT(EPOCH FROM age(CURRENT_TIMESTAMP,created_at)) / 604800) as wk_offset
from loan_files
where loan_type <> 2
order by created_at DESC) as t1
where wk_offset = 0
group by team_id) as t_wk0
on teams.id = t_wk0.team_id
I've created the query above that shows me how many loans each team did in a given week. Week 0 is the past seven days.
Ideally I want a table that shows how many loans each team did in the last 8 weeks, grouped by week. The output would look like:
Any ideas on the best way to do this?
select
t.id,
count(week = 0 or null) as wk0,
count(week = 1 or null) as wk1,
count(week = 2 or null) as wk2,
count(week = 3 or null) as wk3
from
teams t
left join
loan_files lf on lf.team_id = t.id and loan_type <> 2
cross join lateral
(select (current_date - created_at::date) / 7 as week) w
group by 1
In 9.4+ versions use the aggregate filter syntax:
count(*) filter (where week = 0) as wk0,
lateral is from 9.3. In a previous version move the week expression to the filter condition.
How about the following query?
SELECT team_id AS id, count(team_id) AS wk0_count
FROM teams LEFT JOIN loan_files ON teams.id = team_id
WHERE loan_type <> 2
AND trunc(EXTRACT(epoch FROM age(CURRENT_TIMESTAMP, created_at)) / 604800) = 0
GROUP BY team_id
Notable changes are:
ORDER BY clause in subquery was pointless;
created_at in innermost subquery was never used;
wk_offset test is moved on the WHERE clause and not done in two distinct steps;
outermost subquery was not needed.

postgresql complex query joing same table

I would like to get those customers from a table 'transactions' which haven't created any transactions in the last 6 Months.
Table:
'transactions'
id, email, state, paid_at
To visualise:
|------------------------ time period with all transactions --------------------|
|-- period before month transactions > 0) ---|---- curr month transactions = 0 -|
I guess this is doable with a join showing only those that didn't have any transactions on the right side.
Example:
Month = November
The conditions for the left side should be:
COUNT(l.id) > 0
l.paid_at < '2013-05-01 00:00:00'
Conditions for the right side:
COUNT(r.id) = 0
r.paid_at BETWEEN '2013-05-01 00:00:00' AND '2013-11-30 23:59:59'
Is join the right approach?
Answer
SELECT
C .email
FROM
transactions C
WHERE
(
C .email NOT IN (
SELECT DISTINCT
email
FROM
transactions
WHERE
paid_at >= '2013-05-01 00:00:00'
AND paid_at <= '2013-11-30 23:59:59'
)
AND
C .email IN (
SELECT DISTINCT
email
FROM
transactions
WHERE
paid_at <= '2013-05-01 00:00:00'
)
)
AND c.paid_at <= '2013-11-30 23:59:59'
There are a couple of ways you could do this. Use a subquery to get distinct customer ids for transactions in the last 6 months, and then select customers where their id isn't in the subquery.
select c.id, c.name
from customer c
where c.id not in (select distinct customer_id from transaction where dt between <start> and <end>);
Or, use a left join from customer to transaction, and filter the results to have transaction id null. A left join includes all rows from the left-hand table, even when there are no matching rows in the right-hand table. Explanation of left joins here: http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
select c.id, c.name
from customer c
left join transaction t on c.id = t.customer_id
and t.dt between <start> and <end>
where t.id is null;
The left join approach is likely to be faster.

Merging tables in t-sql

I have a table holding periods and prices, something like this
itemid periodid periodstart periodend price
1 1 2011/01/01 2011/05/01 50.00
1 2 2011/05/02 2011/08/01 80.00
1 3 2011/08/02 2011/12/31 50.00
Now I have a second table that can hold single dates or periods
itemid periodid periodstart periodend price
1 8 2011/07/01 2011/07/17 70.00
Now, how can I do a query that would return the following result?
itemid periodid periodstart periodend price
1 1 2011/01/01 2011/05/01 50.00
1 2 2011/05/02 2011/06/30 80.00 ****
1 8 2011/07/01 2011/07/17 70.00 ***
1 2 2011/07/18 2011/08/01 80.00 ****
1 3 2011/08/02 2011/12/31 50.00
EDIT -- Highlight the fact that the merge is modifying the dates around it
How about something like
select
t1.itemid,t1.periodid,t1.periodstart, coalesce(dateadd(d,-1,t2.periodstart),t1.periodend) as periodend, t1.price
from t1
left outer join t2 on t1.periodstart < t2.periodstart and t1.periodend>t2.periodstart and t1.itemid=t2.itemid
union
select
t2.itemid,t2.periodid,t2.periodstart, t2.periodend, t2.price
from t1
inner join t2 on t1.periodstart < t2.periodstart and t1.periodend>t2.periodstart and t1.itemid=t2.itemid
union
select
t1.itemid,t1.periodid,dateAdd(d,1,t2.periodend), t1.periodend, t1.price
from t1
inner join t2 on t1.periodstart < t2.periodend and t1.periodend>t2.periodend and t1.itemid=t2.itemid
order by periodstart
Use a Union?
Select itemid, periodid,periodstart, periodend,price FROM table1
UNION
SELECT itemid, periodid,periodstart, periodend,price FROM table2
Are you trying to do some sort of join though? the result set doesn't match the two tables you supplied.
Are you accounting for entries that line up or are you just trying to combine the rows?
if the latter, you could just do a Union
Select itemid, periodid, periodstart, periodend, price
From Table1
Union
Select itemid, periodid, periodstart, periodend, price
From Table2