Poor performance of UNION query Redshift - amazon-redshift

I have a Redshift UNION query that performs very poorly. Query goes like this:
WITH a1 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders1
GROUP BY revenue_month),
a2 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders2
GROUP BY revenue_month),
b1 AS (SELECT
revenue_month,
amount_type,
SUM(amount) AS amount
FROM monthly
GROUP BY revenue_month,amount_type)
SELECT 'a1' AS data_set, 'revenue' AS amount_type, a1.revenue AS amount FROM a1 UNION
SELECT 'a1' AS data_set, 'cost1' AS amount_type, a1.cost1 AS amount FROM a1 UNION
SELECT 'a1' AS data_set, 'cost2' AS amount_type, a1.cost2 AS amount FROM a1 UNION
SELECT 'a1' AS data_set, 'cost3' AS amount_type, a1.cost3 AS amount FROM a1 UNION
SELECT 'a2' AS data_set, 'revenue' AS amount_type, a2.revenue AS amount FROM a2 UNION
SELECT 'a2' AS data_set, 'cost1' AS amount_type, a2.cost1 AS amount FROM a2 UNION
SELECT 'a2' AS data_set, 'cost2' AS amount_type, a2.cost2 AS amount FROM a2 UNION
SELECT 'a2' AS data_set, 'cost3' AS amount_type, a2.cost3 AS amount FROM a2 UNION
SELECT 'b1' AS data_set, b1.amount_type, b2.amount FROM b2
The goal of the UNION part is to transform a1 and a2 to have the same result set schema as b1 and eventually have one combined data set.
The a1 and a2 subqueries, when run on its own each takes around 60 secs to complete with 6000 rows, while b1 runs for 5 secs with 500 rows. These run-times are acceptable for me, however, the "combined" query above runs for a whopping 20 mins.
I think the fetching part is what takes too much time for this query. I have tried using UNION ALL but performance did not improve that much. If I can somehow transform a1 and a2 to b1 schema without having to use UNION would be great but I haven't been able to do so.
Any help will be greatly appreciated. Thank you

You basically want to unpivot the a1 and a2 tables.
I would do it like this:
WITH
seq (idx) AS (
select 'revenue' UNION ALL
select 'cost1' UNION ALL
select 'cost2' UNION ALL
select 'cost3'
),
a1 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders1
GROUP BY revenue_month),
a2 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders2
GROUP BY revenue_month),
b1 AS (SELECT
revenue_month,
amount_type,
SUM(amount) AS amount
FROM monthly
GROUP BY revenue_month,amount_type)
SELECT
'a1' AS data_set,
seq.idx AS amount_type,
CASE seq.idx
WHEN 'revenue' THEN a1.revenue
WHEN 'cost1' THEN a1.cost1
WHEN 'cost2' THEN a1.cost2
WHEN 'cost3' THEN a1.cost3
END AS amount
FROM a1 CROSS JOIN seq
UNION ALL
SELECT
'a2' AS data_set,
seq.idx AS amount_type,
CASE seq.idx
WHEN 'revenue' THEN a1.revenue
WHEN 'cost1' THEN a1.cost1
WHEN 'cost2' THEN a1.cost2
WHEN 'cost3' THEN a1.cost3
END AS amount
FROM a2 CROSS JOIN seq
UNION ALL
SELECT
'b1' AS data_set,
b1.amount_type,
b1.amount
FROM b1

Thanks #botchniaque for all your help on this. Your CROSS JOIN suggestion solved this. There's something about that query pattern though that Redshift fails to read. The final query that worked for me is something like this:
WITH a1 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders1
GROUP BY revenue_month),
a2 AS (SELECT
revenue_month,
SUM(revenue) AS revenue,
SUM(cost1) AS cost1,
SUM(cost2) AS cost2,
SUM(cost3) AS cost3
FROM orders2
GROUP BY revenue_month),
b1 AS (SELECT
revenue_month,
SUM(CASE WHEN amount_type = 'revenue' THEN amount ELSE 0 END) AS revenue,
SUM(CASE WHEN amount_type = 'cost1' THEN amount ELSE 0 END) AS cost1,
SUM(CASE WHEN amount_type = 'cost2' THEN amount ELSE 0 END) AS cost2,
SUM(CASE WHEN amount_type = 'cost3' THEN amount ELSE 0 END) AS cost3
FROM (SELECT
revenue_month,
amount_type,
SUM(amount) AS amount
FROM monthly
GROUP BY revenue_month,amount_type) AS b0
GROUP BY revenue_month)
SELECT
ab.data_set,
ab.revenue_month,
seq.amount_type,
CASE seq.amount_type
WHEN 'revenue' THEN ab.revenue
WHEN 'cost1' THEN ab.cost1
WHEN 'cost2' THEN ab.cost2
WHEN 'cost3' THEN ab.cost3
END AS amount
FROM
(SELECT a1.revenue_month, a1.revenue, a1.cost1, a1.cost2, a1.cost3 FROM a1 UNION ALL
SELECT a2.revenue_month, a2.revenue, a2.cost1, a2.cost2, a2.cost3 FROM a2 UNION ALL
SELECT b1.revenue_month, b1.revenue, b1.cost1, b1.cost2, b1.cost3 FROM b1) AS ab
CROSS JOIN (SELECT 'revenue' AS amount_type UNION ALL
SELECT 'cost1' AS amount_type UNION ALL
SELECT 'cost2' AS amount_type UNION ALL
SELECT 'cost3' AS amount_type) AS seq
Basically it pivots b1 first to have the same schema as a1 and a2. And then combines all three datasets with UNION resulting to ab. And then finally unpivots the combined dataset with CROSS JOIN

Related

Optimise With Query in PostgreSQL

I have a working PostgreSQL query, but it is taking a considerable amount of time to execute. I need help optimising it.
I have:
Removed inner queries as much as possible.
Removed the unnecessary data from the query.
Created a with query which gets the required data from the beginning
I need help to optimise this query
with data as (
select
e.id,
e.name,
t.barcode,
tt.variant,
t.cost_cents::decimal / 100 as ticket_cost,
t.fee_cents::decimal / 100 as booking_fee
from
tickets t
inner join events e on t.event_id = e.id
inner join ticket_types tt on t.ticket_type_id = tt.id
where
t.status = 2
and e.source in ('source1', 'source2')
)
select
d.name,
count(distinct d.barcode) as issued,
(select count(distinct d2.barcode) from data d2 where d2.id = d.id and d2.variant is null) as sold,
sum(d.ticket_cost) as ticket_revenue,
sum(d.booking_fee) as booking_fees
from
data d
group by
id,
name
Better to detect slow parts with using EXPLAIN .
It will show cost of all parts
You can speed up joins by creating proper indexes.
Also, remove the subquery
(select count(distinct d2.barcode) from data d2 where d2.id = d.id and d2.variant is null)
from the SELECT clause and add a join to d2 table something like this:
select
d.name,
count(distinct d.barcode) as issued,
count(distinct d2.barcode) as sold,
sum(d.ticket_cost) as ticket_revenue,
sum(d.booking_fee) as booking_fees
from
data d
left join data d2 on (d2.id = d.id and d2.variant is null)
group by
d.id,
d.name

Dividing sums with different WHERE conditions

I need help in my query. I am trying to divide a SUM of a column with different WHERE conditions for example
SELECT
TO_CHAR(PS.REPORT_PERIOD_END_DATE, 'YYYY') AS YR,
SUM(T1.PRICE) AS COLUMN_1
FROM TABLE_ONE T1
INNER JOIN SUB_STATUSES status ON status.SUB_ID = T1.ID
WHERE status.R_SUB_STATUS_CODE = 'COMPLETED'
AND T1.TYPE = 'COMPANY' OR T1.TYPE = 'SMALL_BUSINESS'
GROUP BY TO_CHAR(PS.REPORT_PERIOD_END_DATE, 'YYYY')
ORDER BY TO_CHAR(PS.REPORT_PERIOD_END_DATE, 'YYYY') DESC
DIVIDE BY
SELECT
TO_CHAR(PS.REPORT_PERIOD_END_DATE, 'YYYY') AS YR,
SUM(T1.PRICE) AS COLUMN_1
from TABLE_ONE T1
INNER JOIN SUB_STATUSES status ON status.SUB_ID = T1.ID
WHERE status.R_SUB_STATUS_CODE = 'COMPLETED'
AND T1.TYPE = 'LOT' OR T1.TYPE = 'LAND'
GROUP BY TO_CHAR(PS.REPORT_PERIOD_END_DATE, 'YYYY')
ORDER BY TO_CHAR(PS.REPORT_PERIOD_END_DATE, 'YYYY') DESC
The first Column returns this :
2017 1094
2016 89
2015 95
2014 101
2013 113
2012 173
2011 191
2010 165
Use a case statement instead of a where clause. Outer query checks divide by zero.
SELECT yr, case when column_2 <> 0 then column_1/column2 else 0 end divcol
FROM (
SELECT
TO_CHAR(PS.REPORT_PERIOD_END_DATE, 'YYYY') AS YR,
SUM(case when t1.type in ('COMPANY', 'SMALL_BUSINESS') THEN T1.PRICE ELSE 0 END) as COLUMN_1,
SUM(case when t1.type in ('LOT', 'LAND') THEN T1.PRICE ELSE 0 end) AS COLUMN_2
FROM TABLE_ONE T1
INNER JOIN SUB_STATUSES status ON status.SUB_ID = T1.ID
WHERE status.R_SUB_STATUS_CODE = 'COMPLETED'
GROUP BY TO_CHAR(PS.REPORT_PERIOD_END_DATE, 'YYYY')
)
ORDER BY yr DESC

Select the last record for the date and time columns

I need to select the last record in the academic table which has two columns for date and time. When I run the query I get an error. Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
USE PCUnitTest
SELECT C.ACCOUNTNO, C.CONTACT, C.LASTNAME, C.KEY4, A.PEOPLE_ID, A.APP_STATUS, A.APP_DECISION, A.REVISION_DATE, A.REVISION_TIME
FROM ACADEMIC AS A INNER JOIN
GM.dbo.CONTACT1 AS C ON A.PEOPLE_ID = C.KEY4
WHERE A.REVISION_DATE = (SELECT TOP (1) REVISION_DATE, REVISION_TIME, PEOPLE_CODE, PEOPLE_ID, PEOPLE_CODE_ID, ACADEMIC_YEAR, ACADEMIC_TERM, ACADEMIC_SESSION, PROGRAM, DEGREE, CURRICULUM
FROM PCUnitTest.dbo.ACADEMIC
ORDER BY REVISION_DATE DESC, REVISION_TIME DESC)
You can Join the query you are using in the where
USE PowerCampusUnitTest
SELECT C.ACCOUNTNO, C.CONTACT, C.LASTNAME, C.KEY4, A.PEOPLE_ID, A.APP_STATUS, A.APP_DECISION, A.REVISION_DATE, A.REVISION_TIME
FROM ACADEMIC AS A
INNER JOIN GoldMineUnitTest.dbo.CONTACT1 AS C
ON A.PEOPLE_ID = C.KEY4
INNER JOIN (
SELECT TOP 1 A2.REVISION_DATE,A2.REVISION_TIME FROM PowerCampusUnitTest.dbo.ACADEMIC A2
ORDER BY REVISION_DATE DESC, REVISION_TIME DESC
)AS A2
ON A.REVISION_DATE = A2.REVISION_DATE AND A.REVISION_TIME = A2.REVISION_TIME
Use ROW_NUMBER()
USE PCUnitTest
SELECT
R.ACCOUNTNO, R.CONTACT, R.LASTNAME, R.KEY4, R.PEOPLE_ID, R.APP_STATUS, R.APP_DECISION, R.REVISION_DATE, R.REVISION_TIME
FROM
(
SELECT C.ACCOUNTNO, C.CONTACT, C.LASTNAME, C.KEY4, A.PEOPLE_ID, A.APP_STATUS, A.APP_DECISION, A.REVISION_DATE, A.REVISION_TIME
,ROW_NUMBER() OVER (ORDER BY A.REVISION_DATE DESC, A.REVISION_TIME DESC) RN
FROM ACADEMIC AS A INNER JOIN
GMUnitTest.dbo.CONTACT1 AS C ON A.PEOPLE_ID = C.KEY4
) R
WHERE RN=1
If you want to get the latest row for each PEOPLE_ID, then add PARTITION BY
SELECT
R.ACCOUNTNO, R.CONTACT, R.LASTNAME, R.KEY4, R.PEOPLE_ID, R.APP_STATUS, R.APP_DECISION, R.REVISION_DATE, R.REVISION_TIME
FROM
(
SELECT C.ACCOUNTNO, C.CONTACT, C.LASTNAME, C.KEY4, A.PEOPLE_ID, A.APP_STATUS, A.APP_DECISION, A.REVISION_DATE, A.REVISION_TIME
,ROW_NUMBER() OVER (PARTITION BY A.PEOPLE_ID ORDER BY A.REVISION_DATE DESC, A.REVISION_TIME DESC) RN
FROM ACADEMIC AS A INNER JOIN
GMUnitTest.dbo.CONTACT1 AS C ON A.PEOPLE_ID = C.KEY4
) R
WHERE RN=1

SQL insert into using CTE

I am facing a performance issue due to "Insert into" statement in sql. I am using a CTE to select data from multiple tables and insert into other table. It was working just fine until yesterday. Select takes less than a minute to retrieve the data where as insert into taking forever. Can some one please help me in understanding what i am doing wrong. Any help is highly appreciated. Thanks.
Here is my code:
I am using this query in an SP. I am trying to load 220K records to 1.5M records table.
;with CTE_A
AS
(
SELECT A1, A2,...
FROM dbo.A with (nolock)
WHERE A1 = <some condition>
GROUP BY a.A1,a.A2 , a.A3
), CTE_C as
(
SELECT C1, C2,....
FROM dbo.B with (nolock)
WHERE a.C1 = <some condition>
GROUP BY a.c1,a.C2 , a.C3
)
INSERT INTO [dbo].MainTable
SELECT
A1, A2, A3 , C1, C2, C3
FROM
CTE_A ta with (nolock)
LEFT OUTER JOIN
CTE_C tc with (nolock) ON ta.a1 = tc.a1 and ta.b1 = tc.b1 and ta.c1 = tc.c1
LEFT OUTER JOIN
othertable bs with (nolock) ON usd_bs.c = s.c
AND (A1 BETWEEN bs.a1 AND bs.a1)
AND bs.c1 = 1
try this method (temp table instead cte), perfomance must be much higher for your task
IF OBJECT_ID('Tempdb..#CTE_A') IS NOT NULL
DROP TABLE #CTE_A
IF OBJECT_ID('Tempdb..#CTE_C') IS NOT NULL
DROP TABLE #CTE_C
-------------------------------------------------------------
SELECT A1 ,
A2 ,...
INTO #CTE_A --data set into temp table
FROM dbo.A WITH ( NOLOCK )
WHERE A1 = <some condition>
GROUP BY a.A1 ,
a.A2 ,
a.A3
-------------------------------------------------------------
SELECT C1 ,
C2 ,....
FROM dbo.B WITH ( NOLOCK )
INTO #CTE_C --data set into temp table
WHERE a.C1 = <some condition>
GROUP BY a.c1 ,
a.C2 ,
a.C3
INSERT INTO [dbo].MainTable
SELECT A1 ,
A2 ,
A3 ,
C1 ,
C2 ,
C3
FROM #CTE_A AS ta
LEFT JOIN #CTE_C AS tc ON ta.a1 = tc.a1
AND ta.b1 = tc.b1
AND ta.c1 = tc.c1
LEFT JOIN othertable AS bs ON usd_bs.c = s.c
AND ( A1 BETWEEN bs.a1 AND bs.a1 )
AND bs.c1 = 1

TSQL: Convert old where clause to Join syntax

I have this query that I want to put into the 'modern Join syntax'
SELECT
t.acct_order,
c1.acct_desc,
c1.position,
c1.end_pos,
Budget = sum(((g.data_set-1)/2)*(g.debits - g.credits)),
Actual = sum(((g.data_set-3)/-2)*(g.debits - g.credits))
FROM
glm_chart c1,
glm_chart c2,
glh_deptsum g,
#TEMPTABLE t
WHERE
g.acct_uno=c2.acct_uno
and c1.acct_code = t.acct_code
and c2.position between c1.position and c1.end_pos
and g.debits-g.credits<>0
and c1.book=1
and g.data_set in (1, 3)
and (g.period between 201201 and 201212)
GROUP by t.acct_order,
c1.acct_desc,
c1.position,
c1.end_pos
ORDER by t.acct_order
This is what I got to, but as you can see I can't seem to identify the join to table glh_deptsum (g) to C1 or to T
SELECT
t.acct_order,
c1.acct_desc,
c1.position,
c1.end_pos,
sum(((g.data_set-1)/2)*(g.debits - g.credits)) AS Budget,
sum(((g.data_set-3)/-2)*(g.debits - g.credits)) AS Actual,
FROM #TEMPTABLE T
INNER JOIN glm_chart c1 ON c1.acct_code = t.acct_code
INNER JOIN glh_deptsum g <---- HELP WHAT GOES HERE ---------
INNER JOIN glm_chart c2 ON c2.position between c1.position and c1.end_pos AND g.acct_uno=c2.acct_uno
WHERE
g.debits - g.credits <> 0
AND c1.book=1
AND g.data_set in (1, 3)
AND (g.period between 201201 and 201212)
GROUP BY t.acct_order,
c1.acct_desc, c1.position,c1.end_pos
ORDER BY t.acct_order
Can anyone let me know what I'm doing wrong please?
It looks like it should be this:
SELECT t.acct_order,
c1.acct_desc,
c1.position,
c1.end_pos,
Budget = sum(((g.data_set-1)/2)*(g.debits - g.credits)),
Actual = sum(((g.data_set-3)/-2)*(g.debits - g.credits))
FROM #TEMPTABLE t
INNER JOIN glm_chart c1
ON t.acct_code = c1.acct_code
INNER JOIN glm_chart c2
ON c2.position between c1.position and c1.end_pos
INNER JOIN glh_deptsum g
ON c2.acct_uno = g.acct_uno
WHERE g.debits-g.credits<>0
and c1.book=1
and g.data_set in (1, 3)
and (g.period between 201201 and 201212)
GROUP by t.acct_order,
c1.acct_desc,
c1.position,
c1.end_pos
ORDER by t.acct_order