LEFT OUTER JOIN Increasing SUM. How to Prevent This? - tsql

Simplified version of query below, but fundamental gist of it:
WITH ClientSpend AS
(
SELECT
c.ClientName
, CONVERT(INT, (ROUND(SUM(CASE WHEN e.Type = 1 THEN e.Dollars ELSE 0 END), 0))) AS 1_Dollars
, CONVERT(INT, (ROUND(SUM(CASE WHEN e.Type = 2 THEN e.Dollars ELSE 0 END), 0))) AS 2_Dollars
-- There's a bunch more of these for different 'Types'
FROM Expense e WITH(NOLOCK)
INNER JOIN Client c WITH(NOLOCK)
ON c.ClientID = e.ClientID
GROUP BY c.ClientName
)
SELECT
ClientName
, 1_Dollars
, 2_Dollars
FROM ClientSpend
GROUP BY ClientName
Type 2 has its own Expense table which breaks out into more granular detail that I need for a final CASE/SUM line in the CTE SELECT.
I tried testing the above query with a LEFT JOIN to this [ExpenseType2] table ON as many indexes as I can, and I noticed that the SUM on the 2_Dollars is higher when doing this. I'm assuming it's making multiple records even though I'm not selecting anything from the [ExpenseType2] table.
How do I prevent this?
Thanks,

Why not using a sub query? Or a simple select statement? Or you may need to assert more information with sample data.
SELECT first_name, 1_Dollars, 2_Dollars FROM
(
SELECT
c.ClientName
, CONVERT(INT, (ROUND(SUM(CASE WHEN e.Type = 1 THEN e.Dollars ELSE 0 END), 0))) AS 1_Dollars
, CONVERT(INT, (ROUND(SUM(CASE WHEN e.Type = 2 THEN e.Dollars ELSE 0 END), 0))) AS 2_Dollars
-- There's a bunch more of these for different 'Types'
FROM Expense e WITH(NOLOCK)
INNER JOIN Client c WITH(NOLOCK)
ON c.ClientID = e.ClientID
GROUP BY c.ClientName
) a

Related

How to repeat some data points in query results?

I am trying to get the max date by account from 3 different tables and view those dates side by side. I created a separate query for each table, merged the results with UNION ALL, and then wrapped all that in a PIVOT.
The first 2 sections in the link/pic below show what I have been able to accomplish and the 3rd section is what I would like to do.
Query results by step
How can I get the results from 2 of the tables to repeat? Is that possible?
--define var_ent_type = 'ACOM'
--define var_ent_id = '52766'
--define var_dict_id = 113
SELECT
*
FROM
(
SELECT
E.ENTITY_TYPE,
E.ENTITY_ID,
'PERF_SUMMARY' as "TableName",
PS.DICTIONARY_ID,
to_char(MAX(PS.END_EFFECTIVE_DATE), 'YYYY-MM-DD') as "MaxDate"
FROM
RULESDBO.ENTITY E
INNER JOIN PERFORMDBO.PERF_SUMMARY PS ON (PS.ENTITY_ID = E.ENTITY_ID)
WHERE
1=1
-- AND E.ENTITY_TYPE = '&var_ent_type'
-- AND E.ENTITY_ID = '&var_ent_id'
AND PS.DICTIONARY_ID >= 100
AND (E.ACTIVE_STATUS <> 'N' )--and E.TERMINATION_DATE is null )
GROUP BY
E.ENTITY_TYPE,
E.ENTITY_ID,
'PERF_SUMMARY',
PS.DICTIONARY_ID
union all
SELECT
E.ENTITY_TYPE,
E.ENTITY_ID,
'POSITION' as "TableName",
0 as DICTIONARY_ID,
to_char(MAX(H.EFFECTIVE_DATE), 'YYYY-MM-DD') as "MaxDate"
FROM
RULESDBO.ENTITY E
INNER JOIN HOLDINGDBO.POSITION H ON (H.ENTITY_ID = E.ENTITY_ID)
WHERE
1=1
-- AND E.ENTITY_TYPE = '&var_ent_type'
-- AND E.ENTITY_ID = '&var_ent_id'
AND (E.ACTIVE_STATUS <> 'N' )--and E.TERMINATION_DATE is null )
GROUP BY
E.ENTITY_TYPE,
E.ENTITY_ID,
'POSITION',
1
union all
SELECT
E.ENTITY_TYPE,
E.ENTITY_ID,
'CASH_ACTIVITY' as "TableName",
0 as DICTIONARY_ID,
to_char(MAX(C.EFFECTIVE_DATE), 'YYYY-MM-DD') as "MaxDate"
FROM
RULESDBO.ENTITY E
INNER JOIN CASHDBO.CASH_ACTIVITY C ON (C.ENTITY_ID = E.ENTITY_ID)
WHERE
1=1
-- AND E.ENTITY_TYPE = '&var_ent_type'
-- AND E.ENTITY_ID = '&var_ent_id'
AND (E.ACTIVE_STATUS <> 'N' )--and E.TERMINATION_DATE is null )
GROUP BY
E.ENTITY_TYPE,
E.ENTITY_ID,
'CASH_ACTIVITY',
1
--ORDER BY
-- 2,3, 4
)
PIVOT
(
MAX("MaxDate")
FOR "TableName"
IN ('CASH_ACTIVITY', 'PERF_SUMMARY','POSITION')
)
Everything is possible. You only need a window function to make the value repeat across rows w/o data.
--Assuming current query is QC
With QC as (
...
)
select code, account, grouping,
--cash,
first_value(cash) over (partition by code, account order by grouping asc rows unbounded preceding) as cash_repeat,
perf,
--pos,
first_value(pos) over (partition by code, account order by grouping asc rows unbounded preceding) as pos_repeat
from QC
;
See first_value() help here: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/FIRST_VALUE.html#GUID-D454EC3F-370C-4C64-9B11-33FCB10D95EC

View query without sub selecting T-SQL

so I'm trying to build a view query but I keep failing using only joins so I ended up with this deformation.. Any tips on how I can write this query so I don't have to use 6 subselects?
The FeeSum and PaymentSum can be null, so ideally I do not want those in my result set and I also wouldn't like results where the FeeSum and the PaymentSum are equal.
Quick note: client is the table where the clients informations are stored (name, adress, etc..)
customer has a fk on client and is kind of a shell table for the client that store more information for the client,
payment is a list of all payments a customer did,
order is a list of all orders a customer did.
The goal is to get a list where we can track which customer has open fees to pay, based on the orders. It's a legacy project so don't ask why people can order before paying :)
SELECT
cu.Id as [CustomerId]
, CASE
WHEN cl.IsPerson = 1
THEN cl.[AdditionalName] + ' ' + cl.[Name]
ELSE cl.AdditionalName
END as [Name]
, cl.CustomerNumber
, (SELECT SUM(o.Fee) FROM [publication].[Order] o WHERE o.[State] = 2 AND o.CustomerId = cu.Id) as [FeeSum]
, (SELECT SUM(p.Amount) FROM [publication].[Payment] p WHERE p.CustomerId = cu.Id) as [PaymentSum]
, (SELECT MAX(o.OrderDate) FROM [publication].[Order] o WHERE o.[State] = 2 AND o.CustomerId = cu.Id) as [LastOrderDate]
, (SELECT MAX(p.PaymentDate) FROM [publication].[Payment] p WHERE p.CustomerId = cu.Id) as [LastPaymentDate]
, (SELECT MAX(f.Created) FROM [client].[File] f WHERE f.TemplateName = 'Reminder' AND f.ClientId = cl.Id) as [LastReminderDate]
, (SELECT MAX(f.Created) FROM [client].[File] f WHERE f.TemplateName = 'Warning' AND f.ClientId = cl.Id) as [LastWarningDate]
FROM
[publication].[Customer] cu
JOIN
[client].[Client] cl
ON cl.Id = cu.ClientId
WHERE
cu.[Type] = 0
Thanks in advance and I hope I didn't do anything wrong.
Kind regards
You could rewrite the correlated subqueries to instead use joins:
SELECT
cu.Id AS [CustomerId],
CASE WHEN cl.IsPerson = 1
THEN cl.[AdditionalName] + ' ' + cl.[Name]
ELSE cl.AdditionalName END AS [Name],
cl.CustomerNumber,
o.FeeSum,
p.PaymentSum,
o.LastOrderDate,
p.LastPaymentDate,
f.LastReminderDate,
f.LastWarningDate
FROM [publication].[Customer] cu
INNER JOIN [client].[Client] cl
ON cl.Id = cu.ClientId
INNER JOIN
(
SELECT CustomerId, SUM(Fee) AS [FeeSum], MAX(OrderDate) AS [LastOrderDate]
FROM [publication].[Order]
WHERE o.[State] = 2
GROUP BY CustomerId
) o
ON o.CustomerId = cu.Id
INNER JOIN
(
SELECT CustomerId, SUM(Amount) AS [PaymentSum], MAX(PaymentDate) AS [LastPaymentDate]
FROM [publication].[Payment]
WHERE o.[State] = 2
GROUP BY CustomerId
) p
ON p.CustomerId = cu.Id
INNER JOIN
(
SELECT ClientId,
MAX(CASE WHEN TemplateName = 'Reminder' THEN Created END) AS [LastReminderDate],
MAX(CASE WHEN TemplateName = 'Warning' THEN Created END) AS [LastWarningDate]
FROM [client].[File]
GROUP BY ClientId
) f
ON f.ClientId = cl.Id
WHERE
cu.[Type] = 0;

SUM(CASE WHEN ...) returns a greater number than COUNT(DISTINCT..)

I have written a query in two models, but I can't figure out why the second query returns a greater number than the first one; while the number that the first one, COUNT(DISTINCT...) returns is correct:
WITH types(id) AS (VALUES('{1, 4, 5, 3}'::INTEGER[])),
date_gen64 AS
(
SELECT CAST (generate_series(date '10/1/2017', date '11/15/2017', interval
'1 day') AS date) as days ORDER BY days)
SELECT cl.class_date AS c_date,
count(DISTINCT (CASE WHEN co.id = 1 THEN p.id END)),
count(DISTINCT (CASE WHEN co.id = 2 THEN p.id END))
FROM person p
JOIN envelope e ON e.personID = p.id
JOIN "class" cl on cl.id = p.classID
JOIN course co ON co.id = cl.course_id AND co.id = 1
JOIN types ON cr.type_id = ANY (types.id)
RIGHT JOIN date_gen64 dg ON dg.days = cl.class_date
GROUP BY cl.class_date
ORDER BY cl.class_date
The above query returns 26 but following query returns 27!
The reason why I rewrote it with SUM is that the first query
was too slow. But my question is that why the second one counts more?
WITH types(id) AS (VALUES('{1, 4, 5, 3}'::INTEGER[]))
SELECT tmpcl.days,
SUM(CASE WHEN tmp80.course_id = 1 THEN 1
ELSE 0 END),
SUM(CASE WHEN tmp80.course_id = 2 THEN 1
ELSE 0 END)
FROM (
SELECT CAST (generate_series(date '10/1/2017', date '11/15/2017',
interval '1 day') AS date) as days ORDER BY days) tmpcl
LEFT JOIN (
SELECT DISTINCT p.id AS "person_id",
cl.class_date AS c_date,
co.id AS "course_id"
FROM person p
JOIN envelope e ON e.personID = p.id
JOIN "class" cl on cl.id = p.classID
JOIN course co ON co.id = cl.course_id
JOIN types ON cr.type_id = ANY (types.id)
WHERE co.id IN ( 1 , 2 )
) tmp80 ON tmpcl.days = tmp80.class_date
GROUP BY tmpcl.days
ORDER BY tmpcl.days
You can theoretically have multiple people enrolled in the same class on the same day. Indeed that would seem to be the main point of having classes. So each time there are multiple people assigned to the same class on the same day you can have a higher count than you would in your first query. Does that make sense?
You don't appear to be using p.id in that inner query so simply remove it and your counts should match.
WITH types(id) AS (VALUES('{1, 4, 5, 3}'::INTEGER[]))
SELECT tmpcl.days,
SUM(CASE WHEN tmp80.course_id = 1 THEN 1
ELSE 0 END),
SUM(CASE WHEN tmp80.course_id = 2 THEN 1
ELSE 0 END)
FROM (
SELECT CAST (generate_series(date '10/1/2017', date '11/15/2017',
interval '1 day') AS date) as days ORDER BY days) tmpcl
LEFT JOIN (
SELECT DISTINCT cl.class_date AS c_date,
co.id AS "course_id"
FROM person p
JOIN envelope e ON e.personID = p.id
JOIN "class" cl on cl.id = p.classID
JOIN course co ON co.id = cl.course_id
JOIN types ON cr.type_id = ANY (types.id)
WHERE co.id IN ( 1 , 2 )
) tmp80 ON tmpcl.days = tmp80.class_date
GROUP BY tmpcl.days
ORDER BY tmpcl.days

Avoiding Order By in T-SQL

Below sample query is a part of my main query. I found SORT operator in below query is consuming 30% of the cost.
To avoid SORT, there is need of creation of Indexes. Is there any other way to optimize this code.
SELECT TOP 1 CONVERT( DATE, T_Date) AS T_Date
FROM TableA
WHERE ID = r.ID
AND Status = 3
AND TableA_ID >ISNULL((
SELECT TOP 1 TableA_ID
FROM TableA
WHERE ID = r.ID
AND Status <> 3
ORDER BY T_Date DESC
), 0)
ORDER BY T_Date ASC
Looks like you can use not exists rather than the sorts. I think you'll probably get a better performance boost by use a CTE or derived table instead of the a scalar subquery.
select *
from r ... left outer join
(
select ID, min(t_date) as min_date from TableA t1
where status = 3 and not exists (
select 1 from TableA t2
where t2.ID = t1.ID
and t2.status <> 3 and t2.t_date > t1.t_date
)
group by ID
) as md on md.ID = r.ID ...
or
select *
from r ... left outer join
(
select t1.ID, min(t1.t_date) as min_date
from TableA t1 left outer join TableA t2
on t2.ID = t1.ID and t2.status <> 3
where t1.status = 3 and t1.t_date < t2.t_date
group by t1.ID
having count(t2.ID) = 0
) as md on md.ID = r.ID ...
It also appears that you're relying on an identity column but it's not clear what those values mean. I'm basically ignoring it and using the date column instead.
Try this:
SELECT TOP 1 CONVERT( DATE, T_Date) AS T_Date
FROM TableA a1
LEFT JOIN (
SELECT ID, MAX(TableA_ID) AS MaxAID
FROM TableA
WHERE Status <> 3
GROUP BY ID
) a2 ON a2.ID = a1.ID AND a1.TableA_ID > coalesce(a2.MAXAID,0)
WHERE a1.ID = r.ID AND a1.Status = 3
ORDER BY T_Date ASC
The use of TOP 1 in combination with the unexplained r alias concern me. There's almost certainly a MUCH better way to get this data into your results that doesn't involve doing this in a sub query (unless this is for an APPLY operation).

Removing duplicate date periods

I am working on script to get data from a database with millions of rows and have a problem with gaps in periods. We have decided that gaps less than 10 days should not be considered gaps at all. Thus, these gaps should be deleted (See example below. The bold dates form the “real” periods of interest)
ID InDate OutDate
1 2008-10-10 2009-02-05
1 2009-02-08 2009-05-13
1 2011-01-01 2011-05-20
2 2007-03-17 2008-10-19
2 2009-05-30 2010-10-12
2 2010-10-14 2010-12-31
Thus, several problems arises. The first problem is to identify which Outdates and Indates are that close to each other for the period to be transformed into a single one. The next problem is to move the Outdate from the higher row number to the lower row number (that is up the table). The last problem is to identify and get rid of the rows which are now duplicates.
I have tried to solve the question down below. The first two problems are solved in table #t4a. The strategy in table #t4aa is to get rid of the duplicates by marking the duplicate rows in question in a new (dummy) variable and get rid of all such values (1:s) in a later stage. However, it does not work! All rows are marked with a 0, even those which should be marked with an 1. Any suggestions?
--This temp table measures gaps and creates a new variable OutDate2 which in the cases of a to small gap (less than 11 days) write the next Outdate on the row instead of the original value.
WITH C AS (SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) Rownum FROM #t4 t4)
SELECT cur.Rownum, cur.Id, cur.InDate CurInDate, cur.OutDate, nxt.InDate NxtInDate, DATEDIFF(day, cur.OutDate, nxt.InDate) Number_of_days,
CASE WHEN DATEDIFF(day, cur.OutDate, nxt.InDate)<11 AND DATEDIFF(day, cur.OutDate, nxt.InDate)>0 THEN nxt.OutDate ELSE cur.OutDate END AS OutDate2
INTO #t4a
FROM C cur
LEFT OUTER JOIN C nxt ON (nxt.rownum=cur.rownum+1 AND nxt.Id=cur.Id)
--This temp table creates a dummy which identifies the OVERLAP of rows in order for these to be eliminated in a later temporary table. It is this table that does not work.
WITH C AS (SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) rownum FROM #t4a)
SELECT cur.Id, cur.InDate, nxt.OutDate2,
CASE WHEN cur.OutDate2 < nxt.InDate THEN 1.0 ELSE 0.0
END AS Overlap
INTO #t4aa
FROM C cur
LEFT OUTER JOIN C nxt on (cur.rownum=nxt.rownum+1 AND cur.Id=nxt.Id)
This is kind of conceptual but might give you some ideas
WITH C AS
(SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) Rownum FROM #t4 t4)
select Cgood.*
from c
join C as Cgood
on Cgood.ID = C1.ID
and Cgood.Rownum = C.Rownum + 1
and DATEDIFF(day, C.OutDate, nxt.InDate)>=11
group by Cgood.*
union
select Cgood.*
from c
join C as Cgood
on Cgood.ID = C1.ID
and Cgood.Rownum = 1
and C.Rownum = 2
and DATEDIFF(day, C.OutDate, nxt.InDate)>=11
group by Cgood.*
union
select cMerge.ID, c.Indate, cMerge.OutDate
from c
join C as cMerge
on cMerge.ID = C1.ID
and cMerge.Rownum = C.Rownum + 1
and DATEDIFF(day, C.OutDate, cMerge.InDate) < 11
group by cMerge.ID, c.Indate, cMerge.OutDate
union
select cMerge.ID, c.Indate, cMerge.OutDate
from c
join C as cMerge
on cMerge.ID = C1.ID
and cMerge.Rownum = 1
and C.Rownum = 2
and DATEDIFF(day, C.OutDate, cMerge.InDate) < 11
group b
I solved my own question yesterday. I got rid of the last temp table and incorporated creating the dummy variable in the first temp table. The core of the solution was to join backwards as well as forward.
WITH C AS (SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) Rownum FROM #t4 t4)
SELECT cur.Rownum, cur.Id, cur.InDate CurInDate, cur.OutDate, nxt.InDate NxtInDate, DATEDIFF(day, cur.OutDate, nxt.InDate) Number_of_days,
CASE
WHEN DATEDIFF(day, prv.OutDate, cur.InDate)<11
AND DATEDIFF(day, prv.OutDate, cur.InDate)>0
THEN 1.0
ELSE 0.0
END AS Overlap,
CASE
WHEN DATEDIFF(day, cur.OutDate, nxt.InDate)<11
AND DATEDIFF(day, cur.OutDate, nxt.InDate)>0
THEN nxt.OutDate
ELSE cur.OutDate
END AS OutDate2
INTO #t4a
FROM C cur
LEFT OUTER JOIN C prv ON (prv.rownum=cur.rownum-1 AND prv.Id=cur.Id)
LEFT OUTER JOIN C nxt ON (nxt.rownum=cur.rownum+1 AND nxt.Id=cur.Id)