Removing duplicate date periods

Removing duplicate date periods - tsql

I am working on script to get data from a database with millions of rows and have a problem with gaps in periods. We have decided that gaps less than 10 days should not be considered gaps at all. Thus, these gaps should be deleted (See example below. The bold dates form the “real” periods of interest)
ID InDate OutDate
1 2008-10-10 2009-02-05
1 2009-02-08 2009-05-13
1 2011-01-01 2011-05-20
2 2007-03-17 2008-10-19
2 2009-05-30 2010-10-12
2 2010-10-14 2010-12-31
Thus, several problems arises. The first problem is to identify which Outdates and Indates are that close to each other for the period to be transformed into a single one. The next problem is to move the Outdate from the higher row number to the lower row number (that is up the table). The last problem is to identify and get rid of the rows which are now duplicates.
I have tried to solve the question down below. The first two problems are solved in table #t4a. The strategy in table #t4aa is to get rid of the duplicates by marking the duplicate rows in question in a new (dummy) variable and get rid of all such values (1:s) in a later stage. However, it does not work! All rows are marked with a 0, even those which should be marked with an 1. Any suggestions?
--This temp table measures gaps and creates a new variable OutDate2 which in the cases of a to small gap (less than 11 days) write the next Outdate on the row instead of the original value.
WITH C AS (SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) Rownum FROM #t4 t4)
SELECT cur.Rownum, cur.Id, cur.InDate CurInDate, cur.OutDate, nxt.InDate NxtInDate, DATEDIFF(day, cur.OutDate, nxt.InDate) Number_of_days,
CASE WHEN DATEDIFF(day, cur.OutDate, nxt.InDate)<11 AND DATEDIFF(day, cur.OutDate, nxt.InDate)>0 THEN nxt.OutDate ELSE cur.OutDate END AS OutDate2
INTO #t4a
FROM C cur
LEFT OUTER JOIN C nxt ON (nxt.rownum=cur.rownum+1 AND nxt.Id=cur.Id)
--This temp table creates a dummy which identifies the OVERLAP of rows in order for these to be eliminated in a later temporary table. It is this table that does not work.
WITH C AS (SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) rownum FROM #t4a)
SELECT cur.Id, cur.InDate, nxt.OutDate2,
CASE WHEN cur.OutDate2 < nxt.InDate THEN 1.0 ELSE 0.0
END AS Overlap
INTO #t4aa
FROM C cur
LEFT OUTER JOIN C nxt on (cur.rownum=nxt.rownum+1 AND cur.Id=nxt.Id)

This is kind of conceptual but might give you some ideas
WITH C AS
(SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) Rownum FROM #t4 t4)
select Cgood.*
from c
join C as Cgood
on Cgood.ID = C1.ID
and Cgood.Rownum = C.Rownum + 1
and DATEDIFF(day, C.OutDate, nxt.InDate)>=11
group by Cgood.*
union
select Cgood.*
from c
join C as Cgood
on Cgood.ID = C1.ID
and Cgood.Rownum = 1
and C.Rownum = 2
and DATEDIFF(day, C.OutDate, nxt.InDate)>=11
group by Cgood.*
union
select cMerge.ID, c.Indate, cMerge.OutDate
from c
join C as cMerge
on cMerge.ID = C1.ID
and cMerge.Rownum = C.Rownum + 1
and DATEDIFF(day, C.OutDate, cMerge.InDate) < 11
group by cMerge.ID, c.Indate, cMerge.OutDate
union
select cMerge.ID, c.Indate, cMerge.OutDate
from c
join C as cMerge
on cMerge.ID = C1.ID
and cMerge.Rownum = 1
and C.Rownum = 2
and DATEDIFF(day, C.OutDate, cMerge.InDate) < 11
group b

I solved my own question yesterday. I got rid of the last temp table and incorporated creating the dummy variable in the first temp table. The core of the solution was to join backwards as well as forward.
WITH C AS (SELECT Id, InDate, OutDate, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY InDate) Rownum FROM #t4 t4)
SELECT cur.Rownum, cur.Id, cur.InDate CurInDate, cur.OutDate, nxt.InDate NxtInDate, DATEDIFF(day, cur.OutDate, nxt.InDate) Number_of_days,
CASE
WHEN DATEDIFF(day, prv.OutDate, cur.InDate)<11
AND DATEDIFF(day, prv.OutDate, cur.InDate)>0
THEN 1.0
ELSE 0.0
END AS Overlap,
CASE
WHEN DATEDIFF(day, cur.OutDate, nxt.InDate)<11
AND DATEDIFF(day, cur.OutDate, nxt.InDate)>0
THEN nxt.OutDate
ELSE cur.OutDate
END AS OutDate2
INTO #t4a
FROM C cur
LEFT OUTER JOIN C prv ON (prv.rownum=cur.rownum-1 AND prv.Id=cur.Id)
LEFT OUTER JOIN C nxt ON (nxt.rownum=cur.rownum+1 AND nxt.Id=cur.Id)

Related

Avoiding Order By in T-SQL

Below sample query is a part of my main query. I found SORT operator in below query is consuming 30% of the cost.
To avoid SORT, there is need of creation of Indexes. Is there any other way to optimize this code.
SELECT TOP 1 CONVERT( DATE, T_Date) AS T_Date
FROM TableA
WHERE ID = r.ID
AND Status = 3
AND TableA_ID >ISNULL((
SELECT TOP 1 TableA_ID
FROM TableA
WHERE ID = r.ID
AND Status <> 3
ORDER BY T_Date DESC
), 0)
ORDER BY T_Date ASC

Looks like you can use not exists rather than the sorts. I think you'll probably get a better performance boost by use a CTE or derived table instead of the a scalar subquery.
select *
from r ... left outer join
(
select ID, min(t_date) as min_date from TableA t1
where status = 3 and not exists (
select 1 from TableA t2
where t2.ID = t1.ID
and t2.status <> 3 and t2.t_date > t1.t_date
)
group by ID
) as md on md.ID = r.ID ...
or
select *
from r ... left outer join
(
select t1.ID, min(t1.t_date) as min_date
from TableA t1 left outer join TableA t2
on t2.ID = t1.ID and t2.status <> 3
where t1.status = 3 and t1.t_date < t2.t_date
group by t1.ID
having count(t2.ID) = 0
) as md on md.ID = r.ID ...
It also appears that you're relying on an identity column but it's not clear what those values mean. I'm basically ignoring it and using the date column instead.

Try this:
SELECT TOP 1 CONVERT( DATE, T_Date) AS T_Date
FROM TableA a1
LEFT JOIN (
SELECT ID, MAX(TableA_ID) AS MaxAID
FROM TableA
WHERE Status <> 3
GROUP BY ID
) a2 ON a2.ID = a1.ID AND a1.TableA_ID > coalesce(a2.MAXAID,0)
WHERE a1.ID = r.ID AND a1.Status = 3
ORDER BY T_Date ASC
The use of TOP 1 in combination with the unexplained r alias concern me. There's almost certainly a MUCH better way to get this data into your results that doesn't involve doing this in a sub query (unless this is for an APPLY operation).

Postgresql count by past weeks

select id, wk0_count
from teams
left join
(select team_id, count(team_id) as wk0_count
from (
select created_at, team_id, trunc(EXTRACT(EPOCH FROM age(CURRENT_TIMESTAMP,created_at)) / 604800) as wk_offset
from loan_files
where loan_type <> 2
order by created_at DESC) as t1
where wk_offset = 0
group by team_id) as t_wk0
on teams.id = t_wk0.team_id
I've created the query above that shows me how many loans each team did in a given week. Week 0 is the past seven days.
Ideally I want a table that shows how many loans each team did in the last 8 weeks, grouped by week. The output would look like:
Any ideas on the best way to do this?

select
t.id,
count(week = 0 or null) as wk0,
count(week = 1 or null) as wk1,
count(week = 2 or null) as wk2,
count(week = 3 or null) as wk3
from
teams t
left join
loan_files lf on lf.team_id = t.id and loan_type <> 2
cross join lateral
(select (current_date - created_at::date) / 7 as week) w
group by 1
In 9.4+ versions use the aggregate filter syntax:
count(*) filter (where week = 0) as wk0,
lateral is from 9.3. In a previous version move the week expression to the filter condition.

How about the following query?
SELECT team_id AS id, count(team_id) AS wk0_count
FROM teams LEFT JOIN loan_files ON teams.id = team_id
WHERE loan_type <> 2
AND trunc(EXTRACT(epoch FROM age(CURRENT_TIMESTAMP, created_at)) / 604800) = 0
GROUP BY team_id
Notable changes are:
ORDER BY clause in subquery was pointless;
created_at in innermost subquery was never used;
wk_offset test is moved on the WHERE clause and not done in two distinct steps;
outermost subquery was not needed.

TSQL - COUNT number of rows in a different state than current row

It's kind of hard to explain, but from this example it should be clear.
Table TABLE:
Name State Time
--------------------
A 1 1/4/2012
B 0 1/3/2012
C 0 1/2/2012
D 1 1/1/2012
Would like to
select * from TABLE where state=1 order by Time desc
plus an additional column 'Skipped' containing the number of rows after one where state=1 in state 0, in other words the output should look like this:
Name State Time Skipped
A 1 1/4/2012 2 -- 2 rows after A where State != 1
D 1 1/1/2012 0 -- 0 rows after D where State != 1
0 should also be reported in case of 2 consecutive rows are in state = 1, i.e. there is nothing between these rows in a state other than 1.
It seems like CTE are must here, but can't figure out how to count rows where state != 1.
Any help will be appreciated.
(MS Sql Server 2008)

I've used a CTE to establish RowNo, so that you're not dependent on consecutive dates:
WITH CTE_Rows as
(
select name,state,time,
rowno = ROW_NUMBER() over (order by [time])
from MyTable
)
select name,state,time,
gap = isnull(r.rowno - x.rowno - 1,0)
from
CTE_Rows r
outer apply (
select top 1 rowno
from CTE_Rows sub
where sub.rowno < r.rowno and sub.state = 1
order by sub.rowno desc) x
where r.state = 1
If you just want to do it by date, then its simpler - just need an outer apply:
select name,state,r.time,
gap = convert(int,isnull(r.time - x.time - 1,0))
from
MyTable r
outer apply (
select top 1 time
from MyTable sub
where sub.time < r.time and sub.state = 1
order by sub.time desc) x
where r.state = 1
FYI the test data is used was created as follows:
create table MyTable
(Name char(1), [state] tinyint, [Time] datetime)
insert MyTable
values
('E',1,'2012-01-05'),
('A',1,'2012-01-04'),
('B',0,'2012-01-03'),
('C',0,'2012-01-02'),
('D',1,'2012-01-01')

Okay, here you go (it gets a little messy):
SELECT U.CurrentTime,
(SELECT COUNT(*)
FROM StateTable AS T3
WHERE T3.State=0
AND T3.Time BETWEEN U.LastTime AND U.CurrentTime) AS Skipped
FROM (SELECT T1.Time AS CurrentTime,
(SELECT TOP 1 T2.Time
FROM StateTable AS T2
WHERE T2.Time < T1.Time AND T2.State=1
ORDER BY T2.Time DESC) AS LastTime
FROM StateTable AS T1 WHERE T1.State = 1) AS U

TSQL Compare 2 select's result and return result with most recent date

Wonder if someone could give me a quick hand. I have 2 select queries (as shown below) and I want to compare the results of both and only return the result that has the most recent date.
So say I have the following 2 results from the queries:-
--------- ---------- ----------------------- --------------- ------ --
COMPANY A EMPLOYEE A 2007-10-16 17:10:21.000 E-mail 6D29D6D5 SYSTEM 1
COMPANY A EMPLOYEE A 2007-10-15 17:10:21.000 E-mail 6D29D6D5 SYSTEM 1
I only want to return the result with the latest date (so the first one). I thought about putting the results into a temporary table and then querying that but just wondering if there's a simpler, more efficient way?
SELECT * FROM (
SELECT fc.accountidname, fc.owneridname, fap.actualend, fap.activitytypecodename, fap.createdby, fap.createdbyname,
ROW_NUMBER() OVER (PARTITION BY fc.accountidname ORDER BY fap.actualend DESC) AS RN
FROM FilteredContact fc
INNER JOIN FilteredActivityPointer fap ON fc.parentcustomerid = fap.regardingobjectid
WHERE fc.statecodename = 'Active'
AND fap.ownerid = '0F995BDC'
AND fap.createdon < getdate()
) tmp WHERE RN = 1
SELECT * FROM (
SELECT fa.name, fa.owneridname, fa.new_technicalaccountmanageridname, fa.new_customerid, fa.new_riskstatusname,
fa.new_numberofopencases, fa.new_numberofurgentopencases, fap.actualend, fap.activitytypecodename, fap.createdby, fap.createdbyname,
ROW_NUMBER() OVER (PARTITION BY fa.name ORDER BY fap.actualend DESC) AS RN
FROM FilteredAccount fa
INNER JOIN FilteredActivityPointer fap ON fa.accountid = fap.regardingobjectid
WHERE fa.statecodename = 'Active'
AND fap.ownerid = '0F995BDC'
AND fap.createdon < getdate()
) tmp2 WHERE RN = 1

if the tables have the same structure (column count and column types to match), then you could just union the results of the two queries, then order by the date desc and then select the top 1.
select top 1 * from
(
-- your first query
union all
-- your second query.
) T
order by YourDateColumn1 desc

You should GROUP BY and use MAX(createdon)

Dealing with periods and dates without using cursors

I would like to solve this issue avoiding to use cursors (FETCH).
Here comes the problem...
1st Table/quantity
------------------
periodid periodstart periodend quantity
1 2010/10/01 2010/10/15 5
2st Table/sold items
-----------------------
periodid periodstart periodend solditems
14343 2010/10/05 2010/10/06 2
Now I would like to get the following view or just query result
Table Table/stock
-----------------------
periodstart periodend itemsinstock
2010/10/01 2010/10/04 5
2010/10/05 2010/10/06 3
2010/10/07 2010/10/15 5
It seems impossible to solve this problem without using cursors, or without using single dates instead of periods.
I would appreciate any help.
Thanks

DECLARE #t1 TABLE (periodid INT,periodstart DATE,periodend DATE,quantity INT)
DECLARE #t2 TABLE (periodid INT,periodstart DATE,periodend DATE,solditems INT)
INSERT INTO #t1 VALUES(1,'2010-10-01T00:00:00.000','2010-10-15T00:00:00.000',5)
INSERT INTO #t2 VALUES(14343,'2010-10-05T00:00:00.000','2010-10-06T00:00:00.000',2)
DECLARE #D1 DATE
SELECT #D1 = MIN(P) FROM (SELECT MIN(periodstart) P FROM #t1
UNION ALL
SELECT MIN(periodstart) FROM #t2) D
DECLARE #D2 DATE
SELECT #D2 = MAX(P) FROM (SELECT MAX(periodend) P FROM #t1
UNION ALL
SELECT MAX(periodend) FROM #t2) D
;WITH
L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
L1 AS (SELECT 1 AS c FROM L0 A CROSS JOIN L0 B),
L2 AS (SELECT 1 AS c FROM L1 A CROSS JOIN L1 B),
L3 AS (SELECT 1 AS c FROM L2 A CROSS JOIN L2 B),
L4 AS (SELECT 1 AS c FROM L3 A CROSS JOIN L3 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS i FROM L4),
Dates AS(SELECT DATEADD(DAY,i-1,#D1) AS D FROM Nums where i <= 1+DATEDIFF(DAY,#D1,#D2)) ,
Stock As (
SELECT D ,t1.quantity - ISNULL(t2.solditems,0) AS itemsinstock
FROM Dates
LEFT OUTER JOIN #t1 t1 ON t1.periodend >= D and t1.periodstart <= D
LEFT OUTER JOIN #t2 t2 ON t2.periodend >= D and t2.periodstart <= D ),
NStock As (
select D,itemsinstock, ROW_NUMBER() over (order by D) - ROW_NUMBER() over (partition by itemsinstock order by D) AS G
from Stock)
SELECT MIN(D) AS periodstart, MAX(D) AS periodend, itemsinstock
FROM NStock
GROUP BY G, itemsinstock
ORDER BY periodstart

Hopefully a little easier to read than Martin's. I used different tables and sample data, hopefully extrapolating the right info:
CREATE TABLE [dbo].[Quantity](
[PeriodStart] [date] NOT NULL,
[PeriodEnd] [date] NOT NULL,
[Quantity] [int] NOT NULL
) ON [PRIMARY]
CREATE TABLE [dbo].[SoldItems](
[PeriodStart] [date] NOT NULL,
[PeriodEnd] [date] NOT NULL,
[SoldItems] [int] NOT NULL
) ON [PRIMARY]
INSERT INTO Quantity (PeriodStart,PeriodEnd,Quantity)
SELECT '20100101','20100115',5
INSERT INTO SoldItems (PeriodStart,PeriodEnd,SoldItems)
SELECT '20100105','20100107',2 union all
SELECT '20100106','20100108',1
The actual query is now:
;WITH Dates as (
select PeriodStart as DateVal from SoldItems union select PeriodEnd from SoldItems union select PeriodStart from Quantity union select PeriodEnd from Quantity
), Periods as (
select d1.DateVal as StartDate, d2.DateVal as EndDate
from Dates d1 inner join Dates d2 on d1.DateVal < d2.DateVal left join Dates d3 on d1.DateVal < d3.DateVal and d3.DateVal < d2.DateVal where d3.DateVal is null
), QuantitiesSold as (
select StartDate,EndDate,COALESCE(SUM(si.SoldItems),0) as Quantity
from Periods p left join SoldItems si on p.StartDate < si.PeriodEnd and si.PeriodStart < p.EndDate
group by StartDate,EndDate
)
select StartDate,EndDate,q.Quantity - qs.Quantity
from QuantitiesSold qs inner join Quantity q on qs.StartDate < q.PeriodEnd and q.PeriodStart < qs.EndDate
And the result is:
StartDate EndDate (No column name)
2010-01-01 2010-01-05 5
2010-01-05 2010-01-06 3
2010-01-06 2010-01-07 2
2010-01-07 2010-01-08 4
2010-01-08 2010-01-15 5
Explanation: I'm using three Common Table Expressions. The first (Dates) is gathering all of the dates that we're talking about, from the two tables involved. The second (Periods) selects consecutive values from the Dates CTE. And the third (QuantitiesSold) then finds items in the SoldItems table that overlap these periods, and adds their totals together. All that remains in the outer select is to subtract these quantities from the total quantity stored in the Quantity Table

John, what you could do is a WHILE loop. Declare and initialise 2 variables before your loop, one being the start date and the other being end date. Your loop would then look like this:
WHILE(#StartEnd <= #EndDate)
BEGIN
--processing goes here
SET #StartEnd = #StartEnd + 1
END
You would need to store your period definitions in another table, so you could retrieve those and output rows when required to a temporary table.
Let me know if you need any more detailed examples, or if I've got the wrong end of the stick!

Damien,
I am trying to fully understand your solution and test it on a large scale of data, but I receive following errors for your code.
Msg 102, Level 15, State 1, Line 20
Incorrect syntax near 'Dates'.
Msg 102, Level 15, State 1, Line 22
Incorrect syntax near ','.
Msg 102, Level 15, State 1, Line 25
Incorrect syntax near ','.

Damien,
Based on your solution I also wanted to get a neat display for StockItems without overlapping dates. How about this solution?
CREATE TABLE [dbo].[SoldItems](
[PeriodStart] [datetime] NOT NULL,
[PeriodEnd] [datetime] NOT NULL,
[SoldItems] [int] NOT NULL
) ON [PRIMARY]
INSERT INTO SoldItems (PeriodStart,PeriodEnd,SoldItems)
SELECT '20100105','20100106',2 union all
SELECT '20100105','20100108',3 union all
SELECT '20100115','20100116',1 union all
SELECT '20100101','20100120',10
;WITH Dates as (
select PeriodStart as DateVal from SoldItems
union
select PeriodEnd from SoldItems
union
select PeriodStart from Quantity
union
select PeriodEnd from Quantity
), Periods as (
select d1.DateVal as StartDate, d2.DateVal as EndDate
from Dates d1
inner join Dates d2 on d1.DateVal < d2.DateVal
left join Dates d3 on d1.DateVal < d3.DateVal and
d3.DateVal < d2.DateVal where d3.DateVal is null
), QuantitiesSold as (
select StartDate,EndDate,SUM(si.SoldItems) as Quantity
from Periods p left join SoldItems si on p.StartDate < si.PeriodEnd and si.PeriodStart < p.EndDate
group by StartDate,EndDate
)
select StartDate,EndDate, qs.Quantity
from QuantitiesSold qs
where qs.quantity is not null

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Removing duplicate date periods - tsql

Related

Avoiding Order By in T-SQL

Postgresql count by past weeks

TSQL - COUNT number of rows in a different state than current row

TSQL Compare 2 select's result and return result with most recent date

Dealing with periods and dates without using cursors

Categories

Resources