conditionally merge and aggregate adjacent rows in T-SQL

conditionally merge and aggregate adjacent rows in T-SQL - tsql

I have a 100K-row table representing sales during a particular time period. Usually the periods are at least a few hours long, but occasionally we get a period that's only a few minutes long. These tiny periods mess up downstream reporting, so I'd like to merge them with the preceding period. Any period that's 30 minutes or less should get merged with the previous period, with sales data summed across periods. There may be zero, one, or many multiple subsequent short periods between long periods. There are no time gaps in the data-- the start of one period is always the same as the end of the previous one.
What's a good set-based way (no cursors!) to perform this merging?
Existing data (simplified) looks like this:
UnitsSold Start End
---------------------------------------------------
10 06-12-2013 08:03 06-12-2013 12:07
12 06-12-2013 12:07 06-12-2013 16:05
1 06-12-2013 16:05 06-12-2013 16:09
1 06-12-2013 16:09 06-12-2013 16:13
7 06-12-2013 16:13 06-12-2013 20:10
Desired output would look like this:
UnitsSold Start End
---------------------------------------------------
10 06-12-2013 08:03 06-12-2013 12:07
14 06-12-2013 12:07 06-12-2013 16:13
7 06-12-2013 16:13 06-12-2013 20:10
Unfortunately we're still on SQL Server 2008 R2, so we can't leverage the cool new window functions in SQL Server 2012, which might make this problem easier to solve efficiently.
There's a good discussion of a similar problem in Merge adjacent rows in SQL?. I particularly like the PIVOT/UNPIVOT solution, but I'm stumped for how to adapt it to my problem.

My idea is
create list only with long periods
find start of next long period with "outer apply"
sum units with subquery
Something like this
declare #t table (UnitsSold int, start datetime, finish datetime)
insert into #t values (10, '20130612 08:03', '20130612 12:07')
insert into #t values (12, '20130612 12:07', '20130612 16:05')
insert into #t values (1, '20130612 16:05', '20130612 16:09')
insert into #t values (1, '20130612 16:09', '20130612 16:13')
insert into #t values (7, '20130612 16:13', '20130612 20:10')
select
(select SUM(UnitsSold) from #t t3 where t3.start>=t1.start and t3.finish<=ISNULL(oa.start, t1.finish)) as UnitsSold,
t1.start,
ISNULL(oa.start, t1.finish) as finish
from #t t1
outer apply (
select top(1) start
from #t t2
where datediff(minute,t2.start, t2.finish)>30
and t2.start >= t1.finish
order by t2.start
) oa
where datediff(minute, t1.start, t1.finish)>30

Using recursive CTE:
DECLARE #t TABLE (UnitsSold INT, Start DATETIME, Finish DATETIME)
INSERT INTO #t VALUES
(10, '06-12-2013 08:03', '06-12-2013 12:07'),
(12, '06-12-2013 12:07', '06-12-2013 16:05'),
(1, '06-12-2013 16:05', '06-12-2013 16:09'),
(1, '06-12-2013 16:09', '06-12-2013 16:13'),
(7, '06-12-2013 16:13', '06-12-2013 20:10')
;WITH rec AS (
-- Returns periods > 30 minutes
SELECT u.UnitsSold, u.Start, u.Finish
FROM #t u WHERE DATEDIFF(MINUTE, u.Start, u.Finish) > 30
UNION ALL
-- Adds on adjoining periods <= 30 minutes
SELECT
u.UnitsSold + r.UnitsSold,
r.Start,
u.Finish
FROM rec r
INNER JOIN #t u ON r.Finish = u.Start
AND DATEDIFF(MINUTE, u.Start, u.Finish) <= 30)
-- Since the CTE also returns incomplete periods we need
-- to filter out the relevant periods, in this case the
-- last/max values for each start value.
SELECT
MAX(r.UnitsSold) AS UnitsSold,
r.Start AS Start,
MAX(r.Finish) AS Finish
FROM rec r
GROUP BY r.Start

Using CTE and cumulative sum:
DECLARE #t TABLE (UnitsSold INT, Start DATETIME, Finish DATETIME)
INSERT INTO #t VALUES
(10, '06-12-2013 08:03', '06-12-2013 12:07'),
(12, '06-12-2013 12:07', '06-12-2013 16:05'),
(1, '06-12-2013 16:05', '06-12-2013 16:09'),
(1, '06-12-2013 16:09', '06-12-2013 16:13'),
(7, '06-12-2013 16:13', '06-12-2013 20:10')
;WITH groups AS (
SELECT UnitsSold, Start, Finish,
-- Cumulative sum, IIF returns 1 for each row that
-- should generate a new row in the final result.
SUM(IIF(DATEDIFF(MINUTE, Start, Finish) <= 30, 0, 1)) OVER (ORDER BY Start) csum
FROM #t)
SELECT
SUM(UnitsSold) UnitsSold,
MIN(Start) Start,
MAX(Finish) Finish
FROM groups
GROUP BY csum

Related

Get dates with in consecutive 50 days of given date

Activity table:
create table #activity(id int, begin_date datetime, end_date datetime)
insert into #activity values(1, '1/1/2017', '1/31/2017')
insert into #activity values(1, '9/1/2017', '9/15/2017')
insert into #activity values(1, '4/1/2017', '4/15/2017')
insert into #activity values(1, '2/5/2017', '2/15/2017')
insert into #activity values(1, '8/1/2017', '8/31/2017')
Insert into #activity values(2, '11/1/2016', '11/15/2016')
Now input date is 12/1/2016 and id, would like to get all activities within 50 days after 12/1/2016. Query should return activities with begin dates 1/1/2017, 2/5/2017 (because this is within 50 days of 1/31/2017), and 4/1/17.
8/1/2017 and 9/1/2017 of id 1 shouldn't be selected 8/1 is not with in 50 days of 4/15 and 50 day cycle was broken.
TIA

The OP says:
would like to get all activities within 50 days after 12/1/2016
one possible query to achieve that result would be
-- get all activities with a begin_date within 50 days of input_date
select *
from #activity as a
where #input_date <= a.begin_date and a.begin_date < dateadd(day, 50, #input_date)
However, the OP then says:
Query should return activities with begin dates 1/1/2017, 2/5/2017 (because this is within 50 days of 1/31/2017), and 4/1/17. 8/1/2017 and 9/1/2017 of id 1 shouldn't be selected 8/1 is not with in 50 days of 4/15 and 50 day cycle was broken.
This would indicate that you want to find all sequential activities starting after 12/1/2016 where the gap between sequential activities is less than 50 days.
One possible approach for this is to use the lag function. An example of how to use the lag function on this problem is:
select
a.*
, lag(a.end_date, 1, #input_date) over (order by a.end_date) as previous_end
, datediff(day, lag(a.end_date, 1, #input_date) over (order by a.end_date), a.begin_date) as previous_end_to_this_begin
from #activity as a
where #input_date <= a.begin_date
order by a.begin_date
Simplifying that slightly would produce this:
-- get all activities in a row where the gap between activities is less than 50
select * from #activity as a where #input_date <= a.begin_date and a.begin_date < (
select
min(a.begin_date) as first_begin_to_not_include
from
(
select
a.begin_date
, datediff(day, lag(a.end_date, 1, #input_date) over (order by a.end_date), a.begin_date) as previous_end_to_this_begin
from #activity as a
where #input_date <= a.begin_date
) as a
where a.previous_end_to_this_begin > 50
)
order by a.begin_date

PostgreSQL: how do I group rows by 'nearby' timestamps

Considering the following simplified situation:
create table trans
(
id integer not null
, tm timestamp without time zone not null
, val integer not null
, cus_id integer not null
);
insert into trans
(id, tm, val, cus_id)
values
(1, '2017-12-12 16:42:00', 2, 500) --
,(2, '2017-12-12 16:42:02', 4, 501) -- <--+---------+
,(3, '2017-12-12 16:42:05', 7, 502) -- |dt=54s |
,(4, '2017-12-12 16:42:56', 3, 501) -- <--+ |dt=59s
,(5, '2017-12-12 16:43:00', 2, 503) -- |
,(6, '2017-12-12 16:43:01', 5, 501) -- <------------+
,(7, '2017-12-12 16:43:15', 6, 502) --
,(8, '2017-12-12 16:44:50', 4, 501) --
;
I want to group rows by cus_id, but also where the interval between time stamps of consecutive rows for the same cus_id is less than 1 minute.
In the example above this applies to rows with id's 2, 4 and 6. These rows have the same cus_id (501) and have intervals below 1 minute. The interval id{2,4} is 54s and for id{2,6} it is 59s. The interval id{4,6} is also below 1 minute, but it is overridden by the larger interval id{2,6}.
I need a query that gives me the output:
cus_id | tm | val
--------+---------------------+-----
501 | 2017-12-12 16:42:02 | 12
(1 row)
The tm value would be the tm of the first row, i.e. with the lowest tm. The val would be the sum(val) of the grouped rows.
In the example 3 rows are grouped, but that could also be 2, 4, 5, ...
For simplicity, I only let the rows for cus_id 501 have nearby time stamps, but in my real table, there would be a lot more of them. It contains 20M+ rows.
Is this possible?

Naive (subobtimal) solution using a CTE
(a faster approach would avoid the CTE, replacing it by a joined subquery or maybe even use a window function) :
-- Step one: find the start of a cluster
-- (the start is everything after a 60 second silence)
WITH starters AS (
SELECT * FROM trans tr
WHERE NOT EXISTS (
SELECT * FROM trans nx
WHERE nx.cus_id = tr.cus_id
AND nx.tm < tr.tm
AND nx.tm >= tr.tm -'60sec'::interval
)
)
-- SELECT * FROM starters ; \q
-- Step two: join everything within 60sec to the starter
-- and aggregate the clusters
SELECT st.cus_id
, st.id AS id
, MAX(tr.id) AS max_id
, MIN(tr.tm) AS first_tm
, MAX(tr.tm) AS last_tm
, SUM(tr.val) AS val
FROM trans tr
JOIN starters st ON st.cus_id = tr.cus_id
AND st.tm <= tr.tm AND st.tm > tr.tm -'60sec'::interval
GROUP BY 1,2
ORDER BY 1,2
;

Records based on time difference

I have a very strange request. I'm trying to create an SQL statement to do this. I know I can create a cursor but trying to see if it can be done is SQL
Here is my source data.
1 - 1:00 PM
2 - 1:02 PM
3 - 1:03 PM
4 - 1:05 PM
5 - 1:06 PM
6 - 1:09 PM
7 - 1:10 PM
8 - 1:12 PM
9 - 1:13 PM
10 - 1:15 PM
I'm trying to create a function that if I pass an interval it will return the resulting data set.
For example I pass in 5 minutes, then the records I would want back are records 1, 4, 7, & 10.
Is there a way to do this in SQL. Note: if record 4 (1:05 PM wasn't in the data set I would expect to see 1, 5, & 8. I would see 5 because it is the next record with a time greater than 5 minutes from record 1 and record 8 because it is the next record with a time greater than 5 minutes from record 5.

Here is a create script that you should have provided:
declare #Table1 TABLE
([id] int, [time] time)
;
INSERT INTO #Table1
([id], [time])
VALUES
(1, '1:00 PM'),
(2, '1:02 PM'),
(3, '1:03 PM'),
(4, '1:05 PM'),
(5, '1:06 PM'),
(6, '1:09 PM'),
(7, '1:10 PM'),
(8, '1:12 PM'),
(9, '1:13 PM'),
(10, '1:15 PM')
;
I would do this with this query:
declare #interval int
set #interval = 5
;with next_times as(
select id, [time], (select min([time]) from #Table1 t2 where t2.[time] >= dateadd(minute, #interval, t1.[time])) as next_time
from #Table1 t1
),
t as(
select id, [time], next_time
from next_times t1 where id=1
union all
select t3.id, t3.[time], t3.next_time
from t inner join next_times t3
on t.next_time = t3.[time]
)
select id, [time] from t order by 1
-- results:
id time
----------- ----------------
1 13:00:00.0000000
4 13:05:00.0000000
7 13:10:00.0000000
10 13:15:00.0000000
(4 row(s) affected)
It works even for the situations with a missing interval:
-- delete the 1:05 PM record
delete from #table1 where id = 4;
;with next_times as(
select id, [time], (select min([time]) from #Table1 t2 where t2.[time] >= dateadd(minute, #interval, t1.[time])) as next_time
from #Table1 t1
),
t as(
select id, [time], next_time
from next_times t1 where id=1
union all
select t3.id, t3.[time], t3.next_time
from t inner join next_times t3
on t.next_time = t3.[time]
)
select id, [time] from t order by 1;
-- results:
id time
----------- ----------------
1 13:00:00.0000000
5 13:06:00.0000000
8 13:12:00.0000000
(3 row(s) affected)

Last Working Day is showing null while on weekend

Here is my code but its showing null while today is friday. But I would like to get last working day.
-- Insert statements for procedure here
--Below is the param you would pass
DECLARE #dateToEvaluate date=GETDATE();
--Routine
DECLARE #startDate date=CAST('1/1/'+CAST(YEAR(#dateToEvaluate) AS char(4)) AS date); -- let's get the first of the year
WITH
tally(n) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1 FROM sys.all_columns),
dates AS (
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS dt_id,
DATEADD(DAY,n,#startDate) AS dt,
DATENAME(WEEKDAY,DATEADD(DAY,n,#startdate)) AS dt_name
FROM tally
WHERE n<366 --arbitrary
AND DATEPART(WEEKDAY,DATEADD(DAY,n,#startDate)) NOT IN (6)
AND DATEADD(DAY,n,#startDate) NOT IN (SELECT CAST(HolidayDate AS date) FROM Holiday)),
curr_id(id) AS (SELECT dt_id FROM dates WHERE dt=#dateToEvaluate)
SELECT d.dt
FROM dates AS d
CROSS JOIN
curr_id c
WHERE d.dt_id+1=c.id

The code below will take any date and "walk backward" to find the previous week day (M-F) which is not in the #holidays table.
declare #currentdate datetime = '2015-03-22'
declare #holidays table (holiday datetime)
insert #holidays values ('2015-03-20')
;with cte as (
select
#currentdate k
union all
select
dateadd(day, -1, k)
from cte
where
k = #currentdate
or ((datepart(dw, k) + ##DATEFIRST - 1 - 1) % 7) + 1 > 5 --determine day of week independent of culture
or k in (select holiday from #holidays)
)
select min(k) from cte

The dates table doesn't have any FRIDAY dates in it. Change the NOT IN (6) to NOT IN (1, 7). This will remove Saturday and Sundays from the dates table.

Calculation of total duration of events for a day

We have a table called Events, with columns Id(int), EventDate(DateTime), EventStart(datetime) and EventEnd(datetime).
All events start and end on a single day (i.e. no events end the next day), however events in a given date may overlap between them (including one of them could cover another entirely).
Any number of events may occur for a given date.
I would like to, for a single day calculate the total duration during which at least one event was running in T-SQL. I can select the events on a given date, and have even written a function returning true if two events are overlapping and false if not.
I am stuck however in how to take the records in pairs and run them through my function, adding the durations appropriately until I run out of events.
Can you help?
Chris

Try this:
--test table
declare #t table(fromt datetime, tot datetime)
--test data
insert #t values('2011-01-01 10:00', '2011-01-01 11:00')
insert #t values('2011-01-01 10:00', '2011-01-01 10:05')
insert #t values('2011-01-01 10:30', '2011-01-01 11:30')
insert #t values('2011-01-01 12:00', '2011-01-01 12:30')
insert #t values('2011-01-02 12:00', '2011-01-02 12:30')
--query
;with f as
(
select distinct fromt from #t t
where not exists(select 1 from #t where t.fromt > fromt and t.fromt < tot)
), t as
(
select distinct tot from #t t
where not exists(select 1 from #t where t.tot >= fromt and t.tot < tot)
), s as
(
select datediff(day, 0, fromt) d, datediff(second, fromt, (select min(tot)
from t where f.fromt < tot and datediff(day, f.fromt, tot) = 0)) sec
from f
)
select dateadd(day, 0, d) day, sum(sec)/60 [minutes]
from s
group by d
order by d
Result:
day minutes
----------------------- -------
2011-01-01 00:00:00.000 120
2011-01-02 00:00:00.000 30

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse