TSQL: Grouping continuous timeslots together

TSQL: Grouping continuous timeslots together - tsql

I have to group continuous timeslots together:
Example:
DECLARE #TEST as Table (ID int, tFrom datetime, tUntil dateTime)
insert into #TEST Values (1,'2019-1-1 12:00', '2019-1-1 13:00')
insert into #TEST Values (1,'2019-1-1 13:00', '2019-1-1 14:00')
insert into #TEST Values (1,'2019-1-1 14:00', '2019-1-1 16:00')
insert into #TEST Values (1,'2019-1-1 18:00', '2019-1-1 19:00')
insert into #TEST Values (1,'2019-1-1 19:00', '2019-1-1 20:00')
insert into #TEST Values (1,'2019-1-1 20:00', '2019-1-1 21:00')
insert into #TEST Values (1,'2019-1-1 22:00', '2019-1-1 23:00')
insert into #TEST Values (2,'2019-1-1 12:00', '2019-1-1 13:00')
insert into #TEST Values (2,'2019-1-1 13:00', '2019-1-1 14:00')
insert into #TEST Values (2,'2019-1-1 14:00', '2019-1-1 16:00')
insert into #TEST Values (2,'2019-1-1 18:00', '2019-1-1 19:00')
insert into #TEST Values (2,'2019-1-1 19:00', '2019-1-1 20:00')
insert into #TEST Values (2,'2019-1-1 20:00', '2019-1-1 21:00')
insert into #TEST Values (2,'2019-1-1 22:00', '2019-1-1 23:00')
Expected result:
1; 2019-1-1 12:00; 2019-1-1 16:00
1; 2019-1-1 18:00; 2019-1-1 21:00
1; 2019-1-1 22:00; 2019-1-1 23:00
2; 2019-1-1 12:00; 2019-1-1 16:00
2; 2019-1-1 18:00; 2019-1-1 21:00
2; 2019-1-1 22:00; 2019-1-1 23:00

This is a classing gaps-and-islands problem.
The key here is how to identify the groups.
If the difference between tFrom and tUntil is always exactly one hour, you can ignore the tUntil and work only based on the differences between tFrom of different records.
Use a common table expression to identify the groups, and then select min(tFrom) and max(tUntil) from it, grouped by id and group.
What you do is calculate the hours difference between the tFrom and some fixed date, and you subtract that value from row_number ordered by tFrom (and partitioned by id in this case).
This means that consecutive values of tFrom will get the same group key (consecutive here means by the hour in this case):
WITH CTE AS
(
SELECT ID,
tFrom,
tUntil,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY tFrom) -
DATEDIFF(HOUR, '2019-01-01', tFrom) As grp
FROM #Test
)
SELECT ID,
MIN(tFrom) As tFrom,
MAX(tUntil) As tUntil
FROM CTE
GROUP BY ID, grp
ORDER BY Id, tFrom
If the difference between tFrom and tUntill is not fixed, than it's going to be more cumbersome to identify the groups.
I came up with a solution involving three common table expressions - the first is to get the datediff between the current row's tUntill and the next row's tFrom, then calculate a group divider based on previous row's difference, and then calculate the group id based on sums of the dividers:
WITH CTE1 AS
(
SELECT ID,
tFrom,
tUntil,
DATEDIFF(HOUR, tUntil, LEAD(tFrom) OVER(PARTITION BY id ORDER BY tFrom)) As DiffNext
FROM #Test
), CTE2 AS
(
SELECT ID,
tFrom,
tUntil,
ISNULL(SIGN(LAG(DiffNext) OVER(PARTITION BY id ORDER BY tFrom)), 1) AS GroupDivider
FROM CTE1
), CTE3 AS
(
SELECT ID,
tFrom,
tUntil,
SUM(GroupDivider) OVER(PARTITION BY id ORDER BY tFrom) As GroupId
FROM CTE2
)
SELECT ID,
MIN(tFrom) As tFrom,
MAX(tUntil) As tUntil
FROM CTE3
GROUP BY ID, GroupId
ORDER BY ID, tFrom

Good day,
In order to have a flexible solution which cover overlap in the ranges of the time, there are several solutions which we can use. "Gaps & Islands" approach is not the best (from the performance point of view), but it will work, and there are worse options as well (like using loop/cursor). Since "Gaps & Islands" was the phrase that mentioned in the comments and in the solution which was discuss in the comments, l will first show this solution in short.
The solution using the "Gaps & Islands" approach is based on two steps (one query which use CTE). First, you split the ranges into "points in time". Next using "numbers" table or better in this case a "Times" table, you can get the final result SET by finding the gaps between the points, which is classic "Gaps & Islands" problem.
I HIGHLY recommend to follow the post, which I published, and follow it from start to end! There are limitations and disadvantages for this approach, which you must understand. Moreover, the post present the "way of thinking" and how we solve problems like this one step-by-step.
In the post I start with the simplest case of ranges of integers, for example 2-4, 6-8, 8-10, 13-14 which should be grouped into 2-4, 6-10, 13-14.
Next I move to explain issue related to the resolution of space between the ranges, and I present a solution for Ranges of Decimal numbers, which cover the issue.
Finally, using the solution which I presented in detail for INTEGERS, I presented a solution for "Grouping continuous time-slots together", which was the original question in the forum.
Note! The solution which presented here is probably the one which I recommend to use in production. In my next post I published a totally different approach using my personal trick, which can improve performance dramatically.
In short, for the sake of the discussion I will create a Times table (you can use Numbers table directly if you really want). Notice that I create the Times table using a Numbers table.
DROP TABLE IF EXISTS Times
GO
SELECT DT = DATEADD(MINUTE, N*10, '2010-01-01')
INTO Times
FROM Numbers
GO
CREATE CLUSTERED INDEX IX_DT ON Times(DT)
GO
SELECT TOP 1000 DT from Times
GO
and using this table we can solve the issue
;With MyCTE01 as (
SELECT DISTINCT ID, DT
FROM TEST t
INNER JOIN Times dt ON DT between tFrom and tUntil
)
,MyCTE02 as(
SELECT ID, DT,
MyGroup =
DATEDIFF(MINUTE,
DATEADD(MINUTE, 10 * ROW_NUMBER()OVER(PARTITION BY ID ORDER BY ID,DT),0),
DT
)
from MyCTE01
--order by ID,DT
)
SELECT ID, MIN(DT) tFrom, MAX(DT) tUntil
FROM MyCTE02
GROUP BY ID, MyGroup
ORDER BY ID, tFrom
GO
Note! I highly recommend to check the second post (Part 2) before choosing the solution that fit you in production.
I hope that this cover the discussion and that it was helpful

Related

How to update duplicate rows in a table n postgresql

I have created synthetic data for a typical call center.
Below is the screenshot of the table I have created.
Table 1:
Problem statement: Since this is completely random data, I noticed that there are some customers who are being assigned to the same agents whenever they call again.
So using this query I was able to test such a case and count the number of times agents are being repeated for each customer.
select agentid, customerid, count(customerid) from aa_dev.calls group by agentid, customerid having count(customerid) > 1 ;
Table 2
I have a separate agents table to called aa_dev.agents in which the agent's ids are stored
Now I want to replace the agentid for such cases, such that if agentid is repeated 6 times for a single customer then 5 of the times the agent id should be updated with any other agentid from the table but call time shouldn't be overlapping That means the agent we are replacing with should not be busy on the time the call is going one.
I have assigned row numbers to each repeated ones.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY agentid, customerid ORDER BY random()) rn,
COUNT(*) OVER (PARTITION BY agentid, customerid) cnt
FROM aa_dev.calls
)
SELECT agentid, customerid, rn
FROM cte
WHERE cnt > 1;
This way I could visualize the repetition clearly.
So I don't want to update row 1 but the rest.
Is there any way I can acheive this? Can I use the row number and write a query according to the row number to update rownum 2 onwards row one by one with each row having a unique agent?

If you don't want duplicates in your artificial data, it's probably better to not generate them.
But if you already have a table with duplicates and want to work on the duplicates, either updating them or deleting, here is the easy way:
You need a unique ID for each updated row. If you don't have it,
add it temporarily. Then you can use this pattern to update all duplicates
except the first one:
To add artificial id column to preexisting table, use:
ALTER TABLE calls ADD id serial;
In my case I generated a test table with 100 random rows:
CREATE TEMP TABLE calls (id serial, agentid int, customerid int);
INSERT INTO calls (agentid, customerid)
SELECT (random()*10)::int, (random()*10)::int
FROM generate_series(1, 100) n;
Define what constitutes a duplicate and find duplicates in data:
SELECT agentid, customerid, count(*), array_agg(id) id
FROM calls
GROUP BY 1,2 HAVING count(*)>1
ORDER BY 1,2;
Update all the duplicate rows except first one with NULLs:
UPDATE calls SET agentid = whatever_needed
FROM (
SELECT array_agg(id) id, min(id) idmin FROM calls
GROUP BY agentid, customerid HAVING count(*)>1
) AS dup
WHERE calls.id = ANY(dup.id) AND calls.id <> dup.idmin;
Alternatively, remove all duplicates except first one:
DELETE FROM calls
USING (
SELECT array_agg(id) id, min(id) idmin FROM calls
GROUP BY agentid, customerid HAVING count(*)>1
) AS dup
WHERE calls.id = ANY(dup.id) AND calls.id <> dup.idmin;

How to rewrite SQL joins into window functions?

Database is HP Vertica 7 or PostgreSQL 9.
create table test (
id int,
card_id int,
tran_dt date,
amount int
);
insert into test values (1, 1, '2017-07-06', 10);
insert into test values (2, 1, '2017-06-01', 20);
insert into test values (3, 1, '2017-05-01', 30);
insert into test values (4, 1, '2017-04-01', 40);
insert into test values (5, 2, '2017-07-04', 10);
Of the payment cards used in the last 1 day, what is the maximum amount charged on that card in the last 90 days.
select t.card_id, max(t2.amount) max
from test t
join test t2 on t2.card_id=t.card_id and t2.tran_dt>='2017-04-06'
where t.tran_dt>='2017-07-06'
group by t.card_id
order by t.card_id;
Results are correct
card_id max
------- ---
1 30
I want to rewrite the query into sql window functions.
select card_id, max(amount) over(partition by card_id order by tran_dt range between '60 days' preceding and current row) max
from test
where card_id in (select card_id from test where tran_dt>='2017-07-06')
order by card_id;
But result set does not match, how can this be done?
Test data here:
http://sqlfiddle.com/#!17/db317/1

I can't try PostgreSQL, but in Vertica, you can apply the ANSI standard OLAP window function.
But you'll need to nest two queries: The window function only returns sensible results if it has all rows that need to be evaluated in the result set.
But you only want the row from '2017-07-06' to be displayed.
So you'll have to filter for that date in an outer query:
WITH olap_output AS (
SELECT
card_id
, tran_dt
, MAX(amount) OVER (
PARTITION BY card_id
ORDER BY tran_dt
RANGE BETWEEN '90 DAYS' PRECEDING AND CURRENT ROW
) AS the_max
FROM test
)
SELECT
card_id
, the_max
FROM olap_output
WHERE tran_dt='2017-07-06'
;
card_id|the_max
1| 30

As far as I know, PostgreSQL Window function doesn't support bounded range preceding thus range between '90 days' preceding won't work. It does support bounded rows preceding such as rows between 90 preceding, but then you would need to assemble a time-series query similar to the following for the Window function to operate on the time-based rows:
SELECT c.card_id, t.amount, g.d as d_series
FROM generate_series(
'2017-04-06'::timestamp, '2017-07-06'::timestamp, '1 day'::interval
) g(d)
CROSS JOIN ( SELECT distinct card_id from test ) c
LEFT JOIN test t ON t.card_id = c.card_id and t.tran_dt = g.d
ORDER BY c.card_id, d_series
For what you need (based on your question description), I would stick to using group by.

multiple extract() with WHERE clause possible?

So far I have come up with the below:
WHERE (extract(month FROM orders)) =
(SELECT min(extract(month from orderdate))
FROM orders)
However, that will consequently return zero to many rows, and in my case, many, because many orders exist within that same earliest (minimum) month, i.e. 4th February, 9th February, 15th Feb, ...
I know that a WHERE clause can contain multiple columns, so why wouldn't the below work?
WHERE (extract(day FROM orderdate)), (extract(month FROM orderdate)) =
(SELECT min(extract(day from orderdate)), min(extract(month FROM orderdate))
FROM orders)
I simply get: SQL Error: ORA-00920: invalid relational operator
Any help would be great, thank you!
Sample data:
02-Feb-2012
14-Feb-2012
22-Dec-2012
09-Feb-2013
18-Jul-2013
01-Jan-2014
Output:
02-Feb-2012
14-Feb-2012
Desired output:
02-Feb-2012

I recreated your table and found out you just messed up the brackets a bit. The following works for me:
where
(extract(day from OrderDate),extract(month from OrderDate))
=
(select
min(extract(day from OrderDate)),
min(extract(month from OrderDate))
from orders
)

Use something like this:
with cte1 as (
select
extract(month from OrderDate) date_month,
extract(day from OrderDate) date_day,
OrderNo
from tablename
), cte2 as (
select min(date_month) min_date_month, min(date_day) min_date_day
from cte1
)
select cte1.*
from cte1
where (date_month, date_day) = (select min_date_month, min_date_day from cte2)
A common table expression enables you to restructure your data and then use this data to do your select. The first cte-block (cte1) selects the month and the day for each of your table rows. Cte2 then selects min(month) and min(date). The last select then combines both ctes to select all rows from cte1 that have the desired month and day.
There is probably a shorter solution to that, however I like common table expressions as they are almost all the time better to understand than the "optimal, shortest" query.

If that is really what you want, as bizarre as it seems, then as a different approach you could forget the extracts and the subquery against the table to get the minimums, and use an analytic approach instead:
select orderdate
from (
select o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
from orders o
)
where rn = 1;
ORDERDATE
---------
01-JAN-14
The row_number() effectively adds a pseudo-column to every row in your original table, based on the month and day in the order date. The rn values are unique, so there will be one row marked as 1, which will be from the earliest day in the earliest month. If you have multiple orders with the same day/month, say 01-Jan-2013 and 01-Jan-2014, then you'll still only get exactly one with rn = 1, but which is picked is indeterminate. You'd need to add further order by conditions to make it deterministic, but I have no idea what you might want.
That is done in the inner query; the outer query then filters so that only the records marked with rn = 1 is returned; so you get exactly one row back from the overall query.
This also avoids the situation where the earliest day number is not in the earliest month number - say if you only had 01-Jan-2014 and 02-Feb-2014; comparing the day and month separately would look for 01-Feb-2014, which doesn't exist.
SQL Fiddle (with Thomas Tschernich's anwer thrown in too, giving the same result for this data).
To join the result against your invoice table, you don't need to join to the orders table again - especially not with a cross join, which is skewing your results. You can do the join (at least) two ways:
SELECT
o.orderno,
to_char(o.orderdate, 'DD-MM-YYYY'),
i.invno
FROM
(
SELECT o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
FROM orders o
) o, invoices i
WHERE i.invno = o.invno
AND rn = 1;
Or:
SELECT
o.orderno,
to_char(o.orderdate, 'DD-MM-YYYY'),
i.invno
FROM
(
SELECT orderno, orderdate, invno
FROM
(
SELECT o.*,
row_number() over (order by to_char(orderdate, 'MMDD')) as rn
FROM orders o
)
WHERE rn = 1
) o, invoices i
WHERE i.invno = o.invno;
The first looks like it does more work but the execution plans are the same.
SQL Fiddle with your pastebin-supplied query that gets two rows back, and these two that get one.

Difference between dates in different rows

Hy
my problem is, that I need the average time between a chargebegin & chargeend row (timestampserver) grouped by stationname and connectornumber and day.
The main problem is, that i can not use a Max oder Min function because I have the same stationname/connecternumber combination several times in the table.
So in fact I have to select the first chargebegin and find the next chargeend (the one with the same station/connectornumber combination and the min(id) > chargebegin.id) to get the difference.
I tried a lot but in fact i have no idea how to do this.
Database is postgresql 9.2
Testdata:
create table datatable (
id int,
connectornumber int,
message varchar,
metercount int,
stationname varchar,
stationuser varchar,
timestampmessage varchar,
timestampserver timestamp,
authsource varchar
);
insert into datatable values (181,1,'chargebegin',4000,'100','FCSC','2012-10-10 16:39:10','2012-10-10 16:39:15.26');
insert into datatable values (182,1,'chargeend',4000,'100','FCSC','2012-10-10 16:39:17','2012-10-10 16:39:28.379');
insert into datatable values (184,1,'chargebegin',4000,'100','FCSC','2012-10-11 11:06:31','2012-10-11 11:06:44.981');
insert into datatable values (185,1,'chargeend',4000,'100','FCSC','2012-10-11 11:16:09','2012-10-11 11:16:10.669');
insert into datatable values (191,1,'chargebegin',4000,'100','MSISDN_100','2012-10-11 13:38:19','2012-10-11 13:38:26.583');
insert into datatable values (192,1,'chargeend',4000,'100','MSISDN_100','2012-10-11 13:38:53','2012-10-11 13:38:55.631');
insert into datatable values (219,1,'chargebegin',4000,'100','MSISDN_','2012-10-12 11:38:03','2012-10-12 11:38:29.029');
insert into datatable values (220,1,'chargeend',4000,'100','MSISDN_','2012-10-12 11:40:14','2012-10-12 11:40:18.635');

This might have some syntax errors as I can't test it right now, but you should get an idea, how to solve it.
with
chargebegin as (
select
stationname,
connectornumber,
timestampserver,
row_number() over(partition by stationname, connectornumber order by timestampserver) as rn
from
datatable
where
message = 'chargebegin'
),
chargeend as (
select
stationname,
connectornumber,
timestampserver,
row_number() over(partition by stationname, connectornumber order by timestampserver) as rn
from
datatable
where
message = 'chargeend'
)
select
stationname,
connectornumber,
avg(b.timestampserver - a.timestampserver) as avg_diff
from
chargebegin a
join chargeend b using (stationname, connectornumber, rn)
group by
stationname,
connectornumber
This assumes that there is always end event for begin event and that these event cannot overlap (means that for stationname and connectornumber, there can be only one connection at any time). Therefore you can user row_number() to get matching begin/end events and then do whatever calculation is needed.

tsql math across multiple dates in a table

I have a #variabletable simply defined as EOMDate(datetime), DandA(float), Coupon(float), EarnedIncome(float)
04/30/2008, 20187.5,17812.5,NULL
05/31/2008, 24640.63, 22265.63, NULL
06/30/2008, 2375, 26718.75,NULL
What I am trying to do is after the table is populated, I need to go back and calculate the EarnedIncome field to populate it.
the formula is DandA for the current month minus DandA for the previous month plus coupon.
Where I am having trouble is how can I do the update? So for 6/30 the value should be 4453.12 (2375-24640.63)+26718.75
I'll gladly take a clubbing over the head to get this resolved. thanks. Also, running under MS SQL2005 so any CTE ROW_OVER type solution can be used if possible.

You would need to use a subquery like this:
UPDATE #variabletable v1
SET EarnedIncome = DandA
- (SELECT DandA FROM #variabletable v2 WHERE GetMonthOnly(DATEADD(mm, -1, v2.EOMDate)=GetMonthOnly(v1.EOMDate))
+ Coupon
And I was making use of this helper function
DROP FUNCTION GetMonthOnly
GO
CREATE FUNCTION GetMonthOnly
(
#InputDate DATETIME
)
RETURNS DATETIME
BEGIN
RETURN CAST(CAST(YEAR(#InputDate) AS VARCHAR(4)) + '/' +
CAST(MONTH(#InputDate) AS VARCHAR(2)) + '/01' AS DATETIME)
END
GO

There's definitely quite a few ways to do this. You'll find pros and cons depending on how large your data set is, and other factors.
Here's my recommendation...
Declare #table as table
(
EOMDate DateTime,
DandA float,
Coupon Float,
EarnedIncome Float
)
Insert into #table Values('04/30/2008', 20187.5,17812.5,NULL)
Insert into #table Values('05/31/2008', 24640.63, 22265.63, NULL)
Insert into #table Values('06/30/2008', 2375, 26718.75,NULL)
--If we know that EOMDate will only contain one entry per month, and there's *always* one entry a month...
Update #Table Set
EarnedIncome=DandA-
(Select top 1 DandA
from #table t2
where t2.EOMDate<T1.EOMDate
order by EOMDate Desc)+Coupon
From #table T1
Select * from #table
--If there's a chance that there could be more per month, or we only want the values from the previous month (do nothing if it doesn't exist)
Update #Table Set
EarnedIncome=DAndA-(
Select top 1 DandA
From #table T2
Where DateDiff(month, T1.EOMDate, T2.EOMDate)=-1
Order by EOMDate Desc)+Coupon
From #Table T1
Select * from #table
--Leave the null, it's good for the data (since technically you cannot calculate it without a prior month).
I like the second method best because it will only calculate if there exists a record for the preceding month.
(add the following to the above script to see the difference)
--Add one for August
Insert into #table Values('08/30/2008', 2242, 22138.62,NULL)
Update #Table Set
EarnedIncome=DAndA-(
Select top 1 DandA
From #table T2
Where DateDiff(month, T1.EOMDate, T2.EOMDate)=-1
Order by EOMDate Desc
)+Coupon
From #Table T1
--August is Null because there's no july
Select * from #table
It's all a matter of exactly what do you want.
Use the record directly proceding the current record (regardless of date), or ONLY use the record that is a month before the current record.
Sorry about the format... Stackoverflow.com's answer editor and I do not play nice together.
:D

You can use a subquery to perform the calcuation, the only problem is what do you do with the first month because there is no previous DandA value. Here I've set it to 0 using isnull. The query looks like
Update MyTable
Set EarnedIncome = DandA + Coupon - IsNull( Select Top 1 DandA
From MyTable2
Where MyTable.EOMDate > MyTable2.EOMDate
Order by MyTable2.EOMDate desc), 0)
This also assumes that you only have one record per month in each table, and that there are't any gaps between months.

Another alternative is to calculate the running total when you are inserting your data, and have a constraint guarantee that your running total is correct:
http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/01/23/denormalizing-to-enforce-business-rules-running-totals.aspx

There may be a way to do this in a single statement, but in cases like this, I'd be inclined to set up a cursor to walk through each row, computing the new EarnedIncome field for that row, update the row, and then move to the next row.
Ex:
DECLARE #EOMDateVal DATETIME
DECLARE #EarnedIncomeVal FLOAT
DECLARE updCursor CURSOR FOR
SELECT EOMDate FROM #variabletable
OPEN updCursor
FETCH NEXT FROM updCursor INTO #EOMDateVal
WHILE ##FETCH_STATUS = 0
BEGIN
// Compute #EarnedIncomeVal for this row here.
// This also gives you a chance to catch data integrity problems
// that would cause you to fail the whole batch if you compute
// everything in a subquery.
UPDATE #variabletable SET EarnedIncome = #EarnedIncomeVal
WHERE EOMDate = #EOMDateVal
FETCH NEXT FROM updCursor INTO #EOMDateVal
END
CLOSE updCursor
DEALLOCATE updCursor