SQL Time Series Completion Script - tsql

Version: SQL Server 2014
Objective: Create a complete time series with existing date range records.
Initial Data Setup:
IF OBJECT_ID('tempdb..#DataSet') IS NOT NULL
DROP TABLE #DataSet;
CREATE TABLE #DataSet (
RowID INT
,StartDt DATETIME
,EndDt DATETIME
,Col1 FLOAT);
INSERT INTO #DataSet (
RowID
,StartDt
,EndDt
,Col1)
VALUES
(1234,'1/1/2016','12/31/2999',100)
,(1234,'7/23/2016','7/27/2016',90)
,(1234,'7/26/2016','7/31/2016',80)
,(1234,'10/1/2016','12/31/2999',75);
Desired Results:
RowID, StartDt, EndDt, Col1
1234, '01/01/2016', '07/22/2016', 100
1234, '07/23/2016', '07/26/2016', 90
1234, '07/26/2016', '07/31/2016', 80
1234, '08/01/2016', '09/30/2016', 100
1234, '10/01/2016', '12/31/2999', 75
Not an easy task I will admit, If anyone has a suggestion on how to tackle this utilizing SQL alone (Microsoft 2014 TSQL) I would greatly appreciate it. Please keep in mind it is SQL and we want to try to avoid cursors at all costs based on performance for large data sets.
Thanks in Advance.
Also as an FYI I was able to achieve half of this by utilizing a LEAD windows function to set the End Date of the current record to the Startdate-1 of the next. The other half of filling gaps back in from previous records still eludes me.
Updated for the 9/31 to 9/30 date.

The following query does essentially what you are asking. You can tweak it to fit your requirements. Note that when checking the results of my query, your desired results contain 09/31/2016 which is not a valid date.
WITH
RankedData AS
(
SELECT RowID, StartDt, EndDt, Col1,
DATEADD(day, -1, StartDt) AS PrevEndDt,
RANK() OVER(ORDER BY StartDt, EndDt, RowID) AS rank_no
FROM #DataSet
),
HasGapsData AS
(
SELECT a.RowID, a.StartDt,
CASE WHEN b.PrevEndDt <= a.EndDt THEN b.PrevEndDt ELSE a.EndDt END AS EndDt,
a.Col1, a.rank_no
FROM RankedData a
LEFT JOIN RankedData b ON a.rank_no = b.rank_no - 1
)
SELECT RowID, StartDt, EndDt, Col1
FROM HasGapsData
UNION ALL
SELECT a.RowID,
DATEADD(day, 1, a.EndDt) AS StartDt,
DATEADD(day, -1, b.StartDt) AS EndDt,
a.Col1
FROM HasGapsData a
INNER JOIN HasGapsData b ON a.rank_no = b.rank_no - 1
WHERE DATEDIFF(day, a.EndDt, b.StartDt) > 1
ORDER BY StartDt, EndDt;

Related

How to collapse overlapping date periods with acceptable gaps using T-SQL?

We want to group our members' enrollments into "continuous enrollments," allowing for a gap of up to 45 days. I know how to use LEAD to determine if an enrollment should be grouped with the next, but I don't know how to group them. Would it be more appropriate to add 45 to the term date and subtract 45 from the effective date, then check for overlapping date periods? My goal is to have a SQL view that returns the results similar to the final query below. Thank you for your help.
SELECT '101' AS MemID, '2021-01-01' AS EffDate, '2021-01-31' AS TermDate INTO #T1 UNION
SELECT '101', '2021-02-01', '2021-02-28' UNION
SELECT '101', '2021-03-01', '2021-03-31' UNION
SELECT '101', '2021-06-01', '2021-06-30' UNION
SELECT '999', '2021-01-01', '2021-01-15' UNION
SELECT '999', '2021-09-01', '2021-09-28' UNION
SELECT '999', '2021-10-01', '2021-10-31'
SELECT *
, LEAD(EffDate) OVER (PARTITION BY MemID ORDER BY EffDate) AS LeadEffDate
, DATEDIFF(DAY, TermDate, (LEAD(EffDate) OVER (PARTITION BY MemID ORDER BY EffDate))) AS DaysToNextEnrollment
, CASE WHEN (DATEDIFF(DAY, TermDate, (LEAD(EffDate) OVER (PARTITION BY MemID ORDER BY EffDate)))) <= 45 THEN 1 ELSE 0 END AS CombineWithNextRecord
FROM #T1
-- result objective
SELECT 101 AS MemID, '2021-01-01' AS EffDate, '2021-03-31' AS TermDate UNION
SELECT 101, '2021-06-01', '2021-06-30' UNION
SELECT 999, '2021-01-01', '2021-01-15' UNION
SELECT 999, '2021-09-01', '2021-10-31'
I think you are really close. Your question is very similar to
TSQL - creating from-to date table while ignoring in-between steps with conditions with a logic difference on what you want to consider to be the same group.
My basic approach is to use the LAG() function to figure out the previous values for MemID and TermDate and combine that with your 45 day rule to define a group. And finally get the first and last values of each group.
Here is my response to that question modified to your situation.
SELECT
a4.MemID
, CONVERT (DATE, a4.First_EffDate) AS [EffDate]
, CONVERT (DATE, a4.TermDate) AS [TermDate]
FROM (
SELECT
a3.MemID
, a3.EffDate
, a3.TermDate
, a3.MemID_group
, FIRST_VALUE (a3.EffDate) OVER (PARTITION BY a3.MemID_group ORDER BY a3.EffDate) AS [First_EffDate]
, ROW_NUMBER () OVER (PARTITION BY a3.MemID_group ORDER BY a3.EffDate DESC) AS [Row_number]
FROM (
SELECT
a2.MemID
, a2.EffDate
, a2.TermDate
, a2.Previous_MemID
, a2.Previous_TermDate
, a2.New_group
, SUM (a2.New_group) OVER (ORDER BY a2.MemID, a2.EffDate) AS [MemID_group]
FROM (
SELECT
a1.MemID
, a1.EffDate
, a1.TermDate
, a1.Previous_MemID
, a1.Previous_TermDate
---------------------------------------------------------------------------------
-- new group if the MemID is different from the previous row OR
-- if the MemID is the same as the previous row AND it has been more than 45 days
-- between the TermDate of the previous row and the EffDate of the current row
,
IIF((a1.MemID <> a1.Previous_MemID)
OR (
a1.MemID = a1.Previous_MemID
AND DATEDIFF (DAY, a1.Previous_TermDate, a1.EffDate) > 45
)
, 1
, 0) AS [New_group]
---------------------------------------------------------------------------------
FROM (
SELECT
MemID
, EffDate
, TermDate
, LAG (MemID) OVER (ORDER BY MemID) AS [Previous_MemID]
, LAG (TermDate) OVER (PARTITION BY MemID ORDER BY EffDate) AS [Previous_TermDate]
FROM #T1
) a1
) a2
) a3
) a4
WHERE a4.[Row_number] = 1;
Here is the dbfiddle.

Using min/max values from a CTE in a later query, instead of using a subquery in Postgres

I've got a remedial question about pulling results out of a CTE in a later part of the query. For the example code, below are the relevant, stripped down tables:
CREATE TABLE print_job (
created_dts timestamp not null default now(),
status text not null
);
CREATE TABLE calendar_day (
date_actual date not null
);
In the current setup, there are gaps in the dates in the print_job data, and we would like to have a gapless result. For example, there are 87 days from the first to last date in the table, and only 77 days in there have data. We've already got a calendar_day dimension table to join with to get the 87 rows for the 87-day range. It's easy enough to figure out the min and max dates in the data with a subquery or in a CTE, but I don't know how to use those values from a CTE. I've got a full query below, but here are the relevant fragments with comments:
-- Get the date range from the data.
date_range AS (
select min(created_dts::date) AS start_date,
max(created_dts::date) AS end_date
from print_job),
-- This CTE does not work because it doesn't know what date_range is.
complete_date_series_using_cte AS (
select actual_date
from calendar_day
where actual_date >= date_range.start_date
and actual_date <= date_range.end_date
),
-- Subqueries are fine, because the FROM is specified in the subquery condition directly.
complete_date_series_using_subquery AS (
select date_actual
from calendar_day
where date_actual >= (select min(created_dts::date) from print_job)
and date_actual <= (select max(created_dts::date) from print_job)
)
I run into this regularly, and finally figured I'd ask. I've hunted around already for an answer, but I'm not clear how to summarize it well. And while there's nothing wrong with the subqueries in this case, I've got other situations where a CTE is nicer/more readable.
If it helps, I've listed the complete query below.
-- Get some counts and give them names.
WITH
daily_status AS (
select created_dts::date as created_date,
count(*) AS daily_total,
count(*) FILTER (where status = 'Error') AS status_error,
count(*) FILTER (where status = 'Processing') AS status_processing,
count(*) FILTER (where status = 'Aborted') AS status_aborted,
count(*) FILTER (where status = 'Done') AS status_done
from print_job
group by created_dts::date
),
-- Get the date range from the data.
date_range AS (
select min(created_dts::date) AS start_date,
max(created_dts::date) AS end_date
from print_job),
-- There are gaps in the data, and we want a row for dates with no results.
-- Could use generate_series on a timestamp & convert that to dates. But,
-- in our case, we've already got dimension tables for days. All that's needed
-- here is the actual date.
-- This CTE does not work because it doesn't know what date_range is.
-- complete_date_series_using_cte AS (
-- select actual_date
--
-- from calendar_day
--
-- where actual_date >= date_range.start_date
-- and actual_date <= date_range.end_date
-- ),
complete_date_series_using_subquery AS (
select date_actual
from calendar_day
where date_actual >= (select min(created_dts::date) from print_job)
and date_actual <= (select max(created_dts::date) from print_job)
)
-- The final query joins the complete date series with whatever data is in the print_job table daily summaries.
select date_actual,
coalesce(daily_total,0) AS total,
coalesce(status_error,0) AS errors,
coalesce(status_processing,0) AS processing,
coalesce(status_aborted,0) AS aborted,
coalesce(status_done,0) AS done
from complete_date_series_using_subquery
left join daily_status
on daily_status.created_date =
complete_date_series_using_subquery.date_actual
order by date_actual
I said it was a remedial question....I remembered where I'd seen this done before:
https://tapoueh.org/manual-post/2014/02/postgresql-histogram/
In my example, I need to list the CTE in the table list. That's obvious in retrospect, and I realize that I automatically don't think to do that as I'm habitually avoiding CROSS JOIN. The fragment below shows the slight change needed:
WITH
date_range AS (
select min(created_dts)::date as start_date,
max(created_dts)::date as end_date
from print_job
),
complete_date_series AS (
select date_actual
from calendar_day, date_range
where date_actual >= date_range.start_date
and date_actual <= date_range.end_date
),

How to rewrite SQL joins into window functions?

Database is HP Vertica 7 or PostgreSQL 9.
create table test (
id int,
card_id int,
tran_dt date,
amount int
);
insert into test values (1, 1, '2017-07-06', 10);
insert into test values (2, 1, '2017-06-01', 20);
insert into test values (3, 1, '2017-05-01', 30);
insert into test values (4, 1, '2017-04-01', 40);
insert into test values (5, 2, '2017-07-04', 10);
Of the payment cards used in the last 1 day, what is the maximum amount charged on that card in the last 90 days.
select t.card_id, max(t2.amount) max
from test t
join test t2 on t2.card_id=t.card_id and t2.tran_dt>='2017-04-06'
where t.tran_dt>='2017-07-06'
group by t.card_id
order by t.card_id;
Results are correct
card_id max
------- ---
1 30
I want to rewrite the query into sql window functions.
select card_id, max(amount) over(partition by card_id order by tran_dt range between '60 days' preceding and current row) max
from test
where card_id in (select card_id from test where tran_dt>='2017-07-06')
order by card_id;
But result set does not match, how can this be done?
Test data here:
http://sqlfiddle.com/#!17/db317/1
I can't try PostgreSQL, but in Vertica, you can apply the ANSI standard OLAP window function.
But you'll need to nest two queries: The window function only returns sensible results if it has all rows that need to be evaluated in the result set.
But you only want the row from '2017-07-06' to be displayed.
So you'll have to filter for that date in an outer query:
WITH olap_output AS (
SELECT
card_id
, tran_dt
, MAX(amount) OVER (
PARTITION BY card_id
ORDER BY tran_dt
RANGE BETWEEN '90 DAYS' PRECEDING AND CURRENT ROW
) AS the_max
FROM test
)
SELECT
card_id
, the_max
FROM olap_output
WHERE tran_dt='2017-07-06'
;
card_id|the_max
1| 30
As far as I know, PostgreSQL Window function doesn't support bounded range preceding thus range between '90 days' preceding won't work. It does support bounded rows preceding such as rows between 90 preceding, but then you would need to assemble a time-series query similar to the following for the Window function to operate on the time-based rows:
SELECT c.card_id, t.amount, g.d as d_series
FROM generate_series(
'2017-04-06'::timestamp, '2017-07-06'::timestamp, '1 day'::interval
) g(d)
CROSS JOIN ( SELECT distinct card_id from test ) c
LEFT JOIN test t ON t.card_id = c.card_id and t.tran_dt = g.d
ORDER BY c.card_id, d_series
For what you need (based on your question description), I would stick to using group by.

TSQL - Control a number sequence

Im a new in TSQL.
I have a table with a field called ODOMETER of a vehicle. I have to get the quantity of km in a period of time from 1st of the month to the end.
SELECT MAX(Odometer) - MIN(Odometer) as TotalKm FROM Table
This will work in ideal test scenary, but the Odomometer can be reset to 0 in anytime.
Someone can help to solve my problem, thank you.
I'm working with MS SQL 2012
EXAMPLE of records:
Date Odometer value
datetime var, 37210
datetime var, 37340
datetime var, 0
datetime var, 220
Try something like this using the LAG. There are other ways, but this should be easy.
EDIT: Changing the sample data to include records outside of the desired month range. Also simplifying that Reading for easy hand calc. Will shows a second option as siggested by OP.
DECLARE #tbl TABLE (stamp DATETIME, Reading INT)
INSERT INTO #tbl VALUES
('02/28/2014',0)
,('03/01/2014',10)
,('03/10/2014',20)
,('03/22/2014',0)
,('03/30/2014',10)
,('03/31/2014',20)
,('04/01/2014',30)
--Original solution with WHERE on the "outer" SELECT.
--This give a result of 40 as it include the change of 10 between 2/28 and 3/31.
;WITH cte AS (
SELECT Reading
,LAG(Reading,1,Reading) OVER (ORDER BY stamp ASC) LastReading
,Reading - LAG(Reading,1,Reading) OVER (ORDER BY stamp ASC) ChangeSinceLastReading
,CONVERT(date, stamp) stamp
FROM #tbl
)
SELECT SUM(CASE WHEN Reading = 0 THEN 0 ELSE ChangeSinceLastReading END)
FROM cte
WHERE stamp BETWEEN '03/01/2014' AND '03/31/2014'
--Second option with WHERE on the "inner" SELECT (within the CTE)
--This give a result of 30 as it include the change of 10 between 2/28 and 3/31 is by the filtered lag.
;WITH cte AS (
SELECT Reading
,LAG(Reading,1,Reading) OVER (ORDER BY stamp ASC) LastReading
,Reading - LAG(Reading,1,Reading) OVER (ORDER BY stamp ASC) ChangeSinceLastReading
,CONVERT(date, stamp) stamp
FROM #tbl
WHERE stamp BETWEEN '03/01/2014' AND '03/31/2014'
)
SELECT SUM(CASE WHEN Reading = 0 THEN 0 ELSE ChangeSinceLastReading END)
FROM cte
I think Karl solution using LAG is better than mine, but anyway:
;WITH [Rows] AS
(
SELECT o1.[Date], o1.[Value] as CurrentValue,
(SELECT TOP 1 o2.[Value]
FROM #tbl o2 WHERE o1.[Date] < o2.[Date]) as NextValue
FROM #tbl o1
)
SELECT SUM (CASE WHEN [NextValue] IS NULL OR [NextValue] < [CurrentValue] THEN 0 ELSE [NextValue] - [CurrentValue] END )
FROM [Rows]

Best way to get rid of unwanted sql subselects?

I have a table called Registrations with the following fields:
Id
DateStarted (not null)
DateCompleted (nullable)
I have a bar chart which shows the number of registrations started and completed by date.
My query looks like:
;
WITH Initial(DateStarted, StartCount)
as (
select Datestarted, COUNT(*)
FROM Registrations
GROUP BY DateStarted
)
select I.DateStarted, I.StartCount, COUNT(DISTINCT R.RegistrationId) as CompleteCount
from Initial I
inner join Registrations R
ON (I.DateStarted = R.DateCompleted)
GROUP BY I.DateStarted, I.StartCount
which returns a table that looks like:
DateStarted StartCount CompleteCount
2009-08-01 1033 903
2009-08-02 540 498
The query just has one of those code smell problems. What is a better way of doing this?
EDIT: so why wont the below work? you could throw coalesce() statements around the counts in the last select statement if you wanted to make the counts zero instead of null. it will also include dates that have completed (or ended in the example below) registrations even though that date doesn't have started registrations.
I am assuming the following table structure (roughly).
create table temp
(
id int,
start_date datetime,
end_date datetime
)
insert into temp values (1, '8/1/2009', '8/1/2009')
insert into temp values (2, '8/1/2009', '8/2/2009')
insert into temp values (3, '8/1/2009', null)
insert into temp values (4, '8/2/2009', '8/2/2009')
insert into temp values (5, '8/2/2009', '8/3/2009')
insert into temp values (6, '8/2/2009', '8/4/2009')
insert into temp values (7, '8/4/2009', null)
Then you could do the following to get what you want.
with start_helper as
(
select start_date, count(*) as count from temp group by start_date
),
end_helper as
(
select end_date, count(*) as count from temp group by end_date
)
select coalesce(a.start_date, b.end_date) as date, a.count as start_count, b.count as end_count
from start_helper a full outer join end_helper b on a.start_date = b.end_date
where coalesce(a.start_date, b.end_date) is not null
I would think the full outer join is necessary since a record can be completed today that started yesterday but we may have not started a new record today so you would lose a day from your results.
Off-hand, I think this does it:
SELECT
DateStarted
, COUNT(*) as StartCount
, SUM(CASE
WHEN DateCompleted = DateStated THEN 1
ELSE 0 END
) as CompleteCount
FROM Registration
GROUP BY DateStarted
OK, apparently I had the requirements wrong before. Given that the CompleteCounts are independent of the StartDate, then this is how I would do it:
;WITH StartDays AS
(
SELECT DateStarted
, Count(*) AS CompleteCount
FROM Registration
GROUP BY DateStarted
)
, CompleteDays AS
(
SELECT DateCompleted
, Count(*) AS StartCount
FROM Registration
GROUP BY DateCompleted
)
SELECT
DateStarted
, COALESCE(StartCount, 0) AS StartCount
, COALESCE(CompleteCount, 0) AS CompleteCount
FROM StartDays
FULL OUTER JOIN CompleteDays ON DateStarted = DateCompleted
Which actually is pretty close to what you had.
I don't see a problem. I see a common table expression being used.
You didn't provide DDL for the tables, so I'm not going to try to reproduce this. However, I think you may be able to directly substitute the SELECT for the use of Initial.
I believe the following is identical in function to what you have:
select DS.DateStarted
, count(distinct DS.RegistrationId) as StartCount
, count(distinct DC.RegistrationId) as CompleteCount
from Registrations DS
inner join Registrations DC on DS.DateStarted = DC.DateCompleted
group by Ds.DateStarted
I'm a bit confused by the name of the column DateStarted in the results. It looks to just be a date where both some things started and some things ended. And the counts are the number or registrations started and completed that day.
The inner join is throwing away any date where there is either 0 starts or 0 completes. To get all:
select coalesce(DS.DateStarted, DC.DateCompleted) as "Date"
, count(distinct DS.RegistrationId) as StartCount
, count(distinct DC.RegistrationId) as CompleteCount
from Registrations DS
full outer join Registrations DC on DS.DateStarted = DC.DateCompleted
group by Ds.DateStarted, DC.DateCompleted
If you wanted to include dates that are neither DateStarted nor DateCompleted, with counts of 0 and 0, then you will need a source of dates and I think it would be clearer to use two correlated sub-queries in select clause instead of joins and count distinct:
select DateSource."Date"
, (select count(*)
from Registrations
where DateStarted = DateSource."Date") as StartCount
, (select count (*)
from Registrations
where DateCompleted = DateSource."Datge") as CompleteCount
from DateSource -- implementation of date source left as exercise
where DateSource.Date between #LowDate and #HighDate