Optimize recursive query using exclusion list - tsql

I'm trying to optimizes a recursive query for speed. The full query runs for 15 minutes.
The part I'm trying to optimize takes ~3.5min to execute, and the same logic is used twice in the query.
Description:
Table Ret contains over 300K rows with 30 columns (Daily snapshot)
Table Ret_Wh is the werehouse for Ret with over 5million rows (Snapshot history, 90days)
datadate - the day the info was recorded (like 10-01-2012)
statusA - a status like (Red, Blue) that an account can have.
statusB - a different status like (Large, Small) that an account can have.
Statuses can change day to day.
old - an integer age on the account. Age can be increased/decreased if there is a payment on the account. Otherwise incerase by 1 with each day.
account - the account number, and primary key of a row.
In Ret the account is unique.
In RetWh account is unique per datadate.
money - dollars in the account
Both Ret and Ret_Wh have the columns listed above
Query Goal: Select all accounts from Ret_Wh that had an age in a certain range, at ANY time during he month, and had a specific status while in that range.
Then select from those results, matching accounts in Ret, with a specific age "today", no matter their status.
My Goal: Do this in a way that doesn't take 3.5 minutes
Pseudo_Code:
#sdt='2012-10-01' -- or the beginning of any month
#dt = getdate()
create table #temp (account char(20))
create table #result (account char(20), money money)
while #sdt < #dt
BEGIN
insert into #temp
select
A.account
from Ret_Wh as A
where a.datadate = #sdt
and a.statusA = 'Red'
and a.statusB = 'Large'
and a.old between 61 and 80
set #sdt=(add 1 day to #sdt)
END
------
select distinct
b.account
,b.money
into #result
from #temp as A
join (Select account, money from Ret where old = 81) as B
on A.account=B.account
I want to create a distinct list of accounts in Ret_Wh (call it #shrinking_list). Then, in the while, I join Ret_Wh to #shrkining_list. At the end of the while, I delete one account from #shrinking_list. Then the while iterrates, with a smaller list joined to Ret_Wh, thereby speeding up the query as #sdt increases by 1 day. However, I don't know how to pass the exact same account number selected, to an external variable in the while, so that I can delete it from the #shrinking_list.
Any ideas on that, or how to speed this up in general?

Why are you using a cursor to get dates from #sdt to #dt one at a time?
select distinct b.account, b.money
from Ret as B
join Ret_Wh as A
on A.account = B.account
and a.datadate >= #sdt
and a.datadate < #dt
and a.statusA = 'Red'
and a.statusB = 'Large'
and a.old between 61 and 80
where b.old = 81

Related

postgreSQL select interval and fill blanks

I'm working on a system to manage the problems in different projects.
I have the following tables:
Projects
id
Description
Country
1
3D experience
Brazil
2
Lorem Epsum
Chile
Problems
id
idProject
Description
1
1
Not loading
2
1
Breaking down
Problems_status
id
idProblem
Status
Start_date
End_date
1
1
Red
2020-10-17
2020-10-25
2
1
Yellow
2020-10-25
2020-11-20
3
1
Red
2020-11-20
4
2
Red
2020-11-01
2020-11-25
5
2
Yellow
2020-11-25
2020-12-22
6
2
Red
2020-12-22
2020-12-23
7
2
Green
2020-12-23
In the above examples, the problem 1 is still red, and the problem 2 is green (no end date).
I need to create a chart when the user selects an specific project, where the status of the problems along the weeks (starting by the week of the first registered problem) will be shown. The chart of the project 1 should look like this:
I'm trying to write a code in postgreSQL to return a table like this, so that I can populate this chart:
Week
Green
Yellow
Red
42/20
0
0
1
43/20
0
0
1
44/20
0
1
0
...
...
...
...
04/21
1
0
1
I've been trying multiple ways but just can't figure out how to do that, could someone help me please?
Bellow a db-fiddle to help:
CREATE TABLE projects (
id serial NOT NULL,
description character varying(50) NOT NULL,
country character varying(50) NOT NULL,
CONSTRAINT projects_pkey PRIMARY KEY (id)
);
CREATE TABLE problems (
id serial NOT NULL,
id_project integer NOT NULL,
description character varying(50) NOT NULL,
CONSTRAINT problems_pkey PRIMARY KEY (id),
CONSTRAINT problems_id_project_fkey FOREIGN KEY (id_project)
REFERENCES projects (id) MATCH SIMPLE
);
CREATE TABLE problems_status (
id serial NOT NULL,
id_problem integer NOT NULL,
status character varying(50) NOT NULL,
start_date date NOT NULL,
end_date date,
CONSTRAINT problems_status_pkey PRIMARY KEY (id),
CONSTRAINT problems_status_id_problem_fkey FOREIGN KEY (id_problem)
REFERENCES problems (id) MATCH SIMPLE
);
INSERT INTO projects (description, country) VALUES ('3D experience','Brazil');
INSERT INTO projects (description, country) VALUES ('Lorem Epsum','Chile');
INSERT INTO problems (id_project ,description) VALUES (1,'Not loading');
INSERT INTO problems (id_project ,description) VALUES (1,'Breaking down');
INSERT INTO problems_status (id_problem, status, start_date, end_date) VALUES
(1, 'Red', '2020-10-17', '2020-10-25'),(1, 'Yellow', '2020-10-25', '2020-11-20'),
(1, 'Red', '2020-11-20', NULL),(2, 'Red', '2020-11-01', '2020-11-25'),
(2, 'Yellow', '2020-11-25', '2020-12-22'),(2, 'Red', '2020-12-22', '2020-12-23'),
(2, 'Green', '2020-12-23', NULL);
If I understood correctly your goal is to produce a weekly tally by problem status for a particular project for a specific time period (Min db date to current date). Further if a problem status spans week then is should be included in each weeks tally. That involve 2 time periods, the report period against the status start/end dates and checking for overlap of those dates. Now there ate 5 overlaps scenarios that need checking; lets call the ranges let A the any week in the report period and B. the start/end of status. Now, allowing that A must end within the reporting period. but B does not we have the following.
A starts, B starts, A ends, B ends. B overlaps end of A.
A starts, B starts, B ends, A ends. B totally contained within A.
B starts, A starts, B ends, A ends. B overlaps start of A.
B starts, A starts, A ends, B ends. A totally enclosed within B.
Fortunately, Postgres provides functionally to handle all the above meaning the query does not have to handle the individual validations. This is DATERANGEs and the Overlap operator. The difficult work then becomes defining each week with in A. Then employ the Overlap operator on daterange for each week in A against the daterange for B (start_date, end_date). Then do conditional aggregation. for each overlap detected. See full example here.
with problem_list( problem_id ) as
-- identify the specific problem_ids desirded
(select ps.id
from projects p
join problems ps on(ps.id_project = p.id)
where p.id = &selected_project
) --select * from problem_list;
, report_period(srange, erange) as
-- generate the first day of week (Mon) for the
-- oldest start date through day of week of Current_Date
(select min(first_of_week(ps.start_date))
, first_of_week(current_date)
from problem_status ps
join problem_list pl
on (pl.problem_id = ps.id_problem)
) --select * from report_period;
, weekly_calendar(wk,yr, week_dates) as
-- expand the start, end date ranges to week dates (Mon-Sun)
-- and identify the week number with year
(select extract( week from mon)::integer wk
, extract( isoyear from mon)::integer yr
, daterange(mon, mon+6, '[]'::text) wk_dates
from (select generate_series(srange,erange, interval '7 days')::date mon
from report_period
) d
) -- select * from weekly_calendar;
, status_by_week(yr,wk,status) as
-- determine where problem start_date, end_date overlaps each calendar week
-- then where multiple statuses exist for any week keep only the lat
( select yr,wk,status
from (select wc.yr,wc.wk,ps.status
-- , ps.start_date, wc.week_dates,id_problem
, row_number() over (partition by ps.id_problem,yr,wk order by yr, wk, start_date desc) rn
from problem_status ps
join problem_list pl on (pl.problem_id = ps.id_problem)
join weekly_calendar wc on (wc.week_dates && daterange(ps.start_date,ps.end_date)) -- actual overlap test
) ac
where rn=1
) -- select * from status_by_week order by wk;
select 'Project ' || p.id || ': ' || p.description Project
, to_char(wk,'fm09') || '/' || substr(to_char(yr,'fm0000'),3) "WK"
, "Red", "Yellow", "Green"
from projects p
cross join (select sbw.yr,sbw.wk
, count(*) filter (where sbw.status = 'Red') "Red"
, count(*) filter (where sbw.status = 'Yellow') "Yellow"
, count(*) filter (where sbw.status = 'Green') "Green"
from status_by_week sbw
group by sbw.yr, sbw.wk
) sr
where p.id = &selected_project
order by yr,wk;
The CTEs and main operate as follows:
problem_list: Identifies the Problems (id_problem) related the
specified project.
report_period: Identifies the full reporting period start to end.
weekly_calendar: Generates the beginning date (Mon) and ending date (Sun) for each week within the reporting period (A above). Along the
way it also gathers week of the year and the ISO year.
status_by_week: This is the real work horse preforming two tasks.
First is passes each problem by each of the week in the calendar. It
builds row for each overlap detected. Then it enforces the "one
status" rule.
Finally, the main select aggregates the status into the appropriate
buckets and adds the syntactic sugar getting the Program Name.
Note the function first_of_week(). This is a user defined function and available in the example and below. I created it some time ago and have found it useful. You are free to use it. But you do so without any claim of suitability or guaranty.
create or replace
function first_of_week(date_in date)
returns date
language sql
immutable strict
/*
* Given a date return the first day of the week according to ISO-8601
*
* ISO-8601 Standard (in short)
* 1 All weeks begin on Monday.
* 2 All Weeks have exactly 7 days.
* 3 First week of any year is the Monday on or before 4-Jan.
* This implies that the last few days on Dec may be in the
* first week of the following year and that the first few
* days of Jan may be in week 53 (53) of the prior year.
* (Not at the same time obviously.)
*
*/
as $$
with wk_adj(l_days) as (values (array[0,1,2,3,4,5,6]))
select date_in - l_days[ extract (isodow from date_in)::integer ]
from wk_adj;
$$;
In the example I have implemented the query as a SQL function as it seems db<>fiddle has issues with bound variables
and substitution variables, Besides it gave the ability to parameterize it. (Hate hard coded values). For the example I
added additional data fro extra testing, Mostly as data that will not be selected. And an additional Status (what happens if it encounters something other than those 3 status values (in this case Pink). This easy to remove, just get rid on OTHER.
Your notice that "the daterange is covering mon-mon, instead of mon-sun" is incorrect, although it would appear that way for someone not use to looking at them. Lets take week 43. If you queried the date range it would show [2020-10-19,2020-10-26) and yes both those dates are Monday. However, the bracketing characters have meaning. The leading character [ says the date is to included and the trailing character ) says the date is not to be included. A standard condition:
somedate && [2020-10-19,2020-10-26)
is the same as
somedate >= 2020-10-19 and somedate < 2020-10-26
This is why when you change the increment from "mon+6" to "mon+5" you fixed week 43, but introduced errors into other weeks.
You can fill in blanks using COALESCE to select the first non-null value in the list.
SELECT COALESCE(<some_value_that_could_be_null>, <some_value_that_will_not_be_null>);
If you want to force the bounds of your time range into a result set you can UNION your result set with a specific date.
SELECT ... -- your data query here
UNION ALL
SELECT end_ts -- WHERE end_ts is a timestamptz type
In order to UNION you will need to have the same arity and same type of fields returned in the unioned query. You can fill in everything other than the timestamp with NULL casted to whichever the matching type is.
More concrete example:
WITH data AS -- get raw data
(
SELECT p.id
, ps.status
, ps.start_date
, COALESCE(ps.end_date, CURRENT_DATE, '01-01-2025'::DATE) -- you can fill in NULL values with COALESCE
, pj.country
, pj.description
, MAX(start_date) OVER (PARTITION BY p.id) AS latest_update
FROM problems p
JOIN projects pj ON (pj.id = p.id_project)
JOIN problem_status ps ON (p.id = ps.id_problem)
UNION ALL -- force bounds in the following
SELECT NULL::INTEGER -- could be null or a defaulted value
, NULL::TEXT -- could be null or a defaulted value
, start_date -- either as an input param to a function or a hard-coded date
, end_date -- either as an input param to a function or a hard-coded date
, NULL::TEXT
, NULL::TEXT
, NULL::DATE
) -- aggregate in the following
SELECT <week> -- you'll have to figure out how you're getting weeks out of the DATE data
, COUNT(*) FILTER (WHERE status = 'Red')
, COUNT(*) FILTER (WHERE status = 'Yellow')
, COUNT(*) FILTER (WHERE status = 'Green')
FROM data
WHERE start_date = latest_update
GROUP BY <week>
;
Some of the features used in this query are very powerful and you should look them up if they're new to you and you are going to be doing a bunch of reporting queries. Mainly coalesce, common table expressions (CTE), window functions, and aggregate expressions.
Aggregate Expressions
WITH Queries (CTEs)
COALESCE
Window Functions
I wrote a dbfiddle for you to take a look at here after you updated your requirements.

Looping SQL query - PostgreSQL

I'm trying to get a query to loop through a set of pre-defined integers:
I've made the query very simple for this question.. This is pseudo code as well obviously!
my_id = 0
WHILE my_id < 10
SELECT * from table where id = :my_id`
my_id += 1
END
I know that for this query I could just do something like where id < 10.. But the actual query I'm performing is about 60 lines long, with quite a few window statements all referring to the variable in question.
It works, and gets me the results I want when I have the variable set to a single figure.. I just need to be able to re-run the query 10 times with different variables hopefully ending up with one single set of results.
So far I have this:
CREATE OR REPLACE FUNCTION stay_prices ( a_product_id int ) RETURNS TABLE (
pid int,
pp_price int
) AS $$
DECLARE
nights int;
nights_arr INT[] := ARRAY[1,2,3,4];
j int;
BEGIN
j := 1;
FOREACH nights IN ARRAY nights_arr LOOP
-- query here..
END LOOP;
RETURN;
END;
$$ LANGUAGE plpgsql;
But I'm getting this back:
ERROR: query has no destination for result data
HINT: If you want to discard the results of a SELECT, use PERFORM instead.
So do I need to get my query to SELECT ... INTO the returning table somehow? Or is there something else I can do?
EDIT: this is an example of the actual query I'm running:
\x auto
\set nights 7
WITH x AS (
SELECT
product_id, night,
LAG(night, (:nights - 1)) OVER (
PARTITION BY product_id
ORDER BY night
) AS night_start,
SUM(price_pp_gbp) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS pp_price,
MIN(spaces_available) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS min_spaces_available,
MIN(period_date_from) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS min_period_date_from,
MAX(period_date_to) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS max_period_date_to
FROM products_nightlypriceperiod pnpp
WHERE
spaces_available >= 1
AND min_group_size <= 1
AND night >= '2016-01-01'::date
AND night <= '2017-01-01'::date
)
SELECT
product_id as pid,
CASE WHEN x.pp_price > 0 THEN x.pp_price::int ELSE null END as pp_price,
night_start as from_date,
night as to_date,
(night-night_start)+1 as duration,
min_spaces_available as spaces
FROM x
WHERE
night_start = night - (:nights - 1)
AND min_period_date_from = night_start
AND max_period_date_to = night;
That will get me all the nights night periods available for all my products in 2016 along with the price for the period and the max number of spaces I could fill in that period.
I'd like to be able to run this query to get all the periods available between 2 and 30 days for all my products.
This is likely to produce a table with millions of rows. The plan is to re-create this table periodically to enable a very quick look up of what's available for a particular date. The products_nightlypriceperiod represents a night of availability of a product - e.g. Product X has 3 spaces left for Jan 1st 2016, and costs £100 for the night.
Why use a loop? You can do something like this (using your first query):
with params as (
select generate_series(1, 10) as id
)
select t.*
from params cross join
table t
where t.id = params.id;
You can modify params to have the values you really want. Then just use cross join and let the database "do the looping."

How to create a function that loops through another function in PostgreSQL?

I'm using PostgreSQL 9.3.9 and I have a procedure called list_all_upsells that takes in the beginning of a month and the end of a month. (see sqlfiddle.com/#!15/abd02 for sample data) For example, the below code would list the count of upselled accounts for the month of October:
select COUNT(up.*) as "Total Upsell Accounts in October" from
list_all_upsells('2015-10-01 00:00:00'::timestamp, '2015-10-31 23:59:59'::timestamp) as up
where up.user_id not in
(select distinct user_id from paid_users_no_more
where concat(extract(month from payment_stop_date),'-',extract(year from payment_stop_date))<>
concat(extract(month from payment_start_date),'-',extract(year from payment_start_date)));
The list_all_upsells procedure looks like this:
DECLARE
payor_email_2 text;
BEGIN
FOR payor_email_2 in select distinct payor_email from paid_users LOOP
return query
execute
'select paid_users.* from paid_users,
(
select payment_start_date as first_time from paid_users
where payor_email = $3
order by payment_start_date limit 1
) as dummy
where payor_email = $3
and payment_start_date > first_time
and payment_start_date between $1 and $2
and first_time < $1'
using a, b, payor_email_2;
END LOOP;
return;
END
I want to be able to run this for all months that we have records and query the data together in one table like this:
Month | Total Upselled Accounts
---------------------------------
08/2014 | 23
09/2014 | 35
ETC...
10/2015 | 56
I have a query to grab the first of each month and last of each month for the months we have been in business:
select distinct date_trunc('month', payment_start_date)::date as startmonth
from paid_users ORDER BY startmonth;
Last of month:
SELECT distinct (date_trunc('MONTH', payment_start_date) +
INTERVAL '1 MONTH - 1 day')::date as endmonth from paid_users
ORDER BY endmonth;
Now how would I create a function to loop through the list_all_upsells and grab the count for each of these months? I.e. the first query for startmonth gives me 2014-03-01, 2014-04-01, ...to 2015-10-01 whereas the second query for endmonth gives me 2014-03-31, 2014-04-30, ...to 2015-10-31. I want to run the list_all_sells on each of these months so that I can get an aggregate count each month of how many upselled accounts we have
My paid_users table looks like this:
CREATE TABLE paid_users
(
user_id integer,
user_email character varying(255),
payor_id integer,
payor_email character varying(255),
payment_start_date timestamp without time zone DEFAULT now()
)
paid_users_no_more:
CREATE TABLE paid_users_no_more
(
user_id integer,
payment_stop_date timestamp without time zone DEFAULT now()
)
You have a couple of issues with your function, so let's start there. The short of it is that (1) you need only a single parameter to indicate the month, using beginning and ending of the month is setting yourself up for problems; (2) you do not need a dynamic query because you are not changing identifiers (table or column names); (3) you do not need a loop; and (4) your logic is wrong. I could also mention that PostgreSQL uses functions and that they all start with a line like CREATE FUNCTION list_all_upsells(...) but that would be just too picky.
To start with the logic: Apparently a user identified by his email address takes out a subscription from a certain payment_start_date until a certain payment_stop_date and can do this multiple times. You are looking for those users who took out their first subscription before the month in question, and who started a new subscription in the month in question but not a first subscription. In that case the filter payment_start_date > first_time is useless because you already filter for a first subscription being prior to the month in question (first_time < $1) and a new subscription (payment_start_date BETWEEN $1 AND $2).
Points (1), (2) and (3) really only become obvious when rewriting the query inside the function:
CREATE FUNCTION list_all_upsells(timestamp) RETURNS SETOF paid_users AS $$
SELECT paid_users.*
FROM paid_users
JOIN ( -- This JOIN keeps only those rows where the payor_email has a prior subscription
SELECT DISTINCT payor_email,
first_value(payment_start_date) OVER (PARTITION BY payor_email ORDER BY payment_start_date) AS dummy
FROM paid_users
WHERE payment_start_date < date_trunc('month', $1)
) dummy USING (payor_email)
-- This filter keeps only those rows with new subscriptions in the month
WHERE date_trunc('month', payment_start_date) = date_trunc('month', $1)
$$ LANGUAGE sql STRICT;
Since the body of the function has reduced to a single SQL statement, the function is now a sql language function, which is more efficient than plpgsql. You now supply only a single parameter, which can be any moment in the month you want the data for, so list_all_upsells(LOCALTIMESTAMP) will give you the results for the current month. In terms of the query you posted it would be:
SELECT count(up.*) AS "Total Upsell Accounts in October"
FROM list_all_upsells(LOCALTIMESTAMP) up
WHERE up.user_id NOT IN
(SELECT DISTINCT user_id FROM paid_users_no_more
WHERE date_trunc('month', payment_stop_date) <>
date_trunc('month', up.payment_start_date)
);
This, incidentally, really begs the question why you have the table paid_users_no_more. Why not simply add a column payment_stop_date to table paid_users? Where that column is NULL the user is still subscribed. But the whole query is rather odd, because list_all_upsells() returns new subscriptions during the month, so why bother with cancelled subscriptions at some other time?
Now on to your real question:
SELECT months.m "Month", coalesce(count(up.*), 0) "Total Upselled Accounts"
FROM generate_series('2014-08-01'::timestamp,
date_trunc('month', LOCALTIMESTAMP),
'1 month') AS months(m)
LEFT JOIN list_all_upsells(months.m) AS up ON date_trunc('month', payment_start_date) = m
GROUP BY 1
ORDER BY 1;
Generate a series of months from some starting month until the current month, then count the new subscriptions for each month, possibly 0.
SQLFiddle

TSQL get the Prev and Next ID on a list

Let's say I have a table Sales
SaleID int
UserID int
Field1 varchar(10)
Created Datetime
and right now I have loaded and viewing the record with SaleID = 23
What's the right way to find out, using a stored procedure, what's the PREVIOUS and NEXT SalesID value off the current SaleID = 23, that belongs to me (UserID = 1)?
I could do a
SELECT TOP 1 *
FROM Sales
WHERE SaleID > 23 AND UserID = 1
and the same for SaleID < 23 but that's 2 SQL calls.
Is there a better way?
I'm using the SQL Server 2012.
You can get the previous/next SaleID (or any other field) by using the LAG() and LEAD() functions introduced in SQL Server 2012.
For example:
SELECT *,
LAG(SaleID) OVER (PARTITION BY UserID ORDER BY SaleID) Prev,
LEAD(SaleID) OVER (PARTITION BY UserID ORDER BY SaleID) Next
FROM Sales S
SqlFiddle
If you omit the PARTIITION BY clause in the LAG() or LEAD() functions in the answer of thepirat000's, you can find the related previous or next records according to the SaleID column.
Here is the SQL query
SELECT *,
LAG(SaleID) OVER (ORDER BY SaleID) Prev,
LEAD(SaleID) OVER (ORDER BY SaleID) Next
FROM Sales S
The PARTITION BY clause enables you to use these functions within a grouping based on UserID as in the thepirat000's code
If you want the next and previous records only for a single row, or at least for a small set of item following query can also help in terms of performance (as an answer to Eager to Learn's comment)
select
(select top 1 t.SaleID from Sales t where t.SaleID < tab1.SaleID) as prev_id,
SaleID as current_id,
(select top 1 t.SaleID from Sales t where t.SaleID > tab1.SaleID) as next_id
from Sales where SaleID = 2

T-SQL Need assistance with complex join

I am really out of ideas of how to solve this issue and need some assistance - not only solution but idea of how to approach will be welcomed.
I have the following table:
TABLE Data
(
RecordID
,DateAdd
,Status
)
with sample date like this:
11 2012-10-01 OK
11 2012-10-04 NO
11 2012-11-05 NO
22 2012-10-01 OK
33 2012-11-01 NO
33 2012-11-15 OK
And this table with the following example data:
TABLE Periods
(
PeriodID
,PeriodName
,DateStart
,DateEnd
)
1 Octomer 2012-10-01 2012-10-31
2 November 2012-11-01 2012-11-30
What I need to do, is to populate a new table:
TABLE DataPerPeriods
(
PeriodID,
RecordID,
Status
)
That will store all possible combinations of PeriodID and RecordID and the latest status for period if available. If status is not available for give period, then the status for previous periods. If there are no previous status at all - then NULL for status.
For example with the following data I need something like this:
1 11 NO //We have status "OK" and "NO", but "NO" is latest for the period
1 22 OK
1 33 NULL//Because there are no records for this and previous periods
2 11 NO //We get the previos status as there are no records in this periods
2 22 OK //There are not records for this period, but record for last periods is available
2 33 NO //We have status "OK" and "NO", but "OK" is latest for the period
EDIT: I have already populate the period ids and the records ids in the last table, I need more help on the status update.
There might be a better way to do this. But this is the most straight-forward path I know to get what you're looking for, unconventional as it appears. For larger datasets you may have to change your approach:
SELECT p.PeriodID, td.RecordID, statusData.[Status] FROM Periods p
CROSS JOIN (SELECT DISTINCT RecordID FROM Data) td
OUTER APPLY (SELECT TOP 1 [Status], [DateAdd]
FROM Data
WHERE [DateAdd] <= p.DateEnd
AND [RecordID] = td.RecordID
ORDER BY [DateAdd] DESC) statusData
ORDER BY p.PeriodID, td.RecordID
The CROSS JOIN is what gives you every possible combination of RecordIDs and DISTINCT Periods.
The OUTER APPLY selects the latest Status before then end of each Period.
Check out my answer on another question to know how to grab the first or last status : Aggregate SQL Function to grab only the first from each group
OK, here's an idea. Nobody likes cursors, including me, but sometimes for things like this they do come in handy.
The idea is that this cursor loops through each of the Data records, pulling out the ID as an identifier. Inside the loop it finds a single data record and gets the count of a join that meets your criteria.
If the #count = 0, the condition is not met and you should not insert a record for that period.
If the #Count=1, the condition is met so insert a record for the period.
If these conditions need to be updated frequently, you can your query to a job and run it every minute or hour ... what have you.
Hope this helps.
DECLARE #ID int
DECLARE merge_cursor CURSOR FAST_FORWARD FOR
select recordID
from data
OPEN merge_cursor
FETCH NEXT FROM merge_cursor INTO #ID
WHILE ##FETCH_STATUS = 0
BEGIN
--get join if record is found in the periods
declare #Count int
select #Count= count(*)
from data a inner join periods b
on a.[dateadd] between b.datestart and b.dateend
where a.recordID = #ID
if #count>0
--insert into DataPerPeriods(PeriodID, RecordID, Status)
select b.periodid, a.recordid, a.status
from data a inner join periods b on a.[dateadd] between b.datestart and b.dateend --between beginning of month and end of month
where a.recordid = #ID
else
--insert into DataPerPeriods(PeriodID, RecordID, Status)
select b.periodid, a.recordid, a.status
from data a inner join periods b on a.[dateadd] < b.dateend
where a.recordID = #ID --fix this area
FETCH NEXT FROM merge_cursor INTO #ID
END
CLOSE merge_cursor
DEALLOCATE merge_cursor