Determine Time overlap using hash in SAS - hash

I have an sql script on SAS, which defines time overlap. Because table is very big, first I created smaller copy of that table and defined some new variables. to make lookup into table easier.
data data.orders_small(drop=VALID_TO VALID_FROM DATE_OPEN DATE_CLOSE);
set data.orders(keep=ID_ACCOUNT ID_CLIENT VALID_TO VALID_FROM DATE_OPEN DATE_CLOSE);
length VALID_FROM_DAY_S VALID_FROM_DAYM1_S VALID_TO_DAY_S VALID_TO_DAYP1_S 8.;
format VALID_FROM_D_S VALID_FROM_DM1_S VALID_TO_D_S VALID_TO_DP1_S date9.;
VALID_FROM_D_S=intnx('day',VALID_FROM,+0,'S');
VALID_FROM_DM1_S=intnx('day',VALID_FROM,-1,'S');
VALID_TO_D_S=intnx('day',VALID_TO,+0,'S');
VALID_TO_DP1_S=intnx('day',VALID_TO,+1,'S');
run;
/* Checks existence of previous interval with valid_to = to the valid_from of the currently checked interval
where currently checked interval is not the first one */
proc sql noprint;
create table orders_test_1 as
select id_client, id_account, valid_from, valid_to
from data.orders a
where a.VALID_FROM GT a.DATE_OPEN
and not exists( select 1
from data.accounts_test_14_small(&xobstxt.) b
where a.id_account EQ b.id_account
and a.cod_mandant EQ b.cod_mandant
and a.DAT_VALID_FROM EQ b.DAT_VALID_TO_DP1_S );
quit;
/* Checks existence of following interval with valid_from = to the valid_to of the currently checked interval
where currently checked interval is not the last one */
proc sql noprint;
create table orders_test_2 as
select id_client, id_account, valid_from, valid_to
from data.orders a
where VALID_TO NE '31DEC3000'd
and VALID_TO LT DATE_CLOSE
and not exists ( select 1
from data.orders_small b
where a.id_account EQ b.id_account
and a.id_client EQ b.id_client
and a.VALID_TO EQ b.VALID_FROM_DM1_S);
quit;
/* Existence of overlaps */
proc sql;
create table orders_test_3 as
select id_client, id_account
from data.orders a
where exists ( select 1
from data.orders_small b
where a.id_account EQ b.id_account
and a.id_client EQ b.id_client
and ((a.VALID_FROM between b.VALID_FROM_D_S and b.VALID_TO_D_S)
or (a.VALID_TO between b.VALID_FROM_D_S and b.VALID_TO_D_S))
and not (a.VALID_FROM EQ b.VALID_FROM_D_S and a.VALID_TO EQ b.VALID_TO_D_S) );
quit;
Sample of the data:
data data.orders;
Length ID_ACCOUNT 4. ID_CLIENT 4. VALID_FROM 8. VALID_TO 8. DAT_OPEN 8. DAT_CLOSE 8.
;
informat VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE date9.;
format VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE date9.;
input ID_ACCOUNT ID_CLIENT VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE;
datalines;
001 001 01MAR1993 31DEC3000 01MAR1993 31DEC3000
002 002 01MAR1997 15MAY2001 01MAR1997 31DEC3000
002 002 16MAY2001 25JUN2011 01MAR1997 31DEC3000
002 002 24JUN2001 16JUL2012 01MAR1997 31DEC3000
002 002 16MAY2001 16JUL2011 01MAR1997 31DEC3000
;
run;
Running this code take a lot of time, so I thought to use hashing in SAS and make it a bit faster, but couldn't make it work. Any ideas, how could this code be transferred to hash???
Thank you in advance!

Can you just move ahead (lag) the VALID_TO variable and compare that with the VALID_FROM variable in a datastep rather than using a hash? If the lag is greater than the next VALID_FROM then you have an overlap.
DATA ORDERS;
LENGTH ID_ACCOUNT 4. ID_CLIENT 4. VALID_FROM 8. VALID_TO 8.;
INFORMAT VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE DATE9.;
FORMAT VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE DATE9.;
INPUT ID_ACCOUNT ID_CLIENT VALID_FROM VALID_TO DATE_OPEN DATE_CLOSE;
DATALINES;
001 001 01MAR1993 31DEC2000 01MAR1993 31DEC2000
002 002 01MAR1997 15MAY2001 01MAR1997 31DEC2000
002 002 16MAY2001 25JUN2011 01MAR1997 31DEC2000
002 002 24JUN2001 16JUL2012 01MAR1997 31DEC2000
002 002 16MAY2001 16JUL2011 01MAR1997 31DEC2000
;
RUN;
PROC EXPAND DATA = ORDERS OUT = ORDERS_LAG METHOD=NONE;
BY ID_ACCOUNT ID_CLIENT;
CONVERT VALID_TO = VT_LAG / TRANSFORMOUT=(LAG 1);
RUN;
DATA WANT; SET ORDERS_LAG;
IF VT_LAG > VALID_FROM THEN OVERLAP = 1;
ELSE OVERLAP = 0;
RUN;

Since you are not licensed for PROC EXPAND then try the following:
DATA WANT; SET ORDERS;
BY ID_ACCOUNT ID_CLIENT;
LAG_VT = LAG(VALID_TO);
IF FIRST.ID_ACCOUNT OR FIRST.ID_CLIENT THEN LAG_VT = .;
IF LAG_VT NE . AND LAG(VALID_TO) > VALID_FROM THEN OVERLAP = 1;
ELSE OVERLAP = 0;
RUN;
It performs the same lag, but you have to account for following rows containing new accounts and/or clients. PROC EXPAND just performs it a little cleaner and adds a count variable for each unique group.

Related

Count users with more than X amount of transactions within Y days by date

Scenario: Trying to count more active users for time series analysis.
Need: With postgreSQL(redshift) Count customers that have more than X unique transactions within Y days from said date, group by date.
How do i achieve this?
Table: orders
date
user_id
product_id
transaction_id
2022-01-01
001
003
001
2022-01-02
002
001
002
2022-03-01
003
001
003
2022-03-01
003
002
003
...
...
...
...
Outcome:
date
active_customers
2022-01-01
10
2022-01-02
12
2022-01-03
9
2022-01-04
13
You may be able to use the window functions LEAD() and LAG() here but this solution may also work for you.
WITH data AS
(
SELECT o.date
, o.user_id
, COUNT(o.trans_id) tcount
FROM orders o
WHERE o.date BETWEEN o.date - '30 DAYS'::INTERVAL AND o.date -- Y days from given date
GROUP BY o.date, o.user_id
), user_transaction_count AS
(
SELECT d.date
, COUNT(d.user_id) FILTER (WHERE d.tcount > 1) -- X number of transactions
OVER (PARTITION BY d.user_id) user_count
FROM data d
)
SELECT u.date
, SUM(u.user_count) active_customers
FROM user_transaction_count u
GROUP BY u.date
ORDER BY u.date
;
Here is a DBFiddle that demos a couple options.

PostgreSQL: GROUP BY and ORDER BY, whole dataset as a result

In a Postgres database I have a table with the following columns:
ID (Pimary Key)
Code
Date
I'm trying to extract data ordered by Date and grouped by Code so that the most recent date will determine what code rows should be grouped first and so forth (if it makes sense). An example:
007 2022-01-04
007 2022-01-01
007 2021-12-19
002 2022-01-03
002 2021-12-02
002 2021-11-15
035 2022-01-01
035 2021-11-30
035 2021-05-03
001 2021-12-31
022 2021-12-07
076 2021-11-19
I thought I could achieve this with the following query:
SELECT * FROM Table
GROUP BY Table.Code
ORDER BY Table.Date DESC
but this gives me
ERROR: column "Table.ID" must appear in the GROUP BY clause or be used in an aggregate function
And if I add the column ID to the GROUP BY the result I get is just a list ordered by Date with all the Codes mixed.
Is there any way to achieve whai I want?
Edit 3
More elegant solution using max over partition by.
SELECT
"Code",
"Date"
FROM
"Table"
ORDER BY
max("Date") over (partition by "Code") DESC,
"Table"."Date" DESC
;
Output:
Code
Date
007
2022-01-04T00:00:00Z
007
2022-01-01T00:00:00Z
007
2021-12-19T00:00:00Z
002
2022-01-03T00:00:00Z
002
2021-12-02T00:00:00Z
002
2021-11-15T00:00:00Z
035
2022-01-01T00:00:00Z
035
2021-11-30T00:00:00Z
035
2021-05-03T00:00:00Z
001
2021-12-31T00:00:00Z
022
2021-12-07T00:00:00Z
076
2021-11-19T00:00:00Z
Edit 2:
I join a select b with the entire dataset. The select b is used for sort only and is what you tried.
With "b" as
( select
"Code",
max("Date") as "Date"
from
"Table"
group by
"Code"
)
SELECT
"Table"."Code",
"Table"."Date"
FROM
"Table" left join "b" on "Table"."Code" = "b"."Code"
ORDER BY
"b"."Date" desc,
"Table"."Date" DESC;
Output:
Code
Date
007
2022-01-04T00:00:00Z
007
2022-01-01T00:00:00Z
007
2021-12-19T00:00:00Z
002
2022-01-03T00:00:00Z
002
2021-12-02T00:00:00Z
002
2021-11-15T00:00:00Z
035
2022-01-01T00:00:00Z
035
2021-11-30T00:00:00Z
035
2021-05-03T00:00:00Z
001
2021-12-31T00:00:00Z
022
2021-12-07T00:00:00Z
076
2021-11-19T00:00:00Z
Edit1
A group by clause should contain a unique value per line.
The example below show a way to fix the error on your data.
Table with ID:
CREATE TABLE "Table" (
"ID" serial not null primary key,
"Code" varchar,
"Date" timestamp
);
INSERT INTO "Table"
("Code", "Date")
VALUES
('007', '2022-01-04 00:00:00'),
('007', '2022-01-01 00:00:00'),
('007', '2021-12-19 00:00:00'),
('002', '2022-01-03 00:00:00'),
('002', '2021-12-02 00:00:00'),
('002', '2021-11-15 00:00:00'),
('035', '2022-01-01 00:00:00'),
('035', '2021-11-30 00:00:00'),
('035', '2021-05-03 00:00:00'),
('001', '2021-12-31 00:00:00'),
('022', '2021-12-07 00:00:00'),
('076', '2021-11-19 00:00:00')
;
Select:
SELECT * FROM "Table" ORDER BY "Code", "Date" DESC;
Output:
ID
Code
Date
10
001
2021-12-31T00:00:00Z
4
002
2022-01-03T00:00:00Z
5
002
2021-12-02T00:00:00Z
6
002
2021-11-15T00:00:00Z
1
007
2022-01-04T00:00:00Z
2
007
2022-01-01T00:00:00Z
3
007
2021-12-19T00:00:00Z
11
022
2021-12-07T00:00:00Z
7
035
2022-01-01T00:00:00Z
8
035
2021-11-30T00:00:00Z
9
035
2021-05-03T00:00:00Z
12
076
2021-11-19T00:00:00Z
Original Answer
First, select the columns that you want to group e.g. Code, that you want to apply an aggregate function (Date).
Second, list the columns that you want to group in the GROUP BY clause.
In the order by clause, use the same logic as the select clause.
https://www.postgresqltutorial.com/postgresql-group-by/
Tables:
CREATE TABLE "Table"
("Code" int, "Date" timestamp)
;
INSERT INTO "Table"
("Code", "Date")
VALUES
(007, '2022-01-04 00:00:00'),
(007, '2022-01-01 00:00:00'),
(007, '2021-12-19 00:00:00'),
(002, '2022-01-03 00:00:00'),
(002, '2021-12-02 00:00:00'),
(002, '2021-11-15 00:00:00'),
(035, '2022-01-01 00:00:00'),
(035, '2021-11-30 00:00:00'),
(035, '2021-05-03 00:00:00'),
(001, '2021-12-31 00:00:00'),
(022, '2021-12-07 00:00:00'),
(076, '2021-11-19 00:00:00')
;
Select
SELECT
"Table"."Code",
max("Table"."Date")
FROM
"Table"
GROUP BY
"Table"."Code"
ORDER BY
max("Table"."Date") DESC
Output:
Code
max
7
2022-01-04T00:00:00Z
2
2022-01-03T00:00:00Z
35
2022-01-01T00:00:00Z
1
2021-12-31T00:00:00Z
22
2021-12-07T00:00:00Z
76
2021-11-19T00:00:00Z

Finding the total duration of multiple overlapping "start date" and "end date" entries, before a cut-off date?

I have a list of subjects with multiple overlapping entries in the following format:
ID startdate stopdate cutoffdate
1 101 07MAR2014 07MAR2014 14MAR2014
2 105 30MAR2017 03APR2017 07APR2017
3 105 03APR2017 09APR2017 07APR2017
I have previously used SAS to count the total duration for each subject. I used the code described in the SAS documentation here, and adapted in another SO question here. The output using this method would be 1 day for subject 101 and 11 days for subject 105.
Now I have a cut-off date in the far right column. I want my code to disregard days beyond this; i.e. the output would then become 1 day for subject 101 and 9 days for subject 105.
How do I calculate the duration of these overlapping date entries for each subject, but disregard any dates which fall beyond the cut-off date?
Code from prior answer:
data want;
set have;
by id;
retain episode;
start_date = input(start_date, yymmdd10.);
end_date = input(stopdate, yymmdd10.);
prev_stop_date = lag(stopDate);
if first.id then do;
episode = 0;
call missing(prev_stop_date);
end;
if not (start_date <=prev_stop_date <= end_date) then episode+1;
*could add in logic to calculate dates and durations as well depending....;
run;
A see the solution with some logic inside for correct calculation with overlapped dates:
data test;
input ID startdate : date9. stopdate : date9. cutoffdate : date9.;
format startdate stopdate cutoffdate date9.;
datalines;
101 07MAR2014 07MAR2014 14MAR2014
105 30MAR2017 03APR2017 07APR2017
105 03APR2017 09APR2017 07APR2017
;
run;
proc sort data=test;
by ID startdate;
data want (keep=ID datediff);
set test;
by ID startdate;
retain startd stopd datediff;
if first.ID then do;
startd = startdate;
stopd = stopdate;
if stopdate LT cutoffdate then datediff=stopdate - startdate + 1;
else datediff=cutoffdate - startdate + 1;
end;
else do;
if startdate LE stopd and startdate GE startd then
startd = stopd;
if stopdate GE startd and stopdate LE cutoffdate then
stopd = stopdate;
else if stopdate GE startdate and stopdate GT cutoffdate then
stopd = cutoffdate;
datediff = datediff + stopd - startd;
end;
if last.ID then output;
run;
This code, of course, could be optimized. Please check my logic!
Such code produces:
ID datediff
––––––––––––––--
101 1
105 9
Generally what I'd do is create a new stopdate variable which was defined as
stopdate_cut = min(stopdate,cutoffdate);
Then your original code will work (just with this new variable). Make sure to also test startdate, presumably delete the entire row if startdate is more than cutoffdate (where startdate le cutoffdate might be the easiest).
Just to be clear, the original code didn't calculate durations, so I'll add that in here:
data final;
set want;
by id episode;
if first.episode then duration=1;
duration+(stopdate-startdate);
if last.episode then output;
run;
That gives 1 and 11. You might need slightly more code depending on your data.
To add the cutoff, simply add these two lines (the where here isn't doing anything in the example data, but it could be needed.)
data final;
set want;
by id episode;
where startdate le cutoffdate;
stopdate_cut = min(stopdate,cutoffdate);
if first.episode then duration=1;
duration+(stopdate_cut-startdate);
if last.episode then output;
run;
When dealing with possibly overlapping date ranges within a group there is also a possibility you have some gaps.
Because dates are integers in a limited domain you can use a temporary array that is key-indexed with the date and paint values across the range. At the end of the group the number of values in the array is the number of days that fell within a range.
Example:
* generate some data;
data have;
call streaminit(2020);
length id start_date end_date limit_date step 8;
do id = 1 to 20;
end_date = '01jan2015'd;
limit_date = '30sep2016'd;
do _n_ = 1 to rand('integer', 1, 7);
step = rand('integer', -10,60); * data generator diagnostic;
range = rand('integer', 30); * data generator diagnostic;
start_date = end_date + step;
end_date = start_date + range;
output;
end;
end;
format start_date end_date limit_date date9.;
run;
* d1 to d2 should cover all expected dates;
%let d1 = %sysevalf("01JAN2000"D);
%let d2 = %sysevalf("31DEC2100"D);
* evaluate the date range coverage for each id;
data want;
array dates[&d1:&d2] _temporary_; * the canvas onto which values are painted;
do until (last.id);
set have;
by id;
do _n_ = start_date to end_date;
if _n_ > limit_date then leave;
if dates(_n_) = 1 then overlap_days+1;
dates(_n_) + 1; * paint on that canvas;
end;
ranges_count + 1;
end;
* compute total range and portions;
do _n_ = lbound(dates) to hbound(dates);
if missing(dates(_n_)) then continue;
days + 1;
if missing(all_start_date) then all_start_date =_n_;
all_end_date = _n_;
if _n_ > all_start_date and dif(_n_) > 1 then gaps + 1;
end;
all_days = all_end_date - all_start_date + 1;
coverage = days / all_days;
OUTPUT;
call missing(of _all_, of dates(*));
format all_start_date all_end_date date9. coverage percent6.2;
keep id days all_: ranges_count overlap_days gaps cover:;
run;
Produces
Here more possible conditions are set:
data have;
input ID startdate : date9. stopdate : date9. cutoffdate : date9.;
format startdate stopdate cutoffdate date9.;
datalines;
101 07MAR2014 07MAR2014 14MAR2014
105 30MAR2017 03APR2017 07APR2017
105 03APR2017 09APR2017 07APR2017
106 12MAY2018 18MAY2018 01JUL2018
106 15MAY2018 20MAY2018 01JUL2018
106 25MAY2018 28MAY2018 01JUL2018
107 01JAN2005 09JAN2005 01FEB2005
107 05JAN2005 20JAN2005 01FEB2005
107 16JAN2005 18JAN2005 01FEB2005
107 26JAN2005 31JAN2005 01FEB2005
;
run;
Firstly, it is necessary to consider insider of cutoffdate, so min(stopdate,cutoffdate) is used; Secondly, need to consider if the period is complete within the previous record; Thirdly, if startdate is previous stopdate, it is needed to +1, here is '_stop+1' in ifn function.
data want;
set have ;
by id startdate notsorted;
retain total;
_start=lag(startdate);_stop=lag(stopdate);
if first.id then total=min(stopdate,cutoffdate)-startdate+1;
else do;
if _start<=startdate and stopdate<=_stop then return;
total=total+min(stopdate,cutoffdate)-ifn(_stop<startdate,startdate,_stop+1)+1;
end;
if last.id then output;
drop _:;
run;
The SAS System
Obs ID startdate stopdate cutoffdate total
1 101 07MAR2014 07MAR2014 14MAR2014 1
2 105 03APR2017 09APR2017 07APR2017 9
3 106 25MAY2018 28MAY2018 01JUL2018 13
4 107 26JAN2005 31JAN2005 01FEB2005 26

SAS: Separate date_from & date_to into separate lines

I've got an example like this:
data date_table;
stop;
length id $32.;
length name $32.;
length date_from date_to 8.;
format date_from date_to datetime19.;
run;
proc sql;
insert into date_table
values ('1', 'Mark', '13Jun2019 08:39:00'dt, '13Jun2019 11:39:00'dt)
values ('2', 'Bart', '13Jun2019 13:39:00'dt, '13Jun2019 17:39:00'dt);
quit;
I need some smart join (maybe with separate hour mapping table) to achieve something like this:
What I've been trying now was using mapping table
and join like:
proc sql;
create table testing as
select t1.id,
t1.name,
t1.date_from,
t1.date_to
from DATE_TABLE t1 inner join
WORK.CAL_TIME t2 on t1.date_from >= t2.Time and
t1.date_to <= t2.Time;
quit;
But of course the result is empty table because date dpoens't want t join. I might cut date_from and date_to to full hours but still such a join doens't work.
Help.
Looks like you are comparing apples (DATETIME) with oranges (TIME). The order of magnitude of those numbers are totally different.
684 data _null_;
685
686 dt = '13Jun2019 08:39:00'dt ;
687 tm = '08:00't ;
688 put (dt tm) (=comma20.);
689 run;
dt=1,876,034,340 tm=28,800
You probably just want to compare the time of day part of your datetime values to your time values. Also round your start times down and your end times up to the hour.
data date_table;
length id name $32 date_from date_to 8;
format date_from date_to datetime19.;
input id name (date:) (:datetime.);
cards;
1 Mark 13Jun2019:08:39:00 13Jun2019:11:39:00
2 Bart 13Jun2019:13:39:00 13Jun2019:17:39:00
;
data cal_time;
do time='08:00't to '21:00't by '01:00't ;
output;
end;
format time time5.;
run;
proc sql;
create table testing as
select t1.id
, t1.name
, max(t1.date_from,dhms(datepart(t1.date_from),0,0,t2.time))
as datetime_from format=datetime19.
, min(t1.date_to,dhms(datepart(t1.date_to),0,0,t2.time+'01:00't))
as datetime_to format=datetime19.
, t2.time
from DATE_TABLE t1
inner join WORK.CAL_TIME t2
on t2.time between intnx('hour',timepart(t1.date_from),0,'b')
and intnx('hour',timepart(t1.date_to),0,'e')
;
quit;
Result
Obs id name datetime_from datetime_to time
1 1 Mark 13JUN2019:08:39:00 13JUN2019:09:00:00 8:00
2 1 Mark 13JUN2019:09:00:00 13JUN2019:10:00:00 9:00
3 1 Mark 13JUN2019:10:00:00 13JUN2019:11:00:00 10:00
4 1 Mark 13JUN2019:11:00:00 13JUN2019:11:39:00 11:00
5 2 Bart 13JUN2019:13:39:00 13JUN2019:14:00:00 13:00
6 2 Bart 13JUN2019:14:00:00 13JUN2019:15:00:00 14:00
7 2 Bart 13JUN2019:15:00:00 13JUN2019:16:00:00 15:00
8 2 Bart 13JUN2019:16:00:00 13JUN2019:17:00:00 16:00
9 2 Bart 13JUN2019:17:00:00 13JUN2019:17:39:00 17:00

Select a row when data is missing

i got a question.
I use this straight foreward query to retrieve data on a daily basis. part of the data is an ID.
For example, i got ID's 001 002 003 and 004. Every ID has some columns with data.
I daily generate a report based on that data.
A typical day looks lke
ID Date Value
001 2013-07-02 900
002 2013-07-02 800
003 2013-07-02 750
004 2013-07-02 950
Select *
FROM
myTable
WHERE datum > now() - INTERVAL '2 days' and ht not in (select ht from blocked_ht)
order by ht, id;
Some times the import for 1 id fails. So my data looks like
ID Date Value
001 2013-07-02 900
003 2013-07-02 750
004 2013-07-02 950
Its vital to know that 1 ID is missing, visualized in my report (made in Japserreports)
So i instert an ID without a date and value 0 and eddited the query:
SELECT *
FROM
"lptv_import" lptv_import
WHERE datum > now() - INTERVAL '2 days' and ht not in (select ht from negeren_ht) OR datum IS NULL
order by ht, id;
Now the data looks like this:
001 2013-07-02 900
002 800
003 2013-07-02 750
004 2013-07-02 950
How can i select from the tabel the row without the date WHEN ID 002 WITH a date is missing?
Hmm, this looks more compliacted than i thought...
select
id, coalesce(datum::text, 'NULL') as "date", "value"
from
(
select distinct id
from lptv
) id
left join
lptv using (id)
where
datum > now() - INTERVAL '2 days'
and not exists (select ht from negeren_ht where ht = lptv.ht)
order by id