I am trying to group data in the column in SAS like how you can do in pivot table in excel. I'm trying the produce the following desired output.
Problem 11/17/2019-11/23/2019
INC 25 please see the image
SA 15
VV 10
I have tried PROC SQL but not sure how to group in column wise like those dates. Let me know if you need additional info on this, also have attached the image
Some options for creating output that displays frequency counts of categorical data in tabular form and pivoting data itself:
Proc TABULATE
Proc REPORT
Proc FREQUENCY
Proc TRANSPOSE - do you really want data (weeks) as metadata (column names)
SQL (arduous)
Suppose your data has columns visitId, date, and SS:
data visits;
call streaminit(1234);
do date = '01jan2019'd to '31dec2019'd;
do _n_ = 1 to 5 + rand('uniform', 11); /* between 5 and 15 ss codes a day */
visitId + 1;
length ss $5;
ss = scan ("CS,FALL,ELBOW,ANKLE,LS,PS,SA,VV",ceil(rand('uniform',8)));
output;
end;
end;
format date yymmdd10.;
run;
Compute a new variable containing the week of the visit. This variable is used as a bucket for aggregate grouping.
data have;
set visits;
weekof = intnx('week', date, 0); * compute bucket value for aggregation over weeks;
attrib weekof format=mmddyy10. label='Week of';
run;
Use a procedure to generate output:
proc tabulate data=have;
title 'Tabulate - weeks are columns';
where year(weekof) = 2019 and month(weekof) = 11; * restrict to a single month;
class ss weekof;
table ss='', weekof * n=''; * column dimension is weekof (one column per weekof value);
run;
proc tabulate data=have;
title 'Tabulate - weeks are rows';
where year(weekof) = 2019 and qtr(weekof) = 4;
class ss weekof;
table weekof, ss=''*n='' / nocellmerge; * row dimension is weekof (one row per weekof value);
table weekof='', ss=''*n='' / box='Week of'; * row dimension is weekof (one row per weekof value);
run;
proc report data=have split='A0'x;
title 'Report - weeks are columns';
where year(weekof) = 2019 and month(weekof) = 11; * restrict to a single month;
column ss weekof;
define weekof / across;
define ss / group;
run;
proc freq data=have;
title 'Freq - weeks are columns';
where year(weekof) = 2019 and month(weekof) = 11; * restrict to a single month;
table ss * weekof / norow nocol nocum nopercent;
run;
TRANSPOSE
Compute counts over SS and week, transpose that
proc sql;
create table have_counts as
select ss, weekof, count(*) as freq
from have
group by ss, weekof
order by ss, weekof
;
proc transpose data=have_counts out=have_across_week(drop=_name_);
where year(weekof) = 2019 and month(weekof) = 11; * restrict to a single month;
by ss;
id weekof;
var freq;
run;
SQL
SQL code for pivoting is tedious and error prone during coding. It is also not automatically extensible when new dates come into the data. Having lots of similar statements (those SUMs) is known as wallpaper code, and who likes hanging wallpaper ?
proc sql;
create table ss_freq_across_weeks as
select
ss
, sum ( intnx('week', date, 0) = '03-NOV-2019'D ) as week1 label = 'Week of 11/03/2019'
, sum ( intnx('week', date, 0) = '10-NOV-2019'D ) as week2 label = 'Week of 11/10/2019'
, sum ( intnx('week', date, 0) = '17-NOV-2019'D ) as week3 label = 'Week of 11/17/2019'
/*...*/
from have
group by ss
;
Related
I have a list of subjects with multiple overlapping entries in the following format:
ID startdate stopdate cutoffdate
1 101 07MAR2014 07MAR2014 14MAR2014
2 105 30MAR2017 03APR2017 07APR2017
3 105 03APR2017 09APR2017 07APR2017
I have previously used SAS to count the total duration for each subject. I used the code described in the SAS documentation here, and adapted in another SO question here. The output using this method would be 1 day for subject 101 and 11 days for subject 105.
Now I have a cut-off date in the far right column. I want my code to disregard days beyond this; i.e. the output would then become 1 day for subject 101 and 9 days for subject 105.
How do I calculate the duration of these overlapping date entries for each subject, but disregard any dates which fall beyond the cut-off date?
Code from prior answer:
data want;
set have;
by id;
retain episode;
start_date = input(start_date, yymmdd10.);
end_date = input(stopdate, yymmdd10.);
prev_stop_date = lag(stopDate);
if first.id then do;
episode = 0;
call missing(prev_stop_date);
end;
if not (start_date <=prev_stop_date <= end_date) then episode+1;
*could add in logic to calculate dates and durations as well depending....;
run;
A see the solution with some logic inside for correct calculation with overlapped dates:
data test;
input ID startdate : date9. stopdate : date9. cutoffdate : date9.;
format startdate stopdate cutoffdate date9.;
datalines;
101 07MAR2014 07MAR2014 14MAR2014
105 30MAR2017 03APR2017 07APR2017
105 03APR2017 09APR2017 07APR2017
;
run;
proc sort data=test;
by ID startdate;
data want (keep=ID datediff);
set test;
by ID startdate;
retain startd stopd datediff;
if first.ID then do;
startd = startdate;
stopd = stopdate;
if stopdate LT cutoffdate then datediff=stopdate - startdate + 1;
else datediff=cutoffdate - startdate + 1;
end;
else do;
if startdate LE stopd and startdate GE startd then
startd = stopd;
if stopdate GE startd and stopdate LE cutoffdate then
stopd = stopdate;
else if stopdate GE startdate and stopdate GT cutoffdate then
stopd = cutoffdate;
datediff = datediff + stopd - startd;
end;
if last.ID then output;
run;
This code, of course, could be optimized. Please check my logic!
Such code produces:
ID datediff
––––––––––––––--
101 1
105 9
Generally what I'd do is create a new stopdate variable which was defined as
stopdate_cut = min(stopdate,cutoffdate);
Then your original code will work (just with this new variable). Make sure to also test startdate, presumably delete the entire row if startdate is more than cutoffdate (where startdate le cutoffdate might be the easiest).
Just to be clear, the original code didn't calculate durations, so I'll add that in here:
data final;
set want;
by id episode;
if first.episode then duration=1;
duration+(stopdate-startdate);
if last.episode then output;
run;
That gives 1 and 11. You might need slightly more code depending on your data.
To add the cutoff, simply add these two lines (the where here isn't doing anything in the example data, but it could be needed.)
data final;
set want;
by id episode;
where startdate le cutoffdate;
stopdate_cut = min(stopdate,cutoffdate);
if first.episode then duration=1;
duration+(stopdate_cut-startdate);
if last.episode then output;
run;
When dealing with possibly overlapping date ranges within a group there is also a possibility you have some gaps.
Because dates are integers in a limited domain you can use a temporary array that is key-indexed with the date and paint values across the range. At the end of the group the number of values in the array is the number of days that fell within a range.
Example:
* generate some data;
data have;
call streaminit(2020);
length id start_date end_date limit_date step 8;
do id = 1 to 20;
end_date = '01jan2015'd;
limit_date = '30sep2016'd;
do _n_ = 1 to rand('integer', 1, 7);
step = rand('integer', -10,60); * data generator diagnostic;
range = rand('integer', 30); * data generator diagnostic;
start_date = end_date + step;
end_date = start_date + range;
output;
end;
end;
format start_date end_date limit_date date9.;
run;
* d1 to d2 should cover all expected dates;
%let d1 = %sysevalf("01JAN2000"D);
%let d2 = %sysevalf("31DEC2100"D);
* evaluate the date range coverage for each id;
data want;
array dates[&d1:&d2] _temporary_; * the canvas onto which values are painted;
do until (last.id);
set have;
by id;
do _n_ = start_date to end_date;
if _n_ > limit_date then leave;
if dates(_n_) = 1 then overlap_days+1;
dates(_n_) + 1; * paint on that canvas;
end;
ranges_count + 1;
end;
* compute total range and portions;
do _n_ = lbound(dates) to hbound(dates);
if missing(dates(_n_)) then continue;
days + 1;
if missing(all_start_date) then all_start_date =_n_;
all_end_date = _n_;
if _n_ > all_start_date and dif(_n_) > 1 then gaps + 1;
end;
all_days = all_end_date - all_start_date + 1;
coverage = days / all_days;
OUTPUT;
call missing(of _all_, of dates(*));
format all_start_date all_end_date date9. coverage percent6.2;
keep id days all_: ranges_count overlap_days gaps cover:;
run;
Produces
Here more possible conditions are set:
data have;
input ID startdate : date9. stopdate : date9. cutoffdate : date9.;
format startdate stopdate cutoffdate date9.;
datalines;
101 07MAR2014 07MAR2014 14MAR2014
105 30MAR2017 03APR2017 07APR2017
105 03APR2017 09APR2017 07APR2017
106 12MAY2018 18MAY2018 01JUL2018
106 15MAY2018 20MAY2018 01JUL2018
106 25MAY2018 28MAY2018 01JUL2018
107 01JAN2005 09JAN2005 01FEB2005
107 05JAN2005 20JAN2005 01FEB2005
107 16JAN2005 18JAN2005 01FEB2005
107 26JAN2005 31JAN2005 01FEB2005
;
run;
Firstly, it is necessary to consider insider of cutoffdate, so min(stopdate,cutoffdate) is used; Secondly, need to consider if the period is complete within the previous record; Thirdly, if startdate is previous stopdate, it is needed to +1, here is '_stop+1' in ifn function.
data want;
set have ;
by id startdate notsorted;
retain total;
_start=lag(startdate);_stop=lag(stopdate);
if first.id then total=min(stopdate,cutoffdate)-startdate+1;
else do;
if _start<=startdate and stopdate<=_stop then return;
total=total+min(stopdate,cutoffdate)-ifn(_stop<startdate,startdate,_stop+1)+1;
end;
if last.id then output;
drop _:;
run;
The SAS System
Obs ID startdate stopdate cutoffdate total
1 101 07MAR2014 07MAR2014 14MAR2014 1
2 105 03APR2017 09APR2017 07APR2017 9
3 106 25MAY2018 28MAY2018 01JUL2018 13
4 107 26JAN2005 31JAN2005 01FEB2005 26
Original data:
subject medgrp stdt endt
1 A 7/1/2014 7/31/2014
1 A 7/29/2014 8/30/2014
1 B 7/1/2014 8/15/2014
1 C 8/1/2014 9/1/2014
2 A 4/15/2014 5/15/2014
2 A 5/10/2014 6/10/2014
2 A 6/5/2014 6/15/2014
2 A 7/1/2014 8/1/2014
3 A 6/5/2014 6/15/2014
3 A 6/16/2014 8/1/2014
Re-structured data:
subject med_pattern stdt_new endt_new
1 A*B 7/1/2014 7/31/2014
1 A*B*C 8/1/2014 8/15/2014
1 A*C 8/16/2014 8/30/2014
1 C 8/31/2014 9/1/2014
2 A 4/15/2014 6/15/2014
2 A 7/1/2014 8/1/2014
3 A 6/5/2014 8/1/2014
I was able to transform original data to re-structured data by outputting stdt to endt for all records, then keep one date for each subject/medgrp, reform date periods and create the variable med_pattern.
However, this method takes a long time to run, especially for big data (>3m records).
Any suggestions to make this more efficient would be greatly appreciated!
By subject you can use a date keyed multi-data hash to track the medgrp activity for each date in the date range defined by stdt and endt. A iteration of the hash will let you compute your medgrps crossings value.
data have; input
subject medgrp $ stdt: mmddyy8. endt: mmddyy8.; format stdt endt mmddyy10.;
datalines;
1 A 7/1/2014 7/31/2014
1 A 7/29/2014 8/30/2014
1 B 7/1/2014 8/15/2014
1 A 7/15/2014 7/15/2014
1 C 8/1/2014 9/1/2014
2 A 4/15/2014 5/15/2014
2 A 5/10/2014 6/10/2014
2 A 6/5/2014 6/15/2014
2 A 7/1/2014 8/1/2014
3 A 6/5/2014 6/15/2014
3 A 6/16/2014 8/1/2014
;
data crossings_by_date / view=crossings_by_date;
if 0 then set have; * prep PDV;
if _n_ then do;
declare hash dg(multidata:'yes', ordered:'a'); %* 1st hash for subject dates;
dg.defineKey('date');
dg.defineData('date', 'medgrp');
dg.defineDone();
call missing (date); format date adate cdate mmddyy10.;
declare hash crossing(ordered:'a'); %* 2nd hash for deduping a list of medgrps ;
crossing.defineKey('medgrp');
crossing.defineData('medgrp');
crossing.defineDone();
declare hiter dgi('dg');
declare hiter xi('crossing');
end;
dg.clear();
do _n_ = 1 by 1 until (last.subject); * process subjects one by one;
set have;
by subject;
do date = stdt to endt; * load multidata hash with medgrp over date range;
dg.add();
end;
end;
* examine each date in which subject had activity;
adate = .;
cdate = -1e9;
do _i_ = 1 by 1 while (dgi.next() = 0);
if date eq adate
then continue; * hiter over multi-data will return each node;
else adate = date; * track activity date;
* load hash to dedupe tracking of medgrp on date;
crossing.clear();
do _i_ = 1 by 1 while (dg.do_over() = 0);
crossing.replace();
end;
* compute crossing representation on date, A*B*... by traversing 2nd hash;
xi.first(); length cross $20;
cross = medgrp;
do while(0 = xi.next());
cross = catx('*',cross,medgrp);
end;
if date - cdate > 1 then cluster + 1; %* track cluster based on date continuities;
cdate = date;
output; * <------------ view OUTPUT;
end;
keep subject date cross cluster;
run;
* 2nd data step processes view (1st data step);
* determine when date continuity ends or medgrp changes;
data want;
length subject 8 medgrps $20;
format stdt endt mmddyy10.;
do _n_ = 1 by 1 until (last.medgrps);
set crossings_by_date (rename=cross=medgrps);
by cluster medgrps notsorted;
if stdt = . then
stdt = date;
end;
endt = date;
keep subject medgrps stdt endt;
run;
I have to calculate date difference between first date at time = 0 and the dates after. I also have one variable = factor which has 2 categories : one ; two.
For example, here is one date :
A B TIME
10/11/2016 one T0
17/11/2016 two T0
05/01/2017 one T1
28/02/2017 two T1
06/07/2017 one T2
05/09/2017 two T2
I would like to calculate the difference between T0 and the dates for B="one" and B="two" in order to obtain :
DIFF
0
0
56
103
238
292
Calculating the diff as follows :
56 = T1-T0 for "one" = 05/01/2017 - 10/11/2016
103 = T1-T0 for "two" = 28/02/2017 - 17/11/2016
238 = T2-T0 for "one" = 06/07/2017 - 10/11/2016
292 = T2-T0 for "two" = 05/07/2017 - 17/11/2016
Could you help me do it in SAS?
Thanks a lot.
One way is to pull out the TIME='T0' records and merge them back with the other records.
First let's convert your table into a dataset.
data have ;
input b $ Time $ date :yymmdd.;
format date yymmdd10.;
cards;
one T0 2016-11-10
two T0 2016-11-17
one T1 2017-01-05
two T1 2017-02-28
one T2 2017-07-06
two T2 2017-09-05
;
Now let's re-order it so that we can merge by the grouping variable, B.
proc sort ;
by b time ;
run;
Here is a way to merge the data with itself.
data want ;
merge have(where=(time ne 'T0'))
have(keep=time b date rename=(time=time0 date=date0) where=(time0='T0'))
;
by b ;
diff = date - date0;
drop time0;
run;
Results:
Obs b Time date date0 diff
1 one T1 2017-01-05 2016-11-10 56
2 one T2 2017-07-06 2016-11-10 238
3 two T1 2017-02-28 2016-11-17 103
4 two T2 2017-09-05 2016-11-17 292
There are of course several ways to do this. Below are two alternatives. The first selects the first A for each B and merges this with the original data in a SQL-step. The second uses a DATA-step and by groups. The first A within each B is saved as firsttime, and retained so it can be used to calculate the difference.
data test;
input A ddmmyy10. #12 B $3.;
format A ddmmyy10.;
datalines;
10/11/2016 one
17/11/2016 two
05/01/2017 one
28/02/2017 two
06/07/2017 one
05/09/2017 two
;
/* Alt 1*/
proc sql;
create table test2 as
select t1.*, t1.A-t2.A as time
from test as t1 left join (select B, min(A) as A from test group by 1) as t2
on t1.B=t2.B
order by A;
/* Alt 2*/
proc sort data=test;
by B A;
run;
data test3;
set test;
by B;
retain firsttime;
if first.B then firsttime=A;
time=A-firsttime;
drop firsttime;
run;
I have two data set Set1 and Set2.
Set1 data set has column Curr_Dt:-
Set1
Curr_Dt
23/04/1998
01/01/2017
01/12/2018
10/10/2010
Set2 data set has 3 columns St_Dt, End_Dt, Ind
St_Dt End_Dt Ind
01/11/2018 31/12/2018 N
01/01/1998 31/05/1998 N
30/11/2016 02/02/2017 N
I want to update the Ind column of Set2 data set to Y if Curr_Dt from Set1 is falling in between St_Dt and End_Dt of Set2.
I don't see a key on here to merge by, so I am assuming that the first row goes with the first row and so down each data set.
You can do this with a simple Data Step.
data want;
merge set1 set2;
if st_dt <= curr_dt <= end_dt then
ind = 'Y';
run;
This also assumes the dates are stored as dates and not strings.
Create sets
data Set1;
length Curr_Dt $10;
input Curr_Dt;
cards;
23/04/1998
01/01/2017
01/12/2018
10/10/2010
;
run;
data Set2;
length St_Dt $10 End_Dt $10 Ind $1;
input St_Dt$ End_Dt$ Ind$;
cards;
01/11/2018 31/12/2018 N
01/01/1998 31/05/1998 N
30/11/2016 02/02/2017 N
30/11/2005 02/02/2005 N
run;
Set date formats
data Set1;
set Set1;
Curr = input(Curr_Dt, ddmmyy10.);
run;
data Set2;
set Set2;
St = input(St_Dt, ddmmyy10.);
End = input(End_Dt, ddmmyy10.);
run;
Set Y flag if any Curr_Dt from Set1 falls between St_Dt and End_Dt
proc sql;
create table Set2 as
select distinct St_Dt, End_Dt,
case when Set1.Curr>Set2.St and Set1.Curr<Set2.End
then 'Y' else 'N' end as Ind
from Set2
left join Set1 on Set1.Curr>Set2.St and Set1.Curr<Set2.End;
run;
You get
01/01/1998 31/05/1998 Y
30/11/2016 02/02/2017 Y
01/11/2018 31/12/2018 Y
30/11/2005 02/02/2005 N
Trying to calculate holidays in a given year for loading DimDate table. There is an option in SQL to calculate the WeekOfYear Number but couldnt find a function to calculate the WeekNumberOfMonth.
How to find week number of a given month to calculate holidays in a year to fill DIMDate table
Logic
CurrentDateWeek - CurrentDateMonthBeginWeek
the difference between the week Number for a given date and the first week number for the given date provides week of a month
Simple One Line SQL Statement
Declare #CurrentDate as datetime
set #CurrentDate = '5/31/2014'
Select (Datepart(week,#CurrentDate) - (Datepart(week,cast(month(#CurrentDate) as varchar) + '/1/' + cast(Year(#CurrentDate) as varchar))) + 1) as WeekOfMonth
Stored Procedure
Declare #CurrentDate as datetime
Declare #BeginWeek as int
Declare #EndWeek as Int
set #CurrentDate = '5/31/2014'
SET #EndWeek = DATEPART(week,eomonth(#CurrentDate))
SET #BeginWeek = Datepart(week,cast(month(#CurrentDate) as varchar) + '/1/' + cast(Year(#CurrentDate) as varchar))
Select Datepart(week,#CurrentDate) - #BeginWeek + 1
Select #BeginWeek as beginweek
select #EndWeek as endweek
Select Datepart(week,#CurrentDate)
Using the above logic we could fill the DIMDATE table with holiday days