How to Merge the given two SAS datasets - merge

I have two SAS datasets of following type
Dataset1, D1 is as follows
ID Date Amount
x1 10/12/2015 100
x2 200
x2 150
x3 10/10/2014 90
x4 60
Dataset, D2 is of the form
ID Date
x2 10/12/2016
x4 1/1/2016
Dataset D1 can have duplicate values of ID. Dataset D2 has only unique values of ID. Further, D2 consists of only IDs in Dataset D1 which has missing values of variable date (x2 and x4 have date missing in D1). I want to merge D1 with D2 such that the output is as follows
ID Date Amount
x1 10/12/2015 100
x2 10/12/2016 200
x2 10/12/2016 150
x3 10/10/2014 90
x4 1/1/2016 60
Is this doable without using proc sql in SAS. Can I use merge?
I tried using the following but to no use (and it should not work either because D1 has duplicate IDs)
data x;
merge D1 (in=in1) D2(in=in2);
by ID;
if in1;
run;

DATA step merge of two datasets works fine when only one of the datasets has a unique ID. The problem with your merge is that the DATE variables from each dataset will collide:
231 options msglevel=i;
232 data x;
233 merge D1 D2;
234 by ID;
235 put (ID Date Amount)(=);
236 run;
INFO: The variable Date on data set WORK.D1 will be overwritten by data set WORK.D2.
ID=x1 Date=10/12/2015 Amount=100
ID=x2 Date=10/12/2016 Amount=200
ID=x2 Date=. Amount=150
ID=x3 Date=10/10/2014 Amount=90
ID=x4 Date=01/01/2016 Amount=60
The MSGLEVEL=i option generates the INFO: line in the log which alerts you to the collision. In this case you almost get the results you want, despite the collision. The problem is the third record, where DATE is missing. This is a side-effect of having a collision in a one-to-many merge.
I would suggest you avoid the collision by renaming the DATE variable in each dataset. You can then compute a new DATE variable by using the COALESCE() function, which returns the first value that is not missing:
237 data want;
238 merge d1 (keep=ID Date Amount rename=(Date=Date1))
239 d2 (keep=ID Date rename=(Date=Date2))
240 ;
241 by ID;
242 Date=coalesce(Date1,Date2);
243 put (ID Date1 Date2 Date Amount)(=);
244 format Date mmddyy10.;
245 run;
ID=x1 Date1=10/12/2015 Date2=. Date=10/12/2015 Amount=100
ID=x2 Date1=. Date2=10/12/2016 Date=10/12/2016 Amount=200
ID=x2 Date1=. Date2=10/12/2016 Date=10/12/2016 Amount=150
ID=x3 Date1=10/10/2014 Date2=. Date=10/10/2014 Amount=90
ID=x4 Date1=. Date2=01/01/2016 Date=01/01/2016 Amount=60

Is this doable without using proc sql in SAS. Can I use merge?
Yes, any SQL step can be done in a Data step, but may take up more or less code space.
Here is a potential solution:
data DateN DateY;
set D1;
if date=. then output step1;
else output step2;
run;
data merge;
DateN(keep=ID Amount) D2;
by id;
run;
data x;
set merge DateY;
run;
proc sort data=x;
by ID;
run;
This assumes that the missing values from D1 have unique ID.

Related

Copy a dataset B for each variable in dataset A

I have the two following datasets:
Dataset A:
ID
A
B
C
Dataset B:
Age
35
49
53
And I want to copy paste B to each ID of A:
ID Age
A 35
A 49
A 53
B 35
B 49
...
For the moment I do this with a %do cicle but is there a more elegant way to do this? With a single PROC SQL or Datastep for example?
Thanks in advance
You can use SQL to perform a Cartesian product to get all combinations.
For example:
/* setup id data */
data have1;
input id $char1.;
datalines;
A
B
C
;
/* setup age data */
data have2;
input age 8.;
datalines;
35
49
53
;
/* perform Cartesian product */
proc sql noprint;
create table
want
as
select
*
from
have1
,have2
;
quit;
There is no need for macro code. You can use the POINT= option on the SET statement to do this in a data step.
data want;
set a;
do p=1 to nobs;
set b point=p nobs=nobs;
output;
end;
run;

Outputting conditionally from merge

I want to update a history file in SAS. I have new observations, which may overlap with existing data lines.
What is needed, is a file, which would have lines from dataset (new_data) where they exist and in case the lines do not exist, then from old set (old_data). What I've come up is a clunky merge operation, which is conditional on the order of the datasets. (==Works only if New_data is after Old_data. :?)
data new_data;
input key value;
datalines;
1 10
1 11
2 20
2 21
;
run;
data old_data;
input key value;
datalines;
2 50
2 51
3 30
3 31
;
run;
So I'd like to have the following:
key value
1 10
1 11
2 20
2 21
3 30
3 31
However the following does not work. It produces the output below it.
data updated_history;
merge New_data(in=a) old_data(in=b) ;
by key;
if a or (b and not a );
run;
....
2 50
2 51
...
But for some reason this does:
data updated_history;
merge old_data(in=b) New_data(in=a);
by key;
if a or (b and not a );
run;
Question: Is there an intelligent way to manage from which dataset the values are select from. Something like: if a then value_from_dataset a;
The order in which you list the data sets in the MERGE is the order the data is taken. So when the order is old, new values from old are read and then values from new overwrite the values from old. This is why your second version works and the first does not.
Since you have multiple observations per key value you probably do NOT want to use MERGE to combine these files. You could do it using SET by reading the data twice using two DOW loops. In that case it won't matter the order of the dataset in the SET statement since the records are interleaved instead of being joined. This first loop will calculate which of the two input datasets have any observations for this KEY value.
data want ;
anyold=0;
anynew=0;
do until (last.key);
set old_data (in=inold) new_data(in=innew);
by key ;
if inold then anyold=1;
if innew then anynew=1;
end;
do until (last.key);
set old_data (in=inold) new_data(in=innew);
by key ;
if not (anyold and anynew and inold) then output;
end;
drop anyold anynew;
run;
This type of combination is probably easier to code using SQL.
proc sql ;
create table want as
select key,value from new_data
union
select key,value from old_data
where key in (select key from old_data except select key from new_data)
order by 1
;
quit;

Merging SAS datasets by different column names across several columns

I have 2 data sets that I want to merge by territory #...the first dataset has territory information including territory #, the second dataset has territory #'s but they are across 4 different columns titled drug_terr1, drug_terr2, drug_terr3, and drug_Terr4...I need to merge on all 4 columns because they each have different territory #'s and I want those numbers to be included in my merge with the dataset that has all the territory information...I tried a rename but that didn't work because it only changed the first column...is there a way to combine all this data, and rename it by territory # so I can do the merge?
ultimately would like it to look like this, but need to get the 4 columns from 'terrfile' to become 1 column called territory_nbr so I can merge.
%let output = E:\Horizon\Adhoc\AH\;
%let terrs =\\uslsasas1\E$\Horizon\IMS Processing\Weekly Data\20161230\Demo\;
libname terrs "&terrs.";
%let curr_process_wk = '12-30-2016';
%let curr_quarter =_q1;
**0 Grab pskw;
data pskw_data;
set PSKW.PSKWMaster ;
where week in ('12-16-2016','12-23-2016','12-30-2016','01-06-2017') and CopayType ="FBD" and FNRX=1 and pme_id in (46,42,55,38) and product in ('DUEXIS','VIMOVO','PENNSAID')
and
(COBPrimaryRejectCode1 in ('75','76') or COBPrimaryRejectCode2 in ('75', '76') or COBPrimaryRejectCode3 in ('75' , '76'));
run;
proc sort data=pskw_data;
by imsid;
run;
** 01 Grab tbl HCP;
proc sort data=ims.tblhcp (where = (week = &curr_process_wk.) keep = week imsid first_name last_name address1 address2 city state zip spec)
out = IMS_demo (drop = week);
by IMSID;
run;
** 02 Grab tbl terrs_by_imsid;
data terrfile;
set terrs.wd2_terrs_by_imsid&curr_quarter.;
run;
proc sort data = terrfile;
by imsid;
run;
** 03 Grab tbl roster;
data roster (keep = territorycode repname territoryname teamname);
set ims.tblRoster;
repname = trim(left(FirstName))||" "||trim(left(LastName));
run;
**04 link ;
data combine_dbs;
merge pskw_data (in=in1)
ims.tblhcp (in=in2);
by imsid;
if in1;
run;
data territories; ***can't merge because territory code is not in terrfile, just 4 columns as I mentioned above***;
merge terrfile (in=in1)
roster (in=in2);
by territorycode;
if in2;
run;
You need to merge the fact table with the lookup table four times. Let's say your territory identifier is called ID in your lookup table you want to take the field IMS_ID from it. Let's also assume your four fields in your fact table are named ID1-ID4.
proc sql ;
create table want as
select a.*
, b.ims_id as ims_id1
, c.ims_id as ims_id2
, d.ims_id as ims_id3
, e.ims_id as ims_id4
from FACT a
left join LU b on a.id1=b.id
left join LU c on a.id2=c.id
left join LU d on a.id3=d.id
left join LU e on a.id4=e.id
;
quit;
In your example it looks ROSTER is your FACT table and TERRFILES is your LU table. Your ID variable looks like it is name TERRITORYCODE, at least in your lookup file. Hard to tell what the four variables in ROSTER are named.

MONYY7. and DATE9. operations

I'm working on a very big data set, (more than 100 variables and 11 millions observations). In this data set, i have a variable named DTDSI (simulation date) in DATE9. format. (For example: 01APR2015 , 02MAR2015...). I have a macro-program to analyse this data set by comparing the observations in 2 different months:
%macro analysis (data_input , m , m_1);
.....
%mend;
The 2 macro-variables m and m_1 are months that i want to compare. Their format is MONYY7.(APR2015 , MAR2015...). Keep in mind that i cannot modify my data_input (its the data of my company). In the beginning of my macro program, i want to create a new data set with only the observations of the &m and &m_1 month. I can easily do that by creating a new date variable from DTDSI (real_month for ex) but in the format MONYY7. Then i just select the observations where real_month equal &m or real_month equal &m:
Data new;
Set &data_input;
mois_real = input(DTDSI,MONYY7);
RUN;
PROC SQL;
CREATE TABLE NEW AS;
SELECT *
WHERE mois_real in ("&m" , "&m_1")
FROM NEW;
....
The problem is that in my first Data Statement, i duplicated my data_input; which is bad because it took 30 minutes. So can you tell me how can i make my selection (DTDSI = m and DTDSI=m_1) right in my first Statement?
You can use formula's in your where/if condition, so apply your formula from step 1 into step 2 or vice versa.
Data new;
set &data_input;
WHERE put(DTDSI,MONYY7) in ("&m" , "&m_1");
run;

References to SAS date macros not working due to differing data types

hoping for some help on this one.
Currently, the query uses the below to create references for m1-m6 and d1-d6.
%let m1=1114;
%let d1 ='30NOV2014'd;
%let m2=1214;
%let d2='31DEC2014'd;
%let m3=0115;
%let d3='31JAN2015'd;
%let m4=0215;
%let d4='28FEB2015'd;
%let m5=0315;
%let d5='31MAR2015'd;
%let m6=0415;
%let d6='30APR2015'd;
Based on the rest of the code, the m1-m6 dates must be formatted as mmyy. I have tried to swap the above out with this:
data _datemacro_;
m1 = put(intnx('day','01NOV2014'd,0),mmyyn4.);
call symput('m1',"'"||put(m1,9.)) ;
d1 = put(intnx('day','30NOV2014'd,0),date9.);
call symput('d1',"'"||put(d1,9.)) ;
m2 = put(intnx('day',&d1,+1),mmyyn4.);
call symput('m2',"'"||put(m2,9.)) ;
d2 = put(intnx('month',&d1,+1,'e'),date9.);
call symput('d2',"'"||put(d2,9.)||"'"||"d") ;
...etc through m6 and d6
run;
Below is the rest of the code that yields a garden variety of errors, including
ERROR 22-322: Syntax error, expecting one of the following: a name, a
quoted string, (, /, ;, DATA, LAST, NULL.
ERROR 200-322: The symbol is not recognized and will be ignored.
proc sql;
create table perf as
select a. field, a. field, a. field, a. reportingdate,
a. field, a. field,
e. field,
f. field
from table a, table2 e, table3 f
where a. reportingdate between &d1. and &d6.
and (a. field=1 or a. field=1)
and a. field = e. field and a. field = f. field;
quit;
/*Creates performance file by month*/
%macro month (mon,date);
data m&mon. (rename=(field=active&mon. field=co&mon. field=es&mon. field=sr&mon.));
set perf;
where datepart(reportingdate)=&date.;
run;
proc sort data=m&mon.; by field descending co&mon.; run;
proc sort data=m&mon. nodupkey out=m2&mon.; by field; run;
%mend month;
%month (&m1.,&d1.);
%month (&m2.,&d2.);
%month (&m3.,&d3.);
%month (&m4.,&d4.);
%month (&m5.,&d5.);
%month (&m6.,&d6.);
I am able to get it to run accurately until the last 6 lines, where it comes up with 78 errors just from running those 6 lines.
Any suggestions on how to write the macro to keep the correct data type while accurately defining the month start and end dates? When I try and change the start date and end date of each month to the same format, something within the rest of the code causes an error stating that it cannot work with two variables of different formats, even when they are clearly the same format as defined in the code.
Please let me know if there is anything I can clarify, as this was a little harder to explain than I intended.
Thank you for your help.
So you're basically trying to build a macro that will run a report month-by-month. I think using a macro is a good idea, but your structure could benefit from a re-org.
The first thing to fix is the hardcoded dates. Hardcoding is bad 99% of the time. Why not use a loop instead?
Initialise the start and end dates at the top of your program. In future they're easy to find and change if they're at the top, and you won't need to search through your code trying to figure out what else needs to change:
* PICK ANY DATES IN THE MONTHS YOU WANT TO START AND END. IE. DOESNT MATTER IF YOU CHOOSE THE FIRST OR THE 20TH. IT WILL RUN FOR THAT MONTH;
%let rpt_start = %sysfunc(mdy(11,1,2014));
%let rpt_end = %sysfunc(mdy( 4,1,2015));
Go and get the data between the start and end dates:
proc sql;
create table perf as
select a. field, a. field, a. field, a. reportingdate,
a. field, a. field,
e. field,
f. field
from table a, table2 e, table3 f
where a. reportingdate between &rpt_start and &rpt_end
and (a. field=1 or a. field=1)
and a. field = e. field and a. field = f. field;
quit;
Now loop over each month inbetween the start and end dates. Create the desired datasets as we go.
%macro create_monthly_datasets;
%local tmp_dt tmp_end rpt_dt;
%let tmp_end = %sysfunc(intnx(month,&rpt_end,0,end)); *CALC END-DATE DESIRED;
%let tmp_dt = %sysfunc(intnx(month,&rpt_start,0,beginning)); *INITIALISE LOOP;
%do %while (&tmp_dt le &tmp_end);
* CALC ACUTAL DATE WANTED AND STORE IT IN RPT_DT;
* CALC THE MMYY VAL YOU NEED;
* PRINT OUT BOTH VALUES TO MAKE SURE THEYRE CORRECT;
%let rpt_dt = %sysfunc(intnx(month,&tmp_dt,0,end));
%let mmyy = %sysfunc(month(&rpt_dt),z2.)%substr(%sysfunc(year(&rpt_dt)),3,2);
%put %sysfunc(sum(&rpt_dt),date9.) &mmyy;
* DO THE WORK;
data m&mmyy (rename=(field=active&mmyy field=co&mmyy field=es&mmyy field=sr&mmyy));
set perf;
where datepart(reportingdate)=&rpt_dt;
run;
proc sort data=m&mmyy; by field descending co&mmyy; run;
proc sort data=m&mmyy nodupkey out=m2&mmyy; by field; run;
%let tmp_dt = %sysfunc(intnx(month,&tmp_dt,1,beginning)); * ITERATE LOOP;
%end;
%mend;
%create_monthly_datasets;
Try the following for creating the macro variables:
data _null_;
START_MTH = '01nov2014'd;
do i = 1 to 6;
T_DATE = intnx('month',START_MTH,i)-1; /*Shift forwards i months then back 1 day*/
call symput(cats('m',i),put(T_DATE,mmyyn4.));
call symput(cats('d',i),cats("'",put(T_DATE,date9.),"'d"));
end;
run;