Merge multiple incomplete records in SAS - merge

I have a dataset, that each id has multiple incomplete records, it could make more sense to have a final dataset as shown. Basically the idea is to have non-missing data fill the blanks wherever the value is from the 1st line or 2nd line, as long as for the same id.

The easiest way to do this is the self-update. This uses the core property of the update statement, that only non-missing values can replace other values, in a fun way that allows the rows to be simplified like this. The first obs=0 is there simply to give an empty base to update from - the dataset is really being read in from the second mention on that statement.
data have;
id = 1;
input x y z;
datalines;
1 . .
. 1 .
. . 1
;;;;
run;
data want;
update have(obs=0) have;
by id;
run;

proc sql;
create table need as
Select ID, max(v1) as v1,
max(v2) as v2,
max(v3) as v3,
max(v4) as v4
from have;
quit;

Related

Is there a SAS function similar to an Xlookup? [duplicate]

This question already has answers here:
SAS Code that works like Excel's "VLOOKUP" function
(2 answers)
Closed 1 year ago.
I am working on a project that involves two separate CSV files. The first data set, "Trips" has seven columns, with a trip_id, bike_id, duration, from_station_id, to_station_id, capacity and usertype. User type is the only character values, the rest are numerical. The second csv file has station_id and station_name. The objective is to merge the files in some way that will input the station name from the second csv file into the "from" and "to" station sections in the first, based on station id. I know that this would be extremely easy in excel with an xlookup, but I am wondering the correct way to approach this in SAS.
I am using the SAS university edition (the free online one) if that makes any difference. Our code so far is as follows:
data DivvyTrips;
infile '/home/u59304398/sasuser.v94/DivvyTrips.csv' dsd;
input trip_id
bikeid
tripduration
from_station_id
to_station_id
capacity
usertype $;
title "Trips";
run;
data DivvyStations;
infile '/home/u59304398/sasuser.v94/Divvy_Stations.csv' dsd;
input station_id
station_name $;
title "Stations";
run;
All this does is import the data. I do not think a merge with a sort will work because we need both from and to station names.
SAS uses formats to control how values are displayed as text. It uses informats to control how text is converted to values.
Since your station ID is numeric you can use a FORMAT to display the station names for the station id numbers.
You can create a CNTLIN dataset for PROC FORMAT to build a format from your station list dataset. To define a numeric format you just need to have the FMTNAME, START and LABEL variables in your CNTLIN dataset.
data format;
fmtname='STATION';
set divvystations;
rename station_id=start station_name=label;
run;
proc format cntlin=format;
run;
Now you can use the format with your station variables. For most purposes you will not even need to modify your dataset, just tell SAS to use the format with your variable.
Let's create some example data:
data DivvyTrips;
infile cards dsd;
input trip_id
bikeid
tripduration
from_station_id
to_station_id
capacity
usertype :$20.
;
cards;
1,1,10,1,2,2,AAA
2,1,20,2,3,1,BBB
;
data DivvyStations;
infile cards dsd ;
input station_id
station_name :$20.
;
cards;
1,Stop 1
2,Station 2
3,Airport
;
Now create the STATION format.
data format;
fmtname='STATION';
set divvystations;
rename station_id=start station_name=label;
run;
proc format cntlin=format;
run;
Now let's print the trip data and display the stations using the new STATION format.
proc print data=divvytrips;
format from_station_id to_station_id station. ;
run;
Result:
from_
station_ to_station_
Obs trip_id bikeid tripduration id id capacity usertype
1 1 1 10 Stop 1 Station 2 2 AAA
2 2 1 20 Station 2 Airport 1 BBB
If you do want to create a new character variable you use the PUT() function.
data want;
set DivvyTrips;
from_station = put(from_station_id,station.);
to_station = put(to_station_id,station.);
run;
In SAS when you "look up" values you join the two "arrays" or in this case tables together.
The simplest way to do this is using a proc sql step:
proc sql;
create table DivvyTrips_withnames as
select
a.*
,b.station_name as from_station_name
,c.station_name as to_station_name
from DivvyTrips a
left join DivvyStations b
on a.from_station_id = b.station_id
left join DivvyStations c
on a.to_station_id = c.station_id
;
quit;
We end up having to do 2 joins onto your original table as we are doing 2 different "lookups", from_station_id and to_station_id.

How to merge datasets by date category? SAS

I am new to stack overflow. I am also a beginner in SAS. I have two datasets: one with a list of ID's and medications by date and one with ID's and dates by admission number. I am trying to get a list of medications by ID, organized by admission number in SAS.
I've tried merging by ID number and creating an "admission number" variable by using:
if admission_date-admission_date_1=0 then admission_number="Admission 1"
but all values are missing when I do that.
Here's what I have:
Here's what I want:
Thank you for your help!
Doesn't seem like that second data set is useful at all. What you're doing there is creating a enumeration variable which can be accomplished using a BY variable.
proc sort data=have; by id admission_date; run;
data want;
set have;
by id admission_date;
if first.id then admission_number=0;
if first.date then admission_number + 1;
run;
More details are available on the methods here if needed.
https://stats.idre.ucla.edu/sas/faq/how-can-i-create-an-enumeration-variable-by-groups/

Merging sas data sets without a key variable

I am attempting to merge two data sets without a single key variable. The data looks like this in both data sets:
study_id.....round....other variables different between the two sets
A000019....R012....etc
A000019....R013
A000047....R013
A000047....R014
A000047....R015
A000267....R014
This is my code...
DATA RAKAI.complete;
length study_id $ 8;
MERGE hivgps2 rccsdata;
BY study_id round;
RUN;
I've tried to merge by study_id and round which are the only two variables shared across the data sets. But it just stacks the two sets creating double the correct number of IDs. The combination of "study_id" and "round" provides a unique identifier, but no one variable does. Does is just make the most sense to code a new unique id by combining the two variables that are shared by both data sets?
Many Thanks
I realized I can post the code that I meant to deal with potential unwanted spaces here.
DATA hivgps2;
SET hivgps2;
study_id = compress(study_id);
round= compress(round);
RUN;
DATA rccsdata;
SET rccsdata;
study_id = compress(study_id);
round=compress(round);
RUN;
Your code is the correct format for merging by multiple variables. Records from both datasets are included, so if none of the keys match then the result will be the same as if you used SET instead of MERGE.
Are you sure that there is any overlap between the two sets of data? Check that your variables are the same length. If they are character then make sure the values are consistent in their use of upper and lower case letters. Make sure that the values do not have leading spaces or other non-printing characters. Also make sure you haven't attached a format to one of the datasets so that the values you see printed are not what is actually in the data.
In your clean up data steps you should force the length of the variables to be consistent. Also you can compress more than just spaces from the values. I like to eliminate anything that is not a normal 7-bit ASCII code. That will get rid of tabs, non-breaking spaces, nulls and other strange things. In normal 7-Bit ASCII the printable characters are between ! ('21'x or 33 decimal) and ~ ('7E'x or 126 decimal).
data hivgps2_clean ;
length study_id $10 round $5 ;
set hivgps2;
format study_id round ;
study_id=upcase(compress(study_id,compress(study_id,collate(33,126))));
round=upcase(compress(round,compress(study_id,collate(33,126))));
run;
proc sort; by study_id round; run;
data rccsdata_clean;
length study_id $10 round $5 ;
set rccsdata;
format study_id round ;
study_id=upcase(compress(study_id,compress(study_id,collate(33,126))));
round=upcase(compress(round,compress(study_id,collate(33,126))));
run;
proc sort; by study_id round; run;
data want;
merge hivgps2_clean(in=in1) rccsdata_clean(in=in2);
by study_id round;
run;
You can try that, or you can just use a proc sql join:
proc sql;
create table rakai.complete as select
a.*, b.*
from hivgps2 as a
full join rccsdata as b
on a.study_id = b.study_id and a.round = b.round;
quit;

SAS Regression by class variable

I wish to perform multiple regressions conditionally based on the value of a categorical variable. So, for a simple example, consider the sashelp.class data. I need to perform a regression for males and another for females. Since my dataset has many more divisions and is much larger, I start by feeding the different types into macro variables:
proc sql;
select count(distinct Sex) into :numsex
from sashelp.class;
%let numsex=&numsex;
select distinct Sex into :sex1 - :sex&numsex
from sashelp.class;
quit;
Then I am trying to perform a regression on each one by looping them through. I know the commented out code works, but am unsure why my macro function is not working.
/**/
/*data dataF;*/
/* set sashelp.class;*/
/* where Sex='F';*/
/*run;*/
/**/
/*proc reg data=dataF outest=out1;*/
/* model Height=Weight;*/
/*run;*/
%macro regress;
%do i = 1 %to &numsex;
data data&&sex&i;
set sashelp.class;
where Sex='&&sex&i';
run;
proc reg data=data&&sex&i outest=out&i;
model Height=Weight;
run;
%end;
%mend;
%regress;
Also, if there is a better way to do this, then I'm all ears. The current way is a pain since I will have to combine all of my output sets of the estimates to get one dataset. Also, I get a bunch of intermediate datasets that I don't want or need.
Thanks.
Usually the BY group is the best way to do this, not sure if this is exactly what you're looking for:
proc sort data=sashelp.class out=class;
by sex;
run;
proc reg data=class outest=out1;
by sex;
model Height=Weight;
run;
Your macro failed because single quotes stop macro variable resolution (ie, '&sex' does not work to get 'F'; you have to use "&sex" to get "F".)

Automating table/object name scan and search in SAS

OK I'll start with the problem:
I have product tables being created every week which are named in the format:
products_20130701
products_20130708
.
.
.
I'm trying to automate some campaign analysis so that I don't have to manually change the table name in the code every week to use whichever product table is the first one after the maximum end date of my campaign.
e.g
%put &max_enddate.;
/*20130603*/
my product tables in June are:
products_20130602
*products_20130609*
products_20130616
products_20130623
in this instance i would like to use the second table in the list, ignoring over 12 months worth of product tables and just selecting the table who's date is just after my max_enddate macro.
I've been Googling all day and I'm stumped so ANY advice would be much appreciated.
Thanks!
A SQL solution:
data product_20130603;
run;
data product_20130503;
run;
data product_20130703;
run;
%let campdate=20130601;
proc sql;
select min(memname) into :datasetname from dictionary.tables
where libname='WORK' and upcase(scan(memname,1,'_'))='PRODUCT' and
input(scan(memname,2,'_'),YYMMDD8.) ge input("&campdate.",YYMMDD8.);
quit;
Now you have &datasetname that you can use in the set statement, so
    data my_analysis;
    set &datasetname;
    (whatever you are doing);
    run;
Modify 'WORK' to the appropriate libname, and if there are any other restrictions add those as well. You might get some warnings about invalid dates if you have product_somethingnotadate, but that shouldn't matter.
The way this works - the dictionary.tables is a list of all tables in all libnames you have accessed (same as sashelp.vtable, but only available in PROC SQL). First this selects all rows that have a name with a date greater than or equal to your campaign end date; then it takes the min(memname) from that. Memname is of course a string, but in strings that are identical except for a number, you can still use min and get the expected result.
This is probably not suitable for your application, however I find it very useful for the datasets I have as they absolutely must exist for each Sunday and I evaluate the existence of the dataset at the beginning of my code. If they don't exist then it sends an email to our IT guys that tells them that the file is missing and needs to be re-created\restored.
%LET DSN = PRODUCTS_%SYSFUNC(PUTN(%SYSFUNC(INTNX(WEEK.2,%SYSFUNC(INPUTN(&MAX_ENDDATE.,YYMMDD8.)),0,END)),YYMMDDN8.));
With the other suggestions above they will only give you results for datasets that exist, therefore if the one you should have been using has been deleted then it will grab the next one and run the job regardless.
First, get all possible tables:
data PRODUCT_TABLES;
set SASHELP.VTABLE (keep=libname memname);
*get what you need, here i keep it simple;
where lowcase(substr(memname,1,9))='products_';
run;
Next, sort it by date, easily done due to the format of your dataset names.
proc sort data=PRODUCT_TABLES;
by memname;
run;
Finally, you just need to get out the first record where the date is large enough.
data _NULL_;
set PRODUCT_TABLES;
*compare to your macro variable, note that i keep it as simple as possible and let SAS implicitly convert to numeric;
if substr(memname,10,18)>=symgetn("max_enddate") then do;
*set your match into a macro variable, i have put together the libname and memname here;
call symput("selectedTable",cats(libname,'.',memname));
stop; *do not continue, otherwise you will output simply the latest dataset;
end;
run;
Now you can just put the macro variable when you want to use the appropriate dataset, e.g.:
data SOME_TABLE;
set &selectedTable.;
/*DO SOME STUFF*/
run;