I am attempting to merge two data sets without a single key variable. The data looks like this in both data sets:
study_id.....round....other variables different between the two sets
A000019....R012....etc
A000019....R013
A000047....R013
A000047....R014
A000047....R015
A000267....R014
This is my code...
DATA RAKAI.complete;
length study_id $ 8;
MERGE hivgps2 rccsdata;
BY study_id round;
RUN;
I've tried to merge by study_id and round which are the only two variables shared across the data sets. But it just stacks the two sets creating double the correct number of IDs. The combination of "study_id" and "round" provides a unique identifier, but no one variable does. Does is just make the most sense to code a new unique id by combining the two variables that are shared by both data sets?
Many Thanks
I realized I can post the code that I meant to deal with potential unwanted spaces here.
DATA hivgps2;
SET hivgps2;
study_id = compress(study_id);
round= compress(round);
RUN;
DATA rccsdata;
SET rccsdata;
study_id = compress(study_id);
round=compress(round);
RUN;
Your code is the correct format for merging by multiple variables. Records from both datasets are included, so if none of the keys match then the result will be the same as if you used SET instead of MERGE.
Are you sure that there is any overlap between the two sets of data? Check that your variables are the same length. If they are character then make sure the values are consistent in their use of upper and lower case letters. Make sure that the values do not have leading spaces or other non-printing characters. Also make sure you haven't attached a format to one of the datasets so that the values you see printed are not what is actually in the data.
In your clean up data steps you should force the length of the variables to be consistent. Also you can compress more than just spaces from the values. I like to eliminate anything that is not a normal 7-bit ASCII code. That will get rid of tabs, non-breaking spaces, nulls and other strange things. In normal 7-Bit ASCII the printable characters are between ! ('21'x or 33 decimal) and ~ ('7E'x or 126 decimal).
data hivgps2_clean ;
length study_id $10 round $5 ;
set hivgps2;
format study_id round ;
study_id=upcase(compress(study_id,compress(study_id,collate(33,126))));
round=upcase(compress(round,compress(study_id,collate(33,126))));
run;
proc sort; by study_id round; run;
data rccsdata_clean;
length study_id $10 round $5 ;
set rccsdata;
format study_id round ;
study_id=upcase(compress(study_id,compress(study_id,collate(33,126))));
round=upcase(compress(round,compress(study_id,collate(33,126))));
run;
proc sort; by study_id round; run;
data want;
merge hivgps2_clean(in=in1) rccsdata_clean(in=in2);
by study_id round;
run;
You can try that, or you can just use a proc sql join:
proc sql;
create table rakai.complete as select
a.*, b.*
from hivgps2 as a
full join rccsdata as b
on a.study_id = b.study_id and a.round = b.round;
quit;
Related
I have some SAS code from my editor here. I am learning to use SAS (this is my first time using it), so I'm not sure how much code is relevant.
proc import
datafile="C:\Users\barnedsm\Desktop\SAS\ToothGrowth.csv"
dbms=csv
out=tooth;
proc print data=tooth (obs=5);
run;
6. create two SAS data sets ToothGrowth_OJ and ToothGrowth_VC for the animals with the
delivery method orange juice and ascorbic acid, respectively. (5 points)
data ToothGrowth_OJ;
set tooth;
where (supp="OJ");
proc print data=ToothGrowth_OJ (obs=5);
run;
data ToothGrowth_VC;
set tooth;
where (supp="VC");
proc print data=ToothGrowth_VC (obs=5);
run;
7. save the two SAS data sets in a permanent folder on your computer. (5 points)
libname mylibr "C:\Users\barnedsm\Desktop\SAS";
data mylibr.ToothGrowth_OJ_permanent;
set ToothGrowth_OJ;
run;
libname mylibr "C:\Users\barnedsm\Desktop\SAS";
data mylibr.ToothGrowth_VC_permanent;
set ToothGrowth_VC;
run;
For the final question on my assignment, I am wanting to re-combine the last two datasets I made (ToothGrowth_OJ and ToothGrowth_VC) into one dataset (ToothGrowth_combined). How would I do this? My thoughts would be to use a subset function like I used to separate the two. The code I have in mind is below.
data ToothGrowth_combined;
set ToothGrowth_OJ(where=(supp="OJ"));
keep supp Len;
run;
This would tell SAS to keep the values from the ToothGrowth_OJ dataset that have OJ in the "supp" columns (which is all of them) and to keep the variable Len. Assuming that I have done this code correctly, I want to add in the values from my ToothGrwoth_VC dataset in a similar way, but the output is an empty dataset when I run the same code, but replace the "ToothGrowth_OJ" with "ToothGrowth_VC". Is there a way to use the subset code to take these two separate datasets and combine them into one, or an easier way?
Your starting code is doing these steps.
Using PROC IMPORT to guess how to read text file into a dataset.
Creates a subset of the data with only some of the observations.
Creates a second subset of the data.
To recombine the two subsets use the SET statement and list all of the input datasets you want. To limit the number of variables written to the output dataset use a KEEP statement.
data ToothGrowth_combined;
set ToothGrowth_OJ ToothGrowth_VC ;
keep supp Len;
run;
I am not sure why you added the WHERE= dataset option in your code attempt since by the way they were created they each only have observations with a single value of SUPP.
If you want to combine the permanent datasets instead (for example if you started a new SAS session with an empty WORK library) then use those dataset names instead in the SET. Just make sure the libref that points to them is defined in this SAS session.
libname mylibr "C:\Users\barnedsm\Desktop\SAS";
data ToothGrowth_combined;
set mylibr.ToothGrowth_OJ_permanent mylibr.ToothGrowth_VC_permanent;
keep supp Len;
run;
This question already has answers here:
SAS Code that works like Excel's "VLOOKUP" function
(2 answers)
Closed 1 year ago.
I am working on a project that involves two separate CSV files. The first data set, "Trips" has seven columns, with a trip_id, bike_id, duration, from_station_id, to_station_id, capacity and usertype. User type is the only character values, the rest are numerical. The second csv file has station_id and station_name. The objective is to merge the files in some way that will input the station name from the second csv file into the "from" and "to" station sections in the first, based on station id. I know that this would be extremely easy in excel with an xlookup, but I am wondering the correct way to approach this in SAS.
I am using the SAS university edition (the free online one) if that makes any difference. Our code so far is as follows:
data DivvyTrips;
infile '/home/u59304398/sasuser.v94/DivvyTrips.csv' dsd;
input trip_id
bikeid
tripduration
from_station_id
to_station_id
capacity
usertype $;
title "Trips";
run;
data DivvyStations;
infile '/home/u59304398/sasuser.v94/Divvy_Stations.csv' dsd;
input station_id
station_name $;
title "Stations";
run;
All this does is import the data. I do not think a merge with a sort will work because we need both from and to station names.
SAS uses formats to control how values are displayed as text. It uses informats to control how text is converted to values.
Since your station ID is numeric you can use a FORMAT to display the station names for the station id numbers.
You can create a CNTLIN dataset for PROC FORMAT to build a format from your station list dataset. To define a numeric format you just need to have the FMTNAME, START and LABEL variables in your CNTLIN dataset.
data format;
fmtname='STATION';
set divvystations;
rename station_id=start station_name=label;
run;
proc format cntlin=format;
run;
Now you can use the format with your station variables. For most purposes you will not even need to modify your dataset, just tell SAS to use the format with your variable.
Let's create some example data:
data DivvyTrips;
infile cards dsd;
input trip_id
bikeid
tripduration
from_station_id
to_station_id
capacity
usertype :$20.
;
cards;
1,1,10,1,2,2,AAA
2,1,20,2,3,1,BBB
;
data DivvyStations;
infile cards dsd ;
input station_id
station_name :$20.
;
cards;
1,Stop 1
2,Station 2
3,Airport
;
Now create the STATION format.
data format;
fmtname='STATION';
set divvystations;
rename station_id=start station_name=label;
run;
proc format cntlin=format;
run;
Now let's print the trip data and display the stations using the new STATION format.
proc print data=divvytrips;
format from_station_id to_station_id station. ;
run;
Result:
from_
station_ to_station_
Obs trip_id bikeid tripduration id id capacity usertype
1 1 1 10 Stop 1 Station 2 2 AAA
2 2 1 20 Station 2 Airport 1 BBB
If you do want to create a new character variable you use the PUT() function.
data want;
set DivvyTrips;
from_station = put(from_station_id,station.);
to_station = put(to_station_id,station.);
run;
In SAS when you "look up" values you join the two "arrays" or in this case tables together.
The simplest way to do this is using a proc sql step:
proc sql;
create table DivvyTrips_withnames as
select
a.*
,b.station_name as from_station_name
,c.station_name as to_station_name
from DivvyTrips a
left join DivvyStations b
on a.from_station_id = b.station_id
left join DivvyStations c
on a.to_station_id = c.station_id
;
quit;
We end up having to do 2 joins onto your original table as we are doing 2 different "lookups", from_station_id and to_station_id.
I am trying to read in some data in date format and the solution is eluding me. Here are four of my tries using the simplest self-contained examples I could devise. (And the site is making me boost my text-to-code ratio in order for this to post, so please ignore this sentence).
*EDIT - my example was too simplistic. I have spaces in my variables, so I do need to specify positions (the original answer said to ignore positions entirely). The solution below works, but the date variable is not a date.
data clinical;
input
name $ 1-13
visit_date $ 14-23
group $ 25
;
datalines;
John Turner 03/12/1998 D
Mary Jones 04/15/2008 P
Joe Sims 11/30/2009 J
;
run;
No need to specify the lengths. datalines already assumes space-delimited values. A simple way to specify an informat is to use a : after each input variable.
data clinical;
input ID$ visit_date:mmddyy10. group$;
format visit_date mmddyy10.; * Make the date look human-readable;
datalines;
01 03/12/1998 D
02 04/15/2008 P
03 11/30/2009 J
;
run;
Output:
ID visit_date group
01 03/12/1998 D
02 04/15/2008 P
03 11/30/2009 J
A friend of mine suggested this, but it seems odd to have to switch syntax markedly depending on whether the variable is a date or not.
data clinical;
input
name $ 1-12
#13 visit_date MMDDYY10.
group $ 25 ;
datalines;
John Turner 03/12/1998 D
Mary Jones 04/15/2008 P
Joe Sims 11/30/2009 J
;
run;
SAS provides a lot of different ways to input data, just depending on what you want to do.
Column input, which is what you start with, is appropriate when this is true:
To read with column input, data values must have these attributes:
appear in the same columns in all the input data records
consist of standard numeric form or character form
Your data does not meet this in the visit_date column. So, you need to use something else.
Formatted input is appropriate to use when you want these features:
With formatted input, an informat follows a variable name and defines how SAS reads the values of this variable. An informat gives the data type and the field width of an input value. Informats also read data that is stored in nonstandard form, such as packed decimal, or numbers that contain special characters such as commas.
Your visit_date column matches this requirement, as you have a specific informat (mmddyy10.) you would like to use to read in the data into date format.
List input would also work, especially in modified list format, in some cases, though in your example of course it wouldn't due to the spaces in the name. Here's when you might want to use it:
List input requires that you specify the variable names in the INPUT statement in the same order that the fields appear in the input data records. SAS scans the data line to locate the next value but ignores additional intervening blanks. List input does not require that the data is located in specific columns. However, you must separate each value from the next by at least one blank unless the delimiter between values is changed. By default, the delimiter for data values is one blank space or the end of the input record. List input does not skip over any data values to read subsequent values, but it can ignore all values after a given point in the data record. However, pointer controls enable you to change the order that the data values are read.
(For completeness, there is also Named input, though that's more rare to see, and not helpful here.)
You can mix Column and Formatted inputs, but you don't want to mix List input as it doesn't have the same concept of pointer control exactly so it can be easy to end up with something you don't want. In general, you should use the input type that's appropriate to your data - use Column input if your data is all text/regular numerics, use formatted input if you have particular formats for your data.
I have a dataset, that each id has multiple incomplete records, it could make more sense to have a final dataset as shown. Basically the idea is to have non-missing data fill the blanks wherever the value is from the 1st line or 2nd line, as long as for the same id.
The easiest way to do this is the self-update. This uses the core property of the update statement, that only non-missing values can replace other values, in a fun way that allows the rows to be simplified like this. The first obs=0 is there simply to give an empty base to update from - the dataset is really being read in from the second mention on that statement.
data have;
id = 1;
input x y z;
datalines;
1 . .
. 1 .
. . 1
;;;;
run;
data want;
update have(obs=0) have;
by id;
run;
proc sql;
create table need as
Select ID, max(v1) as v1,
max(v2) as v2,
max(v3) as v3,
max(v4) as v4
from have;
quit;
I have a data set which contains 'factor' values and corresponding 'response' values:
data inTable;
input fact $ val $;
datalines;
a 1
a 2
a 3
b 4
b 5
b 6
c 7
d 8
e 9
e 10
f 11
;
run;
I want to aggregate response options by factor, i.e. I need to get
I know perfectly well how to implement this in a data step running a loop through values and applying CATX (posted here). But can I do the same with PROC SQL, using a combination of GROUP BY and some character analog of SUM() or CATX()?
Thanks for help,
Dmitry
The data step is the appropriate tool to use in SAS if you want to apply any sort of logic that carries lots of values forward from previous rows.
Any SQL solution would be extremely unwieldy - you would need to join the input table to itself n times, where n is the maximum number of distinct values for any of your factors, and you would also need to define a sequential key preserving the row order to use for the join.
A list of aggregation functions you can use in proc sql is available here:
http://support.sas.com/kb/25/279.html
Although a few of these do work with character variables, there is no aggregation function for string concatenation.