How do you merge lines in a single dataset with some duplicate values? - merge

I am analyzing a medical record dataset where the patients were screened for STIs at 4 different times points. The data manager created a line per patient per STI for each time period. I want to merge the dataset so there is one line per patient at each time point with all of the diagnosed STI listed.
I created the new variables to capture each STI that would be listed under the Dx variable, but I can't figure out how to merge data within the same dataset so there is only one per patient at each timepoint.
data dx;
set dx;
if dx='ANOGENITAL WARTS (CONDYLOMATA ACUMINATA)' then MRWarts=1;
if dx='CHLAMYDIA' then MRCHLAMYDIA=1;
if dx='DYSPLASIA (ANAL, CERVICAL, OR VAGINAL)' then MRDYSPLASIA=1;
if dx='GONORRHEA' then MRGONORRHEA=1;
if dx='HEPATITIS B (HBV)' then MRHEPB=1;
if dx='HUMAN PAPILLOMAVIRUSES (HPV)-ANY MANIFESTATION' then MRHPV=1;
if dx='PEDICULOSIS PUBIS' then MRPUBIS=1;
if dx='SYPHILIS' then MRSYPHILIS=1;
if dx='TRICHOMONAS VAGINALIS' then MRTRICHOMONAS=1;
run;
Image of data structure I am looking for

taking the sample dataset that you provided in the image, you can use simple transpose for desired outcome.
data have;
input Pt_ID interval_round DX $10.;
datalines;
4 1 HIV
4 1 Warts
3 1 HIV
5 2 Chlamydia
;
run;
proc sort data=have1; by Pt_Id; run;
proc transpose data=have1 out=want(drop=_NAME_);
by Pt_Id;
id Dx;
var interval_round;
run;
proc print data=want; run;
Now this code will create all variables except interval_round, Say for example - a patient was screened for HIV in round 1 and Warts for round 2. Technically it should have only one row .. so how would you represent the interval_round then?

Related

I tried creating lag variable for a momentum plan but not sure how to proceed

this is how I sorted for lag variables data
tsset permno date, monthly
sort permno date
by permno: gen lagret1=ret[_n-1]
by permno: gen lagret2=ret[_n-2]
by permno: gen lagret3=ret[_n-3]
by permno: gen lagret4=ret[_n-4]
by permno: gen lagret5=ret[_n-5]
i don't know the rest
*Step 1: Upload the data and create key variables
*Upload the dataset that contains CRSP information and create key variables.
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/CRSPforMOM.dta", clear
*Keep only common stock
keep if shrcd == 10 | shrcd == 11
*Create monthindex variable
gen monthindex = year(date)*12+month(date)
*Create past 5 months of returns using lag function
*in order to use the built- in lag function I need to tell stata the
*structure of the data
tsset permno date, monthly
sort permno date
by permno: gen lagret1=ret[_n-1]
by permno: gen lagret2=ret[_n-2]
by permno: gen lagret3=ret[_n-3]
by permno: gen lagret4=ret[_n-4]
by permno: gen lagret5=ret[_n-5]
*Create a variable that captures cumulative retruns of stock i,
*from month -5 through current month
*Compounding requires multiplying consecutive returns
gen cumret6 = (1+ret)*(1+lagret1)*(1+lagret2)*(1+lagret3)*(1+lagret4)* (1+lagret5)
*Save
save "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM1.dta", replace
*Step 2: Create and apply filters
*Before allocating stocks to portfolios, we should create and apply filters
*Select only NYSE stocks and find the 10th percentil of NYSE size in each month
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM1.dta", clear
*Keep only NYSE stocks
keep if exchcd == 1
*Keep if market cap is larger than 0
keep if mktcap >0
*Drop missing observations where marketcap is missing
drop if missing(mktcap)
*Since we create portfolios monthly, we need breakpoints monthly
sort date
by date: egen p10=pctile(mktcap), p(10)
*We only need date variable (for merging) and p10 variable (as a filter),
*so we drop everything else
keep date p10
*Drop duplicates so that p10 repeats once for every month in the sample
duplicates drop
*save
save "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOMNYSEBreakpoints.dta", replace
*Merge the breakpoints into the dataset created in step 1,
*so that we can remove small firms
*Break points are date specific so merge on date
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM1.dta", clear
sort date
merge m:1 date using "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOMNYSEBreakpoints.dta"
*merge==3 indicates that an observation is present in both
*master and using datasets, that is the only data that is properly merged
*and the only data that should be kept
keep if _merge==3
*We need to drop _merge variable to be able to merge data again
drop _merge
*Apply filters, i.e. remove small firms and firms priced below $5
drop if missing(mktcap)
drop if mktcap<=p10
*use absolute value because CRSP denotes BID-ASK midpoint with negative sign
drop if abs(prc)<5
*Save
save "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM2.dta", replace
*Step 3: Allocate stocks in 10 portfolios and hold for 6 months
*Use new file
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM2.dta", clear
sort date
*We will create variable prret6, which will tell us which portfolio a stock
*belongs to based on cumret 6
*We will use command xtile puts a prespecified percent of firms into
*each portfolio
*nq() tells stata how many portfolios we want
by date: egen prret6 = xtile (cumret6), nq(10) // takes ~20min to run
*Save
save "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM3.dta", replace
*Use the portfolios
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM3.dta", clear
drop if missing(prret6)
*Expand data, i.e. create 6 copies of the data
expand 6
sort permno date
*Create variable n which trackswhat copy of the data it is,
*n will go from 1 to 6
*_n is the count for the dataset/ the number for each observation
by permno date: gen n=_n
*Use n variable to increment monthindex by 1
replace monthindex = monthindex+n
sort permno monthindex
*Drop return from the master dataset because we want the one from the
*using dataset
drop ret
merge m:1 permno monthindex using "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM1.dta"
keep if _merge==3
drop _merge
save "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM4.dta", replace
*Step 4: Analysis
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM4.dta", clear
sort monthindex prret6 date
*Average returns based on each portfolio in each calendar month and by
*formation month
collapse (mean) ret, by (monthindex prret6 date)
*Summarize again to get average portfolio returns by calendar month (monthindex)
collapse (mean) ret, by (monthindex prret6)
*Transpose the data
reshape wide ret, i(monthindex) j(prret6) // i(rows) j(columns)
*Generate year and month variable for clarity
gen year= round(monthindex/12)
gen month=(monthindex-year*12)+6
*create momentum return variable and check for significance
gen momret=ret10-ret1
ttest ret10=ret1
*testing momentum returns from year 2000 onward
keep if monthindex>=24000
ttest ret10=ret1

SAS Placeholder value

I am looking to have a flexible importing structure into my SAS code. The import table from excel looks like this:
data have;
input Fixed_or_Floating $ asset_or_liability $ Base_rate_new;
datalines;
FIX A 10
FIX L Average Maturity
FLT A 20
FLT L Average Maturity
;
run;
The original dataset I'm working with looks like this:
data have2;
input ID Fixed_or_Floating $ asset_or_liability $ Base_rate;
datalines;
1 FIX A 10
2 FIX L 20
3 FIX A 30
4 FLT A 40
5 FLT L 30
6 FLT A 20
7 FIX L 10
;
run;
The placeholder "Average Maturity" exists in the excel file only when the new interest rate is determined by the average maturity of the bond. I have a separate function for this which allows me to search for and then left join the new base rate depending on the closest interest rate. An example of this is such that if the maturity of the bond is in 10 years, i'll use a 10 year interest rate.
So my question is, how can I perform a simple merge, using similar code to this:
proc sort data = have;
by fixed_or_floating asset_or_liability;
run;
proc sort data = have2;
by fixed_or_floating asset_or_liability;
run;
data have3 (drop = base_rate);
merge have2 (in = a)
have1 (in = b);
by fixed_or_floating asset_or_liability;
run;
The problem at the moment is that my placeholder value doesn't read in and I need it to be a word as this is how the excel works in its lookup table - then I use an if statement such as
if base_rate_new = "Average Maturity" then do;
(Insert existing Function Here)
end;
so just the importing of the excel with a placeholder function please and thank you.
TIA.
I'm not 100% sure if this behaviour corresponds with how your data appears once you import it from excel but if I run your code to create have I get:
NOTE: Invalid data for Base_rate_new in line 145 7-13.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
145 FIX L Average Maturity
Fixed_or_Floating=FIX asset_or_liability=L Base_rate_new=. _ERROR_=1 _N_=2
NOTE: Invalid data for Base_rate_new in line 147 7-13.
147 FLT L Average Maturity
Fixed_or_Floating=FLT asset_or_liability=L Base_rate_new=. _ERROR_=1 _N_=4
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.HAVE has 4 observations and 3 variables.
Basically it's saying that when you tried to import the character strings as numeric it couldn't do it so it left them as null values. If we print the table we can see the null values:
proc print data=have;
run;
Result:
Fixed_or_ asset_or_ Base_
Floating liability rate_new
FIX A 10
FIX L .
FLT A 20
FLT L .
Assuming this truly is what your data looks like then we can use the coalesce function to achieve your goal.
data have3 (drop = base_rate);
merge have2 (in = a)
have (in = b);
by fixed_or_floating asset_or_liability;
base_rate_new = coalesce(base_rate_new,base_rate);
run;
The result of doing this gives us this table:
Fixed_or_ asset_or_ Base_
ID Floating liability rate_new
1 FIX A 10
3 FIX A 10
2 FIX L 20
7 FIX L 20
4 FLT A 20
6 FLT A 20
5 FLT L 30
The coalesce function basically returns the first non-null value it can find in the parameters you pass to it. So when base_rate_new already has a value it uses that, and if it doesn't it uses the base_rate field instead.

Reshaping and merging simulations in Stata

I have a dataset, which consists of 1000 simulations. The output of each simulation is saved as a row of data. There are variables alpha, beta and simulationid.
Here's a sample dataset:
simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
I want to estimate a new value - let's call it new - which depends on alpha and beta as well as different levels of two other variables which we'll call risk and price. Values of risk range from 0 to 100, price from 0 to 500 in steps of 5.
What I want to achieve is a dataset that consists of values representing the probability that (across the simulations) new is greater than 0 for combinations of risk and price.
I can achieve this using the code below. However, the reshape process takes more hours than I'd like. And it seems to me to be something that could be completed a lot quicker.
So, my question is either:
i) is there an efficient way to generate multiple datasets from a single row of data without multiple reshape, or
ii) am I going about this in totally the wrong way?
set maxvar 15000
/* Input sample data */
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
forvalues risk = 0(1)100 {
forvalues price = 0(5)500 {
gen new_r`risk'_p`price' = `price' * (`risk'/200)* beta - alpha
gen probnew_r`risk'_p`price' = 0
replace probnew_r`risk'_p`price' = 1 if new_r`risk'_p`price' > 0
sum probnew_r`risk'_p`price', mean
gen mnew_r`risk'_p`price' = r(mean)
drop new_r`risk'_p`price' probnew_r`risk'_p`price'
}
}
drop if simulationid > 1
save simresults.dta, replace
forvalues risk = 0(1)100 {
clear
use simresults.dta
reshape long mnew_r`risk'_p, i(simulationid) j(price)
keep simulation price mnew_r`risk'_p
rename mnew_r`risk'_p risk`risk'
save risk`risk'.dta, replace
}
clear
use risk0.dta
forvalues risk = 1(1)100 {
merge m:m price using risk`risk'.dta, nogen
save merged.dta, replace
}
Here's a start on your problem.
So far as I can see, you don't need more than one dataset.
The various reshapes and merges just rearrange what was first generated and that can be done within one dataset.
The code here in the first instance is for just one pair of values of alpha and beta. To simulate 1000 such, you would need 1000 times more observations, i.e. about 10 million, which is not usually a problem and to loop over the alphas and betas. But the loop can be tacit. We'll get to that.
This code has been run and is legal. It's limited to one alpha, beta pair.
clear
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
local N = 101 * 101
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
gen result = (price * (risk/200)* beta[1] - alpha[1]) > 0
bysort price risk: gen mean = sum(result)
by price risk: replace mean = mean[_N]/_N
Assuming now that you first read in 1000 values, here is a sketch of how to get the whole thing. This code has not been tested. That is, your dataset starts with 1000 observations; you then enlarge it to 10 million or so, and get your results. The tricksy part is using an expression for the subscript to ensure that each block of results is for a distinct alpha, beta pair. That's not compulsory; you could do it in a loop, but then you would need to generate outside the loop and replace within it.
local N = 101 * 101 * 1000
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
egen sim = seq(), block(10201)
gen result = (price * (risk/200)* beta[ceil(_n/10201)] - alpha[ceil(_n/10201)]) > 0
bysort sim price risk: gen mean = sum(result)
by sim price risk: replace mean = mean[_N]/_N
Other devices used: egen to set up in blocks; getting the mean without repeated calls to summarize; using a true-or-false expression directly.
NB: I haven't tried to understand what you are doing, but it seems to me that the price-risk-simulation conditions define single values, so calculating a mean looks redundant. But perhaps that is in the code because you wish to add further detail to the code once you have it working.
NB2: This seems a purely deterministic calculation. Not sure that you need this code at all.

Increment date variable

I have a SAS dataset, in which I must add a date variable, starting with a certain date (ex: July 10, 2014). For each observation, the date must increase by one day. I cannot figure out how to increment the date. Whenever I try, I get the same date for all observations.
Welcome to Stack Overflow! Let's assume your dataset looks as so:
Have
Obs Var1
1 Mazda
2 Ford
3 BMW
Want
Obs Date Var1
1 01JAN2015 Mazda
2 02JAN2015 Ford
3 03JAN2015 BMW
You can use a Sum Statement with a SAS Date Literal to accomplish this goal.
data want;
format Date date9. /* Makes date the first var, looks prettier */
set have;
if(_N_ = 1) then Date = '31DEC2014'd; /* Set initial value */
Date+1; /* Increment SAS date value by 1 day for each day */
run;
If you have not used the automatic variable N before, it's an iteration counter for each time SAS goes from the top of the data step to the bottom.
The likely reason that you are seeing the same date for each day is because you are not retaining the value you want to increment. Consider the below example program:
data WontWork;
set have;
Add_Me = 1;
/* Do loop just simulates dataset iterations */
do i = 1 to 10;
Add_Me = Add_Me + 1;
output;
end;
drop i;
run;
Explanation
Whenever SAS runs through one iteration of the data step, the Program Data Vector (PDV) resets all non-automatic variables to missing. To fix this, you must either use a Retain statement and then increment the variable, or use a Sum Statement to do the job of both retaining and summing up the variable. The Retain/Sum Statements both tell SAS to remember the last value of a variable so that it does not get reset to missing when it iterates through the data step. One unique property of the retain statement is that you can set an initial value. By default, the retain statement will initialize the variable as missing. The sum statement will always initialize a variable as missing.
data works;
retain Add_Me 0;
/* Do loop just simulates dataset iterations */
do i = 1 to 10;
Add_Me = sum(Add_Me, 1);
output;
end;
drop i;
run;
OR
data works2;
/* Do loop just simulates dataset iterations */
do i = 1 to 10;
Add_Me+1;
output;
end;
drop i;
run;
Note that the sum statement does both of these steps, and also handles missing values. Think of it as a shortcut.
I hope this resolved your problem, and again welcome to Stack Overflow!

Why does merge automatically set the latest dataset I created?

I'm having trouble using merge and I realized why : besides the table I want to merge, SAS seems to automatically add the latest table I created. The following code illustrates the issue :
DATA table1; /* to be merged dataset no 1*/
input X rep Y Z;
cards;
1 1 0 2
5 1 2 6
5 2 5 2
;
run;
proc sort; by x rep; run;
data table3; /* to be merged dataset no 2 */
input X;
cards;
1
5
5
10
10
15
;
run;
proc sort; by x; run;
data table3; /* rep stands for 'replicate' and makes sure there is no uniqueness issue */
set table3; by x;
retain rep;
if first.x then rep=0;
rep=rep+1; /*rep+1; */
run;
data table2; /*some other table having nothing to do with the merge*/
input Y W;
cards;
1 0
1 0
2 0
3 0
3 0
8 0
;
run;
data merge1;
merge table3 table1;
by x rep;
set nobs=n;
run;
When it is submitted, the log shows that the latest table created (table2) is somehow used to create merge1. Actually, table2 columns are added to what merge1 should be.
Trying to understand this, I found that this doesn't happen if I get rid of the set nobs=n; line in the definition of merge1.
I couldn't find why on the internet but I found several documents warning about how merge can be tricky (but for other reasons)...
Therefore, my questions are :
Why does this happen and how to fix it ? (I need nobs in my calculations) I would be able to escape the issue doing the merge and the following treatment in separated data steps but I would like to understand the whole thing and how to properly deal with it.
Is merge the best way to add values in only one column of a dataset ? (here, table1 column X is updated by table3, but Y and Z are not yet). (this question will be secondary if the first one is answered)
The set nobs=n statement is reading in table2 implicitly from &SYSLAST.
It's like doing
data table2 ;
/* some stuff */
run ;
data want ;
set ; /* implicity use &SYSLAST - table2 in this case - as input dataset */
run ;
I'm unsure what you intend to achieve with set nobs=n, but the merge datastep without set nobs=n will return Y and Z values based on the join criteria.
EDIT:
data merge1;
merge table3 table1 end=eof ;
by x rep;
if eof then call symputx('NOBS',_n_) ;
run;
data merge1 ;
set merge1 ;
NOBS = &NOBS ;
run ;
Output of merge1
X rep Y Z NOBS
1 1 0 2 6
5 1 2 6 6
5 2 5 2 6
10 1 6
10 2 6
15 1 6