Why does merge automatically set the latest dataset I created?

Why does merge automatically set the latest dataset I created? - merge

I'm having trouble using merge and I realized why : besides the table I want to merge, SAS seems to automatically add the latest table I created. The following code illustrates the issue :
DATA table1; /* to be merged dataset no 1*/
input X rep Y Z;
cards;
1 1 0 2
5 1 2 6
5 2 5 2
;
run;
proc sort; by x rep; run;
data table3; /* to be merged dataset no 2 */
input X;
cards;
1
5
5
10
10
15
;
run;
proc sort; by x; run;
data table3; /* rep stands for 'replicate' and makes sure there is no uniqueness issue */
set table3; by x;
retain rep;
if first.x then rep=0;
rep=rep+1; /*rep+1; */
run;
data table2; /*some other table having nothing to do with the merge*/
input Y W;
cards;
1 0
1 0
2 0
3 0
3 0
8 0
;
run;
data merge1;
merge table3 table1;
by x rep;
set nobs=n;
run;
When it is submitted, the log shows that the latest table created (table2) is somehow used to create merge1. Actually, table2 columns are added to what merge1 should be.
Trying to understand this, I found that this doesn't happen if I get rid of the set nobs=n; line in the definition of merge1.
I couldn't find why on the internet but I found several documents warning about how merge can be tricky (but for other reasons)...
Therefore, my questions are :
Why does this happen and how to fix it ? (I need nobs in my calculations) I would be able to escape the issue doing the merge and the following treatment in separated data steps but I would like to understand the whole thing and how to properly deal with it.
Is merge the best way to add values in only one column of a dataset ? (here, table1 column X is updated by table3, but Y and Z are not yet). (this question will be secondary if the first one is answered)

The set nobs=n statement is reading in table2 implicitly from &SYSLAST.
It's like doing
data table2 ;
/* some stuff */
run ;
data want ;
set ; /* implicity use &SYSLAST - table2 in this case - as input dataset */
run ;
I'm unsure what you intend to achieve with set nobs=n, but the merge datastep without set nobs=n will return Y and Z values based on the join criteria.
EDIT:
data merge1;
merge table3 table1 end=eof ;
by x rep;
if eof then call symputx('NOBS',_n_) ;
run;
data merge1 ;
set merge1 ;
NOBS = &NOBS ;
run ;
Output of merge1
X rep Y Z NOBS
1 1 0 2 6
5 1 2 6 6
5 2 5 2 6
10 1 6
10 2 6
15 1 6

Related

Processing each row in kdb table and appending arbitrary results in a new table

I have a table
t:([]a:`a`b`c;b:1 2 3;c:`x`y`z)
I would like to iterate and process each row.
The thing is that the processing logic for each row may result in arbitrary lines of data, after the full iteration the result maybe as such e.g.
results:([]a:`a1`b1`b2`b3`c1`c2;x:1 2 2 2 3 3)
I have the following idea so far but doesn't seem to work:
uj { // some processing function } each t
But how does one return arbitrary number of data append the results into a new table?

Assuming you are using something from the table entries to indicate your arbitrary value, you can use a dictionary to indicate a number (or a function) which can be used to apply these values.
In this example, I use the c column of the original table to indicate the number of rows to return (and the number from 1 to count to).
As each entry of the table is a dictionary, I can index using the column names to get the values and build a new table.
I also use raze to join each of the results together, as they will each have the same schema.
raze {[x]
d:`x`y`z!1 3 2;
([]a:((),`$string[x[`a]],/:string 1+til d[x[`c]]);x:((),d[x[`c]])#x[`b])
} each t

Not sure if this is what you want, but you can try something like this:
ungroup select a:`${y,/:x}[string b]'[string a],b from t
Or you can use accumulators if you need the result of the previous row calculations like this:
{y[`b]+:last[x]`b;x,y}/[t;t]

If your processing function is outputting tables that conform, just raze should suffice:
raze {y#enlist x}'[t;1 3 2]
a b c
-----
a 1 x
b 2 y
b 2 y
b 2 y
c 3 z
c 3 z
Otherwise use (uj/)
(uj/) {y#enlist x}'[t;1 3 2]
a b c
-----
a 1 x
b 2 y
b 2 y
b 2 y
c 3 z
c 3 z

Your best answer will depend very much on how you want to use the results computed from each row of t. It might suit you to normalise t; it might not. The key point here:
A table cell can be any q data structure.
The minimum you can do in this regard is to store the result of your processing function in a new column.
Below, an arbitrary binary function f returns its result as a dictionary.
q)f:{n:1+rand 3;(`$string[x],/:"123" til n)!n#y}
q)f [`a;2]
a1| 2
a2| 2
q)update d:a f'b from t
a b c d
---------------------
a 1 x `a1`a2`a3!1 1 1
b 2 y (,`b1)!,2
c 3 z `c1`c2!3 3
But its result could be any q data structure.
You were considering a unary processing function:
q)pf:{#[x;`d;:;] f . x`a`b}
q)pf each t
a b c d
---------------------
a 1 x `a1`a2`a3!1 1 1
b 2 y `b1`b2!2 2
c 3 z `c1`c2`c3!3 3 3
You might find other suggestions at KX Community.

If I understand correctly your question you need something like this :
(uj/){}each t
Check this bit :
(uj/)enlist[t],{x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
This part :
x:update x:i from
// functional form of a function that takes random rows/columns
?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];
// some for of if-else and an update to generate column a (not bullet proof)
{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]
Basically the above gives something like :
q){x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
+`a`b`c`x!(`a0`a1`a2`a3`a4`a5`a6`a7;1 1 1 1 1 1 1 1;`x`x`x`x`x`x`x`x;0 1 2 3 ..
+`a`x!(`a0`a1`a2`a3`a4`a5;0 1 2 3 4 5)
+`a`b`c`x!(`a0`a1`a2;1 1 1;`x`x`x;0 1 2)
+`a`b`c`x!(`a0`a1`a2`a3`a4`a5`a6`a7`a8`a9`a10`a11;1 1 1 1 1 1 1 1 1 1 1 1;`x`..
or taking the first one :
q)first{x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
a b x
--------
a0 1 0
a1 1 1
a2 1 2
a3 1 3
a4 1 4
a5 1 5
a6 1 6
a7 1 7
a8 1 8
a9 1 9
a10 1 10
You can do
(uj/)enist[t],{ // some function }each t
to get what you want. Drop the enlist[t] if you don't want the table you start with in your result
Hope this helps.

How do you merge lines in a single dataset with some duplicate values?

I am analyzing a medical record dataset where the patients were screened for STIs at 4 different times points. The data manager created a line per patient per STI for each time period. I want to merge the dataset so there is one line per patient at each time point with all of the diagnosed STI listed.
I created the new variables to capture each STI that would be listed under the Dx variable, but I can't figure out how to merge data within the same dataset so there is only one per patient at each timepoint.
data dx;
set dx;
if dx='ANOGENITAL WARTS (CONDYLOMATA ACUMINATA)' then MRWarts=1;
if dx='CHLAMYDIA' then MRCHLAMYDIA=1;
if dx='DYSPLASIA (ANAL, CERVICAL, OR VAGINAL)' then MRDYSPLASIA=1;
if dx='GONORRHEA' then MRGONORRHEA=1;
if dx='HEPATITIS B (HBV)' then MRHEPB=1;
if dx='HUMAN PAPILLOMAVIRUSES (HPV)-ANY MANIFESTATION' then MRHPV=1;
if dx='PEDICULOSIS PUBIS' then MRPUBIS=1;
if dx='SYPHILIS' then MRSYPHILIS=1;
if dx='TRICHOMONAS VAGINALIS' then MRTRICHOMONAS=1;
run;
Image of data structure I am looking for

taking the sample dataset that you provided in the image, you can use simple transpose for desired outcome.
data have;
input Pt_ID interval_round DX $10.;
datalines;
4 1 HIV
4 1 Warts
3 1 HIV
5 2 Chlamydia
;
run;
proc sort data=have1; by Pt_Id; run;
proc transpose data=have1 out=want(drop=_NAME_);
by Pt_Id;
id Dx;
var interval_round;
run;
proc print data=want; run;
Now this code will create all variables except interval_round, Say for example - a patient was screened for HIV in round 1 and Warts for round 2. Technically it should have only one row .. so how would you represent the interval_round then?

SAS Placeholder value

I am looking to have a flexible importing structure into my SAS code. The import table from excel looks like this:
data have;
input Fixed_or_Floating $ asset_or_liability $ Base_rate_new;
datalines;
FIX A 10
FIX L Average Maturity
FLT A 20
FLT L Average Maturity
;
run;
The original dataset I'm working with looks like this:
data have2;
input ID Fixed_or_Floating $ asset_or_liability $ Base_rate;
datalines;
1 FIX A 10
2 FIX L 20
3 FIX A 30
4 FLT A 40
5 FLT L 30
6 FLT A 20
7 FIX L 10
;
run;
The placeholder "Average Maturity" exists in the excel file only when the new interest rate is determined by the average maturity of the bond. I have a separate function for this which allows me to search for and then left join the new base rate depending on the closest interest rate. An example of this is such that if the maturity of the bond is in 10 years, i'll use a 10 year interest rate.
So my question is, how can I perform a simple merge, using similar code to this:
proc sort data = have;
by fixed_or_floating asset_or_liability;
run;
proc sort data = have2;
by fixed_or_floating asset_or_liability;
run;
data have3 (drop = base_rate);
merge have2 (in = a)
have1 (in = b);
by fixed_or_floating asset_or_liability;
run;
The problem at the moment is that my placeholder value doesn't read in and I need it to be a word as this is how the excel works in its lookup table - then I use an if statement such as
if base_rate_new = "Average Maturity" then do;
(Insert existing Function Here)
end;
so just the importing of the excel with a placeholder function please and thank you.
TIA.

I'm not 100% sure if this behaviour corresponds with how your data appears once you import it from excel but if I run your code to create have I get:
NOTE: Invalid data for Base_rate_new in line 145 7-13.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
145 FIX L Average Maturity
Fixed_or_Floating=FIX asset_or_liability=L Base_rate_new=. _ERROR_=1 _N_=2
NOTE: Invalid data for Base_rate_new in line 147 7-13.
147 FLT L Average Maturity
Fixed_or_Floating=FLT asset_or_liability=L Base_rate_new=. _ERROR_=1 _N_=4
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.HAVE has 4 observations and 3 variables.
Basically it's saying that when you tried to import the character strings as numeric it couldn't do it so it left them as null values. If we print the table we can see the null values:
proc print data=have;
run;
Result:
Fixed_or_ asset_or_ Base_
Floating liability rate_new
FIX A 10
FIX L .
FLT A 20
FLT L .
Assuming this truly is what your data looks like then we can use the coalesce function to achieve your goal.
data have3 (drop = base_rate);
merge have2 (in = a)
have (in = b);
by fixed_or_floating asset_or_liability;
base_rate_new = coalesce(base_rate_new,base_rate);
run;
The result of doing this gives us this table:
Fixed_or_ asset_or_ Base_
ID Floating liability rate_new
1 FIX A 10
3 FIX A 10
2 FIX L 20
7 FIX L 20
4 FLT A 20
6 FLT A 20
5 FLT L 30
The coalesce function basically returns the first non-null value it can find in the parameters you pass to it. So when base_rate_new already has a value it uses that, and if it doesn't it uses the base_rate field instead.

Joining multiple times in kdb

I have two tables
table 1 (orders) columns: (date,symbol,qty)
table 2 (marketData) columns: (date,symbol,close price)
I want to add the close for T+0 to T+5 to table 1.
{[nday]
value "temp0::update date",string[nday],":mdDates[DateInd+",string[nday],"] from orders";
value "temp::temp0 lj 2! select date",string[nday],":date,sym,close",string[nday],":close from marketData";
table1::temp
} each (1+til 5)
I'm sure there is a better way to do this, but I get a 'loop error when I try to run this function. Any suggestions?

See here for common errors. Your loop error is because you're setting views with value, not globals. Inside a function value evaluates as if it's outside the function so you don't need the ::.
That said there's lots of room for improvement, here's a few pointers.
You don't need the value at all in your case. E.g. this line:
First line can be reduced to (I'm assuming mdDates is some kind of function you're just dropping in to work out the date from an integer, and DateInd some kind of global):
{[nday]
temp0:update date:mdDates[nday;DateInd] from orders;
....
} each (1+til 5)
In this bit it just looks like you're trying to append something to the column name:
select date",string[nday],":date
Remember that tables are flipped dictionaries... you can mess with their column names via the keys, as illustrated (very noddily) below:
q)t:flip `a`b!(1 2; 3 4)
q)t
a b
---
1 3
2 4
q)flip ((`$"a","1"),`b)!(t`a;t`b)
a1 b
----
1 3
2 4
You can also use functional select, which is much neater IMO:
q)?[t;();0b;((`$"a","1"),`b)!(`a`b)]
a1 b
----
1 3
2 4

Seems like you wanted to have p0 to p5 columns with prices corresponding to date+0 to date+5 dates.
Using adverb over to iterate over 0 to 5 days :
q)orders:([] date:(2018.01.01+til 5); sym:5?`A`G; qty:5?10)
q)data:([] date:20#(2018.01.01+til 10); sym:raze 10#'`A`G; price:20?10+10.)
q)delete d from {c:`$"p",string[y]; (update d:date+y from x) lj 2!(`d`sym,c )xcol 0!data}/[ orders;0 1 2 3 4]
date sym qty p0 p1 p2 p3 p4
---------------------------------------------------------------
2018.01.01 A 0 10.08094 6.027448 6.045174 18.11676 1.919615
2018.01.02 G 3 13.1917 8.515314 19.018 19.18736 6.64622
2018.01.03 A 2 6.045174 18.11676 1.919615 14.27323 2.255483
2018.01.04 A 7 18.11676 1.919615 14.27323 2.255483 2.352626
2018.01.05 G 0 19.18736 6.64622 11.16619 2.437314 4.698096

Reshaping and merging simulations in Stata

I have a dataset, which consists of 1000 simulations. The output of each simulation is saved as a row of data. There are variables alpha, beta and simulationid.
Here's a sample dataset:
simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
I want to estimate a new value - let's call it new - which depends on alpha and beta as well as different levels of two other variables which we'll call risk and price. Values of risk range from 0 to 100, price from 0 to 500 in steps of 5.
What I want to achieve is a dataset that consists of values representing the probability that (across the simulations) new is greater than 0 for combinations of risk and price.
I can achieve this using the code below. However, the reshape process takes more hours than I'd like. And it seems to me to be something that could be completed a lot quicker.
So, my question is either:
i) is there an efficient way to generate multiple datasets from a single row of data without multiple reshape, or
ii) am I going about this in totally the wrong way?
set maxvar 15000
/* Input sample data */
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
forvalues risk = 0(1)100 {
forvalues price = 0(5)500 {
gen new_r`risk'_p`price' = `price' * (`risk'/200)* beta - alpha
gen probnew_r`risk'_p`price' = 0
replace probnew_r`risk'_p`price' = 1 if new_r`risk'_p`price' > 0
sum probnew_r`risk'_p`price', mean
gen mnew_r`risk'_p`price' = r(mean)
drop new_r`risk'_p`price' probnew_r`risk'_p`price'
}
}
drop if simulationid > 1
save simresults.dta, replace
forvalues risk = 0(1)100 {
clear
use simresults.dta
reshape long mnew_r`risk'_p, i(simulationid) j(price)
keep simulation price mnew_r`risk'_p
rename mnew_r`risk'_p risk`risk'
save risk`risk'.dta, replace
}
clear
use risk0.dta
forvalues risk = 1(1)100 {
merge m:m price using risk`risk'.dta, nogen
save merged.dta, replace
}

Here's a start on your problem.
So far as I can see, you don't need more than one dataset.
The various reshapes and merges just rearrange what was first generated and that can be done within one dataset.
The code here in the first instance is for just one pair of values of alpha and beta. To simulate 1000 such, you would need 1000 times more observations, i.e. about 10 million, which is not usually a problem and to loop over the alphas and betas. But the loop can be tacit. We'll get to that.
This code has been run and is legal. It's limited to one alpha, beta pair.
clear
input simulationid beta alpha
1 0.025840106 20.59671241
2 0.019850549 18.72183088
3 0.022440886 21.02298228
4 0.018124857 20.38965861
5 0.024134726 22.08678021
6 0.023619479 20.67689981
7 0.016907209 17.69609466
8 0.020036455 24.6443037
9 0.017203175 24.32682682
10 0.020273349 19.1513272
end
local N = 101 * 101
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
gen result = (price * (risk/200)* beta[1] - alpha[1]) > 0
bysort price risk: gen mean = sum(result)
by price risk: replace mean = mean[_N]/_N
Assuming now that you first read in 1000 values, here is a sketch of how to get the whole thing. This code has not been tested. That is, your dataset starts with 1000 observations; you then enlarge it to 10 million or so, and get your results. The tricksy part is using an expression for the subscript to ensure that each block of results is for a distinct alpha, beta pair. That's not compulsory; you could do it in a loop, but then you would need to generate outside the loop and replace within it.
local N = 101 * 101 * 1000
set obs `N'
egen risk = seq(), block(101)
replace risk = risk - 1
egen price = seq(), from(0) to(100)
replace price = 5 * price
egen sim = seq(), block(10201)
gen result = (price * (risk/200)* beta[ceil(_n/10201)] - alpha[ceil(_n/10201)]) > 0
bysort sim price risk: gen mean = sum(result)
by sim price risk: replace mean = mean[_N]/_N
Other devices used: egen to set up in blocks; getting the mean without repeated calls to summarize; using a true-or-false expression directly.
NB: I haven't tried to understand what you are doing, but it seems to me that the price-risk-simulation conditions define single values, so calculating a mean looks redundant. But perhaps that is in the code because you wish to add further detail to the code once you have it working.
NB2: This seems a purely deterministic calculation. Not sure that you need this code at all.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why does merge automatically set the latest dataset I created? - merge

Related

Processing each row in kdb table and appending arbitrary results in a new table

How do you merge lines in a single dataset with some duplicate values?

SAS Placeholder value

Joining multiple times in kdb

Reshaping and merging simulations in Stata

Categories

Resources