Bootstrapping Stepwise Regression in Stata - simulation

I'm trying to bootstrap a stepwise regression in Stata and extract the bootstrapped coefficients. I have two separate ado files. sw_pbs is the command the user uses, which calls the helper command sw_pbs_simulator.
program define sw_pbs, rclass
syntax varlist, [reps(integer 100)]
simulate _b, reps(`reps') : sw_pbs_simulator `varlist'
end
program define sw_pbs_simulator, rclass
syntax varlist
local depvar : word 1 of `varlist'
local indepvar : list varlist - depvar
reg `depvar' `indepvar'
local rmse = e(rmse)
matrix b_matrix = e(b)'
gen col_of_ones = 1
mkmat `indepvar' col_of_ones, mat(x_matrix)
gen errs = rnormal(0, `rmse')
mkmat errs, mat(e_matrix)
matrix y = x_matrix * b_matrix + e_matrix
svmat y
sw reg y `indepvar', pr(0.10) pe(0.05)
drop col_of_ones errs y
end
The output is a data set of the bootstrapped coefficients. My problem is that the output seems to be dependent on the result of the first stepwise regression simulation. For example if I had the independent variables var1 var2 var3 var4 and the first stepwise simulation includes only var1 and var2 in the model, then only var1 and var2 will appear in subsequent models. If the first simulation includes var1 var2 and var3 then only var1 var2 and var3 will appear in subsequent models, assuming that they are significant (if not their coefficients will appear as dots).
For example, the incorrect output is featured below. The variables lweight, age, lbph, svi, gleason and pgg45 never appear if they do not appear in the first simulation.
_b_lweight _b_age _b_lbph _b_svi _b_lcp _b_gleason _b_pgg45 _b_lpsa
.4064831 .5390302
.2298697 .5591789
.2829061 .6279869
.5384691 .6027049
.3157105 .5523808
I want coefficients that are not included in the model to always appear as dots in the data set and I want subsequent simulations to not be seemingly dependent on the first simulation.

By using _b as a short-cut, the first iteration defined which coefficients were to be stored by simulate in all subsequent iterations. That is fine for most simulation programs, as those would use a fixed set of coefficients, but not what you want to use in combination with sw. So I adapted the program to explicitly list the coefficients (possibly missing when not selected) that are to be stored.
I also changed your programs such that they will run faster by avoiding mkmat and svmat and replacing those computations with predict and generate. I also changed it to make it fit more with conventions in the Stata community that a command will only replace a dataset in memory after a user explicitly asks for it by specifying the clear option. Finally I made sure that names of variables and scalars created in the program do not conflict with names already present in memory by using tempvar and tempname. These will also be automatically deleted when the program ends.
clear all
program define sw_pbs, rclass
syntax varlist, clear [reps(integer 100)]
gettoken depvar indepvar : varlist
foreach var of local indepvar {
local res "`res' `var'=r(`var')"
}
simulate `res', reps(`reps') : sw_pbs_simulator `varlist'
end
program define sw_pbs_simulator, rclass
syntax varlist
tempname rmse b
tempvar yhat y
gettoken depvar indepvar : varlist
reg `depvar' `indepvar'
scalar `rmse' = e(rmse)
predict double `yhat' if e(sample)
gen double `y' = `yhat' + rnormal(0, `rmse')
sw reg `y' `indepvar', pr(0.10) pe(0.05)
// start returning coefficients
matrix `b' = e(b)
local in : colnames `b'
local out : list indepvar - in
foreach var of local in {
return scalar `var' = _b[`var']
}
foreach var of local out {
return scalar `var' = .
}
end

Related

Is there a way to declare variable local in nested functions?

Is there a way to declare a variable local in a nested function?
By declare I mean name a variable as usual but enforce that its scope starts in-place. Imagine creating a new nested function in the middle of a large program. There are natural variable names you wish to use and you would not want to worry whether you have to check existing variable names every single time you create a new variable.
To describe the desired effect, I'll use two examples. One minimal. One shows the problem a little better visually.
Short example
function fn1
var = 1
function fn2
local var = 'a';
function fn3
end
end
end
Within fn2 and fn3, var refers to the new variable with starting value 'a' while outside fn2, var with starting value 1 is still available as usual.
Long example
function fn1
var = 1;
var2 = 2;
function fn2
var2 = 'I can access var2 from fn1. Happy.'
local var = 'a'; % remove local to run this snippet
fn3;
function fn3
var2 = 'I can access var2 from fn1. Happy.'
var = 'fn2 cannot safely use variable name var because it may have been used in fn1. But var is the natural name to use in fn2. Sad.';
var = 1;
var2 = 2;
end
end
function fn4
var2 = 'I can also access var2 from fn1. Also happy.'
var = 'If only local scoping works, I would still be able to access var. Would be happy.';
end
fn4;
fn2;
var,
var2,
end
%% desired output, but not real
>> fn1;
var =
1
var2 =
2
Is there a way to accomplish the above?
Currently, I do my best ensure that name variables that are not local in nature with special non-generic names and name variables that are obviously local in nature temp# where # is an integer. (I suppose clearing variables after known last use can help sometimes. But I'd rather not have to do that. Local variables are too numerous.) That works for small programs. But with larger programs, I find it hard to avoid inadvertently re-writing a variable that has already been named at a higher scoping level. It also adds a level of complexity in the thought process, which is not exactly good for efficiency, because when creating a program, not all variables are either obviously local or obviously not local. A flexible scoping mechanism would be very helpful.
I write my programs in Sublime Text so I am not familiar with the Matlab editor. Does the editor have visual guards/warning prompts against errors arising from inflexible scoping? Warning that requires visually scanning through the whole program is barely useful but at least it would be something.
No, there is no way in MATLAB to declare a nested function variable to be local to that function if the variable also exists in the external scope (i.e., the function containing the nested function).
The behavior of Nested Functions is fully described in MATLAB documentation and what you are asking is not possible or at least not documented.
It is specifically stated that the supported behavior is
This means that both a nested function and a function that contains it can modify the same variable without passing that variable as an argument.
and no remedy to prevent this behavior is mentioned in the documentation.
Variables that are defined as input arguments are local to the function. So you can define var as an input argument of fn2*:
function fn2 (var)
...
end
However if you like to define fn2 without changing its signature you need to define an extra level of nesting:
function fn1
var = 1;
var2 = 2;
function fn2
fn2_impl([]);
function fn2_impl (var)
var2 = 'I can access var2 from fn1. Happy.'
var = 'a'; % remove local to run this snippet
fn3;
function fn3
var2 = 'I can access var2 from fn1. Happy.'
var = 'fn2 cannot safely use variable name var because it may have been used in fn1. But var is the natural name to use in fn2. Sad.';
var = 1;
var2 = 2;
end
end
end
function fn4
var2 = 'I can also access var2 from fn1. Also happy.'
var = 'If only local scoping works, I would still be able to access var. Would be happy.';
end
fn4;
fn2;
var,
var2,
end
Here fn2_impl is the actual implementation of fn2 and it inherits all variables that inherited by fn2. var is local to fn2_impl because it is an input argument.
However, I recommend local functions as suggested by #CrisLuengo. If variables need to be shared, using OO style of programming is more readable and maintainable than implicitly sharing by nested functions.
Thanks to #CrisLuengo that noted me that it is possible to skip inputs arguments when calling MATLAB functions.
Nested functions have a very specific use case. They are not intended to avoid having to pass data into a function as input and output arguments, which is to me what you are attempting. Your example can be written using local functions:
function fn1
var = 1;
var2 = 2;
[var,var2] = fn4(var,var2);
var2 = fn2(var2);
var,
var2,
end
function var2 = fn2(var2)
var2 = 'I can access var2 from fn1. Happy.'
var = 'a'; % remove local to run this snippet
[var,var3] = fn3(var,var2);
end
function [var,var2] = fn3(var,var2)
var2 = 'I can access var2 from fn1. Happy.'
var = 'fn2 cannot safely use variable name var because it may have been used in fn1. But var is the natural name to use in fn2. Sad.';
var = 1;
var2 = 2;
end
function [var,var2] = fn4(var,var2)
var2 = 'I can also access var2 from fn1. Also happy.'
var = 'If only local scoping works, I would still be able to access var. Would be happy.';
end
The advantage is that the size of fn1 is much reduced, it fits within one screen and can much more easily be read and debugged. It is obvious which functions modify which variables. And you can name variables whatever you want because no variable scope extends outside any function.
As far as I know, nested functions can only be gainfully used to capture scope in a lambda (function handle in MATLAB speak), you can write a function that creates a lambda (a handle to a nested function) that captures a local variable, and then return that lambda to your caller for use. That is a powerful feature, though useful only situationally. Outside that, I have not found a good use of nested functions. It’s just something you should try to avoid IMO.
Here's an example of the lambda with captured data (not actually tested, it's just to give an idea; also it's a rather silly application, since MATLAB has better ways of interpolating, just bear with me). Create2DInterpolator takes scattered x, y and z sample values. It uses meshgrid and griddata to generate a regular 2D grid representing those samples. It then returns a handle to a function that interpolates in that 2D grid to find the z value for a given x and y. This handle can be used outside the Create2DInterpolator function, and contains the 2D grid representation that we created. Basically, interpolator is an instance of a functor class that contains data. You could implement the same thing by writing a custom class, but that would require a lot more code, a lot more effort, and an additional M-file. More information can be had in the documentation.
interpolator = Create2DInterpolator(x,y,z);
newZ = interpolator(newX,newY);
function interpolator = Create2DInterpolator(x,y,z)
[xData,yData] = meshgrid(min(x):max(x),min(y):max(y));
zData = griddata(x,y,z,xData,yData);
interpolator = #InterolatorFunc;
function z = InterolatorFunc(x,y)
z = interp2(xData,yData,zData,x,y);
end
end

Linking input file with variable stored in it to several function files

I have code which is in multiple function files, input to these functions are stored in one file called inputfile.m(script file), in which I assigned some constant values to the inputs. These values act as a input to several function files named degree_eq.m(function file).
How I can write the code so that every time of execution, function files takes the required inputs from the inputfile.m.
Let's say you have two functions, one with your inputs (inputfile) and one where you do stuff (do_stuff).
function [a,b,c] = inputfile()
%define your constants
a=10;
b=100;
c=8.3;
function z = do_stuff()
[a, b, c] = inputfile() %takes the inputs from inputfile.m
z = a*c - b;
You can exploit the fact that matlab variables are persistent outside their scope. Lets say you have 6 constants a,b,c,d,e,f defined in input file. So what can be done is, write a top script called top.m which would be something like
inputfile
degree_eq1(a,b,c)
degree_eq2(c,d,e)
A third approach (combining Nirvedh Meshram and qbzenker answers) is to call an input script inside your MATLAB functions.
The advantage is that you do not have to specify which parameters are needed from or specified in your input script, but this is a disadvantage too, because the needed inputs are not made explicit. So, it is much more error prone. I only recommend this approach for a large number of input variables.
inputfile.m:
a = 5;
b = 8;
c = 10;
degree_eq.m:
function d = degree_eq()
inputfile;
d = a + b + c;
end
As an alternative, you can specify which input file to use:
degree_eq.m:
function d = degree_eq(inputFilename)
eval(inputFilename);
d = a + b + c;
end
and call it as follows:
degree_eq('inputfile');

Matlab load mat into variable

When loading data from a .Mat file directly into a variable, it stores an struct instead of the variable itself.
Example:
myData.mat contains var1, var2, var3
if I do:
load myData.mat
it will create the variables var1, var2 and var3 in my workspace. OK.
If I assign what load returns to a variable, it stores an struct. This is normal since I'm loading several variables.
foo = load('myData.mat')
foo =
struct with fields:
var1
var2
var3
However suppose that I'm only interested in var1 and I want to directly store into a variable foo.
Load has an option of loading only specific variables from a .mat file, however it still stores an struct
foo = load('myData.mat', 'var1')
foo =
struct with fields:
var1
I want var1 to be directly assigned to foo.
Of course I can do:
foo = load('myData.mat', 'var1')
foo = foo.var1;
But it should be a way of doing this automatically in one line right?
If the MAT-file contains one variable, use
x = importdata(mat_file_name)
load does not behave this way otherwise load would behave inconsistently depending upon the number of variables that you have requested which would lead to an extremely confusing behavior.
To illustrate this, imagine that you wrote a general program that wanted to load all variables from a .mat file, make some modification to them, and then save them again. You want this program to work with any file so some files may have one variable and some may have multiple variables stored in them.
If load used the behavior you've specified, then you'd have to add in all sorts of logic to check how many variables were stored in a file before loading and modifying it.
Here is what this program would look like with the current behavior of load
function modifymyfile(filename)
data = load(filename);
fields = fieldnames(data);
for k = 1:numel(fields)
data.(fields{k}) = modify(data.(fields{k}));
end
save(filename, '-struct', 'data')
end
If the behavior was the way that you think you want
function modifymyfile(filename)
% Use a matfile to determine the number of variables
vars = whos(matfile(filename));
% If there is only one variable
if numel(vars) == 1
% Assign that variable (have to use eval)
tmp = load(filename, vars(1).name);
tmp = modify(tmp);
% Now to save it again, you have to use eval to reassign
eval([vars(1).name, '= tmp;']);
% Now resave
save(filename, vars(1).name);
else
data = load(filename);
fields = fieldnames(data);
for k = 1:numel(fields)
data.(fields{k}) = modify(data.(fields{k}));
end
save(filename, '-struct', 'data');
end
end
I'll leave it to the reader to decide which of these is more legible and robust.
The best way to do what you're trying to do is exactly what you've shown in your question. Simply reassign the value after loading
data = load('myfile.mat', 'var1');
data = data.var1;
Update
Even if you only wanted the variable to not be assigned to a struct when a variable was explicitly specified, you'd still end up with inconsistent behavior which would make it difficult if my program accepted a list of variables to change as a cell array
variables = {'var1', 'var2'}
data = load(filename, variables{:}); % Would yield a struct
variables = {'var1'};
data = load(filename, variables{:}); % Would not yield a struct
#Suever is right, but in case you wish for a one-line workaround this will do it:
foo = getfield(load('myData.mat'), 'var1');
It looks ugly but does what you want:
foo = subsref(matfile('myData.mat'),struct('type','.','subs','var1'))
Use matfile allows partial loading of variables into memory i.e. it only loads what is necessary. The function subsref does the job of the indexing operator "." in this case.

Creating a function with variable number of inputs?

I am trying to define the following function in MATLAB:
file = #(var1,var2,var3,var4) ['var1=' num2str(var1) 'var2=' num2str(var2) 'var3=' num2str(var3) 'var4=' num2str(var4)'];
However, I want the function to expand as I add more parameters; if I wanted to add the variable vark, I want the function to be:
file = #(var1,var2,var3,var4,vark) ['var1=' num2str(var1) 'var2=' num2str(var2) 'var3=' num2str(var3) 'var4=' num2str(var4) 'vark=' num2str(vark)'];
Is there a systematic way to do this?
Use fprintf with varargin for this:
f = #(varargin) fprintf('var%i= %i\n', [(1:numel(varargin));[varargin{:}]])
f(5,6,7,88)
var1= 5
var2= 6
var3= 7
var4= 88
The format I've used is: 'var%i= %i\n'. This means it will first write var then %i says it should input an integer. Thereafter it should write = followed by a new number: %i and a newline \n.
It will choose the integer in odd positions for var%i and integers in the even positions for the actual number. Since the linear index in MATLAB goes column for column we place the vector [1 2 3 4 5 ...] on top, and the content of the variable in the second row.
By the way: If you actually want it on the format you specified in the question, skip the \n:
f = #(varargin) fprintf('var%i= %i', [(1:numel(varargin));[varargin{:}]])
f(6,12,3,15,5553)
var1= 6var2= 12var3= 3var4= 15var5= 5553
Also, you can change the second %i to floats (%f), doubles (%d) etc.
If you want to use actual variable names var1, var2, var3, ... in your input then I can only say one thing: Don't! It's a horrible idea. Use cells, structs, or anything else than numbered variable names.
Just to be crytsal clear: Don't use the output from this in MATLAB in combination with eval! eval is evil. The Mathworks actually warns you about this in the official documentation!
How about calling the function as many times as the number of parameters? I wrote this considering the specific form of the character string returned by your function where k is assumed to be the index of the 'kth' variable to be entered. Array var can be the list of your numeric parameters.
file=#(var,i)[strcat('var',num2str(i),'=') num2str(var) ];
var=[2,3,4,5];
str='';
for i=1:length(var);
str=strcat(str,file(var(i),i));
end
If you want a function to accept a flexible number of input arguments, you need varargin.
In case you want the final string to be composed of the names of your variables as in your workspace, I found no way, since you need varargin and then it looks impossible. But if you are fine with having var1, var2 in your string, you can define this function and then use it:
function str = strgen(varargin)
str = '';
for ii = 1:numel(varargin);
str = sprintf('%s var%d = %s', str, ii, num2str(varargin{ii}));
end
str = str(2:end); % to remove the initial blank space
It is also compatible with strings. Testing it:
% A = pi;
% B = 'Hello!';
strgen(A, B)
ans =
var1 = 3.1416 var2 = Hello!

Different behaviour of function in a for-loop or when unrolling of the loop is performed

I got an odd behaviour of my functions and since i'm not so used to matlab coding i guess is due to something really easy that i don't get.
I can't understand how this could print something different
fx(Punti(1,:),Punti(2,:))
fx(Punti(2,:),Punti(3,:))
fx(Punti(3,:),Punti(4,:))
fx(Punti(4,:),Punti(5,:))
from this
for i_unic=1:4
fx(Punti(i_unic,:),Punti(i_unic+1,:))
end
Consider fx as a generic function.
Is it possible that fx uses some variables that for some reason are erased at the end of each iteration?
EDIT
-->"Punti" is just matrix containing the points a SCARA robot should follow
-->fx is the function "Retta" and it's the following
function retta(PuntoA,PuntoB,Asse_A,q_ini,rot,contaerro,varargin)
global SCARA40
global inizio XX YY ZZ
global seg_Nsteps
npassi = seg_Nsteps;
ipuntofin = inizio + npassi;
for ipunto = inizio : ipuntofin
P4 = PuntoA + (ipunto-inizio)*(PuntoB-PuntoA)/npassi;
q = kineinversa(Asse_A,P4,q_ini,rot);
Mec = SCARA40.fkine(q);
Pec = Mec(:,4);
if (dot((P4-Pec),(P4-Pec),3)>0.0001)
fprintf(1,'\n P4 Desid. = [%9.1f %9.1f %9.1f %9.1f ] \n',P4);
fprintf(1,'\n P4 Attuato = [%9.1f %9.1f %9.1f %9.1f ] \n',Pec);
contaerro = contaerro + 1;
else
q_ini = q;
end
SCARA40.plot(q);
XX(ipunto) = Pec(1);
YY(ipunto) = Pec(2);
ZZ(ipunto) = Pec(3);
if(nargin>6)
color = varargin{1};
else
color = 'r';
end
plot3(XX,YY,ZZ,color,'LineWidth',1 );
drawnow;
hold on
end
end
the test function with the results
Punti = [ 10,10,0,1 ;10,-10,0,1 ;-10,-10,0,1 ; -10,10,0,1 ] ;
%inizio=1
%retta(Punti(1,:)',Punti(2,:)',Asse_A,q_ini,rot,contaerro)
%inizio=21
%retta(Punti(2,:)',Punti(3,:)',Asse_A,q_ini,rot,contaerro)
%inizio=41
%retta(Punti(3,:)',Punti(4,:)',Asse_A,q_ini,rot,contaerro)
%inizio=61
inizio=1
for i=1:length(Punti)-1
retta(Punti(i,:)',Punti(i+1,:)',Asse_A,q_ini,rot,contaerro)
inizio=inizio+20;
end
the two images have been generated restarting Matlab
Addressing the question in the most general sense (since there is no sample given for the function fx or the function/variable Punti) then the reason you are getting different results is likely that the state of your variables/workspace is different when you test one case versus the other. How could this happen? Here are some obvious ways...
Your functions (or possibly other functions they call) are making use of the random number generator, and the starting state of the RNG is different when you test the loop versus unrolled loop case.
Your functions are sharing global variables that aren't reset to some default value at the start of each test case. You mention in a comment that the functions use global variables, so this is likely your problem.
Your functions aren't really functions, but scripts. Scripts all share a common workspace (the base workspace), whereas a function (and specifically each call to a function) will have its own unique workspace. If fx is actually a script, each call may change any or all of the variables in the base workspace. Furthermore, any other scripts, or anything you type into the command line, can change things as well. The contents of the base workspace may therefore be different when you test the loop versus unrolled loop case.
If I were to hazard a guess, I'd say that if you were to exit and restart MATLAB before each test case (i.e. reset everything to the same default starting state) you would probably get the same exact result for the loop versus unrolled loop case.