Logistic regression Macro - macros

%macro intercept(i1= ,i2= );
%let n = %sysfunc(countw(&i1));
%do i = 1 %to &n;
%let val_i1 = %scan(&i1,&i,'');
%let val_i2 = %scan(&i2,&i,'');
data scores;
set repeat_score2;
/* Segment 1 probablity score */
p1 = 0;
z1 = &val_i1 +
a * 0.03 +
r * -0.0047841 +
p * -0.000916081 ;
p1 = 1/(1+2.71828**-z1);
/* Segment 2 probablity score */
p2 = 0;
z2 = &val_i2 +
r * 0.09 +
m * 0.012786245 +
c * -0.00179618 +
p2 = 1/(1+2.71828**-z2);
logit_score = 0;
if max(p1,p2) = p1 then logit_score = 1;
else if max(p1,p2) = p2 then logit_score = 2;
run;
proc freq data = scores;
table logit_score * clu_ /nocol norow nopercent;
run;
%end;
%mend;
%intercept (i1=-0.456491042, i2=-3.207379842, i3=-1.380627318 , i4=0.035684096, i5=-0.855283373);
%intercept (i1=-0.456491042 0, i2=-3.207379842 -3.207379842, i3=-1.380627318 -1.380627318, i4=0.035684096 0.035684096,
i5=-0.855283373 -0.855283373);
I have the above macro which takes the intercept for the two of the above models and then calculates the probablity score and then assigns a a value to to a segment based on that probablity score.
The first problem with above macro is when I execute the macro with one argument each it's resolving macro variable 'n' to 2 and executing twice. First iteration, it's giving the right results while for second it's wrong.
For the second implementation(macro with two aruguments each) n is resolving to 3 and scan is resolving to both of those values together at a time (eg. i1 for the iteration itself is -0.45 and 0), If I remove the space, then it taking '.' to be the delimiter and resolving that to ( 0,45,0 - one for each iteration). I don't get any results for this case.
How do I get this to work the right way?
Thanks!!!

%SCAN and COUNTW function by default consider punctuation symbols and blanks as delimiters. Since your arguments include decimal points, you need to state explicitly that delimiter should be blank for both COUNTW and %SCAN. Which you have done for %SCAN, but not for COUNTW.
So the 2nd line of the code should be:
%let n = %sysfunc(countw(&i1,' '))
And I'm not sure if it's a typo or just formatting thing, but in your %SCAN functions third argument looks like two quotes together '', not quote-blank-quote ' ' as it should be.

Related

how to record the moving direction is SAS?

I have a maze file which is like this.
1111111
1001111
1101101
1101001
1100011
1111111
a format $direction indicating the direction
start end label
D D down
L L left
R R right
U U up
Then, I have a dataset indicating the start and end point.
Row Column
start 2 2
end 3 6
How can I record the moving direction from the start to the end like this?
direction row column
2 2
right 2 3
down 3 3
down 4 3
down 5 3
i have use array
array m(i,j)
if m(i,j) = 0 then
row=i;
column=j;
output;
however, it simply just not in the correct moving order.
Thanks if you can help.
Here's one way of doing this. Writing a more generalised maze-solving algorithm using SAS data step logic is left as an exercise for the reader, but this should work for labyrinths.
/* Define the format */
proc format;
value $direction
'D' = 'down'
'L' = 'left'
'R' = 'right'
'U' = 'up'
;
run;
data want;
/*Read in the maze and start/end points in (y,x) orientation*/
array maze(6,7) (
1,1,1,1,1,1,1,
1,0,0,1,1,1,1,
1,1,0,1,1,0,1,
1,1,0,1,0,0,1,
1,1,0,0,0,1,1,
1,1,1,1,1,1,1
);
array endpoints (2,2) (
2,2
3,6
);
/*Load the start point and output a row*/
x = endpoints(1,2);
y = endpoints(1,1);
output;
/*
Navigate through the maze.
Assume for the sake of simplicity that it is really more of a labyrinth,
i.e. there is only ever one valid direction in which to move,
other than the direction you just came from,
and that the end point is reachable
*/
do _n_ = 1 by 1 until(x = endpoints(2,2) and y = endpoints(2,1));
if maze(y-1,x) = 0 and direction ne 'D' then do;
direction = 'U';
y + -1;
end;
else if maze(y+1,x) = 0 and direction ne 'U' then do;
direction = 'D';
y + 1;
end;
else if maze(y,x-1) = 0 and direction ne 'R' then do;
direction = 'L';
x + -1;
end;
else if maze(y,x+1) = 0 and direction ne 'L' then do;
direction = 'R';
x + 1;
end;
output;
if _n_ > 15 then stop; /*Set a step limit in case something goes wrong*/
end;
format direction $direction.;
drop maze: endpoints:;
run;

improve performance of a double for loop in matlab

I'm doing some analysis where I'm analysing hundreds of data files, which are being analysed iteratively. Here is an examples of the sort of data that I have:
start_time = datenum('1990-01-01');
end_time = datenum('2009-12-31');
time = start_time:end_time;
datx = rand(length(time),1);
daty = datx-2;
where I have a time variable and two data variables.
After loading the data I then need to pass the data through a function. However, I need to do this by including firstly the data from year 1 only, then from years 1 to 2; 1 to 3, 1 to 4 and so on until I pass the data through the function for the entire series. This can be performed with a loop with the following:
% split into different years
datev = datevec(time);
iyear = datev(:,1);
unique_year = unique(iyear);
for k = 1:length(unique_year);
idx = find(iyear >= unique_year(1) & iyear <= unique_year(k));
% select data for year
d_time = time(idx);
d_datx = datx(idx);
d_daty = daty(idx);
% now select individual years from this subset
datev2 = datevec(d_time);
iyear2 = datev2(:,1);
unique_year2 = unique(iyear2);
for k2 = 1:length(unique_year2);
idx2 = find(iyear2 == unique_year2(k2));
% select data for year
d_time2 = d_time(idx2);
d_datx2 = d_datx(idx2);
d_daty2 = d_daty(idx2);
% pass through some function
mae_out = some_function(d_datx2, d_daty2);
mae(k2) = mae_out;
end
mean_mae(k) = mean(mae);
end
function mae = some_function(datx, daty)
mae = mean(abs(datx - daty));
end
Note here that I'm using a very simple function as an example, and the actual function is more complex.
Having two loops like this takes a long time to run on my actual data. Is there a better/faster way that I can perform the above, possibly without loops?
If you record the previous result, you do not need the inner loop. You are currently computing a total of (20+21)/2 = 210 iterations, but you only need to compute 20. The key here is that mean(a(1:k)) == (mean(a(1:k-1))*(k-1) + a(k)) / k (by the definition of mean). Another optimization is to use logical indexing instead of find. It takes up a bit more space, but is much faster.
% split into different years
datev = datevec(time);
iyear = datev(:,1);
unique_year = unique(iyear);
for k = 1:length(unique_year);
idx = (iyear == unique_year(k));
% select data for year
d_time = time(idx);
d_datx = datx(idx);
d_daty = daty(idx);
mae_out = some_function(d_datx, d_daty);
if k == 1
mean_mae(k) = mean_out;
else
mean_mae(k) = (mean_mae(k-1) * (k-1) + mean(mean_out)) / k;
end
end
function mae = some_function(datx, daty)
mae = mean(abs(datx - daty));
end
As you can see, this should give you approximately 20x or more speedup.

How to disable outputting variables inside Octave function

I wrote my own function for Octave, but unfortunately aside of the final result value, the variable "result" is written to console on every change, which is an unwanted behavior.
>> a1 = [160 60]
a1 =
160 60
>> entr = my_entropy({a1}, false)
result = 0.84535
entr = 0.84535
Should be
>> a1 = [160 60]
a1 =
160 60
>> entr = my_entropy({a1}, false)
entr = 0.84535
I don't get the idea of ~ and it don't work, at least when I tried.
Code is as follows:
# The main difference between MATLAB bundled entropy function
# and this custom function is that they use a transformation to uint8
# and the bundled entropy() function is used mostly for signal processing
# while I simply use a straightforward solution usefull e.g. for learning trees
function f = my_entropy(data, weighted)
# function accepts only cell arrays;
# weighted tells whether return one weighed average entropy
# or return a vector of entropies per bucket
# moreover, I find vectors as the only representation of "buckets"
# in other words, vector = bucket (leaf of decision tree)
if nargin < 2
weighted = true;
end;
rows = #(x) size(x,1);
cols = #(x) size(x,2);
if weighted
result = 0;
else
result = [];
end;
for r = 1:rows(data)
for c = 1:cols(data) # in most cases this will be 1:1
omega = sum(data{r,c});
epsilon = 0;
for b = 1:cols(data{r,c})
epsilon = epsilon + ( (data{r,c}(b) / omega) * (log2(data{r,c}(b) / omega)) );
end;
if (-epsilon == 0) entropy = 0; else entropy = -epsilon; end;
if weighted
result = result + entropy
else
result = [result entropy]
end;
end;
end;
f = result;
end;
# test cases
cell1 = { [16];[16];[2 2 2 2 2 2 2 2];[12];[16] }
cell2 = { [16],[12];[16],[2];[2 2 2 2 2 2 2 2],[8 8];[12],[8 8];[16],[8 8] }
cell3 = { [16],[3 3];[16],[2];[2 2 2 2 2 2 2 2],[2 2];[12],[2];[16],[2] }
# end
In your code, you should end lines 39 and 41 with semicolon ;.
Lines finishing in semicolon aren't shown in stdout.
Add ; after result = result + entropy and result = [result entropy] in your code, or in general after any assignment that you don't want printed on screen.
If for some reason you can't modify the function, you can use evalc to prevent unwanted output (at least in Matlab). Note that the output in this case is obtained in char form:
T = evalc(expression) is the same as eval(expression) except that anything that would normally be written to the command window, except for error messages, is captured and returned in the character array T (lines in T are separated by \n characters).
As with any eval variant, this approach should be avoided if possible:
entr = evalc('my_entropy({a1}, false)');

For loop for storing value in many different matrix

I have written a code that stores data in a matrix, but I want to shorten it so it iterates over itself.
The number of matrices created is the known variable. If it was 3, the code would be:
for i = 1:31
if idx(i) == 1
C1 = [C1; Output2(i,:)];
end
if idx(i) == 2
C2 = [C2; Output2(i,:)];
end
if idx(i) == 3
C3 = [C3; Output2(i,:)];
end
end
If I understand correctly, you want to extract rows from Output2 into new variables based on idx values? If so, you can do as follows:
Output2 = rand(5, 10); % example
idx = [1,1,2,2,3];
% get rows from Output which numbers correspond to those in idx with given value
C1 = Output2(find(idx==1),:);
C2 = Output2(find(idx==2),:);
C3 = Output2(find(idx==3),:);
Similar to Marcin i have another solution. Here i predefine my_C as a cell array. Output2 and idx are random generated and instead of find i just use logical adressing. You have to convert the data to type cell {}
Output2 = round(rand(31,15)*10);
idx = uint8(round(1+rand(1,31)*2));
my_C = cell(1,3);
my_C(1,1) = {Output2(idx==1,:)};
my_C(1,2) = {Output2(idx==2,:)};
my_C(1,3) = {Output2(idx==3,:)};
If you want to get your data back just use e.g. my_C{1,1} for the first group.
If you have not 3 but n resulting matrices you can use:
Output2 = round(rand(31,15)*10);
idx = uint8(round(1+rand(1,31)*(n-1)));
my_C = cell(1,n);
for k=1:n
my_C(1,k) = {Output2(idx==k,:)};
end
Where n is a positive integer number
I would recommend a slighty different approach. Except for making the rest of the code more maintainable it may also slightly speed up the execution. This due to that matlab uses a JIT compiler and eval must be recompiled every time. Try this:
nMatrices = 3
for k = 1:nMatrices
C{k} = Output2(idx==k,:);
end
As patrik said in the comments, naming variables like this is poor practice. You would be better off using cell arrays M{1}=C1, or if all the Ci are the same size, even just a 3D array M, for example, where M(:,:,1)=C1.
If you really want to use C1, C2, ... as you variable names, I think you will have to use eval, as arielnmz mentioned. One way to do this in matlab is
for i=1:3
eval(['C' num2str(idx(i)) '=[C' num2str(idx(i)) ';Output2(' num2str(i) ',:)];'])
end
Edited to add test code:
idx=[2 1 3 2 2 3];
Output2=rand(6,4);
C1a=[];
C2a=[];
C3a=[];
for i = 1:length(idx)
if idx(i) == 1
C1a = [C1a; Output2(i,:)];
end
if idx(i) == 2
C2a = [C2a; Output2(i,:)];
end
if idx(i) == 3
C3a = [C3a; Output2(i,:)];
end
end
C1=[];
C2=[];
C3=[];
for i=1:length(idx)
eval(['C' num2str(idx(i)) '=[C' num2str(idx(i)) ';Output2(' num2str(i) ',:)];'])
end
all(C1a(:)==C1(:))
all(C2a(:)==C2(:))
all(C3a(:)==C3(:))

MATLAB: subtracting each element in a large vector from each element in another large vector in the fastest way possible

here is the code I have, its not simple subtraction. We want subtract each value in one vector from each value in the other vector, within certain bounds tmin and tmax. time_a and time_b are the very long vectors with times (in ps). binsize is just for grouping times in a similar range for plotting. The longest way possible would be to loop through each element and subtract each element in the other vector, but this would take forever and we are talking about vectors with hundreds of megabytes up to gb.
function [c, dt, dtEdges] = coincidence4(time_a,time_b,tmin,tmax,binsize)
% round tmin, tmax to a intiger multiple of binsize:
if mod(tmin,binsize)~=0
tmin=tmin-mod(tmin,binsize)+binsize;
end
if mod(tmax,binsize)~=0
tmax=tmax-mod(tmax,binsize);
end
dt = tmin:binsize:tmax;
dtEdges = [dt(1)-binsize/2,dt+binsize/2];
% dtEdges = linspace((tmin-binsize/2),(tmax+binsize/2),length(dt));
c = zeros(1,length(dt));
Na = length(time_a);
Nb = length(time_b);
tic1=tic;
% tic2=tic1;
% bbMax=Nb;
bbMin=1;
for aa = 1:Na
ta = time_a(aa);
bb = bbMin;
% tic
while (bb<=Nb)
tb = time_b(bb);
d = tb - ta;
if d < tmin
bbMin = bb;
bb = bb+1;
elseif d > tmax
bb = Nb+1;
else
% tic
% [dum, dum2] = histc(d,dtEdges);
index = floor((d-dtEdges(1))/(dtEdges(end)-dtEdges(1))*(length(dtEdges)-1)+1);
% toc
% dt(dum2)
c(index)=c(index)+1;
bb = bb+1;
end
end
% if mod(aa, 200) == 0
% toc(tic2)
% tic2=tic;
% end
end
% c=c(1:end-1);
toc(tic1)
end
Well, not a final answer but a few clue to simplify and accelerate your system:
First, use cached values. For example, in your line:
index = floor((d-dtEdges(1))/(dtEdges(end)-dtEdges(1))*(length(dtEdges)-1)+1);
your loop repeat the same computations every iteration. You can calculate the value before starting the loop, cache it then reuse the stored result:
cached_dt_constant = (dtEdges(end)-dtEdges(1))*(length(dtEdges)-1) ;
Then in your loop simply use:
index = floor( (d-dtEdges(1)) / cached_dt_constant +1 ) ;
if you have so many loop iteration you'll save valuable time this way.
Second, I am not entirely sure of what the computations are trying to achieve, but you can save time again by using the indexing power of matlab. By replacing the lower part of your code like this, I get an execution time 2 to 3 time faster (and the same results obviously).
Na = length(time_a);
Nb = length(time_b);
tic1=tic;
dtEdge_span = (dtEdges(end)-dtEdges(1)) ;
cached_dt_constant = dtEdge_span * (length(dtEdges)-1) ;
for aa = 1:Na
ta = time_a(aa);
d = time_b - ta ;
iok = (d>=tmin) & (d<=tmax) ;
index = floor( (d(iok)-dtEdges(1)) ./ cached_dt_constant +1 ) ;
c(index) = c(index) +1 ;
end
toc(tic1)
end
Now there is only one loop to go through, the inner loop has been removed and replaced by vectorized calculation. By scratching the head a bit further there might be a way to do even without the top loop and use only vectorized computations. Although this will require to have enough memory to handle quite big arrays in one go.
If the precision of each value is not critical (I see you round and floor values often), try converting your initial vectors to 'single' type instead of the default matlab 'double'. that would almost double the size of array your memory will be able to handle in one go.