Matlab: Recursion to get decision tree - matlab

I am trying to implement decision tree with recursion: So far I have written the following:
From a give data set, find the best split and return the branches, to give more details lets say I have data with features as columns of matrix and last column indicate the class of the data 1, -1.
Based on 1. I have a best feature to split along with the branches under that split, lets say based on Information gain I get feature 9 is the best split and unique values in feature 9 {1,3,5} are the branches of 9
I have figured how to get the data related to ach branch, then I need to iterate over each branch's data to get the next set of split. I am having trouble figuring this recursion.
Here is the code that I have so far, the recursion that I am doing right now doesn't look right: How can I fix this?
function [indeces_of_node, best_split] = split_node(X_train, Y_train)
%cell to save split information
feature_to_split_cell = cell(size(X_train,2)-1,4);
%iterate over features
for feature_idx=1:(size(X_train,2) - 1)
%get current feature
curr_X_feature = X_train(:,feature_idx);
%identify the unique values
unique_values_in_feature = unique(curr_X_feature);
H = get_entropy(Y_train); %This is actually H(X) in slides
%temp entropy holder
%Storage for feature element's class
element_class = zeros(size(unique_values_in_feature,1),2);
%conditional probability H(X|y)
H_cond = zeros(size(unique_values_in_feature,1),1);
for aUnique=1:size(unique_values_in_feature,1)
match = curr_X_feature(:,1)==unique_values_in_feature(aUnique);
mat = Y_train(match);
majority_class = mode(mat);
element_class(aUnique,1) = unique_values_in_feature(aUnique);
element_class(aUnique,2) = majority_class;
H_cond(aUnique,1) = (length(mat)/size((curr_X_feature),1)) * get_entropy(mat);
end
%Getting the information gain
IG = H - sum(H_cond);
%Storing the IG of features
feature_to_split_cell{feature_idx, 1} = feature_idx;
feature_to_split_cell{feature_idx, 2} = max(IG);
feature_to_split_cell{feature_idx, 3} = unique_values_in_feature;
feature_to_split_cell{feature_idx, 4} = element_class;
end
%set feature to split zero for every fold
feature_to_split = 0;
%getting the max IG of the fold
max_IG_of_fold = max([feature_to_split_cell{:,2:2}]);
%vector to store values in the best feature
values_of_best_feature = zeros(size(15,1));
%Iterating over cell to get get the index and the values under best
%splited feature.
for i=1:length(feature_to_split_cell)
if (max_IG_of_fold == feature_to_split_cell{i,2});
feature_to_split = i;
values_of_best_feature = feature_to_split_cell{i,4};
end
end
display(feature_to_split)
display(values_of_best_feature(:,1)')
curr_X_feature = X_train(:,feature_to_split);
best_split = feature_to_split
indeces_of_node = unique(curr_X_feature)
%testing
for k = 1 : length(values_of_best_feature)
% Condition to stop the recursion, if clases are pure then we are
% done splitting, if both classes have save number of attributes
% then we are done splitting.
if (sum(values_of_best_feature(:,2) == -1) ~= sum(values_of_best_feature(:,2) == 1))
if((sum(values_of_best_feature(:,2) == -1) ~= 0) || (sum(values_of_best_feature(:,2) == 1) ~= 0))
mat1 = X_train(X_train(:,5)== values_of_best_feature(k),:);
[indeces_of_node, best_split] = split_node(mat1, Y_train);
end
end
end
end
Here is the out of my code: and looks like some in my recursion I am only going depth of one branch and after that I never go back to rest of the branches
feature_to_split =
5
ans =
1 2 3 4 5 6 7 8 9
feature_to_split =
9
ans =
3 5 7 8 11
feature_to_split =
21
feature_to_split =
21
feature_to_split =
21
feature_to_split =
21
if you are interest in running this code: git

After multiple rounds of debug, I figured the answers, I hope someone will benefit from this:
for k = 1 : length(values_of_best_feature)
% Condition to stop the recursion, if clases are pure then we are
% done splitting, if both classes have save number of attributes
% then we are done splitting.
if((sum(values_of_best_feature(:,2) == -1) ~= 0) || (sum(values_of_best_feature(:,2) == 1) ~= 0))
X_train(:,feature_to_split) = [];
mat1 = X_train(X_train(:,5)== values_of_best_feature(k),:);
%if(level >= curr_level)
split_node(mat1, Y_train, 1, 2, level-1);
%end
end
end
return;

Related

improve performance of a double for loop in matlab

I'm doing some analysis where I'm analysing hundreds of data files, which are being analysed iteratively. Here is an examples of the sort of data that I have:
start_time = datenum('1990-01-01');
end_time = datenum('2009-12-31');
time = start_time:end_time;
datx = rand(length(time),1);
daty = datx-2;
where I have a time variable and two data variables.
After loading the data I then need to pass the data through a function. However, I need to do this by including firstly the data from year 1 only, then from years 1 to 2; 1 to 3, 1 to 4 and so on until I pass the data through the function for the entire series. This can be performed with a loop with the following:
% split into different years
datev = datevec(time);
iyear = datev(:,1);
unique_year = unique(iyear);
for k = 1:length(unique_year);
idx = find(iyear >= unique_year(1) & iyear <= unique_year(k));
% select data for year
d_time = time(idx);
d_datx = datx(idx);
d_daty = daty(idx);
% now select individual years from this subset
datev2 = datevec(d_time);
iyear2 = datev2(:,1);
unique_year2 = unique(iyear2);
for k2 = 1:length(unique_year2);
idx2 = find(iyear2 == unique_year2(k2));
% select data for year
d_time2 = d_time(idx2);
d_datx2 = d_datx(idx2);
d_daty2 = d_daty(idx2);
% pass through some function
mae_out = some_function(d_datx2, d_daty2);
mae(k2) = mae_out;
end
mean_mae(k) = mean(mae);
end
function mae = some_function(datx, daty)
mae = mean(abs(datx - daty));
end
Note here that I'm using a very simple function as an example, and the actual function is more complex.
Having two loops like this takes a long time to run on my actual data. Is there a better/faster way that I can perform the above, possibly without loops?
If you record the previous result, you do not need the inner loop. You are currently computing a total of (20+21)/2 = 210 iterations, but you only need to compute 20. The key here is that mean(a(1:k)) == (mean(a(1:k-1))*(k-1) + a(k)) / k (by the definition of mean). Another optimization is to use logical indexing instead of find. It takes up a bit more space, but is much faster.
% split into different years
datev = datevec(time);
iyear = datev(:,1);
unique_year = unique(iyear);
for k = 1:length(unique_year);
idx = (iyear == unique_year(k));
% select data for year
d_time = time(idx);
d_datx = datx(idx);
d_daty = daty(idx);
mae_out = some_function(d_datx, d_daty);
if k == 1
mean_mae(k) = mean_out;
else
mean_mae(k) = (mean_mae(k-1) * (k-1) + mean(mean_out)) / k;
end
end
function mae = some_function(datx, daty)
mae = mean(abs(datx - daty));
end
As you can see, this should give you approximately 20x or more speedup.

How to disable outputting variables inside Octave function

I wrote my own function for Octave, but unfortunately aside of the final result value, the variable "result" is written to console on every change, which is an unwanted behavior.
>> a1 = [160 60]
a1 =
160 60
>> entr = my_entropy({a1}, false)
result = 0.84535
entr = 0.84535
Should be
>> a1 = [160 60]
a1 =
160 60
>> entr = my_entropy({a1}, false)
entr = 0.84535
I don't get the idea of ~ and it don't work, at least when I tried.
Code is as follows:
# The main difference between MATLAB bundled entropy function
# and this custom function is that they use a transformation to uint8
# and the bundled entropy() function is used mostly for signal processing
# while I simply use a straightforward solution usefull e.g. for learning trees
function f = my_entropy(data, weighted)
# function accepts only cell arrays;
# weighted tells whether return one weighed average entropy
# or return a vector of entropies per bucket
# moreover, I find vectors as the only representation of "buckets"
# in other words, vector = bucket (leaf of decision tree)
if nargin < 2
weighted = true;
end;
rows = #(x) size(x,1);
cols = #(x) size(x,2);
if weighted
result = 0;
else
result = [];
end;
for r = 1:rows(data)
for c = 1:cols(data) # in most cases this will be 1:1
omega = sum(data{r,c});
epsilon = 0;
for b = 1:cols(data{r,c})
epsilon = epsilon + ( (data{r,c}(b) / omega) * (log2(data{r,c}(b) / omega)) );
end;
if (-epsilon == 0) entropy = 0; else entropy = -epsilon; end;
if weighted
result = result + entropy
else
result = [result entropy]
end;
end;
end;
f = result;
end;
# test cases
cell1 = { [16];[16];[2 2 2 2 2 2 2 2];[12];[16] }
cell2 = { [16],[12];[16],[2];[2 2 2 2 2 2 2 2],[8 8];[12],[8 8];[16],[8 8] }
cell3 = { [16],[3 3];[16],[2];[2 2 2 2 2 2 2 2],[2 2];[12],[2];[16],[2] }
# end
In your code, you should end lines 39 and 41 with semicolon ;.
Lines finishing in semicolon aren't shown in stdout.
Add ; after result = result + entropy and result = [result entropy] in your code, or in general after any assignment that you don't want printed on screen.
If for some reason you can't modify the function, you can use evalc to prevent unwanted output (at least in Matlab). Note that the output in this case is obtained in char form:
T = evalc(expression) is the same as eval(expression) except that anything that would normally be written to the command window, except for error messages, is captured and returned in the character array T (lines in T are separated by \n characters).
As with any eval variant, this approach should be avoided if possible:
entr = evalc('my_entropy({a1}, false)');

For loop for storing value in many different matrix

I have written a code that stores data in a matrix, but I want to shorten it so it iterates over itself.
The number of matrices created is the known variable. If it was 3, the code would be:
for i = 1:31
if idx(i) == 1
C1 = [C1; Output2(i,:)];
end
if idx(i) == 2
C2 = [C2; Output2(i,:)];
end
if idx(i) == 3
C3 = [C3; Output2(i,:)];
end
end
If I understand correctly, you want to extract rows from Output2 into new variables based on idx values? If so, you can do as follows:
Output2 = rand(5, 10); % example
idx = [1,1,2,2,3];
% get rows from Output which numbers correspond to those in idx with given value
C1 = Output2(find(idx==1),:);
C2 = Output2(find(idx==2),:);
C3 = Output2(find(idx==3),:);
Similar to Marcin i have another solution. Here i predefine my_C as a cell array. Output2 and idx are random generated and instead of find i just use logical adressing. You have to convert the data to type cell {}
Output2 = round(rand(31,15)*10);
idx = uint8(round(1+rand(1,31)*2));
my_C = cell(1,3);
my_C(1,1) = {Output2(idx==1,:)};
my_C(1,2) = {Output2(idx==2,:)};
my_C(1,3) = {Output2(idx==3,:)};
If you want to get your data back just use e.g. my_C{1,1} for the first group.
If you have not 3 but n resulting matrices you can use:
Output2 = round(rand(31,15)*10);
idx = uint8(round(1+rand(1,31)*(n-1)));
my_C = cell(1,n);
for k=1:n
my_C(1,k) = {Output2(idx==k,:)};
end
Where n is a positive integer number
I would recommend a slighty different approach. Except for making the rest of the code more maintainable it may also slightly speed up the execution. This due to that matlab uses a JIT compiler and eval must be recompiled every time. Try this:
nMatrices = 3
for k = 1:nMatrices
C{k} = Output2(idx==k,:);
end
As patrik said in the comments, naming variables like this is poor practice. You would be better off using cell arrays M{1}=C1, or if all the Ci are the same size, even just a 3D array M, for example, where M(:,:,1)=C1.
If you really want to use C1, C2, ... as you variable names, I think you will have to use eval, as arielnmz mentioned. One way to do this in matlab is
for i=1:3
eval(['C' num2str(idx(i)) '=[C' num2str(idx(i)) ';Output2(' num2str(i) ',:)];'])
end
Edited to add test code:
idx=[2 1 3 2 2 3];
Output2=rand(6,4);
C1a=[];
C2a=[];
C3a=[];
for i = 1:length(idx)
if idx(i) == 1
C1a = [C1a; Output2(i,:)];
end
if idx(i) == 2
C2a = [C2a; Output2(i,:)];
end
if idx(i) == 3
C3a = [C3a; Output2(i,:)];
end
end
C1=[];
C2=[];
C3=[];
for i=1:length(idx)
eval(['C' num2str(idx(i)) '=[C' num2str(idx(i)) ';Output2(' num2str(i) ',:)];'])
end
all(C1a(:)==C1(:))
all(C2a(:)==C2(:))
all(C3a(:)==C3(:))

find first and last value for unique julian date

i have a data set similar to the following:
bthd = sort(floor(1+(10-1).*rand(10,1)));
bthd2 = sort(floor(1+(10-1).*rand(10,1)));
bthd3 = sort(floor(1+(10-1).*rand(10,1)));
Depth = [bthd;bthd2;bthd3];
Jday = [repmat(733774,10,1);repmat(733775,10,1);repmat(733776,10,1)];
temp = 10+(30-10).*rand(30,1);
Data = [Jday,Depth,temp];
where I have a matrix similar to 'Data' with Julian Date in the first column, depth in the second, and then temperature in the third column. I would like to find what are the first and last values are for each unique Jday. This can be obtained by:
Data = [Jday,Depth,temp];
[~,~,b] = unique(Data(:,1),'rows');
for j = 1:length(unique(b));
top_temp(j) = temp(find(b == j,1,'first'));
bottom_temp(j) = temp(find(b == j,1,'last'));
end
However, my data set is extremely large and using this loop results in long running time. Can anyone suggest a vectorized solution to do this?
use diff:
% for example
Jday = [1 1 1 2 2 3 3 3 5 5 6 7 7 7];
last = find( [diff(Jday) 1] );
first = [1 last(1:end-1)+1];
top_temp = temp(first) ;
bottom_temp = temp(last);
Note that this solution assumes Jday is sorted. If this is not the case, you may sort Jday prior to the suggested procedure.
You should be able to accomplish this using the occurrence option of the unique function:
[~, topidx, ~] = unique(Data(:, 1), 'first', 'legacy');
[~, bottomidx, ~] = unique(Data(:, 1), 'last', 'legacy');
top_temp = temp(topidx);
bottom_temp = temp(bottomidx);
The legacy option is needed if you're using MATLAB R2013a. You should be able to remove it if you're running R2012b or earlier.

How to improve execution time of the following Matlab code

Please help me to improve the following Matlab code to improve execution time.
Actually I want to make a random matrix (size [8,12,10]), and on every row, only have integer values between 1 and 12. I want the random matrix to have the sum of elements which has value (1,2,3,4) per column to equal 2.
The following code will make things more clear, but it is very slow.
Can anyone give me a suggestion??
clc
clear all
jum_kel=8
jum_bag=12
uk_pop=10
for ii=1:uk_pop;
for a=1:jum_kel
krom(a,:,ii)=randperm(jum_bag); %batasan tidak boleh satu kelompok melakukan lebih dari satu aktivitas dalam satu waktu
end
end
for ii=1:uk_pop;
gab1(:,:,ii) = sum(krom(:,:,ii)==1)
gab2(:,:,ii) = sum(krom(:,:,ii)==2)
gab3(:,:,ii) = sum(krom(:,:,ii)==3)
gab4(:,:,ii) = sum(krom(:,:,ii)==4)
end
for jj=1:uk_pop;
gabh1(:,:,jj)=numel(find(gab1(:,:,jj)~=2& gab1(:,:,jj)~=0))
gabh2(:,:,jj)=numel(find(gab2(:,:,jj)~=2& gab2(:,:,jj)~=0))
gabh3(:,:,jj)=numel(find(gab3(:,:,jj)~=2& gab3(:,:,jj)~=0))
gabh4(:,:,jj)=numel(find(gab4(:,:,jj)~=2& gab4(:,:,jj)~=0))
end
for ii=1:uk_pop;
tot(:,:,ii)=gabh1(:,:,ii)+gabh2(:,:,ii)+gabh3(:,:,ii)+gabh4(:,:,ii)
end
for ii=1:uk_pop;
while tot(:,:,ii)~=0;
for a=1:jum_kel
krom(a,:,ii)=randperm(jum_bag); %batasan tidak boleh satu kelompok melakukan lebih dari satu aktivitas dalam satu waktu
end
gabb1 = sum(krom(:,:,ii)==1)
gabb2 = sum(krom(:,:,ii)==2)
gabb3 = sum(krom(:,:,ii)==3)
gabb4 = sum(krom(:,:,ii)==4)
gabbh1=numel(find(gabb1~=2& gabb1~=0));
gabbh2=numel(find(gabb2~=2& gabb2~=0));
gabbh3=numel(find(gabb3~=2& gabb3~=0));
gabbh4=numel(find(gabb4~=2& gabb4~=0));
tot(:,:,ii)=gabbh1+gabbh2+gabbh3+gabbh4;
end
end
Some general suggestions:
Name variables in English. Give a short explanation if it is not immediately clear,
what they are indented for. What is jum_bag for example? For me uk_pop is music style.
Write comments in English, even if you develop source code only for yourself.
If you ever have to share your code with a foreigner, you will spend a lot of time
explaining or re-translating. I would like to know for example, what
%batasan tidak boleh means. Probably, you describe here that this is only a quick
hack but that someone should really check this again, before going into production.
Specific to your code:
Its really easy to confuse gab1 with gabh1 or gabb1.
For me, krom is too similar to the built-in function kron. In fact, I first
thought that you are computing lots of tensor products.
gab1 .. gab4 are probably best combined into an array or into a cell, e.g. you
could use
gab = cell(1, 4);
for ii = ...
gab{1}(:,:,ii) = sum(krom(:,:,ii)==1);
gab{2}(:,:,ii) = sum(krom(:,:,ii)==2);
gab{3}(:,:,ii) = sum(krom(:,:,ii)==3);
gab{4}(:,:,ii) = sum(krom(:,:,ii)==4);
end
The advantage is that you can re-write the comparsisons with another loop.
It also helps when computing gabh1, gabb1 and tot later on.
If you further introduce a variable like highestNumberToCompare, you only have to
make one change, when you certainly find out that its important to check, if the
elements are equal to 5 and 6, too.
Add a semicolon at the end of every command. Having too much output is annoying and
also slow.
The numel(find(gabb1 ~= 2 & gabb1 ~= 0)) is better expressed as
sum(gabb1(:) ~= 2 & gabb1(:) ~= 0). A find is not needed because you do not care
about the indices but only about the number of indices, which is equal to the number
of true's.
And of course: This code
for ii=1:uk_pop
gab1(:,:,ii) = sum(krom(:,:,ii)==1)
end
is really, really slow. In every iteration, you increase the size of the gab1
array, which means that you have to i) allocate more memory, ii) copy the old matrix
and iii) write the new row. This is much faster, if you set the size of the
gab1 array in front of the loop:
gab1 = zeros(... final size ...);
for ii=1:uk_pop
gab1(:,:,ii) = sum(krom(:,:,ii)==1)
end
Probably, you should also re-think the size and shape of gab1. I don't think, you
need a 3D array here, because sum() already reduces one dimension (if krom is
3D the output of sum() is at most 2D).
Probably, you can skip the loop at all and use a simple sum(krom==1, 3) instead.
However, in every case you should be really aware of the size and shape of your
results.
Edit inspired by Rody Oldenhuis:
As Rody pointed out, the 'problem' with your code is that its highly unlikely (though
not impossible) that you create a matrix which fulfills your constraints by assigning
the numbers randomly. The code below creates a matrix temp with the following characteristics:
The numbers 1 .. maxNumber appear either twice per column or not at all.
All rows are a random permutation of the numbers 1 .. B, where B is equal to
the length of a row (i.e. the number of columns).
Finally, the temp matrix is used to fill a 3D array called result. I hope, you can adapt it to your needs.
clear all;
A = 8; B = 12; C = 10;
% The numbers [1 .. maxNumber] have to appear exactly twice in a
% column or not at all.
maxNumber = 4;
result = zeros(A, B, C);
for ii = 1 : C
temp = zeros(A, B);
for number = 1 : maxNumber
forbiddenRows = zeros(1, A);
forbiddenColumns = zeros(1, A/2);
for count = 1 : A/2
illegalIndices = true;
while illegalIndices
illegalIndices = false;
% Draw a column which has not been used for this number.
randomColumn = randi(B);
while any(ismember(forbiddenColumns, randomColumn))
randomColumn = randi(B);
end
% Draw two rows which have not been used for this number.
randomRows = randi(A, 1, 2);
while randomRows(1) == randomRows(2) ...
|| any(ismember(forbiddenRows, randomRows))
randomRows = randi(A, 1, 2);
end
% Make sure not to overwrite previous non-zeros.
if any(temp(randomRows, randomColumn))
illegalIndices = true;
continue;
end
end
% Mark the rows and column as forbidden for this number.
forbiddenColumns(count) = randomColumn;
forbiddenRows((count - 1) * 2 + (1:2)) = randomRows;
temp(randomRows, randomColumn) = number;
end
end
% Now every row contains the numbers [1 .. maxNumber] by
% construction. Fill the zeros with a permutation of the
% interval [maxNumber + 1 .. B].
for count = 1 : A
mask = temp(count, :) == 0;
temp(count, mask) = maxNumber + randperm(B - maxNumber);
end
% Store this page.
result(:,:,ii) = temp;
end
OK, the code below will improve the timing significantly. It's not perfect yet, it can all be optimized a lot further.
But, before I do so: I think what you want is fundamentally impossible.
So you want
all rows contain the numbers 1 through 12, in a random permutation
any value between 1 and 4 must be present either twice or not at all in any column
I have a hunch this is impossible (that's why your code never completes), but let me think about this a bit more.
Anyway, my 5-minute-and-obvious-improvements-only-version:
clc
clear all
jum_kel = 8;
jum_bag = 12;
uk_pop = 10;
A = jum_kel; % renamed to make language independent
B = jum_bag; % and a lot shorter for readability
C = uk_pop;
krom = zeros(A, B, C);
for ii = 1:C;
for a = 1:A
krom(a,:,ii) = randperm(B);
end
end
gab1 = sum(krom == 1);
gab2 = sum(krom == 2);
gab3 = sum(krom == 3);
gab4 = sum(krom == 4);
gabh1 = sum( gab1 ~= 2 & gab1 ~= 0 );
gabh2 = sum( gab2 ~= 2 & gab2 ~= 0 );
gabh3 = sum( gab3 ~= 2 & gab3 ~= 0 );
gabh4 = sum( gab4 ~= 2 & gab4 ~= 0 );
tot = gabh1+gabh2+gabh3+gabh4;
for ii = 1:C
ii
while tot(:,:,ii) ~= 0
for a = 1:A
krom(a,:,ii) = randperm(B);
end
gabb1 = sum(krom(:,:,ii) == 1);
gabb2 = sum(krom(:,:,ii) == 2);
gabb3 = sum(krom(:,:,ii) == 3);
gabb4 = sum(krom(:,:,ii) == 4);
gabbh1 = sum(gabb1 ~= 2 & gabb1 ~= 0)
gabbh2 = sum(gabb2 ~= 2 & gabb2 ~= 0);
gabbh3 = sum(gabb3 ~= 2 & gabb3 ~= 0);
gabbh4 = sum(gabb4 ~= 2 & gabb4 ~= 0);
tot(:,:,ii) = gabbh1+gabbh2+gabbh3+gabbh4;
end
end