CART algorithm of matlab 'fitctree' takes account on the attributes order why ? - matlab

here is an example mentionning that fitctree of matlab takes into account the features order ! why ?
load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y)
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y)
view(Mdl1,'mode','graph');
Not the same model, thus not the same classification accuracy despite dealing with the same features ?

In your example, Xcontains 34 predictors. The predictors contain no names and fitctreejust refers to them by their column numbers x1, x2, ..., x34. If you flip the table, the column number changes and therefore their name. So x1 -> x34. x2 -> x33, etc..
In for most nodes this does not matter because CART always divides a node by the predictor that maximises the impurity gain between the two child nodes. But sometimes there are multiple predictors which result in the same impurity gain. Then it just picks the one with the lowest column number. And since the column number changed by reordering the predictors, you end up with a different predictor at that node.
E.g. let's look at the marked split:
Original order (mdl):
Flipped order (mdl1):
Up to this point always the same predictor and values have been chosen. Names changed due to order, e.g. x5 in the old data = x30 in the new model. But x3 and x6 are actually different predictors. x6 in the flipped order is x29 in the original order.
A scatter plot between those predictors shows how this could happen:
Where blue and cyan lines mark the splits performed by mdl and mdl1 respectively at that node. As we can see, both splits yield child nodes with the same number of elements per label! Therefore CART can chose any of the two predictors, it will cause the same impurity gain.
In that case it seems to just pick the one with the lower column number. In the non-flipped table x3 is chosen instead of x29 because 3 < 29. But if you flip the tables, x3 becomes x32 and x29 becomes x6. Since 6 < 32 you now end up with x6, the original x29.
Ultimately this does not matter - the decision tree of the flipped table is not better or worse. It only happens in the lower nodes where the tree starts to overfit. So you really don't have to care about it.
Appendix:
Code for scatter plot generation:
load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y);
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y);
view(Mdl1,'mode','graph');
idx = (X(:,5)>=0.23154 & X(:,27)>=0.999945 & X(:,1)>=0.5);
remainder = X(idx,:);
labels = cell2mat(Y(idx,:));
gscatter(remainder(:,3), remainder(:,(35-6)), labels,'rgb','osd');
limits = [-1.5 1.5];
xlim(limits)
ylim(limits)
xlabel('predictor 3')
ylabel('predictor 29')
hold on
plot([0.73 0.73], limits, '-b')
plot(limits, [0.693 0.693], '-c')
legend({'b' 'g'})

Related

Error in vector lengths in matlab but i checked them with length() and theyre equal

l1=length(A)
l2=length(L)
%Residual
V=A*X-L
S= (V'*P*V)/(6-2);
%adjusted values
Z=V+L
%plot of observed valued of y
plot(X,L)
%plot of adjusted value of Y
plot(X,Z);
%Covariance Matrices for all Given Quantities can be Obtained by:
Cov_X= S*inv(N)
Cov_La=S*(A*N*A')
Cov_V= S*(inv(P)-A*inv(N)*A')
the output is
HA01
l1 =
6
l2 =
6
V =
-16.7888
-31.4848
110.0764
-3.8431
-51.1036
-6.8562
Z =
1.0e+03 *
0.9685
1.0886
1.2161
1.2905
1.3390
1.4506
Error using plot
Vectors must be the same length.
Error in HA01 (line 24)
plot(X,L)
i expected i accidently created different sized vectors etc but length are same yet it refuses to plot. please help i need to submit this in couple of hours.
length will tell you how many elements its has, but not its shape.
If you do size(X) and size(L) you will likely see that one is, say, 1x6 and the other one 6x1. Transpose one of them so they are the same size, not the same length.

Convolution using 'valid' in Matlab's conv() function

Here is an example of convolution given:
I have two questions here:
Why is the vector 𝑥 padded with two 0s on each side? As, the length of kernel ℎ is 3. If 𝑥 is padded with one 0 on each side, the middle element of convolution output would be within the range of the length of 𝑥, why not one 0 on each side?
Explain the following output to me:
>> x = [1, 2, 1, 3];
>> h = [2, 0, 1];
>> y = conv(x, h, 'valid')
y =
3 8
>>
What is valid doing here in the context of the previously shown mathematics on vectors 𝑥 and ℎ?
I can't speak as to the amount of zero padding that is proper .... That being said, any zero padding is making up data that is not there. This isn't necessarily wrong, but you should be aware that the values computing this information may be biased. Sometimes you care about this, sometimes you don't. Introducing 1 zero (in this case) would leave the middle kernel value always in the data, but why should that be a stopping criteria? Importantly, adding on 2 zeros still leaves one multiplication of values that are actually present in the data and the kernel (the x[0]*h[0] and x[3]*h[2] - using 0 based indexing). Adding on a 3rd zero (or more) would just yield zeros in the output since 3 is the length of the kernel. In other words zero padding will always yield an output that is partially based on the actual data (but not completely) for any zero padding from n=1 to n = length(h)-1 (in this case either 1 or 2).
Even though zero padding with length 2 or 1 still has multiplications based on real data, some values are summed over "fake" data (those multiplied with a padded zero). In this case Matlab gives you 3 options for how you want the data returned. First, you can get the full convolution, which includes values that are biased because they include adding in 0 values that aren't really in the data. Alternatively you can get same, which means the length of the output is the length of the data y = [4 3 8 1]. This corresponds to 1 zero but note that for longer kernels you could technically get other lengths between full and same, Matlab just doesn't return those for you.
Finally, and probably most important to understand out of all this, you have the valid option. In your example only 2 samples of the output are computed from summations that occur only from multiplications over real data (i.e. from multiplying samples of the kernel with samples from x and not from zeros). More specifically:
y[2] = h[2]*x[0] + h[1]*x[1] + h[2]*x[2] = 3 //0 based indexing like example
y[3] = h[2]*x[1] + h[1]*x[2] + h[2]*x[3] = 8
Note none of the other y values are computed with only h and x, they all involve a padded zero which is not necessarily indicative of the real data. For example:
y[4] = h[2]*x[2] + h[1]*x[3] + h[2]*0 <= padded zero

Plot symbols depending on vector values

I have a dataset of points represented by a 2D vector (X).
Each point belongs to a categorical data (Y) represented by an integer value(from 1 to 4).
I want to plot each point with a different symbol depending on its class.
Toy example:
X = randi(100,10,2); % 10 points ranging 1:100 in 2D space
Y = randi(4,10,1); % class of the points (1 to 4)
I create a vector of symbols for each class:
S = {'bx' 'rx' 'b.' 'r.'};
Then I try:
plot(X(:,1), X(:,2), S(Y))
Error using plot
Invalid first data argument
How can I assign to each point of X a different symbol based on the value of Y?
Of curse I can use a loop for each class and plot the different classes one by one. But is there a method to directly plot each class with a different symbol?
No need for a loop, use gscatter:
X = randi(100,10,2); % 10 points ranging 1:100 in 2D space
Y = randi(4,10,1); % class of the points (1 to 4)
color = 'brbr';
symbol = 'xx..';
gscatter(X(:,1),X(:,2),Y,color,symbol)
and you will get:
If X has many rows, but there are only a few S types, then I suggest you check out the second approach first. It's optimized for speed instead of readability. It's about twice as fast if the vector has 10 elements, and more than 200 times as fast if the vector has 1000 elements.
First approach (easy to read):
Regardless of approach, I think you need a loop for this:
hold on
arrayfun(#(n) plot(X(n,1), X(n,2), S{Y(n)}), 1:size(X,1))
Or, to write the loop in the "conventional way":
hold on
for n = 1:size(X,1)
plot(X(n,1), X(n,2), S{Y(n)})
end
Second approach (gives same plot as above):
If your dataset is large, you can sort [Y_sorted, sort_idx] = sort(Y), then use sort_idx to index X, like this: X_sorted = X(sort_idx);. After this, you split X_sorted into 4 groups, one for each of the individual Y-values, using histc and mat2cell. Then you loop over the four groups and plot each one individually.
This way you only need to loop through four values, regardless of the number of elements in your data. This should be a lot faster if the number of elements is high.
[Y_sorted, Y_index] = sort(Y);
X_sorted = X(Y_index, :);
X_cell = mat2cell(X_sorted, histc(Y,1:numel(S)));
hold on
for ii = 1:numel(X_cell)
plot(X_cell{ii}(:,1),X_cell{ii}(:,2),S{ii})
end
Benchmarking:
I did a very simple benchmarking of the two approaches using timeit. The result shows that the second approach is a lot faster:
For 10 elements:
First approach: 0.0086
Second approach: 0.0037
For 1000 elements:
First approach = 0.8409
Second approach = 0.0039

calculate conservative interpolation of two vectors in matlab

G'day
Firstly, apologies for poor wording - I'm at a bit of a loss of how to describe this problem. I'm trying to calculate the conservative interpolation between two different vertical coordinate systems.
I have a vector of ocean transport values Ts, that describe the amount of transport at different depth values S. These depths are unevenly spaced (and size(S) is equal to size(Ts)+1 as the values in S are the depths at the top and bottom over which the transport value applies). I want to interpolate(/project?) this onto a vector of regularly spaced depths Z, where each new transport value Tz is formed from the values of Ts but weighted by the amount of overlap.
I've drawn a picture of what I mean (sorry for the bad quality webcam picture) I want to go from Ts1,Ts2.Ts3...TsN (bottom lines) to Tz1,Tz2,...TzN (top lines). The locations in the x direction for these are s0,s1,s2,...sN and z0,z1,z2,...zN. An example of the 'weighted overlap' would be:
Tz1 = a/(s1-s0) Ts1 + b/(s2-s1) Ts2 + c/(s3-s2) Ts3
where a, b and c are shown in the image as the length of overlap.
Some more details:
Example of z and s follow:
z = 0:5:720;
s = [222.69;...
223.74
225.67
228.53
232.39
237.35
243.56
251.17
260.41
271.5
284.73
300.42
318.9
340.54
365.69
394.69
427.78
465.11
506.62
551.98
600.54
651.2];
Note that I'm free to define z, but not s. Typically, z will be bigger than s (i.e. the smallest value in z will be smaller than in s, while the largest value in z will be larger than in s).
Help or tips greatly appreciated. Cheers,
Dave
I don't think there is an easy solution, as stated in the comments. I'll give it a go though :
One hypothesis first : We assume z0>s0 in order for your problem to be defined.
The idea (for your example) would be to get to the array below :
1 (s1-z0) s1-s0 Ts1
1 (s2-s1) s2-s1 Ts2
1 (z1-s2) s3-s2 Ts3
2 (s3-z1) s3-s2 Ts3
2 (z2-s3) s4-s3 Ts4
3 (z3-z2) s4-s3 Ts4
......
Then we would be able to compute, for each row : column1*column3/column2 and then use accumarray to sum the results with respect to the indexes in the first column.
Now the hardest part is to get this array :
Suppose you have :
A Nx1 vectors Ts
2 (N+1)x1 vectors s and z, with z(1)>s(1).
Vectsz=sort([s(2:end);z]); % Sorted vector of s and z values
In your case this vector should look like :
z0
s1
s2
z1
s3
z2
z3
...
The first column will serve as a subscript to apply accumarray, so we'll want it to increase each time there is a z value in our vector Vectsz
First=interp1(z,1:length(z),Vectsz,'previous');
Second=[diff(Vectsz);0]; % Padded with a 0 to keep the right size
Temp=diff(s);
Third=interp1(s(1:end-1),Temp,Vectsz,'previous');
This will just repeat the diff value everytime you have a z value in your vector Vectsz.
The last column is built exactly like the third one
Fourth=interp1(s(1:end-1),Ts,Vectsz,'previous');
Now that the array is built, a call to accumarray is enough to get the final result :
Res=accumarray(First,Second.*Fourth./Third);
EDIT : There is actually no need for the use of interp1 with the previous option :
Vectsz=sort([s(2:end);z]);
First=cumsum(ismember(Vectsz,z));
Second=[diff(Vectsz);0];
idx=cumsum(ismember(Vectsz,s(2:end)))+1;
Diffs=[diff(s);0];
Third=Diffs(idx);
Fourth=Ts(idx);
Res=accumarray(First,Second.*Fourth./Third);

How to master the random generator in MATLAB for application to bootstrap artificial neural network models

I am working with hydrological time series data and I am attempting to construct Bootstrap Artificial Neural Network models. In order to provide an uncertainty assessment using confidence intervals, one must make sure when resampling/Bootstrapping the original time series data set, that every value in the original time series is held back at least twice within all bootstrap samples in order to calculate the variance and confidence intervals at that point in time.
To give some background:
I am using a hydrological time series that contains Standard Precipitation Index values at monthly time steps, this time series spans 429 (rows) x 1 (column), let's call this time series vector X. All elements/values of X are normalized and standardized between 0 and 1.
Time series X is then trained against some Target values (same length and conditions as X) in a Neural Network to produce new estimates of the Target values, we'll call this output vector, O (same length and conditions as X).
I am now to take X and resample it ii =1:1:200 times (i.e. Bootstrap size = 200) for length(429) with replacement. Let's call the matrix where all the bootstrap samples are placed, M. I use B = randsample(X, length(X), true) and fill M using a for loop such that M(:,ii) = B. Note: I also make sure to include rng('shuffle') after my randsample statement to keep the RNG moving to new states in hopes that it will provide more random results.
Now I am to test how "well" my data was resampled for use in creating confidence intervals.
My procedure is as follow:
Generate a for loop to create M using above procedure
Create a new variable Xc, this will hold all values of X that were not resampled in bootstrap sample ii for ii = 1:1:200
For j=1:1:length(X) fill 'Xc' using the Xc(j,ii) = setdiff(X, M(:,ii)), if element j exists in M(:,ii) fill Xc(j,ii) with NaN.
Xc is now a matrix the same size and dimensions as M. Count the amount of NaN values in each row of Xc and place in vector CI.
If any row in CI is > [Bootstrap sample size, for this case (200) - 1], then no confidence interval can be created at this point.
When I run this I find that the values chosen from my set X are almost always repeated, i.e. the same values of X are used to generate all the samples in M. It's roughly the same ~200 data points in my original time series that are always chosen to create the new bootstrap samples.
How can I effectively alter my program or use any specific functions that will allow me to avoid the negative solution in (5)?
Here is an example of my code, but please keep in mind the variables used in the script may differ from my text in here.
Thank you for the help and please see the code below.
for ii = 1:1:Blen % for loop to create 'how many bootstraps we desire'
B = randsample(Xtrain, wtrain, true); % bootstrap resamples of data series 'X' for 'how many elements' with replacement
rng('shuffle');
M(:,ii) = B; % creates a matrix of all bootstrap resamples with respect to the amount created by the for loop
[C,IA] = setdiff(Xtrain,B); % creates a vector containing all elements of 'Xtrain' that were not included in bootstrap sample 'ii' and the location of each element
[IAc] = setdiff(k,IA); % creates a vector containing locations of elements of 'Xtrain' used in bootstrap sample 'ii' --> ***IA + IAc = wtrain***
for j = 1:1:wtrain % for loop that counts each row of vector
if ismember(j,IA)== 1 % if the count variable is equal to a value of 'IA'
XC(j,ii) = Xtrain(j,1); % place variable in matrix for sample 'ii' in position 'j' if statement above is true
else
XC(j,ii) = NaN; % hold position with a NaN value to state that this value has been used in bootstrap sample 'ii'
end
dum1(:,ii) = wtrain - sum(isnan(XC(:,ii))); % dummy variable to permit transposing of 'IAs' limited by 'isnan' --> used to calculate amt of elements in IA
dum2(:,ii) = sum(isnan(XC(:,ii))); % dummy variable to permit transposing of 'IAsc' limited by 'isnan'
IAs = transpose(dum1) ; % variable counting amount of elements not resampled in 'M' at set 'i', ***i.e. counts 'IA' for each resample set 'i'
IAsc = transpose(dum2) ; % variable counting amount of elements resampled in 'M' at set 'i', ***i.e. counts 'IAc' for each resample set 'i'
chk = isnan(XC); % returns 1 in position of NaN and 0 in position of actual value
chks = sum(chk,2); % counts how many NaNs are in each row for length of time training set
chks_cnt = sum(chks(:)<(Blen-1)); % counts how many values of the original time series that can be provided a confidence interval, should = wtrain to provide complete CIs
end
end
This doesn't appear to be a problem with randsample, but rather a problem in your other code somewhere. randsample does the right thing. For example:
x = (1:10)';
nSamples = 10;
for iter = 1:100;
data(:,iter) = randsample(x,nSamples ,true);
end;
hist(data(:)) %this is approximately uniform
randsample samples quite randomly...