Why does this trivially learnable example break AdaBoost? - matlab

I'm testing out a boosted tree model that I built using Matlab's fitensemble method.
X = rand(100, 10);
Y = X(:, end)>.5;
boosted_tree = fitensemble(X, Y, 'AdaBoostM1', 100,'Tree');
predicted_Y = predict(boosted_tree, X);
I just wanted to run it on a few simple examples, so I threw in an easy case, one feature is >.5 for positive examples and < .5 for negative examples. I get the warning
Warning: AdaBoostM1 exits because classification error = 0
Which leads me to think, great, it figured out the relevant feature and all the training examples were correctly classified.
But if I look at the accuracy
sum(predicted_Y==Y)/length(Y)
The result is 0.5 because the classifier simply assigned the positive class to all examples!
Why does Matlab think that classification error = 0 when it is clearly not 0? I believe this example should be easily learnable. Is there a way to prevent this error and get the correct result using this method?
Edit: The code above should reproduce the warning.

This is not a bug, it's just that AdaBoost is not designed to work in cases where the first weak learner gets perfect classification. More details:
1) The warning you get is referring to the error of the first weak learning, which is indeed zero. You can see this by following the stack trace that comes along with the warning into the function Ensemble.m (in Matlab R2013b, at line 194). If you place a breakpoint there and run your example, then run the command H.predict(X) you will see that this learning has perfect prediction.
2) So why doesn't your ensemble have perfect prediction? If you look more at Ensemble.m, you'll see that this perfect learner never gets added to the ensemble. This is also reflected in that boosted_tree.NTrained is zero.
3) So why doesn't this perfect learner get added to the ensemble? If you find a description of the AdaBoost.M1 algorithm, you'll see that in each round, training examples are weighted by the error of the previous weak learner. But if that weak learner had no error, then the weights will be zero and therefore all subsequent learners will have nothing to do.
4) If you come across this situation in the real world, what do you do? Don't bother with AdaBoost! The problem is easy enough that a single one of your weak learners can solve it:
X = rand(100, 10);
Y = X(:, end)>.5;
tree = fit(ClassificationTree.template, X, Y);
predicted_Y = predict(tree, X);
accuracy = sum(predicted_Y == Y) / length(Y)

Related

data driven curve - MATLAB

I have several sets of data that I want to fit but not all of them look the same (some look like a Gaussian with one peak, some like two Gaussians with 2 peaks or Lorentzians). I wanted to try this method
http://www.mathworks.com/matlabcentral/fileexchange/31562-data-driven-fitting-with-matlab/content/fitit.m
but the program given is not complete so I can not use it (there is no line that defines 'train' and 'test'). I am writing it so that it suits and works for my data (based on the code that it is given and the demo). I was able to find the best fit but I am also trying to use the bootstrap technique in order to find the confidence intervals. My data is xdata and ydata and they are sorted and the duplicates have been removed before I use them in my program.
cpart=cvpartition(size(xdata,1),'k',10);
tr_x=xdata(training(cpart,1));
tr_y=ydata(training(cpart,1));
tst_x=xdata(test(cpart,1));
tst_y=ydata(test(cpart,1));
all_span=linspace(0.01,0.99,99);
s=zeros(length(all_span);
for k=1:length(all_span)
f = #(tr_x,tr_y,tst_x,tst_y) norm(tst_y mylowess (tr_x, tr_y, tst_x, all_span (k)))^2
s(k) = sum(crossval(f,datax,datay,'partition',cpart));
end
[~,mj]=min(s);
n_span=all_span(mj);%n_span is the optimal span
function ys=mylowess(x1,y1,xs,span)
ys1 = smooth(x1,y1,span,'loess');
ys = interp1(x1,ys1,xs,'linear',NaN);
if any(isnan(ys))
ys(xs<x1(1)) = ys1(1);
ys(xs>x1(end)) = ys1(end);
end
So up to this point I understand the program and I have managed to find the optimal span. I want to find the confidence intervals but so far I was not able to make it work.
When I type:
NB=length(xdata);
f=#(xdata,ydata) mylowess(xdata,ydata,xdata,n_span);
yboot2 = bootstrp(NB,f,xdata,ydata)';
I get the following error
Error using griddedInterpolant
The grid vectors are not strictly monotonic increasing.
Error in interp1 (line 186)
F = griddedInterpolant(X,V,method);
Error in mylowess (line 26)
ysmooth=interp1(xdata,ysmooth1,xinput,'linear',NaN);
As I said before there are no duplicates in xdata and I have already sorted xdata before I used them in the program. Can anyone see the mistake I am making? Or is there an easier way to get the confidence intervals?
Thank you for your help.

Issue with Matlab solve function?

The following command
syms x real;
f = #(x) log(x^2)*exp(-1/(x^2));
fp(x) = diff(f(x),x);
fpp(x) = diff(fp(x),x);
and
solve(fpp(x)>0,x,'Real',true)
return the result
solve([0.0 < (8.0*exp(-1.0/x^2))/x^4 - (2.0*exp(-1.0/x^2))/x^2 -
(6.0*log(x^2)*exp(-1.0/x^2))/x^4 + (4.0*log(x^2)*exp(-1.0/x^2))/x^6],
[x == RD_NINF..RD_INF])
which is not what I expect.
The first question: Is it possible to force Matlab's solve to return the set of all solutions?
(This is related to this question.) Moreover, when I try to solve the equation
solve(fpp(x)==0,x,'Real',true)
which returns
ans =
-1.5056100417680902125994180096313
I am not satisfied since all solutions are not returned (they are approximately -1.5056, 1.5056, -0.5663 and 0.5663 obtained from WolframAlpha).
I know that vpasolve with some initial guess can handle this. But, I have no idea how I can generally find initial guessed values to obtain all solutions, which is my second question.
Other solutions or suggestions for solving these problems are welcomed.
As I indicated in my comment above, sym/solve is primarily meant to solve for analytic solutions of equations. When this fails, it tries to find a numeric solution. Some equations can have an infinite number of numeric solutions (e.g., periodic equations), and thus, as per the documentation: "The numeric solver does not try to find all numeric solutions for [the] equation. Instead, it returns only the first solution that it finds."
However, one can access the features of MuPAD from within Matlab. MuPAD's numeric::solve function has several additional capabilities. In particular is the 'AllRealRoots' option. In your case:
syms x real;
f = #(x)log(x^2)*exp(-1/(x^2));
fp(x) = diff(f(x),x);
fpp(x) = diff(fp(x),x);
s = feval(symengine,'numeric::solve',fpp(x)==0,x,'AllRealRoots')
which returns
s =
[ -1.5056102995536617698689500437312, -0.56633904710786569620564475006904, 0.56633904710786569620564475006904, 1.5056102995536617698689500437312]
as well as a warning message.
My answer to this question provides other way that various MuPAD solvers can be used, particularly if you can isolate and bracket your roots.
The above is not going to directly help with your inequalities other than telling you where the function changes sign. For those you could try:
s = feval(symengine,'solve',fpp(x)>0,x,'Real')
which returns
s =
(Dom::Interval(0, Inf) union Dom::Interval(-Inf, 0)) intersect solve(0 < 2*log(x^2) - 3*x^2*log(x^2) + 4*x^2 - x^4, x, Real)
Try plotting this function along with fpp.
While this is not a bug per se, The MathWorks still might be interested in this difference in behavior and poor performance of sym/solve (and the underlying symobj::solvefull) relative to MuPAD's solve. File a bug report if you like. For the life of me I don't understand why they can't better unify these parts of Matlab. The separation makes not sense from the perspective of a user.

How to build an ARMAX model in Matlab

I'm trying to build an ARMAX model which predicts reservoir water elevation as a function of previous elevations and an upstream inflow. My data is on a timestep of roughly 0.041 days, but it does vary slightly, and I have 3643 time series points. I've tried using the basic armax Matlab command, but am getting this error:
Error using armax (line 90)
Operands to the || and && operators must be convertible to
logical scalar values.
The code I'm trying is:
data = iddata(y,x,[],'SamplingInstants',JDAYs)
m1 = armax(data, [30 30 30 1])
where y is a vector of elevations that starts like y=[135.780
135.800
135.810
135.820
135.820
135.830]', x is a vector of flowrates that starts like x=[238.865
238.411
238.033
237.223
237.223
233.828]', and JDAYs is a vector of timestamps that starts like JDAYs=[122.604
122.651
122.688
122.729
122.771
122.813]'.
I'm new to this model type and the system identification toolbox, so I'm having issues figuring out what's causing that error. The Matlab examples aren't very helpful...
I hope this is not getting to you a bit late.
Checking your code i see that you are using a parameter called SamplingInstants. I'm not sure ARMAX functions works with it. Actually i'm sure. I have tried several times, and no, it doesn't. And it don't seems to be a well documented option for ARMAX -or for other methods- too.
The ARX, ARMAX, and other models are based on linear discrete systems from the Z-Transform formalism, that is, one can ussualy assume that your system has been sampled under a regular sampling rate. Although of course, this is not a law, this is the standard framework when dealing with linear -and also non-linear- systems. And also most industrial control & acquisition systems work under a regular rate sampling. Yet.
Try to get inside the ARMAX standard setting, like this:
y=[135.780 135.800 135.810 135.820 135.820 135.830 .....]';
x=[238.865 238.411 238.033 237.223 237.223 233.828 .....]';
%JDAYs=[122.604 122.651 122.688 122.729 122.771 122.813 .....]';
JDAYs=122.601+[0:length(y)-1]*4.18';
data = iddata(y,x,[],'SamplingInstants',JDAYs);
m1 = armax(data, [30 30 30 1])
And this will always work. Please just ensure that x and y are long enough to enable the proper estimation of all the free coefficients, greater than mean(4*orders), for ARMAX to work -in this case, greater than 121-, and desirable greater than 10*mean(4*orders), for ARMAX algorithm to properly solve your problem, and enough time-variant for prevent reaching onto ill-conditioned solutions.
Good Luck ;)...

Simulating spatial PDEs in Modelica - Accessing variable values at specific times

This question is somewhat related to a previous question of mine, where I didn't quite get the right solution. Link: Earlier SO-thread
I am solving PDEs which are time variant with one spatial dimension (e.g. the heat equation - see link below). I'm using the numerical method of lines, i.e. discretizing the spatial derivatives yielding a system of ODEs which are readily solved in Modelica (using the Dymola tool). My problems arise when I simulate the system, or when I plot the results, to be precise. The equations themselves appear to be solved correctly, but I want to express the spatial changes in all the discretized state variables at specific points in time rather than the individual time-varying behavior of each discrete state.
The strategy leading up to my problems is illustrated in this Youtube tutorial, which by the way is not made by me. As you can see at the very end of the tutorial, the time-varying behavior of the temperature is plotted for all the discrete points in the rod, individually. What I would like is a plot showing the temperature through the rod at a specific time, that is the temperature as a function of the spatial coordinate. My strategy to achieve this, which I'm struggling with, is: Given a state vector of N entries:
Real[N] T "Temperature";
..I would use the plotArray Dymola function as shown below.
plotArray( {i for i in 1:N}, {T[i] for i in 1:N} )
Intuitively, this would yield a plot showing the temperature as a function of the spatial coordiate, or the number in the line of discrete units, to be precise. Although this command yields a result, all T-values appear to be 0 in the plot, which is definitely not the case. My question is: How can I successfully obtain and plot the temperatures at all the discrete points at a given time? Thanks in advance for your help.
The code for the problem is as indicated below.
model conduction
parameter Real rho = 1;
parameter Real Cp = 1;
parameter Real L = 1;
parameter Real k = 1;
parameter Real Tlo = 0;
parameter Real Thi = 100;
parameter Real Tinit = 30;
parameter Integer N = 10 "Number of discrete segments";
Real T[N-1] "Temperatures";
Real deltaX = L/N;
initial equation
for i in 1:N-1 loop
T[i] = Tinit;
end for;
equation
rho*Cp*der(T[1]) = k*( T[2] - 2*T[1] + Thi) /deltaX^2;
rho*Cp*der(T[N-1]) = k*( Tlo - 2*T[N-1] + T[N-2]) /deltaX^2;
for i in 2:N-2 loop
rho*Cp*der(T[i]) = k*( T[i+1] - 2*T[i] + T[i-1]) /deltaX^2;
end for
annotation (uses(Modelica(version="3.2")));
end conduction;
Additional edit: The simulations show clearly that for example T[3], that is the temperature of discrete segment no. 3, starts out from 30 and ends up at 70 degrees. When I write T[3] in my command window, however, I get T3 = 0.0 in return. Why is that? This is at the heart of the problem, because the plotArray function would be working if I managed to extract the actual variable values at specific times and not just 0.0.
Suggested solution: This is a rather tedious solution to achieve what I want, and I hope someone knows a better solution. When I run the simulation in Dymola, the software generates a .mat-file containing the values of the variables throughout the time of the simulation. I am able to load this file into MATLAB and manually extract the variables of my choice for plotting. For the problem above, I wrote the following command:
plot( [1:9]' , data_2(2:2:18 , 10)' )
This command will plot the temperatures (as the temperatures are stored together with their derivates in the data_2 array in the .mat-file) against the respetive number of the discrete segment/element. I was really hoping to do this inside Dymola, that is avoid using MATLAB for this. For this specific problem, the amount of variables was low on account of the simplicity of this problem, but I can easily image a .mat-file which is signifanctly harder to navigate through manually like I just did.
Although you do not mention it explicitly I assume that you enter your plotArray command in Dymola's command window. That won't work directly, since the variables you see there do not include your simulation results: If I simulate your model, and then enter T[:] in Dymola's command window, then the printed result is
T[:]
= {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0}
I'm not a Dymola expert, and the only solution I've found (to actively store and load the desired simulation results) is quite cumbersome:
simulateModel("conduction", resultFile="conduction.mat")
n = readTrajectorySize("conduction.mat")
X = readTrajectory("conduction.mat", {"Time"}, n)
Y = readTrajectory("conduction.mat", {"T[1]", "T[2]", "T[3]"}, n)
plotArrays(X[1, :], transpose(Y))

Matlab Weka Interface AdaBoost Issues: Out of Bounds Exception

I'm doing some cross-validation using a Matlab Weka Interface that I got from file exchange. My loop structure seems to work fine for Weka's Logistic classifier. However, when I try to do the exact same thing for AdaBoostM1, it throws the following error:
??? Java exception occurred: java.lang.ArrayIndexOutOfBoundsException
Error in ==> wekaClassify at 24 classProbs(t+1,:) = (classifier.distributionForInstance(testData.instance(t)))';
Error in ==> classifier_search at 225 [pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);
I have determined through some testing that this only occurs when the number of instances in the training set is greater than the number of instances in the test set. I am sure you can see why that is a problem for me, since in most situations the training set is greater than the test set in size.
Is there something different about how I should format my inputs when using Adaboost rather than Logistic? Any information you can give regarding this problem would be so helpful.
I downloaded this code from this page: http://www.mathworks.com/matlabcentral/fileexchange/21204-matlab-weka-interface
Emails bounce from the account of the guy who made it, and he doesn't seem to respond to comments on the page - I'm hoping that maybe someone here has used this.
EDIT: Here is the code that I use to train and test the classifier:
classifier = trainWekaClassifier(matlab2weka('training', featurelabels, train), 'meta.AdaBoostM1', { strcat('-P 100 -S 1 -I ', num2str(r), '-W weka.classifiers.trees.DecisionStump')});
[pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);
I haven't used this combination of software, so I can only take a guess at what could cause this.
Are your training/testing data matrices the right way round? They should be N-by-D (N instances, D features).
If you were passing in a D-by-N training matrix and a D-by-M testing matrix, then I would expect it to work only when M < N - which is what you describe - and even then, it wouldn't give a meaningful result.