Matlab fitrensemble capped to 99 observations - matlab

Probably, a very simple fix, but I've been trying to run random forest algo on some data using the built-in fitrensemble method.
I pass in a 2D matrix X with the training data, and the vector output as Y.
These matrices have 3499 rows each. X has 6 columns and Y 1.
Yet, I get the following output from my fitrensemble call:
ResponseName: 'Y'
CategoricalPredictors: []
ResponseTransform: 'none'
NumObservations: 99
NumTrained: 100
Method: 'LSBoost'
LearnerNames: {'Tree'}
ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.'
FitInfo: []
FitInfoDescription: 'None'
Regularization: []
FResample: 1
Replace: 1
UseObsForLearner: [99×100 logical]
NumObservations is always capped at 99... why is that? Note that when I reduce the size of the training data to have less than 99 rows, then the value of 'NumObservations` comes down to match it. I tried setting it as an argument but that didn't work either.

Related

How can I detect the minimum and maximum values every 50 rows

I'm trying to detect peak values in MATLAB. I'm trying to use the findpeaks function. The problem is that my data consists of 4200 rows and I just want to detect the minimum and maximum point in every 50 rows.After I'll use this code for real time accelerometer data.
This is my code:
[peaks,peaklocations] = findpeaks( filteredX, 'minpeakdistance', 50 );
plot( x, filteredX, x( peaklocations ), peaks, 'or' )
So you want to first reshape your vector into 50 sample rows and then compute the peaks for each row.
A = randn(4200,1);
B = reshape (A,[50,size(A,1)/50]); %//which gives B the structure of 50*84 Matrix
pks=zeros(50,size(A,1)/50); %//pre-define and set to zero/NaN for stability
pklocations = zeros(50,size(A,1)/50); %//pre-define and set to zero/NaN for stability
for i = 1: size(A,1)/50
[pks(1:size(findpeaks(B(:,i)),1),i),pklocations(1:size(findpeaks(B(:,i)),1),i)] = findpeaks(B(:,i)); %//this gives you your peak, you can alter the parameters of the findpeaks function.
end
This generates 2 matrices, pklocations and pks for each of your segments. The downside ofc is that since you do not know how many peaks you will get for each segment and your matrix must have the same length of each column, so I padded it with zero, you can pad it with NaN if you want.
EDIT, since the OP is looking for only 1 maximum and 1 minimum for each 50 samples, this can easily be satisfied by the min/max function in MATLAB.
A = randn(4200,1);
B = reshape (A,[50,size(A,1)/50]); %//which gives B the structure of 50*84 Matrix
[pks,pklocations] = max(B);
[trghs,trghlocations] = min(B);
I guess alternatively, you could do a max(pks), but it is simply making it complicated.

Generate Random Matrix in Matlab

Is there any way in Matlab to generate a 5000 x 1000 matrix of random numbers in which:
MM = betarnd(A,B,1,1000);
but A and B are vectors (1 x 5000). I get the following error message:
??? Error using ==> betarnd at 29
Size information is inconsistent.
I want to avoid a loop like the following one:
for ii = 1 : 1000
MM(:,ii) = betarnd(A,B);
end
Thanks!
You can repeat A and B (vectors of size 1x5000) to obtain matrices of size 1000x5000 in which all rows are equal, and use those matrices as inputs to betarnd. That way you get a result of size 1000x5000 in which column k contains 1000 random values with parameters A(k) and B(k).
The reason is that, according to the documentation (emphasis mine):
R = betarnd(A,B) returns an array of random numbers chosen from the
beta distribution with parameters A and B. The size of R is the common size of A and B if both are arrays.
So, use
MM = betarnd(repmat(A(:).',1000,1), repmat(B(:).',1000,1));

Find median value of the largest clump of similar values in an array in the most computationally efficient manner

Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?
This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).

filling a matrix with random integers from a range according to a rule

I'm using the matrix as an initial population for multiobjective optimization using NSGA-II in matlab. The size of my chromosome vector,(C), is 1x192 and each gene must be within the range 0<=gene<=40 and the genes must be integers. The rule is that the sum of groupings of 6 genes must be less or equal to 40.that is:
sum(reshape(6,[]))<=40
I've use the following code but it outputs either an all-zero population matrix(population matrix=vertical concatenation of 500 chromosomes) or a matrix that does not satisfy the rule:
X=zeros(500,192);
while i<501
r=randi(40,6,32);
if nnz(((sum(r))./40)>1)==0
X(i,:)=reshape(r,1,[]);
i=i+1;
clear r;
else
clear r;
end
end
It is also taking forever to exit the while loop.
What am I doing wrong here? Is there another way of doing the above?
I've also tried this:
i=1;
while i<17500
r=randi([1,40],6,1);
s=sum(r);
if s<=40
X(:,i)=r;
i=i+1;
else
clear r;
end
end
X=unique(X','rows')';
A=X(:,randperm(size(X,2)));
A=X(randperm(size(X,1)),:);
The above tries to create random columns that will be reshaped to the population matrix. But the numbers are repeating; i.e in the 17500(16448 after removing duplicate columns) columns there is no occurrence of the numbers 37 and 40. Is there any way I can optimize the spread of the generated random numbers?
#0x90
I have a vector,called 'chromosome', of size 1x192 and each successive group of 6 members(called phenotype) must sum to 40 or less. To make it clearer:
That is, each P must be an integer in the range 0 to 40 inclusive and the sum at each phenotype must be <=40. I need 500 chromosomes like this.
I hope it makes sense now. ><
You should use randi([min,max],n,m). randint is going to be deprecated.
>> r = randi([1,4],3,2)
r =
3 3
2 2
4 4

How to ignore NaNs in MATLAB?

I'm looking for a way to ignore specific entries in matrices for subsequent linear regression in MATLAB
I have two matricies: y =
9.3335 7.8105 5.8969 3.5928
23.1580 19.6043 15.3085 8.2010
40.1067 35.2643 28.9378 16.6753
56.4697 51.8224 44.5587 29.3674
70.7238 66.5842 58.8909 42.7623
83.0253 78.4561 71.1924 53.8532
and x =
300 300 300 300
400 400 400 400
500 500 500 500
600 600 600 600
700 700 700 700
800 800 800 800
I need to do linear regression on the points where y is between 20 and 80, so I need a way to fully automate the process. I tried making the outlying y values [and their corresponding x values] NaNs, but during linear regression, matlab included the NaNs in the calculations so I got NaN outputs. Can anyone suggest a good way to ignore those entries or to ignore NaNs completely calculations? (NOTE: the columns in y will often have different combinations of values, so I can't eliminate the whole row).
If the NaNs occur in the same locations in both the X and Y matrices, you can use a function call like the following, your_function( X(~isnan(X)), Y(~isnan(X)) ). If the NaNs don't occur in the same locations, you will have to first find the valid indices by something like, `X(~isnan(X)| isnan(Y))'
Since you perform your regression on each column separately, you can simply form an index into rows with the valid y-values:
nCols = size(x,2);
results = zeros(2,nCols);
validY = y>20 & y<80; %# a logical array the size of y with valid entries
nValid = sum(validY,1);
for c = 1:nCols
% results is [slope;intercept] in each column
results(:,c) = [x(validY(:,c),c),ones(nValid(c),1)]\y(validY(:,c),c);
end