I'm looking for a way to ignore specific entries in matrices for subsequent linear regression in MATLAB
I have two matricies: y =
9.3335 7.8105 5.8969 3.5928
23.1580 19.6043 15.3085 8.2010
40.1067 35.2643 28.9378 16.6753
56.4697 51.8224 44.5587 29.3674
70.7238 66.5842 58.8909 42.7623
83.0253 78.4561 71.1924 53.8532
and x =
300 300 300 300
400 400 400 400
500 500 500 500
600 600 600 600
700 700 700 700
800 800 800 800
I need to do linear regression on the points where y is between 20 and 80, so I need a way to fully automate the process. I tried making the outlying y values [and their corresponding x values] NaNs, but during linear regression, matlab included the NaNs in the calculations so I got NaN outputs. Can anyone suggest a good way to ignore those entries or to ignore NaNs completely calculations? (NOTE: the columns in y will often have different combinations of values, so I can't eliminate the whole row).
If the NaNs occur in the same locations in both the X and Y matrices, you can use a function call like the following, your_function( X(~isnan(X)), Y(~isnan(X)) ). If the NaNs don't occur in the same locations, you will have to first find the valid indices by something like, `X(~isnan(X)| isnan(Y))'
Since you perform your regression on each column separately, you can simply form an index into rows with the valid y-values:
nCols = size(x,2);
results = zeros(2,nCols);
validY = y>20 & y<80; %# a logical array the size of y with valid entries
nValid = sum(validY,1);
for c = 1:nCols
% results is [slope;intercept] in each column
results(:,c) = [x(validY(:,c),c),ones(nValid(c),1)]\y(validY(:,c),c);
end
Related
In Matlab I have a structure AVG of size 1 x 6 which has one field averageNEST that is a cell array also of size 1 x 6.
Each averageNEST contains matrices of varying sizes (in one dimension), so for example
AVG(1).averageNEST{1,1} is of size 281 x 3 x 19 and
AVG(1).averageNEST{1,2} is of size 231 x 3 x 19
The 2nd and 3rd dimensions of the matrices are always 3 and 19, it is only the first dimension that can change.
I want to average over all the matrices contained within AVG(1).averageNEST and obtain one matrix of size X x 3 x 19 where X is the size of the smallest matrix in AVG(1).averageNEST.
Then I want to do this for all 6 averageNEST in AVG - so have a separate averaged matrix for AVG(1), AVG(2) ... AVG(6).
I have tried multiple things including trying to concatenate matrices using the following code:
for i=1:6
min_epoch = epoch + 1;
for ii=1:19
averageNEST(:,:,ii) = [AVG(i).averageNEST(1:min_epoch,:,ii)];
end
end
and then average this but it doesn't work and now I'm really confused about what I'm doing!
Can anyone help?
I am not sure if I understand what you want to do. If you want to keep only the elements up to the size of the the smallest matrix and then average those matrices you can do the following:
averageNEST = cell(size(AVG));
for iAVG = 1:numel(AVG)
nests = AVG(iAVG).averageNEST;
minsize = min(cellfun(#(x) size(x,1), nests));
reducednests = cellfun(#(y) y(1:minsize, :, :), nests, 'UniformOutput', false);
averageNEST{iAVG} = sum(cat(4, reducednests{:}), 4) / numel(nests);
end
I have the following data with me. The first row and the first column (highlighted) are two parameters for which the rest of the elements have been generated. I am hoping to convert this matrix into a 50 by 50 matrix, interpolating the data between the rows and column.
I have tried interpolating the second column in the following manner,
x=[100 300 500 700];
y=[-20 -184 -315.2 -412];
z = linspace(x(1),x(4),50);
yi=interp1(x,y,z,'cubic');
But, my problem is, I am not able to figure out how to interpolate with respect to the row simultaneously and get the entire matrix.
Any help/suggestion would be most welcome.
The data is given below;
30 60 90
100 -20 -45 -80.5
300 -184 -215 -225.4
500 -315.2 -254 -339
700 -412 -419 -488
Your data is a function of two variables (f(x,y)) so you'll need to use interp2 rather than interp1.
% Populate the data that you already have
rows = [100, 300, 500 700];
cols = [30, 60, 90];
data = [-20 -45 -80.5
-184 -215 -225.4
-315.2 -254 -339
-412 -419 -488];
% Interpolate this at 100 points in each direction
[newcols, newrows] = meshgrid(linspace(cols(1), cols(end)), ...
linspace(rows(1), rows(end)));
% Perform the bicubic interpolation
newdata = interp2(cols, rows, data, newcols, newrows, 'bicubic')
I'm trying to detect peak values in MATLAB. I'm trying to use the findpeaks function. The problem is that my data consists of 4200 rows and I just want to detect the minimum and maximum point in every 50 rows.After I'll use this code for real time accelerometer data.
This is my code:
[peaks,peaklocations] = findpeaks( filteredX, 'minpeakdistance', 50 );
plot( x, filteredX, x( peaklocations ), peaks, 'or' )
So you want to first reshape your vector into 50 sample rows and then compute the peaks for each row.
A = randn(4200,1);
B = reshape (A,[50,size(A,1)/50]); %//which gives B the structure of 50*84 Matrix
pks=zeros(50,size(A,1)/50); %//pre-define and set to zero/NaN for stability
pklocations = zeros(50,size(A,1)/50); %//pre-define and set to zero/NaN for stability
for i = 1: size(A,1)/50
[pks(1:size(findpeaks(B(:,i)),1),i),pklocations(1:size(findpeaks(B(:,i)),1),i)] = findpeaks(B(:,i)); %//this gives you your peak, you can alter the parameters of the findpeaks function.
end
This generates 2 matrices, pklocations and pks for each of your segments. The downside ofc is that since you do not know how many peaks you will get for each segment and your matrix must have the same length of each column, so I padded it with zero, you can pad it with NaN if you want.
EDIT, since the OP is looking for only 1 maximum and 1 minimum for each 50 samples, this can easily be satisfied by the min/max function in MATLAB.
A = randn(4200,1);
B = reshape (A,[50,size(A,1)/50]); %//which gives B the structure of 50*84 Matrix
[pks,pklocations] = max(B);
[trghs,trghlocations] = min(B);
I guess alternatively, you could do a max(pks), but it is simply making it complicated.
I would like to generate a rectangular matrix A, with entries in the closed interval [0,1], which satisfies the following properties:
(1) size(A) = (200,2000)
(2) rank(A) = 50
(3) nnz(A) = 100000
It will be best if the non-zero elements in A will decay exponentially, or at least polynomially (I want significantly more small values than large).
Obviously (I think...), normalizing to [0,1] in the end is not the major issue here.
Things I tried that didn't work:
First generating a random matrix with A=abs(randn(200,2000)) and thresholding
th = prctile(A(:),(1-(100000/(200*2000)))*100);
A = A.*(A>th);
Now that property (3) is satisfied, I lowered the rank
[U,S,V] = svd(A);
for i=51:200 S(i,i)=0; end
A = U*S/V;
But this matrix has almost full cardinality (I lost propery (3)).
First generating a matrix with the specified rank with A=rand(200,50)*rand(50,2000). Now that condition (2) is satisfied, I threshoded like before. Only now I lost property (2) as the matrix has almost full rank.
So... Is there a way to make sure both properties (2) and (3) are satisfied simultaneously?
P.S. I would like the non-zero entries in the matrix to be distributed in some random/non-structural manner (just making 50 non-zero columns or rows is not my aim...).
This satisfies all conditions, with very high probability:
A = zeros(200,2000);
A(:,1:500) = repmat(rand(200,50),1,10);
You could then then suffle the nonzero columns if desired:
A = A(:,randperm(size(A,2)));
The matrix has a vertical structure: in 500 colums all elements are nonzero, whereas in the remaining 1500 columns all elements are zero. (Not sure if that's acceptable for your purpose).
Trivial approach:
>> A= rand(200,50);
>> B= zeros(200,1950);
>> A = [A B];
>> A = A(:,randperm(size(A,2)));
>> rank(A)
ans =
50
>> nnz(A)
ans =
10000
Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?
This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).