Is there a way to calculate a moving mean in a way that the values at the beginning and at the end of the array are averaged with the ones at the opposite end?
For example, instead of this result:
A=[2 1 2 4 6 1 1];
movmean(A,2)
ans = 2.0 1.5 1.5 3.0 5 3.5 1.0
I want to obtain the vector [1.5 1.5 1.5 3 5 3.5 1.0], as the initial array element 2 would be averaged with the ending element 1.
Generalizing to an arbitrary window size N, this is how you can add circular behavior to movmean in the way you want:
movmean(A([(end-floor(N./2)+1):end 1:end 1:(ceil(N./2)-1)]), N, 'Endpoints', 'discard')
For the given A and N = 2, you get:
ans =
1.5000 1.5000 1.5000 3.0000 5.0000 3.5000 1.0000
For an arbitrary window size n, you can use circular convolution with an averaging mask defined as [1/n ... 1/n] (with n entries; in your example n = 2):
result = cconv(A, repmat(1/n, 1, n), numel(A));
Convolution offers some nice ways of doing this. Though, you may need to tweak your input slightly if you are only going to partially average the ends (i.e. the first is averaged with the last in your example, but then the last is not averaged with the first).
conv([A(end),A],[0.5 0.5],'valid')
ans =
1.5000 1.5000 1.5000 3.0000 5.0000 3.5000 1.0000
The generalized case here, for a moving average of size N, is:
conv(A([end-N+2:end, 1:end]),repmat(1/N,1,N),'valid')
Related
I am trying to compute a moving average on multiple columns of a matrix. After reading some answers on stackoverflow, namely this one, it seemed that the filter function was the way to go. However, it does not ignore NaN elements, and I would like to do this ignoring NaN elements in the spirit of the function nanmean. Below a sample code:
X = rand(100,100); %generate sample matrix
X(sort(randi([1 100],1,10)),sort(randi([1 100],1,10))) = NaN; %put some random NaNs
windowlenght = 7;
MeanMA = filter(ones(1, windowlenght) / windowlenght, 1, X);
Use colfilt with nanmean:
>> A = [1 2 3 4 5; 2 nan nan nan 6; 3 nan nan nan 7; 4 nan nan nan 8; 5 6 7 8 9]
A =
1 2 3 4 5
2 NaN NaN NaN 6
3 NaN NaN NaN 7
4 NaN NaN NaN 8
5 6 7 8 9
>> colfilt(A, [3,3], 'sliding', #nanmean)
ans =
0.6250 1.1429 1.5000 2.5714 1.8750
1.1429 2.2000 3.0000 5.0000 3.1429
1.5000 3.0000 NaN 7.0000 3.5000
2.5714 5.0000 7.0000 7.8000 4.5714
1.8750 3.1429 3.5000 4.5714 3.1250
(if you only care about 'full' blocks, select inner rows / columns appropriately)
Alternatively, you can also use nlfilter, but you then need to be explicit (via an anonymous function handle) about what you'll be doing with the block; in particular, to work with nanmean such that it will produce a scalar output from the whole block, you'll need to convert each block to a column-vector before calling nanmean in your anonymous function:
>> nlfilter(A, [3,3], #(x) nanmean(x(:)))
ans =
0.6250 1.1429 1.5000 2.5714 1.8750
1.1429 2.2000 3.0000 5.0000 3.1429
1.5000 3.0000 NaN 7.0000 3.5000
2.5714 5.0000 7.0000 7.8000 4.5714
1.8750 3.1429 3.5000 4.5714 3.1250
However, for the record, matlab claims colfilt will generally be faster, so generally nlfilter is better reserved for situations where it doesn't make sense for your input to be converted to a column when processing each block.
Also see matlab's manual page/chapter on sliding operations in general.
If you have R2016a or beyond, you can use the movmean function with the 'omitnan' option.
Try
MeanMA = filter(ones(1, windowlenght) / windowlenght, 1, X(find(~isnan(X)));
This will extract the non-nan values from X.
The question is... do you still have a valid filter processing? If X is filled iteratively, one element per timestep, then the "NaN-Elimination" will produce a shorter vector which values are not aligned with the original time vector any more.
EDIT
To still have a valid mean calculation, the filter parameters must be updated according to the number of non-NaN values.
values = X(find(~isnan(X));
templength = length(values);
MeanMA = filter(ones(1, templength ) / templength , 1, values );
This question already has answers here:
How can I find unique rows in a matrix, with no element order within each row?
(4 answers)
Closed 7 years ago.
I have a Protein-Protein interaction data of homo sapiens. The size of the matrix is <4850628x3>. The first two columns are proteins and the third is its confident score. The problem is half the rows are duplicate pairs
if protein A interacts with B, C, D. it is mentioned as
A B 0.8
A C 0.5
A D 0.6
B A 0.8
C A 0.5
D A 0.6
If you observe the confident score of A interacting with B and B interacting with A is 0.8
If I have a matrix of <4850628x3> half the rows are duplicate pairs. If I choose Unique(1,:) I might loose some data.
But I want <2425314x3> i.e without duplicate pairs. How can I do it efficiently?
Thanks
Naresh
Supposing that in your matrix you store each protein with a unique id.
(Eg: A=1, B=2, C=3...) your example matrix will be:
M =
1.0000 2.0000 0.8000
1.0000 3.0000 0.5000
1.0000 4.0000 0.6000
2.0000 1.0000 0.8000
3.0000 1.0000 0.5000
4.0000 1.0000 0.6000
You must first sort the two first columns row-wise so you will always have the protein pairs in the same order:
M2 = sort(M(:,1:2),2)
M2 =
1 2
1 3
1 4
1 2
1 3
1 4
Then use unique with the second parameter rows and keep the indexes of unique pairs:
[~, idx] = unique(M2, 'rows')
idx =
1
2
3
Finally filter your initial matrix to keep unly the unique pairs.
R = M(idx,:)
R =
1.0000 2.0000 0.8000
1.0000 3.0000 0.5000
1.0000 4.0000 0.6000
Et voilĂ !
I have a matrix that looks something like this:
a=[1 1 2 2 3 3 4 4;
1.5 1.5 2.5 2.5 3.5 3.5 4.5 4.5]
what I would like to do is reshape this ie.
What I want is to take the 2x2 matrices next to one another and put them underneath each other.
So get:
b=[1 1;
1.5 1.5;
2 2;
2.5 2.5;
3 3;
3.5 3.5;
4 4;
4.5 4.5]
but I can't seem to manipulate the reshape function to do this for me
edit: the single line version might be a bit complicated, so I've also added one based on a for loop
2 reshapes and a permute should do it (we first split the matrices and store them in 3d), and then stack them. In order to stack them we first need to permute the dimensions (similar to a transpose).
>> reshape(permute(reshape(a,2,2,4),[1 3 2]),8,2)
ans =
1.0000 1.0000
1.5000 1.5000
2.0000 2.0000
2.5000 2.5000
3.0000 3.0000
3.5000 3.5000
4.0000 4.0000
4.5000 4.5000
the for loop based version is a bit more straight forward. We create an empty array of the correct size, and then insert each of the 2x2 matrices separately:
b=zeros(8,2);
for i=1:4,
b((2*i-1):(2*i),:) = a(:,(2*i-1):(2*i));
end
So I have quite a few (over 60000) data points
f(x_k) = k, here k=0,1,2,...,N.
Function is monotonically increasing and visually looks pretty smooth. I would love to be able to find fitting F(x) such that for every x_k it so happens that k <= F(x_k) < k+1.
How should I approach this problem?
Data example
x 0 1 3 5 8 10 14 16 20 23 27 29 35 37 41
f(x) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(This looks a bit like a lookup table. Maybe an image processing application of some sort? I did some tools in my past life where an unrounding was needed.)
Is this a one time problem, or will you be doing it often, so you have a need for speed?
I'd throw it into SLM. Since I don't have the data, I cannot test it out or give you any results myself, but there is certainly no problem with an assured fit of the quality you wish as long as you use sufficient number of knots. You would need additional knots on the right hand side, as it appears to approach a vertical asymptote, thus a singularity. Splines in general tend not to like singularities, as they are still polynomials at heart.
Better yet, swap the x and y axes to do the fit, thus fitting x = f(y). The left end point is not an asymptote, so there is no longer a singularity. Now all you need do is constrain the result to be monotonic increasing, and concave down (thus everywhere a negative second derivative.) You will require far fewer knots for the inverse fit, but use enough knots that the fit is of adequate quality for your goals.
To use the inverse fit, simply interpolate in the reverse direction, something that SLMEVAL is capable of doing. I'll see how it does on the little bit of test data you have provided (with just the default number of knots):
x = [0 1 3 5 8 10 14 16 20 23 27 29 35 37 41];
y = [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14];
slm = slmengine(y,x,'plot','on','increasing','on');
So the fit seems reasonable, but I note that your data seems a bit bumpy. It may indeed be difficult to get a solution that is smooth, yet fits entirely within your requirements.
Lets see how well it did:
[x;y;slmeval(x,slm,-1)]'
ans =
0 0 0.0190
1.0000 1.0000 0.9656
3.0000 2.0000 2.0522
5.0000 3.0000 2.9239
8.0000 4.0000 4.1096
10.0000 5.0000 4.8419
14.0000 6.0000 6.1963
16.0000 7.0000 6.8331
20.0000 8.0000 8.0638
23.0000 9.0000 8.9699
27.0000 10.0000 10.1459
29.0000 11.0000 10.7088
35.0000 12.0000 12.2942
37.0000 13.0000 12.8285
41.0000 14.0000 NaN
It misses the last point completely, refusing to extrapolate. But the remainder are not far off. They do fail your requirement though, as it is not true that
k <= F(x_k) < k+1
Of course, I did not build the spline with such a requirement in the specs. Were I to try to solve this problem in general, I might write code that would estimate the values on the curve directly, with no spline intermediary. Then I could easily enforce your constraints, finding the smoothest set of points that satisfies your error bar requirements and monotonicity, that also lies as close to the original data as is possible. Of course, that would involve a large system solve, with 60k unknowns. I don't know how lsqlin would handle that large of a problem, but there are other solvers that might be able to do so if time was an issue.
Again, with your test data as a small scale example:
x = [0 1 3 5 8 10 14 16 20 23 27 29 35 37 41]';
n = numel(x);
k = (0:(n-1))';
% The "unrounding" bound constraints
LB = k;
UB = k+1;
% The best fit possible
Afit = speye(n,n);
% And as smooth as possible
ind = 1:(n-2);
% could do this with diff of course
dx1 = x(ind+1) - x(ind);
dx2 = x(ind+2) - x(ind + 1);
% central second finite difference, for unequal spacing
den = dx1.*dx2.*(dx1 + dx2)/2;
Areg = spdiags([dx2./den,-(dx1 + dx2)./den,dx1./den],[0 1 2],n-2,n);
rhs = [k;zeros(n-2,1)];
% monotonicity constraints...
Amono = spdiags(repmat([1 -1],14,1),[0 1],n-1,n);
bmono = zeros(n-1,1);
% choose a value for r, that allows you to control the smoothness
% larger values of r will make the curve smoother, but the bounds
% will always be enforced. I played with it, and r = 5 seemed a
% reasonable compromise here.
r = 5;
yhat = lsqlin([Afit;r*Areg],rhs,Amono,bmono,[],[],LB,UB);
lsqlin is a bit unhappy, since it does not handle sparse problem of this form at this time. So it throws a warning that it is converting the problem to a full one.
Warning: Large-scale algorithm can handle bound constraints only;
using medium-scale algorithm instead.
> In lsqlin at 270
Warning: This problem formulation not yet available for sparse matrices.
Converting to full to solve.
> In lsqlin at 320
Optimization terminated.
Of course, this conversion will be TOTALLY unacceptable for a problem with 60k unknowns. DO NOT TRY IT ON 60k data points!!!!!!!!!!!!!!!! Your computer will go into a deep freeze.
How did it do though?
disp([x,k,yhat,k+1])
0 0 0.4356 1.0000
1.0000 1.0000 1.0000 2.0000
3.0000 2.0000 2.0504 3.0000
5.0000 3.0000 3.0000 4.0000
8.0000 4.0000 4.2026 5.0000
10.0000 5.0000 5.0000 6.0000
14.0000 6.0000 6.2739 7.0000
16.0000 7.0000 7.0000 8.0000
20.0000 8.0000 8.0916 9.0000
23.0000 9.0000 9.0000 10.0000
27.0000 10.0000 10.2497 11.0000
29.0000 11.0000 11.0000 12.0000
35.0000 12.0000 12.2994 13.0000
37.0000 13.0000 13.0000 14.0000
41.0000 14.0000 14.0594 15.0000
It worked nicely, although it would be a hog of obscene proportions for large problems as you have. Perhaps there is another optimizer (maybe in TOMLAB or some other package) that can handle a large scale sparse linear problem, subject to linear and bound constraints. You also might wish to force the first point through zero, but that is trivial to do.
A final option, is if say 1000 points is doable, to recreate the curve in batches of 1010 at a time using the above scheme. lsqlin should be able to handle problems of that size with no problem. Leave some overlap at the ends, 5 points in each overlap region should be sufficient. Then average the results in the overlap regions.
I have a large matrix with two columns. First is an index, second is data. Some indices are repeated. How can I retain only the first instance of rows with repeated indices?
For Example:
x =
1 5.5
1 4.5
2 4
3 2.5
3 3
4 1.5
to end up with:
ans =
1 5.5
2 4
3 2.5
4 1.5
I've tried various variations and iterations of
[Uy, iy, yu] = unique(x(:,1));
[q, t] = meshgrid(1:size(x, 2), yu);
totals = accumarray([t(:), q(:)], x(:));
but nothing so far has given me the output I need.
Use the 'first' tag in the unique function and then the second output supplies you with the row indices you want which you can use to 'filter' your matrix.
[~, ind] = unique(x(:,1), 'first');
ans = x(ind, :)
ans =
1.0000 5.5000
2.0000 4.0000
3.0000 2.5000
4.0000 1.5000
EDIT
or as Jonas points out (esp for old Matlab releases)
[~, ind] = unique(flipud(x(:,1)));
ans = x(flipud(ind), :)