I'm using matlab to mimic some c code that I've written so that I can replicate the result and plot it in matlab.
In c I am using an open source sliding median filter linked here.
I am currently using medfilt1 to mimic it in matlab, but there are two problems.
It is SLOW! it would be ideal to have a similar filter in matlab, but I don't want to take the time to write one. If one does not exist, I'd like to at least be able to accomplish #2.
medfilt1 does not wrap around.
x = 1:5;
medfilt1(x,3) % = 1 2 3 4 4
It calculates the first values like
median([0 1 2]) % = 1
This is because medfilt1 uses zeros to fill in when you get near the edges. I would like a way to change this so that the first value is calculated like
median([5 1 2]) % = 2
medfiltcirc(x,3) % = 2 2 3 4 4
UPDATE: For fun, I wrote a sliding filter in matlab, and it turned out to be about 4x slower than using medfilt1 with padarray. Using ordfilt2(padarray(x,[N/2 0],'circular'),N/2,ones(N,1)) proved to be even faster than medfilt1. I believe the only way to improve speed further would be to write a mex file.
another option is to use padarray before operating with the median filter:
x = medfilt1(padarray(1:5,[0 1],'circular'));
and take x(2:end-1) as the answer...
to improve medfilt1 consider using ordfilt2 for example:
x = ordfilt2(1:5,2,[1 1 1]);
this should buy you up to a factor of two, read more about it here, and do pay attention to the variable class used for A...
You can do the wrapping around yourself. Let the data be defined as
x = 1:5; %// data values
S = 3; %// block size
Then:
s = floor(S/2); %// how many elements need to be wrapped around
n = numel(x); %// length of x
result = medfilt1([x(n-s+1:n) x x(1:s)], S);
result = result(s+1:end-s);
As I'm sure that you're aware, circular statistics can be treacherous – from the Wikipedia article on directional statistics:
When data is concentrated, the median and mode may be defined by analogy to the linear case, but for more dispersed or multi-modal data, these concepts are not useful.
Have you looked at the Circular statistics Toolbox on the MathWorks File Exchange? It includes a circ_median function. I don't know how fast this function is in comparison to what you're using, but it might be a good starting point for optimization. A paper was published in the Journal of Statistical Software on the toolbox (alt link to PDF). It might be a good idea to at least check if this function matches the output of other methods.
Related
Suppose I have a weight matrix W nxm where m is the number of variables and the n is the number of instances. Also I have data matrix X of the same size. I try to find the closest weight vector to each instance in X. However both matrices are so dimensional therefore plain methods are not sufficient enough. I have tried some GPU trick at MATLAB but it does not work well since it was sequential approach that was calculating the closest weight for each instance sequentially. I am now looking for efficient one shot code. That takes all the W and X and find the winner with some MATLAB tricks with possibly some GPU addition. Is there any one that can suggest any code snippet in the MATLAB?
This is the thing that I wrote for sequential
x_in_d = gpuArray(x_in); % take input instance to device
W_d = gpuArray(W); % take weight matrix to device
Dx = W_d - x_in_d(ones(size(W_d,1),1),logical(ones(1,length(x_in_d))));
[d_min,winner] = min(sum((Dx.^2)'));
d_min = gather(d_min); %gather results
winner = gather(winner);
What do you mean by so dimensional? It's just an m x n matrix right?
It would be really helpful if you could provide some sample data, based off your description (which isn't the clearest), here is what I think your data looks like.
weights=
[1 4 2
5 3 1]
data=
[2 5 1
1 2 2]
And you want to figure out which row of weights is closest to the row of data? Which in this case would be the first row of weights for both rows of data.
Please edit your question to clarify what your asking for and consider using some examples.
EDIT:
I like Rody's Dup. Comment, if I am correct, check out: Link Here
x=[1;2;3]
x =
1
2
3
y=[4;5;6]
y =
4
5
6
x\y
ans =
2.2857
How did Matlab find that result ? (I searched many forums but I did not understand what they told.I would like to know the algorithm which gave this result.)
From MATLAB documentation of \:
If A is an M-by-N matrix with M < or > N and B is a column vector with M components, or a matrix with several such columns, then X = A\B is the solution in the least squares sense to the under- or overdetermined system of equations A*X = B.
Here your system is not under/over-determined. Since both have 3 rows. So you can visualize your equation as:
xM=y
M=inv(x)*y
Now, since your matrix is not square, it will calculate the pseudo-inverse using SVD. Therefore,
M=pinv(x)*y;
You will get value of M as 2.2857.
Another explanation can be: It will give you the solution of xM=y in the sense of least squares. You can verify this as follows:
M=lsqr(x,y)
This will give you the value of M = 2.2857.
You can always do help \ in MATLAB command window to get more information.
You are encouraged to check more details about the least squares and pseudo-inverse.
This documentation should explain it
http://www.mathworks.com/help/matlab/ref/mrdivide.html
Here is a link to the algorithm
http://www.maths.lth.se/na/courses/NUM115/NUM115-11/backslash.html
You can see the source inside matlab much more easily though. (I don't have it locally so I can't check but the source of a lot of matlab functions is available inside matlab)
I dont quite get the vectorizing way of thinking of matlab, mostly due to the simple examples provided in the documentation, and i hope someone can help me understand it a little better.
So, what i'm trying to accomplish is to take a sample of NxN from a matrix of ncols x nrows x ielements and compute the average for each ielement and store the maximum of the averages. Using for loops, the code would look like this:
for x = 1+margin : nrows-margin
for y = 1+margin : ncols-margin
for i=1:ielem
% take a NxN sample
sample = input_matrix(y-margin:y+margin,x-margin:x+margin,i)
% compute the average of all elements
result(i) = mean2(sample);
end %for i
% store the max of the computed averages
output_matrix(y,x)=max(result);
end %for y
end %for x
can anyone do a good vectorization of this example of a situation ? T
First of all, vectorization is not as important as it once was, due to enhancements in compiling the code before it is ran, but it's still a very common practice and can lead to some enhancements. Older Matlab version executed one line at a time, which would leave a for loop much slower than a vectorized version of the same code.
The part of your matrix that could be vectorized is the inner more for loop. I'll show a simple example of what you are trying to do, I'll let you take the example and put it into your code.
input=randn(5,5,3);
max(mean(mean(input,1),2))
Basically, the inner two mean take the mean of the input array, and the outer max will find the maximum value over the range. If you want, you can break it out step by step, and see what it does. The mean(input,1) will take the mean over the first dimension, mean(input,2) over the second, etc. After the first two means are done, all that is left is a vector, which the max function will easily work. It should be noted that the size of the vector pre-max is [1 1 3], the dimensions are preserved when doing this operation.
I've written this code to perform the 1-d convolution of a 2-d matrix valued function (k is my time index, kend is on the order of 10e3). Is there a faster or cleaner way to do this, perhaps using built in functions?
for k=1:kend
C(:,:,k)=zeros(3);
for l=0:k-1
C(:,:,k)=C(:,:,k)+A(:,:,k-l)*B(:,:,l+1);
end
end
NEW SOLUTION:
This is a newer solution built on the older solution, which solved the previously given formula. The code in the question is actually a modification of that formula, in which the overlap between the two matrices in the third dimension is repeatedly shifted (it's akin to a convolution along the third dimension of the data). The previous solution I gave only computed the result for the last iteration of the code in the question (i.e. k = kend). So, here's a full solution that should be much more efficient than the code in the question for kend on the order of 1000:
kend = size(A,3); %# Get the value for kend
C = zeros(3,3,kend); %# Preallocate the output
Anew = reshape(flipdim(A,3),3,[]); %# Reshape A into a 3-by-3*kend matrix
Bnew = reshape(permute(B,[1 3 2]),[],3); %# Reshape B into a 3*kend-by-3 matrix
for k = 1:kend
C(:,:,k) = Anew(:,3*(kend-k)+1:end)*Bnew(1:3*k,:); %# Index Anew and Bnew so
end %# they overlap in steps
%# of three
Even when using just kend = 100, this solution came out to be about 30 times faster for me than the one in the question and about 4 times faster than a pure for-loop-based solution (which would involve 5 loops!). Note that the discussion below of floating-point accuracy still applies, so it is normal and expected that you will see slight differences between the solutions on the order of the relative floating-point accuracy.
OLD SOLUTION:
Based on this formula you linked to in a comment:
it appears that you actually want to do something different than the code you provided in the question. Assuming A and B are 3-by-3-by-k matrices, the result C should be a 3-by-3 matrix and the formula from your link written out as a set of nested for loops would look like this:
%# Solution #1: for loops
k = size(A,3);
C = zeros(3);
for i = 1:3
for j = 1:3
for r = 1:3
for l = 0:k-1
C(i,j) = C(i,j) + A(i,r,k-l)*B(r,j,l+1);
end
end
end
end
Now, it is possible to perform this operation without any for loops by reshaping and reorganizing A and B appropriately:
%# Solution #2: matrix multiply
Anew = reshape(flipdim(A,3),3,[]); %# Create a 3-by-3*k matrix
Bnew = reshape(permute(B,[1 3 2]),[],3); %# Create a 3*k-by-3 matrix
C = Anew*Bnew; %# Perform a single matrix multiply
You could even rework the code you have in your question to create a solution with a single loop that performs a matrix multiply of your 3-by-3 submatrices:
%# Solution #3: mixed (loop and matrix multiplication)
k = size(A,3);
C = zeros(3);
for l = 0:k-1
C = C + A(:,:,k-l)*B(:,:,l+1);
end
So now the question: Which one of these approaches is faster/cleaner?
Well, "cleaner" is very subjective, and I honestly couldn't tell you which of the above pieces of code makes it any easier to understand what the operation is doing. All the loops and variables in the first solution make it a little hard to track what's going on, but it clearly mirrors the formula. The second solution breaks it all down into a simple matrix operation, but it's difficult to see how it relates to the original formula. The third solution seems like a middle-ground between the two.
So, let's make speed the tie-breaker. If I time the above solutions for a number of values of k, I get these results (in seconds needed to perform 10,000 iterations of the given solution, MATLAB R2010b):
k | loop | matrix multiply | mixed
-----+--------+-----------------+--------
5 | 0.0915 | 0.3242 | 0.1657
10 | 0.1094 | 0.3093 | 0.2981
20 | 0.1674 | 0.3301 | 0.5838
50 | 0.3181 | 0.3737 | 1.3585
100 | 0.5800 | 0.4131 | 2.7311 * The matrix multiply is now fastest
200 | 1.2859 | 0.5538 | 5.9280
Well, it turns out that for smaller values of k (around 50 or less) the for-loop solution actually wins out, showing once again that for loops are not as "evil" as they used to be considered in older versions of MATLAB. Under certain circumstances, they can be more efficient than a clever vectorization. However, when the value of k is larger than around 100, the vectorized matrix-multiply solution starts to win out, scaling much more nicely with increasing k than the for-loop solution does. The mixed for-loop/matrix-multiply solution scales atrociously for reasons that I'm not exactly sure of.
So, if you expect k to be large, I'd go with the vectorized matrix-multiply solution. One thing to keep in mind is that the results you get from each solution (the matrix C) will differ ever so slightly (on the level of the floating-point precision) since the order of additions and multiplications performed for each solution are different, thus leading to a difference in accumulation of rounding errors. In short, the difference between the results for these solutions should be negligible, but you should be aware of it.
Have you looked into Matlab's conv method?
I can't compare it against your provided code, because what you provided gives me a problem with trying to access the zeroth element of A. (When k=1, k-1=0.)
Have you considered using FFTs to convolve? A convolution operation is simply a point-wise multiplication in the frequency domain. You'll have to take some precaution with finite sequences, as you'll end up with circular convolution if you're not careful (but this is trivial to take care of).
Here's a simple example for a 1D case.
>> a=rand(4,1);
>> b=rand(3,1);
>> c=conv(a,b)
c =
0.1167
0.3133
0.4024
0.5023
0.6454
0.3511
The same using FFTs
>> A=fft(a,6);
>> B=fft(b,6);
>> C=real(ifft(A.*B))
C =
0.1167
0.3133
0.4024
0.5023
0.6454
0.3511
A convolution of an M point vector and an N point vector results in an M+N-1 point vector. So, I've padded each of the vectors a and b with zeros before taking the FFT (this is automatically taken care of when I take the 4+3-1=6 point FFT of it).
EDIT
Although the equation that you showed is similar to a circular convolution, it's not exactly it. So you can ditch the FFT approach, and the built-in conv* functions. To answer your question, here's the same operation done without explicit loops:
dim1=3;dim2=dim1;
dim3=10;
a=rand(dim1,dim2,dim3);
b=rand(dim1,dim2,dim3);
mIndx=cellfun(#(x)(1:x),num2cell(1:dim3),'UniformOutput',false);
fun=#(x)sum(reshape(cell2mat(cellfun(#(y,z)a(:,:,y)*b(:,:,z),num2cell(x),num2cell(fliplr(x)),'UniformOutput',false)),[dim1,dim2,max(x)]),3);
c=reshape(cell2mat(cellfun(#(x)fun(x),mIndx,'UniformOutput',false)),[dim1,dim2,dim3]);
mIndx here is a cell, where the ith cell contains a vector 1:i. This is your l index (as others have noted, please don't use l as a variable name).
The next line is an anonymous function that does the convolution operation, making use of the fact that the k index is just the l index flipped around. The operations are carried out on individual cells, and then assembled.
The last line actually performs the operations on the matrices.
The answer is the same as that obtained with the loops. However, you'll find that the looped solution is actually an order of magnitude faster (I averaged 0.007s for my code and 0.0006s for the loop). This is because the loop is pretty straightforward, whereas with this sort of nested construction, there's plenty of function call overheads and repeated reshaping that slow it down.
MATLAB's loops have come a long way since the early days when loops were dreaded. Certainly, vectorized operations are blazing fast; but not everything can be vectorized, and sometimes, loops are more efficient than such convoluted anonymous functions. I could probably shave off a few more tenths here and there by optimizing my construction (or maybe taking a different approach), but I'm not going to do that.
Remember that good code should be readable, as well as efficient and minor optimization at the cost of readability serves no one. Although I wrote the code above, I certainly will not be able to decipher what it does if I revisited it a month later. Your looped code was clear, readable and fast and I would suggest that you stick with it.
First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!
So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a 1-alpha-confidence interval for the mean (in your case alpha = 0.05) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of data X, then we can type
N = 1000; % sample size
X = exprnd(3,N,1); % sample from a non-normal distribution
mu = mean(X); % sample mean (normally distributed)
sig = std(X)/sqrt(N); % sample standard deviation of the mean
alphao2 = .05/2; % alpha over 2
CI = [mu + norminv(alphao2)*sig ,...
mu - norminv(alphao2)*sig ]
CI =
2.9369 3.3126
Testing if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where X is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.
For example if X = exprnd(3,1000,1) as above, the sample is non-normal and the qqplot is very non-linear:
X = exprnd(3,1000,1);
qqplot(X);
On the other hand if the data is normal the qqplot will give a straight line:
qqplot(randn(1000,1))
You might consider, also, using bootstrapping, with the bootci function.
You may use the method proposed in [1]:
MEDIAN +/- 1.7(1.25R / 1.35SQN)
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
y = prcntile(x, [5 50 95])
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
t = 0:99;
tt = repmat(t, 1000, 1);
x = randn(1000, 100) .* tt + tt; % simple gaussian model with varying mean and variance
y = prctile(x, [5 50 95]);
plot(t, y);
legend('5%','50%','95%')
I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm