I have a matrix of data X where rows are time stamps and columns are measurements. I can easily find the lowest sum path through the matrix by:
[r c]=size(X)
for w=1:r
Y(w)=min(X(w,:))
end
result = sum(Y)
this is useful, but what would be really useful is if there were a function that could tell me different paths for a specified frequency. For example if i group 2 rows together this halves the frequency...... If there was a function that could find me different paths with varying frequencies for a specified tolerance then rank them this would be perfect!
A lot to ask but there must be a statistical or mathematical tool that does this......
Not sure if I entirely understand the question, but if I read what you want this should do the trick for a fixed frequency:
frequency = 2;
r = size(X,1);
Y = zeros(r,1);
for w=1:frequency:r
Y(w)=min(min(X(w:w+frequency-1,:)))
end
result = sum(Y)
You can loop over frequencies to find the best path length for each frequency.
Note that finding the optimal path with varying frequencies (so for example first 2 then 3 then 2 again) would be a completely different problem. I think this is much more complex and that you may want to look into linear programming.
Related
I have a set of vectors containing some arbitrary shape like a triangle pulse with a single maxima.
I need to downsample these vectors by an integer factor.
The position of the maxima relative to the length of the vector should stay the same.
Below code shows, that when I do this, there is a bias=-0.0085 introduced by the downsampling step which should be zero on average.
The bias doesn't seem to change much depending on the number of vectors (tried between 200 and 800 vectors)
.
I also tried different resampling functions like downsample and decimate leading to the same results.
datapoints = zeros(1000,800);
for ii = 1:size(datapoints,2)
datapoints(ii:ii+18,ii) = [1:10,9:-1:1];
end
%downsample each column of the data
datapoints_downsampled = datapoints(1:10:end,:);
[~,maxinds_downsampled] = max(datapoints_downsampled);
[~,maxinds] = max(datapoints);
%bias needs to be zero
bias = mean(maxinds/size(datapoints,1)-maxinds_downsampled/size(datapoints_downsampled,1))
This graph shows, that there is a systematic bias that does not depend on the number of vectors
How to remove this bias? Is there a way to determine its magnitude given only one vector?
Where does it come from?
There are two main issues with the code:
Dividing the index by the length of the vector leads to a small bias: if the max is at the first element, then 1/1000 is not the same as 1/100, even though the subsampling preserved the element that contained the maximum. This needs to be corrected for by subtracting 1 before the division, and adding 1/1000 after the division.
Subsampling by a factor of 10 leads to a bias as well: since we're determining the integer location only, in 1/10 cases we preserve the location, in 4/10 cases we move the location in one direction, and in 5/10 cases we move the location in the other direction. The solution is to use an odd subsampling factor, or to determine the location of the maximum with sub-sample precision (this requires proper low-pass filtering before subsampling).
The code below is a modification of the code in the OP, it does a scatter plot of the error vs the location, as well as OP's bias plot. The first plot helps identify issue #2 above. I have made the subsampling factor and the offset for subsampling variables, I recommend that you play with these values to understand what is happening. I have also made the location of the maximum random to avoid a sampling bias. Note I also use N/factor instead of size(datapoints_downsampled,1). The size of the downsampled vector is the wrong value to use if N/factor is not integer.
N = 1000;
datapoints = zeros(N,800);
for ii = 1:size(datapoints,2)
datapoints(randi(N-20)+(1:19),ii) = [1:10,9:-1:1];
end
factor = 11;
offset = round(factor/2);
datapoints_downsampled = datapoints(offset:factor:end,:);
[~,maxinds_downsampled] = max(datapoints_downsampled,[],1);
[~,maxinds] = max(datapoints,[],1);
maxpos_downsampled = (maxinds_downsampled-1)/(N/factor) + offset/N;
maxpos = (maxinds)/N;
subplot(121), scatter(maxpos,maxpos_downsampled-maxpos)
bias = cumsum(maxpos_downsampled-maxpos)./(1:size(datapoints,2));
subplot(122), plot(bias)
I'd like to calculate the Euclidean distance between a vector G and each row of an array C, while dividing each row by a value in a vector GSD. What I've done seems very inefficient. What's my biggest overhead?
Could I speed it up?
m=1E7;
G=1E5*rand(1,8);
C=1E5*[zeros(m,1),rand(m,8)];
GSD=10*rand(1,8);
%I've taken the log10 of the values because G and C are very large in magnitude.
%Don't know if it's worth it.
for i=1:m
dG(i,1)=norm((log10(G)-log10(C(i,2:end)))/log10(GSD));
end
Using the examples from below, they don't all give the same answer. In fact none of them give the same answer (see following figure using:
dG = pdist2(log10(G),log10(C(:,2:end)),'mahalanobis',diag(log10(GSD))); %(1)
dG = sqrt(sum((log10(G)-log10(C(:,2:end))./log10(GSD)).^2,2));
tmp=bsxfun(#rdivide,bsxfun(#minus,log10(G),log10(C(:,2:end))),log10(GSD)); %(4)
dG = sqrt(sum(tmp.^2,2));
You can use pdist2(x,y) to calculate the pairwise distance between all elements in x and y, thus your example would be something like
dG = pdist2(log10(G),log10(C(:,2:end)),'mahalanobis',diag(log10(GSD)).^2);
where the name-pair 'mahalanobis',diag(log10(GSD)).^2 puts log10(GSD) as weights on the Eucledean, which is the known as the Mahalanobis distance.
Note that the Mahalanobis distance is originally intented for normalising data, thus it is the "covariance" which have to be put as the fourth input, which MATLAB then finds the Cholesky decomposition of (element-wise squareroot when diagonal, as here).
Implicit expansion
In newer MATLAB editions, one can also just just the implcit expansion as the first entry is only 1 vector.
dG = sqrt(sum(((log10(G)-log10(C(:,2:9)))./log10(GSD)).^2,2));
which is probably a tad faster, I do, however, prefer the pdist2 solution as I find it clearer.
The floating point should handle the large magnitude of the input data, up to a certain point with float data and with any reasonable value with double data
realmax('single')
ans =
3.4028e+38
realmax('double')
ans =
1.7977e+308
With 1e7 values in the +/- 1e5 range, you may expect the square of the Euclidian distance to be in the +/- 1e17 range (5+5+7), which both formats will handle with ease.
In any case, you should vectorize the code to remove the loop (which Matlab has a history of handling very inefficiently, especially in older versions)
With new versions (2016b and later), simply use:
tmp=(log10(G)-log10(C(:,2:end)))./log10(GSD);
dG = sqrt(sum(tmp.^2,2)); %row-by-row norm
Note that you have to use ./ which is a element-wise division, not / which is matrix right division.
The following code will work everywhere
tmp=bsxfun(#rdivide,bsxfun(#minus,log10(G),log10(C(:,2:end))),log10(GSD));
dG = sqrt(sum(tmp.^2,2)); %row-by-row norm
I however believe that the use of log10 is a mathematical error. The result dG will not be the Euclidian norm. You should stick with the root mean square of the weighted difference:
dG = sqrt(sum(bsxfun(#rdivide,bsxfun(#minus,G,C(:,2:end)),GSD).^2,2)); % all versions
dG = sqrt(sum((G-C(:,2:end)./GSD).^2,2)); %R2016b and later
I have two variables in a .mat file here:
https://www.yousendit.com/download/UW13UGhVQXA4NVVQWWNUQw
testz is a vector of cumulative distance (in meters, monotonically and regularly increasing)
testSDT is a vector of integrated (cumulative) sound wave travel time (in milliseconds) generated using the distance vector and a vector of velocities
(there is an intermediate step of creating interval travel times)
Since velocity is a continuously variable function the resulting interval travelt times and also the integrated travel times are non integers and variable in magnitude
What I want is to resample the distance vector at regular time intervals (e.g. 1 ms, 2 ms, ..., n ms)
What makes it difficult is that the maximum travel time, 994.6659, is less than the number of samples in the 2 vectors, therefore it is not straightforward to use interp1.
i.e.:
X=testSDT -> 1680 samples
Y=testz -> 1680 samples
XI=[1:1:994] -> 994 samples
This is the code I've come up with. It is a working code and it is not too bad I think.
%% Initial chores
M=fix(max(testSDT));
L=(1:1:M);
%% Create indices
% this loops finds the samples in the integrated travel time vector
% that are closest to integer milliseconds and their sample number
for i=1:M
[cl(i) ind(i)] = min(abs(testSDT-L(i)));
nearest(i) = testSDT(ind(i));
end
%% Remove duplicates
% this is necessary to remove duplicates in the index vector (happens in this test).
% For example: 2.5 ms would be the closest to both 2 ms and 2 ms
[clsst,ia,ic] = unique(nearest);
idx=(ind(ia));
%% Interpolation
% this uses the index vectors to resample the depth vectors at
% integer times
newz=interp1(clsst,testz(idx),[1:1:length(idx)],'cubic')';
As far as I can see there is one issue with this code:
I rely on the vector idx as my XI for interpolation. Vector idx is 1 sample shorter than vector ind (one duplicate was removed).
Therefore my new times will stop one millisecond short. This is a very small issue, and duplicate are unlikely but I am wondering if anybody can think of a workaround, or of a different way to approach the problem altogether.
Thank you
If I understand you correctly, you want to extrapolate to that extra point.
you can do this is many ways, one is to add that extra point to the interp1 line.
If you have some function you expect to follow your data you can use it by fitting it to the data and then obtaining that extra point or with a tool like fnxtr.
But I have a problem understanding what you want because of the way you used the line. The third argument you use, [1:1:length(idx)], is just the series [1 2 3 ...], usually when interpolating, one uses some vector x_i of points of interest, though I doubt your points of interest happen to be the series of integers 1:length(idx), what you want is just [1:length(idx) xi], where xi is that extra point x-axis value.
EDIT:
Instead of the loop just produce matrix forms out of L and testSDT, then matrix operation is somewhat faster in doing the min(abs(...:
MM=ones(numel(testSDT),1)*L;
TT=testSDT*ones(1,numel(L));
[cl ind]=(min(abs(TT-MM)));
nearest=testSDT(ind);
First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!
So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a 1-alpha-confidence interval for the mean (in your case alpha = 0.05) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of data X, then we can type
N = 1000; % sample size
X = exprnd(3,N,1); % sample from a non-normal distribution
mu = mean(X); % sample mean (normally distributed)
sig = std(X)/sqrt(N); % sample standard deviation of the mean
alphao2 = .05/2; % alpha over 2
CI = [mu + norminv(alphao2)*sig ,...
mu - norminv(alphao2)*sig ]
CI =
2.9369 3.3126
Testing if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where X is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.
For example if X = exprnd(3,1000,1) as above, the sample is non-normal and the qqplot is very non-linear:
X = exprnd(3,1000,1);
qqplot(X);
On the other hand if the data is normal the qqplot will give a straight line:
qqplot(randn(1000,1))
You might consider, also, using bootstrapping, with the bootci function.
You may use the method proposed in [1]:
MEDIAN +/- 1.7(1.25R / 1.35SQN)
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
y = prcntile(x, [5 50 95])
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
t = 0:99;
tt = repmat(t, 1000, 1);
x = randn(1000, 100) .* tt + tt; % simple gaussian model with varying mean and variance
y = prctile(x, [5 50 95]);
plot(t, y);
legend('5%','50%','95%')
I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm
I have 2 800x1 arrays in Matlab which contain my amplitude vs. frequency data, one array contains the magnitude, the other contains the corresponding values for frequency. I want to find the frequency at which the amplitude has reduced to half of its maximum value.
What would be the best way to do this? I suppose my two main concerns are: if the 'half amplitude' value lies between two data points, how can I find it? (e.g. if the value I'm looking for is 5, how can I "find it in my data" if it lies between two data points such as 4 and 6?)
and if I find the 'half amplitude' value, how do I then find the corresponding value for frequency?
Thanks in advance for your help!
You can find the index near your point of interest by doing
idx = magnitudes >= (max(magnitude)/2);
And then you can see all the corresponding frequencies, including the peak, by doing
disp(frequencies(idx))
You can add more conditions to the idx calculation if you want to see less extraneous stuff.
However, your concern about finding the exact frequency is harder to answer. It will depend heavily on the nature of the signal and also on the lineshape of your window function. In general, you might be better off trying to characterize your peak with a few points and then doing a curvefit of some kind. Are you trying to calculate Q of a resonant filter, by any chance?
If it's ok, you can do simple linear interpolation. Find segments where the drop occurs and calculate intermediate values. That will be no good, if you expect noise in the signal.
idx = find(magnitudes(2:end) <= (max(magnitudes)/2) & ...
magnitudes(1:end-1) >= (max(magnitudes)/2));
mag1 = magnitudes(idx); % magnitudes of points before drop
mag2 = magnitudes(idx+1); % magnitudes of points after drop below max/2
fr1 = frequencies(idx); % frequencies just before drop
fr2 = frequencies(idx+1); % frequencies after drop below max/2
magx = max(magnitudes)/2; % max/2
frx = (magx-mag2).*(fr1-fr2)./(mag1-mag2) + fr2; % estimated frequencies
You can also use INTERP1 function.