Related
I have a list of coordinates, coord, which looks like this when plotted:
I want to remove the long string of points that goes completely from 0 to 1 from the data set, shown on this plot starting at (0, 11) and ending at (1, 11) and the other one that begins at (0, 24) and ends at (1, 28).
So far, I have tried using kmeans to group the data by height using this code:
jet = colormap('jet');
amount = 20;
step = floor(numel(jet(:,1))/amount);
idxOIarr = cell(numel(terp));
scale = 100;
for ii = 1:numel(terp)
figure;
hold on;
expandDat = [stretched{ii}(:,1), scale.*log(terp{ii}(:,2))];
[idx, cent] = kmeans(expandDat(:,1:2), amount, 'Distance', 'cityblock');
idxOIarr{ii} = idx;
for jj = 1:amount
scatter(stretched{ii}(idx == jj,1), FREQ(terp{ii}(idx == jj,2)), 10, jet(step*jj,:), 'filled');
end
end
resulting in this image: Although it does separate the higher rows quite well, it breaks the line in the middle in two and groups the line that begins at (0,20) with some data points below it.
Is there any other way to group and remove these points?
The most efficient way to solve this involves building a graph where each point is a vertex. You join points that you consider "connected" or "closed" with an edge. Thus, the graph will connected components. Now you need to look for the connected components that span the whole range from 0 to 1.
Build the graph. Finding neighbors is most efficient using an R-tree. Here are some suggestions. You can also use a k-d tree, for example. However, this is not strictly necessary, it just can get really slow without a proper spatial indexing structure, because you'll have to compare distances between each pair of points.
Given a Nx2 matrix coord, you can find the square distances between each pair:
D = sum((reshape(coord,[],1,2) - reshape(coord,1,[],2)).^2,3);
(note again that this is expensive if N is large, and in that case using an R-tree will speed things up significantly). D(i,j) is the distance between points with indices i and j (i.e. coord(i,:) and coord(j,:).
Next, build the graph, G, nodes i and j are connected if G(i,j)==1. G is a symmetric matrix:
G = D <= max_distance;
Find connected components. A connected component is just a set of nodes that you can reach from each other by following edges. You don't really need to find all connected components, you just need to find the set of points that have x=0, and starting from each, recursively visit all elements in its connected component to see if you can reach a point that has x=1.
This next code is not tested, but helpfully it gives a starting point:
start_indices = find(coord(:,1)==0); % Is exact equality appropriate here?
end_indices = find(coord(:,1)==1);
to_remove = [];
visited = false(size(coord,1), 1);
for ii=start_indices.'
% For each point with x=0, see if we can reach any of the points at x=1
[res, visited] = can_reach(ii, end_indices, G, visited);
if res
% For this point we can, remove it!
to_remove(end+1) = ii;
end
end
% Iterative function to visit all nodes in a connected component
function [res, visited] = can_reach(start, end_indices, G, visited)
visited(start) = true;
if any(start==end_indices)
% We've reach an end point, stop iterating and return true.
res = true;
return;
end
next = find(G(start,:)); % find neighbors
next(visited(next)) = []; % remove visited neighbors
for ii=next
[res, visited] = can_reach(ii, end_indices, G, visited);
if res
% Yes, we can visit an end point, stop iterating now.
return
end
end
end
Apologies for the long post but this takes a bit to explain. I'm trying to make a script that finds the longest linear portion of a plot. Sample data is in a csv file here, it is stress and strain data for calculating the shear modulus of 3D printed samples. The code I have so far is the following:
x_data = [];
y_data = [];
x_data = Data(:,1);
y_data = Data(:,2);
plot(x_data,y_data);
grid on;
answer1 = questdlg('Would you like to load last attempt''s numbers?');
switch answer1
case 'Yes'
[sim_slopes,reg_data] = regr_and_longest_part(new_x_data,new_y_data,str2num(answer2{3}),str2num(answer2{2}),K);
case 'No'
disp('Take a look at the plot, find a range estimate, and press any button to continue');
pause;
prompt = {'Eliminate values ABOVE this x-value:','Eliminate values BELOW this x-value:','Size of divisions on x-axis:','Factor for similarity of slopes:'};
dlg_title = 'Point elimination';
num_lines = 1;
defaultans = {'0','0','0','0.1'};
if isempty(answer2) < 1
defaultans = {answer2{1},answer2{2},answer2{3},answer2{4}};
end
answer2 = inputdlg(prompt,dlg_title,num_lines,defaultans);
uv_of_x_range = str2num(answer2{1});
lv_of_x_range = str2num(answer2{2});
x_div_size = str2num(answer2{3});
K = str2num(answer2{4});
close all;
iB = find(x_data > str2num(answer2{1}),1,'first');
iS = find(x_data > str2num(answer2{2}),1,'first');
new_x_data = x_data(iS:iB);
new_y_data = y_data(iS:iB);
[sim_slopes, reg_data] = regr_and_longest_part(new_x_data,new_y_data,str2num(answer2{3}),str2num(answer2{2}),K);
end
[longest_section0, Midx]= max(sim_slopes(:,4)-sim_slopes(:,3));
longest_section=1+longest_section0;
long_sec_x_data_start = x_div_size*(sim_slopes(Midx,3)-1)+lv_of_x_range;
long_sec_x_data_end = x_div_size*(sim_slopes(Midx,4)-1)+lv_of_x_range;
long_sec_x_data_start_idx=find(new_x_data >= long_sec_x_data_start,1,'first');
long_sec_x_data_end_idx=find(new_x_data >= long_sec_x_data_end,1,'first');
long_sec_x_data = new_x_data(long_sec_x_data_start_idx:long_sec_x_data_end_idx);
long_sec_y_data = new_y_data(long_sec_x_data_start_idx:long_sec_x_data_end_idx);
[b_long_sec, longes_section_reg_data] = robustfit(long_sec_x_data,long_sec_y_data);
plot(long_sec_x_data,b_long_sec(1)+b_long_sec(2)*long_sec_x_data,'LineWidth',3,'LineStyle',':','Color','k');
function [sim_slopes,reg_data] = regr_and_longest_part(x_points,y_points,x_div,lv,K)
reg_data = cell(1,3);
scatter(x_points,y_points,'.');
grid on;
hold on;
uv = lv+x_div;
ii=0;
while lv <= x_points(end)
if uv > x_points(end)
uv = x_points(end);
end
ii=ii+1;
indices = find(x_points>lv & x_points<uv);
temp_x_points = x_points((indices));
temp_y_points = y_points((indices));
if length(temp_x_points) <= 2
break;
end
[b,stats] = robustfit(temp_x_points,temp_y_points);
reg_data{ii,1} = b(1);
reg_data{ii,2} = b(2);
reg_data{ii,3} = length(indices);
plot(temp_x_points,b(1)+b(2)*temp_x_points,'LineWidth',2);
lv = lv+x_div;
uv = lv+x_div;
end
sim_slopes = NaN(length(reg_data),4);
sim_slopes(1,:) = [reg_data{1,1},0,1,1];
idx=1;
for ii=2:length(reg_data)
coff =sim_slopes(idx,1);
if abs(reg_data{ii,1}-coff) <= K*coff
C=zeros(ii-sim_slopes(idx,3)+1,1);
for kk=sim_slopes(idx,3):ii
C(kk)=reg_data{kk,1};
end
sim_slopes(idx,1)=mean(C);
sim_slopes(idx,2)=std(C);
sim_slopes(idx,4)=ii;
else
idx = idx + 1;
sim_slopes(idx,1)=reg_data{ii,1};
sim_slopes(idx,2)=0;
sim_slopes(idx,3)=ii;
sim_slopes(idx,4)=ii;
end
end
end
Apologies for the code not being well optimized, I'm still relatively new to MATLAB. I did not use derivatives because my data is relatively noisy and derivation might have made it worse.
I've managed to get the get the code to find the longest straight part of the plot by splitting the data up into sections called x_div_size then performing a robustfit on each section, the results of which are written into reg_data. The code then runs through reg_data and finds which lines have the most similar slopes, determined by the K factor, by calculating the average of the slopes in a section of the plot and makes a note of it in sim_slopes. It then finds the longest interval with max(sim_slopes(:,4)-sim_slopes(:,3)) and performs a regression on it to give the final answer.
The problem is that it will only consider the first straight portion that it comes across. When the data is plotted, it has a few parts where it seems straightest:
As an example, when I run the script with answer2 = {'0.2','0','0.0038','0.3'} I get the following, where the black line is the straightest part found by the code:
I have the following questions:
It's clear that from about x = 0.04 to x = 0.2 there is a long straight part and I'm not sure why the script is not finding it. Playing around with different values the script always seems to pick the first longest straight part, ignoring subsequent ones.
MATLAB complains that Warning: Iteration limit reached. because there are more than 50 regressions to perform. Is there a way to bypass this limit on robustfit?
When generating sim_slopes there might be section of the plot whose slope is too different from the average of the previous slopes so it gets marked as the end of a long section. But that section sometimes is sandwiched between several other sections on either side which instead have similar slopes. How would it be possible to tell the script to ignore one wayward section and to continue as if it falls within the tolerance allowed by the K value?
Take a look at the Douglas-Peucker algorithm. If you think of your (x,y) values as the vertices of an (open) polygon, this algorithm will simplify it for you, such that the largest distance from the simplified polygon to the original is smaller than some threshold you can choose. The simplified polygon will be the set of straight lines. Find the two vertices that are furthest apart, and you're done.
MATLAB has an implementation in the Mapping Toolbox called reducem. You might also find an implementation on the File Exchange (but be careful, there is also really bad code on there). Or, you can roll your own, it's quite a simple algorithm.
You can also try using the ischange function to detect changes in the intercept and slope of the data, and then extract the longest portion from that.
Using the sample data you provided, here is what I see from a basic attempt:
>> T = readtable('Data.csv');
>> T = rmmissing(T); % Remove rows with NaN
>> T = groupsummary(T,'Var1','mean'); % Average duplicate timestamps
>> [tf,slopes,intercepts] = ischange(T.mean_Var2, 'linear', 'SamplePoints', T.Var1); % find changes
>> plot(T.Var1, T.mean_Var2, T.Var1, slopes.*T.Var1 + intercepts)
which generates the plot
You should be able to extract the longest segment based on the indices given by find(tf).
You can also tune the parameters of ischange to get fewer or more segments. Adding the name-value pair 'MaxNumChanges' with a value of 4 or 5 produces more linear segments with a tighter fit to the curve, for example, which effectively removes the kink in the plot that you see.
I'm trying to estimate the (unknown) original datapoints that went into calculating a (known) moving average. However, I do know some of the original datapoints, and I'm not sure how to use that information.
I am using the method given in the answers here: https://stats.stackexchange.com/questions/67907/extract-data-points-from-moving-average, but in MATLAB (my code below). This method works quite well for large numbers of data points (>1000), but less well with fewer data points, as you'd expect.
window = 3;
datapoints = 150;
data = 3*rand(1,datapoints)+50;
moving_averages = [];
for i = window:size(data,2)
moving_averages(i) = mean(data(i+1-window:i));
end
length = size(moving_averages,2)+(window-1);
a = (tril(ones(length,length),window-1) - tril(ones(length,length),-1))/window;
a = a(1:length-(window-1),:);
ai = pinv(a);
daily = mtimes(ai,moving_averages');
x = 1:size(data,2);
figure(1)
hold on
plot(x,data,'Color','b');
plot(x(window:end),moving_averages(window:end),'Linewidth',2,'Color','r');
plot(x,daily(window:end),'Color','g');
hold off
axis([0 size(x,2) min(daily(window:end))-1 max(daily(window:end))+1])
legend('original data','moving average','back-calculated')
Now, say I know a smattering of the original data points. I'm having trouble figuring how might I use that information to more accurately calculate the rest. Thank you for any assistance.
You should be able to calculate the original data exactly if you at any time can exactly determine one window's worth of data, i.e. in this case n-1 samples in a window of length n. (In your case) if you know A,B and (A+B+C)/3, you can solve now and know C. Now when you have (B+C+D)/3 (your moving average) you can exactly solve for D. Rinse and repeat. This logic works going backwards too.
Here is an example with the same idea:
% the actual vector of values
a = cumsum(rand(150,1) - 0.5);
% compute moving average
win = 3; % sliding window length
idx = hankel(1:win, win:numel(a));
m = mean(a(idx));
% coefficient matrix: m(i) = sum(a(i:i+win-1))/win
A = repmat([ones(1,win) zeros(1,numel(a)-win)], numel(a)-win+1, 1);
for i=2:size(A,1)
A(i,:) = circshift(A(i-1,:), [0 1]);
end
A = A / win;
% solve linear system
%x = A \ m(:);
x = pinv(A) * m(:);
% plot and compare
subplot(211), plot(1:numel(a),a, 1:numel(m),m)
legend({'original','moving average'})
title(sprintf('length = %d, window = %d',numel(a),win))
subplot(212), plot(1:numel(a),a, 1:numel(a),x)
legend({'original','reconstructed'})
title(sprintf('error = %f',norm(x(:)-a(:))))
You can see the reconstruction error is very small, even using the data sizes in your example (150 samples with a 3-samples moving average).
I'm trying to code a loop in Matlab that iteratively solves for an optimal vector s of zeros and ones. This is my code
N = 150;
s = ones(N,1);
for i = 1:N
if s(i) == 0
i = i + 1;
else
i = i;
end
select = s;
HI = (item_c' * (weights.*s)) * (1/(weights'*s));
s(i) = 0;
CI = (item_c' * (weights.*s)) * (1/(weights'*s));
standarderror_afterex = sqrt(var(CI - CM));
standarderror_priorex = sqrt(var(HI - CM));
ratio = (standarderror_afterex - standarderror_priorex)/(abs(mean(weights.*s) - weights'*select));
ratios(i) = ratio;
s(i) = 1;
end
[M,I] = min(ratios);
s(I) = 0;
This code sets the element to zero in s, which has the lowest ratio. But I need this procedure to start all over again, using the new s with one zero, to find the ratios and exclude the element in s that has the lowest ratio. I need that over and over until no ratios are negative.
Do I need another loop, or do I miss something?
I hope that my question is clear enough, just tell me if you need me to explain more.
Thank you in advance, for helping out a newbie programmer.
Edit
I think that I need to add some form of while loop as well. But I can't see how to structure this. This is the flow that I want
With all items included (s(i) = 1 for all i), calculate HI, CI and the standard errors and list the ratios, exclude item i (s(I) = 0) which corresponds to the lowest negative ratio.
With the new s, including all ones but one zero, calculate HI, CI and the standard errors and list the ratios, exclude item i, which corresponds to the lowest negative ratio.
With the new s, now including all ones but two zeros, repeat the process.
Do this until there is no negative element in ratios to exclude.
Hope that it got more clear now.
Ok. I want to go through a few things before I list my code. These are just how I would try to do it. Not necessarily the best way, or fastest way even (though I'd think it'd be pretty quick). I tried to keep the structure as you had in your code, so you could follow it nicely (even though I'd probably meld all the calculations down into a single function or line).
Some features that I'm using in my code:
bsxfun: Learn this! It is amazing how it works and can speed up code, and makes some things easier.
v = rand(n,1);
A = rand(n,4);
% The two lines below compute the same value:
W = bsxfun(#(x,y)x.*y,v,A);
W_= repmat(v,1,4).*A;
bsxfun dot multiplies the v vector with each column of A.
Both W and W_ are matrices the same size as A, but the first will be much faster (usually).
Precalculating dropouts: I made select a matrix, where before it was a vector. This allows me to then form a variable included using logical constructs. The ~(eye(N)) produces an identity matrix and negates it. By logically "and"ing it with select, then the $i$th column is now select, with the $i$th element dropped out.
You were explicitly calculating weights'*s as the denominator in each for-loop. By using the above matrix to calculate this, we can now do a sum(W), where the W is essentially weights.*s in each column.
Take advantage of column-wise operations: the var() and the sqrt() functions are both coded to work along the columns of a matrix, outputting the action for a matrix in the form of a row vector.
Ok. the full thing. Any questions let me know:
% Start with everything selected:
select = true(N);
stop = false; % Stopping flag:
while (~stop)
% Each column leaves a variable out...
included = ~eye(N) & select;
% This calculates the weights with leave-one-out:
W = bsxfun(#(x,y)x.*y,weights,included);
% You can comment out the line below, if you'd like...
W_= repmat(weights,1,N).*included; % This is the same as previous line.
% This calculates the weights before dropping the variables:
V = bsxfun(#(x,y)x.*y,weights,select);
% There's different syntax, depending on whether item_c is a
% vector or a matrix...
if(isvector(item_c))
HI = (item_c' * V)./(sum(V));
CI = (item_c' * W)./(sum(W));
else
% For example: item_c is a matrix...
% We have to use bsxfun() again
HI = bsxfun(#rdivide, (item_c' * V),sum(V));
CI = bsxfun(#rdivide, (item_c' * W),sum(W));
end
standarderror_afterex = sqrt(var(bsxfun(#minus,HI,CM)));
standarderror_priorex = sqrt(var(bsxfun(#minus,CI,CM)));
% or:
%
% standarderror_afterex = sqrt(var(HI - repmat(CM,1,size(HI,2))));
% standarderror_priorex = sqrt(var(CI - repmat(CM,1,size(CI,2))));
ratios = (standarderror_afterex - standarderror_priorex)./(abs(mean(W) - sum(V)));
% Identify the negative ratios:
negratios = ratios < 0;
if ~any(negratios)
% Drop out of the while-loop:
stop = true;
else
% Find the most negative ratio:
neginds = find(negratios);
[mn, mnind] = min(ratios(negratios));
% Drop out the most negative one...
select(neginds(mnind),:) = false;
end
end % end while(~stop)
% Your output:
s = select(:,1);
If for some reason it doesn't work, please let me know.
The workspace is given as:
limits=[-1 4; -1 4; -1 4];
And in this workspace, there is a spherical obstacle which is defined as:
obstacle.origin_x=1.6;
obstacle.origin_y=0.8;
obstacle.origin_z=0.2;
obstacle.radius_obs=0.2;
save('obstacle.mat', 'obstacle');
I would like to create random point in the area of lim. I created random points using the code below:
function a=rndmpnt(lim, numofpoints)
x=lim(1,1)+(lim(1,2)-lim(1,1))*rand(1,numofpoint);
y=lim(2,1)+(lim(2,2)-lim(2,1))*rand(1,numofpoint);
z=lim(3,1)+(lim(3,2)-lim(3,1))*rand(1,numofpoint);
a=[x y z];
Now I would like to eliminate the points in the area of limits-obstacle. how can I do that?
You want to reject the points within the obstacle. Naturally, after rejection you will probably end up with fewer points than numofpoint. So the process will need to be repeated until enough points are generated. A while loop is appropriate here.
Rejection is done by finding ix (indices of acceptable points) and appending only those points to matrix a. The loop repeats until there are enough of those, and returns exactly the number requested.
function a = rndmpnt(lim, numofpoints)
a = zeros(3,0); % begin with empty matrix
while size(a,2) < numofpoint % not enough points yet
x=lim(1,1)+(lim(1,2)-lim(1,1))*rand(1,numofpoint);
y=lim(2,1)+(lim(2,2)-lim(2,1))*rand(1,numofpoint);
z=lim(3,1)+(lim(3,2)-lim(3,1))*rand(1,numofpoint);
ix = (x - obstacle.origin_x).^2 + (y - obstacle.origin_y).^2 + (z - obstacle.origin_z).^2 > obstacle.radius_obs^2;
a = [a, [x(ix); y(ix); z(ix)]];
end
a = a(:, 1:numofpoint);
end
You may want to add a safeguard against infinite loop (some limit on the number of cycles) in case the user passes in the values such that there are no acceptable points.