Remove long row of data points in Matlab - matlab

I have a list of coordinates, coord, which looks like this when plotted:
I want to remove the long string of points that goes completely from 0 to 1 from the data set, shown on this plot starting at (0, 11) and ending at (1, 11) and the other one that begins at (0, 24) and ends at (1, 28).
So far, I have tried using kmeans to group the data by height using this code:
jet = colormap('jet');
amount = 20;
step = floor(numel(jet(:,1))/amount);
idxOIarr = cell(numel(terp));
scale = 100;
for ii = 1:numel(terp)
figure;
hold on;
expandDat = [stretched{ii}(:,1), scale.*log(terp{ii}(:,2))];
[idx, cent] = kmeans(expandDat(:,1:2), amount, 'Distance', 'cityblock');
idxOIarr{ii} = idx;
for jj = 1:amount
scatter(stretched{ii}(idx == jj,1), FREQ(terp{ii}(idx == jj,2)), 10, jet(step*jj,:), 'filled');
end
end
resulting in this image: Although it does separate the higher rows quite well, it breaks the line in the middle in two and groups the line that begins at (0,20) with some data points below it.
Is there any other way to group and remove these points?

The most efficient way to solve this involves building a graph where each point is a vertex. You join points that you consider "connected" or "closed" with an edge. Thus, the graph will connected components. Now you need to look for the connected components that span the whole range from 0 to 1.
Build the graph. Finding neighbors is most efficient using an R-tree. Here are some suggestions. You can also use a k-d tree, for example. However, this is not strictly necessary, it just can get really slow without a proper spatial indexing structure, because you'll have to compare distances between each pair of points.
Given a Nx2 matrix coord, you can find the square distances between each pair:
D = sum((reshape(coord,[],1,2) - reshape(coord,1,[],2)).^2,3);
(note again that this is expensive if N is large, and in that case using an R-tree will speed things up significantly). D(i,j) is the distance between points with indices i and j (i.e. coord(i,:) and coord(j,:).
Next, build the graph, G, nodes i and j are connected if G(i,j)==1. G is a symmetric matrix:
G = D <= max_distance;
Find connected components. A connected component is just a set of nodes that you can reach from each other by following edges. You don't really need to find all connected components, you just need to find the set of points that have x=0, and starting from each, recursively visit all elements in its connected component to see if you can reach a point that has x=1.
This next code is not tested, but helpfully it gives a starting point:
start_indices = find(coord(:,1)==0); % Is exact equality appropriate here?
end_indices = find(coord(:,1)==1);
to_remove = [];
visited = false(size(coord,1), 1);
for ii=start_indices.'
% For each point with x=0, see if we can reach any of the points at x=1
[res, visited] = can_reach(ii, end_indices, G, visited);
if res
% For this point we can, remove it!
to_remove(end+1) = ii;
end
end
% Iterative function to visit all nodes in a connected component
function [res, visited] = can_reach(start, end_indices, G, visited)
visited(start) = true;
if any(start==end_indices)
% We've reach an end point, stop iterating and return true.
res = true;
return;
end
next = find(G(start,:)); % find neighbors
next(visited(next)) = []; % remove visited neighbors
for ii=next
[res, visited] = can_reach(ii, end_indices, G, visited);
if res
% Yes, we can visit an end point, stop iterating now.
return
end
end
end

Related

Which Bins are occupied in a 3D histogram in MatLab

I got 3D data, from which I need to calculate properties.
To reduce computung I wanted to discretize the space and calculate the properties from the Bin instead of the individual data points and then reasign the propertie caclulated from the bin back to the datapoint.
I further only want to calculate the Bins which have points within them.
Since there is no 3D-binning function in MatLab, what i do is using histcounts over each dimension and then searching for the unique Bins that have been asigned to the data points.
a5pre=compositions(:,1);
a7pre=compositions(:,2);
a8pre=compositions(:,3);
%% BINNING
a5pre_edges=[0,linspace(0.005,0.995,19),1];
a5pre_val=(a5pre_edges(1:end-1) + a5pre_edges(2:end))/2;
a5pre_val(1)=0;
a5pre_val(end)=1;
a7pre_edges=[0,linspace(0.005,0.995,49),1];
a7pre_val=(a7pre_edges(1:end-1) + a7pre_edges(2:end))/2;
a7pre_val(1)=0;
a7pre_val(end)=1;
a8pre_edges=a7pre_edges;
a8pre_val=a7pre_val;
[~,~,bin1]=histcounts(a5pre,a5pre_edges);
[~,~,bin2]=histcounts(a7pre,a7pre_edges);
[~,~,bin3]=histcounts(a8pre,a8pre_edges);
bins=[bin1,bin2,bin3];
[A,~,C]=unique(bins,'rows','stable');
a5pre=a5pre_val(A(:,1));
a7pre=a7pre_val(A(:,2));
a8pre=a8pre_val(A(:,3));
It seems like that the unique function is pretty time consuming, so I was wondering if there is a faster way to do it, knowing that the line only can contain integer or so... or a totaly different.
Best regards
function [comps,C]=compo_binner(x,y,z,e1,e2,e3,v1,v2,v3)
C=NaN(length(x),1);
comps=NaN(length(x),3);
id=1;
for i=1:numel(x)
B_temp(1,1)=v1(sum(x(i)>e1));
B_temp(1,2)=v2(sum(y(i)>e2));
B_temp(1,3)=v3(sum(z(i)>e3));
C_id=sum(ismember(comps,B_temp),2)==3;
if sum(C_id)>0
C(i)=find(C_id);
else
comps(id,:)=B_temp;
id=id+1;
C_id=sum(ismember(comps,B_temp),2)==3;
C(i)=find(C_id>0);
end
end
comps(any(isnan(comps), 2), :) = [];
end
But its way slower than the histcount, unique version. Cant avoid find-function, and thats a function you sure want to avoid in a loop when its about speed...
If I understand correctly you want to compute a 3D histogram. If there's no built-in tool to compute one, it is simple to write one:
function [H, lindices] = histogram3d(data, n)
% histogram3d 3D histogram
% H = histogram3d(data, n) computes a 3D histogram from (x,y,z) values
% in the Nx3 array `data`. `n` is the number of bins between 0 and 1.
% It is assumed all values in `data` are between 0 and 1.
assert(size(data,2) == 3, 'data must be Nx3');
H = zeros(n, n, n);
indices = floor(data * n) + 1;
indices(indices > n) = n;
lindices = sub2ind(size(H), indices(:,1), indices(:,2), indices(:,3));
for ii = 1:size(data,1)
H(lindices(ii)) = H(lindices(ii)) + 1;
end
end
Now, given your compositions array, and binning each dimension into 20 bins, we get:
[H, indices] = histogram3d(compositions, 20);
idx = find(H);
[x,y,z] = ind2sub(size(H), idx);
reduced_compositions = ([x,y,z] - 0.5) / 20;
The bin centers for H are at ((1:20)-0.5)/20.
On my machine this runs in a fraction of a second for 5 million inputs points.
Now, for each composition(ii,:), you have a number indices(ii), which matches with another number idx[jj], corresponding to reduced_compositions(jj,:). One easy way to make the assignment of results is as follows:
H(H > 0) = 1:numel(idx);
indices = H(indices);
Now for each composition(ii,:), your closest match in the reduced set is reduced_compositions(indices(ii),:).

Finding the longest linear section of non-linear plot in MATLAB

Apologies for the long post but this takes a bit to explain. I'm trying to make a script that finds the longest linear portion of a plot. Sample data is in a csv file here, it is stress and strain data for calculating the shear modulus of 3D printed samples. The code I have so far is the following:
x_data = [];
y_data = [];
x_data = Data(:,1);
y_data = Data(:,2);
plot(x_data,y_data);
grid on;
answer1 = questdlg('Would you like to load last attempt''s numbers?');
switch answer1
case 'Yes'
[sim_slopes,reg_data] = regr_and_longest_part(new_x_data,new_y_data,str2num(answer2{3}),str2num(answer2{2}),K);
case 'No'
disp('Take a look at the plot, find a range estimate, and press any button to continue');
pause;
prompt = {'Eliminate values ABOVE this x-value:','Eliminate values BELOW this x-value:','Size of divisions on x-axis:','Factor for similarity of slopes:'};
dlg_title = 'Point elimination';
num_lines = 1;
defaultans = {'0','0','0','0.1'};
if isempty(answer2) < 1
defaultans = {answer2{1},answer2{2},answer2{3},answer2{4}};
end
answer2 = inputdlg(prompt,dlg_title,num_lines,defaultans);
uv_of_x_range = str2num(answer2{1});
lv_of_x_range = str2num(answer2{2});
x_div_size = str2num(answer2{3});
K = str2num(answer2{4});
close all;
iB = find(x_data > str2num(answer2{1}),1,'first');
iS = find(x_data > str2num(answer2{2}),1,'first');
new_x_data = x_data(iS:iB);
new_y_data = y_data(iS:iB);
[sim_slopes, reg_data] = regr_and_longest_part(new_x_data,new_y_data,str2num(answer2{3}),str2num(answer2{2}),K);
end
[longest_section0, Midx]= max(sim_slopes(:,4)-sim_slopes(:,3));
longest_section=1+longest_section0;
long_sec_x_data_start = x_div_size*(sim_slopes(Midx,3)-1)+lv_of_x_range;
long_sec_x_data_end = x_div_size*(sim_slopes(Midx,4)-1)+lv_of_x_range;
long_sec_x_data_start_idx=find(new_x_data >= long_sec_x_data_start,1,'first');
long_sec_x_data_end_idx=find(new_x_data >= long_sec_x_data_end,1,'first');
long_sec_x_data = new_x_data(long_sec_x_data_start_idx:long_sec_x_data_end_idx);
long_sec_y_data = new_y_data(long_sec_x_data_start_idx:long_sec_x_data_end_idx);
[b_long_sec, longes_section_reg_data] = robustfit(long_sec_x_data,long_sec_y_data);
plot(long_sec_x_data,b_long_sec(1)+b_long_sec(2)*long_sec_x_data,'LineWidth',3,'LineStyle',':','Color','k');
function [sim_slopes,reg_data] = regr_and_longest_part(x_points,y_points,x_div,lv,K)
reg_data = cell(1,3);
scatter(x_points,y_points,'.');
grid on;
hold on;
uv = lv+x_div;
ii=0;
while lv <= x_points(end)
if uv > x_points(end)
uv = x_points(end);
end
ii=ii+1;
indices = find(x_points>lv & x_points<uv);
temp_x_points = x_points((indices));
temp_y_points = y_points((indices));
if length(temp_x_points) <= 2
break;
end
[b,stats] = robustfit(temp_x_points,temp_y_points);
reg_data{ii,1} = b(1);
reg_data{ii,2} = b(2);
reg_data{ii,3} = length(indices);
plot(temp_x_points,b(1)+b(2)*temp_x_points,'LineWidth',2);
lv = lv+x_div;
uv = lv+x_div;
end
sim_slopes = NaN(length(reg_data),4);
sim_slopes(1,:) = [reg_data{1,1},0,1,1];
idx=1;
for ii=2:length(reg_data)
coff =sim_slopes(idx,1);
if abs(reg_data{ii,1}-coff) <= K*coff
C=zeros(ii-sim_slopes(idx,3)+1,1);
for kk=sim_slopes(idx,3):ii
C(kk)=reg_data{kk,1};
end
sim_slopes(idx,1)=mean(C);
sim_slopes(idx,2)=std(C);
sim_slopes(idx,4)=ii;
else
idx = idx + 1;
sim_slopes(idx,1)=reg_data{ii,1};
sim_slopes(idx,2)=0;
sim_slopes(idx,3)=ii;
sim_slopes(idx,4)=ii;
end
end
end
Apologies for the code not being well optimized, I'm still relatively new to MATLAB. I did not use derivatives because my data is relatively noisy and derivation might have made it worse.
I've managed to get the get the code to find the longest straight part of the plot by splitting the data up into sections called x_div_size then performing a robustfit on each section, the results of which are written into reg_data. The code then runs through reg_data and finds which lines have the most similar slopes, determined by the K factor, by calculating the average of the slopes in a section of the plot and makes a note of it in sim_slopes. It then finds the longest interval with max(sim_slopes(:,4)-sim_slopes(:,3)) and performs a regression on it to give the final answer.
The problem is that it will only consider the first straight portion that it comes across. When the data is plotted, it has a few parts where it seems straightest:
As an example, when I run the script with answer2 = {'0.2','0','0.0038','0.3'} I get the following, where the black line is the straightest part found by the code:
I have the following questions:
It's clear that from about x = 0.04 to x = 0.2 there is a long straight part and I'm not sure why the script is not finding it. Playing around with different values the script always seems to pick the first longest straight part, ignoring subsequent ones.
MATLAB complains that Warning: Iteration limit reached. because there are more than 50 regressions to perform. Is there a way to bypass this limit on robustfit?
When generating sim_slopes there might be section of the plot whose slope is too different from the average of the previous slopes so it gets marked as the end of a long section. But that section sometimes is sandwiched between several other sections on either side which instead have similar slopes. How would it be possible to tell the script to ignore one wayward section and to continue as if it falls within the tolerance allowed by the K value?
Take a look at the Douglas-Peucker algorithm. If you think of your (x,y) values as the vertices of an (open) polygon, this algorithm will simplify it for you, such that the largest distance from the simplified polygon to the original is smaller than some threshold you can choose. The simplified polygon will be the set of straight lines. Find the two vertices that are furthest apart, and you're done.
MATLAB has an implementation in the Mapping Toolbox called reducem. You might also find an implementation on the File Exchange (but be careful, there is also really bad code on there). Or, you can roll your own, it's quite a simple algorithm.
You can also try using the ischange function to detect changes in the intercept and slope of the data, and then extract the longest portion from that.
Using the sample data you provided, here is what I see from a basic attempt:
>> T = readtable('Data.csv');
>> T = rmmissing(T); % Remove rows with NaN
>> T = groupsummary(T,'Var1','mean'); % Average duplicate timestamps
>> [tf,slopes,intercepts] = ischange(T.mean_Var2, 'linear', 'SamplePoints', T.Var1); % find changes
>> plot(T.Var1, T.mean_Var2, T.Var1, slopes.*T.Var1 + intercepts)
which generates the plot
You should be able to extract the longest segment based on the indices given by find(tf).
You can also tune the parameters of ischange to get fewer or more segments. Adding the name-value pair 'MaxNumChanges' with a value of 4 or 5 produces more linear segments with a tighter fit to the curve, for example, which effectively removes the kink in the plot that you see.

Random sample of points from a rectangular box with a spherical obstacle

The workspace is given as:
limits=[-1 4; -1 4; -1 4];
And in this workspace, there is a spherical obstacle which is defined as:
obstacle.origin_x=1.6;
obstacle.origin_y=0.8;
obstacle.origin_z=0.2;
obstacle.radius_obs=0.2;
save('obstacle.mat', 'obstacle');
I would like to create random point in the area of lim. I created random points using the code below:
function a=rndmpnt(lim, numofpoints)
x=lim(1,1)+(lim(1,2)-lim(1,1))*rand(1,numofpoint);
y=lim(2,1)+(lim(2,2)-lim(2,1))*rand(1,numofpoint);
z=lim(3,1)+(lim(3,2)-lim(3,1))*rand(1,numofpoint);
a=[x y z];
Now I would like to eliminate the points in the area of limits-obstacle. how can I do that?
You want to reject the points within the obstacle. Naturally, after rejection you will probably end up with fewer points than numofpoint. So the process will need to be repeated until enough points are generated. A while loop is appropriate here.
Rejection is done by finding ix (indices of acceptable points) and appending only those points to matrix a. The loop repeats until there are enough of those, and returns exactly the number requested.
function a = rndmpnt(lim, numofpoints)
a = zeros(3,0); % begin with empty matrix
while size(a,2) < numofpoint % not enough points yet
x=lim(1,1)+(lim(1,2)-lim(1,1))*rand(1,numofpoint);
y=lim(2,1)+(lim(2,2)-lim(2,1))*rand(1,numofpoint);
z=lim(3,1)+(lim(3,2)-lim(3,1))*rand(1,numofpoint);
ix = (x - obstacle.origin_x).^2 + (y - obstacle.origin_y).^2 + (z - obstacle.origin_z).^2 > obstacle.radius_obs^2;
a = [a, [x(ix); y(ix); z(ix)]];
end
a = a(:, 1:numofpoint);
end
You may want to add a safeguard against infinite loop (some limit on the number of cycles) in case the user passes in the values such that there are no acceptable points.

Matlab Code to distribute points on plot

I have edited a code that i found online that helps me draw points somehow distributed on a graph based on the minimum distance between them
This is the code that i have so far
x(1)=rand(1)*1000; %Random coordinates of the first point
y(1)=rand(1)*1000;
minAllowableDistance = 30; %IF THIS IS TOO BIG, THE LOOP DOES NOT END
numberOfPoints = 300; % Number of points equivalent to the number of sites
keeperX = x(1); % Initialize first point
keeperY = y(1);
counter = 2;
for k = 2 : numberOfPoints %Dropping another point, and checking if it can be positioned
done=0;
trial_counter=1;
while (done~=1)
x(k)=rand(1)*1000;
y(k)=rand(1)*1000;
thisX = x(k); % Get a trial point.
thisY = y(k);
% See how far is is away from existing keeper points.
distances = sqrt((thisX-keeperX).^2 + (thisY - keeperY).^2);
minDistance = min(distances);
if minDistance >= minAllowableDistance
keeperX(k) = thisX;
keeperY(k) = thisY;
done=1;
trial_counter=trial_counter+1;
counter = counter + 1;
end
if (trial_counter>2)
done=1;
end
end
end
end
So this code is working fine, but sometimes matlab is freezing if the points are above 600. The problem is full , and no more points are added so matlab is doing the work over and over. So i need to find a way when the trial_counter is larger than 2, for the point to find a space that is empty and settle there.
The trial_counter is used to drop a point if it doesn't fit on the third time.
Thank you
Since trial_counter=trial_counter+1; is only called inside if minDistance >= minAllowableDistance, you will easily enter an infinite loop if minDistance < minAllowableDistance (e.g. if your existing points are quite closely packed).
How you do this depends on what your limitations are, but if you're looking at integer points in a set range, one possibility is to keep the points as a binary image, and use bwdist to work out the distance transform, then pick an acceptable point. So each iteration would be (where BW is your stored "image"/2D binary matrix where 1 is the selected points):
D = bwdist(BW);
maybe_points = find(D>minAllowableDistance); % list of possible locations
n = randi(length(maybe_points)); % pick one location
BW(maybe_points(n))=1; % add it to your matrix
(then add some checking such that if you can't find any allowable points the loop quits)

push relabel algorithm

I have done a MATLAB version of the push-relabel FIFO code (exactly like the one on wikipedia and tried it. The discharge function is exactly like wikipedia.
It works perfectly for small graphs (e.g. number of Nodes = 7). However, when I increase my graph size (i.e. number of nodes/vertices > 3500 or more) the "relabel" function runs very slowly, which is called in the "discharge" function. My graphs are huge (i.e. > 3000nodes) so I need to optimize my code.
I tried to optimize the code according to WIKIPEDIA suggestions for global relabeling/gap relabeling:
1) Make neighbour lists for each node, and let the index seen[u] be an iterator over this, instead of the range .
2) Use a gap heuristic.
I'm stuck at the first one , I don't understand what exactly I have to do since it seems there's details left out. (I made neighbor lists such that for vertex u, any connected nodes v(1..n) to u is in the neighbor list already, just not sure how to iterate with the seen[u] index).
[r,c] = find(C);
uc = unique(c);
s = struct;
for i=1:length(uc)
ind = find(c == uc(i));
s(uc(i)).n = [r(ind)];
end
AND the discharge function uses the 's' neighborhood struct list:
while (excess(u) > 0) %test if excess of current node is >0, if so...
if (seen(u) <= length(s(u).n)) %check next neighbor
v = s(u).n(seen(u));
resC = C(u,v) - F(u,v);
if ((resC > 0) && (height(u) == height(v) + 1)) %and if not all neighbours have been tried since last relabel
[C, F, excess] = push(C, F, excess, u, v); %push into untried neighbour
else
seen(u) = seen(u) + 1;
height = relabel(C, F, height, u, N);
end
else
height = relabel(C, F, height, u, N);
seen(u) = 1; %relabel start of queue
end
end
Can someone direct, show or help me please?