Efficient Matlab way of reducing 2D segments? - matlab

I need to write a function to simplify a set of segments. In particular:
Given the set of 2D segments (x and y coordinates)
Keep only one replica in case of overlapping (or almost overlapping) segments
For each kept part keep count of how many overlaps were present on it.
So the input would be a big set of segments (with many overlaps)
And the output would be non overlapping segments with counts for each one of them.
An example input is shown in this matlab plot
As you can see some lines look thicker cause there are multiple overlaps.
As a result I would need just the skeleton with no overlaps, but I would need the information of the overlaps as a number for each output segment.
I am not an expert working with segments and geometrical problems.
What would be the most efficient way of solving such problem? I am using Matlab, but a code example in any high level language would help. Thanks!
As requested here is also a sample dataset, you can get it from this google drive link
The input is a csv with 276 cable segments
The output is a csv with 58 cable segments (the reduced segments) + an extra column containing the number of overlapping cables for each segment kept.
The input could contain many more segments. The idea is that the reduction should eliminate cables that are parallel and overlapping with each other, with a certain tolerance.
For example if 2 cables are parallel but far away they should be kept both.
I don't care about the tolerances it's fine if the output is different, I just need an idea on how to solve this problem efficiently cause the code should run fast even with many many segments as input.

Matlab is probably not the most suited language for geometrical manipulation. PostgreSQL/PostGIS would be a better tool, but if you don't have choice here is one solution to get the skeleton of a line:
% Random points
P = [0 0;
1 1;
2 1;
1.01 1.02;
0.01 0.01];
% positive buffer followed by a negative buffer
pol = polybuffer(polybuffer(P,'lines',0.25,'JointType','miter'),-0.249,'JointType','miter');
hold on
axis equal
% Drop the duplicate with uniquetol (with a tolerance of 0.005) to get the centerline
utol = uniquetol(pol.Vertices,0.005,'ByRows',true)
hold off;
And the center line:


Removing outlier with multiple consecutive values similar to a step

I am processing an ocean wave data, where I have a timeseries of the Peak Wave Period (Tp (s)). The typical values for Tp ranges from 2s-15s for this location. However, it may reach higher values above 15s during extreme events such as a storm. Hence, removing data based on a threshold value is not suitable.
As you can see in the figure below, there are multiple values that are outliers. The high values occurred for a small duration and then dropped down. An extreme event would last for hours.
I have tried the functions filloutlier and medfilt1, but they are not successful in removing the outlier, which I presume is because multiple consecutive outlier data points exists.
Is there a built-in Matlab function exist to handle such situation?
Else, if I need to write my own function to filter such signals, could you provide some guidance.
Attaching a small data sample here as well: Download Data
Dataset plot (Only the segment in the provided data above)
Zoomed in plot at one of the outliers.
If we know that we need the values to be in the range of (2,15), we can clip the values > 15 to 15.
Another way is to use the value of a high percentile (say 95) of the observations and clip values about it.
filloutlier, medfilt1 methods are not removing values like 18 because they are not treating them as outliers. 18 is not very far away from the typical range of (2, 15).

how to approximate time-series data

I'm not sure if this is the right term but I think I want to s̶m̶o̶o̶t̶h̶ ̶a̶n̶d̶/̶o̶r̶ approximate a data set. I have 30 data points as it is presented in the chart below (the red line with dots)
I want to approximate the dataset so it can be described with fewer data points. The black line represents what I want to achieve.
I want to be able to define an approximation level which will control how much the result data set will differ from the original one.
The approximated data set should contain a set of data points which I can connect together using straight lines.
What is the right algorithm or a math function to solve this problem? I don't expect an implementation here, but rather some suggestions where to start.
I wrote my implementation of the approximation algorithm. It works in most of the cases, but there are certain situations in which it returns non-optimal data.
The example below shows three dotted lines. Thin red line is the original dataset, a thick red-black dotted line is generated by my algorithm, the green line is what I'd like to achieve.
var previousValue;
return array.map(function (dataPoint, index, fullArray) {
var approximation = dataPoint;
if (index > 0) {
if (Math.abs(previousValue - value) < tolerance) {
approximation = previousValue;
} else {
previousValue = dataPoint;
} else {
previousValue = dataPoint;
return approximation;
There are two options here:
if the shown "glitch" in the data is significant, meaning that you cannot smooth it.
if all data shown can be approximated and the "glitch" is insignificant
In (1) case, you may consider approximate by templates (e.g. wavelet) or use basic differential analysis to detect and keep the "glitch" (e.g. meshes).
In (2) case, you may use MA, ARIMA to fit, where the "glitch" can be analyzed further through the roots
Okay, point of clarification, are you looking to smooth the data or approximate it? If you are going to smooth the data, by definition, it will get rid of the little bumps and dips in the data series. On the other hand if the goal is to accurately portray all those dips and bumps, then you do NOT want smoothing. I'm going to talk about smoothing, you tell me if you want the other.
Okay, the best way I know to smooth data is to use an alpha value. The equation is Tn+1=(1-α)Tn+αDatan+1. What this means is that you set the portion of the next function point which is affected by your series history and the portion which is affected by the current data point.
Example graph with alpha = .5
Take a look at this data. Here the α=.5. So the function conforms to the data, but not a lot. The one below is the same, but the alpha is .25. So the data is followed even less, but the function is a lot smoother. There is also a third option where α decreases over time. Initially it can be very high, so you quickly follow the data, but then as α decreases over time the trend becomes smoother and stays smooth over time. Finally, you can set a hard limit on the minimum α This will ensure that you will always have some minimum responsiveness to the data.
Example graph with alpha = .25

Normalized histogram in MATLAB incorrect?

I have the following set of data:
4.659]; % Sample set
I fitted a Pareto distribution to this using the maximum likelihood method and I obtain the following graph:
Where the following bit of code is what draws the histogram:
[N,edges,bin] = histcounts(X,'BinMethod','auto');
Am I doing this right? I checked 100 times and the Pareto distribution is indeed optimal, but it seems awfully different from the histogram. Is there an error that may be causing this? Thank you!
I would agree with #tashuhka's comment that you need to think about how you're binning your data.
Imagine the extreme case where you lump everything together into one bin, and then try to fit that single point to a distribution. Your PDF would look nothing like your single square bar. Split into two bins, and now the fit still sucks, but at least one bar is (probably) a little bigger than the other, etc., etc. At the other extreme, every data point has its own bar and the bar graph is nothing but a random forest of bars with only one count.
There are a number of different strategies for choosing an "optimal" bin size that minimizes the number of bins but maximizes the representation of the underlying PDF.
Finally, note that you only have 30 points here, so your other problem may be that you just haven't collected enough data to really nail down the underlying PDF.

MATLAB: Using CONVN for moving average on Matrix

I'm looking for a bit of guidance on using CONVN to calculate moving averages in one dimension on a 3d matrix. I'm getting a little caught up on the flipping of the kernel under the hood and am hoping someone might be able to clarify the behaviour for me.
A similar post that still has me a bit confused is here:
CONVN example about flipping
The Problem:
I have daily river and weather flow data for a watershed at different source locations.
So the matrix is as so,
dim 1 (the rows) represent each site
dim 2 (the columns) represent the date
dim 3 (the pages) represent the different type of measurement (river height, flow, rainfall, etc.)
The goal is to try and use CONVN to take a 21 day moving average at each site, for each observation point for each variable.
As I understand it, I should just be able to use a a kernel such as:
ker = ones(1,21) ./ 21.;
mat = randn(150,365*10,4);
avgmat = convn(mat,ker,'valid');
I tried playing around and created another kernel which should also work (I think) and set ker2 as:
ker2 = [zeros(1,21); ker; zeros(1,21)];
avgmat2 = convn(mat,ker2,'valid');
The question:
The results don't quite match and I'm wondering if I have the dimensions incorrect here for the kernel. Any guidance is greatly appreciated.
Judging from the context of your question, you have a 3D matrix and you want to find the moving average of each row independently over all 3D slices. The code above should work (the first case). However, the valid flag returns a matrix whose size is valid in terms of the boundaries of the convolution. Take a look at the first point of the post that you linked to for more details.
Specifically, the first 21 entries for each row will be missing due to the valid flag. It's only when you get to the 22nd entry of each row does the convolution kernel become completely contained inside a row of the matrix and it's from that point where you get valid results (no pun intended). If you'd like to see these entries at the boundaries, then you'll need to use the 'same' flag if you want to maintain the same size matrix as the input or the 'full' flag (which is default) which gives you the size of the output starting from the most extreme outer edges, but bear in mind that the moving average will be done with a bunch of zeroes and so the first 21 entries wouldn't be what you expect anyway.
However, if I'm interpreting what you are asking, then the valid flag is what you want, but bear in mind that you will have 21 entries missing to accommodate for the edge cases. All in all, your code should work, but be careful on how you interpret the results.
BTW, you have a symmetric kernel, and so flipping should have no effect on the convolution output. What you have specified is a standard moving averaging kernel, and so convolution should work in finding the moving average as you expect.
Good luck!

Efficient way of using ssim() function in Matlab for comparing image structures (or any other alternative)

I'm given the task of reordering a number of randomly placed video frames into the right order. I've managed to do this by using each frame as a reference once, and find the the two closest frames in terms of structure for that reference frame..presumably that these two closest frames would be the ones behind and after that frame in the video. After finding the two closest frames for each video frame, I would then compute a possible path.
My problem however is when it comes to performance, particularly when scoring. It's very inefficient unfortunately,and run time alone for 72 frames (320x240) is approx 80 seconds on just the scoring. I'm not too familiar with Matlab (or any similar language) but this is what I am doing for scoring right now:
for i =1: n_images,
current_image = Images{1,i};
%obtain score pairs image similarity
for j = 1:n_images,
if i ~= j,
scores(1,j) = ssim(Images{1,j}, current_image);
[svalues, index] = sort(scores,'descend');
Closest(1,i) = index(1,1);
Closest(2,i) = index(1,2);
%Closest consists of a 2 x n_images matrix, where for each frame index, there are two
%column values, which are the indexes of the closest frames.
Could anyone give me some pointers for optimizations, or else suggestions for a better way of scoring?
Edit: Images are normalized and converted to grayscale
Edit #2: I've tried using threads by adding parfor in the scoring loop, which improved performance by around 50%, however the problem is that I need to create an executable, and i'm not sure I'd achieve the same performance..
Never mind, here I'm going through all image pairs, twice (switched parameters), which is not needed.So it is possible to reduce the speed by n-1/2.
If you want efficiency over accuracy (which in my case, it is), finding the score from the correlation of histograms is one possible way.
It took me 55 seconds to process 72 frames with ssim(), while only 1.2 seconds with difference of histograms.