MATLAB: Using CONVN for moving average on Matrix - matlab

I'm looking for a bit of guidance on using CONVN to calculate moving averages in one dimension on a 3d matrix. I'm getting a little caught up on the flipping of the kernel under the hood and am hoping someone might be able to clarify the behaviour for me.
A similar post that still has me a bit confused is here:
CONVN example about flipping
The Problem:
I have daily river and weather flow data for a watershed at different source locations.
So the matrix is as so,
dim 1 (the rows) represent each site
dim 2 (the columns) represent the date
dim 3 (the pages) represent the different type of measurement (river height, flow, rainfall, etc.)
The goal is to try and use CONVN to take a 21 day moving average at each site, for each observation point for each variable.
As I understand it, I should just be able to use a a kernel such as:
ker = ones(1,21) ./ 21.;
mat = randn(150,365*10,4);
avgmat = convn(mat,ker,'valid');
I tried playing around and created another kernel which should also work (I think) and set ker2 as:
ker2 = [zeros(1,21); ker; zeros(1,21)];
avgmat2 = convn(mat,ker2,'valid');
The question:
The results don't quite match and I'm wondering if I have the dimensions incorrect here for the kernel. Any guidance is greatly appreciated.

Judging from the context of your question, you have a 3D matrix and you want to find the moving average of each row independently over all 3D slices. The code above should work (the first case). However, the valid flag returns a matrix whose size is valid in terms of the boundaries of the convolution. Take a look at the first point of the post that you linked to for more details.
Specifically, the first 21 entries for each row will be missing due to the valid flag. It's only when you get to the 22nd entry of each row does the convolution kernel become completely contained inside a row of the matrix and it's from that point where you get valid results (no pun intended). If you'd like to see these entries at the boundaries, then you'll need to use the 'same' flag if you want to maintain the same size matrix as the input or the 'full' flag (which is default) which gives you the size of the output starting from the most extreme outer edges, but bear in mind that the moving average will be done with a bunch of zeroes and so the first 21 entries wouldn't be what you expect anyway.
However, if I'm interpreting what you are asking, then the valid flag is what you want, but bear in mind that you will have 21 entries missing to accommodate for the edge cases. All in all, your code should work, but be careful on how you interpret the results.
BTW, you have a symmetric kernel, and so flipping should have no effect on the convolution output. What you have specified is a standard moving averaging kernel, and so convolution should work in finding the moving average as you expect.
Good luck!

Related

scipy.interpolate.griddata slow due to unnecessary data

I have a map with a 600*600 aequidistant x,y grid with associated scalar values.
I have around 1000 x,y coordinates at which I would like to get the bi-linear interpolated map values. Those are randomly placed in an inner center area of the map with arround 400*400 size.
I decided to go with the griddata function with method linear. My understanding is that with linear interpolation I would only need the three nearest grid positions around each coordinate do get the well defined interpolated values. So I would require around 3000 data points of the map to perform the interpolation. The 360k data points are highly unnecessary for this task.
Throwing stupidly the complete map in results in long excecution times of a half minute. Since it's easy to narrow the map already down to the area of interest I could reduce excecution time to nearly 20%.
I am now wondering if I oversaw something in my assumption that I need only the three nearest neighbours for my task. And if not, whether there is a fast solution to filter those 3000 out of the 360k. I assume looping 3000 times over the 360k lines will take longer than to just throw in the inner map.
Edit: I had also a look at the comparisson of the result with 600*600 and the reduced data points. I am actually surprised and concerned about the observation, that the interpolation results differ partly significantly.
So I found out that RegularGridinterpolator is the way to go for me. It's fast and I have a regular grid already.
I tried to sort out my findings with the differences in interpolation value and found griddata to show unexpected behavior for me.
Check out the issue I created for details.
https://github.com/scipy/scipy/issues/17378

Efficient Matlab way of reducing 2D segments?

I need to write a function to simplify a set of segments. In particular:
Given the set of 2D segments (x and y coordinates)
Keep only one replica in case of overlapping (or almost overlapping) segments
For each kept part keep count of how many overlaps were present on it.
So the input would be a big set of segments (with many overlaps)
And the output would be non overlapping segments with counts for each one of them.
An example input is shown in this matlab plot
As you can see some lines look thicker cause there are multiple overlaps.
As a result I would need just the skeleton with no overlaps, but I would need the information of the overlaps as a number for each output segment.
I am not an expert working with segments and geometrical problems.
What would be the most efficient way of solving such problem? I am using Matlab, but a code example in any high level language would help. Thanks!
EDIT
As requested here is also a sample dataset, you can get it from this google drive link
https://drive.google.com/open?id=1r2hkG7gI0qhzfP-Mmn8HzIil1o47Z2nA
The input is a csv with 276 cable segments
The output is a csv with 58 cable segments (the reduced segments) + an extra column containing the number of overlapping cables for each segment kept.
The input could contain many more segments. The idea is that the reduction should eliminate cables that are parallel and overlapping with each other, with a certain tolerance.
For example if 2 cables are parallel but far away they should be kept both.
I don't care about the tolerances it's fine if the output is different, I just need an idea on how to solve this problem efficiently cause the code should run fast even with many many segments as input.
Matlab is probably not the most suited language for geometrical manipulation. PostgreSQL/PostGIS would be a better tool, but if you don't have choice here is one solution to get the skeleton of a line:
% Random points
P = [0 0;
1 1;
2 1;
1.01 1.02;
0.01 0.01];
% positive buffer followed by a negative buffer
pol = polybuffer(polybuffer(P,'lines',0.25,'JointType','miter'),-0.249,'JointType','miter');
plot(P(:,1),P(:,2),'r.','MarkerSize',10)
hold on
plot(pol)
axis equal
% Drop the duplicate with uniquetol (with a tolerance of 0.005) to get the centerline
utol = uniquetol(pol.Vertices,0.005,'ByRows',true)
hold off;
plot(utol(:,1),utol(:,2),'b')
Result:
And the center line:

How come dice coefficient comes in bigger than 1

I want to evaluate my automatic image segmentation results. I use Dice coefficients by a function written in Matlab. Following is the link for the code.
mathworklink
I am comparing the segmented patch and manually cropped patch; interestingly DICE comes in bigger than one. I dispatched the code many times such as taking the absolute value of patches(to get rid of negative pixels) but could not find the reason. How come, while sum each individual set is (say 3 and 5), and their union comes (say 45)? The maximum of union must be 8.
Can somebody guide me more precise sources to implement dice coefficients?
function[Dice]=evaldem(man,auto)
for i=1:size(auto,1)
for j=1:size(auto,2)
auto(i,j).autosegmentedpatch=imresize(auto(i,j).autosegmentedpatch,[224 224]);)
man(i,j).mansegmentedpatch=imresize(man(i,j).mansegmentedpatch,[224 224]);
Dice(i,j)=sevaluate(man(i,j).mansegmentedpatch,auto(i,j).autosegmentedpatch)
end
Since I have many automatically segmented patches, and manually segmented patches, I stored them in structures[man and auto]. Structures` size is [i,j]. Definitely I have to imresize to have them be in equal size! Then, I call the FEX submission file. When it comes to negative pixels of these patches, they have some. Note that, I take the absolute value of these patches when I am computing 'common' and 'union' for Dice. All in all, I get still Dice values bigger than one.

Normalized histogram in MATLAB incorrect?

I have the following set of data:
X=[4.692
6.328
4.677
6.836
5.032
5.269
5.732
5.083
4.772
4.659
4.564
5.627
4.959
4.631
6.407
4.747
4.920
4.771
5.308
5.200
5.242
4.738
4.758
4.725
4.808
4.618
4.638
7.829
7.702
4.659]; % Sample set
I fitted a Pareto distribution to this using the maximum likelihood method and I obtain the following graph:
Where the following bit of code is what draws the histogram:
[N,edges,bin] = histcounts(X,'BinMethod','auto');
bin_middles=mean([edges(1:end-1);edges(2:end)]);
f_X_sample=N/trapz(bin_middles,N);
bar(bin_middles,f_X_sample,1);;
Am I doing this right? I checked 100 times and the Pareto distribution is indeed optimal, but it seems awfully different from the histogram. Is there an error that may be causing this? Thank you!
I would agree with #tashuhka's comment that you need to think about how you're binning your data.
Imagine the extreme case where you lump everything together into one bin, and then try to fit that single point to a distribution. Your PDF would look nothing like your single square bar. Split into two bins, and now the fit still sucks, but at least one bar is (probably) a little bigger than the other, etc., etc. At the other extreme, every data point has its own bar and the bar graph is nothing but a random forest of bars with only one count.
There are a number of different strategies for choosing an "optimal" bin size that minimizes the number of bins but maximizes the representation of the underlying PDF.
Finally, note that you only have 30 points here, so your other problem may be that you just haven't collected enough data to really nail down the underlying PDF.

Merge sensor data for clustering/neural net usage

I have several datasets i.e. matrices that have a 2 columns, one with a matlab date number and a second one with a double value. Here an example set of one of them
>> S20_EavesN0x2DEAir(1:20,:)
ans =
1.0e+05 *
7.345016409722222 0.000189375000000
7.345016618055555 0.000181875000000
7.345016833333333 0.000177500000000
7.345017041666667 0.000172500000000
7.345017256944445 0.000168750000000
7.345017465277778 0.000166875000000
7.345017680555555 0.000164375000000
7.345017888888889 0.000162500000000
7.345018104166667 0.000161250000000
7.345018312500001 0.000160625000000
7.345018527777778 0.000158750000000
7.345018736111110 0.000160000000000
7.345018951388888 0.000159375000000
7.345019159722222 0.000159375000000
7.345019375000000 0.000160625000000
7.345019583333333 0.000161875000000
7.345019798611111 0.000162500000000
7.345020006944444 0.000161875000000
7.345020222222222 0.000160625000000
7.345020430555556 0.000160000000000
Now that I have those different sensor values, I need to get them together into a matrix, so that I could perform clustering, neural net and so on, the only problem is, that the sensor data was taken with slightly different timings or timestamps and there is nothing I can do about that from a data collection point of view.
My first thought was interpolation to make one sensor data set fit another one, but that seems like a messy approach and I was thinking maybe I am missing something, a toolbox or function that would enable me to do this quicker without me fiddling around. To even complicate things more, the number of sensors grew over time, therefore I am looking at different start dates as well.
Someone a good idea on how to go about this? Thanks
I think your first thought about interpolation was the correct one, at least if you plan to use NNs. Another option would be to use approaches which are designed to deal with missing data, like http://en.wikipedia.org/wiki/Dempster%E2%80%93Shafer_theory for example.
It's hard to give an answer for the clustering part, because I have no idea what you're looking for in the data.
For the neural network, beside interpolating there are at least two other methods that come to mind:
training separate networks for each matrix
feeding them all together to the same network, with a flag specifying which matrix the data is coming from, i.e. something like: input (timestamp, flag_m1, flag_m2, ..., flag_mN) => target (value) where the flag_m* columns are mutually exclusive boolean values - i.e. flag_mK is 1 iff the line comes from matrix K, 0 otherwise.
These are the only things I can safely say with the amount of information you provided.