I'm not sure if this is the right term but I think I want to s̶m̶o̶o̶t̶h̶ ̶a̶n̶d̶/̶o̶r̶ approximate a data set. I have 30 data points as it is presented in the chart below (the red line with dots)
I want to approximate the dataset so it can be described with fewer data points. The black line represents what I want to achieve.
I want to be able to define an approximation level which will control how much the result data set will differ from the original one.
The approximated data set should contain a set of data points which I can connect together using straight lines.
What is the right algorithm or a math function to solve this problem? I don't expect an implementation here, but rather some suggestions where to start.
I wrote my implementation of the approximation algorithm. It works in most of the cases, but there are certain situations in which it returns non-optimal data.
The example below shows three dotted lines. Thin red line is the original dataset, a thick red-black dotted line is generated by my algorithm, the green line is what I'd like to achieve.
var previousValue;
return array.map(function (dataPoint, index, fullArray) {
var approximation = dataPoint;
if (index > 0) {
if (Math.abs(previousValue - value) < tolerance) {
approximation = previousValue;
} else {
previousValue = dataPoint;
}
} else {
previousValue = dataPoint;
}
return approximation;
});
There are two options here:
if the shown "glitch" in the data is significant, meaning that you cannot smooth it.
if all data shown can be approximated and the "glitch" is insignificant
In (1) case, you may consider approximate by templates (e.g. wavelet) or use basic differential analysis to detect and keep the "glitch" (e.g. meshes).
In (2) case, you may use MA, ARIMA to fit, where the "glitch" can be analyzed further through the roots
Okay, point of clarification, are you looking to smooth the data or approximate it? If you are going to smooth the data, by definition, it will get rid of the little bumps and dips in the data series. On the other hand if the goal is to accurately portray all those dips and bumps, then you do NOT want smoothing. I'm going to talk about smoothing, you tell me if you want the other.
Okay, the best way I know to smooth data is to use an alpha value. The equation is Tn+1=(1-α)Tn+αDatan+1. What this means is that you set the portion of the next function point which is affected by your series history and the portion which is affected by the current data point.
Example graph with alpha = .5
Take a look at this data. Here the α=.5. So the function conforms to the data, but not a lot. The one below is the same, but the alpha is .25. So the data is followed even less, but the function is a lot smoother. There is also a third option where α decreases over time. Initially it can be very high, so you quickly follow the data, but then as α decreases over time the trend becomes smoother and stays smooth over time. Finally, you can set a hard limit on the minimum α This will ensure that you will always have some minimum responsiveness to the data.
Example graph with alpha = .25
Related
I have a map with a 600*600 aequidistant x,y grid with associated scalar values.
I have around 1000 x,y coordinates at which I would like to get the bi-linear interpolated map values. Those are randomly placed in an inner center area of the map with arround 400*400 size.
I decided to go with the griddata function with method linear. My understanding is that with linear interpolation I would only need the three nearest grid positions around each coordinate do get the well defined interpolated values. So I would require around 3000 data points of the map to perform the interpolation. The 360k data points are highly unnecessary for this task.
Throwing stupidly the complete map in results in long excecution times of a half minute. Since it's easy to narrow the map already down to the area of interest I could reduce excecution time to nearly 20%.
I am now wondering if I oversaw something in my assumption that I need only the three nearest neighbours for my task. And if not, whether there is a fast solution to filter those 3000 out of the 360k. I assume looping 3000 times over the 360k lines will take longer than to just throw in the inner map.
Edit: I had also a look at the comparisson of the result with 600*600 and the reduced data points. I am actually surprised and concerned about the observation, that the interpolation results differ partly significantly.
So I found out that RegularGridinterpolator is the way to go for me. It's fast and I have a regular grid already.
I tried to sort out my findings with the differences in interpolation value and found griddata to show unexpected behavior for me.
Check out the issue I created for details.
https://github.com/scipy/scipy/issues/17378
I need to write a function to simplify a set of segments. In particular:
Given the set of 2D segments (x and y coordinates)
Keep only one replica in case of overlapping (or almost overlapping) segments
For each kept part keep count of how many overlaps were present on it.
So the input would be a big set of segments (with many overlaps)
And the output would be non overlapping segments with counts for each one of them.
An example input is shown in this matlab plot
As you can see some lines look thicker cause there are multiple overlaps.
As a result I would need just the skeleton with no overlaps, but I would need the information of the overlaps as a number for each output segment.
I am not an expert working with segments and geometrical problems.
What would be the most efficient way of solving such problem? I am using Matlab, but a code example in any high level language would help. Thanks!
EDIT
As requested here is also a sample dataset, you can get it from this google drive link
https://drive.google.com/open?id=1r2hkG7gI0qhzfP-Mmn8HzIil1o47Z2nA
The input is a csv with 276 cable segments
The output is a csv with 58 cable segments (the reduced segments) + an extra column containing the number of overlapping cables for each segment kept.
The input could contain many more segments. The idea is that the reduction should eliminate cables that are parallel and overlapping with each other, with a certain tolerance.
For example if 2 cables are parallel but far away they should be kept both.
I don't care about the tolerances it's fine if the output is different, I just need an idea on how to solve this problem efficiently cause the code should run fast even with many many segments as input.
Matlab is probably not the most suited language for geometrical manipulation. PostgreSQL/PostGIS would be a better tool, but if you don't have choice here is one solution to get the skeleton of a line:
% Random points
P = [0 0;
1 1;
2 1;
1.01 1.02;
0.01 0.01];
% positive buffer followed by a negative buffer
pol = polybuffer(polybuffer(P,'lines',0.25,'JointType','miter'),-0.249,'JointType','miter');
plot(P(:,1),P(:,2),'r.','MarkerSize',10)
hold on
plot(pol)
axis equal
% Drop the duplicate with uniquetol (with a tolerance of 0.005) to get the centerline
utol = uniquetol(pol.Vertices,0.005,'ByRows',true)
hold off;
plot(utol(:,1),utol(:,2),'b')
Result:
And the center line:
I have the following set of data:
X=[4.692
6.328
4.677
6.836
5.032
5.269
5.732
5.083
4.772
4.659
4.564
5.627
4.959
4.631
6.407
4.747
4.920
4.771
5.308
5.200
5.242
4.738
4.758
4.725
4.808
4.618
4.638
7.829
7.702
4.659]; % Sample set
I fitted a Pareto distribution to this using the maximum likelihood method and I obtain the following graph:
Where the following bit of code is what draws the histogram:
[N,edges,bin] = histcounts(X,'BinMethod','auto');
bin_middles=mean([edges(1:end-1);edges(2:end)]);
f_X_sample=N/trapz(bin_middles,N);
bar(bin_middles,f_X_sample,1);;
Am I doing this right? I checked 100 times and the Pareto distribution is indeed optimal, but it seems awfully different from the histogram. Is there an error that may be causing this? Thank you!
I would agree with #tashuhka's comment that you need to think about how you're binning your data.
Imagine the extreme case where you lump everything together into one bin, and then try to fit that single point to a distribution. Your PDF would look nothing like your single square bar. Split into two bins, and now the fit still sucks, but at least one bar is (probably) a little bigger than the other, etc., etc. At the other extreme, every data point has its own bar and the bar graph is nothing but a random forest of bars with only one count.
There are a number of different strategies for choosing an "optimal" bin size that minimizes the number of bins but maximizes the representation of the underlying PDF.
Finally, note that you only have 30 points here, so your other problem may be that you just haven't collected enough data to really nail down the underlying PDF.
I'm looking for a bit of guidance on using CONVN to calculate moving averages in one dimension on a 3d matrix. I'm getting a little caught up on the flipping of the kernel under the hood and am hoping someone might be able to clarify the behaviour for me.
A similar post that still has me a bit confused is here:
CONVN example about flipping
The Problem:
I have daily river and weather flow data for a watershed at different source locations.
So the matrix is as so,
dim 1 (the rows) represent each site
dim 2 (the columns) represent the date
dim 3 (the pages) represent the different type of measurement (river height, flow, rainfall, etc.)
The goal is to try and use CONVN to take a 21 day moving average at each site, for each observation point for each variable.
As I understand it, I should just be able to use a a kernel such as:
ker = ones(1,21) ./ 21.;
mat = randn(150,365*10,4);
avgmat = convn(mat,ker,'valid');
I tried playing around and created another kernel which should also work (I think) and set ker2 as:
ker2 = [zeros(1,21); ker; zeros(1,21)];
avgmat2 = convn(mat,ker2,'valid');
The question:
The results don't quite match and I'm wondering if I have the dimensions incorrect here for the kernel. Any guidance is greatly appreciated.
Judging from the context of your question, you have a 3D matrix and you want to find the moving average of each row independently over all 3D slices. The code above should work (the first case). However, the valid flag returns a matrix whose size is valid in terms of the boundaries of the convolution. Take a look at the first point of the post that you linked to for more details.
Specifically, the first 21 entries for each row will be missing due to the valid flag. It's only when you get to the 22nd entry of each row does the convolution kernel become completely contained inside a row of the matrix and it's from that point where you get valid results (no pun intended). If you'd like to see these entries at the boundaries, then you'll need to use the 'same' flag if you want to maintain the same size matrix as the input or the 'full' flag (which is default) which gives you the size of the output starting from the most extreme outer edges, but bear in mind that the moving average will be done with a bunch of zeroes and so the first 21 entries wouldn't be what you expect anyway.
However, if I'm interpreting what you are asking, then the valid flag is what you want, but bear in mind that you will have 21 entries missing to accommodate for the edge cases. All in all, your code should work, but be careful on how you interpret the results.
BTW, you have a symmetric kernel, and so flipping should have no effect on the convolution output. What you have specified is a standard moving averaging kernel, and so convolution should work in finding the moving average as you expect.
Good luck!
My data looks like this:
I am trying to find the following info about my data:
Rate of rise on the "transient" portion
time to steady state and steady state average
I think that stepinfo is my best bet to do this, but it seems to want take the final value as the steady state value which isn't giving me the best result. And I cannot find the average value of the steady time until I know when it is.... Is there a way to set some bounds on the steady state search? On the picture i linked it could be data within +/- 0.25 for 50 data points is steady state?
What you can do is:
Decide what the slope of the curve should be in the intersection between transient and steady state
Smoothen you signal
Find the difference between each point on the graph
Find the first place where the difference between points is lower then the value you selected in
To do this, keep in mind:
The difference in the beginning is on average zero, thus you have to skip these values.
One way to do this is simply: x(x < 0.1 * max(x)) = []; That way you remove the entire start of the curve. You won't need it for this part anyway. Remember to keep a backup of x.
A simple way to smoothen the signal is: smooth_x = arrayfun(#(t) mean(x(t:t+k)), 1:numel(x-k)). You need to find an appropriate value for k.
Even a smoothed curve will have "bumps", thus you might want to compare points that are not adjacent, for instance check diff(x(k),x(k+10)). If the average incline between those two points are lower than the value you selected in 1, you're happy. A combination of find and diff should do the trick here.
Once you have smoothed it, you can use diff to find the average inclination for both the transient and the stable part.
There is no way for me to tell where the curve goes from transient to stable. That's a decision you need to make. It could for instance be less than 0.2 l/min per 10 seconds.