Removing outlier with multiple consecutive values similar to a step - matlab

I am processing an ocean wave data, where I have a timeseries of the Peak Wave Period (Tp (s)). The typical values for Tp ranges from 2s-15s for this location. However, it may reach higher values above 15s during extreme events such as a storm. Hence, removing data based on a threshold value is not suitable.
As you can see in the figure below, there are multiple values that are outliers. The high values occurred for a small duration and then dropped down. An extreme event would last for hours.
I have tried the functions filloutlier and medfilt1, but they are not successful in removing the outlier, which I presume is because multiple consecutive outlier data points exists.
Is there a built-in Matlab function exist to handle such situation?
Else, if I need to write my own function to filter such signals, could you provide some guidance.
Attaching a small data sample here as well: Download Data
Dataset plot (Only the segment in the provided data above)
Zoomed in plot at one of the outliers.

If we know that we need the values to be in the range of (2,15), we can clip the values > 15 to 15.
Another way is to use the value of a high percentile (say 95) of the observations and clip values about it.
filloutlier, medfilt1 methods are not removing values like 18 because they are not treating them as outliers. 18 is not very far away from the typical range of (2, 15).

Related

Boxplot is broken, only showing one line

so my data centres around different treatments and how they impact the day of germination. image of dodgy boxplot data
A while ago whilst making violin plots in R to show the distribution of when germination occurs according to treatment, I attempted to add a boxplot as a descriptive statistic and was met with only one line.
I contacted many people who simply had no idea what the issue was, I used this same data in another violin plot as part of a bigger data collection with more treatments including this one.
I moved on from this and found it odd, now when I have come to perform stats tests in SPSS, I have the same problem as imaged below. When I try a Mann Whitney U test I am told "cannot compute" due to not having solely two variables, when I try a Kruskal Wallis test I am met with the dodgy boxplot below and I am told pairwise comparisons cannot be done due to less than 3 test fields (i.e. 2).
I am at an absolute loss, I have tried rewriting the data out, copying data labels with 'stratified' 'strat' 's' etc and I have no idea where the problem could lie, if anyone could give me any guidance this would be really appreciated!
Thank you
The dependent variable in question appears to have only values 1, 2, and 3 in the Stratified group. If there is at least one case with a value of 1, at least one case with a value of 3, but most values at 2, then a box plot like you're seeing would be expected. In SPSS, run the EXAMINE procedure (Analyze>Descriptive Statistics>Explore in the menus), specifying the same dependent variable and grouping variable, and asking for percentiles. The box plots should match what you're getting, and in the percentiles table you should see that Tukey's hinges show the same value of 2 for the 25th, 50th, and 75th percentiles.
Tukey's hinges are the basis for the box and the line in box plots. The line is at the median or 50th percentile, and the upper and lower box edges are at the 25th and 75h percentiles, respectively. When all three coincide, you get just a line instead of a box.
There are two types of outlying values identified in box plots in SPSS. Points greater than 1.5 box lengths below or above the box edges are outliers, marked with circles, and points greater than 3 box lengths below or above the box edges are extremes, marked with asterisks. Since the box length here is 0, anything at other values is automatically an extreme.
Pairwise comparisons following a Kruskal-Wallis test are available only when there are at least three groups, since with only two groups the overall or omnibus test has already compared the two groups. I'm not sure what the issue was when trying to run a Mann-Whitney test.

Efficient Matlab way of reducing 2D segments?

I need to write a function to simplify a set of segments. In particular:
Given the set of 2D segments (x and y coordinates)
Keep only one replica in case of overlapping (or almost overlapping) segments
For each kept part keep count of how many overlaps were present on it.
So the input would be a big set of segments (with many overlaps)
And the output would be non overlapping segments with counts for each one of them.
An example input is shown in this matlab plot
As you can see some lines look thicker cause there are multiple overlaps.
As a result I would need just the skeleton with no overlaps, but I would need the information of the overlaps as a number for each output segment.
I am not an expert working with segments and geometrical problems.
What would be the most efficient way of solving such problem? I am using Matlab, but a code example in any high level language would help. Thanks!
EDIT
As requested here is also a sample dataset, you can get it from this google drive link
https://drive.google.com/open?id=1r2hkG7gI0qhzfP-Mmn8HzIil1o47Z2nA
The input is a csv with 276 cable segments
The output is a csv with 58 cable segments (the reduced segments) + an extra column containing the number of overlapping cables for each segment kept.
The input could contain many more segments. The idea is that the reduction should eliminate cables that are parallel and overlapping with each other, with a certain tolerance.
For example if 2 cables are parallel but far away they should be kept both.
I don't care about the tolerances it's fine if the output is different, I just need an idea on how to solve this problem efficiently cause the code should run fast even with many many segments as input.
Matlab is probably not the most suited language for geometrical manipulation. PostgreSQL/PostGIS would be a better tool, but if you don't have choice here is one solution to get the skeleton of a line:
% Random points
P = [0 0;
1 1;
2 1;
1.01 1.02;
0.01 0.01];
% positive buffer followed by a negative buffer
pol = polybuffer(polybuffer(P,'lines',0.25,'JointType','miter'),-0.249,'JointType','miter');
plot(P(:,1),P(:,2),'r.','MarkerSize',10)
hold on
plot(pol)
axis equal
% Drop the duplicate with uniquetol (with a tolerance of 0.005) to get the centerline
utol = uniquetol(pol.Vertices,0.005,'ByRows',true)
hold off;
plot(utol(:,1),utol(:,2),'b')
Result:
And the center line:

y axis limit - appears to alter the data that is analysed

I have a bunch of data where the hours taken to process an item ranges from 3-3000 hours. most of the data is <1000 hours
I am creating a boxplot of that data. I have an large number of outliers within the data that I don't need to display, but I do need to analyse.
I have tried to use both 'scale_y_continuous(limits=c(0,1000))' and 'ylim(0,1000)' that appears to change the data that is used to create the boxplot, I altered the limits to '20' to test this theory and I get a complete plot, which can only be because the method i'm using to limit the axis also limits the range of data analysed.
I'd like to limit the y axis but not limit the range of data that is used in the analysis, what function do I use to accomplish that?
many thanks
it appears that it's 'coord_cartesian(ylim = c(nnn,nnn))+' that I needed to use.

How to delete the same peak value in peak analysis and to find the duration of each event (which contain peak value)?

I am a newbie in matlab programming. Actually I have asked this question in mathwork website, but still I did not get the answer, so maybe I can get it here.
I am trying to do peak analysis to find the peak flow of storm water flow. Here is my code :
%% Peak flow analysis
% define data which are used for analysis
Date=finalCSVnew{:,1};
Flow=finalCSVnew{:,7};
figure(2);
[pks,locs]=findpeaks(Flow,Date,'MinPeakProminence',1,'MinPeakDistance',1);
findpeaks(Flow,Date,'MinPeakProminence',1,'MinPeakDistance',1);
text(locs+.02,pks,num2str((1:numel(pks))'));
xlabel('Date and Time');
ylabel('Flow [m3/h]');
title('Find All Peak Flows');
datacursormode on
I managed to plot the peak flow, and find the details about pks and locs. Here, each event should contain one peak flow. So in my case (based on attached picture) I should have 16 events. However, there is duplicate value in event 1 and event 2 which I want to delete one of them, but I am confused about how to do it. Also, I try to find the tutorial for calculating the duration of each event in the website, but I found nothing. I want to know about how to calculate the duration (probably in minutes) based on the peak flow data I got and to delete the peak value in the plot and in pks data which contain duplicates. Is it possible to do that? Could you please help me? Thank you very much for your help.peak flow events
For duplicate values, you can use the unique function to find values which are the same and remove them.
C = unique(pks) % find any unique values and output values without repetitions
https://au.mathworks.com/help/matlab/ref/unique.html
Provide more details about the duration you want to measure. Do you want to measure the duration of just the peak flow? Or of the entire curve leading up to the peak?

MATLAB findpeaks using minpeakdistance

Suppose that I have a data set that contains a cyclical event and I am identifying a threshold (peaks) to separate each event (to eventually find the coefficient of variation).
I have multiple trials of this data - the speed of these events is sometimes significantly faster than others. This data is also a bit noisy, so some 'false local maximas' are sometimes picked up if I don't set the 'minpeakdistance' constraint within the 'findpeaks' function.
I am trying to find a way to ensure that regardless of speed, I am finding 'true local maximas'. I have been visually inspecting each trial to ensure that I have identified only true peaks - if I have also identified false peaks, I have been adjusted the mpd value for that specific trial - but this is literally going to take days.
Any suggestions?
Example:
For most trials of my collection, the following line of code only identifies true maximas:
mpd = 'minpeakdistance';
eval(['[t' num2str(a) '.Mspine.pks(:,1),t' num2str(a) '.Mspine.locs] = findpeaks(t' num2str(a) '.Mspine.xyz(:,1), mpd,25);']);
But, for trial 11, they are moving much faster, so the mpd has to be adjusted to 9; however, if I apply an mpd value of 9 to all of the trials, it will pick up false local maximas.
theI would go over to the frequency domain to find this "cyclical event". Specifically, if you know the rate at which data is sampled/generated, using a FFT will indicate the relative strengths of all periodic events in your data. Have a look at: http://www.mathworks.se/help/matlab/ref/fft.html