I have discrete empirical data which forms a histogram with gaps. I.e. no observations were made of certain values. However in reality those values may well occur.
This is a fig of the scatter graph.
So my question is, SHOULD I interpolate between xaxis values to make bins for the histogram ? If so what would you suggest to be best practice?
Regards,
Don't do it.
With that many sample points, the probability (p-value) of getting empty bins if the distribution is smooth is quite low. There's some underlying reason they're empty, which you may want to investigate. I can think of two possibilities:
Your data actually is discrete (perhaps someone rounded off to 1 signficant figure during data collection, or quantization error was significantly in an ADC) and then unit conversion caused irregular gaps. Even conversion from .12 and .13 to 12,13 as shown could cause this issue, if .12 is actually represented as .11111111198 inside the computer. But this would tend to double-up in a neighboring bin and the gaps would tend to be regularly spaced, so I doubt this is the cause. (For example, if 128 trials of a Bernoulli coin-flip experiment were done for each data point, and someone recorded the percentage of heads in each series to the nearest 1%, you could multiply by 1.28/% to try to recover the actual number of heads, but there'd be 28 empty bins)
Your distribution has real lobes. Because the frequency is significantly reduced following each empty bin, I favor this explanation.
But these are just starting suggestions for your own investigation.
Related
So I am plotting two years worth of data to see the change in distribution of values, but with one of the histograms, one of the years is heavily dependent upon the first column, this is because there are many zeros in the data set. Would you recommend creating a bar strictly for zeros? How would I create this index in matlabd? Or how can I better manipulate the histogram to reflect the actual data set and make it clear that zeros are accounting for the sharp initial rise?
Thanks.
This is more of a statistical question. I f you have good reason to ignore the zeros, for example one of your data aquisition system produced them because of malfunction. you can get simply get rid of them by
hist(data(data~=0))
but you would not need to look at the histograms anyways you can use the variance or even standard deviation to see how much your data shifted.
Furthermore to compare data populations boxplots are much better and easier to handle.
doc boxplot
If on the other hand your zeros are genuine to your data you have to keep them! I am sorry but also here the boxplot function might help you because the zeros might be outliers (shown as little red crosses) or the box is just starting at the zero line.
I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).
I'm trying to understand how MINPEAKDISTANCE works. I returned to the documentation, here, but it wasn't very clear how this parameter works.
Can you kindly clarify it a bit?
Thanks.
Minimum peak separation Specify the minimum peak distance, or minimum
separation between peaks as a positive integer. You can use the
'MINPEAKDISTANCE' option to specify that the algorithm ignore small
peaks that occur in the neighborhood of a larger peak. When you
specify a value for 'MINPEAKDISTANCE', the algorithm initially
identifies all the peaks in the input data and sorts those peaks in
descending order. Beginning with the largest peak, the algorithm
ignores all identified peaks not separated by more than the value of
'MINPEAKDISTANCE'. Default: 1
So if you consider your peak heights as values in the "y" direction, then the separation that this is talking about is in the "x" direction. So for example look at this image (from Matlab docs and if you have the image processing toolbox you can get the data too load noisyecg.mat):
lets say you just want to identify thos 4 big distinct peaks, but not the hundreds of little peaks caused by noise, setting MINPEAKDISTANCE is a feasible way accomplish this because the noisy peaks are at a much higher frequency, i.e. they are closer to each other in the "x" direction, or have a smaller distance separating them than the big peaks do. So choosing a large enough MINPEAKDISTANCE, say 100 or 350 for example depending on what peaks you're interested in, would help you to not detect these undesired noise peaks.
Try findpeaks on this data with different MINPEAKDISTANCE values and see what you get!
If you've got noisy data, you may find that instead of one solid peak, you get lots of small ones (see the folowing image).
The important data here is when the signal is high and when it is low - you don't care about small variations in value, you only want to use one of those peaks and not look at all the smaller local ones around it. If you know the frequency of your signal (i.e. how often the peaks should occur), you can tell the function to ensure that the peaks are separated by a certain amount.
In the above example, the peak is every 15 milliseconds and lasts for 5 milliseconds, so you might set your MINPEAKDISTANCE parameter to 15 or so.
When i run some cepstral coefficient data generated from .wav
files in ELKI wit Kmeans Algorithm k =32 and max iter=100 it gives
negative values for the following Pair Counting Measures.
Jaccard=-3.3627 Recall=-3.3627 Rand=-3.3627 and F1 Measure=2.8465 I
searched somewhere for the range of these measures and they were
(0,1). I ran this data with several other algorithms and having the
same problem. Can anyone please interpret it?
The values should be in the range of [0;1], but:
only if you have complete labels (missing labels can be skipped, but I'm not sure if our implementation handles this case yet)
the clustering must be a complete, non-overlapping, crisp partitioning
Furthermore, when clusters degenerate (depending on your data and seeding, this may happen with k-means) there could be empty clusters, and these again may yield undesired results with the literate implementation of these measures.
How did you label your data?
We try our best to also handle corner cases right; but we can only diagnose and fix what we have observed and can reproduce.
I am trying to resample/recreate already recorded data for plotting purposes. I thought this is best place to ask the question (besides dsp.se).
The data is sampled at high frequency, contains to much data points and not suitable for plotting in time domain (not enough memory). i want to sample it with minimal loss. The sampling interval of the resulting data doesn't need to be same (well it is again for plotting purposes, not analysis) although input data in equally sampled.
When we use the regular resample command from matlab/octave, it can distort stiff pieces of the curve.
What is the best approach here?
For reference I put two pictures found in tex.se)
First image is regular resample
Second image is a better resampled data that can well behave around peaks.
You should try this set of files from the File Exchange. It computes optimal lookup table based on either the maximum set of points or a given error. You can choose from natural, linear, or spline for the interpolation methods. Spline will have the smallest table size but is slower than linear. I don't use natural unless I have a really good reason.
Sincerely,
Jason