plotting histogram in matlab with highly unequal distribution - matlab

So I am plotting two years worth of data to see the change in distribution of values, but with one of the histograms, one of the years is heavily dependent upon the first column, this is because there are many zeros in the data set. Would you recommend creating a bar strictly for zeros? How would I create this index in matlabd? Or how can I better manipulate the histogram to reflect the actual data set and make it clear that zeros are accounting for the sharp initial rise?
Thanks.

This is more of a statistical question. I f you have good reason to ignore the zeros, for example one of your data aquisition system produced them because of malfunction. you can get simply get rid of them by
hist(data(data~=0))
but you would not need to look at the histograms anyways you can use the variance or even standard deviation to see how much your data shifted.
Furthermore to compare data populations boxplots are much better and easier to handle.
doc boxplot
If on the other hand your zeros are genuine to your data you have to keep them! I am sorry but also here the boxplot function might help you because the zeros might be outliers (shown as little red crosses) or the box is just starting at the zero line.

Related

Fitting a gaussian to data with Matlab

I want to produce a figure like the following one (found in a paper)
I think it is done using histfit
However, histfit doesen't really work with my data. The bars exceed the curve. My data is not really normally distributed but I want all the bins to be inside the curve except some outliers. Is there any way to fit a gaussian and plot it like in the above figure?
Edit
This is what histfit(data)has given
I want to fit a gaussian to it and keep some values as ouliers. I need to only use a normal distribution as it is going to be used in a Kalman filter based on the assumption that the data is normally distributed. The fact that is not really normally distributed will certainly affect the performance of the filter but I have to feed it first with the parameters of a normal distribution , i.e mean and std.
I'm not sure you understand how a fit works, if your data is kinda gaussian the function will plot the fitted curve based on the values, some bars will be above some below, it all depends on how the least squares are minimized over the entire curve. you can't force the fit to look different, this is the result of the fitting process. If your data is not normally distributed then the goodness of the fit is poor. without having more info or data, this is the best I can answer :)

What is the official Matlab way to plot the values of histcounts into a histogram with any normalization option?

Assume that I have an array of counts (ideally returned by histcounts). Is there an official Matlab way to plot such a histogram with all the standard normalization options available?
It seems that the best suggestion I have is to get the counts from histcounts and then plot them with bar. Something like:
edges = linspace(0,bound,nbins);
hist_c = histcounts(X,nbins);
bar(edges(1:nbins-1),hist_c);
unfortunately as far as I know it seems that using bar is really not recommended according to this link. Probably because as its obvious from the code, it seems that it moves a lot of implementation details into user code (like produces edges array manually when only needing nbins or having to know if to use 1:nbins-1 vs 2:nbins).
Furthermore, which I believe is the worst, is that it leave the user to have to implement the normalization options on its own. One may point out that histcounts can do the normalization options for you, however, it can only do them given the data matrix X. If one had an extremely large matrix X, then one would be in trouble because producing the histogram counts of X could be done on the fly (as done in this question) but the other normalization options could not be easily be done on the fly. One practice the user could try to implement each normalization option as described by the equations in the documentation but it seems extremely inefficient to have users implement this by hand. Is there a way to get access to the code that actually performs this normalization?
In reality what my question is going for is, is there an official matlab way to produce histogram having only the histogram counts? In particular hiding all the implementation details of producing the counts, normalization, binning, edges, etc?
The ideal code in my mind should look like this to the user:
histogram_counts = get_hist_count(X)
plot_histogram(histogram_counts,'Normalization',normalization)
and produces the desired histogram plot.
Related question:
https://www.mathworks.com/matlabcentral/answers/332178-how-does-one-plot-a-histogram-from-the-histogram-counts
https://www.mathworks.com/matlabcentral/answers/275278-what-is-the-recommended-practice-for-plotting-the-outputs-of-histcounts
https://www.mathworks.com/matlabcentral/answers/91944-how-can-i-combine-the-options-histc-and-stack-in-a-bar-plot-in-matlab-7-4-r2007a#answer_101295

Matlab: Eliminating freak values in dataset

I am searching for a method to eliminate freak values out of given dataset. For example:
All these peaks should be eliminated. I've tried different filters like medfilt, but the peaks are still there. I've also tried a lowpass filter, but it didn't work either. I am a beginner in filtering signals, so I probably did it wrong.
You can download data sets for the x array here and y array here.
I could also imagine a loop to compare the values next to each other, but I am sure there has to be a built-in function?
Here is the result using medfilt1(input,15):
The peaks are vanishing, but the then I get these ugly steps, which I don't want.
just use median filter! medfilt1(data,3) will do if this is a 1 pixel "cosmic" spike. If the peaks remain, increase the window size to 5 or more...
EDIT:
so this is how op's data looks like:
So we see that the data is not exactly uniform or ordered, and there are a lot of data points in the spikes unlike what one first understand from the question (guys please plot your data correctly!) The question is now, is the data in the spikes or on in the baseline?

Automatically truncating a curve to discard outliers in matlab

I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).

Interpolating a histogram matlab

I have discrete empirical data which forms a histogram with gaps. I.e. no observations were made of certain values. However in reality those values may well occur.
This is a fig of the scatter graph.
So my question is, SHOULD I interpolate between xaxis values to make bins for the histogram ? If so what would you suggest to be best practice?
Regards,
Don't do it.
With that many sample points, the probability (p-value) of getting empty bins if the distribution is smooth is quite low. There's some underlying reason they're empty, which you may want to investigate. I can think of two possibilities:
Your data actually is discrete (perhaps someone rounded off to 1 signficant figure during data collection, or quantization error was significantly in an ADC) and then unit conversion caused irregular gaps. Even conversion from .12 and .13 to 12,13 as shown could cause this issue, if .12 is actually represented as .11111111198 inside the computer. But this would tend to double-up in a neighboring bin and the gaps would tend to be regularly spaced, so I doubt this is the cause. (For example, if 128 trials of a Bernoulli coin-flip experiment were done for each data point, and someone recorded the percentage of heads in each series to the nearest 1%, you could multiply by 1.28/% to try to recover the actual number of heads, but there'd be 28 empty bins)
Your distribution has real lobes. Because the frequency is significantly reduced following each empty bin, I favor this explanation.
But these are just starting suggestions for your own investigation.