Resampling data with minimal loss of information in time-domain - matlab

I am trying to resample/recreate already recorded data for plotting purposes. I thought this is best place to ask the question (besides dsp.se).
The data is sampled at high frequency, contains to much data points and not suitable for plotting in time domain (not enough memory). i want to sample it with minimal loss. The sampling interval of the resulting data doesn't need to be same (well it is again for plotting purposes, not analysis) although input data in equally sampled.
When we use the regular resample command from matlab/octave, it can distort stiff pieces of the curve.
What is the best approach here?
For reference I put two pictures found in tex.se)
First image is regular resample
Second image is a better resampled data that can well behave around peaks.

You should try this set of files from the File Exchange. It computes optimal lookup table based on either the maximum set of points or a given error. You can choose from natural, linear, or spline for the interpolation methods. Spline will have the smallest table size but is slower than linear. I don't use natural unless I have a really good reason.
Sincerely,
Jason

Related

Fitting a gaussian to data with Matlab

I want to produce a figure like the following one (found in a paper)
I think it is done using histfit
However, histfit doesen't really work with my data. The bars exceed the curve. My data is not really normally distributed but I want all the bins to be inside the curve except some outliers. Is there any way to fit a gaussian and plot it like in the above figure?
Edit
This is what histfit(data)has given
I want to fit a gaussian to it and keep some values as ouliers. I need to only use a normal distribution as it is going to be used in a Kalman filter based on the assumption that the data is normally distributed. The fact that is not really normally distributed will certainly affect the performance of the filter but I have to feed it first with the parameters of a normal distribution , i.e mean and std.
I'm not sure you understand how a fit works, if your data is kinda gaussian the function will plot the fitted curve based on the values, some bars will be above some below, it all depends on how the least squares are minimized over the entire curve. you can't force the fit to look different, this is the result of the fitting process. If your data is not normally distributed then the goodness of the fit is poor. without having more info or data, this is the best I can answer :)

Automatically truncating a curve to discard outliers in matlab

I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).

how to make a smooth plot in matlab

I have about 100 data points which mostly satisfying a certain function (but some points are off). I would like to plot all those points in a smooth curve but the problem is the points are not uniformly distributed. So is that anyway to get the smooth curve? I am thinking to interpolate some points in between, but the only way that comes up to my mind is to linearly insert some artificial points between two data points. But that will show a pretty weird shape (like some sharp corner). So any better idea? Thanks.
If you know more or less what the actual curve should be, you can try to fit that curve to your points (e.g. using polyfit). Depending on how many points are off and how far, you can get by with least squares regression (which is fairly easy to get working). If you have too many outliers (or they are much too large/small), you can also try robust regression (e.g. least absolute deviation fitting) using the robustfit function.
If you can manually determine the outliers, you can also fit a curve through the other points to get better results or even use interpolation methods (e.g. interp1 in MATLAB) on those points to get a smoother curve.
If you know which function describes your data, robust fitting (using, e.g. ROBUSTFIT, or the new convenient functions LINEARMODEL and NONLINEARMODEL with the robust option) is a good way to go if there are outliers in your data.
If you don't know the function that describes your data, but want a smooth trendline that is little affected by outliers, SMOOTHN from the File Exchange does an excellent job in my experience.
Have you looked at the use of smoothing splines? Like interpolating splines, but with the knot points and coefficients chosen to minimise a least-squares error function. There is an excellent implementation available from Matlab central which I have used successfully.

Histogram computational efficiency

I am trying to plot a 2 GB matrix using MATLAB hist on a computer with 4 GB RAM. The operation is taking hours. Are there ways to increase the performance of the computation, by pre-sorting the data, pre-determining bin sizes, breaking the data into smaller groups, deleting the raw data as the data is added to bins, etc?
Also, after the data is plotted, I need to adjust the binning to ensure the curve is smooth. This requires starting over and re-binning the raw data. I assume the strategy involving the least computation would be to first bin the data using very small bins and then manipulate the bin size of the output, rather than re-binning the raw data. What is the best way to adjust bin sizes post-binning (assuming the bin sizes can only grow and not shrink)?
I don't like answers to StackOverflow Questions of the form "well even though you asked how to do X, you don't really want to do X, you really want to do Y, so here's a solution to Y"
But that's what i am going to do here. I think such an answer is justified in this rare instance becuase the answer below is in accord with sound practices in statistical analysis and because it avoids the current problem in front of you which is crunching 4 GB of datda.
If you want to represent the distribution of a population using a non-parametric density estimator, and you wwish to avoid poor computational performance, a kernel density estimator (KDE) will do the job far better than a histogram.
To begin with, there's a clear preference for KDEs versus histograms among the majority of academic and practicing statisticians. Among the numerous texts on this topic, ne that i think is particularly good is An introduction to kernel density estimation )
Reasons why KDE is preferred to histogram
the shape of a histogram is strongly influenced by the choice of
total number of bins; yet there is no authoritative technique for
calculating or even estimating a suitable value. (Any doubts about this, just plot a histogram from some data, then watch the entire shape of the histogram change as you adjust the number of bins.)
the shape of the histogram is strongly influenced by the choice of
location of the bin edges.
a histogram gives a density estimate that is not smooth.
KDE eliminates completely histogram properties 2 and 3. Although KDE doesn't produce a density estimate with discrete bins, an analogous parameter, "bandwidth" must still be supplied.
To calculate and plot a KDE, you need to pass in two parameter values along with your data:
kernel function: the most common options (all available in the MATLAB kde function) are: uniform, triangular, biweight, triweight, Epanechnikov, and normal. Among these, gaussian (normal) is probably most often used.
bandwith: the choice of value for bandwith will almost certainly have a huge effect on the quality of your KDE. Therefore, sophisticated computation platforms like MATLAB, R, etc. include utility functions (e.g., rusk function or MISE) to estimate bandwith given oother parameters.
KDE in MATLAB
kde.m is the function in MATLAB that implementes KDE:
[h, fhat, xgrid] = kde(x, 401);
Notice that bandwith and kernel are not supplied when calling kde.m. For bandwitdh: kde.m wraps a function for bandwidth selection; and for the kernel function, gaussian is used.
But will using KDE in place of a histogram solve or substantially eliminate the very slow performance given your 2 GB dataset?
It certainly should.
In your Question, you stated that the lagging performance occurred during plotting. A KDE does not require mapping of thousands (missions?) of data points a symbol, color, and specific location on a canvas--instead it plots a single smooth line. And because the entire data set doesn't need to be rendered one point at a time on the canvas, they don't need to be stored (in memory!) while the plot is created and rendered.

How do I compress data points in time-series using MATLAB?

I'm looking for some suggestions on how to compress time-series data in MATLAB.
I have some data sets of pupil size, which were gathered during 1 sec with 25,000 points for each trial(I'm still not sure whether it is proper to call the data 'timeseries'). What I'd like to do from now is to compare them with another data, and I need to compress the number of points into about 10,000 or less, minimizing loss of the information. Are there any ways to do it?
I've tried to search how to do this, but all that I could find out was the way to smooth the data or to compress digital images, which were already done or not useful to me.
• The data sets simply consist of pupil diameter, changing as time goes. For each trial, 25,000 points of data were gathered during 1 sec, that means 1 point denotes the pupil diameter measured for 0.04msec. What I want to do is just to adjust this data into 0.1 msec/point; however, I'm not sure whether I can apply techniques like FFT in this case because it is the first time that I handle this kind of data. I appreciate your advices again.
A standard data compression technique with time series data is to take the fast Fourier transform and use the smallest frequency amplitudes to represent your data (calculate the power spectrum). You can compare data using these frequency amplitudes, though for the to lose the least amount of information you would want to use the frequencies with the largest amplitudes -- but then it becomes tricky to compare the data... Here is the standard Matlab tutorial on FFT. Some other possibilities include:
-ARMA models
-Wavelets
Check out this paper on the "SAX" method, a modern approach for time-series compression -- it also discusses classic time-series compression techniques.