Rolling calculation of line of best fit - one sample at a time - linear-regression

In an application I'm working on, I receive (x,y) samples at a fixed rate (100 Hz) one at a time and need to detect a 4-second sequence (400 samples) with a constant gradient (with a certain tolerance), and store this gradient for later use. Each sample pair is 8 bytes long, so I would need 3200 bytes of memory if I use the standard moving window least squares regression algorithm for the purpose. My question is, is there a formula for a continuous (recursive) calculation of the gradient of the line of best fit - one incoming sample at a time, without the need to keep an array of the last 400 samples? Something in the vein of exponential moving average, where at any point in time only the latest averaged value needs to be known in order to update it with the new incoming sample.
Would appreciate any pointers to existing solutions.

Related

Automatically truncating a curve to discard outliers in matlab

I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).

Noisy signal correlation

I have two (or more) time series that I would like to correlate with one another to look for common changes e.g. both rising or both falling etc.
The problem is that the time series are all fairly noisy with relatively high standard deviations meaning it is difficult to see common features. The signals are sampled at a fairly low frequency (one point every 30s) but cover reasonable time periods 2hours +. It is often the case that the two signs are not the same length, for example 1x1hour & 1x1.5 hours.
Can anyone suggest some good correlation techniques, ideally using built in or bespoke matlab routines? I've tried auto correlation just to compare lags within a single signal but all I got back is a triangular shape with the max at 0 lag (I assume this means there is no obvious correlation except with itself?) . Cross correlation isn't much better.
Any thoughts would be greatly appreciated.
Start with a cross-covariance (xcov) instead of the cross-correlation. xcov removes the DC component (subtracts off the mean) of each data set and then does the cross-correlation. When you cross-correlate two square waves, you get a triangle wave. If you have small signals riding on a large offset, you get a triangle wave with small variations in it.
If you think there is a delay between the two signals, then I would use xcorr to calculate the delay. Since xcorr is doing an FFT of the signal, you should remove the means before calling xcorr, you may also want to consider adding a window (e.g. hanning) to reduce leakage if the data is not self-windowing.
If there is no delay between the signals or you have found and removed the delay, you could just average the two (or more) signals. The random noise should tend to average to zero and the common features will approach the true value.

How do I compress data points in time-series using MATLAB?

I'm looking for some suggestions on how to compress time-series data in MATLAB.
I have some data sets of pupil size, which were gathered during 1 sec with 25,000 points for each trial(I'm still not sure whether it is proper to call the data 'timeseries'). What I'd like to do from now is to compare them with another data, and I need to compress the number of points into about 10,000 or less, minimizing loss of the information. Are there any ways to do it?
I've tried to search how to do this, but all that I could find out was the way to smooth the data or to compress digital images, which were already done or not useful to me.
• The data sets simply consist of pupil diameter, changing as time goes. For each trial, 25,000 points of data were gathered during 1 sec, that means 1 point denotes the pupil diameter measured for 0.04msec. What I want to do is just to adjust this data into 0.1 msec/point; however, I'm not sure whether I can apply techniques like FFT in this case because it is the first time that I handle this kind of data. I appreciate your advices again.
A standard data compression technique with time series data is to take the fast Fourier transform and use the smallest frequency amplitudes to represent your data (calculate the power spectrum). You can compare data using these frequency amplitudes, though for the to lose the least amount of information you would want to use the frequencies with the largest amplitudes -- but then it becomes tricky to compare the data... Here is the standard Matlab tutorial on FFT. Some other possibilities include:
-ARMA models
-Wavelets
Check out this paper on the "SAX" method, a modern approach for time-series compression -- it also discusses classic time-series compression techniques.

How to find the mean/average of a sound in Nyquist

I'm trying to write a simple measurement plug-in for Audacity and it's about as much fun as pounding rocks against my skull. All I want to do is take a chunk of audio and find the average of all the samples (the chunk's DC offset) so that I can present it as a number to the user, and so that I can subtract the DC offset from the samples for further processing. I know and understand the math I want to do, but I don't understand how to do it in Lisp/XLisp/Nyquist/whatever.
Background information in this thread
As far as I know, there is no function to do this. For some reason, the snd-avg function does not actually compute the average of the sound, as you might expect. It does the absolute value first, and then computes the average computes the average and then does the absolute value. Even though there's a separate snd-abs function that could do it. >:(
So I guess I have to write my own? This means converting a sound into an array and then computing the average of that?
(snd-fetch-array sound len step)
Reads
sequential arrays of samples from
sound, returning either an array of
FLONUMs or NIL when the sound
terminates.
(snd-samples sound limit)
Converts the
samples into a lisp array.
Nyquist Functions
And there's not even an average function, so I'll have to do a sum myself, too? But the math functions only work on lists? So I need to convert the array into a list?
And this will also use up a huge amount of memory for longer waveforms (18 bytes per sample), so it would be best to process it in chunks and do a cumulative average. But I don't even know how to do the unoptimized version.
No, (hp s 0.1) won't work, since:
I want to remove DC only, and keep arbitrarily low frequencies. 0.01 Hz should pass through unchanged, DC should be removed.
The high-pass filter is causal and the first samples of the waveform remain unchanged, no matter what knee frequency you use, making it useless for measuring peak samples, etc.
NEVERMIND
snd-maxsamp is computing the absolute value, not snd-avg. snd-avg works just fine. Here's how to squeeze a number ("FLONUM") out of it:
(snd-fetch (snd-avg s (round len) (round len) OP-AVERAGE))
This produces a negative number for negative samples and a positive number for positive samples, as it should.
Should this question be deleted or left as an example to others?

Buddhabrot Fractal

I am trying to implement buddhabrot fractal. I can't understand one thing: all implementations I inspected pick random points on the image to calculate the path of the particle escaping. Why do they do this? Why not go over all pixels?
What purpose do the random points serve? More points make better pictures so I think going over all pixels makes the best picture - am I wrong here?
From my test data:
Working on 400x400 picture. So 160 000 pixels to iterate if i go all over.
Using random sampling,
Picture only starts to take shape after 1 million points. Good results show up around 1 billion random points which takes hours to compute.
Random sampling is better than grid sampling for two main reasons. First because grid sampling will introduce grid-like artifacts in the resulting image. Second is because grid sampling may not give you enough samples for a converged resulting image. If after completing a grid pass, you wanted more samples, you would need to make another pass with a slightly offset grid (so as not to resample the same points) or switch to a finer grid which may end up doing more work than is needed. Random sampling gives very smooth results and you can stop the process as soon as the image has converged or you are satisfied with the results.
I'm the inventor of the technique so you can trust me on this. :-)
The same holds for flame fractals: Buddha brot are about finding the "attractors",
so even if you start with a random point, it is assumed to quite quickly converge to these attracting curves. You typically avoid painting the first 10 pixels in the iteration or so anyways, so the starting point is not really relevant, BUT, to avoid doing the same computation twice, random sampling is much better. As mentioned, it eliminates the risk of artefacts.
But the most important feature of random sampling is that it has all levels of precision (in theory, at least). This is VERY important for fractals: they have details on all levels of precision, and hence require input from all levels as well.
While I am not 100% aware of what the exact reason would be, I would assume it has more to do with efficiency. If you are going to iterate through every single point multiple times, it's going to waste a lot of processing cycles to get a picture which may not look a whole lot better. By doing random sampling you can reduce the amount work needed to be done - and given a large enough sample size still get a result that is difficult to "differentiate" from iterating over all the pixels (from a visual point of view).
This is possibly some kind of Monte-Carlo method so yes, going over all pixels would produce the perfect result but would be horribly time consuming.
Why don't you just try it out and see what happens?
Random sampling is used to get as close as possible to the exact solution, which in cases like this cannot be calculated exactly due to the statistical nature of the problem.
You can 'go over all pixels', but since every pixel is in fact some square region with dimensions dx * dy, you would only use num_x_pixels * num_y_pixels points for your calculation and get very grainy results.
Another way would be to use a very large resolution and scale down the render after the calculation. This would give some kind of 'systematic' render where every pixel of the final render is divided in equal amounts of sub pixels.
I realize this is an old post, but wanted to add my thoughts based on a current project.
The problem with tying your samples to pixels, like others said:
Couples your sample grid to your view plane, making it difficult to do projections, zooms, etc
Not enough fidelity. Random sampling is more efficient as everyone else said, so you need even more samples if you want to sample using a uniform grid
You're much more likely to see grid artifacts at equivalent sample counts, whereas random sampling tends to just look grainy at low counts
However, I'm working on a GPU-accelerated version of buddhabrot, and ran into a couple issues specific to GPU code with random sampling:
I've had issues with overhead/quality of PRNGs in GPU code where I need to generate thousands of samples in parallel
Random sampling produces highly scattered traces, and the GPU really, really wants threads in a given block/warp to move together as much as possible. The performance difference for clustered vs scattered traces was extreme in my testing
Hardware support in modern GPUs for efficient atomicAdd means writes to global memory don't bottleneck GPU buddhabrot nearly as much now
Grid sampling makes it very easy to do a two-pass render, skipping blocks of sample points based on a low-res pass to find points that don't escape or don't ever touch a pixel in the view plane
Long story short, while the GPU technically has to do more work this way for equivalent quality, it's actually faster in practice AFAICT, and GPUs are so fast that re-rendering is often a matter of seconds/minutes instead of hours (or even milliseconds at lower resolution / quality levels, realtime buddhabrot is very cool)