Moving average for time series with not-equal intervls - matlab

I have a dataset for price of the ticker on the stock exchange: time - price. But intervals between data points are not equal - from 1 to 2 minutes.
What is the best practice to calculate moving average for such case?
How to make it in Matlab?
I tend to think, that weights of the points should depend on the time interval that was last since previous point. Does we have function in Matlab to calculate moving average with custom weights of the points?

Here is an example of the "naive" approach I mentioned in the comments above:
% some data (unequally spaced in time, but monotonically non-decreasing)
t = sort(rand(50,1));
x = cumsum(rand(size(t))-0.5);
% linear interpolatation on equally-spaced intervals
tt = linspace(min(t), max(t), numel(t));
xx = interp1(t, x, tt, 'linear');
% plot two data vectors
plot(t, x, 'b.-', tt, xx, 'r.:')
legend({'original', 'equally-spaced'})

My answer is quite similar to lakesh's one. But I will think your problem in terms of interpolation.
First of all, a moving average, or a time average of a function, is the integral of it over a time period, divided by the time length.
In your case, the integral can be seen as a sum, since most generally in each minute the function value is the same. However, your data has unequal time intervals. This can be seen as missing points of the function. Let me explain: for each minute x, you should have a price f(x). But for some times say x=5, f(x) is undefined.
One of the ways you can get rid of discontinuities of a function is interpolation - assign some value to the missing points, according to some rules of calculation. The simpliest algorithm is "keeping the previous value", which is essentially lakesh's idea.
But the benefit of thinking in this aspect lies in the ability to make your data more accurate. It may not apply to a stock market case, but should be true generally, such as a temperature measuring or wind speed, which is guaranteed to smoothly change over the time (rather than keeping constant for 2 minutes and suddenly change in one second). You can use different interpolation techniques to polish the data. "Polishing" in this sense is ok because in any way you have to use the concept of "average". A good interpolation should make the data closer to a model that has been proven to work with the real problem.
CODE - I set the max interval to 5 minutes to show huge difference between the two methods. It depends on your observation and experience to decide which (or any other) method is the best to "predict the past".
% reproduce your scenario
N = 20;
max_interval = 5;
time = randi(max_interval,N,1);
time(1) = 1; % first minute
price = randi(10,N,1);
figure(1)
plot(cumsum(time), price, 'ko-', 'LineWidth', 2);
hold on
% "keeping-previous-value" interpolation
interp1 = zeros(sum(time),1)-1;
interp1(cumsum(time)) = price;
while ismember(-1, interp1)
interp1(interp1==-1) = interp1(find(interp1==-1)-1);
end
plot(interp1, 'bx--')
% "midpoint" interpolation
interp2 = zeros(sum(time),1)-1;
interp2(cumsum(time)) = price;
for ii = 1:length(interp2)
if interp2(ii) == -1
t1 = interp2(ii-1);
t2 = interp2( find(interp2(ii:end)>-1, 1, 'first') +ii-1);
interp2(ii) = (t1+t2)/2;
end
end
plot(interp2, 'rd--')
% "modified-midpoint" interpolation
interp3 = zeros(sum(time),1)-1;
interp3(cumsum(time)) = price;
for ii = 1:length(interp3)
if interp3(ii) == -1
t1 = interp3(ii-1);
t2 = interp3( find(interp3(ii:end)>-1, 1, 'first') +ii-1);
alpha = 1 / find(interp3(ii:end)>-1, 1, 'first');
interp3(ii) = (1-alpha)*t1 + alpha*t2;
end
end
plot(interp3, 'm^--')
hold off
legend('original data', 'interp 1', 'interp 2', 'interp 3')
fprintf(['"keeping-previous-value" (weighted sum) \n', ...
' result: %2.4f \n'], mean(interp1));
fprintf(['"midpoint" (linear interpolation) \n', ...
' result: %2.4f \n'], mean(interp2));
fprintf(['"modified-midpoint" (linear interpolation) \n', ...
' result: %2.4f \n'], mean(interp3));
Note: undefined points should be presented by NaN, but -1 seems easier to play with.

This is my suggestion.
Since you have unequal intervals of data, convert it into equal intervals of data keeping the price constant between unequal intervals.
Then you can use tsmovavg to calculate the moving average of the price series then.

If you are willing to discretize the time value of your data points, the solution should be very straightforward. No matter what kind of window you choose, as long as it's Lipschitz, it can be computed or approximated in amortized O(1) time for each data point or time step using approaches like summed area table.
Else, use a rectangular running window of fixed width that only 'snaps' to data points. Specifically, update the summation of values of all data points within the window only when a data point is joining/leaving the window.
However, if you want to use custom weights for your data points, the method described above no longer works. You can, of course, approximate your spatial kernel with multiple box functions. Otherwise, you might want to look into general bilateral filtering algorithms, as the problem can be formulated as bilateral filtering with a constant range kernel. See the paper Adaptive Manifolds for Real-Time High-Dimensional Filtering for a recently developed algorithm that's relatively easy to implement on this topic. The author's website also provides code in MATLAB.

Related

How to bins values and plot

I have a dataset with two columns, the first column is duration (length of time (e.g. 5 min) and the second column is firing rates. Is it possible to plot this in such a way that firing rates are binned according to corresponding duration (e.g. 5, 10, 15 min) and then plot bars with firing rate on the y axis and time on the x?
I'm sure this can be accomplished without the for loop. Solution below uses the discretize function to accomplish the grouping. Other approaches possible.
% MATLAB R2017a
% Sample data
D = 20*rand(25,1);
FR = 550*rand(25,1);
D_bins = (0:5:20)';
ind = discretize(D,D_bins); % groups data
FR_mean = zeros(length(D_bins),1);
for k = 1:length(D_bins)
FR_mean(k) = mean(FR(ind==k));
end
bar(D_bins,FR_mean) % bar plot
% Cosmetics
xlabel('Duration (min)')
ylabel('Mean Firing Rate (unit)')
I'm positive there's a more efficient way to get the means for each group, possibly using arrayfun or some other nifty functions, but will hold off until OP provides more details.

Cross correlate data that contains "spikes"

When using xcorr in MATLAB to cross correlate 2 related data sets, everything works as expected - I see a correlation peak and the lag reported is correct. However, when I use xcorr to cross correlate unrelated data sets where both data sets contain 1 cluster of "spikes", I see a correlation peak and the lag reported is the distance between the 2 spikes.
In this image:
x is a random data series. y is also a random data series. Both x and y have 30 random peaks inserted into the series in sequence. In theory, there should be no correlation between the 2 data sets since they are both very different. However, it can be seen from the 3rd plot that there is a very strong correlation between the 2 data sets. The code used to generate this figure is at the bottom of this post.
I've tried to filter the spikes using a few different mechanisms (rolling rms power ... etc) before performing the xcorr. This has worked in some cases but not all. I feel like I need a different approach to the problem, maybe an alternative to xcorr. I do understand why x and y cross correlate using xcorr. Is there another cross correlation tool that I can use? Note x and y will never be exactly the same, they will only ever be approximately the same but in normal operation, it's not the spikes that should make them correlate.
Any suggestions on how to tell if x and y correlate while also ignoring the "spikes"?
Here is some my example code:
x = rand(1, 3000);
x = x - 0.5;
y = rand(1, 3000);
y = y - 0.5;
% insert the impulses into the data
impulse_width = 30;
impulse_max_height = 6;
x_impulse_start = 460;
y_impulse_start = 120;
rand_insert_x = rand(1, impulse_width);
rand_insert_x = (rand_insert_x - 0.5) * 2 * impulse_max_height;
rand_insert_y = rand(1, impulse_width);
rand_insert_y = (rand_insert_y - 0.5) * 2 * impulse_max_height;
x(1,x_impulse_start:x_impulse_start + impulse_width - 1) = rand_insert_x;
y(1,y_impulse_start:y_impulse_start + impulse_width - 1) = rand_insert_y;
subplot(3, 1, 1);
plot(x);
ylim([-impulse_max_height impulse_max_height]);
title('random data series: x');
subplot(3, 1, 2);
plot(y);
ylim([-impulse_max_height impulse_max_height]);
title('random data series: y');
[c, l] = xcorr(x, y);
subplot(3, 1, 3);
plot(l, c);
title('correlation using xcorr');
The way to solve this is to use normalized cross-correlation.
In normalize cross-correlation the correlation is 1 when the signals are exactly the same, and less when they are not. You can see it as "percentage of similarity".
To do that in MATLAB, you just need to add 'coeff' as an argument to your code.
So, if I change your code to [c, l] = xcorr(x, y,'coeff'); the plot I get is the nest:
(note I changed sample size to 600 to make it more readable)
the cross-correlation gets to 0.3 there, so not much. However, if we change your code lines to
x(1,x_impulse_start:x_impulse_start + impulse_width - 1) = rand_insert_x;
y(1,y_impulse_start:y_impulse_start + impulse_width - 1) = rand_insert_x;
and insert the same random patter in both signals, then we get:
Now, the cross-correlation gets to a high value, almost 1, but not one, because the big random pattern there is the same, but the rest of the signal is not.
The cross-correlation is the convolution of two signals. Imagine that during the cross correlation, the two signals are at lags like I have shown here (x-axis labels should be completely ignored):
The positive (+) spike in series x (~ sample 490) is multiplied by the negative (-) spike in series y (~ sample 121), resulting in a large negative value in the xcorr, which we actually see in the bottom plot (~ sample 315). This large negative value will be added by something close to 0 since the rest of the signals are indeed low-power noise. I am afraid that no matter what xcorr function you use, you should get the same result. In fact, if there is another function that claims to be a cross-correlator, but doesn't give the same result as xcorr() then that function should not be called a cross-correlator. I hope this helps.
My understanding of the question is "How do I remove these spikes from my data?"
The answer is find something characteristic about those spikes, and then test each time window for that characteristic. If that test passes, then you have detected a spike, and you should remove that data.
For example, you might say "A spike is any time point that has an absolute value greater than some threshold." You determine the threshold using your data, say 0.2. Then you do something like
spikeless_data = data .* (abs(data)<0.2);
which copies data when abs(data)<0.2 and sets it to 0 when not.
You could also notice that a characteristic of spikes is that their derivative is very large, which might be more robust than a simple threshold. This would correspond to spikeless_data = data .* ([abs(diff(data)), 0] < some_threshold);
You will have to play around to find something that works for your data.

Computing a moving average

I need to compute a moving average over a data series, within a for loop. I have to get the moving average over N=9 days. The array I'm computing in is 4 series of 365 values (M), which itself are mean values of another set of data. I want to plot the mean values of my data with the moving average in one plot.
I googled a bit about moving averages and the "conv" command and found something which i tried implementing in my code.:
hold on
for ii=1:4;
M=mean(C{ii},2)
wts = [1/24;repmat(1/12,11,1);1/24];
Ms=conv(M,wts,'valid')
plot(M)
plot(Ms,'r')
end
hold off
So basically, I compute my mean and plot it with a (wrong) moving average. I picked the "wts" value right off the mathworks site, so that is incorrect. (source: http://www.mathworks.nl/help/econ/moving-average-trend-estimation.html) My problem though, is that I do not understand what this "wts" is. Could anyone explain? If it has something to do with the weights of the values: that is invalid in this case. All values are weighted the same.
And if I am doing this entirely wrong, could I get some help with it?
My sincerest thanks.
There are two more alternatives:
1) filter
From the doc:
You can use filter to find a running average without using a for loop.
This example finds the running average of a 16-element vector, using a
window size of 5.
data = [1:0.2:4]'; %'
windowSize = 5;
filter(ones(1,windowSize)/windowSize,1,data)
2) smooth as part of the Curve Fitting Toolbox (which is available in most cases)
From the doc:
yy = smooth(y) smooths the data in the column vector y using a moving
average filter. Results are returned in the column vector yy. The
default span for the moving average is 5.
%// Create noisy data with outliers:
x = 15*rand(150,1);
y = sin(x) + 0.5*(rand(size(x))-0.5);
y(ceil(length(x)*rand(2,1))) = 3;
%// Smooth the data using the loess and rloess methods with a span of 10%:
yy1 = smooth(x,y,0.1,'loess');
yy2 = smooth(x,y,0.1,'rloess');
In 2016 MATLAB added the movmean function that calculates a moving average:
N = 9;
M_moving_average = movmean(M,N)
Using conv is an excellent way to implement a moving average. In the code you are using, wts is how much you are weighing each value (as you guessed). the sum of that vector should always be equal to one. If you wish to weight each value evenly and do a size N moving filter then you would want to do
N = 7;
wts = ones(N,1)/N;
sum(wts) % result = 1
Using the 'valid' argument in conv will result in having fewer values in Ms than you have in M. Use 'same' if you don't mind the effects of zero padding. If you have the signal processing toolbox you can use cconv if you want to try a circular moving average. Something like
N = 7;
wts = ones(N,1)/N;
cconv(x,wts,N);
should work.
You should read the conv and cconv documentation for more information if you haven't already.
I would use this:
% does moving average on signal x, window size is w
function y = movingAverage(x, w)
k = ones(1, w) / w
y = conv(x, k, 'same');
end
ripped straight from here.
To comment on your current implementation. wts is the weighting vector, which from the Mathworks, is a 13 point average, with special attention on the first and last point of weightings half of the rest.

matrix of exponentially declining values according to a given vector

I have a vector of solar radiation measurements for a water body, I would like to calculate the radiation that reaches certain depths in the water column. This can be calculated from Beer's law, which I have applied for the second depth of my measurements:
rad = 1+(30-1).*rand(365,1);
depth = 1:10;
kz = 0.4;
rad(:,2) = rad(:,1).*exp(-kz.*depth(2));
How would I apply this to all of the depths specified in the vector 'depth'? i.e. how would I generate a matrix which has 365 rows and 10 columns where each column refers to the radiation that reaches that particular depth.
Since the decay of radiation due to scattering and absorption is a simple %-loss per depth, you can calculate the result very easily from the initial radiation:
initialRad = 1+(30-1).*rand(365,1);
depth = 0:10; %# start with zero so that the first column is your initial radiation
kz = 0.4;
rad = bsxfun(#times, initialRad, exp(-kz*depth) );
Note that as #Rasman points out, you can use vector multiplication instead of bsxfun, since multiplying a m-by-1 array with a 1-by-n array results in a m-by-n array. The bsxfun solution can be more robust, since it also works when the arrays have additional dimensions (e.g. m-by-1-by-k and 1-by-n-by-k if you do multiple tests), or if the vectors are transposed (e.g. 1-by-m and n-by-1). The solution below is a nice demonstration of good linear algebra skills, though you may want to add a note why you don't use dot multiplication with the two vectors initialRad and the exp-statement.
rad = initialRad * exp(-kz * depth);
You should use loops,
here you can read a tutorial about them, and how to use them,
http://www.mathworks.com/help/distcomp/for.html
basically what you need is, a for loop that contains i as main parameter. Which should run for
i=1 .. 9
and your main assignment would become
rad(:,i+1) = rad(:,i).*exp(-kz.*depth(2));
to be more precise
for i = drange(1:9)
rad(:,i+1) = rad(:,i).*exp(-kz.*depth(2));
end
I do not know the subject but this function will sweep your matrix, column by column, starts assigning column 2 using column 1 and goes on till column 10.

Find only relevant points in MATLAB

I have a MATLAB function that finds charateristic points in a sample. Unfortunatley it only works about 90% of the time. But when I know at which places in the sample I am supposed to look I can increase this to almost 100%. So I would like to know if there is a function in MATLAB that would allow me to find the range where most of my results are, so I can then recalculate my characteristic points. I have a vector which stores all the results and the right results should lie inside a range of 3% between -24.000 to 24.000. Wheras wrong results are always lower than the correct range. Unfortunatley my background in statistics is very rusty so I am not sure how this would be called.
Can somebody give me a hint what I would be looking for? Is there a function build into MATLAB that would give me the smallest possible range where e.g. 90% of the results lie.
EDIT: I am sorry if I didn't make my question clear. Everything in my vector can only range between -24.000 and 24.000. About 90% of my results will be in a range which spans approximately 1.44 ([24-(-24)]*3% = 1.44). These are very likely to be the correct results. The remaining 10% are outside of that range and always lower (why I am not sure taking then mean value is a good idea). These 10% are false and result from blips in my input data. To find the remaining 10% I want to repeat my calculations, but now I only want to check the small range.
So, my goal is to identify where my correct range lies. Delete the values I have found outside of that range. And then recalculate my values, not on a range between -24.000 and 24.000, but rather on a the small range where I already found 90% of my values.
The relevant points you're looking for are the percentiles:
% generate sample data
data = [randn(900,1) ; randn(50,1)*3 + 5; ; randn(50,1)*3 - 5];
subplot(121), hist(data)
subplot(122), boxplot(data)
% find 5th, 95th percentiles (range that contains 90% of the data)
limits = prctile(data, [5 95])
% find data in that range
reducedData = data(limits(1) < data & data < limits(2));
Other approachs exist to detect outliers, such as the IQR outlier test and the three standard deviation rule, among many others:
%% three standard deviation rule
z = 3;
bounds = z * std(data)
reducedData = data( abs(data-mean(data)) < bounds );
and
%% IQR outlier test
Q = prctile(data, [25 75]);
IQ = Q(2)-Q(1);
%a = 1.5; % mild outlier
a = 3.0; % extreme outlier
bounds = [Q(1)-a*IQ , Q(2)+a*IQ]
reducedData = data(bounds(1) < data & data < bounds(2));
BTW if you want to get the z value (|X|<z) that corresponds to 90% area under the curve, use:
area = 0.9; % two-tailed probability
z = norminv(1-(1-area)/2)
Maybe you should try mean value (in matlab: mean) and standard deviation (in matlab: std)?
What is the statistic distribution of your data?
See also this wiki page, section "Interpretation and application".
In general for almost every distribution, very useful Chebyshev's inequalities take place.
In most of the cases this should work:
meanval = mean(data)
stDev = std(data)
and probably the most (75%) of your values will be placed in range:
<meanVal - 2*stDev, meanVal + 2*stDev>
it seems like maybe you want to find the number x in [-24,24] that maximizes the number of sample points in [x,x+1.44]; probably the fastest way to do this involves a sort of the sample points, which is ultimately nlog(n) time; a cheesy approximation would be as follows:
brkpoints = linspace(-24,24-1.44,n_brkpoints); %choose n_brkpoints big, but < # of sample points?
n_count = histc(data,[brkpoints,inf]); %count # data points between breakpoints;
accbins = 1.44 / (brkpoints(2) - brkpoints(1); %# of bins to accumulate;
cscount = cumsum(n_count); %half of the boxcar sum computation;
boxsum = cscount - [zeros(accbins,1);cscount(1:end-accbins)]; %2nd half;
[dum,maxi] = max(boxsum); %which interval has the maximal # counts?
lorange = brkpoints(maxi); %the lower range;
hirange = lorange + 1.44
this solution does fudge some of the corner case stuff about the bottom and top bin, etc.
note that if you're going to go by the Chebyshev inequality route, Petunin's Inequality is probably applicable, and will give a slight boost.