Finding rolling z-score in Matlab

Finding rolling z-score in Matlab - matlab

I want to calculate z-score of the current point in cross-sectional time-series data based on standard deviation over the last 10 days and simple moving average over the last 10 days. I can't use the z-score function in Matlab as it looks forward to calculate the z-score. Currently my solution is
for i=11:length(equity.(1))
z(i) = (x(i)-mean(x(i-10:i))/std(x(i-10:i);
end
but issue is that i want to do this for the entire dataset at once. Is there a way to handle the entire matrix at once and calculate z-score for a given look back period (10 days in my case).

Whether this is in fact more efficient or not I don't know, but one way (im2col requires the image processing toolbox):
data = 1:40; %dummy data
% presuming "ten days" means day of interest + 9 days back
n = 10;
data2 = im2col(data,[1,n],'sliding');
%mean/std for each column:
dmean = mean(data2);
dstd = std(data2);
z = (data(n:end)-dmean)./dstd;
You might also try this from the file exchange.

Related

Calculate the autocorrelation of a time series created from a normal distribution

I generate a time series from a normal distribution and then I try to plot the autocorrelation by using the following code snippet:
ts1 = normrnd(0,0.25,1,100);
autocorrelation_ts1 = xcorr(ts1);
I was expecting that the autocorrelation would show 1 for x=0 and almost 0 for the rest of values, instead I get value 6 at axis position 100.
I think the question applies both to Matlab and Octave but I am not sure.

First thing is that your second line of code is wrong. I think you meant to put
autocorrelation_ts1 = xcorr(ts1);
Other than this, I think your solution is correct. The reason the max value is at 100 and not 0 is because a temporal shift of 0 in the autocorrelation actually happens on the 100th iteration of the correlation function. In other words, the numbers on the X axis don't correspond to time.
To get time on the X axis change your code to
[autocorrelation_ts1, shifts] = xcorr(ts1);
Then
plot(shifts, autocorrelation_ts1)
With regard to the max value, matlab documentation for xcorr indicates that 1 is not the maximum output value of the function when called without the normalization argument. If you want to normalize such that all values are 1 or less, use
[autocorrelation_ts1, shifts] = xcorr(ts1, 'normalized');

Just as complementary reference to Scott's answer, this is the complete code snippet, including stem chart scaling to show up to 20 shifts/lags.
[auto_ts1, lags] = xcorr(ts1);
ts_begin = ceil(size(lags,2)/2);
ts_end = ts_begin + 20;
stem(lags(ts_begin:ts_end),auto_ts1(ts_begin:ts_end)/max(auto_ts1), 'linewidth', 4.0, 'filled')

Matlab codes for making scales

I've extracted certain data from an Excel file.
It involves two columns : one for certain periods and another for corresponing
daily prices. Followings are my codes.(t1 and t2 are user inputs.)
row_1 = find(period==t1)
row_2 = find(period==t2)
f_0 = period(row_1:row_2, 1)
f_1 = price(row_1:row_2 , 1)
y_1 = plot(handles.axes2, f_0, f_1)
f_0 : period (x-axis), f_1 : price(y-axis)
My goal is to express the trend of price fluctuations by using sounds.
So the way I came up with this is as follows.
Step1: Find the maximum and minimum value of the price corresponding to the given period. Step2: Divide the distances between these two points into eight sections. Step3: Allocate eight musical scales(C D E F G A B C) to each
eight sections and play it.
At my level, I achieved to find the min/max values of the given period.
But from the next stage, I can't come up with any ideas.
Please help me with any advice.

If I understand you correctly, you want to allocate eight musical scales to divided period, and such codes may help.
%% let's play some music~
clc; clear;
%% Set the Sampling frequency & time period
fs=44100;
t=0:1/fs:0.5;
%% eight musical scales
Cscale{1}=sin(2*pi*262*t); %c-do
Cscale{2}=sin(2*pi*294*t); %c-re
Cscale{3}=sin(2*pi*330*t); %c-mi
Cscale{4}=sin(2*pi*349*t); %c-fa
Cscale{5}=sin(2*pi*392*t); %c-so
Cscale{6}=sin(2*pi*440*t); %c-la
Cscale{7}=sin(2*pi*494*t); %c-ti
Cscale{8}=sin(2*pi*523*t); %c-do-high
%you could call "sound(Cscale{i},fs)" to paly each scales
%% Divide the distances between these two points
% the highest point must be special treated
Min_p=0;
Max_p=8;
Sample_p=[0 1 2 3 4 5 6 7 8];
for i=1:length(Sample_p)
S_p=Sample_p(i);
if (S_p == Max_p)
sound(Cscale{end},fs);
else
%Find the correct music scale and play it
sound(Cscale{1+floor(8*(Sample_p(i)-Min_p)/(Max_p-Min_p))},fs);
end
pause(0.5)
end
Here is what I looked at(you may need google translation because it is written in Chinese)
http://blog.csdn.net/weaponsun/article/details/46695255

How do I take an n-day average of data in Matlab to match another time series?

I have daily time series data and I want to calculate 5-day averages of that data while also retrieving the corresponding start date for each of the 5-day averages. For example:
x = [732099 732100 732101 732102 732103 732104 732105 732106 732107 732108];
y= [1 5 3 4 6 2 3 5 6 8];
Where x and y are actually size 92x1.
Firstly, how do I compute the 5-day mean when this time series data is not divisible by 5? Ultimately, I want to compute the 'jumping mean', where the average is not computed continuously (e.g., June 1-5, June 6-10, and so on).
I've tried doing the following:
Pentad_avg = mean(reshape(y(1:90),5,[]))'; %manually adjusted to be divisible by 5
Pentad_dt = x(1:5:90); %select every 5th day for time
However, Pentad_dt gives me dates 01-Jun-2004 and 06-Jun-2004 as output. And, that brings me to my second point.
I am looking to find 5-day averages for x and y that correspond to 5-day averages of another time series. This second time series has 5-day averaged data starting from 15-Jun-2004 until 29-Aug-2004 (instead of starting at 01-Jun-2004). Ultimately, how do I align the dates and 5-day averages between these two time series?

Synchronization between two time series can be accomplished using the timeseries object. Placing your data into an object allows Matlab to intelligently process it. The most useful thing is adds for your usage is the synchronize method.
You'll want to make sure to properly set the time vector on each of the timeseries objects.
An example of what this might look like is as follows:
ts1 = timeseries(y,datestr(x));
ts2 = timeseries(OtherData,OtherTimes);
[ts1 ts2] = synchronize(ts1,ts2,'Uniform','Interval',5);
This should return to you each timeseries aligned to be with the same times. You could also specify a specific time vector to align a timeseries to using the resample method.

How to identify an optimal subsample from a data set with missing values in MATLAB

I would like to identify the largest possible contiguous subsample of a large data set. My data set consists of roughly 15,000 financial time series of up to 360 periods in length. I have imported the data into MATLAB as a 360 by 15,000 numerical matrix.
This matrix contains a lot of NaNs due to some of the financial data not being available for the entire period. In the illustration, NaN entries are shown in dark blue, and non-NaN entries appear in light blue. It is these light blue non-NaN entries which I would like to ideally combine into an optimal subsample.
I would like to find the largest possible contiguous block of data that is contained in my matrix, while ensuring that my matrix contains a sufficient number of periods.
In a first step I would like to sort my matrix from left to right in descending order by the number of non-NaN entries in each column, that is, I would like to sort by the vector obtained by entering sum(~isnan(data),1).
In a second step I would like to find the sub-array of my data matrix that is at least 72 entries along the first dimension and is otherwise as large as possible, measured by the total number of entries.
What is the best way to implement this?

A big warning (may or may not apply depending on context)
As Oleg mentioned, when an observation is missing from a financial time series, it's often missing for reason: eg. the entity went bankrupt, the entity was delisted, or the instrument did not trade (i.e. illiquid). Constructing a sample without NaNs is likely equivalent to constructing a sample where none of these events occur!
For example, if this were hedge fund return data, selecting a sample without NaNs would exclude funds that blew up and ceased trading. Excluding imploded funds would bias estimates of expected returns upwards and estimates of variance or covariance downwards.
Picking a sample period with the fewest time series with NaNs would also exclude periods like the 2008 financial crisis, which may or may not make sense. Excluding 2008 could lead to an underestimate of how haywire things could get (though including it could lead to overestimate the probability of certain rare events).
Some things to do:
Pick a sample period as long as possible but be aware of the limitations.
Do your best to handle survivorship bias: eg. if NaNs represent delisting events, try to get some kind of delisting return.
You almost certainly will have an unbalanced panel with missing observations, and your algorithm will have to be deal with that.
Another general finance / panel data point, selecting a sample at some time point t and then following it into the future is perfectly ok. But selecting a sample based upon what happens during or after the sample period can be incredibly misleading.
Code that does what you asked:
This should do what you asked and be quite fast. Be aware of the problems though if whether an observation is missing is not random and orthogonal to what you care about.
Inputs are a T by n sized matrix X:
T = 360; % number of time periods (i.e. rows) in X
n = 15000; % number of time series (i.e. columns) in X
T_subsample = 72; % desired length of sample (i.e. rows of newX)
% number of possible starting points for series of length T_subsample
nancount_periods = T - T_subsample + 1;
nancount = zeros(n, nancount_periods, 'int32'); % will hold a count of NaNs
X_isnan = int32(isnan(X));
nancount(:,1) = sum(X_isnan(1:T_subsample, :))'; % 'initialize
% We need to obtain a count of nans in T_subsample sized window for each
% possible time period
j = 1;
for i=T_subsample + 1:T
% One pass: add new period in the window and subtract period no longer in the window
nancount(:,j+1) = nancount(:,j) + X_isnan(i,:)' - X_isnan(j,:)';
j = j + 1;
end
indicator = nancount==0; % indicator of whether starting_period, series
% has no NaNs
% number of nonan series of length T_subsample by starting period
max_subsample_size_by_starting_period = sum(indicator);
max_subsample_size = max(max_subsample_size_by_starting_period);
% find the best starting period
starting_period = find(max_subsample_size_by_starting_period==max_subsample_size, 1);
ending_period = starting_period + T_subsample - 1;
columns_mask = indicator(:,starting_period);
columns = find(columns_mask); %holds the column ids we are using
newX = X(starting_period:ending_period, columns_mask);

Here's an idea,
Assuming you can rearrange the series, calculate the distance (you decide the metric, but if looking at is nan vs not is nan, Hamming is ok).
Now hierarchically cluster the series and rearrange them using either a dendrogram
or http://www.mathworks.com/help/bioinfo/examples/working-with-the-clustergram-function.html
You should probably prune any series that doesn't have a minimum number of non nan values before you start.

First I have only little insight in financial mathematics. I understood it that you want to find the longest continuous chain of non-NaN values for each time series. The time series should be sorted depending on the length of this chain and each time series, not containing a chain above a threshold, discarded. This can be done using
data = rand(360,15e3);
data(abs(data) <= 0.02) = NaN;
%% sort and chop data based on amount of consecutive non-NaN values
binary_data = ~isnan(data);
% find edges, denote their type and calculate the biggest chunk in each
% column
edges = [2*binary_data(1,:)-1; diff(binary_data, 1)];
chunk_size = diff(find(edges));
chunk_size(end+1) = numel(edges)-sum(chunk_size);
[row, ~, id] = find(edges);
num_row_elements = diff(find(row == 1));
num_row_elements(end+1) = numel(chunk_size) - sum(num_row_elements);
%a chunk of NaN has a -1 in id, a chunk of non-NaN a 1
chunks_per_row = mat2cell(chunk_size .* id,num_row_elements,1);
% sort by largest consecutive block of non-NaNs
max_size = cellfun(#max, chunks_per_row);
[max_size_sorted, idx] = sort(max_size, 'descend');
data_sorted = data(:,idx);
% remove all elements that only have block sizes smaller then some number
some_number = 20;
data_sort_chop = data_sorted(:,max_size_sorted >= some_number);
Note that this can be done a lot simpler, if the order of periods within a time series doesn't matter, aka data([1 2 3],id) and data([3 1 2], id) are identical.
What I do not know is, if you want to discard all periods within a time series that don't correspond to the biggest value, get all those chains as individual time series, ...
Feel free to drop a comment if it has to be more specific.

Pruning data for better viewing on loglog graph - Matlab

just wondering if anyone has any ideas about an issue I'm having.
I have a fair amount of data that needs to be displayed on one graph. Two theoretical lines that are bold and solid are displayed on top, then 10 experimental data sets that converge to these lines are graphed, each using a different identifier (eg the + or o or a square etc). These graphs are on a log scale that goes up to 1e6. The first few decades of the graph (< 1e3) look fine, but as all the datasets converge (> 1e3) it's really difficult to see what data is what.
There's over 1000 data points points per decade which I can prune linearly to an extent, but if I do this too much the lower end of the graph will suffer in resolution.
What I'd like to do is prune logarithmically, strongest at the high end, working back to 0. My question is: how can I get a logarithmically scaled index vector rather than a linear one?
My initial assumption was that as my data is lenear I could just use a linear index to prune, which lead to something like this (but for all decades):
//%grab indicies per decade
ind12 = find(y >= 1e1 & y <= 1e2);
indlow = find(y < 1e2);
indhigh = find(y > 1e4);
ind23 = find(y >+ 1e2 & y <= 1e3);
ind34 = find(y >+ 1e3 & y <= 1e4);
//%We want ind12 indexes in this decade, find spacing
tot23 = round(length(ind23)/length(ind12));
tot34 = round(length(ind34)/length(ind12));
//%grab ones to keep
ind23keep = ind23(1):tot23:ind23(end);
ind34keep = ind34(1):tot34:ind34(end);
indnew = [indlow' ind23keep ind34keep indhigh'];
loglog(x(indnew), y(indnew));
But this causes the prune to behave in a jumpy fashion obviously. Each decade has the number of points that I'd like, but as it's a linear distribution, the points tend to be clumped at the high end of the decade on the log scale.
Any ideas on how I can do this?

I think the easiest way to do this would be to use the LOGSPACE function to generate a set of indices into your data. For example, to create a set of 100 points logarithmically spaced from 1 to N (the number of points in your data), you can try the following:
indnew = round(logspace(0,log10(N),100)); %# Create the log-spaced index
indnew = unique(indnew); %# Remove duplicate indices
loglog(x(indnew),y(indnew)); %# Plot the indexed data
Creating a logarithmically-spaced index like this will result in fewer values being chosen from the end of the vector relative to the start, thus pruning values more severely towards the end of the vector and improving the appearance of the log plot. It would therefore be most effective with vectors that are sorted in ascending order.

The way I understand the problem is that your x-values are linearly spaced, so that if you plot them logarithmically, there are way more data points in 'higher' decades, so that markers lie extremely close to one another. For example, if x goes from 1 to 1000, there are 10 points in the first decade 90 in the second, and 900 in the third. You want to have, say, 3 points per decade instead.
I see two ways to solve the problem. The easier one is to use differently colored lines instead of different markers. Thus, you don't sacrifice any data points, and you can still distinguish everything.
The second solution is to create an unevenly spaced index. Here's how you can do that.
%# create some data
x = 1:1000;
y = 2.^x;
%# plot the graph and see the dots 'coalesce' very quickly
figure,loglog(x,y,'.')
%# for the example, I use a step size of 0.7, which is `log(1)`
xx = 0.7:0.7:log(x(end)); %# this is where I want the data to be plotted
%# find the indices where we want to plot by finding the closest `log(x)'-values
%# run unique to avoid multiples of the same index
indnew = unique(interp1(log(x),1:length(x),xx,'nearest'));
%# plot with fewer points
figure,loglog(x(indnew),y(indnew),'.')