calculate monthly mean, 90th and 99th percentile of time series - matlab

I'm reading this article on wind speed trends and they specify in their methods that they tried to determine if there is a trend within the time series of monthly mean, 90th, and 99th percentile values of wind speed over the period shown. How would one achieve this? Furthermore, what does it mean by 90th and 99th percentile? My example:
v = datenum(1981, 1, 1):datenum(2010, 11, 31); % time vector
d = rand(1,length(v)); % data vector
% calculate mean, 90th and 99th percentile values
dateV = datevec(v); % date vector
[~,~,b] = unique(dateV(:,1:2),'rows');
monthly_v = accumarray(b,v,[],#mean);
monthly_d = accumarray(b,d,[],#mean);
I can calculate the monthly mean by the method shown above, but am not sure on how to calculate the 90th and 99th percentile (plus I'm not even sure what it is). Can anyone provide some information on this?

Use the prctile function. What you are seeking is a threshold where the proportion / percentage of input data that is exceeding this threshold is 100% - percentile. For example, if you sought the 90% quantile, you are trying to find a quantity in your input data where 10% of your data exceeded this quantity. For the 99% percentile, you are seeking the quantity in your input data where 1% of your data exceeded this threshold. You can simply call prctile by:
Y = prctile(X, P);
X is your data stored in vector form, and P is a vector or single number that lists the percentiles you desire. The output would be those thresholds that we just talked about, stored in Y.
In your case, v and d is your data you want to find the percentiles on per month, and thus you would modify your accumarray call like so:
monthly_v_90 = accumarray(b,v,[],#(x) prctile(x, 90));
monthly_v_99 = accumarray(b,v,[],#(x) prctile(x, 99));
monthly_d_90 = accumarray(b,d,[],#(x) prctile(x, 90));
monthly_d_99 = accumarray(b,d,[],#(x) prctile(x, 99));
What the above code will do is that for each unique month, you will calculate the 90% and 99% quantiles for v and d respectively. Specifically, monthly_v_90 and monthly_v_99 will give you the 90% and 99% quantiles for each month in a unique year for v while monthly_d_90 and monthly_d_99 will give you the 90% and 99% quantiles for each month in a unique year for d.
In your call to datevec, you are generating months from January 1981 to December 2010. Because there are 30 years in between, and there are 12 months in a year, you should have 360 element vectors with the above (as well as your calculations for the mean).

Related

Separate y-values depending if the x-value is increasing or decreasing

I try to analyze my data using a mixture of python and matlab, but I am stuck and could not find any discussion that solves my problem.
My data consists of temperature and current measurements which are recorded at the same time but using two different devices. Afterwards these measurements are matched together using the time stamp of each measurement to get the "raw data plot". However, the current values differ at the same temperature depending if the sample was heated up or cooled down.
Now, I would like to separate the heating and cooling values of the current measurements, and calculate the mean and standard deviation for all currents at one temperature for cooling and heating, respectively.
What I do so far is first looking for all the values that are beloning to the same temperature, no matter if it's a cooling or heating cycle. That results in quite large standard deviation values.
The two figures show a simple example how my data looks like.
The first figure plots the temperature values against the number of data points and marks all values that belong to this temperature:
The second figure shows the current data with the marked values that correspond to the temperature.
The temperature is always kept constant for 180 s and then increased or decreased by 10°C. During the 180 s several current measurements are taking place, which results in several data point per temperature per cycle. The cycle is repeated itself several times (not shown here). To simplify the example here, I just used simple numbers instead of real temperature and current values. The repetition of the same number, indicates several measurements at one temperature. In reality the current values are not completely stable, but fluctuate around a certain value (which I also irgnoered here).
The code which does that looks like this:
Sample data:
Test_T = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,4,4,4,4,4,3,3,3,3,3,2,2,2,2,2,1,1,1,1,1] ;
Test_I = [5,5,5,5,5,6,6,6,6,6,7,7,7,7,7,8,8,8,8,8,9,9,9,9,9,7,7,7,7,7,6,6,6,6,6,5,5,5,5,5,4,4,4,4,4] ;
Code:
Test_T_sel =Test_T;
Test_I_sel = Test_I;
ll = 0;
ul = 2;
ID = Test_T_sel <ul & Test_T_sel >ll;
Test_x_avg = mean(Test_T_sel(ID));
Test_y_avg = mean(Test_I_sel(ID));
figure('Position', [100 100 700 500]);
plot(Test_T_sel);
hold on;
plot(find(ID), Test_T_sel(ID), '*r');
ylabel('Temperature [°C]')
figure('Position', [900 100 700 500]);
plot(Test_I_sel);
hold on;
plot(find(ID), Test_I_sel(ID), '.r');
ylabel('Current [µA]')
And Test_T contains 90 values increasing stepwise from 1 to 5 each 10 values, while Test_I contains the current values. As you can see for Temperature = 1°C the current values is either 5 or 4. Now I would like to get a vector that only contains the values of T and the corresponding value of current if T increases and a second vector for the current values were T decreases.
I thought using a if else comand, but I actually do not know how to implement this. Maybe something like this could work:
if T2 == T1 and T2-T1 <= 0.2 "take corresponding I values" (this is true when the temperature is stable and only varies by 0.2°C)
if T2-T1 > 0.2 "ignore I values until" T2 == T1 again and T2-T1 <= 0.2 (this would either be a stronger variation at one temperature or indicate a temperature change and waits until T is constant again)
But now, I still need to distinguish if the temperature is generally increasing or decreasing after 5 measurements.
if T2 > T1, T is increasing (Test_T_heat) and the correspnding I values should be written in a vector Test_I_heat
if T2 < T1, T is decreasing (Test_T_cool) and the corresponding I values should be written in a vector Test_I_cool
For the example given above this should look like this at the end:
Test_T_heat: [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5];
Test_I_heat: [5,5,5,5,5,6,6,6,6,6,7,7,7,7,7,8,8,8,8,8,9,9,9,9,9];
Test_T_cool: [4,4,4,4,4,3,3,3,3,3,2,2,2,2,2,1,1,1,1,1] ;
Test_I_cool: [7,7,7,7,7,6,6,6,6,6,5,5,5,5,5,4,4,4,4,4] ;
How has the "code" to be changed that I get such vectors?

How do I take an n-day average of data in Matlab to match another time series?

I have daily time series data and I want to calculate 5-day averages of that data while also retrieving the corresponding start date for each of the 5-day averages. For example:
x = [732099 732100 732101 732102 732103 732104 732105 732106 732107 732108];
y= [1 5 3 4 6 2 3 5 6 8];
Where x and y are actually size 92x1.
Firstly, how do I compute the 5-day mean when this time series data is not divisible by 5? Ultimately, I want to compute the 'jumping mean', where the average is not computed continuously (e.g., June 1-5, June 6-10, and so on).
I've tried doing the following:
Pentad_avg = mean(reshape(y(1:90),5,[]))'; %manually adjusted to be divisible by 5
Pentad_dt = x(1:5:90); %select every 5th day for time
However, Pentad_dt gives me dates 01-Jun-2004 and 06-Jun-2004 as output. And, that brings me to my second point.
I am looking to find 5-day averages for x and y that correspond to 5-day averages of another time series. This second time series has 5-day averaged data starting from 15-Jun-2004 until 29-Aug-2004 (instead of starting at 01-Jun-2004). Ultimately, how do I align the dates and 5-day averages between these two time series?
Synchronization between two time series can be accomplished using the timeseries object. Placing your data into an object allows Matlab to intelligently process it. The most useful thing is adds for your usage is the synchronize method.
You'll want to make sure to properly set the time vector on each of the timeseries objects.
An example of what this might look like is as follows:
ts1 = timeseries(y,datestr(x));
ts2 = timeseries(OtherData,OtherTimes);
[ts1 ts2] = synchronize(ts1,ts2,'Uniform','Interval',5);
This should return to you each timeseries aligned to be with the same times. You could also specify a specific time vector to align a timeseries to using the resample method.

How to identify an optimal subsample from a data set with missing values in MATLAB

I would like to identify the largest possible contiguous subsample of a large data set. My data set consists of roughly 15,000 financial time series of up to 360 periods in length. I have imported the data into MATLAB as a 360 by 15,000 numerical matrix.
This matrix contains a lot of NaNs due to some of the financial data not being available for the entire period. In the illustration, NaN entries are shown in dark blue, and non-NaN entries appear in light blue. It is these light blue non-NaN entries which I would like to ideally combine into an optimal subsample.
I would like to find the largest possible contiguous block of data that is contained in my matrix, while ensuring that my matrix contains a sufficient number of periods.
In a first step I would like to sort my matrix from left to right in descending order by the number of non-NaN entries in each column, that is, I would like to sort by the vector obtained by entering sum(~isnan(data),1).
In a second step I would like to find the sub-array of my data matrix that is at least 72 entries along the first dimension and is otherwise as large as possible, measured by the total number of entries.
What is the best way to implement this?
A big warning (may or may not apply depending on context)
As Oleg mentioned, when an observation is missing from a financial time series, it's often missing for reason: eg. the entity went bankrupt, the entity was delisted, or the instrument did not trade (i.e. illiquid). Constructing a sample without NaNs is likely equivalent to constructing a sample where none of these events occur!
For example, if this were hedge fund return data, selecting a sample without NaNs would exclude funds that blew up and ceased trading. Excluding imploded funds would bias estimates of expected returns upwards and estimates of variance or covariance downwards.
Picking a sample period with the fewest time series with NaNs would also exclude periods like the 2008 financial crisis, which may or may not make sense. Excluding 2008 could lead to an underestimate of how haywire things could get (though including it could lead to overestimate the probability of certain rare events).
Some things to do:
Pick a sample period as long as possible but be aware of the limitations.
Do your best to handle survivorship bias: eg. if NaNs represent delisting events, try to get some kind of delisting return.
You almost certainly will have an unbalanced panel with missing observations, and your algorithm will have to be deal with that.
Another general finance / panel data point, selecting a sample at some time point t and then following it into the future is perfectly ok. But selecting a sample based upon what happens during or after the sample period can be incredibly misleading.
Code that does what you asked:
This should do what you asked and be quite fast. Be aware of the problems though if whether an observation is missing is not random and orthogonal to what you care about.
Inputs are a T by n sized matrix X:
T = 360; % number of time periods (i.e. rows) in X
n = 15000; % number of time series (i.e. columns) in X
T_subsample = 72; % desired length of sample (i.e. rows of newX)
% number of possible starting points for series of length T_subsample
nancount_periods = T - T_subsample + 1;
nancount = zeros(n, nancount_periods, 'int32'); % will hold a count of NaNs
X_isnan = int32(isnan(X));
nancount(:,1) = sum(X_isnan(1:T_subsample, :))'; % 'initialize
% We need to obtain a count of nans in T_subsample sized window for each
% possible time period
j = 1;
for i=T_subsample + 1:T
% One pass: add new period in the window and subtract period no longer in the window
nancount(:,j+1) = nancount(:,j) + X_isnan(i,:)' - X_isnan(j,:)';
j = j + 1;
end
indicator = nancount==0; % indicator of whether starting_period, series
% has no NaNs
% number of nonan series of length T_subsample by starting period
max_subsample_size_by_starting_period = sum(indicator);
max_subsample_size = max(max_subsample_size_by_starting_period);
% find the best starting period
starting_period = find(max_subsample_size_by_starting_period==max_subsample_size, 1);
ending_period = starting_period + T_subsample - 1;
columns_mask = indicator(:,starting_period);
columns = find(columns_mask); %holds the column ids we are using
newX = X(starting_period:ending_period, columns_mask);
Here's an idea,
Assuming you can rearrange the series, calculate the distance (you decide the metric, but if looking at is nan vs not is nan, Hamming is ok).
Now hierarchically cluster the series and rearrange them using either a dendrogram
or http://www.mathworks.com/help/bioinfo/examples/working-with-the-clustergram-function.html
You should probably prune any series that doesn't have a minimum number of non nan values before you start.
First I have only little insight in financial mathematics. I understood it that you want to find the longest continuous chain of non-NaN values for each time series. The time series should be sorted depending on the length of this chain and each time series, not containing a chain above a threshold, discarded. This can be done using
data = rand(360,15e3);
data(abs(data) <= 0.02) = NaN;
%% sort and chop data based on amount of consecutive non-NaN values
binary_data = ~isnan(data);
% find edges, denote their type and calculate the biggest chunk in each
% column
edges = [2*binary_data(1,:)-1; diff(binary_data, 1)];
chunk_size = diff(find(edges));
chunk_size(end+1) = numel(edges)-sum(chunk_size);
[row, ~, id] = find(edges);
num_row_elements = diff(find(row == 1));
num_row_elements(end+1) = numel(chunk_size) - sum(num_row_elements);
%a chunk of NaN has a -1 in id, a chunk of non-NaN a 1
chunks_per_row = mat2cell(chunk_size .* id,num_row_elements,1);
% sort by largest consecutive block of non-NaNs
max_size = cellfun(#max, chunks_per_row);
[max_size_sorted, idx] = sort(max_size, 'descend');
data_sorted = data(:,idx);
% remove all elements that only have block sizes smaller then some number
some_number = 20;
data_sort_chop = data_sorted(:,max_size_sorted >= some_number);
Note that this can be done a lot simpler, if the order of periods within a time series doesn't matter, aka data([1 2 3],id) and data([3 1 2], id) are identical.
What I do not know is, if you want to discard all periods within a time series that don't correspond to the biggest value, get all those chains as individual time series, ...
Feel free to drop a comment if it has to be more specific.

Finding rolling z-score in Matlab

I want to calculate z-score of the current point in cross-sectional time-series data based on standard deviation over the last 10 days and simple moving average over the last 10 days. I can't use the z-score function in Matlab as it looks forward to calculate the z-score. Currently my solution is
for i=11:length(equity.(1))
z(i) = (x(i)-mean(x(i-10:i))/std(x(i-10:i);
end
but issue is that i want to do this for the entire dataset at once. Is there a way to handle the entire matrix at once and calculate z-score for a given look back period (10 days in my case).
Whether this is in fact more efficient or not I don't know, but one way (im2col requires the image processing toolbox):
data = 1:40; %dummy data
% presuming "ten days" means day of interest + 9 days back
n = 10;
data2 = im2col(data,[1,n],'sliding');
%mean/std for each column:
dmean = mean(data2);
dstd = std(data2);
z = (data(n:end)-dmean)./dstd;
You might also try this from the file exchange.

MATLAB: Averaging time-series data without loops?

I have measured a handful of variables in 30 minute intervals. Time stamps are available in datevec or datenum format. I want to calculate ...
a) ... daily averages and
b) ... average values at time x, e.g. temperature at 11:30, temperature at 12:00, etc. averaged over my whole dataset.
While this is, more or less, easily done with loops, I wonder if there is an easier / more convenient way to work with time-series, since this is a quite basic task after all?
/edit 1: As per request: click me for sample data
Considering that datevec() output is stored in tvec and data in x, group with unique(...,'rows') and accumulate with accumarray():
% Group by day
[unDates, ~, subs] = unique(tvec(:,1:3),'rows');
% Accumulate by day
[unDates accumarray(subs, x, [], #mean)]
% Similarly by hour
[unHours, ~, subs] = unique(tvec(:,4:5),'rows');
[unHours accumarray(subs, x, [], #mean)]