Error in data source: correct iteratively the vector without for loop? - matlab

Hello everyone I have a new small problem:
The data I am using have a weird trade time that goes from 17.00 of one day to 16.15 of the day after.
That means that, e.g., for the day 09-27-2013 The source I am using registers the transactions occurred as follows:
09/27/2013,17:19:42,3225.00,1 #%first obs of the vector
09/27/2013,18:37:59,3225.00,1 #%second obs of the vector
09/27/2013,09:21:34,3210.50,1 #%fifth obs of the vector
Now first and second obs are incorrect for me: they belong to 9/27 trading day but they have been executed on 9/26. Since I am working on some functions in matlab that relies on non-decremental times I need to solve this issue. The date format I am using is actually the datenum Matlab format so I am trying to solve the problem just subtracting one from the incorrect observations:
%#Call time the time vector, I can identify the 'incorrect' observations
It is easy to tell that this will only fix the 'last' incorrect observations of a series. In the previous example this would only correct the second element. And I should run the code several times (I thought about a while loop) until idx will be empty. This is not a big issue when working with small series but I have up to 20millions observations and probably hundred of thousands consecutively incorrect ones.
Is there a way to fix this in a vectorized way?
while idx
However, given that the computation would not be so complex I thought that a for loop could efficiently solve the issue and my idea was the following:
for i=N:-1:1
if diff(time(i,:)<0)
sadly it does not seems to work.
Here is an example of data I am actually using.
735507.708030093 %# I made this up to give you an example of two consecutively wrong observations
735507.708564815 %# This is an incorrect observation
Thanks everyone in advance

Sensible version -
for count = 1:numel(time)
dtime = diff([0 ;time]);
ind1 = find(dtime<0,1,'last')-1;
time(ind1) = time(ind1)-1;
Faster-but-crazier version -
dtime = diff([0 ;time]);
for count = 1:numel(time)
ind1 = find(dtime<0,1,'last')-1;
time(ind1) = time(ind1)-1;
dtime(ind1+1) = 0;
dtime(ind1) = dtime(ind1)-1;
More Crazier version -
dtime = diff([0 ;time]);
ind1 = numel(dtime);
for count = 1:numel(time)
ind1 = find(dtime(1:ind1)<0,1,'last')-1;
time(ind1) = time(ind1)-1;
dtime(ind1) = dtime(ind1)-1;
Some average computation runtimes for these versions with various datasizes -
Datasize 1: 3432 elements
Version 1 - 0.069 sec
Version 2 - 0.042 sec
Version 3 - 0.034 sec
Datasize 2: 20 Million elements
Version 1 - 37029 sec
Version 2 - 23303 sec
Version 3 - 20040 sec

So apparently I had 3 other different problems in the data source that I think could have stucked the routine Divakar proposed. Anyway I thought it was being too slow so I started thinking to another solution and came up with a super quick vectorized one.
Given that the observations I wanted to modify fall in a determined known interval of time the function just look for every observation falling in that interval and modifies it as I want (-1 in my case).
function [ datetime ] = correct_date( datetime,starttime, endtime)
%#datetime is my vector of dates and times in matlab numerical format
%#starttime is the starting hour of the interval expressed in datestr format. e.g. '17:00:00'
%#endtime is the ending hour of the interval expressed in datestr format. e.g. '23:59:59'
if (nargin < 1) || (nargin > 3),
error('Requires 1 to 3 input arguments.')
% default values
if nargin == 1,
elseif nargin == 2,
tvec=[datenum(starttime) datenum(endtime)];
tvec=tvec-floor(tvec); %#As I am working on multiples days I need to isolate only HH:MM:SS for my interval limits
temp=datetime-floor(datetime); %#same motivation as in the previous line
idx=find(temp>=tvec(1)&temp<=tvec(2)); %#logical find the indices
datetime(idx)=datetime(idx)-1; %#modify them as I want
clear tvec temp idx


Checking if a given date hour is within a predefined interval with datenum()

I have a table with dates (and other things), which I have extracted from a CSV file. In order to do some processing of my data (including plotting) I decided to convert all my date-strings to date-numbers (below for simplicity reasons I will exclude all the rest of the data and concentrate on the dates only so don't mind the step from dates to timetable and the fact that it can be omitted):
dates = [7.330249777777778e+05;7.330249291666667e+05;7.330246729166667;7.330245256944444;7.330246763888889;7.330245284722222;7.330245326388889;7.330246625000000];
timetable = table(dates);
I'm facing the following issue - based on the time during the day I want to tell the user if a date is in the morning (24-hours scale: 5-12h), noon (12-13h), afternoon (13-18h), evening (18-21h), night (21-5h) based on the date I have stored in my table. In case I had a date-vector (with elements: year,month,day,hour,minute,second) it would be pretty straight forward:
for date = 1:size(timetable)
switch timetable(date).hour
case {5,12}
case {12,13}
case {13,18}
case {18,21}
With 7.330246729166667 and the rest this is not that obvious at least to me. Any idea how to avoid converting to some other date-format just for this step and at the same time avoid some complex formula for extracting the required data (not necessarily hour only but I'm interested in the rest too)?
One unit in Matlab serial dates is equivalent to 1 day, i.e. 24 hours. Knowing this, you can bin the fractional part of the the dates within the intraday buckets you defined (note that your switch will only work for values exactly equal to the case lists):
bins = {'morning', 'noon', 'afternoon', 'evening', 'night'};
edges = [5,12,13,18,21,25]./24; % As fraction of a day
% Take fractional part
time = mod(dates,1);
% Bin with lb <= x < ub, where e.g. lb = 5/25 and is ub = 12/24
[counts,~,pos] = histcounts(time, edges);
% Make sure unbinned x in [0,5) are assigned 'night'
pos(pos==0) = 5;
ans =

average wind direction using histc matlab

Hello this question might be easy but i am struggling to get average wind directions for 1 year. I need hourly averages to compare with concentration measurements. My wind measurements are every minute in degree. So my idea was to use the histc function in matlab to get the most common winddirection within the hour. this works for 1 h but how do i create a loop which gives me hourly values for a year.
here is the code
wdd=winddirections in degree(vectorsize e.g for a year 525600)
binranges = [0:10:360];
[bincounts,ind] = histc(wdd(1:60),binranges);
[num idx] = max(bincounts(:));
kind regards matthias
How about this one -
binranges = [0:10:360]
[bincounts,ind] = histc(reshape(wdd,60,[]),binranges)
[nums idxs] = max(bincounts)
What I would do is:
wdd_phour=reshape(wdd,60,525600/60); % get a matrix of size 60(min) X hours per year
mean_phour=mean(wdd_phour,1); % compute the average of each 60 mins for every our in a year

A moving average with different functions and varying time-frames

I have a matrix time-series data for 8 variables with about 2500 points (~10 years of mon-fri) and would like to calculate the mean, variance, skewness and kurtosis on a 'moving average' basis.
Lets say frames = [100 252 504 756] - I would like calculate the four functions above on over each of the (time-)frames, on a daily basis - so the return for day 300 in the case with 100 day-frame, would be [mean variance skewness kurtosis] from the period day201-day300 (100 days in total)... and so on.
I know this means I would get an array output, and the the first frame number of days would be NaNs, but I can't figure out the required indexing to get this done...
This is an interesting question because I think the optimal solution is different for the mean than it is for the other sample statistics.
I've provided a simulation example below that you can work through.
First, choose some arbitrary parameters and simulate some data:
%#Set some arbitrary parameters
T = 100; N = 5;
WindowLength = 10;
%#Simulate some data
X = randn(T, N);
For the mean, use filter to obtain a moving average:
MeanMA = filter(ones(1, WindowLength) / WindowLength, 1, X);
MeanMA(1:WindowLength-1, :) = nan;
I had originally thought to solve this problem using conv as follows:
MeanMA = nan(T, N);
for n = 1:N
MeanMA(WindowLength:T, n) = conv(X(:, n), ones(WindowLength, 1), 'valid');
MeanMA = (1/WindowLength) * MeanMA;
But as #PhilGoddard pointed out in the comments, the filter approach avoids the need for the loop.
Also note that I've chosen to make the dates in the output matrix correspond to the dates in X so in later work you can use the same subscripts for both. Thus, the first WindowLength-1 observations in MeanMA will be nan.
For the variance, I can't see how to use either filter or conv or even a running sum to make things more efficient, so instead I perform the calculation manually at each iteration:
VarianceMA = nan(T, N);
for t = WindowLength:T
VarianceMA(t, :) = var(X(t-WindowLength+1:t, :));
We could speed things up slightly by exploiting the fact that we have already calculated the mean moving average. Simply replace the within loop line in the above with:
VarianceMA(t, :) = (1/(WindowLength-1)) * sum((bsxfun(#minus, X(t-WindowLength+1:t, :), MeanMA(t, :))).^2);
However, I doubt this will make much difference.
If anyone else can see a clever way to use filter or conv to get the moving window variance I'd be very interested to see it.
I leave the case of skewness and kurtosis to the OP, since they are essentially just the same as the variance example, but with the appropriate function.
A final point: if you were converting the above into a general function, you could pass in an anonymous function as one of the arguments, then you would have a moving average routine that works for arbitrary choice of transformations.
Final, final point: For a sequence of window lengths, simply loop over the entire code block for each window length.
I have managed to produce a solution, which only uses basic functions within MATLAB and can also be expanded to include other functions, (for finance: e.g. a moving Sharpe Ratio, or a moving Sortino Ratio). The code below shows this and contains hopefully sufficient commentary.
I am using a time series of Hedge Fund data, with ca. 10 years worth of daily returns (which were checked to be stationary - not shown in the code). Unfortunately I haven't got the corresponding dates in the example so the x-axis in the plots would be 'no. of days'.
% start by importing the data you need - here it is a selection out of an
% excel spreadsheet
returnsHF = xlsread('HFRXIndices_Final.xlsx','EquityHedgeMarketNeutral','D1:D2742');
% two years to be used for the moving average. (250 business days in one year)
window = 500;
% create zero-matrices to fill with the MA values at each point in time.
mean_avg = zeros(length(returnsHF)-window,1);
st_dev = zeros(length(returnsHF)-window,1);
skew = zeros(length(returnsHF)-window,1);
kurt = zeros(length(returnsHF)-window,1);
% Now work through the time-series with each of the functions (one can add
% any other functions required), assinging the values to the zero-matrices
for count = window:length(returnsHF)
% This is the most tricky part of the script, the indexing in this section
% The TwoYearReturn is what is shifted along one period at a time with the
% for-loop.
TwoYearReturn = returnsHF(count-window+1:count);
mean_avg(count-window+1) = mean(TwoYearReturn);
st_dev(count-window+1) = std(TwoYearReturn);
skew(count-window+1) = skewness(TwoYearReturn);
kurt(count-window +1) = kurtosis(TwoYearReturn);
% Plot the MAs
subplot(4,1,1), plot(mean_avg)
title('2yr mean')
subplot(4,1,2), plot(st_dev)
title('2yr stdv')
subplot(4,1,3), plot(skew)
title('2yr skewness')
subplot(4,1,4), plot(kurt)
title('2yr kurtosis')

MATLAB: n-minute/hour/day averages of a time-series

This is a follow-up to an earlier question of mine posted here. Based on Oleg Komarov's answer I wrote a little tool to get daily, hourly, etc. averages or sums of my data that uses accumarray() and datevec()'s output structure. Feel free to have a look at it here (it's probably not written very well, but it works for me).
What I would like to do now is add the functionality to calculate n-minute, n-hour, n-day, etc. statistics instead of 1-minute, 1-hour, 1-day, etc. like my function does. I have a rough idea that simply loops over my time-vector t (which would be pretty much what I would have done already if I hadn't learnt about the beautiful accumarray()), but that means I have to do a lot of error-checking for data gaps, uneven sampling times, etc.
I wonder if there is a more elegant/efficient approach that lets me re-use/extend my old function posted above, i.e. something that still makes use of accumarray() and datevec(), since this makes working with gaps very easy.
You can download some sample data taken from my last question here. These were sampled at 30 min intervals, so a possible example of what I want to do would be to calculate 6 hour averages without relying on the assumption that they are free of gaps and/or always sampled at exactly 30 min.
This is what I have come up with so far, which works reasonably well, apart from a small but easily fixed problem with the time stamps (e.g. 0:30 is representative for the interval from 0:30 to 0:45 -- my old function suffers from the same problem, though):
[ ... see my answer below ...]
Thanks to woodchips for inspiration.
The linked method of using accumarray seems overkill and too complex to me if you start with evenly spaced measurements without any gaps. I have the following function in my private toolbox for calculating an N-point average of vectors:
function y = blockaver(x, n)
% y = blockaver(x, n)
% input points are averaged over n points
% always returns column vector
if n == 1
y = x(:);
nblocks = floor(length(x) / n);
y = mean(reshape(x(1:n * nblocks), n, nblocks), 1).';
Works pretty well for quick and dirty decimating by a factor N, but note that it does not apply proper anti-alias filtering. Use decimate if that is important.
I guess I figured it out using parts of #Bas Swinckels answer and #woodchip 's code linked above. Not exactly what I would call good code, but working and reasonably fast.
function [ t_acc, x_acc, subs ] = ts_aggregation( t, x, n, target_fmt, fct_handle )
% t is time in datenum format (i.e. days)
% x is whatever variable you want to aggregate
% n is the number of minutes, hours, days
% target_fmt is 'minute', 'hour' or 'day'
% fct_handle can be an arbitrary function (e.g. #sum)
t = t(:);
x = x(:);
switch target_fmt
case 'day'
t_factor = 1;
case 'hour'
t_factor = 1 / 24;
case 'minute'
t_factor = 1 / ( 24 * 60 );
t_acc = ( t(1) : n * t_factor : t(end) )';
subs = ones(length(t), 1);
for i = 2:length(t_acc)
subs(t > t_acc(i-1) & t <= t_acc(i)) = i;
x_acc = accumarray( subs, x, [], fct_handle );
/edit: Updated to a much shorter fnction that does use loops, but appears to be faster than my previous solution.

matlab brute force indexing

Hi I am working with the brute force method to examine possible combinations of "panels" and "turbines"
My code is
for number_panels = 0:5
for number_turbines = 0:10
for n = 1:24 % number of hours per day
deficit(n) = Demand(n) - (PV_supply(n)*number_panels) -...
(WT_supply(n)*number_turbines);% hourly power deficit
if deficit(n)<0
deficit(n) = 0;
The problem I have above is that I haven't yet figured the correct indexing of this code.
What I am trying to do is find the "deficit" for the "number_panels" , "number_turbines" and "n". As it stands I can only find the "deficit" for the last for loop.
How can I code so that I can have the option to access the nth row (or sets of "n" i.e 1-24) and also for the "number_panels" "number_turbines" option?
thanks - in order to find the sum of each deficit(n) value and thus have the respective total deficit of the 24 hour period I have done the following which seems to me to be able to do what I am asking but I am getting incorrect answers:
daily_deficit(number_panels + 1, number_turbines + 1) =...
sum(deficit(number_panels + 1, number_turbines + 1,n)) –
function calcDeficit.m:
function deficit = calcDeficit (Demand, PV_supply, WT_supply)
% initialize the size (good practice)
deficit = zeros(6,11,24);
for number_panels = 0:5
for number_turbines = 0:10
for n = 1:24 % number of hours per day
deficit(number_panels+1,number_turbines+1,:) = Demand(n) - (PV_supply(n)*number_panels) -...
(WT_supply(n)*number_turbines);% hourly power deficit
if deficit(n)<0
deficit(n) = 0;
example call:
Demand=randn(24,1); PV_supply=randn(24,1); WT_supply=randn(24,1); test(Demand,PV_supply,WT_supply)
You access Demand by
Your problem is that you're storing the deficit result as a function only of the value n, the number of hours per day. In your inner loop around n, you keep replacing the values each time through your outer loops, so at the end of the run, you only have the value for n = 1:24 at number_panels = 5 and number_turbines = 10.
Try this:
deficit(number_panels+1, number_turbines+1, n) = ...
That way at the end, you can check any combination given the three indices. I've added a value of 1 to number_panels and number_turbines because MATLAB uses 1-based indexes. To get your results for a specific number of panels or turbines, make sure to add 1 when checking.
Specifically, for 3 panels and 4 turbines at hour 5 in the day:
disp(deficit(3+1, 4+1, 5))
EDIT: Added 1 to the values of number_panels and number_turbines to avoid 0-indexing.