I am processing 1Hz timestamps (variable 'timestamp_1hz') from a logger which doesn't log exactly at the same time every second (the difference varies from 0.984 to 1.094, but sometimes 0.5 or several seconds if the logger burps). The 1Hz dataset is used to build a 10 minute averaged dataset, and each 10 minute interval must have 600 records. Because the logger doesn't log exactly at the same time every second, the timestamp slowly drifts through the 1 second mark. Issues come up when the timestamp cross the 0 mark, as well as the 0.5 mark.
I have tried various ways to pre-process the timestamps. The timestamps with around 1 second between them should be considered valid. A few examples include:
% simple
% this screws up around half second and full second values
rawseconds = raw_1hz_new(:,6)+(raw_1hz_new(:,7)./1000);
rawsecondstest = rawseconds;
rawsecondstest(:,1) = floor(rawseconds(:,1))+ rawseconds(1,1);
% more complicated
% this screws up if there is missing data, then the issue compounds because k+1 timestamp is dependent on k timestamp
rawseconds = raw_1hz_new(:,6)+(raw_1hz_new(:,7)./1000);
A = diff(rawseconds);
numcheck = rawseconds(1,1);
integ = floor(numcheck);
fract = numcheck-integ;
if fract>0.5
rawseconds(1,1) = rawseconds(1,1)-0.5;
end
for k=2:length(rawseconds)
rawsecondstest(k,1) = rawsecondstest(k-1,1)+round(A(k-1,1));
end
I would like to pre-process the timestamps then compare it to a contiguous 1Hz timestamp using 'intersect' in order to find the missing, repeating, etc data such as this:
% pull out the time stamp (round to 1hz and convert to serial number)
timestamp_1hz=round((datenum(raw_1hz_new(:,[1:6])))*86400)/86400;
% calculate new start time and end time to find contig time
starttime=min(timestamp_1hz);
endtime=max(timestamp_1hz);
% determine the contig time
contigtime=round([floor(mean([starttime endtime])):1/86400:ceil(mean([starttime endtime]))-1/86400]'*86400)/86400;
% find indices where logger time stamp matches real time and puts
% the indices of a and b
clear Ia Ib Ic Id
[~,Ia,Ib]=intersect(timestamp_1hz,contigtime);
% find indices where there is a value in real time that is not in
% logger time
[~,Ic] = setdiff(contigtime,timestamp_1hz);
% finds the indices that are unique
[~,Id] = unique(timestamp_1hz);
You can download 10 days of the raw_1hz_new timestamps here. Any help or tips would be much appreciated!
The problem you have is that you can't simply match these stamps up to a list of times, because you could be expecting a set of datapoints at seconds = 1000, 1001, 1002, but if there was an earlier blip you could have entirely legitimate data at 1000.5, 1001.5, 1002.5 instead.
If all you want is a list of valid times/their location in your series, why not just something like (times in seconds):
A = diff(times); % difference between times
n = find(abs(A-1)<0.1) % change 0.1 to whatever your tolerance is
times2 = times(n+1);
times2 should then be a list of all your timestamps where the previous timestamp was approximately 1 second ago - works on a small set of fake data I constructed, didn't try it on yours. (For future reference: it would be more help to provide a small subset of your data, e.g. just a few minutes worth, that you know contains a blip).
I would then take the list of valid timestamps and split it up into 10 minute sections for averaging, counting how many valid timestamps were obtained in each section. If it's working, you should end up with no more than 600 - but not much less if the blips are occasional.
Related
I'm trying to plot some data with respect to minutes instead of seconds in Matlab as formatted time, i.e. min.sec.
I have real time data streaming in where with every sample received, its time in seconds is also sent. I then plot them with respect to time. Now, since my session is around 15 minutes long, I can't be plotting with respect to time. Therefore I wanted to plot it with respect to time (min.sec). I tried dividing the received time by 60 but this gives me minutes with 100 subdivisions instead of 60 (the minutes increment after 0.9999 instead of 0.59). How do I convert it so that I'm able to plot with respect to time in minutes?
Here is what I mean by 0.99 fractions of a minute instead of 0.59. A normal minute has 60 divisions not 100.
EDIT:
I tried m7913d's suggestions and here is what I got.
first I plot the signal with respect to time in seconds without changing the ticks ( A normal plot(t,v))
The I added datetick('x', 'mm:ss'); to the plot (Xtick format is not supported in Matlab 2015b)
Here is a screenshot of the results
The time in seconds was up to 80 seconds, when translated into minutes, it should give me 1 minutes and 20 seconds as the maximum x axis limit. But this is not the case. I tried to construct a t vector (i.e like t=0:seconds(3):minutes(3)) but I couldn't link it to my seconds vector which will be constantly updating as new samples are received from the serial port.
Thanks
You can use xtickformat to specify the desired format of your x labels as follows:
% generate a random signal (in seconds)
t = 0:5:15*60;
y = rand(size(t));
plot(seconds(t),y) % plot your signal, making it explicit that the t is expressed in seconds
xtickformat('mm:ss') % specify the desired format of the x labels
Note that I used the seconds methods, which returns a duration object, to indicate to Matlab that t is expressed in seconds.
The output of the above script is (the right image is a zoomed version of the left image):
Pre R2016b
One can use datetime instead of xtickformat as follows:
datetimes = datetime(0,0,0,0,0,t); % convert seconds to datetime
plot(datetimes,y)
datetick('x', 'MM:SS'); % set the x tick format (Note that you should now use capital M and S in the format string
xlim([min(datetimes) max(datetimes)])
I have large data files with 10Hz resolution which have been split up into half hour files. Each half hour file should contain 18000 rows. However, this is not usually the case and there are gaps in the time stamps due to datalogging errors.
In order to be able to process this data further, I need uniform data files with exactly 18000 rows each. I have written a script in MATLAB which solves this problem by using a generated full timestamp for each half hour period. Where there are gaps in the original timestamp, I fill the rows with NaNs (except the timestamp row).
Using a simple example, here is the code:
Fs=1:10; % uniform timestamp
Fs=Fs'; %'//<-- prevents string markdown
Ts=[1 3 7 2 1; 3 4 3 3 2;6 5 2 1 3]; %Ts is the original data with the timestamp in the first column
Corrected=zeros(length(Fs),6); % corrected is the data after applying uniform timestamp
for ii=1:length(Fs)
for jj=1:length (Ts(:,1))
if Fs(ii)==Ts(jj)
Corrected(ii,:)=[Fs(ii) Ts(jj,:)];
break
else
Corrected(ii,:)=[Fs(ii), NaN*[ii,1:4]];
continue
end
end
end
The code works well enough but when I apply it to the 10Hz data, it is extremely slow. Any ideas on how I can improve this code?
Note that in the actual code I compare date strings:
if sum(Fs_str(ii,:)==Ts_str(jj,:))==23
Corrected(ii,:)=[Fs(ii) Ts(jj,:)];
I would like to identify the largest possible contiguous subsample of a large data set. My data set consists of roughly 15,000 financial time series of up to 360 periods in length. I have imported the data into MATLAB as a 360 by 15,000 numerical matrix.
This matrix contains a lot of NaNs due to some of the financial data not being available for the entire period. In the illustration, NaN entries are shown in dark blue, and non-NaN entries appear in light blue. It is these light blue non-NaN entries which I would like to ideally combine into an optimal subsample.
I would like to find the largest possible contiguous block of data that is contained in my matrix, while ensuring that my matrix contains a sufficient number of periods.
In a first step I would like to sort my matrix from left to right in descending order by the number of non-NaN entries in each column, that is, I would like to sort by the vector obtained by entering sum(~isnan(data),1).
In a second step I would like to find the sub-array of my data matrix that is at least 72 entries along the first dimension and is otherwise as large as possible, measured by the total number of entries.
What is the best way to implement this?
A big warning (may or may not apply depending on context)
As Oleg mentioned, when an observation is missing from a financial time series, it's often missing for reason: eg. the entity went bankrupt, the entity was delisted, or the instrument did not trade (i.e. illiquid). Constructing a sample without NaNs is likely equivalent to constructing a sample where none of these events occur!
For example, if this were hedge fund return data, selecting a sample without NaNs would exclude funds that blew up and ceased trading. Excluding imploded funds would bias estimates of expected returns upwards and estimates of variance or covariance downwards.
Picking a sample period with the fewest time series with NaNs would also exclude periods like the 2008 financial crisis, which may or may not make sense. Excluding 2008 could lead to an underestimate of how haywire things could get (though including it could lead to overestimate the probability of certain rare events).
Some things to do:
Pick a sample period as long as possible but be aware of the limitations.
Do your best to handle survivorship bias: eg. if NaNs represent delisting events, try to get some kind of delisting return.
You almost certainly will have an unbalanced panel with missing observations, and your algorithm will have to be deal with that.
Another general finance / panel data point, selecting a sample at some time point t and then following it into the future is perfectly ok. But selecting a sample based upon what happens during or after the sample period can be incredibly misleading.
Code that does what you asked:
This should do what you asked and be quite fast. Be aware of the problems though if whether an observation is missing is not random and orthogonal to what you care about.
Inputs are a T by n sized matrix X:
T = 360; % number of time periods (i.e. rows) in X
n = 15000; % number of time series (i.e. columns) in X
T_subsample = 72; % desired length of sample (i.e. rows of newX)
% number of possible starting points for series of length T_subsample
nancount_periods = T - T_subsample + 1;
nancount = zeros(n, nancount_periods, 'int32'); % will hold a count of NaNs
X_isnan = int32(isnan(X));
nancount(:,1) = sum(X_isnan(1:T_subsample, :))'; % 'initialize
% We need to obtain a count of nans in T_subsample sized window for each
% possible time period
j = 1;
for i=T_subsample + 1:T
% One pass: add new period in the window and subtract period no longer in the window
nancount(:,j+1) = nancount(:,j) + X_isnan(i,:)' - X_isnan(j,:)';
j = j + 1;
end
indicator = nancount==0; % indicator of whether starting_period, series
% has no NaNs
% number of nonan series of length T_subsample by starting period
max_subsample_size_by_starting_period = sum(indicator);
max_subsample_size = max(max_subsample_size_by_starting_period);
% find the best starting period
starting_period = find(max_subsample_size_by_starting_period==max_subsample_size, 1);
ending_period = starting_period + T_subsample - 1;
columns_mask = indicator(:,starting_period);
columns = find(columns_mask); %holds the column ids we are using
newX = X(starting_period:ending_period, columns_mask);
Here's an idea,
Assuming you can rearrange the series, calculate the distance (you decide the metric, but if looking at is nan vs not is nan, Hamming is ok).
Now hierarchically cluster the series and rearrange them using either a dendrogram
or http://www.mathworks.com/help/bioinfo/examples/working-with-the-clustergram-function.html
You should probably prune any series that doesn't have a minimum number of non nan values before you start.
First I have only little insight in financial mathematics. I understood it that you want to find the longest continuous chain of non-NaN values for each time series. The time series should be sorted depending on the length of this chain and each time series, not containing a chain above a threshold, discarded. This can be done using
data = rand(360,15e3);
data(abs(data) <= 0.02) = NaN;
%% sort and chop data based on amount of consecutive non-NaN values
binary_data = ~isnan(data);
% find edges, denote their type and calculate the biggest chunk in each
% column
edges = [2*binary_data(1,:)-1; diff(binary_data, 1)];
chunk_size = diff(find(edges));
chunk_size(end+1) = numel(edges)-sum(chunk_size);
[row, ~, id] = find(edges);
num_row_elements = diff(find(row == 1));
num_row_elements(end+1) = numel(chunk_size) - sum(num_row_elements);
%a chunk of NaN has a -1 in id, a chunk of non-NaN a 1
chunks_per_row = mat2cell(chunk_size .* id,num_row_elements,1);
% sort by largest consecutive block of non-NaNs
max_size = cellfun(#max, chunks_per_row);
[max_size_sorted, idx] = sort(max_size, 'descend');
data_sorted = data(:,idx);
% remove all elements that only have block sizes smaller then some number
some_number = 20;
data_sort_chop = data_sorted(:,max_size_sorted >= some_number);
Note that this can be done a lot simpler, if the order of periods within a time series doesn't matter, aka data([1 2 3],id) and data([3 1 2], id) are identical.
What I do not know is, if you want to discard all periods within a time series that don't correspond to the biggest value, get all those chains as individual time series, ...
Feel free to drop a comment if it has to be more specific.
I am writing a for loop to average 10 years of hourly measurements made on the hour. The dates of the measurements are recorded as MATLAB datenums.
I am trying to iterate through using 0.0417 as it is the datenum for 1AM 00/00/00 but it is adding in a couple of seconds of error each time I iterate.
Can anyone recommend a better way for me to iterate by hour?
date = a(:,1);
load = a(:,7);
%loop for each hour of the year
for i=0:0.0417:366
%set condition
%condition removes year from current date
c = date(:)-datenum(year(date(:)),0,0)==i;
%evaluate condition on load vector and find mean
X(i,2)=mean(load(c==1));
end
An hour has a duration of 1/24 day, not 0.0417. Use 1/24 and the precision is sufficient high for a year.
For an even higher precision, use something like datenum(y,1,1,1:24*365,0,0) to generate all timestamps.
To avoid error drift entirely, specify the index using integers, and divide the result down inside the loop:
for hour_index=1:365*24
hour_datenum = (hour_index - 1) / 24;
end
I have a data file which contains time data. The list is quite long, 100,000+ points. There is data every 0.1 seconds, and the time stamps are so:
'2010-10-10 12:34:56'
'2010-10-10 12:34:56.1'
'2010-10-10 12:34:56.2'
'2010-10-10 12:34:53.3'
etc.
Not every 0.1 second interval is necessarily present. I need to check whether a 0.1 second interval is missing, then insert this missing time into the date vector. Comparing strings seems unnecessarily complicated. I tried comparing seconds since midnight:
date_nums=datevec(time_stamps);
secs_since_midnight=date_nums(:,4)*3600+date_nums(:,5)*60+date_nums(:,6);
comparison_secs=linspace(0,86400,864000);
res=(ismember(comparison_secs,secs_since_midnight)~=1);
However this approach doesn't work due to rounding errors. Both the seconds since midnight and the linspace of the seconds to compare it to never quite equal up (due to the tenth of a second resolution?). The intent is to later do an fft on the data associated with the time stamps, so I want as much uniform data as possible (the data associated with the missing intervals will be interpolated). I've considered blocking it into smaller chunks of time and just checking the small chunks one at a time, but I don't know if that's the best way to go about it. Thanks!
Multiply your numbers-of-seconds by 10 and round to the nearest integer before comparing against your range.
There may be more efficient ways to do this than ismember. (I don't know offhand how clever the implementation of ismember is, but if it's The Simplest Thing That Could Possibly Work then you'll be taking O(N^2) time that way.) For instance, you could use the timestamps that are actually present (as integer numbers of 0.1-second intervals) as indices into an array.
Since you're concerned with missing data records and not other timing issues such as a drifting time channel, you could check for missing records by converting the time values to seconds, doing a DIFF and finding those first differences that are greater than some tolerance. This would tell you the indices where the missing records should go. It's then up to you to do something about this. Remember, if you're going to use this list of indices to fill the gaps, process the list in descending index order since inserting the records will cause the index list to be unsynchronized with the data.
>> time_stamps = now:.1/86400:now+1; % Generate test data.
>> time_stamps(randi(length(time_stamps), 10, 1)) = []; % Remove 10 random records.
>> t = datenum(time_stamps); % Convert to date numbers.
>> t = 86400 * t; % Convert to seconds.
>> index = find(diff(t) > 1.999 * 0.1)' + 1 % Find missing records.
index =
30855
147905
338883
566331
566557
586423
642062
654682
733641
806963