How can I speed up MATALAB nested for loop (timestamp correction)? - matlab

I have large data files with 10Hz resolution which have been split up into half hour files. Each half hour file should contain 18000 rows. However, this is not usually the case and there are gaps in the time stamps due to datalogging errors.
In order to be able to process this data further, I need uniform data files with exactly 18000 rows each. I have written a script in MATLAB which solves this problem by using a generated full timestamp for each half hour period. Where there are gaps in the original timestamp, I fill the rows with NaNs (except the timestamp row).
Using a simple example, here is the code:
Fs=1:10; % uniform timestamp
Fs=Fs'; %'//<-- prevents string markdown
Ts=[1 3 7 2 1; 3 4 3 3 2;6 5 2 1 3]; %Ts is the original data with the timestamp in the first column
Corrected=zeros(length(Fs),6); % corrected is the data after applying uniform timestamp
for ii=1:length(Fs)
for jj=1:length (Ts(:,1))
if Fs(ii)==Ts(jj)
Corrected(ii,:)=[Fs(ii) Ts(jj,:)];
break
else
Corrected(ii,:)=[Fs(ii), NaN*[ii,1:4]];
continue
end
end
end
The code works well enough but when I apply it to the 10Hz data, it is extremely slow. Any ideas on how I can improve this code?
Note that in the actual code I compare date strings:
if sum(Fs_str(ii,:)==Ts_str(jj,:))==23
Corrected(ii,:)=[Fs(ii) Ts(jj,:)];

Related

More efficient way to search a large matrix in MATLAB?

I have a code that does what I want but it is too slow because I have a very large mat file with a matrix (33 gigabyte) that I need to search for particular values and extract those.
The file that I'm searching has the following structure:
reporter sector partner year ave
USA all CAN 2007 0.060026126
USA all CAN 2011 0.0637898418
...
This goes on for millions of rows. I want to extract the last (5th) column value for particular reporter and partner values (sector and year are fixed). In actuality there are more fixed values that I have taken out for the sake of simplicity but this might slow down my code even more. The country_codes and partner values need to vary and are looped for that reason.
The crucial part of my code is the following:
for i = 1:length(country_codes)
for g = [1:length(partner)]
matrix(i,g) = big_file(...
ismember(GTAP_data(:,1), country_codes(i)) & ... % reporter
ismember(GTAP_data(:,2), 'all') & ...sector
ismember(GTAP_data(:,3), partner(g)) & ... partner
ismember([GTAP_data{:,4}]', 2011) & ... year
,5); % ave column
end
end
In other words, the code goes through the million rows and finds just the right value by applying ismember with logical & on everything.
Is there a faster way to do this than using ismember? Can someone assist?
So what I see is you build a big table out of the data in different files.
It seems your values are text-based. That takes up more memory. "USA" already takes up three bytes of memory. If you have less then 255 countries to concider, you could store them as only one byte in uint8 format.
If you can store all columns as a value between 0 and 255 you can make a uint8 matrix that can be indexed very fast.
As an example:
%demo
GTAP_regions={'USA','NL','USA','USA','NL','GB','NL','USA','Korea Republic of','GB','NL','USA','Korea Republic of'};
S=whos('GTAP_regions');
S.bytes
GTAP_regions requires 1580 bytes. Now we convert it.
GTAP_regions_list=GTAP_regions(1);
GTAP_regions_uint=uint8(1);
for ct = 2:length(GTAP_regions)
I=ismember(GTAP_regions_list,GTAP_regions(ct));
if ~any(I)
GTAP_regions_list(end+1)=GTAP_regions(ct);
else
GTAP_regions_uint(end+1)=uint8(find(I));
end
end
S=whos('GTAP_regions_list');
S.bytes
S=whos('GTAP_regions_uint');
S.bytes
GTAP_regions_uint we need to use to do indexing, and it is now only 10 bytes and will be very fast to analyse.
GTAP_regions_list we need to use to find what index value belongs to what country, is only 496 bytes.
You can also do this for sector, partner and year, depending on the range of years. If it is no more than 255 different years it will work. Otherwise you could store it as uint16 and have 65535 possible options.

Plotting multiple datasets in MATLAB

I have voltage and current signals from multiple days. The time vector is in seconds of the day (SOD), and the voltage and current vectors are in volts and amps respectively. However, the vector data from each day is different lengths. For example Mondays data might be 1x100000 for both time and voltage/current, and Tuesdays might be 1x50000 for both time and voltage/current. I was asked to plot the different days of data on the same figure for comparison purposes. I have tried using the plot(x1,y1,x2,y2) method but that obviously didn't work due to different vector lengths. I tried interpolating to the larger data set, but then realized that I will get all NaNs on the result since there is no overlap in time. I ran out of ideas and am desperately in need of help.
EDIT:
I guess I forgot to mention that somehow I would like to overlay them one on top of the other in the same figure and not using a subplot.
It sounds like you want a data vector of length n to span, I'm guessing, 24 hours = 86400 seconds, for any n (e.g. n=100000 or n=50000). Assuming the original data is uniformly sampled, this should do the trick:
x1=linspace(0,86400,length(x1));
x2=linspace(0,86400,length(x2));
plot(x1,y1,'r-',x2,y2,'b-');
If it is not uniformly sampled, we can still make it work:
t1=linspace(0,86400,length(x1));
t2=linspace(0,86400,length(x2));
newy1 = spline(x1,y1,t1);
newy2 = spline(x2,y2,t2);
plot(t1,newy1,'r-',t2,newy2,'b-');

Timestamp Processing Brain Teaser

I am processing 1Hz timestamps (variable 'timestamp_1hz') from a logger which doesn't log exactly at the same time every second (the difference varies from 0.984 to 1.094, but sometimes 0.5 or several seconds if the logger burps). The 1Hz dataset is used to build a 10 minute averaged dataset, and each 10 minute interval must have 600 records. Because the logger doesn't log exactly at the same time every second, the timestamp slowly drifts through the 1 second mark. Issues come up when the timestamp cross the 0 mark, as well as the 0.5 mark.
I have tried various ways to pre-process the timestamps. The timestamps with around 1 second between them should be considered valid. A few examples include:
% simple
% this screws up around half second and full second values
rawseconds = raw_1hz_new(:,6)+(raw_1hz_new(:,7)./1000);
rawsecondstest = rawseconds;
rawsecondstest(:,1) = floor(rawseconds(:,1))+ rawseconds(1,1);
% more complicated
% this screws up if there is missing data, then the issue compounds because k+1 timestamp is dependent on k timestamp
rawseconds = raw_1hz_new(:,6)+(raw_1hz_new(:,7)./1000);
A = diff(rawseconds);
numcheck = rawseconds(1,1);
integ = floor(numcheck);
fract = numcheck-integ;
if fract>0.5
rawseconds(1,1) = rawseconds(1,1)-0.5;
end
for k=2:length(rawseconds)
rawsecondstest(k,1) = rawsecondstest(k-1,1)+round(A(k-1,1));
end
I would like to pre-process the timestamps then compare it to a contiguous 1Hz timestamp using 'intersect' in order to find the missing, repeating, etc data such as this:
% pull out the time stamp (round to 1hz and convert to serial number)
timestamp_1hz=round((datenum(raw_1hz_new(:,[1:6])))*86400)/86400;
% calculate new start time and end time to find contig time
starttime=min(timestamp_1hz);
endtime=max(timestamp_1hz);
% determine the contig time
contigtime=round([floor(mean([starttime endtime])):1/86400:ceil(mean([starttime endtime]))-1/86400]'*86400)/86400;
% find indices where logger time stamp matches real time and puts
% the indices of a and b
clear Ia Ib Ic Id
[~,Ia,Ib]=intersect(timestamp_1hz,contigtime);
% find indices where there is a value in real time that is not in
% logger time
[~,Ic] = setdiff(contigtime,timestamp_1hz);
% finds the indices that are unique
[~,Id] = unique(timestamp_1hz);
You can download 10 days of the raw_1hz_new timestamps here. Any help or tips would be much appreciated!
The problem you have is that you can't simply match these stamps up to a list of times, because you could be expecting a set of datapoints at seconds = 1000, 1001, 1002, but if there was an earlier blip you could have entirely legitimate data at 1000.5, 1001.5, 1002.5 instead.
If all you want is a list of valid times/their location in your series, why not just something like (times in seconds):
A = diff(times); % difference between times
n = find(abs(A-1)<0.1) % change 0.1 to whatever your tolerance is
times2 = times(n+1);
times2 should then be a list of all your timestamps where the previous timestamp was approximately 1 second ago - works on a small set of fake data I constructed, didn't try it on yours. (For future reference: it would be more help to provide a small subset of your data, e.g. just a few minutes worth, that you know contains a blip).
I would then take the list of valid timestamps and split it up into 10 minute sections for averaging, counting how many valid timestamps were obtained in each section. If it's working, you should end up with no more than 600 - but not much less if the blips are occasional.

Matlab: Tempo-Alignment according to Timestamps

May be it is so simple but I'm new to Matlab and not good in Timestamps issues in general. Sorry!
I have two different cameras each contains timestamps of frames. I read them to two arrays TimestampsCam1 and TimestampsCam2:
TimestampsCam1 contains 1500 records and the timestamps are in Microseconds as follows:
1 20931160389
2 20931180407
3 20931200603
4 20931220273
5 20931240360 ...
and TimestampsCam2 contains 1000 records and the timestamps are in Milliseconds as follows:
1 28275280
2 28315443
3 28355607
4 28395771
5 28435935 ...
The first camera starts capturing first and ends a bit later than the second camera. So what I need to do is to know exactly where a frame from first camera is captured at the same time (or nearly the same time) by the other camera. In other words, I want to align the two arrays(cameras) in time according to the timestamps. I want to get at the end two arrays of same size where each record is tempo-aligned to the corresponding record in the other array.
Many thanks to all!
Sam
Make sure they are in the same unit of measurement, e.g. microseconds
Create an index which contains all values, except duplicates, suppose this one is 2400 records long
Create two NaN vectors of length 2400 by putting the value (for example the framenumber) at the place where the index matches the timestamp
Now you have two aligned vectors with NaNs to pad them where required.

Comparing dates and filling in gap times in matlab

I have a data file which contains time data. The list is quite long, 100,000+ points. There is data every 0.1 seconds, and the time stamps are so:
'2010-10-10 12:34:56'
'2010-10-10 12:34:56.1'
'2010-10-10 12:34:56.2'
'2010-10-10 12:34:53.3'
etc.
Not every 0.1 second interval is necessarily present. I need to check whether a 0.1 second interval is missing, then insert this missing time into the date vector. Comparing strings seems unnecessarily complicated. I tried comparing seconds since midnight:
date_nums=datevec(time_stamps);
secs_since_midnight=date_nums(:,4)*3600+date_nums(:,5)*60+date_nums(:,6);
comparison_secs=linspace(0,86400,864000);
res=(ismember(comparison_secs,secs_since_midnight)~=1);
However this approach doesn't work due to rounding errors. Both the seconds since midnight and the linspace of the seconds to compare it to never quite equal up (due to the tenth of a second resolution?). The intent is to later do an fft on the data associated with the time stamps, so I want as much uniform data as possible (the data associated with the missing intervals will be interpolated). I've considered blocking it into smaller chunks of time and just checking the small chunks one at a time, but I don't know if that's the best way to go about it. Thanks!
Multiply your numbers-of-seconds by 10 and round to the nearest integer before comparing against your range.
There may be more efficient ways to do this than ismember. (I don't know offhand how clever the implementation of ismember is, but if it's The Simplest Thing That Could Possibly Work then you'll be taking O(N^2) time that way.) For instance, you could use the timestamps that are actually present (as integer numbers of 0.1-second intervals) as indices into an array.
Since you're concerned with missing data records and not other timing issues such as a drifting time channel, you could check for missing records by converting the time values to seconds, doing a DIFF and finding those first differences that are greater than some tolerance. This would tell you the indices where the missing records should go. It's then up to you to do something about this. Remember, if you're going to use this list of indices to fill the gaps, process the list in descending index order since inserting the records will cause the index list to be unsynchronized with the data.
>> time_stamps = now:.1/86400:now+1; % Generate test data.
>> time_stamps(randi(length(time_stamps), 10, 1)) = []; % Remove 10 random records.
>> t = datenum(time_stamps); % Convert to date numbers.
>> t = 86400 * t; % Convert to seconds.
>> index = find(diff(t) > 1.999 * 0.1)' + 1 % Find missing records.
index =
30855
147905
338883
566331
566557
586423
642062
654682
733641
806963