I'm trying to convert a character vector (200,000 rows) into Matlab serial numbers.
The format is '01/07/2015 00:00:59'.
This takes an incredibly long time, and online I can only find tips for solving this in Matlab. Any ideas how I can improve this?
You can use the datenum(datevector) type of input for datenum.
It is much faster than the string parsing. I frequently use this trick whenever I have to import long date/time data (which is nearly everyday).
It consists in sending a mx6 (or mx3) matrix, containing values representing [yy mm dd HH MM SS]. The matrix should be of type double.
It means instead of letting Matlab/Octave do the parsing, you read all the numbers in the string with your favourite way (textscan, fscanf, sscanf, ...), then you send numbers to datenum instead of string.
In the example below I generated a long array (86401x19) of date string as sample data:
>> strDate(1:5,:)
ans =
31/07/2015 15:10:13
31/07/2015 15:10:14
31/07/2015 15:10:15
31/07/2015 15:10:16
31/07/2015 15:10:17
To convert that to datenum faster than by the conventional way, I use:
strDate = [strDate repmat(' ',size(strDate,1),1)] ; %// add a whitespace at the end of each line
M = textscan( strDate.' , '%f/%f/%f %f:%f:%f' ) ; %'// read each value independently
M = cell2mat(M) ; %// convert to matrix
M = M(:,[3 2 1 4 5 6]) ; %// reorder columns
dt = datenum(M ) ; %// convert to serial date
This should bring speed up in Matlab but I am pretty sure it should improve things in Octave too. To quantify that at least on Matlab, here's a quick benchmark:
function test_datenum
d0 = now ;
d = (d0:1/3600/24:d0+1).' ; %// 1 day worth of date (one per second)
strDate = datestr(d,'dd/mm/yyyy HH:MM:SS') ; %'// generate the string array
fprintf('Time with automatic date parsing: %f\n' , timeit(#(x) datenum_auto(strDate)) )
fprintf('Time with customized date parsing: %f\n', timeit(#(x) datenum_preparsed(strDate)) )
function dt = datenum_auto(strDate)
dt = datenum(strDate,'dd/mm/yyyy HH:MM:SS') ; %// let Matlab/Octave do the parsing
function dt = datenum_preparsed(strDate)
strDate = [strDate repmat(' ',size(strDate,1),1)] ; %// add a whitespace at the end of each line
M = textscan( strDate.' , '%f/%f/%f %f:%f:%f' ) ; %'// read each value independently
M = cell2mat(M) ; %// convert to matrix
M = M(:,[3 2 1 4 5 6]) ; %// reorder columns
dt = datenum(M ) ; %// convert to serial date
On my machine, it yields:
>> test_datenum
Time with automatic date parsing: 0.614698
Time with customized date parsing: 0.073633
Of course you could also compact the code in a couple of lines:
M = cell2mat(textscan([strDate repmat(' ',size(strDate,1),1)].','%f/%f/%f %f:%f:%f'))) ;
dt = datenum( M(:,[3 2 1 4 5 6]) ) ;
But I tested it and the improvement is so marginal that it is not really worth the loss of readability.
Related
I am facing an issue with counting number of occurrences by date, suppose I have an excel file where the data is as follows:
1/1/2001 23
1/1/2001 29
1/1/2001 24
3/1/2001 22
3/1/2001 23
My desired output is:
1/1/2001 3
2/1/2001 0
3/1/2001 2
Though 2/1/2001 does't appear in the input, I want that included in the output with 0 counts. This is my current code:
[Value, Time] = xlsread('F:\1km\fire\2001- 02\2001_02.xlsx','Sheet1','A2:D159','',#convertSpreadsheetExcelDates);
tm=datenum(Time);
val=Value(:,4);
data=[tm val];
% a=(datestr(tm));
T1=datetime('9/23/2001');
T2=datetime('6/23/2002');
T = T1:T2;
tm_all=datenum(T);
[~, idx] = ismember(tm_all,data(:,1));
% idx=idx';
out = tm_all(idx);
The ismember function does not seem to work, because the length of tm_all is 274 and the size of data is 158x2
I suggest you to use datetime instead of datenum for converting your date strings into a serial representation, this can make (not only) the whole computation much easier:
tm = datetime({
'1/1/2001';
'1/1/2001';
'1/1/2001';
'3/1/2001';
'3/1/2001'
},'InputFormat','dd/MM/yyyy');
Once you have obtained your datetime vector, the calculation can be achieved as follows:
% Create a sequence of datetimes from the first date to the last date...
T = (min(tm):max(tm)).';
% Build an indexing of every occurrence to the regards of the sequence...
[~,idx] = ismember(tm,T);
% Count the occurrences for every occurrence...
C = accumarray(idx,1);
% Put unique dates and occurrences together into a single variable...
res = table(T,C)
Here is the output:
res =
T C
___________ _
01-Jan-2001 3
02-Jan-2001 0
03-Jan-2001 2
For more information about the functions used within the computation:
accumarray function
ismember function
On a side note, I didn't understand whether your dates are in dd/MM/yyyy or in MM/dd/yyyy format... because with the latter, you cannot have that output using my approach, and you should also implement an algorithm for detecting the current month and then splitting your data over a monthly (and eventually yearly, if your dates span over 2001) criterion instead:
tm = datetime({
'1/1/2001';
'1/1/2001';
'1/1/2001';
'3/1/2001';
'3/1/2001'
},'InputFormat','MM/dd/yyyy');
M = month(tm);
M_seq = (min(M):max(M)).';
[~,idx] = ismember(M,M_seq);
C = accumarray(idx,1);
res = table(datetime(2001,M_seq,1),C)
res =
Var1 C
___________ _
01-Jan-2001 3
01-Feb-2001 0
01-Mar-2001 2
I'll first give the code and then explain step by step.
code:
[Value, Time] = xlsread('stack','A1:D159','',#convertSpreadsheetExcelDates);
tm=datenum(Time);
val=Value(:,4);
data=[tm val];
a=(datestr(tm));
T1=datetime('1/1/2001');
T2=datetime('6/23/2002');
T = T1:T2;
tm_all=datenum(T);
[~, idx] = ismember(tm_all,data(:,1)); % get indices
[occurence,dates]= hist(data(:,1),unique(data(:,1))); % count occurences of dates from file
t = [0;data(:,1)]; % add 0 to dates (for later because MATLAB starts at 1
[~,idx] = ismember(t(idx+1),dates); % get incides
q = [0 occurence]; % add 0 to occurence (for later because MATLAB starts at 1
occ = q(idx+1); % make vector with occurences
out = [tm_all' occ']; % output
idx of ismember is an 1xlength(tm_all) vector that at position i contains the lowest index of where tm_all(i) is found in data(:,1). So take for example A = [1 2 3 4] and B = [1 1 2 4] then for [~,idx] = ismember(A,B) the result will be
idx = [1 3 0 4]
because A(1) = 1 and the first 1 in B is found at posistion 1. If a number in A doesn't occur in B, then the result will be 0.
[occurence,dates]= hist(data(:,1),unique(data(:,1))); gives the number of occurences for the dates.
t = [0;data(:,1)]; adds a zero in the beginning so tlooks like:
0
'date 1'
'date 2'
'date 3'
'date 4'
...
Why this is done, will be explained next.
t(idx+1) is a vector that is 1xlength(tm_all), and is kind of a copy of tm_all except that when a date doesn't occur in the file, the date is zero. How does this work? t(i) gives you the value of t at position i. So t( 1 5 4 2 9) is a vector with the values of t at positions 1, 5, 4, 2 and 9. Remember idx is the vector that contains the incides of the of the dates in data(:,1). Because Matlab indexing starts at 1, idx+1 is needed. The dates in data':,1) then must also be increased. That's done by adding the zero in the beginning.
[~,idx] = ismember(t(idx+1),dates); is the same as before, but idx now contains the indices of dates.
q = [0 occurence]; again adds a zero occ = q(idx+1); is the row of occurences of the dates.
I am writing a mesh smoother in Matlab for a particular CFD code. The requirements for the mesh data file are very strict, and so the number format must be very specific. The definition for one quadrilateral element is as follows (Where the 2nd row is the x coordinates and 3rd row is y coordinates).
ELEMENT 14 [ 1Q] GROUP 0
5.000000 10.00000 10.00000 5.000000
2.000000 2.000000 4.500000 4.500000
As you can see, each number takes up exactly 8 characters. Once the whole mesh has been passed through my smoother, I need to write the numbers back to a file. the closest I've gotten to this number format is with the following operator:
%#7.7g
I don't think I can use %f, as this specifies the number of digits after the decimal, which in my case varies (I have coordinates that go from less than one to over 100). the only issue that this operator is giving me is when I have numbers less than one, it only retains 7 sig figs, and the number ends up being 9 characters long; for example:
0.9313373
Does anyone know the correct operator, or have a workaround? Much appreciated.
Single format spec:
If you can live with only 4 digit precision (decimal part after the .) out of your 8 characters, and if the mesh reading program can handle padding 0 (i.e. if it can read values like 007.2365 properly), then the simplest and quickest option is to use only the format specifier.
For example, let's define an input a with different order of magnitude:
a=[ 1 234 7 ;
12 2 0.123456789 ;...
5 2.36 0.0024 ] ;
Your highest number is 234, so that leaves 4 digits for the fractional part. If you accept that precision for all the numbers in your Matrix, then you're all set with a simple instruction:
fmt='%08.4f %08.4f %08.4f\n'; %// hard coded format specifier
sprintf(fmt,a.') %'// we transpose a to keep the same shape in the output since sprintf is column major.
ans =
001.0000 234.0000 007.0000
012.0000 002.0000 000.1235
005.0000 002.3600 000.0024
If you don't know in advance what will be the maximum order of magnitude of your data you can set the format specifier programmatically:
nDigitTotal = 8 ;
nmax = ceil( log10(max(a(:))) ) ; %// max order of magnitude of numbers in "a"
f = sprintf('%%0%d.%df',nDigitTotal,nDigitTotal-nmax-1) ; %// format spec for 1 number
fmt = [f '\t' f '\t' f '\n'] ; %// format spec for a line of 3 numbers
s = sprintf(fmt,a.')
Will give the same result as above. Add a check to make sure there are no extreme values in your matrix which will eat your precision.
Individual format spec:
Lastly, if that precision and/or the leading zero do not work for you, you can resort to a more elaborate solution. I quite like the idea from excaza of setting a mask to specify the precision for each number. I'll produce my own version, very slightly different, which account for numbers at any precision and allow array output. However, if you end up using this solution give credit to excaza since he was the inspiration for this evolution:
a = a.' ; %'// transpose from beginning/ thats done
nDigitTotal = 8; %// Total number of characters
mask = nDigitTotal(ones(size(a))) ; %// Create mask
nOrder = floor( log10(a) ) ; %// find order of magnitude of each element in the matrix
mask = mask - nOrder.*(nOrder>0) -1 ; %// adjust mask according to "nOrder" (only for nOrder>0)
temp = [mask(:)';a(:)']; %// Stack the vectors and weave them
f = '%7.*f' ; %// basic format spec
fmt = [f '\t' f '\t' f '\n'] ; %// build your line
sprintf(fmt,temp) %// go for it
will give you:
ans =
1.0000000 234.00000 7.0000000
12.000000 2.0000000 0.1234568
5.0000000 2.3600000 0.0024000
note: replace the tabulation ('\t') with normal whitespace (' ') in the format specifier separator depending on what your meshing software is expecting.
This is the only workaround I could think of:
A = [1.12341234 .12341234 20.12341234 5 10];
% Create a precision mask
maxwidth = 8; % Max width, characters
mask = maxwidth(ones(size(A))) - 1; % Initialize mask, account for decimal
mask(A < 1) = maxwidth - 2;
% Stack the vectors and weave them
temp = [mask(:)';A(:)'];
temp = temp(:);
test = sprintf('%#7.*g ', temp);
Which returns:
test =
1.123412 0.123412 20.12341 5.000000 10.00000
It's an annoying extra step but we can utilize sprintf's ability to take an asterisk in order to refer to an argument in the input list. Due to how my sprintf call and test case are set up, I wove the mask and data together so sprintf sees the precision specifier and data alternating. The temp(:) call isn't necessary, if you pass the original temp matrix to sprintf it will do the same thing since it reads the data column-major. I added it in so the behavior is more explicit.
How to formulate the sprintf call for your actual printing routine will depend on what you're doing, but this should at least help you on your way.
Edit1: To expand, what the above is doing is equivalent to:
a = sprintf('%#7.*g ', temp(1), temp(2));
b = sprintf('%#7.*g ', temp(3), temp(4));
c = sprintf('%#7.*g ', temp(5), temp(6));
d = sprintf('%#7.*g ', temp(7), temp(8));
e = sprintf('%#7.*g ', temp(9), temp(10));
test = [a b c d e];
Edit2: Updated to account for integer values
Edit3: Note that this currently will only work for positive numbers
I have a matrix in MATLAB of 50572x4 doubles. The last column has datenum format dates, increasing values from 7.3025e+05 to 7.3139e+05. The question is:
How can I split this matrix into sub-matrices, each that cover intervals of 30 days?
If I'm not being clear enough… the difference between the first element in the 4th column and the last element in the 4th column is 7.3139e5 − 7.3025e5 = 1.1376e3, or 1137.6. I would like to partition this into 30 day segments, and get a bunch of matrices that have a range of 30 for the 4th columns. I'm not quite sure how to go about doing this...I'm quite new to MATLAB, but the dataset I'm working with has only this representation, necessitating such an action.
Note that a unit interval between datenum timestamps represents 1 day, so your data, in fact, covers a time period of 1137.6 days). The straightforward approach is to compare each timestamps with the edges in order to determine which 30-day interval it belongs to:
t = A(:, end) - min(A:, end); %// Normalize timestamps to start from 0
idx = sum(bsxfun(#lt, t, 30:30:max(t))); %// Starting indices of intervals
rows = diff([0, idx, numel(t)]); %// Number of rows in each interval
where A is your data matrix, where the last column is assumed to contain the timestamps. rows stores the number of rows of the corresponding 30-day intervals. Finally, you can employ cell arrays to split the original data matrix:
C = mat2cell(A, rows, size(A, 2)); %// Split matrix into intervals
C = C(~cellfun('isempty', C)); %// Remove empty matrices
Hope it helps!
Well, all you need is to find the edge times and the matrix indexes in between them. So, if your numbers are at datenum format, one unit is the same as one day, which means that we can jump from 30 and 30 units until we get as close as we can to the end, as follows:
startTime = originalMatrix(1,4);
endTime = originalMatrix(end,4);
edgeTimes = startTime:30:endTime;
% And then loop though the edges checking for samples that complete a cycle:
nEdges = numel(edgeTimes);
totalMeasures = size(originalMatrix,1);
subMatrixes = cell(1,nEdges);
prevEdgeIdx = 0;
for curEdgeIdx = 1:nEdges
nearIdx=getNearestIdx(originalMatrix(:,4),edgeTimes(curEdgeIdx));
if originalMatrix(nearIdx,4)>edgeTimes(curEdgeIdx)
nearIdx = nearIdx-1;
end
if nearIdx>0 && nearIdx<=totalMeasures
subMatrix{curEdgeIdx} = originalMatrix(prevEdgeIdx+1:curEdgeIdx,:);
prevEdgeIdx=curEdgeIdx;
else
error('For some reason the edge was not inbound.');
end
end
% Now we check for the remaining days after the edges which does not complete a 30 day cycle:
if curEdgeIdx<totalMeasures
subMatrix{end+1} = originalMatrix(curEdgeIdx+1:end,:);
end
The function getNearestIdx was discussed here and it gives you the nearest point from the input values without checking all possible points.
function vIdx = getNearestIdx(values,point)
if isempty(values) || ~numel(values)
vIdx = [];
return
end
vIdx = 1+round((point-values(1))*(numel(values)-1)...
/(values(end)-values(1)));
if vIdx < 1, vIdx = []; end
if vIdx > numel(values), vIdx = []; end
end
Note: This is pseudocode and may contain errors. Please try to adjust it into your problem.
so I have a matrix Data in this format:
Data = [Date Time Price]
Now what I want to do is plot the Price against the Time, but my data is very large and has lines where there are multiple Prices for the same Date/Time, e.g. 1st, 2nd lines
29 733575.459548611 40.0500000000000
29 733575.459548611 40.0600000000000
29 733575.459548612 40.1200000000000
29 733575.45954862 40.0500000000000
I want to take an average of the prices with the same Date/Time and get rid of any extra lines. My goal is to do linear intrapolation on the values which is why I must have only one Time to one Price value.
How can I do this? I did this (this reduces the matrix so that it only takes the first line for the lines with repeated date/times) but I don't know how to take the average
function [ C ] = test( DN )
[Qrows, cols] = size(DN);
C = DN(1,:);
for i = 1:(Qrows-1)
if DN(i,2) == DN(i+1,2)
%n = 1;
%while DN(i,2) == DN(i+n,2) && i+n<Qrows
% n = n + 1;
%end
% somehow take average;
else
C = [C;DN(i+1,:)];
end
end
[C,ia,ic] = unique(A,'rows') also returns index vectors ia and ic
such that C = A(ia,:) and A = C(ic,:)
If you use as input A only the columns you do not want to average over (here: date & time), ic with one value for every row where rows you want to combine have the same value.
Getting from there to the means you want is for MATLAB beginners probably more intuitive with a for loop: Use logical indexing, e.g. DN(ic==n,3) you get a vector of all values you want to average (where n is the index of the date-time-row it belongs to). This you need to do for all different date-time-combinations.
A more vector-oriented way would be to use accumarray, which leads to a solution of your problem in two lines:
[DateAndTime,~,idx] = unique(DN(:,1:2),'rows');
Price = accumarray(idx,DN(:,3),[],#mean);
I'm not quite sure how you want the result to look like, but [DataAndTime Price] gives you the three-row format of the input again.
Note that if your input contains something like:
1 0.1 23
1 0.2 47
1 0.1 42
1 0.1 23
then the result of applying unique(...,'rows') to the input before the above lines will give a different result for 1 0.1 than using the above directly, as the latter would calculate the mean of 23, 23 and 42, while in the former case one 23 would be eliminates as duplicate before and the differing row with 42 would have a greater weight in the average.
Try the following:
[Qrows, cols] = size(DN);
% C is your result matrix
C = DN;
% this will give you the indexes where DN(i,:)==DN(i+1)
i = find(diff(DN(:,2)==0);
% replace C(i,:) with the average
C(i,:) = (DN(i,:)+DN(i+1,:))/2;
% delete the C(i+1,:) rows
C(i,:) = [];
Hope this works.
This should work if the repeated time values come in pairs (the average is calculated between i and i+1). Should you have time repeats of 3 or more then try to rethink how to change these steps.
Something like this would work, but I did not run the code so I can't promise there's no bugs.
newX = unique(DN(:,2));
newY = zeros(1,length(newX));
for ix = 1:length(newX)
allOcurrences = find(DN(:,2)==DN(i,2));
% If there's duplicates, take their mean
if numel(allOcurrences)>1
newY(ix) = mean(DN(allOcurrences,3));
else
% If not, use the only Y value
newY(ix) = DN(ix,3);
end
end
I am not very comfortable with using accumarray function in Matlab, though I have begun to appreciate its powers! I was wondering if I could input 2 cols in the VAL field of accumarray function. Please see -
sz = 3 ; % num_rows for each ID
mat1 = [1 20 ; 1 40 ; 1 50 ; 2 10 ; 2 100 ; 2 110] ; % Col1 is ID, Col2 is Value
idx = [30 1000 ; 30 1200 ; 30 1500 ; 30 1000 ; 30 1200 ; 30 1500 ] ;
% col1: index ID, col2: value
mat1 is ID returns while idx is index returns. For simplicity, idx returns are repeated to match mat1. All IDs in mat1 have same rows. Even idx has the same rows.
[~,~,n] = unique(mat1(:,1), 'rows', 'last') ;
fncovariance = #(x,y) (x.*y)/sz ;
accumarray(n, [x(:,2) y(:,2)], [], fncovariance) % --> FAILS as VAL is not-vector!
You can see that I'm trying to calculate covariance (cov(x,y,1)) but cannot use Matlab's function directly as mat1 has IDs and I need covariance for each ID w.r.t Index.
Ansmat:
1 2444.4
2 7888.9
The short answer is no. In the accumarray() help, the key part is:
"VAL must be a numeric, logical, or character vector with the same length
as the number of rows in SUBS. VAL may also be a scalar whose value is
repeated for all the rows of SUBS."
This means you can't even fake it out by using cells.
However, if you put the IDs in their own index variable, and then reshape your data so that data corresponding to different IDs are in different columns, this problem can be efficiently handled by bsxfun(). For reference, I also included a matrix math method, a simple for loop method using cov(), and a cellfun() method using a custom fncovariance() function (note I modified it from yours above).
fncovariance = #(x,y) mean(x.*y) - mean(x)*mean(y);
IDs = unique(mat1(:,1));
ret = reshape(mat1(:,2), sz, length(IDs));
idx = idx(1:sz, 2);
% bsxfun method
mean(bsxfun(#times, ret, idx)) - bsxfun(#times, mean(ret), mean(idx))
% matrix math
idx' * ret / length(idx) - mean(ret)*mean(idx)
% for loop method
id_cov = zeros(1, length(IDs));
for i=1:length(IDs)
tmp = cov(ret(:,i), idx, 1);
id_cov(i) = tmp(2,1);
end
id_cov
% cellfun method
ret_cell = num2cell(ret, 1);
idx_cell = num2cell(repmat(idx, 1, length(IDs)), 1);
cellfun(fncovariance, ret_cell, idx_cell)
If you simulate some more data and time these different methods, the bsxfun() way is the fastest:
sz = 10;
n_ids = 100;
IDs = 1:n_ids;
ret = randi(1000, sz, n_ids);
idx = randi(1000, sz, 1);
Elapsed time is 0.001292 seconds.
Elapsed time is 0.001523 seconds.
Elapsed time is 0.009625 seconds.
Elapsed time is 0.011454 seconds.
A final option you might be interested in is the grpstats() function in the statistics toolbox, which lets you sweep out arbitrary statistics based on a grouping variable.