Consider a cell array of dates date = { 1000000 x 1 } such that it has dates in different formats.
date = 27-01-2009
28-Mar-2003
.
.
.
21-02-2003 06:35:20
21-02-2003 06:35:20.42
.
.
and so on
How do I get the a 100000x3 matrix A = [ year month day ] from date?
Approach 1
date = {
'27-01-2009'
'28-Mar-2003'
'21-02-2003 06:35:20'
'21-02-2003 06:35:20.42'}
date_double_arr = datevec(date,'dd-mm-yyyy')
out = date_double_arr(:,1:3) %// desired output
Output -
out =
2009 1 27
2003 3 28
2003 2 21
2003 2 21
Approach 2
In case of inconsistencies between the date-month-year and time, one might want to seperate out the former group and use them to get the final Nx3 array like so -
t1 = cellfun(#(x) strsplit(x,' '), date,'uni',0)
t2 = cellfun(#(x) x(1), t1)
t3 = datevec(t2,'dd-mm-yyyy')
out = t3(:,1:3) %// desired output
Related
Objective:
I would like to get, for each period and group of a timetable, the result of a given function of var1 and var2 [i.e. the ratio of (the sum of var1 over the group) by (the sum of var2 over the group)]
using unstack and a function handle:
Data
A = [1 2 3 4 2 4 6 8]';
B = [4 2 14 7 8 4 28 14]';
C=["group1";"group1";"group2";"group2";"group1";"group1";"group2";"group2"];
Year = [2010 2010 2010 2010 2020 2020 2020 2020]';
Year = datetime(string(Year), 'Format', 'yyyy');
t=table(Year,C,A,B,'VariableNames',{'Year' 'group' 'var1' 'var2'});
t=table2timetable(t,'RowTimes','Year');
Desired Output [EDIT]
A table with three columns: year, Ratio_group1, Ratio_group2. Where for instance:
Ratio_group1 for 2010 = (1+2) / (4+2) =0.5.
Function
f = #(x,y) sum(x)./sum(y); %or f = #(x) sum(x(1,:))./sum(x(2,:));
[Ratio,is] = unstack(t,{'var1','var2'},"group",'AggregationFunction',f);
Errors that I get:
%Not enough input arguments.
%Or: Index in position 1 exceeds array bounds (must not exceed 1)
Another failed test inspired from https://www.mathworks.com/help/matlab/ref/double.groupsummary.html
(See Method Function Handle with Multiple Inputs)
[Ratio,is] = unstack(t,{["var1"],["var2"]},"group",'AggregationFunction',f);
%Error: A table variable subscript must be a numeric array continaing real positive integers, a logical array (...)
This can be done using findgroups and splitapply (or equivalently accumarray):
result = splitapply(#(x,y) sum(x)/sum(y), t.var1, t.var2, findgroups(t.Year, t.group));
I am facing an issue with counting number of occurrences by date, suppose I have an excel file where the data is as follows:
1/1/2001 23
1/1/2001 29
1/1/2001 24
3/1/2001 22
3/1/2001 23
My desired output is:
1/1/2001 3
2/1/2001 0
3/1/2001 2
Though 2/1/2001 does't appear in the input, I want that included in the output with 0 counts. This is my current code:
[Value, Time] = xlsread('F:\1km\fire\2001- 02\2001_02.xlsx','Sheet1','A2:D159','',#convertSpreadsheetExcelDates);
tm=datenum(Time);
val=Value(:,4);
data=[tm val];
% a=(datestr(tm));
T1=datetime('9/23/2001');
T2=datetime('6/23/2002');
T = T1:T2;
tm_all=datenum(T);
[~, idx] = ismember(tm_all,data(:,1));
% idx=idx';
out = tm_all(idx);
The ismember function does not seem to work, because the length of tm_all is 274 and the size of data is 158x2
I suggest you to use datetime instead of datenum for converting your date strings into a serial representation, this can make (not only) the whole computation much easier:
tm = datetime({
'1/1/2001';
'1/1/2001';
'1/1/2001';
'3/1/2001';
'3/1/2001'
},'InputFormat','dd/MM/yyyy');
Once you have obtained your datetime vector, the calculation can be achieved as follows:
% Create a sequence of datetimes from the first date to the last date...
T = (min(tm):max(tm)).';
% Build an indexing of every occurrence to the regards of the sequence...
[~,idx] = ismember(tm,T);
% Count the occurrences for every occurrence...
C = accumarray(idx,1);
% Put unique dates and occurrences together into a single variable...
res = table(T,C)
Here is the output:
res =
T C
___________ _
01-Jan-2001 3
02-Jan-2001 0
03-Jan-2001 2
For more information about the functions used within the computation:
accumarray function
ismember function
On a side note, I didn't understand whether your dates are in dd/MM/yyyy or in MM/dd/yyyy format... because with the latter, you cannot have that output using my approach, and you should also implement an algorithm for detecting the current month and then splitting your data over a monthly (and eventually yearly, if your dates span over 2001) criterion instead:
tm = datetime({
'1/1/2001';
'1/1/2001';
'1/1/2001';
'3/1/2001';
'3/1/2001'
},'InputFormat','MM/dd/yyyy');
M = month(tm);
M_seq = (min(M):max(M)).';
[~,idx] = ismember(M,M_seq);
C = accumarray(idx,1);
res = table(datetime(2001,M_seq,1),C)
res =
Var1 C
___________ _
01-Jan-2001 3
01-Feb-2001 0
01-Mar-2001 2
I'll first give the code and then explain step by step.
code:
[Value, Time] = xlsread('stack','A1:D159','',#convertSpreadsheetExcelDates);
tm=datenum(Time);
val=Value(:,4);
data=[tm val];
a=(datestr(tm));
T1=datetime('1/1/2001');
T2=datetime('6/23/2002');
T = T1:T2;
tm_all=datenum(T);
[~, idx] = ismember(tm_all,data(:,1)); % get indices
[occurence,dates]= hist(data(:,1),unique(data(:,1))); % count occurences of dates from file
t = [0;data(:,1)]; % add 0 to dates (for later because MATLAB starts at 1
[~,idx] = ismember(t(idx+1),dates); % get incides
q = [0 occurence]; % add 0 to occurence (for later because MATLAB starts at 1
occ = q(idx+1); % make vector with occurences
out = [tm_all' occ']; % output
idx of ismember is an 1xlength(tm_all) vector that at position i contains the lowest index of where tm_all(i) is found in data(:,1). So take for example A = [1 2 3 4] and B = [1 1 2 4] then for [~,idx] = ismember(A,B) the result will be
idx = [1 3 0 4]
because A(1) = 1 and the first 1 in B is found at posistion 1. If a number in A doesn't occur in B, then the result will be 0.
[occurence,dates]= hist(data(:,1),unique(data(:,1))); gives the number of occurences for the dates.
t = [0;data(:,1)]; adds a zero in the beginning so tlooks like:
0
'date 1'
'date 2'
'date 3'
'date 4'
...
Why this is done, will be explained next.
t(idx+1) is a vector that is 1xlength(tm_all), and is kind of a copy of tm_all except that when a date doesn't occur in the file, the date is zero. How does this work? t(i) gives you the value of t at position i. So t( 1 5 4 2 9) is a vector with the values of t at positions 1, 5, 4, 2 and 9. Remember idx is the vector that contains the incides of the of the dates in data(:,1). Because Matlab indexing starts at 1, idx+1 is needed. The dates in data':,1) then must also be increased. That's done by adding the zero in the beginning.
[~,idx] = ismember(t(idx+1),dates); is the same as before, but idx now contains the indices of dates.
q = [0 occurence]; again adds a zero occ = q(idx+1); is the row of occurences of the dates.
An excel file contains 5 columns; first column contains year (1987 to 2080), second column contains month, third column contains days, fourth and fifth column contain values. I would like to get the sum values of columns four and five according to year in column one. For example, I would like to get the sum values of column four and five for year 1987, then 1988, then 1989...so on.!
Example of data file is attached
I have tried the following code considering that each year contains 365 days.
n=1;
for i=1:365:size(data,1)
Total(n,:) = sum(data(i:i+365-1,:));
n=n+1;
end
But the problem is that not all the years contain 365 days. Some of them (e.g. 1988, 1992) contain 366 days in a year as they are leap year. In those cases, the sum results become incorrect.
Looking for your help to get the sum values of columns 4 and 5 according to the year in column 1.
It would be greatly appreciated.
UPDATE: much faster solution at the end!
It can be done as follows with one line for each column:
% some example data
years = ceil(1987:0.3:2080)';
months = randi(12,numel(years),1);
days = randi(30,numel(years),1);
values = randi(42,numel(years),2);
% data similar to yours;
data = [ years months days values ];
That would be the easy readable long way:
% years
y = data(:,1)
% unique years
uy = unique(y);
% for column 4
C4 = arrayfun(#(x) sum( data(y == x, 4) ), uy )
% for column 5
C5 = arrayfun(#(x) sum( data(y == x, 5) ), uy )
or just short in one line per column:
C4 = arrayfun(#(x) sum( data( (data(:,1) == x), 4) ), unique(data(:,1)) )
returning a 94x1 double array with all sums for all 94 unique years of the example data.
If you want to arrange it somehow you could do it as follows:
summary = [uy, C4, C5]
returning something like:
summary = %//sum of sum of
column 4 column 5
1987 3 3
1988 40 40
1989 56 56
1990 96 96
1991 54 54
1992 15 15
1993 73 73
1994 42 42
1995 66 66
1996 56 56
...
You could also do all columns at once. Already for just 2 column it should be 50% faster.
cols = 4:5;
C = cell2mat( arrayfun(#(x) sum( data(y == x, cols),1 ), uy,'uni',0 ) )
The problem with that solution is, that you have a matrix of about 30000x5 size, and for every unique years it will apply the indexing on the whole matrix to "search" for the current year which is summed up. But actually there is an in-built function doing exactly that:
A simpler and much faster solution you can achieve using accumarray:
[~,~, i_uy] = unique(data(:,1));
C4 = accumarray(i_uy,data(:,4));
C5 = accumarray(i_uy,data(:,5));
Say that I have a dataset:
Jday = datenum('2009-01-01 00:00','yyyy-mm-dd HH:MM'):1/24:...
datenum('2009-01-05 23:00','yyyy-mm-dd HH:MM');
DateV = datevec(Jday);
DateV(4,:) = [];
DateV(15,:) = [];
DateV(95,:) = [];
Dat = rand(length(Jday),1)
How is it possible to remove all of the days that have less than 24 measurements. For example, in the first day there is only 23 measurements thus I would need to remove that entire day, how could I repeat this for all of the array?
A quick solution is to group by year, month, day with unique(), then count observation per day with accumarray() and exclude those with less than 24 obs with two steps of logical indexing:
% Count observations per day
[unDate,~,subs] = unique(DateV(:,1:3),'rows');
counts = [unDate accumarray(subs,1)]
counts =
2009 1 1 22
2009 1 2 24
2009 1 3 24
2009 1 4 24
2009 1 5 23
Then, apply criteria to the counts and retrieve logical index
% index only those that meet criteria
idxC = counts(:,end) == 24
idxC =
0
1
1
1
0
% keep those which meet criteria (optional, for visual inspection)
counts(idxC,:)
ans =
2009 1 2 24
2009 1 3 24
2009 1 4 24
Finally, find the members of Dat that fall into the selected counts with a second round of logical indexinf through ismember():
idxDat = ismember(subs,find(idxC))
Dat(idxDat,:)
Rather long answer, but I think it should be useful. I would do this using containers.Map. Possibly there is a faster way, but maybe for now this one will be good.
Jday = datenum('2009-01-01 00:00','yyyy-mm-dd HH:MM'):1/24:...
datenum('2009-01-05 23:00','yyyy-mm-dd HH:MM');
DateV = datevec(Jday);
DateV(4,:) = [];
DateV(15,:) = [];
DateV(95,:) = [];
% create a map
dateMap = containers.Map();
% count measurements in each date (i.e. first three columns of DateV)
for rowi = 1:1:size(DateV,1)
dateRow = DateV(rowi, :);
dateStr = num2str(dateRow(1:3));
if ~isKey(dateMap, dateStr)
% initialize Map for a given date with 1 measurement (i.e. our
% counter of measuremnts
dateMap(dateStr) = 1;
continue;
end
% increment measurement counter for given date
dateMap(dateStr) = dateMap(dateStr) + 1;
end
% get the dates
dateStrSet = keys(dateMap);
for keyi = 1:numel(dateStrSet)
dateStrCell = dateStrSet(keyi);
dateStr = dateStrCell{1};
% get number of measurements in a given date
numOfmeasurements = dateMap(dateStr);
% if less then 24 do something about it, e.g. save the date
% for later removal from DateV
if numOfmeasurements < 24
fprintf(1, 'This date has less than 24 measurement: %s\n', dateStr);
end
end
The results is:
This date has less than 24 measurement: 2009 1 1
This date has less than 24 measurement: 2009 1 5
I have observed daily data that I need to compare to generated Monthly data so I need to get a mean of each month over the thirty year period.
My observed data set is currently in 365x31 with rows being each day (no leap years!) and the extra column being the month number (1-12).
the problem I am having is that I can only seem to get a script to get the mean of all years. ie. I cannot figure how to get the script to do it for each column separately. Example of the data is below:
1 12 14
1 -15 10
2 13 3
2 2 37
...all the way to 12 for 365 rows
SO: to recap, I need to get the mean of [12; -15; 13; 2] then [14; 10; 3; 37] and so on.
I have been trying to use the unique() function to loop through which works for getting the number rows to average but incorrect means. Now I need it to do each month(28-31 rows) and column individually. Result should be a 12x30 matrix. I feel like I am missing something SIMPLE. Code:
u = unique(m); %get unique values of m (months) ie) 1-12
for i=1:length(u)
month(i) = mean(obatm(u(i), (2:31)); % the average for each month of each year
end
Appreciate any ideas! Thanks!
You can simply filter the rows for each month and then apply mean, like so:
month = zeros(12, size(obatm, 2));
for k = 1:12
month(k, :) = mean(obatm(obatm(:, 1) == k, :));
end
EDIT:
If you want something fancy, you can also do this:
cols = size(obatm, 2) - 1;
subs = bsxfun(#plus, obatm(:, 1), (0:12:12 * (cols - 1)));
vals = obatm(:, 2:end);
month = reshape(accumarray(subs(:), vals(:), [12 * cols, 1], #mean), 12, cols)
Look, Ma, no loops!