ind2sub from finding value in a 3D array - matlab

I have a dataset time_local that is size 144x91x8845 (lon x lat x hour). I want to find the indices at which a particular hour occurs.
data stores the data
time_local stores the hours that the data occurs at. Due to time zone differences, not all the hours in each 144x91 face is the same
% One year
first = datenum([num2str(years(y)),'-01-01 00:00:00']);
last = datenum([num2str(years(y)),'-12-31 23:00:00']);
dt=1/24;
subset = first:dt:last;
% Find where hour one occurs (want all hours, but starting with 1 hour)
ind = find(time_local == subset(1)); % Hour 1
% Want to save out a new data matrix with the data from hour 1
[a,b,~] = size(time_local);
ind = find(time_local == subset(1)); % Day 1
[x1,y1,z1] = ind2sub(size(time_local),ind);
ind1 = [x1,y1,z1];
data1 = NaN(a,b,length(subset)); % Preallocate new array
data1(:,:,1) = data(ind1(1,:)); % Pull out data where
ind gives me the linear indices, but I want to know the subscript indices so I can save data1 out where each 144x91 face is one hour. Right now, the ind2sub does not seem to be finding the right indices because the time_local that comes out from the indices is not correct.
Edit: I tried the following, which doesn't quite work because the time_local1 and data saved out isn't indexed correctly, but it's close. There must be a more efficient way though.
time_local1 = NaN(a,b,length(subset));
data = NaN(a,b,length(subset));
for a1 = 1:length(subset)
if isempty(time_local == subset(a1)) == 0
ind = find(time_local == subset(a1)); % Hour 1
[x1,y1,z1] = ind2sub(size(time_local),ind);
for a2 = 1:length(x1)
time_local1(x1(a2),y1(a2),a1) = time_local(x1(a2),y1(a2),z1(a2));
data(x1(a2),y1(a2),a1) = data1(x1(a2),y1(a2),z1(a2));
end
end
end

This probably isn't the most efficient one - using loops - but it works
[a,b,~] = size(data1); % Size of data file
first = datenum([num2str(years(y)),'-01-01 00:00:00']); % Hour one
last = datenum([num2str(years(y)),'-12-31 23:00:00']);
dt=1/24;
subset = first:dt:last;
%% Find where hour 1 occurs along the 3rd dimension for every gridcell (matching vector and vector instead of vector and matrix)
% Pre-allocate
time_local1 = NaN(a,b,length(subset));
data = NaN(a,b,length(subset));
count = 1:24:length(data1); % Set increases for beginning of each day
for a1 = 1:a % lon
disp(a1)
for b1 = 1:b % lat
ind(a1,b1) = find(time_local(a1,b1,:) == first); % Location of hour one in the 3rd dimension
ind1 = ind(a1,b1):24:length(data1); % Hours in each day starting from hour one
for c1 = 1:length(ind1)-2
%%% Ran once - Checking if the dates I am averaging across
%%% actually have no missing days
% if squeeze(time_local(a1,b1,ind1(c1):ind1(c1)+23)) ~= subset(count(c1):count(c1)+23)'
% disp([a1,b1])
% end
time_local1(a1,b1,count(c1):count(c1)+23) = time_local(a1,b1,ind1(c1):ind1(c1)+23); % Hourly data
data(a1,b1,c1) = nanmean(data1(a1,b1,ind1(c1):ind1(c1)+23)); % Average over one day
time(a1,b1,c1) = time_local1(a1,b1,count(c1)); % One entry per day
end
end
end

Related

Fast way to get mean values of rows accordingly to subscripts

I have a data, which may be simulated in the following way:
N = 10^6;%10^8;
K = 10^4;%10^6;
subs = randi([1 K],N,1);
M = [randn(N,5) subs];
M(M<-1.2) = nan;
In other words, it is a matrix, where the last row is subscripts.
Now I want to calculate nanmean() for each subscript. Also I want to save number of rows for each subscript. I have a 'dummy' code for this:
uniqueSubs = unique(M(:,6));
avM = nan(numel(uniqueSubs),6);
for iSub = 1:numel(uniqueSubs)
tmpM = M(M(:,6)==uniqueSubs(iSub),1:5);
avM(iSub,:) = [nanmean(tmpM,1) size(tmpM,1)];
end
The problem is, that it is too slow. I want it to work for N = 10^8 and K = 10^6 (see commented part in the definition of these variables.
How can I find the mean of the data in a faster way?
This sounds like a perfect job for findgroups and splitapply.
% Find groups in the final column
G = findgroups(M(:,6));
% function to apply per group
fcn = #(group) [mean(group, 1, 'omitnan'), size(group, 1)];
% Use splitapply to apply fcn to each group in M(:,1:5)
result = splitapply(fcn, M(:, 1:5), G);
% Check
assert(isequaln(result, avM));
M = sortrows(M,6); % sort the data per subscript
IDX = diff(M(:,6)); % find where the subscript changes
tmp = find(IDX);
tmp = [0 ;tmp;size(M,1)]; % add start and end of data
for iSub= 2:numel(tmp)
% Calculate the mean over just a single subscript, store in iSub-1
avM2(iSub-1,:) = [nanmean(M(tmp(iSub-1)+1:tmp(iSub),1:5),1) tmp(iSub)-tmp(iSub-1)];tmp(iSub-1)];
end
This is some 60 times faster than your original code on my computer. The speed-up mainly comes from presorting the data and then finding all locations where the subscript changes. That way you do not have to traverse the full array each time to find the correct subscripts, but rather you only check what's necessary each iteration. You thus calculate the mean over ~100 rows, instead of first having to check in 1,000,000 rows whether each row is needed that iteration or not.
Thus: in the original you check numel(uniqueSubs), 10,000 in this case, whether all N, 1,000,000 here, numbers belong to a certain category, which results in 10^12 checks. The proposed code sorts the rows (sorting is NlogN, thus 6,000,000 here), and then loop once over the full array without additional checks.
For completion, here is the original code, along with my version, and it shows the two are the same:
N = 10^6;%10^8;
K = 10^4;%10^6;
subs = randi([1 K],N,1);
M = [randn(N,5) subs];
M(M<-1.2) = nan;
uniqueSubs = unique(M(:,6));
%% zlon's original code
avM = nan(numel(uniqueSubs),7); % add the subscript for comparison later
tic
uniqueSubs = unique(M(:,6));
for iSub = 1:numel(uniqueSubs)
tmpM = M(M(:,6)==uniqueSubs(iSub),1:5);
avM(iSub,:) = [nanmean(tmpM,1) size(tmpM,1) uniqueSubs(iSub)];
end
toc
%%%%% End of zlon's code
avM = sortrows(avM,7); % Sort for comparison
%% Start of Adriaan's code
avM2 = nan(numel(uniqueSubs),6);
tic
M = sortrows(M,6);
IDX = diff(M(:,6));
tmp = find(IDX);
tmp = [0 ;tmp;size(M,1)];
for iSub = 2:numel(tmp)
avM2(iSub-1,:) = [nanmean(M(tmp(iSub-1)+1:tmp(iSub),1:5),1) tmp(iSub)-tmp(iSub-1)];
end
toc %tic/toc should not be used for accurate timing, this is just for order of magnitude
%%%% End of Adriaan's code
all(avM(:,1:6) == avM2) % Do the comparison
% End of script
% Output
Elapsed time is 58.561347 seconds.
Elapsed time is 0.843124 seconds. % ~70 times faster
ans =
1×6 logical array
1 1 1 1 1 1 % i.e. the matrices are equal to one another

Vectorization instead of nested for loops in matlab

I am having trouble vectorizing this for loop in matlab which is really slow.
tvec and data are N×6 and N×4 arrays respectively, and they are the inputs to the function.
% preallocate
sVec = size(tvec)
tvec_ab = zeros(sVec(1),6);
data_ab = zeros(sVec(1),4);
inc = 0;
for i = 1:12
for j = 1:31
inc = inc +1;
[I,~] = find(tvec(:,3)==i & tvec(:,2)== j,1,'first');
if(I > 0)
tvec_ab(inc,:) = tvec(I,:);
data_ab(inc,:) = sum(data( (tvec(:,3) == j) & (tvec(:,2)==i) ,:));
end
end
end
% set output values
tvec_a = tvec_ab(1:inc,:);
data_a = data_ab(1:inc,:);
Every row in tvec represents the timestamp where the data was taken in the same row in the data matrix. Below you can see how a row would look like:
tvec:
[year, month, day, hour, minute, second]
data:
[dataA, dataB, dataC, dataD]
In the main program we can choose to "aggregate" after month, day or hour.
The code above is an example of how the aggregation for the option 'DAY' could happen.
the first time stamp of the day is the time stamp we want our output tvec_a to have in the row for that day.
The data output for that day (row in this case) would then be the sum of all the data for that day. Example:
data:
[data1ADay1, data1BDay1, data1CDay1, data1DDay1;
data2ADay1, data2BDay1, data2CDay1, data2DDay1]
aggregated data:
[data1ADay1 + data2ADay1, data1BDay1 + data2BDay1, data1CDay1+ data2CDay1,
data1DDay1+data2DDay1]
A vectorized version (not fully tested)
[x y] = meshgrid(1:12,1:31);
XY=[x(:) Y(:)];
[I,loc]=ismember(XY,tvec(:,2:3),'rows');
tvec_ab(I)=tvec(loc(loc>0),:);
acm = accumarray(tvec(:,2:3),data);
data_ab(I) = acm(sub2ind(size(acm),tvec(:,2),tvec(:,3)));
I actually found a way to do it myself:
%J is the indexes of the first unique days ( eg. if there is multiple
%data from january 1., the first time stamp from january 1. will be
%the time samp for our output)
[~,J,K] = unique(tvec(:,2:3),'rows');
%preallocate
tvec_ab = zeros(length(J),6);
data_ab = zeros(length(J),4);
tvec_ab = tvec(J,:);
%sum all data from the same days together column wise.
for i = 1:4
data_ab(:,i) = accumarray(K,data(:,i));
end
%set output
data_a = data_ab;
tvec_a = tvec_ab;
Thanks for your responses though

Text file and Matlab

I have a problem doing som changes to a *.txt file.
The data have this format:
11,2003,1,1,9,38,40.38,1
11,2003,1,1,9,47,2.5,1
11,2003,1,1,10,34,43.88,1
11,2003,1,1,10,38,14.5,1
11,2003,1,1,12,47,13.2,1
Where the columns are station number,year, month, day, hour, minute, seconds and precipitation(1 = 0.1 mm)
The times that have precipitation = 0 are not included in the list. This results in hours without rainfall will be absent. For these cases I want to make one entry for the first minute of the hour without rainfall in the New file, to show that there has been made measurements. Like this:
50810,200301010938,0.1
50810,200301010947,0.1
50810,200301011034,0.1
50810,200301011038,0.1
50810,200301011100,0.0 <---- This is what I need to get in the New file
50810,200301011247,0.1
(New station number, date/time, precipitation)
For now I've come up With this:
clear all
data = load('jan-31des_2003.txt'); %opens file with data
fid=fopen('50810_2003','w'); %opens empty file to write
[nrow, ncol] = size(data); %size of data
fprintf(fid,'%5s %12s %5s \r\n','Snr','Dato - Tid','RR_01') %Header
for row = 1:nrow
y = data(row,2); %year
m = data(row,3); % month
d = data(row,4); % date
h = data(row,5); % hour
M = data(row,6); % minute
p = data(row,8); % precipitation
p = p*0.1
end
fclose(fid);
You can use an if command to check if the next hour you are looking at is more than a single hour ahead from the last one. If that's the case you can create a new entry at this point:
if data(row,5) > (data(row-1,5)+1)
y = data(row,2); %year
m = data(row,3); % month
d = data(row,4); % date
h = data(row,5)+1; % hour
M = 00; % minute
p = 0; % precipitation
end
after this part you will need to make check again if there is another 'skipped' hour and so on. You should also replace your for-loop, by a while-loop until you reach the last entry in your dataset.
Try to implement this idea and come back to us with your code in case it didn't work out.

MATLAB: subtracting each element in a large vector from each element in another large vector in the fastest way possible

here is the code I have, its not simple subtraction. We want subtract each value in one vector from each value in the other vector, within certain bounds tmin and tmax. time_a and time_b are the very long vectors with times (in ps). binsize is just for grouping times in a similar range for plotting. The longest way possible would be to loop through each element and subtract each element in the other vector, but this would take forever and we are talking about vectors with hundreds of megabytes up to gb.
function [c, dt, dtEdges] = coincidence4(time_a,time_b,tmin,tmax,binsize)
% round tmin, tmax to a intiger multiple of binsize:
if mod(tmin,binsize)~=0
tmin=tmin-mod(tmin,binsize)+binsize;
end
if mod(tmax,binsize)~=0
tmax=tmax-mod(tmax,binsize);
end
dt = tmin:binsize:tmax;
dtEdges = [dt(1)-binsize/2,dt+binsize/2];
% dtEdges = linspace((tmin-binsize/2),(tmax+binsize/2),length(dt));
c = zeros(1,length(dt));
Na = length(time_a);
Nb = length(time_b);
tic1=tic;
% tic2=tic1;
% bbMax=Nb;
bbMin=1;
for aa = 1:Na
ta = time_a(aa);
bb = bbMin;
% tic
while (bb<=Nb)
tb = time_b(bb);
d = tb - ta;
if d < tmin
bbMin = bb;
bb = bb+1;
elseif d > tmax
bb = Nb+1;
else
% tic
% [dum, dum2] = histc(d,dtEdges);
index = floor((d-dtEdges(1))/(dtEdges(end)-dtEdges(1))*(length(dtEdges)-1)+1);
% toc
% dt(dum2)
c(index)=c(index)+1;
bb = bb+1;
end
end
% if mod(aa, 200) == 0
% toc(tic2)
% tic2=tic;
% end
end
% c=c(1:end-1);
toc(tic1)
end
Well, not a final answer but a few clue to simplify and accelerate your system:
First, use cached values. For example, in your line:
index = floor((d-dtEdges(1))/(dtEdges(end)-dtEdges(1))*(length(dtEdges)-1)+1);
your loop repeat the same computations every iteration. You can calculate the value before starting the loop, cache it then reuse the stored result:
cached_dt_constant = (dtEdges(end)-dtEdges(1))*(length(dtEdges)-1) ;
Then in your loop simply use:
index = floor( (d-dtEdges(1)) / cached_dt_constant +1 ) ;
if you have so many loop iteration you'll save valuable time this way.
Second, I am not entirely sure of what the computations are trying to achieve, but you can save time again by using the indexing power of matlab. By replacing the lower part of your code like this, I get an execution time 2 to 3 time faster (and the same results obviously).
Na = length(time_a);
Nb = length(time_b);
tic1=tic;
dtEdge_span = (dtEdges(end)-dtEdges(1)) ;
cached_dt_constant = dtEdge_span * (length(dtEdges)-1) ;
for aa = 1:Na
ta = time_a(aa);
d = time_b - ta ;
iok = (d>=tmin) & (d<=tmax) ;
index = floor( (d(iok)-dtEdges(1)) ./ cached_dt_constant +1 ) ;
c(index) = c(index) +1 ;
end
toc(tic1)
end
Now there is only one loop to go through, the inner loop has been removed and replaced by vectorized calculation. By scratching the head a bit further there might be a way to do even without the top loop and use only vectorized computations. Although this will require to have enough memory to handle quite big arrays in one go.
If the precision of each value is not critical (I see you round and floor values often), try converting your initial vectors to 'single' type instead of the default matlab 'double'. that would almost double the size of array your memory will be able to handle in one go.

MATLAB: Dividing a year-length varying-resolution time vector into months

I have a time series in the following format:
time data value
733408.33 x1
733409.21 x2
733409.56 x3
etc..
The data runs from approximately 01-Jan-2008 to 31-Dec-2010.
I want to separate the data into columns of monthly length.
For example the first column (January 2008) will comprise of the corresponding data values:
(first 01-Jan-2008 data value):(data value immediately preceding the first 01-Feb-2008 value)
Then the second column (February 2008):
(first 01-Feb-2008 data value):(data value immediately preceding the first 01-Mar-2008 value)
et cetera...
Some ideas I've been thinking of but don't know how to put together:
Convert all serial time numbers (e.g. 733408.33) to character strings with datestr
Use strmatch('01-January-2008',DatesInChars) to find the indices of the rows corresponding to 01-January-2008
Tricky part (?): TransformedData(:,i) = OriginalData(start:end) ? end = strmatch(1) - 1 and start = 1. Then change start at the end of the loop to strmatch(1) and then run step 2 again to find the next "starting index" and change end to the "new" strmatch(1)-1 ?
Having it speed optimized would be nice; I am going to apply it on data sampled ~2 million times.
Thanks!
I would use histc with a list a list of last days of the month as the second parameter (Note: use histc with the two return functions).
The edge list can easily be created with datenum or datevec.
This way you don't have operation on string and you that should be fast.
EDIT:
Example with result in a simple data structure (including some code from #Rody):
% Generate some test times/data
tstart = datenum('01-Jan-2008');
tend = datenum('31-Dec-2010');
tspan = tstart : tend;
tspan = tspan(:) + randn(size(tspan(:))); % add some noise so it's non-uniform
data = randn(size(tspan));
% Generate list of edge
edge = [];
for y = 2008:2010
for m = 1:12
edge = [edge datenum(y, m, 1)];
end
end
% Histogram
[number, bin] = histc(tspan, edge);
% Setup of result
result = {};
for n = 1:length(edge)
result{n} = [tspan(bin == n), data(bin == n)];
end
% Test
% 04-Aug-2008 17:25:20
datestr(result{8}(4,1))
tspan(data == result{8}(4,2))
datestr(tspan(data == result{8}(4,2)))
Assuming you have sorted, non-equally-spaced date numbers, the way to go here is to put the relevant data in a cell array, so that each entry corresponds to the next month, and can hold a different amount of elements.
Here's how to do that quite efficiently:
% generate some test times/data
tstart = datenum('01-Jan-2008');
tend = datenum('31-Dec-2010');
tspan = tstart : tend;
tspan = tspan(:) + randn(size(tspan(:))); % add some noise so it's non-uniform
data = randn(size(tspan));
% find month numbers
[~,M] = datevec(tspan);
% find indices where the month changes
inds = find(diff([0; M]));
% extract data in columns
sz = numel(inds)-1;
cols = cell(sz,1);
for ii = 1:sz-1
cols{ii} = data( inds(ii) : inds(ii+1)-1 );
end
Note that it can be difficult to determine which entry in cols belongs to which month, year, so here's how to do it in a more human-readable way:
% change this line:
[y,M] = datevec(tspan);
% and change these lines:
cols = cell(sz,3);
for ii = 1:sz-1
cols{ii,1} = data( inds(ii) : inds(ii+1)-1 );
% also store the year and month
cols{ii,2} = y(inds(ii));
cols{ii,3} = M(inds(ii));
end
I'll assume you have a timeVals an Nx1 double vector holding the time value of each datum. Assuming data is also an Nx1 array. I also assume data and timeVals are sorted according to time: that is, the samples you have are ordered according to the time they were taken.
How about:
subs = #(x,i) x(:,i);
months = subs( datevec(timeVals), 2 ); % extract the month of year as a number from the time
r = find( months ~= [months(2:end), months(end)+1] );
monthOfCell = months( r );
r( 2:end ) = r( 2:end ) - r( 1:end-1 );
dataByMonth = mat2cell( data', r ); % might need to transpose data or r here...
timeByMonth = mat2cell( timeVal', r );
After running this code, you have a cell array dataByMonth each cell contains all data relevant to a specific month. The corresponding cell of timeByMonth holds the sampling times of the data of the respective month. Finally, monthOfCell tells you what is the month's number (1-12) of each cell.