Bucketing Algorithm - matlab

I've got some code that works, but is a bit of a bottleneck, and I'm stuck trying to figure out how to speed it up. It's in a loop, and I can't figure how to vectorize it.
I've got a 2D array, vals, that represents timeseries data. Rows are dates, columns are different series. I'm trying to bucket the data by months to perform various operations on it (sum, mean, etc). Here is my current code:
allDts; %Dates/times for vals. Size is [size(vals, 1), 1]
vals;
[Y M] = datevec(allDts);
fomDates = unique(datenum(Y, M, 1)); %first of the month dates
[Y M] = datevec(fomDates);
nextFomDates = datenum(Y, M, DateUtil.monthLength(Y, M)+1);
newVals = nan(length(fomDates), size(vals, 2)); %preallocate for speed
for k = 1:length(fomDates);
This next line is the bottleneck because I call it so many times.(looping)
idx = (allDts >= fomDates(k)) & (allDts < nextFomDates(k));
bucketed = vals(idx, :);
newVals(k, :) = nansum(bucketed);
end %for
Any Ideas? Thanks in advance.

That's a difficult problem to vectorize. I can suggest a way to do it using CELLFUN, but I can't guarantee that it will be faster for your problem (you would have to time it yourself on the specific data sets you are using). As discussed in this other SO question, vectorizing doesn't always work faster than for loops. It can be very problem-specific which is the best option. With that disclaimer, I'll suggest two solutions for you to try: a CELLFUN version and a modification of your for-loop version that may run faster.
CELLFUN SOLUTION:
[Y,M] = datevec(allDts);
monthStart = datenum(Y,M,1); % Start date of each month
[monthStart,sortIndex] = sort(monthStart); % Sort the start dates
[uniqueStarts,uniqueIndex] = unique(monthStart); % Get unique start dates
valCell = mat2cell(vals(sortIndex,:),diff([0 uniqueIndex]));
newVals = cellfun(#nansum,valCell,'UniformOutput',false);
The call to MAT2CELL groups the rows of vals that have the same start date together into cells of a cell array valCell. The variable newVals will be a cell array of length numel(uniqueStarts), where each cell will contain the result of performing nansum on the corresponding cell of valCell.
FOR-LOOP SOLUTION:
[Y,M] = datevec(allDts);
monthStart = datenum(Y,M,1); % Start date of each month
[monthStart,sortIndex] = sort(monthStart); % Sort the start dates
[uniqueStarts,uniqueIndex] = unique(monthStart); % Get unique start dates
vals = vals(sortIndex,:); % Sort the values according to start date
nMonths = numel(uniqueStarts);
uniqueIndex = [0 uniqueIndex];
newVals = nan(nMonths,size(vals,2)); % Preallocate
for iMonth = 1:nMonths,
index = (uniqueIndex(iMonth)+1):uniqueIndex(iMonth+1);
newVals(iMonth,:) = nansum(vals(index,:));
end

If all you need to do is form the sum or mean on rows of a matrix, where the rows are summed depending upon another variable (date) then use my consolidator function. It is designed to do exactly this operation, reducing data based on the values of an indicator series. (Actually, consolidator can also work on n-d data, and with a tolerance, but all you need to do is pass it the month and year information.)
Find consolidator on the file exchange on Matlab Central

Related

MATLAB Table Datetime loop indexing

I am trying to take my excel spreadsheet and import it into MATLAB (already accomplished that), and then using for-loop indexing to create arrays of the data for a give day containing.
So ideally I would like to know how I could iterate through a years worth of data, and create variables with the date corresponding to the table elements on that day. As I said, I have multiple years worth of data, which is why I'd like a solution which would "automate" my process.
Welcome to stackoverflow! Note that it is easier if you provide some code -- at least to create some sample data.
Anyway, you can loop over dates easily and there shouldn't be a problem with efficiency if you scale it to many entries:
Tm = [ datetime('now') + duration(1,0,0)%add 1 hour
datetime('now')
datetime('yesterday')
datetime('tomorrow')];
% convert to date
Dt = yyyymmdd(Tm);
% you may want to sort it
% [val,idx] = sort(Dt);
% get unique dates
Dt_uq = unique(Dt);
% create a cell of storage
DataAtDate = cell(length(Dt_uq),1);
% loop over unique dates
for i = 1:length(Dt_uq)
Dt1 = Dt_uq(i);
% get all of the same type
lg = Dt == Dt1;
% index matrix/cell to do something
disp(Dt(lg,:))
% do something with the data or matrix or table... e.g. store it in a cell
DataAtDate{i} = Dt(lg,:);
end

insert value in a matrix in a for loop

I wrote this matlab code in order to concatenate the results of the integration of all the columns of a matrix extracted form a multi matrix array.
"datimf" is a matrix composed by 100 matrices, each of 224*640, vertically concatenated.
In the first loop i select every single matrix.
In the second loop i integrate every single column of the selected matrix
obtaining a row of 640 elements.
The third loop must concatenate vertically all the lines previously calculated.
Anyway i got always a problem with the third loop. Where is the error?
singleframe = zeros(224,640);
int_frame_all = zeros(1,640);
conc = zeros(100,640);
for i=0:224:(22400-224)
for j = 1:640
for k = 1:100
singleframe(:,:) = datimf([i+1:(i+223)+1],:);
int_frame_all(:,j) = trapz(singleframe(:,j));
conc(:,k) = vertcat(int_frame_all);
end
end
end
An alternate way to do this without using any explicit loops (edited in response to rayryeng's comment below. It's also worth noting that using cellfun may not be more efficient than explicitly looping.):
nmats = 100;
nrows = 224;
ncols = 640;
datimf = rand(nmats*nrows, ncols);
% convert to an nmats x 1 cell array containing each matrix
cellOfMats = mat2cell(datimf, ones(1, nmats)*nrows, ncols);
% Apply trapz to the contents of each cell
cellOfIntegrals = cellfun(#trapz, cellOfMats, 'UniformOutput', false);
% concatenate the results
conc = cat(1, cellOfIntegrals{:});
Taking inspiration from user2305193's answer, here's an even better "loop-free" solution, based on reshaping the matrix and applying trapz along the appropriate dimension:
datReshaped = reshape(datimf, nrows, nmats, ncols);
solution = squeeze(trapz(datReshaped, 1));
% verify solutions are equivalent:
all(solution(:) == conc(:)) % ans = true
I think I understand what you want. The third loop is unnecessary as both the inner and outer loops are 100 elements long. Also the way you have it you are assigning singleframe lots more times than necessary since it does not depend on the inner loops j or k. You were also trying to add int_frame_all to conc before int_frame_all was finished being populated.
On top of that the j loop isn't required either since trapz can operate on the entire matrix at once anyway.
I think this is closer to what you intended:
datimf = rand(224*100,640);
singleframe = zeros(224,640);
int_frame_all = zeros(1,640);
conc = zeros(100,640);
for i=1:100
idx = (i-1)*224+1;
singleframe(:,:) = datimf(idx:idx+223,:);
% for j = 1:640
% int_frame_all(:,j) = trapz(singleframe(:,j));
% end
% The loop is uncessary as trapz can operate on the entire matrix at once.
int_frame_all = trapz(singleframe,1);
%I think this is what you really want...
conc(i,:) = int_frame_all;
end
It looks like you're processing frames in a video.
The most efficent approach in my experience would be to reshape datimf to be 3-dimensional. This can easily be achieved with the reshape command.
something along the line of vid=reshape(datimf,224,640,[]); should get you far in this regard, where the 3rd dimension is time. vid(:,:,1) then would display the first frame of the video.

save vectors of different sizes in matrix

I would like to divide a vector in many vectors and put all of them in a matrix. I got this error "Subscripted assignment dimension mismatch."
STEP = zeros(50,1);
STEPS = zeros(50,length(locate));
for i = 1:(length(locate)-1)
STEP = filtered(locate(i):locate(i+1));
STEPS(:,i) = STEP;
end
I take the value of "filtered" from (1:50) at the first time for example and I would like to stock it in the first row of a matrix, then for iterations 2, I take value of "filtered from(50:70) for example and I stock it in row 2 in the matrix, and this until the end of the loop..
If someone has an idea, I don't get it! Thank you!
As mentioned in the comments, to make it work you can edit the loopy code at the end with -
STEPS(1:numel(STEP),i) = STEP;
Also, output array STEPS doesn't seem to use the last column. So, the initialization could use one less column, like so -
STEPS = zeros(50,length(locate)-1);
All is good with the loopy code, but in the long run with a high level language like MATLAB, you might want to look for faster codes and one way to achieve that would be vectorized codes. So, let me suggest a vectorized solution using bsxfun's masking capability to process such ragged-arrays. The implementation to cover generic elements in locate would look something like this -
% Get differentiation, which represent the interval lengths for each col
diffs = diff(locate)+1;
% Initialize output array
out = zeros(max(diffs),length(locate)-1);
% Get elements from filtered array for setting into o/p array
vals = filtered(sort([locate(1):locate(end) locate(2:end-1)]));
% Use bsxfun to create a mask that are to be set in o/p array and set thereafter
out(bsxfun(#ge,diffs,(1:max(diffs)).')) = vals;
Sample run for verification -
>> % Inputs
locate = [6,50,70,82];
filtered = randi(9,1,120);
% Get extent of output array for number of rows
N = max(diff(locate))+1;
>> % Original code with corrections
STEP = zeros(N,1);
STEPS = zeros(N,length(locate)-1);
for i = 1:(length(locate)-1)
STEP = filtered(locate(i):locate(i+1));
STEPS(1:numel(STEP),i) = STEP;
end
>> % Proposed code
diffs = diff(locate)+1;
out = zeros(max(diffs),length(locate)-1);
vals = filtered(sort([locate(1):locate(end) locate(2:end-1)]));
out(bsxfun(#ge,diffs,(1:max(diffs)).')) = vals;
>> max_error = max(abs(out(:)-STEPS(:)))
max_error =
0

Foreach loop problems in MATLAB

I have the following piece of code:
for query = queryFiles
queryImage = imread(strcat('Queries/', query));
queryImage = im2single(rgb2gray(queryImage));
[qf,qd] = vl_covdet(queryImage, opts{:}) ;
for databaseEntry = databaseFiles
entryImage = imread(databaseEntry.name);
entryImage = im2single(rgb2gray(entryImage));
[df,dd] = vl_covdet(entryImage, opts{:}) ;
[matches, H] = matchFeatures(qf,qf,df,dd) ;
result = [result; query, databaseEntry, length(matches)];
end
end
It is my understanding that it should work as a Java/C++ for(query:queryFiles), however the query appears to be a copy of the queryFiles. How do I iterate through this vector normally?
I managed to sort the problem out. It was mainly to my MATLAB ignorance. I wasn't aware of cell arrays and that's the reason I had this problem. That and the required transposition.
From your code it appears that queryFiles is a numeric vector. Maybe it's a column vector? In that case you should convert it into a row:
for query = queryFiles.'
This is because the for loop in Matlab picks a column at each iteration. If your vector is a single column, it picks the whole vector in just one iteration.
In MATLAB, the for construct expects a row vector as input:
for ii = 1:5
will work (loops 5 times with ii = 1, 2, ...)
x = 1:5;
for ii = x
works the same way
However, when you have something other than a row vector, you would simply get a copy (or a column of data at a time).
To help you better, you need to tell us what the data type of queryFiles is. I am guessing it might be a cell array of strings since you are concatenating with a file path (look at fullfile function for the "right" way to do this). If so, then a "safe" approach is:
for ii = 1:numel(queryFiles)
query = queryFiles{ii}; % or queryFiles(ii)
It is often helpful to know what loop number you are in, and in this case ii provides that count for you. This approach is robust even when you don't know ahead of time what the shape of queryFiles is.
Here is how you can loop over all elements in queryFiles, this works for scalars, row vectors, column vectors and even high dimensional matrices:
for query = queryFiles(:)'
% Do stuff
end
Is queryFiles a cell array? The safest way to do this is to use an index:
for i = 1:numel(queryFiles)
query = queryFiles{i};
...
end

Matlab - Access index of max value in for loop and use it to remove values from array

I would like to recursively find the maximum value in a series of matrices (column 8, to be specific), then use the index of that maximum value to set all values in the array with index up to the max index to NaN (for columns 14:16). It is straight forward to find the max value and index, but using a for loop to do it for multiple arrays I am stumped.
Here is how I can do it without a for loop:
[C,Max] = max(wy2000(:,8));
wy2000(1:Max,14:16) = NaN;
[C,Max] = max(wy2001(:,8));
wy2001(1:Max,14:16) = NaN;
[C,Max] = max(wy2002(:,8));
wy2002(1:Max,14:16) = NaN;
and so on and so forth...
Here are two ways I have tried using a for loop:
startyear = 2000;
endyear = 2009;
for n=startyear:endyear
currentYear = sprintf('wy%d',n);
[C,Max] = max(currentYear(:,8));
currentYear(1:Max,14:16) = NaN;
end
Here is another way I tried, using the eval function
for n=2000:2009;
currentYear = ['wy' int2str(n)];
var2 = ['maxswe' int2str(n)];
eval([var2 ' = max(currentYear(:,8))']);
end
In both cases, the problem seems to be that MATLAB doesn't recognize the 'currentYear' variable to be the array that corresponds to the wyXXXX that I already have created in my workspace.
Based on Peters answer, here is some more info about my data. I am starting with a matrix of data called all_data which holds 16 columns of data, spanning the time period 1982 - 2012. I am only interested in the period 2000 - 2009, and I am also interested in analyzing each year individually (2000, 2001,...,2009).
To get the data into individual years, I use the following code:
for n=2000:2009;
s = datenum(n-1,10,1);
e = datenum(n,9,30);
startcell = find(TIME(:,7)==s);
endcell = find(TIME(:,7)==e);
var1 = ['wy' int2str(n)];
eval([var1 '= all_data3(startcell:endcell,:)']);
eval(['save ', var1]);
end
For clarification, it is the period 10/1/YEAR1 to 9/30/YEAR2 that I am interested in, and TIME is a matrix holding the dates and times of my data.
So at the end of the above for-loop, I have a new matrix for each water-year (wy). I then want to find the date of maximum snow-accumulation (column 8) and exclude all data prior to that date from my analysis. this is where the original question comes from.
Peter's solution works, but I was hoping to find a more simple solution to find the max date and set the values prior to that date to NaN, without having to declare a bunch of variables (or entries in a cell array).
If I could write a loop that would create the cell array that Peter suggested based on a start and end year, that would make the code transferable to other datasets, but when i try to do this I run into the issue that the index for the cell-array is 1:length(years), but the wy arrays are named according to the actual year, so there is an inconsistency when using the eval function.
Matt
You've discovered the problem with eval and dynamically named variables. They're messy. I'd recommend recoding this as a cell array, with the cell array index being the index for the year:
years = 2000:2009;
wy{1} = wy2000;
wy{2} = wy2001;
% etc...
% Then,
for n=1:length(years)
[C, maxval] = max(wy{n}(:,8));
% etc.
end
You really only need the actual year when you input the data and when you display it. Now, if you're starting from a huge pile of arrays already named this way, that's the time to use eval: to convert them into this form that's easier to use. Just form the eval strings so they read, for example, 'wy{1} = wy2000;'