Out-of-memory algorithms for addressing large arrays - matlab

I am trying to deal with a very large dataset. I have k = ~4200 matrices (varying sizes) which must be compared combinatorially, skipping non-unique and self comparisons. Each of k(k-1)/2 comparisons produces a matrix, which must be indexed against its parents (i.e. can find out where it came from). The convenient way to do this is to (triangularly) fill a k-by-k cell array with the result of each comparison. These are ~100 X ~100 matrices, on average. Using single precision floats, it works out to 400 GB overall.
I need to 1) generate the cell array or pieces of it without trying to place the whole thing in memory and 2) access its elements (and their elements) in like fashion. My attempts have been inefficient due to reliance on MATLAB's eval() as well as save and clear occurring in loops.
for i=1:k
[~,m] = size(data{i});
cur_var = ['H' int2str(i)];
%# if i == 1; save('FileName'); end; %# If using a single MAT file and need to create it.
eval([cur_var ' = cell(1,k-i);']);
for j=i+1:k
[~,n] = size(data{j});
eval([cur_var '{i,j} = zeros(m,n,''single'');']);
eval([cur_var '{i,j} = compare(data{i},data{j});']);
end
save(cur_var,cur_var); %# Add '-append' when using a single MAT file.
clear(cur_var);
end
The other thing I have done is to perform the split when mod((i+j-1)/2,max(factor(k(k-1)/2))) == 0. This divides the result into the largest number of same-size pieces, which seems logical. The indexing is a little more complicated, but not too bad because a linear index could be used.
Does anyone know/see a better way?

Here's a version that combines going fast with using minimal memory.
I use fwrite/fread so that you still can use parfor (and this time, I made sure it works :) )
%# assume data is loaded an k is known
%# find the index pairs for comparisons. This could be done more elegantly, I guess.
%# I'm constructing a lower triangular array, i.e. an array that has ones wherever
%# we want to compare i (row) and j (col). Then I use find to get i and j
[iIdx,jIdx] = find(tril(ones(k,k),-1));
%# create a directory to store the comparisons
mkdir('H_matrix_elements')
savePath = fullfile(pwd,'H_matrix_elements');
%# loop through all comparisons in parallel. This way there may be a bit more overhead from
%# the individual function calls. However, parfor is most efficient if there are
%# a lot of relatively similarly fast iterations.
parfor ct = 1:length(iIdx)
%# make the comparison - do double b/c there shouldn't be a memory issue
currentComparison = compare(data{iIdx(ct)},data{jIdx{ct});
%# create save-name as H_i_j, e.g. H_104_23
saveName = fullfile(savePath,sprintf('H_%i_%i',iIdx(ct),jIdx(ct)));
%# save. Since 'save' is not allowed, use fwrite to write the data to disk
fid = fopen(saveName,'w');
%# for simplicity: save data as vector, add two elements to the beginning
%# to store the size of the array
fwrite(fid,[size(currentComparison)';currentComparison(:)]); % ' #SO formatting
%# close file
fclose(fid)
end
%# to read e.g. comparison H_104_23
fid = fopen(fullfile(savePath,'H_104_23'),'r');
tmp = fread(fid);
fclose(fid);
%# reshape into 2D array.
data = reshape(tmp(3:end),tmp(1),tmp(2));

You can get rid of the eval and clear calls by assigning the filename separately.
for i=1:k
[~,m] = size(data{i});
file_name = ['H' int2str(i)];
cur_var = cell(1, k-i);
for j=i+1:k
[~,n] = size(data{j});
cur_var{i,j} = zeros(m, n, 'single');
cur_var{i,j} = compare(data{i}, data{j});
end
save(file_name, cur_var);
end
If you need the saved variables to take different names, use the -struct option to save.
str.(file_name);
save(file_name, '-struct', str);

Related

Create a matrix combining many variables by using their names and a for loop

Suppose I have n .mat files and each are named as follows: a1, a2, ..., an
And within each of these mat files there is a variable called: var (nxn matrix)
I would like to create a matrix: A = [a1.var a2.var, ..., an.var] without writing it all out because there are many .mat files
A for-loop comes to mind, something like this:
A = []
for i = 1:n
[B] = ['a',num2str(i),'.mat',var];
A = [A B]
end
but this doesn't seem to work or even for the most simple case where I have variables that aren't stored as a(i) but rather 'a1', 'a2' etc.
Thank you very much!
load and concatenate 'var' from each of 'a(#).mat':
n = 10;
for i = n:-1:1 % 1
file_i = sprintf('a%d.mat', i); % 2
t = load(file_i, 'var');
varsCell{i} = t.var; % 3
end
A = [varsCell{:}]; % concatenate each 'var' in one step.
Here are some comment on the above code. All the memory-related stuff isn't very important here, but it's good to keep in mind during larger projects.
1)
In MATLAB, it is rarely a good idea or necessary to grow variables during a for loop. Each time an element is added, MATLAB must find and allocate a new block of RAM. This can really slow things down, especially for long loops or large variables. When possible, pre-allocate your variables (A = zeros(n,n*n)). Alternatively, it sometimes works to count backwards in the loop. MATLAB pre-allocates the whole array, since you're effectively telling it the final size.
2)
Equivalent to file_i = ['a',num2str(i),'.mat'] in this case, sprintf can be clearer and more powerful.
3)
Store each 'var' in a cell array. This is a balance between allocating all the needed memory and the complication of indexing into the correct places of a preallocated array. Internally, the cell array is a list of pointers to the location of each loaded 'var' matrix.
to create a test set...
generate 'n' matrices of n*n random doubles
save each as 'a(#).mat' in current directory
for i = 1:n
var = rand(n);
save(sprintf('a%d.mat',i), 'var');
end
Code
%%// The final result, A would have size nX(nXn)
A = zeros(n,n*n); %%// Pre-allocation for better performance
for k =1:n
load(strcat('a',num2str(k),'.mat'))
A(1:n,(k-1)*n+1:(k-1)*n+n) = var;
end

Linear indexing of 3D matrix: best way to expand the first 2 dims to a vector (slow sprintf on scalars)

I have a 3D vector s1(nmax,mmax,ntimeSTEPS). I want to take at each time step j (i.e. each value of the third dimension) all the elements of the first two dimensions and obtain a vector to give to sprintf. However, sprintf is PAINFULLY SLOW if inside a cycle! I checked the manual and it looks like there is no way to do that directly with linear indexing. Or am I missing something? I can only think of using reshape, but something like s1(:,j) would be the top, but that's not how MATLAB works. I did:
nmax = 800;
mmax =400;
nmax_x_mmax = nmax*mmax;
ntimeSTEPS = 1;
charINPUT = cell(nmax_x_mmax,1);
s1 = ones(nmax,mmax,ntimeSTEPS)*1234;
tic
for j=1:ntimeSTEPS
%... other stuff
input=reshape(s1(:,:,j),nmax_x_mmax,1);
for kk=1:length(input)
charINPUT{kk} = sprintf('%6.3f',input(kk));
end
%... other stuff (collecting movie frames etc)
end
toc
This on a single time steps takes 5.09 SECONDS on my i7 2.2 GHz! I am trying to do an animation and this is crazily slow. If I increase the size of the array its basically stuck.
Any suggestion for doing this with linear indexes?
Using sprintf
sprintf can take an array. Output with newlines and use regexp to parse out the digits and put them in a cell array of strings.
charINPUT = regexp(sprintf('%6.3f\n',s1(:)),'(?<=\s*)(\S*)(?=\n)','match')
Without sprintf
You don't have to use sprintf in a loop to build your cell array of strings. Since num2str takes a format specifier, you can just do this for the whole thing:
charINPUT = cellstr(num2str(s1(:),'%6.3f'))
You can either skip the loop over ntimeSTEPS entirely, or if you are performing other operations you are not showing that require the loop you can handle indexing as follows.
For direct indexing of s1 with no temporary variable, you can compute the linear indexes yourself via (1:nmax*nmax) + (j-1)*nmax*nmax.
for j=1:ntimeSTEPS,
stepInds = (1:nmax*nmax) + (j-1)*nmax*nmax;
charINPUT = cellstr(num2str(s1(stepInds),'%6.3f'))
end
Try this
for idx = 1:numel(s1)
charINPUT{idx} = sprintf('%6.3f',s1(idx));
end

MATLAB: vectors of different length

I want to create a MATLAB function to import data from files in another directory and fit them to a given model, but because the data need to be filtered (there's "thrash" data in different places in the files, eg. measurements of nothing before the analyzed motion starts).
So the vectors that contain the data used to fit end up having different lengths and so I can't return them in a matrix (eg. x in my function below). How can I solve this?
I have a lot of datafiles so I don't want to use a "manual" method. My function is below. All and suggestions are welcome.
datafit.m
function [p, x, y_c, y_func] = datafit(pattern, xcol, ycol, xfilter, calib, p_calib, func, p_0, nhl)
datafiles = dir(pattern);
path = fileparts(pattern);
p = NaN(length(datafiles));
y_func = [];
for i = 1:length(datafiles)
exist(strcat(path, '/', datafiles(i).name));
filename = datafiles(i).name;
data = importdata(strcat(path, '/', datafiles(i).name), '\t', nhl);
filedata = data.data/1e3;
xdata = filedata(:,xcol);
ydata = filedata(:,ycol);
filter = filedata(:,xcol) > xfilter(i);
x(i,:) = xdata(filter);
y(i,:) = ydata(filter);
y_c(i,:) = calib(y(i,:), p_calib);
error = #(par) sum(power(y_c(i,:) - func(x(i,:), par),2));
p(i,:) = fminsearch(error, p_0);
y_func = [y_func; func(x(i,:), p(i,:))];
end
end
sample data: http://hastebin.com/mokocixeda.md
There are two strategies I can think of:
I would return the data in a vector of cells instead, where the individual cells store vectors of different lengths. You can access data the same way as arrays, but use curly braces: Say c{1}=[1 2 3], c{2}=[1 2 10 8 5] c{3} = [ ].
You can also filter the trash data upon reading a line, if that makes your vectors have the same length.
If memory is not an major issue, try filling up the vectors with distinct values, such as NaN or Inf - anything, that is not found in your measurements based on their physical context. You might need to identify the longest data-set before you allocate memory for your matrices (*). This way, you can use equally sized matrices and easily ignore the "empty data" later on.
(*) Idea ... allocate memory based on the size of the largest file first. Fill it up with e.g. NaN's
matrix = zeros(length(datafiles), longest_file_line_number) .* NaN;
Then run your function. Determine the length of the longest consecutive set of data.
new_max = length(xdata(filter));
if new_max > old_max
old_max = new_max;
end
matrix(i, length(xdata(filter))) = xdata(filter);
Crop your matrix accordingly, before the function returns it ...
matrix = matrix(:, 1:old_max);

Matlab: how to implement a dynamic vector

I am refering to an example like this
I have a function to analize the elements of a vector, 'input'. If these elements have a special property I store their values in a vector, 'output'.
The problem is that at the begging I don´t know the number of elements it will need to store in 'output'so I don´t know its size.
I have a loop, inside I go around the vector, 'input' through an index. When I consider special some element of this vector capture the values of 'input' and It be stored in a vector 'ouput' through a sentence like this:
For i=1:N %Where N denotes the number of elements of 'input'
...
output(j) = input(i);
...
end
The problem is that I get an Error if I don´t previously "declare" 'output'. I don´t like to "declare" 'output' before reach the loop as output = input, because it store values from input in which I am not interested and I should think some way to remove all values I stored it that don´t are relevant to me.
Does anyone illuminate me about this issue?
Thank you.
How complicated is the logic in the for loop?
If it's simple, something like this would work:
output = input ( logic==true )
Alternatively, if the logic is complicated and you're dealing with big vectors, I would preallocate a vector that stores whether to save an element or not. Here is some example code:
N = length(input); %Where N denotes the number of elements of 'input'
saveInput = zeros(1,N); % create a vector of 0s
for i=1:N
...
if (input meets criteria)
saveInput(i) = 1;
end
end
output = input( saveInput==1 ); %only save elements worth saving
The trivial solution is:
% if input(i) meets your conditions
output = [output; input(i)]
Though I don't know if this has good performance or not
If N is not too big so that it would cause you memory problems, you can pre-assign output to a vector of the same size as input, and remove all useless elements at the end of the loop.
output = NaN(N,1);
for i=1:N
...
output(i) = input(i);
...
end
output(isnan(output)) = [];
There are two alternatives
If output would be too big if it was assigned the size of N, or if you didn't know the upper limit of the size of output, you can do the following
lengthOutput = 100;
output = NaN(lengthOutput,1);
counter = 1;
for i=1:N
...
output(counter) = input(i);
counter = counter + 1;
if counter > lengthOutput
%# append output if necessary by doubling its size
output = [output;NaN(lengthOutput,1)];
lengthOutput = length(output);
end
end
%# remove unused entries
output(counter:end) = [];
Finally, if N is small, it is perfectly fine to call
output = [];
for i=1:N
...
output = [output;input(i)];
...
end
Note that performance degrades dramatically if N becomes large (say >1000).

How can I create/process variables in a loop in MATLAB?

I need to calculate the mean, standard deviation, and other values for a number of variables and I was wondering how to use a loop to my advantage. I have 5 electrodes of data. So to calculate the mean of each I do this:
mean_ch1 = mean(ch1);
mean_ch2 = mean(ch2);
mean_ch3 = mean(ch3);
mean_ch4 = mean(ch4);
mean_ch5 = mean(ch5);
What I want is to be able to condense that code into a line or so. The code I tried does not work:
for i = 1:5
mean_ch(i) = mean(ch(i));
end
I know this code is wrong but it conveys the idea of what I'm trying to accomplish. I want to end up with 5 separate variables that are named by the loop or a cell array with all 5 variables within it allowing for easy recall. I know there must be a way to write this code I'm just not sure how to accomplish it.
You have a few options for how you can do this:
You can put all your channel data into one large matrix first, then compute the mean of the rows or columns using the function MEAN. For example, if each chX variable is an N-by-1 array, you can do the following:
chArray = [ch1 ch2 ch3 ch4 ch5]; %# Make an N-by-5 matrix
meanArray = mean(chArray); %# Take the mean of each column
You can put all your channel data into a cell array first, then compute the mean of each cell using the function CELLFUN:
meanArray = cellfun(#mean,{ch1,ch2,ch3,ch4,ch5});
This would work even if each chX array is a different length from one another.
You can use EVAL to generate the separate variables for each channel mean:
for iChannel = 1:5
varName = ['ch' int2str(iChannel)]; %# Create the name string
eval(['mean_' varName ' = mean(' varName ');']);
end
If it's always exactly 5 channels, you can do
ch = {ch1, ch2, ch3, ch4, ch5}
for j = 1:5
mean_ch(j) = mean(ch{j});
end
A more complicated way would be
for j = 1:nchannels
mean_ch(j) = eval(['mean(ch' num2str(j) ')']);
end
Apart from gnovice's answer. You could use structures and dynamic field names to accomplish your task. First I assume that your channel data variables are all in the format ch* and are the only variables in your MATLAB workspace. The you could do something like the following
%# Move the channel data into a structure with fields ch1, ch2, ....
%# This could be done by saving and reloading the workspace
save('channelData.mat','ch*');
chanData = load('channelData.mat');
%# Next you can then loop through the structure calculating the mean for each channel
flds = fieldnames(chanData); %# get the fieldnames stored in the structure
for i=1:length(flds)
mean_ch(i) = mean(chanData.(flds{i});
end