Matlab process big data - matlab

I have to read and then process a huge amount of data (Matrix: ~40.000.000x19).
Frist step is to read the data:
Array = load('vort.dat');
The flie 'vort.dat ' contains ~40.000.000 (imax*jmax*kmax) lines and 19 rows, the first line is:
3.53080034E-03 0.00000000 1.25000002E-02 63.0216064 -3.03968048 -358.802948 -744.902588 -2.51340670E-10 2.11566061E-04 18.6898212 72.3569489 0.727692425 0.754972637 0.661218643 1.50408816 1.87408039E-03 5.69900125E-03 0.00000000 0.00000000
Than I loop over the Array and store the various values for the post-processing into separate arrays:
imax=511;
jmax=160;
kmax=399;
for q=1:length(Array(:,1))
Rp(k,j,i)=Array(q,1);
yp(k,j,i)=(0.5-Rp(k,j,i))*360;
...
% index variables
k=k+1;
if(k>kmax)
k=1;
i=i+1;
if(i>imax)
i=1;
j=j+1;
if(j>jmax)
j=1;
end
end
end
end
Than the post-processing starts!
The problem is that matlab crashes without a warning during the data processing or during the plotting of figures!
I already set the stack size to unlimitied (ulimit -s unlimited).
The second idea was working with memmapfile, it looks like it is working but the plots from the post-processing show that it does not read the right data!
%%% Array = load('vort.dat');
m=memmapfile('data.dat','Format',{'double',[imax*jmax*kmax 19], 'x'},'repeat', 1);
Array=m.data.x;

If you're running out of memory during pre-processing, you may want to clear MATLAB's memory before loading massive amounts of data:
clear('all');
Array = load('vort.dat');
%'Here continues the pre-processing'
If you're running out of memory during post-processing, you may want to clear massive variables once they're not used anymore. For example, since Array is not used anymore after pre-processing, begin your post-processing with:
clear('Array');
or, simpler:
Array = 0;
Given the size of your matrix, this should free enough memory to allow you carry on with post-processing and reporting.
So, as example, the script would look like:
%//Preparing
clear('all'); %//Start with fresh memory
dbstop('if', 'error'); %//Trap uncaught exceptions
%//Loading
A = load('vort.dat');
%//Pre-processing, vectorized operations
I = 511;
J = 160;
K = 399;
Rp = permute(reshape(A(:,1),K,I,J), [1 3 2]);
yp = (0.5 - Rp)*360;
%//...
%//Post-processing
clear('A'); %//and other vars not needed anymore
%//...

Related

Declaring a vector in matlab whose size we don't know

Suppose we are running an infinite for loop in MATLAB, and we want to store the iterative values in a vector. How can we declare the vector without knowing the size of it?
z=??
for i=1:inf
z(i,1)=i;
if(condition)%%condition is met then break out of the loop
break;
end;
end;
Please note first that this is bad practise, and you should preallocate where possible.
That being said, using the end keyword is the best option for extending arrays by a single element:
z = [];
for ii = 1:x
z(end+1, 1) = ii; % Index to the (end+1)th position, extending the array
end
You can also concatenate results from previous iterations, this tends to be slower since you have the assignment variable on both sides of the equals operator
z = [];
for ii = 1:x
z = [z; ii];
end
Sadar commented that directly indexing out of bounds (as other answers are suggesting) is depreciated by MathWorks, I'm not sure on a source for this.
If your condition computation is separate from the output computation, you could get the required size first
k = 0;
while ~condition
condition = true; % evaluate the condition here
k = k + 1;
end
z = zeros( k, 1 ); % now we can pre-allocate
for ii = 1:k
z(ii) = ii; % assign values
end
Depending on your use case you might not know the actual number of iterations and therefore vector elements, but you might know the maximum possible number of iterations. As said before, resizing a vector in each loop iteration could be a real performance bottleneck, you might consider something like this:
maxNumIterations = 12345;
myVector = zeros(maxNumIterations, 1);
for n = 1:maxNumIterations
myVector(n) = someFunctionReturningTheDesiredValue(n);
if(condition)
vecLength = n;
break;
end
end
% Resize the vector to the length that has actually been filled
myVector = myVector(1:vecLength);
By the way, I'd give you the advice to NOT getting used to use i as an index in Matlab programs as this will mask the imaginary unit i. I ran into some nasty bugs in complex calculations inside loops by doing so, so I would advise to just take n or any other letter of your choice as your go-to loop index variable name even if you are not dealing with complex values in your functions ;)
You can just declare an empty matrix with
z = []
This will create a 0x0 matrix which will resize when you write data to it.
In your case it will grow to a vector ix1.
Keep in mind that this is much slower than initializing your vector beforehand with the zeros(dim,dim) function.
So if there is any way to figure out the max value of i you should initialize it withz = zeros(i,1)
cheers,
Simon
You can initialize z to be an empty array, it'll expand automatically during looping ...something like:
z = [];
for i = 1:Inf
z(i) = i;
if (condition)
break;
end
end
However this looks nasty (and throws a warning: Warning: FOR loop index is too large. Truncating to 9223372036854775807), I would do here a while (true) or the condition itself and increment manually.
z = [];
i = 0;
while !condition
i=i+1;
z[i]=i;
end
And/or if your example is really what you need at the end, replace the re-creation of the array with something like:
while !condition
i=i+1;
end
z = 1:i;
As mentioned in various times in this thread the resizing of an array is very processing intensive, and could take a lot of time.
If processing time is not an issue:
Then something like #Wolfie mentioned would be good enough. In each iteration the array length will be increased and that is that:
z = [];
for ii = 1:x
%z = [z; ii];
z(end+1) = ii % Best way
end
If processing time is an issue:
If the processing time is a large factor, and you want it to run as smooth as possible, then you need to preallocating.If you have a rough idea of the maximum number of iterations that will run then you can use #PluginPenguin's suggestion. But there could still be a change of hitting that preset limit, which will break (or severely slow down) the program.
My suggestion:
If your loop is running infinitely until you stop it, you could do occasional resizing. Essentially extending the size as you go, but only doing it once in a while. For example every 100 loops:
z = zeros(100,1);
for i=1:inf
z(i,1)=i;
fprintf("%d,\t%d\n",i,length(z)); % See it working
if i+1 >= length(z) %The array as run out of space
%z = [z; zeros(100,1)]; % Extend this array (note the semi-colon)
z((length(z)+100),1) = 0; % Seems twice as fast as the commented method
end
if(condition)%%condition is met then break out of the loop
break;
end;
end
This means that the loop can run forever, the array will increase with it, but only every once in a while. This means that the processing time hit will be minimal.
Edit:
As #Cris kindly mentioned MATLAB already does what I proposed internally. This makes two of my comments completely wrong. So the best will be to follow what #Wolfie and #Cris said with:
z(end+1) = i
Hope this helps!

Create a matrix combining many variables by using their names and a for loop

Suppose I have n .mat files and each are named as follows: a1, a2, ..., an
And within each of these mat files there is a variable called: var (nxn matrix)
I would like to create a matrix: A = [a1.var a2.var, ..., an.var] without writing it all out because there are many .mat files
A for-loop comes to mind, something like this:
A = []
for i = 1:n
[B] = ['a',num2str(i),'.mat',var];
A = [A B]
end
but this doesn't seem to work or even for the most simple case where I have variables that aren't stored as a(i) but rather 'a1', 'a2' etc.
Thank you very much!
load and concatenate 'var' from each of 'a(#).mat':
n = 10;
for i = n:-1:1 % 1
file_i = sprintf('a%d.mat', i); % 2
t = load(file_i, 'var');
varsCell{i} = t.var; % 3
end
A = [varsCell{:}]; % concatenate each 'var' in one step.
Here are some comment on the above code. All the memory-related stuff isn't very important here, but it's good to keep in mind during larger projects.
1)
In MATLAB, it is rarely a good idea or necessary to grow variables during a for loop. Each time an element is added, MATLAB must find and allocate a new block of RAM. This can really slow things down, especially for long loops or large variables. When possible, pre-allocate your variables (A = zeros(n,n*n)). Alternatively, it sometimes works to count backwards in the loop. MATLAB pre-allocates the whole array, since you're effectively telling it the final size.
2)
Equivalent to file_i = ['a',num2str(i),'.mat'] in this case, sprintf can be clearer and more powerful.
3)
Store each 'var' in a cell array. This is a balance between allocating all the needed memory and the complication of indexing into the correct places of a preallocated array. Internally, the cell array is a list of pointers to the location of each loaded 'var' matrix.
to create a test set...
generate 'n' matrices of n*n random doubles
save each as 'a(#).mat' in current directory
for i = 1:n
var = rand(n);
save(sprintf('a%d.mat',i), 'var');
end
Code
%%// The final result, A would have size nX(nXn)
A = zeros(n,n*n); %%// Pre-allocation for better performance
for k =1:n
load(strcat('a',num2str(k),'.mat'))
A(1:n,(k-1)*n+1:(k-1)*n+n) = var;
end

How to read binary file in one block rather than using a loop in matlab

I have this file which is a series of x, y, z coordinates of over 34 million particles and I am reading them in as follows:
parfor i = 1:Ntot
x0(i,1)=fread(fid, 1, 'real*8')';
y0(i,1)=fread(fid, 1, 'real*8')';
z0(i,1)=fread(fid, 1, 'real*8')';
end
Is there a way to read this in without doing a loop? It would greatly speed up the read in. I just want three vectors with x,y,z. I just want to speed up the read in process. Thanks. Other suggestions welcomed.
I do not have a machine with Matlab and I don't have your file to test either but I think coordinates = fread (fid, [3, Ntot], 'real*8') should work fine.
Maybe fread is the function you are looking for.
You're right. Reading data in larger batches is usually a key part of speeding up file reads. Another part is pre-allocating the destination variable zeros, for example, a zeros call.
I would do something like this:
%Pre-allocate
x0 = zeros(Ntot,1);
y0 = zeros(Ntot,1);
z0 = zeros(Ntot,1);
%Define a desired batch size. make this as large as you can, given available memory.
batchSize = 10000;
%Use while to step through file
indexCurrent = 1; %indexCurrent is the next element which will be read
while indexCurrent <= Ntot
%At the end of the file, we may need to read less than batchSize
currentBatch = min(batchSize, Ntot-indexCurrent+1);
%Load a batch of data
tmpLoaded = fread(fid, currentBatch*3, 'read*8')';
%Deal the fread data into the desired three variables
x0(indexCurrent + (0:(currentBatch-1))) = tmpLoaded(1:3:end);
y0(indexCurrent + (0:(currentBatch-1))) = tmpLoaded(2:3:end);
z0(indexCurrent + (0:(currentBatch-1))) = tmpLoaded(3:3:end);
%Update index variable
indexCurrent = indexCurrent + batchSize;
end
Of course, make sure you test, as I have not. I'm always suspicious of off-by-one errors in this sort of work.

placing data in structures in matlab?

I have a loop which iterates for 97 times and there are two arrrays
frequency[1024]
strength[1024]
these arrays change values after each iteration of the loop. so before its value changes i need to put them in a structure.For instance the structure would be something like
s(1).frame=1 %this will show the iteration no.
s(1).str=strength
s(1).freq=frequency
now i need 97 such structures say s(1) to s(97) in an array.
my question is: how can I create an array of structures within my loop. Please help me.
I like to iterate backward in cases like this, as this forces a full memory allocation the first time the loop is executed. Then the code would look something like this:
%Reset the structure
s = struct;
for ix = 97:-1:1
%Do stuff
%Store the data
s(ix).frame = ix;
s(ix).str = strength;
s(ix).freq = frequency;
end
If one frame depends on the next, or you don't know how many total frame there will be, you can scan forwards. 97 frames is not a lot of data, so you probably don;t need to worry too much about optimizing the pre-allocation portion of the problem.
%Reset the structure
s = struct;
for ix = 1:97
%Do stuff
%Store the data
s(ix).frame = ix;
s(ix).str = strength;
s(ix).freq = frequency;
end
Or, if you really need to performance of a pre-allocated array of structures, but you don't know how large it will be at the onset, you can do something like this:
%Reset the structure
s = struct;
for ix = 1:97
%Do stuff
%Extend if needed
if length(s)<ix
s(ix*2).frame = nan; %Double allocation every time you reach the end.
end
%Store the data
s(ix).frame = ix;
s(ix).str = strength;
s(ix).freq = frequency;
end
%Clip extra allocation
s = s(1:ix);

Out-of-memory algorithms for addressing large arrays

I am trying to deal with a very large dataset. I have k = ~4200 matrices (varying sizes) which must be compared combinatorially, skipping non-unique and self comparisons. Each of k(k-1)/2 comparisons produces a matrix, which must be indexed against its parents (i.e. can find out where it came from). The convenient way to do this is to (triangularly) fill a k-by-k cell array with the result of each comparison. These are ~100 X ~100 matrices, on average. Using single precision floats, it works out to 400 GB overall.
I need to 1) generate the cell array or pieces of it without trying to place the whole thing in memory and 2) access its elements (and their elements) in like fashion. My attempts have been inefficient due to reliance on MATLAB's eval() as well as save and clear occurring in loops.
for i=1:k
[~,m] = size(data{i});
cur_var = ['H' int2str(i)];
%# if i == 1; save('FileName'); end; %# If using a single MAT file and need to create it.
eval([cur_var ' = cell(1,k-i);']);
for j=i+1:k
[~,n] = size(data{j});
eval([cur_var '{i,j} = zeros(m,n,''single'');']);
eval([cur_var '{i,j} = compare(data{i},data{j});']);
end
save(cur_var,cur_var); %# Add '-append' when using a single MAT file.
clear(cur_var);
end
The other thing I have done is to perform the split when mod((i+j-1)/2,max(factor(k(k-1)/2))) == 0. This divides the result into the largest number of same-size pieces, which seems logical. The indexing is a little more complicated, but not too bad because a linear index could be used.
Does anyone know/see a better way?
Here's a version that combines going fast with using minimal memory.
I use fwrite/fread so that you still can use parfor (and this time, I made sure it works :) )
%# assume data is loaded an k is known
%# find the index pairs for comparisons. This could be done more elegantly, I guess.
%# I'm constructing a lower triangular array, i.e. an array that has ones wherever
%# we want to compare i (row) and j (col). Then I use find to get i and j
[iIdx,jIdx] = find(tril(ones(k,k),-1));
%# create a directory to store the comparisons
mkdir('H_matrix_elements')
savePath = fullfile(pwd,'H_matrix_elements');
%# loop through all comparisons in parallel. This way there may be a bit more overhead from
%# the individual function calls. However, parfor is most efficient if there are
%# a lot of relatively similarly fast iterations.
parfor ct = 1:length(iIdx)
%# make the comparison - do double b/c there shouldn't be a memory issue
currentComparison = compare(data{iIdx(ct)},data{jIdx{ct});
%# create save-name as H_i_j, e.g. H_104_23
saveName = fullfile(savePath,sprintf('H_%i_%i',iIdx(ct),jIdx(ct)));
%# save. Since 'save' is not allowed, use fwrite to write the data to disk
fid = fopen(saveName,'w');
%# for simplicity: save data as vector, add two elements to the beginning
%# to store the size of the array
fwrite(fid,[size(currentComparison)';currentComparison(:)]); % ' #SO formatting
%# close file
fclose(fid)
end
%# to read e.g. comparison H_104_23
fid = fopen(fullfile(savePath,'H_104_23'),'r');
tmp = fread(fid);
fclose(fid);
%# reshape into 2D array.
data = reshape(tmp(3:end),tmp(1),tmp(2));
You can get rid of the eval and clear calls by assigning the filename separately.
for i=1:k
[~,m] = size(data{i});
file_name = ['H' int2str(i)];
cur_var = cell(1, k-i);
for j=i+1:k
[~,n] = size(data{j});
cur_var{i,j} = zeros(m, n, 'single');
cur_var{i,j} = compare(data{i}, data{j});
end
save(file_name, cur_var);
end
If you need the saved variables to take different names, use the -struct option to save.
str.(file_name);
save(file_name, '-struct', str);