Why does my matlab program use so much memory? - matlab

I am writing a matlab program, which reads about 500 files. Each file has 20,000 lines, with 1 number on each line. The program tries to build a matrix of 20,000 * 500 with these numbers.
The numbers are stored as Double, so 8 bytes per number. So I would expect this to take 20,000 * 500 * 8 bytes, which is approximately 1E8, i.e. 100MB. And yet this program exhausts my 16GB memory. As the program runs, I see the memory use steadily going up, GB by GB. I am using Matlab R2015b on Ubuntu 14.04.
What's happening? Many thanks for your attention.
Here is the full code
clear all;
% number of rna bits in the file
filesize = 20532
maxFiles = 480;
rnaCounts = NaN(filesize,maxFiles);
myFolder = '~/_STATS/data3/RNASeqV2/UNC__IlluminaHiSeq_RNASeqV2/Level_3';
filePattern = fullfile(myFolder, '*genes.normalized_results');
theFiles = dir(filePattern);
rnaCounts = NaN(filesize,length(theFiles));
for k = 1 : length(theFiles)
mrnaFilename = strtrim(theFiles(k).name);
fprintf(1, 'Now reading mrnaFile %d %s \n', k, mrnaFilename);
% read rna file
fullFileName = fullfile(myFolder, mrnaFilename);
rnafid = fopen(fullFileName);
if rnafid < 0
fprintf('====ERROR OPENING RNA FILE =====================');
end
rnaline = fgets(rnafid);
lc = 1; % line counter
while ischar(rnaline) && feof(rnafid) ~= 1
rnaline = fgets(rnafid);
rnaSplit = strsplit(rnaline);
% write to the matrix
rnaCounts(lc,k) = str2num(rnaSplit{2});
lc = lc + 1;
end
fclose(rnafid);
end

As verified by the OP, the str2num function in the Linux version of Matlab 2015b has a memory leak. This function is not very useful anyway as it is designed to parse strings representing entire matrices (1 2; 3 4) rather than the typical use case of parsing a single number (1.234). Use str2double when doing simple number parsing; it is faster even when str2num isn't broken.
It is likely that using a different version of Matlab would also work around the problem, because in my experience, these kinds of memory bugs don't usually persist from one version to the next.

Often, high-level I/O functions, such as dlmread or textscan are useful to read such text formats. Use dlmread if you have only numeric data,
and textscan for more complex formats.
The sample data you provided is:
A2LD1|87769 135.5735
As you only need the number in the second column and discard the identifier in the first column, all you have is numeric data, and you can use dlmread.
data = dlmread(fullFileName, '\t', 1, 1);
The \t is to specify that the delimiter (column separator) is a Tab. The two 1s are to specify a row offset and a column offset, i.e. ignore the first row (the header) and the first column (id) of the file.

Related

Optimizing reading the data in Matlab

I have a large data file with a text formatted as a single column with n rows. Each row is either a real number or a string with a value of: No Data. I have imported this text as a nx1 cell named Data. Now I want to filter out the data and to create a nx1 array out of it with NaN values instead of No data. I have managed to do it using a simple cycle (see below), the problem is that it is quite slow.
z = zeros(n,1);
for i = 1:n
if Data{i}(1)~='N'
z(i) = str2double(Data{i});
else
z(i) = NaN;
end
end
Is there a way to optimize it?
Actually, the whole parsing can be performed with a one-liner using a properly parametrized readtable function call (no iterations, no sanitization, no conversion, etc...):
data = readtable('data.txt','Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No data');
Here is the content of the text file I used as a template for my test:
9.343410
11.54300
6.733000
-135.210
No data
34.23000
0.550001
No data
1.535000
-0.00012
7.244000
9.999999
34.00000
No data
And here is the output (which can be retrieved in the form of a vector of doubles using data.Var1):
ans =
9.34341
11.543
6.733
-135.21
NaN
34.23
0.550001
NaN
1.535
-0.00012
7.244
9.999999
34
NaN
Delimiter: specified as a line break since you are working with a single column... this prevents No data to produce two columns because of the whitespace.
Format: you want numerical values.
TreatAsEmpty: this tells the function to treat a specific string as empty, and empty doubles are set to NaN by default.
If you run this you can find out which approach is faster. It creates an 11MB text file and reads it with the various approaches.
filename = 'data.txt';
%% generate data
fid = fopen(filename,'wt');
N = 1E6;
for ct = 1:N
val = rand(1);
if val<0.01
fwrite(fid,sprintf('%s\n','No Data'));
else
fwrite(fid,sprintf('%f\n',val*1000));
end
end
fclose(fid)
%% Tommaso Belluzzo
tic
data = readtable(filename,'Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No Data');
toc
%% Camilo Rada
tic
[txtMat, nLines]=txt2mat(filename);
NoData=txtMat(:,1)=='N';
z = zeros(nLines,1);
z(NoData)=nan;
toc
%% Gelliant
tic
fid = fopen(filename,'rt');
z= textscan(fid, '%f', 'Delimiter','\n', 'whitespace',' ', 'TreatAsEmpty','No Data', 'EndOfLine','\n','TextType','char');
z=z{1};
fclose(fid);
toc
result:
Elapsed time is 0.273248 seconds.
Elapsed time is 0.304987 seconds.
Elapsed time is 0.206315 seconds.
txt2mat is slow, even without converting resulting string matrix to numbers it is outperformed by readtable and textscan. textscan is slightly faster than readtable. Probably because it skips some of the internal sanity checks and does not convert the resulting data to a table.
Depending of how big are your files and how often you read such files, you might want to go beyond readtable, that could be quite slow.
EDIT: After tests, with a file this simple the method below provide no advantages. The method was developed to read RINEX files, that are large and complex in the sense that the are aphanumeric with different numbers of columns and different delimiters in different rows.
The most efficient way I've found, is to read the whole file as a char matrix, then you can easily find you "No data" lines. And if your real numbers are formatted with fix width you can transform them from char into numbers in a way much more efficient than str2double or similar functions.
The function I wrote to read a text file into a char matrix is:
function [txtMat, nLines]=txt2mat(filename)
% txt2mat Read the content of a text file to a char matrix
% Read all the content of a text file to a matrix as wide as the longest
% line on the file. Shorter lines are padded with blank spaces. New lines
% are not included in the output.
% New lines are identified by new line \n characters.
% Reading the whole file in a string
fid=fopen(filename,'r');
fileData = char(fread(fid));
fclose(fid);
% Finding new lines positions
newLines= fileData==sprintf('\n');
linesEndPos=find(newLines)-1;
% Calculating number of lines
nLines=length(linesEndPos);
% Calculating the width (number of characters) of each line
linesWidth=diff([-1; linesEndPos])-1;
% Number of characters per row including new lines
charsPerRow=max(linesWidth)+1;
% Initializing output var with blank spaces
txtMat=char(zeros(charsPerRow,nLines,'uint8')+' ');
% Computing a logical index to all characters of the input string to
% their final positions
charIdx=false(charsPerRow,nLines);
% Indexes of all new lines
linearInd = sub2ind(size(txtMat), (linesWidth+1)', 1:nLines);
charIdx(linearInd)=true;
charIdx=cumsum(charIdx)==0;
% Filling output matrix
txtMat(charIdx)=fileData(~newLines);
% Cropping the last row coresponding to new lines characters and transposing
txtMat=txtMat(1:end-1,:)';
end
Then, once you have all your data in a matrix (let's assume it is named txtMat), you can do:
NoData=txtMat(:,1)=='N';
And if your number fields have fix width, you can transform them to numbers way more efficiently than str2num with something like
values=((txtMat(:,1:10)-'0')*[1e6; 1e5; 1e4; 1e3; 1e2; 10; 1; 0; 1e-1; 1e-2]);
Where I've assumed the numbers have 7 digits and two decimal places, but you can easily adapt it for your case.
And to finish you need to set the NaN values with:
values(NoData)=NaN;
This is more cumbersome than readtable or similar functions, but if you are looking to optimize the reading, this is WAY faster. And if you don't have fix width numbers you can still do it this way by adding a couple lines to count the number of digits and find the place of the decimal point before doing the conversion, but that will slow down things a little bit. However, I think it will still be faster.

Matlab interp1 gives last row as NaN

I have a problem similar to here. However, it doesn't seem that there is a resolution.
My problem is as such: I need to import some files, for example, 5. There are 20 columns in each file, but the number of lines are varied. Column 1 is time in terms of crank-angle degrees, and the rest are data.
So my code first imports all of the files, finds the file with the most number of rows, then creates a multidimensional array with that many rows. The timing is in engine cycles so, I would then remove lines from the imported file that go beyond a whole engine cycle. This way, I always have data in terms of X whole engine cycles. Then I would just interpolate the data to the pre-allocated array to have a giant multi-dimensional array for the 5 data files.
However, this seems to always result in the last row of every column of every page being filled with NaNs. Please have a look at the code below. I can't see where I'm doing wrong. Oh, and by the way, as I have been screwed over before, this is NOT homework.
maxlines = 0;
maxcycle = 999;
for i = 1:1
filename = sprintf('C:\\Directory\\%s\\file.out',folder{i});
file = filelines(filename); % Import file clean
lines = size(file,1); % Find number of lines of file
if lines > maxlines
maxlines = lines; % If number of lines in this file is the most, save it
end
lastCAD = file(end,1); % Add simstart to shift the start of the cycle to 0 CAD
lastcycle = fix((lastCAD-simstart)./cycle); % Find number of whole engine cycles
if lastcycle < maxcycle
maxcycle = lastcycle; % Find lowest number of whole engine cycles amongst all designs
end
cols = size(file,2); % Find number of columns in files
end
lastcycleCAD = maxcycle.*cycle+simstart; % Define last CAD of whole cycle that can be used for analysis
% Import files
thermo = zeros(maxlines,cols,designs); % Initialize array to proper size
xq = linspace(simstart,lastcycleCAD,maxlines); % Define the CAD degrees
for i = 1:designs
filename = sprintf('C:\\Directory\\%s\\file.out',folder{i});
file = importthermo(filename, 6, inf); % Import the file clean
[~,lastcycleindex] = min(abs(file(:,1)-lastcycleCAD)); % Find index of end of last whole cycle
file = file(1:lastcycleindex,:); % Remove all CAD after that
thermo(:,1,i) = xq;
for j = 2:17
thermo(:,j,i) = interp1(file(:,1),file(:,j),xq);
end
sprintf('file from folder %s imported OK',folder{i})
end
thermo(end,:,:) = []; % Remove NaN row
Thank you very much for your help!
Are you sampling out of the range? if so, you need to tell interp1 that you want extrapolation
interp1(file(:,1),file(:,j),xq,'linear','extrap');

How to sparsely read a large file in Matlab?

I ran a simulation which wrote a huge file to disk. The file is a big matrix v. I can't read it all, but I really only need a portion of the matrix, say, 1:100 of the columns and rows. I'd like to do something like
vtag = dlmread('v',1:100:end, 1:100:end);
Of course, that doesn't work. I know I should have only done the following when writing to the file
dlmwrite('vtag',v(1:100:end, 1:100:end));
But I did not, and running everything again would take two more days.
Thanks
Amir
Thankfully the dlmread function supports specifying a range to read as the third input. So if you wan to read all N columns for the first 100 rows, you can specify that with the following command
startRow = 1;
startColumn = 1;
endRow = 100;
endColumn = N;
rng = [startRow, startColumn, endRow, endColumn] - 1;
vtag = dlmread(filename, ',', rng);
EDIT Based on your clarification
Since you don't want 1:100 rows but rather 1:100:end rows, the following approach should work better for you.
You can use textscan to read chunks of data at a time. You can read a "good" row and then read in the next "chunk" of data to ignore (discarding it in the process), and continue until you reach the end of the file.
The code below is a slight modification of that idea, except it utilizes the HeaderLines input to textscan which instructs the function how many lines to ignore before reading in the data. The first time through the loop, no lines will be skipped, however all other times through the loop, rows2skip lines will be skipped. This allows us to "jump" through the file very rapidly without calling any additional file opertions.
startRow = 1;
rows2skip = 99;
columns = 3000;
fid = fopen(filename, 'rb');
% For now, we'll just assume you're reading in floating-point numbers
format = repmat('%f ', [1 columns]);
count = 1;
lines2discard = startRow - 1;
while ~feof(fid)
% Use "HeaderLines" to skip data before reading in data we care about
row = textscan(fid, format, 1, 'Delimiter', ',', 'HeaderLines', lines2discard);
data{count} = [row{:}];
% After the first time through, set the "HeaderLines" (i.e. lines to ignore)
% to be the # we want to skip between lines (much faster than alternatives!)
lines2discard = rows2skip;
count = count + 1;
end
fclose(fid);
data = cat(1, data{:});
You may need to adjust your format specifier for your own type of input.

Performance issues by processing huge textfiles

I am facing the problem to extract data out of a textfile which has both numbers and characters in it. The data I want, (the numbers) are separated by rows with characters, describing the following dataset. The textfile is rather large (>2.000.000 lines).
I try to put every dataset (the number of rows between two rows with characters) into a matrix. The matrix should be named according to the description (frequency) in the textline above each dataset. I have a working code, but I face performance problems. Maybe someone can help me to speed it up. One file takes currently about 15 minutes. I need the numbers in matrices to process them further.
Snippet out of Textfile:
21603 2135 21339 21604
103791 94 1 1 1 4
21339 1702 21600 21604
-1
-1
2414
1
Velocity (magnitude) Response at Structural FE Nodes
1
Frequency = 10.00 Hz
Result = Engineering Units
Component = Vmag
Location =
Form & Units = RMS Magnitude in m/s
1 5 1 11 2 1
1 0 1 1 1 0 0 0
1 2161
0.00000e+000 1.00000e+001 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
20008
1.23285e-004
20428
1.21613e-004
Here is my code:
file='large_file.txt';
fid=fopen(file,'r');
k=1;
filerows=2164986; % nr of rows in textfile
A=zeros(filerows,6); % preallocate Matrix where textfile should be saved in
for count=1:8 % get rid of first 8 lines
fgets(fid);
end
name=0;
start=1;
while ~feof(fid)
a=fgets(fid);
b=str2double(strread(a,'%s')); % turn read row in a vector
if isnan(b(1))==1 % check whether there are characters in the row
if strfind(a,'Frequency') % check if 'Frequency' is in the row
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);
name=b(3);
for count=1:10 % get rid of next 10 lines
fgets(fid);
end
start=k+1;
end
else % if there are just numbers in the row, insert it into the matrix
A(k,1:length(b))=b; % populate matrix A with the row entries
k = k+1;
end
k/filerows % show progress
end
fclose(fid);
Matrixname = sprintf('Frequency%i=A(%i:end,:);',name,start);
eval(Matrixname);
Reading text files line by line is very time consuming, especially in Matlab. When I have to read in text files, I usually read in the entire file at once. You may be limited by memory, so read it in the largest size chunks your machine can handle. Once it's all in memory, use some kind of logical indexing to find the parts of the data you're interested in. Again, in Matlab, for and while loops are very slow. For the data set you have there, I would do the following:
fid = fopen(file);
data = fread(fid,[1 maxBytes],'char=>char');
blockIndices = strfind(data,'Velocity'); % Calculate offsets based on data format
% Another approach much faster than for loops
lineData = regexp(data,sprintf('\n'),'split'); % No each line is in a cell
processedData = cellfun(#processData,lineData,'Uniform',false);
function y = processData(x)
% do something with x
end
Once I had the block indices I could calculate offsets to the parts of the data I want. I don't think that two million lines is that much data, and most computers these days have multiple gigabytes of memory, and it doesn't look like each line is more than a couple hundred characters, so the file is probably less than half a GB. Good luck.
Using the matlab profiler will help you see which lines of code are taking the most amount of time so that you can figure out what to optimize.
As the original poster determined, the line causing trouble in this case was
k/filerows % show progress
Printing to the screen many times is very time consuming. If you want to show progress without slowing the code way down, you could do
if mod(k,filerows/100) == 0
disp('k rows processed');
end
That code will cause an update to be displayed 100 times, or every 3.5 seconds in that particular case.
If you want to get really fancy, check out waitbar, but it is usually overkill.
Finally I got the sscanf-solution to work. I used that function to replace the str2double function to gain some speed as suggested in Why is str2double so slow in matlab as compared to a mex-function?.
Sadly it didn't do too much, but at least it helped a bit.
So, start was ca. 850s
Profiler after removing progress-status: ca. 450s
Profiler after replacing str2double by sscanf: ca.330s
The code now is:
file='test.txt';
fid=fopen(file,'r');
k=1;
filerows=2164986; % nr of rows in textfile
A=zeros(filerows,6); % preallocate Matrix where textfile should be saved in
for count=1:8 % get rid of first 8 lines
fgets(fid);
end
name=0;
start=1;
while ~feof(fid)
a=fgets(fid);
b=strread(a,'%s');
b=sscanf(sprintf('%s#', b{:}), '%g#')';
if isempty(b) % check whether there had been characters in the row
if strfind(a,'Frequency') % check whether 'Frequency' was in the row
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);
b=str2double(strread(a,'%s'));
name=b(3);
for count=1:8 % get rid of next 8 lines
fgets(fid);
end
start=k+1;
end
else % if there were just numbers in the row, insert it into the matrix
A(k,1:length(b))=b; % populate matrix A with the row entries
k = k+1;
end
end
fclose(fid);
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);

How to read binary file in one block rather than using a loop in matlab

I have this file which is a series of x, y, z coordinates of over 34 million particles and I am reading them in as follows:
parfor i = 1:Ntot
x0(i,1)=fread(fid, 1, 'real*8')';
y0(i,1)=fread(fid, 1, 'real*8')';
z0(i,1)=fread(fid, 1, 'real*8')';
end
Is there a way to read this in without doing a loop? It would greatly speed up the read in. I just want three vectors with x,y,z. I just want to speed up the read in process. Thanks. Other suggestions welcomed.
I do not have a machine with Matlab and I don't have your file to test either but I think coordinates = fread (fid, [3, Ntot], 'real*8') should work fine.
Maybe fread is the function you are looking for.
You're right. Reading data in larger batches is usually a key part of speeding up file reads. Another part is pre-allocating the destination variable zeros, for example, a zeros call.
I would do something like this:
%Pre-allocate
x0 = zeros(Ntot,1);
y0 = zeros(Ntot,1);
z0 = zeros(Ntot,1);
%Define a desired batch size. make this as large as you can, given available memory.
batchSize = 10000;
%Use while to step through file
indexCurrent = 1; %indexCurrent is the next element which will be read
while indexCurrent <= Ntot
%At the end of the file, we may need to read less than batchSize
currentBatch = min(batchSize, Ntot-indexCurrent+1);
%Load a batch of data
tmpLoaded = fread(fid, currentBatch*3, 'read*8')';
%Deal the fread data into the desired three variables
x0(indexCurrent + (0:(currentBatch-1))) = tmpLoaded(1:3:end);
y0(indexCurrent + (0:(currentBatch-1))) = tmpLoaded(2:3:end);
z0(indexCurrent + (0:(currentBatch-1))) = tmpLoaded(3:3:end);
%Update index variable
indexCurrent = indexCurrent + batchSize;
end
Of course, make sure you test, as I have not. I'm always suspicious of off-by-one errors in this sort of work.