Matlab: fastest method of reading parts/sequences of a large binary file - matlab

I want to read parts from a large (ca. 11 GB) binary file. The currently working solution is to load the entire file ( raw_data ) with fread(), then crop out pieces of interest ( data ).
Question: Is there a faster method of reading small (1-2% of total file, partially sequential reads) parts of a file, given something like a binary mask (i.e. a logical index of specific bytes of interst) in Matlab? Specifics below.
Notes for my specific case:
data of interest (26+e6 bytes, or ca. 24 MB) is roughly 2% of raw_data (1.2e+10 bytes or ca. 11 GB)
each 600.000 bytes contain ca 6.500 byte reads, which can be broken down to roughly 1.200 read-skip cycles (such as 'read 10 bytes, skip 5000 bytes').
the read instructions of the total file can be broken down in ca 20.000 similar but (not exactly identical) read-skip cycles (i.e. ca. 20.000x1.200 read-skip cycles)
The file is read from a GPFS (parallel file system)
Excessive RAM, newest Matlab ver and all toolboxes are available for the task
My initial idea of fread-fseek cycle proved to be extrodinarily much slower (see psuedocode below) than reading the whole file. Profiling revealed fread() is slowest (being called over a million times probably obvious to the experts here).
Alternatives I considered: memmapfile() [ ref ] has no feasible read multiple small parts as far as I could find. The MappedTensor library might be the next thing I'd look into. Related but didn't help, just to link to article: 1, 2.
%open file
fi=fopen('data.bin');
%example read-skip data
f_reads = [20 10 6 20 40]; %read this number of bytes
f_skips = [900 6000 40 300 600]; %skip these bytes after each read instruction
data = []; %save the result here
fseek(fi,90000,'bof'); %skip initial bytes until first read
%read the file
for ind=1:nbr_read_skip_cylces-1
tmp_data = fread(fi,f_reads(ind));
data = [data; tmp_data]; %add newly read bytes to data variable
fseek(fi,f_skips(ind),'cof'); %skip to next read position
end
FYI: To get an overview and for transparency, I've compiled some plots (below) of the first ca 6.500 read locations (of my actual data) that, after collapsing into fread-fseek pairs can, can be summarized in 1.200 fread-fseek pairs.

I would do two things to speed up your code:
preallocate the data array.
write a C MEX-file to call fread and fseek.
This is a quick test I did to compare using fread and fseek from MATLAB or C:
%% Create large binary file
data = 1:10000000; % 80 MB
fi = fopen('data.bin', 'wb');
fwrite(fi, data, 'double');
fclose(fi);
n_read = 1;
n_skip = 99;
%% Read using MATLAB
tic
fi = fopen('data.bin', 'rb');
fseek(fi, 0, 'eof');
sz = ftell(fi);
sz = floor(sz / (n_read + n_skip));
data = zeros(1, sz);
fseek(fi, 0, 'bof');
for ind = 1:sz
data(ind) = fread(fi, n_read, 'int8');
fseek(fi, n_skip, 'cof');
end
toc
%% Read using C MEX-file
mex fread_test_mex.c
tic
data = fread_test_mex('data.bin', n_read, n_skip);
toc
And this is fread_test_mex.c:
#include <stdio.h>
#include <mex.h>
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
// No testing of inputs...
// inputs = 'data.bin', 1, 99
char* fname = mxArrayToString(prhs[0]);
int n_read = mxGetScalar(prhs[1]);
int n_skip = mxGetScalar(prhs[2]);
FILE* fi = fopen(fname, "rb");
fseek(fi, 0L, SEEK_END);
int sz = ftell(fi);
sz /= n_read + n_skip;
plhs[0] = mxCreateNumericMatrix(1, sz, mxDOUBLE_CLASS, mxREAL);
double* data = mxGetPr(plhs[0]);
fseek(fi, 0L, SEEK_SET);
char buffer[1];
for(int ind = 1; ind < sz; ++ind) {
fread(buffer, 1, n_read, fi);
data[ind] = buffer[0];
fseek(fi, n_skip, SEEK_CUR);
}
fclose(fi);
}
I see this:
Elapsed time is 6.785304 seconds.
Building with 'Xcode with Clang'.
MEX completed successfully.
Elapsed time is 1.376540 seconds.
That is, reading the data is 5x as fast with a C MEX-file. And that time includes loading the MEX-file into memory. A second run is a bit faster (1.14 s) because the MEX-file is already loaded.
In the MATLAB code, if I initialize data = []; and then extend the matrix every time I read like OP does:
tmp = fread(fi, n_read, 'int8');
data = [data, tmp];
then the execution time for that loop was 159 s, with 92.0% of the time spent in the data = [data, tmp] line. Preallocating really is important!

Related

Loading chunks of large binary files in Matlab, quickly

I have some pretty massive data files (256 channels, on the order of 75-100 million samples = ~40-50 GB or so per file) in int16 format. It is written in flat binary format, so the structure is something like: CH1S1,CH2S1,CH3S1 ... CH256S1,CH1S2,CH2S2,...
I need to read in each channel separately, filter and offset correct it, then save. My current bottleneck is loading each channel, which takes about 7-8 minutes... scale that up 256 times, and I'm looking at nearly 30 hours just to load the data! I am trying to intelligently use fread, to skip bytes as I read each channel; I have the following code in a loop over all 256 channels to do this:
offset = i - 1;
fseek(fid,offset*2,'bof');
dat = fread(fid,[1,nSampsTotal],'*int16',(nChan-1)*2);
Reading around, this is typically the fastest way to load parts of a large binary file, but is the file simply too large to do this any faster?
I'm not loading that much data... the test file I'm working with is 37GB, for one of the 256 channels, I'm only loading 149MB for the entire trace... maybe the 'skip' functionality of fread is suboptimal?
System details: MATLAB 2017a, Windows 7, 64bit, 32GB RAM
#CrisLuengo's idea was much faster: essentially, chunking the data, loading each chunk and then splitting that out to separate channel files to save RAM.
Here is some code for just the loading part which is fast, less than 1 minute:
% fake raw data
disp('building... ');
nChan = 256;
nSampsTotal = 10e6;
tic; DATA = rand(nChan,nSampsTotal); toc;
fid = fopen('rawData.dat','w');
disp('writing flat binary file... ');
tic; fwrite(fid,DATA(:),'int16'); toc;
fclose(fid);
% compute the number of samples and chunks
chunkSize = 1e6;
nChunksTotal = ceil(nSampsTotal/chunkSize);
%% load by chunks
t1 = tic;
fid = fopen('rawData.dat','r');
dat = zeros(nChan,chunkSize,'int16');
chunkCnt = 1;
while 1
tic
if chunkCnt <= nChunksTotal
% load the data
fprintf('Chunk %02d/%02d: loading... ',chunkCnt,nChunksTotal);
dat = fread(fid,[nChan,chunkSize],'*int16');
else
break;
end
toc;
chunkCnt = chunkCnt + 1;
end
t = toc(t1); fprintf('Total time: %4.2f secs.\n\n\n',t);
% Total time: 55.07 secs.
fclose(fid);
On the other hand, loading by channel by skipping through the file takes about 20x longer, a little over 20 minutes:
%% load by channels (slow)
t1 = tic;
fid = fopen('rawData.dat','r');
dat = zeros(1,nSampsTotal);
for i = 1:nChan
tic;
fprintf('Channel %03d/%03d: loading... ');
offset = i-1;
fseek(fid,offset*2,'bof');
dat = fread(fid,[1,nSampsTotal],'*int16',(nChan-1)*2);
toc;
end
t = toc(t1); fprintf('Total time: %4.2f secs.\n\n\n',t);
% Total time: 1133.48 secs.
fclose(fid);
I'd also like to thank OCDER on the Matlab forums for their help: link

Efficient import of semi structured text

I have multiple text files saved from a Tekscan pressure mapping system. I'm am trying to find the most efficient method for importing the multiple comma delimited matrices into one 3-d matrix of type uint8. I have developed a solution, which makes repeated calls to the MATLAB function dlmread. Unfortunately, it takes roughly 1.5 min to import the data. I have included the code below.
This code makes calls to two other functions I wrote, metaextract and framecount which I have not included as they aren't truly relevant to answering the question at hand.
Here are two links to samples of the files I am using.
The first is a shorter file with 90 samples
The second is a longer file with 3458 samples
Any help would be appreciated
function pressureData = tekscanimport
% Import TekScan data from .asf file to 3d matrix of type double.
[id,path] = uigetfile('*.asf'); %User input for .asf file
if path == 0 %uigetfile returns zero on cancel
error('You must select a file to continue')
end
path = strcat(path,id); %Concatenate path and id to full path
% function calls
pressureData.metaData = metaextract(path);
nLines = linecount(path); %Find number of lines in file
nFrames = framecount(path,nLines);%Find number of frames
rowStart = 25; %Default starting row to read from tekscan .asf file
rowEnd = rowStart + 41; %Frames are 42 rows long
colStart = 0;%Default starting col to read from tekscan .asf file
colEnd = 47;%Frames are 48 rows long
pressureData.frames = zeros([42,48,nFrames],'uint8');%Preallocate for speed
f = waitbar(0,'1','Name','loading Data...',...
'CreateCancelBtn','setappdata(gcbf,''canceling'',1)');
setappdata(f,'canceling',0);
for i = 1:nFrames %Loop through file skipping frame metadata
if getappdata(f,'canceling')
break
end
waitbar(i/nFrames,f,sprintf('Loaded %.2f%%', i/nFrames*100));
%Make repeated calls to dlmread
pressureData.frames(:,:,i) = dlmread(path,',',[rowStart,colStart,rowEnd,colEnd]);
rowStart = rowStart + 44;
rowEnd = rowStart + 41;
end
delete(f)
end
I gave it a try. This code opens your big file in 3.6 seconds on my PC. The trick is to use sscanf instead of the str2double and str2number functions.
clear all;tic
fid = fopen('tekscanlarge.txt','rt');
%read the header, stop at frame
header='';
l = fgetl(fid);
while length(l)>5&&~strcmp(l(1:5),'Frame')
header=[header,l,sprintf('\n')];
l = fgetl(fid);
if length(l)<5,l(end+1:5)=' ';end
end
%all data at once
dat = fread(fid,inf,'*char');
fclose(fid);
%allocate space
res = zeros([48,42,3458],'uint8');
%get all line endings
LE = [0,regexp(dat','\n')];
i=1;
for ct = 2:length(LE)-1 %go line by line
L = dat(LE(ct-1)+1:LE(ct)-1);
if isempty(L),continue;end
if all(L(1:5)==['Frame']')
fr = sscanf(L(7:end),'%u');
i=1;
continue;
end
% sscan can only handle row-char with space seperation.
res(:,i,fr) = uint8(sscanf(strrep(L',',',' '),'%u'));
i=i+1;
end
toc
Does anyone knows of a faster way to convert than sscanf? Because it spends the majority of time on this function (2.17 seconds). For a dataset of 13.1MB I find it very slow compared to the speed of the memory.
Found a way to do it in 0.2 seconds that might be usefull for others as well.
This mex-file scans through a list of char values for numbers and reports them back. Save it as mexscan.c and run mex mexscan.c.
#include "mex.h"
/* The computational routine */
void calc(unsigned char *in, unsigned char *out, long Sout, long Sin)
{
long ct = 0;
int newnumber=0;
for (int i=0;i<Sin;i+=2){
if (in[i]>=48 && in[i]<=57) { //it is a number
out[ct]=out[ct]*10+in[i]-48;
newnumber=1;
} else { //it is not a number
if (newnumber==1){
ct++;
if (ct>Sout){return;}
}
newnumber=0;
}
}
}
/* The gateway function */
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
unsigned char *in; /* input vector */
long Sout; /* input size of output vector */
long Sin; /* size of input vector */
unsigned char *out; /* output vector*/
/* check for proper number of arguments */
if(nrhs!=2) {
mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nrhs","two input required.");
}
if(nlhs!=1) {
mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nlhs","One output required.");
}
/* make sure the first input argument is type char */
if(!mxIsClass(prhs[0], "char")) {
mexErrMsgIdAndTxt("MyToolbox:arrayProduct:notDouble","Input matrix must be type char.");
}
/* make sure the second input argument is type uint32 */
if(!mxIsClass(prhs[0], "char")) {
mexErrMsgIdAndTxt("MyToolbox:arrayProduct:notDouble","Input matrix must be type char.");
}
/* get dimensions of the input matrix */
Sin = mxGetM(prhs[0])*2;
/* create a pointer to the real data in the input matrix */
in = (unsigned char *) mxGetPr(prhs[0]);
Sout = mxGetScalar(prhs[1]);
/* create the output matrix */
plhs[0] = mxCreateNumericMatrix(1,Sout,mxUINT8_CLASS,0);
/* get a pointer to the real data in the output matrix */
out = (unsigned char *) mxGetPr(plhs[0]);
/* call the computational routine */
calc(in,out,Sout,Sin);
}
Now this script runs in 0.2 seconds and returns the same result as the previous script.
clear all;tic
fid = fopen('tekscanlarge.txt','rt');
%read the header, stop at frame
header='';
l = fgetl(fid);
while length(l)>5&&~strcmp(l(1:5),'Frame')
header=[header,l,sprintf('\n')];
l = fgetl(fid);
if length(l)<5,l(end+1:5)=' ';end
end
%all data at once
dat = fread(fid,inf,'*char');
fclose(fid);
S=[48,42,3458];
d = mexscan(dat,uint32(prod(S)+3458));
d(1:prod(S(1:2))+1:end)=[];%remove frame numbers
d = reshape(d,S);
toc

Why does my matlab program use so much memory?

I am writing a matlab program, which reads about 500 files. Each file has 20,000 lines, with 1 number on each line. The program tries to build a matrix of 20,000 * 500 with these numbers.
The numbers are stored as Double, so 8 bytes per number. So I would expect this to take 20,000 * 500 * 8 bytes, which is approximately 1E8, i.e. 100MB. And yet this program exhausts my 16GB memory. As the program runs, I see the memory use steadily going up, GB by GB. I am using Matlab R2015b on Ubuntu 14.04.
What's happening? Many thanks for your attention.
Here is the full code
clear all;
% number of rna bits in the file
filesize = 20532
maxFiles = 480;
rnaCounts = NaN(filesize,maxFiles);
myFolder = '~/_STATS/data3/RNASeqV2/UNC__IlluminaHiSeq_RNASeqV2/Level_3';
filePattern = fullfile(myFolder, '*genes.normalized_results');
theFiles = dir(filePattern);
rnaCounts = NaN(filesize,length(theFiles));
for k = 1 : length(theFiles)
mrnaFilename = strtrim(theFiles(k).name);
fprintf(1, 'Now reading mrnaFile %d %s \n', k, mrnaFilename);
% read rna file
fullFileName = fullfile(myFolder, mrnaFilename);
rnafid = fopen(fullFileName);
if rnafid < 0
fprintf('====ERROR OPENING RNA FILE =====================');
end
rnaline = fgets(rnafid);
lc = 1; % line counter
while ischar(rnaline) && feof(rnafid) ~= 1
rnaline = fgets(rnafid);
rnaSplit = strsplit(rnaline);
% write to the matrix
rnaCounts(lc,k) = str2num(rnaSplit{2});
lc = lc + 1;
end
fclose(rnafid);
end
As verified by the OP, the str2num function in the Linux version of Matlab 2015b has a memory leak. This function is not very useful anyway as it is designed to parse strings representing entire matrices (1 2; 3 4) rather than the typical use case of parsing a single number (1.234). Use str2double when doing simple number parsing; it is faster even when str2num isn't broken.
It is likely that using a different version of Matlab would also work around the problem, because in my experience, these kinds of memory bugs don't usually persist from one version to the next.
Often, high-level I/O functions, such as dlmread or textscan are useful to read such text formats. Use dlmread if you have only numeric data,
and textscan for more complex formats.
The sample data you provided is:
A2LD1|87769 135.5735
As you only need the number in the second column and discard the identifier in the first column, all you have is numeric data, and you can use dlmread.
data = dlmread(fullFileName, '\t', 1, 1);
The \t is to specify that the delimiter (column separator) is a Tab. The two 1s are to specify a row offset and a column offset, i.e. ignore the first row (the header) and the first column (id) of the file.

Finding the row indices of a logical vector without using find()

My program handles huge amount of data and the function find is the one to blame for taking so much time to execute. At some point I get a logical vector and I want to extract row indices of the 1 elements in the vector. How can I do that without using the find function?
Here's a demo:
temp = rand(10000000, 1);
temp1 = temp > 0.5;
temp2 = find(temp1);
But it is too slow in case of having much more data. Any suggestion?
Thank you
Find seems to be a very optimized function. What I did was to create a mex version very restricted to this particular problem. Running time was cut by half. :)
Here is the code:
#include <math.h>
#include <matrix.h>
#include <mex.h>
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
mxLogical *in;
double *out;
int i, nInput, nTrues;
// Get the number of elements of the input.
nInput = mxGetNumberOfElements(prhs[0]);
// Get a pointer to the logical input array.
in = mxGetLogicals(prhs[0]);
// Allocate memory for the output. As we don't know the number of
// matches, we allocate an array the same size of the input. We will
// probably reallocate it later.
out = mxMalloc(sizeof(double) * nInput);
// Count the number of 'trues' and store its positions.
for (nTrues = 0, i = 0; i < nInput; )
if (in[i++])
out[nTrues++] = i;
// Reallocate the array, if necessary.
if (nTrues < nInput)
out = mxRealloc(out, sizeof(double) * nTrues);
// Assign the indexes to the output array.
plhs[0] = mxCreateDoubleMatrix(0, 0, mxREAL);
mxSetPr(plhs[0], out);
mxSetM(plhs[0], nTrues);
mxSetN(plhs[0], 1);
}
Just save it to a file called, for example, find2.c and compile with mex find2.c.
Assuming:
temp = rand(10000000, 1);
temp1 = temp > 0.5;
Running times:
tic
temp2 = find(temp1);
toc
Elapsed time is 0.082875 seconds.
tic
temp2 = find2(temp1);
toc
Elapsed time is 0.044330 seconds.
IMPORTANT NOTE: this function has no error handling. It's assumed the input is always a logical array and the output is a double array. Caution is required.
You could try to split your calculations in small pieces. This will not reduce the amount of calculations you have to do, but it might still be faster since the data fits into fast cache memory, instead of in the slow main memory (or in the worst case you might even be swapping to disk). Something like this:
temp = rand(10000000, 1);
n = 100000; % chunk size
for i = 1:floor(length(temp) / n)
chunk = temp(((i-1) * n + 1):(i*n))
temp1 = chunk > 0.5;
temp2 = find(temp1);
do_stuff(temp2)
end
You can create an array of regular index and then apply logical indexing. I didn't check if it was faster than find tough.
Example:
Index=1:size(temp);
Found = Index(temp1);

Fastest Matlab file reading?

My MATLAB program is reading a file about 7m lines long and wasting far too much time on I/O. I know that each line is formatted as two integers, but I don't know exactly how many characters they take up. str2num is deathly slow, what matlab function should I be using instead?
Catch: I have to operate on each line one at a time without storing the whole file memory, so none of the commands that read entire matrices are on the table.
fid = fopen('file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = str2num(tline);
%do stuff with nums
tline = fgetl(fid);
end
fclose(fid);
Problem statement
This is a common struggle, and there is nothing like a test to answer. Here are my assumptions:
A well formatted ASCII file, containing two columns of numbers. No headers, no inconsistent lines etc.
The method must scale to reading files that are too large to be contained in memory, (although my patience is limited, so my test file is only 500,000 lines).
The actual operation (what the OP calls "do stuff with nums") must be performed one row at a time, cannot be vectorized.
Discussion
With that in mind, the answers and comments seem to be encouraging efficiency in three areas:
reading the file in larger batches
performing the string to number conversion more efficiently (either via batching, or using better functions)
making the actual processing more efficient (which I have ruled out via rule 3, above).
Results
I put together a quick script to test out the ingestion speed (and consistency of result) of 6 variations on these themes. The results are:
Initial code. 68.23 sec. 582582 check
Using sscanf, once per line. 27.20 sec. 582582 check
Using fscanf in large batches. 8.93 sec. 582582 check
Using textscan in large batches. 8.79 sec. 582582 check
Reading large batches into memory, then sscanf. 8.15 sec. 582582 check
Using java single line file reader and sscanf on single lines. 63.56 sec. 582582 check
Using java single item token scanner. 81.19 sec. 582582 check
Fully batched operations (non-compliant). 1.02 sec. 508680 check (violates rule 3)
Summary
More than half of the original time (68 -> 27 sec) was consumed with inefficiencies in the str2num call, which can be removed by switching the sscanf.
About another 2/3 of the remaining time (27 -> 8 sec) can be reduced by using larger batches for both file reading and string to number conversions.
If we are willing to violate rule number three in the original post, another 7/8 of the time can be reduced by switching to a fully numeric processing. However, some algorithms do not lend themselves to this, so we leave it alone. (Not the "check" value does not match for the last entry.)
Finally, in direct contradiction a previous edit of mine within this response, no savings are available by switching the the available cached Java, single line readers. In fact that solution is 2 -- 3 times slower than the comparable single line result using native readers. (63 vs. 27 seconds).
Sample code for all of the solutions described above are included below.
Sample code
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Create a test file
cd(tempdir);
fName = 'demo_file.txt';
fid = fopen(fName,'w');
for ixLoop = 1:5
d = randi(1e6, 1e5,2);
fprintf(fid, '%d, %d \n',d);
end
fclose(fid);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Initial code
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = str2num(tline);
CHECK = round((CHECK + mean(nums) ) /2);
tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Initial code. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using sscanf, once per line
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = sscanf(tline,'%d, %d');
CHECK = round((CHECK + mean(nums) ) /2);
tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Using sscanf, once per line. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using fscanf in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
while ~isempty(scannedData)
for ix = 1:size(scannedData,1)
nums = scannedData(ix,:);
CHECK = round((CHECK + mean(nums) ) /2);
end
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
end
fclose(fid);
t = toc;
fprintf(1,'Using fscanf in large batches. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using textscan in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
while ~isempty(scannedData{1})
for ix = 1:size(scannedData{1},1)
nums = [scannedData{1}(ix) scannedData{2}(ix)];
CHECK = round((CHECK + mean(nums) ) /2);
end
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
end
fclose(fid);
t = toc;
fprintf(1,'Using textscan in large batches. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, incrementing to end-of-line, sscanf
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char'); %This can be slightly optimized
end
data = [dataBatch dataIncrement];
while ~isempty(data)
scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
for ix = 1:size(scannedData,1)
nums = scannedData(ix,:);
CHECK = round((CHECK + mean(nums) ) /2);
end
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
end
data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Reading large batches into memory, then sscanf. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java single line readers + sscanf
CHECK = 0;
tic;
bufferSize = 1e4;
reader = java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize );
tline = char(reader.readLine());
while ~isempty(tline)
nums = sscanf(tline,'%d, %d');
CHECK = round((CHECK + mean(nums) ) /2);
tline = char(reader.readLine());
end
reader.close();
t = toc;
fprintf(1,'Using java single line file reader and sscanf on single lines. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java scanner for file reading and string conversion
CHECK = 0;
tic;
jFile = java.io.File('demo_file.txt');
scanner = java.util.Scanner(jFile);
scanner.useDelimiter('[\s\,\n\r]+');
while scanner.hasNextInt()
nums = [scanner.nextInt() scanner.nextInt()];
CHECK = round((CHECK + mean(nums) ) /2);
end
scanner.close();
t = toc;
fprintf(1,'Using java single item token scanner. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, vectorized operations (non-compliant solution)
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char'); %This can be slightly optimized
end
data = [dataBatch dataIncrement];
while ~isempty(data)
scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
CHECK = round((CHECK + mean(scannedData(:)) ) /2);
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
end
data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Fully batched operations. %3.2f sec. %d check \n', t, CHECK);
(original answer)
To expand on the point made by Ben ... your bottleneck will always be file I/O if you are reading these files line by line.
I understand that sometimes you cannot fit a whole file into memory. I typically read in a large batch of characters (1e5, 1e6 or thereabouts, depending on the memory of your system). Then I either read additional single characters (or back off single characters) to get a round number of lines, and then run your string parsing (e.g. sscanf).
Then if you want you can process the resulting large matrix one row at a time, before repeating the process until you read the end of the file.
It's a little bit tedious, but not that hard. I typically see 90% plus improvement in speed over single line readers.
(terrible idea using Java batched line readers removed in shame)
I have had good results (speedwise) using memmapfile(). This minimises the amount of memory data copying, and makes use of the kernel's IO buffering. You need enough free address space (though not actual free memory) to map the entire file, and enough free memory to hold the output variable (obviously!)
The example code below reads a text file into a two-column matrix data of int32 type.
fname = 'file.txt';
fstats = dir(fname);
% Map the file as one long character string
m = memmapfile(fname, 'Format', {'uint8' [ 1 fstats.bytes] 'asUint8'});
textdata = char(m.Data(1).asUint8);
% Use textscan() to parse the string and convert to an int32 matrix
data = textscan(textdata, '%d %d', 'CollectOutput', 1);
data = data{:};
% Tidy up!
clear('m')
You may need to fiddle with the parameters to textscan() to get exactly what you want - see the online docs.
Even if you can't fit the whole file in memory, you should read a large batch using the matrix read functions.
Maybe you can even use vector operations for some of the data processing, which would speed things along further.
I have found that MATLAB reads csv files significantly faster than text files, so if it's possible to convert your text file to csv using some other software, it may significantly speed up Matlab's operations.