Reading large text files with insuficient RAM in Matlab - matlab

I want to read a large text file of about 2GB and perform a series of operations on that data. Following approach
tic
fid=fopen(strcat(Name,'.dat'));
C=textscan(fid, '%d%d%f%f%f%d');
fclose(fid);
%Extract cell values
y=C{1}(1:Subsampling:end)/Subsampling;
x=C{2}(1:Subsampling:end)/Subsampling;
%...
Reflectanse=C{6}(1:Subsampling:end);
Overlap=round(Overlap/Subsampling);
fails immediatly after reading C (C=textscan(fid, '%d%d%f%f%f%d');) with a strange peak in my memory usage:
What would be the best way to import a file of this size? Is there a way to use textscan() to read individual parts of a text file, or are there any other functions better suited for this task?
Edit: Every column in the textscan contains an information field information for 3D-Points:
width hieght X Y Z Grayscale
345 453 3.422 53.435 0.234 200
346 453 3.545 52.345 0.239 200
... % and so on for ~40 millon points

If you can process each row individually then the following code will allow you to do that. I've included start_line and end_line if you want to specify a block of data.
headerSpec = '%s %s %s %s %s %s';
dataSpec = '%f %f %f %f %f %f';
fid=fopen('data.dat');
% Read Header
names = textscan(fid, headerSpec, 1, 'delimiter', '\t');
k = 0;
% specify a start and end line for getting a block of data
start_line = 2;
end_line = 3;
while ~feof(fid)
k=k+1;
if k < start_line
continue;
end
if k > end_line
break;
end
% get data
C = textscan(fid, dataSpec, 1, 'delimiter', '\t');
row = [C{1:6}]; % convert data from cell to vector
% do what you want with the row
end
fclose(fid);
There is the possibility of reading in the entire file, but this will depend on the amount of memory you have available and if matlab has any restrictions in place. This can be seen by typing memory at the command window.

Related

Read compressed files without unzipping them in Matlab

I have (very large) comma separated files compressed in bz2 format. If I un-compressed them and I read with
fileID = fopen('file.dat');
X = textscan(fileID,'%d %d64 %s %f %d %f %f %d', 'delimiter', ',');
fclose(fileID);
everything is fine. But I would like to read them without uncompressing them, something like
fileID = fopen('file.bz2');
X = textscan(fileID,'%d %d64 %s %f %d %f %f %d', 'delimiter', ',');
fclose(fileID);
which, unfortunatley, returns an empty X. Any suggestions? Do I have to uncompressed them unavoidably via the system(' ... ') command ?
You could try to use the form of textscan that takes a string instead of a stream. Using the Matlab Java integration, you can leverage Java chained streams to decompress on the fly and read single lines, which can then be parsed:
% Build a stream chain that reads, decompresses and decodes the file into lines
fileStr = javaObject('java.io.FileInputStream', 'file.dat.gz');
inflatedStr = javaObject('java.util.zip.GZIPInputStream', fileStr);
charStr = javaObject('java.io.InputStreamReader', inflatedStr);
lines = javaObject('java.io.BufferedReader', charStr);
% If you know the size in advance you can preallocate the arrays instead
% of just stating the types to allow vcat to succeed
X = { int32([]), int64([]), {}, [], int32([]), [], [], int32([]) };
curL = lines.readLine();
while ischar(curL) % on EOF, readLine returns null, which becomes [] (type double)
% Parse a single line from the file
curX = textscan(curL,'%d %d64 %s %f %d %f %f %d', 'delimiter', ',');
% Append new line results
for iCol=1:length(X)
X{iCol}(end+1) = curX{iCol};
end
curL = lines.readLine();
end
lines.close(); % Don't forget this or the file will remain open!
I'm not exactly vouching for the performance of this method, with all the array appending going on, but at least that is how you can read a GZ file on the fly in Matlab/Octave. Also:
If you have a Java stream class that decompresses another format (try e.g. Apache Commons Compress), you can read it the same way. You could read bzip2 or xz files.
There are also classes to access archives, like zip files in the base Java distribution, or tar/RAR/7z and more in Apache Commons Compress. These classes usually have some way of finding files stored within the archive, allowing you to open an input stream to them within the archive and read in the same way as above.
On a unix system I would use named pipes and do something like this:
system('mkfifo pipename');
system(['bzcat file.bz2 > pipename &']);
fileID = fopen('pipename', 'r');
X = textscan(fileID,'%d %d64 %s %f %d %f %f %d', 'delimiter', ',');
fclose(fileID);
system('remove pipename');

read a txt file to matrix and cellarray Matlab

I have a txt file with those entries and I would like to know how to get the numerical values from the second column until the last column in a matrix and the first column in a cell array.
I've tried with import data and fscanf and I dont understand what's going on.
CP6 7,2 -2,7 6,6
P5 -5,8 -5,9 5,8
P6 5,8 -5,9 5,8
AF7 -5,0 7,2 3,6
AF8 5,0 7,2 3,6
FT7 -7,6 2,8 3,6
This should give you what you want based on the text sample you supplied.
fileID = fopen('x.txt'); %open file x.txt
m=textscan(fileID,'%s %d ,%d %d ,%d %d ,%d');
fclose(fileID); %close file
col1 = m{1,1}; %get first column into cell array col1
colRest = cell2mat(m(1,2:6)); %convert rest of columns into matrix colRest
Lookup textscan for more info on reading specially formatted data
This function should do the trick. It reads your file and scans it according to your pattern. Then, put the first column in a cell array and the others in a matrix.
function [ C1,A ] = scan_your_txt_file( filename )
fid = fopen(filename,'rt');
C = textscan(fid, '%s %d,%d %d,%d %d,%d');
fclose(fid);
C1 = C{1};
A = cell2mat(C(2:size(C,2)));
end
Have you tried xlsread? It makes a numeric array and two non-numeric arrays.
[N,T,R]=xlsread('yourfilename.txt')
but your data is not comma delimited. It also looks like you are using a comma to represent a decimal point. Does this array have 7 columns or 4? Because I'm in the US, I'm going to assume you have paired coordinates and the comma is one kind of delimiter while the space is a second one.
So here is something klugy, but it works. It is a gross ugly hack, but it works.
%housekeeping
clc
%get name of raw file
d=dir('*22202740*.txt')
%translate from comma-as-decimal to period-as-decimal
fid = fopen(d(1).name,'r') %source
fid2= fopen('myout.txt','w+') %sink
while 1
tline = fgetl(fid); %read
if ~ischar(tline), break, end %end loop
fprintf(fid2,'%s\r\n',strrep(tline,',','.')) %write updated line to output
end
fclose(fid)
fclose(fid2)
%open, gulp, parse/store, close
fid3 = fopen('myout.txt','r');
C=textscan(fid3,'%s %f %f %f ');
fclose(fid3);
%measure waist size and height
[n,m]=size(C);
n=length(C{1});
%put in slightly more friendly form
temp=zeros(n,m);
for i=2:m
t0=C{i};
temp(:,i)=t0;
end
%write to excel
xlswrite('myout_22202740.xlsx',temp(:,2:end),['b1:' char(96+m) num2str(n)]);
xlswrite('myout_22202740.xlsx',C{1},['a1:a' num2str(n)])
%read from excel
[N,T,R]=xlsread('myout_22202740.xlsx')
If you want those commas to be decimal points, then that is a different question.

Reading CSV with mixed type data

I need to read the following csv file in MATLAB:
2009-04-29 01:01:42.000;16271.1;16271.1
2009-04-29 02:01:42.000;2.5;16273.6
2009-04-29 03:01:42.000;2.599609;16276.2
2009-04-29 04:01:42.000;2.5;16278.7
...
I'd like to have three columns:
timestamp;value1;value2
I tried the approaches described here:
Reading date and time from CSV file in MATLAB
modified as:
filename = 'prova.csv';
fid = fopen(filename, 'rt');
a = textscan(fid, '%s %f %f', ...
'Delimiter',';', 'CollectOutput',1);
fclose(fid);
But it returs a 1x2 cell, whose first element is a{1}='ÿþ2', the other are empty.
I had also tried to adapt to my case the answers to these questions:
importing data with time in MATLAB
Read data files with specific format in matlab and convert date to matal serial time
but I didn't succeed.
How can I import that csv file?
EDIT After the answer of #macduff i try to copy-paste in a new file the data reported above and use:
a = textscan(fid, '%s %f %f','Delimiter',';');
and it works.
Unfortunately that didn't solve the problem because I have to process csv files generated automatically, which seems to be the cause of the strange MATLAB behavior.
What about trying:
a = textscan(fid, '%s %f %f','Delimiter',';');
For me I get:
a =
{4x1 cell} [4x1 double] [4x1 double]
So each element of a corresponds to a column in your csv file. Is this what you need?
Thanks!
Seems you're going about it the right way. The example you provide poses no problems here, I get the output you desire. What's in the 1x2 cell?
If I were you I'd try again with a smaller subset of the file, say 10 lines, and see if the output changes. If yes, then try 100 lines, etc., until you find where the 4x1 cell + 4x2 array breaks down into the 1x2 cell. It might be that there's an empty line or a single empty field or whatever, which forces textscan to collect data in an additional level of cells.
Note that 'CollectOutput',1 will collect the last two columns into a single array, so you'll end up with 1 cell array of 4x1 containing strings, and 1 array of 4x2 containing doubles. Is that indeed what you want? Otherwise, see #macduff's post.
I've had to parse large files like this, and I found I didn't like textscan for this job. I just use a basic while loop to parse the file, and I use datevec to extract the timestamp components into a 6-element time vector.
%% Optional: initialize for speed if you have large files
n = 1000 %% <# of rows in file - if known>
timestamp = zeros(n,6);
value1 = zeros(n,1);
value2 = zeros(n,1);
fid = fopen(fname, 'rt');
if fid < 0
error('Error opening file %s\n', fname); % exit point
end
cntr = 0
while true
tline = fgetl(fid); %% get one line
if ~ischar(tline), break; end; % break out of loop at end of file
cntr = cntr + 1;
splitLine = strsplit(tline, ';'); %% split the line on ; delimiters
timestamp(cntr,:) = datevec(splitLine{1}, 'yyyy-mm-dd HH:MM:SS.FFF'); %% using datevec to parse time gives you a standard timestamp vector
value1(cntr) = splitLine{2};
value2(cntr) = splitLine{3};
end
%% Concatenate at the end if you like
result = [timestamp value1 value2];

Fastest Matlab file reading?

My MATLAB program is reading a file about 7m lines long and wasting far too much time on I/O. I know that each line is formatted as two integers, but I don't know exactly how many characters they take up. str2num is deathly slow, what matlab function should I be using instead?
Catch: I have to operate on each line one at a time without storing the whole file memory, so none of the commands that read entire matrices are on the table.
fid = fopen('file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = str2num(tline);
%do stuff with nums
tline = fgetl(fid);
end
fclose(fid);
Problem statement
This is a common struggle, and there is nothing like a test to answer. Here are my assumptions:
A well formatted ASCII file, containing two columns of numbers. No headers, no inconsistent lines etc.
The method must scale to reading files that are too large to be contained in memory, (although my patience is limited, so my test file is only 500,000 lines).
The actual operation (what the OP calls "do stuff with nums") must be performed one row at a time, cannot be vectorized.
Discussion
With that in mind, the answers and comments seem to be encouraging efficiency in three areas:
reading the file in larger batches
performing the string to number conversion more efficiently (either via batching, or using better functions)
making the actual processing more efficient (which I have ruled out via rule 3, above).
Results
I put together a quick script to test out the ingestion speed (and consistency of result) of 6 variations on these themes. The results are:
Initial code. 68.23 sec. 582582 check
Using sscanf, once per line. 27.20 sec. 582582 check
Using fscanf in large batches. 8.93 sec. 582582 check
Using textscan in large batches. 8.79 sec. 582582 check
Reading large batches into memory, then sscanf. 8.15 sec. 582582 check
Using java single line file reader and sscanf on single lines. 63.56 sec. 582582 check
Using java single item token scanner. 81.19 sec. 582582 check
Fully batched operations (non-compliant). 1.02 sec. 508680 check (violates rule 3)
Summary
More than half of the original time (68 -> 27 sec) was consumed with inefficiencies in the str2num call, which can be removed by switching the sscanf.
About another 2/3 of the remaining time (27 -> 8 sec) can be reduced by using larger batches for both file reading and string to number conversions.
If we are willing to violate rule number three in the original post, another 7/8 of the time can be reduced by switching to a fully numeric processing. However, some algorithms do not lend themselves to this, so we leave it alone. (Not the "check" value does not match for the last entry.)
Finally, in direct contradiction a previous edit of mine within this response, no savings are available by switching the the available cached Java, single line readers. In fact that solution is 2 -- 3 times slower than the comparable single line result using native readers. (63 vs. 27 seconds).
Sample code for all of the solutions described above are included below.
Sample code
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Create a test file
cd(tempdir);
fName = 'demo_file.txt';
fid = fopen(fName,'w');
for ixLoop = 1:5
d = randi(1e6, 1e5,2);
fprintf(fid, '%d, %d \n',d);
end
fclose(fid);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Initial code
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = str2num(tline);
CHECK = round((CHECK + mean(nums) ) /2);
tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Initial code. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using sscanf, once per line
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
nums = sscanf(tline,'%d, %d');
CHECK = round((CHECK + mean(nums) ) /2);
tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Using sscanf, once per line. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using fscanf in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
while ~isempty(scannedData)
for ix = 1:size(scannedData,1)
nums = scannedData(ix,:);
CHECK = round((CHECK + mean(nums) ) /2);
end
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
end
fclose(fid);
t = toc;
fprintf(1,'Using fscanf in large batches. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using textscan in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
while ~isempty(scannedData{1})
for ix = 1:size(scannedData{1},1)
nums = [scannedData{1}(ix) scannedData{2}(ix)];
CHECK = round((CHECK + mean(nums) ) /2);
end
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
end
fclose(fid);
t = toc;
fprintf(1,'Using textscan in large batches. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, incrementing to end-of-line, sscanf
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char'); %This can be slightly optimized
end
data = [dataBatch dataIncrement];
while ~isempty(data)
scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
for ix = 1:size(scannedData,1)
nums = scannedData(ix,:);
CHECK = round((CHECK + mean(nums) ) /2);
end
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
end
data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Reading large batches into memory, then sscanf. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java single line readers + sscanf
CHECK = 0;
tic;
bufferSize = 1e4;
reader = java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize );
tline = char(reader.readLine());
while ~isempty(tline)
nums = sscanf(tline,'%d, %d');
CHECK = round((CHECK + mean(nums) ) /2);
tline = char(reader.readLine());
end
reader.close();
t = toc;
fprintf(1,'Using java single line file reader and sscanf on single lines. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java scanner for file reading and string conversion
CHECK = 0;
tic;
jFile = java.io.File('demo_file.txt');
scanner = java.util.Scanner(jFile);
scanner.useDelimiter('[\s\,\n\r]+');
while scanner.hasNextInt()
nums = [scanner.nextInt() scanner.nextInt()];
CHECK = round((CHECK + mean(nums) ) /2);
end
scanner.close();
t = toc;
fprintf(1,'Using java single item token scanner. %3.2f sec. %d check \n', t, CHECK);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, vectorized operations (non-compliant solution)
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char'); %This can be slightly optimized
end
data = [dataBatch dataIncrement];
while ~isempty(data)
scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
CHECK = round((CHECK + mean(scannedData(:)) ) /2);
dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
end
data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Fully batched operations. %3.2f sec. %d check \n', t, CHECK);
(original answer)
To expand on the point made by Ben ... your bottleneck will always be file I/O if you are reading these files line by line.
I understand that sometimes you cannot fit a whole file into memory. I typically read in a large batch of characters (1e5, 1e6 or thereabouts, depending on the memory of your system). Then I either read additional single characters (or back off single characters) to get a round number of lines, and then run your string parsing (e.g. sscanf).
Then if you want you can process the resulting large matrix one row at a time, before repeating the process until you read the end of the file.
It's a little bit tedious, but not that hard. I typically see 90% plus improvement in speed over single line readers.
(terrible idea using Java batched line readers removed in shame)
I have had good results (speedwise) using memmapfile(). This minimises the amount of memory data copying, and makes use of the kernel's IO buffering. You need enough free address space (though not actual free memory) to map the entire file, and enough free memory to hold the output variable (obviously!)
The example code below reads a text file into a two-column matrix data of int32 type.
fname = 'file.txt';
fstats = dir(fname);
% Map the file as one long character string
m = memmapfile(fname, 'Format', {'uint8' [ 1 fstats.bytes] 'asUint8'});
textdata = char(m.Data(1).asUint8);
% Use textscan() to parse the string and convert to an int32 matrix
data = textscan(textdata, '%d %d', 'CollectOutput', 1);
data = data{:};
% Tidy up!
clear('m')
You may need to fiddle with the parameters to textscan() to get exactly what you want - see the online docs.
Even if you can't fit the whole file in memory, you should read a large batch using the matrix read functions.
Maybe you can even use vector operations for some of the data processing, which would speed things along further.
I have found that MATLAB reads csv files significantly faster than text files, so if it's possible to convert your text file to csv using some other software, it may significantly speed up Matlab's operations.

Matlab: how handle abnormal data files

I am trying to import a large number of files into Matlab for processing. A typical file would look like this:
mass intensity
350.85777 238
350.89252 3094
350.98688 2762
351.87899 468
352.17712 569
352.28449 426
Some text and numbers here, describing the experimental setup, eg
Scan 3763 # 81.95, contains 1000 points:
The numbers in the two columns are separated by 8 spaces. However, sometimes the experiment will go wrong and the machine will produce a datafile like this one:
mass intensity
Some text and numbers here, describing the experimental setup, eg
Scan 3763 # 81.95, contains 1000 points:
I found that using space-separated files with a single header row, ie
importdata(path_to_file,' ', 1);
works best for the normal files. However, it totally fails on all the abnormal files. What would the easiest way to fix this be? Should I stick with importdata (already tried all possible settings, it just doesn't work) or should I try writing my own parser? Ideally, I would like to get those values in a Nx2 matrix for normal files and [0 0] for abnormal files.
Thanks.
I don't think you need to create your own parser, nor is this all that abnormal. Using textscan is your best option here.
fid = fopen('input.txt', 'rt');
data = textscan(fid, '%f %u', 'Headerlines', 1);
fclose(fid);
mass = data{1};
intensity = data{2};
Yields:
mass =
350.8578
350.8925
350.9869
351.8790
352.1771
352.2845
intensity =
238
3094
2762
468
569
426
For your 1st file and:
mass =
Empty matrix: 0-by-1
intensity =
Empty matrix: 0-by-1
For your empty one.
By default, text scan reads whitespace as a delimiter, and it only reads what you tell it to until it can no longer do so; thus it ignores the final lines in your file. You can also run a second textscan after this one if you want to pick up those additional fields:
fid = fopen('input.txt', 'rt');
data = textscan(fid, '%f %u', 'Headerlines', 1);
mass = data{1};
intensity = data{2};
data = textscan(fid, '%*s %u %*c %f %*c %*s %u %*s', 'Headerlines', 1);
scan = data{1};
level = data{2};
points = data{3};
fclose(fid);
Along with your mass and intensity data gives:
scan =
3763
level =
81.9500
points =
1000
what do you mean 'totally failes on abnormal files'?
you can check if importdata finds any data using e.g.
>> imported = importdata(path_to_file,' ', 1);
>> isfield(imported, 'data')