Textscan on file with large number of lines - matlab

I'm trying to analyze a very large file using textscan in MATLAB. The file in question is about 12 GB in size and contains about 250 million lines with seven (floating) numbers in each (delimited by a whitespace); because this obviously would not fit into the RAM of my desktop, I'm using the approach suggested in the MATLAB documentation (i.e. loading and analyzing a smaller block of the file at a time. According to the documentation this should allow for processing "arbitrarily large delimited text file[s]"). This only allows me to scan about 43% of the file, after which textscan starts returning empty cells (despite there still being data left to scan in the file).
To debug, I attempted to go to several positions in the file using the fseek function, for example like this:
fileInfo = dir(fileName);
fid = fileopen(fileName);
fseek(fid, floor(fileInfo.bytes/10), 'bof');
textscan(fid,'%f %f %f %f %f %f %f','Delimiter',' ');
I'm assuming that the way I'm using fseek here moves the position indicator to about 10% of my file. (I'm aware this doesn't necessarily mean the indicator is at the beginning of a line, but if I run textscan twice I get a satisfactory answer.) Now, if I substitute fileInfo.bytes/10 by fileInfo.bytes/2 (i.e. moving it to about 50% of the file) everything breaks down and textscan only returns an empty 1x7 cell.
I looked at the file using a text editor for large files, and this shows that the entire file looks fine, and that there should be no reason for textscan to be confused. The only possible explanation that I can think of is that something goes wrong on a much deeper level that I have little understanding of. Any suggestions would be greatly appreciated!
EDIT
The relevant part of my code used to look like this:
while ~feof(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter', ' '); %// Read nLines
%// do some stuff
end
First I tried fixing it using ftell and fseek as suggested by Hoki below. This gave exactly the same error as I got before: MATLAB was unable to read in more than approximately 43% of the file. Then I tried using the HeaderLines solution (also suggested below), like this:
i = 0;
while ~feof(fid)
frewind(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter',' ', 'HeaderLines', i*nLines);
%// do some stuff
i = i + 1;
end
This seems to read in the data without producing errors; it is, however, incredibly slow.
I'm not entirely sure I understand what HeaderLines does in this context, but it seems to make textscan completely ignore everything that comes before the specified line. This doesn't seem to happen when using textscan in the "appropriate" way (either with or without ftell and fseek): in both cases it tries to continue from its last position, but to no avail because of some reason I don't understand yet.

fseek a pointer in a file is only good when you know precisely where (or by how many bytes) you want to move the cursor. It is very useful for binary files when you just want to skip some records of known length. But on a text file it is more dangerous and confusing than anything (unless you are absolutely sure that each line is the same size and each element on the line is at the same exact place/column, but that doesn't happen often).
There are several ways to read a text file block by block:
1) Use the HeaderLines option
To simply skip a block of lines on a text file, you can use the HeaderLines parameter of textscan, so for example:
readFormat = '%f %f %f %f %f %f %f' ; %// read format specifier
nLines = 10000 ; %// number of line to read per block
fileInfo = dir(fileName);
%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' '); %// read the first 10000 lines
fclose(fid)
%// Now do something with your "M" data
Then when you want to read the second block:
%// later read the SECOND block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', nLines); %// read lines 10001 to 20000
fclose(fid)
And if you have many blocks, for the Nth block, just adapt:
%// and then for the Nth BLOCK block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', (N-1)*nLines);
fclose(fid)
If necessary (if you have many blocks), just code this last version in a loop.
Note that this is good if you close your file after each block reading (so the file pointer will start at the beginning of the file when you open it again). Closing the file after reading a block of data is safer if your processing might take a long time or may error out (you don't want to have files which remain open too long or loose the fid if you crash).
2) Read by block (without closing the file)
If the processing of the block is quick and safe enough so you're sure it won't bomb out, you could afford to not close the file. In this case, the textscan file pointer will stay where you stopped, so you could also :
read a block (do not close the file): M = textscan(fid, readFormat, nLines)
Process it then save your result (and release memory)
read the next block with the same call: M = textscan(fid, readFormat, nLines)
In this case you wouldn't need the headerlines parameter because textscan will resume reading exactly where it stopped.
3) use ftell and fseek
Lastly, you could use fseek to start reading the file at the precise position you want, but in this case I recommend using it in conjunction with ftell.
ftell will return the current position in an open file, so use that to know at which position you stop reading last, then use fseek the next time to go straight at this position. Something like:
%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)
%// do some stuff
%// then read another block:
fid = fileopen(fileName);
fseek( fid , 'bof' , lastPosition ) ;
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)
%// and so on ...

Related

strange behavior of the fscanf function

I am trying to read the information contained in a small configuration file with Matlab's fscanf function. The content of the file is;
YAcex: 1.000000
YOx: 1.000000
KAce: 1.000000
The matlab code used to parse the file is;
fh = fopen('parameters', 'r');
fscanf(fh, 'YAcex: %f\n')
fscanf(fh, 'YOx: %f\n')
fscanf(fh, 'KAce: %f\n')
fclose(fh);
When this script is invoked, only the "YAcex" line is read correctly; fscanf returns [] for the two other lines. If the YOx and KAce lines are switched (KAce before YOx), all lines are read correctly by fscanf.
Can someone explain this behavior?
supplementary information: The linefeeds in the input file are simple linefeed (\n character, without \r character).
Your problem is that you only want to read one value per call to fscanf, but by default it tries to read as many values as possible. Note this excerpt from the documentation:
The fscanf function reapplies the format throughout the entire file and positions the file pointer at the end-of-file marker. If fscanf cannot match formatSpec to the data, it reads only the portion that matches and stops processing.
This means the first call correctly reads the first line of the file, but then tries to read the next line as well, finding no exact match to its format specifier. It finds a partial match for the next line, where the first Y of YOx: matches the beginning of YAcex: in the format specifier. This partial match places the file pointer directly after the Y in YOx:, causing the next call to fscanf to fail since it is starting at the Ox: .... We can illustrate this with ftell:
fh = fopen('parameters', 'r');
fscanf(fh, 'YAcex: %f\n');
ftell(fh)
ans =
18 % The "O" is the 18th character in the file
When you switch the YOx: and KAce: lines, a partial match of the next line doesn't happen any more, so the file pointer ends up at the beginning of the next line every time and all the reads are successful.
So, how can you get around this? One option is to always specify the size argument so fscanf doesn't reapply the format specifier unnecessarily:
fh = fopen('parameters', 'r');
fscanf(fh, 'YAcex: %f\n', 1);
fscanf(fh, 'YOx: %f\n', 1);
fscanf(fh, 'KAce: %f\n', 1);
fclose(fh);
Another option is to do this all in one line:
fh = fopen('parameters', 'r');
values = fscanf(fh, 'YAcex: %f\n YOx: %f\n KAce: %f\n');
fclose(fh);
And values will be a 3-by-1 array containing the 3 values from the file.
As you already realized, \r or \r\n could cause this kind of behavior. The likely reason is similar to that, for example, there is some invisible characters like space somewhere. You can debug this by reading all as uint8, and take a look at the location where problem occurs:
u8 = fread(fh, inf, '*uint8')';
One stupid way to avoid this kind of issue is to read all as char, and search each keyword:
fh = fopen('parameters');
ch = fread(fh, inf, '*char')'; % read all as char
fclose(fh);
YAcex = regexp(ch, '(?<=YAcex:\s?)[\d\.]+', 'match', 'once'); % parse YAcex
You can parse others accordingly. The advantage of this is that it is less sensitive to a space somewhere, and the order of parameters does not matter.

skip lines in txt file using textscan in matlab

I have a huge .txt file and parts of which I want to parse (using text scan), say I have 10000 line data and a part which starts at line 300, the part also has a header of 10 lines say,how can I skip the first 300 lines(not using header function of text scan of course as I then wont be able to get my actual 10 line header) or is there a way in which I can jump to line 300 and start text scan from there as if 301 line was the first line.
So, assuming your data is generated by the following (since you have not mentioned how it's formatted in your question):
fid = fopen('datafile.txt', 'w');
for i=1:300
fprintf(fid, 'ignore%d\n', i);
end
for i=301:310
fprintf(fid, 'header%d\n', i);
end
for i=311:10000
fprintf(fid, '%d\n', i);
end
fclose(fid);
You can read the data by first using fgetl to advance to the 300th line, then using textscan twice to get the header info and the data. Basically, the thing to remember is that textscan works starting from the place where fid is pointing. So, if it's pointing to the 301st line, it'll start scanning from there. So, here's the code to read the file above, starting from line 301:
fid = fopen('datafile.txt', 'r');
for i=1:300
fgetl(fid);
end
scannedHeader = textscan(fid, '%s', 10);
scannedData = textscan(fid, '%d');
fclose(fid);
NB: if the data is always the same format, you can use ftell to know where to skip to exactly then use fseek to go to that offset.

Decimating large data files from disk in MATLAB?

I have very large data files (typically 30Gb to 60Gb) in .txt format. I want to find a way to to automatically decimate the files without importing them to memory first.
My .txt files consist of two columns of data, this is an example file:
https://www.dropbox.com/s/87s7qug8aaipj31/RTL5_57.txt
What I have done so far is to import the data to a variable "C" then down sample the data. The problem with this method is that the variable "C" often fills the memory capacity of MATLAB before the program has change to decimate:
function [] = textscan_EPS(N,D,fileEPS )
%fileEPS: .txt address
%N: number of lines to read
%D: Decimation factor
fid = fopen(fileEPS);
format = '%f\t%f';
C = textscan(fid, format, N, 'CollectOutput', true);% this variable exceeds memory capacity
d = downsample(C{1},D);
plot(d);
fclose(fid);
end
How can I modify this line:
C = textscan(fid, format, N, 'CollectOutput', true);
So that it effectively decimates the data at this instance by importing every other line of or every 3rd line ect.. of the .txt file from disk to variable "C" in memory.
Any help would be much appreciated.
Cheers,
Jim
PS
An alternative method that I have been playing around with uses "fread" but it encouters the same problem:
function [d] = fread_EPS(N,D,fileEPS)
%N: number of lines to read
%D: decimation factor
%fileEPS: location of .txt fiel
%read in the data as characters
fid = fopen(fileEPS);
c = fread(fid,N*19,'*char');% EWach line of .txt has 19 characters
%Parse and read the data into floading point numbers
f=sscanf(c,'%f');
%Reshape the data into a two column format
format long
d=decimate((flipud(rot90(reshape(f,2,[])))),D); %reshape for 2 colum format, rotate 90, flip veritically,decimation factor
I believe that textscan is the way to go, however you may need to take an intermediate step. Here is what I would do assuming you can easily read N lines at a time:
Read in N lines with textscan(fileID,formatSpec,N)
Sample from these lines, store the result (file or variable) and drop the rest
As long as there are lines left continue with step 1
Optional, depending on your storage method: combine all stored results into one big sample
It should be possible to just read 1 line each time, and decide whether you want to keep/discard it. Though this should consume minimal memory I would try to do a few thousand each time to get reasonable performance.
I ended up writing the code below based on Dennis Jaheruddin's advice. It appears to work well for large .txt files (10GB to 50Gb). The code is also inspired by another post:
Memory map file in MATLAB?
Nlines = 1e3; % set numbe of lines to sample per cycle
sample_rate = (1/1.5e6); %data sample rate
DECE= 1;% decimation factor
start = 40; %start of plot time
finish = 50; % end plot time
TIME = (0:sample_rate:sample_rate*((Nlines)-1));
format = '%f\t%f';
fid = fopen('C:\Users\James Archer\Desktop/RTL5_57.txt');
while(~feof(fid))
C = textscan(fid, format, Nlines, 'CollectOutput', true);
d = C{1}; % immediately clear C at this point you need the memory!
clearvars C ;
TIME = ((TIME(end)+sample_rate):sample_rate:(sample_rate*(size(d,1)))+(TIME(end)));%shift Time along
if ((TIME(end)) > start) && ((TIME(end)) < finish);
plot((TIME(1:DECE:end)),(d(1:DECE:end,:)))%plot and decimate
end
hold on;
clearvars d;
end
fclose(fid);
older versions of MATLAB do not process this code well, the following message appears:
Caught std::exception Exception message is: bad allocation
But MATLAB 2013 works just fine

Textscan generates a vectore twice the expected size

I want to load a csv file in a matrix using matlab.
I used the following code:
formatSpec = ['%*f', repmat('%f',1,20)];
fid = fopen(filename);
X = textscan(fid, formatSpec, 'Delimiter', ',', 'CollectOutput', 1);
fclose(fid);
X = X{1};
The csv file has 1000 rows and 21 columns.
However, the matrix X generated has 2000 columns and 20 columns.
I tried using different delimiters like '\t' or '\n', but it doesn't change.
When I displayed X, I noticed that it displayed the correct csv file but with extra rows of zeros every 2 rows.
I also tried adding the 'HeaderLines' parameters:
`X = textscan(fid, formatSpec1, 'Delimiter', '\n', 'CollectOutput', 1, 'HeaderLines', 1);`
but this time, the result is an empty matrix.
Am I missing something?
EDIT: #horchler
I could read with no problem the 'test.csv' file.
There is no extra comma at the end of each row. I generated my csv file with a python script: I read the rows of another csv file, modified these (selecting some of them and doing arithmetic operations on them) and wrote the new rows on another csv file. In order to do this, I converted each element of the first csv file into floats...
New Edit:
Reading the textscan documentation more carefully, I think the problem is that my input file is neither a textfile nor a str, but a file containing floats
EDIT: three lines from the file
0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,2
1,-0.3834323,-1.92452324171,-1.2453254094,0.43455627857,-0.24571121,0.4340657,1,1,0,0,0,0.3517396202,1,0,0,0.3558122164,0.2936975319,0.4105696144,0,1,0
-0.78676,-1.09767,0.765554578,0.76579043,0.76,1,0,0,323124.235998,1,0,0,0,1,0,0,1,0,0,0,2
How about using regex ?
X=[];
fid = fopen(filename);
while 1
fl = fgetl(fid);
if ~ischar(fl), break, end
r =regexp(fl,'([-]*\d+[.]*\d*)','match');
r=r(1:21); % because your line 2nd is somehow having 22 elements,
% all lines must have same # elements or an error will be thrown
% Error: CAT arguments dimensions are not consistent.
X=[X;r];
end
fclose(fid);
Using csvread to read a csv file seems a good option. However, I also tend to read csv files with textscan as files are sometimes badly written. Having more options to read them is therefore necessary.
I face a reading problem like yours when I think the file is written a certain way but it is actually written another way. To debug it I use fgetl and print, for each line read, both the output of fgetl and its double version (see the example below). Examining the double version, you may find which character causes a problem.
In your case, I would first look at multiple occurrences of delimiters (',' and '\t') and , in 'textscan', I would activate the option 'MultipleDelimsAsOne' (while turning off 'CollectOutput').
fid = fopen(filename);
tline = fgetl(fid);
while ischar(tline)
disp(tline);
double(tline)
pause;
tline = fgetl(fid);
end
fclose(fid);

Open text files in matlab and save them from matlab

I have a big text file containing data that needs to be extracted and inserted into a new text file. I possibly need to store this data in an cell/matrix array ?
But for now, the question is that I am trying to test a smaller dataset, to check if the code below works.
I have a code in which it opens a text file, scans through it and replicates the data and saves it in another text file called, "output.txt".
Problem : It doesn't seem to save the file properly. It just shows an empty array in the text file, such as this " [] ". The original text file just contains string of characters.
%opens the text file and checks it line by line.
fid1 = fopen('sample.txt');
tline = fgetl(fid1);
while ischar(tline)
disp(tline);
tline = fgetl(fid1);
end
fclose(fid1);
% save the sample.txt file to a new text fie
fid = fopen('output.txt', 'w');
fprintf(fid, '%s %s\n', fid1);
fclose(fid);
% view the contents of the file
type exp.txt
Where do i go from here ?
It's not a good practice to read an input file by loading all of its contents to memory at once. This way the file size you're able to read is limited by the amount of memory on the machine (or by the amount of memory the OS is willing to allocate to a single process).
Instead, use fopen and its related function in order to read the file line-by-line or char-by- char.
For example,
fid1 = fopen('sample.txt', 'r');
fid = fopen('output.txt', 'w');
tline = fgetl(fid1);
while ischar(tline)
fprintf(fid, '%s\n', tline);
tline = fgetl(fid1);
end
fclose(fid1);
fclose(fid);
type output.txt
Of course, if you know in advance that the input file is never going to be large, you can read it all at once using by textread or some equivalent function.
Try using textread, it reads data from a text file and stores it as a matrix or a Cell array. At the end of the day, I assume you would want the data to be stored in a variable to manipulate it as required. Once you are done manipulating, open a file using fopen and use fprintf to write data in the format you want.