strange behavior of the fscanf function - matlab

I am trying to read the information contained in a small configuration file with Matlab's fscanf function. The content of the file is;
YAcex: 1.000000
YOx: 1.000000
KAce: 1.000000
The matlab code used to parse the file is;
fh = fopen('parameters', 'r');
fscanf(fh, 'YAcex: %f\n')
fscanf(fh, 'YOx: %f\n')
fscanf(fh, 'KAce: %f\n')
fclose(fh);
When this script is invoked, only the "YAcex" line is read correctly; fscanf returns [] for the two other lines. If the YOx and KAce lines are switched (KAce before YOx), all lines are read correctly by fscanf.
Can someone explain this behavior?
supplementary information: The linefeeds in the input file are simple linefeed (\n character, without \r character).

Your problem is that you only want to read one value per call to fscanf, but by default it tries to read as many values as possible. Note this excerpt from the documentation:
The fscanf function reapplies the format throughout the entire file and positions the file pointer at the end-of-file marker. If fscanf cannot match formatSpec to the data, it reads only the portion that matches and stops processing.
This means the first call correctly reads the first line of the file, but then tries to read the next line as well, finding no exact match to its format specifier. It finds a partial match for the next line, where the first Y of YOx: matches the beginning of YAcex: in the format specifier. This partial match places the file pointer directly after the Y in YOx:, causing the next call to fscanf to fail since it is starting at the Ox: .... We can illustrate this with ftell:
fh = fopen('parameters', 'r');
fscanf(fh, 'YAcex: %f\n');
ftell(fh)
ans =
18 % The "O" is the 18th character in the file
When you switch the YOx: and KAce: lines, a partial match of the next line doesn't happen any more, so the file pointer ends up at the beginning of the next line every time and all the reads are successful.
So, how can you get around this? One option is to always specify the size argument so fscanf doesn't reapply the format specifier unnecessarily:
fh = fopen('parameters', 'r');
fscanf(fh, 'YAcex: %f\n', 1);
fscanf(fh, 'YOx: %f\n', 1);
fscanf(fh, 'KAce: %f\n', 1);
fclose(fh);
Another option is to do this all in one line:
fh = fopen('parameters', 'r');
values = fscanf(fh, 'YAcex: %f\n YOx: %f\n KAce: %f\n');
fclose(fh);
And values will be a 3-by-1 array containing the 3 values from the file.

As you already realized, \r or \r\n could cause this kind of behavior. The likely reason is similar to that, for example, there is some invisible characters like space somewhere. You can debug this by reading all as uint8, and take a look at the location where problem occurs:
u8 = fread(fh, inf, '*uint8')';
One stupid way to avoid this kind of issue is to read all as char, and search each keyword:
fh = fopen('parameters');
ch = fread(fh, inf, '*char')'; % read all as char
fclose(fh);
YAcex = regexp(ch, '(?<=YAcex:\s?)[\d\.]+', 'match', 'once'); % parse YAcex
You can parse others accordingly. The advantage of this is that it is less sensitive to a space somewhere, and the order of parameters does not matter.

Related

how to read binary format byte by byte in MATLAB

I have been struggling with this bug. When using MATLAB to read a binary file that contains three columns of numbers in float formats.
I am reading one number at a time using this line.
pt(j) = fread(fid,1,'float','a');
I have found that sometimes (rarely) MATLAB instead of reading four bytes for a float, it uses 5 bytes. And it misses up the rest of the reading. I am not sure if the file is corrupted or MATLAB has a bug there. When I printed the file as a txt and read it in txt everything works well.
As a work around here is what I did:
cur = ftell(fid);
if (cur - prev)~= 4
pt(j) = 0; % I m throwing this reading away for the sake of saving the rest of the data. This is not ideal
cur = prev +4;
fseek(fid, cur,'bof');
end
prev = cur;
I tried different combinations of different formats float32 float64 etc... nothing works MATLAB always read 5 bytes instead of 4 at this particular location.
EDIT:
To solve it based on Chris's answer. I was using this command to open the file.
fid = fopen(fname,'rt');
I replaced it with this
fid = fopen(fname,'r');
Sometimes, rarely, skipping a byte. It sounds to me like you are on Windows, and have opened the file in text mode. See the permissions parameter to the fopen function.
When opening a file in text mode on Windows, the sequence \r\n (13,10) is replaced with \n (10). This happens before fread gets to it.
So, when opening the file, don't do:
fid = fopen('name', 'rt');
The t here indicates "text". Instead, do:
fid = fopen('name', 'r');
To make this explicit, you can add b to the permissions. This is not documented, but is supposed to mean "binary", and makes the call similar to what you'd do in C or in the POSIX fopen():
fid = fopen('name', 'rb');

Read hex from file matlab

What I have
A txt file like:
D091B
E7E1F
20823
...
What I need
To read them and store them like char, just as they are in the file: N (don't knot how many) lines, with its 5 characters (5 columns) at each one.
What have I tried
fichero = fopen('PS.txt','r');
sizeDatos = [[] 5]; % Several Options, read below
resultados=fscanf(fichero, '%s', sizeDatos); % Here too
fclose(fichero);
I've tried with the snippet above, to read my txt file. However, I didn't manage to get it. Most I've obtained is, using:
sizeDatos = [1 Inf];
So I got all my hex characters into an array, with no spaces.
As you can see, I've tried several optios changing fscanf size parameter, as well as trying to say into the format chain that it should recognize new lines by using \n for example. None of them have worked for me.
Any idea about how can I get it? I've readed fscanf page from documentation, but it didn't inspire me to make anything different.
One possible solution is using textscan and convert it to a cell array.
fileId = fopen('PS.txt');
C = textscan(fileId, '%s');
Now to show the content of cell you can use
celldisp(C)
Or you can convert it to other types.
Don't forget to close your file after using it.

Textscan on file with large number of lines

I'm trying to analyze a very large file using textscan in MATLAB. The file in question is about 12 GB in size and contains about 250 million lines with seven (floating) numbers in each (delimited by a whitespace); because this obviously would not fit into the RAM of my desktop, I'm using the approach suggested in the MATLAB documentation (i.e. loading and analyzing a smaller block of the file at a time. According to the documentation this should allow for processing "arbitrarily large delimited text file[s]"). This only allows me to scan about 43% of the file, after which textscan starts returning empty cells (despite there still being data left to scan in the file).
To debug, I attempted to go to several positions in the file using the fseek function, for example like this:
fileInfo = dir(fileName);
fid = fileopen(fileName);
fseek(fid, floor(fileInfo.bytes/10), 'bof');
textscan(fid,'%f %f %f %f %f %f %f','Delimiter',' ');
I'm assuming that the way I'm using fseek here moves the position indicator to about 10% of my file. (I'm aware this doesn't necessarily mean the indicator is at the beginning of a line, but if I run textscan twice I get a satisfactory answer.) Now, if I substitute fileInfo.bytes/10 by fileInfo.bytes/2 (i.e. moving it to about 50% of the file) everything breaks down and textscan only returns an empty 1x7 cell.
I looked at the file using a text editor for large files, and this shows that the entire file looks fine, and that there should be no reason for textscan to be confused. The only possible explanation that I can think of is that something goes wrong on a much deeper level that I have little understanding of. Any suggestions would be greatly appreciated!
EDIT
The relevant part of my code used to look like this:
while ~feof(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter', ' '); %// Read nLines
%// do some stuff
end
First I tried fixing it using ftell and fseek as suggested by Hoki below. This gave exactly the same error as I got before: MATLAB was unable to read in more than approximately 43% of the file. Then I tried using the HeaderLines solution (also suggested below), like this:
i = 0;
while ~feof(fid)
frewind(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter',' ', 'HeaderLines', i*nLines);
%// do some stuff
i = i + 1;
end
This seems to read in the data without producing errors; it is, however, incredibly slow.
I'm not entirely sure I understand what HeaderLines does in this context, but it seems to make textscan completely ignore everything that comes before the specified line. This doesn't seem to happen when using textscan in the "appropriate" way (either with or without ftell and fseek): in both cases it tries to continue from its last position, but to no avail because of some reason I don't understand yet.
fseek a pointer in a file is only good when you know precisely where (or by how many bytes) you want to move the cursor. It is very useful for binary files when you just want to skip some records of known length. But on a text file it is more dangerous and confusing than anything (unless you are absolutely sure that each line is the same size and each element on the line is at the same exact place/column, but that doesn't happen often).
There are several ways to read a text file block by block:
1) Use the HeaderLines option
To simply skip a block of lines on a text file, you can use the HeaderLines parameter of textscan, so for example:
readFormat = '%f %f %f %f %f %f %f' ; %// read format specifier
nLines = 10000 ; %// number of line to read per block
fileInfo = dir(fileName);
%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' '); %// read the first 10000 lines
fclose(fid)
%// Now do something with your "M" data
Then when you want to read the second block:
%// later read the SECOND block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', nLines); %// read lines 10001 to 20000
fclose(fid)
And if you have many blocks, for the Nth block, just adapt:
%// and then for the Nth BLOCK block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', (N-1)*nLines);
fclose(fid)
If necessary (if you have many blocks), just code this last version in a loop.
Note that this is good if you close your file after each block reading (so the file pointer will start at the beginning of the file when you open it again). Closing the file after reading a block of data is safer if your processing might take a long time or may error out (you don't want to have files which remain open too long or loose the fid if you crash).
2) Read by block (without closing the file)
If the processing of the block is quick and safe enough so you're sure it won't bomb out, you could afford to not close the file. In this case, the textscan file pointer will stay where you stopped, so you could also :
read a block (do not close the file): M = textscan(fid, readFormat, nLines)
Process it then save your result (and release memory)
read the next block with the same call: M = textscan(fid, readFormat, nLines)
In this case you wouldn't need the headerlines parameter because textscan will resume reading exactly where it stopped.
3) use ftell and fseek
Lastly, you could use fseek to start reading the file at the precise position you want, but in this case I recommend using it in conjunction with ftell.
ftell will return the current position in an open file, so use that to know at which position you stop reading last, then use fseek the next time to go straight at this position. Something like:
%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)
%// do some stuff
%// then read another block:
fid = fileopen(fileName);
fseek( fid , 'bof' , lastPosition ) ;
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)
%// and so on ...

Is there a way to only read part of a line of an input file?

I have a routine that opens a lookup table file to see if a certain entry already exists before writing to the file. Each line contains about 2,500 columns of data. I need to check the first 2 columns of each line to make sure the entry doesn't exist.
I don't want to have to read in 2,500 columns for every line just to check 2 entries. I was attempting to use the fscanf function, but it gives me an Invalid size error when I attempt to only read 2 columns. Is there a way to only read part of each line of an input file?
if(exist(strcat(fileDirectory,fileName),'file'))
fileID = fopen(strcat(fileDirectory,fileName),'r');
if(fileID == -1)
disp('ERROR: Could not open file.\n')
end
% Read file to see if line already exists
dataCheck = fscanf(fileID, '%f %f', [inf 2]);
for i=1:length(dataCheck(:,1))
if(dataCheck(i,1) == sawAnglesDeg(sawCount))
if(dataCheck(i,2) == sarjAnglesDeg(floor((sawCount-1)/4)+1))
% This line has already been written in lookup table
lineExists = true;
disp('Duplicate lookup table line found. Skipping...\n')
break;
end
end
end
fclose(fileID);
end
Well, not really.
You should be able in a loop to do an fscanf of the first two doubles, followed by a fgetl to read the rest of the line, i.e. on the form:
while there_are_more_lines
dataCheck = fscanf(fileID, '%f', 2);
fgetl(fileID); % Read remainder of line, discarding it
% Do check here for each line
end
Since it is a text file, you can not really skip reading characters from the file. For binary files you can do an fseek, which can jump round in the file based on a byte-count - it can be used if you know exactly where the next line starts (in byte-count). But for a text file you do not know that, since each line will vary in length. If you save the data in a binary file instead, it would be possible to do something like that.
What I would probably do: Create two files, the first one containing the two "check-values", that could be read in quickly, and the other one containing the 2500 columns of data, with or without the two "check-values". They should be updated synchronously; when adding one line to the first file, also one line is added to the second file.
And I would definitely make a checkData matrix variable and keep that in memory as long as possible; when adding a new line to the file, also update the checkData matrix, so you should only need to read the file once initially and use the checkData matrix for the rest of the life of your program.
With textscan you can skip fields, parts of fields, or even "rest of line", so I would do this (based on MATLAB help example slightly modified):
fileID = fopen('data.dat');
data = textscan(fileID,'%f %f %*[^\n]');
fclose(fileID);
Then check data (should be the two columns you want) to see if any of those rows matches the requirements.
As #Jesper Grooss wrote, there is no solution to skip the remaining of a line without reading it. In a single text file context, a fastest solution would probably consist of
reading the entire file with textscan (one line of text into one cell element of a matrix)
appending the new line to the matrix even if it is a duplicate entry
uniquing the cell matrix with unique(cellmatrix, 'rows')
appending the new line to the text file if it corresponds to a new entry
The uniquing step replaces the putatively costly for loop.

Confused with .tsv files in MATLAB (converting to a Matrix?)

I have a .tsv file that I wish to open in MATLAB, however I am having several problems with this.
I have tried the following
fid = fopen('data.tsv');
C = textscan(fid, ['%s' repmat('%f',1,8)], 'HeaderLines', 1);
fclose(fid);
and got some weird values that had nothing to do with my file. I also tried:
data = dlmread('data.tsv', '\t');
and got this
Error using dlmread (line 139)
Mismatch between file and format string.
Trouble reading number from file (row 1u, field 1u) ==> Participant Assessment
Experiment Block Trial
Answer Reaction Timestamp Free Response\n
Is there some way I can get it to ignore the header, or am I doing it totally wrong?
With dlmread you can specify where to start reading in the file. This is one of the few times that MATLAB indexing begins at 0 - [0,0] is the first row, first column. Therefore, to ignore the first row (containing your header):
data = dlmread('data.tsv','\t', 1, 0);
This will only work if all the values (other than the header lines you skip) are numeric.
Your example with textscan also looks fine to me (provided that the format supplied is correct and there is indeed only one header line). C will be a cell array; to obtain the data from each column use C{n} where n is the column number.
Rather than skipping the header line, it's sometimes useful to just read it in to a separate value:
fid = fopen('data.tsv');
C_header = textscan(fid, '%s',9);
C = textscan(fid, ['%s' repmat('%f',1,8)]);
fclose(fid);