Change default NaN representation of fprintf() in Matlab - matlab

I am trying to export data from Matlab in format that would be understood by another application... For that I need to change the NaN, Inf and -Inf strings (that Matlab prints by default for such values) to //m, //inf+ and //Inf-.
In general I DO KNOW how to accomplish this. I am asking how (and whether it is possible) to exploit one particular thing in Matlab. The actual question is located in the last paragraph.
There are two approaches that I have attempted (code bellow).
Use sprintf() on data and strrep() the output. This is done in line-by-line fashion in order to save memory. This solution takes almost 10 times more time than simple fprintf(). The advantage is that it has low memory overhead.
Same as option 1., but the translation is done on the whole data at once. This solution is way faster, but vulnerable to out of memory exception. My problem with this approach is that I do not want to unnecessarily duplicate the data.
Code:
rows = 50000
cols = 40
data = rand(rows, cols); % generate random matrix
data([1 3 8]) = NaN; % insert some NaN values
data([5 6 14]) = Inf; % insert some Inf values
data([4 2 12]) = -Inf; % insert some -Inf values
fid = fopen('data.txt', 'w'); %output file
%% 0) Write data using default fprintf
format = repmat('%g ', 1, cols);
tic
fprintf(fid, [format '\n'], data');
toc
%% 1) Using strrep, writing line by line
fprintf(fid, '\n');
tic
for i = 1:rows
fprintf(fid, '%s\n', strrep(strrep(strrep(sprintf(format, data(i, :)), 'NaN', '//m'), '-Inf', '//inf-'), 'Inf', '//inf+'));
end
toc
%% 2) Using strrep, writing all at once
fprintf(fid, '\n');
format = [format '\n'];
tic
fprintf(fid, '%s\n', strrep(strrep(strrep(sprintf(format, data'), 'NaN', '//m'), '-Inf', '//inf-'), 'Inf', '//inf+'));
toc
Output:
Elapsed time is 1.651089 seconds. % Regular fprintf()
Elapsed time is 11.529552 seconds. % Option 1
Elapsed time is 2.305582 seconds. % Option 2
Now to the question...
I am not satisfied with the memory overhead and time lost using my solutions in comparison with simple fprintf().
My rationale is that the 'NaN', 'Inf' and '-Inf' strings are simple data saved in some variable inside the *printf() or *2str() implementation. Is there any way to change their value at runtime?
For example in C# I would change the System.Globalization.CultureInfo.NumberFormat.NaNSymbol, etc. as explaind here.

In the limited case mentioned in comments that a number of (unknown, changing per data set) columns may be entirely NaN (or Inf, etc), but that there are not unwanted NaN values otherwise, another possibility is to check the first row of data, assemble a format string which writes the \\m strings directly, and use that while telling fprintf to ignore the columns that contain NaN or other unwanted values.
y = ~isnan(data(1,:)); % find all non-NaN
format = sprintf('%d ',y); % print a 1/0 string
format = strrep(format,'1','%g');
format = strrep(format,'0','//m');
fid = fopen('data.txt', 'w');
fprintf(fid, [format '\n'], data(:,y)'); %pass only the non-NaN data
fclose(fid);
By my check with two columns of NaN this fprintf is pretty much the same as your "regular" fprintf and quicker than the loop - not taking into account the initialisation step of producing format. It would be fiddlier to set it up to automatically produce the format string if you also have to take +/- Inf into account, but certainly possible. There is probably a cleaner way of producing format as well.
How it works:
You can pass in a subset of your data, and you can also insert any text you like into a format string, so if every row has the same desired "text" in the same spot (in this case NaN columns and our desired replacement for "NaN"), we can put the text we want in that spot and then just not pass those parts of the data to fprintf in the first place. A simpler example for trying out on the command line:
x = magic(5);
x(:,3)=NaN
sprintf('%d %d ihatethrees %d %d \n',x(:,[1,2,4,5])');

Related

Optimizing reading the data in Matlab

I have a large data file with a text formatted as a single column with n rows. Each row is either a real number or a string with a value of: No Data. I have imported this text as a nx1 cell named Data. Now I want to filter out the data and to create a nx1 array out of it with NaN values instead of No data. I have managed to do it using a simple cycle (see below), the problem is that it is quite slow.
z = zeros(n,1);
for i = 1:n
if Data{i}(1)~='N'
z(i) = str2double(Data{i});
else
z(i) = NaN;
end
end
Is there a way to optimize it?
Actually, the whole parsing can be performed with a one-liner using a properly parametrized readtable function call (no iterations, no sanitization, no conversion, etc...):
data = readtable('data.txt','Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No data');
Here is the content of the text file I used as a template for my test:
9.343410
11.54300
6.733000
-135.210
No data
34.23000
0.550001
No data
1.535000
-0.00012
7.244000
9.999999
34.00000
No data
And here is the output (which can be retrieved in the form of a vector of doubles using data.Var1):
ans =
9.34341
11.543
6.733
-135.21
NaN
34.23
0.550001
NaN
1.535
-0.00012
7.244
9.999999
34
NaN
Delimiter: specified as a line break since you are working with a single column... this prevents No data to produce two columns because of the whitespace.
Format: you want numerical values.
TreatAsEmpty: this tells the function to treat a specific string as empty, and empty doubles are set to NaN by default.
If you run this you can find out which approach is faster. It creates an 11MB text file and reads it with the various approaches.
filename = 'data.txt';
%% generate data
fid = fopen(filename,'wt');
N = 1E6;
for ct = 1:N
val = rand(1);
if val<0.01
fwrite(fid,sprintf('%s\n','No Data'));
else
fwrite(fid,sprintf('%f\n',val*1000));
end
end
fclose(fid)
%% Tommaso Belluzzo
tic
data = readtable(filename,'Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No Data');
toc
%% Camilo Rada
tic
[txtMat, nLines]=txt2mat(filename);
NoData=txtMat(:,1)=='N';
z = zeros(nLines,1);
z(NoData)=nan;
toc
%% Gelliant
tic
fid = fopen(filename,'rt');
z= textscan(fid, '%f', 'Delimiter','\n', 'whitespace',' ', 'TreatAsEmpty','No Data', 'EndOfLine','\n','TextType','char');
z=z{1};
fclose(fid);
toc
result:
Elapsed time is 0.273248 seconds.
Elapsed time is 0.304987 seconds.
Elapsed time is 0.206315 seconds.
txt2mat is slow, even without converting resulting string matrix to numbers it is outperformed by readtable and textscan. textscan is slightly faster than readtable. Probably because it skips some of the internal sanity checks and does not convert the resulting data to a table.
Depending of how big are your files and how often you read such files, you might want to go beyond readtable, that could be quite slow.
EDIT: After tests, with a file this simple the method below provide no advantages. The method was developed to read RINEX files, that are large and complex in the sense that the are aphanumeric with different numbers of columns and different delimiters in different rows.
The most efficient way I've found, is to read the whole file as a char matrix, then you can easily find you "No data" lines. And if your real numbers are formatted with fix width you can transform them from char into numbers in a way much more efficient than str2double or similar functions.
The function I wrote to read a text file into a char matrix is:
function [txtMat, nLines]=txt2mat(filename)
% txt2mat Read the content of a text file to a char matrix
% Read all the content of a text file to a matrix as wide as the longest
% line on the file. Shorter lines are padded with blank spaces. New lines
% are not included in the output.
% New lines are identified by new line \n characters.
% Reading the whole file in a string
fid=fopen(filename,'r');
fileData = char(fread(fid));
fclose(fid);
% Finding new lines positions
newLines= fileData==sprintf('\n');
linesEndPos=find(newLines)-1;
% Calculating number of lines
nLines=length(linesEndPos);
% Calculating the width (number of characters) of each line
linesWidth=diff([-1; linesEndPos])-1;
% Number of characters per row including new lines
charsPerRow=max(linesWidth)+1;
% Initializing output var with blank spaces
txtMat=char(zeros(charsPerRow,nLines,'uint8')+' ');
% Computing a logical index to all characters of the input string to
% their final positions
charIdx=false(charsPerRow,nLines);
% Indexes of all new lines
linearInd = sub2ind(size(txtMat), (linesWidth+1)', 1:nLines);
charIdx(linearInd)=true;
charIdx=cumsum(charIdx)==0;
% Filling output matrix
txtMat(charIdx)=fileData(~newLines);
% Cropping the last row coresponding to new lines characters and transposing
txtMat=txtMat(1:end-1,:)';
end
Then, once you have all your data in a matrix (let's assume it is named txtMat), you can do:
NoData=txtMat(:,1)=='N';
And if your number fields have fix width, you can transform them to numbers way more efficiently than str2num with something like
values=((txtMat(:,1:10)-'0')*[1e6; 1e5; 1e4; 1e3; 1e2; 10; 1; 0; 1e-1; 1e-2]);
Where I've assumed the numbers have 7 digits and two decimal places, but you can easily adapt it for your case.
And to finish you need to set the NaN values with:
values(NoData)=NaN;
This is more cumbersome than readtable or similar functions, but if you are looking to optimize the reading, this is WAY faster. And if you don't have fix width numbers you can still do it this way by adding a couple lines to count the number of digits and find the place of the decimal point before doing the conversion, but that will slow down things a little bit. However, I think it will still be faster.

Plot in MATLAB using data from a txt file

I need to read data from a file and plot a graph with its data. The problem is:
(1) I can't change the format of data in the file
(2) The format contains information and characters that I don't know how to deal with.
Here is a part of the data file, it's in a txt format:
Estation;Date;Time;Temp1;Temp2;Pressure;
83743;01/01/2016;0000;31.9;25.3;1005.1;
83743;01/01/2016;1200;31.3;26.7;1005.7;
83743;01/01/2016;1800;33.1;25.4;1004.3;
83743;02/01/2016;0000;26.1;24.2;1008.6;
What I'm trying to do is to plot the Date and Time against Temp1 and Temp2, not worrying about Pressure. The first column can be neglected as well. How can I extract the Date, Time and Temps into and matrix so I can plot them? All I did so far was this:
fileID = fopen('teste.txt','r');
[A] = fscanf(fileID, ['%d' ';']);
fclose(fileID);
disp(A);
Which just reads the first value, 83743.
To build on m7913d's answer:
fileID = fopen('MyFile.txt','r');
A = fscanf(fileID, ['%s' ';']); % read the header line
B = fscanf(fileID, '%d;%d/%d/%d;%d;%f;%f;%f;', [8,inf]); % read all the data into B (the date is parsed into three columns)
fclose(fileID);
B = B.'; % transpose B
% C is just for verification, can be omitted
C = datetime([B(:,4:-1:2) B(:,5)/100zeros(numel(B(:,1)),2)],'InputFormat','yyyy-MM-dd HH:mm:ss');
D = datenum(C); % Get the date in a MATLAB usable format
Titles = strsplit(A,';'); % Get column names
figure;
hold on % hold the figure for multiple plots
plot(D,B(:,6),'r')
plot(D,B(:,7),'b')
datetick('x') % Set a date tick as axis
legend(Titles{4},Titles{5}); % uses titles for legend
note the parsing of the date into C: first is the date as given by you in dd-MM-yyyy format, which I flip to the official standard of yyyy-MM-dd, then your hour, which needs to be divided by 100, then a 0 for both minutes and seconds. You might need to rip those apart when you don't have exactly hourly data. Finally transform to a regular datenum, which MATLAB can use for processing.
Which results in:
You might want to play around with the datetick format, as it's got lots of options which might appeal to you.
fileID = fopen('input.txt','r');
[A] = fscanf(fileID, ['%s' ';']); % read the header line
[B] = fscanf(fileID, '%d;%d/%d/%d;%d;%f;%f;%f;', [8,inf]); % read all the data into B (the date is parsed into three columns)
fclose(fileID);
disp(B');
Note that %d reads an integer (not a double) and %f reads a floating point number.
See fscanf for more details.

str2num and importing data for large matrix

I have a large matrix in xlsx file which contains chars as following for example:
1,26:45:32.350,6,7,8,9,9,0,0,0
1,26:45:32.409,5,7,8,9,9,0,75,89
I want to make the 2nd column (the one which contains 26:45:32:350)
as a time vector and all the rest as a double matrix.
I tried the next code on like 50000 rows and it worked.
[FileName PathName] = uigetfile('*.xlsx','XLSX Files');
fid = fopen(FileName);
T=char(importdata(FileName));
Time=T(:,5:16);
Data=str2double(T);
However, when I tested it on the whole matrix (about 500,000 roww), I recieved Data=[] instead of matrix.
Is there any other thing I can do so 'Data' will be double matrix even for large matrix?
The excel file contains 1 column and around 500,000 rows, so the whole line 1,26:45:32:350,6,7,8,9,9,0,0,0 is inside 1 cell.
Also, I wrote another code,which works but take alot of time to run.
[FileName PathName] = uigetfile('*.xlsx','XLSX Files');
fid = fopen(FileName);
T=importdata(FileName);
h = waitbar(0,'Converting Data to cell array, please wait...');
for i=1:length(T)
delimiter_index=[0 find(T{i,1}(:)==char(44))'];
for j=1:length(delimiter_index)-1
Data{i,j}=T{i,1}(delimiter_index(j)+1:delimiter_index(j+1)-1);
end
waitbar(i/length(T));
end
close(h)
h = waitbar(0,'Seperating Data to time and data, please wait...');
for i=1:length(T)
Full_Time(i,:)=Data{i,2};
Data{i,2}=Data{i,1};
Data{i,1}=Full_Time(i,:);
waitbar(i/length(T));
end
close(h)
Data(:,1)=[];
h = waitbar(0,'Changing data cell to mat, please wait...');
for i=1:size(Data,1)
for j=1:size(Data,2)
Matrix(i,j)=str2num(Data{i,j});
end
waitbar(i/size(Data,1));
end
close(h)
Running this code for like 20000 rows shows that:(slowest to fastest)
waitbar
allchild
str2num
importdata
So basically I can remove this waitbar, but allchild (not sure what it is) and str2num take most of the time.
Is there anything I can do to make it run faster?

Performance issues by processing huge textfiles

I am facing the problem to extract data out of a textfile which has both numbers and characters in it. The data I want, (the numbers) are separated by rows with characters, describing the following dataset. The textfile is rather large (>2.000.000 lines).
I try to put every dataset (the number of rows between two rows with characters) into a matrix. The matrix should be named according to the description (frequency) in the textline above each dataset. I have a working code, but I face performance problems. Maybe someone can help me to speed it up. One file takes currently about 15 minutes. I need the numbers in matrices to process them further.
Snippet out of Textfile:
21603 2135 21339 21604
103791 94 1 1 1 4
21339 1702 21600 21604
-1
-1
2414
1
Velocity (magnitude) Response at Structural FE Nodes
1
Frequency = 10.00 Hz
Result = Engineering Units
Component = Vmag
Location =
Form & Units = RMS Magnitude in m/s
1 5 1 11 2 1
1 0 1 1 1 0 0 0
1 2161
0.00000e+000 1.00000e+001 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
20008
1.23285e-004
20428
1.21613e-004
Here is my code:
file='large_file.txt';
fid=fopen(file,'r');
k=1;
filerows=2164986; % nr of rows in textfile
A=zeros(filerows,6); % preallocate Matrix where textfile should be saved in
for count=1:8 % get rid of first 8 lines
fgets(fid);
end
name=0;
start=1;
while ~feof(fid)
a=fgets(fid);
b=str2double(strread(a,'%s')); % turn read row in a vector
if isnan(b(1))==1 % check whether there are characters in the row
if strfind(a,'Frequency') % check if 'Frequency' is in the row
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);
name=b(3);
for count=1:10 % get rid of next 10 lines
fgets(fid);
end
start=k+1;
end
else % if there are just numbers in the row, insert it into the matrix
A(k,1:length(b))=b; % populate matrix A with the row entries
k = k+1;
end
k/filerows % show progress
end
fclose(fid);
Matrixname = sprintf('Frequency%i=A(%i:end,:);',name,start);
eval(Matrixname);
Reading text files line by line is very time consuming, especially in Matlab. When I have to read in text files, I usually read in the entire file at once. You may be limited by memory, so read it in the largest size chunks your machine can handle. Once it's all in memory, use some kind of logical indexing to find the parts of the data you're interested in. Again, in Matlab, for and while loops are very slow. For the data set you have there, I would do the following:
fid = fopen(file);
data = fread(fid,[1 maxBytes],'char=>char');
blockIndices = strfind(data,'Velocity'); % Calculate offsets based on data format
% Another approach much faster than for loops
lineData = regexp(data,sprintf('\n'),'split'); % No each line is in a cell
processedData = cellfun(#processData,lineData,'Uniform',false);
function y = processData(x)
% do something with x
end
Once I had the block indices I could calculate offsets to the parts of the data I want. I don't think that two million lines is that much data, and most computers these days have multiple gigabytes of memory, and it doesn't look like each line is more than a couple hundred characters, so the file is probably less than half a GB. Good luck.
Using the matlab profiler will help you see which lines of code are taking the most amount of time so that you can figure out what to optimize.
As the original poster determined, the line causing trouble in this case was
k/filerows % show progress
Printing to the screen many times is very time consuming. If you want to show progress without slowing the code way down, you could do
if mod(k,filerows/100) == 0
disp('k rows processed');
end
That code will cause an update to be displayed 100 times, or every 3.5 seconds in that particular case.
If you want to get really fancy, check out waitbar, but it is usually overkill.
Finally I got the sscanf-solution to work. I used that function to replace the str2double function to gain some speed as suggested in Why is str2double so slow in matlab as compared to a mex-function?.
Sadly it didn't do too much, but at least it helped a bit.
So, start was ca. 850s
Profiler after removing progress-status: ca. 450s
Profiler after replacing str2double by sscanf: ca.330s
The code now is:
file='test.txt';
fid=fopen(file,'r');
k=1;
filerows=2164986; % nr of rows in textfile
A=zeros(filerows,6); % preallocate Matrix where textfile should be saved in
for count=1:8 % get rid of first 8 lines
fgets(fid);
end
name=0;
start=1;
while ~feof(fid)
a=fgets(fid);
b=strread(a,'%s');
b=sscanf(sprintf('%s#', b{:}), '%g#')';
if isempty(b) % check whether there had been characters in the row
if strfind(a,'Frequency') % check whether 'Frequency' was in the row
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);
b=str2double(strread(a,'%s'));
name=b(3);
for count=1:8 % get rid of next 8 lines
fgets(fid);
end
start=k+1;
end
else % if there were just numbers in the row, insert it into the matrix
A(k,1:length(b))=b; % populate matrix A with the row entries
k = k+1;
end
end
fclose(fid);
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);

Efficiently write two arrays to a file in Matlab

As per the title, I wish to write two column vectors to a file (format: 3 digit hex, tab, 4 digit hex). I think I can do it with the following:
for i=1:size(imgA,1)
fprintf(fid, ['%03X %04X \n'], imgA(i,1), imgB(i,1));
end
For a large sized vector this is taking a long time, I'm sure there is a probably a better way of doing this?
I thought about restructuring the two arrays into one (interleaving every second entry) and writing it out in one go - but I can't seem to get that to work
Thanks!
This seemed to work for my needs:
fprintf(fid, ['%03X %04X \n'], [imgA(:,1), imgB(:,1)]');
fprintf(fid, ['%03X %04X \n'], [imgA(:,1);imgB(:,1)]);
I think this is faster.
I would try dlmwrite:
delim = repmat(' ',length(imgA(:,1)),1);
output = [ imgA(:,1) , delim , imgB(:,1) ]
dlmwrite('test.txt',output,'')
Hex formats are already strings, so it's simple, but you need a delimiter matrix to get some space between your vectors.
Example:
tic
A = dec2hex( randi(10000,10000,1) );
B = dec2hex( randi(10000,10000,1) );
delim = repmat(' ',length(A),1);
output = [A, delim, B];
dlmwrite('test.txt',output,'');
toc
quite fast I guess:
Elapsed time is 0.860588 seconds.
For 100000 elements:
Elapsed time is 8.652231 seconds.
so the time obviously scales linear with the number of elements. I don't know if it is finally faster than fprintf
If you wouldn't have the hex format, but decimal numbers this approach would be definetely faster:
A = randi(10000,100000,1) ;
B = randi(10000,100000,1) ;
C = [A, B];
save('test.txt','C','-tabs','-ascii');
assuming A and B are columnvectors, otherwise transpose them.
Elapsed time is 0.155126 seconds.