Efficiently write two arrays to a file in Matlab - matlab

As per the title, I wish to write two column vectors to a file (format: 3 digit hex, tab, 4 digit hex). I think I can do it with the following:
for i=1:size(imgA,1)
fprintf(fid, ['%03X %04X \n'], imgA(i,1), imgB(i,1));
end
For a large sized vector this is taking a long time, I'm sure there is a probably a better way of doing this?
I thought about restructuring the two arrays into one (interleaving every second entry) and writing it out in one go - but I can't seem to get that to work
Thanks!

This seemed to work for my needs:
fprintf(fid, ['%03X %04X \n'], [imgA(:,1), imgB(:,1)]');

fprintf(fid, ['%03X %04X \n'], [imgA(:,1);imgB(:,1)]);
I think this is faster.

I would try dlmwrite:
delim = repmat(' ',length(imgA(:,1)),1);
output = [ imgA(:,1) , delim , imgB(:,1) ]
dlmwrite('test.txt',output,'')
Hex formats are already strings, so it's simple, but you need a delimiter matrix to get some space between your vectors.
Example:
tic
A = dec2hex( randi(10000,10000,1) );
B = dec2hex( randi(10000,10000,1) );
delim = repmat(' ',length(A),1);
output = [A, delim, B];
dlmwrite('test.txt',output,'');
toc
quite fast I guess:
Elapsed time is 0.860588 seconds.
For 100000 elements:
Elapsed time is 8.652231 seconds.
so the time obviously scales linear with the number of elements. I don't know if it is finally faster than fprintf
If you wouldn't have the hex format, but decimal numbers this approach would be definetely faster:
A = randi(10000,100000,1) ;
B = randi(10000,100000,1) ;
C = [A, B];
save('test.txt','C','-tabs','-ascii');
assuming A and B are columnvectors, otherwise transpose them.
Elapsed time is 0.155126 seconds.

Related

Optimizing reading the data in Matlab

I have a large data file with a text formatted as a single column with n rows. Each row is either a real number or a string with a value of: No Data. I have imported this text as a nx1 cell named Data. Now I want to filter out the data and to create a nx1 array out of it with NaN values instead of No data. I have managed to do it using a simple cycle (see below), the problem is that it is quite slow.
z = zeros(n,1);
for i = 1:n
if Data{i}(1)~='N'
z(i) = str2double(Data{i});
else
z(i) = NaN;
end
end
Is there a way to optimize it?
Actually, the whole parsing can be performed with a one-liner using a properly parametrized readtable function call (no iterations, no sanitization, no conversion, etc...):
data = readtable('data.txt','Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No data');
Here is the content of the text file I used as a template for my test:
9.343410
11.54300
6.733000
-135.210
No data
34.23000
0.550001
No data
1.535000
-0.00012
7.244000
9.999999
34.00000
No data
And here is the output (which can be retrieved in the form of a vector of doubles using data.Var1):
ans =
9.34341
11.543
6.733
-135.21
NaN
34.23
0.550001
NaN
1.535
-0.00012
7.244
9.999999
34
NaN
Delimiter: specified as a line break since you are working with a single column... this prevents No data to produce two columns because of the whitespace.
Format: you want numerical values.
TreatAsEmpty: this tells the function to treat a specific string as empty, and empty doubles are set to NaN by default.
If you run this you can find out which approach is faster. It creates an 11MB text file and reads it with the various approaches.
filename = 'data.txt';
%% generate data
fid = fopen(filename,'wt');
N = 1E6;
for ct = 1:N
val = rand(1);
if val<0.01
fwrite(fid,sprintf('%s\n','No Data'));
else
fwrite(fid,sprintf('%f\n',val*1000));
end
end
fclose(fid)
%% Tommaso Belluzzo
tic
data = readtable(filename,'Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No Data');
toc
%% Camilo Rada
tic
[txtMat, nLines]=txt2mat(filename);
NoData=txtMat(:,1)=='N';
z = zeros(nLines,1);
z(NoData)=nan;
toc
%% Gelliant
tic
fid = fopen(filename,'rt');
z= textscan(fid, '%f', 'Delimiter','\n', 'whitespace',' ', 'TreatAsEmpty','No Data', 'EndOfLine','\n','TextType','char');
z=z{1};
fclose(fid);
toc
result:
Elapsed time is 0.273248 seconds.
Elapsed time is 0.304987 seconds.
Elapsed time is 0.206315 seconds.
txt2mat is slow, even without converting resulting string matrix to numbers it is outperformed by readtable and textscan. textscan is slightly faster than readtable. Probably because it skips some of the internal sanity checks and does not convert the resulting data to a table.
Depending of how big are your files and how often you read such files, you might want to go beyond readtable, that could be quite slow.
EDIT: After tests, with a file this simple the method below provide no advantages. The method was developed to read RINEX files, that are large and complex in the sense that the are aphanumeric with different numbers of columns and different delimiters in different rows.
The most efficient way I've found, is to read the whole file as a char matrix, then you can easily find you "No data" lines. And if your real numbers are formatted with fix width you can transform them from char into numbers in a way much more efficient than str2double or similar functions.
The function I wrote to read a text file into a char matrix is:
function [txtMat, nLines]=txt2mat(filename)
% txt2mat Read the content of a text file to a char matrix
% Read all the content of a text file to a matrix as wide as the longest
% line on the file. Shorter lines are padded with blank spaces. New lines
% are not included in the output.
% New lines are identified by new line \n characters.
% Reading the whole file in a string
fid=fopen(filename,'r');
fileData = char(fread(fid));
fclose(fid);
% Finding new lines positions
newLines= fileData==sprintf('\n');
linesEndPos=find(newLines)-1;
% Calculating number of lines
nLines=length(linesEndPos);
% Calculating the width (number of characters) of each line
linesWidth=diff([-1; linesEndPos])-1;
% Number of characters per row including new lines
charsPerRow=max(linesWidth)+1;
% Initializing output var with blank spaces
txtMat=char(zeros(charsPerRow,nLines,'uint8')+' ');
% Computing a logical index to all characters of the input string to
% their final positions
charIdx=false(charsPerRow,nLines);
% Indexes of all new lines
linearInd = sub2ind(size(txtMat), (linesWidth+1)', 1:nLines);
charIdx(linearInd)=true;
charIdx=cumsum(charIdx)==0;
% Filling output matrix
txtMat(charIdx)=fileData(~newLines);
% Cropping the last row coresponding to new lines characters and transposing
txtMat=txtMat(1:end-1,:)';
end
Then, once you have all your data in a matrix (let's assume it is named txtMat), you can do:
NoData=txtMat(:,1)=='N';
And if your number fields have fix width, you can transform them to numbers way more efficiently than str2num with something like
values=((txtMat(:,1:10)-'0')*[1e6; 1e5; 1e4; 1e3; 1e2; 10; 1; 0; 1e-1; 1e-2]);
Where I've assumed the numbers have 7 digits and two decimal places, but you can easily adapt it for your case.
And to finish you need to set the NaN values with:
values(NoData)=NaN;
This is more cumbersome than readtable or similar functions, but if you are looking to optimize the reading, this is WAY faster. And if you don't have fix width numbers you can still do it this way by adding a couple lines to count the number of digits and find the place of the decimal point before doing the conversion, but that will slow down things a little bit. However, I think it will still be faster.

Change default NaN representation of fprintf() in Matlab

I am trying to export data from Matlab in format that would be understood by another application... For that I need to change the NaN, Inf and -Inf strings (that Matlab prints by default for such values) to //m, //inf+ and //Inf-.
In general I DO KNOW how to accomplish this. I am asking how (and whether it is possible) to exploit one particular thing in Matlab. The actual question is located in the last paragraph.
There are two approaches that I have attempted (code bellow).
Use sprintf() on data and strrep() the output. This is done in line-by-line fashion in order to save memory. This solution takes almost 10 times more time than simple fprintf(). The advantage is that it has low memory overhead.
Same as option 1., but the translation is done on the whole data at once. This solution is way faster, but vulnerable to out of memory exception. My problem with this approach is that I do not want to unnecessarily duplicate the data.
Code:
rows = 50000
cols = 40
data = rand(rows, cols); % generate random matrix
data([1 3 8]) = NaN; % insert some NaN values
data([5 6 14]) = Inf; % insert some Inf values
data([4 2 12]) = -Inf; % insert some -Inf values
fid = fopen('data.txt', 'w'); %output file
%% 0) Write data using default fprintf
format = repmat('%g ', 1, cols);
tic
fprintf(fid, [format '\n'], data');
toc
%% 1) Using strrep, writing line by line
fprintf(fid, '\n');
tic
for i = 1:rows
fprintf(fid, '%s\n', strrep(strrep(strrep(sprintf(format, data(i, :)), 'NaN', '//m'), '-Inf', '//inf-'), 'Inf', '//inf+'));
end
toc
%% 2) Using strrep, writing all at once
fprintf(fid, '\n');
format = [format '\n'];
tic
fprintf(fid, '%s\n', strrep(strrep(strrep(sprintf(format, data'), 'NaN', '//m'), '-Inf', '//inf-'), 'Inf', '//inf+'));
toc
Output:
Elapsed time is 1.651089 seconds. % Regular fprintf()
Elapsed time is 11.529552 seconds. % Option 1
Elapsed time is 2.305582 seconds. % Option 2
Now to the question...
I am not satisfied with the memory overhead and time lost using my solutions in comparison with simple fprintf().
My rationale is that the 'NaN', 'Inf' and '-Inf' strings are simple data saved in some variable inside the *printf() or *2str() implementation. Is there any way to change their value at runtime?
For example in C# I would change the System.Globalization.CultureInfo.NumberFormat.NaNSymbol, etc. as explaind here.
In the limited case mentioned in comments that a number of (unknown, changing per data set) columns may be entirely NaN (or Inf, etc), but that there are not unwanted NaN values otherwise, another possibility is to check the first row of data, assemble a format string which writes the \\m strings directly, and use that while telling fprintf to ignore the columns that contain NaN or other unwanted values.
y = ~isnan(data(1,:)); % find all non-NaN
format = sprintf('%d ',y); % print a 1/0 string
format = strrep(format,'1','%g');
format = strrep(format,'0','//m');
fid = fopen('data.txt', 'w');
fprintf(fid, [format '\n'], data(:,y)'); %pass only the non-NaN data
fclose(fid);
By my check with two columns of NaN this fprintf is pretty much the same as your "regular" fprintf and quicker than the loop - not taking into account the initialisation step of producing format. It would be fiddlier to set it up to automatically produce the format string if you also have to take +/- Inf into account, but certainly possible. There is probably a cleaner way of producing format as well.
How it works:
You can pass in a subset of your data, and you can also insert any text you like into a format string, so if every row has the same desired "text" in the same spot (in this case NaN columns and our desired replacement for "NaN"), we can put the text we want in that spot and then just not pass those parts of the data to fprintf in the first place. A simpler example for trying out on the command line:
x = magic(5);
x(:,3)=NaN
sprintf('%d %d ihatethrees %d %d \n',x(:,[1,2,4,5])');

Storing each iteration of a loop in Matlab

I have a 2d matrix (A=80,42), I am trying to split it into (80,1) 42 times and save it with a different name. i.e.
M_n1, M_n2, M_n3, … etc (representing the number of column)
I tried
for i= 1:42
M_n(i)=A(:,i)
end
it didn't work
How can I do that without overwrite the result and save each iteration in a file (.txt) ?
You can use eval
for ii = 1:size(A,2)
eval( sprintf( 'M_n%d = A(:,%d);', ii, ii ) );
% now you have M_n? var for you to process
end
However, the use of eval is not recommanded, you might be better off using cell array
M_n = mat2cell( A, [size(A,1)], ones( 1, size(A,2) ) );
Now you have M_n a cell array with 42 cells one for each column of A.
You can access the ii-th column by M_n{ii}
Generally, doing if you consider doing this kind of things: don't.
It does not scale up well, and having them in one array is usually far more convenient.
As long as the results have the same shape, you can use a standard array, if not you can put each result in a cell array eg. :
results = cell(nTests,1)
result{1} = runTest(inputs{1})
or even
results = cellfun(#runTest,inputs,'UniformOutput',false); % where inputs is a cell array
And so on.
If you do want to write the numbers to a file at each iteration, you could do it without the names with csvwrite or the like (since you're only talking about 80 numbers a time).
Another option is using matfile, which lets you write directly to a variable in a .mat file. Consult help matfile for the specifics.

Linear indexing of 3D matrix: best way to expand the first 2 dims to a vector (slow sprintf on scalars)

I have a 3D vector s1(nmax,mmax,ntimeSTEPS). I want to take at each time step j (i.e. each value of the third dimension) all the elements of the first two dimensions and obtain a vector to give to sprintf. However, sprintf is PAINFULLY SLOW if inside a cycle! I checked the manual and it looks like there is no way to do that directly with linear indexing. Or am I missing something? I can only think of using reshape, but something like s1(:,j) would be the top, but that's not how MATLAB works. I did:
nmax = 800;
mmax =400;
nmax_x_mmax = nmax*mmax;
ntimeSTEPS = 1;
charINPUT = cell(nmax_x_mmax,1);
s1 = ones(nmax,mmax,ntimeSTEPS)*1234;
tic
for j=1:ntimeSTEPS
%... other stuff
input=reshape(s1(:,:,j),nmax_x_mmax,1);
for kk=1:length(input)
charINPUT{kk} = sprintf('%6.3f',input(kk));
end
%... other stuff (collecting movie frames etc)
end
toc
This on a single time steps takes 5.09 SECONDS on my i7 2.2 GHz! I am trying to do an animation and this is crazily slow. If I increase the size of the array its basically stuck.
Any suggestion for doing this with linear indexes?
Using sprintf
sprintf can take an array. Output with newlines and use regexp to parse out the digits and put them in a cell array of strings.
charINPUT = regexp(sprintf('%6.3f\n',s1(:)),'(?<=\s*)(\S*)(?=\n)','match')
Without sprintf
You don't have to use sprintf in a loop to build your cell array of strings. Since num2str takes a format specifier, you can just do this for the whole thing:
charINPUT = cellstr(num2str(s1(:),'%6.3f'))
You can either skip the loop over ntimeSTEPS entirely, or if you are performing other operations you are not showing that require the loop you can handle indexing as follows.
For direct indexing of s1 with no temporary variable, you can compute the linear indexes yourself via (1:nmax*nmax) + (j-1)*nmax*nmax.
for j=1:ntimeSTEPS,
stepInds = (1:nmax*nmax) + (j-1)*nmax*nmax;
charINPUT = cellstr(num2str(s1(stepInds),'%6.3f'))
end
Try this
for idx = 1:numel(s1)
charINPUT{idx} = sprintf('%6.3f',s1(idx));
end

Convert .mat to .csv octave/matlab

I'm trying to write an octave program that will convert a .mat file to a .csv file. The .mat file has a matrix X and a column vector y. X is populated with 0s and 1s and y is populated with labels from 1 to 10. I want to take y and put it in front of X and write it as a .csv file.
Here is a code snippet of my first approach:
load(filename, "X", "y");
z = [y X];
basename = split{1};
csvname = strcat(basename, ".csv");
csvwrite(csvname, z);
The resulting file contains lots of really small decimal numbers, e.g. 8.560596795891285e-06,1.940359477121703e-06, etc...
My second approach was to loop through and manually write the values out to the .csv file:
load(filename, "X", "y");
z = [y X];
basename = split{1};
csvname = strcat(basename, ".csv");
csvfile = fopen(csvname, "w");
numrows = size(z, 1);
numcols = size(z, 2);
for i = 1:numrows
for j = 1:numcols
fprintf(csvfile, "%d", z(i, j));
if j == numcols
fprintf(csvfile, "\n");
else
fprintf(csvfile, ",");
end
end
end
fclose(csvfile);
That gave me a correct result, but took a really long time.
Can someone tell me either how to use csvwrite in a way that will write the correct values, or how to more efficiently manually create the .csv file.
Thanks!
The problem is that if y is of type char, your X vector gets converted to char, too. Since your labels are nothing else but numbers, you can simply convert them to numbers and save the data using csvwrite:
csvwrite('data.txt', [str2num(y) X]);
Edit Also, in the loop you save the numbers using integer conversion %d, while csvwrite writes doubles if your data is of type double. If the zeros are not exactly zeros, csvwrite will write them with scientific notation, while your loop will round them. Hence the different behavior.
Just a heads up your code isn't optimized for Matab / octave. Switch the for i and for j lines around.
Octave is in column major order so its not cache efficient to do what your doing. It will speed up the overall loop by making the change to probably an acceptable time