Optimizing reading the data in Matlab

Optimizing reading the data in Matlab - matlab

I have a large data file with a text formatted as a single column with n rows. Each row is either a real number or a string with a value of: No Data. I have imported this text as a nx1 cell named Data. Now I want to filter out the data and to create a nx1 array out of it with NaN values instead of No data. I have managed to do it using a simple cycle (see below), the problem is that it is quite slow.
z = zeros(n,1);
for i = 1:n
if Data{i}(1)~='N'
z(i) = str2double(Data{i});
else
z(i) = NaN;
end
end
Is there a way to optimize it?

Actually, the whole parsing can be performed with a one-liner using a properly parametrized readtable function call (no iterations, no sanitization, no conversion, etc...):
data = readtable('data.txt','Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No data');
Here is the content of the text file I used as a template for my test:
9.343410
11.54300
6.733000
-135.210
No data
34.23000
0.550001
No data
1.535000
-0.00012
7.244000
9.999999
34.00000
No data
And here is the output (which can be retrieved in the form of a vector of doubles using data.Var1):
ans =
9.34341
11.543
6.733
-135.21
NaN
34.23
0.550001
NaN
1.535
-0.00012
7.244
9.999999
34
NaN
Delimiter: specified as a line break since you are working with a single column... this prevents No data to produce two columns because of the whitespace.
Format: you want numerical values.
TreatAsEmpty: this tells the function to treat a specific string as empty, and empty doubles are set to NaN by default.

If you run this you can find out which approach is faster. It creates an 11MB text file and reads it with the various approaches.
filename = 'data.txt';
%% generate data
fid = fopen(filename,'wt');
N = 1E6;
for ct = 1:N
val = rand(1);
if val<0.01
fwrite(fid,sprintf('%s\n','No Data'));
else
fwrite(fid,sprintf('%f\n',val*1000));
end
end
fclose(fid)
%% Tommaso Belluzzo
tic
data = readtable(filename,'Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No Data');
toc
%% Camilo Rada
tic
[txtMat, nLines]=txt2mat(filename);
NoData=txtMat(:,1)=='N';
z = zeros(nLines,1);
z(NoData)=nan;
toc
%% Gelliant
tic
fid = fopen(filename,'rt');
z= textscan(fid, '%f', 'Delimiter','\n', 'whitespace',' ', 'TreatAsEmpty','No Data', 'EndOfLine','\n','TextType','char');
z=z{1};
fclose(fid);
toc
result:
Elapsed time is 0.273248 seconds.
Elapsed time is 0.304987 seconds.
Elapsed time is 0.206315 seconds.
txt2mat is slow, even without converting resulting string matrix to numbers it is outperformed by readtable and textscan. textscan is slightly faster than readtable. Probably because it skips some of the internal sanity checks and does not convert the resulting data to a table.

Depending of how big are your files and how often you read such files, you might want to go beyond readtable, that could be quite slow.
EDIT: After tests, with a file this simple the method below provide no advantages. The method was developed to read RINEX files, that are large and complex in the sense that the are aphanumeric with different numbers of columns and different delimiters in different rows.
The most efficient way I've found, is to read the whole file as a char matrix, then you can easily find you "No data" lines. And if your real numbers are formatted with fix width you can transform them from char into numbers in a way much more efficient than str2double or similar functions.
The function I wrote to read a text file into a char matrix is:
function [txtMat, nLines]=txt2mat(filename)
% txt2mat Read the content of a text file to a char matrix
% Read all the content of a text file to a matrix as wide as the longest
% line on the file. Shorter lines are padded with blank spaces. New lines
% are not included in the output.
% New lines are identified by new line \n characters.
% Reading the whole file in a string
fid=fopen(filename,'r');
fileData = char(fread(fid));
fclose(fid);
% Finding new lines positions
newLines= fileData==sprintf('\n');
linesEndPos=find(newLines)-1;
% Calculating number of lines
nLines=length(linesEndPos);
% Calculating the width (number of characters) of each line
linesWidth=diff([-1; linesEndPos])-1;
% Number of characters per row including new lines
charsPerRow=max(linesWidth)+1;
% Initializing output var with blank spaces
txtMat=char(zeros(charsPerRow,nLines,'uint8')+' ');
% Computing a logical index to all characters of the input string to
% their final positions
charIdx=false(charsPerRow,nLines);
% Indexes of all new lines
linearInd = sub2ind(size(txtMat), (linesWidth+1)', 1:nLines);
charIdx(linearInd)=true;
charIdx=cumsum(charIdx)==0;
% Filling output matrix
txtMat(charIdx)=fileData(~newLines);
% Cropping the last row coresponding to new lines characters and transposing
txtMat=txtMat(1:end-1,:)';
end
Then, once you have all your data in a matrix (let's assume it is named txtMat), you can do:
NoData=txtMat(:,1)=='N';
And if your number fields have fix width, you can transform them to numbers way more efficiently than str2num with something like
values=((txtMat(:,1:10)-'0')*[1e6; 1e5; 1e4; 1e3; 1e2; 10; 1; 0; 1e-1; 1e-2]);
Where I've assumed the numbers have 7 digits and two decimal places, but you can easily adapt it for your case.
And to finish you need to set the NaN values with:
values(NoData)=NaN;
This is more cumbersome than readtable or similar functions, but if you are looking to optimize the reading, this is WAY faster. And if you don't have fix width numbers you can still do it this way by adding a couple lines to count the number of digits and find the place of the decimal point before doing the conversion, but that will slow down things a little bit. However, I think it will still be faster.

Related

Variable Width Columns in .txt Files

I have a function that takes data and imports that data into a text file. The issue that I am having is with formatting. I want to be able to set the width of the columns based on the widest array of characters in that column. So, in the code below I have labels and then data. My idea would be to take the length of each individually and find the largest value. Say the second column labels has 15 chars and that is longer than any data array, then I want to set the width of that column to 15 + 3 (white spaces) making it 18. If column 3 had a max of 8 chars for a member of data, then I would like to set the width to 11. I have found plenty of literature on fixed width, and I found that I could do '-*s', *width, colLabels; but I am having difficulty figuring out how to implement that.
Below is my code and it doesn't fail but it takes forever and then won't open because there is not enough memory. I have really tried to work through this to no avail.
Thanks in advance and if there is any other information I can provide, then let me know.
for col = 1:length(this.colLabels) % iterate through columns
colLen = length(this.colLabels{col}); % find the longest string in labels
v = max(this.data(:,col)); % find the longest double in data
n = num2str(v, '%.4f'); % precision of 4 after decimal place
dataLen = length(n);
% find max width for column and add white space
if colLen > dataLen
colWidth = colLen + 3;
else
colWidth = dataLen + 3;
end
% print it
fprintf(fid, '%-*s', this.colWidth, this.colLabels{col}); % write col position i
fprintf(fid, '\n');
fprintf(fid, '%-*s', this.colWidth, this.colUnits{col});% write unit position i
fprintf(fid, '\n');
fprintf(fid, '%-*s', this.colWidth, this.data(:,col)); % write all rows of data in column i
end

There are few places where you are making mistakes:
The size of number is not necessarily related to its size when printed. Consider 1.1234 and 1000, one of these is a larger string and the other is a larger number. This may or may not matter for your data ...
Two, it is best to use the correct format strings when printing to string. %s is for strings, not numbers.
Perhaps most importantly, text appears on multiple lines because of the newline character which ends one line and starts another. This means you essentially have to write one row at a time, not one column at a time.
I tend to prefer creating the text in memory then writing to a file. The following isn't the cleanest implementation but it works.
this.colLabels = {'test' 'cheese' 'variable' 'really long string'};
this.colUnits = {'ml' 'cm' 'C' 'kg'};
n_columns = length(this.colLabels);
%Fake data
this.data = reshape(1:n_columns*5,5,n_columns);
this.data(1) = 1.2345678;
this.data(5) = 1000; %larger number but smaller string
%Format as desired ...
string_data = arrayfun(#(x) sprintf('%g',x),this.data,'un',0);
string_data = [this.colLabels; this.colUnits; string_data];
%Add on newlines ...
%In newer versions you can use newline instead of char(10)
string_data(:,end+1) = {char(10)};
string_lengths = cellfun('length',string_data);
max_col_widths = max(string_lengths,[],1);
%In newer versions you can use singleton expansion, but beware
n_spaces_add = bsxfun(#minus,max_col_widths,string_lengths);
%left justify filling with spaces
final_strings = cellfun(#(x,y) [x blanks(y)],string_data,num2cell(n_spaces_add),'un',0);
%Optional delimiter between columns
%Don't add delimiter for last column or for newline column
final_strings(:,1:end-2) = cellfun(#(x) [x ', '],final_strings(:,1:end-2),'un',0);
%Let's skip last newline
final_strings{end,end} = '';
%transpose for next line so that (:) goes by row first, not column
%Normally (:) linearizes by column first
final_strings = final_strings';
%concatenate all cells together
entire_string = [final_strings{:}];
%Write this to disk fprintf(fid,'%s',entire_string);

The data in the text file is stored one line after the other, so you cannot write column by column. You need first to determine the width of the columns and write the label/unit header, then write all the data. All we need to have is a proper format string for fprintf: fixed width format and fprintf is extremely useful for exporting column delimited data.
The first part of the code is ok in order to determine the width of the columns (assuming the data only has positive samples). You only need to store it in an array.
nCol=length(this.colLabels);
colWidth = zeros(1,nCol);
for col = 1:nCol
colLen = length(this.colLabels{col}); % find the longest string in labels
v = max(this.data(:,col)); % find the longest double in data
n = num2str(v, '%.4f'); % precision of 4 after decimal place
dataLen = length(n);
% find max width for column and add white space
colWidth(col)=max(colLen,dataLen);
end
Now, we need to build format string for the labels and data, to use with sprintf. The format string will look like '%6s %8s %10s\n' for the header and '%6.4f %8.4f %10.4f\n' for the data.
fmtHeader=sprintf('%%%ds ',colWidth);
fmtData=sprintf('%%%d.4f ',colWidth);
%Trim the triple space at the end and add the newline
fmtHeader=[fmtHeader(1:end-3) '\n'];
fmtData =[fmtData(1:end-3) '\n'];
We use the fact that, when sprintf is given an array as input, it will iterate through all the values to produce a long string. We can use the same trick to write the data, but singe we write line by line and Matlab stores data in column major order, a transpose is necessary.
fid=fopen('myFile.txt');
fprintf(fid,fmtHeader,this.colLabels{:});
fprintf(fid,fmtHeader,this.colUnits{:});
fprintf(fid,fmtData,transpose(this.data));
fclose(fid);
For the headers, the cell can be converted to a comma separated list with {:}. This is the same as writing fprintf(fid,fmtHeader,this.colLabels{1},this.colLabels{2},...)
Using the same test data from #Jimbo 's answer and fid=1; to output the fprintf to the screen the code gives:
test cheese variable really long string
ml cm C kg
1.2346 6.0000 11.0000 16.0000
2.0000 7.0000 12.0000 17.0000
3.0000 8.0000 13.0000 18.0000
4.0000 9.0000 14.0000 19.0000
1000.0000 10.0000 15.0000 20.0000
Finally, the most compact version of the code is:
fid=1; %print to screen for test purpose
colWidth =max( cellfun(#length,this.colLabels(:)') , max(1+floor(log10(max(this.data,[],1))) , 1) + 5); %log10 to count digits, +5 for the dot and decimal digits ; works for data >=0 only
fprintf(fid,[sprintf('%%%ds ',colWidth(1:end-1)) sprintf('%%%ds\n',colWidth(end))],this.colLabels{:},this.colUnits{:}); %print header
fprintf(fid,[sprintf('%%%d.4f ',colWidth(1:end-1)) sprintf('%%%d.4f\n',colWidth(end))],this.data'); %print data

Read a text file and separate the data into different columns and different tables in MATLAB

I have a huge text file that needs to be read and processed in MATLAB. This file at some points contain text to indicate that a new data series has started.
I have searched here but cant find any simple solution.
So what I want to do is to read the data in the file, put the data in a table in three different columns and when it finds text it should create a new table. It should repeat this process until the entire document is scanned.
This is how the document looks like:
time V(A,B) I(R1)
Step Information: X=1 (Run: 1/11)
0.000000000000000e+000 -2.680148e-016 0.000000e+00
9.843925313007988e-012 -4.753470e-006 2.216314e-011
1.000052605772457e-011 -4.835427e-006 2.552497e-011
1.031372754715773e-011 -4.999340e-006 -3.042096e-012
1.094013052602406e-011 -5.327165e-006 -1.206968e-011
Step Information: X=1 (Run: 2/11)
0.000000000000000e+000 -2.680148e-016 0.000000e+000
9.843925313007988e-012 -4.753470e-006 2.216314e-011
1.000052605772457e-011 -4.835427e-006 2.552497e-011
1.031372754715773e-011 -4.999340e-006 -3.042096e-012
1.094013052602406e-011 -5.327165e-006 -1.206968e-011

A rather crude approach is to read the file line by line and check if the line consists of three numbers. If it does, then append this to a temporary matrix. When you finally get to a line that doesn't contain three numbers, append this matrix as an element in a cell array, clear the temporary matrix and continue.
Something like this would work, assuming that the file is stored in 'file.txt':
%// Open the file
f = fopen('file.txt', 'r');
%// Initialize empty cell array
data = {};
%// Initialize temporary matrix
temp = [];
%// Loop over the file...
while true
%// Get a line from the file
line = fgetl(f);
%// If we reach the end of the file, get out
if line == -1
%// Last check before we break
%// Check if the temporary matrix isn't empty and add
if ~isempty(temp)
data = [data; temp];
end
break;
end
%// Else, check to see if this line contains three numbers
numbers = textscan(line, '%f %f %f');
%// If this line doesn't consist of three numbers...
if all(cellfun(#isempty, numbers))
%// If the temporary matrix is empty, skip
if isempty(temp)
continue;
end
%// Concatenate to cell array
data = [data; temp];
%// Reset temporary matrix
temp = [];
%// If this does, then create a row vector and concatenate
else
temp = [temp; numbers{:}];
end
end
%// Close the file
fclose(f);
The code is pretty self-explanatory but let's go into it to be sure you know what's going on. First open up the file with fopen to get a "pointer" to the file, then initialize our cell array that will contain our matrices as well as the temporary matrix used when reading in matrices in between header information. After we simply loop over each line of the file and we can grab a line with fgetl using the file pointer we created. We then check to see if we have reached the end of the file and if we have, let's check to see if the temporary matrix has any numerical data in it. If it does, add this into our cell array then finally get out of the loop. We use fclose to close up the file and clean things up.
Now the heart of the operation is what follows after this check. We use textscan and search for three numbers separated by spaces. That's done with the '%f %f %f' format specifier. This should give you a cell array of three elements if you are successful with numbers. If this is correct, then convert this cell array of elements into a row of numbers and concatenate this into the temporary matrix. Doing temp = [temp; numbers{:}]; facilitates this concatenation. Simply put I piece together each number and concatenate them horizontally to create a single row of numbers. I then take this row and concatenate this as another row in the temporary matrix.
Should we finally get to a line where it's all text, this will give you all three elements in the cell array found by textscan to be empty. That's the purpose of the all and cellfun call. We search each element in the cell and see if it's empty. If every element is empty, this is a line that is text. If this situation arises, simply take the temporary matrix and add this as a new entry into your cell array. You'd then reset the temporary matrix and start the logic over again.
However, we also have to take into account that there may be multiple lines that consist of text. That's what the additional if statement is for inside the first if block using all. If we have an additional line of text that precedes a previous line of text, the temporary matrix of values should still be empty and so you should check to see if that is empty before you try and concatenate the temporary matrix. If it's empty, don't bother and just continue.
After running this code, I get the following for my data matrix:
>> format long g
>> celldisp(data)
data{1} =
0 -2.680148e-16 0
9.84392531300799e-12 -4.75347e-06 2.216314e-11
1.00005260577246e-11 -4.835427e-06 2.552497e-11
1.03137275471577e-11 -4.99934e-06 -3.042096e-12
1.09401305260241e-11 -5.327165e-06 -1.206968e-11
data{2} =
0 -2.680148e-16 0
9.84392531300799e-12 -4.75347e-06 2.216314e-11
1.00005260577246e-11 -4.835427e-06 2.552497e-11
1.03137275471577e-11 -4.99934e-06 -3.042096e-12
1.09401305260241e-11 -5.327165e-06 -1.206968e-11
To access a particular "table", do data{ii} where ii is the table you want to access that was read in from top to bottom in your text file.

The most versatile way is to read line by line using textscan. If you want to speed this process up, you can have a dummy read first:
ie. You loop through all the lines without storing the data and decide which lines are the text lines and which are numbers, recording a quick number of lines for each.
You then have enough information about the data to run through quickly the arrays. This will speed up the time it takes to store the data in your new arrays massively.
Your second loop is the one that actually reads the data into the array/s. You should now know which lines to skip. You can also pre-allocate the arrays within the data cell if you wish to.
fid = fopen('file.txt','r');
data = {};
nlines = [];
% now start the loop
k=0; % counter for data sets
while ~feof(fid)
line = fgetl(fid);
% check if is data or text
if all(ismember(line,' 0123456789+.')) % is it data
nlines(k) = nlines(k)+1;
else %is it text
k=k+1;
nlines(k) = 0;
end
end
frewind(fid); % go back to start of file
% You could preallocate the data array here if you wished
% now get the data
for aa = 1 : length(nlines)
if nlines(aa)==0;
continue
end
textscan(fid,'%s\r\n',1); % skip textline
data{aa} = textscan(fid,'%f%f%f\r\n',nlines(k));
end

Performance issues by processing huge textfiles

I am facing the problem to extract data out of a textfile which has both numbers and characters in it. The data I want, (the numbers) are separated by rows with characters, describing the following dataset. The textfile is rather large (>2.000.000 lines).
I try to put every dataset (the number of rows between two rows with characters) into a matrix. The matrix should be named according to the description (frequency) in the textline above each dataset. I have a working code, but I face performance problems. Maybe someone can help me to speed it up. One file takes currently about 15 minutes. I need the numbers in matrices to process them further.
Snippet out of Textfile:
21603 2135 21339 21604
103791 94 1 1 1 4
21339 1702 21600 21604
-1
-1
2414
1
Velocity (magnitude) Response at Structural FE Nodes
1
Frequency = 10.00 Hz
Result = Engineering Units
Component = Vmag
Location =
Form & Units = RMS Magnitude in m/s
1 5 1 11 2 1
1 0 1 1 1 0 0 0
1 2161
0.00000e+000 1.00000e+001 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
20008
1.23285e-004
20428
1.21613e-004
Here is my code:
file='large_file.txt';
fid=fopen(file,'r');
k=1;
filerows=2164986; % nr of rows in textfile
A=zeros(filerows,6); % preallocate Matrix where textfile should be saved in
for count=1:8 % get rid of first 8 lines
fgets(fid);
end
name=0;
start=1;
while ~feof(fid)
a=fgets(fid);
b=str2double(strread(a,'%s')); % turn read row in a vector
if isnan(b(1))==1 % check whether there are characters in the row
if strfind(a,'Frequency') % check if 'Frequency' is in the row
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);
name=b(3);
for count=1:10 % get rid of next 10 lines
fgets(fid);
end
start=k+1;
end
else % if there are just numbers in the row, insert it into the matrix
A(k,1:length(b))=b; % populate matrix A with the row entries
k = k+1;
end
k/filerows % show progress
end
fclose(fid);
Matrixname = sprintf('Frequency%i=A(%i:end,:);',name,start);
eval(Matrixname);

Reading text files line by line is very time consuming, especially in Matlab. When I have to read in text files, I usually read in the entire file at once. You may be limited by memory, so read it in the largest size chunks your machine can handle. Once it's all in memory, use some kind of logical indexing to find the parts of the data you're interested in. Again, in Matlab, for and while loops are very slow. For the data set you have there, I would do the following:
fid = fopen(file);
data = fread(fid,[1 maxBytes],'char=>char');
blockIndices = strfind(data,'Velocity'); % Calculate offsets based on data format
% Another approach much faster than for loops
lineData = regexp(data,sprintf('\n'),'split'); % No each line is in a cell
processedData = cellfun(#processData,lineData,'Uniform',false);
function y = processData(x)
% do something with x
end
Once I had the block indices I could calculate offsets to the parts of the data I want. I don't think that two million lines is that much data, and most computers these days have multiple gigabytes of memory, and it doesn't look like each line is more than a couple hundred characters, so the file is probably less than half a GB. Good luck.

Using the matlab profiler will help you see which lines of code are taking the most amount of time so that you can figure out what to optimize.
As the original poster determined, the line causing trouble in this case was
k/filerows % show progress
Printing to the screen many times is very time consuming. If you want to show progress without slowing the code way down, you could do
if mod(k,filerows/100) == 0
disp('k rows processed');
end
That code will cause an update to be displayed 100 times, or every 3.5 seconds in that particular case.
If you want to get really fancy, check out waitbar, but it is usually overkill.

Finally I got the sscanf-solution to work. I used that function to replace the str2double function to gain some speed as suggested in Why is str2double so slow in matlab as compared to a mex-function?.
Sadly it didn't do too much, but at least it helped a bit.
So, start was ca. 850s
Profiler after removing progress-status: ca. 450s
Profiler after replacing str2double by sscanf: ca.330s
The code now is:
file='test.txt';
fid=fopen(file,'r');
k=1;
filerows=2164986; % nr of rows in textfile
A=zeros(filerows,6); % preallocate Matrix where textfile should be saved in
for count=1:8 % get rid of first 8 lines
fgets(fid);
end
name=0;
start=1;
while ~feof(fid)
a=fgets(fid);
b=strread(a,'%s');
b=sscanf(sprintf('%s#', b{:}), '%g#')';
if isempty(b) % check whether there had been characters in the row
if strfind(a,'Frequency') % check whether 'Frequency' was in the row
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);
b=str2double(strread(a,'%s'));
name=b(3);
for count=1:8 % get rid of next 8 lines
fgets(fid);
end
start=k+1;
end
else % if there were just numbers in the row, insert it into the matrix
A(k,1:length(b))=b; % populate matrix A with the row entries
k = k+1;
end
end
fclose(fid);
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);

Change default NaN representation of fprintf() in Matlab

I am trying to export data from Matlab in format that would be understood by another application... For that I need to change the NaN, Inf and -Inf strings (that Matlab prints by default for such values) to //m, //inf+ and //Inf-.
In general I DO KNOW how to accomplish this. I am asking how (and whether it is possible) to exploit one particular thing in Matlab. The actual question is located in the last paragraph.
There are two approaches that I have attempted (code bellow).
Use sprintf() on data and strrep() the output. This is done in line-by-line fashion in order to save memory. This solution takes almost 10 times more time than simple fprintf(). The advantage is that it has low memory overhead.
Same as option 1., but the translation is done on the whole data at once. This solution is way faster, but vulnerable to out of memory exception. My problem with this approach is that I do not want to unnecessarily duplicate the data.
Code:
rows = 50000
cols = 40
data = rand(rows, cols); % generate random matrix
data([1 3 8]) = NaN; % insert some NaN values
data([5 6 14]) = Inf; % insert some Inf values
data([4 2 12]) = -Inf; % insert some -Inf values
fid = fopen('data.txt', 'w'); %output file
%% 0) Write data using default fprintf
format = repmat('%g ', 1, cols);
tic
fprintf(fid, [format '\n'], data');
toc
%% 1) Using strrep, writing line by line
fprintf(fid, '\n');
tic
for i = 1:rows
fprintf(fid, '%s\n', strrep(strrep(strrep(sprintf(format, data(i, :)), 'NaN', '//m'), '-Inf', '//inf-'), 'Inf', '//inf+'));
end
toc
%% 2) Using strrep, writing all at once
fprintf(fid, '\n');
format = [format '\n'];
tic
fprintf(fid, '%s\n', strrep(strrep(strrep(sprintf(format, data'), 'NaN', '//m'), '-Inf', '//inf-'), 'Inf', '//inf+'));
toc
Output:
Elapsed time is 1.651089 seconds. % Regular fprintf()
Elapsed time is 11.529552 seconds. % Option 1
Elapsed time is 2.305582 seconds. % Option 2
Now to the question...
I am not satisfied with the memory overhead and time lost using my solutions in comparison with simple fprintf().
My rationale is that the 'NaN', 'Inf' and '-Inf' strings are simple data saved in some variable inside the *printf() or *2str() implementation. Is there any way to change their value at runtime?
For example in C# I would change the System.Globalization.CultureInfo.NumberFormat.NaNSymbol, etc. as explaind here.

In the limited case mentioned in comments that a number of (unknown, changing per data set) columns may be entirely NaN (or Inf, etc), but that there are not unwanted NaN values otherwise, another possibility is to check the first row of data, assemble a format string which writes the \\m strings directly, and use that while telling fprintf to ignore the columns that contain NaN or other unwanted values.
y = ~isnan(data(1,:)); % find all non-NaN
format = sprintf('%d ',y); % print a 1/0 string
format = strrep(format,'1','%g');
format = strrep(format,'0','//m');
fid = fopen('data.txt', 'w');
fprintf(fid, [format '\n'], data(:,y)'); %pass only the non-NaN data
fclose(fid);
By my check with two columns of NaN this fprintf is pretty much the same as your "regular" fprintf and quicker than the loop - not taking into account the initialisation step of producing format. It would be fiddlier to set it up to automatically produce the format string if you also have to take +/- Inf into account, but certainly possible. There is probably a cleaner way of producing format as well.
How it works:
You can pass in a subset of your data, and you can also insert any text you like into a format string, so if every row has the same desired "text" in the same spot (in this case NaN columns and our desired replacement for "NaN"), we can put the text we want in that spot and then just not pass those parts of the data to fprintf in the first place. A simpler example for trying out on the command line:
x = magic(5);
x(:,3)=NaN
sprintf('%d %d ihatethrees %d %d \n',x(:,[1,2,4,5])');

Converting a comma separated filed to a matlab matrix

I have a comma separated file in the format:
Col1Name,Col1Val1,Col1Val2,Col1Val3,...Col1ValN,Col2Name,Col2Val1,...Col2ValN,...,ColMName,ColMVal1,...,ColMValN
My question is, how can I convert this file into something Matlab can treat as a matrix, and how would I go about using this matrix in a file? I supposed I could some scripting language to format the file into matlab matrix format and copy it, but the file is rather large (~7mb).
Thanks!
Sorry for the edit:
The file format is:
Col1Name;Col2Name;Col3Name;...;ColNName
Col1Val1;Col2Val2;Col3Val3;...;ColNVal1
...
Col1ValM;Col2ValM;Col3ValM;...;VolNValM
Here is some actual data:
Press;Temp.;CondF;Cond20;O2%;O2ppm;pH;NO3;Chl(a);PhycoEr;PhycoCy;PAR;DATE;TIME;excel.date;date.time
0.96;20.011;432.1;431.9;125.1;11.34;8.999;134;9.2;2.53;1.85;16.302;08.06.2011;12:01:52;40702;40702.0.5
1;20.011;433;432.8;125;11.34;9;133.7;8.19;3.32;2.02;17.06;08.06.2011;12:01:54;40702;40702.0.5
1.1;20.012;432.7;432.4;125.1;11.34;9;133.8;8.35;2.13;2.2;19.007;08.06.2011;12:01:55;40702;40702.0.5
1.2;20.012;432.8;432.5;125.2;11.35;9.001;133.8;8.45;2.95;1.95;21.054;08.06.2011;12:01:56;40702;40702.0.5
1.3;20.012;432.7;432.4;125.4;11.37;9.002;133.7;8.62;3.17;1.87;22.934;08.06.2011;12:01:57;40702;40702.0.5
1.4;20.007;432.1;431.9;125.2;11.35;9.003;133.7;9.48;4.17;1.6;24.828;08.06.2011;12:01:58;40702;40702.0.5
1.5;19.997;432.3;432.2;124.9;11.33;9.003;133.8;8.5;3.84;1.79;27.327;08.06.2011;12:01:59;40702;40702.0.5
1.6;20;432.8;432.6;124.5;11.29;9.003;133.6;8.57;3.22;1.86;30.259;08.06.2011;12:02:00;40702;40702.0.5
1.7;19.99;431.9;431.9;124.4;11.28;9.002;133.6;8.79;3.7;1.81;35.152;08.06.2011;12:02:02;40702;40702.0.5
1.8;19.994;432.1;432.1;124.4;11.28;9.002;133.6;8.58;3.41;1.84;39.098;08.06.2011;12:02:03;40702;40702.0.5
1.9;19.993;433;432.9;124.6;11.3;9.002;133.6;8.59;3.45;5.53;45.488;08.06.2011;12:02:04;40702;40702.0.5
2;19.994;433;432.9;124.8;11.32;9.002;133.5;8.6;2.76;1.99;50.646;08.06.2011;12:02:05;40702;40702.0.5

If you don't know number of rows and columns up front, you can't use previous solution. Use this instead.
7 Mb is not large, it is small. This is the 21st century.
To read in to a matlab matrix:
text = fileread('file.name'); % a string with the entire file contents in it. 7 Mb is no big deal.
NAMES = {}; % we'll record column names here
VALUES = []; % this will be the matrix of values
while text(end) = ','
text(end)=[]; % elimnate any trailing commas
end
commas = find(text==','); % Index all the commas
commas = [0;commas(:);length(commas)+1] % put fake commas before and after text to simplify loop
col = 0; % which column are we in
I = 1;
while I<length(commas)
txt = text(commas(I)+1:commas(I+1)-1);
I = I+1;
num = str2double(txt);
if isnan(num) % this means it must be a column name
NAMES{end+1,1} = txt;
col = col+1; % can you believe Matlab doesn't support col++ ???
row = 1; % back to the top at each new column
continue % we have dealt with this txt, its not a num so ... next
end
% if we made it here we have a number
VALUES(row,col) = num;
end
Then you can save your matlab matrix VALUES and also the header names if you want them in matlab format NAMES into matlab format file
save('mymatrix.mat','VALUES','NAMES'); % saves matrix and column names to .mat file
You get stuff back in to matlab when you want it from the file by:
load mymatrix.mat; % loads VALUES and NAMES from .mat file
Some limitations:
You can't use commas in your column header names.
You cannot "name" a column something like "898.2" or anything which can be read as a double number, it will be read in as a number.
If your columns have different lengths, the shorter ones will be padded with zeros to the length of the longest column.
That's all I can think of.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Optimizing reading the data in Matlab - matlab

Related

Variable Width Columns in .txt Files

Read a text file and separate the data into different columns and different tables in MATLAB

Performance issues by processing huge textfiles

Change default NaN representation of fprintf() in Matlab

Converting a comma separated filed to a matlab matrix

Categories

Resources