str2num and importing data for large matrix - matlab

I have a large matrix in xlsx file which contains chars as following for example:
1,26:45:32.350,6,7,8,9,9,0,0,0
1,26:45:32.409,5,7,8,9,9,0,75,89
I want to make the 2nd column (the one which contains 26:45:32:350)
as a time vector and all the rest as a double matrix.
I tried the next code on like 50000 rows and it worked.
[FileName PathName] = uigetfile('*.xlsx','XLSX Files');
fid = fopen(FileName);
T=char(importdata(FileName));
Time=T(:,5:16);
Data=str2double(T);
However, when I tested it on the whole matrix (about 500,000 roww), I recieved Data=[] instead of matrix.
Is there any other thing I can do so 'Data' will be double matrix even for large matrix?
The excel file contains 1 column and around 500,000 rows, so the whole line 1,26:45:32:350,6,7,8,9,9,0,0,0 is inside 1 cell.
Also, I wrote another code,which works but take alot of time to run.
[FileName PathName] = uigetfile('*.xlsx','XLSX Files');
fid = fopen(FileName);
T=importdata(FileName);
h = waitbar(0,'Converting Data to cell array, please wait...');
for i=1:length(T)
delimiter_index=[0 find(T{i,1}(:)==char(44))'];
for j=1:length(delimiter_index)-1
Data{i,j}=T{i,1}(delimiter_index(j)+1:delimiter_index(j+1)-1);
end
waitbar(i/length(T));
end
close(h)
h = waitbar(0,'Seperating Data to time and data, please wait...');
for i=1:length(T)
Full_Time(i,:)=Data{i,2};
Data{i,2}=Data{i,1};
Data{i,1}=Full_Time(i,:);
waitbar(i/length(T));
end
close(h)
Data(:,1)=[];
h = waitbar(0,'Changing data cell to mat, please wait...');
for i=1:size(Data,1)
for j=1:size(Data,2)
Matrix(i,j)=str2num(Data{i,j});
end
waitbar(i/size(Data,1));
end
close(h)
Running this code for like 20000 rows shows that:(slowest to fastest)
waitbar
allchild
str2num
importdata
So basically I can remove this waitbar, but allchild (not sure what it is) and str2num take most of the time.
Is there anything I can do to make it run faster?

Related

Optimizing reading the data in Matlab

I have a large data file with a text formatted as a single column with n rows. Each row is either a real number or a string with a value of: No Data. I have imported this text as a nx1 cell named Data. Now I want to filter out the data and to create a nx1 array out of it with NaN values instead of No data. I have managed to do it using a simple cycle (see below), the problem is that it is quite slow.
z = zeros(n,1);
for i = 1:n
if Data{i}(1)~='N'
z(i) = str2double(Data{i});
else
z(i) = NaN;
end
end
Is there a way to optimize it?
Actually, the whole parsing can be performed with a one-liner using a properly parametrized readtable function call (no iterations, no sanitization, no conversion, etc...):
data = readtable('data.txt','Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No data');
Here is the content of the text file I used as a template for my test:
9.343410
11.54300
6.733000
-135.210
No data
34.23000
0.550001
No data
1.535000
-0.00012
7.244000
9.999999
34.00000
No data
And here is the output (which can be retrieved in the form of a vector of doubles using data.Var1):
ans =
9.34341
11.543
6.733
-135.21
NaN
34.23
0.550001
NaN
1.535
-0.00012
7.244
9.999999
34
NaN
Delimiter: specified as a line break since you are working with a single column... this prevents No data to produce two columns because of the whitespace.
Format: you want numerical values.
TreatAsEmpty: this tells the function to treat a specific string as empty, and empty doubles are set to NaN by default.
If you run this you can find out which approach is faster. It creates an 11MB text file and reads it with the various approaches.
filename = 'data.txt';
%% generate data
fid = fopen(filename,'wt');
N = 1E6;
for ct = 1:N
val = rand(1);
if val<0.01
fwrite(fid,sprintf('%s\n','No Data'));
else
fwrite(fid,sprintf('%f\n',val*1000));
end
end
fclose(fid)
%% Tommaso Belluzzo
tic
data = readtable(filename,'Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No Data');
toc
%% Camilo Rada
tic
[txtMat, nLines]=txt2mat(filename);
NoData=txtMat(:,1)=='N';
z = zeros(nLines,1);
z(NoData)=nan;
toc
%% Gelliant
tic
fid = fopen(filename,'rt');
z= textscan(fid, '%f', 'Delimiter','\n', 'whitespace',' ', 'TreatAsEmpty','No Data', 'EndOfLine','\n','TextType','char');
z=z{1};
fclose(fid);
toc
result:
Elapsed time is 0.273248 seconds.
Elapsed time is 0.304987 seconds.
Elapsed time is 0.206315 seconds.
txt2mat is slow, even without converting resulting string matrix to numbers it is outperformed by readtable and textscan. textscan is slightly faster than readtable. Probably because it skips some of the internal sanity checks and does not convert the resulting data to a table.
Depending of how big are your files and how often you read such files, you might want to go beyond readtable, that could be quite slow.
EDIT: After tests, with a file this simple the method below provide no advantages. The method was developed to read RINEX files, that are large and complex in the sense that the are aphanumeric with different numbers of columns and different delimiters in different rows.
The most efficient way I've found, is to read the whole file as a char matrix, then you can easily find you "No data" lines. And if your real numbers are formatted with fix width you can transform them from char into numbers in a way much more efficient than str2double or similar functions.
The function I wrote to read a text file into a char matrix is:
function [txtMat, nLines]=txt2mat(filename)
% txt2mat Read the content of a text file to a char matrix
% Read all the content of a text file to a matrix as wide as the longest
% line on the file. Shorter lines are padded with blank spaces. New lines
% are not included in the output.
% New lines are identified by new line \n characters.
% Reading the whole file in a string
fid=fopen(filename,'r');
fileData = char(fread(fid));
fclose(fid);
% Finding new lines positions
newLines= fileData==sprintf('\n');
linesEndPos=find(newLines)-1;
% Calculating number of lines
nLines=length(linesEndPos);
% Calculating the width (number of characters) of each line
linesWidth=diff([-1; linesEndPos])-1;
% Number of characters per row including new lines
charsPerRow=max(linesWidth)+1;
% Initializing output var with blank spaces
txtMat=char(zeros(charsPerRow,nLines,'uint8')+' ');
% Computing a logical index to all characters of the input string to
% their final positions
charIdx=false(charsPerRow,nLines);
% Indexes of all new lines
linearInd = sub2ind(size(txtMat), (linesWidth+1)', 1:nLines);
charIdx(linearInd)=true;
charIdx=cumsum(charIdx)==0;
% Filling output matrix
txtMat(charIdx)=fileData(~newLines);
% Cropping the last row coresponding to new lines characters and transposing
txtMat=txtMat(1:end-1,:)';
end
Then, once you have all your data in a matrix (let's assume it is named txtMat), you can do:
NoData=txtMat(:,1)=='N';
And if your number fields have fix width, you can transform them to numbers way more efficiently than str2num with something like
values=((txtMat(:,1:10)-'0')*[1e6; 1e5; 1e4; 1e3; 1e2; 10; 1; 0; 1e-1; 1e-2]);
Where I've assumed the numbers have 7 digits and two decimal places, but you can easily adapt it for your case.
And to finish you need to set the NaN values with:
values(NoData)=NaN;
This is more cumbersome than readtable or similar functions, but if you are looking to optimize the reading, this is WAY faster. And if you don't have fix width numbers you can still do it this way by adding a couple lines to count the number of digits and find the place of the decimal point before doing the conversion, but that will slow down things a little bit. However, I think it will still be faster.

Read a text file and separate the data into different columns and different tables in MATLAB

I have a huge text file that needs to be read and processed in MATLAB. This file at some points contain text to indicate that a new data series has started.
I have searched here but cant find any simple solution.
So what I want to do is to read the data in the file, put the data in a table in three different columns and when it finds text it should create a new table. It should repeat this process until the entire document is scanned.
This is how the document looks like:
time V(A,B) I(R1)
Step Information: X=1 (Run: 1/11)
0.000000000000000e+000 -2.680148e-016 0.000000e+00
9.843925313007988e-012 -4.753470e-006 2.216314e-011
1.000052605772457e-011 -4.835427e-006 2.552497e-011
1.031372754715773e-011 -4.999340e-006 -3.042096e-012
1.094013052602406e-011 -5.327165e-006 -1.206968e-011
Step Information: X=1 (Run: 2/11)
0.000000000000000e+000 -2.680148e-016 0.000000e+000
9.843925313007988e-012 -4.753470e-006 2.216314e-011
1.000052605772457e-011 -4.835427e-006 2.552497e-011
1.031372754715773e-011 -4.999340e-006 -3.042096e-012
1.094013052602406e-011 -5.327165e-006 -1.206968e-011
A rather crude approach is to read the file line by line and check if the line consists of three numbers. If it does, then append this to a temporary matrix. When you finally get to a line that doesn't contain three numbers, append this matrix as an element in a cell array, clear the temporary matrix and continue.
Something like this would work, assuming that the file is stored in 'file.txt':
%// Open the file
f = fopen('file.txt', 'r');
%// Initialize empty cell array
data = {};
%// Initialize temporary matrix
temp = [];
%// Loop over the file...
while true
%// Get a line from the file
line = fgetl(f);
%// If we reach the end of the file, get out
if line == -1
%// Last check before we break
%// Check if the temporary matrix isn't empty and add
if ~isempty(temp)
data = [data; temp];
end
break;
end
%// Else, check to see if this line contains three numbers
numbers = textscan(line, '%f %f %f');
%// If this line doesn't consist of three numbers...
if all(cellfun(#isempty, numbers))
%// If the temporary matrix is empty, skip
if isempty(temp)
continue;
end
%// Concatenate to cell array
data = [data; temp];
%// Reset temporary matrix
temp = [];
%// If this does, then create a row vector and concatenate
else
temp = [temp; numbers{:}];
end
end
%// Close the file
fclose(f);
The code is pretty self-explanatory but let's go into it to be sure you know what's going on. First open up the file with fopen to get a "pointer" to the file, then initialize our cell array that will contain our matrices as well as the temporary matrix used when reading in matrices in between header information. After we simply loop over each line of the file and we can grab a line with fgetl using the file pointer we created. We then check to see if we have reached the end of the file and if we have, let's check to see if the temporary matrix has any numerical data in it. If it does, add this into our cell array then finally get out of the loop. We use fclose to close up the file and clean things up.
Now the heart of the operation is what follows after this check. We use textscan and search for three numbers separated by spaces. That's done with the '%f %f %f' format specifier. This should give you a cell array of three elements if you are successful with numbers. If this is correct, then convert this cell array of elements into a row of numbers and concatenate this into the temporary matrix. Doing temp = [temp; numbers{:}]; facilitates this concatenation. Simply put I piece together each number and concatenate them horizontally to create a single row of numbers. I then take this row and concatenate this as another row in the temporary matrix.
Should we finally get to a line where it's all text, this will give you all three elements in the cell array found by textscan to be empty. That's the purpose of the all and cellfun call. We search each element in the cell and see if it's empty. If every element is empty, this is a line that is text. If this situation arises, simply take the temporary matrix and add this as a new entry into your cell array. You'd then reset the temporary matrix and start the logic over again.
However, we also have to take into account that there may be multiple lines that consist of text. That's what the additional if statement is for inside the first if block using all. If we have an additional line of text that precedes a previous line of text, the temporary matrix of values should still be empty and so you should check to see if that is empty before you try and concatenate the temporary matrix. If it's empty, don't bother and just continue.
After running this code, I get the following for my data matrix:
>> format long g
>> celldisp(data)
data{1} =
0 -2.680148e-16 0
9.84392531300799e-12 -4.75347e-06 2.216314e-11
1.00005260577246e-11 -4.835427e-06 2.552497e-11
1.03137275471577e-11 -4.99934e-06 -3.042096e-12
1.09401305260241e-11 -5.327165e-06 -1.206968e-11
data{2} =
0 -2.680148e-16 0
9.84392531300799e-12 -4.75347e-06 2.216314e-11
1.00005260577246e-11 -4.835427e-06 2.552497e-11
1.03137275471577e-11 -4.99934e-06 -3.042096e-12
1.09401305260241e-11 -5.327165e-06 -1.206968e-11
To access a particular "table", do data{ii} where ii is the table you want to access that was read in from top to bottom in your text file.
The most versatile way is to read line by line using textscan. If you want to speed this process up, you can have a dummy read first:
ie. You loop through all the lines without storing the data and decide which lines are the text lines and which are numbers, recording a quick number of lines for each.
You then have enough information about the data to run through quickly the arrays. This will speed up the time it takes to store the data in your new arrays massively.
Your second loop is the one that actually reads the data into the array/s. You should now know which lines to skip. You can also pre-allocate the arrays within the data cell if you wish to.
fid = fopen('file.txt','r');
data = {};
nlines = [];
% now start the loop
k=0; % counter for data sets
while ~feof(fid)
line = fgetl(fid);
% check if is data or text
if all(ismember(line,' 0123456789+.')) % is it data
nlines(k) = nlines(k)+1;
else %is it text
k=k+1;
nlines(k) = 0;
end
end
frewind(fid); % go back to start of file
% You could preallocate the data array here if you wished
% now get the data
for aa = 1 : length(nlines)
if nlines(aa)==0;
continue
end
textscan(fid,'%s\r\n',1); % skip textline
data{aa} = textscan(fid,'%f%f%f\r\n',nlines(k));
end

Importing matrix with different row length

I have large data files and I would like to import 12 columns of data for further use. However the row length will be different in each instance. I would import the selected columns only but below the data I need are some blank rows followed by extra numbers which aren't necessary, so I'm wondering how to import just the data I need? I don't mind specifying and end row but this would be different for each case and I'm not sure if I'm missing anything else obvious! To help I've attached a print-screen of an example of the data I'm working with:
To summarise I only require the "blue" data above the purple boxes, each file I will use will have the same layout except there may be more/less rows of data.
EDIT
I have updated the code to give you a better understanding of the process:
% An empty array:
importedarray = [];
% Open the data file to read from it:
fid = fopen( 'dummydata.txt', 'r' );
% Check that fid is not -1 (error openning the file).
% read each line of the data in a cell array:
cac = textscan( fid, '%s', 'Delimiter', '\n' );
% size(cac{1},1) must equals the # of rows in your data file.
totalRows = size(cac{1},1);
fprintf('Imported %d rows of data!\n',totalRows)
% Close the file as we don't need it anymore:
fclose( fid );
% for total rows in data
for k=1:totalRows
fprintf('Parsing data on row %d of %d...\n',k,totalRows);
currentRow = cac{1}{k,1};
fprintf('Row contains:\n%s\n',currentRow);
% finish (break from loop) when encounter an empty row:
if isempty(currentRow)
fprintf('Empty row encountered (#%d). Exiting the loop...\n',k);
break;
end
eachRowElement = strsplit(currentRow, ' ');
fprintf('Splitting row to %d elements...\n',length(eachRowElement));
fprintf('Converting row to floats...');
eachRowElement2num = cellfun(#str2num,eachRowElement,'UniformOutput',false);
fprintf('Done!\n');
fprintf('Converting cell to matrix...');
importedarray(k,:) = cell2mat(eachRowElement2num);
fprintf('Done!\n');
end
clearvars cac k fid totalRows currentRow eachRowElement eachRowElement2num;
Given your example image (that all the columns of each row are filled with floats and on an empty row you stop) this should do the job giving info along the way. If not you will be able to tell what is the issue by looking the line the code stopped. I include code to eliminate the unnecessary variables after importing. This must be done manually or you can create a function to perform the task (functions' work space is different the the temporary variables are deleted on function return, see: http://www.mathworks.com/help/matlab/ref/function.html). Hope this helps.
PS. In your example you keep 12 columns skipping the first two. The above code will import the whole row. You can choose what columns to keep later by using matrix indexing, like:
importedarray = importedarray(:,3:14);
if these columns don't change you can incorporate this into your function.

Reading dat file into matrix in matlab with different different amounts of data each row

I'm trying to write a matlab function that will load a data into a matrix. The problem is the data has one more value each row. So I can't use load unfortunately, so I'm trying to use fgetl.
The data looks like:
143234
454323 354654
543223 343223 325465
etc
What I did is create a matrix of zeros, the dimensions being the height and the longest string of data. To put the data into the matrix I used fgetl to read each row and then used textscan to split the data up at whitespace. Then I used str2num (I think this is where the error is) to convert the string to a number.
First heres my code:
%READTRI Opens the triangle dat file and turns it into a matrix
fid = fopen('triangledata.dat');
%create matrix of zeros for the data to be retrieved
trimat = zeros(15,15);
%Check to see if the file loaded properly
if fid == -1
disp('File load error')
else
%if it does, continue
%read each line of data into a
while feof(fid) == 0
%run through line by line
aline = fgetl(fid);
%split aline into parts by whitespace
splitstr = textscan(aline,'%s','delimiter',' ');
%This determines what row of the matrix the for loop writes to
rowCount = 1;
%run through cell array to get the numbers out and write them to
%the matrix
for i = 1:length(splitstr)
%convert to number
num = str2num(splitstr{i});
%write num to matrix
trimat(rowCount, i) = num;
end
%iterate rowCount
rowCount = rowCount + 1;
end
%close the file
closeresult = fclose(fid);
%check for errors
if closeresult == 0
disp('File close successful')
else
disp('File close not successful')
end
end
end
The error I'm getting is:
Error using str2num (line 33)
Requires string or character array input.
Error in readTri (line 32)
num = str2num(splitstr{i});
What bothers me is that when I try, in the interactive console the same thing that goes on in the loop i.e. import aline, split it into a cell array using textscan, then use num2str to convert it to a integer. Everything works. So either the way I'm using num2str is wrong or the for loop is doing something funky.
I was just hoping for ideas, there is a LOT of data so adding zeros to make load work is not possible.
thanks for reading!
You can use dlmread instead of load or fgetl
It automatically returns a matrix with zeros whenever the line is not as long as the longest.
Just do
matrix = dlmread('triangledata.dat');
Why not using textscan?
fid = fopen('test.txt','r');
C = textscan(fid, '%f%f%f');
fclose(fid);
res = cell2mat(C)
The result is
res =
143234 NaN NaN
454323 354654 NaN
543223 343223 325465
where missing values are NaN.

Reading CSV with mixed type data

I need to read the following csv file in MATLAB:
2009-04-29 01:01:42.000;16271.1;16271.1
2009-04-29 02:01:42.000;2.5;16273.6
2009-04-29 03:01:42.000;2.599609;16276.2
2009-04-29 04:01:42.000;2.5;16278.7
...
I'd like to have three columns:
timestamp;value1;value2
I tried the approaches described here:
Reading date and time from CSV file in MATLAB
modified as:
filename = 'prova.csv';
fid = fopen(filename, 'rt');
a = textscan(fid, '%s %f %f', ...
'Delimiter',';', 'CollectOutput',1);
fclose(fid);
But it returs a 1x2 cell, whose first element is a{1}='ÿþ2', the other are empty.
I had also tried to adapt to my case the answers to these questions:
importing data with time in MATLAB
Read data files with specific format in matlab and convert date to matal serial time
but I didn't succeed.
How can I import that csv file?
EDIT After the answer of #macduff i try to copy-paste in a new file the data reported above and use:
a = textscan(fid, '%s %f %f','Delimiter',';');
and it works.
Unfortunately that didn't solve the problem because I have to process csv files generated automatically, which seems to be the cause of the strange MATLAB behavior.
What about trying:
a = textscan(fid, '%s %f %f','Delimiter',';');
For me I get:
a =
{4x1 cell} [4x1 double] [4x1 double]
So each element of a corresponds to a column in your csv file. Is this what you need?
Thanks!
Seems you're going about it the right way. The example you provide poses no problems here, I get the output you desire. What's in the 1x2 cell?
If I were you I'd try again with a smaller subset of the file, say 10 lines, and see if the output changes. If yes, then try 100 lines, etc., until you find where the 4x1 cell + 4x2 array breaks down into the 1x2 cell. It might be that there's an empty line or a single empty field or whatever, which forces textscan to collect data in an additional level of cells.
Note that 'CollectOutput',1 will collect the last two columns into a single array, so you'll end up with 1 cell array of 4x1 containing strings, and 1 array of 4x2 containing doubles. Is that indeed what you want? Otherwise, see #macduff's post.
I've had to parse large files like this, and I found I didn't like textscan for this job. I just use a basic while loop to parse the file, and I use datevec to extract the timestamp components into a 6-element time vector.
%% Optional: initialize for speed if you have large files
n = 1000 %% <# of rows in file - if known>
timestamp = zeros(n,6);
value1 = zeros(n,1);
value2 = zeros(n,1);
fid = fopen(fname, 'rt');
if fid < 0
error('Error opening file %s\n', fname); % exit point
end
cntr = 0
while true
tline = fgetl(fid); %% get one line
if ~ischar(tline), break; end; % break out of loop at end of file
cntr = cntr + 1;
splitLine = strsplit(tline, ';'); %% split the line on ; delimiters
timestamp(cntr,:) = datevec(splitLine{1}, 'yyyy-mm-dd HH:MM:SS.FFF'); %% using datevec to parse time gives you a standard timestamp vector
value1(cntr) = splitLine{2};
value2(cntr) = splitLine{3};
end
%% Concatenate at the end if you like
result = [timestamp value1 value2];