Performance issues by processing huge textfiles

Performance issues by processing huge textfiles - matlab

I am facing the problem to extract data out of a textfile which has both numbers and characters in it. The data I want, (the numbers) are separated by rows with characters, describing the following dataset. The textfile is rather large (>2.000.000 lines).
I try to put every dataset (the number of rows between two rows with characters) into a matrix. The matrix should be named according to the description (frequency) in the textline above each dataset. I have a working code, but I face performance problems. Maybe someone can help me to speed it up. One file takes currently about 15 minutes. I need the numbers in matrices to process them further.
Snippet out of Textfile:
21603 2135 21339 21604
103791 94 1 1 1 4
21339 1702 21600 21604
-1
-1
2414
1
Velocity (magnitude) Response at Structural FE Nodes
1
Frequency = 10.00 Hz
Result = Engineering Units
Component = Vmag
Location =
Form & Units = RMS Magnitude in m/s
1 5 1 11 2 1
1 0 1 1 1 0 0 0
1 2161
0.00000e+000 1.00000e+001 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000 0.00000e+000
20008
1.23285e-004
20428
1.21613e-004
Here is my code:
file='large_file.txt';
fid=fopen(file,'r');
k=1;
filerows=2164986; % nr of rows in textfile
A=zeros(filerows,6); % preallocate Matrix where textfile should be saved in
for count=1:8 % get rid of first 8 lines
fgets(fid);
end
name=0;
start=1;
while ~feof(fid)
a=fgets(fid);
b=str2double(strread(a,'%s')); % turn read row in a vector
if isnan(b(1))==1 % check whether there are characters in the row
if strfind(a,'Frequency') % check if 'Frequency' is in the row
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);
name=b(3);
for count=1:10 % get rid of next 10 lines
fgets(fid);
end
start=k+1;
end
else % if there are just numbers in the row, insert it into the matrix
A(k,1:length(b))=b; % populate matrix A with the row entries
k = k+1;
end
k/filerows % show progress
end
fclose(fid);
Matrixname = sprintf('Frequency%i=A(%i:end,:);',name,start);
eval(Matrixname);

Reading text files line by line is very time consuming, especially in Matlab. When I have to read in text files, I usually read in the entire file at once. You may be limited by memory, so read it in the largest size chunks your machine can handle. Once it's all in memory, use some kind of logical indexing to find the parts of the data you're interested in. Again, in Matlab, for and while loops are very slow. For the data set you have there, I would do the following:
fid = fopen(file);
data = fread(fid,[1 maxBytes],'char=>char');
blockIndices = strfind(data,'Velocity'); % Calculate offsets based on data format
% Another approach much faster than for loops
lineData = regexp(data,sprintf('\n'),'split'); % No each line is in a cell
processedData = cellfun(#processData,lineData,'Uniform',false);
function y = processData(x)
% do something with x
end
Once I had the block indices I could calculate offsets to the parts of the data I want. I don't think that two million lines is that much data, and most computers these days have multiple gigabytes of memory, and it doesn't look like each line is more than a couple hundred characters, so the file is probably less than half a GB. Good luck.

Using the matlab profiler will help you see which lines of code are taking the most amount of time so that you can figure out what to optimize.
As the original poster determined, the line causing trouble in this case was
k/filerows % show progress
Printing to the screen many times is very time consuming. If you want to show progress without slowing the code way down, you could do
if mod(k,filerows/100) == 0
disp('k rows processed');
end
That code will cause an update to be displayed 100 times, or every 3.5 seconds in that particular case.
If you want to get really fancy, check out waitbar, but it is usually overkill.

Finally I got the sscanf-solution to work. I used that function to replace the str2double function to gain some speed as suggested in Why is str2double so slow in matlab as compared to a mex-function?.
Sadly it didn't do too much, but at least it helped a bit.
So, start was ca. 850s
Profiler after removing progress-status: ca. 450s
Profiler after replacing str2double by sscanf: ca.330s
The code now is:
file='test.txt';
fid=fopen(file,'r');
k=1;
filerows=2164986; % nr of rows in textfile
A=zeros(filerows,6); % preallocate Matrix where textfile should be saved in
for count=1:8 % get rid of first 8 lines
fgets(fid);
end
name=0;
start=1;
while ~feof(fid)
a=fgets(fid);
b=strread(a,'%s');
b=sscanf(sprintf('%s#', b{:}), '%g#')';
if isempty(b) % check whether there had been characters in the row
if strfind(a,'Frequency') % check whether 'Frequency' was in the row
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);
b=str2double(strread(a,'%s'));
name=b(3);
for count=1:8 % get rid of next 8 lines
fgets(fid);
end
start=k+1;
end
else % if there were just numbers in the row, insert it into the matrix
A(k,1:length(b))=b; % populate matrix A with the row entries
k = k+1;
end
end
fclose(fid);
Matrixname = sprintf('Frequency%i=A(%i:%i,:);',name,start,k);
eval(Matrixname);

Related

Optimizing reading the data in Matlab

I have a large data file with a text formatted as a single column with n rows. Each row is either a real number or a string with a value of: No Data. I have imported this text as a nx1 cell named Data. Now I want to filter out the data and to create a nx1 array out of it with NaN values instead of No data. I have managed to do it using a simple cycle (see below), the problem is that it is quite slow.
z = zeros(n,1);
for i = 1:n
if Data{i}(1)~='N'
z(i) = str2double(Data{i});
else
z(i) = NaN;
end
end
Is there a way to optimize it?

Actually, the whole parsing can be performed with a one-liner using a properly parametrized readtable function call (no iterations, no sanitization, no conversion, etc...):
data = readtable('data.txt','Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No data');
Here is the content of the text file I used as a template for my test:
9.343410
11.54300
6.733000
-135.210
No data
34.23000
0.550001
No data
1.535000
-0.00012
7.244000
9.999999
34.00000
No data
And here is the output (which can be retrieved in the form of a vector of doubles using data.Var1):
ans =
9.34341
11.543
6.733
-135.21
NaN
34.23
0.550001
NaN
1.535
-0.00012
7.244
9.999999
34
NaN
Delimiter: specified as a line break since you are working with a single column... this prevents No data to produce two columns because of the whitespace.
Format: you want numerical values.
TreatAsEmpty: this tells the function to treat a specific string as empty, and empty doubles are set to NaN by default.

If you run this you can find out which approach is faster. It creates an 11MB text file and reads it with the various approaches.
filename = 'data.txt';
%% generate data
fid = fopen(filename,'wt');
N = 1E6;
for ct = 1:N
val = rand(1);
if val<0.01
fwrite(fid,sprintf('%s\n','No Data'));
else
fwrite(fid,sprintf('%f\n',val*1000));
end
end
fclose(fid)
%% Tommaso Belluzzo
tic
data = readtable(filename,'Delimiter','\n','Format','%f','ReadVariableNames',false,'TreatAsEmpty','No Data');
toc
%% Camilo Rada
tic
[txtMat, nLines]=txt2mat(filename);
NoData=txtMat(:,1)=='N';
z = zeros(nLines,1);
z(NoData)=nan;
toc
%% Gelliant
tic
fid = fopen(filename,'rt');
z= textscan(fid, '%f', 'Delimiter','\n', 'whitespace',' ', 'TreatAsEmpty','No Data', 'EndOfLine','\n','TextType','char');
z=z{1};
fclose(fid);
toc
result:
Elapsed time is 0.273248 seconds.
Elapsed time is 0.304987 seconds.
Elapsed time is 0.206315 seconds.
txt2mat is slow, even without converting resulting string matrix to numbers it is outperformed by readtable and textscan. textscan is slightly faster than readtable. Probably because it skips some of the internal sanity checks and does not convert the resulting data to a table.

Depending of how big are your files and how often you read such files, you might want to go beyond readtable, that could be quite slow.
EDIT: After tests, with a file this simple the method below provide no advantages. The method was developed to read RINEX files, that are large and complex in the sense that the are aphanumeric with different numbers of columns and different delimiters in different rows.
The most efficient way I've found, is to read the whole file as a char matrix, then you can easily find you "No data" lines. And if your real numbers are formatted with fix width you can transform them from char into numbers in a way much more efficient than str2double or similar functions.
The function I wrote to read a text file into a char matrix is:
function [txtMat, nLines]=txt2mat(filename)
% txt2mat Read the content of a text file to a char matrix
% Read all the content of a text file to a matrix as wide as the longest
% line on the file. Shorter lines are padded with blank spaces. New lines
% are not included in the output.
% New lines are identified by new line \n characters.
% Reading the whole file in a string
fid=fopen(filename,'r');
fileData = char(fread(fid));
fclose(fid);
% Finding new lines positions
newLines= fileData==sprintf('\n');
linesEndPos=find(newLines)-1;
% Calculating number of lines
nLines=length(linesEndPos);
% Calculating the width (number of characters) of each line
linesWidth=diff([-1; linesEndPos])-1;
% Number of characters per row including new lines
charsPerRow=max(linesWidth)+1;
% Initializing output var with blank spaces
txtMat=char(zeros(charsPerRow,nLines,'uint8')+' ');
% Computing a logical index to all characters of the input string to
% their final positions
charIdx=false(charsPerRow,nLines);
% Indexes of all new lines
linearInd = sub2ind(size(txtMat), (linesWidth+1)', 1:nLines);
charIdx(linearInd)=true;
charIdx=cumsum(charIdx)==0;
% Filling output matrix
txtMat(charIdx)=fileData(~newLines);
% Cropping the last row coresponding to new lines characters and transposing
txtMat=txtMat(1:end-1,:)';
end
Then, once you have all your data in a matrix (let's assume it is named txtMat), you can do:
NoData=txtMat(:,1)=='N';
And if your number fields have fix width, you can transform them to numbers way more efficiently than str2num with something like
values=((txtMat(:,1:10)-'0')*[1e6; 1e5; 1e4; 1e3; 1e2; 10; 1; 0; 1e-1; 1e-2]);
Where I've assumed the numbers have 7 digits and two decimal places, but you can easily adapt it for your case.
And to finish you need to set the NaN values with:
values(NoData)=NaN;
This is more cumbersome than readtable or similar functions, but if you are looking to optimize the reading, this is WAY faster. And if you don't have fix width numbers you can still do it this way by adding a couple lines to count the number of digits and find the place of the decimal point before doing the conversion, but that will slow down things a little bit. However, I think it will still be faster.

Matlab interp1 gives last row as NaN

I have a problem similar to here. However, it doesn't seem that there is a resolution.
My problem is as such: I need to import some files, for example, 5. There are 20 columns in each file, but the number of lines are varied. Column 1 is time in terms of crank-angle degrees, and the rest are data.
So my code first imports all of the files, finds the file with the most number of rows, then creates a multidimensional array with that many rows. The timing is in engine cycles so, I would then remove lines from the imported file that go beyond a whole engine cycle. This way, I always have data in terms of X whole engine cycles. Then I would just interpolate the data to the pre-allocated array to have a giant multi-dimensional array for the 5 data files.
However, this seems to always result in the last row of every column of every page being filled with NaNs. Please have a look at the code below. I can't see where I'm doing wrong. Oh, and by the way, as I have been screwed over before, this is NOT homework.
maxlines = 0;
maxcycle = 999;
for i = 1:1
filename = sprintf('C:\\Directory\\%s\\file.out',folder{i});
file = filelines(filename); % Import file clean
lines = size(file,1); % Find number of lines of file
if lines > maxlines
maxlines = lines; % If number of lines in this file is the most, save it
end
lastCAD = file(end,1); % Add simstart to shift the start of the cycle to 0 CAD
lastcycle = fix((lastCAD-simstart)./cycle); % Find number of whole engine cycles
if lastcycle < maxcycle
maxcycle = lastcycle; % Find lowest number of whole engine cycles amongst all designs
end
cols = size(file,2); % Find number of columns in files
end
lastcycleCAD = maxcycle.*cycle+simstart; % Define last CAD of whole cycle that can be used for analysis
% Import files
thermo = zeros(maxlines,cols,designs); % Initialize array to proper size
xq = linspace(simstart,lastcycleCAD,maxlines); % Define the CAD degrees
for i = 1:designs
filename = sprintf('C:\\Directory\\%s\\file.out',folder{i});
file = importthermo(filename, 6, inf); % Import the file clean
[~,lastcycleindex] = min(abs(file(:,1)-lastcycleCAD)); % Find index of end of last whole cycle
file = file(1:lastcycleindex,:); % Remove all CAD after that
thermo(:,1,i) = xq;
for j = 2:17
thermo(:,j,i) = interp1(file(:,1),file(:,j),xq);
end
sprintf('file from folder %s imported OK',folder{i})
end
thermo(end,:,:) = []; % Remove NaN row
Thank you very much for your help!

Are you sampling out of the range? if so, you need to tell interp1 that you want extrapolation
interp1(file(:,1),file(:,j),xq,'linear','extrap');

MATLAB data file when it overs its memory

There is a Matrix of 500000000 X 5.
And the sample of this data is like this :
1 01 06:0 48407
1 01 06:1 48407
.
.
.
865850 31 23:5 1586884493
Each column means [area_number date hour:minute amount_of_data]
I want to load them entirely, after that make another 865850 X 4464 matrix from their 5th column values. In this new matrix, row insists area_number. And each column means amout_of_data according to time priority.
This is what I wrote.
clear all; close all;
fileID=fopen('data2.txt','r');
Data=fscanf(fileID, '%d %d %d:%d %d',[5 100000]);
Data = Data';
Zeros = zeros(4000, 4464);
DataA = Data(:,1); % indexs
DataB = Data(:,2); % dates
DataC = Data(:,3); % hours
DataD = Data(:,4); % minutes
DataE = Data(:,5); % data
for m=1:40000
r = DataA(m);
c = (DataB(m)-1)*24*6 + DataC(m)*6 + DataD(m);
Zeros(r,c) = DataE(m);
end
I can't finish it because the matrix too big to load it at once.
It overs memory limitation of MATLAB.
Please help me...
Thank you~!

To solve your problem, using the matfile command is probably the best choice. It allows you to write data directly to a mat-file on the filesystem but access it like a variable.
If I understood your data right, all lines with the same index are next to each other, and all data with the same index is small enough to fit your memory.
Read all data with index 1
process it like you did above, creating one row of your intended matrix
Write this row to your matfile
Proceed with the next index until you reach the end

beginner:referencing a cell containing a zero matlab

So far I have got this code:
clear all; % clears all variables from your workspace
close all; % closes all figure windows
clc; % clears command window
%%=============================================
%%define number of trials and subjects
%%=============================================
njump=81;
nsbj=6;
%%
%%==============================================
%%defining size of cell that will be created
%%==============================================
data=cell(njump,1);
%%
%%==============================================
%%defining gravity and frame rate per second (fps)
%%==============================================
fps=0.001; %frames per second
g=-9.81; %acceleration
%%
%%==============================================
%%read in excel data in CSV format
%%===============================================
for i=1:njump;
x=sprintf('Trial%02d.csv',i),sprintf('Trial%02d',i),'A1:E7000';;% jump data
data{i}=xlsread(x,'A1:E7000');
%%===============================================
%%defining total no. of frames and time of flight
%tnf=total number of frames equal to zero
%n = nnz(X) returns the number of nonzero elements in matrix X.
%%===============================================
% myMax(i) = nanmax(data{i}(:,5));
% vals = find(data{i}(:,5) > (myMax(i) - 10));
% pointsInAir = numel(vals,i);
tnf(i,1) = size(data{i,1},1) - nnz(data{i,1}(:,5)); %number of zeros
tof(i)=tnf(i)*fps; %Time of flight is equal to this
jh(i,1)=(-g*(tof(i).^2)/8); %defining jump height
%%=================================================
%%to find average power first use "find" function to find the first zero in
%%Fz, have the cell referenced
%%then use nanmean for average power(av_pwr)
%%use nanmin for peak power (peak_pwr)
%%=================================================
n = 1; % the following call to find retrieves only the n first element(s) found.
ref= find(data{i,1}(:,5)==0, n, 'first');
% ref=find(data(:,5)==0);
peak_pwr(i,1) = nanmin (data {i,1}(1:ref,5)); %preak power in coloumn E1 in all data with reference to cell found
av_pwr(i,1)=nanmean(data {i,1}(1:ref,5));%average power in coloumn E1 in all data with reference to cell found
%%==================================================
%%Plot the results onto a time vs jump height, time vs average power and
%%time vs peak power
However the part that is hard is trying to find the first zero in column E which is the 5th column to use as a reference cell. I want to use this reference cell so that I can do my average and peak power calcs. that use the numbers before this zero.

In this case ref is empty so you cannot access the first element.
If you think that ref should not be empty you need to go back further to see where things go wrong. Otherwise, you can use something like:
if any(ref)
%Do something
else
%Return the default value/do alternative action
end

It could help to have an example of what's in data. I have created one which might be similar to yours :
data{1,1}=magic(6)-10
Now in this matrix, column 5 actually does have a zero element so ref= find(data{1,1}(:,5)==0); ref(1) actually works and retrieves the first index of a zero element. However, if it didn't, you would be trying to access the first element of an empty matrix.
Try instead using the second (and perhaps third) arguments of find to achieve this :
n = 1;
% the following call to find retrieves only the n first element(s) found.
ref= find(data{1,1}(:,5)==0, n, 'first');
The rest of your code seems like it should work, although from the looks of it i have a feeling your loop (i take it you are using a loop for i) could maybe be vectorized.
Hope this helps :)
Tepp

Output loop result in Matlab

Hi have this code and I don't know how to put the output result with every pixel.I think the output code are not well defined.
EDIT:
I'm going to try to explain the code:
% I have an image
imagen1=imread('recor.tif');
imagen2= double(imagen1);
band1= imagen2(:,:,1);
% I preallocate the result (the image size is 64*89*6)
yvan2= zeros(61,89,1);
% For every pixel of the image, I want to get one result (each one is different).
for i = 1 : nfiles
for j = 1 : nrows
for i = 1:numel(band1)
% I'm doing this because I've to multiply the results of this interpolation with that result a2ldb1y= ldcm_1(:,1). This vector has a length of 2151x1 and I need to muliply the result of the interpolation for (101:267) position on the vector, this is the reason because I'm doing the interpolation since 101 to 267 (also because I don't have that values).
interplan= interp1(van1,man2,t2,'spline');
ma(96) = banda1a(i); % I said 96, because I want to do an interpollation
end
van1= [101 96 266]';
mos1= ma(134);
van2= [0 pos1 0];
t= 101:267;
t2= t';
xi= 101:1:267;
xi2=xi';
interplan= interp1(van1,van2,t2,'spline');
% After this, I 'prepare' the vector.
out=zeros(2151,1)
out(101:267) = interplan;
% And then, I do all this operation (are important for the result)
a2ldb1y= ldcm_1(:,1);
a2ldsum_pesos1= sum(a2ldb1y);
a2l7dout1_a= a2ldb1y.*out;
a2l7dout1_b= a2l7dout1_a./a2ldsum_pesos1;
a2l7dout1_c= sum(a2l7dout1_b);
% And the result a2l7dout1_c I want it for every pixel (the results are different because every pixel has a different value...)
**yvan2(:,:,1)= [a2l7dout1_c];**
end
end
Thanks in advance,

I'm shooting in the dark here, but I think you're looking for:
yvan2(i, j, 1)= a2l7dout1_c;
instead of:
yvan2(:,:,1)= [a2l7dout1_c];
and thus your output should be stored in the variable yvan2 after the loops are done.
P.S
Some issues in your code:
Why do you have two loops using the same iteration variable i? Your calculations are probably incorrect since i is being modified by two for loops.
Why do you even need the second loop? Each iteration overruns the value of ma(134) set by the previous iteration. You can just replace the entire loop with:
ma(134) = banda1a(numel(band1))
You shouldn't be using the names i and j for loop variables. They are already reserved for the imaginary unit (that is, sqrt(-1)), so MATLAB needs extra processing time for name resolution. You'd rather use other loop variable names instead, even ii and jj.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Performance issues by processing huge textfiles - matlab

Related

Optimizing reading the data in Matlab

Matlab interp1 gives last row as NaN

MATLAB data file when it overs its memory

beginner:referencing a cell containing a zero matlab

Output loop result in Matlab

Categories

Resources