I have multiple text files that are about 2GB in size (approximately 70 million lines). I also have a quad-core machine and access to the Parallel Computing toolbox.
Typically you might open a file and read lines as so:
f = fopen('file.txt');
l = fgets(f);
while ~ isempty(l)
% do something with l
l = fgets(f);
end
I wanted to distribute the "do something with l" across my 4 cores, but that of course requires the use of a parfor loop. That would require that I "slurp" the 2GB file (to borrow a Perl term) into MATLAB a priori, instead of processing on the fly. I don't actually need l, just the result of the processing.
Is there a way to read lines out of a text file with parallel computing?
EDIT: It's worth mentioning that I can find the exact number of lines ahead of time (!wc -l mygiantfile.txt).
EDIT2: The structure of the file is as follows:
15 1180 62444 e0e0 049c f3ec 104
So 3 decimal numbers, 3 hex numbers, and 1 decimal number. Repeat this for 70 million lines.
As requested, I'm showing an example of memory-mapped files using memmapfile class.
Since you didn't provide the exact format of the data file, I will create my own. The data I am creating is a table of N rows, each consisting of 4 columns:
first is a double scalar value
second is a single value
third is a fixed-length string representing a uint32 in HEX notation (e.g: D091BB44)
fourth column is a uint8 value
The code to generate the random data, and write it to binary file structured as described above:
% random data
N = 10;
data = [...
num2cell(rand(N,1)), ...
num2cell(rand(N,1,'single')), ...
cellstr(dec2hex(randi(intmax('uint32'), [N,1]),8)), ...
num2cell(randi([0 255], [N,1], 'uint8')) ...
];
% write to binary file
fid = fopen('file.bin', 'wb');
for i=1:N
fwrite(fid, data{i,1}, 'double');
fwrite(fid, data{i,2}, 'single');
fwrite(fid, data{i,3}, 'char');
fwrite(fid, data{i,4}, 'uint8');
end
fclose(fid);
Here is the resulting file viewed in a HEX editor:
we can confirm the first record (note that my system uses Little-endian byte ordering):
>> num2hex(data{1,1})
ans =
3fd4d780d56f2ca6
>> num2hex(data{1,2})
ans =
3ddd473e
>> arrayfun(#dec2hex, double(data{1,3}), 'UniformOutput',false)
ans =
'46' '35' '36' '32' '37' '35' '32' '46'
>> dec2hex(data{1,4})
ans =
C0
Next we open the file using memory-mapping:
m = memmapfile('file.bin', 'Offset',0, 'Repeat',Inf, 'Writable',false, ...
'Format',{
'double', [1 1], 'd';
'single', [1 1], 's';
'uint8' , [1 8], 'h'; % since it doesnt directly support char
'uint8' , [1 1], 'i'});
Now we can access the records as an ordinary structure array:
>> rec = m.Data; % 10x1 struct array
>> rec(1) % same as: data(1,:)
ans =
d: 0.3257
s: 0.1080
h: [70 53 54 50 55 53 50 70]
i: 192
>> rec(4).d % same as: data{4,1}
ans =
0.5799
>> char(rec(10).h) % same as: data{10,3}
ans =
2B2F493F
The benefit is that for large data files, is that you can restrict the mapping "viewing window" to a small subset of the records, and move this view along the file:
% read the records two at-a-time
numRec = 10; % total number of records
lenRec = 8*1 + 4*1 + 1*8 + 1*1; % length of each record in bytes
numRecPerView = 2; % how many records in a viewing window
m.Repeat = numRecPerView;
for i=1:(numRec/numRecPerView)
% move the window along the file
m.Offset = (i-1) * numRecPerView*lenRec;
% read the two records in this window:
%for j=1:numRecPerView, m.Data(j), end
m.Data(1)
m.Data(2)
end
Some matlab's built-in functions support multithreading - the list is here. There is no need for the Parallel Computing toolbox.
If the "do something with l" can benefit from the toolbox, just implement the function before reading another line.
You may alternatively want to read the whole file using
fid = fopen('textfile.txt');
C = textscan(fid,'%s','delimiter','\n');
fclose(fid);
and then compute the cells in C in parallel.
If the reading time is a key issue, you may also want to access parts of the data file within a parfor loop. Here is an example from Edric M Ellis.
%Some data
x = rand(1000, 10);
fh = fopen( 'tmp.bin', 'wb' );
fwrite( fh, x, 'double' );
fclose( fh );
% Read the data
y = zeros(1000, 10);
parfor ii = 1:10
fh = fopen( 'tmp.bin', 'rb' );
% Get to the correct spot in the file:
offset_bytes = (ii-1) * 1000 * 8; % 8 bytes/double
fseek( fh, offset_bytes, 'bof' );
% read a column
y(:,ii) = fread( fh, 1000, 'double' );
fclose( fh );
end
% Check
assert( isequal( x, y ) );
Related
In short, I'm having a headache in multiple languages to read a txt file (linked below). My most familiar language is MATLAB so for that reason I'm using that in this example. I've found a way to read this file in ~ 5 minutes, but given I'll have tons and tons of data from my instrument shortly as it measures all day every 30 seconds this just isn't feasible.
I'm looking for a way to quickly read these irregular text files so that going forward I can knock these out with less of a time burden.
You can find my exact data at this link:
http://lb3.pandonia.net/BostonMA/Pandora107s1/L0/Pandora107s1_BostonMA_20190814_L0.txt.bz2
I've been using the "readtable" function in matlab and I have achieved a final product I want but I'm looking to increase the speed
Below is my code!
clearvars -except pan day1; % Clearing all variables except for the day and instrument variables.
close all;
clc;
pan_mat = [107 139 155 153]; % Matrix of pandora numbers for file-choosing
reasons.
pan = pan_mat(pan); % pandora number I'm choosing
pan = num2str(pan); % Turning Pandora number into a string.
%pan = '107'
pandora = strcat('C:\Users\tadams15\Desktop\Folders\Counts\Pandora_Dta\',pan)
% string that designates file location
%date = '90919'
month = '09'; % Month
day2 = strcat('0',num2str(day1)) % Creating a day name for the figure I ultimately produce
cd(pandora)
d2 = strcat('2019',num2str(month),num2str(day2)); % The final date variable
for the figure I produce
%file_pan = 'Pandora107s1_BostonMA_20190909_L0';
file_pan = strcat('Pandora',pan,'s1_BostonMA_',d2,'_L0'); % File name string
%Try reading it in line by line?
% Load in as a string and then convert the lines you want as numbers into
% number.
delimiterIn = '\t';
headerlinesIn = 41;
A = readtable(file_pan,'HeaderLines', 41, 'Delimiter', '\t'); %Reading the
file as a table
A = table2cell(A); % Converting file to a cell
A = regexp(A, ' ', 'split'); % converting cell to a structure matrix.
%%
A= array2table(A); % Converting Structure matrix back to table
row_num = 0;
pan_mat_2 = zeros(2359,4126);
datetime_mat = zeros(2359,2);
blank = 0;
%% Converting data to proper matrices
[length width] = size(A);
% The matrix below is going through "A" and writing from it to a new
% matrix, "pan_mat_2" which is my final product as well as singling out the
% rows that contain non-number variables I'd like to keep and adding them
% later.
tic
%flag1
for i = 1:length; % Make second number the length of the table, A
blank = 0;
b = table2array(A{i,1});
[rows, columns] = size(b);
if columns > 4120 && columns < 4140
row_num = row_num + 1;
blank = regexp(b(2), 'T', 'split');
blank2 = regexp(blank{1,1}(2), 'Z', 'split');
datetime_mat(row_num,1) = str2double(blank{1,1}(1));
datetime_mat(row_num,2) = str2double(blank2{1,1}(1));
for j = 1:4126;
pan_mat_2(row_num,j) = str2double(b(j));
end
end
end
toc
%flag2
In short, I'm already getting the result I want but the part of the code where I'm writing to a new array "flag 1" to "flag 2" is taking roughly 222 seconds while the entire code only takes about 248 seconds. I'd like to find a better way to create the data there than to write it to a new array and take a whole bunch of time.
Any suggestions?
Note:
There are a quite a few improvments you can make for speed but there are also corrections. You preallocate you final output variable with hard coded values:
pan_mat_2 = zeros(2359,4126);
But later you populate it in a loop which run for i = 1:length.
length is the full number of lines picked from the file. In your example file there are only 784 lines. So even if all your line were valid (ok to be parsed), you would only ever fill the first 784 lines of the total 2359 lines you allocated in your pan_mat_2. In practice, this file has only 400 valid data lines, so your pan_mat_2 could definitely be smaller.
I know you couldn't know you had only 400 line parsed before you parsed them, but you knew from the beginning that you had only 784 line to parse (you had the info in the variable length). So in case like these pre-allocate to 784 and only later discard the empty lines.
Fortunately, the solution I propose does not need to pre-allocate larger then discard. The matrices will end up the right size from the start.
The code:
%%
file_pan = 'Pandora107s1_BostonMA_20190814_L0.txt' ;
delimiterIn = '\t';
headerlinesIn = 41;
A = readtable(file_pan,'HeaderLines', 41, 'Delimiter', '\t'); %Reading the file as a table
A = table2cell(A); % Converting file to a cell
A = regexp(A, ' ', 'split'); % converting cell to a structure matrix.
%% Remove lines which won't be parsed
% Count the number of elements in each line
nelem = cell2mat( cellfun( #size , A ,'UniformOutput',0) ) ;
nelem(:,1) = [] ;
% find which lines does not have enough elements to be parsed
idxLine2Remove = ~(nelem > 4120 & nelem < 4140) ;
% remove them from the data set
A(idxLine2Remove) = [] ;
%% Remove nesting in cell array
nLinesToParse = size(A,1) ;
A = reshape( [A{:}] , [], nLinesToParse ).' ;
% now you have a cell array of size [400x4126] cells
%% Now separate the columns with different data type
% Column 1 => [String] identifier
% Column 2 => Timestamp
% Column 3 to 4125 => Numeric values
% Column 4126 => empty cell created during the 'split' operation above
% because of a trailing space character.
LineIDs = A(:,1) ;
TimeStamps = A(:,2) ;
Data = A(:,3:end-1) ; % fetch to "end-1" to discard last empty column
%% now extract the values
% You could do that directly:
% pan_mat = str2double(Data) ;
% but this takes a long time. A much computationnaly faster way (even if it
% uses more complex code) would be:
dat = strjoin(Data) ; % create a single long string made of all the strings in all the cells
nums = textscan( dat , '%f' , Inf ) ; % call textscan on it (way faster than str2double() )
pan_mat = reshape( cell2mat( nums ) , nLinesToParse ,[] ) ; % reshape to original dimensions
%% timestamps
% convert to character array
strTimeStamps = char(TimeStamps) ;
% convert to matlab own datetime numbering. This will be a lot faster if
% you have operations to do on the time stamps later
ts = datenum(strTimeStamps,'yyyymmddTHHMMSSZ') ;
%% If you really want them the way you had it in your example
strTimeStamps(:,9) = ' ' ; % replace 'T' with ' '
strTimeStamps(:,end) = ' ' ; % replace 'Z' characters with ' '
%then same again, merge into a long string, parse then reshape accordingly
strdate = reshape(strTimeStamps.',1,[]) ;
tmp = textscan( strdate , '%d' , Inf ) ;
datetime_mat = reshape( double(cell2mat(tmp)),2,[]).' ;
The performance:
As you can see on my machine your original code takes ~102 seconds to execute, with 80% of that (81s) spent on calling the function str2double() 3,302,400 times!
My solution, run on the same input file, takes ~5.5 seconds, with half of the time spent on calling strjoin() 3 times.
When you read the code above, try to understand how I limited the repetition of function call in lengthy loops by trying to keep everything as vectorised as possible.
Using the profiler, you can see that you call str2double 3302400 times in a run which takes about 80% of the total time on my pc. Now thats suboptimal, as each time you only translate 1 value and as far as your code goes you dont need the values as string again. I added this under you original code:
row_num = 0;
pan_mat_2_b = cell(2359,4126);
datetime_mat_b = cell(2359,2);%not zeros
blank = 0;
tic
%flag1
for i = 1:length % Make second number the length of the table, A
blank = 0;
b = table2array(A{i,1});
[rows, columns] = size(b);
if columns > 4120 && columns < 4140
row_num = row_num + 1;
blank = regexp(b(2), 'T', 'split');
blank2 = regexp(blank{1,1}(2), 'Z', 'split');
%datetime_mat(row_num,1) = str2double(blank{1,1}(1));
%datetime_mat(row_num,2) = str2double(blank2{1,1}(1));
datetime_mat_b(row_num,1) = blank{1,1}(1);
datetime_mat_b(row_num,2) = blank2{1,1}(1);
pan_mat_2_b(row_num,:) = b;
% for j = 1:4126
% pan_mat_2(row_num,j) = str2double(b(j));
% end
end
end
datetime_mat_b = datetime_mat_b(~all(cellfun('isempty',datetime_mat_b),2),:);
pan_mat_2_b=pan_mat_2_b(~all(cellfun('isempty',pan_mat_2_b),2),:);
datetime_mat_b=str2double(string(datetime_mat_b));
pan_mat_2_b=str2double(pan_mat_2_b);
toc
Still not great, but better. If you want to speed this up further i recommend you take a closer look at the readtable part. As you can save up quite some time if you start with reading in the format as doubles right from the beginning
I made a 865850 by 4464 matrix.
Then I need to save it to a .txt file.
For that, I use fprintf, but I met a hard obstacle....
There are 4464 columns. How can I designate their formatspec?
They are all integers.
Now I know just one way...
fprintf(fid, '%10d %10d.....%10d', Zeros); (4464times..)
Is the only way to save them?
Thank you~!!
clear all; close all;
loop = 1;
Zeros = zeros(15000, 4464);
fileID = fopen('data2.txt','r');
while loop < 4200
Data = fscanf(fileID, '%d %d %d:%d %d\n', [5, 100000]);
Data = Data';
DataA = Data(:,1);
DataB = Data(:,2);
DataC = Data(:,3);
DataD = Data(:,4);
DataE = Data(:,5);
for m=1:100000
r = DataA(m);
c = ((DataB(m)-1)*24*6 + DataC(m)*6 + DataD(m))+1;
Zeros(r,c) = DataE(m);
end
for n=1:4464
Zeros1{n}=Zeros(:, n);
fileID2 = fopen('result.txt','a');
fprintf(fileID2, '%10d %10d\n ', Zeros1{1}, Zeros1{2});
end
fclose(fileID2);
loop = loop + 1;
end
don't use printf with the whole row. Use the CSV export, or iterate over each element of each row and print it isolatedly.
I frequently like to add that for data of this size, textual storage is a bad idea. No one will ever open this in a text editor and think "Oh, this is practical". Everyone will have a bad time carrying around hundreds of megabytes of unnecessary file size. simply use the savemat methods to store the data if you plan to open it in matlab, or use a binary format, for example by just doing fwrite on the data to a file with a sensible binary representation of your numbers.
You could also just use the built-in MATLAB ASCII save format (instead of printf):
>> foo = magic( 4 )
foo =
16 2 3 13
5 11 10 8
9 7 6 12
4 14 15 1
>> save( 'foo.txt', '-ascii', 'foo' )
I've used neworkx to generate a random geometric graph on 50 nodes, and create a .dat file with some attributes of this network.
I need to access these as MATLAB variables. I read the file in as a data string using:
fid = fopen('mydata.dat','r')
data = textscan(fid, '%s')
fclose(fid)
The structure of the data file is as follows
conn = val
Adj = val ..... val
.............
val ......val
pos =
[0.7910629988376467, 0.5523474928588686]
...
[0.6799716933198028, 0.6981655240935597]
i.e. conn is a number, Adj is (supposed to be) a 50 by 50 matrix and pos is a 50 by 2 matrix.
I can read conn, and Adj as MATLAB variables fine, but I'm having trouble reading pos. The first instance starts at data{1}{2508}, and is
data{1}{2508}
>>> [0.7832623541518583,
How do I shoehorn this into a 50 by 2 (or 2 by 50) matrix?
To read the Adj I use
P = 50 %number of nodes
index = 5
for i=1:P
for j = 1:P
Adj(i,j) = str2double(data{1}(index + P*(i-1) +j))
end
end
I thought something similar would work for pos, but with j = 1:2 and index = 2508 but I'm getting NaNs as the lines (fields?) of my .dat file aren't just values, they're of the form [val, or ,val]
You can first delete all characters you don't want to have.
data = regexprep(data{1},'[\[\],]','');
After that, your loop should succeed. However, you can speed up your code by using array functions.
Find the occurance of pos
ind = find(strcmp(data,'pos')); # Should be 2506 in your case
After that, once you know that your array is 50x2 use:
pos = str2double(reshape(data(pos+2:end),2,50)')
Note, the +2 is for pos and =.
I have a csv file which contains 2d arrays of 4 columns but a varying number of rows. Eg:
2, 354, 23, 101
3, 1023, 43, 454
1, 5463, 45, 7657
4, 543, 543, 654
3, 56, 7654, 344
...
I need to be able to import the data such that I can run operations on each block of data, however csvread, dlmread and textscan all ignore the blank lines.
I can't seem to find a solution anywhere, how can this be done?
PS:
It may be worth noting that the files of the format above are actually the concatenation of many files containing only one block of data (I don't want to have to read from thousands of files every time) therefore the blank line between blocks can be changed to any other delimiter / marker. This is just done with a python script.
EDIT: My Solution - based upon / inspired by petrichor below
I replaced the csvread with textscan which is faster. Then I realised that if I replaced the blank lines with lines of nan instead (modifying my python script) I could remove the need for a second textscan the slow point. My code is:
filename = 'data.csv';
fid = fopen(filename);
allData = cell2mat(textscan(fid,'%f %f %f %f','delimiter',','));
fclose(fid);
nanLines = find(isnan(allData(:,1)))';
iEnd = (nanLines - (1:length(nanLines)));
iStart = [1 (nanLines(1:end-1) - (0:length(nanLines)-2))];
nRows = iEnd - iStart + 1;
allData(nanLines,:)=[];
data = mat2cell(allData, nRows);
Which evaluates in 0.28s (a file of just of 103000 lines). I've accepted petrichor's solution as it indeed best solves my initial problem.
filename = 'data.txt';
%# Read all the data
allData = csvread(filename);
%# Compute the empty line indices
fid = fopen(filename);
lines = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
blankLines = find(cellfun('isempty', lines{1}))';
%# Find the indices to separate data into cells from the whole matrix
iEnd = [blankLines - (1:length(blankLines)) size(allData,1)];
iStart = [1 (blankLines - (0:length(blankLines)-1))];
nRows = iEnd - iStart + 1;
%# Put the data into cells
data = mat2cell(allData, nRows)
That gives the following for your data:
data =
[3x4 double]
[2x4 double]
I'm trying to load the following ascii file into MATLAB using load()
% some comment
1 0xc661
2 0xd661
3 0xe661
(This is actually a simplified file. The actual file I'm trying to load contains an undefined number of columns and an undefined number of comment lines at the beginning, which is why the load function was attractive)
For some strange reason, I obtain the following:
K>> data = load('testMixed.txt')
data =
1 50785
2 58977
3 58977
I've observed that the problem occurs anytime there's a "d" in the hexadecimal number.
Direct hex2dec conversion works properly:
K>> hex2dec('d661')
ans =
54881
importdata seems to have the same conversion issue, and so does the ImportWizard:
K>> importdata('testMixed.txt')
ans =
1 50785
2 58977
3 58977
Is that a bug, am I using the load function in some prohibited way, or is there something obvious I'm overlooking?
Are there workarounds around the problem, save from reimplementing the file parsing on my own?
Edited my input file to better reflect my actual file format. I had a bit oversimplified in my original question.
"GOLF" ANSWER:
This starts with the answer from mtrw and shortens it further:
fid = fopen('testMixed.txt','rt');
data = textscan(fid,'%s','Delimiter','\n','MultipleDelimsAsOne','1',...
'CommentStyle','%');
fclose(fid);
data = strcat(data{1},{' '});
data = sscanf([data{:}],'%i',[sum(isspace(data{1})) inf]).';
PREVIOUS ANSWER:
My first thought was to use TEXTSCAN, since it has an option that allows you to ignore certain lines as comments when they start with a given character (like %). However, TEXTSCAN doesn't appear to handle numbers in hexadecimal format well. Here's another option:
fid = fopen('testMixed.txt','r'); % Open file
% First, read all the comment lines (lines that start with '%'):
comments = {};
position = 0;
nextLine = fgetl(fid); % Read the first line
while strcmp(nextLine(1),'%')
comments = [comments; {nextLine}]; % Collect the comments
position = ftell(fid); % Get the file pointer position
nextLine = fgetl(fid); % Read the next line
end
fseek(fid,position,-1); % Rewind to beginning of last line read
% Read numerical data:
nCol = sum(isspace(nextLine))+1; % Get the number of columns
data = fscanf(fid,'%i',[nCol inf]).'; % Note '%i' works for all integer formats
fclose(fid); % Close file
This will work for an arbitrary number of comments at the beginning of the file. The computation to get the number of columns was inspired by Jacob's answer.
New:
This is the best I could come up with. It should work for any number of comment lines and columns. You'll have to do the rest yourself if there are strings, etc.
% Define the characters representing the start of the commented line
% and the delimiter
COMMENT_START = '%%';
DELIMITER = ' ';
% Open the file
fid = fopen('testMixed.txt');
% Read each line till we reach the data
l = COMMENT_START;
while(l(1)==COMMENT_START)
l = fgetl(fid);
end
% Compute the number of columns
cols = sum(l==DELIMITER)+1;
% Split the first line
split_l = regexp(l,' ','split');
% Read all the data
A = textscan(fid,'%s');
% Compute the number of rows
rows = numel(A{:})/cols;
% Close the file
fclose(fid);
% Assemble all the data into a matrix of cell strings
DATA = [split_l ; reshape(A{:},[cols rows])']; %' adding this to make it pretty in SO
% Recognize each column and process accordingly
% by analyzing each element in the first row
numeric_data = zeros(size(DATA));
for i=1:cols
str = DATA(1,i);
% If there is no '0x' present
if isempty(findstr(str{1},'0x')) == true
% This is a number
numeric_data(:,i) = str2num(char(DATA(:,i)));
else
% This is a hexadecimal number
col = char(DATA(:,i));
numeric_data(:,i) = hex2dec(col(:,3:end));
end
end
% Display the data
format short g;
disp(numeric_data)
This works for data like this:
% Comment 1
% Comment 2
1.2 0xc661 10 0xa661
2 0xd661 20 0xb661
3 0xe661 30 0xc661
Output:
1.2 50785 10 42593
2 54881 20 46689
3 58977 30 50785
OLD:
Yeah, I don't think LOAD is the way to go. You could try:
a = char(importdata('testHexa.txt'));
a = hex2dec(a(:,3:end));
This is based on both gnovice's and Jacob's answers, and is a "best of breed"
For files like:
% this is my comment
% this is my other comment
1 0xc661 123
2 0xd661 456
% surprise comment
3 0xe661 789
4 0xb661 1234567
(where the number of columns within the file MUST be the same, but not known ahead of time, and all comments denoted by a '%' character), the following code is fast and easy to read:
f = fopen('hexdata.txt', 'rt');
A = textscan(f, '%s', 'Delimiter', '\n', 'MultipleDelimsAsOne', '1', 'CollectOutput', '1', 'CommentStyle', '%');
fclose(f);
A = A{1};
data = sscanf(A{1}, '%i')';
data = repmat(data, length(A), 1);
for ctr = 2:length(A)
data(ctr,:) = sscanf(A{ctr}, '%i')';
end