matlab: speed up for loop with string analysis

matlab: speed up for loop with string analysis - matlab

I have a very lare csv file containing three columns. Now I want to load these columns as fast as possible into a matlab matrix.
Currently what I do is this
fid = fopen(inputfile, 'rt');
g = textscan(fid,'%s','delimiter','\r\n');
tdata = g{1};
fclose(fid);
results = zeros([numel(tdata)-4], 3);
tic
display('start reading data...');
for r = 4:numel(tdata)
if ~mod(r, 100)
display(['data row: ' num2str(r) ' / ' num2str(numel(tdata))]);
end
entries = strsplit(tdata{r}, ',');
results(r-3,1) = str2double(strrep(entries{1},',', '.'));
results(r-3,2) = str2double(strrep(entries{2},',', '.'));
results(r-3,3) = str2double(strrep(entries{3},',', '.'));
end
This however takes ~30 seconds for 200 000 lines. This means 150 µs per line. This is really slow. The code is not accepted by parfor.
Now I would like to know what causes the bottleneck in the for loop and how I can speed it up.
Here the measured times:
str2double 578253 calls 29.631s
strsplit 192750 calls 13.388s
EDIT:
The content has this structure in the file
0.000000, -0.00271, 5394147
0.000667, -0.00271, 5394148
0.001333, -0.00271, 5394149
0.002000, -0.00271, 5394150

I think a lot can be improved by calling textscan differently.
You do this:
g = textscan(fid,'%s','delimiter','\r\n');
But then call tdata = g{1};
If textscan is called correctly it should already split all your data, and give it back as numbers.
Try this:
g=textscan(fid,'%f,%f,%f,'delimiter','\r\n')
It should give you back three cell arrays with in the columns your values. To convert to a matrix you can use:
g=cell2mat(g)
I imported 200k lines in 0.12 seconds.
It seems your code has some other workarounds. You start at r=4, it seems you have 3 lines that you don't want to read. so after fopen you can call 3 times
[~] =fgetl(fid)
to get to the interesting part of your file.
You also first split the line with ',' as seperator. But the replace all ',' by '.'. That will not do anything, all ',' are already gone since they were used as seperators.

If you used csvread you wouldn't need to use str2double or strsplit, which you say are the slow lines... it's likely much quicker for a csv.
You would be able to replace all the above code by:
results = csvread(inputfile);

Related

Read first line of CSV file that contains text in cells

I have a deco.csv file and I only want to extract B1 to K1 (20 columns of the first rows), i.e. Deco_0001 to Deco_0020.
I first make a pre-allocation:
names = string(20,1);
and what I want is when calling S(1), it gives Deco_0001; when calling S(20), it gives Deco_0020.
I have read through textscan but I do not know how to specify the range is first row and running from column 2 to column 21 of the csv file.
Also, I want save the names individually but what I have tried just save the first line in only one cell:
fid=fopen('deco.csv');
C=textscan(fid, '%s',1);
fclose(fid);
Thanks!

It's not very elegant, but this should work for you:
fid=fopen('deco.csv');
C=textscan(fid, '%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s',1,'Delimiter',',');
fclose(fid);
S = cell(20,1);
for ii = 1:20
S{ii} = C{ii+1};
end

joining arrays in Matlab and writing to file using dlmwrite( ) adds extra space

I am generating 2500 values in Matlab in format (time,heart_rate, resp_rate) by using below code
numberOfSeconds = 2500;
time = 1:numberOfSeconds;
newTime = transpose(time);
number0 = size(newTime, 1)
% generating heart rates
heart_rate = 50 +(70-50) * rand (numberOfSeconds,1);
intHeartRate = int64(heart_rate);
number1 = size(intHeartRate, 1)
% hist(heart_rate)
% generating resp rates
resp_rate = 50 +(70-50) * rand (numberOfSeconds,1);
intRespRate = int64(resp_rate);
number2 = size(intRespRate, 1)
% hist(heart_rate)
% joining time and sensor data
joinedStream = strcat(num2str(newTime),{','},num2str(intHeartRate),{','},num2str(intRespRate))
dlmwrite('/Users/amar/Desktop/geenrated/rate.txt', joinedStream,'delimiter','');
The data shown in the console is alright, but when I save this data to a .txt file, it contains extra spaces in beginning. Hence I am not able to parse the .txt file to generate input stream. Please help

Replace the last two lines of your code with the following. No need to use strcat if you want a CSV output file.
dlmwrite('/Users/amar/Desktop/geenrated/rate.txt', [newTime intHeartRate intRespRate]);

𝑇ℎ𝑒 𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑠𝑢𝑔𝑔𝑒𝑠𝑡𝑒𝑑 𝑏𝑦 𝑃𝐾𝑜 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑖𝑚𝑝𝑙𝑒𝑠𝑡 𝑓𝑜𝑟 𝑦𝑜𝑢𝑟 𝑐𝑎𝑠𝑒. 𝑇ℎ𝑖𝑠 𝑎𝑛𝑠𝑤𝑒𝑟 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑠 𝑤ℎ𝑦 𝑦𝑜𝑢 𝑔𝑒𝑡 𝑡ℎ𝑒 𝑢𝑛𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑜𝑢𝑡𝑝𝑢𝑡.
The data written in the file is exactly what is shown in the console.
>> joinedStream(1) %The exact output will differ since 'rand' is used
ans =
cell
' 1,60,63'
num2str basically converts a matrix into a character array. Hence number of characters in its each row must be same. So for each column of the original matrix, the row with the maximum number of characters is set as a standard for all the rows with less characters and the deficiency is filled by spaces. Columns are separated by 2 spaces. Take a look at the following smaller example to understand:
>> num2str([44, 42314; 4, 1212421])
ans =
2×11 char array
'44 42314'
' 4 1212421'

Read textfile with a mix of floats, integers and strings in the same column

Loading a well formatted and delimited text file in Matlab is relatively simple, but I struggle with a text file that I have to read in. Sadly I can not change the structure of the source file, so I have to deal with what I have.
The basic file structure is:
123 180 (two integers, white space delimited)
1.5674e-8
.
.
(floating point numbers in column 1, column 2 empty)
.
.
100 4501 (another two integers)
5.3456e-4 (followed by even more floating point numbers)
.
.
.
.
45 String (A integer in column 1, string in column 2)
.
.
.
A simple
[data1,data2]=textread('filename.txt','%f %s', ...
'emptyvalue', NaN)
Does not work.
How can I properly filter the input data? All examples I found online and in the Matlab help so far deal with well structured data, so I am a bit lost at where to start.
As I have to read a whole bunch of those files >100 I rather not iterate trough every single line in every file. I hope there is a much faster approach.
EDIT:
I made a sample file available here: test.txt (google drive)

I've looked at the text file you supplied and tried to draw a few general conclusions -
When there are two integers on a line, the second integer corresponds to the number of rows following.
You always have (two integers (A, B) followed by "B" floats), repeated twice.
After that you have some free-form text (or at least, I couldn't deduce anything useful about the format after that).
This is a messy format so I doubt there are going to be any nice solutions. Some useful general principles are:
Use fgetl when you need to read a single line (it reads up to the next newline character)
Use textscan when it's possible to read multiple lines at once - it is much faster than reading a line at a time. It has many options for how to parse, which it is worth getting to know (I recommend typing doc textscan and reading the entire thing).
If in doubt, just read the lines in as strings and then analyse them in MATLAB.
With that in mine, here is a simple parser for your files. It will probably need some modifications as you are able to infer more about the structure of the files, but it is reasonably fast on the ~700 line test file you gave.
I've just given the variables dummy names like "a", "b", "floats" etc. You should change them to something more specific to your needs.
function output = readTestFile(filename)
fid = fopen(filename, 'r');
% Read the first line
line = '';
while isempty(line)
line = fgetl(fid);
end
nums = textscan(line, '%d %d', 'CollectOutput', 1);
a = nums{1}(1);
b = nums{1}(2);
% Read 'b' of the next lines:
contents = textscan(fid, '%f', b);
floats1 = contents{1};
% Read the next line:
line = '';
while isempty(line)
line = fgetl(fid);
end
nums = textscan(line, '%d %d', 'CollectOutput', 1);
c = nums{1}(1);
d = nums{1}(2);
% Read 'd' of the next lines:
contents = textscan(fid, '%f', d);
floats2 = contents{1};
% Read the rest:
rest = textscan(fid, '%s', 'Delimiter', '\n');
output.a = a;
output.b = b;
output.c = c;
output.d = d;
output.floats1 = floats1;
output.floats2 = floats2;
output.rest = rest{1};
end

You can read in the file line by line using the lower-level functions, then parse each line manually.
You open the file handle like in C
fid = fopen(filename);
Then you can read a line using fgetl
line = fgetl(fid);
String tokenize it on spaces is probably the best first pass, storing each piece in a cell array (because a matrix doesn't support ragged arrays)
colnum = 1;
while ~isempty(rem)
[token, rem] = strtok(rem, ' ');
entries{linenum, colnum} = token;
colnum = colnum + 1;
end
then you can wrap all of that inside another while loop to iterate over the lines
linenum = 1;
while ~feof(fid)
% getl, strtok, index bookkeeping as above
end
It's up to you whether it's best to parse the file as you read it or read it into a cell array first and then go over it afterwards.
Your cell entries are all going to be strings (char arrays), so you will need to use str2num to convert them to numbers. It does a good job of working out the format so that might be all you need.

How to read digits from file to matrix, no delimeter

I have a data stored in below format, no delimeter and digit domain is {0,1}. With using octave, taking the digits and storing them in martix is reaised a problem for me. I have not managed below scnerio. So, How can I take those digits and store them on matrix as told at below?
Data in File, 32 x 32 digits
00000000000000000000000000000000
00000000001111110000000000000000
...
00000010000000100001000000000000
how to store data
matrix[1, 1:32] = 00000000000000000000000000000000
matrix[2, 1:32] = 00000000001111110000000000000000
. . .
matrix[32, 1:32] = 00000010000000100001000000000000
OR
matrix[1, 1:32] = 00000000000000000000000000000000
matrix[1, 33:64] = 00000000001111110000000000000000
. . .
matrix[1, 993:1024] = 00000010000000100001000000000000

One possible solution is to read the data as a string first:
octave> textread('foo.dat', '%s', 'headerlines', 2)
ans =
{
[1,1] = 00000000000000000000000000000000
[2,1] = 00000000001111110000000000000000
...
}
If these are binary representations of decimals, you may find bin2dec() useful.

This would do the trick (though I don't know how well that third input to fread and arrayfun work with Octave, tested this on Matlab):
fid = fopen('a.txt','rt');
str = fread(fid,inf,'char=>char');
st = fclose(fid);
qrn = str==10|str==13;
str(qrn) = [];
yourMat = reshape(arrayfun(#str2num,str),find(qrn,1)-1,[]).'

Assuming you don't have header lines, you can read the text in as a cell arrray of strings like so:
C = textread('names.txt', '%s');
Then, in general for all numbers from 0 to 9, you can transform this into a matrix like so:
M = vertcat(S{:})-'0';
If performance is an issue you can look into other ways to import the strings, but this should get the job done.

I have never used Matlab, but asuming it reads files the same way Octave does, and if using an external tool is OK, you could try replacing the characters to add a delimiter using a text editor. You could change every "0" to "0," and every "1" to "1," and then simply load the file.
(This would add a delimiter at the end of every line. In case that creates a problem, you could try replacing your text by pairs instead "00"->"0,0" "10" -> "1,0" and so on)
In case the file is too big for a normal editor, you might even try replacing the characters with sed:
sed -i 's/charactertoreplace/newcharacter/g' yourfile.txt

skip reading headers in MATLAB

I had a similar question. but what i am trying now is to read files in .txt format into MATLAB. My problem is with the headers. Many times due to errors the system rewrites the headers in the middle of file and then MATLAB cannot read the file. IS there a way to skip it? I know i can skip reading some characters if i know what the character is.
here is the code i am using.
[c,pathc]=uigetfile({'*.txt'},'Select the data','V:\data');
file=[pathc c];
data= dlmread(file, ',', 1,4);
this way i let the user pick the file. My files are huge typically [ 86400 125 ]
so naturally it has 125 header fields or more depends on files.
Thanks
Because the files are so big i cannot copy , but its in format like
day time col1 col2 col3 col4 ...............................
2/3/2010 0:10 3.4 4.5 5.6 4.4 ...............................
..................................................................
..................................................................
and so on

With DLMREAD you can read only numeric data. It will not read date and time, as your first two columns contain. If other data are all numeric you can tell DLMREAD to skip first row and 2 columns on the right:
data = dlmread(file, ' ', 1,2);
To import also day and time you can use IMPORTDATA instead of DLMREAD:
A = importdata(file, ' ', 1);
dt = datenum(A.textdata(2:end,1),'mm/dd/yyyy');
tm = datenum(A.textdata(2:end,2),'HH:MM');
data = A.data;
The date and time will be converted to serial numbers. You can convert them back with DATESTR function.

It turns out that you can still use textscan. Except that you read everything as string. Then, you attempt to convert to double. 'str2double' returns NaN for strings, and since headers are all strings, you can identify header rows as rows with all NaNs.
For example:
%# find and open file
[c,pathc]=uigetfile({'*.txt'},'Select the data','V:\data');
file=[pathc c];
fid = fopen(file);
%# read all text
strData = textscan(fid,'%s%s%s%s%s%s','Delimiter',',');
%# close the file again
fclose(fid);
%# catenate, b/c textscan returns a column of cells for each column in the data
strData = cat(2,strData{:});
%# convert cols 3:6 to double
doubleData = str2double(strData(:,3:end));
%# find header rows. headerRows is a logical array
headerRowsL = all(isnan(doubleData),2);
%# since I guess you know what the headers are, you can just remove the header rows
dateAndTimeCell = strData(~headerRowsL,1:2);
dataArray = doubleData(~headerRowsL,:);
%# and you're ready to start working with your data

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

matlab: speed up for loop with string analysis - matlab

If you used csvread you wouldn't need to use str2double or strsplit, which you say are the slow lines... it's likely much quicker for a csv. You would be able to replace all the above code by: results = csvread(inputfile);

Related

Read first line of CSV file that contains text in cells

joining arrays in Matlab and writing to file using dlmwrite( ) adds extra space

Read textfile with a mix of floats, integers and strings in the same column

How to read digits from file to matrix, no delimeter

skip reading headers in MATLAB

Categories

Resources