Reading large amount of data stored in lines from csv - matlab

I need to read in a lot of data (~10^6 data points) from a *.csv-file.
the data is stored in lines
I can't know how many data points per line and how many lines are there before I read it in
the amount of data points per line can be different for each line
So the *.csv-file could look like this:
x Header
x1,x2
y Header
y1,y2,y3, ...
z Header
z1,z2
...
Right now I read in every line as string and split it at every comma. This is what my code looks like:
index = 1;
headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');
while ~isempty(headerLine{1})
dummy = textscan(csvFileHandle,'%s',1,'Delimiter','\n', ...
'BufSize',2^31 - 1);
rawData(index) = textscan(dummy{1}{1},'%f','Delimiter',',');
headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');
index = index + 1;
end
It's working, but it's pretty slow. Most of the time is used while splitting the string with textscan. (~95%).
I preallocated rawData with sample data, but it brought next to nothing for the speed.
Is there a better way than mine to read in something like this?
If not, is there a faster way to split this string?

First suggestion: to read a single line as a string when looping over a file, just use fgetl (returns a nice single string so no faffing with cell arrays).
Also, you might consider (if possible), reading everything in a single go rather than making repeating reads from file:
output = textscan(fid, '%*s%s','Delimiter','\n'); % skips headers with *
If the file is so big that you can't do everything at once, try to read in blocks (e.g. tackle 1000 lines at a time, parsing data as you go).
For converting the string, there are the options of str2num or strsplit+str2double but the only thing I can think of that might be slightly quicker than textscan is sscanf. Since this doesn't accept the delimiter as a separate input put it in the format string (the last value doesn't end with ,, true, but sscanf can handle that).
for n = 1:length(output);
data{n} = sscanf(output{n},'%f,');
end
Tests with a limited patch of test data suggests sscanf is a bit quicker (but might depend on machine/version/data sizes).

Related

MATLAB fwrite\fread issue: two variables are being concatenated

I am reading in a binary EDF file and I have to split it into multiple smaller EDF files at specific points and then adjust some of the values inside. Overall it works quite well but when I read in the file it combines 2 character arrays with each other. Obviously everything afterwords gets corrupted as well. I am at a dead end and have no idea what I'm doing wrong.
The part of the code (writing) that has to contain the problem:
byt=fread(fid,8,'*char');
fwrite(tfid,byt,'*char');
fwrite(tfid,fread(fid,44));
%new number of records
s = records;
fwrite(tfid,s,'*char');
fseek(fid,8,0);
%test
fwrite(tfid,fread(fid,8,'*char'),'*char');
When I use the reader it combines the records (fwrite(tfid,s,'*char'))
with the value of the next variable. All variables before this are displayed correctly. The relevant code of the reader:
hdr.bytes = str2double(fread(fid,8,'*char')');
reserved = fread(fid,44);%#ok
hdr.records = str2double(fread(fid,8,'*char')');
if hdr.records == -1
beep
disp('There appears to be a problem with this file; it returns an out-of-spec value of -1 for ''numberOfRecords.''')
disp('Attempting to read the file with ''edfReadUntilDone'' instead....');
[hdr, record] = edfreadUntilDone(fname, varargin);
return
end
hdr.duration = str2double(fread(fid,8,'*char')');
The likely problem is that your character array s does not have 8 characters in it, but you expect there to be 8 when you read it from the file. Whatever the number of characters in the array is, that's how many values fwrite will write out to the file. Anything less than 8 characters and you'll end up reading part of the next piece of data when you read from the file.
One fix would be to pad s with blanks before writing it:
s = [blanks(8-numel(records)) records];
In addition, the syntax '*char' is only valid when using fread: the * indicates that the output class should be 'char' as well. It's unnecessary when using fwrite.

looping and ordering extracted data in MatLab

I have hundreds of data files, named file001~file400 for example, and should pick a numeric value from each. I know how to pick each number, but, since there are too many files, I need to loop the command and order all extracted numbers with regard to the number of corresponding file. I appreciate any help.
The problem is to loop over all files. This can be done in the following way:
for i = 1:400
filename = sprintf('file%03d',i);
// do the number picking, etc. using the filename.
end
EDIT: Per request, for filenames FT00100 to FT05320, we we make two small changes, one in the loop range and one in the first parameter to sprintf:
for i = 100:5320
filename = sprintf('FT%05d',i);
// do the number picking, etc. using the filename.
end

How to avoid the repeated paragraghs of long txt files being ignored for importdata in matlab

I am trying to import all double from a txt file, which has this form
#25x1 string
#9999x2 double
.
.
.
#(repeat ten times)
However, when I am trying to use import Wizard, only the first
25x1 string
9999x2 double.
was successfully loaded, the other 9 were simply ignored
How may I import all the data? (Does importdata has a maximum length or something?)
Thanks
It's nothing to do with maximum length, importdata is just not set up for the sort of data file you describe. From the help file:
For ASCII files and spreadsheets, importdata expects
to find numeric data in a rectangular form (that is, like a matrix).
Text headers can appear above or to the left of the numeric data,
as follows:
Column headers or file description text at the top of the file, above
the numeric data. Row headers to the left of the numeric data.
So what is happening is that the first section of your file, which does match the format importdata expects, is being read, and the rest ignored. Instead of importdata, you'll need to use textscan, in particular, this style:
C = textscan(fileID,formatSpec,N)
fileID is returned from fopen. formatspec tells textscan what to expect, and N how many times to repeat it. As long as fileID remains open, repeated calls to textscan continue to read the file from wherever the last read action stopped - rather than going back to the start of the file. So we can do this:
fileID = fopen('myfile.txt');
repeats = 10;
for n = 1:repeats
% read one string, 25 times
C{n,1} = textscan(fileID,'%s',25);
% read two floats, 9999 times
C{n,2} = textscan(fileID,'%f %f',9999);
end
You can then extract your numerical data out of the cell array (if you need it in one block you may want to try using 'CollectOutput',1 as an option).

How to get the number of columns of a csv file?

I have a huge csv file that I want to load with matlab. However, I'm only interested in specific columns that I know the name.
As a first step, I would like to just check how many columns the csv file has. How can I do that with matlab?
As Jonesy and erelender suggest, I would think this will do it:
fid=fopen(filename);
tline = fgetl(fid);
fclose(fid);
length(find(tline==','))+1
Since you don't seem to know what kind of carriage return character (or character encoding?) is being used then I would suggest progressively sampling your file until you encounter a recognizable CR character. One way to do this is to loop over something like
A = fscanf(fileID, ['%' num2str(N) 'c'], sizeA);
where N is the number of characters to read. At each iteration test A for presence of carriage return characters, stop if one is encountered. Once you know where the carriage return is just repeat with the right N and perform the length(find...) operation, or alternately accumulate the number of commas at each iteration. You may want to check that your file is being read along rows (is it always?), check a few samples to make sure it is.
1-) Read the first line of file
2-) Count the number of commas, or seperator characters if it is not comma
3-) Add 1 to the count and the result is the number of columns in the file.
If the csv has only numeric value you can use:
M=csvread('file_name.csv');
[row,col]=size(M);

Data separation by particular rows in Matlab

I am relatively new to using Matlab and I don't have much knowledge about programming either. For a project I am working on currently I need to process a lot of data which is logged using the following format.
$GPRMC,202124.985,V,,,,,,,091112,,,N*44
2038,4674,4667,5593,3379
2087,5133,5111,6084,3372
2138,5134,5114,6080,3376
2188,5133,5114,6084,3377
2238,5130,5113,6084,3410
2287,5134,5113,6080,3416
2337,5133,5110,6080,3417
2387,5133,5110,6084,3416
2438,5130,5113,6081,3396
2487,5132,5110,6080,3410
$GPRMC,202125.985,V,,,,,,,091112,,,N*45
2985,5130,5113,6085,3988
3035,5130,5118,6084,4541
3085,5138,5113,6082,5186
3135,5130,5114,6081,6001
3185,5134,5110,6084,6311
3234,5134,5113,6084,6319
3284,5131,5114,6084,6316
3339,5131,5110,6084,6260
3389,5130,5114,6080,6178
3438,5134,5110,6085,6077
$GPRMC,202126.985,V,,,,,,,091112,,,N*46
3942,5131,5114,6085,5916
3992,5130,5110,6084,5917
4042,5133,5110,6084,5950
4091,5131,5114,6080,5996
4142,5134,5114,6085,6062
4192,5134,5114,6084,6129
4242,5134,5110,6080,6150
4291,5130,5110,6079,6186
4341,5130,5110,6089,6246
4391,5130,5118,6083,6266
It continues like this until the end of the file. What I want to do is to be able to separate the data such that, all the '$GPRMC' strings (rows) are listed together as text (not separated) in one file or array while all the other rows (numerical) listed together in one file array (comma separated is desirable). Is it even possible? If it is than can you please give me some pointers?
Not quite sure what you mean by separated or not separated. If you copy the text you posted into some file like testf.dat, a simple script like this using fopen, fprintf, and fgets might be what you're looking for:
infile = fopen('testf.dat');
outf1 = fopen('GPRMC.dat','w');
outf2 = fopen('nums.dat','w');
tline = fgets(infile);
while ischar(tline)
if tline(1:6) == '$GPRMC'
fprintf(outf1,tline);
else
fprintf(outf2,tline);
end
tline = fgets(infile);
end
fclose(infile);
fclose(outf1);
fclose(outf2);