How to get the number of columns of a csv file? - matlab

I have a huge csv file that I want to load with matlab. However, I'm only interested in specific columns that I know the name.
As a first step, I would like to just check how many columns the csv file has. How can I do that with matlab?

As Jonesy and erelender suggest, I would think this will do it:
fid=fopen(filename);
tline = fgetl(fid);
fclose(fid);
length(find(tline==','))+1
Since you don't seem to know what kind of carriage return character (or character encoding?) is being used then I would suggest progressively sampling your file until you encounter a recognizable CR character. One way to do this is to loop over something like
A = fscanf(fileID, ['%' num2str(N) 'c'], sizeA);
where N is the number of characters to read. At each iteration test A for presence of carriage return characters, stop if one is encountered. Once you know where the carriage return is just repeat with the right N and perform the length(find...) operation, or alternately accumulate the number of commas at each iteration. You may want to check that your file is being read along rows (is it always?), check a few samples to make sure it is.

1-) Read the first line of file
2-) Count the number of commas, or seperator characters if it is not comma
3-) Add 1 to the count and the result is the number of columns in the file.

If the csv has only numeric value you can use:
M=csvread('file_name.csv');
[row,col]=size(M);

Related

reading data from csv files with `textscan` in MATLAB

[Edited:] I have a file data2007a.csv and I copied and pasted (using TextEdit in MacBook) the first consecutive few lines to a new file datatest1.csv for testing:
Nomenclature,ReporterISO3,ProductCode,ReporterName,PartnerISO3,PartnerName,Year,TradeFlowName,TradeFlowCode,TradeValue in 1000 USD
S3,ABW,0,Aruba,ANT,Netherlands Antilles,2007,Export,6,448.91
S3,ABW,0,Aruba,ATG,Antigua and Barbuda,2007,Export,6,0.312
S3,ABW,0,Aruba,CHN,China,2007,Export,6,24.715
S3,ABW,0,Aruba,COL,Colombia,2007,Export,6,95.885
S3,ABW,0,Aruba,DOM,Dominican Republic,2007,Export,6,11.432
I wanted to use textscan to read it into MATLAB with only columns 2,3,5 (starting from the second row) and I wrote the following code
clc,clear all
fid = fopen('datatest1.csv');
data = textscan(fid,'%*s %s %d %*s %s %*[^\n]',...
'Delimiter',',',...
'HeaderLines',1);
fclose(fid);
But I ended up with only the second row of columns 2,3 and 5:
I then keep the first row in data2007a.csv and selected several others to saved as datatest2.csv:
Nomenclature,ReporterISO3,ProductCode,ReporterName,PartnerISO3,PartnerName,Year,TradeFlowName,TradeFlowCode,TradeValue in 1000 USD
S3,ABW,1,Aruba,USA,United States,2007,Export,6,1.392
S3,ABW,1,Aruba,VEN,Venezuela,2007,Export,6,5633.157
S3,ABW,2,Aruba,ANT,Netherlands Antilles,2007,Export,6,310.734
S3,ABW,2,Aruba,USA,United States,2007,Export,6,342.42
S3,ABW,2,Aruba,VEN,Venezuela,2007,Export,6,63.722
S3,AGO,0,Angola,DEU,Germany,2007,Export,6,105.334
S3,AGO,0,Angola,ESP,Spain,2007,Export,6,8533.125
And I wrote:
clc,clear all
fid = fopen('datatest2.csv');
data = textscan(fid,'%*s %s %d %*s %s %*[^\n]',...
'Delimiter',',',...
'HeaderLines',1);
fclose(fid);
data{1}
It gives exactly what I wanted:
When I use the same code for my original data file data2007a.csv, it goes as in the first case.
What is going wrong and how can I fix it?
[Added:] If one replicates my experiments1, one can find that both cases work and the problem does not exist! I really don't know what is going on.
1 For "replicate" I mean copy-and-paste the data given above and save it as two new files, say, datatest4a.csv and datatest4b.csv. I used visdiff('datatest1.csv', 'datatest4a.csv') to compare two files and it returned:
Given how you fixed it, I think this is an end-of-line character issue. This sometimes comes up when moving text files between Windows and Unix based systems, as they use different conventions.
When you add %*[^\n] to the end of a textscan format, as you have here. it means to skip everything to the end of line. But if it expects a specific end of line character, and can't find one, it will skip everything to the end of the file. This would explain why you get one row correctly read and then nothing else.
If you don't specify what the end of line character is, Matlab appears to default to... something... in this not very clear specification in the help:
The default end-of-line sequence is \n, \r, or \r\n, depending on the contents of your file.
One way to try and cure this without having to create a new file would be to add this 'EndOfLine', '\r\n' to your textscan call:
If you specify '\r\n', then textscan treats any of \r, \n, and the
combination of the two (\r\n) as end-of-line characters.
This will hopefully handle most standard(ish) EOL conventions. It is likely that copy-pasting and saving with a different bit of software than was originally used to create the file changed the end of line characters such that Matlab was able to recognise them.

MATLAB export multiple .csv files at one time

I have a matrix where I need to export each column into a separate .csv file.
I know the number of columns and I can achieve my desired result if I specifically select one column to export. I would do this by:
dlmwrite('1.csv',data(:,1), 'precision', 9)
Therefore if I want column 2 I would change the variable to data(:,2) and save this as 2.csv.
So I want a loop that will do all this automatically. I have tried
for i=1:Number_of_Columns
dlmwrite('(i).csv',csv_data(:,(i)), 'precision', 9)
end
which clearly won't work but I am unsure how to do it.
Any help or advice would be much appreciated
Your problem is the filename. If you put i between quotes it will be taken as character instead of a variable. (In your case your filename will always be "(i).csv")
You can concatenate strings using [ ], and since i is an integer you have to convert it to string using num2str()
Try:
for i=1:Number_of_Columns
dlmwrite([num2str(i) '.csv'], csv_data(:,i), 'precision', 9)
end
PD: Since you are storing each column (not each row) in a file, I'm not sure if you want a file where each element is in a separate line, or if you want the column to be stored as a row and separated by commas.
If you want the latter, transpose your column:
dlmwrite([num2str(i) '.csv'], csv_data(:,i).', 'precision', 9)
Note that the transpose operator is .' instead of the complex conjugate ' (this is a common misuse since the results are the same as long as you only use real numbers)

Reading large amount of data stored in lines from csv

I need to read in a lot of data (~10^6 data points) from a *.csv-file.
the data is stored in lines
I can't know how many data points per line and how many lines are there before I read it in
the amount of data points per line can be different for each line
So the *.csv-file could look like this:
x Header
x1,x2
y Header
y1,y2,y3, ...
z Header
z1,z2
...
Right now I read in every line as string and split it at every comma. This is what my code looks like:
index = 1;
headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');
while ~isempty(headerLine{1})
dummy = textscan(csvFileHandle,'%s',1,'Delimiter','\n', ...
'BufSize',2^31 - 1);
rawData(index) = textscan(dummy{1}{1},'%f','Delimiter',',');
headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');
index = index + 1;
end
It's working, but it's pretty slow. Most of the time is used while splitting the string with textscan. (~95%).
I preallocated rawData with sample data, but it brought next to nothing for the speed.
Is there a better way than mine to read in something like this?
If not, is there a faster way to split this string?
First suggestion: to read a single line as a string when looping over a file, just use fgetl (returns a nice single string so no faffing with cell arrays).
Also, you might consider (if possible), reading everything in a single go rather than making repeating reads from file:
output = textscan(fid, '%*s%s','Delimiter','\n'); % skips headers with *
If the file is so big that you can't do everything at once, try to read in blocks (e.g. tackle 1000 lines at a time, parsing data as you go).
For converting the string, there are the options of str2num or strsplit+str2double but the only thing I can think of that might be slightly quicker than textscan is sscanf. Since this doesn't accept the delimiter as a separate input put it in the format string (the last value doesn't end with ,, true, but sscanf can handle that).
for n = 1:length(output);
data{n} = sscanf(output{n},'%f,');
end
Tests with a limited patch of test data suggests sscanf is a bit quicker (but might depend on machine/version/data sizes).

How to import column of numbers from mixed string numerical text file

Variations of this question have already been asked several times, for example here. However, I can't seem to get this to work for my data.
I have a text file with 3 columns. First and third columns are floating point numbers. Middle column is strings. I'm only interested in getting the first column really.
Here's what I tried:
filename=fopen('heartbeatn1nn.txt');
A = textscan(filename,'%f','HeaderLines',0);
fclose(filename);
When I do this A comes out as just a single number--the first element in the column. How do I get the whole column? I've also tried this with the '.tsv' file extension, same result.
Also tried:
filename=fopen('heartbeatn1nn.txt');
formatSpec='%f';sizeA=[1 Inf];
A = fscanf(filename,formatSpec,sizeA);
fclose(filename);
with same result.
Could the file size be a problem? Not sure how many rows but probably quite a few since file size is 1.7M.
Assuming the columns in your text file are separated by single whitespace characters your format specification should look like this:
A = textscan(filename,'%f %s %f');
A now contains the complete file content. To obtain the first column:
first_col = A{:,1};
Alternatively, you can tell textscan to skip the unneeded fields with the * option:
first_col = cell2mat( textscan(filename, '%f %*s %*f') );
This returns only the first column.

How to avoid the repeated paragraghs of long txt files being ignored for importdata in matlab

I am trying to import all double from a txt file, which has this form
#25x1 string
#9999x2 double
.
.
.
#(repeat ten times)
However, when I am trying to use import Wizard, only the first
25x1 string
9999x2 double.
was successfully loaded, the other 9 were simply ignored
How may I import all the data? (Does importdata has a maximum length or something?)
Thanks
It's nothing to do with maximum length, importdata is just not set up for the sort of data file you describe. From the help file:
For ASCII files and spreadsheets, importdata expects
to find numeric data in a rectangular form (that is, like a matrix).
Text headers can appear above or to the left of the numeric data,
as follows:
Column headers or file description text at the top of the file, above
the numeric data. Row headers to the left of the numeric data.
So what is happening is that the first section of your file, which does match the format importdata expects, is being read, and the rest ignored. Instead of importdata, you'll need to use textscan, in particular, this style:
C = textscan(fileID,formatSpec,N)
fileID is returned from fopen. formatspec tells textscan what to expect, and N how many times to repeat it. As long as fileID remains open, repeated calls to textscan continue to read the file from wherever the last read action stopped - rather than going back to the start of the file. So we can do this:
fileID = fopen('myfile.txt');
repeats = 10;
for n = 1:repeats
% read one string, 25 times
C{n,1} = textscan(fileID,'%s',25);
% read two floats, 9999 times
C{n,2} = textscan(fileID,'%f %f',9999);
end
You can then extract your numerical data out of the cell array (if you need it in one block you may want to try using 'CollectOutput',1 as an option).