Reading .csv file into MATLAB - matlab

As a part of my assignment, I have to read a .csv file. The file contains a mixture of text, numeric data and missing data under the columns:
Number, Title, Description (>100 words, variable length), Location, Time, Term, Company, Category, Source.
There are more than 0.5 million rows.
Suggest me a command to read this file into MATLAB.
I have already tried the following:
uiopen('filename.csv',1)
It gives error: Use textscan to read more complex formats. Then I tried:
data =textscan('filename.csv','%f %s %s %s %s %s %s %s %s %f','HeaderLines', 1, 'Delimiter', ',');
This command runs to completion, but it only gives an array (1X10) of cells (which are empty). Hence, I am not getting what I want.
I also tried textread command but it gives error.

textscan is what you want to use but according to the matlab documentation page for textscan the first argument is supposed to be a file id. Right now you are passing in a string.

You may want to try using readtable:
t = readtable('filename.csv');
This will create a table in Matlab, which can contain the strings and numeric data.
Alternatively, you can use the Import Tool (accessible from the Import Data button on the Matlab UI), or you can use uiimport to open it:
uiimport('filename.csv')
This will present a graphical presentation of your data, and can generate code for you to do the import as well.
The difficulty you may face with textscan is that you need to get the formats (%f, %s, etc...) correct, and any variations in this may cause failures. For example, if you have strings in a numeric field because of some missing/bad data, it may fail. If you choose to use textscan and don't get the results you're expecting, you may want to try with all '%s' format specifications.
textscan(f,'%s %s %s %s %s %s %s %s %s %s','HeaderLines', 1, 'Delimiter', ',');

Related

reading data from csv files with `textscan` in MATLAB

[Edited:] I have a file data2007a.csv and I copied and pasted (using TextEdit in MacBook) the first consecutive few lines to a new file datatest1.csv for testing:
Nomenclature,ReporterISO3,ProductCode,ReporterName,PartnerISO3,PartnerName,Year,TradeFlowName,TradeFlowCode,TradeValue in 1000 USD
S3,ABW,0,Aruba,ANT,Netherlands Antilles,2007,Export,6,448.91
S3,ABW,0,Aruba,ATG,Antigua and Barbuda,2007,Export,6,0.312
S3,ABW,0,Aruba,CHN,China,2007,Export,6,24.715
S3,ABW,0,Aruba,COL,Colombia,2007,Export,6,95.885
S3,ABW,0,Aruba,DOM,Dominican Republic,2007,Export,6,11.432
I wanted to use textscan to read it into MATLAB with only columns 2,3,5 (starting from the second row) and I wrote the following code
clc,clear all
fid = fopen('datatest1.csv');
data = textscan(fid,'%*s %s %d %*s %s %*[^\n]',...
'Delimiter',',',...
'HeaderLines',1);
fclose(fid);
But I ended up with only the second row of columns 2,3 and 5:
I then keep the first row in data2007a.csv and selected several others to saved as datatest2.csv:
Nomenclature,ReporterISO3,ProductCode,ReporterName,PartnerISO3,PartnerName,Year,TradeFlowName,TradeFlowCode,TradeValue in 1000 USD
S3,ABW,1,Aruba,USA,United States,2007,Export,6,1.392
S3,ABW,1,Aruba,VEN,Venezuela,2007,Export,6,5633.157
S3,ABW,2,Aruba,ANT,Netherlands Antilles,2007,Export,6,310.734
S3,ABW,2,Aruba,USA,United States,2007,Export,6,342.42
S3,ABW,2,Aruba,VEN,Venezuela,2007,Export,6,63.722
S3,AGO,0,Angola,DEU,Germany,2007,Export,6,105.334
S3,AGO,0,Angola,ESP,Spain,2007,Export,6,8533.125
And I wrote:
clc,clear all
fid = fopen('datatest2.csv');
data = textscan(fid,'%*s %s %d %*s %s %*[^\n]',...
'Delimiter',',',...
'HeaderLines',1);
fclose(fid);
data{1}
It gives exactly what I wanted:
When I use the same code for my original data file data2007a.csv, it goes as in the first case.
What is going wrong and how can I fix it?
[Added:] If one replicates my experiments1, one can find that both cases work and the problem does not exist! I really don't know what is going on.
1 For "replicate" I mean copy-and-paste the data given above and save it as two new files, say, datatest4a.csv and datatest4b.csv. I used visdiff('datatest1.csv', 'datatest4a.csv') to compare two files and it returned:
Given how you fixed it, I think this is an end-of-line character issue. This sometimes comes up when moving text files between Windows and Unix based systems, as they use different conventions.
When you add %*[^\n] to the end of a textscan format, as you have here. it means to skip everything to the end of line. But if it expects a specific end of line character, and can't find one, it will skip everything to the end of the file. This would explain why you get one row correctly read and then nothing else.
If you don't specify what the end of line character is, Matlab appears to default to... something... in this not very clear specification in the help:
The default end-of-line sequence is \n, \r, or \r\n, depending on the contents of your file.
One way to try and cure this without having to create a new file would be to add this 'EndOfLine', '\r\n' to your textscan call:
If you specify '\r\n', then textscan treats any of \r, \n, and the
combination of the two (\r\n) as end-of-line characters.
This will hopefully handle most standard(ish) EOL conventions. It is likely that copy-pasting and saving with a different bit of software than was originally used to create the file changed the end of line characters such that Matlab was able to recognise them.

Matlab textscan formatspec delimiter error

When reading a large csv file Matlab doesn't recognize ||,|| as a proper delimiter as input argument for textscan. The data is as follows (simplified):
||X||,||Y||,||Z|| (header)
||1||,||2||,||4||
||4||,||4||,||3||
etc.
I use data = textscan(fileID,formatSpec,'Delimiter',','); to read in the data with some format spec '%f %f %f'.
My rubber band solution has been to use 010 editor to replace all '||' with '', making it a proper csv file for matlab, but due to the size of the document (6M lines with approx 35 fields) and the frequency of new documents this is hardly a great solution.
Does anyone know a proper way to import such a file?
You should be able to include it in the format specifier:
data = textscan(fid, '||%f||,||%f||,||%f||', 'headerlines', 1)
and then just leave out the delimiter.
Edit (Following on from comments)
If you are trying to read in strings, the trick is to get it to read in strings without the | character. This is done using %[^|], like this:
data = textscan(fid, '|| %[^|] ||,|| %[^|] ||,|| %[^|] ||', 'headerlines', 1)

How to avoid the repeated paragraghs of long txt files being ignored for importdata in matlab

I am trying to import all double from a txt file, which has this form
#25x1 string
#9999x2 double
.
.
.
#(repeat ten times)
However, when I am trying to use import Wizard, only the first
25x1 string
9999x2 double.
was successfully loaded, the other 9 were simply ignored
How may I import all the data? (Does importdata has a maximum length or something?)
Thanks
It's nothing to do with maximum length, importdata is just not set up for the sort of data file you describe. From the help file:
For ASCII files and spreadsheets, importdata expects
to find numeric data in a rectangular form (that is, like a matrix).
Text headers can appear above or to the left of the numeric data,
as follows:
Column headers or file description text at the top of the file, above
the numeric data. Row headers to the left of the numeric data.
So what is happening is that the first section of your file, which does match the format importdata expects, is being read, and the rest ignored. Instead of importdata, you'll need to use textscan, in particular, this style:
C = textscan(fileID,formatSpec,N)
fileID is returned from fopen. formatspec tells textscan what to expect, and N how many times to repeat it. As long as fileID remains open, repeated calls to textscan continue to read the file from wherever the last read action stopped - rather than going back to the start of the file. So we can do this:
fileID = fopen('myfile.txt');
repeats = 10;
for n = 1:repeats
% read one string, 25 times
C{n,1} = textscan(fileID,'%s',25);
% read two floats, 9999 times
C{n,2} = textscan(fileID,'%f %f',9999);
end
You can then extract your numerical data out of the cell array (if you need it in one block you may want to try using 'CollectOutput',1 as an option).

Reading CSV files with MATLAB?

I am trying to read in a .csv file with MATLAB. Here is my code:
csvread('out2.csv')
This is what out2.csv looks like:
03/09/2013 23:55:12,129.32,129.33
03/09/2013 23:55:52,129.32,129.33
03/09/2013 23:56:02,129.32,129.33
On windows I am able to read this exact same file with the xlsread function with no problems. I am currently on a linux machine. When I first used xlsread to read the file I was told "File is not in recognized format" so I switched to using csvread. However, using csvread, I get the following error message:
Error using dlmread (line 139)
Mismatch between file and format string.
Trouble reading number from file (row 1u, field 2u) ==> /09/2013
23:55:12,129.32,129.33\n
Error in csvread (line 48)
m=dlmread(filename, ',', r, c)
I think the '/' in the date is causing the issue. On windows, the 1st column is interpreted as a string. On linux it seems to be interpreted as a number, so it tries to read the number and fails at the backslash. This is what I think is going on at least. Any help would be really appreciated.
csvread can only read doubles, so it's choking on the date field. Use textscan.
fid = fopen('out2.csv');
out = textscan(fid,'%s%f%f','delimiter',',');
fclose(fid);
date = datevec(out{1});
col1 = out{2};
col2 = out{3};
Update (8/31/2017)
Since this was written back in 2013, MATLAB's textscan function has been updated to directly read dates and times. Now the code would look like this:
fid = fopen('out2.csv');
out = textscan(fid, '%{MM/dd/uu HH:mm:ss}D%f%f', 'delimiter', ',');
fclose(fid)
[date, col1, col2] = deal(out{:});
An alternative as mentioned by #Victor Hugo below (and currently my personal go-to for this type of situation) would be to use readtable which will accept the same formatting string as textscan but assemble the results directly into a table object:
dataTable = readtable('out2.csv', 'Format', '%{MM/dd/uu HH:mm:ss}D%f%f')
dataTable.Properties.VariableNames = {'date', 'col1', 'col2'};
dataTable =
3×3 table
date col1 col2
___________________ ______ ______
03/09/2013 23:55:12 129.32 129.33
03/09/2013 23:55:52 129.32 129.33
03/09/2013 23:56:02 129.32 129.33
Unfortunately, the documentation for csvread clearly states:
M = csvread(filename) reads a comma-separated value formatted file, filename. The file can only contain numeric values.
Since / is neither a comma, nor a numeric value, it produces an error.
You can use readtable, as it will accept any input.
https://www.mathworks.com/help/matlab/ref/readtable.html
Yeah xlsread requires Microsoft Excel to be installed, unless it is run in 'basic' mode and 'basic' mode only reads .xls .xlsx and .xlsm files.
Another alternative are a number of user-written functions posted on MATLAB's file exchange that will work on linux and are more flexible at reading non formatted content.
One example:
https://www.mathworks.com/matlabcentral/fileexchange/75219-csv2cellfast-import-csv-files-on-machines-without-excel

Matlab : Read a file name in string format from a .csv file

I am having a .csv file which contains let's say 50 rows.
At the beginning of each row I have a file name in the following format 001_02_03.bmp followed by values separated by commas. Something like this :
001_02_03.bmp,20,30,45,10,40,20
Can someone tell me how can I read the first column from the data?
I know how to obtain the data from the second column onward. I am using the csvread function like this X = csvread('filename.csv', 0, 1);. If I try to read the first column in the same manner it outputs an error, saying the csvread does not support string format.
Use textscan, ie:
fid1 = fopen(csvFileName);
X = textscan(fid1, '%s%f%f%f%f%f%f', 'Delimiter', ',');
fclose(fid1);
FirstCol = X{1, 1};
A little more detail? csvread only works with purely numeric data, so you can't use it to get in data with a .bmp, or underscores for that matter. Thus we use textscan. The funny looking format string input to textscan is just saying that the columns are, in order, of type string %s, then the next 6 columns are of type double %f%f%f%f%f%f (or you might choose to alter this to reflect an integer datatype - I personally rarely bother unless the quantity of data is huge or floating point precision is a problem).
Note, if you just wanted to get the first column and ignore the rest, you can replace the format string with %s% %*[^\n]. A final point, if your csv file has a header line, you can skip it using the HeaderLines optional input to textscan.