Table imported into Matlab from CSV, variable name prefixed by "x___" - matlab

I have a bunch of automatically-generated CSV files with headers, which I'd like to import into Matlab 2016a as a table. I used code such as
T = readtable('d:\test.csv', 'readvariablenames', true);
However, even though the name of the CSV's first column is "runNr", the first column in the Matlab table gets named "x___runNr"
This clearly has something to do with the CSV files being in a slightly format different from that expected by Matlab. For instance, it could be that my CSVs have a Byte Order Mark in the beginning. Still, I am not sure what to do to fix this, since I cannot change the format of the CSVs. Readtable, on the other hand, gives me the output format I am most comfortable with.
Upon calling readtable, the following warning is issued:
"Warning: Variable names were modified to make them valid MATLAB identifiers. "
However, some of my CSVs (perhaps produced by a different version of the software that outputs them) are still read OK, and for those CSVs the same warning is displayed, thus the warning alone is not indicative of the problem.

I think I found the source of the problem:
Like you have suspected, the encoding of your CSV file is "UTF-8-BOM" (I saw it using Notepad++).
The UTF-8 representation of the BOM is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF
MATLAB R2019a knows to ignore the first 3 bytes, but R2016a is "confused" by the 3 characters, and add x___ prefix to runNr.
A workaround, is create a temporary file with out the first 3 characters:
f = fopen('test.csv', 'r');
A = fread(f, '*uint8');
fclose(f);
if all(A(1:3) == hex2dec(['EF'; 'BB'; 'BF']))
f = fopen('tmp.csv', 'w');
fwrite(f, A(4:end)); %Skip first 3 characters.
fclose(f);
T = readtable('tmp.csv', 'readvariablenames', true);
else
T = readtable('test.csv', 'readvariablenames', true);
end
There might be more efficient solutions (like simply removing the x___).

Related

MATLAB fwrite\fread issue: two variables are being concatenated

I am reading in a binary EDF file and I have to split it into multiple smaller EDF files at specific points and then adjust some of the values inside. Overall it works quite well but when I read in the file it combines 2 character arrays with each other. Obviously everything afterwords gets corrupted as well. I am at a dead end and have no idea what I'm doing wrong.
The part of the code (writing) that has to contain the problem:
byt=fread(fid,8,'*char');
fwrite(tfid,byt,'*char');
fwrite(tfid,fread(fid,44));
%new number of records
s = records;
fwrite(tfid,s,'*char');
fseek(fid,8,0);
%test
fwrite(tfid,fread(fid,8,'*char'),'*char');
When I use the reader it combines the records (fwrite(tfid,s,'*char'))
with the value of the next variable. All variables before this are displayed correctly. The relevant code of the reader:
hdr.bytes = str2double(fread(fid,8,'*char')');
reserved = fread(fid,44);%#ok
hdr.records = str2double(fread(fid,8,'*char')');
if hdr.records == -1
beep
disp('There appears to be a problem with this file; it returns an out-of-spec value of -1 for ''numberOfRecords.''')
disp('Attempting to read the file with ''edfReadUntilDone'' instead....');
[hdr, record] = edfreadUntilDone(fname, varargin);
return
end
hdr.duration = str2double(fread(fid,8,'*char')');
The likely problem is that your character array s does not have 8 characters in it, but you expect there to be 8 when you read it from the file. Whatever the number of characters in the array is, that's how many values fwrite will write out to the file. Anything less than 8 characters and you'll end up reading part of the next piece of data when you read from the file.
One fix would be to pad s with blanks before writing it:
s = [blanks(8-numel(records)) records];
In addition, the syntax '*char' is only valid when using fread: the * indicates that the output class should be 'char' as well. It's unnecessary when using fwrite.

reading data from csv files with `textscan` in MATLAB

[Edited:] I have a file data2007a.csv and I copied and pasted (using TextEdit in MacBook) the first consecutive few lines to a new file datatest1.csv for testing:
Nomenclature,ReporterISO3,ProductCode,ReporterName,PartnerISO3,PartnerName,Year,TradeFlowName,TradeFlowCode,TradeValue in 1000 USD
S3,ABW,0,Aruba,ANT,Netherlands Antilles,2007,Export,6,448.91
S3,ABW,0,Aruba,ATG,Antigua and Barbuda,2007,Export,6,0.312
S3,ABW,0,Aruba,CHN,China,2007,Export,6,24.715
S3,ABW,0,Aruba,COL,Colombia,2007,Export,6,95.885
S3,ABW,0,Aruba,DOM,Dominican Republic,2007,Export,6,11.432
I wanted to use textscan to read it into MATLAB with only columns 2,3,5 (starting from the second row) and I wrote the following code
clc,clear all
fid = fopen('datatest1.csv');
data = textscan(fid,'%*s %s %d %*s %s %*[^\n]',...
'Delimiter',',',...
'HeaderLines',1);
fclose(fid);
But I ended up with only the second row of columns 2,3 and 5:
I then keep the first row in data2007a.csv and selected several others to saved as datatest2.csv:
Nomenclature,ReporterISO3,ProductCode,ReporterName,PartnerISO3,PartnerName,Year,TradeFlowName,TradeFlowCode,TradeValue in 1000 USD
S3,ABW,1,Aruba,USA,United States,2007,Export,6,1.392
S3,ABW,1,Aruba,VEN,Venezuela,2007,Export,6,5633.157
S3,ABW,2,Aruba,ANT,Netherlands Antilles,2007,Export,6,310.734
S3,ABW,2,Aruba,USA,United States,2007,Export,6,342.42
S3,ABW,2,Aruba,VEN,Venezuela,2007,Export,6,63.722
S3,AGO,0,Angola,DEU,Germany,2007,Export,6,105.334
S3,AGO,0,Angola,ESP,Spain,2007,Export,6,8533.125
And I wrote:
clc,clear all
fid = fopen('datatest2.csv');
data = textscan(fid,'%*s %s %d %*s %s %*[^\n]',...
'Delimiter',',',...
'HeaderLines',1);
fclose(fid);
data{1}
It gives exactly what I wanted:
When I use the same code for my original data file data2007a.csv, it goes as in the first case.
What is going wrong and how can I fix it?
[Added:] If one replicates my experiments1, one can find that both cases work and the problem does not exist! I really don't know what is going on.
1 For "replicate" I mean copy-and-paste the data given above and save it as two new files, say, datatest4a.csv and datatest4b.csv. I used visdiff('datatest1.csv', 'datatest4a.csv') to compare two files and it returned:
Given how you fixed it, I think this is an end-of-line character issue. This sometimes comes up when moving text files between Windows and Unix based systems, as they use different conventions.
When you add %*[^\n] to the end of a textscan format, as you have here. it means to skip everything to the end of line. But if it expects a specific end of line character, and can't find one, it will skip everything to the end of the file. This would explain why you get one row correctly read and then nothing else.
If you don't specify what the end of line character is, Matlab appears to default to... something... in this not very clear specification in the help:
The default end-of-line sequence is \n, \r, or \r\n, depending on the contents of your file.
One way to try and cure this without having to create a new file would be to add this 'EndOfLine', '\r\n' to your textscan call:
If you specify '\r\n', then textscan treats any of \r, \n, and the
combination of the two (\r\n) as end-of-line characters.
This will hopefully handle most standard(ish) EOL conventions. It is likely that copy-pasting and saving with a different bit of software than was originally used to create the file changed the end of line characters such that Matlab was able to recognise them.

How to avoid the repeated paragraghs of long txt files being ignored for importdata in matlab

I am trying to import all double from a txt file, which has this form
#25x1 string
#9999x2 double
.
.
.
#(repeat ten times)
However, when I am trying to use import Wizard, only the first
25x1 string
9999x2 double.
was successfully loaded, the other 9 were simply ignored
How may I import all the data? (Does importdata has a maximum length or something?)
Thanks
It's nothing to do with maximum length, importdata is just not set up for the sort of data file you describe. From the help file:
For ASCII files and spreadsheets, importdata expects
to find numeric data in a rectangular form (that is, like a matrix).
Text headers can appear above or to the left of the numeric data,
as follows:
Column headers or file description text at the top of the file, above
the numeric data. Row headers to the left of the numeric data.
So what is happening is that the first section of your file, which does match the format importdata expects, is being read, and the rest ignored. Instead of importdata, you'll need to use textscan, in particular, this style:
C = textscan(fileID,formatSpec,N)
fileID is returned from fopen. formatspec tells textscan what to expect, and N how many times to repeat it. As long as fileID remains open, repeated calls to textscan continue to read the file from wherever the last read action stopped - rather than going back to the start of the file. So we can do this:
fileID = fopen('myfile.txt');
repeats = 10;
for n = 1:repeats
% read one string, 25 times
C{n,1} = textscan(fileID,'%s',25);
% read two floats, 9999 times
C{n,2} = textscan(fileID,'%f %f',9999);
end
You can then extract your numerical data out of the cell array (if you need it in one block you may want to try using 'CollectOutput',1 as an option).

Reading CSV files with MATLAB?

I am trying to read in a .csv file with MATLAB. Here is my code:
csvread('out2.csv')
This is what out2.csv looks like:
03/09/2013 23:55:12,129.32,129.33
03/09/2013 23:55:52,129.32,129.33
03/09/2013 23:56:02,129.32,129.33
On windows I am able to read this exact same file with the xlsread function with no problems. I am currently on a linux machine. When I first used xlsread to read the file I was told "File is not in recognized format" so I switched to using csvread. However, using csvread, I get the following error message:
Error using dlmread (line 139)
Mismatch between file and format string.
Trouble reading number from file (row 1u, field 2u) ==> /09/2013
23:55:12,129.32,129.33\n
Error in csvread (line 48)
m=dlmread(filename, ',', r, c)
I think the '/' in the date is causing the issue. On windows, the 1st column is interpreted as a string. On linux it seems to be interpreted as a number, so it tries to read the number and fails at the backslash. This is what I think is going on at least. Any help would be really appreciated.
csvread can only read doubles, so it's choking on the date field. Use textscan.
fid = fopen('out2.csv');
out = textscan(fid,'%s%f%f','delimiter',',');
fclose(fid);
date = datevec(out{1});
col1 = out{2};
col2 = out{3};
Update (8/31/2017)
Since this was written back in 2013, MATLAB's textscan function has been updated to directly read dates and times. Now the code would look like this:
fid = fopen('out2.csv');
out = textscan(fid, '%{MM/dd/uu HH:mm:ss}D%f%f', 'delimiter', ',');
fclose(fid)
[date, col1, col2] = deal(out{:});
An alternative as mentioned by #Victor Hugo below (and currently my personal go-to for this type of situation) would be to use readtable which will accept the same formatting string as textscan but assemble the results directly into a table object:
dataTable = readtable('out2.csv', 'Format', '%{MM/dd/uu HH:mm:ss}D%f%f')
dataTable.Properties.VariableNames = {'date', 'col1', 'col2'};
dataTable =
3×3 table
date col1 col2
___________________ ______ ______
03/09/2013 23:55:12 129.32 129.33
03/09/2013 23:55:52 129.32 129.33
03/09/2013 23:56:02 129.32 129.33
Unfortunately, the documentation for csvread clearly states:
M = csvread(filename) reads a comma-separated value formatted file, filename. The file can only contain numeric values.
Since / is neither a comma, nor a numeric value, it produces an error.
You can use readtable, as it will accept any input.
https://www.mathworks.com/help/matlab/ref/readtable.html
Yeah xlsread requires Microsoft Excel to be installed, unless it is run in 'basic' mode and 'basic' mode only reads .xls .xlsx and .xlsm files.
Another alternative are a number of user-written functions posted on MATLAB's file exchange that will work on linux and are more flexible at reading non formatted content.
One example:
https://www.mathworks.com/matlabcentral/fileexchange/75219-csv2cellfast-import-csv-files-on-machines-without-excel

Matlab : Read a file name in string format from a .csv file

I am having a .csv file which contains let's say 50 rows.
At the beginning of each row I have a file name in the following format 001_02_03.bmp followed by values separated by commas. Something like this :
001_02_03.bmp,20,30,45,10,40,20
Can someone tell me how can I read the first column from the data?
I know how to obtain the data from the second column onward. I am using the csvread function like this X = csvread('filename.csv', 0, 1);. If I try to read the first column in the same manner it outputs an error, saying the csvread does not support string format.
Use textscan, ie:
fid1 = fopen(csvFileName);
X = textscan(fid1, '%s%f%f%f%f%f%f', 'Delimiter', ',');
fclose(fid1);
FirstCol = X{1, 1};
A little more detail? csvread only works with purely numeric data, so you can't use it to get in data with a .bmp, or underscores for that matter. Thus we use textscan. The funny looking format string input to textscan is just saying that the columns are, in order, of type string %s, then the next 6 columns are of type double %f%f%f%f%f%f (or you might choose to alter this to reflect an integer datatype - I personally rarely bother unless the quantity of data is huge or floating point precision is a problem).
Note, if you just wanted to get the first column and ignore the rest, you can replace the format string with %s% %*[^\n]. A final point, if your csv file has a header line, you can skip it using the HeaderLines optional input to textscan.