I have a CSV file with possibly missing data, and the data is both chars and numbers. What is the best way to deal with this?
Here is an example:
file.csv
name,age,gender
aaa,20,m
bbb,25,
ccc,,m
ddd,40,f
readMyCSV.m
fid = fopen('file.csv','rt');
C = textscan(fid, '%s%f%s', 'Delimiter',',', 'HeaderLines',1, 'EmptyValue',NaN);
fclose(fid);
[name,age,gender] = deal(C{:});
The data read:
>> [name num2cell(age) gender]
ans =
'aaa' [ 20] 'm'
'bbb' [ 25] ''
'ccc' [NaN] 'm'
'ddd' [ 40] 'f'
What #Amro has suggested is the most common way to read a csv file with missing values.
In you case since your data types are both characters and numbers you should provide the proper format of each column.
So your function should look something like this:
C = textscan(fid, '%d32 %c %d8 %d8 %d32 %f32 %f %s ','HeaderLines', 1, 'Delimiter', ',');
for more data formats look here:
http://www.mathworks.com/help/techdoc/ref/textscan.html
Related
I have a data file that includes comma-delimited data that I am trying to read into Octave. Most of the data is fine, but some includes numbers between double quotes that use commas between the quotes. Here's a sample section of data:
.123,4.2,"4,123",700,12pie
.34,4.23,602,701,23dj
.4345,4.6,"3,623,234",700,134nfg
.951,68.5,45,699,4lkj
I've been using textscan to read the data (since there's a mix of number and strings), specifying comma delimiters, and that works most of the time, but occasionally the file contains these bigger integers in quotes scattered through that column. I was able to get around one of these quoted numbers earlier in the data file because I knew where it would be, but it wasn't pretty:
sclose = textscan(fid, '%n %n', 1, 'delimiter', ',');
junk = fgetl(fid, 1);
junk = textscan(fid, '%s', 1, 'delimiter', '"');
junk = fgetl(fid, 1);
sopen = textscan(fid, '%n %s', 1, 'delimiter', ',');
I don't care about the data in that column, but because it changes size and sometimes contains the quoted with extra commas that I want to ignore, I'm struggling with how to read/skip it. Any suggestions on how to handle it?
Here's my current (ugly) approach that reads the column as a string, then uses strfind to check for a " within the string. If it's present then it reads another comma-delimited string and repeats the check until the closing " is found and then resumes reading the data.
fid = fopen('sample.txt', 'r');
for k=1:4
expdata1(k, :) = textscan(fid, '%n %n %s', 1, 'delimiter', ','); #read first 3 data pts
qcheck = char(expdata1(k,3));
idx = strfind(qcheck, '"'); #look for "
dloc = ftell(fid);
for l=1:4
if isempty(idx) #if no " present, continue reading data
break
endif
dloc = ftell(fid); #save location so can return to next data point
expdata1(k, 3) = textscan(fid, '%s', 1, 'delimiter', ','); #if " present, read next comma segment and check for "
qcheck = char(expdata1(k,3));
idx = strfind(qcheck, '"');
endfor
fseek(fid, dloc);
expdata2(k, :) = textscan(fid, '%n %s', 1, 'delimiter', ',');
endfor
fclose(fid);
There's gotta be a better way...
I see this has a matlab tag on it, are you using matlab textscan or octave?
If in matlab, I would suggest using either readmatrix or readtable.
Also note, the format specifier for quoted string is %q. This should be applicable to both languages even for textscan.
Putting your sample data in data.csv, the following is possible:
>> readtable("data.csv", 'Format','%f%f%q%d%s');
ans =
4×5 table
Var1 Var2 Var3 Var4 Var5
______ ____ _____________ ____ __________
0.123 4.2 {'4,123' } 700 {'12pie' }
0.34 4.23 {'602' } 701 {'23dj' }
0.4345 4.6 {'3,623,234'} 700 {'134nfg'}
0.951 68.5 {'45' } 699 {'4lkj' }
I'm working with MATLAB for few days and I'm having difficulties to import a CSV-file to a matrix.
My problem is that my CSV-file contains almost only Strings and some integer values, so that csvread() doesn't work. csvread() only gets along with integer values.
How can I store my strings in some kind of a 2-dimensional array to have free access to each element?
Here's a sample CSV for my needs:
04;abc;def;ghj;klm;;;;;
;;;;;Test;text;0xFF;;
;;;;;asdfhsdf;dsafdsag;0x0F0F;;
The main thing are the empty cells and the texts within the cells.
As you see, the structure may vary.
For the case when you know how many columns of data there will be in your CSV file, one simple call to textscan like Amro suggests will be your best solution.
However, if you don't know a priori how many columns are in your file, you can use a more general approach like I did in the following function. I first used the function fgetl to read each line of the file into a cell array. Then I used the function textscan to parse each line into separate strings using a predefined field delimiter and treating the integer fields as strings for now (they can be converted to numeric values later). Here is the resulting code, placed in a function read_mixed_csv:
function lineArray = read_mixed_csv(fileName, delimiter)
fid = fopen(fileName, 'r'); % Open the file
lineArray = cell(100, 1); % Preallocate a cell array (ideally slightly
% larger than is needed)
lineIndex = 1; % Index of cell to place the next line in
nextLine = fgetl(fid); % Read the first line from the file
while ~isequal(nextLine, -1) % Loop while not at the end of the file
lineArray{lineIndex} = nextLine; % Add the line to the cell array
lineIndex = lineIndex+1; % Increment the line index
nextLine = fgetl(fid); % Read the next line from the file
end
fclose(fid); % Close the file
lineArray = lineArray(1:lineIndex-1); % Remove empty cells, if needed
for iLine = 1:lineIndex-1 % Loop over lines
lineData = textscan(lineArray{iLine}, '%s', ... % Read strings
'Delimiter', delimiter);
lineData = lineData{1}; % Remove cell encapsulation
if strcmp(lineArray{iLine}(end), delimiter) % Account for when the line
lineData{end+1} = ''; % ends with a delimiter
end
lineArray(iLine, 1:numel(lineData)) = lineData; % Overwrite line data
end
end
Running this function on the sample file content from the question gives this result:
>> data = read_mixed_csv('myfile.csv', ';')
data =
Columns 1 through 7
'04' 'abc' 'def' 'ghj' 'klm' '' ''
'' '' '' '' '' 'Test' 'text'
'' '' '' '' '' 'asdfhsdf' 'dsafdsag'
Columns 8 through 10
'' '' ''
'0xFF' '' ''
'0x0F0F' '' ''
The result is a 3-by-10 cell array with one field per cell where missing fields are represented by the empty string ''. Now you can access each cell or a combination of cells to format them as you like. For example, if you wanted to change the fields in the first column from strings to integer values, you could use the function str2double as follows:
>> data(:, 1) = cellfun(#(s) {str2double(s)}, data(:, 1))
data =
Columns 1 through 7
[ 4] 'abc' 'def' 'ghj' 'klm' '' ''
[NaN] '' '' '' '' 'Test' 'text'
[NaN] '' '' '' '' 'asdfhsdf' 'dsafdsag'
Columns 8 through 10
'' '' ''
'0xFF' '' ''
'0x0F0F' '' ''
Note that the empty fields results in NaN values.
Given the sample you posted, this simple code should do the job:
fid = fopen('file.csv','r');
C = textscan(fid, repmat('%s',1,10), 'delimiter',';', 'CollectOutput',true);
C = C{1};
fclose(fid);
Then you could format the columns according to their type. For example if the first column is all integers, we can format it as such:
C(:,1) = num2cell( str2double(C(:,1)) )
Similarly, if you wish to convert the 8th column from hex to decimals, you can use HEX2DEC:
C(:,8) = cellfun(#hex2dec, strrep(C(:,8),'0x',''), 'UniformOutput',false);
The resulting cell array looks as follows:
C =
[ 4] 'abc' 'def' 'ghj' 'klm' '' '' [] '' ''
[NaN] '' '' '' '' 'Test' 'text' [ 255] '' ''
[NaN] '' '' '' '' 'asdfhsdf' 'dsafdsag' [3855] '' ''
In R2013b or later you can use a table:
>> table = readtable('myfile.txt','Delimiter',';','ReadVariableNames',false)
>> table =
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
____ _____ _____ _____ _____ __________ __________ ________ ____ _____
4 'abc' 'def' 'ghj' 'klm' '' '' '' NaN NaN
NaN '' '' '' '' 'Test' 'text' '0xFF' NaN NaN
NaN '' '' '' '' 'asdfhsdf' 'dsafdsag' '0x0F0F' NaN NaN
Here is more info.
Use xlsread, it works just as well on .csv files as it does on .xls files. Specify that you want three outputs:
[num char raw] = xlsread('your_filename.csv')
and it will give you an array containing only the numeric data (num), an array containing only the character data (char) and an array that contains all data types in the same format as the .csv layout (raw).
Have you tried to use the "CSVIMPORT" function found in the file exchange? I haven't tried it myself, but it claims to handle all combinations of text and numbers.
http://www.mathworks.com/matlabcentral/fileexchange/23573-csvimport
Depending on the format of your file, importdata might work.
You can store Strings in a cell array. Type "doc cell" for more information.
I recommend looking at the dataset array.
The dataset array is a data type that ships with Statistics Toolbox.
It is specifically designed to store hetrogeneous data in a single container.
The Statistics Toolbox demo page contains a couple vidoes that show some of the dataset array features. The first is titled "An Introduction to Dataset Arrays". The second is titled "An Introduction to Joins".
http://www.mathworks.com/products/statistics/demos.html
If your input file has a fixed amount of columns separated by commas and you know in which columns are the strings it might be best to use the function
textscan()
Note that you can specify a format where you read up to a maximum number of characters in the string or until a delimiter (comma) is found.
% Assuming that the dataset is ";"-delimited and each line ends with ";"
fid = fopen('sampledata.csv');
tline = fgetl(fid);
u=sprintf('%c',tline); c=length(u);
id=findstr(u,';'); n=length(id);
data=cell(1,n);
for I=1:n
if I==1
data{1,I}=u(1:id(I)-1);
else
data{1,I}=u(id(I-1)+1:id(I)-1);
end
end
ct=1;
while ischar(tline)
ct=ct+1;
tline = fgetl(fid);
u=sprintf('%c',tline);
id=findstr(u,';');
if~isempty(id)
for I=1:n
if I==1
data{ct,I}=u(1:id(I)-1);
else
data{ct,I}=u(id(I-1)+1:id(I)-1);
end
end
end
end
fclose(fid);
I have write this:
dlmwrite(fName,IND,'-append',... %// Print the matrix
'delimiter','\n', 'newline','pc');
The output is this:
23 46 56 67
How should I modify the dlmwrite function to have an output like this:
23, 46, 56, 67;
Why are you using '\n' as a delimiter? You should be using ',' instead (which is default, by the way, so you don't have to modify the 'delimiter' attribute at all in this case).
If you want to use a modified delimiter and a semi-colon to terminate each line, it's a bit of a problem for dlmwrite, so use the more powerful fprintf instead:
fid = fopen(fName, 'a');
fprintf(fid, [repmat('%d, ', 1, size(IND, 2) - 1), '%d;\r\n'], IND.');
fclose(fid);
EDIT:
Your question is a bit unclear about the desired output, so here are two more options for you:
If you want to write your data as one long line, instead of size(IND, 2) pass numel(IND):
fprintf(fid, [repmat('%d, ', 1, numel(IND) - 1), '%d;\r\n'], IND.');
or use the following three-liner instead:
X = IND.';
fprintf(fid, '%d, ', X(1:end - 1));
fprintf(fid, '%d;\r\n', X(end));
If you want to serialize your matrix column-wise, don't transpose IND:
fprintf(fid, [repmat('%d, ', 1, size(IND, 2) - 1), '%d;\r\n'], IND);
Try using the arguments 'delimiter', ', ', 'newline', ';\r\n'
That works in Octave; it is not clear from the Matlab documentation that the Matlab version accepts values for 'newline' other than 'pc' and 'unix'.
I have an excel file data I would use.
I would like from two input values from columns B and C get the name from column A.
Example: from these two values
var1 = 12.90050072
var2 = 55.95981118
I would get "ALIOTH"
here data
A B C
ALGOL 3.13614789 40.95564610
ALIOTH 12.90050072 55.95981118
ALKAID 13.79233003 49.31324779
I can load the csv file, but can not browse the data.
function [name] = getNameObject(ad,dec)
fileID = fopen('bdd.csv');
C = textscan(fileID, '%s %f %f','Delimiter',';');
fclose(fileID);
Please suggest some functions and sample code to do this
As you will need to compare floating point values, direct numeric comparisons don't work a lot of the time. Here I will make use of string comparisons to achieve what you need:
clear;
fid = fopen('test.csv');
C = textscan(fid, '%s %s %s', 'Delimiter', ';');
fclose(fid);
val1 = input('Enter the first input: ', 's');
val2 = input('Enter the second input: ', 's');
if(find(ismember(C{2},val1)) == find(ismember(C{3},val2)))
output = C{1}{find(ismember(C{2},val1))}
else
disp('No match found!');
end
Now the result would be something like:
>> test
Enter the first input: 1.03
Enter the second input: 4.12
No match found!
>> test
Enter the first input: 12.90050072
Enter the second input: 55.95981118
output =
ALIOTH
Here I'm assuming, as per what I could deduce from your code, that the delimiter was a semi-colon. As such, my input data was:
A;B;C
ALGOL;3.13614789;40.95564610
ALIOTH;12.90050072;55.95981118
ALKAID;13.79233003;49.31324779
I use importdata to deal with csv-s
aa.csv:
A, B, C
ALGOL, 3.13614789, 40.95564610
ALIOTH, 12.90050072, 55.95981118
ALKAID, 13.79233003, 49.31324779
importdata('aa.csv').data:
3.1361 40.9556
12.9005 55.9598
13.7923 49.3132
importdata('aa.csv').textdata:
'A' ' B' ' C'
'ALGOL' '' ''
'ALIOTH' '' ''
'ALKAID' '' ''
I have a large tab delimited file (10000 rows, 15000 columns) and would like to import it into Matlab.
I've tried to import it using textscan function the following way:
function [C_text, C_data] = ReadDataFile(filename, header, attributesCount, delimiter,
attributeFormats, attributeFormatCount)
AttributeTypes = SetAttributeTypeMatrix(attributeFormats, attributeFormatCount);
fid = fopen(filename);
if(header == 1)
%read column headers
C_text = textscan(fid, '%s', attributesCount, 'delimiter', delimiter);
C_data = textscan(fid, AttributeTypes{1, 1}, 'headerlines', 1);
else
C_text = '';
C_data = textscan(fid, AttributeTypes{1, 1});
end
fclose(fid);
AttributeTypes{1, 1} is a string wich describes variable types for each column (in this case there are 14740 float and 260 string type variables so the value of AttributeTypes{1, 1} is '%f%f......%f%s%s...%s where %f is repeated 14740 times and %s 260 times).
When I try to execute
>> [header, data] = ReadDataFile('data/orange_large_train.data.chunk1', 1, 15000, '\t', types, size);
header array seems to be correct (column names have been read correctly).
data is a 1 x 15000 array (only first row has been imported instead of 10000) and don't know what is causing such behavior.
I guess the problem is caused in this line:
C_data = textscan(fid, AttributeTypes{1, 1});
but don't know what could be wrong because there is a similar example described in the help reference.
I would be very thankful if anyone of you suggested any fix for the issue - How to read all 10000 rows.
I believe all your data are there. If you look inside data, every cell there should contains the whole column (10000x1). You can extract i-th cell as an array with data{i}.
You would probably want to separate double and string data. I don't know what is attributeFormats, you probably can use this array. But you can also use the AttributeTypes{1, 1}.
isdouble = strfind(AttributeTypes{1, 1}(2:2:end),'f');
data_double = cell2mat(data(isdouble));
To combine string data into one cell array of strings you can do:
isstring = strfind(AttributeTypes{1, 1}(2:2:end),'s');
data_string = horzcat(data{isstring});