textscan introduces additional zeros in output array - matlab

I have a .txt file like this:
ord,cat,1
ord,cat,1
ord,cat,3
ord,cat,1
ord,cat,4
I know the number of entries for each row (comma separated) but not the number of rows.
I need to import the number of the following car in an array.
I wrote this:
fid=fopen(filename)
A=textscan(fid,'%s%s%d','Delimiter',',')
But i get this
A = {17x1 cell} [16x1 int32]
where the number of cells is clearly wrong.
When i try to read
A{3}
i get
ans =
0
0
0
0
0
1
0
1
0
3
0
1
0
4
I'm really interested just in the integer array, but maybe can be useful to show you also:
A{1}
ans =
'{\rtf1\ansi\ansicpg1252\cocoartf1187\cocoasubrtf400'
'{\fonttbl\f0\fswiss\fcharset0 Helvetica;}'
'{\colortbl;\red255\green255\blue255;}'
[1x75 char]
[1x102 char]
'\f0\fs24 \cf0 ord'
'\'
'ord'
'\'
'ord'
'\'
'ord'
'\'
'ord'
'}'
A{2}
ans =
''
''
''
''
''
'cat'
''
'cat'
''
'cat'
''
'cat'
''
'cat'
Ok,I think there was a format mistake of some kind in the input file.
I deleted it and created a new .txt file and the code above works fine.

You're not giving the right format command to textscan.
A=textscan(fid,'%s%d','Delimiter',',')
'%s%d' here means "read one string, then one integer". So it will probably sit there reading string-integer-string-integer (or trying to), and the "0"s arise from errors where
Since you have three entries per line, try instead:
A=textscan(fid,'%s%s%d','Delimiter',',')
Your numbers should be in A{3}.
If you don't need the first two columns, you can also skip over those fields:
A=textscan(fid,'%*s%*s%d','Delimiter',',')

Related

Matlab - Read An Unknown CSV [duplicate]

I'm working with MATLAB for few days and I'm having difficulties to import a CSV-file to a matrix.
My problem is that my CSV-file contains almost only Strings and some integer values, so that csvread() doesn't work. csvread() only gets along with integer values.
How can I store my strings in some kind of a 2-dimensional array to have free access to each element?
Here's a sample CSV for my needs:
04;abc;def;ghj;klm;;;;;
;;;;;Test;text;0xFF;;
;;;;;asdfhsdf;dsafdsag;0x0F0F;;
The main thing are the empty cells and the texts within the cells.
As you see, the structure may vary.
For the case when you know how many columns of data there will be in your CSV file, one simple call to textscan like Amro suggests will be your best solution.
However, if you don't know a priori how many columns are in your file, you can use a more general approach like I did in the following function. I first used the function fgetl to read each line of the file into a cell array. Then I used the function textscan to parse each line into separate strings using a predefined field delimiter and treating the integer fields as strings for now (they can be converted to numeric values later). Here is the resulting code, placed in a function read_mixed_csv:
function lineArray = read_mixed_csv(fileName, delimiter)
fid = fopen(fileName, 'r'); % Open the file
lineArray = cell(100, 1); % Preallocate a cell array (ideally slightly
% larger than is needed)
lineIndex = 1; % Index of cell to place the next line in
nextLine = fgetl(fid); % Read the first line from the file
while ~isequal(nextLine, -1) % Loop while not at the end of the file
lineArray{lineIndex} = nextLine; % Add the line to the cell array
lineIndex = lineIndex+1; % Increment the line index
nextLine = fgetl(fid); % Read the next line from the file
end
fclose(fid); % Close the file
lineArray = lineArray(1:lineIndex-1); % Remove empty cells, if needed
for iLine = 1:lineIndex-1 % Loop over lines
lineData = textscan(lineArray{iLine}, '%s', ... % Read strings
'Delimiter', delimiter);
lineData = lineData{1}; % Remove cell encapsulation
if strcmp(lineArray{iLine}(end), delimiter) % Account for when the line
lineData{end+1} = ''; % ends with a delimiter
end
lineArray(iLine, 1:numel(lineData)) = lineData; % Overwrite line data
end
end
Running this function on the sample file content from the question gives this result:
>> data = read_mixed_csv('myfile.csv', ';')
data =
Columns 1 through 7
'04' 'abc' 'def' 'ghj' 'klm' '' ''
'' '' '' '' '' 'Test' 'text'
'' '' '' '' '' 'asdfhsdf' 'dsafdsag'
Columns 8 through 10
'' '' ''
'0xFF' '' ''
'0x0F0F' '' ''
The result is a 3-by-10 cell array with one field per cell where missing fields are represented by the empty string ''. Now you can access each cell or a combination of cells to format them as you like. For example, if you wanted to change the fields in the first column from strings to integer values, you could use the function str2double as follows:
>> data(:, 1) = cellfun(#(s) {str2double(s)}, data(:, 1))
data =
Columns 1 through 7
[ 4] 'abc' 'def' 'ghj' 'klm' '' ''
[NaN] '' '' '' '' 'Test' 'text'
[NaN] '' '' '' '' 'asdfhsdf' 'dsafdsag'
Columns 8 through 10
'' '' ''
'0xFF' '' ''
'0x0F0F' '' ''
Note that the empty fields results in NaN values.
Given the sample you posted, this simple code should do the job:
fid = fopen('file.csv','r');
C = textscan(fid, repmat('%s',1,10), 'delimiter',';', 'CollectOutput',true);
C = C{1};
fclose(fid);
Then you could format the columns according to their type. For example if the first column is all integers, we can format it as such:
C(:,1) = num2cell( str2double(C(:,1)) )
Similarly, if you wish to convert the 8th column from hex to decimals, you can use HEX2DEC:
C(:,8) = cellfun(#hex2dec, strrep(C(:,8),'0x',''), 'UniformOutput',false);
The resulting cell array looks as follows:
C =
[ 4] 'abc' 'def' 'ghj' 'klm' '' '' [] '' ''
[NaN] '' '' '' '' 'Test' 'text' [ 255] '' ''
[NaN] '' '' '' '' 'asdfhsdf' 'dsafdsag' [3855] '' ''
In R2013b or later you can use a table:
>> table = readtable('myfile.txt','Delimiter',';','ReadVariableNames',false)
>> table =
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
____ _____ _____ _____ _____ __________ __________ ________ ____ _____
4 'abc' 'def' 'ghj' 'klm' '' '' '' NaN NaN
NaN '' '' '' '' 'Test' 'text' '0xFF' NaN NaN
NaN '' '' '' '' 'asdfhsdf' 'dsafdsag' '0x0F0F' NaN NaN
Here is more info.
Use xlsread, it works just as well on .csv files as it does on .xls files. Specify that you want three outputs:
[num char raw] = xlsread('your_filename.csv')
and it will give you an array containing only the numeric data (num), an array containing only the character data (char) and an array that contains all data types in the same format as the .csv layout (raw).
Have you tried to use the "CSVIMPORT" function found in the file exchange? I haven't tried it myself, but it claims to handle all combinations of text and numbers.
http://www.mathworks.com/matlabcentral/fileexchange/23573-csvimport
Depending on the format of your file, importdata might work.
You can store Strings in a cell array. Type "doc cell" for more information.
I recommend looking at the dataset array.
The dataset array is a data type that ships with Statistics Toolbox.
It is specifically designed to store hetrogeneous data in a single container.
The Statistics Toolbox demo page contains a couple vidoes that show some of the dataset array features. The first is titled "An Introduction to Dataset Arrays". The second is titled "An Introduction to Joins".
http://www.mathworks.com/products/statistics/demos.html
If your input file has a fixed amount of columns separated by commas and you know in which columns are the strings it might be best to use the function
textscan()
Note that you can specify a format where you read up to a maximum number of characters in the string or until a delimiter (comma) is found.
% Assuming that the dataset is ";"-delimited and each line ends with ";"
fid = fopen('sampledata.csv');
tline = fgetl(fid);
u=sprintf('%c',tline); c=length(u);
id=findstr(u,';'); n=length(id);
data=cell(1,n);
for I=1:n
if I==1
data{1,I}=u(1:id(I)-1);
else
data{1,I}=u(id(I-1)+1:id(I)-1);
end
end
ct=1;
while ischar(tline)
ct=ct+1;
tline = fgetl(fid);
u=sprintf('%c',tline);
id=findstr(u,';');
if~isempty(id)
for I=1:n
if I==1
data{ct,I}=u(1:id(I)-1);
else
data{ct,I}=u(id(I-1)+1:id(I)-1);
end
end
end
end
fclose(fid);

Convert the contents of columns containing numeric text to numbers

I have a csv file that consists of text or number. But some columns are corrupted as seen in the image below("<<"K.O). When I open the csv file via Matlab (without importing), it converts them to number and define undefined values such as "<<"K.O as NaN as I wanted. But when I read the file via script I wrote:
opts = detectImportOptions(filedir);
table = readtable(filedir,opts);
It reads them as char arrays. Since I have many different csv files (columns are different), I want to do it automatically rather than using textscan(since it requires file format and my file format is different for each csv file). Is there any way to convert the contents of columns containing numeric text to numbers automatically?
As far as I can understand from your comments, this is what you are actually looking for:
for i = 1:numel(files)
file = fullfile(folder,files(i).name));
opts = detectImportOptions(file);
idx = strcmp(opts.VariableNames,'Grade');
if (any(idx))
opts.VariableTypes(idx) = {'double'};
end
tabs(i) = readtable(file,opts);
end
Assuming you have your data stored in a table, you can attempt to convert each column of character arrays to numeric values using str2double. Any values that don't convert to a numeric value (empty entries, words, non-numeric strings, etc.) will be converted to NaN.
Since you want to do the conversions automatically, we'll have to make one key assumption: any column that converts to all NaN values should remain unchanged. In such a case, the data was likely either all non-convertable character arrays, or already numeric. Given that assumption, this generic conversion could be applied to any table T:
for varName = T.Properties.VariableNames
numData = str2double(T.(varName{1}));
if ~all(isnan(numData))
T.(varName{1}) = numData;
end
end
As a test, the following sample data:
T = table((1:5).', {'Y'; 'N'; 'Y'; 'Y'; 'N'}, {'pi'; ''; '1.4e5'; '1'; 'A'});
T =
Var1 Var2 Var3
____ ____ _______
1 'Y' 'pi'
2 'N' ''
3 'Y' '1.4e5'
4 'Y' '1'
5 'N' 'A'
Will be converted to the following by the above code:
T =
Var1 Var2 Var3
____ ____ ______
1 'Y' NaN
2 'N' NaN
3 'Y' 140000
4 'Y' 1
5 'N' NaN

Strange behaviour in size(strfind(n,',')) for n = 44

For some reason in
size(strfind(n,','))
the number 44 is special and produces a comma found result:
value={55}
numCommas = size(strfind(value{1},','),2)
ans= 0 ...(GOOD)
value={44}
numCommas = size(strfind(value{1},','),2)
ans= 1 ...(BAD) - Why is it doing this?
value={'44,44,44'}
numCommas = size(strfind(value{1},','),2)
ans= 2 ...(GREAT)
I need to find the number of commas in a cell element, where the element can either be an integer or a string.
To elaborate on my comment. The ASCII code for a comma, (,), is 44. Effectively what you are doing in your code is
size(strfind(44,','),2)
or
size(strfind(char(44),','),2)
where 44 is not a string but is interpreted as a numeric value which is then converted to a character and results in a comma, (,) which we can see when we use char
>> char(44)
ans =
,
You can fix your code by changing
value={44}
to
value={'44'}
so then you will be performing strfind on a string instead of a numeric value.
>> size(strfind('44', ','), 2)
ans =
0
which provides the correct answer.
Alternatively you could use num2str
>> size(strfind(num2str(value{1}), ','), 2)
ans =
0
You can avoid this by simply doing value{1} = '44'. Or if that's not an alternative, use num2str like this:
value={44};
numCommas = size(strfind(num2str(value{1}),','),2)
numCommas =
0
This will also work for string inputs:
value={'44,44,44'};
numCommas = size(strfind(num2str(value{1}),','),2)
numCommas =
2
Why do you get "wrong" results?`
It's because 44 is the ASCII code for comma ,.
You can check this quite simply by casting the value to char.
char(44)
ans =
,
You are checking for commas in a string. As the input to strfind is an integer, it automatically cast it to char. In the last example, your are inserting a "real" string, thus it finds the two commas in there.
Try this one:
value={'44'}
numCommas = size(strfind(value{1},','),2)
instead of:
value={44}
numCommas = size(strfind(value{1},','),2)
It should work, since it's a char now.

Replacing letters with numbers in a MATLAB array

I am trying to write a function to mark the results of a test. The answers given by participants are stored in a nx1 cell array. However, theses are stored as letters. I am looking for a way to convert (a-d) these into numbers (1-4) ie. a=1, b=2 so these can be compared the answers using logical operations.
What I have so far is:
[num,txt,raw]=xlsread('FolkPhysicsMERGE.xlsx', 'X3:X142');
FolkPhysParAns=txt;
I seem to be able to find how to convert from numbers into letters but not the other way around. I feel like there should be a relatively easy way to do this, any ideas?
If you have a cell array of letters:
>> data = {'a','b','c','A'};
you only need to:
Convert to lower-case with lower, to treat both cases equally;
Convert to a character array with cell2mat;
Subtract (the ASCII code of) 'a' and add 1.
Code:
>> result = cell2mat(lower(data))-'a'+1
result =
1 2 3 1
More generally, if the possible answers are not consecutive letters, or even not single letters, use ismember:
>> possibleValues = {'s', 'm', 'l', 'xl', 'xxl'};
>> data = {'s', 'm', 'xl', 'l', 'm', 'l', 'aaa'};
>> [~, result] = ismember(data, possibleValues)
result =
1 2 4 3 2 3 0
Thought I might as well write an answer...
you can use strrep to replace 'a' with '1' (note it is the string format), and do it for all 26 letters and then use cell2mat to convert string '1' - '26' etc to numeric 1 -26.
Lets say:
t = {'a','b','c'} //%Array of Strings
t = strrep(t,'a','1') //%replace all 'a' with '1'
t = strrep(t,'b','2') //%replace all 'b' with '2'
t = strrep(t,'c','3') //%replace all 'c' with '3'
%// Or 1 line:
t = strrep(g,{'a','b','c'},{'1','2','3'})
>> t =
'1' '2' '3'
output = cellfun(#str2num,t,'un',0) //% keeps the cell structure
>> output =
[1] [2] [3]
alternatively:
output = str2num(cell2mat(t')) //% uses the matrix structure instead, NOTE the inversion ', it is crucial here.
>> output =
1
2
3

How to detect the two tabs in a string in matlab

I am reading data form tab separated file:
str1 = '1 3'
str2 = '4 5 6'
In 'str1' second place is empty. I am reading line by line in matlab and then using strsplit, I extract values from each line and later, I am building arrays. Each column in text correspond to each array.
strsplit(str1, '\t')
yeilds ==> '1 3'
strsplit(str2, '\t')
yeilds ==> '4 5 6'
Somehow, I miss that second place in first string is empty. How can I save this information?
Try using a regular expression:
str1 = '1 3'
numel(regexp(str1, '\t')) % look for the number of elements of the regular expression that looks for tabs '\t'
will return 2
For your problem you could do the following:
tmp = regexp(str1, '(\d*)\t(\d*)\t(\d*)', 'tokens')
tmp{1}
=
'1' '' '3'
Matlab has built-in support to read tab-separated files:
A = importdata('file.txt', '\t')
If your file looks like this:
1\t2\t3
4\t\t5
importdata yields:
A =
1 2 3
4 NaN 5