cell2table removes values from first column if string is a single character - matlab

Suppose you have the following data:
A = [1,2,3;4,5,6];
headers = {'force', 'mass', 'acceleration'};
units = {'N','Kg','m/s^2'};
Let's say I want to convert it to a table, where headers will be the 'VariableNames':
table_of_data = cell2table([units; num2cell(A)]);
table_of_data.Properties.VariableNames = headers
table_of_data =
force mass acceleration
_____ ____ ____________
N 'Kg' 'm/s^2'
[2] [3]
[5] [6]
Note that the first two columns of A are removed. This is because MATLAB treats the single character N differently than 'Kg' and 'm/s^2'. If I insert a space after 'N ' I get:
table_of_data =
force mass acceleration
_____ ____ ____________
'N ' 'Kg' 'm/s^2'
[1] [2] [3]
[4] [5] [6]
How can I get a proper table, with all elements displayed without inserting a space 'N '?
It's no problem to use a single character in units if I add more rows to the cell array, such as [headers; units; num2cell(A)], so the following works:
table_of_data = cell2table([headers; units; num2cell(A)]);
table_of_data(1,:) = [];
table_of_data.Properties.VariableNames = headers
table_of_data =
force mass acceleration
_____ ____ ____________
'N ' 'Kg' 'm/s^2'
[1] [2] [3]
[4] [5] [6]
How can I solve this without turning to cumbersome workarounds?

This likely has to do with table's internal representation of the data. It seems like what it does is tries to vertically concatenate the data in a column and if the concatenation succeeds then it uses an array, otherwise it stores it as a cell .
In the case of a single character N and the numbers, 1 and 4, they can be concatenated without error; however, it converts them all to chars.
vertcat('N', 1, 4)
However, when you add the space, concatenation now fails
vertcat('N ', 1, 4)
And the output is displayed like a cell.
You have a few options:
Use table.Properties.VariableUnits to store the units rather than trying to incorporate the units into your table.
table_of_data.Properties.VariableUnits = units;
Display the units in the column headers
headers = {'force_N', 'mass_kg', 'acceleration_m_s2'};
Create a double-nested cell array to store all of the units, which explicitly causes it to be stored as a cell array internally.
table_of_data = cell2table([num2cell(units); num2cell(A)])

Related

Convert the contents of columns containing numeric text to numbers

I have a csv file that consists of text or number. But some columns are corrupted as seen in the image below("<<"K.O). When I open the csv file via Matlab (without importing), it converts them to number and define undefined values such as "<<"K.O as NaN as I wanted. But when I read the file via script I wrote:
opts = detectImportOptions(filedir);
table = readtable(filedir,opts);
It reads them as char arrays. Since I have many different csv files (columns are different), I want to do it automatically rather than using textscan(since it requires file format and my file format is different for each csv file). Is there any way to convert the contents of columns containing numeric text to numbers automatically?
As far as I can understand from your comments, this is what you are actually looking for:
for i = 1:numel(files)
file = fullfile(folder,files(i).name));
opts = detectImportOptions(file);
idx = strcmp(opts.VariableNames,'Grade');
if (any(idx))
opts.VariableTypes(idx) = {'double'};
end
tabs(i) = readtable(file,opts);
end
Assuming you have your data stored in a table, you can attempt to convert each column of character arrays to numeric values using str2double. Any values that don't convert to a numeric value (empty entries, words, non-numeric strings, etc.) will be converted to NaN.
Since you want to do the conversions automatically, we'll have to make one key assumption: any column that converts to all NaN values should remain unchanged. In such a case, the data was likely either all non-convertable character arrays, or already numeric. Given that assumption, this generic conversion could be applied to any table T:
for varName = T.Properties.VariableNames
numData = str2double(T.(varName{1}));
if ~all(isnan(numData))
T.(varName{1}) = numData;
end
end
As a test, the following sample data:
T = table((1:5).', {'Y'; 'N'; 'Y'; 'Y'; 'N'}, {'pi'; ''; '1.4e5'; '1'; 'A'});
T =
Var1 Var2 Var3
____ ____ _______
1 'Y' 'pi'
2 'N' ''
3 'Y' '1.4e5'
4 'Y' '1'
5 'N' 'A'
Will be converted to the following by the above code:
T =
Var1 Var2 Var3
____ ____ ______
1 'Y' NaN
2 'N' NaN
3 'Y' 140000
4 'Y' 1
5 'N' NaN

MatLab find column number of text cell array

I have a cell data type matrix containing a header and a large number of rows.
sample data:
set press dp
32.7045 17.805965 123.75047
32.690094 17.80584 123.74992
32.6232 17.815094 123.790115
I am trying to find the index of a specific column using the strcmp command to search through the all the data.
dpCol = strcmp([data{:}], 'dp')
This always returns
dpCol =
0
Am I using the data cell type wrong or something? Thank you!
Try using cell notation to yield just the 1st row, EG:
data(1,:) = {'set','press','dp'}
instead of unpacking* the entire cell array since strcmp can operate on cell arrays.
>>> data = {'set' 'press' 'dp'
32.7045 17.805965 123.75047
32.690094 17.80584 123.74992
32.6232 17.815094 123.790115}
data =
'set' 'press' 'dp'
[32.7045] [17.8060] [123.7505]
[32.6901] [17.8058] [123.7499]
[32.6232] [17.8151] [123.7901]
>>> col_idx = strcmp(data(1,:),'dp')
col_idx=
0 0 1
Then return the dp using the logical indices and cell2mat...
>>> dp = cell2mat(data(2:end,col_idx))
dp =
123.7505
123.7499
123.7901
or unpack* and concatenate the comma separated list
>>> dp = [data{2:end,col_idx}]
dp =
123.7505 123.7499 123.7901
As an alternative try cell2struct.
>>> datastruct = cell2struct(data(2:end,:),data(1,:),2)
datastruct =
3x1 struct array with fields:
set
press
dp
Then dp is ...
>>> dp = [datastruct.dp]
dp =
123.7505 123.7499 123.7901
* Using the colon operator inside curly braces unpacks an cell array into a comma separated list. Using square brackets horizontally concatenates the comma separated list which returns a character array set pressdp{{{ since the first item in the list is a character array. The garbage characters between and after 'set', 'press' and 'dp' are caused by reading the doubles as char. IE: char(32.7045) is the ASCII equivalent of whitespace. The arrays always get unpacked as column.

Extract only words from a cell array in matlab

I have a set of documents containing pre-processed texts from html pages. They are already given to me. I want to extract only the words from it. I do not want any numbers or common words or any single letters to be extracted. The first problem I am facing is this.
Suppose I have a cell array :
{'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'}
I want to make the cell array having only the words - like this.
{'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'}
And then convert this to this cell array
{'thanks' 'dogsbreath' 'endif' 'if'}
Is there any way to do this ?
Updated Requirement : Thanks to all of your answers. However I am facing a problem ! Let me illustrate this (Please note that the cell values are extracted text from HTML documents and hence may contain non ASCII values) -
{'!/bin/bash' '![endif]' '!take-a-long' '!–photo'}
This gives me the answer
{'bin' 'bash' 'endif' 'take' 'a' 'long' 'â' 'photo' }
My Questions:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
Also the text "2. areoplane 3. cactus 4. a_rinny_boo... 5. trumpet 6. window 7. curtain ... 173. gypsy_wagon..." returns a word as 'areoplane' 'cactus' 'a_rinny_boo' 'trumpet' 'window' 'curtain' 'gypsy_wagon'. I want the words 'a_rinny_boo' and ''gypsy_wagon to be 'a' 'rinny' 'boo' 'gypsy' 'wagon'. Can this be done ?
Update 1 Following all the suggestions I have written down a function which does most of the things except the above two newly asked questions.
function [Text_Data] = raw_txt_gn(filename)
% This function will convert the text documnets into raw text
% It will remove all commas empty cells and other special characters
% It will also convert all the words of the text documents into lowercase
T = textread(filename, '%s');
% find all the important indices
ind1=find(ismember(T,':WebpageTitle:'));
T1 = T(ind1+1:end,1);
% Remove things which are not basically words
not_words = {'##','-',':ImageSurroundingText:',':WebpageDescription:',':WebpageKeywords:',' '};
T2 = []; count = 1;
for j=1:length(T1)
x = T1{j};
ind=find(ismember(not_words,x), 1);
if isempty(ind)
B = regexp(x, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
% convert the string into lowecase
% so that while generating the features the case sensitivity is
% handled well
x = lower(B);
T2{count,1} = x;
count = count+1;
end
end
T2 = T2(~cellfun('isempty',T2));
% Getting the common words in the english language
% found from Wikipedia
not_words2 = {'the','be','to','of','and','a','in','that','have','i'};
not_words2 = [not_words2, 'it' 'for' 'not' 'on' 'with' 'he' 'as' 'you' 'do' 'at'];
not_words2 = [not_words2, 'this' 'but' 'his' 'by' 'from' 'they' 'we' 'say' 'her' 'she'];
not_words2 = [not_words2, 'or' 'an' 'will' 'my' 'one' 'all' 'would' 'there' 'their' 'what'];
not_words2 = [not_words2, 'so' 'up' 'out' 'if' 'about' 'who' 'get' 'which' 'go' 'me'];
not_words2 = [not_words2, 'when' 'make' 'can' 'like' 'time' 'no' 'just' 'him' 'know' 'take'];
not_words2 = [not_words2, 'people' 'into' 'year' 'your' 'good' 'some' 'could' 'them' 'see' 'other'];
not_words2 = [not_words2, 'than' 'then' 'now' 'look' 'only' 'come' 'its' 'over' 'think' 'also'];
not_words2 = [not_words2, 'back' 'after' 'use' 'two' 'how' 'our' 'work' 'first' 'well' 'way'];
not_words2 = [not_words2, 'even' 'new' 'want' 'because' 'any' 'these' 'give' 'day' 'most' 'us'];
for j=1:length(T2)
x = T2{j};
% if a particular cell contains only numbers then make it empty
if sum(isstrprop(x, 'digit'))~=0
T2{j} = [];
end
% also remove single character cells
if length(x)==1
T2{j} = [];
end
% also remove the most common words from the dictionary
% the common words are taken from the english dicitonary (source
% wikipedia)
ind=find(ismember(not_words2,x), 1);
if isempty(ind)==0
T2{j} = [];
end
end
Text_Data = T2(~cellfun('isempty',T2));
Update 2
I found this code in here which tells me how to check for non-ascii characters. Incorporating this code snippet in Matlab as
% remove the non-ascii characters
if all(x < 128)
else
T2{j} = [];
end
and then removing the empty cells it seems my second requirement is fulfilled though the text containing a part of non-ascii characters completely disappears.
Can my final requirements be completed ? Most of them concerns the character '_' and '-'.
A regexp approach to go directly to the final step:
A = {'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'};
B = regexp(A, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
Which matches any alphabetic, numeric, or underscore character. For the sample case we get a 1x4 cell array:
B =
'thanks' 'dogsbreath' 'endif' 'if'
Edit:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Because I'm flattening the cell arrays to remove nested cells. If you remove B = [B{:}]; each cell will have a nested cell inside containing all of the matches for the input cell array. You can combine these however you want after.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
Yes, you'll have to make it based on the character codes.
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
As I said, the regex matches alphabetic, numeric, or underscore characters. You can change your filter to exclude _, which will also address the fourth bullet point: B = regexp(A, '[a-zA-Z0-9]*', 'match'); This will match a-z, A-Z, and 0-9 only. This will also exclude the non-ASCII characters, which it seems like the \w* flag matches.
I think #excaza's solution would be the go-to approach, but here's an alternative one with isstrprop using its optional input argument 'alpha' to look for alphabets -
A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
Sample run -
>> A
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
>> A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
ans =
'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'
To get to the final destination, you can tweak this approach a bit, like so -
B = cellfun(#(x) x(isstrprop(x, 'alpha')), A,'Uni',0);
out = B(~cellfun('isempty',B))
Sample run -
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
out =
'thanks' 'dogsbreath' 'endif' 'if'

How do I compute the number of times a character appears in a character array in MATLAB? [duplicate]

This question already has answers here:
how to count unique elements of a cell in matlab?
(2 answers)
Closed 7 years ago.
I want to determine the number of times a character appears in a character array, excluding the time it appears at the last position.
How would I do this?
In Matlab computing environment, all variables are arrays, and strings are of type char (character arrays). So your Character Array is actually a string (Or in reality the other way around). Which means you can apply string methods on it to achieve your results. To find total count of occurrence of a character except on last place in a String/Character Array named yourStringVar you can do this:
YourSubString = yourStringVar(1:end-1)
//Now you have substring of main string in variable named YourSubString without the last character because you wanted to ignore it
numberOfOccurrence = length(find(YourSubString=='Character you want to count'))
It has been pointed out by Ray that length(find()) is not a good approach due to various reasons. Alternatively you could do:
numberOfOccurrence = nnz(YourSubString == 'Character you want to count')
numberOfOccurrence will give you your desired result.
What you can do is map each character into a unique integer ID, then determine the count of each character through histcounts. Use unique to complete the first step. The first output of unique will give you a list of all possible unique characters in your string. If you want to exclude the last time each character occurs in the string, just subtract 1 from the total count. Assuming S is your character array:
%// Get all unique characters and assign them to a unique ID
[unq,~,id] = unique(S);
%// Count up how many times we see each character and subtract by 1
counts = histcounts(id) - 1;
%// Show table of occurrences with characters
T = table(cellstr(unq(:)), counts.', 'VariableNames', {'Character', 'Counts'});
The last piece of code displays everything in a nice table. We ensure that the unique characters are placed as individual cells in a cell array.
Example:
>> S = 'ABCDABABCDEFFGACEG';
Running the above code, we get:
>> T
T =
Character Counts
_________ ______
'A' 3
'B' 2
'C' 2
'D' 1
'E' 1
'F' 1
'G' 1

How to load a cell array that has both strings and numbers?

I have a cell array that has both strings and numbers. I want to load all the elements of the cell array. For the same I used the following method:
load(filename);
This command is loading only strings and excluding the columns that has numbers. Basically since my file is not .mat extension, it is treating it as ASCII file and loading only text.
I tried importdata(filename). But that gives me struct of 1*1. I need the elements to be imported into another cell array of same dimension.
Is there a way to load all the values?
load is used to import .mat-files with workspace variables. Since your data is not an actual .mat-file, you need to use a different method.
Let's assume you have the file filename with tab-delimited data:
str1 1
str2 2
str3 3
str4 4
To get a cell-array where the first column is a string (using %s) and the second a double (using %f), you can use textscan. Check out the result, maybe it's already what you're searching for.
filename = 'data';
F = fopen(filename, 'r');
data = textscan(F, '%s %f', 'Delimiter', '\t');
If not, you can create a cell-array CA where the first column is a string (using cellstr) and the second one is a double (using num2cell).
CA = cell(size(data{1},1),2);
CA(:,1) = cellstr(data{1});
CA(:,2) = num2cell(data{2});
Result:
CA =
'str1' [1]
'str2' [2]
'str3' [3]
'str4' [4]