Replace every string containing '# ' - matlab

I have a cell array containing strings like dataT1 below. How do I replace all strings containing a '#' with the letter 'O' and nothing else (no numbers in the string)?
dataT1 = {
[275.7770] [169.6630] [89.5380] [48.2740] [24.2400] [12.7510]
[284.3560] [160.4500] [87.3740] [47.4500] [23.9530] [12.4590]
'# 12.304' [129.7730] [66.2630] [34.1540] [15.1730] [ 9.6840]
[267.5270] [152.3700] '# 17.504' [45.2510] [23.3770] [13.0670]
[206.9110] [115.3030] [56.4770] [29.9350] [14.8680] '# 6.504' }

You can use a simple loop.
If you want to replace the # with O then use this:
for c = 1:numel(data)
% check for character array type in case cell also has numeric values
if ischar(data{c})
% replace hashes with 'O'
data{c} = strrep(data{c}, '#', 'O');
end
end
If you want to replace entire strings containing # with just the string 'O' then use this:
for c = 1:numel(data)
% check for character array type in case cell also has numeric values
if ischar(data{c})
% Search for # within string
if strfind( data{c}, '#' ) > 0
% replace string with 'O'
data{c} = 'O';
end
end
end

Related

How to convert 1d array of chars from CSV to 2d cell array in Matlab

I have a function I cannot change in Matlab that converts CSV file to a 1d array of chars.
This is an example of what it looks like:
array_o_chars = ['1' 'd' ',' ' ' 'a' 'r' 'r' 'a' 'y' '\n' 'o' 'f' ',' ' ' 'c' 'h' 'a' 'r' 's'];
I need to convert it to a 2d "Cell" array for the rest of my code, how do I do it?
The answer is surprisingly simple:
First, you want to break the char array into the "rows" of the CSV file. Depending on your line ending, you will choose one of these as a delimiter: \n, \r, or a combination of the two. For this example, we are using \n.
rows = split(array_o_chars, '\n');
Now that you have your rows, you need to create your columns, this is done in the same manner as your rows, except using , as your delimiter, or in this case , (a comma followed by a space).
cell_array = split(rows, ', ');
Now you have the 2d cell array you desire.
All together now:
% Test 1d array
array_o_chars = ['1' 'd' ',' ' ' 'a' 'r' 'r' 'a' 'y' '\n' 'o' 'f' ',' ' ' 'c' 'h' 'a' 'r' 's'];
% Conversion to cell array
rows = split(array_o_chars, '\n');
cell_array = split(rows, ', ');
% Show result in Command Window
display(cell_array);
Output from Matlab:
cell_array =
2×2 cell array
{'1d'} {'array'}
{'of'} {'chars'}

Effective way to convert/create matrix from mixed cell/string

Sometimes there might be more that one string located somewhere else, so I need a way to find everyone in the cell array. I have a cell array like the one below and I need a fast and effective way to 1) remove the empty columns, 2) convert the cells containing a string with "#" to the number after the "#" (6.504), and finally 3) create or convert the whole cell array to a data matrix like "data" below. Is there a smart way to do all this? Any suggestions are highly appreciated.
array ={
[47.4500] '' [23.9530] '' [12.4590]
[34.1540] '' [15.1730] '' [ 9.6840]
[45.2510] '' [23.3770] '' [13.0670]
[29.9350] '' [14.8680] '' '# 6.504'}
data =[
47.4500 23.9530 12.4590
34.1540 15.1730 9.6840
45.2510 23.3770 13.0670
29.9350 14.8680 6.5040]
Columns with mixed types are tricky to handle, but if the format always follows the regex pattern # \d+(?:\.\d+) you can proceed as follows:
C = {
47.4500 '' 23.9530 '' 12.4590
34.1540 '' 15.1730 '' 9.6840
45.2510 '' 23.3770 '' 13.0670
29.9350 '' 14.8680 '' '# 6.504'
};
% Get rid of empty columns...
C(:,all(cellfun(#ischar,C))) = [];
% Convert numeric strings into numeric values...
C = cellfun(#(x)convert(x),C,'UniformOutput',false);
% Convert the cell matrix into a numeric matrix...
C = cell2mat(C);
Where the convert function is defined as follows:
function x = convert(x)
if (~ischar(x))
return;
end
x = str2double(strrep(x,'# ',''));
end

How to convert an string array into a character array in Matlab?

Suppose that we have an string array in Matlab like bellow:
a='This is a book'
How can we convert the above string array into a character array by a function in Matlab like bellow?
b={'T' 'h' 'i' 's' ' ' 'i' 's' ' ' 'a' ' ' 'b' 'o' 'o' 'k'}
Your a is not a string array; it's a character array (which also used to be called a string, but starting from R2016b that term has a different meaning). Your b is not a character array, it's a cell array that contains characters.
Anyway, to convert from a to b, use num2cell:
a = 'This is a book';
b = num2cell(a);
If you really really want to convert string (introduced since R2016b) to char array, this is how you do.
s = "My String"; % Create a string with ""
c = char(s); % This is how you convert string to char.
isstring(c)
ans =
logical
0
ischar(c)
ans =
logical
1

Extract only words from a cell array in matlab

I have a set of documents containing pre-processed texts from html pages. They are already given to me. I want to extract only the words from it. I do not want any numbers or common words or any single letters to be extracted. The first problem I am facing is this.
Suppose I have a cell array :
{'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'}
I want to make the cell array having only the words - like this.
{'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'}
And then convert this to this cell array
{'thanks' 'dogsbreath' 'endif' 'if'}
Is there any way to do this ?
Updated Requirement : Thanks to all of your answers. However I am facing a problem ! Let me illustrate this (Please note that the cell values are extracted text from HTML documents and hence may contain non ASCII values) -
{'!/bin/bash' '![endif]' '!take-a-long' '!–photo'}
This gives me the answer
{'bin' 'bash' 'endif' 'take' 'a' 'long' 'â' 'photo' }
My Questions:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
Also the text "2. areoplane 3. cactus 4. a_rinny_boo... 5. trumpet 6. window 7. curtain ... 173. gypsy_wagon..." returns a word as 'areoplane' 'cactus' 'a_rinny_boo' 'trumpet' 'window' 'curtain' 'gypsy_wagon'. I want the words 'a_rinny_boo' and ''gypsy_wagon to be 'a' 'rinny' 'boo' 'gypsy' 'wagon'. Can this be done ?
Update 1 Following all the suggestions I have written down a function which does most of the things except the above two newly asked questions.
function [Text_Data] = raw_txt_gn(filename)
% This function will convert the text documnets into raw text
% It will remove all commas empty cells and other special characters
% It will also convert all the words of the text documents into lowercase
T = textread(filename, '%s');
% find all the important indices
ind1=find(ismember(T,':WebpageTitle:'));
T1 = T(ind1+1:end,1);
% Remove things which are not basically words
not_words = {'##','-',':ImageSurroundingText:',':WebpageDescription:',':WebpageKeywords:',' '};
T2 = []; count = 1;
for j=1:length(T1)
x = T1{j};
ind=find(ismember(not_words,x), 1);
if isempty(ind)
B = regexp(x, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
% convert the string into lowecase
% so that while generating the features the case sensitivity is
% handled well
x = lower(B);
T2{count,1} = x;
count = count+1;
end
end
T2 = T2(~cellfun('isempty',T2));
% Getting the common words in the english language
% found from Wikipedia
not_words2 = {'the','be','to','of','and','a','in','that','have','i'};
not_words2 = [not_words2, 'it' 'for' 'not' 'on' 'with' 'he' 'as' 'you' 'do' 'at'];
not_words2 = [not_words2, 'this' 'but' 'his' 'by' 'from' 'they' 'we' 'say' 'her' 'she'];
not_words2 = [not_words2, 'or' 'an' 'will' 'my' 'one' 'all' 'would' 'there' 'their' 'what'];
not_words2 = [not_words2, 'so' 'up' 'out' 'if' 'about' 'who' 'get' 'which' 'go' 'me'];
not_words2 = [not_words2, 'when' 'make' 'can' 'like' 'time' 'no' 'just' 'him' 'know' 'take'];
not_words2 = [not_words2, 'people' 'into' 'year' 'your' 'good' 'some' 'could' 'them' 'see' 'other'];
not_words2 = [not_words2, 'than' 'then' 'now' 'look' 'only' 'come' 'its' 'over' 'think' 'also'];
not_words2 = [not_words2, 'back' 'after' 'use' 'two' 'how' 'our' 'work' 'first' 'well' 'way'];
not_words2 = [not_words2, 'even' 'new' 'want' 'because' 'any' 'these' 'give' 'day' 'most' 'us'];
for j=1:length(T2)
x = T2{j};
% if a particular cell contains only numbers then make it empty
if sum(isstrprop(x, 'digit'))~=0
T2{j} = [];
end
% also remove single character cells
if length(x)==1
T2{j} = [];
end
% also remove the most common words from the dictionary
% the common words are taken from the english dicitonary (source
% wikipedia)
ind=find(ismember(not_words2,x), 1);
if isempty(ind)==0
T2{j} = [];
end
end
Text_Data = T2(~cellfun('isempty',T2));
Update 2
I found this code in here which tells me how to check for non-ascii characters. Incorporating this code snippet in Matlab as
% remove the non-ascii characters
if all(x < 128)
else
T2{j} = [];
end
and then removing the empty cells it seems my second requirement is fulfilled though the text containing a part of non-ascii characters completely disappears.
Can my final requirements be completed ? Most of them concerns the character '_' and '-'.
A regexp approach to go directly to the final step:
A = {'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'};
B = regexp(A, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
Which matches any alphabetic, numeric, or underscore character. For the sample case we get a 1x4 cell array:
B =
'thanks' 'dogsbreath' 'endif' 'if'
Edit:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Because I'm flattening the cell arrays to remove nested cells. If you remove B = [B{:}]; each cell will have a nested cell inside containing all of the matches for the input cell array. You can combine these however you want after.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
Yes, you'll have to make it based on the character codes.
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
As I said, the regex matches alphabetic, numeric, or underscore characters. You can change your filter to exclude _, which will also address the fourth bullet point: B = regexp(A, '[a-zA-Z0-9]*', 'match'); This will match a-z, A-Z, and 0-9 only. This will also exclude the non-ASCII characters, which it seems like the \w* flag matches.
I think #excaza's solution would be the go-to approach, but here's an alternative one with isstrprop using its optional input argument 'alpha' to look for alphabets -
A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
Sample run -
>> A
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
>> A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
ans =
'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'
To get to the final destination, you can tweak this approach a bit, like so -
B = cellfun(#(x) x(isstrprop(x, 'alpha')), A,'Uni',0);
out = B(~cellfun('isempty',B))
Sample run -
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
out =
'thanks' 'dogsbreath' 'endif' 'if'

delimiting by a char but not deleting it

I have a text file that looks like this:
(a (bee (cold down)))
if I load it using
c=textscan(fid,'%s');
I get this:
'(a'
'(bee'
'(cold'
'down)))'
What I would like to get is:
'('
'a'
'('
'bee'
'('
'cold'
'down'
')'
')'
')'
I know I can delimit with '(' and ')' by specifying 'Delimiter' in textscan, but then I will loose this character, which I want to keep.
Thank you in Advance.
The %s specifier indicates that you want Strings, what you want is individual chars. Use %c instead .
c=textscan(fid,'%c');
Update if you want too keep your words intact then you'll want to load your text using the %s specifier. After the text is loaded you can either solve this problem with Regular Expressions (not my forte) or write your own parser then parses each word individually and saves the paranthesis and words to a new cell array.
AFAIK, there is no canned routine capable of preserving arbitrary delimiters.
You'd have to do it yourself:
string = '(a (bee (cold down)))';
bo = string == '(';
bc = string == ')';
sp = string == ' ';
output = cell(nnz(bo|bc|sp)+1,1);
j = 1;
for ii = 1:numel(string)
if bo(ii)
output{j} = '(';
j = j + 1;
elseif bc(ii)
output{j} = ')';
j = j + 1;
elseif sp(ii)
j = j + 1;
else
output{j} = [output{j} string(ii)];
end
end
Which can probably be improved -- the growing character array will prevent the loop from being JIT'ed. The array bc | bo | sp holds all the information to vectorize this thing, I just don't see how at this hour...
Nevertheless, it should give you a place to start.
Matlab has a strtok function similar to C. Its format is:
token = strtok(str)
token = strtok(str, delimiter)
[token, remain] = strtok('str', ...)
there is also a string replace function strrep:
modifiedStr = strrep(origStr, oldSubstr, newSubstr)
What I would do is modify the original string with strrep to add in delimiters, then use strtok. Since you already scanned the string into c:
c = (c,'(','( '); %Add a space after each open paren
c = (c,')',' ) '); % Add a space before and after each close paren
token = zeros(10); preallocate for speed
i = 2;
[token(1), remain] = strtok(c, ' ');
while(remain)
[token(i), remain] = strtok(c, ' ');
i =i + 1;
end
gives you the linear token array of each of the string you requested.
strtok reference: http://www.mathworks.com/help/techdoc/ref/strtok.html
strrep reference: http://www.mathworks.com/help/techdoc/ref/strrep.html