Replacing Text with Error Message if Text not Found in Matlab - matlab

I have a function that replaces a piece of text, and it became relevant that this function needs to issue an error in case it fails to do so.
One way to do so would be:
text_var = 'The whole big text';
if(~contains(text_var,'The part that will be replaced')))
throw(MException('MF:error','The part to be replaced is not in the text!'))
else
text_var = strrep(text_var,'The part that will be replaced','The replacement');
end
However, this seems not to be efficient. I can assume the text, if it appears does so only once. But I'd like to make a single call to a function operating on text_var. Is there no text replacement function in Matlab that returns an error if the replacement failed?

You could do the replace, and just check if the new string's length is unchanged (this assumes the original and replacement strings are different lengths)
text_var = 'the whole big text';
n = numel( text_var );
text_var = strrep( text_var, 'replace me', 'with this' );
if numel( text_var ) == n
error( 'No replacements made' );
end
If you can't make that assumption, you could use strfind to get the indices of the string. This will be empty if not found (so error), or you can use it to manually remove the string. Especially easy as you state it will appear at most once.
text_var = 'the whole big text';
removeStr = 'replace this';
k = strfind( text_var, removeStr );
if isempty( k )
error( 'No replacements made' );
end
text_var( k:k+numel(removeStr)-1 ) = []; % Remove string
Because you're only matching once, you might find that regexp is quicker than strfind, as you can use the 'once' argument of regexp to make it stop on the first match
k = regexp( text_var, removeStr, 'once' ); % instead of using strfind

Related

Extract only words from a cell array in matlab

I have a set of documents containing pre-processed texts from html pages. They are already given to me. I want to extract only the words from it. I do not want any numbers or common words or any single letters to be extracted. The first problem I am facing is this.
Suppose I have a cell array :
{'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'}
I want to make the cell array having only the words - like this.
{'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'}
And then convert this to this cell array
{'thanks' 'dogsbreath' 'endif' 'if'}
Is there any way to do this ?
Updated Requirement : Thanks to all of your answers. However I am facing a problem ! Let me illustrate this (Please note that the cell values are extracted text from HTML documents and hence may contain non ASCII values) -
{'!/bin/bash' '![endif]' '!take-a-long' '!–photo'}
This gives me the answer
{'bin' 'bash' 'endif' 'take' 'a' 'long' 'â' 'photo' }
My Questions:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
Also the text "2. areoplane 3. cactus 4. a_rinny_boo... 5. trumpet 6. window 7. curtain ... 173. gypsy_wagon..." returns a word as 'areoplane' 'cactus' 'a_rinny_boo' 'trumpet' 'window' 'curtain' 'gypsy_wagon'. I want the words 'a_rinny_boo' and ''gypsy_wagon to be 'a' 'rinny' 'boo' 'gypsy' 'wagon'. Can this be done ?
Update 1 Following all the suggestions I have written down a function which does most of the things except the above two newly asked questions.
function [Text_Data] = raw_txt_gn(filename)
% This function will convert the text documnets into raw text
% It will remove all commas empty cells and other special characters
% It will also convert all the words of the text documents into lowercase
T = textread(filename, '%s');
% find all the important indices
ind1=find(ismember(T,':WebpageTitle:'));
T1 = T(ind1+1:end,1);
% Remove things which are not basically words
not_words = {'##','-',':ImageSurroundingText:',':WebpageDescription:',':WebpageKeywords:',' '};
T2 = []; count = 1;
for j=1:length(T1)
x = T1{j};
ind=find(ismember(not_words,x), 1);
if isempty(ind)
B = regexp(x, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
% convert the string into lowecase
% so that while generating the features the case sensitivity is
% handled well
x = lower(B);
T2{count,1} = x;
count = count+1;
end
end
T2 = T2(~cellfun('isempty',T2));
% Getting the common words in the english language
% found from Wikipedia
not_words2 = {'the','be','to','of','and','a','in','that','have','i'};
not_words2 = [not_words2, 'it' 'for' 'not' 'on' 'with' 'he' 'as' 'you' 'do' 'at'];
not_words2 = [not_words2, 'this' 'but' 'his' 'by' 'from' 'they' 'we' 'say' 'her' 'she'];
not_words2 = [not_words2, 'or' 'an' 'will' 'my' 'one' 'all' 'would' 'there' 'their' 'what'];
not_words2 = [not_words2, 'so' 'up' 'out' 'if' 'about' 'who' 'get' 'which' 'go' 'me'];
not_words2 = [not_words2, 'when' 'make' 'can' 'like' 'time' 'no' 'just' 'him' 'know' 'take'];
not_words2 = [not_words2, 'people' 'into' 'year' 'your' 'good' 'some' 'could' 'them' 'see' 'other'];
not_words2 = [not_words2, 'than' 'then' 'now' 'look' 'only' 'come' 'its' 'over' 'think' 'also'];
not_words2 = [not_words2, 'back' 'after' 'use' 'two' 'how' 'our' 'work' 'first' 'well' 'way'];
not_words2 = [not_words2, 'even' 'new' 'want' 'because' 'any' 'these' 'give' 'day' 'most' 'us'];
for j=1:length(T2)
x = T2{j};
% if a particular cell contains only numbers then make it empty
if sum(isstrprop(x, 'digit'))~=0
T2{j} = [];
end
% also remove single character cells
if length(x)==1
T2{j} = [];
end
% also remove the most common words from the dictionary
% the common words are taken from the english dicitonary (source
% wikipedia)
ind=find(ismember(not_words2,x), 1);
if isempty(ind)==0
T2{j} = [];
end
end
Text_Data = T2(~cellfun('isempty',T2));
Update 2
I found this code in here which tells me how to check for non-ascii characters. Incorporating this code snippet in Matlab as
% remove the non-ascii characters
if all(x < 128)
else
T2{j} = [];
end
and then removing the empty cells it seems my second requirement is fulfilled though the text containing a part of non-ascii characters completely disappears.
Can my final requirements be completed ? Most of them concerns the character '_' and '-'.
A regexp approach to go directly to the final step:
A = {'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'};
B = regexp(A, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
Which matches any alphabetic, numeric, or underscore character. For the sample case we get a 1x4 cell array:
B =
'thanks' 'dogsbreath' 'endif' 'if'
Edit:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Because I'm flattening the cell arrays to remove nested cells. If you remove B = [B{:}]; each cell will have a nested cell inside containing all of the matches for the input cell array. You can combine these however you want after.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
Yes, you'll have to make it based on the character codes.
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
As I said, the regex matches alphabetic, numeric, or underscore characters. You can change your filter to exclude _, which will also address the fourth bullet point: B = regexp(A, '[a-zA-Z0-9]*', 'match'); This will match a-z, A-Z, and 0-9 only. This will also exclude the non-ASCII characters, which it seems like the \w* flag matches.
I think #excaza's solution would be the go-to approach, but here's an alternative one with isstrprop using its optional input argument 'alpha' to look for alphabets -
A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
Sample run -
>> A
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
>> A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
ans =
'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'
To get to the final destination, you can tweak this approach a bit, like so -
B = cellfun(#(x) x(isstrprop(x, 'alpha')), A,'Uni',0);
out = B(~cellfun('isempty',B))
Sample run -
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
out =
'thanks' 'dogsbreath' 'endif' 'if'

How to get rid of the punctuation? and check the spelling error

eliminate punctuation
words split when meeting new line and space, then store in array
check the text file got error or not with the function of checkSpelling.m file
sum up the total number of error in that article
no suggestion is assumed to be no error, then return -1
sum of error>20, return 1
sum of error<=20, return -1
I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:
My data2 file is :
checkSpelling.m
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestion';
end
end
%Quit Word to release the server
h.Quit
f19.m
for i = 1:1
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word_punctuation=regexprep(CharData,'[`~!##$%^&*()-_=+[{]}\|;:\''<,>.?/','')
word_newLine = regexp(word_punctuation, '\n', 'split')
word = regexp(word_newLine, ' ', 'split')
[sizeData b] = size(word)
suggestion = cellfun(#checkSpelling, word, 'UniformOutput', 0)
A19(i)=sum(~cellfun(#isempty,suggestion))
feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end
Substitute your regexprep call to
word_punctuation=regexprep(CharData,'\W','\n');
Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.
Then
word = regexp(word_punctuation, '\n', 'split');
As you can see you don't need to split by space (see above). But you can remove the empty cells:
word(cellfun(#isempty,word)) = [];
Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.
UPDATE
The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.
UPDATE 2
Another solution to the 1st UPDATE:
word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');

Function that creates list of personalia

I have the function
fid = fopen(filename,'w');
if exist('fid')
check = true;
else
check = false;
end
for i=1:length(persons)
fprintf(fid, '%s\n',serialize_person(persons(i)));
end
end
Where serialize_person is
function [output] = serialize_person(person)
fprintf ( '<%s>#' , person.name ) ;
serialize_date ( person.date_of_birth ) ;
fprintf ( '#<%i>\n' , person.phone ) ;
end
Which takes is a personalia and writes out 'name.day.month.year.phonenumber'
Firstly I need to make this come out as a single string of text in 'output' for it to(I assume)work in the first function, how would I go about this?
Secondly, the first function takes is a filename and a cell of persons. I want it to come out on a textfile with the name 'filename' with one personalia per line.
Yesterday I had it working up to the for loop, but somehow I cant get beyond the first line today without hitting an error message.
Could you give me some advice here, I don't know whats wrong.
To write output to a character array rather than the console, use sprintf. Also, to join the strings with a '.' between strings, try strjoin with a delimiter set:
function [output] = serialize_person(person)
delim = '.';
output = strjoin(sprintf ( '<%s>#' , person.name ), ...
serialize_date ( person.date_of_birth ), ...
sprintf ( '#<%i>\n' , person.phone ), delim);
end
Modify serialize_date similarly.

print cells type in Matlab

I have a cell array like:
>>text
'Sentence1'
'Sentence2'
'Sentence3'
Whenever I use
sprintf(fid,'%s\n',text)
I get an error saying:
'Function is not defined for 'cell' inputs.'
But if I put :
sprintf(fid,'%s\n',char(text))
It works but in the file appears all the sentences mixed all together like with no sense.
Can you recommend me what to do?
Whener I put text I get:
>>text
'Title '
'Author'
'comments '
{3x1} cell
That is why I can not use text{:}.
If you issue
sprintf('%s\n', text)
you are saying "print a string with a newline. The string is this cell array". That's not correct; a cell-array is not a string.
If you issue
sprintf('%s\n', char(text))
you are saying "print a string with a newline. The string is this cell array, which I convert to character array.". The thing is, that conversion results in a single character array, and sprintf will re-use the %s\n format only for multiple inputs. Moreover, it writes that single character array by column, meaning, all characters in the first column, concatenated horizontally with all characters from the second column, concatenated with all characters from the third column, etc.
Therefore, the approprate call to sprintf is something with multiple inputs:
sprintf(fid, '%s\n', text{:})
because the cell-expansion text{:} creates a comma-separated list from all entries in the cell-array, which is exactly what sprintf/fprintf expects.
EDIT As you indicate:, you have non-char entries in text. You have to check for that. If you want to pass only the strings into fprintf, use
fprintf(fid, '%s\n', text{ cellfun('isclass', text, 'char') })
if that {3x1 cell} is again a set of strings, so you want to write all strings recursively, then just use something like this:
function textWriter
text = {...
'Title'
'Author'
'comments'
{...
'Title2'
'Author2'
'comments2'
{...
'Title3'
'Author3'
'comments3'
}
}
}
text = cell2str(text);
fprintf(fid, '%s\n', text{:});
end
function out = cell2str(c)
out = {};
for ii = c(:)'
if iscell(ii{1})
S = cell2str(ii{1});
out(end+1:end+numel(S)) = S;
else
out{end+1} = ii{1};
end
end
end

Adding a substring to each line in a string in MATLAB

Say I have a string in a variable in MATLAB like the following:
this is the first line
this is the second line
this is the third line
I would like to add a fixed string at the beginning of each line. For example:
add_substring(input_string, 'add_this. ')
would output:
add_this. this is the first line
add_this. this is the second line
add_this. this is the third line
I know I can do this by looping through the input string, but I am looking for a more compact (hopefully vectorized) way to do this, perhaps using one of MATLAB built-ins such as arrayfun accumarray.
The strcat function is what you're looking for. It does vectorized concatenation of strings.
strs = {
'this is the first line'
'this is the second line'
'this is the third line'
}
strcat({'add_this. '}, strs)
With strcat, you need to put 'add_this. ' in a cell ({}) to protect it from having its trailing whitespace stripped, which is strcat's normal behavior for char inputs.
Assuming your strings are stored in a cell array then cellfun will do what you want, e.g.
s = {'this is the first line', 'this is the second line', 'this is the third line'};
prefix = 'add_this. ';
res = cellfun(#(str) strcat(prefix, str), s, 'UniformOutput', false);