strfind split keyword within paragraph - matlab

After the output of keywords in URL, how do I check whether the keywords exist in the content of the page like the content below, if yes then return 1, else return 0. There is strfind at there, but I do not have idea why it cannot work
str = 'http://en.wikipedia.org/wiki/hostname'
Paragraph = 'hostname From wikipedia, the free encyclopedia Jump to: navigation, search In computer networking, a hostname (archaically nodename .....'
SplitStrings = regexp(str,'[/.]','split')
for it = SplitStrings
c( it{1} ) = strfind(Paragraph, it{1} )
end
SplitStrings = {};
feature11=(cellfun(#(n) isempty(n), strfind(Paragraph, SplitStrings{1})))
I can do with the below code 4 checking whether 'https' exist or not. But, how to modify the 'SplitString' into 'B6'?
str = 'https://en.wikipedia.org/wiki/hostname'
A6 = regexp(str,'\w*://','match','once')
B6 = {'https'};
feature6=(cellfun(#(n) isempty(n), strfind(A6, B6{1})))

It is absolutely not clear to me what you want to do here...
I suspect it is this:
str = 'http://en.wikipedia.org/wiki/hostname';
haystack = 'hostname From wikipedia, the free encyclopedia Jump to: navigation, search In computer networking, a hostname (archaically nodename .....';
needles = regexp(str,'[:/.]*','split') %// note the different search string
%// What I think you want to do
~cellfun('isempty', regexpi(haystack, needles, 'once'))
Results:
needles =
'http' 'en' 'wikipedia' 'org' 'wiki' 'hostname'
ans =
0 1 1 0 1 1
but if this is not the case, please edit your question and include your desired outputs for some example inputs.
EDIT
OK, so if I understand you corretly now, you want whole words and not partial matches. You must tell this to regexp, in the following way:
%// NOTE: these metacharacters indicate that match is to occur
%// at beginning AND end of word (so whole words only)
needles = strcat('\<', regexpi(str,'[:/.]*','split'), '\>')
%// Search for these words in the paragraph
~cellfun('isempty', regexpi(haystack, needles, 'once'))

You can try this
f=#(str) isempty(strfind(Paragraph,str))
cellfun(f,SplitStrings)
This should get whole words. The key is parsing the variable Paragraph to get them
SplitParagraph=regexp(Paragraph,'[ ,:.()]','split');
I=ismember(SplitStrings,SplitParagraph);
SplitStrings(I)

Related

Extract only words from a cell array in matlab

I have a set of documents containing pre-processed texts from html pages. They are already given to me. I want to extract only the words from it. I do not want any numbers or common words or any single letters to be extracted. The first problem I am facing is this.
Suppose I have a cell array :
{'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'}
I want to make the cell array having only the words - like this.
{'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'}
And then convert this to this cell array
{'thanks' 'dogsbreath' 'endif' 'if'}
Is there any way to do this ?
Updated Requirement : Thanks to all of your answers. However I am facing a problem ! Let me illustrate this (Please note that the cell values are extracted text from HTML documents and hence may contain non ASCII values) -
{'!/bin/bash' '![endif]' '!take-a-long' '!–photo'}
This gives me the answer
{'bin' 'bash' 'endif' 'take' 'a' 'long' 'â' 'photo' }
My Questions:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
Also the text "2. areoplane 3. cactus 4. a_rinny_boo... 5. trumpet 6. window 7. curtain ... 173. gypsy_wagon..." returns a word as 'areoplane' 'cactus' 'a_rinny_boo' 'trumpet' 'window' 'curtain' 'gypsy_wagon'. I want the words 'a_rinny_boo' and ''gypsy_wagon to be 'a' 'rinny' 'boo' 'gypsy' 'wagon'. Can this be done ?
Update 1 Following all the suggestions I have written down a function which does most of the things except the above two newly asked questions.
function [Text_Data] = raw_txt_gn(filename)
% This function will convert the text documnets into raw text
% It will remove all commas empty cells and other special characters
% It will also convert all the words of the text documents into lowercase
T = textread(filename, '%s');
% find all the important indices
ind1=find(ismember(T,':WebpageTitle:'));
T1 = T(ind1+1:end,1);
% Remove things which are not basically words
not_words = {'##','-',':ImageSurroundingText:',':WebpageDescription:',':WebpageKeywords:',' '};
T2 = []; count = 1;
for j=1:length(T1)
x = T1{j};
ind=find(ismember(not_words,x), 1);
if isempty(ind)
B = regexp(x, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
% convert the string into lowecase
% so that while generating the features the case sensitivity is
% handled well
x = lower(B);
T2{count,1} = x;
count = count+1;
end
end
T2 = T2(~cellfun('isempty',T2));
% Getting the common words in the english language
% found from Wikipedia
not_words2 = {'the','be','to','of','and','a','in','that','have','i'};
not_words2 = [not_words2, 'it' 'for' 'not' 'on' 'with' 'he' 'as' 'you' 'do' 'at'];
not_words2 = [not_words2, 'this' 'but' 'his' 'by' 'from' 'they' 'we' 'say' 'her' 'she'];
not_words2 = [not_words2, 'or' 'an' 'will' 'my' 'one' 'all' 'would' 'there' 'their' 'what'];
not_words2 = [not_words2, 'so' 'up' 'out' 'if' 'about' 'who' 'get' 'which' 'go' 'me'];
not_words2 = [not_words2, 'when' 'make' 'can' 'like' 'time' 'no' 'just' 'him' 'know' 'take'];
not_words2 = [not_words2, 'people' 'into' 'year' 'your' 'good' 'some' 'could' 'them' 'see' 'other'];
not_words2 = [not_words2, 'than' 'then' 'now' 'look' 'only' 'come' 'its' 'over' 'think' 'also'];
not_words2 = [not_words2, 'back' 'after' 'use' 'two' 'how' 'our' 'work' 'first' 'well' 'way'];
not_words2 = [not_words2, 'even' 'new' 'want' 'because' 'any' 'these' 'give' 'day' 'most' 'us'];
for j=1:length(T2)
x = T2{j};
% if a particular cell contains only numbers then make it empty
if sum(isstrprop(x, 'digit'))~=0
T2{j} = [];
end
% also remove single character cells
if length(x)==1
T2{j} = [];
end
% also remove the most common words from the dictionary
% the common words are taken from the english dicitonary (source
% wikipedia)
ind=find(ismember(not_words2,x), 1);
if isempty(ind)==0
T2{j} = [];
end
end
Text_Data = T2(~cellfun('isempty',T2));
Update 2
I found this code in here which tells me how to check for non-ascii characters. Incorporating this code snippet in Matlab as
% remove the non-ascii characters
if all(x < 128)
else
T2{j} = [];
end
and then removing the empty cells it seems my second requirement is fulfilled though the text containing a part of non-ascii characters completely disappears.
Can my final requirements be completed ? Most of them concerns the character '_' and '-'.
A regexp approach to go directly to the final step:
A = {'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'};
B = regexp(A, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
Which matches any alphabetic, numeric, or underscore character. For the sample case we get a 1x4 cell array:
B =
'thanks' 'dogsbreath' 'endif' 'if'
Edit:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Because I'm flattening the cell arrays to remove nested cells. If you remove B = [B{:}]; each cell will have a nested cell inside containing all of the matches for the input cell array. You can combine these however you want after.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
Yes, you'll have to make it based on the character codes.
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
As I said, the regex matches alphabetic, numeric, or underscore characters. You can change your filter to exclude _, which will also address the fourth bullet point: B = regexp(A, '[a-zA-Z0-9]*', 'match'); This will match a-z, A-Z, and 0-9 only. This will also exclude the non-ASCII characters, which it seems like the \w* flag matches.
I think #excaza's solution would be the go-to approach, but here's an alternative one with isstrprop using its optional input argument 'alpha' to look for alphabets -
A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
Sample run -
>> A
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
>> A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
ans =
'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'
To get to the final destination, you can tweak this approach a bit, like so -
B = cellfun(#(x) x(isstrprop(x, 'alpha')), A,'Uni',0);
out = B(~cellfun('isempty',B))
Sample run -
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
out =
'thanks' 'dogsbreath' 'endif' 'if'

How to get rid of the punctuation? and check the spelling error

eliminate punctuation
words split when meeting new line and space, then store in array
check the text file got error or not with the function of checkSpelling.m file
sum up the total number of error in that article
no suggestion is assumed to be no error, then return -1
sum of error>20, return 1
sum of error<=20, return -1
I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:
My data2 file is :
checkSpelling.m
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestion';
end
end
%Quit Word to release the server
h.Quit
f19.m
for i = 1:1
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word_punctuation=regexprep(CharData,'[`~!##$%^&*()-_=+[{]}\|;:\''<,>.?/','')
word_newLine = regexp(word_punctuation, '\n', 'split')
word = regexp(word_newLine, ' ', 'split')
[sizeData b] = size(word)
suggestion = cellfun(#checkSpelling, word, 'UniformOutput', 0)
A19(i)=sum(~cellfun(#isempty,suggestion))
feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end
Substitute your regexprep call to
word_punctuation=regexprep(CharData,'\W','\n');
Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.
Then
word = regexp(word_punctuation, '\n', 'split');
As you can see you don't need to split by space (see above). But you can remove the empty cells:
word(cellfun(#isempty,word)) = [];
Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.
UPDATE
The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.
UPDATE 2
Another solution to the 1st UPDATE:
word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');

import data and then split into array form

I want to split the string into part eg: en, wikipedia, org, wiki, hostname... then check whether those split string appear in the A11. However, I am facing problem when I used importdata.
str = 'http://en.wikipedia.org/wiki/hostname';
split_URL = regexp(str,'[:/.]*','split') %// note the different search string
A11 = 'hostname From wikipedia, the free encyclopedia Jump to: navigation, search In computer networking, a hostname (archaically nodename .....';
feature11 = (~cellfun('isempty', regexpi(A11, split_URL, 'once')))
I replaced str = 'http://en.wikipedia.org/wiki/hostname'; with data=importdata(urlq), whereas urlq contain http://en.wikipedia.org/wiki/hostname which I saved in the note pad.
How do I use the importdata in this case?
I can split the str = 'http://en.wikipedia.org/wiki/hostname'; in to array form with 1x6cells but I cannot do so when using importdata. Any ideas?

Extracting strings from cells in MATLAB [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Using regexp to find a word
I'm working on an assignment for my CS course.
We're given a plain text file, which, in my case, contains a series of tweets.
What I need to do is create a script that will detect hashtags, and then save each hashtag into an cell array.
So far I know how to write a function that detects the '#' symbol...
strfind(textRead{i},'#');
where in a for loop where i=1:30 (that is, the number of cells of text). However, past that, I'm at a loss as to how I should write a script that will detect the '#' and return the text between that and the next ' ' (space) character.
Try this:
str = '#someHashtag other tweet text ignore #random';
regexp(str, '#[A-z]*', 'match')
I think you'll be able to find the rest out yourself :)
Here is basic skeleton. But make sure to use correct regexp to extract the values ;-)
Yes with the above Dorin's regexp and match you get one value at a time. You may add a token as per this example from mathworks.
Sample:
str = ['if <code>A </code> == x<sup>2 </sup>, ' ... '<em>disp(x) </em>']
str = if <code>A </code> == x<sup>2 </sup>, <em>disp(x) </em>
expr = '<(\w+).*?>.*?</\1>';
[tok mat] = regexp(str, expr, 'tokens', 'match');
tok{:}
ans = 'code'
ans = 'sup'
ans = 'em'
in above code you don't really need to loop and can process entire text bulk as one string , hopefully not hitting any string limit......
But if you want to loop, or if you need to loop, you use the following sample with Rody's regexp and match only.
fid = fopen('data.txt');
dataText = fgetl(fid);
while ~feof(fid)
ldata = textscan(dataText,'*%d#*');
X = (ldata, '#[A-z]*', 'match')
Cellarray = X{1}
end
Disp(X)
fclose(fid);

Regexp to find a matching condition in a string

Hi need help in using regexp for condition matching.
ex.my file has the following content
{hello.program='function'`;
bye.program='script'; }
I am trying to use regexp to match the string that has .program='function' in them:
pattern = '[.program]+\=(function)'
also tried pattern='[^\n]*(.hello=function)[^\n]*';
pattern_match = regexp(myfilename,pattern , 'match')
but this returns me pattern_match={} while i expect the result to be hello.program='function'`;
If 'function' comes with string-markers, you need to include these in the match. Also, you need to escape the dot (otherwise, it's considered "any character"). [.program]+ looks for one or several letters contained in the square brackets - but you can just look for program instead. Also, you don't need to escape the =-sign (which is probably what messed up the match).
cst = {'hello.program=''function''';'bye.program=''script'''; };
pat = 'hello\.program=''function''';
out = regexp(cst,pat,'match');
out{1}{1} %# first string from list, first match
hello.program='function'
EDIT
In response to the comment
my file contains
m2 = S.Parameter;
m2.Value = matlabstandard;
m2.Volatility = 'Tunable';
m2.Configurability = 'None';
m2.ReferenceInterfaceFile ='';
m2.DataType = 'auto';
my objective is to find all the lines that match, .DataType='auto'
Here's how you find the matching lines with regexp
%# read the file with textscan into a variable txt
fid = fopen('myFile.m');
txt = textscan(fid,'%s');
fclose(fid);
txt = txt{1};
%# find the match. Allow spaces and equal signs between DataType and 'auto'
match = regexp(txt,'\.DataType[ =]+''auto''','match')
%# match is not empty only if a match was found. Identify the non-empty match
matchedLine = find(~cellfun(#isempty,match));
Try this as it matches .program='function' exactly:
(\.)program='function'
I think this did not work:
'[.program]+\=(function)'
because of how the []'s work. Here is a link explaining why I say that: http://www.regular-expressions.info/charclass.html