Split word and check spelling error within article - matlab

I want to check the spelling error within an aricle, I have 100 articles to check to see got spelling error of not, if got one error then word return 1 else 0. I have to split the article into words by word then only check. I have done all of these here, but the problem is i could not check the spelling error of the split word.However, I could check with
deliberate_mistake = 'tabel';
suggestion = checkSpelling(deliberate_mistake)
output:
suggestion =
'table'
checkSpelling.m file
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestions';
end
end
%Quit Word to release the server
h.Quit
f20.m file
for i = 1:1
data2=fopen(strcat('DATA\',int2str(i),''),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word =regexp(CharData,' ','split')
[sizeData b] = size(word);
suggestion = checkSpelling(word)

Your input is a cell array, try to give your function a single string input. Works for me.

Related

Loop over a cell array if file exists in folder and print to new document

I want to loop over a cell array while reading in files within a working
folder. If that file exists, create and open new file called testcase.txt
print statement. If that file doesn't exist in working folder, do nothing
and move on to next iteration.
Here's what example looks like with search for one file
fid=fopen('testcase.txt','w');
if exists('abc.txt','file')
abc.txt = 1;
fprintf('test case successful');
else
abc.txt = 0;
end
fclose(fid);
Here's a cell array example expanded over multiple cases looks like
I can't seem to get it to run properly. Can someone help me to get this
loop to work?
extension = {'abc.txt' 'def.txt' 'ght.txt'};
convertedfile = [abc.txt def.txt ght.txt];
fid=fopen('testcase.txt','w');
for i = extension
if exist(['''' convertedfile ''''],'file')
i = 1;
fprintf('test case successful');
else
i = 0;
end
end
fclose(fid);
Are you looking for something like this?
extension = {'abc.txt' 'def.txt' 'ght.txt'};
% convertedfile = [abc.txt def.txt ght.txt];
fid=fopen('testcase.txt','w');
for i = 1:length(extension)
if ~exist(extension{i},'file')
fprintf(fid,'test case successful\n');
end
end
fclose(fid);
Also, you could print the text file for which it was successful using:
fprintf(fid,'%s: test case successful\n',extension{i});

whay zero is not consider by %d identifier

I Have 400 files in my directory. Which have file name H1001,H1002,H1003....like that. I want to read that files in matlab
When I m using that code it give me the error.
'd=dir('C:\Users\Desktop\New\*.txt')>
<num_files=length(d)>
data=cell(1,num_files);
for k = 1:400
myfilename = sprintf('H1%3d.txt',k);
mydata{k} = importdata(myfilename);
end'
it is showing
myfilename=H1 1. which is wrong file name H1001. so GETTING ERROR IN NEXT LINE.
It is not reading 00. It gives blank space.
Can any body tell me the answer.
The correct format string if you want a zero filled right adjusted value is not %3d (which right adjusts with spaces) but instead %03d;
for k = 1:400
myfilename = sprintf('H1%03d.txt',k);
mydata{k} = importdata(myfilename);
end
The difference can easily be seen in
> printf('H1%3d.txt\n', 7);
H1 7.txt
> printf('H1%03d.txt\n', 7);
H1007.txt

How to get rid of the punctuation? and check the spelling error

eliminate punctuation
words split when meeting new line and space, then store in array
check the text file got error or not with the function of checkSpelling.m file
sum up the total number of error in that article
no suggestion is assumed to be no error, then return -1
sum of error>20, return 1
sum of error<=20, return -1
I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:
My data2 file is :
checkSpelling.m
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestion';
end
end
%Quit Word to release the server
h.Quit
f19.m
for i = 1:1
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word_punctuation=regexprep(CharData,'[`~!##$%^&*()-_=+[{]}\|;:\''<,>.?/','')
word_newLine = regexp(word_punctuation, '\n', 'split')
word = regexp(word_newLine, ' ', 'split')
[sizeData b] = size(word)
suggestion = cellfun(#checkSpelling, word, 'UniformOutput', 0)
A19(i)=sum(~cellfun(#isempty,suggestion))
feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end
Substitute your regexprep call to
word_punctuation=regexprep(CharData,'\W','\n');
Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.
Then
word = regexp(word_punctuation, '\n', 'split');
As you can see you don't need to split by space (see above). But you can remove the empty cells:
word(cellfun(#isempty,word)) = [];
Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.
UPDATE
The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.
UPDATE 2
Another solution to the 1st UPDATE:
word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');

strfind split keyword within paragraph

After the output of keywords in URL, how do I check whether the keywords exist in the content of the page like the content below, if yes then return 1, else return 0. There is strfind at there, but I do not have idea why it cannot work
str = 'http://en.wikipedia.org/wiki/hostname'
Paragraph = 'hostname From wikipedia, the free encyclopedia Jump to: navigation, search In computer networking, a hostname (archaically nodename .....'
SplitStrings = regexp(str,'[/.]','split')
for it = SplitStrings
c( it{1} ) = strfind(Paragraph, it{1} )
end
SplitStrings = {};
feature11=(cellfun(#(n) isempty(n), strfind(Paragraph, SplitStrings{1})))
I can do with the below code 4 checking whether 'https' exist or not. But, how to modify the 'SplitString' into 'B6'?
str = 'https://en.wikipedia.org/wiki/hostname'
A6 = regexp(str,'\w*://','match','once')
B6 = {'https'};
feature6=(cellfun(#(n) isempty(n), strfind(A6, B6{1})))
It is absolutely not clear to me what you want to do here...
I suspect it is this:
str = 'http://en.wikipedia.org/wiki/hostname';
haystack = 'hostname From wikipedia, the free encyclopedia Jump to: navigation, search In computer networking, a hostname (archaically nodename .....';
needles = regexp(str,'[:/.]*','split') %// note the different search string
%// What I think you want to do
~cellfun('isempty', regexpi(haystack, needles, 'once'))
Results:
needles =
'http' 'en' 'wikipedia' 'org' 'wiki' 'hostname'
ans =
0 1 1 0 1 1
but if this is not the case, please edit your question and include your desired outputs for some example inputs.
EDIT
OK, so if I understand you corretly now, you want whole words and not partial matches. You must tell this to regexp, in the following way:
%// NOTE: these metacharacters indicate that match is to occur
%// at beginning AND end of word (so whole words only)
needles = strcat('\<', regexpi(str,'[:/.]*','split'), '\>')
%// Search for these words in the paragraph
~cellfun('isempty', regexpi(haystack, needles, 'once'))
You can try this
f=#(str) isempty(strfind(Paragraph,str))
cellfun(f,SplitStrings)
This should get whole words. The key is parsing the variable Paragraph to get them
SplitParagraph=regexp(Paragraph,'[ ,:.()]','split');
I=ismember(SplitStrings,SplitParagraph);
SplitStrings(I)

Separating an array based on whether it contains a phrase or not

I am really just a noob at Matlab, so please don't get upset if I use wrong syntax. I am currently writing a small program in which I put all .xlsx file names from a certain directory into an array. I now want to separate the files into two different arrays based on their name. This is what I tried:
files = dir('My_directory\*.xlsx')
file_number = 1;
file_amount = length(files);
while file_number <= file_amount;
file_name = files(file_number).name;
filescs = [];
filescwf = [];
if strcmp(file_name,'*cs.xlsx') == 1;
filescs = [filescs,file_name];
else
filescwf = [filescwf,file_name];
end
file_number = file_number + 1
end
The idea here is that strcmp(file_name,'*cs.xlsx') checks if file_name contains 'cs' at the end. If it does, it is put into filescs, if it doesn't it is put into filescwf. However, this does not seem to work...
Any thoughts?
strcmp(file_name,'*cs.xlsx') checks whether file_name is identical to *cs.xlsx. If there is no file by that name (hint: few file systems allow '*' as part of a file name), it will always be false. (btw: there is no need for the '==1' comparison or the semicolon on the respective line)
You can use array indexing here to extract the relevant part of the filename you want to compare. file_name(1:5), will give you the first 5 characters, file_name(end-5:end) will give you the last 6, for example.
strcmp doesn't work with wildcards such as *cs.xlsx. See this question for an alternative approach.
You can use regexp to check the final letters of each of your files, then cellfun to apply regexp to all your filenames.
Here, getIndex will have 1's for all the files ending with cs.xlsx. The (?=$) part make sure that cs.xlsx is at the end.
files = dir('*.xlsx')
filenames = {files.name}'; %get filenames
getIndex = cellfun(#isempty,regexp(filenames, 'cs.xlsx(?=$)'));
list1 = filenames(getIndex);
list2 = filenames(~getIndex);