MATLAB: Count punctuation marks in table columns - matlab

I'm trying to find the amount of sentences in this table:
Download Table here: http://www.mediafire.com/file/m81vtdo6bdd7bw8/Table_RandomInfoMiddle.mat/file
As you can see by the full-stops, there is one sentence in column one, and 2 sentences in column 3.
At the end of the day I desire to have a table with nothing but punctuation marks(with the exception of place holders like "", to keep the table rows the same length) that indicate the end of a sentence(Like "." or "?" or "!"), in order to calculate the total number of punctuation marks of each column.
This is my code(Yet unsuccessful):
EqualCoumns = [2:2:max(width(Table_RandomInfoMiddle))];
for t=EqualCoumns
MiddleOnlySentenceIndicators = Table_RandomInfoMiddle((Table_RandomInfoMiddle{:, t}=='punctuation'),:);
%Reomve all but "!.?" = Which is the only sentence enders
MiddleOnlySentenceIndicators(MiddleOnlySentenceIndicators{:, t} == ',', :) = [];
MiddleOnlySentenceIndicators(MiddleOnlySentenceIndicators{:, t} == ';', :) = [];
MiddleOnlySentenceIndicators(MiddleOnlySentenceIndicators{:, t} == ':', :) = [];
MiddleOnlySentenceIndicators(MiddleOnlySentenceIndicators{:, t} == '-', :) = [];
MiddleSentence_Nr(t) = height(MiddleOnlySentenceIndicators);
end
Right now this is almost giving good results, there is a little mistake somewhere.
(In the answer I would like to request only one thing, that I might have access to the results in the same table like form, it should look something like this(edited):
Any help will be appreciated.
Thank you!

If we use the table from my previous answer, t, we can use the following solution:
punctuation_table = table();
for col=1:size(t,2)
column_name = sprintf('Punctuation count for column %d',col);
punctuation_table.(column_name) = nnz(ismember(t(:,col).Variables,{'?',',','.','!'}));
end
which will create a table like this:
punctuation_table =
1×4 table
Punctuation count for column 1 Punctuation count for column 2 Punctuation count for column 3 Punctuation count for column 4
______________________________ ______________________________ ______________________________ ______________________________
2 0 2 0

Related

LibreOffice Calc Range Max and delete macro

I have a sheet in libreoffice Calc which has an Id Column with incremental value from 1 to N.
I need to create a Macro in VBA (linked to a button i will create later) where i can select the last ID (which is the MAX id also) and delete the entire row relating to this ID.
i tried this so far
Sub suppression
dim maxId as Integer
my_range = ThisComponent.Sheets(0).getCellRangebyName("B19:B1048576")
maxId = Application.WorksheetFunction.Max(range("Dépôts!B19:B1048576"))
MsgBox maxId
End Sub
Thanks a lot for your help.
In libreoffice BASIC you first need to get the data array of the cell range. This is an array of arrays each representing a row of the cell range. It is indexed from zero irrespective of the location of the cell range within the sheet. Because your cell range is one column wide, each member array has only one member, which is at index zero.
As Jim K says, 'Application.WorksheetFunction' is from VBA. It is possible to use worksheet functions in LibreOffice BASIC, but these act on ordinary arrays rather than cell arrays, and the MAX function takes a one-dimensional array so it would be necessary to first reshape the data array using a loop. Furthermore, if you want to delete the row corresponding to the maximum value you are then faced with the problem of finding the index of that row using only the value itself.
It is much simpler to find the index by looping over the data array as shown in the snippet below.
Also, rather than traversing over a million rows, it would save computational effort to obtain the last used row of the spreadsheet via the BASIC function 'GetLastUsedRow(oSheet as Object)', which is supplied with LibreOffice. This is located in the 'Tools' library in 'LibreOffice Macros & Dialogs'. To use it you have to put the statement: 'Globalscope.BasicLibraries.LoadLibrary("Tools")' somewhere before you call the function.
To delete the identified row, get the XTableRows interface of the spreadsheet and call its removeByIndex() function.
The following snippet assumes that the header row of your table is in row 18 of the sheet, as suggested by your example code, which is in row 17 when numbered from zero.
Sub suppression()
' Specify the position of the index range
''''''''''''''''''''''''''''''''''''
Dim nIndexColumn As Long '
nIndexColumn = 1 '
'
Dim nHeaderRow As Long '
nHeaderRow = 17 '
'
''''''''''''''''''''''''''''''''''''
Dim oSheet as Object
oSheet = ThisComponent.getSheets().getByIndex(0)
' Instead of .getCellRangebyName("B19:B1048576") use:
Globalscope.BasicLibraries.LoadLibrary("Tools")
Dim nLastUsedRow As Long
nLastUsedRow = GetLastUsedRow(oSheet)
Dim oCellRange As Object
' Left Top Right Bottom
oCellRange = oSheet.getCellRangeByPosition(nIndexColumn, nHeaderRow, nIndexColumn, nLastUsedRow)
' getDataArray() returns an array of arrays, each repressenting a row.
' It is indexed from zero, irrespective of where oCellRange is located
' in the sheet
Dim data() as Variant
data = oCellRange.getDataArray()
Dim max as Double
max = data(1)(0)
' First ID number is in row 1 (row 0 contains the header).
Dim rowOfMaxInArray As Long
rowOfMaxInArray = 1
Dim i As Long, x As Double
For i = 2 To UBound(data)
x = data(i)(0)
If x > max Then
max = x
rowOfMaxInArray = i
End If
Next i
' if nHeaderRow = 0, i.e. the first row in the sheet, you could save a
' couple of lines by leaving the next statement out
Dim rowOfMaxInSheet As long
rowOfMaxInSheet = rowOfMaxInArray + nHeaderRow
oSheet.getRows().removeByIndex(rowOfMaxInSheet, 1)
End Sub

Extract only words from a cell array in matlab

I have a set of documents containing pre-processed texts from html pages. They are already given to me. I want to extract only the words from it. I do not want any numbers or common words or any single letters to be extracted. The first problem I am facing is this.
Suppose I have a cell array :
{'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'}
I want to make the cell array having only the words - like this.
{'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'}
And then convert this to this cell array
{'thanks' 'dogsbreath' 'endif' 'if'}
Is there any way to do this ?
Updated Requirement : Thanks to all of your answers. However I am facing a problem ! Let me illustrate this (Please note that the cell values are extracted text from HTML documents and hence may contain non ASCII values) -
{'!/bin/bash' '![endif]' '!take-a-long' '!–photo'}
This gives me the answer
{'bin' 'bash' 'endif' 'take' 'a' 'long' 'â' 'photo' }
My Questions:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
Also the text "2. areoplane 3. cactus 4. a_rinny_boo... 5. trumpet 6. window 7. curtain ... 173. gypsy_wagon..." returns a word as 'areoplane' 'cactus' 'a_rinny_boo' 'trumpet' 'window' 'curtain' 'gypsy_wagon'. I want the words 'a_rinny_boo' and ''gypsy_wagon to be 'a' 'rinny' 'boo' 'gypsy' 'wagon'. Can this be done ?
Update 1 Following all the suggestions I have written down a function which does most of the things except the above two newly asked questions.
function [Text_Data] = raw_txt_gn(filename)
% This function will convert the text documnets into raw text
% It will remove all commas empty cells and other special characters
% It will also convert all the words of the text documents into lowercase
T = textread(filename, '%s');
% find all the important indices
ind1=find(ismember(T,':WebpageTitle:'));
T1 = T(ind1+1:end,1);
% Remove things which are not basically words
not_words = {'##','-',':ImageSurroundingText:',':WebpageDescription:',':WebpageKeywords:',' '};
T2 = []; count = 1;
for j=1:length(T1)
x = T1{j};
ind=find(ismember(not_words,x), 1);
if isempty(ind)
B = regexp(x, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
% convert the string into lowecase
% so that while generating the features the case sensitivity is
% handled well
x = lower(B);
T2{count,1} = x;
count = count+1;
end
end
T2 = T2(~cellfun('isempty',T2));
% Getting the common words in the english language
% found from Wikipedia
not_words2 = {'the','be','to','of','and','a','in','that','have','i'};
not_words2 = [not_words2, 'it' 'for' 'not' 'on' 'with' 'he' 'as' 'you' 'do' 'at'];
not_words2 = [not_words2, 'this' 'but' 'his' 'by' 'from' 'they' 'we' 'say' 'her' 'she'];
not_words2 = [not_words2, 'or' 'an' 'will' 'my' 'one' 'all' 'would' 'there' 'their' 'what'];
not_words2 = [not_words2, 'so' 'up' 'out' 'if' 'about' 'who' 'get' 'which' 'go' 'me'];
not_words2 = [not_words2, 'when' 'make' 'can' 'like' 'time' 'no' 'just' 'him' 'know' 'take'];
not_words2 = [not_words2, 'people' 'into' 'year' 'your' 'good' 'some' 'could' 'them' 'see' 'other'];
not_words2 = [not_words2, 'than' 'then' 'now' 'look' 'only' 'come' 'its' 'over' 'think' 'also'];
not_words2 = [not_words2, 'back' 'after' 'use' 'two' 'how' 'our' 'work' 'first' 'well' 'way'];
not_words2 = [not_words2, 'even' 'new' 'want' 'because' 'any' 'these' 'give' 'day' 'most' 'us'];
for j=1:length(T2)
x = T2{j};
% if a particular cell contains only numbers then make it empty
if sum(isstrprop(x, 'digit'))~=0
T2{j} = [];
end
% also remove single character cells
if length(x)==1
T2{j} = [];
end
% also remove the most common words from the dictionary
% the common words are taken from the english dicitonary (source
% wikipedia)
ind=find(ismember(not_words2,x), 1);
if isempty(ind)==0
T2{j} = [];
end
end
Text_Data = T2(~cellfun('isempty',T2));
Update 2
I found this code in here which tells me how to check for non-ascii characters. Incorporating this code snippet in Matlab as
% remove the non-ascii characters
if all(x < 128)
else
T2{j} = [];
end
and then removing the empty cells it seems my second requirement is fulfilled though the text containing a part of non-ascii characters completely disappears.
Can my final requirements be completed ? Most of them concerns the character '_' and '-'.
A regexp approach to go directly to the final step:
A = {'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'};
B = regexp(A, '\w*', 'match');
B(cellfun('isempty', B)) = []; % Clean out empty cells
B = [B{:}]; % Flatten cell array
Which matches any alphabetic, numeric, or underscore character. For the sample case we get a 1x4 cell array:
B =
'thanks' 'dogsbreath' 'endif' 'if'
Edit:
Why is bin/bash and take-a-long being separated into three cells ? Its not a problem for me but still why? Can this be avoided. I mean all words coming from a single cell being combined into one.
Because I'm flattening the cell arrays to remove nested cells. If you remove B = [B{:}]; each cell will have a nested cell inside containing all of the matches for the input cell array. You can combine these however you want after.
Notice that in '!–photo' there exists an non-ascii character â which esentially means a. Can a step be incorporated such that this transformation is automatic?
Yes, you'll have to make it based on the character codes.
I noticed that the text "it? __________ About the Author:" gives me "__________" as a word. Why is this so ?
As I said, the regex matches alphabetic, numeric, or underscore characters. You can change your filter to exclude _, which will also address the fourth bullet point: B = regexp(A, '[a-zA-Z0-9]*', 'match'); This will match a-z, A-Z, and 0-9 only. This will also exclude the non-ASCII characters, which it seems like the \w* flag matches.
I think #excaza's solution would be the go-to approach, but here's an alternative one with isstrprop using its optional input argument 'alpha' to look for alphabets -
A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
Sample run -
>> A
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
>> A(cellfun(#(x) any(isstrprop(x, 'alpha')), A))
ans =
'!!!!thanks' '!!dogsbreath' '!--[endif]--' '!--[if'
To get to the final destination, you can tweak this approach a bit, like so -
B = cellfun(#(x) x(isstrprop(x, 'alpha')), A,'Uni',0);
out = B(~cellfun('isempty',B))
Sample run -
A =
'!' '!!' '!!!!)' '!!!!thanks' '!!dogsbreath' '!)' '!--[endif]--' '!--[if'
out =
'thanks' 'dogsbreath' 'endif' 'if'

How do I compute the number of times a character appears in a character array in MATLAB? [duplicate]

This question already has answers here:
how to count unique elements of a cell in matlab?
(2 answers)
Closed 7 years ago.
I want to determine the number of times a character appears in a character array, excluding the time it appears at the last position.
How would I do this?
In Matlab computing environment, all variables are arrays, and strings are of type char (character arrays). So your Character Array is actually a string (Or in reality the other way around). Which means you can apply string methods on it to achieve your results. To find total count of occurrence of a character except on last place in a String/Character Array named yourStringVar you can do this:
YourSubString = yourStringVar(1:end-1)
//Now you have substring of main string in variable named YourSubString without the last character because you wanted to ignore it
numberOfOccurrence = length(find(YourSubString=='Character you want to count'))
It has been pointed out by Ray that length(find()) is not a good approach due to various reasons. Alternatively you could do:
numberOfOccurrence = nnz(YourSubString == 'Character you want to count')
numberOfOccurrence will give you your desired result.
What you can do is map each character into a unique integer ID, then determine the count of each character through histcounts. Use unique to complete the first step. The first output of unique will give you a list of all possible unique characters in your string. If you want to exclude the last time each character occurs in the string, just subtract 1 from the total count. Assuming S is your character array:
%// Get all unique characters and assign them to a unique ID
[unq,~,id] = unique(S);
%// Count up how many times we see each character and subtract by 1
counts = histcounts(id) - 1;
%// Show table of occurrences with characters
T = table(cellstr(unq(:)), counts.', 'VariableNames', {'Character', 'Counts'});
The last piece of code displays everything in a nice table. We ensure that the unique characters are placed as individual cells in a cell array.
Example:
>> S = 'ABCDABABCDEFFGACEG';
Running the above code, we get:
>> T
T =
Character Counts
_________ ______
'A' 3
'B' 2
'C' 2
'D' 1
'E' 1
'F' 1
'G' 1

Delete first two characters of an alphanumeric sequence in a column

I'm trying to scan a column for character size. If the alphanumerical character size is met (qty 12), then the first two characters will be deleted. They will always be specific numbers (10). See below.
H063088955
F243066424
10G403085387
F253066457
E473057375
G503087343
10H303098124
G093075912
G433084322
10G403085388
Select the cells you wish to process and run:
Sub qwerty()
For Each r In Selection
v = r.Text
If Len(v) = 12 Then
r.Value = Mid(v, 3)
End If
Next r
End Sub

How to get rid of the punctuation? and check the spelling error

eliminate punctuation
words split when meeting new line and space, then store in array
check the text file got error or not with the function of checkSpelling.m file
sum up the total number of error in that article
no suggestion is assumed to be no error, then return -1
sum of error>20, return 1
sum of error<=20, return -1
I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:
My data2 file is :
checkSpelling.m
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestion';
end
end
%Quit Word to release the server
h.Quit
f19.m
for i = 1:1
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word_punctuation=regexprep(CharData,'[`~!##$%^&*()-_=+[{]}\|;:\''<,>.?/','')
word_newLine = regexp(word_punctuation, '\n', 'split')
word = regexp(word_newLine, ' ', 'split')
[sizeData b] = size(word)
suggestion = cellfun(#checkSpelling, word, 'UniformOutput', 0)
A19(i)=sum(~cellfun(#isempty,suggestion))
feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end
Substitute your regexprep call to
word_punctuation=regexprep(CharData,'\W','\n');
Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.
Then
word = regexp(word_punctuation, '\n', 'split');
As you can see you don't need to split by space (see above). But you can remove the empty cells:
word(cellfun(#isempty,word)) = [];
Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.
UPDATE
The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.
UPDATE 2
Another solution to the 1st UPDATE:
word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');