Getting unnecessary stop words from text in matlab? - matlab

stopwords_cellstring={'a','s', 'about', 'above', 'above', 'across', 'after', ...
'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', ...
'already', 'also','although','always','am','among', 'amongst', 'amoungst', ...
'amount', 'an', 'and', 'another', 'any','anyhow','anyone','anything','anyway', ...
'anywhere', 'are', 'around', 'as', 'at', 'back','be','became', 'because','become',...
'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below',...
'beside', 'besides', 'between', 'beyond', 'bill','all','When', 'both', 'bottom','but', 'by',...
'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de',...
'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight',...
'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', ...
'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify',...
'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found',...
'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt',...
'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', ...
'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if',...
'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last',...
'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile',...
'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must',...
'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine',...
'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off',...
'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise',...
'our', 'ours', 'ourselves', 'out', 'over', 'own','part', 'per', 'perhaps', 'please',...
'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious',...
'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so',...
'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', ...
'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them',...
'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', ...
'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third', 'this', 'those',...
'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too',...
'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up',...
'upon', 'us', 'very', 'via', 'was', 'we','all', 'well', 'were','uses','way','went', 'what', 'whatever', 'when',...
'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein',...
'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever',...
'whole','When', 'whom', 'whose','saw', 'why', 'will', 'with', 'within', 'without', 'would', 'yet',...
'you', 'your', 'yours', 'yourself', 'yourselves', 'the'};
above are the stop words that i am using to process a text file and removing them but there are some words which are not still eliminated listed as follows:
[3] 'The'
[6] 'â'
[5] 's'
can anyone help me how to eliminate these words the code i'm using is as follows
str1 = F
split1 = regexp(str1,'\s','Split');
out_str1 = strjoin(split1(~ismember(split1,stopwords_cellstring)),' ')
C = regexp(out_str1, '<s>|\w*|</s>', 'match');

Just concentrating on word filtering and if willing to be case insensitive, using ~ismember will not work. Here is a solution to filter stopwords while being not case sensitive:
str1 = F
split1 = regexp(str1,'\s','Split');
%%%%
logic = arrayfun(#(word)any(strcmpi(word, stopwords_cellstring)), split1);
out_str1 = strjoin(split1(~logic), ' ');
%%%%
C = regexp(out_str1, '<s>|\w*|</s>', 'match');
How it works:
I have replaced ismember test with arrayfun. For each word in split1, the arrayfun checks if this word match any of the words in stopwords_cellstring using strcmpi (case insensitive).
NB1: For s it is not part of your stop list.
NB2: For â, I don't know if there's a quick solution to convert it to a (i.e without accentuation) ... easy solution would be to add it to your list unless it is included in more complex word like âmer for instance ... in this last case I would advice you to prefilter str1 with str1 = strrep(str1, 'â', 'a').

Related

Material Ui import gives error "Module build failed: Error: ENOENT: no such file or directory"

I am building a Next JS app. Everything went well unless I imported material-ui to the project.
With material-ui, it is repeatedly giving error:
./node_modules/#emotion/styled/dist/styled.browser.esm.js
Module build failed: Error: ENOENT: no such file or directory, open
'D:\ReactProjects\ace\node_modules\#emotion\styled\dist\styled.browser.esm.js'
I deleted the material-ui and styled, re-installed but no affect. Anyone knows the reason?
Can you try creating a file named styled.browser.esm.js and pasting this into it and placing it into D:\ReactProjects\ace\node_modules\#emotion\styled\dist\;
import '#babel/runtime/helpers/extends';
import 'react';
import '#emotion/is-prop-valid';
import createStyled from '../base/dist/emotion-styled-base.browser.esm.js';
import '#emotion/react';
import '#emotion/utils';
import '#emotion/serialize';
var tags = ['a', 'abbr', 'address', 'area', 'article', 'aside', 'audio', 'b', 'base', 'bdi', 'bdo', 'big', 'blockquote', 'body', 'br', 'button', 'canvas', 'caption', 'cite', 'code', 'col', 'colgroup', 'data', 'datalist', 'dd', 'del', 'details', 'dfn', 'dialog', 'div', 'dl', 'dt', 'em', 'embed', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'head', 'header', 'hgroup', 'hr', 'html', 'i', 'iframe', 'img', 'input', 'ins', 'kbd', 'keygen', 'label', 'legend', 'li', 'link', 'main', 'map', 'mark', 'marquee', 'menu', 'menuitem', 'meta', 'meter', 'nav', 'noscript', 'object', 'ol', 'optgroup', 'option', 'output', 'p', 'param', 'picture', 'pre', 'progress', 'q', 'rp', 'rt', 'ruby', 's', 'samp', 'script', 'section', 'select', 'small', 'source', 'span', 'strong', 'style', 'sub', 'summary', 'sup', 'table', 'tbody', 'td', 'textarea', 'tfoot', 'th', 'thead', 'time', 'title', 'tr', 'track', 'u', 'ul', 'var', 'video', 'wbr', // SVG
'circle', 'clipPath', 'defs', 'ellipse', 'foreignObject', 'g', 'image', 'line', 'linearGradient', 'mask', 'path', 'pattern', 'polygon', 'polyline', 'radialGradient', 'rect', 'stop', 'svg', 'text', 'tspan'];
var newStyled = createStyled.bind();
tags.forEach(function (tagName) {
// $FlowFixMe: we can ignore this because its exposed type is defined by the CreateStyled type
newStyled[tagName] = newStyled(tagName);
});
export default newStyled;
I know this is super weird, but it seems like npm is not putting all the files into the folders.
I resolved an issue like this by taking the file manually from a repo and placing it in my folder.

i can not create a table in kafka ksql

this is my sql,i don't know what's wrong
CREATE TABLE memo_src (
'id' VARCHAR PRIMARY KEY,
'title' VARCHAR,
'cover_image' VARCHAR
) WITH (
KAFKA_TOPIC = 'memo.memorandum.memorandum_info',
VALUE_FORMAT='AVRO'
);
Then prompt me as follows
extraneous input ''id'' expecting {'EMIT', 'CHANGES', 'INTEGER', 'DATE', 'TIME', 'TIMESTAMP', 'INTERVAL', 'YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'ZONE', 'PARTITION', 'STRUCT', 'EXPLAIN', 'ANALYZE', 'TYPE', 'TYPES', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'PARTITIONS', 'FUNCTIONS', 'FUNCTION', 'ARRAY', 'MAP', 'SET', 'RESET', 'SESSION', 'KEY', 'SINK', 'SOURCE', 'IF', IDENTIFIER, DIGIT_IDENTIFIER, QUOTED_IDENTIFIER, BACKQUOTED_IDENTIFIER}
Statement: CREATE TABLE memo_src (
'id' VARCHAR PRIMARY KEY,
'title' VARCHAR,
'cover_image' VARCHAR
) WITH (
KAFKA_TOPIC = 'memo.memorandum.memorandum_info',
VALUE_FORMAT='AVRO'
);
your id, title and cover_image are not supposed to be in quotation marks. That's why the compiler states it did not expect the token 'id'.

Cannot create stream with KSQL

I am trying to create a stream using ksql.
ksql> CREATE STREAM fakeData22 (Id VARCHAR, category VARCHAR, timeStamp VARCHAR, deviceID INTEGER, properties MAP<VARCHAR, VARCHAR>) WITH (KAFKA_TOPIC='fake-data-19', VALUE_FORMAT='JSON');
I get the following output. Am i missing something?
line 1:94: extraneous input 'properties' expecting {'ADD', 'APPROXIMATE', 'AT', 'CONFIDENCE', 'NO', 'SUBSTRING', 'POSITION', 'TINYINT', 'SMALLINT', 'INTEGER', 'DATE', 'TIME', 'TIMESTAMP', 'INTERVAL', 'YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'ZONE', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'VIEW', 'REPLACE', 'GRANT', 'REVOKE', 'PRIVILEGES', 'PUBLIC', 'OPTION', 'EXPLAIN', 'ANALYZE', 'FORMAT', 'TYPE', 'TEXT', 'GRAPHVIZ', 'LOGICAL', 'DISTRIBUTED', 'TRY', 'SHOW', 'TABLES', 'SCHEMAS', 'CATALOGS', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'TO', 'SYSTEM', 'BERNOULLI', 'POISSONIZED', 'TABLESAMPLE', 'RESCALED', 'ARRAY', 'MAP', 'SET', 'RESET', 'SESSION', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'WORK', 'ISOLATION', 'LEVEL', 'SERIALIZABLE', 'REPEATABLE', 'COMMITTED', 'UNCOMMITTED', 'READ', 'WRITE', 'ONLY', 'CALL', 'NFD', 'NFC', 'NFKD', 'NFKC', 'IF', 'NULLIF', 'COALESCE', IDENTIFIER, DIGIT_IDENTIFIER, QUOTED_IDENTIFIER, BACKQUOTED_IDENTIFIER}
Caused by: line 1:94: extraneous input 'properties' expecting {'ADD', 'APPROXIMATE', 'AT', 'CONFIDENCE', 'NO', 'SUBSTRING', 'POSITION', 'TINYINT', 'SMALLINT', 'INTEGER', 'DATE', 'TIME', 'TIMESTAMP', 'INTERVAL', 'YEAR', 'MONTH', 'DAY', 'HOUR', 'MINUTE', 'SECOND', 'ZONE', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'VIEW', 'REPLACE', 'GRANT', 'REVOKE', 'PRIVILEGES', 'PUBLIC', 'OPTION', 'EXPLAIN', 'ANALYZE', 'FORMAT', 'TYPE', 'TEXT', 'GRAPHVIZ', 'LOGICAL', 'DISTRIBUTED', 'TRY', 'SHOW', 'TABLES', 'SCHEMAS', 'CATALOGS', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'TO', 'SYSTEM', 'BERNOULLI', 'POISSONIZED', 'TABLESAMPLE', 'RESCALED', 'ARRAY', 'MAP', 'SET', 'RESET', 'SESSION', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'WORK', 'ISOLATION', 'LEVEL', 'SERIALIZABLE', 'REPEATABLE', 'COMMITTED', 'UNCOMMITTED', 'READ', 'WRITE', 'ONLY', 'CALL', 'NFD', 'NFC', 'NFKD', 'NFKC', 'IF', 'NULLIF', 'COALESCE', IDENTIFIER, DIGIT_IDENTIFIER, QUOTED_IDENTIFIER, BACKQUOTED_IDENTIFIER}
Because KSQL doesn't support backtick escaping.
The actually working workaround if table columns are declared with backtick escaping and being uppercase :
CREATE STREAM fakeData22 \
(Id VARCHAR, category VARCHAR, timeStamp VARCHAR, deviceID INTEGER, \
`PROPERTIES` MAP<VARCHAR, VARCHAR>) \
WITH (KAFKA_TOPIC='fake-data-19', \
VALUE_FORMAT='JSON');
and query like below.
ksql> select `PROPERTIES` from fakeData22;
Related KSQL Github issue : Support escaping identifier names. #677
I think properties must be a reserved word in KSQL. I've added this to an issue on the KSQL project for us to track, but in the meantime please try enclosing properties in backticks:
CREATE STREAM fakeData22 \
(Id VARCHAR, category VARCHAR, timeStamp VARCHAR, deviceID INTEGER, \
`properties` MAP<VARCHAR, VARCHAR>) \
WITH (KAFKA_TOPIC='fake-data-19', \
VALUE_FORMAT='JSON');

Removing stop words from single string

My query is string = 'Alligator in water' where in is a stop word. How can I remove it so that I get stop_remove = 'Alligator water' as output. I have tried it with ismember but it returns integer value for matching word, I want to get the remaining words as output.
in is just an example, I'd like to remove all possible stop words.
A slightly more elegant way than Luis Mendo's solution is to use regexprep that does exactly what you want
>> result = regexprep( 'Alligator in water', 'in\s*', '' ); % replace with an empty string
result =
Alligator water
If you have several stop words you can simply add them to the pattern (in this example I consider 'in' and 'near' as stop words):
>> result = regexprep( 'Alligator in water near land', {'in\s*','near\s*'}, '' )
result =
Alligator water land
Use this for removing all stop-words.
Code
% Source of stopwords- http://norm.al/2009/04/14/list-of-english-stop-words/
stopwords_cellstring={'a', 'about', 'above', 'above', 'across', 'after', ...
'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', ...
'already', 'also','although','always','am','among', 'amongst', 'amoungst', ...
'amount', 'an', 'and', 'another', 'any','anyhow','anyone','anything','anyway', ...
'anywhere', 'are', 'around', 'as', 'at', 'back','be','became', 'because','become',...
'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below',...
'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom','but', 'by',...
'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de',...
'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight',...
'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', ...
'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify',...
'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found',...
'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt',...
'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', ...
'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if',...
'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last',...
'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile',...
'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must',...
'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine',...
'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off',...
'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise',...
'our', 'ours', 'ourselves', 'out', 'over', 'own','part', 'per', 'perhaps', 'please',...
'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious',...
'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so',...
'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', ...
'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them',...
'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', ...
'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third', 'this', 'those',...
'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too',...
'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up',...
'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when',...
'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein',...
'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever',...
'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet',...
'you', 'your', 'yours', 'yourself', 'yourselves', 'the'};
str1 = 'Alligator in water of the pool'
split1 = regexp(str1,'\s','Split');
out_str1 = strjoin(split1(~ismember(split1,stopwords_cellstring)),' ')
Output
str1 =
Alligator in water of the pool
out_str1 =
Alligator water pool
NOTE: This code uses strjoin from Mathworks File-exchange.
Use regexp:
string = 'Alligator in water'; %// data string
result = regexp(string, 'in\s*', 'split'); %// split according to stop word
result = [result{:}]; %// join remaining pieces

creating multiple elements with same set of options in zend framework.

Hi i have to create some multiselect element in zend with same options. i.e.
$B1 = new Zend_Form_Element_Multiselect('Rating');
$B1->setLabel('B1Rating')
->setMultiOptions(
array(
'NULL' => "Select",
'1' => '1', '2' => '2', '3' => '3', '4' => '4', '5' => '5'))
->setRequired(TRUE)
->addValidator('NotEmpty', true);
$B1->setValue(array('NULL'));
$B1->size = 5;
$this->addElement($B1);
Now i have to create 5 elements of same type but different labels. So i don't want to copy the whole code 5 times. So is there any way i can do so without copy-pasting the code for 5 times.
About three different ways spring to mind. Here's the simplest one
$B2 = clone $B1;
$B2->setLabel('B2Rating');
Another approach:
$options = array(
'required' => true,
'validators' => array('NotEmpty'),
'value' => null,
'size' => 5,
'multiOptions' => array(
'NULL' => "Select",
'1' => '1', '2' => '2', '3' => '3', '4' => '4', '5' => '5'),
);
$B1 = new Zend_Form_Element_Multiselect('Rating', $options);
$B1->setLabel('B1Rating')
$this->addElement($B1);
$B2 = new Zend_Form_Element_Multiselect('Rating2', $options);
$B2->setLabel('B2Rating')
$this->addElement($B1);
And so on...
Because there's never a limit on the amount of ways you can accomplish a certain goal, here's another solution:
$ratingLabels = array('Rating 1', 'Rating 2', 'Rating 3');
foreach($ratingLabels as $index => $ratingLabel) {
$this->addElement('multiselect', 'rating' . (++$index), array(
'required' => true,
'label' => $ratingLabel,
'value' => 'NULL',
'size' => 5,
'multiOptions' => array(
'NULL' => 'Select',
'1' => '1', '2' => '2', '3' => '3', '4' => '4', '5' => '5'
),
));
}