Perl part-of-speech tagging: need tag set for Lingua::EN::Tagger - perl

So, I want to use Lingua::EN::Tagger, but I can't seem to find the domain of tags (aka the parts of speech: ie MD, NN, etc.
I found something similar to what I want here: http://engtagger.rubyforge.org/
but this is for ruby, not perl, and I'm not sure if the tags are going to be the exact same set.
Thanks in advance.

The tag set is available in the readme from Lingua0EN-Tagger-0.23 (http://cpansearch.perl.org/src/ACOBURN/Lingua-EN-Tagger-0.23/README)
Lingua::EN::Tagger
This module uses part-of-speech statistics from the Penn Treebank
to assign POS tags to English text. The tagger applies a bigram (two-word)
Hidden Markov Model to guess the appropriate POS tag for a word. That means
that the tagger will try to assign a POS tag based on the known POS tags
for a given word and the POS tag assigned to its predecessor.
The tagger tends to assume unknown words are nouns, but this behavior is
configurable.
The POS tagger can also be used to find maximal noun phrases in tagged text.
You can also use this module to extract all nouns and/or noun phrases.
TAG SET
----------------------------------------------------------------
The set of POS tags used here is a modified version of the
Penn Treebank tagset. Tags with non-letter characters have been
redefined to work better in our data structures. Also, the
``Determiner'' tag (DET) has been changed from `DT', in order to
avoid confusion with the HTML tag, <DT>.
-----------------------------------------------------------------
CC Conjunction, coordinating and, or
CD Adjective, cardinal number 3, fifteen
DET Determiner this, each, some
EX Pronoun, existential there there
FW Foreign words
IN Preposition / Conjunction for, of, although, that
JJ Adjective happy, bad
JJR Adjective, comparative happier, worse
JJS Adjective, superlative happiest, worst
LS Symbol, list item A, A.
MD Verb, modal can, could, 'll
NN Noun aircraft, data
NNP Noun, proper London, Michael
NNPS Noun, proper, plural Australians, Methodists
NNS Noun, plural women, books
PDT Determiner, prequalifier quite, all, half
POS Possessive 's, '
PRP Determiner, possessive second mine, yours
PRPS Determiner, possessive their, your
RB Adverb often, not, very, here
RBR Adverb, comparative faster
RBS Adverb, superlative fastest
RP Adverb, particle up, off, out
SYM Symbol *
TO Preposition to
UH Interjection oh, yes, mmm
VB Verb, infinitive take, live
VBD Verb, past tense took, lived
VBG Verb, gerund taking, living
VBN Verb, past/passive participle taken, lived
VBP Verb, base present form take, live
VBZ Verb, present 3SG -s form takes, lives
WDT Determiner, question which, whatever
WP Pronoun, question who, whoever
WPS Determiner, possessive & question whose
WRB Adverb, question when, how, however
PP Punctuation, sentence ender ., !, ?
PPC Punctuation, comma ,
PPD Punctuation, dollar sign $
PPL Punctuation, quotation mark left ``
PPR Punctuation, quotation mark right ''
PPS Punctuation, colon, semicolon, elipsis :, ..., -
LRB Punctuation, left bracket (, {, [
RRB Punctuation, right bracket ), }, ]

Related

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

Name of the notation that goes like /<command> [arg0|arg1]

I know there is a notation or convention that for example describes the usage of a command (in a shell for example).
/<command> [arg0|arg1]
means the following is a right way of expressing/using the command: /TheNameOfTheCommand arg0 or /TheNameOfTheCommand arg1.
It is a bit like RegEx or a formal language. Minecraft uses this notation too to describe the syntax of their command. And I once heard about it by a professor in a programming lecture. That's the reason I think it must be a real convention.
Do you know the name of this convention or does it exist at all?
It's a convention, but not a standard. Or maybe it would be more accurate to say that it is a convention adapted from a collection of standards which differ in details, except that the standards derive from the conventional use, and it's more common to find other variants than strict application of the standards.
The use of angle brackets to delimit grammatical variables goes back to Peter Backus' notation to describe the original Algol (1959); the use of brackets to donate optionality and vertical bars to list alternatives was used in the definition of Pascal (1974) and promoted by Niklaus Wirth in a note published in 1977 (“What Can We Do About the
Unnecessary Diversity of Notation for
Syntactic Definitions”).
In the Pascal report, it was called "Extended Backus Naur Form", and it is one of a number of similar notations which go by that name. I think that's unfortunate because it doesn't acknowledge Wirth's contribution but if you called it WBNF people would probably think you were talking about a radio station.
(Wirth didn't always use angle brackets. In the published version of the Pascal report, grammatical variables were printed on italics, but in the widely-distributed typed manuscript, angle brackets were used. Similarly, literal tokens were sometimes typeset in boldface and sometimes typed surrounded by quotation marks.)

Unicode Character COMBINING LATIN SMALL LETTER C

What is the likelihood that I'll run into COMBINING LATIN SMALL LETTER C (U+0368) in "real life" (besides clever Scottish folk)?
I'm asking since it's in both the Unicode Block Combining Diacritical Marks and the Category Mark, Nonspacing [Mn].
As a result, it seems to gets treated the same as characters such as COMBINING GRAVE ACCENT (U+0300) by Utilities such as the ICU Transliterator (using either the suggested "NFD; [:Nonspacing Mark:] Remove; NFC" or a straight "Latin-ASCII" transliteration).
The likelihood is very close to zero, but not exactly zero. You cannot prevent anyone from using a Unicode character as he likes. There is no specific information about U+0368 in the Unicode Standard, but it has definitely been defined as a combining character that will cause a symbol (c) to be displayed above the preceding character. I would expect to find it mostly in digitized forms of medieval manuscripts, or something like that.
Using it after a space character, as in the “clever” page mentioned, is not the intended use, but not invalid either. Unicode lets you use any combining mark after any character, whether it makes sense or not.
It has no canonical or compatibility decomposition, so there is no clear-cut way to deal with in a context where you cannot, or do not want to, retain the character.
The likelihood is utterly indeterminate except to say that if you expect it not to occur, then it will occur.

Why does Github Flavored Markup only add newlines for lines that start with [\w\<]?

In our site (which is aimed at highly non-technical people), we let them use Markdown when sending emails. That way, they get nice things like bold, italic, etc. Being non-technical, however, they would never get past the “add two lines to make newlines actually work” quirk.
For that reason mainly, we are using a variant of Github Flavored Markdown.
We mainly borrowed this part:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/^[\w\<][^\n]*\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
This works well, but in some cases it doesn’t add the new-lines, and I guess the key to that is the “in very clear cases” part of that comment.
If I interpret it correctly, this is only adding newlines for lines that start with either a word character or a ‘<’.
Does anyone know why that is? Particularly, why ‘<’?
What would be the harm in just adding the two spaces to essentially anything (lines starting with spaces, hyphens, anything)?
'<' character is used at the beginning of a line to quote messages. I guess that is the reason.
The other answer to this question is quite wrong. This has nothing to do with quoting, and the character for markdown quoting is >.
^[\w\<][^\n]*\n+
Let's break the above regex into parts:
^ = anchor start of string.
[\w\<] matches a word character or the start of word boundary. \< is not a literal, but rather a GNU word boundary. See here (do a ctrl+f for \<).
[^\n]* matches any length of non-newline characters
\n matches a new line.
+ is, I believe, a possessive quantifier.
I believe, but am not 100% sure, that this simply is used to set x to a line of text. Then, the heavy work is done with the next line:
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
This says "if x satisfies the regex \n{2} (that is, has two line breaks), leave x as is. Otherwise, strip x and append a newline character.

Theory: "Lexical Encoding"

I am using the term "Lexical Encoding" for my lack of a better one.
A Word is arguably the fundamental unit of communication as opposed to a Letter. Unicode tries to assign a numeric value to each Letter of all known Alphabets. What is a Letter to one language, is a Glyph to another. Unicode 5.1 assigns more than 100,000 values to these Glyphs currently. Out of the approximately 180,000 Words being used in Modern English, it is said that with a vocabulary of about 2,000 Words, you should be able to converse in general terms. A "Lexical Encoding" would encode each Word not each Letter, and encapsulate them within a Sentence.
// An simplified example of a "Lexical Encoding"
String sentence = "How are you today?";
int[] sentence = { 93, 22, 14, 330, QUERY };
In this example each Token in the String was encoded as an Integer. The Encoding Scheme here simply assigned an int value based on generalised statistical ranking of word usage, and assigned a constant to the question mark.
Ultimately, a Word has both a Spelling & Meaning though. Any "Lexical Encoding" would preserve the meaning and intent of the Sentence as a whole, and not be language specific. An English sentence would be encoded into "...language-neutral atomic elements of meaning ..." which could then be reconstituted into any language with a structured Syntactic Form and Grammatical Structure.
What are other examples of "Lexical Encoding" techniques?
If you were interested in where the word-usage statistics come from :
http://www.wordcount.org
This question impinges on linguistics more than programming, but for languages which are highly synthetic (having words which are comprised of multiple combined morphemes), it can be a highly complex problem to try to "number" all possible words, as opposed to languages like English which are at least somewhat isolating, or languages like Chinese which are highly analytic.
That is, words may not be easily broken down and counted based on their constituent glyphs in some languages.
This Wikipedia article on Isolating languages may be helpful in explaining the problem.
Their are several major problems with this idea. In most languages, the meaning of a word, and the word associated with a meaning change very swiftly.
No sooner would you have a number assigned to a word, before the meaning of the word would change. For instance, the word "gay" used to only mean "happy" or "merry", but it is now used mostly to mean homosexual. Another example is the morpheme "thank you" which originally came from German "danke" which is just one word. Yet another example is "Good bye" which is a shortening of "God bless you".
Another problem is that even if one takes a snapshot of a word at any point of time, the meaning and usage of the word would be under contention, even within the same province. When dictionaries are being written, it is not uncommon for the academics responsible to argue over a single word.
In short, you wouldn't be able to do it with an existing language. You would have to consider inventing a language of your own, for the purpose, or using a fairly static language that has already been invented, such as Interlingua or Esperanto. However, even these would not be perfect for the purpose of defining static morphemes in an ever-standard lexicon.
Even in Chinese, where there is rough mapping of character to meaning, it still would not work. Many characters change their meanings depending on both context, and which characters either precede or postfix them.
The problem is at its worst when you try and translate between languages. There may be one word in English, that can be used in various cases, but cannot be directly used in another language. An example of this is "free". In Spanish, either "libre" meaning "free" as in speech, or "gratis" meaning "free" as in beer can be used (and using the wrong word in place of "free" would look very funny).
There are other words which are even more difficult to place a meaning on, such as the word beautiful in Korean; when calling a girl beautiful, there would be several candidates for substitution; but when calling food beautiful, unless you mean the food is good looking, there are several other candidates which are completely different.
What it comes down to, is although we only use about 200k words in English, our vocabularies are actually larger in some aspects because we assign many different meanings to the same word. The same problems apply to Esperanto and Interlingua, and every other language meaningful for conversation. Human speech is not a well-defined, well oiled-machine. So, although you could create such a lexicon where each "word" had it's own unique meaning, it would be very difficult, and nigh on impossible for machines using current techniques to translate from any human language into your special standardised lexicon.
This is why machine translation still sucks, and will for a long time to come. If you can do better (and I hope you can) then you should probably consider doing it with some sort of scholarship and/or university/government funding, working towards a PHD; or simply make a heap of money, whatever keeps your ship steaming.
It's easy enough to invent one for yourself. Turn each word into a canonical bytestream (say, lower-case decomposed UCS32), then hash it down to an integer. 32 bits would probably be enough, but if not then 64 bits certainly would.
Before you ding for giving you a snarky answer, consider that the purpose of Unicode is simply to assign each glyph a unique identifier. Not to rank or sort or group them, but just to map each one onto a unique identifier that everyone agrees on.
How would the system handle pluralization of nouns or conjugation of verbs? Would these each have their own "Unicode" value?
As a translations scheme, this is probably not going to work without a lot more work. You'd like to think that you can assign a number to each word, then mechanically translate that to another language. In reality, languages have the problem of multiple words that are spelled the same "the wind blew her hair back" versus "wind your watch".
For transmitting text, where you'd presumably have an alphabet per language, it would work fine, although I wonder what you'd gain there as opposed to using a variable-length dictionary, like ZIP uses.
This is an interesting question, but I suspect you are asking it for the wrong reasons. Are you thinking of this 'lexical' Unicode' as something that would allow you to break down sentences into language-neutral atomic elements of meaning and then be able to reconstitute them in some other concrete language? As a means to achieve a universal translator, perhaps?
Even if you can encode and store, say, an English sentence using a 'lexical unicode', you can not expect to read it and magically render it in, say, Chinese keeping the meaning intact.
Your analogy to Unicode, however, is very useful.
Bear in mind that Unicode, whilst a 'universal' code, does not embody the pronunciation, meaning or usage of the character in question. Each code point refers to a specific glyph in a specific language (or rather the script used by a group of languages). It is elemental at the visual representation level of a glyph (within the bounds of style, formatting and fonts). The Unicode code point for the Latin letter 'A' is just that. It is the Latin letter 'A'. It cannot automagically be rendered as, say, the Arabic letter Alif (ﺍ) or the Indic (Devnagari) letter 'A' (अ).
Keeping to the Unicode analogy, your Lexical Unicode would have code points for each word (word form) in each language. Unicode has ranges of code points for a specific script. Your lexical Unicode would have to a range of codes for each language. Different words in different languages, even if they have the same meaning (synonyms), would have to have different code points. The same word having different meanings, or different pronunciations (homonyms), would have to have different code points.
In Unicode, for some languages (but not all) where the same character has a different shape depending on it's position in the word - e.g. in Hebrew and Arabic, the shape of a glyph changes at the end of the word - then it has a different code point. Likewise in your Lexical Unicode, if a word has a different form depending on its position in the sentence, it may warrant its own code point.
Perhaps the easiest way to come up with code points for the English Language would be to base your system on, say, a particular edition of the Oxford English Dictionary and assign a unique code to each word sequentially. You will have to use a different code for each different meaning of the same word, and you will have to use a different code for different forms - e.g. if the same word can be used as a noun and as a verb, then you will need two codes
Then you will have to do the same for each other language you want to include - using the most authoritative dictionary for that language.
Chances are that this excercise is all more effort than it is worth. If you decide to include all the world's living languages, plus some historic dead ones and some fictional ones - as Unicode does - you will end up with a code space that is so large that your code would have to be extremely wide to accommodate it. You will not gain anything in terms of compression - it is likely that a sentence represented as a String in the original language would take up less space than the same sentence represented as code.
P.S. for those who are saying this is an impossible task because word meanings change, I do not see that as a problem. To use the Unicode analogy, the usage of letters has changed (admittedly not as rapidly as the meaning of words), but it is not of any concern to Unicode that 'th' used to be pronounced like 'y' in the Middle ages. Unicode has a code point for 't', 'h' and 'y' and they each serve their purpose.
P.P.S. Actually, it is of some concern to Unicode that 'oe' is also 'œ' or that 'ss' can be written 'ß' in German
This is an interesting little exercise, but I would urge you to consider it nothing more than an introduction to the concept of the difference in natural language between types and tokens.
A type is a single instance of a word which represents all instances. A token is a single count for each instance of the word. Let me explain this with the following example:
"John went to the bread store. He bought the bread."
Here are some frequency counts for this example, with the counts meaning the number of tokens:
John: 1
went: 1
to: 1
the: 2
store: 1
he: 1
bought: 1
bread: 2
Note that "the" is counted twice--there are two tokens of "the". However, note that while there are ten words, there are only eight of these word-to-frequency pairs. Words being broken down to types and paired with their token count.
Types and tokens are useful in statistical NLP. "Lexical encoding" on the other hand, I would watch out for. This is a segue into much more old-fashioned approaches to NLP, with preprogramming and rationalism abound. I don't even know about any statistical MT that actually assigns a specific "address" to a word. There are too many relationships between words, for one thing, to build any kind of well thought out numerical ontology, and if we're just throwing numbers at words to categorize them, we should be thinking about things like memory management and allocation for speed.
I would suggest checking out NLTK, the Natural Language Toolkit, written in Python, for a more extensive introduction to NLP and its practical uses.
Actually you only need about 600 words for a half decent vocabulary.