UIMA Ruta: Optional Quantifier - uima

I wanna match some terms only if the term behind this term is relevant for me. So I've created a minimal example:
This is my Test Data:
small Large
Large
small
And I wanna mark the terms small Large and Large but not "small".
So I thought, something like this should work:
DECLARE Test;
(SW*? CW) {-> CREATE(Test)};
But RUTA only matches "small Large".
For Testing I've replaced "SW" with "W" and it will do what I wan't.

Unfortunately, optional quantifiers at the beginning of a rule are not optional if the rule starts to match with the first rule element. This means that you either need two rules or you need to change the order of the rule element matching.
Changing the order of the rule element match lead to different rule matches since not all incremental sequences of SWs are considered before the CW. However, this is something one would normally prefer anyways. The rule would look like:
(SW* #CW) {-> CREATE(Test)};
The two rules approach would look something like:
(SW+? CW) {-> CREATE(Test)};
CW {-> CREATE(Test)};
I recommend avoiding the usage of a reluctant quantifier if it is not really required because of additional computation which are not necessary. Rather use the PARTOF condition even if it looks not as nice.
DISCLAIMER: Iam a developer of UIMA Ruta

Related

UIMA Ruta. Retrieve phrases separated by WS (spaces, breaks, etc.)

I'm going to retrieve phrases separated by spaces, breaks and other punctuation symbols.
I've spent a lot of time trying to find out the best way to do that.
Option 1. The easiest way.
DECLARE T1, T2;
"cool rules" -> T1;
"cool rule" -> T2;
Input: "123cool rules".
Result: T1 and T2 are triggered;
Option 2. Using WORDLIST and WORDTABLE.
Let wordlist 1.txt contains 2 rows:
cool rules
cool
code for extraction is the following
WORDLIST WList = '1.txt';
DECLARE W1;
Document{-> MARKFAST(W1, WList, true, 2)};
Input: "cool rules".
Result: only first row is extracted. I guess that in this case intersected rules are not triggered.
Option 3. Mark combination of two tokens
DECLARE T1;
("cool" "rule") {-> T1};
Input: "cool rules cool rule 1cool rule"
Result: 2 annotations: cool rule + 1cool rule. Loss of extraction speed in 10 times.
Option 4. REGEXP matching
Maybe it is possible to match such pattern "cool\\srule", but I have no idea how to define the type expression. SW*{REGEXP("cool\\srule")->T1} does not provide results.
As you see, I'm trying to solve a very simple task, but did not succeed yet. The option 3 is a really good way to do that, but extraction process becomes slower in 10 times.
If you want to identify specific phrases, you should use a dictionary lookup, not directly rules.
Therefore, I'd recommend the MARKFAST option 2. However, there are two problems: (a) only longest matches are supported and (b) you either need to change the segmentation (tokenization) or do some postprocessing.
(a) This cannot be solved. If this is really required, a different dictionary annotator should be used. See e.g., the UIMA mailing lists.
(b) The MARKFAST works on RutaBasic annotations which are automatically created for each smallest part. Because of the default seeder, the token "1cool" consists of two RutaBasics, one for the NUM, one for the SW. If you do not want to change the preprocessing, you can simply apply a rule that fixed that like
RETAINTYPE(WS);
ANY{-PARTOF(WS)} t:#T1{-> UNMARK(t)};
btw, option 4 won't work because the REGEXP condition checks on the covered text of the matched annotation SW which only represents one token. If you do something like (SW+){REGEXP("cool\\srule")->T1}, then the rule wont match if there is another SW afterwards.
DISCLAIMER: I am a developer Of UIMA Ruta

Using match with user defined types in PL Racket

The following PL code does not work under #lang pl:
Edited code according to Alexis Kings answer
(define-type BINTREE
[Leaf Number]
[Node BINTREE BINTREE])
(: retrieve-leaf : BINTREE -> Number)
(define (retrieve-leaf btree)
(match btree
[(Leaf number) number])
What i'd like to achieve is as follows:
Receive a BINTREE as input
Check whether the tree is simply a leaf
Return the leaf numerical value
This might be a basic question but how would I go about solving this?
EDIT: The above seems to work if cases is used instead of match.
Why is that?
As you've discovered, match and cases are two similar but separate
things. The first is used for general Racket values, and the second is
used for things that you defined with define-type. Unfortunately,
they don't mix well in either direction, so if you have a defined type
then you need to use cases.
As for the reason for that, it's kind of complicated... One thing is
that the pl language was made well before match was powerful enough
to deal with arbitrary values conveniently. It does now, but it cannot
be easily tweaked to do what cases does: the idea behind define-type
is to make programming simple by making it mandatory to use just
cases for such values --- there are no field accessors, no predicates
for the variants (just for the whole type), and certainly no mutation.
Still, it is possible to do anything you need with just cases. If you
read around, the core idea is to mimic disjoint union types in HM
languages like ML and Haskell, and with only cases pattern matching
available, many functions are easy to start since there's a single way
to deal with them.
match and Typed Racket got closer to being able to do these things,
but it's still not really powerful enough to do all of that --- which is
why cases will stay separate from match in the near future.
As a side note, this is in contrast to what I want --- I know that this
is often a point of confusion, so I'd love to have just match used
throughout. Maybe I'll break at some point and hack things so that
cases is also called match, and the contents of the branches would
be used to guess if you really need the real match or the cases
version. But that would really be a crude hack.
I think you're on the right track, but your match syntax isn't correct. It should look like this:
(: retrieve-leaf : BINTREE -> Number)
(define (retrieve-leaf btree)
(match btree
[(Leaf number) number]))
The match pattern clauses must be inside the match form. Additionally, number is just a binding, not a procedure, so it doesn't need to be in parens.

Sphinx with metaphone and wildcard search

we are an anatomy platform and use sphinx for our search. We want to make our search more fuzzier and started to use metaphone to correct spelling mistakes. It finds for example phalanges even though the search word is falanges.
That's good but we want more. We want that the user could type in falange or even falang and we still find phalanges. Any ideas how to accomplish this?
If you are interested you can checkout our sphinx config file here.
Thanks!
Well you can enable both metaphone and min_prefix_len on an index at once. It will sort of work.
falange*
might then just work. (to match phalanges)
The problem is the 'stripped' letters may change the 'sound' of the word (because change the pronunciation)
eg falange becomes FLNJ, but falang acully becomes FLNK - so they no longer 'substrings' of one another. (ie phalanges becomes FLNJS, which FLNK* wont match)
... to be honest I dont know a good solution. You could perhaps get better results, if was to apply stemming, BEFORE metaphone. (so the endings that change the pronouncation of the words are removed.
Alas Sphinx can't do this. If you enable stemming and metaphone together, only ONE of the processors will ever fire.
Two possible solutions, implement stemming outside of sphinx (or maybe with regexp_filter. Not sure if say a porter stemmer can be implemnented purely with regular expressions)
or modify sphinx, so that ALL morphology processors apply. (rather than just the first one that changes the word)

When a keyword means different things in different contexts, is that an example of context sensitivity?

According to this answer => in Scala is a keyword which has two different meanings: 1 to denote a function type: Double => Double and 2 to create a lambda expression: (x: Double): Double => 2*x.
How does this relate to formal grammars, i.e. does this make Scala context sensitive?
I know that most languages are not context free, but I'm not sure whether the situation I'm describing has anything to do with that.
Edit:
Seems like I don't understand context sensitive grammars well enough. I know how the production rules are supposed to look, and what they mean ("this production applies only if A is surrounded by these symbols"), but I'm just not sure how they relate to actual (programming) languages.
I think my confusion stems from reading something like "Chomsky introduced this term because a word's meaning can depend on its context", and I connected => with the term "word" in the quote, and those two uses of it being two separate contexts.
It be great if an answer would address my confusion.
It's been a while since I've handled formal language theory, but I'll bite.
"Context-free" means that the production rules required in the corresponding grammar do not have a "context". It does not mean that a specific symbol cannot appear in different rules.
Addressing the edit: in other words (and more informally), deciding whether a language is context-free or context-sensitive boils down not to looking at the "meaning" of a specific "word" or "words". Instead, it amounts to looking at the set of all legal expressions in that language, and seeing whether you can "encode" them only by taking into account the positional relationships the component "words" have with one another. This is essentially what the Pumping Lemma checks.
For example:
S → Type"="Body
Type → "Double"
Type → "Double""=>""Double"
Body → Lambda
Body → NormalBody
NormalBody → "x"
Lambda -> "x""=>"NormalBody
Where S is of course the start symbol, uppercased IDs are nonterminals, and quoted strings are terminals. Obviously, this can generate a string like:
Double=>Double=x=>x
but the grammar is still context-free.
So just this, as in the observation that the nonterminal "=>" can appear in two "places" of the program, does not make Scala context-sensitive.
However, it does not mean that:
the entire Scala language is context-free,
it is context-sensitive - it can be even more complex,
if you would like to encode the semantics of Scala into a grammar, you would end up with either a context-free or a context-sensitive one.
The last thing is especially relevant since you've mentioned "meaning" in the (nomen omen) context of formal languages.

Perl operators are "discovered" and not designed?

Just reading this page: https://github.com/book/perlsecret/blob/master/lib/perlsecret.pod , and was really surprised with the statements like:
Discovered by Philippe Bruhat, 2012.
Discovered by Abigail, 2010. (Alternate nickname: "grappling hook")
Discovered by Rafaël Garcia-Suarez, 2009.
Discovered by Philippe Bruhat, 2007.
and so on...
The above operators are DISCOVERED, so they are not intentional by perl-design?
Thats mean than here is possibility than perl sill have some random character sequences what in right order doing something useful like the ()x!! "operator"?
Is here any other language what have discovered operatos?
From the page you linked:
They are like operators in the sense that these Perl programmers see
them often enough to recognize them without thinking about their
smaller parts, and eventually add them to their toolbox. And they are
like secrets in the sense that they have to be discovered by their
future user (or be transmitted by a fellow programmer), because they
are not explicitly documented.
That is, they are not really their own operators, but they are made up of smaller operators compounded to do something combinedly.
For example, the 'venus' operator (0+ or +0) numifies the object on its left or right. That's what adding zero in any form does, "secret" operator or not.
Perl has a bunch of operators that do special things, as well as characters that do special things when interpreted in a specific context. Rather than these being actual "operators" (i.e., not explicitly recognized by the Perl parser), think of them as combinations of certain functions/operations. For example ()X!!, which is known as the "Enterprise" operator, consists of () which is a list, followed by x, which is a repetition operator, followed by !! (the "bang bang" operator), which performs a boolean conversion. This is one of the reasons that Perl is so expressive.