UIMA RUTA Matching-mmHg - uima

I tried to match "mmHg" using Regex in UIMA RUTA SCRIPT but it's not matching. I used the following rules:
W{REGEXP("mmHg")->MARK(ME_UNITSPACING)};
ANY{REGEXP("mmHg")->MARK(ME_UNITSPACING)};
ANY+?{REGEXP("mmHg")->MARK(ME_UNITSPACING)};

DECLARE LOWERCAMELCASE,ME_UNITSPACING;
Document{-> RETAINTYPE(SPACE)};
SW CW{->MARK(LOWERCAMELCASE,1,2)};
Document{-> RETAINTYPE};
LOWERCAMELCASE{REGEXP("mmHg")->MARK(ME_UNITSPACING)};
Sample Input:
mmHg
small City
fishBowl

Related

Can't convert string to lower case in Ruta

I have a functioning RUTA script. All I want to do is convert a string variable to lowercase doing this ASSIGN(s1, toLowerCase(s2)) where both s1 and s2 are strings. My script works when I do this ASSIGN(s1,s2) but causes an error when I add toLowerCase to my script. The error I get is not very helpful.
2021-08-28 11:27:39 ERROR AnnotateFileHandler:67 - org.apache.uima.resource.ResourceInitializationException: Initialization of annotator class "org.apache.uima.ruta.engine.RutaEngine" failed. (Descriptor: )
I found an answer posted by Peter.
here
I had to change the way I was configuring my Ruta engine to import the string functions, like this:
createEngineDescription(RutaEngine.class,
RutaEngine.PARAM_MAIN_SCRIPT, "system8.annotator.system8",
RutaEngine.PARAM_ADDITIONAL_EXTENSIONS,
new String[]{
BooleanOperationsExtension.class.getName(),
StringOperationsExtension.class.getName()})
Thank goodness for Peter Kluegl

RUTA: Multiline Annotation

Full disclosure: New to RUTA.
I have a multi line using regex to find the entity. But I need it now to have the break removed in the annotation.
My RUTA looks like "(?i)\\b[A-Z]{2}[[0-9]{1,}[\n]{0,}[0-9]{1,}]{1,}" -> EntitType;
My results end up like
S01234
25475
How can I get it be S0123425475?
Here is an example for storing a modified string in a feature:
DECLARE EntitType (String normalized);
e:EntitType{-> e.normalized = replaceAll(e.ct, "\n", "")};
DISCLAIMER: I am a developer of UIMA Ruta

UIMA RUTA annotation at the beginning of sequence

I have sequence of annotations that are instances of the same type (e.g. sequence of CW annotations). I need to remove the first of them (more formally: remove annotation that has no annotations of the same type before in document). Less formally: to remove an annotation at the beginning of document. Example document: "Software StageTools"
So, I tried many variants:
"Software"{-AFTER(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{BEFORE(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{-STARTSWITH(Document) -> UNMARK(CW)} CW+; //does not work
CW{0, 0} "Software"{-> UNMARK(CW)} CW+; //getting parsing error
...and some other ones. Obviously, no one works (may be, I can refer to begin feature of annotation, but this will not solve formal issue).
At last, the question is - how can I say RUTA to remove annotation that has no annotations of the same type before in document?
There are many ways to do this. Here are two examples:
# cw:CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove the first CW "Software" in the document if there is another CW following.
ANY{-PARTOF(CW)} cw:#CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove any CW "Software" if there is a CW following and there is no CW preceding. If the document can start with the pattern, you need a second rule.
Your second rule actually works for me. The last rule has no valid syntax. The min/max quantifier requires different brackets like [0,0]. However, this would not have the effect you want.
DISCLAIMER: I am a developer of UIMA Ruta

How to retrieve compound words from string list- UIMA RUTA

Sample Script:
DECLARE Name,TEST;
"Peter"->Name;
"der Groot"->Name;
"Robert"->Name;
"de Leew"->Name;
"O'Sullivan"->Name;
STRING s;
STRINGLIST slist;
Name{-> MATCHEDTEXT(s), ADD(slist,s),LOG(s)};
ANY+ {INLIST(slist)->MARK(TEST)};
Received Output:
Peter
Robert
Expected Output:
Peter
der Groot
Robert
de Leew
O'Sullivan
Sample Input:
Peter
der Groot
Robert
de Leew
O'Sullivan
I've tried to mark the stringlist value into an annotation type.But the received output is different from expected output.
The condition at the rule element ANY+ validates every single ANY, thus fails with the first one and also matches only single tokens.
Should the last rule annotate only position directly after Name annotations?
If not, the you can do something like:
Name{-> MATCHEDTEXT(s), ADD(slist,s)};
MARKFAST(TEST, slist);
If yes, the situation gets more complicated because you do not have candidates with the correct span. You cannot solve this with a combination of ANY and INLIST, You either need a correct span or fragments in the list. I'd rather recommend an additional fixing rule:
Name{-> MATCHEDTEXT(s), ADD(slist,s)};
MARKFAST(TEST, slist);
ANY{-ENDSWITH(Name)} #TEST{-> UNMARK(TEST)};
DISCLAIMER: I am a developer of UIMA Ruta

UIMA RUTA: How to check if String variable is in StringList?

I am looking for something like this:
WORDLIST lemmas = 'lemmas.txt';
DECLARE Test;
BLOCK(AnnotateTests) Token{} {
STRING lemma;
Token{->GETFEATURE("lemma", lemma)};
INLIST(lemma, lemmas) -> MARK(Action); // <- How to do this?
}
I know this is broken code, but I would like to know how I can supply a list of terms by a text file and annotate all instances of, say, Token, who have a certain feature (Lemma in the example) value among the ones in the list. I know String equality is possible, but list membership I was not able to find in the documentation or figure out myself.
Thanks!
UIMA Ruta 2.1.0: Unfortunately, the INLIST condition does not accept additional arguments, but only checks on the covered text of the matched annotation. So you cannot use that. The CONTAINS condition accepts an additional argument, but not word lists. You can also not apply the wordlist with MARKFAST since the dictionary check is token-based.
The best solution for this problem is to ask the developers to add the functionality, or adding an external condition that provides the functionality.
In UIMA Ruta 2.1.0, you could use StringListExpressions instead of word lists:
STRINGLIST LemmaSL = {"cat", "dog"}; // the content of the wordlist
Token{CONTAINS(LemmaSL, Token.lemma) -> MARK(Action)};
In UIMA Ruta 2.2.0, the INLIST condition is able to process an additional argument that replaces the covered text of the matched annotation, which should solve your problem:
WORDLIST LemmaList = 'lemmas.txt';
Token{INLIST(LemmaList, Token.lemma) -> MARK(Action)};
DISCLAIMER: I am a developer of Apache UIMA Ruta.