RUTA: Multiline Annotation - ruta

Full disclosure: New to RUTA.
I have a multi line using regex to find the entity. But I need it now to have the break removed in the annotation.
My RUTA looks like "(?i)\\b[A-Z]{2}[[0-9]{1,}[\n]{0,}[0-9]{1,}]{1,}" -> EntitType;
My results end up like
S01234
25475
How can I get it be S0123425475?

Here is an example for storing a modified string in a feature:
DECLARE EntitType (String normalized);
e:EntitType{-> e.normalized = replaceAll(e.ct, "\n", "")};
DISCLAIMER: I am a developer of UIMA Ruta

Related

Can't convert string to lower case in Ruta

I have a functioning RUTA script. All I want to do is convert a string variable to lowercase doing this ASSIGN(s1, toLowerCase(s2)) where both s1 and s2 are strings. My script works when I do this ASSIGN(s1,s2) but causes an error when I add toLowerCase to my script. The error I get is not very helpful.
2021-08-28 11:27:39 ERROR AnnotateFileHandler:67 - org.apache.uima.resource.ResourceInitializationException: Initialization of annotator class "org.apache.uima.ruta.engine.RutaEngine" failed. (Descriptor: )
I found an answer posted by Peter.
here
I had to change the way I was configuring my Ruta engine to import the string functions, like this:
createEngineDescription(RutaEngine.class,
RutaEngine.PARAM_MAIN_SCRIPT, "system8.annotator.system8",
RutaEngine.PARAM_ADDITIONAL_EXTENSIONS,
new String[]{
BooleanOperationsExtension.class.getName(),
StringOperationsExtension.class.getName()})
Thank goodness for Peter Kluegl

Can we set tolerance level on regex annotator in Ruta?

I am annotating Borrower Name
"Borrower Name" -> BorrowerNameKeyword ( "label" = "Borrower Name");
But I get this text post OCR analysis. At times I might get Borrower Name as B0rr0wer Nane. Is this possible to set tolerance limit so that this text gets annotated as BorrowerNameKeyword?
Is their any other approach which could help here?
I could think of dictionary correction but that wont help as it could auto correct right words.
You could achieve that with regular expressions in UIMA Ruta. For you particular example the following rule should work:
"B.rr.wer\\sNa.e" -> BorrowerName;
Likewise, you can create more variants of regular expressions to cover the OCR errors.

UIMA RUTA annotation at the beginning of sequence

I have sequence of annotations that are instances of the same type (e.g. sequence of CW annotations). I need to remove the first of them (more formally: remove annotation that has no annotations of the same type before in document). Less formally: to remove an annotation at the beginning of document. Example document: "Software StageTools"
So, I tried many variants:
"Software"{-AFTER(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{BEFORE(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{-STARTSWITH(Document) -> UNMARK(CW)} CW+; //does not work
CW{0, 0} "Software"{-> UNMARK(CW)} CW+; //getting parsing error
...and some other ones. Obviously, no one works (may be, I can refer to begin feature of annotation, but this will not solve formal issue).
At last, the question is - how can I say RUTA to remove annotation that has no annotations of the same type before in document?
There are many ways to do this. Here are two examples:
# cw:CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove the first CW "Software" in the document if there is another CW following.
ANY{-PARTOF(CW)} cw:#CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove any CW "Software" if there is a CW following and there is no CW preceding. If the document can start with the pattern, you need a second rule.
Your second rule actually works for me. The last rule has no valid syntax. The min/max quantifier requires different brackets like [0,0]. However, this would not have the effect you want.
DISCLAIMER: I am a developer of UIMA Ruta

Setting Features-Uima Ruta

I am trying "learning by example" which was given in the uima ruta documentation.I have tried how to define and assign a relation of employment, by storing the given annotations as feature values.But I got error messages.I'm not clear in that concept can explain me in detail.
DECLARE Annotation EmplRelation
(Employee employeeRef, Employer employerRef);
Sentence{CONTAINS(EmploymentIndicator) -> CREATE(EmplRelation,"employeeRef" = Employee, "employerRef" = Employer)};
e1:Employer # EmploymentIndicator # e2:Employee) {-> EmplRelation, EmplRelation.employeeRef=e2, EmplRelation.employerRef=e1};
Just assuming what the mentioned error messages could be: The script in the question is not complete. The section "learning by example" does not contain always complete scripts but builts upon previous examples. A complete and running script for this example could look like (for an input text like "Peter works for Frank."):
DECLARE Employee, Employer, EmploymentIndicator, Sentence;
DECLARE EmplRelation (Employee employeeRef, Employer employerRef);
// create some dummy annotations to work on
"Peter" -> Employee;
"Frank" -> Employer;
"works for" -> EmploymentIndicator;
(# PERIOD){-> Sentence};
// the actual rules
Sentence{CONTAINS(EmploymentIndicator) -> CREATE(EmplRelation,"employeeRef" = Employee, "employerRef" = Employer)};
(e1:Employee # EmploymentIndicator # e2:Employer) {-> EmplRelation, EmplRelation.employeeRef=e1, EmplRelation.employerRef=e2};
Please mind that I modified the last rule so that it works on the minimal example.
DISCLAIMER: I am a developer of UIMA Ruta

UIMA RUTA: How to check if String variable is in StringList?

I am looking for something like this:
WORDLIST lemmas = 'lemmas.txt';
DECLARE Test;
BLOCK(AnnotateTests) Token{} {
STRING lemma;
Token{->GETFEATURE("lemma", lemma)};
INLIST(lemma, lemmas) -> MARK(Action); // <- How to do this?
}
I know this is broken code, but I would like to know how I can supply a list of terms by a text file and annotate all instances of, say, Token, who have a certain feature (Lemma in the example) value among the ones in the list. I know String equality is possible, but list membership I was not able to find in the documentation or figure out myself.
Thanks!
UIMA Ruta 2.1.0: Unfortunately, the INLIST condition does not accept additional arguments, but only checks on the covered text of the matched annotation. So you cannot use that. The CONTAINS condition accepts an additional argument, but not word lists. You can also not apply the wordlist with MARKFAST since the dictionary check is token-based.
The best solution for this problem is to ask the developers to add the functionality, or adding an external condition that provides the functionality.
In UIMA Ruta 2.1.0, you could use StringListExpressions instead of word lists:
STRINGLIST LemmaSL = {"cat", "dog"}; // the content of the wordlist
Token{CONTAINS(LemmaSL, Token.lemma) -> MARK(Action)};
In UIMA Ruta 2.2.0, the INLIST condition is able to process an additional argument that replaces the covered text of the matched annotation, which should solve your problem:
WORDLIST LemmaList = 'lemmas.txt';
Token{INLIST(LemmaList, Token.lemma) -> MARK(Action)};
DISCLAIMER: I am a developer of Apache UIMA Ruta.