UIMA RUTA: How to check if String variable is in StringList? - uima

I am looking for something like this:
WORDLIST lemmas = 'lemmas.txt';
DECLARE Test;
BLOCK(AnnotateTests) Token{} {
STRING lemma;
Token{->GETFEATURE("lemma", lemma)};
INLIST(lemma, lemmas) -> MARK(Action); // <- How to do this?
}
I know this is broken code, but I would like to know how I can supply a list of terms by a text file and annotate all instances of, say, Token, who have a certain feature (Lemma in the example) value among the ones in the list. I know String equality is possible, but list membership I was not able to find in the documentation or figure out myself.
Thanks!

UIMA Ruta 2.1.0: Unfortunately, the INLIST condition does not accept additional arguments, but only checks on the covered text of the matched annotation. So you cannot use that. The CONTAINS condition accepts an additional argument, but not word lists. You can also not apply the wordlist with MARKFAST since the dictionary check is token-based.
The best solution for this problem is to ask the developers to add the functionality, or adding an external condition that provides the functionality.
In UIMA Ruta 2.1.0, you could use StringListExpressions instead of word lists:
STRINGLIST LemmaSL = {"cat", "dog"}; // the content of the wordlist
Token{CONTAINS(LemmaSL, Token.lemma) -> MARK(Action)};
In UIMA Ruta 2.2.0, the INLIST condition is able to process an additional argument that replaces the covered text of the matched annotation, which should solve your problem:
WORDLIST LemmaList = 'lemmas.txt';
Token{INLIST(LemmaList, Token.lemma) -> MARK(Action)};
DISCLAIMER: I am a developer of Apache UIMA Ruta.

Related

Mark Diagnosis with a certain ICD-10 Code as main diagnosis

I would like to mark all diagnoses whose conceptId starts with "G35." as main diagnosis. How can I implement this?
d:Diagnosis{d.conceptId.startswith("G35.") -> MainDiagnosis};
d:Diagnosis{d.conceptId[0:3] == "G35." -> MainDiagnosis};
All the best
Philipp
You could use REGEXP condition to match a pattern on the value of a feature (i.e. Diagnosis.conceptId).
A solution in your case would be something like this:
d:Diagnosis{REGEXP(d.conceptId, "^G35.*") -> MainDiagnosis};
For more information on REGEXP Condition, feel free to consult the documentation
Another option would be to use StringFunctions; similar to what you tried to do in the first rule.
d:Diagnosis{startsWith(d.conceptId, "G35") -> MainDiagnosis};
However this requires you to activate an optional extension org.apache.uima.ruta.string.bool.BooleanOperationsExtension in your Ruta Analysis Engine by setting its parameter PARAM_ADDITIONAL_EXTENSIONS

Can we set tolerance level on regex annotator in Ruta?

I am annotating Borrower Name
"Borrower Name" -> BorrowerNameKeyword ( "label" = "Borrower Name");
But I get this text post OCR analysis. At times I might get Borrower Name as B0rr0wer Nane. Is this possible to set tolerance limit so that this text gets annotated as BorrowerNameKeyword?
Is their any other approach which could help here?
I could think of dictionary correction but that wont help as it could auto correct right words.
You could achieve that with regular expressions in UIMA Ruta. For you particular example the following rule should work:
"B.rr.wer\\sNa.e" -> BorrowerName;
Likewise, you can create more variants of regular expressions to cover the OCR errors.

RUTA: Multiline Annotation

Full disclosure: New to RUTA.
I have a multi line using regex to find the entity. But I need it now to have the break removed in the annotation.
My RUTA looks like "(?i)\\b[A-Z]{2}[[0-9]{1,}[\n]{0,}[0-9]{1,}]{1,}" -> EntitType;
My results end up like
S01234
25475
How can I get it be S0123425475?
Here is an example for storing a modified string in a feature:
DECLARE EntitType (String normalized);
e:EntitType{-> e.normalized = replaceAll(e.ct, "\n", "")};
DISCLAIMER: I am a developer of UIMA Ruta

UIMA RUTA annotation at the beginning of sequence

I have sequence of annotations that are instances of the same type (e.g. sequence of CW annotations). I need to remove the first of them (more formally: remove annotation that has no annotations of the same type before in document). Less formally: to remove an annotation at the beginning of document. Example document: "Software StageTools"
So, I tried many variants:
"Software"{-AFTER(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{BEFORE(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{-STARTSWITH(Document) -> UNMARK(CW)} CW+; //does not work
CW{0, 0} "Software"{-> UNMARK(CW)} CW+; //getting parsing error
...and some other ones. Obviously, no one works (may be, I can refer to begin feature of annotation, but this will not solve formal issue).
At last, the question is - how can I say RUTA to remove annotation that has no annotations of the same type before in document?
There are many ways to do this. Here are two examples:
# cw:CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove the first CW "Software" in the document if there is another CW following.
ANY{-PARTOF(CW)} cw:#CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove any CW "Software" if there is a CW following and there is no CW preceding. If the document can start with the pattern, you need a second rule.
Your second rule actually works for me. The last rule has no valid syntax. The min/max quantifier requires different brackets like [0,0]. However, this would not have the effect you want.
DISCLAIMER: I am a developer of UIMA Ruta

UIMA Ruta: Copy the feature value from a contained annotation to a containing annotation

Note: This seems heavily related to Setting feature value to the count of containing annotation in UIMA Ruta. But I cannot quite apply the answer to my situation.
I am analyzing plain text documents where the following structure is assumed:
Document (one, of course)
Section (many)
Heading (one per section)
I am being asked to identify sections by checking whether their headings satisfy conditions. A useful and obvious sort of condition would be: does the heading match a given regular expression? A less-useful but perhaps more achievable condition would be: does the heading contain a given text?
I could and have already achieved this by taking a list of tuples of regular expressions and section titles, and at design time, for each member of the list, as such:
BLOCK(forEach) SECTION{} {
...
HEADING{REGEXP(".*table.*contents.*", true) ->
SETFEATURE("value", "Table of Contents")};
...
}
SECTION{ -> SETFEATURE("value", "Table of Contents")}
<- { HEADING.headingValue == "Table of Contents"; };
This approach is fairly straightforward but has a few big drawbacks:
It heavily violates the DRY principle
Even when writing the rule for just one section to identify, the rule author must copy the section title twice (it should only need to be specified once)
It makes the script needlessly long and unwieldy
It puts a big burden on the rule author, who in an ideal case, would only need to know Regex - not Ruta
So I wanted to refactor to achieve the following goals:
A text file is used to store the regular expressions and corresponding titles, and the rule iterates over these pairs
Features, rather than types, are used to differentiate different sections/headings (i.e. like above, using SECTION.value=="Table of Contents" and not TableOfContentsSection)
After looking over the UIMA Ruta reference to see which options were available to achieve these goals, I settled on the following:
Use a WORDTABLE to store tuples of section title, words to find / regex if possible, lookup type - so for instance, Table of Contents,contents,sectiontitles
Use MARKTABLE to mark an intermediate annotation type LookupMatch whose hint feature contains the section title and whose lookup feature contains the type of lookup we are talking about
For each HEADING, see if a LookupMatch.lookup == "sectiontitle" is inside, and if it is, copy the LookupMatch.hint to the heading's value field.
For each SECTION, see if a HEADING with a value is inside; if so, copy the value to the SECTION.value field.
It was not quite a surprise to find that implementing steps 3 and 4 was not so easy. That's where I am at and why I am asking for help.
// STEP 1
WORDTABLE Structure_Heading_WordTable =
'/uima/resource/structure/Structure_Heading_WordTable.csv';
// STEP 2
Document.docType == "Contract"{
-> MARKTABLE(LookupMatch, // annotation
2, // lookup column #
Structure_Heading_WordTable, // word table to lookup
true, // case-insensitivity
0, // length before case-insensitivity
"", // characters to ignore
0, // matches to ignore
"hint" = 1, "lookup" = 3 // features
)
};
// STEPS 3 AND 4 ... ???
BLOCK(ForEach) LookupMatch.lookup == "sectiontitle"{} {
???
}
HEADING{ -> SETFEATURE("value", ???)} <- {
???
};
Here is my first real stab at it:
HEADING{ -> SETFEATURE("value", lookupMatchHint)} <- {
LookupMatch.lookup == "HeadingWords"{ -> GETFEATURE("hint", lookupMatchHint)};
};
TL; DR
How can I conditionally copy a feature value from one annotation to another? GETFEATURE kind of assumes that you only get 1...