How to retrieve compound words from string list- UIMA RUTA - uima

Sample Script:
DECLARE Name,TEST;
"Peter"->Name;
"der Groot"->Name;
"Robert"->Name;
"de Leew"->Name;
"O'Sullivan"->Name;
STRING s;
STRINGLIST slist;
Name{-> MATCHEDTEXT(s), ADD(slist,s),LOG(s)};
ANY+ {INLIST(slist)->MARK(TEST)};
Received Output:
Peter
Robert
Expected Output:
Peter
der Groot
Robert
de Leew
O'Sullivan
Sample Input:
Peter
der Groot
Robert
de Leew
O'Sullivan
I've tried to mark the stringlist value into an annotation type.But the received output is different from expected output.

The condition at the rule element ANY+ validates every single ANY, thus fails with the first one and also matches only single tokens.
Should the last rule annotate only position directly after Name annotations?
If not, the you can do something like:
Name{-> MATCHEDTEXT(s), ADD(slist,s)};
MARKFAST(TEST, slist);
If yes, the situation gets more complicated because you do not have candidates with the correct span. You cannot solve this with a combination of ANY and INLIST, You either need a correct span or fragments in the list. I'd rather recommend an additional fixing rule:
Name{-> MATCHEDTEXT(s), ADD(slist,s)};
MARKFAST(TEST, slist);
ANY{-ENDSWITH(Name)} #TEST{-> UNMARK(TEST)};
DISCLAIMER: I am a developer of UIMA Ruta

Related

Can we set tolerance level on regex annotator in Ruta?

I am annotating Borrower Name
"Borrower Name" -> BorrowerNameKeyword ( "label" = "Borrower Name");
But I get this text post OCR analysis. At times I might get Borrower Name as B0rr0wer Nane. Is this possible to set tolerance limit so that this text gets annotated as BorrowerNameKeyword?
Is their any other approach which could help here?
I could think of dictionary correction but that wont help as it could auto correct right words.
You could achieve that with regular expressions in UIMA Ruta. For you particular example the following rule should work:
"B.rr.wer\\sNa.e" -> BorrowerName;
Likewise, you can create more variants of regular expressions to cover the OCR errors.

UIMA RUTA annotation at the beginning of sequence

I have sequence of annotations that are instances of the same type (e.g. sequence of CW annotations). I need to remove the first of them (more formally: remove annotation that has no annotations of the same type before in document). Less formally: to remove an annotation at the beginning of document. Example document: "Software StageTools"
So, I tried many variants:
"Software"{-AFTER(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{BEFORE(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{-STARTSWITH(Document) -> UNMARK(CW)} CW+; //does not work
CW{0, 0} "Software"{-> UNMARK(CW)} CW+; //getting parsing error
...and some other ones. Obviously, no one works (may be, I can refer to begin feature of annotation, but this will not solve formal issue).
At last, the question is - how can I say RUTA to remove annotation that has no annotations of the same type before in document?
There are many ways to do this. Here are two examples:
# cw:CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove the first CW "Software" in the document if there is another CW following.
ANY{-PARTOF(CW)} cw:#CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove any CW "Software" if there is a CW following and there is no CW preceding. If the document can start with the pattern, you need a second rule.
Your second rule actually works for me. The last rule has no valid syntax. The min/max quantifier requires different brackets like [0,0]. However, this would not have the effect you want.
DISCLAIMER: I am a developer of UIMA Ruta

Recognize undefined Entities in Watson Conversation

Please, I wanted to know if it is possible to catch different entities on Watson conversation without defining their values.
For example, I am working on a Mobile up for room booking in my company and I can't define all the room's names so I want that my Bot recognize the name just according to the used pattern for example
"Book #room for tomorrow"
and whatever I put in place of #room it takes it as a room name.
thank you
Its now available check out https://console.bluemix.net/docs/services/conversation/entities.html#pattern-entities
A pattern must be entered as a regular expression in the field.
For instance internationalPhone: ^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$, e.g., +44 1962 815000
EDIT: The solution below still works but right now the pattern entities as discussed by Dudi are more systematic solution. Leaving this here for legacy reasons.
Right now the regexp support inside Watson Conversation Service is probably the est bet.
For your particular example, you can use the following expression inside the dialog node condition:
input.text.matches('^[bB]ook[^\w]+(\w+).+ (tomorrow|today)$')
and inside that node you can add the following regexp to node context to extract the second word (or the word after "Book") to a variable:
"room" : "<? input.text.extract('^[bB]ook[^\\w]+(\\w+).+ (tomorrow|today)$',1) ?>"
(note that in context unlike in conditions you need to actually escape \ with another \)
This will match inputs such as "book bathroom for today" or "book r101 for tomorrow".
A good place where you can try your regexp expressions is https://regex101.com/

FirstToken is not found for some reference-UIMA RUTA

FirstToken is not found for some reference(which contains space at the end).
Script:
DECLARE FirstToken, LastToken;
BLOCK(InRef) Reference{}{
ANY{POSITION(Reference,1) -> MARK(FirstToken)};
Document{-> MARKLAST(LastToken)};
}
Input Files:
1. Ferreira, F.R., Prado, S.D., Carvalho, M.C, and Kraemer, F.B. (2015). Biopower and biopolitics in the field of food and nutrition. Revista de Nutrição, 28(1), 109-119. Available at http://dx.doi.org/10.1590/1415-52732015000100010.
2. Ali, S. (2007). Feminism and postcolonialism: Knowledge/politics. Ethnic and Racial Studies, 30(2), 191–212.
3. Forbes, D.A., King, K.M., Kushner, K.E., Letourneau, N.L., Myrick, A.F., and Profetto-McGrath, J. (1999). Warrantable evidence in nursing science. Journal of Advanced Nursing, 29(2), 373–379.
Annotations that start or end with something invisible are also not visible. This definition may sound unintuitive but is required for sequential matching.
This happens most often if some annotation starts of ends with a space. It is recommended to remove/trim these spaces from the annotations, e.g., with:
RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
Reference{-> TRIM(WS)};
RETAINTYPE;
You can also work on annotations that end with a space if you make spaces visible:
RETAINTYPE(SPACE);
Beside that, you can also use the MARKFIRST action like the MARKLAST action instead of the POSITION condition, which is extremely slow.
DISCLAIMER: I am a developer of UIMA Ruta

UIMA RUTA: How to check if String variable is in StringList?

I am looking for something like this:
WORDLIST lemmas = 'lemmas.txt';
DECLARE Test;
BLOCK(AnnotateTests) Token{} {
STRING lemma;
Token{->GETFEATURE("lemma", lemma)};
INLIST(lemma, lemmas) -> MARK(Action); // <- How to do this?
}
I know this is broken code, but I would like to know how I can supply a list of terms by a text file and annotate all instances of, say, Token, who have a certain feature (Lemma in the example) value among the ones in the list. I know String equality is possible, but list membership I was not able to find in the documentation or figure out myself.
Thanks!
UIMA Ruta 2.1.0: Unfortunately, the INLIST condition does not accept additional arguments, but only checks on the covered text of the matched annotation. So you cannot use that. The CONTAINS condition accepts an additional argument, but not word lists. You can also not apply the wordlist with MARKFAST since the dictionary check is token-based.
The best solution for this problem is to ask the developers to add the functionality, or adding an external condition that provides the functionality.
In UIMA Ruta 2.1.0, you could use StringListExpressions instead of word lists:
STRINGLIST LemmaSL = {"cat", "dog"}; // the content of the wordlist
Token{CONTAINS(LemmaSL, Token.lemma) -> MARK(Action)};
In UIMA Ruta 2.2.0, the INLIST condition is able to process an additional argument that replaces the covered text of the matched annotation, which should solve your problem:
WORDLIST LemmaList = 'lemmas.txt';
Token{INLIST(LemmaList, Token.lemma) -> MARK(Action)};
DISCLAIMER: I am a developer of Apache UIMA Ruta.