UIMA RUTA Combination of Annotations - annotations

I am new to UIMA RUTA and after reading the UIMA RUTA Guide, I have the following question. I want to write a set of rules that will search for two annotations (FIRST, SECOND) inside my document with specific values (FIRST: "hello" and SECOND: "world","pres") and if they find them it will create a new annotation (THIRD) with value:"end".
However, the script is not working and I am wondering why.
WORDTABLE Firsts= 'FIRST.csv';
WORDTABLE Seconds= 'SECOND.csv';
WORDTABLE Thirds= 'THIRD.csv';
DECLARE Annotation FIRST(STRING value);
DECLARE Annotation SECOND(STRING value, STRING pos);
DECLARE Annotation THIRD(STRING value);
Document{->MARKTABLE(FIRST, 1, Firsts, "value"=2)};
Document{->MARKTABLE(SECOND, 1, Seconds, "value"=2, "pos"=3)};
Document{AND(CONTAINS(FIRST{FEATURE("value","hello")}),CONTAINS(SECOND{FEATURE("value","world","pos","pres")})){->CREATE(THIRD{FEATURE("value","end")})}};
Could you please help me out? Thanx.

The last rule is not valid.
You could write something like:
(f:FIRST{f.value=="hello"} # s:SECOND{s.value=="world",s.pos=="pres"}){-> CREATE(THIRD, "value" = "end")};
or
Document{-> CREATE(THIRD, "value" = "end")}<-{f:FIRST{f.value=="hello"} # s:SECOND{s.value=="world",s.pos=="pres"};};
or something with a conjunct rule.
DISCLAIMER: I am a developer of UIMA Ruta

Related

How to select and set a covering/covered annotation as a feature in RUTA

I have a Ruta rule that looks something like this, where dep is a DKPro imported Dependency type.
dep{}->{MyItem{->
SETFEATURE("Display", "displayedValue"),
SETFEATURE("Lemma", dep.Dependent.lemma.value),
SETFEATURE("Parent", dep.Governor)};};
The first two actions work. The problem I have is in the third action SETFEATURE("Parent", dep.Governor). dep.Governor returns a Token type but my feature requires another type that happens to share the same location as the Governor. In other words I want my own type, not dep.Governor, that has already annotated that governing word.
I am unsure how to recover an annotation (my annotation) that occupies the same space as the dep.Governor. Ideally I would like to recover it as a variable so that I can reuse it for other features to do something like this.
a:MyItem [that overlaps dep.Governor]
dep{}->{MyItem{->SETFEATURE("Parent", a)};};
Here is a more precise example
d:dep.Dependency{->
MyItem,
MyItem.Display = "Ignore",
MyItem.Lemma = d.Dependent.lemma.value,
MyItem.LanguageParent = d,
};
The line MyItem.LanguageParent = d produces this Ruta error
Trying to access value of feature "type.MyItem:LanguageParent" as "type.MyItem", but range of feature is "de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency"
I am sure there is a cleaner way than this, but for now, I am converting the type using a block function and saving it into an annotation variable.
BLOCK(ConvertTokenToMyItem) Token{IS(MyItem)} {
varMyItem:MyItem;
}
Then I use it
d:dep.Dependency{->
MyItem,
MyItem.Display = "Ignore",
MyItem.Lemma = d.Dependent.lemma.value,
MyItem.LanguageParent = varMyItem,
};

using ruta, annotate a line containing annotation and extract required data

Annotate a line containing specific annotations in order to extract text. Annotate line for Borrower and Co-Borrower and get their respective SSNs
Borrower Name: Alice Johnson SSN: 123-456-7890
Co-Borrower Name: Bob Symonds SSN: 987-654-3210
code
PACKAGE uima.ruta.test;
TYPESYSTEM utils.PlainTextTypeSystem;
ENGINE utils.PlainTextAnnotator;
EXEC(PlainTextAnnotator, {Line});
DECLARE Borrower, Name;
DECLARE BorrowerName(String value, String label);
CW{REGEXP("\\bBorrower") -> Borrower} CW{REGEXP("Name") -> Name};
Borrower Name COLON n:CW[1,3]{-> CREATE(BorrowerName, "label"="Borrower Name", "value"=n.ct)};
DECLARE SSN;
DECLARE BorrowerSSN(String label, String value);
W{REGEXP("SSN") -> SSN};
SSN COLON n:NUM[3,3]{-> CREATE(BorrowerSSN, "label"="Borrower SSN", "value"=n.ct)};
DECLARE Co;
CW{REGEXP("Co") -> Co};
DECLARE CoBorrowerName(String label, String value);
Co Borrower Name COLON n:CW[1,3]{-> CREATE(CoBorrowerName, "label"="Co-Borrower Name", "value"=n.ct)};
DECLARE BorrowerLine;
Line{CONTAINS(Borrower),CONTAINS(Name)->MARK(BorrowerLine)};
Please suggest how to annotate a line containing annotation and get specific label value for required annotation.
To spare yourself from handling the separate strings, you could gather all the indicators to a wordlist (i.e., a text file containing one indicator per line) and place it in your project resources folder (see this for more details). Then you could just mark all the indicators with the desired indicator type:
WORDLIST IndicatorList ='IndicatorList.txt';
DECLARE Indicator;
Document{->MARKFAST(Indicator, IndicatorList )};
This would output Indicator helper annotations like "Borrower Name".
Once you have that, you could now iterate over the lines and find the target annotations.
DECLARE Invisible;
SPECIAL{-PARTOF(Invisible), REGEXP("[-]")-> Invisible};
BLOCK(line) Line{CONTAINS(Indicator)}{
//Ex. pattern: Borrower Name: Alice Johnson SSN: 123-456-7890
Indicator COLON c:CW[1,3]{-> CREATE(BorrowerName, "label"="Borrower Name", "value"=c.ct)} Indicator;
FILTERTYPE(Invisible);
Indicator COLON n:NUM[3,3]{-> CREATE(BorrowerSSN, "label"="BorrowerSSN", "value"=n.ct)};
REMOVEFILTERTYPE(Invisible);
}
Hope this helps.
Addition to Viorel's answer:
The PlainTextAnnotator creates annotations of the type Line and these annotation cover the complete line, which means that leading or trailing whitespaces are also included. As a consequence, the resulting annotations are not visible for the following rules. In order to avoid this problem, you could for example trim the whitespaces in these annotations:
EXEC(PlainTextAnnotator, {Line});
ADDRETAINTYPE(WS);
Line{-> TRIM(WS)};
REMOVERETAINTYPE(WS);

Whitespace tokenizer not working when using simple query string

I first implemented query search using SimpleQueryString shown as follows.
Entity Definition
#Entity
#Indexed
#AnalyzerDef(name = "whitespace", tokenizer = #TokenizerDef(factory = WhitespaceTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = ASCIIFoldingFilterFactory.class)
})
public class AdAccount implements SearchableEntity, Serializable {
#Id
#DocumentId
#Column(name = "ID")
#GeneratedValue(strategy = GenerationType.AUTO)
private Long id;
#Field(store = Store.YES, analyzer = #Analyzer(definition = "whitespace"))
#Column(name = "NAME")
private String name;
//other properties and getters/setters
}
I use the white space tokenizer factory here because the default standard analyzer ignores special characters, which is not ideal in my use case. The document I referred to is https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-WhiteSpaceTokenizer. In this document it states that Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens.
SimpleQueryString Method
protected Query inputFilterBuilder() {
SimpleQueryStringMatchingContext simpleQueryStringMatchingContext = queryBuilder.simpleQueryString().onField("name");
return simpleQueryStringMatchingContext
.withAndAsDefaultOperator()
.matching(searchRequest.getQuery() + "*").createQuery();
}
searchRequest.getQuery() returns the search query string, then I append the prefix operator in the end so that it supports prefix query.
However, this does not work as expected with the following example.
Say I have an entity whose name is "AT&T Account", when searching with "AT&", it does not return this entity.
I then made the following changes to directly use a white space analyzer. This time searching with "AT&" works as expected. But the search is case sensitive now, i.e, searching with "at&" returns nothing now.
#Field
#Analyzer(impl = WhitespaceAnalyzer.class)
#Column(name = "NAME")
private String name;
My questions are:
Why doesn't it work when I use the white space factory in my first attempt? I assume using the factory versus using the actual analyzer implementation is different?
How to make my search case-insensitive if I use the #Analyzer annotation as in my second attempt?
Why doesn't it work when I use the white space factory in my first attempt? I assume using the factory versus using the actual analyzer implementation is different?
Wildcard and prefix queries (the one you're using when you add a * suffix in your query string) do not apply analysis, ever. Which means your lowercase filter is not applied to your search query, but it has been applied to your indexed text, which means it will never match: AT&* does not match the indexed at&t.
Using the #Analyzer annotation only worked because you removed the lowercasing at index time. With this analyzer, you ended up with AT&T (uppercase) in the index, and AT&* does match the indexed AT&T. It's just by chance, though: if you index At&t, you will end up with At&t in the index and you'll end up with the same problem.
How to make my search case-insensitive if I use the #Analyzer annotation as in my second attempt?
As I mentioned above, the #Analyzer annotation is not the solution, you actually made your search worse.
There is no built-in solution to make wildcard and prefix queries apply analysis, mainly because analysis could remove pattern characters such as ? or *, and that would not end well.
You could restore your initial analyzer, and lowercase the query yourself, but that will only get you so far: ascii folding and other analysis features won't work.
The solution I generally recommend is to use an edge-ngrams filter. The idea is to index every prefix of every word, so "AT&T Account" would get indexed as the terms a, at, at&, at&t, a, ac, acc, acco, accou, accoun, account and a search for "at&" would return the correct results even without a wildcard.
See this answer for a more extensive explanation.
If you use the ELasticsearch integration, you will have to rely on a hack to make the "query-only" analyzer work properly. See here.

Want to Remove Markup's from the Annotation-UIMA RUTA

If I use P tag(from Html Annotator) as PASSAGE.I want to ignore the markup's from the Annotation.
SCRIPT:
//-------------------------------------------------------------------
// SPECIAL SQUARE HYPHEN PARENTHESIS
//-------------------------------------------------------------------
DECLARE LParen, RParen;
SPECIAL{REGEXP("[(]") -> MARK(LParen)};
SPECIAL{REGEXP("[)]") -> MARK(RParen)};
DECLARE LSQParen, RSQParen;
SPECIAL{REGEXP("[\\[]") -> MARK(LSQParen)};
SPECIAL{REGEXP("[\\]]") -> MARK(RSQParen)};
DECLARE LANGLEBRACKET,RANGLEBRACKET;
SPECIAL{REGEXP("<")->MARK(LANGLEBRACKET)};
AMP{REGEXP("<")->MARK(LANGLEBRACKET)};
SPECIAL{REGEXP(">")->MARK(RANGLEBRACKET)};
AMP{REGEXP(">")->MARK(RANGLEBRACKET)};
DECLARE LBracket,RBracket;
(LParen|LSQParen|LANGLEBRACKET){->MARK(LBracket)};
(RParen|RSQParen|RANGLEBRACKET){->MARK(RBracket)};
DECLARE PASSAGE,TESTPASSAGE;
"<a name=\"para(.+?)\">(.*?)</a>"->2=PASSAGE;
RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
PASSAGE{-> TRIM(WS)};
RETAINTYPE;
PASSAGE{->MARK(TESTPASSAGE)};
DECLARE TagContent,PassageFirstToken,InitialTag;
LBracket ANY+? RBracket{-PARTOF(TagContent)->MARK(TagContent,1,3)};
BLOCK(foreach)PASSAGE{}
{
Document{->MARKFIRST(PassageFirstToken)};
}
TagContent{CONTAINS(PassageFirstToken),-PARTOF(InitialTag)->MARK(InitialTag)};
BLOCK(foreach)PASSAGE{}
{
InitialTag ANY+{->SHIFT(PASSAGE,2,2)};
}
Sample Input:
<p class="Normal"><a name="para1"><h1><b>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. </b></a></p>
<p class="Normal"><a name="para2"><aus>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.</a></p>
<p class="Normal"><a name="para3">On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.</a></p>
<p class="Normal"><a name="para4">On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. </a></p>
<p class="Normal"><a name="para5">On the Insert tab, the <span>galleries</span> include items that are designed to coordinate with the overall look of your document.</a></p>
PASSAGE(5) AND TESTPASSAGE(2).Why the TESTPASSAGE reduced? And InitialTag is not tagged.
I have attached the output annotation image
When reproducing the given example, I get 5 PASSAGE annotations and 3 TESTPASSAGE annotations (the last three PASSAGE annotations). The other two PASSAGE annotations are not annotated with TESTPASSAGE, because they start with a MARKUP annotation, which is not visible by default, and make the complete annotation invisible. In order to avoid this problem, you can make MARKUP visible or trim markups from PASSAGE annotations (is this actually the main question?). Just extend you rules for the TRIM action:
RETAINTYPE(WS, MARKUP);
PASSAGE{-> TRIM(WS, MARKUP)};
RETAINTYPE;
There are no InitialTag annotations because there are no TagContent annotations because there are no LBracket annotations in the example.
Btw, you could rewrite some rules:
PASSAGE{->MARKFIRST(PassageFirstToken)};
(LBracket # RBracket){-PARTOF(TagContent)-> TagContent};
DISCLAIMER: I am a developer of UIMA Ruta
//-------------------------------------------------------------------
// SPECIAL SQUARE HYPHEN PARENTHESIS
//-------------------------------------------------------------------
DECLARE LParen, RParen;
SPECIAL{REGEXP("[(]") -> MARK(LParen)};
SPECIAL{REGEXP("[)]") -> MARK(RParen)};
DECLARE LSQParen, RSQParen;
SPECIAL{REGEXP("[\\[]") -> MARK(LSQParen)};
SPECIAL{REGEXP("[\\]]") -> MARK(RSQParen)};
DECLARE LANGLEBRACKET,RANGLEBRACKET;
SPECIAL{REGEXP("<")->MARK(LANGLEBRACKET)};
AMP{REGEXP("<")->MARK(LANGLEBRACKET)};
SPECIAL{REGEXP(">")->MARK(RANGLEBRACKET)};
AMP{REGEXP(">")->MARK(RANGLEBRACKET)};
DECLARE LBracket,RBracket;
(LParen|LSQParen|LANGLEBRACKET){->MARK(LBracket)};
(RParen|RSQParen|RANGLEBRACKET){->MARK(RBracket)};
DECLARE PASSAGE,TESTPASSAGE;
"<a name=\"para(.+?)\">(.*?)</a>"->2=PASSAGE;
RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
PASSAGE{-> TRIM(WS)};
RETAINTYPE;
PASSAGE{->MARK(TESTPASSAGE)};
DECLARE TagContent,PassageFirstToken,InitialTag;
LBracket ANY+? RBracket{-PARTOF(TagContent)->MARK(TagContent,1,3)};
BLOCK(foreach)PASSAGE{}
{
Document{->MARKFIRST(PassageFirstToken)};
}
TagContent{CONTAINS(PassageFirstToken),-PARTOF(InitialTag)->MARK(InitialTag)};
BLOCK(foreach)PASSAGE{}
{
InitialTag ANY+{->SHIFT(PASSAGE,2,2)};
}

a simple Ruta annotator

I just started with Ruta and I would like to write a rule that will work like this:
it will try to match a word e.g. XYZ and when it hits it, it will then assign the text that comes before to the Annotator CompanyDetails.
For example :
This is a paragraph that contains the phrase we are interested in, which follows the sentence. LL, Inc. a Delaware limited liability company (XYZ).
After running the script the annotator CompanyDetails will contain the string:
LL, Inc. a Delaware limited liability company
I assume that you mean annotation of the type 'CompanyDetails' when you talk about annotator 'CompanyDetails'.
There are many (really many) different ways to solve this task. Here's one example that applies some helper rules:
DECLARE Annotation CompanyDetails (STRING context);
DECLARE Sentence, XYZ;
// just to get a running example with simple sentences
PERIOD #{-> Sentence} PERIOD;
#{-> Sentence} PERIOD;
"XYZ" -> XYZ; // should be done in a dictionary
// the actual rule
STRING s;
Sentence{-> MATCHEDTEXT(s)}->{XYZ{-> CREATE(CompanyDetails, "context" = s)};};
This example stores the string of the complete sentence in the feature. The rule matches on all sentences and stores the covered text in the variable ´s´. Then, the content of the sentence is investigated: An inlined rule tries to match on XYZ, creates an annotation of the type CompanyDetails, and assigns the value of the variable to the feature named context. I would rather store an annotation instead of a string since you could still get the string with getCoveredText(). If you just need the tokens before XYZ in the sentence, the you could do something like that (with an annotation instead of a string this time):
DECLARE Annotation CompanyDetails (Annotation context);
DECLARE Sentence, XYZ, Context;
// just to get a running example with simple sentences
PERIOD #{-> Sentence} PERIOD;
#{-> Sentence} PERIOD;
"XYZ" -> XYZ;
// the actual rule
Sentence->{ #{-> Context} SPECIAL? #XYZ{-> GATHER(CompanyDetails, "context" = 1)};};