using ruta, annotate a line containing annotation and extract required data - uima

Annotate a line containing specific annotations in order to extract text. Annotate line for Borrower and Co-Borrower and get their respective SSNs
Borrower Name: Alice Johnson SSN: 123-456-7890
Co-Borrower Name: Bob Symonds SSN: 987-654-3210
code
PACKAGE uima.ruta.test;
TYPESYSTEM utils.PlainTextTypeSystem;
ENGINE utils.PlainTextAnnotator;
EXEC(PlainTextAnnotator, {Line});
DECLARE Borrower, Name;
DECLARE BorrowerName(String value, String label);
CW{REGEXP("\\bBorrower") -> Borrower} CW{REGEXP("Name") -> Name};
Borrower Name COLON n:CW[1,3]{-> CREATE(BorrowerName, "label"="Borrower Name", "value"=n.ct)};
DECLARE SSN;
DECLARE BorrowerSSN(String label, String value);
W{REGEXP("SSN") -> SSN};
SSN COLON n:NUM[3,3]{-> CREATE(BorrowerSSN, "label"="Borrower SSN", "value"=n.ct)};
DECLARE Co;
CW{REGEXP("Co") -> Co};
DECLARE CoBorrowerName(String label, String value);
Co Borrower Name COLON n:CW[1,3]{-> CREATE(CoBorrowerName, "label"="Co-Borrower Name", "value"=n.ct)};
DECLARE BorrowerLine;
Line{CONTAINS(Borrower),CONTAINS(Name)->MARK(BorrowerLine)};
Please suggest how to annotate a line containing annotation and get specific label value for required annotation.

To spare yourself from handling the separate strings, you could gather all the indicators to a wordlist (i.e., a text file containing one indicator per line) and place it in your project resources folder (see this for more details). Then you could just mark all the indicators with the desired indicator type:
WORDLIST IndicatorList ='IndicatorList.txt';
DECLARE Indicator;
Document{->MARKFAST(Indicator, IndicatorList )};
This would output Indicator helper annotations like "Borrower Name".
Once you have that, you could now iterate over the lines and find the target annotations.
DECLARE Invisible;
SPECIAL{-PARTOF(Invisible), REGEXP("[-]")-> Invisible};
BLOCK(line) Line{CONTAINS(Indicator)}{
//Ex. pattern: Borrower Name: Alice Johnson SSN: 123-456-7890
Indicator COLON c:CW[1,3]{-> CREATE(BorrowerName, "label"="Borrower Name", "value"=c.ct)} Indicator;
FILTERTYPE(Invisible);
Indicator COLON n:NUM[3,3]{-> CREATE(BorrowerSSN, "label"="BorrowerSSN", "value"=n.ct)};
REMOVEFILTERTYPE(Invisible);
}
Hope this helps.

Addition to Viorel's answer:
The PlainTextAnnotator creates annotations of the type Line and these annotation cover the complete line, which means that leading or trailing whitespaces are also included. As a consequence, the resulting annotations are not visible for the following rules. In order to avoid this problem, you could for example trim the whitespaces in these annotations:
EXEC(PlainTextAnnotator, {Line});
ADDRETAINTYPE(WS);
Line{-> TRIM(WS)};
REMOVERETAINTYPE(WS);

Related

Want to Remove Markup's from the Annotation-UIMA RUTA

If I use P tag(from Html Annotator) as PASSAGE.I want to ignore the markup's from the Annotation.
SCRIPT:
//-------------------------------------------------------------------
// SPECIAL SQUARE HYPHEN PARENTHESIS
//-------------------------------------------------------------------
DECLARE LParen, RParen;
SPECIAL{REGEXP("[(]") -> MARK(LParen)};
SPECIAL{REGEXP("[)]") -> MARK(RParen)};
DECLARE LSQParen, RSQParen;
SPECIAL{REGEXP("[\\[]") -> MARK(LSQParen)};
SPECIAL{REGEXP("[\\]]") -> MARK(RSQParen)};
DECLARE LANGLEBRACKET,RANGLEBRACKET;
SPECIAL{REGEXP("<")->MARK(LANGLEBRACKET)};
AMP{REGEXP("<")->MARK(LANGLEBRACKET)};
SPECIAL{REGEXP(">")->MARK(RANGLEBRACKET)};
AMP{REGEXP(">")->MARK(RANGLEBRACKET)};
DECLARE LBracket,RBracket;
(LParen|LSQParen|LANGLEBRACKET){->MARK(LBracket)};
(RParen|RSQParen|RANGLEBRACKET){->MARK(RBracket)};
DECLARE PASSAGE,TESTPASSAGE;
"<a name=\"para(.+?)\">(.*?)</a>"->2=PASSAGE;
RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
PASSAGE{-> TRIM(WS)};
RETAINTYPE;
PASSAGE{->MARK(TESTPASSAGE)};
DECLARE TagContent,PassageFirstToken,InitialTag;
LBracket ANY+? RBracket{-PARTOF(TagContent)->MARK(TagContent,1,3)};
BLOCK(foreach)PASSAGE{}
{
Document{->MARKFIRST(PassageFirstToken)};
}
TagContent{CONTAINS(PassageFirstToken),-PARTOF(InitialTag)->MARK(InitialTag)};
BLOCK(foreach)PASSAGE{}
{
InitialTag ANY+{->SHIFT(PASSAGE,2,2)};
}
Sample Input:
<p class="Normal"><a name="para1"><h1><b>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. </b></a></p>
<p class="Normal"><a name="para2"><aus>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.</a></p>
<p class="Normal"><a name="para3">On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document.</a></p>
<p class="Normal"><a name="para4">On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your document. </a></p>
<p class="Normal"><a name="para5">On the Insert tab, the <span>galleries</span> include items that are designed to coordinate with the overall look of your document.</a></p>
PASSAGE(5) AND TESTPASSAGE(2).Why the TESTPASSAGE reduced? And InitialTag is not tagged.
I have attached the output annotation image
When reproducing the given example, I get 5 PASSAGE annotations and 3 TESTPASSAGE annotations (the last three PASSAGE annotations). The other two PASSAGE annotations are not annotated with TESTPASSAGE, because they start with a MARKUP annotation, which is not visible by default, and make the complete annotation invisible. In order to avoid this problem, you can make MARKUP visible or trim markups from PASSAGE annotations (is this actually the main question?). Just extend you rules for the TRIM action:
RETAINTYPE(WS, MARKUP);
PASSAGE{-> TRIM(WS, MARKUP)};
RETAINTYPE;
There are no InitialTag annotations because there are no TagContent annotations because there are no LBracket annotations in the example.
Btw, you could rewrite some rules:
PASSAGE{->MARKFIRST(PassageFirstToken)};
(LBracket # RBracket){-PARTOF(TagContent)-> TagContent};
DISCLAIMER: I am a developer of UIMA Ruta
//-------------------------------------------------------------------
// SPECIAL SQUARE HYPHEN PARENTHESIS
//-------------------------------------------------------------------
DECLARE LParen, RParen;
SPECIAL{REGEXP("[(]") -> MARK(LParen)};
SPECIAL{REGEXP("[)]") -> MARK(RParen)};
DECLARE LSQParen, RSQParen;
SPECIAL{REGEXP("[\\[]") -> MARK(LSQParen)};
SPECIAL{REGEXP("[\\]]") -> MARK(RSQParen)};
DECLARE LANGLEBRACKET,RANGLEBRACKET;
SPECIAL{REGEXP("<")->MARK(LANGLEBRACKET)};
AMP{REGEXP("<")->MARK(LANGLEBRACKET)};
SPECIAL{REGEXP(">")->MARK(RANGLEBRACKET)};
AMP{REGEXP(">")->MARK(RANGLEBRACKET)};
DECLARE LBracket,RBracket;
(LParen|LSQParen|LANGLEBRACKET){->MARK(LBracket)};
(RParen|RSQParen|RANGLEBRACKET){->MARK(RBracket)};
DECLARE PASSAGE,TESTPASSAGE;
"<a name=\"para(.+?)\">(.*?)</a>"->2=PASSAGE;
RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
PASSAGE{-> TRIM(WS)};
RETAINTYPE;
PASSAGE{->MARK(TESTPASSAGE)};
DECLARE TagContent,PassageFirstToken,InitialTag;
LBracket ANY+? RBracket{-PARTOF(TagContent)->MARK(TagContent,1,3)};
BLOCK(foreach)PASSAGE{}
{
Document{->MARKFIRST(PassageFirstToken)};
}
TagContent{CONTAINS(PassageFirstToken),-PARTOF(InitialTag)->MARK(InitialTag)};
BLOCK(foreach)PASSAGE{}
{
InitialTag ANY+{->SHIFT(PASSAGE,2,2)};
}

Xstream ignore whitespace characters

I load data from XML into java classes using xstream library. The texts in several tags are very long and take more than one line. Such formatting causes that I have in Java class field text with additional characters like \n\t. Is there any way to load data from XML file without these characters?
Xml tag is declared in two lines. Opening tag is in the first line, then I have very long text, and the closing tag is declared in second line.
You can use regex or the string split method.
String string = "004-034556";
String[] parts = string.split("-");
String part1 = parts[0]; // 004
String part2 = parts[1]; // 034556
Just split your string. In your case it would be
String wantedText = parts[0];
Another solution would be to put your values into a string array, loop the array, match and remove any characters you dont want.
You can see how to match and remove Here

a simple Ruta annotator

I just started with Ruta and I would like to write a rule that will work like this:
it will try to match a word e.g. XYZ and when it hits it, it will then assign the text that comes before to the Annotator CompanyDetails.
For example :
This is a paragraph that contains the phrase we are interested in, which follows the sentence. LL, Inc. a Delaware limited liability company (XYZ).
After running the script the annotator CompanyDetails will contain the string:
LL, Inc. a Delaware limited liability company
I assume that you mean annotation of the type 'CompanyDetails' when you talk about annotator 'CompanyDetails'.
There are many (really many) different ways to solve this task. Here's one example that applies some helper rules:
DECLARE Annotation CompanyDetails (STRING context);
DECLARE Sentence, XYZ;
// just to get a running example with simple sentences
PERIOD #{-> Sentence} PERIOD;
#{-> Sentence} PERIOD;
"XYZ" -> XYZ; // should be done in a dictionary
// the actual rule
STRING s;
Sentence{-> MATCHEDTEXT(s)}->{XYZ{-> CREATE(CompanyDetails, "context" = s)};};
This example stores the string of the complete sentence in the feature. The rule matches on all sentences and stores the covered text in the variable ´s´. Then, the content of the sentence is investigated: An inlined rule tries to match on XYZ, creates an annotation of the type CompanyDetails, and assigns the value of the variable to the feature named context. I would rather store an annotation instead of a string since you could still get the string with getCoveredText(). If you just need the tokens before XYZ in the sentence, the you could do something like that (with an annotation instead of a string this time):
DECLARE Annotation CompanyDetails (Annotation context);
DECLARE Sentence, XYZ, Context;
// just to get a running example with simple sentences
PERIOD #{-> Sentence} PERIOD;
#{-> Sentence} PERIOD;
"XYZ" -> XYZ;
// the actual rule
Sentence->{ #{-> Context} SPECIAL? #XYZ{-> GATHER(CompanyDetails, "context" = 1)};};

I don't understand what a YAML tag is

I get it on some level, but I have yet to see an example that didn't bring up more questions than answers.
http://rhnh.net/2011/01/31/yaml-tutorial
# Set.new([1,2]).to_yaml
--- !ruby/object:Set
hash:
1: true
2: true
I get that we're declaring a Set tag. I don't get what the subsequent hash mapping has to do with it. Are we declaring a schema? Can someone show me an example with multiple tag declarations?
I've read through the spec: http://yaml.org/spec/1.2/spec.html#id2761292
%TAG ! tag:clarkevans.com,2002:
Is this declaring a schema? Is there something else a parser has to do in order to successfully parse the file? A schema file of some type?
http://www.yaml.org/refcard.html
Tag property: # Usually unspecified.
none : Unspecified tag (automatically resolved by application).
'!' : Non-specific tag (by default, "!!map"/"!!seq"/"!!str").
'!foo' : Primary (by convention, means a local "!foo" tag).
'!!foo' : Secondary (by convention, means "tag:yaml.org,2002:foo").
'!h!foo': Requires "%TAG !h! <prefix>" (and then means "<prefix>foo").
'!<foo>': Verbatim tag (always means "foo").
Why is it relevant to have a primary and secondary tag, and why does a secondary tag refer to a URI? What problem is being solved by having these?
I seem to see a lot of "what they are", and no "why are they there", or "what are they used for".
I don't know a lot about YAML but I'll give it a shot:
Tags are used to denote types. A tag is declared using ! as you have seen from the "refcard" there. The %TAG directive is kind of like declaring a shortcut to a tag.
I'll demonstrate with PyYaml. PyYaml can parse the secondary tag of !!python/object: as an actual python object. The double exclamation mark is a substitution in itself, short for !tag:yaml.org,2002:, which turns the whole expression into !tag:yaml.org,2002:python/object:. This expression is a little unwieldy to be typing out every time we want to create an object, so we give it an alias using the %TAG directive:
%TAG !py! tag:yaml.org,2002:python/object: # declares the tag alias
---
- !py!__main__.MyClass # creates an instance of MyClass
- !!python/object:__main__.MyClass # equivalent with no alias
- !<tag:yaml.org,2002:python/object:__main__.MyClass> # equivalent using primary tag
Nodes are parsed by their default type if you have no tag annotations. The following are equivalent:
- 1: Alex
- !!int "1": !!str "Alex"
Here is a complete Python program using PyYaml demonstrating tag usage:
import yaml
class Entity:
def __init__(self, idNum, components):
self.id = idNum
self.components = components
def __repr__(self):
return "%s(id=%r, components=%r)" % (
self.__class__.__name__, self.id, self.components)
class Component:
def __init__(self, name):
self.name = name
def __repr__(self):
return "%s(name=%r)" % (
self.__class__.__name__, self.name)
text = """
%TAG !py! tag:yaml.org,2002:python/object:__main__.
---
- !py!Component &transform
name: Transform
- !!python/object:__main__.Component &render
name: Render
- !<tag:yaml.org,2002:python/object:__main__.Entity>
id: 123
components: [*transform, *render]
- !<tag:yaml.org,2002:int> "3"
"""
result = yaml.load(text)
More information is available in the spec

How to extract a constant substring from an existing constant string in Eclipse?

Say I've got an existing constant string:
private final static String LOREM_IPSUM = "Lorem Ipsum";
Is there a way in Eclipse to quickly extract a substring of this as another constant such that I could end up with something like:
private final static String LOREM = "Lorem";
private final static String IPSUM = "Ipsum";
private final static String LOREM_IPSUM = LOREM + " " + IPSUM;
In this particular case, two refactorings (one for LOREM and one for IPSUM) would suffice.
There's a Quick Assist that you can use to pull out one bit of a quoted string. Select the text you want to pull out and hit Ctrl+1 (that's the digit for "one"). You'll see a quick assist for "Pick out selected part of String". Choose it; Eclipse will break up your string for you.
If you don't already use it, get familiar with the "Select Enclosing Element" key combination (Shift+Alt+Up). If you put the cursor in the middle of a string and hit that combination, the whole string will be selected. Select it again, and the expression containing it will be selected. You will do this dozens of times each day.
Try this sequence of steps:
Put your cursor after the m in Lorem, and press enter.
Right cursor one character.
Press enter.
Select "Lorem" and perform Refactor -> Extract Constant...
Select "Ipsum" and perform Refactor -> Extract Constant...