This might be a trivial question, but I'm new to Ruta so bear with me please.
My testdata consists of numbers in the following format:
0.1mm 0,11mm 1.1mm 1,1mm 1mm
I use the following rule to annotate the first four examples:
((NUM(COMMA|PERIOD)NUM) W{REGEXP("mm")}) {-> nummm};
Document{->MARK(nummm)};
Now I want to annotate "1mm", for example, too, but I'm kind of stuck right now, because I have no idea how to do this. I tried negating Conditions, like AFTER (as in "if NUM mm not after comma or period"), but it either didn't work or the syntax was wrong. Any help would be appreciated!
EDIT: I should add that I want to annotate "1mm", but not the 1mm part after a comma or period, as of right now i basically annotate everything twice.
There are really a lot of ways to specify this in UIMA Ruta.
Here's the first thing that came to my mind:
(NUM{-PARTOF(nummm)} (PM{PARTOF({COMMA,PERIOD})} NUM)? W{REGEXP("mm")}){-> nummm};
This is probably not the "best" rule but should do what you want. There are three main changes:
I made the middle part of the rule optional so that it also matches on a single NUM.
I added the negated PARTOF of at the first rule element thus the matching will fail if the starting point is already covered by a nummm annotation. The - is a shortcut for the NOT condition.
I replaced the expensive disjunctive composed rule element with a simple one just because it is not really necessary here.
This rule works because the actions of a rule match are already executed before the next rule match is considered.
DISCLAIMER: I am a developer of UIMA Ruta.
Related
How can we implement pattern matching in Spring Batch, I am using org.springframework.batch.item.file.mapping.PatternMatchingCompositeLineMapper
I got to know that I can only use ? or * here to create my pattern.
My requirement is like below:
I have a fixed length record file and in each record I have two fields at 35th and 36th position which gives record type
for example below "05" is record type which is at 35th and 36th position and total length of record is 400.
0000001131444444444444445589868444050MarketsABNAKKAAAAKKKA05568551456...........
I tried to write regular expression but it does not work, i got to know only two special character can be used which are * and ? .
In that case I can only write like this
??????????????????????????????????05?????????????..................
but it does not seem to be good solution.
Please suggest how can I write this solution, Thanks a lot for help in advance
The PatternMatchingCompositeLineMapper uses an instance of org.springframework.batch.support.PatternMatcher to do the matching. It's important to note that PatternMatcher does not use true regular expressions. It uses something closer to ant patterns (the code is actually lifted from AntPathMatcher in Spring Core).
That being said, you have three options:
Use a pattern like you are referring to (since there is no short hand way to specify the number of ? that should be checked like there is in regular expressions).
Create your own composite LineMapper implementation that uses regular expressions to do the mapping.
For the record, if you choose option 2, contributing it back would be appreciated!
I would like to annotate the first token of a text and use that annotation in following rules. I have tried different patterns:
Token.begin == 0 (doesn't work, although there definitely is a token that begins at 0)
Token{STARTSWITH(DocumentMetaData)}; (also doesn't work)
The only pattern that works is:
Document{->MARKFIRST(First)};
But if I try to use that annotation e.g. in the following way:
First{->MARK(FirstAgain)};
it doesn't work again. This makes absolutely no sense to me. There seems to be a really weird behaviour with annotations that start at 0.
This trivial task can be a bit tricky indeed, mainly because of the visibility settings. I do not know why your rules in the question do not work without having a look at the text that should be processed.
As for UIMA Ruta 2.7.0, I prefer a rule like:
# Token{->First};
Here some additional thoughts about the rules in the question:
Token.begin == 0;
Normally, there is not token with begin at 0 since the document starts with some whitespaces or line breaks. If there is actually a token that starts at offset 0 and the rule does not match, then something invisible is covering the begin of the end of the token. This depends of course of the filtering settings, but in case that you did not change them, it could be a bom.
Token{STARTSWITH(DocumentMetaData)};
Here, either the problem above applies, or the begin offset is not identical. If the DocumentMetaData covers the complete document, then I would bet on the leading whitespaces. Another reason could be that the internal indexing is broken, e.g., the tokens or the DocumentMetaData are created by an external analysis engine which was called with EXEC and no reindexing was configured in the action. This situation could also occur with unfortunate optimizations using the config params.
Document{->MARKFIRST(First)};
First{->MARK(FirstAgain)};
MARKFIRST creates an annotation using the offset of the first RutaBasic in the matched context IIRC. If the document starts with something invisible, e.g., a line break, then the second rule cannot match.
As a general advice in situations like this when some obvious simple rules do not work correctly as expected, I recommend adding some additional rules and using the debugging config with the explanation view. As rule like Token; can directly highlight if the visibility setting are problematic for the given tokens.
DISCLAIMER: I am a developer of UIMA Ruta
I have a situation where I would like to combine a set of annotations into one new annotation - using the found annotations as features of the new one. One of the annotations would serve as the boundary for the new annotation. All of the other associated annotations would be within a set number of tokens of the main one. The linked article here is similar to my problem, but I can't rely on positions of annotations; they could be before or after the main annotation AND in any order. This article eludes to a way to handle this scenario: "You can also specify rules that do not care [about position], but they will probably return too many matches." Could someone explain how this would work?
UIMA Ruta Creating annotation with features separated by some text
So you have defined your boundaries (as annotations) and want to annotate what lies in between, as features for the new annotation (irrespective of the order). Correct me if I got this wrong.
If that's the case, an illustration/solution example would look like:
(StartBoundary # EndBoundary){-> CREATE(NewAnnotation, "featureA"=A, "featureB"=B};
assuming the annotations A and B already exist in the isolated span.
The problem occurs when you have more than one annotation of type A or/and B. In this case, featureA or/and featureB will be set to the first occurence of A or/and B.
Here is an example:
assertThat(commentById.getId()).isNotNull();
assertThat(commentById.getContent()).isNotBlank();
assertThat(commentById.getAuthor()).isNotNull();
assertThat(commentById.getAuthor().getUsername()).isNotBlank();
assertThat(commentById.getAuthor().getAvatar()).isNotBlank();
assertThat(commentById.getAuthor().getId()).isNotNull();
Is there anyway to chain this into a single assertThat statement
Sorry for the unclear question. I mean, is there some fluent method calls to chain multiple assertThat statement together. Here is an example I can think of:
assertThat(commentById)
.isNotNull()
.and(Comment::getID).isNotNull()
.and(Comment::getContent).isNotBlank()
.and(Comment::getAuthor).is(author->{
author.isNotNull()
.and(User::getID).isNotNull()
.and(User::getAvatar).isNotBlank()
.and(User::getUsername).isNotBlank()
});
You can utilize satisfies method:
assertThat(commentById.getId()).isNotNull();
assertThat(commentById.getContent()).isNotBlank();
assertThat(commentById.getAuthor()).isNotNull().satisfies(author -> {
assertThat(author.getUsername()).isNotBlank();
assertThat(author.getAvatar()).isNotBlank();
assertThat(author.getId()).isNotNull();
});
This helps to eliminate repeating parts of code while testing nested structures.
If you want the commentById object itself to be tested by "one-liner", it is theoretically possible to apply same approach onto it (assertThat(commentById).satisfies(c -> {assertThat(c.getId()).isNotNull(); ...})), however I state it here only to literally answer your question, actually I don't see any benefit of such expression.
This is not possible at the moment, what is possible is to use extracting but that implies navigating from the current actual to the extracted one without being able to go back to the original actual.
I'm using eclipse parser to work with expressions and statements in java code.
I have a function:
public boolean visit(PostfixExpression node)
which deals with Postfix expressoins, such ass i++;
Problem is i want to distinguish between a for statement postfix, and other postfixes.
I thought maybe i could get to the node's parent and somehow check if it's a for. Something like node.getParent()... but node.getParent() doesn't return an expression.
Any ideas how to recognize if the PostfixExpression belongs to a for loop?
Thanks
edit:
By "for statement postfix" i mean the postfix in the for loop's first line. Such as:
for(i=0;i<10;i++)
So i want to distinguish this i++ from other i++'s.
Can't you just call ASTNode.getParent() to see what kind of statement the expression is contained in?
I solved this by creating a for_updaters List (using node.updaters()) and updating it every time i visit a for loop (could also be nested loops). Also, whenever i come across a PostfixExpression (including for updaters), i add it to another List, and then delete from this List all similar occurrences that appear in for_updaters List. This way i'm only left with non-for-updaters Postfixes. This also works for me because every time i visit a for loop i clear both Lists, so no worries about duplicate variable names.
Note: node.updaters() returns the exact full expression: [i++]. But i only need i. It's easy to extract it by converting the updater to String and then use substring().