How can we annotate an unicode character in uima ruta - uima

How can we annotate an unicode character in uima ruta:
For Example: I want to mark this text(Paris: Éditions Robert Laffont).So I used the following rule.
DECLARE CITY;
CW COLON CW+{->MARK(CITY,1,3)};
But the text covered upto Paris: Ã. Is there any way to solve this problem. Awaiting for the answer.Thanks in advance.

Its all about he definition of the lexer which creates the token class annotations of ruta (W, CW, SPECIAL ...).
The rule CW COLON CW+{->MARK(CITY,1,1)}; creates an annotation of the type CITY for the text span Paris regardless of the unicode character.
The last rule element CW+ matches on à since this is annotated with a CW, but stops there since ‰ is not a CW but a SPECIAL.
There are different ways to avoid this problem. My advice would be that you should rely on a different type of annotation for your rules. The job of the lexer annotations of ruta is to create minimal annotations. They do not define tokens in general.
You could maybe use something like this (or use an actual tokenizer for better performance):
DECLARE CITY;
DECLARE Token;
RETAINTYPE(SPACE);
(W (SPECIAL? W)*){-> Token};
RETAINTYPE;
Token COLON Token+{->MARK(CITY,1,1)};
DISCLAIMER: I am a developer of UIMA Ruta

Related

What characters are allowed in the name of a rule in Drools?

I haven't been able to find in Drools documentation, which characters (beyond alphabet letters) are allowed/disallowed in a rule name in Drools - does anyone know or have a reference?
The only relevant section of Drools doc I've found so far does not specify:
Each rule must have a unique name within the rule package. If you use the same rule name more than once in any DRL file in the package, the rules fail to compile. Always enclose rule names with double quotation marks (rule "rule name") to prevent possible compilation errors, especially if you use spaces in rule names.
I think I have discovered, anecdotally, that some "grouping" characters do not work in rule names (seems rules named with can't be found or aren't included) - or at least, in extension rules (the extended rule seems to work with grouping chars, but not its extension; example below): The grouping chars include parentheses "()", square brackets "[]", and "curly braces" "{}". Although less than & greater than "<>" work, so I'm so far replacing the former with the latter.
Or are there escape chars for the problematic grouping chars?
Example:
rule "(grouping chars, and commas, work here)"
when
// conditions LHS
then
end
// removing parentheses, or replacing with < >,
// from below line works
rule "(grouping chars DON'T work here)"
extends "(grouping chars, and commas, work here)"
when
then
// consequences RHS
I haven't discovered either way yet with all other characters (for example, other punctuation; except I have discovered commas "," work). But it would be nice to know ahead of time what characters are allowed.
Theoretically every identifier inside a string should work, but you might have empirically found some combination that is breaking the grammar somehow.
Thanks for the investigation, I've filled a Jira, please take a look at it

change '#' key in freemarker templates

In order to use if statements in Freemarker templates, the following syntax is used;
[#if ${numberCoupons} <= 1]
[#assign couponsText = 'coupon']
[/#if]
Is there a way to replace the '#' character with something else, because I am trying to integrate it with drools (a java based rule engine) and the '#' character is used to mark start of comments so the formatting breaks?
There isn't anything for that out of the box (it uses a JavaCC generated parser, which is static). But you can write a TemplateLoader that just delegates to another TemplateLoader, but replaces the Reader with a FilterReader that replaces [% and [/% and [%-- and --%] with [#, etc. Then then you can use % instead of # in the FreeMarker tags. (It's somewhat confusing though, as error messages will still use #, etc.)
As #ddekany wrote, you can write code that tranform the template without the pound sign, But notice it can clash with HTML or XML (and similar) tags, at least from an editor prespective.

How can I change the Seperator option in WordTable? -Uima Ruta

From the Ruta Documentation, A WORDTABLE is simply a comma-separated file (.csv), which actually uses semicolons for separation of the entries. I need to change the the seperator option.Because Some text contains semicolon, So its coming in seperate column.
I've changed the Seperator Option, I received an error message.
How can I solve this issue.
Example:
I really like beef, with mushroom sauce; pasta, with Alfredo sauce; and salad, with French dressing.;0
Think before you speak.;1
We had students from Lima, Peru; Santiago, Chile; and Caracas, Venezuela.;2
Up to the current UIMA Ruta version (2.6.1), changing the separator for the csv tables is not supported. Unfortunately, quoting is also not supprted.
DISCLAIMER: I am a developer of UIMA Ruta

Uima Ruta CW Annotation

I'm using Uima Ruta 2.5.0 Version. In this, Symbols like Γ,Δ were coming under CW .Why its Happening?
Input
Γ
Δ
The CW annotation like the other TokenSeed annotations is created by a JFlex lexer. The rule for CW is [:uppercase:][:lowercase:]* whereas [:uppercase:] is defined by the Unicode properties \p{Uppercase}. Both of your example symbols are greek uppercase letters.
DISCLAIMER: I am a developer of UIMA Ruta

Uima ruta -Abbrevations

Can I segment the letters of a word using Uima Ruta?
Ex.
1.(WHO)
2.(APIAs)
Script:
DECLARE NEW;
BLOCK (foreach)CAP{}
{
W{REGEXP(".")->MARK(NEW)};
}
Yes, this is achieved with simple regex rules in UIMA Ruta:
DECLARE Char;
CAP->{"."->Char;};
You cannot use normal rules for this because you need to match on something smaller than RutaBasic. The only option is to use regexp rules which operate directly on the text instead of on annotations. You should of course be very careful since this can lead to really many annotations.
Some explanation for the somewhat compact rule: CAP->{"."->Char;};
CAP // the only rule element of the rule: match on each CAP annotation
->{// indicates that inlined rules follow that are applied in the context of the matched annotation.
"." // a regular expression matching on each character
-> Char // the "action" of the regex rule: create an annotation of the type Char for each match of the regex
;}; // end of regex rule, end of inlined rules, end of actual rule
Summarizing, the rule iterates over all CAP annotations, applies a regular expression on each iterated covered text and creates annotations for the matches.
You can of course also use a BLOCK instead of an inlined rule.
DISCLAIMER: I am a developer of UIMA Ruta