UIMA RUTA wordlist matching issue - uima

I am trying to match some multi-word tokens using UIMA RUTA 2.6.0. And there are some phrases that are partially equal to each other, e. g. in the same file I has following entries: "includes the", "include the", "in this", "in the".
There is next piece of text in my input file: "1. "Agents or employees" includes the directors...". Obviously, there is a "includes the" match, but if other above 3 entries are present in wordlist then no match will be found. Moreover, the ordering of those entries in wordlist does not depend on matching success: it always fails.
And this issue occurs not only in single file. So, the question: how can I fix it? May be some settings of RUTA annotator?

Whitespaces in the wordlist can lead to missed matches. If the whitespaces are not important, set the configuration parameter 'dictRemoveWS' to true.
DISCLAIMER: I am a developer of UIMA Ruta

Related

What characters are allowed in the name of a rule in Drools?

I haven't been able to find in Drools documentation, which characters (beyond alphabet letters) are allowed/disallowed in a rule name in Drools - does anyone know or have a reference?
The only relevant section of Drools doc I've found so far does not specify:
Each rule must have a unique name within the rule package. If you use the same rule name more than once in any DRL file in the package, the rules fail to compile. Always enclose rule names with double quotation marks (rule "rule name") to prevent possible compilation errors, especially if you use spaces in rule names.
I think I have discovered, anecdotally, that some "grouping" characters do not work in rule names (seems rules named with can't be found or aren't included) - or at least, in extension rules (the extended rule seems to work with grouping chars, but not its extension; example below): The grouping chars include parentheses "()", square brackets "[]", and "curly braces" "{}". Although less than & greater than "<>" work, so I'm so far replacing the former with the latter.
Or are there escape chars for the problematic grouping chars?
Example:
rule "(grouping chars, and commas, work here)"
when
// conditions LHS
then
end
// removing parentheses, or replacing with < >,
// from below line works
rule "(grouping chars DON'T work here)"
extends "(grouping chars, and commas, work here)"
when
then
// consequences RHS
I haven't discovered either way yet with all other characters (for example, other punctuation; except I have discovered commas "," work). But it would be nice to know ahead of time what characters are allowed.
Theoretically every identifier inside a string should work, but you might have empirically found some combination that is breaking the grammar somehow.
Thanks for the investigation, I've filled a Jira, please take a look at it

How can I change the Seperator option in WordTable? -Uima Ruta

From the Ruta Documentation, A WORDTABLE is simply a comma-separated file (.csv), which actually uses semicolons for separation of the entries. I need to change the the seperator option.Because Some text contains semicolon, So its coming in seperate column.
I've changed the Seperator Option, I received an error message.
How can I solve this issue.
Example:
I really like beef, with mushroom sauce; pasta, with Alfredo sauce; and salad, with French dressing.;0
Think before you speak.;1
We had students from Lima, Peru; Santiago, Chile; and Caracas, Venezuela.;2
Up to the current UIMA Ruta version (2.6.1), changing the separator for the csv tables is not supported. Unfortunately, quoting is also not supprted.
DISCLAIMER: I am a developer of UIMA Ruta

Uima ruta -Abbrevations

Can I segment the letters of a word using Uima Ruta?
Ex.
1.(WHO)
2.(APIAs)
Script:
DECLARE NEW;
BLOCK (foreach)CAP{}
{
W{REGEXP(".")->MARK(NEW)};
}
Yes, this is achieved with simple regex rules in UIMA Ruta:
DECLARE Char;
CAP->{"."->Char;};
You cannot use normal rules for this because you need to match on something smaller than RutaBasic. The only option is to use regexp rules which operate directly on the text instead of on annotations. You should of course be very careful since this can lead to really many annotations.
Some explanation for the somewhat compact rule: CAP->{"."->Char;};
CAP // the only rule element of the rule: match on each CAP annotation
->{// indicates that inlined rules follow that are applied in the context of the matched annotation.
"." // a regular expression matching on each character
-> Char // the "action" of the regex rule: create an annotation of the type Char for each match of the regex
;}; // end of regex rule, end of inlined rules, end of actual rule
Summarizing, the rule iterates over all CAP annotations, applies a regular expression on each iterated covered text and creates annotations for the matches.
You can of course also use a BLOCK instead of an inlined rule.
DISCLAIMER: I am a developer of UIMA Ruta

UIMA Ruta: Can't ignore periods using MarkTable

If I have a dictionary containing various acronyms and designations, I would ideally like to be able to avoid having entries for each "U.S.A.", "USA", and "usa". I have no trouble ignoring case, but the ignore chars argument does not seem to work across the board. After the appropriate import and declare statements, I get something like the following:
Document{->MARKTABLE(Acroynm,1,AcronymDict,true,0,".,-",10,"expandedForm"=2)};
It successfully ignores a single set of 1-10 hyphens. It does not ignore 10 hyphens spaced throughout the word. (It will ignore a-bc and a--bc but not a-b-c.) This is actually fine for hyphens, but I cannot, with the above statement, get it to ignore periods at all. (It ignores neither a.bc or a.b.c.) Further, if I can get it to ignore periods, is there any way I can ignore the periods in A.B.C. and not just the one in A.BC?
Any further description of the limitations of this argument would be useful. Thanks.
Relevant Ruta Documentation: https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.language.actions.marktable

How to search in resource files in Eclipse? (escaped chars)

How do you search in resource files (*.properties) in Eclipse for string containing non-ASCII characters?
EDIT: Currently I use * in place of those special chars, but I'd prefer Eclipse to handle this for me: so it would either search for '\u00E1' in raw files when I enter 'á', or it might translate the files first and then just search for 'á'.
My apologize for not being specific enough when asking.
In Eclipse, you can use Search -> File Search . In the Search dialog, check the Regular expression option. Then enter this pattern in the Containing text: field to find non-ASCII characters:
[\u007f-\uffff]
(the square brackets are part of the pattern). Enter the File name patterns
*.properties
and then select which resources to search (selected resource, workspace, working set, etc.) and click OK
See also the Pattern javadoc for how to express such regular expressions.
Personally, I search them from the command line using grep, but you can search them in Eclipse by using a question mark in a non-regular-expression search, which should match any character. You can also use a period in a regular expression search.
The search dialog allows you to search for strings in *.resources files in the workspace.
Go to Search -> File. Enable the Regular Expressions checkbox - this also content assist to choose the regular expression according to your needs. In the file name patterns, give *.properties and then, Go :)