How to declare # except line break for later usage? - uima

I use skip wild card # for text between rule elements.
However, I mark always per line, thus I m able to use #{-CONTAINS(BREAK)}
for example RuleElementA #{-CONTAINS(BREAK)} RuleElementB must be on a single line
How can I declare/save #{-CONTAINS(BREAK)} so that i could use later just shortcut like
RuleElementA sc RuleElementB ?

You should try to annotate first your building block (i.e. Lines) and create your target annotations based on that (so-called Bottom-Up Matching Strategy in UIMA Ruta).
Therefore, your can annotate all the lines in the input document by following a naive approach:
DECLARE Line;
ADDRETAINTYPE(BREAK);
BREAK #{-> MARKONCE(Line)} #BREAK;
REMOVERETAINTYPE(BREAK);
This would allow you to remain on the line level while creating the target annotations. You could then iterate over all the Lines in the document in order to ensure the correctness of your span:
BLOCK (forEach) Line{CONTAINS(W)}{
RuleElementA # RuleElementB
}
Alternatively, you could make use of the PlainTextAnnotator which is by default, part of the UIMA Ruta installation package. This approach can guarantee you a better line detection:
ENGINE utils.PlainTextAnnotator;
TYPESYSTEM Utils.PlainTextTypeSystem;
EXEC(PlainTextAnnotator, {Line, EmptyLine});
DECLARE FreeLine, LineFree;
ADDRETAINTYPE(WS);
EmptyLine Line{-> FreeLine};
Line{-> LineFree} BREAK[1,2] #EmptyLine;
Line{-> TRIM(WS)};
FreeLine{-> TRIM(WS)};
LineFree{-> TRIM(WS)};
REMOVERETAINTYPE(WS);

Related

What if there exists no matched rule in a Lex program because of REJECT?

I'm currently reading the documentation on Lex written by Lesk and Schmidt, and get confused by the REJECT action.
Consider the two rules
a[bc]+ { ... ; REJECT;}
a[cd]+ { ... ; REJECT;}
Input:
ab
Only the first matches, and see what we get from the material.
The action REJECT means ``go do the next alternative.'' It causes whatever rule was second choice after the current rule to be executed.
However, there is no second choice, will there comes a error?
There are really very few use cases for REJECT; I don't think I've ever seen an instance of it in use other than in examples.
Anyway, unless you specify %option nodefault (or the -s command-line flag), flex will add a default fallback action to your ruleset, equivalent to
.|\n ECHO;
In your case, that pattern will match after the REJECT.
However, it is possible to override the default action; for example, you could add the rule:
.|\n REJECT;
In that case, flex really will not have an alternative after the two REJECTs, and it will print an error message on stderr ("flex scanner jammed") and then call exit.

UIMA RUTA annotation at the beginning of sequence

I have sequence of annotations that are instances of the same type (e.g. sequence of CW annotations). I need to remove the first of them (more formally: remove annotation that has no annotations of the same type before in document). Less formally: to remove an annotation at the beginning of document. Example document: "Software StageTools"
So, I tried many variants:
"Software"{-AFTER(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{BEFORE(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{-STARTSWITH(Document) -> UNMARK(CW)} CW+; //does not work
CW{0, 0} "Software"{-> UNMARK(CW)} CW+; //getting parsing error
...and some other ones. Obviously, no one works (may be, I can refer to begin feature of annotation, but this will not solve formal issue).
At last, the question is - how can I say RUTA to remove annotation that has no annotations of the same type before in document?
There are many ways to do this. Here are two examples:
# cw:CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove the first CW "Software" in the document if there is another CW following.
ANY{-PARTOF(CW)} cw:#CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove any CW "Software" if there is a CW following and there is no CW preceding. If the document can start with the pattern, you need a second rule.
Your second rule actually works for me. The last rule has no valid syntax. The min/max quantifier requires different brackets like [0,0]. However, this would not have the effect you want.
DISCLAIMER: I am a developer of UIMA Ruta

FirstToken is not found for some reference-UIMA RUTA

FirstToken is not found for some reference(which contains space at the end).
Script:
DECLARE FirstToken, LastToken;
BLOCK(InRef) Reference{}{
ANY{POSITION(Reference,1) -> MARK(FirstToken)};
Document{-> MARKLAST(LastToken)};
}
Input Files:
1. Ferreira, F.R., Prado, S.D., Carvalho, M.C, and Kraemer, F.B. (2015). Biopower and biopolitics in the field of food and nutrition. Revista de Nutrição, 28(1), 109-119. Available at http://dx.doi.org/10.1590/1415-52732015000100010.
2. Ali, S. (2007). Feminism and postcolonialism: Knowledge/politics. Ethnic and Racial Studies, 30(2), 191–212.
3. Forbes, D.A., King, K.M., Kushner, K.E., Letourneau, N.L., Myrick, A.F., and Profetto-McGrath, J. (1999). Warrantable evidence in nursing science. Journal of Advanced Nursing, 29(2), 373–379.
Annotations that start or end with something invisible are also not visible. This definition may sound unintuitive but is required for sequential matching.
This happens most often if some annotation starts of ends with a space. It is recommended to remove/trim these spaces from the annotations, e.g., with:
RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
Reference{-> TRIM(WS)};
RETAINTYPE;
You can also work on annotations that end with a space if you make spaces visible:
RETAINTYPE(SPACE);
Beside that, you can also use the MARKFIRST action like the MARKLAST action instead of the POSITION condition, which is extremely slow.
DISCLAIMER: I am a developer of UIMA Ruta

Macro name expanded from another macro in makefile

I have a makefile with the following format. First I define what my outputs are;
EXEFILES = myexe1.exe myexe2.exe
Then I define what the dependencies are for those outputs;
myexe1.exe : myobj1.obj
myexe2.exe : myobj2.obj
Then I have some macros that define extra dependencies for linking;
DEP_myexe1 = lib1.lib lib2.lib
DEP_myexe2 = lib3.lib lib4.lib
Then I have the target for transforming .obj to .exe;
$(EXEFILES):
$(LINK) -OUT:"Exe\$#" -ADDOBJ:"Obj\$<" -IMPLIB:$($($(DEP_$*)):%=Lib\\%)
What I want to happen is (example for myexe1.exe)
DEP_$* -> DEP_myexe1
$(DEP_myexe1) -> lib1.lib lib2.lib
$(lib1.lib lib2.lib:%=Lib\\%) -> Lib\lib1.lib Lib\lib2.lib
Unfortunately this is not working. When I run make --just-print, the -IMPLIB: arguments are empty. However, if I run $(warning DEP_$*) I get
DEP_myexe1
And when I run $(warning $(DEP_myexe1)) I get
lib1.lib lib2.lib
So for some reason, make does not like the combination of $(DEP_$*). Perhaps it cannot resolve macro names dynamically like this. What can I do to get this to work? Is there an alternative?
Where does $(warning DEP_$*) give you DEP_myexe1 as output exactly? Because given your makefile above it shouldn't.
$* is the stem of the target pattern that matched. In your case, because you have explicit target names, you have no patten match and so no stem and so $* is always empty.
Additionally, you are attempting a few too many expansions. You are expanding $* to get myexe1 directly (assuming for the moment that variable works the way you intended). You then prefix that with DEP_ and used $(DEP_$*) to get the lib1.lib lib2.lib. You then expand that result $($(DEP_$*)) and then expand that (empty) result again (to do your substitution) $($($(DEP_$*)):%=Lib\\%).
You want to either use $(#:.exe=) instead of $* in your rule body or use %.exe as your target and then use $* to get myexe1/myexe2.
You then want to drop two levels of expansion from $($($(DEP_$*)):%=Lib\\%) and use $(DEP_$*:%=Lib\\%) instead.
So (assuming you use the pattern rule) you end up with:
%.exe:
$(LINK) -OUT:"Exe\$#" -ADDOBJ:"Obj\$<" -IMPLIB:$(DEP_$*:%=Lib\\%)
I managed to get it working without needing to resolve macros in the way described above. I modified the linking dependencies like this;
myexe1.exe : myobj1.obj lib1.lib lib2.lib
myexe2.exe : myobj2.obj lib3.lib lib4.lib
Then I need to filter these files by extension in the target recipe;
$(EXEFILES):
$(LINK) -OUT:"$(EXE_PATH)\$#" -ADDOBJ:$(patsubst %, Obj\\%, $(filter %.obj, $^)) -IMPLIB:$(patsubst %, Lib\\%, $(filter %.lib, $^))
The $(pathsubst ...) is used to prepend the path that the relevant files are in.
In the case of myexe1.exe, the link command expands to;
slink -OUT:"Exe\myexe1.exe" -ADDOBJ: Obj\myexe1.obj -IMPLIB: Lib\lib1.lib Lib\lib2.lib
Out of interest's sake, I would still like to know if it is possible to resolve macro names like in the question.

UIMA RUTA: How to check if String variable is in StringList?

I am looking for something like this:
WORDLIST lemmas = 'lemmas.txt';
DECLARE Test;
BLOCK(AnnotateTests) Token{} {
STRING lemma;
Token{->GETFEATURE("lemma", lemma)};
INLIST(lemma, lemmas) -> MARK(Action); // <- How to do this?
}
I know this is broken code, but I would like to know how I can supply a list of terms by a text file and annotate all instances of, say, Token, who have a certain feature (Lemma in the example) value among the ones in the list. I know String equality is possible, but list membership I was not able to find in the documentation or figure out myself.
Thanks!
UIMA Ruta 2.1.0: Unfortunately, the INLIST condition does not accept additional arguments, but only checks on the covered text of the matched annotation. So you cannot use that. The CONTAINS condition accepts an additional argument, but not word lists. You can also not apply the wordlist with MARKFAST since the dictionary check is token-based.
The best solution for this problem is to ask the developers to add the functionality, or adding an external condition that provides the functionality.
In UIMA Ruta 2.1.0, you could use StringListExpressions instead of word lists:
STRINGLIST LemmaSL = {"cat", "dog"}; // the content of the wordlist
Token{CONTAINS(LemmaSL, Token.lemma) -> MARK(Action)};
In UIMA Ruta 2.2.0, the INLIST condition is able to process an additional argument that replaces the covered text of the matched annotation, which should solve your problem:
WORDLIST LemmaList = 'lemmas.txt';
Token{INLIST(LemmaList, Token.lemma) -> MARK(Action)};
DISCLAIMER: I am a developer of Apache UIMA Ruta.