FirstToken is not found for some reference-UIMA RUTA - uima

FirstToken is not found for some reference(which contains space at the end).
Script:
DECLARE FirstToken, LastToken;
BLOCK(InRef) Reference{}{
ANY{POSITION(Reference,1) -> MARK(FirstToken)};
Document{-> MARKLAST(LastToken)};
}
Input Files:
1. Ferreira, F.R., Prado, S.D., Carvalho, M.C, and Kraemer, F.B. (2015). Biopower and biopolitics in the field of food and nutrition. Revista de Nutrição, 28(1), 109-119. Available at http://dx.doi.org/10.1590/1415-52732015000100010.
2. Ali, S. (2007). Feminism and postcolonialism: Knowledge/politics. Ethnic and Racial Studies, 30(2), 191–212.
3. Forbes, D.A., King, K.M., Kushner, K.E., Letourneau, N.L., Myrick, A.F., and Profetto-McGrath, J. (1999). Warrantable evidence in nursing science. Journal of Advanced Nursing, 29(2), 373–379.

Annotations that start or end with something invisible are also not visible. This definition may sound unintuitive but is required for sequential matching.
This happens most often if some annotation starts of ends with a space. It is recommended to remove/trim these spaces from the annotations, e.g., with:
RETAINTYPE(WS); // or RETAINTYPE(SPACE, BREAK,...);
Reference{-> TRIM(WS)};
RETAINTYPE;
You can also work on annotations that end with a space if you make spaces visible:
RETAINTYPE(SPACE);
Beside that, you can also use the MARKFIRST action like the MARKLAST action instead of the POSITION condition, which is extremely slow.
DISCLAIMER: I am a developer of UIMA Ruta

Related

Mozilla Deep Speech SST suddenly can't spell

I am using deep speech for speech to text. Up to 0.8.1, when I ran transcriptions like:
byte_encoding = subprocess.check_output(
"deepspeech --model deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.1-models.scorer --audio audio/2830-3980-0043.wav", shell=True)
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I would get back results that were pretty good. But since 0.8.2, where the scorer argument was removed, my results are just rife with misspellings that make me think I am now getting a character level model where I used to get a word-level model. The errors are in a direction that looks like the model isn't correctly specified somehow.
Now I when I call:
byte_encoding = subprocess.check_output(
['deepspeech', '--model', 'deepspeech-0.8.2-models.pbmm', '--audio', myfile])
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I now see errors like
endless -> "endules"
service -> "servic"
legacy -> "legaci"
earning -> "erting"
before -> "befir"
I'm not 100% that it is related to removing the scorer from the API, but it is one thing I see changing between releases, and the documentation suggested accuracy improvements in particular.
Short: The scorer matches letter output from the audio to actual words. You shouldn't leave it out.
Long: If you leave out the scorer argument, you won't be able to detect real world sentences as it matches the output from the acoustic model to words and word combinations present in the textual language model that is part of the scorer. And bear in mind that each scorer has specific lm_alpha and lm_beta values that make the search even more accurate.
The 0.8.2 version should be able to take the scorer argument. Otherwise update to 0.9.0, which has it as well. Maybe your environment is changed in a way. I would start in a new dir and venv.
Assuming you are using Python, you could add this to your code:
ds.enableExternalScorer(args.scorer)
ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
And check the example script.

Can we set tolerance level on regex annotator in Ruta?

I am annotating Borrower Name
"Borrower Name" -> BorrowerNameKeyword ( "label" = "Borrower Name");
But I get this text post OCR analysis. At times I might get Borrower Name as B0rr0wer Nane. Is this possible to set tolerance limit so that this text gets annotated as BorrowerNameKeyword?
Is their any other approach which could help here?
I could think of dictionary correction but that wont help as it could auto correct right words.
You could achieve that with regular expressions in UIMA Ruta. For you particular example the following rule should work:
"B.rr.wer\\sNa.e" -> BorrowerName;
Likewise, you can create more variants of regular expressions to cover the OCR errors.

How to declare # except line break for later usage?

I use skip wild card # for text between rule elements.
However, I mark always per line, thus I m able to use #{-CONTAINS(BREAK)}
for example RuleElementA #{-CONTAINS(BREAK)} RuleElementB must be on a single line
How can I declare/save #{-CONTAINS(BREAK)} so that i could use later just shortcut like
RuleElementA sc RuleElementB ?
You should try to annotate first your building block (i.e. Lines) and create your target annotations based on that (so-called Bottom-Up Matching Strategy in UIMA Ruta).
Therefore, your can annotate all the lines in the input document by following a naive approach:
DECLARE Line;
ADDRETAINTYPE(BREAK);
BREAK #{-> MARKONCE(Line)} #BREAK;
REMOVERETAINTYPE(BREAK);
This would allow you to remain on the line level while creating the target annotations. You could then iterate over all the Lines in the document in order to ensure the correctness of your span:
BLOCK (forEach) Line{CONTAINS(W)}{
RuleElementA # RuleElementB
}
Alternatively, you could make use of the PlainTextAnnotator which is by default, part of the UIMA Ruta installation package. This approach can guarantee you a better line detection:
ENGINE utils.PlainTextAnnotator;
TYPESYSTEM Utils.PlainTextTypeSystem;
EXEC(PlainTextAnnotator, {Line, EmptyLine});
DECLARE FreeLine, LineFree;
ADDRETAINTYPE(WS);
EmptyLine Line{-> FreeLine};
Line{-> LineFree} BREAK[1,2] #EmptyLine;
Line{-> TRIM(WS)};
FreeLine{-> TRIM(WS)};
LineFree{-> TRIM(WS)};
REMOVERETAINTYPE(WS);

UIMA RUTA annotation at the beginning of sequence

I have sequence of annotations that are instances of the same type (e.g. sequence of CW annotations). I need to remove the first of them (more formally: remove annotation that has no annotations of the same type before in document). Less formally: to remove an annotation at the beginning of document. Example document: "Software StageTools"
So, I tried many variants:
"Software"{-AFTER(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{BEFORE(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{-STARTSWITH(Document) -> UNMARK(CW)} CW+; //does not work
CW{0, 0} "Software"{-> UNMARK(CW)} CW+; //getting parsing error
...and some other ones. Obviously, no one works (may be, I can refer to begin feature of annotation, but this will not solve formal issue).
At last, the question is - how can I say RUTA to remove annotation that has no annotations of the same type before in document?
There are many ways to do this. Here are two examples:
# cw:CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove the first CW "Software" in the document if there is another CW following.
ANY{-PARTOF(CW)} cw:#CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove any CW "Software" if there is a CW following and there is no CW preceding. If the document can start with the pattern, you need a second rule.
Your second rule actually works for me. The last rule has no valid syntax. The min/max quantifier requires different brackets like [0,0]. However, this would not have the effect you want.
DISCLAIMER: I am a developer of UIMA Ruta

How to retrieve compound words from string list- UIMA RUTA

Sample Script:
DECLARE Name,TEST;
"Peter"->Name;
"der Groot"->Name;
"Robert"->Name;
"de Leew"->Name;
"O'Sullivan"->Name;
STRING s;
STRINGLIST slist;
Name{-> MATCHEDTEXT(s), ADD(slist,s),LOG(s)};
ANY+ {INLIST(slist)->MARK(TEST)};
Received Output:
Peter
Robert
Expected Output:
Peter
der Groot
Robert
de Leew
O'Sullivan
Sample Input:
Peter
der Groot
Robert
de Leew
O'Sullivan
I've tried to mark the stringlist value into an annotation type.But the received output is different from expected output.
The condition at the rule element ANY+ validates every single ANY, thus fails with the first one and also matches only single tokens.
Should the last rule annotate only position directly after Name annotations?
If not, the you can do something like:
Name{-> MATCHEDTEXT(s), ADD(slist,s)};
MARKFAST(TEST, slist);
If yes, the situation gets more complicated because you do not have candidates with the correct span. You cannot solve this with a combination of ANY and INLIST, You either need a correct span or fragments in the list. I'd rather recommend an additional fixing rule:
Name{-> MATCHEDTEXT(s), ADD(slist,s)};
MARKFAST(TEST, slist);
ANY{-ENDSWITH(Name)} #TEST{-> UNMARK(TEST)};
DISCLAIMER: I am a developer of UIMA Ruta