uima wordlist missing entries - uima

using uima ruta 2.7.0
DECLARE Substance;
WORDLIST EnzymeSearchList = 'enzyme.txt';
Document{-> MARKFAST(Substance, EnzymeSearchList, true)}; // true ignores case
enzyme.txt contains ~ 16.000 entries (=lines)
If I use a file containing few entries, for example 5, my further rules work without any problem. Once I provide the full list of thousands of entries, my results are incomplete.
Can be the issue caused by reaching WORDLIST limit? Or heap maybe? Nothing fails upon program execution.
I have found a thread specifically stating
There is no maximum size for the wordlists in UIMA Ruta. ... My largest wordlist consisted of about 500k entries

I assume that you mean by incomplete that several (obivous) entities have not been found/annotated in the document?
This is most likely caused by whitespaces in the enzyme.txt file. Can you verify this, e.g., be removing all whitespace in this file and retest the script
If the problem is caused by whitespaces, there are several options to solve/avoid this. You can for example set the config param 'dictRemoveWS' to true for automatically removing the whitepaces when the dictionary is loaded.
Is upgrading to UIMA Ruta 2.8.1 (which should also fix this problem) an option?

Related

Talend: configuration dimension time error in tOracleOutput

I still have this problem
Exception in component tOracleOutput_1
java.sql.SQLSyntaxErrorException: ORA-00904: : invalid identifier
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:951)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513)
There is some currupt code in your job. What you can do is first check is there any code generated for this job. if not try removing each component/disable and run and see if the error persist or not
I have had this as well. What usually helps is restarting Talend or restarting the computer.
If that doesn't help, there is something wrong with the job. Then I check every schema, every connection, every tMap, every item in the job if there is an error which Talend doesn't show to me.
To check if the code generation system works, you can always click on the Code tab and see if something comes up.
EDIT
An error ORA-00904 comes up. This leads to the suggestion that a column is named wrongly as seen here: https://dba.stackexchange.com/questions/129641/ora-00904-error-while-querying-the-oracle-database-table
To avoid ORA-00904, column names must
begin with a letter.
consist only of alphanumeric and the special characters ($_#); other characters need double quotation marks around them.
be less than or equal to thirty characters.

UIMA RUTA annotation at the beginning of sequence

I have sequence of annotations that are instances of the same type (e.g. sequence of CW annotations). I need to remove the first of them (more formally: remove annotation that has no annotations of the same type before in document). Less formally: to remove an annotation at the beginning of document. Example document: "Software StageTools"
So, I tried many variants:
"Software"{-AFTER(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{BEFORE(CW) -> UNMARK(CW)} CW+; //does not work
"Software"{-STARTSWITH(Document) -> UNMARK(CW)} CW+; //does not work
CW{0, 0} "Software"{-> UNMARK(CW)} CW+; //getting parsing error
...and some other ones. Obviously, no one works (may be, I can refer to begin feature of annotation, but this will not solve formal issue).
At last, the question is - how can I say RUTA to remove annotation that has no annotations of the same type before in document?
There are many ways to do this. Here are two examples:
# cw:CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove the first CW "Software" in the document if there is another CW following.
ANY{-PARTOF(CW)} cw:#CW.ct=="Software"{-> UNMARK(cw)} CW;
Remove any CW "Software" if there is a CW following and there is no CW preceding. If the document can start with the pattern, you need a second rule.
Your second rule actually works for me. The last rule has no valid syntax. The min/max quantifier requires different brackets like [0,0]. However, this would not have the effect you want.
DISCLAIMER: I am a developer of UIMA Ruta

Fastparse parse error column numbers missing

I just updated from fastparse 0.3.7 to 0.4.1. There is no longer a column number value in the extras of a Parsed.Failure. I grepped through the source and it seems the functionality has been removed, though it is still in the documentation. Is there some other way to get column info now?
It's just changed a bit. You need to grab the index and the parser that failed, and call StringReprOps.prettyIndex.

UIMA RUTA: How to check if String variable is in StringList?

I am looking for something like this:
WORDLIST lemmas = 'lemmas.txt';
DECLARE Test;
BLOCK(AnnotateTests) Token{} {
STRING lemma;
Token{->GETFEATURE("lemma", lemma)};
INLIST(lemma, lemmas) -> MARK(Action); // <- How to do this?
}
I know this is broken code, but I would like to know how I can supply a list of terms by a text file and annotate all instances of, say, Token, who have a certain feature (Lemma in the example) value among the ones in the list. I know String equality is possible, but list membership I was not able to find in the documentation or figure out myself.
Thanks!
UIMA Ruta 2.1.0: Unfortunately, the INLIST condition does not accept additional arguments, but only checks on the covered text of the matched annotation. So you cannot use that. The CONTAINS condition accepts an additional argument, but not word lists. You can also not apply the wordlist with MARKFAST since the dictionary check is token-based.
The best solution for this problem is to ask the developers to add the functionality, or adding an external condition that provides the functionality.
In UIMA Ruta 2.1.0, you could use StringListExpressions instead of word lists:
STRINGLIST LemmaSL = {"cat", "dog"}; // the content of the wordlist
Token{CONTAINS(LemmaSL, Token.lemma) -> MARK(Action)};
In UIMA Ruta 2.2.0, the INLIST condition is able to process an additional argument that replaces the covered text of the matched annotation, which should solve your problem:
WORDLIST LemmaList = 'lemmas.txt';
Token{INLIST(LemmaList, Token.lemma) -> MARK(Action)};
DISCLAIMER: I am a developer of Apache UIMA Ruta.

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.
I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)