Create annotation using regexp on document text - ruta

I am trying to annotate a json document using regular expression. I can create a simple annotation to mark a "JsonBlock" using the following, but, I cannot seem to turn around and use that annotaiton "JsonBlock" in a loop.
My document looks something like this:
{ "Key": { "JsonBlock": { [
{"id":"123","value":"This is some multi-line long text..." },
{"id":"456","value":"This is some multi-line long text..." } ] } } }
Here is a simple regex based expression that creates an annotaiton
("([{\\s\"]*id.*?\\})")-> JsonBlock;
But, why can't I iterate over JsonBlock annotations using the following? I should be missing something!
BLOCK(myBlock)JsonBlock{}{
}
Beyond that too, I have another annotation that represents the id as "JsonBlockId", I have tried to use PARTOF to check if the JsonBlockId is part of JsonBlock and that rule does not seem to fire. I should be missing something.
Any pointers would be appreciated.
Thanks!

The BLOCk does not match because the JsonBlock annotations are visible.
Please mind that all annotations are invisible if their begin offset or end offset is covered by any invisible annotation. In you example, this is BREAK and/or SPACE.
You can fix the problem by either changing your regex not to include the whitespaces, or you can make whitespace visible, or you can change the offsets of the annotations not to include the whitespaces. Here are the latter two options:
DECLARE JsonBlock;
"([{\\s\"]*id.*?\\})"-> JsonBlock;
RETAINTYPE(WS);
BLOCK(first) JsonBlock{}{
}
RETAINTYPE;
RETAINTYPE(WS);
JsonBlock{-> TRIM(WS)};
RETAINTYPE;
BLOCK(first) JsonBlock{}{
}
Your example rule was not valid. I removed the brackets.
DICLAIMER: I am a developer of UIMA Ruta

Related

Can we set tolerance level on regex annotator in Ruta?

I am annotating Borrower Name
"Borrower Name" -> BorrowerNameKeyword ( "label" = "Borrower Name");
But I get this text post OCR analysis. At times I might get Borrower Name as B0rr0wer Nane. Is this possible to set tolerance limit so that this text gets annotated as BorrowerNameKeyword?
Is their any other approach which could help here?
I could think of dictionary correction but that wont help as it could auto correct right words.
You could achieve that with regular expressions in UIMA Ruta. For you particular example the following rule should work:
"B.rr.wer\\sNa.e" -> BorrowerName;
Likewise, you can create more variants of regular expressions to cover the OCR errors.

UIMA RUTA: How to check if String variable is in StringList?

I am looking for something like this:
WORDLIST lemmas = 'lemmas.txt';
DECLARE Test;
BLOCK(AnnotateTests) Token{} {
STRING lemma;
Token{->GETFEATURE("lemma", lemma)};
INLIST(lemma, lemmas) -> MARK(Action); // <- How to do this?
}
I know this is broken code, but I would like to know how I can supply a list of terms by a text file and annotate all instances of, say, Token, who have a certain feature (Lemma in the example) value among the ones in the list. I know String equality is possible, but list membership I was not able to find in the documentation or figure out myself.
Thanks!
UIMA Ruta 2.1.0: Unfortunately, the INLIST condition does not accept additional arguments, but only checks on the covered text of the matched annotation. So you cannot use that. The CONTAINS condition accepts an additional argument, but not word lists. You can also not apply the wordlist with MARKFAST since the dictionary check is token-based.
The best solution for this problem is to ask the developers to add the functionality, or adding an external condition that provides the functionality.
In UIMA Ruta 2.1.0, you could use StringListExpressions instead of word lists:
STRINGLIST LemmaSL = {"cat", "dog"}; // the content of the wordlist
Token{CONTAINS(LemmaSL, Token.lemma) -> MARK(Action)};
In UIMA Ruta 2.2.0, the INLIST condition is able to process an additional argument that replaces the covered text of the matched annotation, which should solve your problem:
WORDLIST LemmaList = 'lemmas.txt';
Token{INLIST(LemmaList, Token.lemma) -> MARK(Action)};
DISCLAIMER: I am a developer of Apache UIMA Ruta.

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.
I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)

SSIS Formatting input from flat file

I'm a SSIS newbie. I wanna format the inputs of my flat file before saving the entries in a database table. Initially I created a flat file as follows:-
"1","Superman","Metropolis"
"2","Batman","Gotham"
"3","Spiderman","New York"
"4","James Bond","London"
"5","Green Lantern","Oa"
The solution for stripping this was simple as shown here http://www.mssqltips.com/sqlservertip/1316/strip-double-quotes-from-an-import-file-in-integration-services-ssis/
But now i have created a new similar package and given my input file like this:-
"6", "TMNT", "Sewers NY"
"7", "Iron Man", "New York"
Note here I've put a space after the delimiting comma. Now when I follow the above method the first number field stripped of the double quotes, but rest of the entries retain their quotes. Any idea how to work around this? One suggestion on a similar question on stackoverflow mentioned use of a "Transformation script". Since I'm a newbie can anyone please throw light on this method?
Yes, you can use Script component transformation. Select all columns, and change them to ReadWrite. The code:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
Row.ID = Row.ID.Replace("\"", string.Empty);
Row.Movie = Row.Movie.Replace("\"", string.Empty);
Row.City = Row.City.Replace("\"", string.Empty);
}
If you want to trim the spaces you can use
Row.ID.Replace("\"", string.Empty).Trim();
You would also need to take care if you want to preserve the values that are " ". Please post if the suggestion was helpful or if you have any questions.
In the 'General' tab you can set a text qualifier of ". Then those quotes will be ignored.
Then you don't need to write error prone script when there is a simple solution.

Where's the syntax error in this (f)lex snippet?

I'm having a great time doing a lexer using flex. Problem is, my code editor doesn't color the syntax of the file, and it seems my rule has an error in it. Since I'm not too sure about how to use single quotes and double quotes inside intervals, I thought I'd share that snippet with you:
[^\\\'\n]+
{
wchar_t* string;
utf8_decode(yytext, &string);
yyextra->append(string);
free(string);
}
Flex tells me there's an 'unrecognized rule' on the utf8_decode line. If I remove the whole rule, things look fine again.
Can anyone tell what I'm doing wrong here?
The action must begin on the same line as the pattern. So use
[^\\\'\n]+ {
wchar_t* string;
utf8_decode(yytext, &string);
yyextra->append(string);
free(string);
}