a simple Ruta annotator - uima

I just started with Ruta and I would like to write a rule that will work like this:
it will try to match a word e.g. XYZ and when it hits it, it will then assign the text that comes before to the Annotator CompanyDetails.
For example :
This is a paragraph that contains the phrase we are interested in, which follows the sentence. LL, Inc. a Delaware limited liability company (XYZ).
After running the script the annotator CompanyDetails will contain the string:
LL, Inc. a Delaware limited liability company

I assume that you mean annotation of the type 'CompanyDetails' when you talk about annotator 'CompanyDetails'.
There are many (really many) different ways to solve this task. Here's one example that applies some helper rules:
DECLARE Annotation CompanyDetails (STRING context);
DECLARE Sentence, XYZ;
// just to get a running example with simple sentences
PERIOD #{-> Sentence} PERIOD;
#{-> Sentence} PERIOD;
"XYZ" -> XYZ; // should be done in a dictionary
// the actual rule
STRING s;
Sentence{-> MATCHEDTEXT(s)}->{XYZ{-> CREATE(CompanyDetails, "context" = s)};};
This example stores the string of the complete sentence in the feature. The rule matches on all sentences and stores the covered text in the variable ´s´. Then, the content of the sentence is investigated: An inlined rule tries to match on XYZ, creates an annotation of the type CompanyDetails, and assigns the value of the variable to the feature named context. I would rather store an annotation instead of a string since you could still get the string with getCoveredText(). If you just need the tokens before XYZ in the sentence, the you could do something like that (with an annotation instead of a string this time):
DECLARE Annotation CompanyDetails (Annotation context);
DECLARE Sentence, XYZ, Context;
// just to get a running example with simple sentences
PERIOD #{-> Sentence} PERIOD;
#{-> Sentence} PERIOD;
"XYZ" -> XYZ;
// the actual rule
Sentence->{ #{-> Context} SPECIAL? #XYZ{-> GATHER(CompanyDetails, "context" = 1)};};

Related

using ruta, annotate a line containing annotation and extract required data

Annotate a line containing specific annotations in order to extract text. Annotate line for Borrower and Co-Borrower and get their respective SSNs
Borrower Name: Alice Johnson SSN: 123-456-7890
Co-Borrower Name: Bob Symonds SSN: 987-654-3210
code
PACKAGE uima.ruta.test;
TYPESYSTEM utils.PlainTextTypeSystem;
ENGINE utils.PlainTextAnnotator;
EXEC(PlainTextAnnotator, {Line});
DECLARE Borrower, Name;
DECLARE BorrowerName(String value, String label);
CW{REGEXP("\\bBorrower") -> Borrower} CW{REGEXP("Name") -> Name};
Borrower Name COLON n:CW[1,3]{-> CREATE(BorrowerName, "label"="Borrower Name", "value"=n.ct)};
DECLARE SSN;
DECLARE BorrowerSSN(String label, String value);
W{REGEXP("SSN") -> SSN};
SSN COLON n:NUM[3,3]{-> CREATE(BorrowerSSN, "label"="Borrower SSN", "value"=n.ct)};
DECLARE Co;
CW{REGEXP("Co") -> Co};
DECLARE CoBorrowerName(String label, String value);
Co Borrower Name COLON n:CW[1,3]{-> CREATE(CoBorrowerName, "label"="Co-Borrower Name", "value"=n.ct)};
DECLARE BorrowerLine;
Line{CONTAINS(Borrower),CONTAINS(Name)->MARK(BorrowerLine)};
Please suggest how to annotate a line containing annotation and get specific label value for required annotation.
To spare yourself from handling the separate strings, you could gather all the indicators to a wordlist (i.e., a text file containing one indicator per line) and place it in your project resources folder (see this for more details). Then you could just mark all the indicators with the desired indicator type:
WORDLIST IndicatorList ='IndicatorList.txt';
DECLARE Indicator;
Document{->MARKFAST(Indicator, IndicatorList )};
This would output Indicator helper annotations like "Borrower Name".
Once you have that, you could now iterate over the lines and find the target annotations.
DECLARE Invisible;
SPECIAL{-PARTOF(Invisible), REGEXP("[-]")-> Invisible};
BLOCK(line) Line{CONTAINS(Indicator)}{
//Ex. pattern: Borrower Name: Alice Johnson SSN: 123-456-7890
Indicator COLON c:CW[1,3]{-> CREATE(BorrowerName, "label"="Borrower Name", "value"=c.ct)} Indicator;
FILTERTYPE(Invisible);
Indicator COLON n:NUM[3,3]{-> CREATE(BorrowerSSN, "label"="BorrowerSSN", "value"=n.ct)};
REMOVEFILTERTYPE(Invisible);
}
Hope this helps.
Addition to Viorel's answer:
The PlainTextAnnotator creates annotations of the type Line and these annotation cover the complete line, which means that leading or trailing whitespaces are also included. As a consequence, the resulting annotations are not visible for the following rules. In order to avoid this problem, you could for example trim the whitespaces in these annotations:
EXEC(PlainTextAnnotator, {Line});
ADDRETAINTYPE(WS);
Line{-> TRIM(WS)};
REMOVERETAINTYPE(WS);

Whitespace tokenizer not working when using simple query string

I first implemented query search using SimpleQueryString shown as follows.
Entity Definition
#Entity
#Indexed
#AnalyzerDef(name = "whitespace", tokenizer = #TokenizerDef(factory = WhitespaceTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = ASCIIFoldingFilterFactory.class)
})
public class AdAccount implements SearchableEntity, Serializable {
#Id
#DocumentId
#Column(name = "ID")
#GeneratedValue(strategy = GenerationType.AUTO)
private Long id;
#Field(store = Store.YES, analyzer = #Analyzer(definition = "whitespace"))
#Column(name = "NAME")
private String name;
//other properties and getters/setters
}
I use the white space tokenizer factory here because the default standard analyzer ignores special characters, which is not ideal in my use case. The document I referred to is https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-WhiteSpaceTokenizer. In this document it states that Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens.
SimpleQueryString Method
protected Query inputFilterBuilder() {
SimpleQueryStringMatchingContext simpleQueryStringMatchingContext = queryBuilder.simpleQueryString().onField("name");
return simpleQueryStringMatchingContext
.withAndAsDefaultOperator()
.matching(searchRequest.getQuery() + "*").createQuery();
}
searchRequest.getQuery() returns the search query string, then I append the prefix operator in the end so that it supports prefix query.
However, this does not work as expected with the following example.
Say I have an entity whose name is "AT&T Account", when searching with "AT&", it does not return this entity.
I then made the following changes to directly use a white space analyzer. This time searching with "AT&" works as expected. But the search is case sensitive now, i.e, searching with "at&" returns nothing now.
#Field
#Analyzer(impl = WhitespaceAnalyzer.class)
#Column(name = "NAME")
private String name;
My questions are:
Why doesn't it work when I use the white space factory in my first attempt? I assume using the factory versus using the actual analyzer implementation is different?
How to make my search case-insensitive if I use the #Analyzer annotation as in my second attempt?
Why doesn't it work when I use the white space factory in my first attempt? I assume using the factory versus using the actual analyzer implementation is different?
Wildcard and prefix queries (the one you're using when you add a * suffix in your query string) do not apply analysis, ever. Which means your lowercase filter is not applied to your search query, but it has been applied to your indexed text, which means it will never match: AT&* does not match the indexed at&t.
Using the #Analyzer annotation only worked because you removed the lowercasing at index time. With this analyzer, you ended up with AT&T (uppercase) in the index, and AT&* does match the indexed AT&T. It's just by chance, though: if you index At&t, you will end up with At&t in the index and you'll end up with the same problem.
How to make my search case-insensitive if I use the #Analyzer annotation as in my second attempt?
As I mentioned above, the #Analyzer annotation is not the solution, you actually made your search worse.
There is no built-in solution to make wildcard and prefix queries apply analysis, mainly because analysis could remove pattern characters such as ? or *, and that would not end well.
You could restore your initial analyzer, and lowercase the query yourself, but that will only get you so far: ascii folding and other analysis features won't work.
The solution I generally recommend is to use an edge-ngrams filter. The idea is to index every prefix of every word, so "AT&T Account" would get indexed as the terms a, at, at&, at&t, a, ac, acc, acco, accou, accoun, account and a search for "at&" would return the correct results even without a wildcard.
See this answer for a more extensive explanation.
If you use the ELasticsearch integration, you will have to rely on a hack to make the "query-only" analyzer work properly. See here.

What is the most effective way in systemVerilog to know how many words a string has?

I have Strings in the following structure:
cmd, addr, data, data, data, data, ……., \n
For example:
"write,A0001000,00000000, \n"
I have to know how many words the String has.
I know that I can go over the String and search for the number of commas, but is there more effective way to do it?
UVM provides a facility to do regexp matching using the DPI, in case you're already using that. Have a look at the functions in uvm_svcmd_dpi.svh
Verilab also provides svlib, a package containing string matching functions.
A simpler option would be to change the commas(,) to a space, then you can use $sscanf (or $fscanf to skip the intermediate string and read directly from a file), assuming each command has a maximum number of words.
int code; // returns the number of words read
string str,word[5];
code = $sscanf(str,"%s %s %s %s %s", word[0],word[1],word[2],word[3],word[4]);
You can use %h if you know a word is in hex and translate it directly to a numeric value instead of a string.
The first step is to define extremely clearly what a word actually is vis. what constitutes the start of a word and what constitutes the end of the word, once you understand this, if should become obvious how to parse the string correctly.
In Java StringTokenizer is the best way to find the count of words in a string.
String sampleString= "cmd addr data data data data...."
StringTokenizer st = new Tokenizer(sampleString);
st.countTokens();
Hope this will help you :)
In java you can use following code to count words in string
public class WordCounts{
public static void main(String []args){
String text="cmd, addr, data, data, data, data";
String trimmed = text.trim();
int words = trimmed.isEmpty() ? 0 : trimmed.split("\\s+").length;
System.out.println(words);
}
}

How to check SQL, Language syntax in String at compile time (Scala)

I am writing a translator which converts DSL to multiple programming language (It seems like Apache Thrift).
For example,
// an example DSL
LOG_TYPE: COMMERCE
COMMON_FIELD : session_id
KEY: buy
FIELD: item_id, transaction_id
KEY: add_to_cart
FIELD: item_id
// will be converted to Java
class Commerce {
private String session_id
private String key;
private String item_id;
private String transaction_id
// auto-created setter, getter, helper methods
...
}
It also should be translated into objective-c and javascript.
To implement it, I have to replace string
// 1. create or load code fragments
String variableDeclarationInJava = "private String {$field};";
String variableDeclarationInJavascript = "...";
String variableDeclarationInObjC = "...";
// 2. replace it
variableDeclarationInJava.replace(pattern, fieldName)
...
Replacing code fragment in String is not type safe and frustrating since it does not any information even if there are errors.
So, my question is It is possible to parse String at compile time? like Scala sqltyped library
If it is possible, I would like to know how can I achieve it.
Thanks.
As far, as I understand, it could be. Please take a look at string interpolation. You implement a custom interpolator, (like it was done for quasi quotations or in Slick).
A nice example of the thing you may want to do is here

How to pass parameters to a Progress program using database field dynamic-based rules?

I have in my database a set of records that concentrates information about my .W's, e.g. window name, parent directory, file name, procedure type (for internal treatments purposes), used to build my main menu. With this data I'm developing a new start procedure for the ERP that I maintain and using the opportunity in order to rewrite some really outdated functions and programs and implement new functionalities. Until now, I hadn't any problems but when I started to develop the .P procedure which will check the database register of a program that was called from the menu of this new start procedure - to check if it needs to receive fixed parameters to be run and its data types - I found a problem that I can't figure out a solution.
In this table, I have stored in one of the fields the parameters needed by the program, each with his correspondent data type. The problem is on how to pass different data types to procedures based only on the stored data. I tried to pre-convert data using a CASE clause and an include to check the parameter field for correct parameter sending but the include doesn't work as I've expected.
My database field is stored as this:
Description | DATATYPE | Content
I've declared some variables and converted properly the stored data into their correct datatype vars.
DEF VAR c-param-exec AS CHAR NO-UNDO EXTENT 9 INIT ?.
DEF VAR i-param-exec AS INT NO-UNDO EXTENT 9 INIT ?.
DEF VAR de-param-exec AS DEC NO-UNDO EXTENT 9 INIT ?.
DEF VAR da-param-exec AS DATE NO-UNDO EXTENT 9 INIT ?.
DEF VAR l-param-exec AS LOG NO-UNDO EXTENT 9 INIT ?.
DEF VAR i-count AS INT NO-UNDO.
blk-count:
DO i-count = 0 TO 8:
IF TRIM(programa.parametro[i-count]) = '' THEN
LEAVE blk-count.
i-count = i-count + 1.
CASE ENTRY(2,programa.parametro[i-count],CHR(1)):
WHEN 'CHARACTER' THEN
c-param-exec[i-count] = ENTRY(3,programa.parametro[i-count],CHR(1)).
WHEN 'INTEGER' THEN
i-param-exec[i-count] = INT(ENTRY(3,programa.parametro[i-count],CHR(1))).
WHEN 'DECIMAL' THEN
de-param-exec[i-count] = DEC(ENTRY(3,programa.parametro[i-count],CHR(1))).
WHEN 'DATE' THEN
da-param-exec[i-count] = DATE(ENTRY(3,programa.parametro[i-count],CHR(1))).
WHEN 'LOGICAL' THEN
l-param-exec[i-count] = (ENTRY(3,programa.parametro[i-count],CHR(1)) = 'yes').
OTHERWISE
c-param-exec[i-count] = ENTRY(3,programa.parametro[i-count],CHR(1)).
END CASE.
END.
Then I tried to run the program using an include to pass parameters (in this example, the program have 3 INPUT parameters).
RUN VALUE(c-prog-exec) ({util\abrePrograma.i 1},
{util\abrePrograma.i 2},
{util\abrePrograma.i 3}).
Here is my abrePrograma.i
/* abrePrograma.i */
(IF ENTRY(2,programa.parametro[{1}],CHR(1)) = 'CHARACTER' THEN c-param-exec[{1}] ELSE
IF ENTRY(2,programa.parametro[{1}],CHR(1)) = 'INTEGER' THEN i-param-exec[{1}] ELSE
IF ENTRY(2,programa.parametro[{1}],CHR(1)) = 'DECIMAL' THEN de-param-exec[{1}] ELSE
IF ENTRY(2,programa.parametro[{1}],CHR(1)) = 'DATE' THEN da-param-exec[{1}] ELSE
IF ENTRY(2,programa.parametro[{1}],CHR(1)) = 'LOGICAL' THEN l-param-exec[{1}] ELSE
c-param-exec[{1}])
If I suppress the 2nd, 3rd, 4th and 5th IF's from the include or use only one data type in all IF's (e.g. only CHAR, only DATE, etc.) the program works properly and executes like a charm but I need to call some old programs, which expects different datatypes in its INPUT parameters and using the programs as described OpenEdge doesn't compile the caller, triggering the error number 223.
---------------------------
Erro (Press HELP to view stack trace)
---------------------------
** Tipos de dados imcompativeis em expressao ou atribuicao. (223)
** Nao entendi a linha 86. (196)
---------------------------
OK Ajuda
---------------------------
Can anyone help me with this ?
Thanks in advance.
Looks as if you're trying to use variable parameter definitions.
Have a look at the "create call" statement in the ABL reference.
http://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/dvref/call-object-handle.html#wwconnect_header
Sample from the documentation
DEFINE VARIABLE hCall AS HANDLE NO-UNDO.
CREATE CALL hCall.
/* Invoke hello.p non-persistently */
hCall:CALL-NAME = "hello.p".
/* Sets CALL-TYPE to the default */
hCall:CALL-TYPE = PROCEDURE-CALL-TYPE
hCall:NUM-PARAMETERS = 1.
hCall:SET-PARAMETER(1, "CHARACTER", "INPUT", "HELLO WORLD").
hCall:INVOKE.
/* Clean up */
DELETE OBJECT hCall.
The best way to get to the bottom of those kind of preprocessor related issues is to do a compile with preprocess listing followed by a syntax check on the preprocessed file. Once you know where the error is in the resulting preprocessed file you have to find out which include / define caused the code that won't compile .
In procedure editor
compile source.w preprocess source.pp.
Open source.pp in the procedure editor and do syntax check
look at original source to find include or preprocessor construct that resulted in the code that does not compile.
Okay, I am getting a little bit lost (often happens to me with lots of preprocessors) but am I missing that on the way in and out of the database fields you are storing values as characters, right? So when storing a parameter in the database you have to convert it to Char and on the way out of the database you have convert it back to its correct data-type. To not do it one way or the other would cause a type mismatch.
Also, just thinking out loud (without thinking it all the way through) wonder if using OOABL (Object Oriented ABL) depending on if you Release has it available wouldn't make it easier by defining signatures for the different datatypes and then depending on which type of input or output parameter you call it with, it will use the correct signature and correct conversion method.
Something like:
METHOD PUBLIC VOID storeParam(input cParam as char ):
dbfield = cParam.
RETURN.
END METHOD.
METHOD PUBLIC VOID storeParam(input iParam as int ):
dbfield = string(iParam).
RETURN.
END METHOD.
METHOD PUBLIC VOID storeParam(input dParam as date ):
dbfield = string(dParam).
RETURN.
END METHOD.
just a thought.