UIMA Ruta. Retrieve phrases separated by WS (spaces, breaks, etc.) - uima

I'm going to retrieve phrases separated by spaces, breaks and other punctuation symbols.
I've spent a lot of time trying to find out the best way to do that.
Option 1. The easiest way.
DECLARE T1, T2;
"cool rules" -> T1;
"cool rule" -> T2;
Input: "123cool rules".
Result: T1 and T2 are triggered;
Option 2. Using WORDLIST and WORDTABLE.
Let wordlist 1.txt contains 2 rows:
cool rules
cool
code for extraction is the following
WORDLIST WList = '1.txt';
DECLARE W1;
Document{-> MARKFAST(W1, WList, true, 2)};
Input: "cool rules".
Result: only first row is extracted. I guess that in this case intersected rules are not triggered.
Option 3. Mark combination of two tokens
DECLARE T1;
("cool" "rule") {-> T1};
Input: "cool rules cool rule 1cool rule"
Result: 2 annotations: cool rule + 1cool rule. Loss of extraction speed in 10 times.
Option 4. REGEXP matching
Maybe it is possible to match such pattern "cool\\srule", but I have no idea how to define the type expression. SW*{REGEXP("cool\\srule")->T1} does not provide results.
As you see, I'm trying to solve a very simple task, but did not succeed yet. The option 3 is a really good way to do that, but extraction process becomes slower in 10 times.

If you want to identify specific phrases, you should use a dictionary lookup, not directly rules.
Therefore, I'd recommend the MARKFAST option 2. However, there are two problems: (a) only longest matches are supported and (b) you either need to change the segmentation (tokenization) or do some postprocessing.
(a) This cannot be solved. If this is really required, a different dictionary annotator should be used. See e.g., the UIMA mailing lists.
(b) The MARKFAST works on RutaBasic annotations which are automatically created for each smallest part. Because of the default seeder, the token "1cool" consists of two RutaBasics, one for the NUM, one for the SW. If you do not want to change the preprocessing, you can simply apply a rule that fixed that like
RETAINTYPE(WS);
ANY{-PARTOF(WS)} t:#T1{-> UNMARK(t)};
btw, option 4 won't work because the REGEXP condition checks on the covered text of the matched annotation SW which only represents one token. If you do something like (SW+){REGEXP("cool\\srule")->T1}, then the rule wont match if there is another SW afterwards.
DISCLAIMER: I am a developer Of UIMA Ruta

Related

Turning off hotstring options

I was going through AHK documentation on hotstrings, and found the Options section particularly interesting
The one question I have is about turning off options, for example:
C: Case sensitive: When you type an abbreviation, it must exactly match the case defined in the script. Use C0 to turn case sensitivity back off.
There are no examples of how you would do this, so that is my first question
How would you do it?
That is, what are the steps and some code examples I could use to accomplish it?
As for the second question, one that is not as important
Why would you do it?
To answer your first question, there are two main ways of creating a Case-sensitive hotstring.
Method 1- Single use Options modifiers:
Stick the options between the first and second colons of the hotstring. Ex:
;Case Sensitive
:C:ROFL:: Rolling on floor laughing
:C:ICYMI:: In case you missed it
:C:TL;DR:: Too long, didn’t read
;Non case-sensitive
::atot::A Tale of Two Cities
::ctbc::Cry, the Beloved Country
Method 2- #Hotstring directive
If you have a longer list of hotstrings you want some certain properties to apply to, you can use this directive. The directive will apply to all hotstrings following it, which is why there exists options like #Hotstring C0 that can turn off a previously declared #Hotstring C. Ex:
#Hotstring C
::ROFL:: Rolling on floor laughing
::ICYMI:: In case you missed it
::TL;DR:: Too long, didn’t read
#Hotstring C0
::atot::A Tale of Two Cities
::ctbc::Cry, the Beloved Country
(Both of these code blocks are functionally identical)
To answer your second question, if you meant why would you need to use the C0 option, please see Method 2 above. If you meant why would you use the Case Options in general, that would be a matter of personal preferences.

Tags in vowpal wabbit

I am doing binary classification using vowpal-wabbit. A particular record(set of features) has 10 zeroes and 5 ones. So, I am creating two lines in vowpal-format
-1 10 `50 |f f1
1 5 `50 |f f1
Since the prediction(probability) for both these records would be same, I want to keep the same tag, so that I can dedupe the predictions({tag,prediction}) later and join with my original raw-data.
Is it possible to keep the same tag for more than one record in vowpal-wabbit?
First, the syntax above isn't correct
To be identified as such, tags should either:
Touch the | separator (no space between them) OR
The leading quote, needs to be a simple quote, not a backquote, by convention.
(or both).
Otherwise you get:
warning: `50 is not a good float, replacing with 0
warning: `50 is not a good float, replacing with 0
Which hints that vw interprets these "tags" as prediction-base.
For details, see Input format in the official documentation
Once the example is fixed to the correct syntax:
-1 10 '50|f f1
1 5 '50|f f1
Which runs fine, we can answer the question:
Is it possible to keep the same tag for more than one record in vowpal-wabbit?
Yes, you can. The tag is merely a simple way to connect input and output (when predictions are involved), there's no check for uniqueness anywhere. If you duplicate tags on input, you'll simply get the same duplicate tags on prediction output as well.
More notes:
Even if two examples are identical, you may get different predictions, if the model has changed somewhat in between them. Remember vw is an online learner, so the model can continuously change with each example unless you add the -t (test-only, don't learn) option.
Features whose value is zero are ignored, so you can drop them. The standard way in vw to say this is 'positive' and this is 'negative' is to use the values {+1, -1}. This is true for both labels and input features.

UIMA Ruta: Optional Quantifier

I wanna match some terms only if the term behind this term is relevant for me. So I've created a minimal example:
This is my Test Data:
small Large
Large
small
And I wanna mark the terms small Large and Large but not "small".
So I thought, something like this should work:
DECLARE Test;
(SW*? CW) {-> CREATE(Test)};
But RUTA only matches "small Large".
For Testing I've replaced "SW" with "W" and it will do what I wan't.
Unfortunately, optional quantifiers at the beginning of a rule are not optional if the rule starts to match with the first rule element. This means that you either need two rules or you need to change the order of the rule element matching.
Changing the order of the rule element match lead to different rule matches since not all incremental sequences of SWs are considered before the CW. However, this is something one would normally prefer anyways. The rule would look like:
(SW* #CW) {-> CREATE(Test)};
The two rules approach would look something like:
(SW+? CW) {-> CREATE(Test)};
CW {-> CREATE(Test)};
I recommend avoiding the usage of a reluctant quantifier if it is not really required because of additional computation which are not necessary. Rather use the PARTOF condition even if it looks not as nice.
DISCLAIMER: Iam a developer of UIMA Ruta

Postgresql full text search part of words

Is postresql capable of doing a full text search, based on 'half' a word?
For example I'm trying to seach for "tree", but I tell postgres to search for "tr".
I can't find such a solution that is capable of doing this.
Currently I'm using
select * from test, to_tsquery('tree') as q where vectors ## q ;
But I'd like to do something like this:
select * from test, to_tsquery('tr%') as q where vectors ## q ;
You can use tsearch prefix matching, see http://www.postgresql.org/docs/9.0/interactive/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
postgres=# select to_tsvector('tree') ## to_tsquery('tr:*');
?column?
----------
t
(1 row)
It will only work for prefix search though, not if you want partial match at any position in the word.
Sounds like you simply want wildcard matching.
One option, as previously mentioned is trigrams. My (very) limited experience with it was that it was too slow on massive tables for my liking (some cases slower than a LIKE). As I said, my experience with trigrams is limited, so I might have just been using it wrong.
A second option you could use is the wildspeed module: http://www.sai.msu.su/~megera/wiki/wildspeed
(you'll have to build & install this tho).
The 2nd option will work for suffix/middle matching as well. Which may or may not be more than you're looking for.
There are a couple of caveats (like size of the index), so read through that page thoroughly.
select * from test, to_tsquery('tree') as q
where vectors ## q OR xxxx LIKE ('%tree%')
':*' is to specify prefix matching.
It can be done with trigrams but it's not part of tsearch2.
You can view the manual here: http://www.postgresql.org/docs/9.0/interactive/pgtrgm.html
Basically what the pg_tgrm module does is split a word in all it's parts so it can search for those separate parts.

Is there a diff algorithm that preserves line ownership

My goal is coming up with a script to track the point a line was added, even if the line is subsequently modified or moved around (both of which confuse traditional vcs 'blame' scripts. I've done some minor background research (see bottom) but didn't find anything useful. I have a concept for how to proceed but the runtime would be atrocious (there's a factorial involved).
The two missing features are tracking edited-in-place lines separate from a deletion-and-addition of that line, and tracking entire functions moved around so they're in different hunks. For those experienced with diff but unfamiliar with the terminology, a subsequence is a contiguous group of + or - lines, with a type of either delete (all -), add (all +), or replace (a combination). I need more information, on moves and edit-in-place lines, vaguely alluded to in an entry on c2: DiffAlgorithm (paragraph starts with "My favorite mode"). Does anyone know what that is? (seems to be based on Tichy, see bottom.)
Here's more info on the two missing features:
no concept of a change on a line, (a fourth type, something like edit-in-place). In this hunk, the parent of 'bc' is 'b' but 'd' is new and isn't a descendant of 'b':
a
-b
+bc
+d
The workaround for this isn't too complicated, if the position of edits is the same (just an expanded version of markup_instraline_changes but comparing edit distance on all equal-sized subsets of old and new lines.
no concept of "moving" code that preserves the ownership of the lines, e.g. this diff shouldn't alter the ownership of "line", although its position changes.
a
-line
c
+line
This could be dealt with in the same way but with much worse runtime (instead of only checking single blocks marked 'replace', you'd need to check Levenshtein distance between all added against all removed lines) and with likely false positives (some, like whitespace-only lines, aren't relevant to my problem).
Research I've done: reading about gestalt pattern matching (Ratcliff and Obershelp, used in Python's difflib) and An O(ND) Difference Algorithm and its Variations (EW Myers).
After posting the question, I found references to Tichy84 which appears to be The string-to-string correction problem with block moves (which I haven't read yet) according to Walter Tichy's paper a year later on RCS
You appear to be interested in origin tracking, the problem of tracing where a line came from.
Ideally, you'd instrument the editor to remember how things were edited, and store the edits with the text in your repository, thus solving the problem trivially, but none of us software engineers seem to be smart enough to implement this simple idea.
As a weak substitute, one can look at a sequence of source code revisions from the repository and reconstruct a "plausible" history of changes. This is what you seem to be doing by proposing the use of "diff". As you've noted, diff doesn't understand the idea of "moving" or "copying".
SD Smart Differencer tools compare source text by parsing the text according to the langauge it is in, discovering the code structures, and computing least-Levensthein differences in terms of programming language constructs (identifiers, expressions, statements, blocks, classes, ...) and abstract editing operators "insert", "delete", "copy", "move" and "rename identifier within a scope". They produce diff-like output, a little richer because they tell you line/column -> line/column with different editing operations.
Obviously the "move" and "copy" edits are the ones most interesting to you in terms of tracking specific lines (well, specific language constructs). Our experience is that code goes through lots of copy and edits, too, which I suspect won't surprise you.
These tools are in Beta, and are presently available for COBOL, Java and C#. Lots of other langauges are in the pipe, because the SmartDifferencer is built on top of a langauge-parameterized infrastructure, DMS Software Reengineering Toolkit, which has quite a number of already existing, robust langauge grammars.
I think the idea of what amount of editing a line that can be done while it remains a descendent of some previously written line is very subjective, and based on context, both things that a computer cannot work with. You'd have to specify some sort of configurable minimum similarity on lines in your program I think... The other problem is that it is entirely possible for two identical lines to be written completely independently (for example incrementing the value of some variable), and this will be be quite a common thing, so your desired algorithm won't really give truthful or useful information about a line quite often.
I would like to suggest an algorithm for this though (which makes tons of hopefully obvious assumptions by the way) so here goes:
Convert both texts to lists of lines
Copy the lists and Strip all whitespace from inside of each line
Delete blank lines from both lists
Repeat
Do a Levenshtein distance from the old to new lists ...
... keeping all intermediate data
Find all lines in the new text that were matched with old lines
Mark the line in both new/old original lists as having been matched
Delete the line from the new text (the copy)
Optional: If some matched lines are in a contiguous sequence ...
... in either original text assign them to a grouping as well!
Until there is nothing left but unmatchable lines in the new text
Group together sequences of unmatched lines in both old and new texts ...
... which are contiguous in the original text
Attribute each with the line match before and after
Run through all groups in old text
If any match before and after attributes with new text groups for each
//If they are inside the same area basically
Concatenate all the lines in both groups (separately and in order)
Include a character to represent where the line breaks are
Repeat
Do a Levenshtein distance on these concatenations
If there are any significantly similar subsequences found
//I can't really define this but basically a high proportion
//of matches throughout all lines involved on both sides
For each matched subsequence
Find suitable newline spots to delimit the subsequence
Mark these lines matched in the original text
//Warning splitting+merging of lines possible
//No 1-to-1 correspondence of lines here!
Delete the subsequence from the new text group concat
Delete also from the new text working list of lines
Until there are no significantly similar subsequences found
Optional: Regroup based on remaining unmatched lines and repeat last step
//Not sure if there's any point in trying that at the moment
Concatenate the ENTIRE list of whitespaced-removed lines in the old text
Concatenate the lines in new text also (should only be unmatched ones left)
//Newline character added in both cases
Repeat
Do Levenshtein distance on these concatenations
Match similar subsequences in the same way as earlier on
//Don't need to worry deleting from list of new lines any more though
//Similarity criteria should be a fair bit stricter here to avoid
// spurious matchings. Already matched lines in old text might have
// even higher strictness, since all of copy/edit/move would be rare
While you still have matchings
//Anything left unmatched in the old text is deleted stuff
//Anything left unmatched in the new text is newly written by the author
Print out some output to show all the comparing results!
Well, hopefully you can see the basics of what I mean with that completely untested algorithm. Find obvious matches first, and verbatim moves of chunks of decreasing size, then compare stuff that's likely to be similar, then look for anything else which is similar, but both modified and moved: probably just coincidentally similar.
Well, if you try implementing this, tell me how it works out, and what details you changed, and what kind of assignments you made to the various variables involved... I expect there will be some test cases where it works brilliantly and others where it just abyssmally fails due to some massive oversight. The idea is that most stuff will be matched before you get to the inefficient final loop, and indeed the previous one