How to generalize special entities - ruta

We use Apache UIMA Ruta for processing our documents. The input documents contains all kind of patterns that we try to recognize and translate to a hierarchy of annotations.
One of the things we will do with the result is to decorate the input text with links. For that it's import that we know the original position information of the found annotations.
Some of the annotations are based on value lists. We use MarkTable to resolve them.
The problem we have is that input document can contain different kind of special entities. For example, the document can contain also words that contain entities like & 𝌆. These can also exist in words / sentences that will be looked up into valuelists.
We are searching for an option to generalize (convert) all that kind of options to a normal "plain text" format, so that we don't have to add all kind of options, with special entities to the valuelists.
Doing a pre-processing of the document and replace them all (for example with the HTMLConverter engine) is AFAIK not a good option, because that will also change the position info. & should match on &, but still seen as size 5.
I tried to use the replace action, that will add an extra "replacement" attribute to the annotation. When I add an interceptor (aspect) to the getCoveredText of the annotation class, and return replacement instead of real text if available, the matching will succeed. But this give problems if the replacement text contains spacers (the end position is still equal with the original text / first RutBasic).
Any suggestions how we can solve this?

I solved this issue by building a pre- and post processor for the content.
In the pre-processor I replace text fragments with other text. For example the & and & will be replaced by a normal &. While preprocessing I store each replacement details in an replacement object, that will be added to an ordered list. A replacement object contains the original text and the difference in length (& is 4 characters longer than a single &).
After annotating with RUTA(and other annotators) I correct all the found annotation values (text) to the original value and I fix the position information (begin and end) of the annotations, so that they match with the original content. I use the list with replacement details for this process.

Related

Is it possible to ignore paragraph marks when using getTextRanges() in word add in?

I am currently developing a word addin using the office js library. I need to get all sentences in the word document as individual ranges. For this I used getTextRanges() on the body of the document with "." as the delimiter. However, it also separates on paragraph mark which is not ideal for my use case. All I want is for the document to be divvied up into ranges where the only delimiter is "." - regardless of whether the ranges will then expand across paragraphs.
Is there a way to ignore paragraph marks with getTextRanges(), or is there another method entirely that I seem to have overlooked?
Thanks.
I have been unable to resolve it.

Multiple regex in one command

Disclaimer: I have no engineering background whatsoever - please don't hold it against me ;)
What I'm trying to do:
Scan a bunch of text strings and find the ones that
are more than one word
contain title case (at least one capitalized word after the first one)
but exclude specific proper nouns that don't get checked for title case
and disregard any parameters in curly brackets
Example: Today, a Man walked his dogs named {FIDO} and {Fifi} down the Street.
Expectation: Flag the string for title capitalization because of Man and Street, not because of Today, {FIDO} or {Fifi}
Example: Don't post that video on TikTok.
Expectation: No flag because TikTok is a proper noun
I have bits and pieces, none of them error-free from what https://www.regextester.com/ keeps telling me so I'm really hoping for help from this community.
What I've tried (in piece meal but not all together):
(?=([A-Z][a-z]+\s+[A-Z][a-z]+))
^(?!(WordA|WordB)$)
^((?!{*}))
I think your problem is not really solvable solely with regex...
My recommendation would be splitting the input via [\s\W]+ (e.g. with python's re.split, if you really need strings with more than one word, you can check the length of the result), filtering each resulting word if the first character is uppercase (e.g with python's string.isupper) and finally filtering against a dictionary.
[\s\W]+ matches all whitespace and non-word characters, yielding words...
The reasoning behind this different approach: compiling all "proper nouns" in a regex is kinda impossible, using "isupper" also works with non-latin letters (e.g. when your strings are unicode, [A-Z] won't be sufficient to detect uppercase). Filtering utilizing a dictionary is a way more forward approach and much easier to maintain (I would recommend using set or other data type suited for fast lookups.
Maybe if you can define your use case more clearer we can work out a pure regex solution...

Algolia tag not searchable when ending with special characters

I'm coming across a strange situation where I cannot search on string tags that end with a special character. So far I've tried ) and ].
For example, given a Fruit index with a record with a tag apple (red), if you query (using the JS library) with tagFilters: "apple (red)", no results will be returned even if there are records with this tag.
However, if you change the tag to apple (red (not ending with a special character), results will be returned.
Is this a known issue? Is there a way to get around this?
EDIT
I saw this FAQ on special characters. However, it seems as though even if I set () as separator characters to index that only effects the direct attriubtes that are searchable, not the tag. is this correct? can I change the separator characters to index on tags?
You should try using the array syntax for your tags:
tagFilters: ["apple (red)"]
The reason it is currently failing is because of the syntax of tagFilters. When you pass a string, it tries to parse it using a special syntax, documented here, where commas mean "AND" and parentheses delimit an "OR" group.
By the way, tagFilters is now deprecated for a much clearer syntax available with the filters parameter. For your specific example, you'd use it this way:
filters: '_tags:"apple (red)"'

Encoding special chars in XSLT output

I have built a set of scripts, part of which transform XML documents from one vocabulary to a subset of the document in another vocabulary.
For reasons that are opaque to me, but apparently non-negotiable, the target platform (Java-based) requires the output document to have 'encoding="UTF-8"' in the XML declaration, but some special characters within text nodes must be encoded with their hex unicode value - e.g. '”' must be replaced with '”' and so forth. I have not been able to acquire a definitive list of which chars must be encoded, but it does not appear to be as simple as "all non-ASCII".
Currently, I have a horrid mess of VBScript using ADODB to directly check each line of the output file after processing, and replace characters where necessary. This is painfully slow, and unsurprisingly some characters get missed (and are consequently nuked by the target platform).
While I could waste time "refining" the VBScript, the long-term aim is to get rid of that entirely, and I'm sure there must be a faster and more accurate way of achieving this, ideally within the XSLT stage itself.
Can anyone suggest any fruitful avenues of investigation?
(edit: I'm not convinced that character maps are the answer - I've looked at them before, and unless I'm mistaken, since my input could conceivably contain any unicode character, I would need to have a map containing all of them except the ones I don't want encoded...)
<xsl:output encoding="us-ascii"/>
Tells the serialiser that it has to produce ASCII-compatible output. That should force it to produce character references for all non-ASCII characters in text content and attribute values. (Should there be non-ASCII in other places like tag or attribute names, serialisation will fail.)
Well with XSLT 2.0 you have tagged your post with you can use a character map, see http://www.w3.org/TR/xslt20/#character-maps.

How can I identify an OpenXml Paragraph as one I programmatically inserted?

I am programmatically adding an OpenXML paragraph to a Word Document and I need to be able to identify that paragraph as mine later on. Any ideas on how to do this? I have tried inserting an XML comment and extended attributes but when you save the document in word it removes all unknown xml. It doesn't matter if it is an attribute in the paragraph or the run, or an element before the paragraph, just some way I can identify it later on. Also, I do not want this identifier visible in the word document.
Examples of what I could use:
<paragraph id="myParagraph"></paragraph>
<otherelement>myparagraph</otherelement>
<paragraph></paragraph>
Any help would be AWESOME because my head it hurting from the brick wall I have been running into.
Thanks!
Give the paragraph a w:rsidR attribute and assign a unique value to it; if there is no value present when word saves the document it will randomly assign it's own 8-digit hexadecimal value anyway. (The value is not limited to 8 digits or hexadecimal characters. Word will not modify existing RSIDs.)
That being said -- make sure to keep RSID values unique and do NOT modify existing RSID attributes -- they are the unique ID for that paragraph, and if the document splits into multiple versions and a user tries to merge them back together those RSIDs are used to determine what paragraphs have changed.
(Also note that runs have RSIDs as well.)
If the user modifies the paragraph, the RSID of that paragraph may change.
The alternate option is to use Custom XML: http://msdn.microsoft.com/en-us/library/bb608618.aspx
Use stylename in paragraph properties.
or try this one
http://msdn.microsoft.com/en-us/library/office/hh674468.aspx
Hope this helps.