Does anyone know how can I search for all words in a text that are italicized? And to extend that, search for specific words that are (or are not) italicized?
For example, given "I am certain that I am not mistaken", I'd like to extract certain, or extract all am's that are not italicized?
Assuming that the formatting information is present in the CAS, e.g., by applying the HtmlAnnotator (in combination with HtmlConverter) provided by Ruta, the rules could look like (as indicated in a comment of the question):
I{-> MyType};
SW.ct=="am"{-PARTOF(I) -> MyType};
You maybe need to import the HtmlTypeSystem of Ruta.
DISCLAIMER: I am a developer of UIMA Ruta
Related
The starspace documentation is unclear on the parameter 'fileFormat' which takes the value 'labelDoc' or 'fastText'.
I would like to understand intuitively what material difference setting this paramter would have.
Currently, my best guess is that if you set fileFormat to 'fastText' then all tokens in the training file that do not have the prefix '__label__' will be broken down into character-level n-grams as in fastText.
Alternatively, if you set fileFormat to 'labelDoc' then starspace will assume that all tokens are actually labels, and you do not need to prepend '__label__' to the tokens, because they will be recognized as labels anyway.
Is my thinking correct?
The way StarSpace uses the labels highly depends on the trainMode you are using. The labelDoc format is useful when you go for a trainMode that just relies on labels (trainMode 1 through 4) where it may be the same thing to use a fastText format specifying the __label__ prefix but some trainModes benefit from labelDoc format (i.e. trainMode 1 or 3) to use a whole sentence as a label element for that trainMode.
So to clarify that, if you are performing a text classification task(as explained in this example labelDoc wouldn't have any input recognized but on the other hand, as you stated, using fastText format will breakdown all non-labeled text as input and learn to predict the __label__ tags.
And an example for labelDoc format would be developing a content based recommender system (as explained in this example) every tab separated sentence is used at LHS or RHS during training time. But if you go on a collaborative approach (the content of the articles or wherever you sentences come from is not taken in account) it can be trained either with fastText (specifying the __label__ prefix) or labelDoc file format as labels are picked randomly during training time for LHS or RHS. (This second example is explained here).
I'm trying to understand what is the query format when I press (Cmd + P) or (Comd + T) and then type something.
Let's say I type ABC. it seems to me that VSCode searches using the regex A.*B.*C.*. Is it correct? It also appears that * is also allowed in the query, but I got confusing results, for example here
Can someone please point me out to the documentation about the query format?
It is called "fuzzy" matching or searching. I couldn't find any formal documentation other than something like implementing fuzzy matching. For your odd test case of vs*b it looks like they are trying to implement fuzzy matching with out-of-order symbols like some other editors have.
See also More fuzzy matching:VSCode documenation
The file picker is not using regular expressions, but a fuzzy search algorithm. I think this feature is somehow connected to IntelliSense, but I am not aware of any detailed technical documentation. However, it has been introduced in December 2015 (VSCode 0.10.6) and became a default setting in January 2016 (VSCode 0.10.9).
On GitHub you can find an issue collecting bug reports / feature requests regarding the fuzzy searching. If you want to dig much deeper into this topic, you might find a good starting point there.
As a side note, also the User Settings(File > Preferences > Settings) seem to use the same kind of fuzzy search:
I wanna match some terms only if the term behind this term is relevant for me. So I've created a minimal example:
This is my Test Data:
small Large
Large
small
And I wanna mark the terms small Large and Large but not "small".
So I thought, something like this should work:
DECLARE Test;
(SW*? CW) {-> CREATE(Test)};
But RUTA only matches "small Large".
For Testing I've replaced "SW" with "W" and it will do what I wan't.
Unfortunately, optional quantifiers at the beginning of a rule are not optional if the rule starts to match with the first rule element. This means that you either need two rules or you need to change the order of the rule element matching.
Changing the order of the rule element match lead to different rule matches since not all incremental sequences of SWs are considered before the CW. However, this is something one would normally prefer anyways. The rule would look like:
(SW* #CW) {-> CREATE(Test)};
The two rules approach would look something like:
(SW+? CW) {-> CREATE(Test)};
CW {-> CREATE(Test)};
I recommend avoiding the usage of a reluctant quantifier if it is not really required because of additional computation which are not necessary. Rather use the PARTOF condition even if it looks not as nice.
DISCLAIMER: Iam a developer of UIMA Ruta
In the middle of a conversion project from unstructured Framemaker to DITA-compliant, structured Framemaker. Customer wants xrefs to be underlined in the output. Seems straightforward enough, but I've been all over the documentation and all over the internet and can't find what I need. The EDD file shows that we should be using the "link.external" style, which makes perfect sense, but for the life of me I can't figure out where link.external is defined. I've found one piece of documentation in all my searching that sort of comes close to what I need, but the process for styling an xref, according to this document, is long and laborious. I just can't believe that applying a simple style to an element is so hard. Where would I look for the definition of the "link.external" style (or any other style, for that matter)? What obvious point am I missing?
You apply the style in the Cross-Reference panel using building blocks in the cross-reference format(s).
For example:
Section 2.3.4, Volcanoes.
would be styled using the x-ref format below:
Section <$paranumonly>, <Emphasis><$paratext>.
Therefore, to underline all of the x-refs, create an underline character format such as Underline, and use it in a building block within every x-ref format that you have.
<Underline>“<$paratext>” on page\ <$pagenum>
The change only applies to the x-ref, not to the following text.
My NetBeans dictionary is kind of... illiterate? It's flagging words like "website" and the "doesn" part of doesn't. I right-clicked expecting to see your standard Add to dictionary... option but found none. I browsed the menus and also found nothing.
How do I educate my NetBeans spellchecker?
It looks like the spell checker is a relatively recent addition. There are basic instructions on how to change the dictionary here.
Adding an unknown word to the dictionary requires alt + enter while the cursor is on the 'misspelled' word. This might take care of the most glaring omissions.
If it highlights just 'doesn', then it probably isn't aware of English-style contractions (i.e., it doesn't know that words can span across an apostrophe). Until that is fixed, I would recommend just adding 'doesn' as a separate word using the above method.