Is it possible to ignore paragraph marks when using getTextRanges() in word add in? - ms-word

I am currently developing a word addin using the office js library. I need to get all sentences in the word document as individual ranges. For this I used getTextRanges() on the body of the document with "." as the delimiter. However, it also separates on paragraph mark which is not ideal for my use case. All I want is for the document to be divvied up into ranges where the only delimiter is "." - regardless of whether the ranges will then expand across paragraphs.
Is there a way to ignore paragraph marks with getTextRanges(), or is there another method entirely that I seem to have overlooked?
Thanks.
I have been unable to resolve it.

Related

VSCode multiline search of two words?

I saw a SO post that says you can search using regex or an actual literal text on it to search multiline texts. But what if you want to (quickly) search two or three of words within a specified lines of text content?
For example, what if you want to search for multiline text area that contains "ruby" and "regex" (assuming you want to know where you took a note on your txt (or markdown or rich text format) file. you may want to search for "how to use regex in ruby" or "the ruby regex tutorial", right? )
Now you can use a simple (but redundant) regex like ruby(.*\n)+regex|regex(.*\n)+ruby. But to me it doesn't look beautiful. For three or more words, this kind of regex workaround increases its redundancy exponentially also, not good.
So is there a smarter way to do this? Thanks.

How to generalize special entities

We use Apache UIMA Ruta for processing our documents. The input documents contains all kind of patterns that we try to recognize and translate to a hierarchy of annotations.
One of the things we will do with the result is to decorate the input text with links. For that it's import that we know the original position information of the found annotations.
Some of the annotations are based on value lists. We use MarkTable to resolve them.
The problem we have is that input document can contain different kind of special entities. For example, the document can contain also words that contain entities like & 𝌆. These can also exist in words / sentences that will be looked up into valuelists.
We are searching for an option to generalize (convert) all that kind of options to a normal "plain text" format, so that we don't have to add all kind of options, with special entities to the valuelists.
Doing a pre-processing of the document and replace them all (for example with the HTMLConverter engine) is AFAIK not a good option, because that will also change the position info. & should match on &, but still seen as size 5.
I tried to use the replace action, that will add an extra "replacement" attribute to the annotation. When I add an interceptor (aspect) to the getCoveredText of the annotation class, and return replacement instead of real text if available, the matching will succeed. But this give problems if the replacement text contains spacers (the end position is still equal with the original text / first RutBasic).
Any suggestions how we can solve this?
I solved this issue by building a pre- and post processor for the content.
In the pre-processor I replace text fragments with other text. For example the & and & will be replaced by a normal &. While preprocessing I store each replacement details in an replacement object, that will be added to an ordered list. A replacement object contains the original text and the difference in length (& is 4 characters longer than a single &).
After annotating with RUTA(and other annotators) I correct all the found annotation values (text) to the original value and I fix the position information (begin and end) of the annotations, so that they match with the original content. I use the list with replacement details for this process.

Removing some paragraph marks in a word document

I copy text from PDF files into word 2010 documents using Abbyy conversion software. I find the result will contain many line breaks which are incorrect. Is there any way I can remove any such marks if they are not preceded by either "." or "?" or "!"
I write macros in excel but have no experience of word coding
You could do a search and replace depending if you can find some sort of rules wich you can apply. Mayeby a little screenshot?

How can I use the DocX library to change the font globally, remove superfluous spaces, and remove or add extra line breaks?

I want to, using the DocX library [https://docx.codeplex.com/], convert a .docx document to use a different font. Does anybody know how to do that? The samples projects are very spare, and the documentation is nonexistent.
I find, too, that often there are extraneous spaces in documents, and I want to iterate over all these until there are never two contiguous spaces. I can do this in a loop, I guess, replacing " " (2 spaces) with " " (1 space) until " " (2 spaces) is no longer found.
However, I also want to remove superfluous line breaks that sometimes occur when copying-and-pasting text into a document. I can do it "manually" (in Libre Office, not sure how it's done in MS Word), as I got an answer to this question:
(select "Regular Expressions" and then replace "$" (without the quotes) with a space)
...but how programmatically, with DocX?
Additionally, in some cases I want to ADD line breaks/"paragraph returns" where there are legitimate line breaks between the end of one paragraph and the start of another, but no extra line to separate them visually. According to this:
...I can add a paragraph/line break to a legitimate line break by searching for "$" and replacing that with "\n\n"
This does work, too (manually, in Libre Office); but again...how to do this with the DocX library?
It appears that not all of this is possible with the current version of the DocX library you are using. If it is not exposed in documentation, the functions might as well not exist, and you should not be using undocumented features.
There is a much more mature library available, however, called the "Open XML SDK", that can do everything you need.
The correct way to change a font, regardless of whether you are doing it with the document editor, or you are writing a program to manipulate these files, is to change the appropriate text's style attribute, or changing the definition of style in use.
You should never, ever, ever, ever directly change the font of any text. Personally, I think that the 'font type' and 'font size' menus should be removed entirely from word/libreoffice/etc, and only be accessible inside a 'change style properties' dialog; the only reason to directly apply a font is if you are actually providing an example of particular typeface under discussion!
See How to: Replace the styles parts in a word processing document (Open XML SDK) from the MSDN documentation for a description of the way that works.
To search and replace text, the applicable MSDN page is How to: Search and replace text in a document part (Open XML SDK). For specifically replacing multiple spaces with a single space, there are numerous results on Google that should all work to at least some degree.

How can I identify an OpenXml Paragraph as one I programmatically inserted?

I am programmatically adding an OpenXML paragraph to a Word Document and I need to be able to identify that paragraph as mine later on. Any ideas on how to do this? I have tried inserting an XML comment and extended attributes but when you save the document in word it removes all unknown xml. It doesn't matter if it is an attribute in the paragraph or the run, or an element before the paragraph, just some way I can identify it later on. Also, I do not want this identifier visible in the word document.
Examples of what I could use:
<paragraph id="myParagraph"></paragraph>
<otherelement>myparagraph</otherelement>
<paragraph></paragraph>
Any help would be AWESOME because my head it hurting from the brick wall I have been running into.
Thanks!
Give the paragraph a w:rsidR attribute and assign a unique value to it; if there is no value present when word saves the document it will randomly assign it's own 8-digit hexadecimal value anyway. (The value is not limited to 8 digits or hexadecimal characters. Word will not modify existing RSIDs.)
That being said -- make sure to keep RSID values unique and do NOT modify existing RSID attributes -- they are the unique ID for that paragraph, and if the document splits into multiple versions and a user tries to merge them back together those RSIDs are used to determine what paragraphs have changed.
(Also note that runs have RSIDs as well.)
If the user modifies the paragraph, the RSID of that paragraph may change.
The alternate option is to use Custom XML: http://msdn.microsoft.com/en-us/library/bb608618.aspx
Use stylename in paragraph properties.
or try this one
http://msdn.microsoft.com/en-us/library/office/hh674468.aspx
Hope this helps.