How can I identify an OpenXml Paragraph as one I programmatically inserted? - openxml

I am programmatically adding an OpenXML paragraph to a Word Document and I need to be able to identify that paragraph as mine later on. Any ideas on how to do this? I have tried inserting an XML comment and extended attributes but when you save the document in word it removes all unknown xml. It doesn't matter if it is an attribute in the paragraph or the run, or an element before the paragraph, just some way I can identify it later on. Also, I do not want this identifier visible in the word document.
Examples of what I could use:
<paragraph id="myParagraph"></paragraph>
<otherelement>myparagraph</otherelement>
<paragraph></paragraph>
Any help would be AWESOME because my head it hurting from the brick wall I have been running into.
Thanks!

Give the paragraph a w:rsidR attribute and assign a unique value to it; if there is no value present when word saves the document it will randomly assign it's own 8-digit hexadecimal value anyway. (The value is not limited to 8 digits or hexadecimal characters. Word will not modify existing RSIDs.)
That being said -- make sure to keep RSID values unique and do NOT modify existing RSID attributes -- they are the unique ID for that paragraph, and if the document splits into multiple versions and a user tries to merge them back together those RSIDs are used to determine what paragraphs have changed.
(Also note that runs have RSIDs as well.)
If the user modifies the paragraph, the RSID of that paragraph may change.
The alternate option is to use Custom XML: http://msdn.microsoft.com/en-us/library/bb608618.aspx

Use stylename in paragraph properties.
or try this one
http://msdn.microsoft.com/en-us/library/office/hh674468.aspx
Hope this helps.

Related

Is it possible to ignore paragraph marks when using getTextRanges() in word add in?

I am currently developing a word addin using the office js library. I need to get all sentences in the word document as individual ranges. For this I used getTextRanges() on the body of the document with "." as the delimiter. However, it also separates on paragraph mark which is not ideal for my use case. All I want is for the document to be divvied up into ranges where the only delimiter is "." - regardless of whether the ranges will then expand across paragraphs.
Is there a way to ignore paragraph marks with getTextRanges(), or is there another method entirely that I seem to have overlooked?
Thanks.
I have been unable to resolve it.

Deleting the last paragraph in a table cell in MS Word

In an empty table cell with the cursor inside, I am inserting multiple paragraphs of text (each with different styles) using insertFileFromBase64.
When no newline is present at the bottom of the Base64 DOCX file, the last paragraph will not receive the style assigned to it within the Base64 DOCX file.
However, when a newline is present within the Base64 DOCX file, I cannot get rid of it. Selecting the last paragraph within the cell and performing a delete does not return an error, but the newline remains present.
I guess this is related to the special status of the "cell marker" within Word, but I cannot find a way around this problem.
Does anyone know a solution?
Found workaround myself: when you FIRST set the style of the current paragraph to the style of the very last paragraph that is contained in the Base64 DOCX file, then the problem can be avoided. (Of course, this supposes that you know upfront the style of that very last paragraph -- which is not necessarily always the case.)

How to generalize special entities

We use Apache UIMA Ruta for processing our documents. The input documents contains all kind of patterns that we try to recognize and translate to a hierarchy of annotations.
One of the things we will do with the result is to decorate the input text with links. For that it's import that we know the original position information of the found annotations.
Some of the annotations are based on value lists. We use MarkTable to resolve them.
The problem we have is that input document can contain different kind of special entities. For example, the document can contain also words that contain entities like & 𝌆. These can also exist in words / sentences that will be looked up into valuelists.
We are searching for an option to generalize (convert) all that kind of options to a normal "plain text" format, so that we don't have to add all kind of options, with special entities to the valuelists.
Doing a pre-processing of the document and replace them all (for example with the HTMLConverter engine) is AFAIK not a good option, because that will also change the position info. & should match on &, but still seen as size 5.
I tried to use the replace action, that will add an extra "replacement" attribute to the annotation. When I add an interceptor (aspect) to the getCoveredText of the annotation class, and return replacement instead of real text if available, the matching will succeed. But this give problems if the replacement text contains spacers (the end position is still equal with the original text / first RutBasic).
Any suggestions how we can solve this?
I solved this issue by building a pre- and post processor for the content.
In the pre-processor I replace text fragments with other text. For example the & and &AMP; will be replaced by a normal &. While preprocessing I store each replacement details in an replacement object, that will be added to an ordered list. A replacement object contains the original text and the difference in length (& is 4 characters longer than a single &).
After annotating with RUTA(and other annotators) I correct all the found annotation values (text) to the original value and I fix the position information (begin and end) of the annotations, so that they match with the original content. I use the list with replacement details for this process.

Docvariable with empty string value

In word, I'm using docvariables to manage pluralization.
A VBA macro is changing the value of several docvariables to pluralize / singularize them.
But sometimes I want to use a Docvariable only for enable/disable a 's' suffix.
Problem: I cannot set it to empty string, because it deletes the docvariable.
The field displays an error in word.
So I'm searching a way to achieve this, it could be :
A way to keep a Docvariable existing, with empty string or equivalent value
A field formula which make this job if the variable doesn't exist
Any other workaround would be appreciated.
Thank you
A Document Variable (used in DocVariable field codes) cannot exist if it has no content.
A possibility would be to also store the space in this DocVariable so that it display s[space] or just [space].
Otherwise you may need to write this information to a Bookmark (possibly using a Set field) and display the content using a Ref field.

Manipulate status of links in Word document with OpenXML SDK

I have a Word-Document with some links to cells in Excel-files. In Word, I can get a context menu, that leads to a window with all the links of the document. There, I can see and manipulate properties of the links.
Amongst others, there is the part "Updatemethod for chosen link" (words may differ, I translated it from the German version), I have two radio-boxes with "automatic" / "manual". And a Checkbox "locked".
I want to modify (especially the locked-checkbox) these properties with OpenXML, but I did not find the place, where in the model this information is stored. I printed the OuterXML for a link with locked checked and for a link with locked unchecked, but did not find any differences in the parameter field (\a \f 5 \h * MERGEFORMAT - for both!)
Anyone knows, how I can modify this with OpenXML SDK?
Thanks in advance,
Frank
Word has different ways to represent the LINK in Office Open XML depending partly on the format of the link (e.g. whether you Paste Link to an object or to plain text).
For example, if you paste a link to a "Microsoft Excel Worksheet Object", although Word displays a LINK field in the document, the XML does not actually record the field code using either the simple or more complex encoding for field codes. It actually encodes the object in a <w:object> element that records information about a "shape", with the shape type in <v:shapetype>, the shape itself in <v:shape>, and information about the OLE link in <o:OLEObject>
In that case, Automatic link updating is recorded using
<o:OLEObject UpdateMode='Always'> for automatic links
and
<o:OLEObject UpdateMode='OnCall'> for manual links.
Whether or not the link is Locked is recorded in
<o:OLEObject><o:LockedField></o:LockedField<o:OLEObject>
(either as "false" or "" AFAICS).
Word reconstructs the LINK field code from the w:object information when it displays the document.
However, if you paste the link as text, the XML Word records will contain a complex field code construction, starting with a <w:fldChar w:fldCharType='begin' /> element.
In that case, the fact that the link is locked is indicated by a value of '1" in the w:fldLock attribute, and probably the absence of that attribute if it is not locked. e.g.
<w:fldChar w:fldCharType='begin' w:fldLock='1' />
In either case, an automatic link is indicated by the presence of the \a switch in the field code (reconstructed in the case of the first example). If there is no \a switch, it's not an automatic link.
That may not cover all the possible cases but should give you some clues about where to look in the XML.