Match paragraph in OpenXML SDK to interop paragraph in Word document - ms-word

The Word interop is insanely slow when I try to parse the text in the document with 100+ pages. I re-wrote my code to use the OpenXML SDK which is much faster. My problem is that once I have found the information in OpenXML document I have to locate it then in the Word document and scroll main window to it. In order to accomplish this I have to somehow match OpenXML paragraph to interop paragraph. I thought that interop paragraphs perfectly match openxml paragraphs, but I was wrong. In fact the interop usually have more paragraphs than in OpenXML. Is there any trick or some kind of information which could help me match them? For example I have figured out that usually interop has 1 more empty paragraph after every row in the table. So I could probably use this information and bear it in mind, however I afraid there much more than just 1 case I have found myself.
UPDATE
Here is below screenshots of simple Add-In I have created to demonstrate the difference between interop and openxml paragraphs on the Word document with simple content like this:
The add-in then retrieves the list of interop paragraphs and list of OpenXML paragraphs and show them side-by-side:
Here is below the code I used:
var document = Globals.ThisAddIn.Application.ActiveDocument;
if (document == null)
return;
var interopParagraphs = document
.StoryRanges
.Cast<Range>()
.SingleOrDefault(r => r.StoryType == WdStoryType.wdMainTextStory)
.Paragraphs
.Cast<Paragraph>()
.Select(p => p.Range.Text);
var openXmlDocument = WordprocessingDocument.FromFlatOpcString(document.Content.WordOpenXML);
if (openXmlDocument == null)
return;
var openXmlParagraphs = openXmlDocument
.MainDocumentPart
.Document
.Body
.Descendants<DocumentFormat.OpenXml.Wordprocessing.Paragraph>()
.Select(p => p.InnerText);
var compareDialog = new CompareForm(interopParagraphs, openXmlParagraphs);
compareDialog.ShowDialog();

Turning my comment into an answer.
For the case of table rows, you can check to see whether you are looking at an end-of-row paragraph using Range.IsEndOfRowMark.
This property returns True if the specified range is collapsed and is located at the end-of-row mark in a table, and False if not.
You can also use Range.Information[WdInformation.wdAtEndOfRowMarker].
Returns True if the specified selection or range is at the end-of-row mark in a table
Despite the slight difference in the documentation, the range must be collapsed for this property as well. AFAIK, they are equivalent.
I also noticed that this doesn't work if you access a paragraph directly, e.g.e Document.Paragraph[4]. You have to iterate through them for it to work. This does not seem to be documented.

Related

Get Font Properties From Word Document With OpenXML (.NET)

How can I get font properties from word document with OpenXML?
var para = wordDocument.MainDocumentPart.RootElement.Descendants<Paragraph>().ToList();
With the code above, I can only get the paragraphs themselves.
Only font insertion shown in forum.
Please help me..
Although i don't really know, what 'font-properties' means in this context, my answer is: it depends.
styles (templates defining paragraph or run format, etc) are set in MainDocumentPart.StyleDefinitionsPart
formatting properties are defined in RunProperties or ParagraphProperties (applied styles can also be found here)
So if you like to retrieve certain formatting properties, you will have to look inside the openxml-package.

Reaching the referenced text from a Word interop field object

I am using Word interop to build a Word plugin. In this plugin I have a case where I want to examine all
Field objects in the document and when that field is a cross-reference to another place in the same document I need to be able to capture the text in the paragraph that the field is referring to.
I was able to get the name of the field object but there were no bookmarks defined in the Document although in Word I could click on the field to get to the other location.
Example field
Example field as code
referenced text I need to get
No Bookmark objects are defined
I tried to simulate the user clicking on the field by invoking DoClick() on it and then I accessed V_V_Scalar_Document_Generic.Application.Selection.Range.Text
but it gave nothing. I also tried the GoTo approach below but still didn't reach the referenced text.
System.Collections.Generic.List<string> L_V_List_String_Fields = new System.Collections.Generic.List<string>();
foreach (Field L_V_Scalar_Field_Item in V_V_Scalar_Document_Generic.Range.Fields)
{
try
{
if (L_V_Scalar_Field_Item.Type == WdFieldType.wdFieldRef)
// L_V_Scalar_Field_Item.Data --> gives COM exception
// L_V_Scalar_Field_Item.Code.ID --> blanks
// L_V_Scalar_Field_Item.DoClick() 'will not help because fields are not always hyperlinks
// L_V_Scalar_Field_Item.Result.Text --> gives the text of the field itself
// all variations I tried for the target parameter in the line below (last param) are not working
// V_V_Scalar_Document_Generic.[GoTo](Microsoft.Office.Interop.Word.WdGoToItem.wdGoToField, System.Type.Missing, System.Type.Missing, "_Ref28680085")
// Dim L_V_Scalar_String_Source as string = V_V_Scalar_Document_Generic.Application.Selection.Range.Text
L_V_List_String_Fields.Add($"CodeText:{L_V_Scalar_Field_Item.Code.Text} |FieldType:{L_V_Scalar_Field_Item.Type} |FieldKind:{L_V_Scalar_Field_Item.Kind} |SourceText:{"source text ??"}");
}
catch (Exception L_V_Scalar_Exception_Generic)
{
}
}
The bookmarks are not listed because Word has a convention that bookmarks with names starting with an underscore ("_") are "hidden". In the Insert->Links->Bookmark dialog box, you can see them if you check the "Hidden Bookmarks" box, but in the Find and Replace box, you have to enter the name manually.
Even when Bookmarks are hidden, you can reference them. So for example you should be able to do something like this (this is VBA syntax):
Dim TargetText As String
TargetText = ActiveDocument.Bookmarks("_Ref28680085").Range.Text
to get the text "covered" by the bookmark. In theory, you could use Goto, by using wdGotoBookmark instead of wdGotoField, except that I think it will only have a chance of working with the Selection object, not a Range object.
Depending on what type of cross-reference the user inserts, Word "covers" different parts of the referenced material. So you may need to construct the Range you really need, e.g. using the Bookmark's Range.Start to tell you which paragraph the reference is pointing at.

Does the Office.js API support multiple range selection?

I need to select multiple ranges simultaneously via the Office.js API like you can do in the MSWord UI by holding down the CTRL key and highlight multiple non-contiguous paragraphs, like the screenshot below:
This attempt doesn't work. Rather than highlighting the first two instances of the word "the" in the document, it's highlighting the first, then highlight the second afterwards:
Word.run(function (context) {
// Set up the search options.
var options = Word.SearchOptions.newObject(context);
options.matchCase = false;
options.ignoreSpace = true;
options.ignorePunct = true;
options.matchWildcards = true;
var searchText = "the";
var searchResults = context.document.body.search(searchText, options);
context.load(searchResults);
return context.sync().then(function () {
searchResults.items[0].select();
searchResults.items[1].select();
});
});
No, none of the APIs support multiple selections. Even the ability for the user to do so, using Ctrl+select is relatively new. The capability was never carried over to the APIs.
The closest the APIs can do is to highlight (or otherwise format) the Range objects of interest. There is such functionality in Word's dialog box which is also available to the COM APIs, but I don't find an equivalent for the JS APIs...
To confirm what Cindy mentioned, Non-continuous selections are not only not supported in Office.js (for Word, we DO support them for Excel though) but also not supported manually on other platforms (i.e. Word Online).
It might be possible.
I ran across an odd result when using bindings and Office.context.document.goToByIdAsync(). Using this function you can navigate to any binding without having to call Word.Run(), which is nice. There is an option called SelectionMode, which by default does not select the binding, but can be set to select the contents of the binding. Weirdly, selecting the content in this way does not deselect the current selection! Which is not the result I wanted, fwiw; to me it is a nuisance that requires me to "deselect" any current selection before I use goToByIdAsync. But it's possible you could use this to select multiple ranges by wrapping them in contentControls and then creating bindings on them, then calling goToByIdAsync (with SelectionMode set to Select) on each binding. I have not tested this.
Edit
Actually, the previous selection is deselected, but it remains highlighted as though it is still selected. This appears to be a display bug.

Merging documents using OpenXml and section breaks causes empty paragraphs

I am stitching a couple of documents together with a requirement that each document should retain its header and footer information in the final document. Using AltChunk instead of raw OpenXml or DocumentBuilder saves a lot of effort with regards to styles, formatting, references, parts, etc.
Unfortunately, after a couple of days I can't seem to get a 100% working version due to a small and frustrating issue and I need some insight.
My code is loosly based on this article
I modify each sub document, prior to appending it (as an AltChunk) to a working document, by moving the last section properties into the last paragraph (in order to retain header and footer references), but Word seems to be adding a blank paragraph to each of these documents as it renders them in the final document. I end up with:
document 1 with correct header and footer
section properties/break
blank paragraph
document 2 with correct header and footer
section properties/break
blank paragraph
etc.
I cant remove the blank paragraphs afterwards, as I ideally don't want to use WAS to render the document first.
It seems as if you cannot have a next-page section break without a following paragraph?
After further investigation, it seems that will not be away around my usage scenario. I would need to place the last section properties in the body element, but due to my way of processing with nested AltChunk, it would not work.
I have changed my approach completely and went back to a more detailed append procedure using OpenXml Power Tools and some LINQ to Xml.
I'm using Document Builder and works perfectly for me!
var sources = new List<OpenXmlPowerTools.Source>();
sources.Add(new OpenXmlPowerTools.Source(new WmlDocument(#tempReportPart1)));
sources.Add(new OpenXmlPowerTools.Source(new WmlDocument(#tempReportPart2)));
var outputPath = #"C:\Users\xpto\Documents\TestFolder\myNewDocument.docx";
DocumentBuilder.BuildDocument(sources, outputPath);
I have the similar empty paragraph issue while importing HTML files.
My solution is,
After inserting HTML AltChunk, I add a GUID place holder. After processing the file, I will open the file again, locate the GUID and check if there is a empty paragraph before it, if so remove the empty paragraph and GUID. it seems work perfectly in my solution.
Hope it helps.

Office 2013 JavaScript API for Word - Content Control questions

is it possible to insert a content control into a Word document, then, get some sort of handle or context to the content control, and then insert HTML into it?
Essentially, the scenario that I am trying to create with the Office JavaScript API is to, upon the user's request, insert a rich text content control, and then populate it with HTML.
I am able to insert the content control from the JavaScript API using the approach suggested at http://social.msdn.microsoft.com/Forums/en-US/appsforoffice/thread/8c4809c7-743c-4388-aef0-bc6a6855c882. It requires a coercionType of ooxml. However, the content that I wish to populate with the ooxml is HTML based. So when I try to insert a content control with the following ooxml:
...Boiler ooxml to create content control...
<w:r><w:t><h1>Test header</h1><h2>Test subheader</h2><p>Test paragraph text</p></w:t></w:r>
The insert attempt fails. I'm assuming that's because you can't mix ooxml and html when inserting this into the document with a coercionType of ooxml.
Since this ooxml approach is the only way you can insert a content control, how can I then set the content control with HTML text? I have looked over the Document object help content at http://msdn.microsoft.com/en-us/library/fp142295.aspx, but I'm unsure how I can do this still, or if it's feasible.
Thanks
though I have not tried this with JS - it should be possible nontheless.
Try adding a altChunk Element, it can contain other open xml or html. I have used it a few times with success.
a few links on the issue:
http://blogs.msdn.com/b/brian_jones/archive/2008/12/08/the-easy-way-to-assemble-multiple-word-documents.aspx
http://blogs.msdn.com/b/ericwhite/archive/2008/10/27/how-to-use-altchunk-for-document-assembly.aspx
U should however try to use "strict"-xml - otherwise the above might not be possible.
I just found this example (sry it's german, but there should be an english version somewhere as well). In which coercionType is used like this:
Office.context.document.setSelectedDataAsync(
booksToRead,
{ coercionType: Office.CoercionType.Html },
function (result) {
// Access the results, if necessary.
});
This might do the trick as well.