How can I perform automated tests against MS Word documents using PowerShell? - powershell

We regularly need to perform a handful of relatively simple tests against a bunch of MS Word documents. As these checks are currently done manually, I am striving for a way to automate this. For example:
Check if every page actually has a page number and verify that it is correct.
Verify that a version identifier in the page header is identical across all pages.
Check if the document has a table of contents.
Check if the document has a table of figures.
Check if every figure has a caption.
et cetera. Is this reasonably feasible using PowerShell in conjunction with a Word API?

Powershell can access Word via its object model/Interop (on Windows, at any rate) and AIUI can also work with the Office Open XML OOXML) API, so really you should be able to write any checks you want on the document content. What is slightly less obvious is how you verify that the document content will result in a particular "printed appearance". I'm going to start with some comments on the details first.
Just bear in mind that in the following notes I'm just pointing out a few things that you might have to deal with. If you're examining documents produced by an organisation where people are already broadly speaking following the same standards, it may be easier.
Of the 5 examples you give, without checking the details I couldn't say exactly how you would do them, and there could be difficulties with all of them, but for example
Check if every page actually has a page number and verify that it is correct.
Difficult using either OOXML or the object model, because what you would really be checking is that the header for a particular section had a visible { PAGE } field code. Because that field code might be nested inside other fields that say "if don't display this field code", it's not so easy to be sure that there would be a page number.
Which is what I mean by checking the document's "printed appearance" - if, for example, you can use the object model to print to PDF and have some mechanism that lets PS inspect the PDF's content, that might be a better approach.
Verify that a version identifier in the page header is identical across all pages.
Similar problem to the above, IMO. It depends partly on how the version identifier might be inserted. Is it just a piece of text? Could it be constructed from a number of fields? Might it reference Document Properties or Variables, or Custom XML content?
Check if the document has a table of contents.
Perhaps enough to look for a TOC field that does not have certain options, such as a \c option that a Table of Figures would contain.
Check if the document has a table of figures.
Perhaps enough to check for a TOC field that does have a \c option, perhaps with a specific parameter such as "Figure"
Check if every figure has a caption.
Not sure that you can tell whether a particular image is "a Figure". But if you mean "verify that every graphic object has a caption", you could probably iterate through the inline and floating graphics in the document and verify that there was something that looked like a Word standard caption paragraph within a certain distance of that object. Word has two standard field code patterns for captions AFAIK (one where the chapter number is included and one where it isn't), so you could look for those. You could measure a distance between the image and the caption by ensuring that they were no more than a predefined number of paragraphs apart, or in the case of a floating image, perhaps that the paragraph anchoring the image was no more than so many paragraphs away from the caption.
A couple of more general problems that you might have to deal with:
- just because a document contains a certain feature, such as a ToC field, does not mean that it is visible. A TOC field might have been formatted as not visible. Even harder to detect, it could have been formatted as colored white.
- change tracking. You might have to use the Word object model to "accept changes" before checking whether any given feature is actually there or not. Unless you can find existing code that would help you do that using the OOXML representation of the document, that's probably a strong case for doing checks via the object model.
Some final observations
for future checks, perhaps worth noting that in principle you could create a "DocumentInspector" that users could call from Word BackStage to perform checks on a document. Not sure you can force users to run it, or that you could create it in PS, but perhaps a useful tool.
longer term, if you are doing a very large number of checks, perhaps worth considering whether you could train a ML model to try to detect problems.

Related

Approach for extracting relevant text using Azure Cognitive Search

Context:
I have a set of documents in SharePoint. I have set up Azure Cognitive Search (Standard tier) with data sources (SharePoint), index and indexers. I have also added a semantic configuration.
Outcome:
Ask a question, and have the search find and return relevant sections from the documents. I will use these sections to feed into OpenAI to construct a cohesive result.
I would like to replicate this Microsoft demo: https://www.youtube.com/watch?v=3t3qZu1Dy1k&t=572s It seems to me to create this 'demo' each document content is very small and they could easily be combined to pass into OpenAI.
My experience so far:
The results return the documents and rank them, which seems OK - however it returns a short 'caption' and the full text. The caption is not necessarily related to my question - and can therefore not be used for the next step. The full document is far too big to be used in OpenAI.
I have managed to get Semantic answers - however the question has to be so precise to get a result, and the associated text is limited.
What I would like:
I would like the search to return sub-sections of the document, where the results of my question may be. If that is not supported, I feel I need an entirely new approach.
Any ideas? Thanks in advance for your time.
The demo you refer to works by feeding documents to Azure Cognitive Search. A query is then formulated as a question that uses the Semantic Search functionality to return a set of potential semantic answers extracted from the content in the index.
These potential semantic answers are then fed as a prompt to OpenAI's text completion service: https://beta.openai.com/docs/guides/completion
First, you must ensure you can get good semantic answers. Inspect the content you have indexed and verify that it contains content that could semantically be an answer to the questions you test with. Good content should have declarations of facts. I.e., statements that could be used verbatim as an answer to a question. Examples:
The capital of France is Paris.
Forecast for 2022 is expected to be 22%.
The semantic functionality in Azure Search will only respond with a text section containing a potential answer to your question. If you can't get this step to work, you have to work on improving that. Either via semantic configuration, choice of content, or by making sure you process your content so that the items in your index contain the relevant content in the correct properties.
Ensure your content is indexed and mapped to properties in a sensible way
Work with the semantic configuration until you get sensible results
Once the previous two steps are ok, submit to OpenAI
I have tested the semantic text on two different data sets. Both were a combination of website content, PDF- and Word documents, etc. The topic and volume of content were essentially the same. From one data set, I could get excellent semantic answers. But, the other data set was disappointing.
My conclusion was that the content in the good data set was formulated and structured in a way that fits a semantic scenario. The other data set would often have logic and meaning presented in tables and layouts. As a human reading the content on paper, you would understand it. But, semantically, it would not make as much sense.

How to find relation between two columns of csv (containing labels and related data) file using doc2vec?

I am working on a problem related to doc2vec where i need to find labels that are related to a particular word. For ex (csv file):
Data Label / Tags
In a future world devastated by disease, a convict is sent back sci-fi
in time to gather information about the man-made virus that
wiped out most of the human population on the planet.
You have slipped under my skin, invaded my blood and seized my action
heart. That sounds more like a poison than a person,” was all I
could say. His confession had both shocked and thrilled me.
Plenty of data like this is available on which the model can be trained. Now, I want the results like, when I enter a particular word like virus, it gives me corresponding labels (sci-fi) where ever the word is used and also give those labels (action), where the word virus itself is not present but it's semantically related words (like poison, poisonous) are present. The semantically related words can be easily fetched from the model. I just want to list the labels.
I want to know if something could be applied rather than using keyword search. Any particular method which could help me solve this problem.
Thanks

Make a Class Schedule Report

How can I make my Crystal Report look like the attached image? I have had no success creating it with a crosstab.
The short answer is that Crystal Reports isn't really equipped to handle the format you're dealing with. And here's why:
Let's assume for a moment you've already figured out how to interpret your query into something usable. Since we aren't using a Cross Table, the best you could hope for would be setting a Details section for each individual time slot and arranging a large number of formulas into a grid shape:
The problem is that every Formula would need to be unique; interpreting whether there is a Class at that Time and Date, and which Class it is. There would be up to 168 of those formulas and you'd have to manually go in and modify each one to check for their own unique combination of Date and Time. Which defeats the whole purpose of using a computer - to make repeated tasks easier.
Plus you'll have difficulty with the formatting: You'd need to program every "cell" to use a unique set of colors based on the displayed Class. That part is technically doable, but there's no way to "merge the cells" when classes last longer than a half hour. You'd end up with something like this:
So don't torture yourself trying to make this happen in Crystal. Even with all the time and effort it would take to formulate the grid, there's no good way to make it look like your screenshot.
That said, it looks as though you managed to put a schedule together in Excel. Is there any reason you can't use Excel instead? It's a much more powerful tool, and a cursory Google search suggests it can handle queries as well.

Assigning Custom Unique IDs to Word 2013 OpenXML Elements

TLDR/Question
How can I best assign unique IDs to (ideally all) of the elements in the XML that describes a Word document such that I can read/write those unique IDs from a Word (2013) Add-In?
Additionally, solutions describing ways I can get a good diff of two Word documents might be helpful but this is not the primary question.
Background
I'm creating an application-level add-in for Word (2013) using VSTO. Part of my task involves diffing an original Word document W with a modified W' so that I can then process the diff for another task. While Word clearly has the capability for diffs/merges (available in the "Review" panel in Word 2013) thus far I have not been able to find a way to programatically extract the diffs.
Therefore, I plan to get the XML for the documents (e.g. using Range.WordOpenXML) and diff them. There are a number of published algorithms for diffing XML documents (i.e. Diff(W.XML, W'.XML)) where the accuracy of the diff is largely dependent on being able to properly match the XML elements from the two documents.
Proposed Solution and Its Problems
Therefore, I'd like to be able to assign a unique ID for every element in the XML of the Word document that I can access from my Add-In. In this case a solution would be something like importing a custom namespace into the package called mynamespace and adding the attribute mynamespace:ID=*** for every element in the DOCX package. The attribute would then be accessible via Range.WordOpenXML.
However, simply using mce:Ignorable, mce:ProcessContent, and mce:PreserveAttributes as detailed at http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2012/09/21/markup-compatibility-and-extensibility.aspx does not work. The modified Word document loads without any issues, however I cannot seem to find any of the attributes, additionally saving the document removes all of the added markup.
From http://openxmldeveloper.org/discussions/formats/f/13/p/8078/163573.aspx it appears that this process of using custom xml via the Markup Compatibility and Extensibility (MCE) portion of the Office Open XML standard has become complicated over the years (patent issues, etc.). Therefore I'm guessing that my issues arise because Word's XML processor just removes all of the markup that it cannot natively process (maybe there is a way to hook into Word's XML processor and give it custom commands?).
For the future viewers:
1) There is absolutely no way to set any kind of id for most of elements, which can survive in Word (you can use any custom tags or attributes, but after MS Word opens the document, it's gone)
2) Only two elements can be used as id - ContentControl, they have ids, and bookmark (it is possible to make a hidden bookmark adding underscore before it's name, it works only from code), their name can be an id.
3) If tracking changes is enabled in Word, it is absolutely possible to see the diffs in XML, using Range.WordOpenXML and getting actual OpenXml from it, as explained here, for example.

Benefits of RESTful URL

What are the benefits of
http://www.example.com/app/servlet/cat1/cat2/item
URL
over
http://www.example.com/app/servlet?catid=12345
URL
Could there be any problems if we use first URL because initially we were using the first URL and change to second URL. This is in context of large constantly changing content on website. Here categories can be infinite in number.
In relation to a RESTful application, you should not care about the URL template. The "better" one is the one that is easier for the application to generate.
In relation to indexing and SEO, sorry, but it is unlikely that the search engines are going to understand your hypermedia API to be able to index it.
To get a better understanding in regards to the URLs, have a look at:
Is That REST API Really RPC? Roy Fielding Seems to Think So
Richardson Maturity Model
One difference is that the second URL doesn't name the categories, so the client code and indeed human users need to look up some category name to number mapping page first, store those mappings, use them all the time, and refresh the list when previously unknown categories are encountered etc.. Given the first URL you necessarily know the categories even if the item page doesn't mention them (but the site may still need a list of categories somewhere anyway).
Another difference is that the first format encodes two levels of categorisation, whereas the second hides the number of levels. That might make things easier or harder depending on how variable you want the depth to be (now or later) and whether someone inappropriately couples code to 2-level depth (for example, by parsing the URLs with a regexp capturing the categories using two subgroups). Of course, the same problem could exist if they couple themselves to the current depth of categories listed in a id->category-path mapping page anyway....
In terms of SEO, if this is something you want indexed by search engines the first is better assuming the category names are descriptive of the content under them. Most engines favor URLs that match the search query. However, if category names can change you likely need to maintain 301 redirects when they do.
The first form will be better indexed by search engines, and is more cache friendly. The latter is both an advantage (you can decrease the load on your server) and a disadvantage (you aren't necessarily aware of people re-visiting your page, and page changes may not propagate immediately to the users: a little care must be taken to achieve this).
The first form also requires (somewhat) heavier processing to get the desired item from the URL.
If you can control the URL syntax, I'd suggest something like:
http://www.example.com/app/servlet/cat1/cat2/item/12345
or better yet, through URL rewrite,
http://www.example.com/cat1/cat2/item/12345
where 12345 is the resource ID. Then when you access the data (which you would have done anyway), are able to do so quickly; and you just verify that the record does match cat1, cat2 and item. Experiment with page cache settings and be sure to send out ETag (maybe based on ID?) and Last-Modified headers, as well as checking If-Modified-Since and If-None-Match header requests.
What we have here is not a matter of "better" indexing but of relevancy.
And so, 1st URL will mark your page as a more relevant to the subject (assuming correlation between page/cat name and subject matter).
For example: Let`s say we both want to rank for "Red Nike shoes", say (for a simplicity sake) that we both got the same "score" on all SEO factors except for URL.
In 1st case the URL can be http://www.example.com/app/servlet/shoes/nike/red-nice
and in the second http://www.example.com/app/servlet?itemid=12345.
Just by looking on both string you can intuitively sense which one is more relevant...
The 1st one tells you up-front "Heck yes, I`m all about Red Nike Shoes" while the 2nd one kinda mumbles "Red Nike Shoes? Did you meant item code 12345?"
Also, Having part of the KW in the URL will help you get more relevancy and also it can help you win "long-tail" goals without much work. (just having KW in URL can sometimes be enough)
But the issue goes even deeper.
The second type of URL includes parameters and those can (an 99.9% will) lead to duplicated content issue. When using parameters you`ll have to deal with questions like:
What happens for non-existent catid?
Is there a parameter verification? (and how full proof is it?)
and etc.
So why choose the second version? Because sometime you just don`t have a choice... :)