Read word document (*.doc) content with tables etc - powershell

I have a word document (2003). I am using Powershell to parse the content of the document.
The document contains a few lines of text at the top, a dozen tables with differing number of columns and then some more text.
I expect to be able to read the document as something like the below:
Read document (make necessary objects etc)
Get each line of text
If not part of a table, process as text and Write-Output
else
If part of a table
Get table number (by order) and parse output based on columns
end if
Below is the powershell script that I have begun to write:
$objWord = New-Object -Com Word.Application
$objWord.Visible = $false
$objDocument = $objWord.Documents.Open($filename)
$paras = $objDocument.Paragraphs
foreach ($para in $paras)
{
Write-Output $para.Range.Text
}
I am not sure if Paragraphs is what I want. Is there anything more suitable for my purpose?
All I am getting now is the entire content of the document. How do I control what I get. Like I want to get a line, be able to determine if it is part of a table or not and take an action based on what number table it is.

You can enumerate the tables in a Word document via the Tables collection. The Rows and Columns properties will allow you to determine the number of rows/columns in a given table. Individual cells can be accessed via the Cell object.
Example that will print the value of the cell in the last row and last column of each table in the document:
$wd = New-Object -ComObject Word.Application
$wd.Visible = $true
$doc = $wd.Documents.Open($filename)
$doc.Tables | ForEach-Object {
$_.Cell($_.Rows.Count, $_.Columns.Count).Range.Text
}

Related

PowerShell - How can I select the last row of the last table in a Word document, copy it, and insert as a new row at the bottom of the table?

I am trying to increase the level of automation within my day-to-day work, part of which involves adding a line to the end of a table within a report that contains largely the same information, with a few cells changed (new dates).
I have a little experience with VB and C++, but I am very much an amateur when it comes to PowerShell, which seems to be the go-to for task automation.
I have a couple of PowerShell scripts that search through the body of the report and change text, but the last part of the report is a record, and needs appending as opposed to amending.
How would I go about this?
I have tried mangling a few bits of PowerShell code I've found online, to no avail. I have gotten as far as selecting a row in the correct table, but I have no idea how I might then select the last row, copy this and insert it beneath as a new row at the bottom of the table:
$objWord = New-Object -ComObject word.application
$objWord.Visible = $True
$objWord.Documents.Open("FILEPATH")
$FindText = "KEYTEXT"
$objWord.Selection.Find.Execute($FindText)
$objWord.Selection.SelectRow()
Here's a lengthy example with explanations that does what you're looking for:
# open document
$objWord = New-Object -ComObject word.application
$objWord.Visible = $True
$doc = $objWord.Documents.Open("C:\temp\temp.docx")
# search for row
$FindText = "Value2"
$result = $objWord.Selection.Find.Execute($FindText)
$objWord.Selection.SelectRow()
# copy the ranges from searched row
$table = $objWord.Selection.Tables[1]
$copiedCells = $objWord.Selection.Cells | select columnindex,rowindex
# add row at the end of table
$table.Rows[$table.Rows.Count].Select()
$objWord.Selection.InsertRowsBelow(1)
# insert copied text into each column of last row
foreach ($cell in $copiedCells) {
# copy value from cell in copied row
$copiedText = $table.Cell($cell.rowindex, $cell.columnindex).Range.Text
# remove last 2 characters (paragraph and end-of-cell)
$TrimmedText = $copiedText.Remove($copiedText.Length - 2)
# set value for cell in last row
$table.Cell($table.Rows.Count,$cell.columnindex).Range.Text = $TrimmedText
}

MS Word's range arithmetic in Powershell

I'm trying to automate adding new tables to a Word document using Powershell.
I wrote a Powershell script that is meant for adding summary tables based on a whole document in a proper location. It gathers the information from the file contents and then, in the selected ranges, creates a new summary tables. The table is always inserted at the end of the range (which is a chapter in the document). The range is based on the list headers. However, while adding a new table to the selected range, I cannot force Word to leave the next header which is chosen as the end of the range. It gets deleted.
For example: I'm having chapters from 1.1 to 1.10 in my file and I'm choosing to add a new table at the end of the chapter 1.1, right before the chapter 1.2. A whole chapter 1.2. header is deleted and the chapter 1.3 is now labeled as 1.2.
I tried substracting various numbers from the Range.End property, following information in the Microsoft documentation (https://learn.microsoft.com/en-us/office/vba/api/word.range.end), however is doesn't seem to give any results.
The code (shortcut):
Add-Type -AssemblyName Microsoft.Office.Interop.Word
$word = New-Object -ComObject Word.application
$report = $word.Documents.Open("C:\file.docx")
#information gathering here
#find the right location and add a new table
$start = $report.Paragraphs | ? {$_.Range.ListFormat.ListString -eq '1.1'} | % {$_.Range}
$end = $report.Paragraphs | ? {$_.Range.ListFormat.ListString -eq '1.2'} | % {$_.Range}
$first_table = $report.Range($start.Start, $end.Start).Tables.Add($end, 24, 4, [ref]$DefaultTableBehavior::wdWord9TableBehavior, [ref]$AutoFitBehavior::wdAutoFitFixed)
#continue with filling up the table
Part of the problem is that if you have a sequence of empty auto-numbered paragraphs, inserting a table into one of them will mean that subsequent paragraphs will be placed inside the table.
Also, if your 1.1 section can contain material, AFAICS your $start will contain the range of its first paragraph, which isn't very useful - all you really need is that location of the paragraph you want to insert before.
As an alternative, I suggest that you start by inserting a paragraph mark immediately before the 1.2 heading, then insert the table.
e.g. like this:
#$start = $report.Paragraphs | ? {$_.Range.ListFormat.ListString -eq '1.1'} | % {$_.Range}
$end = $report.Paragraphs | ? {$_.Range.ListFormat.ListString -eq '1.2'} | % {$_.Range}
$place = $report.Range($end.Start - 1, $end.Start-1)
$place.InsertParagraph()
$place = $report.Range($end.Start - 1, $end.Start-1)
$first_table = $place.Tables.Add($place, 24, 4, [ref]$DefaultTableBehavior::wdWord9TableBehavior, [ref]$AutoFitBehavior::wdAutoFitFixed)
If you don't want that extra paragraph mark, you can probably delete it, and if you really want, you could use
$report.Range($end.Start-1, $end.Start).Text = ""
after the table insertion.

Get unique values

We're trying to optimize some code that removes duplicates from an Array as fast as possible. Normally this can be easily done by piping the input to Group-Object and then using only the Name property. But we would like to avoid the pipeline, as it is slower.
However, we tried the following code:
[System.Collections.ArrayList]$uniqueFrom = #()
$From = #('A', 'A', 'B')
$From.Where({-not ($uniqueFrom.Contains($_))}).ForEach({
$uniqueFrom.Add($_)
})
$uniqueFrom
In theory, this should work. But for one reason or another the output is not the expected #('A', 'B'). Why is it not reevaluating the ArrayList in the .where clause?
In my experience reducing the 'pipe filtering' to get the unique values can be achieve by using DataView. If you are processing an array you need to convert this to a DataTable first before you get the values using the DataView.
e.g.
$arr = #('val1','val1','val1','val2','val1','val3'....)
$newDatatable = New-Object System.Data.Datatable
[void]$newDatatable.Columns.Add("FetchUniqueColumn")
foreach($e in $arr)
{
$row = $newDatatable.NewRow()
$row.Item('FetchUniqueColumn') = $e
$newDatatable.Rows.Add($row)
}
$filterDataView = New-Object System.Data.Dataview($newDatatable)
$UniqueDT = $filterDataView.ToTable($true,'FetchUniqueColumn')
$UniqueValues_array = $UniqueDT.Rows.FetchUniqueColumn
Note this is a whole lot faster if your input is a DataTable since you don't have to convert it anymore prior to setting the DataView filter for unique values to $true in creating the $UniqueDT datatable from the dataview:
$UniqueDT = $filterDataView.ToTable($true,'FetchUniqueColumn')
Tested by querying 1 column with 3000 rows datatable from SQL.
My results are as follows:
**With 1 column Data Table as input
Select -Unique - 300 ms
Using DataView - 21 ms
**With #() array as input (converted SQL results to array prior to benchmarking)
Select Unique - 262 ms
Using DataView - 106 ms
Disclaimer: in this answer I'm just explaining why the current code isn't working, not attempting to give alternative solution. For solution check the accepted answer.
Why is it not reevaluating the ArrayList in the .where clause?
It's not supposed to do this. What it is actually doing is filtering here:
$From.Where({-not ($uniqueFrom.Contains($_))})
and then executing
$uniqueFrom.Add($_)
for each element. As you did
[System.Collections.ArrayList]$uniqueFrom = #()
this array is empty and therefore will return $false for any $uniqueFrom.Contains($_)
Proof:
To verify that what I've written above is true you can do the following:
[System.Collections.ArrayList]$uniqueFrom = #()
$uniqueFrom.add("A")
$From.Where({-not ($uniqueFrom.Contains($_))}).ForEach({
$uniqueFrom.Add($_)
})
Output is A, B (A was added manually, two A were skipped as this entry already exists in $uniqueFrom, B was added inside ForEach) as expected.

Powershell GUI Imported text file missing newline/carriage return

I have a powershell GUI which imports a text file and displays it in a textbox when a button is clicked
But even though the text file contains one entry per line when it gets displayed in the textbox it is all on one line...
The text file looks like this-
But when I import it it looks like this-
This is the code I am using-
$button_hosts = New-Object system.windows.Forms.Button
$button_hosts.Text = "Hosts"
$button_hosts.Width = 60
$button_hosts.Height = 25
$button_hosts.location = new-object system.drawing.point(20,55)
$button_hosts.Font = "Microsoft Sans Serif,10"
$mydocs = [Environment]::GetFolderPath('MyDocuments')
$button_hosts.Add_Click({
$textBox_hosts.Text = Get-Filename "$mydocs" txt
$textBox_hostlist.Text = Get-Content $textBox_hosts.Text
})
$GUI.controls.Add($button_hosts)
Any idea how to get it to display the same? I cant add any extra data to the txt file as it is an output from another program
Set the lines property, not the text property.
$textBox_hostlist.Lines = Get-Content $textBox_hosts.Text
Get-Content reads the content one line at a time and returns a collection of objects, each of which represents a line of content.
The means that you have to join the collection with carriage returns and linefeeds:
(Get-Content $textBox_hosts) -Join "`r`n"
For your WinForms TextBox, do you have the multiline property set to true?
https://msdn.microsoft.com/en-us/library/12w624ff(v=vs.110).aspx
If not, it defaults to single line.

How can PowerShell be used on Microsoft Word to get the page number that a hyperlink is found on?

I don't seem to find quite a bit of examples of PowerShell and Microsoft Word. I've seen plenty of how to post a page number in a footer, but I don't quite grasp the PowerShell select method or object. An example of how to count up pages in any particular set of documents was also reviewed.
I've dug through quite a few books and only one of them really had anything to do with PowerShell and MS Word. If anything only a few trivial Excel examples or how to create a word document was given. I also noticed that Office 365 is offered as a focus point of one book and even an online script building resource, but nothing like that I could find on Office 2013 and prior.
This is the script that I'm working with now which isn't really much to look at.
$objWord = New-Object -ComObject Word.Application;
$objWord.Visible = $false;
$objWord.DisplayAlerts = "wdAlertsNone";
# Create the selection object
$Selection = $objWord.Selection;
#$document = $objWord.documents.open("C:\Path\To\Word\Document\file.docx");
$hyperlinks = #($document.Hyperlinks);
#loop through all links in the word document
$hyperlinks | ForEach {
if($_.Address -ne $null)
{
# The character number where the hyperlink text starts
$startCharNumber = $_.Range.Start;
# The character number where the hyperlink text ends
$endCharNumber = $_.Range.End;
# Here is where to calculate which page number the $startCharNumber is found on. How exactly to do this?
# For viewing purposes only. To be used to create a report or index.
Write-Host "Text To Display: " $_.TextToDisplay " URL: " $_.Address " Page Num: " ;
}
}
$objWord.quit();
You can use Information(wdActiveEndPageNumber) to get the page containing the selection.
$word = [System.Runtime.InteropServices.Marshal]::GetActiveObject('Word.Application')
$wdActiveEndPageNumber = 3
$doc = $word.ActiveDocument
foreach ($h in $doc.Hyperlinks) {
$page = $h.Range.Information($wdActiveEndPageNumber)
echo "Page $page : $($h.Address)"
}
Edited following #bibadia's comment.