I'm trying to search for keywords across a large number of MS Word documents, and return the results to a file. I've got a working script, but I wasn't aware of the scale, and what I've got isn't nearly efficient enough, it would take days to plod through everything.
The script as it stands now takes keywords from CompareData.txt and runs it through all the files in a specific folder, then appends it to a file.
So when I'm done I will know how many files have each specific keyword.
[cmdletBinding()]
Param(
$Path = "C:\willscratch\"
) #end param
$findTexts = (Get-Content c:\scratch\CompareData.txt)
Foreach ($Findtext in $FindTexts)
{
$matchCase = $false
$matchWholeWord = $true
$matchWildCards = $false
$matchSoundsLike = $false
$matchAllWordForms = $false
$forward = $true
$wrap = 1
$application = New-Object -comobject word.application
$application.visible = $False
$docs = Get-childitem -path $Path -Recurse -Include *.docx
$i = 1
$totaldocs = 0
Foreach ($doc in $docs)
{
Write-Progress -Activity "Processing files" -status "Processing $($doc.FullName)" -PercentComplete ($i /$docs.Count * 100)
$document = $application.documents.open($doc.FullName)
$range = $document.content
$null = $range.movestart()
$wordFound = $range.find.execute($findText,$matchCase,
$matchWholeWord,$matchWildCards,$matchSoundsLike,
$matchAllWordForms,$forward,$wrap)
if($wordFound)
{
$doc.fullname
$document.Words.count
$totaldocs ++
} #end if $wordFound
$document.close()
$i++
} #end foreach $doc
$application.quit()
"There are $totaldocs total files with $findText" | Out-File -Append C:\scratch\output.txt
#clean up stuff
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($range) | Out-Null
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($document) | Out-Null
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($application) | Out-Null
Remove-Variable -Name application
[gc]::collect()
[gc]::WaitForPendingFinalizers()
}
What I'd like to do is figure out a way to search each file for everything in CompareData.txt once, rather than iterate through it a bunch of times. If I was dealing with a small set of data, the approach I've got would get the job done - but I've come to find out that both the data in CompareData.txt and the source Word file directory will be very large.
Any ideas on how to optimize this?
Right now you're doing this (pseudocode):
foreach $Keyword {
create Word Application
foreach $File {
load Word Document from $File
find $Keyword
}
}
That means that if you have a 100 keywords and 10 documents, you're opening and closing a 100 instances of Word and loading in a thousand word documents before you're done.
Do this instead:
create Word Application
foreach $File {
load Word Document from $File
foreach $Keyword {
find $Keyword
}
}
So you only launch one instance of Word and only load each document once.
As noted in the comments, you may optimize the whole process by using the OpenXML SDK, rather than launching Word:
(assuming you've installed OpenXML SDK in its default location)
# Import the OpenXML library
Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll'
# Grab the keywords and file names
$Keywords = Get-Content C:\scratch\CompareData.txt
$Documents = Get-childitem -path $Path -Recurse -Include *.docx
# hashtable to store results per document
$KeywordMatches = #{}
# store OpenXML word document type in variable as a shorthand
$WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]
foreach($Docx in $Docs)
{
# create array to hold matched keywords
$KeywordMatches[$Docx.FullName] = #()
# open document, wrap content stream in streamreader
$Document = $WordDoc::Open($Docx.FullName, $false)
$DocumentStream = $Document.MainDocumentPart.GetStream()
$DocumentReader = New-Object System.IO.StreamReader $DocumentStream
# read entire document
$DocumentContent = $DocumentReader.ReadToEnd()
# test for each keyword
foreach($Keyword in $Keywords)
{
$Pattern = [regex]::Escape($KeyWord)
$WordFound = $DocumentContent -match $Pattern
if($WordFound)
{
$KeywordMatches[$Docx.FullName] += $Keyword
}
}
$DocumentReader.Dispose()
$Document.Dispose()
}
Now, you can show the word count for each document:
$KeywordMatches.GetEnumerator() |Select File,#{n="Count";E={$_.Value.Count}}
Related
I'm extremely new to Powershell (as in, this is the first solution I've tried). I'm trying to create a solution that will change all fonts within a given Word document to Arial. So far I've composed this solution which works for body text.
$WordExts = '.docx','.doc','.docm'
$Word = New-Object -ComObject Word.Application
$Word.Visible = $false
$folder = Get-ChildItem $PSScriptRoot | ? {$_.Extension -in $WordExts}
foreach ($file in $folder){
echo $file.FullName
$worddoc = $Word.Documents.Open($file.FullName)
$Selection = $Word.Selection
$worddoc.Select()
$Selection.Font.Name = "Arial"
$worddoc.Close()
$worddoc = $null
}
$Word.Quit()
$Word = $null
This changes the body text of the Word document as intended. However, it is not capturing or making changes to strings within text boxes or the table of contents. Which method do I need to use to target these shapes and make changes to the text within them? I've been trying to look through the Powershell documentation but have been unable to find the answer. Thanks in advance.
I started with the link in jonsson's comment, and was able to figure out how to iterate through headers and footers.
Text boxes took some more work, since they aren't located in the same place. I asked for some help from my husband and we started using echo to look at what was in all those sections (and then in THOSE sections) until we found the right object. Also, while headers and footers are clearly defined, text boxes are scattered across several different sections, so we had to iterate through each section in the document to look for the textbox object ShapeRange. I also later added a method to remove document protection, because it was holding up the script.
$WordExts = '.docx','.doc','.docm'
$Word = New-Object -ComObject Word.Application
$Word.Visible = $false
$folder = Get-ChildItem $PSScriptRoot | ? {$_.Extension -in $WordExts}
#Remove document protection
ForEach ($file in $folder) {
$file | Unblock-File
}
#Iterate through each file in the parent folder
foreach ($file in $folder){
echo $file.FullName
$worddoc = $Word.Documents.Open($file.FullName)
#Change the font in the main body of the document
$Selection = $Word.Selection
$worddoc.Select()
$Selection.Font.Name = "Arial"
#Iterate through each section to locate ShapeRange (textbox) to change font
$target = $worddoc.Sections.Count
$count = 1;
While($count -lt ($target + 1)){
ForEach ($section in $worddoc.Sections($count).Range.ShapeRange) {
ForEach($item in $section.TextFrame){
ForEach($i in $item.TextRange){
$i.Font.Name = "Arial"
}
}
}
$count++
}
#Iterate through each header and footer to change font
ForEach ($section in $worddoc.Sections) {
ForEach ($header in $section.Headers) {
$header.Range.Font.Name = "Arial"
}
ForEach ($footer in $section.Footers) {
$footer.Range.Font.Name = "Arial"
}
}
$worddoc.Close()
$worddoc = $null
}
$Word.Quit()
$Word = $null
Problem: I need to search a large log file that is currently being used by another process. I cannot stop this other process or put a lock on the .log file. I need to quickly search this file, and I can't read it all into memory. I get that StreamReader() is the fastest, but I can't figure out how to avoid it attempting to grab a lock on the file.
$p = "Seachterm:Search"
$files = "\\remoteserver\c\temp\tryingtofigurethisout.log"
$SearchResult= Get-Content -Path $files | Where-Object { $_ -eq $p }
The below doesn't work because I can't get a lock of the file.
$reader = New-Object System.IO.StreamReader($files)
$lines = #()
if ($reader -ne $null) {
while (!$reader.EndOfStream) {
$line = $reader.ReadLine()
if ($line.Contains($p)) {
$lines += $line
}
}
}
$lines | Select-Object -Last 1
This takes too long:
get-content $files -ReadCount 500 |
foreach { $_ -match $p }
I would greatly appreciate any pointers in how I can go about quickly and efficiently (memory wise) searching a large log file.
Perhaps this will work for you. It tries to read the lines of the file as fast as possible, but with a difference to your second approach (which is approx. the same as what [System.IO.File]::ReadAllLines() would do).
To collect the lines, I use a List object which will perform faster than appending to an array using +=
$p = "Seachterm:Search"
$path = "\\remoteserver\c$\temp\tryingtofigurethisout.log"
if (!(Test-Path -Path $path -PathType Leaf)) {
Write-Warning "File '$path' does not exist"
}
else {
try {
$fileStream = [System.IO.FileStream]::new($path, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite)
$streamReader = [System.IO.StreamReader]::new($fileStream)
# or use this syntax:
# $fileMode = [System.IO.FileMode]::Open
# $fileAccess = [System.IO.FileAccess]::Read
# $fileShare = [System.IO.FileShare]::ReadWrite
# $fileStream = New-Object -TypeName System.IO.FileStream $path, $fileMode, $fileAccess, $fileShare
# $streamReader = New-Object -TypeName System.IO.StreamReader -ArgumentList $fileStream
# use a List object of type String or an ArrayList to collect the strings quickly
$lines = New-Object System.Collections.Generic.List[string]
# read the lines as fast as you can and add them to the list
while (!$streamReader.EndOfStream) {
$lines.Add($streamReader.ReadLine())
}
# close and dispose the obects used
$streamReader.Close()
$fileStream.Dispose()
# do the 'Contains($p)' after reading the file to not slow that part down
$lines.ToArray() | Where-Object { $_.Contains($p) } | Select-Object -Last 1
}
catch [System.IO.IOException] {}
}
Basically, it does what your second code does, but with the difference that using just the StreamReader, the file is opened with [System.IO.FileShare]::Read, whereas this code opens the file with [System.IO.FileShare]::ReadWrite
Note that there may be exceptions thrown using this because another application has write permissions to the file, hence the try{...} catch{...}
Hope that helps
SCRIPT PURPOSE
The idea behind the script is to recursively extract the text from a large amount of documents and update a field in an Azure SQL database with the extracted text. Basically we are moving away from Windows Search of document contents to an SQL full text search to improve the speed.
ISSUE
When the script encounters an issue opening the file such as it being password protected, it fails for every single document that follows. Here is the section of the script that processes the files:
foreach ($list in (Get-ChildItem ( join-path $PSScriptRoot "\FileLists\*" ) -include *.txt )) {
## Word object
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
$word.DisplayAlerts = 0
Write-Output ""
Write-Output "################# Parsing $list"
Write-Output ""
$query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
VALUES "
foreach ($file in (Get-Content $list)) {
if ($file -like "*-*" -and $file -notlike "*~*") {
Write-Output "Processing: $($file)"
Try {
$doc = $word.Documents.OpenNoRepairDialog($file, $false, $false, $false, "ttt")
if ($doc) {
$fileName = [io.path]::GetFileNameWithoutExtension($file)
$fileName = $filename + ".txt"
$doc.SaveAs("$env:TEMP\$fileName", [ref]$saveFormat)
$doc.Close()
$4ID = $fileName.split('-')[-1].replace(' ', '').replace(".txt", "")
$text = Get-Content -raw "$env:TEMP\$fileName"
$text = $text.replace("'", "''")
$query += "
('$text', $4ID),"
Remove-Item -Force "$env:TEMP\$fileName"
<# Upload to azure #>
$query = $query.Substring(0,$query.Length-1)
$query += ";"
Invoke-Sqlcmd #params -Query $Query -ErrorAction "SilentlyContinue"
$query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
VALUES "
}
}
Catch {
Write-Host "$($file) failed to process" -ForegroundColor RED;
continue
}
}
}
Remove-Item -Force $list.FullName
Write-Output ""
Write-Output "Uploading to azure"
Write-Output ""
<# Upload to azure #>
Invoke-Sqlcmd #params -Query $setQuery -ErrorAction "SilentlyContinue"
$word.Quit()
TASKKILL /f /PID WINWORD.EXE
}
Basically it parses through a folder of .txt files that contain x amount of document paths, creates a T-SQL update statement and runs against an Azure SQL database after each file is fully parsed. The files are generated with the following:
if (!($continue)) {
if ($pdf){
$files = (Get-ChildItem -force -recurse $documentFolder -include *.pdf).fullname
}
else {
$files = (Get-ChildItem -force -recurse $documentFolder -include *.doc, *.docx).fullname
}
$files | Out-File (Join-Path $PSScriptRoot "\documents.txt")
$i=0; Get-Content $documentFile -ReadCount $interval | %{$i++; $_ | Out-File (Join-Path $PSScriptRoot "\FileLists\documents_$i.txt")}
}
The $interval variable defines how many files are set to be extracted for each given upload to azure. Initially i had the word object being created outside the loop and never closed until the end. Unfortunately this doesn't seem to work as every time the script hits a file it cannot open, every file that follows will fail, until it reaches the end of the inner foreach loop foreach ($file in (Get-Content $list)) {.
This means that to get the expected outcome i have to run this with an interval of 1 which takes far too long.
This is a shot in the dark
But to me it sounds like the reason its failing is because the Word Com object is now prompting you for some action due since it cannot open the file so all following items in the loop also fail. This might explain why it works if you set the $Interval to 1 because when its 1 it is closing and opening the Com object every time and that takes forever (I did this with excel).
What you can do is in your catch statement, close and open a new Word Com object which should lets you continue on with the loop (but it will be a bit slower if it needs to open the Com object a lot).
If you want to debug the problem even more, set the Com object to be visible, and slowly loop through your program without interacting with Word. This will show you what is happening with Word and if there are any prompts that are causing the application to hang.
Of course, if you want to run it at full speed, you will need to detect which documents you can't open before hand or you could multithread it by opening several Word Com objects which will allow you to load several documents at a time.
As for...
ISSUE
When the script encounters an issue opening the file such as it being password protected, it fails for every single document that follows.
... then test for this as noted here...
How to check if a word file has a password?
$filename = "C:\path\to\your.doc"
$wd = New-Object -COM "Word.Application"
try {
$doc = $wd.Documents.Open($filename, $null, $null, $null, "")
} catch {
Write-Host "$filename is password-protected!"
}
... and skip the file to avoid the failure of the remaining files.
TASK
Extract text from .doc, .docx and .pdf files and upload content to an Azure SQL database. Needs to be fast as its running over millions of documents.
ISSUES
Script starts to fail if one of the documents has an issue. Some that i have come across are:
This file failed to open last time you tried - Open readonly
File is corrupt
SCRIPT
Firstly i generate a list of files containing 100 file paths. This is so i can continue execution if i need to stop it and / or it errors out:
## Word object
if (!($continue)) {
$files = (Get-ChildItem -force -recurse $documentFolder -include *.doc, *.docx).fullname
$files | Out-File (Join-Path $PSScriptRoot "\documents.txt")
$i=0; Get-Content $documentFile -ReadCount 100 | %{$i++; $_ | Out-File (Join-Path $PSScriptRoot "\FileLists\documents_$i.txt")}
}
Then i create the ComObject with the DisplayAlerts flag set to 0 (i thought this would fix it. It didnt)
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
$word.DisplayAlerts = 0
After this, I loop through each file in each list, save the file as .txt to the temp folder, extract the text and generate an SQL INSERT statemnt:
foreach ($file in (Get-Content $list)) {
Try {
if ($file -like "*-*") {
Write-Output "Processing: $($file)"
$doc = $word.Documents.Open($file)
$fileName = [io.path]::GetFileNameWithoutExtension($file)
$fileName = $filename + ".txt"
$doc.SaveAs("$env:TEMP\$fileName", [ref]$saveFormat)
$doc.Close()
$4ID = $fileName.split('-')[-1].replace(' ', '').replace(".txt", "")
$text = Get-Content -raw "$env:TEMP\$fileName"
$text = $text.replace("'", "")
$query += "
('$text', $4ID),"
Remove-Item -Force "$env:TEMP\$fileName"
}
<# Upload to azure #>
$query = $query.Substring(0,$query.Length-1)
$query += ";"
$params = #{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = $query
}
Invoke-Sqlcmd #params -ErrorAction "continue"
$query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
VALUES "
}
Catch {
Write-Host "$($file) failed to process" -ForegroundColor RED;
}
}
Remove-Item -Force $list.FullName
ISSUES
As stated above, if something is wrong with one of the files or the document failed to open properly off a previous run the script starts failing. Everything in the loop throws errors, starting with:
You cannot call a method on a null-valued expression.
At D:\OneDrive\Scripts\Microsoft Cloud\CachedText-Extraction\CachedText-Extraction.ps1:226 char:13
+ $doc = $word.Documents.Open($file)
Basically what i want is a way to stop those errors from appearing by simply skipping the file if it has an error with the document. Alternatively, if there is a better way to extract text from document files using PowerShell and not using word that would be good too.
An example of one of the error messages:
This causes the file to be locked and execution to pause. The only way to get around it is to kill word, which then causes the rest of the script to fail.
First time posting - but I think this is a good one as I've spent 2 days researching, talked with local experts, and still haven't found this done.
Individual print jobs must be regularly initiated on a large set of files (.txt files), and this must be converted through the print job to a local file (i.e. through a PDF printer) which retains the original base name for each file. Further, the script must be highly portable.
The objective will not be met if the file is simply converted (and not printed), the original base file name is not retained, or the print process requires manual interaction at each print.
After my research, this is what stands so far in PowerShell:
PROBLEM: This script does everything but actually print the contents of the file.
It iterates through the files, and "prints" a .pdf while retaining the original file name base; but the .pdf is empty.
I know I'm missing something critical (i.e. maybe a stream use?); but after searching and searching have not been able to find it. Any help is greatly appreciated.
As mentioned in the code, the heart of the print function is gathered from this post:
# The heart of this script (ConvertTo-PDF) is largley taken and slightly modified from https://social.technet.microsoft.com/Forums/ie/en-US/04ddfe8c-a07f-4d9b-afd6-04b147f59e28/automating-printing-to-pdf?forum=winserverpowershell
# The $OutputFolder variable can be disregarded at the moment. It is an added bonus, and a work in progress, but not cirital to the objective.
function ConvertTo-PDF {
param(
$TextDocumentPath, $OutputFolder
)
Write-Host "TextDocumentPath = $TextDocumentPath"
Write-Host "OutputFolder = $OutputFolder"
Add-Type -AssemblyName System.Drawing
$doc = New-Object System.Drawing.Printing.PrintDocument
$doc.DocumentName = $TextDocumentPath
$doc.PrinterSettings = new-Object System.Drawing.Printing.PrinterSettings
$doc.PrinterSettings.PrinterName = 'Microsoft Print to PDF'
$doc.PrinterSettings.PrintToFile = $true
$file=[io.fileinfo]$TextDocumentPath
Write-Host "file = $file"
$pdf= [io.path]::Combine($file.DirectoryName, $file.BaseName) + '.pdf'
Write-Host "pdf = $pdf"
$doc.PrinterSettings.PrintFileName = $pdf
$doc.Print()
Write-Host "Attempted Print: $pdf"
$doc.Dispose()
}
# get the relative path of the TestFiles and OutpufFolder folders.
$scriptPath = split-path -parent $MyInvocation.MyCommand.Definition
Write-Host "scriptPath = $scriptPath"
$TestFileFolder = "$scriptPath\TestFiles\"
Write-Host "TestFileFolder = $TestFileFolder"
$OutputFolder = "$scriptPath\OutputFolder\"
Write-Host "OutputFolder = $OutputFolder"
# initialize the files variable with content of the TestFiles folder (relative to the script location).
$files = Get-ChildItem -Path $TestFileFolder
# Send each test file to the print job
foreach ($testFile in $files)
{
$testFile = "$TestFileFolder$testFile"
Write-Host "Attempting Print from: $testFile"
Write-Host "Attemtping Print to : $OutputFolder"
ConvertTo-PDF $testFile $OutputFolder
}
You are missing a handler that reads the text file and passes the text to the printer. It is defined as a scriptblock like this:
$PrintPageHandler =
{
param([object]$sender, [System.Drawing.Printing.PrintPageEventArgs]$ev)
# More code here - see below for details
}
and is added to the PrintDocument object like this:
$doc.add_PrintPage($PrintPageHandler)
The full code that you need is below:
$PrintPageHandler =
{
param([object]$sender, [System.Drawing.Printing.PrintPageEventArgs]$ev)
$linesPerPage = 0
$yPos = 0
$count = 0
$leftMargin = $ev.MarginBounds.Left
$topMargin = $ev.MarginBounds.Top
$line = $null
$printFont = New-Object System.Drawing.Font "Arial", 10
# Calculate the number of lines per page.
$linesPerPage = $ev.MarginBounds.Height / $printFont.GetHeight($ev.Graphics)
# Print each line of the file.
while ($count -lt $linesPerPage -and (($line = $streamToPrint.ReadLine()) -ne $null))
{
$yPos = $topMargin + ($count * $printFont.GetHeight($ev.Graphics))
$ev.Graphics.DrawString($line, $printFont, [System.Drawing.Brushes]::Black, $leftMargin, $yPos, (New-Object System.Drawing.StringFormat))
$count++
}
# If more lines exist, print another page.
if ($line -ne $null)
{
$ev.HasMorePages = $true
}
else
{
$ev.HasMorePages = $false
}
}
function Out-Pdf
{
param($InputDocument, $OutputFolder)
Add-Type -AssemblyName System.Drawing
$doc = New-Object System.Drawing.Printing.PrintDocument
$doc.DocumentName = $InputDocument.FullName
$doc.PrinterSettings = New-Object System.Drawing.Printing.PrinterSettings
$doc.PrinterSettings.PrinterName = 'Microsoft Print to PDF'
$doc.PrinterSettings.PrintToFile = $true
$streamToPrint = New-Object System.IO.StreamReader $InputDocument.FullName
$doc.add_PrintPage($PrintPageHandler)
$doc.PrinterSettings.PrintFileName = "$($InputDocument.DirectoryName)\$($InputDocument.BaseName).pdf"
$doc.Print()
$streamToPrint.Close()
}
Get-Childitem -Path "$PSScriptRoot\TextFiles" -File -Filter "*.txt" |
ForEach-Object { Out-Pdf $_ $_.Directory }
Incidentally, this is based on the official Microsoft C# example here:
PrintDocumentClass