Extracting text from word documents

Extracting text from word documents - powershell

TASK
Extract text from .doc, .docx and .pdf files and upload content to an Azure SQL database. Needs to be fast as its running over millions of documents.
ISSUES
Script starts to fail if one of the documents has an issue. Some that i have come across are:
This file failed to open last time you tried - Open readonly
File is corrupt
SCRIPT
Firstly i generate a list of files containing 100 file paths. This is so i can continue execution if i need to stop it and / or it errors out:
## Word object
if (!($continue)) {
$files = (Get-ChildItem -force -recurse $documentFolder -include *.doc, *.docx).fullname
$files | Out-File (Join-Path $PSScriptRoot "\documents.txt")
$i=0; Get-Content $documentFile -ReadCount 100 | %{$i++; $_ | Out-File (Join-Path $PSScriptRoot "\FileLists\documents_$i.txt")}
}
Then i create the ComObject with the DisplayAlerts flag set to 0 (i thought this would fix it. It didnt)
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
$word.DisplayAlerts = 0
After this, I loop through each file in each list, save the file as .txt to the temp folder, extract the text and generate an SQL INSERT statemnt:
foreach ($file in (Get-Content $list)) {
Try {
if ($file -like "*-*") {
Write-Output "Processing: $($file)"
$doc = $word.Documents.Open($file)
$fileName = [io.path]::GetFileNameWithoutExtension($file)
$fileName = $filename + ".txt"
$doc.SaveAs("$env:TEMP\$fileName", [ref]$saveFormat)
$doc.Close()
$4ID = $fileName.split('-')[-1].replace(' ', '').replace(".txt", "")
$text = Get-Content -raw "$env:TEMP\$fileName"
$text = $text.replace("'", "")
$query += "
('$text', $4ID),"
Remove-Item -Force "$env:TEMP\$fileName"
}
<# Upload to azure #>
$query = $query.Substring(0,$query.Length-1)
$query += ";"
$params = #{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = $query
}
Invoke-Sqlcmd #params -ErrorAction "continue"
$query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
VALUES "
}
Catch {
Write-Host "$($file) failed to process" -ForegroundColor RED;
}
}
Remove-Item -Force $list.FullName
ISSUES
As stated above, if something is wrong with one of the files or the document failed to open properly off a previous run the script starts failing. Everything in the loop throws errors, starting with:
You cannot call a method on a null-valued expression.
At D:\OneDrive\Scripts\Microsoft Cloud\CachedText-Extraction\CachedText-Extraction.ps1:226 char:13
+ $doc = $word.Documents.Open($file)
Basically what i want is a way to stop those errors from appearing by simply skipping the file if it has an error with the document. Alternatively, if there is a better way to extract text from document files using PowerShell and not using word that would be good too.
An example of one of the error messages:
This causes the file to be locked and execution to pause. The only way to get around it is to kill word, which then causes the rest of the script to fail.

Related

Print pdf files on different printers depending on their content

I want to print .pdf-files on different printers - depending on their content.
How can I check whether a specific single word is present in a file?
To queue through a folder's content I've build the following so far:
Unblock-File -Path S:\test\itextsharp.dll
Add-Type -Path S:\test\itextsharp.dll
$files = Get-ChildItem S:\test\*.pdf
$adobe='C:\Program Files (x86)\Adobe\Acrobat DC\Acrobat\Acrobat.exe'
foreach ($file in $files) {
$reader = [iTextSharp.text.pdf.parser.PdfTextExtractor]
$Extract = $reader::GetTextFromPage($File.FullName,1)
if ($Extract -Contains 'Lieferschein') {
Write-Host -ForegroundColor Yellow "Lieferschein"
$printername='XX1'
$drivername='XX1'
$portname='192.168.X.41'
} else {
Write-Host -ForegroundColor Yellow "Etikett"
$printername='XX2'
$drivername='XX2'
$portname='192.168.X.42'
}
$arglist = '/S /T "' + $file.FullName + '" "' + $printername + '" "' + $drivername + " " + $portname
start-process $adobe -argumentlist $arglist -wait
Start-Sleep -Seconds 15
Remove-Item $file.FullName
}
And for now I got 2 problems with it:
1st: Add-Type -Path itextsharp.dll gives me an error.
Add-Type: One or more types in the assembly cannot be loaded. Get the LoaderExceptions property for more information. In line: 2 character: 1
I've read that it might be due to the file being blocked. There is no information about that in the properties though. And the Unblock-File comand and the start doesn't change/solve anything.
After using $error[0].exception.loaderexceptions[0] I get the information that BouncyCastle.Crypto, Version=1.8.6.0 is missing. Unfortunatelly I can't find any sources for that yet.
2nd: Will if ($Extract -Contains 'Lieferschein') work as I intend? Will it check for the phrase after the Add-Type gets loaded successfully?
Alternatively: There's also the possibility to make it depend from the content's format. One type of the files has the size of DIN A4 for example. The other one is smaller than that. If there's an easier way to check for that, you'd make me happy aswell.
Thank you in advance!

Searching for a keyword in a pdf using Powershell and iTextSharp.dll. It's a very common thing. You then just use your conditional logic to send to whatever printer you choose.
SO, something like this should do.
Add-Type -Path 'C:\path_to_dll\itextsharp.dll'
$pdfs = Get-ChildItem 'C:\path_to_pdfs' -Filter '*.pdf'
$export = 'D:\Temp\PdfExport.csv'
$results = #()
$keywords = #('Keyword1')
foreach ($pdf in $pdfs)
{
"processing - $($pdf.FullName)"
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $pdf.FullName
for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
$pageText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page).Split([char]0x000A)
foreach ($keyword in $keywords)
{
if ($pageText -match $keyword)
{
$response = #{
keyword = $keyword
file = $pdf.FullName
page = $page
}
$results += New-Object PSObject -Property $response
}
}
}
$reader.Close()
}
"`ndone"
$results |
Export-Csv $export -NoTypeInformation
Update
As per your comment, regarding your error.
Again, iTextSharp is a legacy, and you really need to move to iText7.
Nonetheless, that is not a PowerShell code issue. It is an iTextSharp.dll missing dependency. Even with iText7, you need to ensure you have all the dependencies on your machine and properly loaded.
As noted in this SO Q&A:
How to use Itext7 in powershell V5, Exception when loading pdfWriter

1st:
After finding the correct version (1.8.6) on nuget.org the Add-Type commands work perfectly. As expected I didn't even need the unblock command as it was not marked as a blocked file in the properties. Now the script starts with:
Add-Type -Path 'c:\BouncyCastle.Crypto.dll'
Add-Type -Path 'c:\itextsharp.dll'
2nd
Regarding the check-queue: I just had to replace -contains with -match in my if clause.
if ($Extract -Contains 'Lieferschein')

Powershell: exclusive lock for a file during multiple set-content and get-content operations

I am trying to write a script which runs on multiple client machines and writes to a single text file on a network share.
I want to ensure that only one machine can maniputale the file at any one time, whilst the other machines run a loop to check if the file is available.
the script runs this first:
Set-Content -Path $PathToHost -Value (get-content -Path $PathToHost | Select-String -Pattern "$HostName " -NotMatch) -ErrorAction Stop
Which removes some lines if they are matching the criteria. Then I want to append a new line with this:
Add-Content $PathToHost "$heartbeat$_" -ErrorAction Stop
The problem is that between the execution of those two commands another client has access to the file and begins to write to the file as well.
I have explored the solution here: Locking the file while writing in PowerShell
$PathToHost = "C:\file.txt"
$mode = "Open"
$access = "ReadWrite"
$share = "None"
$file = [System.IO.File]::Open($path, $mode, $access, $share)
$file.close()
Which can definitely lock the file, but I am not sure how to proceed to then read and write to the file.
Any help is much appreciated.
EDIT: Solution as below thanks to twinlakes' answer
$path = "C:\Users\daniel_mladenov\hostsTEST.txt"
$mode = "Open"
$access = "ReadWrite"
$share = "none"
$file = [System.IO.File]::Open($path, $mode, $access, $share)
$fileread = [System.IO.StreamReader]::new($file, [Text.Encoding]::UTF8)
# Counts number of lines in file
$imax=0
while ($fileread.ReadLine() -ne $null){
$imax++
}
echo $imax
#resets read position to beginning
$fileread.basestream.position = 0
#reads content of whole file and discards mathching lines
$content=#()
for ($i=0; $i -lt $imax; $i++){
$ContentLine = $fileread.ReadLine()
If($ContentLine -notmatch "$HostIP\s" -and $ContentLine -notmatch "$HostName\s"){
$content += $ContentLine
}
}
echo $content
#Writes remaining lines back to file
$filewrite = [System.IO.StreamWriter]::new($file)
$filewrite.basestream.position = 0
for ($i=0; $i -lt $content.length; $i++){
$filewrite.WriteLine($content[$i])
}
$filewrite.WriteLine($heartbeat)
$filewrite.Flush()
$file.SetLength($file.Position) #trims file to the content which has been written, discarding any content past that point
$file.close()

$file is a System.IO.FileStream object. You will need to call the write method on that object, which requires a byte array.
$string = # the string to write to the file
$bytes = [Text.Encoding]::UTF8.GetBytes($string)
$file.Write($bytes, 0, $bytes.Length)

Word com object failing

SCRIPT PURPOSE
The idea behind the script is to recursively extract the text from a large amount of documents and update a field in an Azure SQL database with the extracted text. Basically we are moving away from Windows Search of document contents to an SQL full text search to improve the speed.
ISSUE
When the script encounters an issue opening the file such as it being password protected, it fails for every single document that follows. Here is the section of the script that processes the files:
foreach ($list in (Get-ChildItem ( join-path $PSScriptRoot "\FileLists\*" ) -include *.txt )) {
## Word object
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
$word.DisplayAlerts = 0
Write-Output ""
Write-Output "################# Parsing $list"
Write-Output ""
$query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
VALUES "
foreach ($file in (Get-Content $list)) {
if ($file -like "*-*" -and $file -notlike "*~*") {
Write-Output "Processing: $($file)"
Try {
$doc = $word.Documents.OpenNoRepairDialog($file, $false, $false, $false, "ttt")
if ($doc) {
$fileName = [io.path]::GetFileNameWithoutExtension($file)
$fileName = $filename + ".txt"
$doc.SaveAs("$env:TEMP\$fileName", [ref]$saveFormat)
$doc.Close()
$4ID = $fileName.split('-')[-1].replace(' ', '').replace(".txt", "")
$text = Get-Content -raw "$env:TEMP\$fileName"
$text = $text.replace("'", "''")
$query += "
('$text', $4ID),"
Remove-Item -Force "$env:TEMP\$fileName"
<# Upload to azure #>
$query = $query.Substring(0,$query.Length-1)
$query += ";"
Invoke-Sqlcmd #params -Query $Query -ErrorAction "SilentlyContinue"
$query = "INSERT INTO tmp_CachedText (tCachedText, tOID)
VALUES "
}
}
Catch {
Write-Host "$($file) failed to process" -ForegroundColor RED;
continue
}
}
}
Remove-Item -Force $list.FullName
Write-Output ""
Write-Output "Uploading to azure"
Write-Output ""
<# Upload to azure #>
Invoke-Sqlcmd #params -Query $setQuery -ErrorAction "SilentlyContinue"
$word.Quit()
TASKKILL /f /PID WINWORD.EXE
}
Basically it parses through a folder of .txt files that contain x amount of document paths, creates a T-SQL update statement and runs against an Azure SQL database after each file is fully parsed. The files are generated with the following:
if (!($continue)) {
if ($pdf){
$files = (Get-ChildItem -force -recurse $documentFolder -include *.pdf).fullname
}
else {
$files = (Get-ChildItem -force -recurse $documentFolder -include *.doc, *.docx).fullname
}
$files | Out-File (Join-Path $PSScriptRoot "\documents.txt")
$i=0; Get-Content $documentFile -ReadCount $interval | %{$i++; $_ | Out-File (Join-Path $PSScriptRoot "\FileLists\documents_$i.txt")}
}
The $interval variable defines how many files are set to be extracted for each given upload to azure. Initially i had the word object being created outside the loop and never closed until the end. Unfortunately this doesn't seem to work as every time the script hits a file it cannot open, every file that follows will fail, until it reaches the end of the inner foreach loop foreach ($file in (Get-Content $list)) {.
This means that to get the expected outcome i have to run this with an interval of 1 which takes far too long.

This is a shot in the dark
But to me it sounds like the reason its failing is because the Word Com object is now prompting you for some action due since it cannot open the file so all following items in the loop also fail. This might explain why it works if you set the $Interval to 1 because when its 1 it is closing and opening the Com object every time and that takes forever (I did this with excel).
What you can do is in your catch statement, close and open a new Word Com object which should lets you continue on with the loop (but it will be a bit slower if it needs to open the Com object a lot).
If you want to debug the problem even more, set the Com object to be visible, and slowly loop through your program without interacting with Word. This will show you what is happening with Word and if there are any prompts that are causing the application to hang.
Of course, if you want to run it at full speed, you will need to detect which documents you can't open before hand or you could multithread it by opening several Word Com objects which will allow you to load several documents at a time.

As for...
ISSUE
When the script encounters an issue opening the file such as it being password protected, it fails for every single document that follows.
... then test for this as noted here...
How to check if a word file has a password?
$filename = "C:\path\to\your.doc"
$wd = New-Object -COM "Word.Application"
try {
$doc = $wd.Documents.Open($filename, $null, $null, $null, "")
} catch {
Write-Host "$filename is password-protected!"
}
... and skip the file to avoid the failure of the remaining files.

Optimize Word document keyword search

I'm trying to search for keywords across a large number of MS Word documents, and return the results to a file. I've got a working script, but I wasn't aware of the scale, and what I've got isn't nearly efficient enough, it would take days to plod through everything.
The script as it stands now takes keywords from CompareData.txt and runs it through all the files in a specific folder, then appends it to a file.
So when I'm done I will know how many files have each specific keyword.
[cmdletBinding()]
Param(
$Path = "C:\willscratch\"
) #end param
$findTexts = (Get-Content c:\scratch\CompareData.txt)
Foreach ($Findtext in $FindTexts)
{
$matchCase = $false
$matchWholeWord = $true
$matchWildCards = $false
$matchSoundsLike = $false
$matchAllWordForms = $false
$forward = $true
$wrap = 1
$application = New-Object -comobject word.application
$application.visible = $False
$docs = Get-childitem -path $Path -Recurse -Include *.docx
$i = 1
$totaldocs = 0
Foreach ($doc in $docs)
{
Write-Progress -Activity "Processing files" -status "Processing $($doc.FullName)" -PercentComplete ($i /$docs.Count * 100)
$document = $application.documents.open($doc.FullName)
$range = $document.content
$null = $range.movestart()
$wordFound = $range.find.execute($findText,$matchCase,
$matchWholeWord,$matchWildCards,$matchSoundsLike,
$matchAllWordForms,$forward,$wrap)
if($wordFound)
{
$doc.fullname
$document.Words.count
$totaldocs ++
} #end if $wordFound
$document.close()
$i++
} #end foreach $doc
$application.quit()
"There are $totaldocs total files with $findText" | Out-File -Append C:\scratch\output.txt
#clean up stuff
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($range) | Out-Null
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($document) | Out-Null
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($application) | Out-Null
Remove-Variable -Name application
[gc]::collect()
[gc]::WaitForPendingFinalizers()
}
What I'd like to do is figure out a way to search each file for everything in CompareData.txt once, rather than iterate through it a bunch of times. If I was dealing with a small set of data, the approach I've got would get the job done - but I've come to find out that both the data in CompareData.txt and the source Word file directory will be very large.
Any ideas on how to optimize this?

Right now you're doing this (pseudocode):
foreach $Keyword {
create Word Application
foreach $File {
load Word Document from $File
find $Keyword
}
}
That means that if you have a 100 keywords and 10 documents, you're opening and closing a 100 instances of Word and loading in a thousand word documents before you're done.
Do this instead:
create Word Application
foreach $File {
load Word Document from $File
foreach $Keyword {
find $Keyword
}
}
So you only launch one instance of Word and only load each document once.
As noted in the comments, you may optimize the whole process by using the OpenXML SDK, rather than launching Word:
(assuming you've installed OpenXML SDK in its default location)
# Import the OpenXML library
Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll'
# Grab the keywords and file names
$Keywords = Get-Content C:\scratch\CompareData.txt
$Documents = Get-childitem -path $Path -Recurse -Include *.docx
# hashtable to store results per document
$KeywordMatches = #{}
# store OpenXML word document type in variable as a shorthand
$WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]
foreach($Docx in $Docs)
{
# create array to hold matched keywords
$KeywordMatches[$Docx.FullName] = #()
# open document, wrap content stream in streamreader
$Document = $WordDoc::Open($Docx.FullName, $false)
$DocumentStream = $Document.MainDocumentPart.GetStream()
$DocumentReader = New-Object System.IO.StreamReader $DocumentStream
# read entire document
$DocumentContent = $DocumentReader.ReadToEnd()
# test for each keyword
foreach($Keyword in $Keywords)
{
$Pattern = [regex]::Escape($KeyWord)
$WordFound = $DocumentContent -match $Pattern
if($WordFound)
{
$KeywordMatches[$Docx.FullName] += $Keyword
}
}
$DocumentReader.Dispose()
$Document.Dispose()
}
Now, you can show the word count for each document:
$KeywordMatches.GetEnumerator() |Select File,#{n="Count";E={$_.Value.Count}}

Powershell automated deletion of specified SharePoint documents

We have a csv file with approximately 8,000 SharePoint document file URLs - the files in question they refer to have to be downloaded to a file share location, then deleted from the SharePoint. The files are not located in the same sites, but across several hundred in a server farm. We are looking to remove only the specified files - NOT the entire library.
We have the following script to effect the download, which creates the folder structure so that the downloaded files are separated.
param (
[Parameter(Mandatory=$True)]
[string]$base = "C:\Export\",
[Parameter(Mandatory=$True)]
[string]$csvFile = "c:\export.csv"
)
write-host "Commencing Download"
$date = Get-Date
add-content C:\Export\Log.txt "Commencing Download at $date":
$webclient = New-Object System.Net.WebClient
$webclient.UseDefaultCredentials = $true
$files = (import-csv $csvFile | Where-Object {$_.Name -ne ""})
$line=1
Foreach ($file in $files) {
$line = $line + 1
if (($file.SpURL -ne "") -and ($file.path -ne "")) {
$lastBackslash = $file.SpURL.LastIndexOf("/")
if ($lastBackslash -ne -1) {
$fileName = $file.SpURL.substring(1 + $lastBackslash)
$filePath = $base + $file.path.replace("/", "\")
New-Item -ItemType Directory -Force -Path $filePath.substring(0, $filePath.length - 1)
$webclient.DownloadFile($file.SpURL, $filePath + $fileName)
$url=$file.SpURL
add-content C:\Export\Log.txt "INFO: Processing line $line in $csvFile, writing $url to $filePath$fileName"
} else {
$host.ui.WriteErrorLine("Exception: URL has no backslash on $line for filename $csvFile")
}
} else {
$host.ui.WriteErrorLine("Exception: URL or Path is empty on line $line for filename $csvFile")
}
}
write-Host "Download Complete"
Is there a way we could get the versions for each file?
I have been looking for a means to carry out the deletion, using the same csv file as reference - all of the code I have seen refers to deleting entire libraries, which is not desired.
I am very new to PowerShell and am getting lost. Can anyone shed some light?
Many thanks.

This looks like it might be useful. It's a different approach and would need to be modified to pull in the file list from your CSV but it looks like it generally accomplishes what you are looking to do.
https://sharepoint.stackexchange.com/questions/6511/download-and-delete-documents-using-powershell

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Extracting text from word documents - powershell

Related

Print pdf files on different printers depending on their content

Powershell: exclusive lock for a file during multiple set-content and get-content operations

Word com object failing

Optimize Word document keyword search

Powershell automated deletion of specified SharePoint documents

Categories

Resources