How to read a PDF using Powershell

How to read a PDF using Powershell - powershell

I am at the beginning of my first real powershell project. Right now I am trying to read certain fields from a PDF, specifically account number and District ID. However, despite scouring the internet for a couple of hours, there is not an answer when it comes to using iText 7. I tried using a iTextSharp video while substituting what I thought was the correct add on for iText 7 but that just failed. I am new to trying to read a PDF and am literally just trying to get it to return the pdf into a text file. Once I get that, I'll worry about getting the right information. If there is an easier way to just pull directly from the field, I'm all ears.
**The final task is to pull this information from literally hundreds of the same document. I'm just trying to get through this process using baby steps. The fields are typed in, so reading, in theory, should be easy enough
Add-Type -Path "file path\itext.pdfa.dll"
$path= "file path\doc.pdf"
$pdf = New-Object iText.text.pdf.PdfReader -ArgumentList $path
$export=""
foreach($page in 1..($pdf.NumberOfPages)){
$export=
[iText.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
}
$export | Out-File "file path\Test.txt"

Related

Blank file after using powershell to convert a word file?

I have been trying to use PowerShell to convert some .docx files to .docm. I'm able to convert the file, but it's blank every time I open it.
This is the code I have been using:
Get-ChildItem *.docx | Rename-Item -NewName { $_.name -replace '\.docx$','.docm' }

Adding this here per other comments regarding it.
.DOCM is just a Word doc with embedded macros.
What do you expect to see?
In most cases, Word security blocks macro docs from opening unless you tell Word you accept the macro risk, or you've already disabled that.
So, if these are not .DOCs with macros, I am not sure of what your plan was here.
If you just went into Windows Explorer and opened a .docx (non-Macro) file, then manually renamed it to .docm, then try and open it, you'd get the same result.
So, not a PS or PS-specific code issue. Changing the extension does not make it a true .docm, it must be saved that way in Word.
... removing the code refactor.
FYI...There are online tools for this conversion.
Though I've never used or needed to use them. So, just a heads up.
However, here is more info after looking at my old notes, if the goal is to automate this via PS.
if you really wanted to do this in PS, you need to use PS to open a .docx using MSOffice COM, add VBA/Macro code to the doc, and then save it as a macro-enabled file.
For example, here is an article regarding
[Converting Word document format with PowerShell][2]
$path = "c:\olddocuments\"
$word_app = New-Object -ComObject Word.Application
$Format = [Microsoft.Office.Interop.Word.WdSaveFormat]::wdFormatXMLDocument
Get-ChildItem -Path $path -Filter '*.doc' |
ForEach-Object {
$document = $word_app.Documents.Open($_.FullName)
$docx_filename = "$($_.DirectoryName)\$($_.BaseName).docx"
$document.SaveAs([ref] $docx_filename, [ref]$Format)
$document.Close()
}
$word_app.Quit()
If you need to convert the documents to PDF, make the following change
to the “SaveAs” line in the script. 17 corresponds to the PDF file
format when doing a Save As in Microsoft Word.
$document.SaveAs([ref] $docx_filename, [ref]17)
Microsoft Word file format tech doc is here:
[WdSaveFormat enumeration (Word)][3]
https://learn.microsoft.com/en-us/office/vba/api/Word.WdSaveFormat
wdFormatFlatXMLMacroEnabled # 20 Open XML file format with macros enabled saved as a single XML file.

Create Jenkins Job that searches for strings in files and reports warnings with file and line number

I have Builds in Jenkins that produce warnings. These are great to navigate and address issues. I want to search in the source for strings and report these using the same warning style and, therefore, warning navigator. I have looked at the text finder plugin but this is a post build step that appears to only effect success or failure. I have looked at power shell but I do not see anything that will produce a warnings with a path and a line number. I could use powershell, batch file, or even write a small command line app to do this. I'm hoping something already exists.
Update:
This looks the best suggestion so far...
$Files = Get-ChildItem \ExampleFolder\ -Recurse; foreach ($file in $files) { $File | select-string ExampleString | % {Write-Warning "$($file.fullname): $($_.linenumber)"}}
I have not had time to try this in Jenkin's PowerShell plugin yet. Works great from a script. May need to change format to match expectations of Jenkins's warnings plugin. Have not worked out if I can search for strings containing punctuation. For example "(void)". Also need to limit to only source files with extension .c or , .h

Using Powershell to Strip Content from PDF

Using Powershell to Strip Content from PDF While Keeping PDF Format.
My Task:
I have been attempting to perform what would be a simple task if the documents were not in PDF format. I have a bunch of PDFs that have unwanted data before the bulk of usable data starts, this is anything that comes before ‘%PDF’ in the documents. A script that pulls all the desired data and exports it to a new file was needed. That part was super easy.
The Problem:
The data that is exported appears to be formatted correctly, except it doesn’t open as a PDF anymore. I can open it in Notepad++ and it looks identical to one that was clean manually and works. Examining the raw code of the Powershell altered PDF it appears that the ‘lines’ are much shorter than they should be.
$Path = 'C:\FileLocation'
$Output = '.\MyFile.pdf'
$LineArr = #()
$Target = Get-ChildItem -Path $Path -Filter *.pdf -Recurse -ErrorAction SilentlyContinue | Get-Content -Encoding default | Out-String -stream
$Target.Where({ $_ -like '*%PDF*' }, 'SkipUntil') | ForEach-Object{
If ($_.contains('%PDF')){
$LineArr += "%" + $_.Split('%')[1]
}
else{
$LineArr += $_
}
}
$LineArr | Out-File -Encoding Default -FilePath $Output
I understand the PDF format doesn't really use lines, so that might be where the problem is being created. Either when the data is being initially put into an array, or when it’s being written the PDF format is probably being broken. Is there a way to retain the format of the PDF while it is modified and then saved? It’s probably the case that I’m missing something simple.

So I was about to start looking at iTextSharp and decided to give an older language a try first, Winbatch. (bleh!) I almost made a screen scraper to do the work but the shame of taking that route got the better of me. So, the function library was the next stop.
This is just a little blurb I spit out with no error checking or logging going on at this point. All that will be added in along with file searches later. All in all it manages to clear all the unwanted extras in the PDF but keeping the exact format that is required by PDFs.
strPDFdoco = "C:\TestPDFs\Test.pdf"
strPDFString = "%%PDF"
strPDFendString = "%%%%END"
If FileExist(strPDFdoco)
strPDFName = ItemExtract(-1, strPDFdoco, "\")
strFixedPDFFullPath = ("C:\TestPDF\Fixed\": strPDFName)
strCurrentPDFFileSize = FileSize(strPDFdoco) ; Get size of PDF file
hndOldPDFFile = BinaryAlloc(strCurrentPDFFileSize) ; Allocate memory for reading PDF file
BinaryRead(hndOldPDFFile, strPDFdoco) ; Read PDF file
strStartIndex = BinaryIndexEx(hndOldPDFFile, 0, strPDFString, #FWDSCAN, #FALSE) ; Find start point for copy
strEndIndex = BinaryEodGet(hndOldPDFFile) ; find eof
strCount = strEndIndex - strStartIndex
strWritePDF = BinaryWriteEx( hndOldPDFFile, strStartIndex, strFixedPDFFullPath, 0, strCount)
BinaryFree(hndOldPDFFile)
ENDIF
Now that I have an idea how this works, making a tool to do this in PS sounds more doable. There's a PS function out there in the wild called Get-HexDump that might be a good base to educate myself on bits and hex in PS. Since this works in Winbatch I assume there is some sort of equivalent in AutoIt and it could be reproduced in most basic languages.
There appears to be a lot of people out there trying to clear crud from before the header and after the end of their PDF docos, Hopefully this helps, I've got a half mill to hit with whatever script I morph this into. I might update with a PS version if I decide to go that route again, and if I remember.

preplog.exe ran in foreach log file

I have a folder with x amount of web log files and I need to prep them for bulk import to SQL
for that I have to run preplog.exe into each one of them.
I want to create a Power script to do this for me, the problem that I'm having is that preplog.exe has to be run in CMD and I need to enter the input path and the output path.
For Example:
D:>preplog c:\blah.log > out.log
I've been playing with Foreach but I haven't have any luck.
Any pointers will be much appreciated

I would guess...
Get-ChildItem "C:\Folder\MyLogFiles" | Foreach-Object { preplog $_.FullName | Out-File "preplog.log" -Append }
FYI it is good practice on this site to post your not working code so at least we have some context. Here I assume you're logging to the current directory into one file.
Additionally you've said you need to run in CMD but you've tagged PowerShell - it pays to be specific. I've assumed PowerShell because it's a LOT easier to script.
I've also had to assume that the folder contains ONLY your log files, otherwise you will need to include a Where statement to filter the items.
In short I've made a lot of assumptions that means this may not be an accurate answer, so keep all this in mind for your next question =)

Working with Word templates with Powershell

I am writing a function that is part of a much larger script that will take input from a web form, check to see if that user exists in either our AD or Linux systems, create the account if it doesn't, email the user when it's done, then create a Word document that we can print out and give them with their credentials (sans temp password), email address, and basic information about our IT services. I have been beating my head against the wall with the Word integration. There is almost ZERO Powershell documentation online for Word integration. I've been having to translate what I can from C# and VB and even half of that isn't even translateable. I've got it mostly working now but I'm having problems getting PS to put my text in the correct location in the Word template. I have a Word Template with 4 bookmarks where I am inserting the user's name, username, email address, and account expiration. The problem is, PS is placing all of the text at the same bookmark. I've found that if I put info in the script statically it will work (ie. $FillName.Text = 'John Doe') but if I use a variable it will just stick all of them at the first bookmark. Here is my code:
Function createWordDocument($fullname,$sam,$mailaddress,$Expiration)
{
$word = New-Object -ComObject "Word.application"
$doc = $word.Documents.add("C:\Users\smiths\Documents\Powershell Scripts\webformCreateUsers\welcome2.dotx")
$FillName=$doc.Bookmarks.Item("Name").Range
$FillName.Text="$fullname "
$FillUser=$doc.Bookmarks.Item("Username").Range
$FillUser.Text="$sam"
$FillMail=$doc.Bookmarks.Item("Email").Range
$FillMail.Text="$mailaddress"
$FillExpiration=$doc.Bookmarks.Item("Expiration").Range
$FillExpiration.Text="$Expiration"
$file = "C:\Users\smiths\Documents\Powershell Scripts\webformCreateUsers\test1.docx"
$doc.SaveAs([ref]$file)
$Word.Quit()
}
The function is receiving parameters that originated from a import-csv. $fullname, $sam and potentially $mailaddress have all been modified from their original inputs. #Expiration comes from the import-csv raw. Any help would be appreciated. This seems to be the most relevant info I could find and as far as I can tell I've got the same code, but It won't work for multiple bookmarks.

Ok, like I suggested you can setup a Mail Merge base that you can use to create docs for people. It does mean that you would need to output your data to a CSV file, but that is pretty trivial.
Start by setting up a test CSV with the data that you want to include. For simplicity you may want to place it with the word doc that references it. We'll call it mailmerge.csv for now, but you can name it whatever you want. Looks like Name, UserName, Email, and Expiration are the fields you would want. You can use dummy data in those fields for the time being.
Then setup your mail merge in Word, and save it someplace. We'll call it Welcome3.docx, and stash it in the same place as your last doc. Then, once it's setup to reference your CSV file, and saved, you can launch Word, open the master document, and perform the merge, then just save the file, and away you go.
I'll just use a modified version of your function which will create the CSV from the parameters provided, open the merge doc, execute the merge, save the new file, and close word. Then it'll pass a FileInfo object back so you can use that to send the email, or whatever.
Function createWordDocument($fullname,$sam,$mailaddress,$Expiration)
{
[PSCustomObject]#{Name=$fullname;Username=$sam;Email=$mailaddress;Expiration=$Expiration}|Export-Csv "C:\Users\smiths\Documents\Powershell Scripts\webformCreateUsers\mailmerge.csv" -NoTypeInformation -Force
$word = New-Object -ComObject "Word.application"
$doc = $word.Documents.Open("C:\Users\smiths\Documents\Powershell Scripts\webformCreateUsers\welcome3.dotx")
$doc.MailMerge.Execute()
$file = "C:\Users\smiths\Documents\Powershell Scripts\webformCreateUsers\$fullname.docx"
($word.documents | ?{$_.Name -Match "Letters1"}).SaveAs([ref]$file)
$Word.Quit()
[System.IO.FileInfo]$file
}

TheMadTechnician put me on the right track, but I had to do some tweaking. Here is what I wound up with:
Function createWordDocument($fullname)
{
$word = New-Object -ComObject "Word.application"
$doc = $word.Documents.Add("C:\Users\smiths\Documents\Powershell Scripts\webformCreateUsers\welcome_letter.docx")
$doc.MailMerge.Execute()
$file = "C:\Users\smiths\Documents\Powershell Scripts\webformCreateUsers\$fullname.docx"
($word.documents | ?{$_.Name -Match "Letters1"}).SaveAs([ref]$file)
$quitFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveOptions],"wdDoNotSaveChanges")
$Word.Quit([ref]$quitformat)
}
Instead of passing the arguments to the function, I had the main function create the mailmerge.csv file for me and just have the Word template connect to it. I'm still passing $fullname since that's what I'm naming the file in the end. The two major hiccups in the end were that everytime a mailmerge document file is opened, Word asks if you want to conect back to the source data. This means that when Powershell was trying to open it, Word was waiting for interaction and then PS would close it when it thought it was done. Of course, this meant that nothing got done. I found that there is a registry key that you must create to enable Word to skip the SQL Security check. for posterity's sake you must create a key here:
HKCU\Software\Microsoft\Office\14.0\Word\Options\ called SQLSecurityCheck with a DWORD value of 0. That allowed Word to properly open the template and manipulate the files. The last bit of trouble that I had was that Word was wanting to re-save the original file each time it ran and would leave a dialogue box open which would leave Word open and in memory. The last 2 lines force word to close without saving.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to read a PDF using Powershell - powershell

Related

Blank file after using powershell to convert a word file?

Create Jenkins Job that searches for strings in files and reports warnings with file and line number

Using Powershell to Strip Content from PDF

preplog.exe ran in foreach log file

Working with Word templates with Powershell

Categories

Resources