Edit powershell script to merge 2 docx into one PDF - powershell

i have found this script online. It converts docx files to pdf. The thing is, it creates one pdf for each docx. I need to edit this script, to merge 2 docx files into one single PDF file. I have zero knowledge of powershell, but i know batch in linux.
$documents_path = Split-Path -parent $MyInvocation.MyCommand.Path
$word_app = New-Object -ComObject Word.Application
Get-ChildItem -Path $documents_path -Filter *.doc? | ForEach-Object {
$document = $word_app.Documents.Open($_.FullName)
$pdf_filename = "$($_.DirectoryName)\$($_.BaseName).pdf"
$document.SaveAs([ref] $pdf_filename, [ref] 17)
$document.Close()
}
$word_app.Quit()

This is the design of the script you are using.
Use the more direct approach by merging the .docx files first, then convert to PDF. This means you have to understand the MSWord object model and how to code for it. You're going to have to pick a starting .docx the append other word data to the end.
So, do a search for how to merge Word files. Get that worked out, then you can just use PowerShell to make them .pdfs.
With zero knowledge of PowerShell, you should really take a few quick online training session to get an handle on it all, before you get yourself in a very frustrating position.
Go to the Microsoft Virtual Academy and YouTube and do a search for 'beginning PowerShell'

Related

Blank file after using powershell to convert a word file?

I have been trying to use PowerShell to convert some .docx files to .docm. I'm able to convert the file, but it's blank every time I open it.
This is the code I have been using:
Get-ChildItem *.docx | Rename-Item -NewName { $_.name -replace '\.docx$','.docm' }
Adding this here per other comments regarding it.
.DOCM is just a Word doc with embedded macros.
What do you expect to see?
In most cases, Word security blocks macro docs from opening unless you tell Word you accept the macro risk, or you've already disabled that.
So, if these are not .DOCs with macros, I am not sure of what your plan was here.
If you just went into Windows Explorer and opened a .docx (non-Macro) file, then manually renamed it to .docm, then try and open it, you'd get the same result.
So, not a PS or PS-specific code issue. Changing the extension does not make it a true .docm, it must be saved that way in Word.
... removing the code refactor.
FYI...There are online tools for this conversion.
Though I've never used or needed to use them. So, just a heads up.
However, here is more info after looking at my old notes, if the goal is to automate this via PS.
if you really wanted to do this in PS, you need to use PS to open a .docx using MSOffice COM, add VBA/Macro code to the doc, and then save it as a macro-enabled file.
For example, here is an article regarding
[Converting Word document format with PowerShell][2]
$path = "c:\olddocuments\"
$word_app = New-Object -ComObject Word.Application
$Format = [Microsoft.Office.Interop.Word.WdSaveFormat]::wdFormatXMLDocument
Get-ChildItem -Path $path -Filter '*.doc' |
ForEach-Object {
$document = $word_app.Documents.Open($_.FullName)
$docx_filename = "$($_.DirectoryName)\$($_.BaseName).docx"
$document.SaveAs([ref] $docx_filename, [ref]$Format)
$document.Close()
}
$word_app.Quit()
If you need to convert the documents to PDF, make the following change
to the “SaveAs” line in the script. 17 corresponds to the PDF file
format when doing a Save As in Microsoft Word.
$document.SaveAs([ref] $docx_filename, [ref]17)
Microsoft Word file format tech doc is here:
[WdSaveFormat enumeration (Word)][3]
https://learn.microsoft.com/en-us/office/vba/api/Word.WdSaveFormat
wdFormatFlatXMLMacroEnabled # 20 Open XML file format with macros enabled saved as a single XML file.

PowerShell: Turning/Printing Word Files into PDF using Bullzip

So I have a folder with a bunch of docx files and I want to
turn them into PDFs. I could use the following method in a loop
$Word=NEW-OBJECT –COMOBJECT WORD.APPLICATION
$Doc=$Word.Documents.Open('"C:\[path]\word.docx"')
$Name=($Doc.Fullname).replace(“docx”,”pdf”)
$Doc.saveas([ref] $Name, [ref] 17)
$Doc.close()
$word.quit()
But there are over 100 word files, and so it would take a while.
I installed BullZip to test if it worked better but I don't know
the code in order to use it in powershell. If anyone can provide it along with an explanation on how it works, I would appreciate it.
Thanks

Powershell: Go through all files (PDF's) in a directory and move them based on what's written in the first 6 bytes

I am currently trying to write a powershell script that does the following:
Go through all PDF-Files in the directory in which the script is in
Check the first few bytes of those PDF-Files
If those bytes say something along the lines of "PK", move them to a different location
If the bytes say something else (ex: PDF1.4), dont move them at all and go to the next one.
Context: We have around 70k PDF-Files that cant be opened. After checking them with a certain tool, it looks like around 99% of those are damaged and the remaining 1% are zip files.
The first bytes of a zipped PDF file start with "PK", the first bytes of a broken PDF-File start with PDF1.4 for example.
I need to unzip all zip files and relocate them. Going through 70k PDF-Files by hand is kinda painful, so im looking for a way to automate it.
I know im supposed to provide a code sample, but the truth is that i am absolutely lost. I have written a few powershell scripts before, but i have no idea how to do something like this.
So, if anyone could kindly point me to the right direction or give me a useful function, i would really appreciate it a lot.
You can use Get-Content to get your first 6 bytes as you asked.
We can then tie that into a loop on all the documents and configure a simple if statement to decide what to do next, e.g. move the file to another dir
EDITED BASED ON YOUR COMMENT:
$pdfDirectory = 'C:\Temp\struktur_id_1225\ext_dok'
$newLocation = 'C:\Path\To\New\Folder'
Get-ChildItem "$pdfDirectory" -Filter "*.pdf" | foreach {
if((Get-Content $_.FullName | select -first 1 ) -like "%PDF-1.5*"){
$HL7 = $_.FullName.replace("ext_dok","MDM")
$HL7 = $HL7.replace(".pdf",".hl7")
move $_.FullName $newLocation;
move $HL7 $newLocation
}
}
Try using the above, which is also a bit easier to edit.
$pdfDirectory will need to be set to the folder containing the PDF Files
$newLocation will obviously be the new directory!
And you will still need to change the -like "%PDF-1.5*" to suit your search!
It should do the rest for you, give it a shot
Another Edit
I have mimicked your folder structure on my computer, and placed a few PDF files and matching HL7 files and the script is working perfectly.
Get-Content is not suited for PDF's, you'd want to use iTextSharp to read PDF's.
Download the iTextSharp(found in releases) and put the itextsharp.dll somewhere easy to find (ie. the folder your script is located in).
You can install the .nupkg by using Install-Package, or simply using an archive tool to extract the contents of the .nupkg file (it's basically a .zip file)
The code below adds every word on page 1 for each PDF separated by whitespace to an array. You can then test if the array contains your keyword
Add-Type -Path "C:\path\to\itextsharp.dll"
$pdfs = Get-ChildItem "C:\path\to\pdfs" *.pdf
foreach ($pdf in $pdfs) {
$reader = New-Object itextsharp.text.pdf.pdfreader -ArgumentList $pdf.Fullname
$text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,1).Split("")
foreach($line in $text) {
# do your test here
}
}

Using Powershell to Strip Content from PDF

Using Powershell to Strip Content from PDF While Keeping PDF Format.
My Task:
I have been attempting to perform what would be a simple task if the documents were not in PDF format. I have a bunch of PDFs that have unwanted data before the bulk of usable data starts, this is anything that comes before ‘%PDF’ in the documents. A script that pulls all the desired data and exports it to a new file was needed. That part was super easy.
The Problem:
The data that is exported appears to be formatted correctly, except it doesn’t open as a PDF anymore. I can open it in Notepad++ and it looks identical to one that was clean manually and works. Examining the raw code of the Powershell altered PDF it appears that the ‘lines’ are much shorter than they should be.
$Path = 'C:\FileLocation'
$Output = '.\MyFile.pdf'
$LineArr = #()
$Target = Get-ChildItem -Path $Path -Filter *.pdf -Recurse -ErrorAction SilentlyContinue | Get-Content -Encoding default | Out-String -stream
$Target.Where({ $_ -like '*%PDF*' }, 'SkipUntil') | ForEach-Object{
If ($_.contains('%PDF')){
$LineArr += "%" + $_.Split('%')[1]
}
else{
$LineArr += $_
}
}
$LineArr | Out-File -Encoding Default -FilePath $Output
I understand the PDF format doesn't really use lines, so that might be where the problem is being created. Either when the data is being initially put into an array, or when it’s being written the PDF format is probably being broken. Is there a way to retain the format of the PDF while it is modified and then saved? It’s probably the case that I’m missing something simple.
So I was about to start looking at iTextSharp and decided to give an older language a try first, Winbatch. (bleh!) I almost made a screen scraper to do the work but the shame of taking that route got the better of me. So, the function library was the next stop.
This is just a little blurb I spit out with no error checking or logging going on at this point. All that will be added in along with file searches later. All in all it manages to clear all the unwanted extras in the PDF but keeping the exact format that is required by PDFs.
strPDFdoco = "C:\TestPDFs\Test.pdf"
strPDFString = "%%PDF"
strPDFendString = "%%%%END"
If FileExist(strPDFdoco)
strPDFName = ItemExtract(-1, strPDFdoco, "\")
strFixedPDFFullPath = ("C:\TestPDF\Fixed\": strPDFName)
strCurrentPDFFileSize = FileSize(strPDFdoco) ; Get size of PDF file
hndOldPDFFile = BinaryAlloc(strCurrentPDFFileSize) ; Allocate memory for reading PDF file
BinaryRead(hndOldPDFFile, strPDFdoco) ; Read PDF file
strStartIndex = BinaryIndexEx(hndOldPDFFile, 0, strPDFString, #FWDSCAN, #FALSE) ; Find start point for copy
strEndIndex = BinaryEodGet(hndOldPDFFile) ; find eof
strCount = strEndIndex - strStartIndex
strWritePDF = BinaryWriteEx( hndOldPDFFile, strStartIndex, strFixedPDFFullPath, 0, strCount)
BinaryFree(hndOldPDFFile)
ENDIF
Now that I have an idea how this works, making a tool to do this in PS sounds more doable. There's a PS function out there in the wild called Get-HexDump that might be a good base to educate myself on bits and hex in PS. Since this works in Winbatch I assume there is some sort of equivalent in AutoIt and it could be reproduced in most basic languages.
There appears to be a lot of people out there trying to clear crud from before the header and after the end of their PDF docos, Hopefully this helps, I've got a half mill to hit with whatever script I morph this into. I might update with a PS version if I decide to go that route again, and if I remember.

iTextSharp to merge PDF files in PowerShell

I have a folder which contains thousands of PDF files. I need to filter through these files based on file name (which will group these into 2 or more PDF's) and then merge these 2 more more PDF's into 1 PDF.
I'm OK with group the files but not sure the best way of then merging these into 1 PDF. I have researched iTextSharp but have been unable to get this to work in PowerShell.
Is iTextSharp the best way of doing this? Any help with the code for this would be much appreciated.
Many thanks
Paul
Have seen a few of these PowerShell-tagged questions that are also tagged with itextsharp, and always wondered why answers are given in .NET, which can be very confusing unless the person asking the question is proficient in PowerShell to begin with. Anyway, here's a simple working PowerShell script to get you started:
$workingDirectory = Split-Path -Parent $MyInvocation.MyCommand.Path;
$pdfs = ls $workingDirectory -recurse | where {-not $_.PSIsContainer -and $_.Extension -imatch "^\.pdf$"};
[void] [System.Reflection.Assembly]::LoadFrom(
[System.IO.Path]::Combine($workingDirectory, 'itextsharp.dll')
);
$output = [System.IO.Path]::Combine($workingDirectory, 'output.pdf');
$fileStream = New-Object System.IO.FileStream($output, [System.IO.FileMode]::OpenOrCreate);
$document = New-Object iTextSharp.text.Document;
$pdfCopy = New-Object iTextSharp.text.pdf.PdfCopy($document, $fileStream);
$document.Open();
foreach ($pdf in $pdfs) {
$reader = New-Object iTextSharp.text.pdf.PdfReader($pdf.FullName);
$pdfCopy.AddDocument($reader);
$reader.Dispose();
}
$pdfCopy.Dispose();
$document.Dispose();
$fileStream.Dispose();
To test:
Create an empty directory.
Copy code above into a Powershell script file in the directory.
Copy the itextsharp.dll to the directory.
Put some PDF files in the directory.
Not sure how you intend to group filter the PDFs based on file name, or if that's your intention (couldn't tell if you meant just pick out PDFs by extension), but that shouldn't be too hard to add.
Good luck. :)