iTextSharp to merge PDF files in PowerShell - powershell

I have a folder which contains thousands of PDF files. I need to filter through these files based on file name (which will group these into 2 or more PDF's) and then merge these 2 more more PDF's into 1 PDF.
I'm OK with group the files but not sure the best way of then merging these into 1 PDF. I have researched iTextSharp but have been unable to get this to work in PowerShell.
Is iTextSharp the best way of doing this? Any help with the code for this would be much appreciated.
Many thanks
Paul

Have seen a few of these PowerShell-tagged questions that are also tagged with itextsharp, and always wondered why answers are given in .NET, which can be very confusing unless the person asking the question is proficient in PowerShell to begin with. Anyway, here's a simple working PowerShell script to get you started:
$workingDirectory = Split-Path -Parent $MyInvocation.MyCommand.Path;
$pdfs = ls $workingDirectory -recurse | where {-not $_.PSIsContainer -and $_.Extension -imatch "^\.pdf$"};
[void] [System.Reflection.Assembly]::LoadFrom(
[System.IO.Path]::Combine($workingDirectory, 'itextsharp.dll')
);
$output = [System.IO.Path]::Combine($workingDirectory, 'output.pdf');
$fileStream = New-Object System.IO.FileStream($output, [System.IO.FileMode]::OpenOrCreate);
$document = New-Object iTextSharp.text.Document;
$pdfCopy = New-Object iTextSharp.text.pdf.PdfCopy($document, $fileStream);
$document.Open();
foreach ($pdf in $pdfs) {
$reader = New-Object iTextSharp.text.pdf.PdfReader($pdf.FullName);
$pdfCopy.AddDocument($reader);
$reader.Dispose();
}
$pdfCopy.Dispose();
$document.Dispose();
$fileStream.Dispose();
To test:
Create an empty directory.
Copy code above into a Powershell script file in the directory.
Copy the itextsharp.dll to the directory.
Put some PDF files in the directory.
Not sure how you intend to group filter the PDFs based on file name, or if that's your intention (couldn't tell if you meant just pick out PDFs by extension), but that shouldn't be too hard to add.
Good luck. :)

Related

Blank file after using powershell to convert a word file?

I have been trying to use PowerShell to convert some .docx files to .docm. I'm able to convert the file, but it's blank every time I open it.
This is the code I have been using:
Get-ChildItem *.docx | Rename-Item -NewName { $_.name -replace '\.docx$','.docm' }
Adding this here per other comments regarding it.
.DOCM is just a Word doc with embedded macros.
What do you expect to see?
In most cases, Word security blocks macro docs from opening unless you tell Word you accept the macro risk, or you've already disabled that.
So, if these are not .DOCs with macros, I am not sure of what your plan was here.
If you just went into Windows Explorer and opened a .docx (non-Macro) file, then manually renamed it to .docm, then try and open it, you'd get the same result.
So, not a PS or PS-specific code issue. Changing the extension does not make it a true .docm, it must be saved that way in Word.
... removing the code refactor.
FYI...There are online tools for this conversion.
Though I've never used or needed to use them. So, just a heads up.
However, here is more info after looking at my old notes, if the goal is to automate this via PS.
if you really wanted to do this in PS, you need to use PS to open a .docx using MSOffice COM, add VBA/Macro code to the doc, and then save it as a macro-enabled file.
For example, here is an article regarding
[Converting Word document format with PowerShell][2]
$path = "c:\olddocuments\"
$word_app = New-Object -ComObject Word.Application
$Format = [Microsoft.Office.Interop.Word.WdSaveFormat]::wdFormatXMLDocument
Get-ChildItem -Path $path -Filter '*.doc' |
ForEach-Object {
$document = $word_app.Documents.Open($_.FullName)
$docx_filename = "$($_.DirectoryName)\$($_.BaseName).docx"
$document.SaveAs([ref] $docx_filename, [ref]$Format)
$document.Close()
}
$word_app.Quit()
If you need to convert the documents to PDF, make the following change
to the “SaveAs” line in the script. 17 corresponds to the PDF file
format when doing a Save As in Microsoft Word.
$document.SaveAs([ref] $docx_filename, [ref]17)
Microsoft Word file format tech doc is here:
[WdSaveFormat enumeration (Word)][3]
https://learn.microsoft.com/en-us/office/vba/api/Word.WdSaveFormat
wdFormatFlatXMLMacroEnabled # 20 Open XML file format with macros enabled saved as a single XML file.

How to search for a specific file name then move all files after that file to another folder using PowerShell

Let's say I have 10 PDF files in a folder named c:\Temp
1440_021662_54268396_1.pdf
1440_028116_19126420_1.pdf
1440_028116_19676803_1.pdf
1440_028116_19697944_1.pdf
1440_028116_19948492_1.pdf
1440_028116_19977334_1.pdf
1440_028116_20500866_1.pdf
1440_028116_20562027_1.pdf
1440_028116_20566871_1.pdf
1440_028116_20573350_1.pdf
In my search, I know I am looking for a file that will match a specific number, for example 19676803 (I'm getting the number to search for from a SQL Query I'm running in my script)
I know how to find that specific file, but what I need to be able to do is move all the files after the searched file has been found to another pre-defined folder. So using the 10 PDFs above as the example files, I need to move all the files "after" the file named 1440_028116_19676803_1.pdf to another folder. I know how to move files using PowerShell, just do not know how to do it after/from a specific file name. Hope that makes sense.
$batchNumCompleted = 'c:\Temp\'
$lastLoanPrinted = $nameQuery.LoanNumber
$fileIndex = Get-ChildItem -path $batchNumCompleted | where {$_.name -match $lastLoanPrinted}
Can anyone provide suggestions/help on accomplishing my goal? I'm not able to provide all code written so far as it contains confidential information. Thank you.
Use the .Where() extension method in SkipUntil mode:
$allFiles = Get-ChildItem -path $batchNumCompleted
$filesToMove = $allFiles.Where({$_.Name -like '*19676803_1.pdf'}, 'SkipUntil') |Select -Skip 1
Remove the Select -Skip 1 command if you want to move the file with 19676803 in the name as well

Powershell: Go through all files (PDF's) in a directory and move them based on what's written in the first 6 bytes

I am currently trying to write a powershell script that does the following:
Go through all PDF-Files in the directory in which the script is in
Check the first few bytes of those PDF-Files
If those bytes say something along the lines of "PK", move them to a different location
If the bytes say something else (ex: PDF1.4), dont move them at all and go to the next one.
Context: We have around 70k PDF-Files that cant be opened. After checking them with a certain tool, it looks like around 99% of those are damaged and the remaining 1% are zip files.
The first bytes of a zipped PDF file start with "PK", the first bytes of a broken PDF-File start with PDF1.4 for example.
I need to unzip all zip files and relocate them. Going through 70k PDF-Files by hand is kinda painful, so im looking for a way to automate it.
I know im supposed to provide a code sample, but the truth is that i am absolutely lost. I have written a few powershell scripts before, but i have no idea how to do something like this.
So, if anyone could kindly point me to the right direction or give me a useful function, i would really appreciate it a lot.
You can use Get-Content to get your first 6 bytes as you asked.
We can then tie that into a loop on all the documents and configure a simple if statement to decide what to do next, e.g. move the file to another dir
EDITED BASED ON YOUR COMMENT:
$pdfDirectory = 'C:\Temp\struktur_id_1225\ext_dok'
$newLocation = 'C:\Path\To\New\Folder'
Get-ChildItem "$pdfDirectory" -Filter "*.pdf" | foreach {
if((Get-Content $_.FullName | select -first 1 ) -like "%PDF-1.5*"){
$HL7 = $_.FullName.replace("ext_dok","MDM")
$HL7 = $HL7.replace(".pdf",".hl7")
move $_.FullName $newLocation;
move $HL7 $newLocation
}
}
Try using the above, which is also a bit easier to edit.
$pdfDirectory will need to be set to the folder containing the PDF Files
$newLocation will obviously be the new directory!
And you will still need to change the -like "%PDF-1.5*" to suit your search!
It should do the rest for you, give it a shot
Another Edit
I have mimicked your folder structure on my computer, and placed a few PDF files and matching HL7 files and the script is working perfectly.
Get-Content is not suited for PDF's, you'd want to use iTextSharp to read PDF's.
Download the iTextSharp(found in releases) and put the itextsharp.dll somewhere easy to find (ie. the folder your script is located in).
You can install the .nupkg by using Install-Package, or simply using an archive tool to extract the contents of the .nupkg file (it's basically a .zip file)
The code below adds every word on page 1 for each PDF separated by whitespace to an array. You can then test if the array contains your keyword
Add-Type -Path "C:\path\to\itextsharp.dll"
$pdfs = Get-ChildItem "C:\path\to\pdfs" *.pdf
foreach ($pdf in $pdfs) {
$reader = New-Object itextsharp.text.pdf.pdfreader -ArgumentList $pdf.Fullname
$text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,1).Split("")
foreach($line in $text) {
# do your test here
}
}

Edit powershell script to merge 2 docx into one PDF

i have found this script online. It converts docx files to pdf. The thing is, it creates one pdf for each docx. I need to edit this script, to merge 2 docx files into one single PDF file. I have zero knowledge of powershell, but i know batch in linux.
$documents_path = Split-Path -parent $MyInvocation.MyCommand.Path
$word_app = New-Object -ComObject Word.Application
Get-ChildItem -Path $documents_path -Filter *.doc? | ForEach-Object {
$document = $word_app.Documents.Open($_.FullName)
$pdf_filename = "$($_.DirectoryName)\$($_.BaseName).pdf"
$document.SaveAs([ref] $pdf_filename, [ref] 17)
$document.Close()
}
$word_app.Quit()
This is the design of the script you are using.
Use the more direct approach by merging the .docx files first, then convert to PDF. This means you have to understand the MSWord object model and how to code for it. You're going to have to pick a starting .docx the append other word data to the end.
So, do a search for how to merge Word files. Get that worked out, then you can just use PowerShell to make them .pdfs.
With zero knowledge of PowerShell, you should really take a few quick online training session to get an handle on it all, before you get yourself in a very frustrating position.
Go to the Microsoft Virtual Academy and YouTube and do a search for 'beginning PowerShell'

Powershell - rename file using Date Taken attribute

I have a stack load of images and videos on my Samsung phone. I copied these images to a USB then onto my PC.
I want to use Powershell to rename these files based on their Date Taken attribute.
Format required = yyyy-MM-dd HH.mm.ss ddd
I have been using a Powershell script (see below) that does this beautifully using the Date Modified attribute, but the copy above somehow changed the Date Modified value on me (WTH!), so I can't use that now (as its not accurate).
Get-ChildItem | Rename-Item -NewName {$_.LastWriteTime.ToString("yyyy-MM-dd HH.mm.ss ddd") + ($_.Extension)}
In summary - is there a way to change the file name based on the Date Taken file attribute? Suggestions I have seen online require use of the .NET System.Drawing.dll and convoluted code (I'm sure it works, but damn its ugly).
GG
Please checkout Set-PhotographNameAsDateTimeTaken Powershell module. It extract date and time from the picture and change name of the picture to it.
It allows to use -Recurse -Replace and -Verbose parameter. By default it will create reuslt folder at the same level as your working dir.
If you need change the format of the target names the code can be found here.
I 'glued' together a bunch of other answers to make a bulk script. Credit to those, but Chrome crashed and I lost those other webpages on Stack. This works on photo files only and will rename all files to YYYYMMDD_HHMMSS.jpg format.
Here it is:
$nocomment = [reflection.assembly]::LoadWithPartialName("System.Drawing")
get-childitem *.jpg | foreach {
$pic = New-Object System.Drawing.Bitmap($_.Name)
$bitearr = $pic.GetPropertyItem(36867).Value
$string = [System.Text.Encoding]::ASCII.GetString($bitearr)
$date = [datetime]::ParseExact($string,"yyyy:MM:dd HH:mm:ss`0",$Null)
[string] $newfilename = get-date $date -format yyyyMd_HHmmss
$newfilename += ".jpg"
$pic.Dispose()
rename-item $_ $newfilename -Force
$newfilename
}
In order to avoid this error:
New-Object : Cannot find type [System.Drawing.Bitmap]: verify that the assembly containing this type is
loaded.
...
Make sure the required assembly is loaded before executing the code above:
add-type -AssemblyName System.Drawing