Using Powershell to Strip Content from PDF While Keeping PDF Format.
My Task:
I have been attempting to perform what would be a simple task if the documents were not in PDF format. I have a bunch of PDFs that have unwanted data before the bulk of usable data starts, this is anything that comes before ‘%PDF’ in the documents. A script that pulls all the desired data and exports it to a new file was needed. That part was super easy.
The Problem:
The data that is exported appears to be formatted correctly, except it doesn’t open as a PDF anymore. I can open it in Notepad++ and it looks identical to one that was clean manually and works. Examining the raw code of the Powershell altered PDF it appears that the ‘lines’ are much shorter than they should be.
$Path = 'C:\FileLocation'
$Output = '.\MyFile.pdf'
$LineArr = #()
$Target = Get-ChildItem -Path $Path -Filter *.pdf -Recurse -ErrorAction SilentlyContinue | Get-Content -Encoding default | Out-String -stream
$Target.Where({ $_ -like '*%PDF*' }, 'SkipUntil') | ForEach-Object{
If ($_.contains('%PDF')){
$LineArr += "%" + $_.Split('%')[1]
}
else{
$LineArr += $_
}
}
$LineArr | Out-File -Encoding Default -FilePath $Output
I understand the PDF format doesn't really use lines, so that might be where the problem is being created. Either when the data is being initially put into an array, or when it’s being written the PDF format is probably being broken. Is there a way to retain the format of the PDF while it is modified and then saved? It’s probably the case that I’m missing something simple.
So I was about to start looking at iTextSharp and decided to give an older language a try first, Winbatch. (bleh!) I almost made a screen scraper to do the work but the shame of taking that route got the better of me. So, the function library was the next stop.
This is just a little blurb I spit out with no error checking or logging going on at this point. All that will be added in along with file searches later. All in all it manages to clear all the unwanted extras in the PDF but keeping the exact format that is required by PDFs.
strPDFdoco = "C:\TestPDFs\Test.pdf"
strPDFString = "%%PDF"
strPDFendString = "%%%%END"
If FileExist(strPDFdoco)
strPDFName = ItemExtract(-1, strPDFdoco, "\")
strFixedPDFFullPath = ("C:\TestPDF\Fixed\": strPDFName)
strCurrentPDFFileSize = FileSize(strPDFdoco) ; Get size of PDF file
hndOldPDFFile = BinaryAlloc(strCurrentPDFFileSize) ; Allocate memory for reading PDF file
BinaryRead(hndOldPDFFile, strPDFdoco) ; Read PDF file
strStartIndex = BinaryIndexEx(hndOldPDFFile, 0, strPDFString, #FWDSCAN, #FALSE) ; Find start point for copy
strEndIndex = BinaryEodGet(hndOldPDFFile) ; find eof
strCount = strEndIndex - strStartIndex
strWritePDF = BinaryWriteEx( hndOldPDFFile, strStartIndex, strFixedPDFFullPath, 0, strCount)
BinaryFree(hndOldPDFFile)
ENDIF
Now that I have an idea how this works, making a tool to do this in PS sounds more doable. There's a PS function out there in the wild called Get-HexDump that might be a good base to educate myself on bits and hex in PS. Since this works in Winbatch I assume there is some sort of equivalent in AutoIt and it could be reproduced in most basic languages.
There appears to be a lot of people out there trying to clear crud from before the header and after the end of their PDF docos, Hopefully this helps, I've got a half mill to hit with whatever script I morph this into. I might update with a PS version if I decide to go that route again, and if I remember.
Related
I have been trying to use PowerShell to convert some .docx files to .docm. I'm able to convert the file, but it's blank every time I open it.
This is the code I have been using:
Get-ChildItem *.docx | Rename-Item -NewName { $_.name -replace '\.docx$','.docm' }
Adding this here per other comments regarding it.
.DOCM is just a Word doc with embedded macros.
What do you expect to see?
In most cases, Word security blocks macro docs from opening unless you tell Word you accept the macro risk, or you've already disabled that.
So, if these are not .DOCs with macros, I am not sure of what your plan was here.
If you just went into Windows Explorer and opened a .docx (non-Macro) file, then manually renamed it to .docm, then try and open it, you'd get the same result.
So, not a PS or PS-specific code issue. Changing the extension does not make it a true .docm, it must be saved that way in Word.
... removing the code refactor.
FYI...There are online tools for this conversion.
Though I've never used or needed to use them. So, just a heads up.
However, here is more info after looking at my old notes, if the goal is to automate this via PS.
if you really wanted to do this in PS, you need to use PS to open a .docx using MSOffice COM, add VBA/Macro code to the doc, and then save it as a macro-enabled file.
For example, here is an article regarding
[Converting Word document format with PowerShell][2]
$path = "c:\olddocuments\"
$word_app = New-Object -ComObject Word.Application
$Format = [Microsoft.Office.Interop.Word.WdSaveFormat]::wdFormatXMLDocument
Get-ChildItem -Path $path -Filter '*.doc' |
ForEach-Object {
$document = $word_app.Documents.Open($_.FullName)
$docx_filename = "$($_.DirectoryName)\$($_.BaseName).docx"
$document.SaveAs([ref] $docx_filename, [ref]$Format)
$document.Close()
}
$word_app.Quit()
If you need to convert the documents to PDF, make the following change
to the “SaveAs” line in the script. 17 corresponds to the PDF file
format when doing a Save As in Microsoft Word.
$document.SaveAs([ref] $docx_filename, [ref]17)
Microsoft Word file format tech doc is here:
[WdSaveFormat enumeration (Word)][3]
https://learn.microsoft.com/en-us/office/vba/api/Word.WdSaveFormat
wdFormatFlatXMLMacroEnabled # 20 Open XML file format with macros enabled saved as a single XML file.
I get a CSV every week that our finance team puts in a shared drive. I have a script for that CSV that I run once I get it.
The first command of the script is of course Import-Csv.
The problem is, the finance team insists on naming the file differently each time plus they don't always put it in the same location within the drive.
As a result, I have to first hunt for the file, put it into the directory that the script points to and then rename the file.
I've tried talking to the team about putting it in the same location and making sure the filename is the same but they only follow the instructions for a couple of weeks before just doing whatever.
Ideally, I'd like for it so that when I run the script, there would be a popup that would ask me to pick a CSV (Similar to how it looks when you do "Save As" on an Office Document).
Anyway for this to be done within PowerShell?
You can access .Net classes and interface with the forms library to instantiate and take input from the standard FileOpen dialog. Something like below:
Using Namespace System.Windows.Forms
$FileBrowser = [OpenFileDialog]::new()
$FileBrowser.InitialDirectory = 'c:\temp'
$FileBrowser.Filter = 'Comma Separated Values (*.csv) | *.csv'
[Void]$FileBrowser.ShowDialog()
$CsvFile = $FileBrowser.FileName
Then use $CsvFile int he Import-Csv command.
You can change the .InitialDirectory property to make navigating a little more convenient.
Use the .Filter property to limit the file open display to CSV files, to make things that much more convenient.
Also, use the [Void] class to prevent the status return (usually 'OK' or 'Cancel') from echoing to the screen.
Note: A simple Google search will turn up many examples. I refined some of the work from here. That will also document some of the other properties if you want to explore etc.
If you are willing to settle for a selection box that doesn't look as nice as the Save As dialog, you can use Out-Gridview. Something along these lines might help.
$filenames =
#(Get-ChildItem -Path C:\temp -Recurse -Filter *.csv |
Sort-Object LastWriteTime -Descending |
Out-GridView -Title 'Choose a file' -PassThru)
$csvfile = $filenames[0].FullName
Import-Csv $csvfile | More
The -Path specifies a directory that contains all the locations where your csv file might be delivered. The sort is just to put the recently written files at the top of the grid. This supposedly makes selection easier. The #() wrapper merely makes sure the result stored in $filenames is an array.
You would do something else with the results of Import-Csv.
Steven's response certainly satisfies your original question, but an alternative would be to let PowerShell do the work. If you know the drive, and you know the name of the file this week, you can pass the name to your script and let it search the drive filtering on the specific csv file you need. Make it recursive, and open the only file that matches. Sorry, didn't have time yesterday to include code. Here's a function that returns the full file path when provided with a top level search path and a filename with possible wildcards.
function gfp { $result=gci $args[0] -recurse -include $args[1]; return ($result.DirectoryName + "\" + $result.Name) }
Example: gfp "d:\rootfolder" "thisweeksfilename.csv"
I am currently trying to write a powershell script that does the following:
Go through all PDF-Files in the directory in which the script is in
Check the first few bytes of those PDF-Files
If those bytes say something along the lines of "PK", move them to a different location
If the bytes say something else (ex: PDF1.4), dont move them at all and go to the next one.
Context: We have around 70k PDF-Files that cant be opened. After checking them with a certain tool, it looks like around 99% of those are damaged and the remaining 1% are zip files.
The first bytes of a zipped PDF file start with "PK", the first bytes of a broken PDF-File start with PDF1.4 for example.
I need to unzip all zip files and relocate them. Going through 70k PDF-Files by hand is kinda painful, so im looking for a way to automate it.
I know im supposed to provide a code sample, but the truth is that i am absolutely lost. I have written a few powershell scripts before, but i have no idea how to do something like this.
So, if anyone could kindly point me to the right direction or give me a useful function, i would really appreciate it a lot.
You can use Get-Content to get your first 6 bytes as you asked.
We can then tie that into a loop on all the documents and configure a simple if statement to decide what to do next, e.g. move the file to another dir
EDITED BASED ON YOUR COMMENT:
$pdfDirectory = 'C:\Temp\struktur_id_1225\ext_dok'
$newLocation = 'C:\Path\To\New\Folder'
Get-ChildItem "$pdfDirectory" -Filter "*.pdf" | foreach {
if((Get-Content $_.FullName | select -first 1 ) -like "%PDF-1.5*"){
$HL7 = $_.FullName.replace("ext_dok","MDM")
$HL7 = $HL7.replace(".pdf",".hl7")
move $_.FullName $newLocation;
move $HL7 $newLocation
}
}
Try using the above, which is also a bit easier to edit.
$pdfDirectory will need to be set to the folder containing the PDF Files
$newLocation will obviously be the new directory!
And you will still need to change the -like "%PDF-1.5*" to suit your search!
It should do the rest for you, give it a shot
Another Edit
I have mimicked your folder structure on my computer, and placed a few PDF files and matching HL7 files and the script is working perfectly.
Get-Content is not suited for PDF's, you'd want to use iTextSharp to read PDF's.
Download the iTextSharp(found in releases) and put the itextsharp.dll somewhere easy to find (ie. the folder your script is located in).
You can install the .nupkg by using Install-Package, or simply using an archive tool to extract the contents of the .nupkg file (it's basically a .zip file)
The code below adds every word on page 1 for each PDF separated by whitespace to an array. You can then test if the array contains your keyword
Add-Type -Path "C:\path\to\itextsharp.dll"
$pdfs = Get-ChildItem "C:\path\to\pdfs" *.pdf
foreach ($pdf in $pdfs) {
$reader = New-Object itextsharp.text.pdf.pdfreader -ArgumentList $pdf.Fullname
$text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,1).Split("")
foreach($line in $text) {
# do your test here
}
}
Trying to find a way in powershell that allowed me to move a file based on its size. I could not find exactly what I was looking for. I found how to move files of only a certain size and to do other if/then statements but not to move a file to different locations based on there size.
Why did I need/want to do this? A exe I am running creates and output even if it has no data. so sometimes the file is empty and sometimes it has data. When it has data I need it sent to someone, when its empty I just wanted it in a backup folder for reference.
This part let me move a file based on size: -cle is less than or equal to
$BlankFiles = Get-ChildItem c:\test\*.rej | where { $_.Length -cle 0kb}
This part let me check if an empty file exist: After lots of reading went with system.io.file over test-path
[System.IO.File]::Exists($BlankFiles)
Putting this all in a IF/ELSE statement was the problem i struggled with. Answer I came up with is below.
I am mainly posting this since I could not find the exact scenario and if any one sees a problem with this approach that I missed.
Here is the solution I came up with and it all the test I did it appears to be working as intended. Note: I only need to do this on one file at a time, which is why this works and why I left out recursive or loop steps.
If the file is blank it moves it to a backup folder and appends it with the date, if it has data it makes a copy with the date append to the backup folder and moves the file with date append to a different location that is accessible to the necessary users.
I was thinking about going with check to see how many lines are in the file over the size of the file, but it appears the file when blank sometimes has a return in it and sometimes it doesn't. So I went with size method instead
$BlankFiles = Get-ChildItem c:\test\*.rej | where { $_.Length -cle 0kb}
$date = Get-Date
$fndate = $date.ToString("MMddyyyy")
If ([System.IO.File]::Exists($BlankFiles) -eq "True") {
Move-Item C:\test\*.rej c:\test\blankfiles -"$fndate".rej
}
Else {
Copy-Item c:\test\*.rej c:\test\realfiles-"$fndate".rej
Move-Item c:\test\*.rej c:\user\accessible\realfiles-"$fndate".rej -Force
}
If anyone see any issues with doing this way or has a better suggestions, but as I mentioned from my test it appears to be working wonderfully and I thought I would share.
I'm super new at all of this so please excuse my lack of technical elegance and all around idiocy.
dir c:\Users\me\desktop\Test\*.txt | %{ $sourceFile = $_; get-content $_} | Out-File "$sourceFile.results"
How can I modify this command line so that instead of one file with the contents of all the text files I have a one to one ratio so that each output files represents the contents of each text file?
I realize that this object is ridiculous in terms of application but I'm conceptually trying to piece this together bit by bit so I can really understand.
P.S. What's with the %? Haha another ridiculous question, doesn't seem worth a separate post, what does it do?
dir | % { Out-File -FilePath "new_$($_.Name)" -InputObject (gc $_.FullName) }
only one pipeline needed. this command appends "new_" to the filename because I was using the same directory to write to. You can remove this if it's not needed.