I have thousands of PDF documents that I am trying to comb through and pull out only certain data. I have successfully created a script that goes through each PDF, puts its content into a .txt, and then the final .txt is searched for the requested information. The only part I am stuck on is trying to combine all the data from each PDF into this .txt file. Currenly, each successive PDF simply overwrites the previous data and the search is only performed on the final PDF in the folder. How can I alter this set of code to allow each bit of information to be concatenated into the .txt instead of overwriting?
$all = Get-Childitem -Path $file1 -Recurse -Filter *.pdf
foreach ($f in $all){
$outfile = -join ', '
$text = convert-PDFtoText $outfile
}
Here is my entire script for reference:
Start-Process powershell.exe -Verb RunAs {
function convert-PDFtoText {
param(
[Parameter(Mandatory=$true)][string]$file
)
Add-Type -Path "C:\ps\itextsharp.dll"
$pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
for ($page = 1; $page -le $pdf.NumberOfPages; $page++){
$text=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
Write-Output $text
}
$pdf.Close()
}
$content = Read-Host "What are we looking for?: "
$file1 = Read-Host "Path to search: "
$all = Get-Childitem -Path $file1 -Recurse -Filter *.pdf
foreach ($f in $all){
$outfile = $f -join ', '
$text = convert-PDFtoText $outfile
}
$text | Out-File "C:\ps\bulk.txt"
Select-String -Path C:\ps\bulk.txt -Pattern $content | Out-File "C:\ps\select.txt"
Start-Sleep -Seconds 60
}
Any help would be greatly appreciated!
To capture all output across all convert-PDFtoText in a single output file, use a single pipeline with the ForEach-Object cmdlet:
Get-ChildItem -Path $file1 -Recurse -Filter *.pdf |
ForEach-Object { convert-PDFtoText $_.FullName } |
Out-File "C:\ps\bulk.txt"
A tweak to your convert-PDFtoText function would allow for a more concise and efficient solution:
Make convert-PDFtoText accept Get-ChildItem input directly from the pipeline:
function convert-PDFtoText {
param(
[Alias('FullName')
[Parameter(Mandatory, ValueFromPipelineByPropertyName)]
[string] $file
)
begin {
Add-Type -Path "C:\ps\itextsharp.dll"
}
process {
$pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
for ($page = 1; $page -le $pdf.NumberOfPages; $page++) {
[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
}
$pdf.Close()
}
}
This then allows you to simplify the command at the top to:
Get-ChildItem -Path $file1 -Recurse -Filter *.pdf |
convert-PDFtoText |
Out-File "C:\ps\bulk.txt"
Related
I am trying to construct a script that moves through specific folders and the log files in it, and filters the error codes. After that it passes them into a new file.
I'm not really sure how to do that with for loops so I'll leave my code bellow.
If someone could tell me what I'm doing wrong, that would be greatly appreciated.
$file_name = Read-Host -Prompt 'Name of the new file: '
$path = 'C:\Users\user\Power\log_script\logs'
Add-Type -AssemblyName System.IO.Compression.FileSystem
function Unzip
{
param([string]$zipfile, [string]$outpath)
[System.IO.Compression.ZipFile]::ExtractToDirectory($zipfile, $outpath)
}
if ([System.IO.File]::Exists($path)) {
Remove-Item $path
Unzip 'C:\Users\user\Power\log_script\logs.zip' 'C:\Users\user\Power\log_script'
} else {
Unzip 'C:\Users\user\Power\log_script\logs.zip' 'C:\Users\user\Power\log_script'
}
$folder = Get-ChildItem -Path 'C:\Users\user\Power\log_script\logs\LogFiles'
$files = foreach($logfolder in $folder) {
$content = foreach($line in $files) {
if ($line -match '([ ][4-5][0-5][0-9][ ])') {
echo $line
}
}
}
$content | Out-File $file_name -Force -Encoding ascii
Inside the LogFiles folder are three more folders each containing log files.
Thanks
Expanding on a comment above about recursing the folder structure, and then actually retrieving the content of the files, you could try something line this:
$allFiles = Get-ChildItem -Path 'C:\Users\user\Power\log_script\logs\LogFiles' -Recurse
# iterate the files
$allFiles | ForEach-Object {
# iterate the content of each file, line by line
Get-Content $_ | ForEach-Object {
if ($_ -match '([ ][4-5][0-5][0-9][ ])') {
echo $_
}
}
}
It looks like your inner loop is of a collection ($files) that doesn't yet exist. You assign $files to the output of a ForEach(...) loop then try to nest another loop of $files inside it. Of course at this point $files isn't available to be looped.
Regardless, the issue is you are never reading the content of your log files. Even if you managed to loop through the output of Get-ChildItem, you need to look at each line to perform the match.
Obviously I cannot completely test this, but I see a few issues and have rewritten as below:
$file_name = Read-Host -Prompt 'Name of the new file'
$path = 'C:\Users\user\Power\log_script\logs'
$Pattern = '([ ][4-5][0-5][0-9][ ])'
if ( [System.IO.File]::Exists( $path ) ) { Remove-Item $path }
Expand-Archive 'C:\Users\user\Power\log_script\logs.zip' 'C:\Users\user\Power\log_script'
Select-String -Path 'C:\Users\user\Power\log_script\logs\LogFiles\*' -Pattern $Pattern |
Select-Object -ExpandProperty line |
Out-File $file_name -Force -Encoding ascii
Note: Select-String cannot recurse on its own.
I'm not sure you need to write your own UnZip function. PowerShell has the Expand-Archive cmdlet which can at least match the functionality thus far:
Expand-Archive -Path <SourceZipPath> -DestinationPath <DestinationFolder>
Note: The -Force parameter allows it to over write the destination files if they are already present. which may be a substitute for testing if the file exists and deleting if it does.
If you are going to test for the file that section of code can be simplified as:
if ( [System.IO.File]::Exists( $path ) ) { Remove-Item $path }
Unzip 'C:\Users\user\Power\log_script\logs.zip' 'C:\Users\user\Power\log_script'
This is because you were going to run the UnZip command regardless...
Note: You could also use Test-Path for this.
Also there are enumerable ways to get the matching lines, here are a couple of extra samples:
Get-ChildItem -Path 'C:\Users\user\Power\log_script\logs\LogFiles' |
ForEach-Object{
( Get-Content $_.FullName ) -match $Pattern
# Using match in this way will echo the lines that matched from each run of
# Get-Content. If nothing matched nothing will output on that iteration.
} |
Out-File $file_name -Force -Encoding ascii
This approach will read the entire file into an array before running the match on it. For large files it may pose a memory issue, however it enabled the clever use of -match.
OR:
Get-ChildItem -Path 'C:\Users\user\Power\log_script\logs\LogFiles' |
Get-Content |
ForEach-Object{ If( $_ -match $Pattern ) { $_ } } |
Out-File $file_name -Force -Encoding ascii
Note: You don't need the alias echo or its real cmdlet Write-Output
UPDATE: After fuzzing around a bit and trying different things I finally got it to work.
I'll include the code below just for demonstration purposes.
Thanks everyone
$start = Get-Date
"`n$start`n"
$file_name = Read-Host -Prompt 'Name of the new file: '
Out-File $file_name -Force -Encoding ascii
Expand-Archive -Path 'C:\Users\User\Power\log_script\logs.zip' -Force
$i = 1
$folders = Get-ChildItem -Path 'C:\Users\User\Power\log_script\logs\logs\LogFiles' -Name -Recurse -Include *.log
foreach($item in $folders) {
$files = 'C:\Users\User\Power\log_script\logs\logs\LogFiles\' + $item
foreach($file in $files){
$content = Get-Content $file
Write-Progress -Activity "Filtering..." -Status "File $i of $($folders.Count)" -PercentComplete (($i / $folders.Count) * 100)
$i++
$output = foreach($line in $content) {
if ($line -match '([ ][4-5][0-5][0-9][ ])') {
Add-Content -Path $file_name -Value $line
}
}
}
}
$end = Get-Date
$time = [int]($end - $start).TotalSeconds
Write-Output ("Runtime: " + $time + " Seconds" -join ' ')
My code below:
$Source = "C:\Users\xxxx"
$targetFolder = "C:\Users\xxx\new"
Get-ChildItem -Path $Source -Include * | forEach-Object{
$fileObject = $_
$filename = $_.BaseName.Substring(26)
Get-ChildItem -Path $Source -Include * | forEach-Object{
$tempObject = $_
$temp = $_.BaseName.Substring(26)
if($temp -eq $filename -And $tempObject -ne $fileObject){
Start-Process $fileObject + $tempObject -Verb Print -ArgumentList $targetFolder
}
}
I'm able to successfully match files based on part of their names, now I'd like to print the two matching images vertically on top of each other as a PDF, the Printer name would be "Microsoft Print to PDF"
What I truly want is as long as the images are matching according to my if statement, to be combined into one image, if there's another solution that can make that happen I'm open to it as well!!!
Any ideas????
I try to read big data log file, in folder C: \ log \ 1 \ i put 2 txt files, i need open-> read all file .txt and find with filter some text like whis: [text]
# Filename: script.ps1
$Files = Get-ChildItem "C:\log\1\" -Filter "*.txt"
foreach ($File in $Files)
{
$StringMatch = $null
$StringMatch = select-string $File -pattern "[Error]"
if ($StringMatch) {out-file -filepath C:\log\outputlog.txt -inputobject $StringMatch}
}
# end of script
not work
Would doing something like a select-string work?
Select-String C:\Scripts\*.txt -pattern "SEARCH STRING HERE" | Format-List
Or if there are multiple files you are wanting to parse maybe use the same select-string but within a loop and output the results.
$Files = Get-ChildItem "C:\log\1\" -Filter "*.txt"
foreach ($File in $Files)
{
$StringMatch = $null
$StringMatch = select-string $File -pattern "SEARCH STRING HERE"
if ($StringMatch) {out-file -filepath c:\outputlog.txt -inputobject $StringMatch}
}
This will print out the file name along with the line number in the file. I hope this is what you are looking for.
Remove-Item -Path C:\log\outlog.txt
$Files = Get-ChildItem "C:\log\1\" -Filter "*.txt"
foreach ($File in $Files)
{
$lineNumber = 0
$content = Get-Content -Path "C:\log\1\$File"
foreach($line in $content)
{
if($line.Contains('[Error]'))
{
Add-Content -Path C:\log\outlog.txt -Value "$File -> $lineNumber"
}
$lineNumber++
}
}
Code below works
It selects strings in txt files in your folder based on -SimpleMatch and then appends it to new.txt file.
Though i do not know how to put two simple matches in one line. Maybe someone does and can post it here
Select-String -Path C:\log\1\*.txt -SimpleMatch "[Error]" -ca | select -exp line | out-file C:\log\1\new.txt -Append
Select-String -Path C:\log\1\*.txt -SimpleMatch "[File]" -ca | select -exp line | out-file C:\log\1\new.txt -Append
Regards
-----edit-----
If you want to you may not append it anywhere just display - simply dont pipe it to out-file
use index then check it :
New-Item C:\log\outputlog.txt
$Files = Get-ChildItem "C:\log\1\" -Include "*.txt"
foreach ($File in $Files)
{
$StringMatch = $null
$StringMatch = Get-Content $File
if($StringMatch.IndexOf("[Error]") -ne -1)
{
Add-Content -Path C:\log\outputlog.txt -Value ($StringMatch+"
-------------------------------------------------------------
")
}
}
# end of script
I'm trying to write a PowerShell script which will take particular extension files from different servers. To pass many extensions I know we can use # followed by extension. When I pick them from an input file and pass it on the script, it doesn't work.
$ServerName=Get-content "D:\HOMEWARE\BLRMorningCheck\Jerry\servername.txt"
foreach ($server in $ServerName)
{
$server_host=echo $server | %{$data = $_.split(";"); Write-Output "$($data[0])"}
$Targetfolder=echo $server | %{$data = $_.split(";"); Write-Output "$($data[1])"}
$Ext=echo $server | %{$data = $_.split(";"); Write-Output "$($data[2])"}
$Extension =#($Ext)
$Targetfolder=$Targetfolder.Trim('"')
$Files = Get-Childitem $TargetFolder -Include $Extension -Recurse
echo $Files
}
My extensions are *.log, *.log*7z, *.txt*7z, *.txt*.
$Ext contains a string, probably with a comma-separated list of extensions. However, a comma-separated string doesn't turn into an array (which is what the -Include parameter expects) just because you put it in #(). You need to split the string at the delimiter character:
PS C:\> $Ext = ".log,.log*7z,*.txt*7z,.txt"
PS C:\> $Ext
.log,.log*7z,*.txt*7z,.txt
PS C:\> $Extension = $Ext -split ','
PS C:\> $Extension
.log
.log*7z
*.txt*7z
.txt
Also, like I said in my answer to your previous question, you're probably better off using Import-Csv for reading your input file:
$filename = 'D:\HOMEWARE\BLRMorningCheck\Jerry\servername.txt'
Import-Csv $filename -Delimiter ';' -Header 'ComputerName', 'TargetFolder', 'Ext' |
select TargetFolder, #{n='Extensions';e={$_.Ext -split ','}} |
% { Get-Childitem $_.TargetFolder -Include $_.Extensions -Recurse }
#Ansgar has a the correct approach but in case you're new to powershell the more basic syntax below might be easier to understand. If you already know what file extensions you're looking for you don't need to get them from the file, just create an array that contains them.
$exts = "*.log",".log*7z","*.txt*7z","*.txt"
$servers = Get-Content "D:\HOMEWARE\BLRMorningCheck\Jerry\servername.txt"
foreach ($server in $servers) {
$host = $($server.split(';'))[0]
$targetFolder = $($server.split(';'))[1]
$files = Get-Childitem $targetFolder -Include $exts -Recurse
foreach($f in $files) {
Write-Output $f.fullname
}
}
I have a lot of .csv Files and I want to write the filename into the same file, at the end of the last position.
For example in C:\CSV I have:
P0_0050569F52981EE39CEF8C857147E850.csv
P0_0050569F52981EE39CEF8D4825092850.csv
P0_0050569F52981EE39CEF8EE13B954850.csv
...and another thousand more of these files
In every one of this files I have some content:
P0_0050569F52981EE39CEF8C857147E850.csv = 365013;253;9001
I want to transform this: 365013;253;9001 to this:
365013;253;9001;
P0_0050569F52981EE39CEF8C857147E850.csv
I cannot find the error...
Get-ChildItem -Filter "*.csv" -Path "C:\CSV" -Recurse | % {
#Open file
$reader = New-Object System.IO.StreamReader $_.FullName
#Ignore first line
$reader.ReadLine() | out-null
#Get name
$filename = $filename
#Write
Add-Content ";" $filename
#Close stream
$reader.Close()
}
MickyB's option should work. Here's another way to do it:
Get-ChildItem -Filter "*.csv" -Path "C:\CSV" -Recurse | foreach {$contents = get-content $_.fullname; $contents +=";"+"`r`n"+$_.name; Set-Content $_.fullname -Value $contents }
This will happen the filename at the end of each CSV.
Get-ChildItem -Filter "*.csv" -Path "C:\CSV" -Recurse | % {
(";"+"`r`n"+$_.name) | Out-File -FilePath $_.fullname -Append
}