way to determine if file is PDF faster - powershell

Looking for some pointers / tips to increase the speed and/or efficacy of below. Would be open to other methods, but have only dabbled in powershell,cmd and python.
Also credit where credit is due: This is a hack-job on the following: https://stackoverflow.com/a/44183234/12834479
Rather than working local, I'm hitting a Network Share over VPN with abysmal connection speeds.
Roughly, it's working at 8 secs / PDF.
Issues I've tried to take care of, goal is to ensure each PDF is readable by Adobe. Images saved as PDF (but not pdfs) will open in some PDF software, but Adobe hates them. I have the method to convert, but my rate limiter is identifying them.
Adobe PDFs -start with %PDF
Some Bank PDFs - start with "blank space" then %PDF
3rd party software - Junk Headers, but %PDF is within document
$items = Get-ChildItem | Where-Object {$_.Extension -eq ".pdf"}
$arrary = #()
$logFile = "RESULTS_$(get-date -Format yyyymmdd).log"
$badCounter = 0
$goodCounter = 0
$msg = "`n`nProcessing " + $items.count + " files... "
Write-Host -nonewline -foregroundcolor Yellow $msg
foreach ($item in $items)
{
trap { Write-Output "Error trapped: $_"; continue; }
try {
$pdfText = Get-Content $item -raw
$ptr3 = '%PDF'
if ('%PDF' -ne $pdfText.SubString(([System.Math]::Max(0,$pdfText.IndexOf($ptr3))),4)) { $arrary+= "$item |-failed" >>$logfile;$badCounter += 1; $badCounter} else { $goodCounter += 1; $goodCounter}
continue;}
catch [System.Exception]{write-output "$item $_";}}
$totalCounter = $badCounter + $goodCounter
Write-Output $arrary >> $logFile
1..3 | %{ Write-Output "" >> $logFile }
Write-Output "Total: $totalCounter / BAD: $badCounter / GOOD: $goodCounter" >> $logFile
Write-Output "DONE!`n`n"
If any difference currently running in PS Version 7.1.3 / but also have 5.1.18 on local.

Actually, PDF files aren't plaintext files at all, but binary files, so you should not read them in as string.
What you are looking for is called a FourCC magic number in the file. This four-character code can be seen as Magic number to identify the file type.
For PDF files, these 4 bytes are 0x25, 0x50, 0x44, 0x46 ("%PDF") and the file should start with those bytes.
For those true PDF files, you could test with:
[byte[]]$fourCC = Get-Content -Encoding Byte -ReadCount 4 -TotalCount 4 -Path 'X:\TheFile.pdf'
if ([System.Text.Encoding]::ASCII.GetString($fourCC) -ceq '%PDF') {
Write-Host "This is a true PDF file"
}
However, as you say "Bank pdf's usually start with a blank space", to also consider those files "good", you can do:
[byte[]]$sixCC = Get-Content -Encoding Byte -ReadCount 6 -TotalCount 6 -Path 'X:\TheFile.pdf'
if ([System.Text.Encoding]::ASCII.GetString($sixCC) -cmatch '%PDF') {
Write-Host "This is a PDF file"
}
If you also want to treat files where "%PDF" is found anyhere in the file as "good", you will need to read the whole file as string, but with a one-to-one byte mapping of the bytes.
For that you can use below helper function:
function ConvertTo-BinaryString {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
Param (
[Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[String]$Path
)
# Note: Codepage 28591 returns a 1-to-1 char to byte mapping
$Encoding = [Text.Encoding]::GetEncoding(28591)
$Stream = [System.IO.FileStream]::new($Path, 'Open', 'Read')
$StreamReader = [System.IO.StreamReader]::new($Stream, $Encoding)
$BinaryText = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
return $BinaryText
}
Next, you can use that function as:
$binString = ConvertTo-BinaryString -Path 'X:\TheFile.pdf'
if ($binString.IndexOf("%PDF") -ge 0) {
Write-Host "This is a PDF file"
}
Putting it all together and assuming you want all files marked as .PDF files where the magic number '%PDF' (case-sensitive) can be found anywhere in the file:
function ConvertTo-BinaryString {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
Param (
[Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[String]$Path
)
# Note: Codepage 28591 returns a 1-to-1 char to byte mapping
$Encoding = [Text.Encoding]::GetEncoding(28591)
$Stream = [System.IO.FileStream]::new($Path, 'Open', 'Read')
$StreamReader = [System.IO.StreamReader]::new($Stream, $Encoding)
$BinaryText = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
return $BinaryText
}
$badCounter = 0
$goodCounter = 0
$logFile = "RESULTS_{0:yyyyMMdd}.log" -f (Get-Date)
# get an array of pdf file FullNames
$files = #(Get-ChildItem -File -Filter '*.pdf').FullName
Write-Host "Processing $($files.Count) files... " -ForegroundColor Yellow
# loop through the array, test if '%PDF' is found and output strings for the log file
$result = foreach ($item in $files) {
$pdfText = ConvertTo-BinaryString -Path $item
if ($pdfText.IndexOf("%PDF") -ge 0) {
$goodCounter++
"Success - $item"
}
else {
$badCounter++
"Fail - $item"
}
}
# write the output to the log file
$result | Set-Content -Path $logFile
"=" * 25 | Add-Content -Path $logFile
"BAD: $badCounter" | Add-Content -Path $logFile
"GOOD: $goodCounter" | Add-Content -Path $logFile
"Total: $($files.Count)" | Add-Content -Path $logFile
Write-Host "DONE!" -ForegroundColor Green

Related

Use PowerShell to find and replace hex values in binary files [duplicate]

This question already has answers here:
delete some sequence of bytes in Powershell [duplicate]
(1 answer)
Methods to hex edit binary files via Powershell
(4 answers)
Closed 3 years ago.
UPDATE:
I got a working script to accomplish the task. I needed to batch process a bunch of files so it accepts a csv file formatted as FileName,OriginalHEX,CorrectedHEX. It's very slow even after limiting the search to the first 512 bytes. It could probably be written better and made faster. Thanks for the help.
UPDATE 2: Revised the search method to be faster but it's nowhere near as fast as a dedicated hex editor. Be aware that it's memory intensive. peaks around 32X the size of the file in RAM. 10MB=320MB of RAM. 100MB=3.2GB of RAM. I don't recommend for big files. It also saves to a new file instead of overwriting. Original file renamed as File.ext_original#date-time.
Import-CSV $PSScriptRoot\HEXCorrection.csv | ForEach-Object {
$File = $_.'FileName'
$Find = $_.'OriginalHEX'
$Replace = $_.'CorrectedHEX'
IF (([System.IO.File]::Exists("$PSScriptRoot\$File"))) {
$Target = (Get-ChildItem -Path $PSScriptRoot\$File)
} ELSE {
Write-Host $File "- File Not Found`n" -ForegroundColor 'Red'
RETURN
}
Write-Host "File: "$Target.Name`n"Find: "$Find`n"Replace: "$Replace
$TargetLWT = $Target.LastWriteTime
$TargetCT = $Target.CreationTime
IF ($Target.IsReadOnly) {
Write-Host $Target.Name "- Is Read-Only`n" -ForegroundColor 'Red'
RETURN
}
$FindLen = $Find.Length
$ReplaceLen = $Replace.Length
$TargetLen = (1..$Target.Length)
IF (!($FindLen %2 -eq 0) -OR !($ReplaceLen %2 -eq 0) -OR
[String]::IsNullOrEmpty($FindLen) -OR [String]::IsNullOrEmpty($ReplaceLen)) {
Write-Host "Input hex values are not even or empty" -ForegroundColor 'DarkRed'
RETURN
} ELSEIF (
$FindLen -ne $ReplaceLen) {
Write-Host "Input hex values are different lengths" -ForegroundColor 'DarkYellow'
RETURN
}
$FindAsBytes = New-Object System.Collections.ArrayList
$Find -split '(.{2})' | ? {$_} | % { $FindAsBytes += [Convert]::ToInt64($_,16) }
$ReplaceAsBytes = New-Object System.Collections.ArrayList
$Replace -split '(.{2})' | ? {$_} | % { $ReplaceAsBytes += [Convert]::ToInt64($_,16) }
# ^-- convert to base 10
Write-Host "Starting Search"
$FileBytes = [IO.File]::ReadAllBytes($Target)
FOREACH ($Byte in $FileBytes) {
$ByteCounter++
IF ($Byte -eq [INT64]$FindAsBytes[0]) { TRY {
(1..([INT64]$FindAsBytes.Count-1)) | % {
$Test = ($FileBytes[[INT64]$ByteCounter-1+[INT64]$_] -eq $FindAsBytes[$_])
IF ($Test -ne 'True') {
THROW
}
}
Write-Host "Found at Byte:" $ByteCounter -ForegroundColor 'Green'
(0..($ReplaceAsBytes.Count-1)) | % {
$FileBytes[[INT64]$ByteCounter+[INT64]$_-1] = $ReplaceAsBytes[$_]}
$Found = 'True'
$BytesReplaces = $BytesReplaces + [INT64]$ReplaceAsBytes.Count
}
CATCH {}
}
}
IF ($Found -eq 'True'){
[IO.File]::WriteAllBytes("$Target-temp", $FileBytes)
$OriginalName = $Target.Name+'_Original'+'#'+(Get-Date).ToString('yyMMdd-HHmmss')
Rename-Item -LiteralPath $Target.FullName -NewName $OriginalName
Rename-Item $Target"-temp" -NewName $Target.Name
#Preserve Last Modified Time
$Target.LastWriteTime = $TargetLWT
$Target.CreationTime = $TargetCT
Write-Host $BytesReplaces "Bytes Replaced" -ForegroundColor 'Green'
Write-Host "Original saved as:" $OriginalName
} ELSE {
Write-Host "No Matches" -ForegroundColor 'Red'}
Write-Host "Finished Search`n"
Remove-Variable -Name * -ErrorAction SilentlyContinue
} # end foreach from line 1
PAUSE
Original: This has been asked before but found no solutions to perform a simple and straight up find hex value and replace hex value on large files, 100MB+.
Even better would be any recommendations for a hex editor with command line support for this task.
Here's a first crack at it:
(get-content -encoding byte file) -replace '\b10\b',11 -as 'byte[]'
I was checking those other links, but the only answer that does search and replace has some bugs. I voted to reopen. The mklement0 one is close. None of them search and then print the position of the replacement.
Nevermind. Yours is faster and uses less memory.

Prepend "!" to the beginning of the first line of a file

I have several files that I need to add a "!" to the beginning, just on the first line. I still need to keep the first line's content, just add a "!" as the first character.
Any help would be really appreciated.
Thanks!
Edit:
The only thing I could figure out so far was to do the following:
$a = Get-Content 'hh_Regulars3.csv'
$b = '!'
Set-Content 'hh_Regulars3-new.csv' -value $b,$a
This just added the "!" to the top of the file, instead of to the beginning of the first line.
You sent an array to Set-Content with $b,$a. Each array item will be given its own line as you have seen. It would displayed the same way on the prompt if executed.
As long as the file is not too big read it in as one string and add the character in.
$path = 'hh_Regulars3.csv'
"!" + (Get-Content $path -Raw) | Set-Content $path
If you only have PowerShell 2.0 then Out-String would work in place of -Raw
"!" + (Get-Content $path | Out-String) | Set-Content $path
The brackets are important to be sure the file is read in before it goes to through the pipeline. It allows us to both read and write on the same pipeline.
If the file is larger look into using StreamReaders and StreamWriters. This would also have to be used if the trailing new line, created by the Add-Content and Set-Content, is not warranted.
Late to the party, but thought this might be useful. I needed to perform the operation over a thousand+ large files, and needed something a little more robust and less prone to OOM exceptions. Ended up just writing it leveraging .Net libraries:
function PrependTo-File{
[cmdletbinding()]
param(
[Parameter(
Position=1,
ValueFromPipeline=$true,
Mandatory=$true,
ValueFromPipelineByPropertyName=$true
)]
[System.IO.FileInfo]
$file,
[string]
[Parameter(
Position=0,
ValueFromPipeline=$false,
Mandatory=$true
)]
$content
)
process{
if(!$file.exists){
write-error "$file does not exist";
return;
}
$filepath = $file.fullname;
$tmptoken = (get-location).path + "\_tmpfile" + $file.name;
write-verbose "$tmptoken created to as buffer";
$tfs = [System.io.file]::create($tmptoken);
$fs = [System.IO.File]::Open($file.fullname,[System.IO.FileMode]::Open,[System.IO.FileAccess]::ReadWrite);
try{
$msg = $content.tochararray();
$tfs.write($msg,0,$msg.length);
$fs.position = 0;
$fs.copyTo($tfs);
}
catch{
write-verbose $_.Exception.Message;
}
finally{
$tfs.close();
# close calls dispose and gc.supressfinalize internally
$fs.close();
if($error.count -eq 0){
write-verbose ("updating $filepath");
[System.io.File]::Delete($filepath);
[System.io.file]::Move($tmptoken,$filepath);
}
else{
$error.clear();
write-verbose ("an error occured, rolling back. $filepath not effected");
[System.io.file]::Delete($tmptoken);
}
}
}
}
Usage:
PS> get-item fileName.ext | PrependTo-File "contentToAdd`r`n"
This oneliner might works :
get-ChildItem *.txt | % { [System.Collections.ArrayList]$lines=Get-Content $_;
$lines[0]=$lines[0].Insert(0,"!") ;
Set-Content "new_$($_.name)" -Value $lines}
Try this:
$a = get-content "c:\yourfile.csv"
$a | %{ $b = "!" + $a ; $b | add-content "c:\newfile.csv" }

Powershell editing mp3 infos

I'm searching for a way to edit mp3 files info (like artist, album, etc) in a PowerShell script.
I found a way to get the info I need on the .mp3 files, but not how to modify them.
$songs = Get-ChildItem $dir -Filter *.mp3;
$shell = new-object -com shell.application;
Foreach ($song in $songs) {
$shellfolder = $shell.namespace($dir);
$shellfile = $shellfolder.parsename($song);
$title = $shell.namespace($dir).getdetailsof($shellfile,21);
}
With getDetailsOf (with 21) I'm able to get the song Title, but no setDetailsOf exists, so I don't know how I can modify the song title for another one.
If you're like me and prefer not to use a library, here's a pretty simple ID3v1 function I wrote that may serve your needs:
#Set the specified ID3v1 properties of a file by writing the last 128 bytes
Function Set-ID3v1( #All parameters except path are optional, they will not change if not specified.
[string]$path, #Full path to the file to be updated - wildcards not supported because [] are so stinky and it's only supposed to work on one file at a time.
[string]$Title = "`0", #a string containing only 0 indicates a parameter not specified.
[string]$Artist = "`0",
[string]$Album = "`0",
[string]$Year = "`0",
[string]$Comment = "`0",
[int]$Track = -1,
[int]$Genre = -1,
[bool]$BackDate=$true){#Preserve modification date, but add a minute to indicate it's newer than duplicates
$CurrentModified = (Get-ChildItem -LiteralPath $path).LastWriteTime #use literalpath here to get only one file, even if it has []
Try{
$enc = [System.Text.Encoding]::ASCII #Probably wrong, but works occasionally. See https://stackoverflow.com/questions/9857727/text-encoding-in-id3v2-3-tags
$currentID3Bytes = New-Object byte[] (128)
$strm = New-Object System.IO.FileStream ($path,[System.IO.FileMode]::Open,[System.IO.FileAccess]::ReadWrite,[System.IO.FileShare]::None)
$strm.Seek(-128,'End') | Out-Null #Basic ID3v1 info is 128 bytes from EOF
$strm.Read($currentID3Bytes,0,$currentID3Bytes.Length) | Out-Null
Write-Host "$path `nCurrentID3: $($enc.GetString($currentID3Bytes))"
$strm.Seek(-128,'End') | Out-Null #Basic ID3v1 info is 128 bytes from EOF
If($enc.GetString($currentID3Bytes[0..2]) -ne 'TAG'){
Write-Warning "No existing ID3v1 found - adding to end of file"
$strm.Seek(0,'End')
$currentID3Bytes = $enc.GetBytes(('TAG' + (' ' * (30 + 30 + 30 + 4 + 30)))) #Add a blank tag to the end of the file
$currentID3Bytes += 255 #empty Genre
$strm.Write($currentID3Bytes,0,$currentID3Bytes.length)
$strm.Flush()
$Strm.Close()
$strm = New-Object System.IO.FileStream ($path,[System.IO.FileMode]::Open,[System.IO.FileAccess]::Write,[System.IO.FileShare]::None)
$strm.Seek(-128,'End')
}
$strm.Seek(3,'Current') | Out-Null #skip over 'TAG' to get to the good stuff
If($Title -eq "`0"){ $strm.Seek(30,'Current') | Out-Null} #Skip over
Else{ $strm.Write($enc.GetBytes($Title.PadRight(30,' ').Substring(0,30)),0,30) } #if specified, write 30 space-padded bytes to the stream
If($Artist -eq "`0"){ $strm.Seek(30,'Current') | Out-Null}
Else {$strm.Write($enc.GetBytes($Artist.PadRight(30,' ').Substring(0,30)),0,30) }
If($Album -eq "`0"){ $strm.Seek(30,'Current') | Out-Null}
Else{$strm.Write($enc.GetBytes($Album.PadRight(30,' ').Substring(0,30)),0,30) }
If($Year -eq "`0"){ $strm.Seek(4,'Current') | Out-Null}
Else {$strm.Write($enc.GetBytes($Year.PadRight(4,' ').Substring(0,4)),0,4) }
If(($Track -ne -1) -or ($currentID3Bytes[125] -eq 0)) {$CommentMaxLen = 28}Else{$CommentMaxLen = 30} #If a Track is specified or present in the file, Comment is 28 chars
If($Comment -eq "`0"){ $strm.Seek($CommentMaxLen,'Current') | Out-Null}
Else {$strm.Write($enc.GetBytes($Comment.PadRight($CommentMaxLen,' ').Substring(0,$CommentMaxLen)),0,$CommentMaxLen) }
If($Track -eq -1 ){$strm.Seek(2,'Current') | Out-Null}
Else{$strm.Write(#(0,$Track),0,2)} #Track, if present, is preceded by a 0-byte to form the last two bytes of Comment
If($Genre -ne -1){$strm.Write($Genre,0,1) | Out-Null}
}Catch{
Write-Error $_.Exception.Message
}Finally{
If($strm){
$strm.Flush()
$strm.Close()
}
}
If($BackDate){(Get-ChildItem -LiteralPath $path).LastWriteTime = $CurrentModified.AddMinutes(1)}
}
You call it with the full path to the MP3 file and any ID3V1 attributes you want to change:
Set-ID3v1 -path "c:\users\me\Desktop\Test.mp3" -Year 1996 -Title "This is a test"
Basically it writes the last 128 bytes of the file with Title, Artist, etc.
I might add support for ID3v2.3 which is much more flexible, but less compatible with older devices and software. Check GitHub for the latest version.

powershell binary file comparison

All,
There is a application which generates it's export dumps.I need to write a script that will compare the previous days dump against the latest and if there are differences among them i have to some basic manipulations of moving and deleting sort of stuff.
I have tried finding a suitable way of doing it and the method i tried was :
$var_com=diff (get-content D:\local\prodexport2 -encoding Byte) (get-content D:\local\prodexport2 -encoding Byte)
I tried the Compare-Object cmdlet as well. I notice a very high memory usage and eventually i get a message System.OutOfMemoryException after few minutes. Has one of you done something similer ?. Some thoughts please.
There was a thread which mentioned about a has comparison which i have no idea as to how to go about.
Thanks in advance folks
Osp
With PowerShell 4 you can use native commandlets to do this:
function CompareFiles {
param(
[string]$Filepath1,
[string]$Filepath2
)
if ((Get-FileHash $Filepath1).Hash -eq (Get-FileHash $Filepath2).Hash) {
Write-Host 'Files Match' -ForegroundColor Green
} else {
Write-Host 'Files do not match' -ForegroundColor Red
}
}
PS C:> CompareFiles .\20131104.csv .\20131104-copy.csv
Files Match
PS C:> CompareFiles .\20131104.csv .\20131107.csv
Files do not match
You could easily modify the above function to return a $true or $false value if you want to use this programmatically on a large scale
EDIT
After seeing this answer, I just wanted to supply larger scale version that simply returns true or false:
function CompareFiles
{
param
(
[parameter(
Mandatory = $true,
HelpMessage = "Specifies the 1st file to compare. Make sure it's an absolute path with the file name and its extension."
)]
[string]
$file1,
[parameter(
Mandatory = $true,
HelpMessage = "Specifies the 2nd file to compare. Make sure it's an absolute path with the file name and its extension."
)]
[string]
$file2
)
( Get-FileHash $file1 ).Hash -eq ( Get-FileHash $file2 ).Hash
}
You could use fc.exe. It comes with Windows. Here's how you would use it:
fc.exe /b d:\local\prodexport2 d:\local\prodexport1 > $null
if (!$?) {
"The files are different"
}
Another method is to compare the MD5 hashes of the files:
$Filepath1 = 'c:\testfiles\testfile.txt'
$Filepath2 = 'c:\testfiles\testfile1.txt'
$hashes =
foreach ($Filepath in $Filepath1,$Filepath2)
{
$MD5 = [Security.Cryptography.HashAlgorithm]::Create( "MD5" )
$stream = ([IO.StreamReader]"$Filepath").BaseStream
-join ($MD5.ComputeHash($stream) |
ForEach { "{0:x2}" -f $_ })
$stream.Close()
}
if ($hashes[0] -eq $hashes[1])
{'Files Match'}
A while back I wrote an article on a buffered comparison routine to compare two files with PowerShell:
function FilesAreEqual {
param(
[System.IO.FileInfo] $first,
[System.IO.FileInfo] $second,
[uint32] $bufferSize = 524288)
if ($first.Length -ne $second.Length) return $false
if ( $bufferSize -eq 0 ) $bufferSize = 524288
$fs1 = $first.OpenRead()
$fs2 = $second.OpenRead()
$one = New-Object byte[] $bufferSize
$two = New-Object byte[] $bufferSize
$equal = $true
do {
$bytesRead = $fs1.Read($one, 0, $bufferSize)
$fs2.Read($two, 0, $bufferSize) | out-null
if ( -Not [System.Linq.Enumerable]::SequenceEqual($one, $two)) {
$equal = $false
}
} while ($equal -and $bytesRead -eq $bufferSize)
$fs1.Close()
$fs2.Close()
return $equal
}
You can use it by:
FilesAreEqual c:\temp\test.html c:\temp\test.html
A hash (like MD5) needs to traverse the entire file to do the hash calculation. This script returns as soon at it sees a difference in the buffer. It compares the buffer using LINQ which is faster than native PowerShell.
if ( (Get-FileHash c:\testfiles\testfile1.txt).Hash -eq (Get-FileHash c:\testfiles\testfile2.txt).Hash ) {
Write-Output "Files match"
} else {
Write-Output "Files do not match"
}

How can I split a text file using PowerShell?

I need to split a large (500 MB) text file (a log4net exception file) into manageable chunks like 100 5 MB files would be fine.
I would think this should be a walk in the park for PowerShell. How can I do it?
A word of warning about some of the existing answers - they will run very slow for very big files. For a 1.6 GB log file I gave up after a couple of hours, realising it would not finish before I returned to work the next day.
Two issues: the call to Add-Content opens, seeks and then closes the current destination file for every line in the source file. Reading a little of the source file each time and looking for the new lines will also slows things down, but my guess is that Add-Content is the main culprit.
The following variant produces slightly less pleasant output: it will split files in the middle of lines, but it splits my 1.6 GB log in less than a minute:
$from = "C:\temp\large_log.txt"
$rootName = "C:\temp\large_log_chunk"
$ext = "txt"
$upperBound = 100MB
$fromFile = [io.file]::OpenRead($from)
$buff = new-object byte[] $upperBound
$count = $idx = 0
try {
do {
"Reading $upperBound"
$count = $fromFile.Read($buff, 0, $buff.Length)
if ($count -gt 0) {
$to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext)
$toFile = [io.file]::OpenWrite($to)
try {
"Writing $count to $to"
$tofile.Write($buff, 0, $count)
} finally {
$tofile.Close()
}
}
$idx ++
} while ($count -gt 0)
}
finally {
$fromFile.Close()
}
Simple one-liner to split based on number of lines (100 in this case):
$i=0; Get-Content .....log -ReadCount 100 | %{$i++; $_ | Out-File out_$i.txt}
This is a somewhat easy task for PowerShell, complicated by the fact that the standard Get-Content cmdlet doesn't handle very large files too well. What I would suggest to do is use the .NET StreamReader class to read the file line by line in your PowerShell script and use the Add-Content cmdlet to write each line to a file with an ever-increasing index in the filename. Something like this:
$upperBound = 50MB # calculated by Powershell
$ext = "log"
$rootName = "log_"
$reader = new-object System.IO.StreamReader("C:\Exceptions.log")
$count = 1
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
while(($line = $reader.ReadLine()) -ne $null)
{
Add-Content -path $fileName -value $line
if((Get-ChildItem -path $fileName).Length -ge $upperBound)
{
++$count
$fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext)
}
}
$reader.Close()
Same as all the answers here, but using StreamReader/StreamWriter to split on new lines (line by line, instead of trying to read the whole file into memory at once). This approach can split big files in the fastest way I know of.
Note: I do very little error checking, so I can't guarantee it'll work smoothly for your case. It did for mine (1.7 GB TXT file of 4 million lines split in 100,000 lines per file in 95 seconds).
#split test
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$filename = "C:\Users\Vincent\Desktop\test.txt"
$rootName = "C:\Users\Vincent\Desktop\result"
$ext = ".txt"
$linesperFile = 100000#100k
$filecount = 1
$reader = $null
try{
$reader = [io.file]::OpenText($filename)
try{
"Creating file number $filecount"
$writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
$filecount++
$linecount = 0
while($reader.EndOfStream -ne $true) {
"Reading $linesperFile"
while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){
$writer.WriteLine($reader.ReadLine());
$linecount++
}
if($reader.EndOfStream -ne $true) {
"Closing file"
$writer.Dispose();
"Creating file number $filecount"
$writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
$filecount++
$linecount = 0
}
}
} finally {
$writer.Dispose();
}
} finally {
$reader.Dispose();
}
$sw.Stop()
Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"
Output splitting a 1.7 GB file:
...
Creating file number 45
Reading 100000
Closing file
Creating file number 46
Reading 100000
Closing file
Creating file number 47
Reading 100000
Closing file
Creating file number 48
Reading 100000
Split complete in 95.6308289 seconds
I often need to do the same thing. The trick is getting the header repeated into each of the split chunks. I wrote the following cmdlet (PowerShell v2 CTP 3) and it does the trick.
##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Count
# (Or -c) The maximum number of lines in each file.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv 3000 -rc 1
#
#.LINK
# Out-TempFile
##############################################################################
function Split-File {
[CmdletBinding(DefaultParameterSetName='Path')]
param(
[Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$Path,
[Alias("PSPath")]
[Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$LiteralPath,
[Alias('c')]
[Parameter(Position=2,Mandatory=$true)]
[Int32]$Count,
[Alias('d')]
[Parameter(Position=3)]
[String]$Destination='.',
[Alias('rc')]
[Parameter()]
[Int32]$RepeatCount
)
process {
# yeah! the cmdlet supports wildcards
if ($LiteralPath) { $ResolveArgs = #{LiteralPath=$LiteralPath} }
elseif ($Path) { $ResolveArgs = #{Path=$Path} }
Resolve-Path #ResolveArgs | %{
$InputName = [IO.Path]::GetFileNameWithoutExtension($_)
$InputExt = [IO.Path]::GetExtension($_)
if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }
# get the input file in manageable chunks
$Part = 1
Get-Content $_ -ReadCount:$Count | %{
# make an output filename with a suffix
$OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))
# In the first iteration the header will be
# copied to the output file as usual
# on subsequent iterations we have to do it
if ($RepeatCount -and $Part -gt 1) {
Set-Content $OutputFile $Header
}
# write this chunk to the output file
Write-Host "Writing $OutputFile"
Add-Content $OutputFile $_
$Part += 1
}
}
}
}
I found this question while trying to split multiple contacts in a single vCard VCF file to separate files. Here's what I did based on Lee's code. I had to look up how to create a new StreamReader object and changed null to $null.
$reader = new-object System.IO.StreamReader("C:\Contacts.vcf")
$count = 1
$filename = "C:\Contacts\{0}.vcf" -f ($count)
while(($line = $reader.ReadLine()) -ne $null)
{
Add-Content -path $fileName -value $line
if($line -eq "END:VCARD")
{
++$count
$filename = "C:\Contacts\{0}.vcf" -f ($count)
}
}
$reader.Close()
Many of these answers were too slow for my source files. My source files were SQL files between 10 MB and 800 MB that needed to split into files of roughly equal line counts.
I found some of the previous answers which use Add-Content to be quite slow. Waiting many hours for a split to finish wasn't uncommon.
I didn't try Typhlosaurus's answer, but it looks to only do splits by file size, not line count.
The following has suited my purposes.
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
Write-Host "Reading source file..."
$lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql")
$totalLines = $lines.Length
Write-Host "Total Lines :" $totalLines
$skip = 0
$count = 100000; # Number of lines per file
# File counter, with sort friendly name
$fileNumber = 1
$fileNumberString = $filenumber.ToString("000")
while ($skip -le $totalLines) {
$upper = $skip + $count - 1
if ($upper -gt ($lines.Length - 1)) {
$upper = $lines.Length - 1
}
# Write the lines
[System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)])
# Increment counters
$skip += $count
$fileNumber++
$fileNumberString = $filenumber.ToString("000")
}
$sw.Stop()
Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"
For a 54 MB file, I get the output...
Reading source file...
Total Lines : 910030
Split complete in 1.7056578 seconds
I hope others looking for a simple, line-based splitting script that matches my requirements will find this useful.
There's also this quick (and somewhat dirty) one-liner:
$linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | %{ Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } }
You can tweak the number of first lines per batch by changing the hard-coded 3000 value.
Do this:
FILE 1
There's also this quick (and somewhat dirty) one-liner:
$linecount=0; $i=0;
Get-Content .\BIG_LOG_FILE.txt | %
{
Add-Content OUT$i.log "$_";
$linecount++;
if ($linecount -eq 3000) {$I++; $linecount=0 }
}
You can tweak the number of first lines per batch by changing the hard-coded 3000 value.
Get-Content C:\TEMP\DATA\split\splitme.txt | Select -First 5000 | out-File C:\temp\file1.txt -Encoding ASCII
FILE 2
Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 5000 | Select -First 5000 | out-File C:\temp\file2.txt -Encoding ASCII
FILE 3
Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 10000 | Select -First 5000 | out-File C:\temp\file3.txt -Encoding ASCII
etc…
I've made a little modification to split files based on size of each part.
##############################################################################
#.SYNOPSIS
# Breaks a text file into multiple text files in a destination, where each
# file contains a maximum number of lines.
#
#.DESCRIPTION
# When working with files that have a header, it is often desirable to have
# the header information repeated in all of the split files. Split-File
# supports this functionality with the -rc (RepeatCount) parameter.
#
#.PARAMETER Path
# Specifies the path to an item. Wildcards are permitted.
#
#.PARAMETER LiteralPath
# Specifies the path to an item. Unlike Path, the value of LiteralPath is
# used exactly as it is typed. No characters are interpreted as wildcards.
# If the path includes escape characters, enclose it in single quotation marks.
# Single quotation marks tell Windows PowerShell not to interpret any
# characters as escape sequences.
#
#.PARAMETER Destination
# (Or -d) The location in which to place the chunked output files.
#
#.PARAMETER Size
# (Or -s) The maximum size of each file. Size must be expressed in MB.
#
#.PARAMETER RepeatCount
# (Or -rc) Specifies the number of "header" lines from the input file that will
# be repeated in each output file. Typically this is 0 or 1 but it can be any
# number of lines.
#
#.EXAMPLE
# Split-File bigfile.csv -s 20 -rc 1
#
#.LINK
# Out-TempFile
##############################################################################
function Split-File {
[CmdletBinding(DefaultParameterSetName='Path')]
param(
[Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$Path,
[Alias("PSPath")]
[Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)]
[String[]]$LiteralPath,
[Alias('s')]
[Parameter(Position=2,Mandatory=$true)]
[Int32]$Size,
[Alias('d')]
[Parameter(Position=3)]
[String]$Destination='.',
[Alias('rc')]
[Parameter()]
[Int32]$RepeatCount
)
process {
# yeah! the cmdlet supports wildcards
if ($LiteralPath) { $ResolveArgs = #{LiteralPath=$LiteralPath} }
elseif ($Path) { $ResolveArgs = #{Path=$Path} }
Resolve-Path #ResolveArgs | %{
$InputName = [IO.Path]::GetFileNameWithoutExtension($_)
$InputExt = [IO.Path]::GetExtension($_)
if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }
Resolve-Path #ResolveArgs | %{
$InputName = [IO.Path]::GetFileNameWithoutExtension($_)
$InputExt = [IO.Path]::GetExtension($_)
if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount }
# get the input file in manageable chunks
$Part = 1
$buffer = ""
Get-Content $_ -ReadCount:1 | %{
# make an output filename with a suffix
$OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt))
# In the first iteration the header will be
# copied to the output file as usual
# on subsequent iterations we have to do it
if ($RepeatCount -and $Part -gt 1) {
Set-Content $OutputFile $Header
}
# test buffer size and dump data only if buffer is greater than size
if ($buffer.length -gt ($Size * 1MB)) {
# write this chunk to the output file
Write-Host "Writing $OutputFile"
Add-Content $OutputFile $buffer
$Part += 1
$buffer = ""
} else {
$buffer += $_ + "`r"
}
}
}
}
}
}
Sounds like a job for the UNIX command split:
split MyBigFile.csv
Just split my 55 GB csv file in 21k chunks in less than 10 minutes.
It's not native to PowerShell though, but comes with, for instance, the git for windows package https://git-scm.com/download/win
As the lines can be variable in logs I thought it best to take a number of lines per file approach. The following code snippet processed a 4 million line log file in under 19 seconds (18.83.. seconds)splitting it into 500,000 line chunks:
$sourceFile = "c:\myfolder\mylargeTextyFile.csv"
$partNumber = 1
$batchSize = 500000
$pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"
[System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001) # utf8 this one
$fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None")
$streamIn=New-Object System.IO.StreamReader($fs, $enc)
$streamout = new-object System.IO.StreamWriter $pathAndFilename
$line = $streamIn.readline()
$counter = 0
while ($line -ne $null)
{
$streamout.writeline($line)
$counter +=1
if ($counter -eq $batchsize)
{
$partNumber+=1
$counter =0
$streamOut.close()
$pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv"
$streamout = new-object System.IO.StreamWriter $pathAndFilename
}
$line = $streamIn.readline()
}
$streamin.close()
$streamout.close()
This can easily be turned into a function or script file with parameters to make it more versatile. It uses a StreamReader and StreamWriter to achieve its speed and tiny memory footprint
My requirement was a bit different. I often work with Comma Delimited and Tab Delimited ASCII files where a single line is a single record of data. And they're really big, so I need to split them into manageable parts (whilst preserving the header row).
So, I reverted back to my classic VBScript method and bashed together a small .vbs script that can be run on any Windows computer (it gets automatically executed by the WScript.exe script host engine on Window).
The benefit of this method is that it uses Text Streams, so the underlying data isn't loaded into memory (or, at least, not all at once). The result is that it's exceptionally fast and it doesn't really need much memory to run. The test file I just split using this script on my i7 was about 1 GB in file size, had about 12 million lines of text and was split into 25 part files (each with about 500k lines each) – the processing took about 2 minutes and it didn’t go over 3 MB memory used at any point.
The caveat here is that it relies on the text file having "lines" (meaning each record is delimited with a CRLF) as the Text Stream object uses the "ReadLine" function to process a single line at a time. But hey, if you're working with TSV or CSV files, it's perfect.
Option Explicit
Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt"
Private Const REPEAT_HEADER_ROW = True
Private Const LINES_PER_PART = 500000
Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart
sStart = Now()
sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1)
iLineCounter = 0
iOutputFile = 1
Set oFileSystem = CreateObject("Scripting.FileSystemObject")
Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False)
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
If REPEAT_HEADER_ROW Then
iLineCounter = 1
sHeaderLine = oInputFile.ReadLine()
Call oOutputFile.WriteLine(sHeaderLine)
End If
Do While Not oInputFile.AtEndOfStream
sLine = oInputFile.ReadLine()
Call oOutputFile.WriteLine(sLine)
iLineCounter = iLineCounter + 1
If iLineCounter Mod LINES_PER_PART = 0 Then
iOutputFile = iOutputFile + 1
Call oOutputFile.Close()
Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True)
If REPEAT_HEADER_ROW Then
Call oOutputFile.WriteLine(sHeaderLine)
End If
End If
Loop
Call oInputFile.Close()
Call oOutputFile.Close()
Set oFileSystem = Nothing
Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now())
If this may help, it works perfectly for me.
Script check a folder, parse all CSV files and check nb of lines per file.
If file contains more than 55000 lines in file, script split the file into sub-files of 50000 lines and name them " _1, _2, ...."
At end of the script, original file is renamed to avoid a load.
foreach ($MyFile in $MyFolder)
{
# Read parent CSV
$InputFilename = $MyFile
$InputFile = Get-Content $MyFile
$OutputFilenamePattern = "$MyFile"+"_"
Write-Host ".........."
Write-Host ". File to process"
Write-Host ".........."
WRITE-HOST "$MyVar_file_Path"
Write-Host "$InputFilename"
Write-Host "$OutputFilenamePattern"
Write-Host ".........."
$LineLimit = 50000
# Initialize
$line = 0
$i = 0
$file = 0
$start = 0
$nb_lines = (Get-Content $MyFile).Length
Write-Host ".........."
Write-Host "$nb_lines lines in the file"
Write-Host ".........."
if ($nb_lines -gt 55000)
{
# Loop all text lines
while ($line -le $InputFile.Length)
{
# Generate child CSVs
if ($i -eq $LineLimit -Or $line -eq $InputFile.Length)
{
$file++
$Filename = "$OutputFilenamePattern$file.csv"
# $InputFile[0] | Out-File $Filename -Force # Writes Header at the beginning of the line.
If ($file -ne 1) {$InputFile[0] | Out-File $Filename -Force}
$InputFile[$start..($line - 1)] | Out-File $Filename -Force -Append # Original line 19 with the addition of -Append so it doesn't overwrite the headers you just wrote.
# $InputFile[$start..($line-1)] | Out-File $Filename -Force
$start = $line;
$i = 0
Write-Host "$Filename"
}
# Increment counters
$i++;
$line++
}
$Source_name = $MyVar_file_Path2 + "\" + $InputFilename
$Destination_name = $MyVar_file_Path2 + "\" + "Splitted_" + $InputFilename
Write-Host ".........."
Write-Host ". File to rename"
Write-Host ".........."
Write-Host "$Source_name"
Write-Host "$Destination_name"
Write-Host ".........."
Rename-Item $Source_name -NewName $Destination_name
}
Write-Host "."
Write-Host "."
}
Here is my solution to split a file called patch6.txt (about 32,000 lines) into separate files of 1000 lines each. Its not quick, but it does the job.
$infile = "D:\Malcolm\Test\patch6.txt"
$path = "D:\Malcolm\Test\"
$lineCount = 1
$fileCount = 1
foreach ($computername in get-content $infile)
{
write $computername | out-file -Append $path_$fileCount".txt"
$lineCount++
if ($lineCount -eq 1000)
{
$fileCount++
$lineCount = 1
}
}