In Powershell, How do I split a large binary file? - powershell

I've seen the answer elsewhere for text files, but I need to do this for a compressed file.
I've got a 6G binary file which needs to be split into 100M chunks. Am I missing the analog for unix's "head" somewhere?

Never mind. Here you go:
function split($inFile, $outPrefix, [Int32] $bufSize){
$stream = [System.IO.File]::OpenRead($inFile)
$chunkNum = 1
$barr = New-Object byte[] $bufSize
while( $bytesRead = $stream.Read($barr,0,$bufsize)){
$outFile = "$outPrefix$chunkNum"
$ostream = [System.IO.File]::OpenWrite($outFile)
$ostream.Write($barr,0,$bytesRead);
$ostream.close();
echo "wrote $outFile"
$chunkNum += 1
}
}
Assumption: bufSize fits in memory.

The answer to the corollary question: How do you put them back together?
function stitch($infilePrefix, $outFile) {
$ostream = [System.Io.File]::OpenWrite($outFile)
$chunkNum = 1
$infileName = "$infilePrefix$chunkNum"
$offset = 0
while(Test-Path $infileName) {
$bytes = [System.IO.File]::ReadAllBytes($infileName)
$ostream.Write($bytes, 0, $bytes.Count)
Write-Host "read $infileName"
$chunkNum += 1
$infileName = "$infilePrefix$chunkNum"
}
$ostream.close();
}

I answered the question alluded to in this question's comments by bernd_k but I would use -ReadCount in this case instead of -TotalCount e.g.
Get-Content bigfile.bin -ReadCount 100MB -Encoding byte
This causes Get-Content to read a chunk of the file at a time where the chunk size is either a line for text encodings or a byte for byte encoding. Keep in mind that when it does this, you get an array passed down the pipeline and not individual bytes or lines of text.

Related

UTF-8 BOM to UTF-8 Conversion for a large file

Based on the suggestion from this thread, i have used powershell to do the UTF-8 conversion, now i am running into another problem, i have a very huge file around 18 gb which i am trying to convert on a machine with around 50GB RAM free, but this conversion process eats up all the ram and encoding fails, is there a way to limit the RAM usage or to do the conversion in chunks?
Using PowerShell to write a file in UTF-8 without the BOM
BTW below is exact code
foreach ($file in ls -name $Path\CM*.csv)
{
$file_content = Get-Content "$Path\$file";
[System.IO.File]::WriteAllLines("$Path\$file", $file_content);
echo "encoding done : $file"
}
Don't Store the file's content in memory. As noted here, doing so require 3-4 times the file size in RAM. Get-Content is slow but quite memory efficient. so a simple solution may be
Get-Content -Path <FilePath> | Out-File -FilePath <FilePath> -Encoding UTF8
Note: While I haven't tried this you may want to use Add-Content instead of Out-File. The latter will sometimes reformat according to console width. Characteristic of Out-* cmdlets they traverse the for-display formatting system.
Because the content is streamed down the pipe, only one line at a time is stored in RAM. .Net memory Garbage Collection is running in the background releasing and otherwise managing RAM.
Note: [System.IO.StreamReader] and [System.IO.StreamWriter] can probably also address this issue. They may be faster, and are just as memory efficient, but they come with a syntax burden that may not be worth it, particularly if this is a one-off... That said, you can instantiate them with a System.Text.Encoding enum, so theoretically can use them for the conversion.
When you know that the input file is always UTF-8 with BOM, you only need to strip the first three bytes (the BOM) from the file.
Using a buffered stream, you only need to load a fraction of the file into memory.
For best performance I would use a FileStream. This is a raw binary stream and thus has the least overhead.
$streamIn = $streamOut = $null
try {
$streamIn = [IO.FileStream]::new( $fullPathToInputFile, [IO.FileMode]::Open )
$streamOut = [IO.FileStream]::new( $fullPathToOutputFile, [IO.FileMode]::Create )
# Strip 3 bytes (the UTF-8 BOM) from the input file
$null = $streamIn.Seek( 3, [IO.SeekOrigin]::Begin )
# Copy the remaining bytes to the output file
$streamIn.CopyTo( $streamOut )
# You may try a custom buffer size for better performance:
# $streamIn.CopyTo( $streamOut, 1MB )
}
finally {
# Make sure to close the files even in case of an exception
if( $streamIn ) { $streamIn.Close() }
if( $streamOut ) { $streamOut.Close() }
}
You may experiment with the FileStream.CopyTo() overload that has a bufferSize parameter. In my experience, a larger buffer size (say 1 MiB) can improve performance considerably, but when it is too large, performance will suffer again because of bad cache use.
You can use a StreamReader and StreamWriter to do the conversion.
The StreamWriter by default outputs UTF8NoBOM.
This will take a lot of disk actions, but will be lean on memory.
Bear in mind that .Net needs full absolute paths.
$sourceFile = 'D:\Test\Blah.txt' # enter your own in- and output files here
$destinationFile = 'D:\Test\out.txt'
$reader = [System.IO.StreamReader]::new($sourceFile, [System.Text.Encoding]::UTF8)
$writer = [System.IO.StreamWriter]::new($destinationFile)
while ($null -ne ($line = $reader.ReadLine())) {
$writer.WriteLine($line)
}
# clean up
$writer.Flush()
$reader.Dispose()
$writer.Dispose()
The above code will add a final newline to the output file. If that is unwanted, do this instead:
$sourceFile = 'D:\Test\Blah.txt'
$destinationFile = 'D:\Test\out.txt'
$reader = [System.IO.StreamReader]::new($sourceFile, [System.Text.Encoding]::UTF8)
$writer = [System.IO.StreamWriter]::new($destinationFile)
while ($null -ne ($line = $reader.ReadLine())) {
if ($reader.EndOfStream) {
$writer.Write($line)
}
else {
$writer.WriteLine($line)
}
}
# clean up
$writer.Flush()
$reader.Dispose()
$writer.Dispose()

PowerShell 7.0 how to compute hashsum of a big file read in chunks

The script should copy files and compute hash sum of them.
My goal is make the function which will read the file once instead of 3 ( read_for_copy + read_for_hash + read_for_another_copy ) to minimize network load.
So I tried read a chunk of file then compute md5 hash sum and write out file to several places.
The file`s size may vary from 100 MB up to 2 TB and maybe more. There is no need to check files identity at this moment, just need to compute hash sum for initial files.
And I am stuck with respect to computing hash sum:
$ifile = "C:\Users\User\Desktop\inputfile"
$ofile = "C:\Users\User\Desktop\outputfile_1"
$ofile2 = "C:\Users\User\Desktop\outputfile_2"
$md5 = new-object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$bufferSize = 10mb
$stream = [System.IO.File]::OpenRead($ifile)
$makenew = [System.IO.File]::OpenWrite($ofile)
$makenew2 = [System.IO.File]::OpenWrite($ofile2)
$buffer = new-object Byte[] $bufferSize
while ( $stream.Position -lt $stream.Length ) {
$bytesRead = $stream.Read($buffer, 0, $bufferSize)
$makenew.Write($buffer, 0, $bytesread)
$makenew2.Write($buffer, 0, $bytesread)
# I am stuck here
$hash = [System.BitConverter]::ToString($md5.ComputeHash($buffer)) -replace "-",""
}
$stream.Close()
$makenew.Close()
$makenew2.Close()
How I can collect chunks of data to compute the hash of whole file?
And extra question: is it possible to calculate hash and write data out in parallel mode? Especially taking into account that workflow {parallel{}} does not supported from PS version 6 ?
Many thanks
If you want to handle input buffering manually, you need to use the TransformBlock/TransformFinalBlock methods exposed by $md5:
while($bytesRead = $stream.Read($buffer, 0, $bufferSize))
{
# Write to file copies
$makenew.Write($buffer, 0, $bytesread)
$makenew2.Write($buffer, 0, $bytesread)
# Feed next chunk to MD5 CSP
$null = $md5.TransformBlock($buffer, 0 , $bytesRead, $null, 0)
}
# Complete the hashing routine
$md5.TransformFinalBlock([byte[]]::new(0), 0, 0)
# Grab hash value from CSP
$hash = [BitConverter]::ToString($md5.Hash).Replace('-','')
My goal is make the function which will read the file once instead of 3 ( read_for_copy + read_for_hash + read_for_another_copy ) to minimize network load
I'm not entirely sure what you mean by network load here. If the source file is on a remote file share, but the new copies go onto a local file system, you can minimize network load by simply copying the source file once, then use that one copy as the source of the second copy and the hash calculation:
$ifile = "\\remoteMachine\c$\Users\User\Desktop\inputfile"
$ofile = "C:\Users\User\Desktop\outputfile_1"
$ofile2 = "C:\Users\User\Desktop\outputfile_2"
# Copy remote -> local
Copy-Item -Path $ifile -Destination $ofile
# Copy local -> local
Copy-Item -Path $ofile -Destination $ofile2
# Hash local file stream
$md5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$stream = [System.IO.File]::OpenRead($ofile)
$hash = [BitConverter]::ToString($md5.ComputeHash($stream)).Replace('-','')
FWIW, passing the file stream object to $md5.ComputeHash($stream) directly is likely going to be faster than manually buffering the input
Final listing
$ifile = "C:\Users\User\Desktop\inputfile"
$ofile = "C:\Users\User\Desktop\outputfile_1"
$ofile2 = "C:\Users\User\Desktop\outputfile_2"
$md5 = new-object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$bufferSize = 1mb
$stream = [System.IO.File]::OpenRead($ifile)
$makenew = [System.IO.File]::OpenWrite($ofile)
$makenew2 = [System.IO.File]::OpenWrite($ofile2)
$buffer = new-object Byte[] $bufferSize
while ( $stream.Position -lt $stream.Length )
{
$bytesRead = $stream.Read($buffer, 0, $bufferSize)
$makenew.Write($buffer, 0, $bytesread)
$makenew2.Write($buffer, 0, $bytesread)
$hash = $md5.TransformBlock($buffer, 0 , $bytesRead, $null , 0)
}
$md5.TransformFinalBlock([byte[]]::new(0), 0, 0)
$hash = [BitConverter]::ToString($md5.Hash).Replace('-','')
$hash
$stream.Flush()
$stream.Close()
$makenew.Flush()
$makenew.Close()
$makenew2.Flush()
$makenew2.Close()

stop script when the target (find and replace text) is reached 1 or 2 or 3 times depending on config

What if some file contains strings like
11111111111111
22222222222222
33333333333333
44444444444444
22222222222222
33333333333333
22222222222222
11111111111111
and I need to find and replace 22222222222222 to 777777777777777 just 2 times while processing the file string by string using foreach and save file with changes after that.
$file = 'path_to\somefile.txt'
foreach($string in $file)
{
$new_string = $string -replace '22222222222222', '777777777777777'
}
$string | Out-file $file
I understand that the script above will replace all strings and will be saved just with one string instead of my requirements.
As per the comment, one needs to keep track how many replacements are done.
To read a file line by line, use .Net's StreamReader.ReadLine(). In a while loop, keep reading until you are at end of file.
While reading lines, there are a few things. Keep track how many times the replaceable string is encountered, and, if need be, replace it. The results must be saved in both cases: a line that was replaced and all the other lines too.
It's slow to add content into a file on line by line cases, even on the days of SSDs. A bulk operation is much more efficient. Save the modified data into a StringBuilder. After the whole file is processed, write the contents on a single operation. If the file's going to be large (one gigabyte or more), consider writing 10 000 rows a time. Like so,
$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText("MyFile.csv")
$found = 0
$oldVal = '22222222222222'
$newVal = '777777777777777'
# Read the file line by line
while($null -ne ($line = $reader.ReadLine())) {
# Only change two first occurrances
if( ($found -lt 2) -and ($line -eq $oldVal) ) {
++$found
$line = $newVal
}
# Add data into a buffer
[void]$sb.AppendLine($line)
}
add-content "SomeOtherFile.csv" $sb.ToString()
$reader.close()

delete some sequence of bytes in Powershell [duplicate]

This question already has answers here:
Methods to hex edit binary files via Powershell
(4 answers)
Closed 3 years ago.
I have a *.bin file. How can I delete with poweshell some part of bytes (29 bytes, marked yellow) with repeatig sequence of bytes (12 bytes, marked red pen)? Thanks a lot!!
Using a very helpful article and ditto function I found here, it seems it is posible to read a binary file and convert it to a string while not altering any of the bytes by using Codepage 28591.
With that (I slightly changed the function), you can do this to delete the bytes in your *.bin file:
function ConvertTo-BinaryString {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
[OutputType([String])]
Param (
[Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[String]$Path
)
$Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'
# Note: Codepage 28591 returns a 1-to-1 char to byte mapping
$Encoding = [Text.Encoding]::GetEncoding(28591)
$StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
$BinaryText = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
return $BinaryText
}
$inputFile = 'D:\test.bin'
$outputFile = 'D:\test2.bin'
$fileBytes = [System.IO.File]::ReadAllBytes($inputFile)
$binString = ConvertTo-BinaryString -Path $inputFile
# create your regex: 17 bytes in range of \x00 to \xFF followed by 12 bytes specific range
$re = [Regex]'[\x00-\xFF]{17}\xEB\x6F\xD3\x01\x18\x00{3}\xFF{3}\xFE'
# use a MemoryStream object to store the result
$ms = New-Object System.IO.MemoryStream
$pos = $replacements = 0
$re.Matches($binString) | ForEach-Object {
# write the part of the byte array before the match to the MemoryStream
$ms.Write($fileBytes, $pos, $_.Index)
# update the 'cursor' position for the next match
$pos += ($_.Index + $_.Length)
# and count the number of replacements done
$replacements++
}
# write the remainder of the bytes to the stream
$ms.Write($fileBytes, $pos, $fileBytes.Count - $pos)
# save the updated bytes to a new file (will overwrite existing file)
[System.IO.File]::WriteAllBytes($outputFile, $ms.ToArray())
$ms.Dispose()
if ($replacements) {
Write-Host "$replacements replacement(s) made."
}
else {
Write-Host "Byte sequence not found. No replacements made."
}

Find last occurrence of ASCII byte in file and truncate file from that point to end?

In a Windows command line environment, I'd like to be able to search a binary file for the last (final) occurrence of hex 06 char ("Ack") and truncate the file from that char to the end of the file, meaning that the found char is also trimmed off. How can I do that? The files can be several hundred megabytes in size.
EDIT: To be fair, I did quite a lot of Googling for code ideas, but my search terms are not bringing me to some kind of way to tackle this. Something like "search binary file for ASCII char hex 06, find last occurrence of that char and truncate the file from that point on," is so vague as to be essentially useless. I'll keep looking!
If you start reading bytes from the end of the file you will find the last ACK (if there is one). Knowing its position, you can now truncate the file.
I'm not good at PowerShell, so there might be some cmdlet I don't know about, but this achieves what you want:
$filename = "C:\temp\FindAck.txt"
$file = Get-Item $filename
$len = $file.Length
$blockSize = 32768
$buffer = new-object byte[] $blockSize
$found = $false
$blockNum = [math]::floor($len / $blockSize)
$mode = [System.IO.FileMode]::Open
$access = [System.IO.FileAccess]::Read
$sharing = [IO.FileShare]::Read
$fs = New-Object IO.FileStream($filename, $mode, $access, $sharing)
$foundPos = -1
while (!$found -and $blockNum -ge 0) {
$fs.Position = $blockNum * $blockSize
$bytesRead = $fs.Read($buffer, 0, $blocksize)
if ($bytesRead -gt 0) {
for ($i = $bytesRead -1; $i -ge 0; $i--) {
if ($buffer[$i] -eq 6) {
$foundPos = $blockNum * $blockSize + $i
$found = $true
break
}
}
}
$blockNum--
}
$fs.Dispose()
if ($foundPos -ne -1) {
$mode = [System.IO.FileMode]::Open
$access = [System.IO.FileAccess]::Write
$sharing = [IO.FileShare]::Read
$fs = New-Object IO.FileStream($filename, $mode, $access, $sharing)
$fs.SetLength($foundPos)
$fs.Dispose()
}
Write-Host $foundPos
The idea of reading in 32KB blocks is to get a reasonable size chunk from the disk to process rather than reading one byte at a time.
References:
Creating Byte[] in PowerShell
file IO, is this a bug in Powershell?
How to truncate a file in c#?
Break