Find last occurrence of ASCII byte in file and truncate file from that point to end? - truncate

In a Windows command line environment, I'd like to be able to search a binary file for the last (final) occurrence of hex 06 char ("Ack") and truncate the file from that char to the end of the file, meaning that the found char is also trimmed off. How can I do that? The files can be several hundred megabytes in size.
EDIT: To be fair, I did quite a lot of Googling for code ideas, but my search terms are not bringing me to some kind of way to tackle this. Something like "search binary file for ASCII char hex 06, find last occurrence of that char and truncate the file from that point on," is so vague as to be essentially useless. I'll keep looking!

If you start reading bytes from the end of the file you will find the last ACK (if there is one). Knowing its position, you can now truncate the file.
I'm not good at PowerShell, so there might be some cmdlet I don't know about, but this achieves what you want:
$filename = "C:\temp\FindAck.txt"
$file = Get-Item $filename
$len = $file.Length
$blockSize = 32768
$buffer = new-object byte[] $blockSize
$found = $false
$blockNum = [math]::floor($len / $blockSize)
$mode = [System.IO.FileMode]::Open
$access = [System.IO.FileAccess]::Read
$sharing = [IO.FileShare]::Read
$fs = New-Object IO.FileStream($filename, $mode, $access, $sharing)
$foundPos = -1
while (!$found -and $blockNum -ge 0) {
$fs.Position = $blockNum * $blockSize
$bytesRead = $fs.Read($buffer, 0, $blocksize)
if ($bytesRead -gt 0) {
for ($i = $bytesRead -1; $i -ge 0; $i--) {
if ($buffer[$i] -eq 6) {
$foundPos = $blockNum * $blockSize + $i
$found = $true
break
}
}
}
$blockNum--
}
$fs.Dispose()
if ($foundPos -ne -1) {
$mode = [System.IO.FileMode]::Open
$access = [System.IO.FileAccess]::Write
$sharing = [IO.FileShare]::Read
$fs = New-Object IO.FileStream($filename, $mode, $access, $sharing)
$fs.SetLength($foundPos)
$fs.Dispose()
}
Write-Host $foundPos
The idea of reading in 32KB blocks is to get a reasonable size chunk from the disk to process rather than reading one byte at a time.
References:
Creating Byte[] in PowerShell
file IO, is this a bug in Powershell?
How to truncate a file in c#?
Break

Related

Using Powershell to output characters (not lines) after a match in a large file

I use powershell to parse huge files and easily take a look at a small part of the file where a certain string occurs.. like this:
Select-String P120300420059211107104259.txt -Pattern "<ID>9671510841" -Context 0,300
This gives me 300 lines of the file after the occurance of that ID number.
But I've come across a file that has no carriage returns. Now I would like to do the same thing, but instead of lines being returned, I guess I need characters.
How would I do this?
I have never created scripts in powershell - just ran simple commands like the above.
I would like to see maybe 1000 characters after the matched string, within a huge file.
THanks!
The problem with using Select-String or [Regex]::Matches() (or -match) to test for the presence of a substring in a single-line file is that you first need to read the whole file into memory at once.
The good news is that you don't need regular expressions to find a substring in a huge single-line text file - instead, you can read the file contents into memory in smaller chunks and then search through those - this way you don't need to store the entire file in memory at once.
Reading buffered text from a file is fairly straightforward:
Open a readable file stream
Create a StreamReader to read from the file stream
Start reading!
Then you just need to check whether:
The target substring is found in each chunk, or
The start of the target substring is partially found at the tail end of the current chunk
And then repeat until you find the substring, at which point you read the following 1000 characters.
Here's an example of how you could implement it as script function (I've tried to explain the code in more detail in inline comments):
function Find-SubstringWithPostContext {
[CmdletBinding(DefaultParameterSetName = 'wp')]
param(
[Alias('PSPath')]
[Parameter(Mandatory = $true, ParameterSetName = 'lp', ValueFromPipelineByPropertyName = $true, ValueFromPipeline = $true)]
[string[]]$LiteralPath,
[Parameter(Mandatory = $true, ParameterSetName = 'wp', Position = 0)]
[string[]]$Path,
[Parameter(Mandatory = $true)]
[ValidateLength(1, 5000)]
[string]$Substring,
[ValidateRange(2, 25000)]
[int]$PostContext = 1000,
[switch]$All,
[System.Text.Encoding]
$Encoding
)
begin {
# start by ensuring we'll be using a buffer that's at least 4 larger than the
# target substring to avoid too many tail searches
$bufferSize = 2000
while ($Substring.Length -gt $bufferSize / 4) {
$bufferSize *= 2
}
$buffer = [char[]]::new($bufferSize)
}
process {
if ($PSCmdlet.ParameterSetName -eq 'wp') {
# resolve input paths if necessary
$LiteralPath = $Path | Convert-Path
}
:fileLoop
foreach ($lp in $LiteralPath) {
$file = Get-Item -LiteralPath $lp
# skip directories
if ($file -isnot [System.IO.FileInfo]) { continue }
try {
$fileStream = $file.OpenRead()
$scanner = [System.IO.StreamReader]::new($fileStream, $true)
do {
# remember the current offset in the file, we'll need this later
$baseOffset = $fileStream.Position
# read a chunk from the file, convert to string
$readCount = $scanner.ReadBlock($buffer, 0, $bufferSize)
$string = [string]::new($buffer, 0, $readCount)
$eof = $readCount -lt $bufferSize
# test if target substring is found in the chunk we just read
$indexOfTarget = $string.IndexOf($Substring)
if ($indexOfTarget -ge 0) {
Write-Verbose "Substring found in chunk at local index ${indexOfTarget}"
# we found a match, ensure we've read enough post-context ahead of the given index
$tail = ''
if ($string.Length - $indexOfTarget -lt $PostContext -and $readCount -eq $bufferSize) {
# just like above, we read another chunk from the file and convert it to a proper string
$tailBuffer = [char[]]::new($PostContext - ($string.Length - $indexOfTarget))
$tailCount = $scanner.ReadBlock($tailBuffer, 0, $tailBuffer.Length)
$tail = [string]::new($tailBuffer, 0, $tailCount)
}
# construct and output the full post-context
$substringWithPostContext = $string.Substring($indexOfTarget) + $tail
if($substringWithPostContext.Length -gt $PostContext){
$substringWithPostContext = $substringWithPostContext.Remove($PostContext)
}
Write-Verbose "Writing output object ..."
Write-Output $([PSCustomObject]#{
FilePath = $file.FullName
Offset = $baseOffset + $indexOfTarget
Value = $substringWithPostContext
})
if (-not $All) {
# no need to search this file any further unless `-All` was specified
continue fileLoop
}
else {
# rewind to position after this match before next iteration
$rewindOffset = $indexOfTarget - $readCount
$null = $scanner.BaseStream.Seek($rewindOffset, [System.IO.SeekOrigin]::Current)
}
}
else {
# target was not found, but we may have "clipped" it in half,
# so figure out if target string could start at the end of current string chunk
for ($i = $string.Length - $target.Length; $i -lt $string.Length; $i++) {
# if the first character of the target substring isn't found then
# we might as well skip it immediately
if ($string[$i] -ne $target[0]) { continue }
if ($target.StartsWith($string.Substring($i))) {
# rewind file stream to this position so it'll get re-tested on
# the next iteration, then break out of tail search
$rewindOffset = $i - $string.Length
$null = $scanner.BaseStream.Seek($rewindOffset, [System.IO.SeekOrigin]::Current)
break
}
}
}
} until ($eof)
}
finally {
# remember to clean up after searching each file
$scanner, $fileStream |Where-Object { $_ -is [System.IDisposable] } |ForEach-Object Dispose
}
}
}
}
Now you can extract exactly 1000 characters after a substring is found with minimal memory allocation:
Get-ChildItem P*.txt |Find-SubstringWithPostContext -Substring '<ID>9671510841'
I haven't tested this enough to tell you if it works properly but it definitely was something fun to code. -Context here will give you the context based on characters before and after instead of lines. You can give it a try and let me know if it worked :)
Usage:
Get-ChildItem *.txt | Find-String -Pattern 'mypattern'
Get-ChildItem *.txt | Find-String -Pattern 'mypattern' -Context 20, 20
Get-ChildItem *.txt | Find-String -Pattern 'mypattern' -AllMatches
using namespace System.Text.RegularExpressions
using namespace System.IO
function Find-String {
param(
[parameter(ValueFromPipeline, Mandatory)]
[Alias('PSPath')]
[FileInfo]$Path,
[parameter(Mandatory, Position = 0)]
[string]$Pattern,
[RegexOptions]$Options = 'IgnoreCase',
[switch]$AllMatches,
[int[]]$Context
)
process
{
$re = [regex]::new($Pattern, $Options)
$content = [File]::ReadAllText($Path)
$match = if($AllMatches.IsPresent)
{
$re.Matches($content)
}
else
{
$re.Match($content)
}
if($match.Success -notcontains $true) { return }
foreach($m in $match)
{
$out = [ordered]#{
Path = $path.FullName
Value = $m.Value
Index = $m.Index
Length = $m.Length
}
if($PSBoundParameters.ContainsKey('Context'))
{
$before = $m.Index
$after = $m.Index + $m.Length
$contextBefore = $Context[0]
$contextAfter = $Context[1]
while($contextBefore-- -and $before)
{
$before--
}
while($contextAfter-- -and $after -lt $content.Length)
{
$after++
}
$out.Context = (-join $content[$before..$after]).Trim()
}
[pscustomobject]$out
}
}
}

PowerShell 7.0 how to compute hashsum of a big file read in chunks

The script should copy files and compute hash sum of them.
My goal is make the function which will read the file once instead of 3 ( read_for_copy + read_for_hash + read_for_another_copy ) to minimize network load.
So I tried read a chunk of file then compute md5 hash sum and write out file to several places.
The file`s size may vary from 100 MB up to 2 TB and maybe more. There is no need to check files identity at this moment, just need to compute hash sum for initial files.
And I am stuck with respect to computing hash sum:
$ifile = "C:\Users\User\Desktop\inputfile"
$ofile = "C:\Users\User\Desktop\outputfile_1"
$ofile2 = "C:\Users\User\Desktop\outputfile_2"
$md5 = new-object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$bufferSize = 10mb
$stream = [System.IO.File]::OpenRead($ifile)
$makenew = [System.IO.File]::OpenWrite($ofile)
$makenew2 = [System.IO.File]::OpenWrite($ofile2)
$buffer = new-object Byte[] $bufferSize
while ( $stream.Position -lt $stream.Length ) {
$bytesRead = $stream.Read($buffer, 0, $bufferSize)
$makenew.Write($buffer, 0, $bytesread)
$makenew2.Write($buffer, 0, $bytesread)
# I am stuck here
$hash = [System.BitConverter]::ToString($md5.ComputeHash($buffer)) -replace "-",""
}
$stream.Close()
$makenew.Close()
$makenew2.Close()
How I can collect chunks of data to compute the hash of whole file?
And extra question: is it possible to calculate hash and write data out in parallel mode? Especially taking into account that workflow {parallel{}} does not supported from PS version 6 ?
Many thanks
If you want to handle input buffering manually, you need to use the TransformBlock/TransformFinalBlock methods exposed by $md5:
while($bytesRead = $stream.Read($buffer, 0, $bufferSize))
{
# Write to file copies
$makenew.Write($buffer, 0, $bytesread)
$makenew2.Write($buffer, 0, $bytesread)
# Feed next chunk to MD5 CSP
$null = $md5.TransformBlock($buffer, 0 , $bytesRead, $null, 0)
}
# Complete the hashing routine
$md5.TransformFinalBlock([byte[]]::new(0), 0, 0)
# Grab hash value from CSP
$hash = [BitConverter]::ToString($md5.Hash).Replace('-','')
My goal is make the function which will read the file once instead of 3 ( read_for_copy + read_for_hash + read_for_another_copy ) to minimize network load
I'm not entirely sure what you mean by network load here. If the source file is on a remote file share, but the new copies go onto a local file system, you can minimize network load by simply copying the source file once, then use that one copy as the source of the second copy and the hash calculation:
$ifile = "\\remoteMachine\c$\Users\User\Desktop\inputfile"
$ofile = "C:\Users\User\Desktop\outputfile_1"
$ofile2 = "C:\Users\User\Desktop\outputfile_2"
# Copy remote -> local
Copy-Item -Path $ifile -Destination $ofile
# Copy local -> local
Copy-Item -Path $ofile -Destination $ofile2
# Hash local file stream
$md5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$stream = [System.IO.File]::OpenRead($ofile)
$hash = [BitConverter]::ToString($md5.ComputeHash($stream)).Replace('-','')
FWIW, passing the file stream object to $md5.ComputeHash($stream) directly is likely going to be faster than manually buffering the input
Final listing
$ifile = "C:\Users\User\Desktop\inputfile"
$ofile = "C:\Users\User\Desktop\outputfile_1"
$ofile2 = "C:\Users\User\Desktop\outputfile_2"
$md5 = new-object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$bufferSize = 1mb
$stream = [System.IO.File]::OpenRead($ifile)
$makenew = [System.IO.File]::OpenWrite($ofile)
$makenew2 = [System.IO.File]::OpenWrite($ofile2)
$buffer = new-object Byte[] $bufferSize
while ( $stream.Position -lt $stream.Length )
{
$bytesRead = $stream.Read($buffer, 0, $bufferSize)
$makenew.Write($buffer, 0, $bytesread)
$makenew2.Write($buffer, 0, $bytesread)
$hash = $md5.TransformBlock($buffer, 0 , $bytesRead, $null , 0)
}
$md5.TransformFinalBlock([byte[]]::new(0), 0, 0)
$hash = [BitConverter]::ToString($md5.Hash).Replace('-','')
$hash
$stream.Flush()
$stream.Close()
$makenew.Flush()
$makenew.Close()
$makenew2.Flush()
$makenew2.Close()

How can I increase the maximum number of characters read by Read-Host?

I need to get a very long string input (around 9,000 characters), but Read-Host will truncate after around 8,000 characters. How can I extend this limit?
The following are possible workarounds.
Workaround 1 has the advantage that it will work with PowerShell background jobs that require keyboard input. Note that if you are trying to paste clipboard content containing new lines, Read-HostLine will only read the first line, but Read-Host has this same behavior.
Workaround 1:
<#
.SYNOPSIS
Read a line of input from the host.
.DESCRIPTION
Read a line of input from the host.
.EXAMPLE
$s = Read-HostLine -prompt "Enter something"
.NOTES
Read-Host has a limitation of 1022 characters.
This approach is safe to use with background jobs that require input.
If pasting content with embedded newlines, only the first line will be read.
A downside to the ReadKey approach is that it is not possible to easily edit the input string before pressing Enter as with Read-Host.
#>
function Read-HostLine ($prompt = $null) {
if ($prompt) {
"${prompt}: " | Write-Host
}
$str = ""
while ($true) {
$key = $host.UI.RawUI.ReadKey("NoEcho, IncludeKeyDown");
# Paste the clipboard on CTRL-V
if (($key.VirtualKeyCode -eq 0x56) -and # 0x56 is V
(([int]$key.ControlKeyState -band [System.Management.Automation.Host.ControlKeyStates]::LeftCtrlPressed) -or
([int]$key.ControlKeyState -band [System.Management.Automation.Host.ControlKeyStates]::RightCtrlPressed))) {
$clipboard = Get-Clipboard
$str += $clipboard
Write-Host $clipboard -NoNewline
continue
}
elseif ($key.VirtualKeyCode -eq 0x08) { # 0x08 is Backspace
if ($str.Length -gt 0) {
$str = $str.Substring(0, $str.Length - 1)
Write-Host "`b `b" -NoNewline
}
}
elseif ($key.VirtualKeyCode -eq 13) { # 13 is Enter
Write-Host
break
}
elseif ($key.Character -ne 0) {
$str += $key.Character
Write-Host $key.Character -NoNewline
}
}
return $str
}
Workaround 2:
$maxLength = 65536
[System.Console]::SetIn([System.IO.StreamReader]::new([System.Console]::OpenStandardInput($maxLength), [System.Console]::InputEncoding, $false, $maxLength))
$s = [System.Console]::ReadLine()
Workaround 3:
function Read-Line($maxLength = 65536) {
$str = ""
$inputStream = [System.Console]::OpenStandardInput($maxLength);
$bytes = [byte[]]::new($maxLength);
while ($true) {
$len = $inputStream.Read($bytes, 0, $maxLength);
$str += [string]::new($bytes, 0, $len)
if ($str.EndsWith("`r`n")) {
$str = $str.Substring(0, $str.Length - 2)
return $str
}
}
}
$s = Read-Line
More discussion here:
Console.ReadLine() max length?
Why does Console.Readline() have a limit on the length of text it allows?
https://github.com/PowerShell/PowerShell/issues/16555

strange characters when opening a properties file

I have a requirement to update a properties file for a very old project, the properties file is supposed to display Arabic characters but it displays somthing like that "Êã ÊÓÌíá ØáÈßã", i wrote a simple program from which i was able to read the correct Arabic values from the file,
Reader r = new InputStreamReader(new FileInputStream("C:\\Labels_ar.properties"), "Windows-1256");
buffered = new BufferedReader(r);
String line;
while ((line = buffered.readLine()) != null) {
System.out.println("line" + line);
}
but do u have any idea on how i can open the file, edit and save the new changes?
If, as you seem to think, the encoding is Windows-1256, there are editors that will do the job, such as EditPadLite.
If it's not that, the first thing you need to find out is the encoding. Given it's a properties file, it may well be UTF-8 but the easiest way to find out is to get a hex dump of the file and post it here. Under Linux, I'd normally suggest using:
od -xcb Labels_ar.properties
but, given you're on Windows, that's not going to work so well (unless you have CygWin installed).
So, if you have your own favourite hex dump program, just use that. Otherwise you can use the following Powershell one:
function Pf-Dump-Hex-Item([byte[]] $data) {
$left = "+0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +A +B +C +D +E +F"
$right = "0123456789ABCDEF"
Write-Output "======== $left +$right"
$addr = 0
$left = "{0:X8} " -f $addr
$right = ""
# Now go through the input bytes
foreach ($byte in $bytes) {
# Add 2-digit hex number then filtered character.
$left += "{0:x2} " -f $byte
if (($byte -lt 0x20) -or ($byte -gt 0x7e)) { $byte = "." }
$right += [char] $byte
# Increment address and start new line if needed.
$addr++;
if (($addr % 16) -eq 0) {
Write-Output "$left $right"
$left = "{0:X8} " -f $addr
$right = "";
}
}
# Flush last line if needed.
$lastLine = "{0:X8}" -f $addr
if (($addr % 16) -ne 0) {
while (($addr % 16) -ne 0) {
$left += " "
$addr++;
}
Write-Output "$left $right"
}
Write-Output $lastLine
Write-Output ""
}
function Pf-Dump-Hex {
param(
[Parameter (Mandatory = $false, Position = 0)]
[string] $Path,
[Parameter (Mandatory = $false, ValueFromPipeline = $true)]
[Object] $Object
)
begin {
Set-StrictMode -Version Latest
# Create the array to hold content then do path if given.
[byte[]] $bytes = $null
if ($Path) {
$bytes = [IO.File]::ReadAllBytes((Resolve-Path $Path))
Pf-Dump-Hex-Item $bytes
}
}
process {
# Process each object (input/pipe).
if ($object) {
foreach ($obj in $object) {
if ($obj -is [Byte]) {
$bytes = $obj
} else {
$inpStr = [string] $obj
$bytes = [Text.Encoding]::Unicode.GetBytes($inpStr)
}
Pf-Dump-Hex-Item $bytes
}
}
}
}
If you load that into a Powershell session then run:
pf-dump-hex Labels_ar.properties
that should allow you to evaluate the file encoding.
I think there are two problems :
1- Im not sure if System.out.println() can print arabic characters, so try another method like MessageBox.show() to be sure there is a problem with reading file.
2- If MessageBox.show() shows same result, the problem should be the charset, you can try UTF-8 or somthing else.

In Powershell, How do I split a large binary file?

I've seen the answer elsewhere for text files, but I need to do this for a compressed file.
I've got a 6G binary file which needs to be split into 100M chunks. Am I missing the analog for unix's "head" somewhere?
Never mind. Here you go:
function split($inFile, $outPrefix, [Int32] $bufSize){
$stream = [System.IO.File]::OpenRead($inFile)
$chunkNum = 1
$barr = New-Object byte[] $bufSize
while( $bytesRead = $stream.Read($barr,0,$bufsize)){
$outFile = "$outPrefix$chunkNum"
$ostream = [System.IO.File]::OpenWrite($outFile)
$ostream.Write($barr,0,$bytesRead);
$ostream.close();
echo "wrote $outFile"
$chunkNum += 1
}
}
Assumption: bufSize fits in memory.
The answer to the corollary question: How do you put them back together?
function stitch($infilePrefix, $outFile) {
$ostream = [System.Io.File]::OpenWrite($outFile)
$chunkNum = 1
$infileName = "$infilePrefix$chunkNum"
$offset = 0
while(Test-Path $infileName) {
$bytes = [System.IO.File]::ReadAllBytes($infileName)
$ostream.Write($bytes, 0, $bytes.Count)
Write-Host "read $infileName"
$chunkNum += 1
$infileName = "$infilePrefix$chunkNum"
}
$ostream.close();
}
I answered the question alluded to in this question's comments by bernd_k but I would use -ReadCount in this case instead of -TotalCount e.g.
Get-Content bigfile.bin -ReadCount 100MB -Encoding byte
This causes Get-Content to read a chunk of the file at a time where the chunk size is either a line for text encodings or a byte for byte encoding. Keep in mind that when it does this, you get an array passed down the pipeline and not individual bytes or lines of text.