delete some sequence of bytes in Powershell [duplicate] - powershell

This question already has answers here:
Methods to hex edit binary files via Powershell
(4 answers)
Closed 3 years ago.
I have a *.bin file. How can I delete with poweshell some part of bytes (29 bytes, marked yellow) with repeatig sequence of bytes (12 bytes, marked red pen)? Thanks a lot!!

Using a very helpful article and ditto function I found here, it seems it is posible to read a binary file and convert it to a string while not altering any of the bytes by using Codepage 28591.
With that (I slightly changed the function), you can do this to delete the bytes in your *.bin file:
function ConvertTo-BinaryString {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
[OutputType([String])]
Param (
[Parameter(Mandatory = $True, ValueFromPipeline = $True, Position = 0)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[String]$Path
)
$Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'
# Note: Codepage 28591 returns a 1-to-1 char to byte mapping
$Encoding = [Text.Encoding]::GetEncoding(28591)
$StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
$BinaryText = $StreamReader.ReadToEnd()
$StreamReader.Close()
$Stream.Close()
return $BinaryText
}
$inputFile = 'D:\test.bin'
$outputFile = 'D:\test2.bin'
$fileBytes = [System.IO.File]::ReadAllBytes($inputFile)
$binString = ConvertTo-BinaryString -Path $inputFile
# create your regex: 17 bytes in range of \x00 to \xFF followed by 12 bytes specific range
$re = [Regex]'[\x00-\xFF]{17}\xEB\x6F\xD3\x01\x18\x00{3}\xFF{3}\xFE'
# use a MemoryStream object to store the result
$ms = New-Object System.IO.MemoryStream
$pos = $replacements = 0
$re.Matches($binString) | ForEach-Object {
# write the part of the byte array before the match to the MemoryStream
$ms.Write($fileBytes, $pos, $_.Index)
# update the 'cursor' position for the next match
$pos += ($_.Index + $_.Length)
# and count the number of replacements done
$replacements++
}
# write the remainder of the bytes to the stream
$ms.Write($fileBytes, $pos, $fileBytes.Count - $pos)
# save the updated bytes to a new file (will overwrite existing file)
[System.IO.File]::WriteAllBytes($outputFile, $ms.ToArray())
$ms.Dispose()
if ($replacements) {
Write-Host "$replacements replacement(s) made."
}
else {
Write-Host "Byte sequence not found. No replacements made."
}

Related

PowerShell reading and writing compressed files with byte arrays

Final Update: Turns out I didn't need Binary writer. I could just copy memory streams from one archive to another.
I'm re-writing a PowerShell script which works with archives. I'm using two functions from here
Expand-Archive without Importing and Exporting files
and can successfully read and write files to the archive. I've posted the whole program just in case it makes things clearer for someone to help me.
However, there are three issues (besides the fact that I don't really know what I'm doing).
1.) Most files have this error on when trying to run
Add-ZipEntry -ZipFilePath ($OriginalArchivePath + $PartFileDirectoryName) -EntryPath $entry.FullName -Content $fileBytes}
Cannot convert value "507" to type "System.Byte". Error: "Value was either too large or too small for an unsigned byte." (replace 507 with whatever number from the byte array is there)
2.) When it reads a file and adds it to the zip archive (*.imscc) it adds a character "a" to the beginning of the file contents.
3.) The only file it doesn't error on are text files, when I really want it to handle any file
Thank you for any assistance!
Update: I've tried using System.IO.BinaryWriter, with the same errors.
Add-Type -AssemblyName 'System.Windows.Forms'
Add-Type -AssemblyName 'System.IO.Compression'
Add-Type -AssemblyName 'System.IO.Compression.FileSystem'
function Folder-SuffixGenerator($SplitFileCounter)
{
return ' ('+$usrSuffix+' '+$SplitFileCounter+')'
}
function Get-ZipEntryContent(#returns the bytes of the first matching entry
[string] $ZipFilePath, #optional - specify a ZipStream or path
[IO.Stream] $ZipStream = (New-Object IO.FileStream($ZipFilePath, [IO.FileMode]::Open)),
[string] $EntryPath){
$ZipArchive = New-Object IO.Compression.ZipArchive($ZipStream, [IO.Compression.ZipArchiveMode]::Read)
$buf = New-Object byte[] (0) #return an empty byte array if not found
$ZipArchive.GetEntry($EntryPath) | ?{$_} | %{ #GetEntry returns first matching entry or null if there is no match
$buf = New-Object byte[] ($_.Length)
Write-Verbose " reading: $($_.Name)"
$_.Open().Read($buf,0,$buf.Length)
}
$ZipArchive.Dispose()
$ZipStream.Close()
$ZipStream.Dispose()
return ,$buf
}
function Add-ZipEntry(#Adds an entry to the $ZipStream. Sample call: Add-ZipEntry -ZipFilePath "$PSScriptRoot\temp.zip" -EntryPath Test.xml -Content ([text.encoding]::UTF8.GetBytes("Testing"))
[string] $ZipFilePath, #optional - specify a ZipStream or path
[IO.Stream] $ZipStream = (New-Object IO.FileStream($ZipFilePath, [IO.FileMode]::OpenOrCreate)),
[string] $EntryPath,
[byte[]] $Content,
[switch] $OverWrite, #if specified, will not create a second copy of an existing entry
[switch] $PassThru ){#return a copy of $ZipStream
$ZipArchive = New-Object IO.Compression.ZipArchive($ZipStream, [IO.Compression.ZipArchiveMode]::Update, $true)
$ExistingEntry = $ZipArchive.GetEntry($EntryPath) | ?{$_}
If($OverWrite -and $ExistingEntry){
Write-Verbose " deleting existing $($ExistingEntry.FullName)"
$ExistingEntry.Delete()
}
$Entry = $ZipArchive.CreateEntry($EntryPath)
$WriteStream = New-Object System.IO.StreamWriter($Entry.Open())
$WriteStream.Write($Content,0,$Content.Length)
$WriteStream.Flush()
$WriteStream.Dispose()
$ZipArchive.Dispose()
If($PassThru){
$OutStream = New-Object System.IO.MemoryStream
$ZipStream.Seek(0, 'Begin') | Out-Null
$ZipStream.CopyTo($OutStream)
}
$ZipStream.Close()
$ZipStream.Dispose()
If($PassThru){$OutStream}
}
$NoDeleteFiles = #('files_meta.xml' ,'course_settings.xml', 'assignment_groups.xml', 'canvas_export.txt', 'imsmanifest.xml')
Set-Variable usrSuffix -Option ReadOnly -Value 'part' -Force
$MaxImportFileSize = 1000
$compressionLevel = [System.IO.Compression.CompressionLevel]::Optimal
$SplitFileCounter = 1
$FileBrowser = New-Object System.Windows.Forms.OpenFileDialog
$FileBrowser.filter = "Canvas Export Files (*.imscc)| *.imscc"
[void]$FileBrowser.ShowDialog()
$FileBrowser.FileName
$FilePath = $FileBrowser.FileName
$OriginalArchivePath = $FilePath.Substring(0,$FilePath.Length-6)
$PartFileDirectoryName = $OriginalArchive + (Folder-SuffixGenerator($SplitFileCounter)) + '.imscc'
$CourseZip = [IO.Compression.ZipFile]::OpenRead($FilePath)
$CourseZipFiles = $CourseZip.Entries | Sort Length -Descending
$CourseZip.Dispose()
<#
$SortingTable = $CourseZip.entries | Select Fullname,
#{Name="Size";Expression={$_.length}},
#{Name="CompressedSize";Expression={$_.Compressedlength}},
#{Name="PctZip";Expression={[math]::Round(($_.compressedlength/$_.length)*100,2)}}|
Sort Size -Descending | format-table –AutoSize
#>
# Add mandatory files
ForEach($entry in $CourseZipFiles)
{
if ($NoDeleteFiles.Contains($entry.Name)){
Write-Output "Adding to Zip" + $entry.FullName
# Add to Zip
$fileBytes = Get-ZipEntryContent -ZipFilePath $FilePath -EntryPath $entry.FullName
Add-ZipEntry -ZipFilePath ($OriginalArchivePath + $PartFileDirectoryName) -EntryPath $entry.FullName -Content $fileBytes
}
}```
System.IO.StreamWriter is a text writer, and therefore not suitable for writing raw bytes. Cannot convert value "507" to type "System.Byte" indicates that an inappropriate attempt was made to convert text - a .NET string composed of [char] instances which are in effect [uint16] code points (range 0x0 - 0xffff) - to [byte] instances (0x0 - 0xff). Therefore, any Unicode character whose code point is greater than 255 (0xff) will cause this error.
The solution is to use a .NET API that allows writing raw bytes, namely System.IO.BinaryWriter:
$WriteStream = [System.IO.BinaryWriter]::new($Entry.Open())
$WriteStream.Write($Content)
$WriteStream.Flush()
$WriteStream.Dispose()

Using Powershell to output characters (not lines) after a match in a large file

I use powershell to parse huge files and easily take a look at a small part of the file where a certain string occurs.. like this:
Select-String P120300420059211107104259.txt -Pattern "<ID>9671510841" -Context 0,300
This gives me 300 lines of the file after the occurance of that ID number.
But I've come across a file that has no carriage returns. Now I would like to do the same thing, but instead of lines being returned, I guess I need characters.
How would I do this?
I have never created scripts in powershell - just ran simple commands like the above.
I would like to see maybe 1000 characters after the matched string, within a huge file.
THanks!
The problem with using Select-String or [Regex]::Matches() (or -match) to test for the presence of a substring in a single-line file is that you first need to read the whole file into memory at once.
The good news is that you don't need regular expressions to find a substring in a huge single-line text file - instead, you can read the file contents into memory in smaller chunks and then search through those - this way you don't need to store the entire file in memory at once.
Reading buffered text from a file is fairly straightforward:
Open a readable file stream
Create a StreamReader to read from the file stream
Start reading!
Then you just need to check whether:
The target substring is found in each chunk, or
The start of the target substring is partially found at the tail end of the current chunk
And then repeat until you find the substring, at which point you read the following 1000 characters.
Here's an example of how you could implement it as script function (I've tried to explain the code in more detail in inline comments):
function Find-SubstringWithPostContext {
[CmdletBinding(DefaultParameterSetName = 'wp')]
param(
[Alias('PSPath')]
[Parameter(Mandatory = $true, ParameterSetName = 'lp', ValueFromPipelineByPropertyName = $true, ValueFromPipeline = $true)]
[string[]]$LiteralPath,
[Parameter(Mandatory = $true, ParameterSetName = 'wp', Position = 0)]
[string[]]$Path,
[Parameter(Mandatory = $true)]
[ValidateLength(1, 5000)]
[string]$Substring,
[ValidateRange(2, 25000)]
[int]$PostContext = 1000,
[switch]$All,
[System.Text.Encoding]
$Encoding
)
begin {
# start by ensuring we'll be using a buffer that's at least 4 larger than the
# target substring to avoid too many tail searches
$bufferSize = 2000
while ($Substring.Length -gt $bufferSize / 4) {
$bufferSize *= 2
}
$buffer = [char[]]::new($bufferSize)
}
process {
if ($PSCmdlet.ParameterSetName -eq 'wp') {
# resolve input paths if necessary
$LiteralPath = $Path | Convert-Path
}
:fileLoop
foreach ($lp in $LiteralPath) {
$file = Get-Item -LiteralPath $lp
# skip directories
if ($file -isnot [System.IO.FileInfo]) { continue }
try {
$fileStream = $file.OpenRead()
$scanner = [System.IO.StreamReader]::new($fileStream, $true)
do {
# remember the current offset in the file, we'll need this later
$baseOffset = $fileStream.Position
# read a chunk from the file, convert to string
$readCount = $scanner.ReadBlock($buffer, 0, $bufferSize)
$string = [string]::new($buffer, 0, $readCount)
$eof = $readCount -lt $bufferSize
# test if target substring is found in the chunk we just read
$indexOfTarget = $string.IndexOf($Substring)
if ($indexOfTarget -ge 0) {
Write-Verbose "Substring found in chunk at local index ${indexOfTarget}"
# we found a match, ensure we've read enough post-context ahead of the given index
$tail = ''
if ($string.Length - $indexOfTarget -lt $PostContext -and $readCount -eq $bufferSize) {
# just like above, we read another chunk from the file and convert it to a proper string
$tailBuffer = [char[]]::new($PostContext - ($string.Length - $indexOfTarget))
$tailCount = $scanner.ReadBlock($tailBuffer, 0, $tailBuffer.Length)
$tail = [string]::new($tailBuffer, 0, $tailCount)
}
# construct and output the full post-context
$substringWithPostContext = $string.Substring($indexOfTarget) + $tail
if($substringWithPostContext.Length -gt $PostContext){
$substringWithPostContext = $substringWithPostContext.Remove($PostContext)
}
Write-Verbose "Writing output object ..."
Write-Output $([PSCustomObject]#{
FilePath = $file.FullName
Offset = $baseOffset + $indexOfTarget
Value = $substringWithPostContext
})
if (-not $All) {
# no need to search this file any further unless `-All` was specified
continue fileLoop
}
else {
# rewind to position after this match before next iteration
$rewindOffset = $indexOfTarget - $readCount
$null = $scanner.BaseStream.Seek($rewindOffset, [System.IO.SeekOrigin]::Current)
}
}
else {
# target was not found, but we may have "clipped" it in half,
# so figure out if target string could start at the end of current string chunk
for ($i = $string.Length - $target.Length; $i -lt $string.Length; $i++) {
# if the first character of the target substring isn't found then
# we might as well skip it immediately
if ($string[$i] -ne $target[0]) { continue }
if ($target.StartsWith($string.Substring($i))) {
# rewind file stream to this position so it'll get re-tested on
# the next iteration, then break out of tail search
$rewindOffset = $i - $string.Length
$null = $scanner.BaseStream.Seek($rewindOffset, [System.IO.SeekOrigin]::Current)
break
}
}
}
} until ($eof)
}
finally {
# remember to clean up after searching each file
$scanner, $fileStream |Where-Object { $_ -is [System.IDisposable] } |ForEach-Object Dispose
}
}
}
}
Now you can extract exactly 1000 characters after a substring is found with minimal memory allocation:
Get-ChildItem P*.txt |Find-SubstringWithPostContext -Substring '<ID>9671510841'
I haven't tested this enough to tell you if it works properly but it definitely was something fun to code. -Context here will give you the context based on characters before and after instead of lines. You can give it a try and let me know if it worked :)
Usage:
Get-ChildItem *.txt | Find-String -Pattern 'mypattern'
Get-ChildItem *.txt | Find-String -Pattern 'mypattern' -Context 20, 20
Get-ChildItem *.txt | Find-String -Pattern 'mypattern' -AllMatches
using namespace System.Text.RegularExpressions
using namespace System.IO
function Find-String {
param(
[parameter(ValueFromPipeline, Mandatory)]
[Alias('PSPath')]
[FileInfo]$Path,
[parameter(Mandatory, Position = 0)]
[string]$Pattern,
[RegexOptions]$Options = 'IgnoreCase',
[switch]$AllMatches,
[int[]]$Context
)
process
{
$re = [regex]::new($Pattern, $Options)
$content = [File]::ReadAllText($Path)
$match = if($AllMatches.IsPresent)
{
$re.Matches($content)
}
else
{
$re.Match($content)
}
if($match.Success -notcontains $true) { return }
foreach($m in $match)
{
$out = [ordered]#{
Path = $path.FullName
Value = $m.Value
Index = $m.Index
Length = $m.Length
}
if($PSBoundParameters.ContainsKey('Context'))
{
$before = $m.Index
$after = $m.Index + $m.Length
$contextBefore = $Context[0]
$contextAfter = $Context[1]
while($contextBefore-- -and $before)
{
$before--
}
while($contextAfter-- -and $after -lt $content.Length)
{
$after++
}
$out.Context = (-join $content[$before..$after]).Trim()
}
[pscustomobject]$out
}
}
}

Stream just part of a file using PowerShell and compute hash

I need to be able to identify some large binary files which have been copied and renamed between secure servers. To do this, I would like to be able to hash the first X bytes and the last X bytes of all the files. I need to do this with only what is available on a standard Windows 10 system with no additional software installed, so PowerShell seems like the right choice.
Some things that don't work:
I cannot read the entire file in, then extract the parts of the file I want to hash. The objective I'm trying to achieve is to minimize the amount of the file I need to read, and reading the entire file defeats that purpose.
Reading moderately large portions of a file into a PowerShell variable appears to be pretty slow, so $hash.ComputeHash($moderatelyLargeVariable) doesn't seem like a viable solution.
I'm pretty sure I need to do $hash.ComputeHash($stream) where $stream only streams part of the file.
Thus far I've tried:
function Get-FileStreamHash {
param (
$FilePath,
$Algorithm
)
$hash = [Security.Cryptography.HashAlgorithm]::Create($Algorithm)
## METHOD 0: See description below
$stream = ([IO.StreamReader]"${FilePath}").BaseStream
$hashValue = $hash.ComputeHash($stream)
## END of part I need help with
# Convert to a hexadecimal string
$hexHashValue = -join ($hashValue | ForEach-Object { "{0:x2}" -f $_ })
$stream.Close()
# return
$hexHashValue
}
Method 0: This works, but it's streaming the whole file and thus doesn't solve my problem. For a 3GB file this takes about 7 seconds on my machine.
Method 1: $hashValue = $hash.ComputeHash((Get-Content -Path $FilePath -Stream "")). This also is streaming the whole file, and it also takes forever. For the same 3GB file it takes something longer than 5 minutes (I cancelled at that point, and don't know what the total duration would be).
Method 2: $hashValue = $hash.ComputeHash((Get-Content -Path $FilePath -Encoding byte -TotalCount $qtyBytes -Stream "")). This is the same as Method 1, except that it limits the content to $qtyBytes. At 1000000 (1MB) it takes 18 seconds. I think that means Method 1 would have taken ~15 hours, 7700x slower than Method 0.
Is there a way to do something like Method 2 (limit what is read) but without the slow down? And if so, is there a good way to do it on just the end of the file?
Thanks!
You could try one (or a combination of both) of the following helper functions to read a number of bytes from the beginning of the file or taken from the end:
function Read-FirstBytes {
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true, Position = 0)]
[Alias('FullName', 'FilePath')]
[ValidateScript({ Test-Path -Path $_ -PathType Leaf })]
[string]$Path,
[Parameter(Mandatory=$true, Position = 1)]
[int]$Bytes,
[ValidateSet('ByteArray', 'HexString', 'Base64')]
[string]$As = 'ByteArray'
)
try {
$stream = [System.IO.File]::OpenRead($Path)
$length = [math]::Min([math]::Abs($Bytes), $stream.Length)
$buffer = [byte[]]::new($length)
$null = $stream.Read($buffer, 0, $length)
switch ($As) {
'HexString' { ($buffer | ForEach-Object { "{0:x2}" -f $_ }) -join '' ; break }
'Base64' { [Convert]::ToBase64String($buffer) ; break }
default { ,$buffer }
}
}
catch { throw }
finally { $stream.Dispose() }
}
function Read-LastBytes {
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true, Position = 0)]
[Alias('FullName', 'FilePath')]
[ValidateScript({ Test-Path -Path $_ -PathType Leaf })]
[string]$Path,
[Parameter(Mandatory=$true, Position = 1)]
[int]$Bytes,
[ValidateSet('ByteArray', 'HexString', 'Base64')]
[string]$As = 'ByteArray'
)
try {
$stream = [System.IO.File]::OpenRead($Path)
$length = [math]::Min([math]::Abs($Bytes), $stream.Length)
$null = $stream.Seek(-$length, 'End')
$buffer = for ($i = 0; $i -lt $length; $i++) { $stream.ReadByte() }
switch ($As) {
'HexString' { ($buffer | ForEach-Object { "{0:x2}" -f $_ }) -join '' ; break }
'Base64' { [Convert]::ToBase64String($buffer) ; break }
default { ,[Byte[]]$buffer }
}
}
catch { throw }
finally { $stream.Dispose() }
}
Then you can compute a hash value from it and format as you like.
Combinations are possible like
$begin = Read-FirstBytes -Path 'D:\Test\somefile.dat' -Bytes 50 # take the first 50 bytes
$end = Read-LastBytes -Path 'D:\Test\somefile.dat' -Bytes 1000 # and the last 1000 bytes
$Algorithm = 'MD5'
$hash = [Security.Cryptography.HashAlgorithm]::Create($Algorithm)
$hashValue = $hash.ComputeHash($begin + $end)
($hashValue | ForEach-Object { "{0:x2}" -f $_ }) -join ''
I believe this would be a more efficient way of reading the last bytes of your file using System.IO.BinaryReader. You can combine this function with the function you have, it can read all bytes, last n bytes (-Last) or first n bytes (-First).
function Read-Bytes {
[cmdletbinding(DefaultParameterSetName = 'Path')]
param(
[parameter(
Mandatory,
ValueFromPipelineByPropertyName,
ParameterSetName = 'Path',
Position = 0
)][alias('FullName')]
[ValidateScript({
if(Test-Path $_ -PathType Leaf)
{
return $true
}
throw 'Invalid File Path'
})]
[System.IO.FileInfo]$Path,
[parameter(
HelpMessage = 'Specifies the number of Bytes from the beginning of a file.',
ParameterSetName = 'FirstBytes',
Position = 1
)]
[int64]$First,
[parameter(
HelpMessage = 'Specifies the number of Bytes from the end of a file.',
ParameterSetName = 'LastBytes',
Position = 1
)]
[int64]$Last
)
process
{
try
{
$reader = [System.IO.BinaryReader]::new(
[System.IO.File]::Open(
$Path.FullName,
[system.IO.FileMode]::Open,
[System.IO.FileAccess]::Read
)
)
$stream = $reader.BaseStream
$length = (
$stream.Length, $First
)[[int]($First -lt $stream.Length -and $First)]
$stream.Position = (
0, ($length - $Last)
)[[int]($length -gt $Last -and $Last)]
$bytes = while($stream.Position -ne $length)
{
$stream.ReadByte()
}
[pscustomobject]#{
FilePath = $Path.FullName
Length = $length
Bytes = $bytes
}
}
catch
{
Write-Warning $_.Exception.Message
}
finally
{
$reader.Close()
$reader.Dispose()
}
}
}
Usage
Get-ChildItem . -File | Read-Bytes -Last 100: Reads the last 100 bytes of all files on the current folder. If the -Last argument exceeds the file length, it reads the entire file.
Get-ChildItem . -File | Read-Bytes -First 100: Reads the first 100 bytes of all files on the current folder. If the -First argument exceeds the file length, it reads the entire file.
Read-Bytes -Path path/to/file.ext: Reads all bytes of file.ext.
Output
Returns an object with the properties FilePath, Length, Bytes.
FilePath Length Bytes
-------- ------ -----
/home/user/Documents/test/...... 14 {73, 32, 119, 111…}
/home/user/Documents/test/...... 0
/home/user/Documents/test/...... 0
/home/user/Documents/test/...... 0
/home/user/Documents/test/...... 116 {111, 109, 101, 95…}
/home/user/Documents/test/...... 17963 {50, 101, 101, 53…}
/home/user/Documents/test/...... 3617 {105, 32, 110, 111…}
/home/user/Documents/test/...... 638 {101, 109, 112, 116…}
/home/user/Documents/test/...... 0
/home/user/Documents/test/...... 36 {65, 99, 114, 101…}
/home/user/Documents/test/...... 735 {117, 112, 46, 79…}
/home/user/Documents/test/...... 1857 {108, 111, 115, 101…}
/home/user/Documents/test/...... 77 {79, 80, 69, 78…}

strange characters when opening a properties file

I have a requirement to update a properties file for a very old project, the properties file is supposed to display Arabic characters but it displays somthing like that "Êã ÊÓÌíá ØáÈßã", i wrote a simple program from which i was able to read the correct Arabic values from the file,
Reader r = new InputStreamReader(new FileInputStream("C:\\Labels_ar.properties"), "Windows-1256");
buffered = new BufferedReader(r);
String line;
while ((line = buffered.readLine()) != null) {
System.out.println("line" + line);
}
but do u have any idea on how i can open the file, edit and save the new changes?
If, as you seem to think, the encoding is Windows-1256, there are editors that will do the job, such as EditPadLite.
If it's not that, the first thing you need to find out is the encoding. Given it's a properties file, it may well be UTF-8 but the easiest way to find out is to get a hex dump of the file and post it here. Under Linux, I'd normally suggest using:
od -xcb Labels_ar.properties
but, given you're on Windows, that's not going to work so well (unless you have CygWin installed).
So, if you have your own favourite hex dump program, just use that. Otherwise you can use the following Powershell one:
function Pf-Dump-Hex-Item([byte[]] $data) {
$left = "+0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +A +B +C +D +E +F"
$right = "0123456789ABCDEF"
Write-Output "======== $left +$right"
$addr = 0
$left = "{0:X8} " -f $addr
$right = ""
# Now go through the input bytes
foreach ($byte in $bytes) {
# Add 2-digit hex number then filtered character.
$left += "{0:x2} " -f $byte
if (($byte -lt 0x20) -or ($byte -gt 0x7e)) { $byte = "." }
$right += [char] $byte
# Increment address and start new line if needed.
$addr++;
if (($addr % 16) -eq 0) {
Write-Output "$left $right"
$left = "{0:X8} " -f $addr
$right = "";
}
}
# Flush last line if needed.
$lastLine = "{0:X8}" -f $addr
if (($addr % 16) -ne 0) {
while (($addr % 16) -ne 0) {
$left += " "
$addr++;
}
Write-Output "$left $right"
}
Write-Output $lastLine
Write-Output ""
}
function Pf-Dump-Hex {
param(
[Parameter (Mandatory = $false, Position = 0)]
[string] $Path,
[Parameter (Mandatory = $false, ValueFromPipeline = $true)]
[Object] $Object
)
begin {
Set-StrictMode -Version Latest
# Create the array to hold content then do path if given.
[byte[]] $bytes = $null
if ($Path) {
$bytes = [IO.File]::ReadAllBytes((Resolve-Path $Path))
Pf-Dump-Hex-Item $bytes
}
}
process {
# Process each object (input/pipe).
if ($object) {
foreach ($obj in $object) {
if ($obj -is [Byte]) {
$bytes = $obj
} else {
$inpStr = [string] $obj
$bytes = [Text.Encoding]::Unicode.GetBytes($inpStr)
}
Pf-Dump-Hex-Item $bytes
}
}
}
}
If you load that into a Powershell session then run:
pf-dump-hex Labels_ar.properties
that should allow you to evaluate the file encoding.
I think there are two problems :
1- Im not sure if System.out.println() can print arabic characters, so try another method like MessageBox.show() to be sure there is a problem with reading file.
2- If MessageBox.show() shows same result, the problem should be the charset, you can try UTF-8 or somthing else.

Searching many large text files in Powershell

I frequently have to search server log files in a directory that may contain 50 or more files of 200+ MB each. I've written a function in Powershell to do this searching. It finds and extracts all the values for a given query parameter. It works great on an individual large file or a collection of small files but totally bites in the above circumstance, a directory of large files.
The function takes a parameter, which consists of the query parameter to be searched.
In pseudo-code:
Take parameter (e.g. someParam or someParam=([^& ]+))
Create a regex (if one is not supplied)
Collect a directory list of *.log, pipe to Select-String
For each pipeline object, add the matchers to a hash as keys
Increment a match counter
Call GC
At the end of the pipelining:
if (hash has keys)
enumerate the hash keys,
sort and append to string array
set-content the string array to a file
print summary to console
exit
else
print summary to console
exit
Here's a stripped-down version of the file processing.
$wtmatches = #{};
gci -Filter *.log | Select-String -Pattern $searcher |
%{ $wtmatches[$_.Matches[0].Groups[1].Value]++; $items++; [GC]::Collect(); }
I'm just using an old perl trick of de-duplicating found items by making them the keys of a hash. Perhaps, this is an error, but a typical output of the processing is going to be around 30,000 items at most. More normally, found items is in the low thousands range. From what I can see, the number of keys in the hash does not affect processing time, it is the size and number of the files that breaks it. I recently threw in the GC in desperation, it does have some positive effect but it is marginal.
The issue is that with the large collection of large files, the processing sucks the RAM pool dry in about 60 seconds. It doesn't actually use a lot of CPU, interestingly, but there's a lot of volatile storage going on. Once the RAM usage has gone up over 90%, I can just punch out and go watch TV. It could take hours to complete the processing to produce a file with 15,000 or 20,000 unique values.
I would like advice and/or suggestions for increasing the efficiency, even if that means using a different paradigm to accomplish the processing. I went with what I know. I use this tool on almost a daily basis.
Oh, and I'm committed to using Powershell. ;-) This function is part of a complete module I've written for my job, so, suggestions of Python, perl or other useful languages are not useful in this case.
Thanks.
mp
Update:
Using latkin's ProcessFile function, I used the following wrapper for testing. His function is orders of magnitude faster than my original.
function Find-WtQuery {
<#
.Synopsis
Takes a parameter with a capture regex and a wildcard for files list.
.Description
This function is intended to be used on large collections of large files that have
the potential to take an unacceptably long time to process using other methods. It
requires that a regex capture group be passed in as the value to search for.
.Parameter Target
The parameter with capture group to find, e.g. WT.z_custom=([^ &]+).
.Parameter Files
The file wildcard to search, e.g. '*.log'
.Outputs
An object with an array of unique values and a count of total matched lines.
#>
param(
[Parameter(Mandatory = $true)] [string] $target,
[Parameter(Mandatory = $false)] [string] $files
)
begin{
$stime = Get-Date
}
process{
$results = gci -Filter $files | ProcessFile -Pattern $target -Group 1;
}
end{
$etime = Get-Date;
$ptime = $etime - $stime;
Write-Host ("Processing time for {0} files was {1}:{2}:{3}." -f (gci
-Filter $files).Count, $ptime.Hours,$ptime.Minutes,$ptime.Seconds);
return $results;
}
}
The output:
clients:\test\logs\global
{powem} [4] --> Find-WtQuery -target "WT.ets=([^ &]+)" -files "*.log"
Processing time for 53 files was 0:1:35.
Thanks to all for comments and help.
Here's a function that will hopefully speed up and reduce the memory impact of the file processing part. It will return an object with 2 properties: The total count of lines matched, and a sorted array of unique strings from the match group specified. (From your description it sounds like you don't really care about the count per string, just the string values themselves)
function ProcessFile
{
param(
[Parameter(ValueFromPipeline = $true, Mandatory = $true)]
[System.IO.FileInfo] $File,
[Parameter(Mandatory = $true)]
[string] $Pattern,
[Parameter(Mandatory = $true)]
[int] $Group
)
begin
{
$regex = new-object Regex #($pattern, 'Compiled')
$set = new-object 'System.Collections.Generic.SortedDictionary[string, int]'
$totalCount = 0
}
process
{
try
{
$reader = new-object IO.StreamReader $_.FullName
while( ($line = $reader.ReadLine()) -ne $null)
{
$m = $regex.Match($line)
if($m.Success)
{
$set[$m.Groups[$group].Value] = 1
$totalCount++
}
}
}
finally
{
$reader.Close()
}
}
end
{
new-object psobject -prop #{TotalCount = $totalCount; Unique = ([string[]]$set.Keys)}
}
}
You can use it like this:
$results = dir *.log | ProcessFile -Pattern 'stuff (capturegroup)' -Group 1
"Total matches: $($results.TotalCount)"
$results.Unique | Out-File .\Results.txt
IMO #latkin's approach is the way to go if you want do this within PowerShell and not use some dedicated tool. I made a few changes though to make the command play better with respect to accepting pipeline input. I also modified the regex to search for all matches on a particular line. Neither approach searches across multiple lines although that scenario would be pretty easy to handle as long as the pattern only ever spanned a few lines. Here's my take on the command (put it in a file called Search-File.ps1):
[CmdletBinding(DefaultParameterSetName="Path")]
param(
[Parameter(Mandatory=$true, Position=0)]
[ValidateNotNullOrEmpty()]
[string]
$Pattern,
[Parameter(Mandatory=$true, Position=1, ParameterSetName="Path",
ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true,
HelpMessage="Path to ...")]
[ValidateNotNullOrEmpty()]
[string[]]
$Path,
[Alias("PSPath")]
[Parameter(Mandatory=$true, Position=1, ParameterSetName="LiteralPath",
ValueFromPipelineByPropertyName=$true,
HelpMessage="Path to ...")]
[ValidateNotNullOrEmpty()]
[string[]]
$LiteralPath,
[Parameter()]
[ValidateRange(0, [int]::MaxValue)]
[int]
$Group = 0
)
Begin
{
Set-StrictMode -Version latest
$count = 0
$matched = #{}
$regex = New-Object System.Text.RegularExpressions.Regex $Pattern,'Compiled'
}
Process
{
if ($psCmdlet.ParameterSetName -eq "Path")
{
# In the -Path (non-literal) case we may need to resolve a wildcarded path
$resolvedPaths = #($Path | Resolve-Path | Convert-Path)
}
else
{
# Must be -LiteralPath
$resolvedPaths = #($LiteralPath | Convert-Path)
}
foreach ($rpath in $resolvedPaths)
{
Write-Verbose "Processing $rpath"
$stream = new-object System.IO.FileStream $rpath,'Open','Read','Read',4096
$reader = new-object System.IO.StreamReader $stream
try
{
while (($line = $reader.ReadLine())-ne $null)
{
$matchColl = $regex.Matches($line)
foreach ($match in $matchColl)
{
$count++
$key = $match.Groups[$Group].Value
if ($matched.ContainsKey($key))
{
$matched[$key]++
}
else
{
$matched[$key] = 1;
}
}
}
}
finally
{
$reader.Close()
}
}
}
End
{
new-object psobject -Property #{TotalCount = $count; Matched = $matched}
}
I ran this against my IIS log dir (8.5 GB and ~1000 files) to find all the IP addresses in all the logs e.g.:
$r = ls . -r *.log | C:\Users\hillr\Search-File.ps1 '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
This took 27 minutes on my system and found 54356330 matches:
$r.Matched.GetEnumerator() | sort Value -Descending | select -f 20
Name Value
---- -----
xxx.140.113.47 22459654
xxx.29.24.217 13430575
xxx.29.24.216 13321196
xxx.140.113.98 4701131
xxx.40.30.254 53724