I am not sure I am phrasing this correctly so I'd rather show.
I am trimming a string in this way:
$input = '12345'
$string = $input.Substring(1,$string.Length-1)
The idea is to remove the first and the final character. It works fine on the first run. On the second run the length is already -1 so two characters are actually trimmed.
However I want the script to always deduct the final character (5) even after the first run. How do I reset it ?
Thank you.
The second parameter of Substring is the length of the substring, not the ending index. Hence you want the string to be 2 characters shorter:
$inputstring = "12345"
$string = $inputstring
while ($string.Length -gt 2)
{
$string
$string = $string.Substring(1,$string.Length-2)
}
$string
This outputs:
12345
234
3
Is there a way to determine whether a specified file contains a specified byte array (at any position) in powershell?
Something like:
fgrep --binary-files=binary "$data" "$filepath"
Of course, I can write a naive implementation:
function posOfArrayWithinArray {
param ([byte[]] $arrayA, [byte[]]$arrayB)
if ($arrayB.Length -ge $arrayA.Length) {
foreach ($pos in 0..($arrayB.Length - $arrayA.Length)) {
if ([System.Linq.Enumerable]::SequenceEqual(
$arrayA,
[System.Linq.Enumerable]::Skip($arrayB, $pos).Take($arrayA.Length)
)) {return $pos}
}
}
-1
}
function posOfArrayWithinFile {
param ([byte[]] $array, [string]$filepath)
posOfArrayWithinArray $array (Get-Content $filepath -Raw -AsByteStream)
}
// They return position or -1, but simple $false/$true are also enough for me.
— but it's extremely slow.
Sorry, for the additional answer. It is not usual to do so, but the universal question intrigues me and the approach and information of my initial "using -Like" answer is completely different. Btw, if you looking for a positive response to the question "I believe that it must exist in .NET" to accept an answer, it probably not going to happen, the same quest exists for StackOverflow searches in combination with C#, .Net or Linq.
Anyways, the fact that nobody is able to find the single assumed .Net command for this so far, it is quiet understandable that several semi-.Net solutions are being purposed instead but I believe that this will cause some undesired overhead for a universal function.
Assuming that you ByteArray (the byte array being searched) and SearchArray (the byte array to be searched) are completely random. There is only a 1/256 chance that each byte in the ByteArray will match the first byte of the SearchArray. In that case you don't have to look further, and if it does match, the chance that the second byte also matches is 1/2562, etc. Meaning that the inner loop will only run about 1.004 times as much as the outer loop. In other words, the performance of everything outside the inner loop (but in the outer loop) is almost as important as what is in the inner loop!
Note that this also implies that the chance a 500Kb random sequence exists in a 100Mb random sequence is virtually zero. (So, how random are your given binary sequences actually?, If they are far from random, I think you need to add some more details to your question). A worse case scenario for my assumption will be a ByteArray existing of the same bytes (e.g. 0, 0, 0, ..., 0, 0, 0) and a SearchArray of the same bytes ending with a different byte (e.g. 0, 0, 0, ..., 0, 0, 1).
Based on this, it shows again (I have also proven this in some other answers) that native PowerShell commands aren't that bad and possibly could even outperform .Net/Linq commands in some cases. In my testing, the below Find-Bytes function is about 20% till twice as fast as the function in your question:
Find-Bytes
Returns the index of where the -Search byte sequence is found in the -Bytes byte sequence. If the search sequence is not found a $Null ([System.Management.Automation.Internal.AutomationNull]::Value) is returned.
Parameters
-Bytes
The byte array to be searched
-Search
The byte array to search for
-Start
Defines where to start searching in the Bytes sequence (default: 0)
-All
By default, only the first index found will be returned. Use the -All switch to return the remaining indexes of any other search sequences found.
Function Find-Bytes([byte[]]$Bytes, [byte[]]$Search, [int]$Start, [Switch]$All) {
For ($Index = $Start; $Index -le $Bytes.Length - $Search.Length ; $Index++) {
For ($i = 0; $i -lt $Search.Length -and $Bytes[$Index + $i] -eq $Search[$i]; $i++) {}
If ($i -ge $Search.Length) {
$Index
If (!$All) { Return }
}
}
}
Usage example:
$a = [byte[]]("the quick brown fox jumps over the lazy dog".ToCharArray())
$b = [byte[]]("the".ToCharArray())
Find-Bytes -all $a $b
0
31
Benchmark
Note that you should open a new PowerShell session to properly benchmark this as Linq uses a large cache that properly doesn't apply to your use case.
$a = [byte[]](&{ foreach ($i in (0..500Kb)) { Get-Random -Maximum 256 } })
$b = [byte[]](&{ foreach ($i in (0..500)) { Get-Random -Maximum 256 } })
Measure-Command {
$y = Find-Bytes $a $b
}
Measure-Command {
$x = posOfArrayWithinArray $b $a
}
The below code may prove to be faster, but you will have to test that out on your binary files:
function Get-BinaryText {
# converts the bytes of a file to a string that has a
# 1-to-1 mapping back to the file's original bytes.
# Useful for performing binary regular expressions.
Param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true)]
[ValidateScript( { Test-Path $_ -PathType Leaf } )]
[Alias('FullName','FilePath')]
[string]$Path
)
$Stream = New-Object System.IO.FileStream -ArgumentList $Path, 'Open', 'Read'
# Note: Codepage 28591 returns a 1-to-1 char to byte mapping
$Encoding = [Text.Encoding]::GetEncoding(28591)
$StreamReader = New-Object System.IO.StreamReader -ArgumentList $Stream, $Encoding
$BinaryText = $StreamReader.ReadToEnd()
$Stream.Dispose()
$StreamReader.Dispose()
return $BinaryText
}
# enter the byte array to search for here
# for demo, I'll use 'SearchMe' in bytes
[byte[]]$searchArray = 83,101,97,114,99,104,77,101
# create a regex from the $searchArray bytes
# 'SearchMe' --> '\x53\x65\x61\x72\x63\x68\x4D\x65'
$searchString = ($searchArray | ForEach-Object { '\x{0:X2}' -f $_ }) -join ''
$regex = [regex]$searchString
# read the file as binary string
$binString = Get-BinaryText -Path 'D:\test.bin'
# use regex to return the 0-based starting position of the search string
# return -1 if not found
$found = $regex.Match($binString)
if ($found.Success) { $found.Index } else { -1}
Just formalizing my comments and agreeing with your comment:
I dislike the idea of converting byte sequences to character sequences
at all (I'd better have functionality to match byte (or other)
sequences as they are), among the
conversion-to-character-strings-implying solutions this seems to be
one of the quickest
Performance
String manipulations are usually expensive but re-initializing a LINQ call is apparently pretty expensive as well. I guess, that you might presume that the native algorithms for the PowerShell string representation and methods (operators) like -Like are meanwhile completely squeezed.
Memory
Aside from some founded performance disadvantages, there is a memory disadvantage as well by converting each byte to a decimal string representation. In the purposed solution, each byte will take an average of 2.57 bytes (depending on the number of decimal digits of each byte: (1 * 10 / 256) + (2 * 90 /256) + (3 * 156 / 256)). Besides you will use/need an extra byte for separating the numeric representations. In total, this will increase the sequence about 3.57 times!.
You might consider saving bytes by e.g. converting it to hexadecimal and/or combine the separator, but that will likely result in an expensive conversion again.
Easy
Anyways, the easy way is probably still the most effective.
This comes down to the following simplified syntax:
" $Sequence " -Like "* $SubSequence *" # $True if $Sequence contains $SubSequence
(Where $Sequence and $SubSequence are binary arrays of type: [Byte[]])
Note 1: the spaces around the variables are important. This will prevent a false positive in case a 1 (or 2) digit byte representation overlaps with a 2 (or 3) digit byte representation. E.g.: 123 59 74 contains 23 59 7 in the string representation but not in the actual bytes.
Note 2: This syntax will tell you only whether $arrayA contains $arrayB ($True or $False). There is no clue where $arrayB actually resides in $arrayA. If you need to know this, or e.g. want to replace $arrayB with something else, refer to this answer: Methods to hex edit binary files via PowerShell .
I've determined that the following can work as a workaround:
(Get-Content $filepath -Raw -Encoding 28591).IndexOf($fragment)
— i.e. any bytes can be successfully matched by PowerShell strings (in fact, .NET System.Strings) when we specify binary-safe encoding. Of course, we need to use the same encoding for both the file and fragment, and the encoding must be really binary-safe (e.g. 1250, 1000 and 28591 fit, but various species of Unicode (including the default BOM-less UTF-8) don't, because they convert any non-well-formed code-unit to the same replacement character (U+FFFD)). Thanks to Theo for clarification.
On older PowerShell, you can use:
[System.Text.Encoding]::GetEncoding(28591).
GetString([System.IO.File]::ReadAllBytes($filepath)).
IndexOf($fragment)
Sadly, I haven't found a way to match sequences universally (i.e. a common method to match sequences with any item type: integer, object, etc). I believe that it must exist in .NET (especially that particual implementation for sequences of characters exists). Hopefully, someone will suggest it.
For example if the string is blahblah02baboon - I need to get the "baboon" seperated from the rest and the variable would countain only the characters "baboon". Every string i need to do this with has alphabet characters first then 2 numbers then more alphabet characters, so it should be the same process everytime.
Any advice would be greatly appreciated.
My advice is to learn about regular expressions.
'blahblah02baboon' -replace '\D*\d*(\w*)', '$1'
Or use regex
$MyString = "01baaab01blah02baboon"
# Match any character which is not a digit
$Result = [regex]::matches($MyString, "\D+")
# Take the last result
$LastResult = $Result[$Result.Count-1].Value
# Output
Write-Output "My last result = $LastResult"
I have a filename and I wish to extract two portions of this and add into variables so I can compare if they are the same.
$name = FILE_20161012_054146_Import_5785_1234.xml
So I want...
$a = 5785
$b = 1234
if ($a = $b) {
# do stuff
}
I have tried to extract the 36th up to the 39th character
Select-Object {$_.Name[35,36,37,38]}
but I get
{5, 7, 8, 5}
Have considered splitting but looks messy.
There are several ways to do this. One of the most straightforward, as PetSerAl suggested is with .Substring():
$_.name.Substring(35,4)
Another way is with square braces, as you tried to do, but it gives you an array of [char] objects, not a string. You can use -join and you can use a range to make that easier:
$_.name[35..38] -join ''
For what you're doing, matching a pattern, you could also use a regular expression with capturing groups:
if ($_.name -match '_(\d{4})_(\d{4})\.xml$') {
if ($Matches[1] -eq $Matches[2]) {
# ...
}
}
This way can be very powerful, but you need to learn more about regex if you're not familiar. In this case it's looking for an underscore _ followed by 4 digits (0-9), followed by an underscore, and four more digits, followed by .xml at the end of the string. The digits are wrapped in parentheses so they are captured separately to be referenced later (in $Matches).
Yet another approach: returns 1234 substring four times.
$FileName = "FILE_20161012_054146_Import_5785_1234.xml"
# $FileName
$FileName.Substring(33,4) # Substring method (zero-based)
-join $FileName[33..36] # indexing from beginning (zero-based)
-join $FileName[-8..-5] # reverse indexing:
# e.g. $FileName[-1] returns the last character
$FileArr = $FileName.Split("_.") # Split (depends only on filename "pattern template")
$FileArr[$FileArr.Count -2] # does not depend on lengths of tokens
I am having an issue with my PowerShell Program counting the number of sentences in a file I am using. I am using the following code:
foreach ($Sentence in (Get-Content file))
{
$i = $Sentence.Split("?")
$n = $Sentence.Split(".")
$Sentences += $i.Length
$Sentences += $n.Length
}
The total number of sentences I should get is 61 but I am getting 71, could someone please help me out with this? I have Sentences set to zero as well.
Thanks
foreach ($Sentence in (Get-Content file))
{
$i = $Sentence.Split("[?\.]")
$Sentences = $i.Length
}
I edited your code a bit.
The . that you were using needs to be escaped, otherwise Powershell recognises it as a Regex dotall expression, which means "any character"
So you should split the string on "[?\.]" or similar.
When counting sentences, what you are looking for is where each sentence ends. Splitting, though, returns a collection of sentence fragments around those end characters, with the ends themselves represented by the gap between elements. Therefore, the number of sentences will equal the number of gaps, which is one less the number of fragments in the split result.
Of course, as Keith Hill pointed out in a comment above, the actual splitting is unnecessary when you can count the ends directly.
foreach( $Sentence in (Get-Content test.txt) ) {
# Split at every occurrence of '.' and '?', and count the gaps.
$Split = $Sentence.Split( '.?' )
$SplitSentences += $Split.Count - 1
# Count every occurrence of '.' and '?'.
$Ends = [char[]]$Sentence -match '[.?]'
$CountedSentences += $Ends.Count
}
Contents of test.txt file:
Is this a sentence? This is a
sentence. Is this a sentence?
This is a sentence. Is this a
very long sentence that spans
multiple lines?
Also, to clarify on the remarks to Vasili's answer: the PowerShell -split operator interprets a string as a regular expression by default, while the .NET Split method only works with literal string values.
For example:
'Unclosed [bracket?' -split '[?]' will treat [?] as a regular expression character class and match the ? character, returning the two strings 'Unclosed [bracket' and ''
'Unclosed [bracket?'.Split( '[?]' ) will call the Split(char[]) overload and match each [, ?, and ] character, returning the three strings 'Unclosed ', 'bracket', and ''