I have created a script that loops through an array and excludes any variables that are found within a second array.
While the code works; it got me wondering if it could be simplified or piped.
$result = #()
$ItemArray = #("a","b","c","d")
$exclusionArray = #("b","c")
foreach ($Item in $ItemArray)
{
$matchFailover = $false
:gohere
foreach ($ExclusionItem in $exclusionArray)
{
if ($Item -eq $ExclusionItem)
{
Write-Host "Match: $Item = $ExclusionItem"
$matchFailover = $true
break :gohere
}
else{
Write-Host "No Match: $Item != $ExclusionItem"
}
}
if (!($matchFailover))
{
Write-Host "Adding $Item to results"
$result += $Item
}
}
Write-Host "`nResults are"
$result
To give your task a name: You're looking for the relative complement aka set difference between two arrays:
In set-theory notation, it would be $ItemArray \ $ExclusionArray, i.e., those elements in $ItemArray that aren't also in $ExclusionArray.
This related question is looking for the symmetric difference between two sets, i.e., the set of elements that are unique to either side - at last that's what the Compare-Object-based solutions there implement, but only under the assumption that each array has no duplicates.
EyIM's helpful answer is conceptually simple and concise.
A potential problem is performance: a lookup in the exclusion array must be performed for each element in the input array.
With small arrays, this likely won't matter in practice.
With larger arrays, LINQ offers a substantially faster solution:
Note: In order to benefit from the LINQ solution, your arrays should be in memory already, and the benefit is greater the larger the exclusion array is. If your input is streaming via the pipeline, the overhead from executing the pipeline may make attempts to optimize array processing pointless or even counterproductive, in which case sticking with the native PowerShell solution makes sense - see iRon's answer.
# Declare the arrays as [string[]]
# so that calling the LINQ method below works as-is.
# (You could also cast to [string[]] ad hoc.)
[string[]] $ItemArray = 'a','b','c','d'
[string[]] $exclusionArray = 'b','c'
# Return only those elements in $ItemArray that aren't also in $exclusionArray
# and convert the result (a lazy enumerable of type [IEnumerable[string]])
# back to an array to force its evaluation
# (If you directly enumerate the result in a pipeline, that step isn't needed.)
[string[]] [Linq.Enumerable]::Except($ItemArray, $exclusionArray) # -> 'a', 'd'
Note the need to use the LINQ types explicitly, via their static methods, because PowerShell, as of v7, has no support for extension methods.
However, there is a proposal on GitHub to add such support; this related proposal asks for improved support for calling generic methods.
See this answer for an overview of how to currently call LINQ methods from PowerShell.
Performance comparison:
Tip of the hat to iRon for his input.
The following benchmark code uses the Time-Command function to compare the two approaches, using arrays with roughly 4000 and 2000 elements, respectively, which - as in the question - differ by only 2 elements.
Note that in order to level the playing field, the .Where() array method (PSv4+) is used instead of the pipeline-based Where-Object cmdlet, as .Where() is faster with arrays already in memory.
Here are the results averaged over 10 runs; note the relative performance, as shown in the Factor columns; from a single-core Windows 10 VM running Windows PowerShell v5.1.:
Factor Secs (10-run avg.) Command TimeSpan
------ ------------------ ------- --------
1.00 0.046 # LINQ... 00:00:00.0455381
8.40 0.382 # Where ... -notContains... 00:00:00.3824038
The LINQ solution is substantially faster - by a factor of 8+ (though even the much slower solution only took about 0.4 seconds to run).
It seems that the performance gap is even wider in PowerShell Core, where I've seen a factor of around 19 with v7.0.0-preview.4.; interestingly, both tests ran faster individually than in Windows PowerShell.
Benchmark code:
# Script block to initialize the arrays.
# The filler arrays are randomized to eliminate caching effects in LINQ.
$init = {
$fillerArray = 1..1000 | Get-Random -Count 1000
[string[]] $ItemArray = $fillerArray + 'a' + $fillerArray + 'b' + $fillerArray + 'c' + $fillerArray + 'd'
[string[]] $exclusionArray = $fillerArray + 'b' + $fillerArray + 'c'
}
# Compare the average of 10 runs.
Time-Command -Count 10 { # LINQ
. $init
$result = [string[]] [Linq.Enumerable]::Except($ItemArray, $exclusionArray)
}, { # Where ... -notContains
. $init
$result = $ItemArray.Where({ $exclusionArray -notcontains $_ })
}
You can use Where-Object with -notcontains:
$ItemArray | Where-Object { $exclusionArray -notcontains $_ }
Output:
a, d
Advocating native PowerShell:
As per #mklement0's answer, with no doubt, Language Integrated Query (LINQ) is //Fast...
But in some circumstances, native PowerShell commands using the pipeline as suggested by #EylM can still beat LINQ. This is not just theoretical but might happen in used cases where the concerned process is idle and waiting for a slow input. E.g. where the input comes from:
A remote server (e.g. Active Directory)
A slow device
A separate thread that has to make a complex calculation
The internet ...
Despite I haven't seen an easy prove for this yet, this is suggested at several sites and can be deducted from sites as e.g. High Performance PowerShell with LINQ and Ins and Outs of the PowerShell Pipeline.
Prove
To prove the above thesis, I have created a small Slack cmdlet that slows down each item dropped into the pipeline with 1 millisecond (by default):
Function Slack-Object ($Delay = 1) {
process {
Start-Sleep -Milliseconds $Delay
Write-Output $_
}
}; Set-Alias Slack Slack-Object
Now let's see if native PowerShell can actually beat LINQ:
(To get a good performance comparison, caches should be cleared by e.g. starting a fresh PowerShell session.)
[string[]] $InputArray = 1..200
[string[]] $ExclusionArray = 100..300
(Measure-Command {
$Result = [Linq.Enumerable]::Except([string[]] ($InputArray | Slack), $ExclusionArray)
}).TotalMilliseconds
(Measure-Command {
$Result = $InputArray | Slack | Where-Object {$ExclusionArray -notcontains $_}
}).TotalMilliseconds
Results:
LINQ: 411,3721
PowerShell: 366,961
To exclude the LINQ cache, a single run test should be done but as commented by #mklement0, the results of single runs might vary each run.
The results also highly depend on the size of the input arrays, the size of the result, the slack, the test system, etc.
Conclusion:
PowerShell might still be faster than LINQ in some scenarios!
Quoting mklement0's comment:
"Overall, it's fair to say that the difference in performance is so small in this scenario that it's not worth picking the approach based on performance - and it makes sense to go with the more PowerShell-like approach (Where-Object), given that the LINQ approach is far from obvious. The bottom line is: choose LINQ only if you have large arrays that are already in memory. If the pipeline is involved, the pipeline overhead alone may make optimizations pointless."
Related
I have a foreach loop that currently puts three entries in my hashtable:
$result = foreach($key in $serverSpace.Keys){
if($serverSpace[$key] -lt 80){
[pscustomobject]#{
Server = $key
Space = $serverSpace[$key]}}}
When I use
$result.count
I get 3 as expected.
I changed the foreach loop to exlude the entries less than or equal to one using
$result = foreach($key in $serverSpace.Keys){
if($serverSpace[$key] -lt 80 -and $serverSpace[$key] -gt 1){
[pscustomobject]#{
Server = $key
Space = $serverSpace[$key]}}}
$result.count should have 1 as its output but it doesn't recognize .count as a suggested command and $result.count doesn't output anything anymore. I'm assuming when theres only one entry in the hash table it won't allow a count? Not sure whats going on but my conditions for my script are dependent on the count of $result. Any help would be appreciated.
$result is not a hashtable so I prefixed it with #($result).count. Thank you to #Theo and #Lee_Dailey
What you're seeing is a bug in Windows PowerShell (as of the latest and final version, 5.1), which has since been corrected in PowerShell (Core) - see GitHub issue #3671 for the original bug report.
That is, since v3 all objects should have an intrinsic .Count property , not just collections, in the interest of unified treatment of scalars and collections - see this answer for more information.
The workaround for Windows PowerShell is indeed to force a value to be an array via #(...), the array-subexpression operator, which is guaranteed to have a .Count property, as shown in your answer, but it shouldn't be necessary and indeed isn't anymore in PowerShell (Core, v6+)
# !! Due to a BUG, this outputs $null in *Windows PowerShell*,
# !! but correctly outputs 1 in PowerShell (Core).
([pscustomobject] #{}).Count
# Workaround for Windows PowerShell that is effective in *both* editions,
# though potentially wasteful in PowerShell (Core):
#([pscustomobject] #{}).Count
I have a foreach loop that currently puts three entries in my hashtable:
$result = foreach($key in $serverSpace.Keys){
if($serverSpace[$key] -lt 80){
[pscustomobject]#{
Server = $key
Space = $serverSpace[$key]}}}
When I use
$result.count
I get 3 as expected.
I changed the foreach loop to exlude the entries less than or equal to one using
$result = foreach($key in $serverSpace.Keys){
if($serverSpace[$key] -lt 80 -and $serverSpace[$key] -gt 1){
[pscustomobject]#{
Server = $key
Space = $serverSpace[$key]}}}
$result.count should have 1 as its output but it doesn't recognize .count as a suggested command and $result.count doesn't output anything anymore. I'm assuming when theres only one entry in the hash table it won't allow a count? Not sure whats going on but my conditions for my script are dependent on the count of $result. Any help would be appreciated.
$result is not a hashtable so I prefixed it with #($result).count. Thank you to #Theo and #Lee_Dailey
What you're seeing is a bug in Windows PowerShell (as of the latest and final version, 5.1), which has since been corrected in PowerShell (Core) - see GitHub issue #3671 for the original bug report.
That is, since v3 all objects should have an intrinsic .Count property , not just collections, in the interest of unified treatment of scalars and collections - see this answer for more information.
The workaround for Windows PowerShell is indeed to force a value to be an array via #(...), the array-subexpression operator, which is guaranteed to have a .Count property, as shown in your answer, but it shouldn't be necessary and indeed isn't anymore in PowerShell (Core, v6+)
# !! Due to a BUG, this outputs $null in *Windows PowerShell*,
# !! but correctly outputs 1 in PowerShell (Core).
([pscustomobject] #{}).Count
# Workaround for Windows PowerShell that is effective in *both* editions,
# though potentially wasteful in PowerShell (Core):
#([pscustomobject] #{}).Count
Suppose I have a process that generates a collection of objects. For a very simple example, consider $(1 | get-member). I can get the number of objects generated:
PS C:\WINDOWS\system32> $(1 | get-member).count
21
or I can do something with those objects.
PS C:\WINDOWS\system32> $(1 | get-member) | ForEach-object {write-host $_.name}
CompareTo
Equals
...
With only 21 objects, doing the above is no problem. But what if the process generates hundreds of thousands of objects? Then I don't want to run the process once just to count the objects and then run it again to execute what I want to do with them. So how can I get a count of objects in a collection sent down the pipeline?
A similar question was asked before, and the accepted answer was to use a counter variable inside the script block that works on the collection. The problem is that I already have that counter and what I want is to check that the outcome of that counter is correct. So I don't want to just count inside the script block. I want a separate, independent measure of the size of the collection that I sent down the pipeline. How can I do that?
If processing and counting is needed:
Doing your own counting inside a ForEach-Object script block is your best bet to avoid processing in two passes.
The problem is that I already have that counter and what I want is to check that the outcome of that counter is correct.
ForEach-Object is reliably invoked for each and every input object, including $null values, so there should be no need to double-check.
If you want a cleaner separation of processing and counting, you can pass multiple -Process script blocks to ForEach-Object (in this example, { $_ + 1 } is the input-processing script block and { ++$count } is the input-counting one):
PS> 1..5 | ForEach-Object -Begin { $count = 0 } `
-Process { $_ + 1 }, { ++$count } `
-End { "--- count: $count" }
2
3
4
5
6
--- count: 5
Note that, due to a quirk in ForEach-Object's parameter binding, passing -Begin and -End script blocks is actually required in order to pass multiple -Process (per-input-object) blocks; pass $null if you don't actually need -Begin and/or -End - see GitHub issue #4513.
Also note that the $count variable lives in the caller's scope, and is not scoped to the ForEach-Object call; that is, $count = 0 potentially updates a preexisting $count variable, and, if it didn't previously exist, lives on after the ForEach-Object call.
If only counting is needed:
Measure-Object is the cmdlet to use with large, streaming input collections in the pipeline[1]:
The following example generates 100,000 integers one by one and has Measure-Object count them one by one, without collecting the entire input in memory.
PS> (& { $i=0; while ($i -lt 1e5) { (++$i) } } | Measure-Object).Count
100000
Caveat: Measure-Object ignores $null values in the input collection - see GitHub issue #10905.
Note that while counting input objects is Measure-Object's default behavior, it supports a variety of other operations as well, such as summing -Sum and averaging (-Average), optionally combined in a single invocation.
[1] Measure-Object, as a cmdlet, is capable of processing input in a streaming fashion, meaning it counts objects it receives one by one, as they're being received, which means that even very large streaming input sets (those also created one by one, such as enumerating the rows of a large CSV file with Import-Csv) can be processed without the risk of running out of memory - there is no need to load the input collection as a whole into memory. However, if (a) the input collection already is in memory, or (b) it can fit into memory and performance is important, then use (...).Count.
Let's see short PS-snippets:
Write-Output #(1,2,3)
Write-Output #(1,2,3).Count
Get-Process|Write-Output
Write-Output (Get-Process)
$p=7890;Write-Output "Var. is $p"
Write-Output "ABCD78".Length
Function Get-StoppedService {
Param([string]$Computername = $env:computername)
$s = Get-Service -ComputerName $Computername
$stopped = $s | where {$_.Status -eq 'Stopped'}
Write-Output $Stopped
}
Get-StoppedService
In all snippets above I can just wipe out string Write-Output (with trailing space, if any) and will have exactly the same functionality.
The question is: do you know example where we CAN'T throw off Write-Output? Of course, if you interesting in Write-Output's parameter -NoEnumerate you can't eliminate cmdlet itself, so let's suppose we don't want/need -NoEnumerate. In this case - do you have example?
Well MSDN itself has the the following statement
This cmdlet is typically used in scripts to display strings and other
objects on the console. However, because the default behavior is to
display the objects at the end of a pipeline, it is generally not
necessary to use the cmdlet. For example, get-process | write-output
is equivalent to get-process.
One case in which it might be useful is when you build a pipeline with unknown stages, i.e. stages which are given to you via parameters etc. and the user could give you any cmdlet for the stage e.g. Tee-Object. If the user don't want anything special to happen, he could simply pass Write-Output as a kind of "pass-through" stage (of course you could easily implement that yourself as well)
As is often the case, PetSerAl's pithy examples (by way of a comment on the question, in this case) lead to epiphanies:
While generally not needed (see DAXaholic's helpful answer; you don't even need Write-Output for -NoEnumerate, because that can be emulated with the unary array-construction operator, ,), Write-Output can provide syntactic sugar:
(Write-Output a b c d e | Measure-Object).Count yields 5, which demonstrates that Write-Output accepts an arbitrary number of individual arguments, which are sent through the pipeline / to the screen one by one.
As such, Write-Output can simplify sending a collection of items to the output stream, by not requiring them to be quoted, because, as a cmdlet, Write-Output's parameters are parsed in argument mode:
Thus, instead of a "noisy" array literal such as the following (with its need for quoting the elements and separating them with ,):
"a", "b", "c", "d", "e"
you can simply write (albeit at the expense of performance):
Write-Output a b c d e
I am using the TFS PowerTools Cmdlets in PowerShell to try to get at some information about Changesets and related WorkItems from my server. I have boiled the problem down to behavior I don't understand and I am hoping it is not TFS specific (so someone out there might be able to explain the problem to me :) )
Here's the only command that I can get to work:
Get-TfsItemHistory C:\myDir -recurse -stopafter 5 | % { Write-Host $_.WorkItems[0]["Title"] }
It does what I expect - Get-TfsItemHistory returns a list of 5 ChangeSets, and it pipes those to a foreach that prints out the Title of the first associated WorkItem. So what's my problem? I am trying to write a large script, and I prefer to code things to look more like a C# program (powershell syntax makes me cry). Whenever I try to do the above written any other way, the WorkItems collection is null.
The following commands (which I interpret to be logically equivalent) do not work (The WorkItems collection is null):
$items = Get-TfsItemHistory C:\myDir -recurse -stopafter 5
$items | ForEach-Object { Write-Host $_.WorkItems[0]["Title"] }
The one I would really prefer:
$items = Get-TfsItemHistory C:\myDir -recurse -stopafter 5
foreach ($item in $items)
{
$item.WorkItems[0]["Title"]
# do lots of other stuff
}
I read an article about the difference between the 'foreach' operator and the ForEach-Object Cmdlet, but that seems to be more of a performance debate. This really appears to be an issue about when the piping is being used.
I'm not sure why all three of these approaches don't work. Any insight is appreciated.
This is indeed confusing. For now a work-around is to grab the items like so:
$items = #(Get-TfsItemHistory . -r -Stopafter 25 |
Foreach {$_.WorkItems.Count > $null; $_})
This accesses the WorkItems collection which seems to cause this property to be populated (I know - WTF?). I tend to use #() to generate an array in cases where I want to use the foreach keyword. The thing with the foreach keyword is that it will iterate a scalar value including $null. So the if the query returns nothing, $items gets assigned $null and the foreach will iterate the loop once with $item set to null. Now PowerShell generally deals with nulls very nicely. However if you hand that value back to the .NET Framework, it usually isn't as forgiving. The #() will guarantee an array with with either 0, 1 or N elements in it. If it is 0 then the foreach loop will not execute its body at all.
BTW your last approach - foreach ($item in $items) { ... } - should work just fine.