Suppose I have a process that generates a collection of objects. For a very simple example, consider $(1 | get-member). I can get the number of objects generated:
PS C:\WINDOWS\system32> $(1 | get-member).count
21
or I can do something with those objects.
PS C:\WINDOWS\system32> $(1 | get-member) | ForEach-object {write-host $_.name}
CompareTo
Equals
...
With only 21 objects, doing the above is no problem. But what if the process generates hundreds of thousands of objects? Then I don't want to run the process once just to count the objects and then run it again to execute what I want to do with them. So how can I get a count of objects in a collection sent down the pipeline?
A similar question was asked before, and the accepted answer was to use a counter variable inside the script block that works on the collection. The problem is that I already have that counter and what I want is to check that the outcome of that counter is correct. So I don't want to just count inside the script block. I want a separate, independent measure of the size of the collection that I sent down the pipeline. How can I do that?
If processing and counting is needed:
Doing your own counting inside a ForEach-Object script block is your best bet to avoid processing in two passes.
The problem is that I already have that counter and what I want is to check that the outcome of that counter is correct.
ForEach-Object is reliably invoked for each and every input object, including $null values, so there should be no need to double-check.
If you want a cleaner separation of processing and counting, you can pass multiple -Process script blocks to ForEach-Object (in this example, { $_ + 1 } is the input-processing script block and { ++$count } is the input-counting one):
PS> 1..5 | ForEach-Object -Begin { $count = 0 } `
-Process { $_ + 1 }, { ++$count } `
-End { "--- count: $count" }
2
3
4
5
6
--- count: 5
Note that, due to a quirk in ForEach-Object's parameter binding, passing -Begin and -End script blocks is actually required in order to pass multiple -Process (per-input-object) blocks; pass $null if you don't actually need -Begin and/or -End - see GitHub issue #4513.
Also note that the $count variable lives in the caller's scope, and is not scoped to the ForEach-Object call; that is, $count = 0 potentially updates a preexisting $count variable, and, if it didn't previously exist, lives on after the ForEach-Object call.
If only counting is needed:
Measure-Object is the cmdlet to use with large, streaming input collections in the pipeline[1]:
The following example generates 100,000 integers one by one and has Measure-Object count them one by one, without collecting the entire input in memory.
PS> (& { $i=0; while ($i -lt 1e5) { (++$i) } } | Measure-Object).Count
100000
Caveat: Measure-Object ignores $null values in the input collection - see GitHub issue #10905.
Note that while counting input objects is Measure-Object's default behavior, it supports a variety of other operations as well, such as summing -Sum and averaging (-Average), optionally combined in a single invocation.
[1] Measure-Object, as a cmdlet, is capable of processing input in a streaming fashion, meaning it counts objects it receives one by one, as they're being received, which means that even very large streaming input sets (those also created one by one, such as enumerating the rows of a large CSV file with Import-Csv) can be processed without the risk of running out of memory - there is no need to load the input collection as a whole into memory. However, if (a) the input collection already is in memory, or (b) it can fit into memory and performance is important, then use (...).Count.
Related
Consider the following function:
function myFunction {
100
sleep 1
200
sleep 1
300
sleep 1
}
As you can see it will emit those values, one by one down the pipeline.
But I want to wait for all the values to be emitted before going on. Like
myFunction | waitForThePreviousCommandToComplete | Format-Table
I want the Format-Table above to receive the entire array, instead of one-by-one items.
Is it even possible in Powershell?
Use
(...), the grouping operator in order to collect a command's output in full first, before sending it to the success output stream (the pipeline).
# Due to (...), doesn't send myfunction's output to Format-Table until it has run
# to completion and all its output has been collected.
(myFunction) | Format-Table
# Also works for entire pipelines.
(100, 200, 300 | ForEach-Object { $_; Start-Sleep 1 }) | Format-Table
Note:
If you need to up-front collect the output of multiple commands (pipelines) and / or language statements, use $(...), the subexpression operator instead, e.g. $(Get-Date -Year 2020; Get-Date -Year 2030) | Format-Table; the next point applies to it as well.
Whatever output was collected by (...) is enumerated, i.e., if the collected output is an enumerable, its elements are emitted one by one to the success output stream - albeit without any delay at that point.
Note that the collected output is invariably an enumerable (an array of type [object[]]) if two or more output objects were collected, but it also can be one in the usual event that a single object that itself is an enumerable was collected.
E.g., (Write-Output -NoEnumerate 1, 2, 3) | Measure-Object reports a count of 3, even though Write-Output -NoEnumerate output the given array as a single object (without (...), Measure-Object would report 1).
Typically, commands (cmdlets, functions, scripts) stream their output objects, i.e. emit them one by one to the pipeline, as soon as they are produced, while the command is still running, as your function does, and also act on their pipeline input one by one. However, some cmdlets invariably, themselves collect all input objects first, before they start emitting their output object(s), of conceptual necessity: notable examples are Sort-Object, Group-Object, and Measure-Object, all of which must act on the entirety of their input before they can start emitting results. Ditto for Format-Table when it is passed the -AutoSize switch, discussed next.
In the case of Format-Table, specifically, you can use the -AutoSize switch in order force it to collect all input first, in order to determine suitable display column widths based on all data (by default, Format-Table waits for 300 msecs. in order to determine column widths, based on whatever subset of the input data it has received by then).
However, this does not apply to so-called out-of-band-formatted objects, notably strings and primitive .NET types, which are still emitted (by their culture-invariant .ToString() representation) as they're being received.
Only complex objects (those with properties) are collected first, notably hashtables and [pscustomobject] instances; e.g.:
# Because this ForEach-Object call outputs complex objects (hashtables),
# Format-Table, due to -AutoSize, collects them all first,
# before producing its formatted output.
100, 200, 300 | ForEach-Object { #{ num = $_ }; Start-Sleep 1 } |
Format-Table -AutoSize
If you want to create a custom function that collects all of its pipeline input up front, you have two options:
Create a simple function that uses the automatic $input variable in its function body, which implicitly runs only after all input has been received; e.g.:
# This simple function simply relays its input, but
# implicitly only after all of it has been collected.
function waitForThePreviousCommandToComplete { $input }
# Output doesn't appear until after the ForEach-Object
# call has emitted all its output.
100, 200, 300 | ForEach-Object { $_; Start-Sleep 1 } | waitForThePreviousCommandToComplete
In the context of an advanced function, you'll have to manually collect all input, iteratively in the process block, via a list-type instance allocated in the begin block, which you can then process in the end block.
While using a simple function with $input is obviously simpler, you may still want an advanced one for all the additional benefits it offers (preventing unbound arguments, parameter validation, multiple pipeline-binding parameters, ...).
See this answer for an example.
Sort waits until it has everything.
myFunction | sort-object
Or:
(myFunction)
$(myfunction1; myFunction2)
myFunction | format-table -autosize
myFunction | more
See also: How to tell PowerShell to wait for each command to end before starting the next?
For some unknown reason, just putting the function inside brackets solved my problem:
(myFunction) | Format-Table
I was searching for a way to to read only the first few lines of a csv file and came across this answer. The accepted answer suggests using
Get-Content "C:\start.csv" | select -First 10 | Out-File "C:\stop.csv"
Another answers suggests using
Get-Content C:\Temp\Test.csv -TotalCount 3
Because my csv is fairly large I went with the second option. It worked fine. Out of curiosity I decided to try the first option assuming I could ctrl+c if it took forever. I was surprised to see that it returned just as quickly.
Is it safe to use the first approach when working with large files? How does powershell achieve this?
Yes, Select-Object -First n is "safe" for large files (provided you want to read only a small number of lines, so pipeline overhead will be insignificant, else Get-Content -TotalCount n will be more efficient).
It works like break in a loop, by exiting the pipeline early, when the given number of items have been processed. Internally it throws a special exception that the PowerShell pipeline machinery recognizes.
Here is a demonstration that "abuses" Select-Object to break from a ForEach-Object "loop", which is not possible using normal break statement.
1..10 | ForEach-Object {
Write-Host $_ # goes directly to console, so is ignored by Select-Object
if( $_ -ge 3 ) { $true } # "break" by outputting one item
} | Select-Object -First 1 | Out-Null
Output:
1
2
3
As you can see, Select-Object -First n actually breaks the pipeline instead of first reading all input and then selecting only the specified number of items.
Another, more common use case is when you want to find only a single item in the output of a pipeline. Then it makes sense to exit from the pipeline as soon as you have found that item:
Get-ChildItem -Recurse | Where-Object { SomeCondition } | Select-Object -First 1
According to Microsoft the Get-Content cmdlet has a parameter called -ReadCount. Their documentation states
Specifies how many lines of content are sent through the pipeline at a time. The default value is 1. A value of 0 (zero) sends all of the content at one time.
This parameter does not change the content displayed, but it does affect the time it takes to display the content. As the value of ReadCount increases, the time it takes to return the first line increases, but the total time for the operation decreases. This can make a perceptible difference in large items.
Since -ReadCount defaults to 1 Get-Content effectively acts as a generator for reading a file line-by-line.
Is there any way of getting the intrinsic numeric index (order#)
of the selected/filtered/matched object
from a Where-Object processing?
for example:
Get-Process | Where-Object -property id -eq 1024
without using further code...
?is it possible to get the index of the object with ID=4
from some 'inner/hidden' Powershell mechanism???
or instruct Where-Object to spit out the index
where the match(es) took place?
(in this case would be 1, 0 is the 'idle' process)
You could capture the result of the Get-Process cmdlet as array, and use the IndexOf() method to get the index or -1 if that Id is not found:
$gp = (Get-Process).Id
$gp.IndexOf(1024)
No, there is no built-in mechanism for what you're looking for as of PowerShell 7.1
If only one item is to be matched - as in your case - you can use the Array type's static .FindIndex() method:
$processes = Get-Process
# Find the 0-based index of the process with ID 1024.
[Array]::FindIndex($processes, [Predicate[object]] { param($o) $o.Id -eq 1024 })
Note that this returns a zero-based index if a match is found, and -1 otherwise.
The ::FindIndex() method has the advantage of searching only for the first match, unlike Where-Object, which as of PowerShell 7.1 always searches the entire input collection (see below). As a method rather than a pipeline-based cmdlet, it invariably requires the input array to be in memory in full (which the pipeline doesn't require).
While it wouldn't directly address your use case, note that there's conceptually related feature request #13772 on GitHub, which proposes introducing an automatic $PSIndex variable to be made available inside ForEach-Object and Where-Object script blocks.
As an aside:
Note that while [Array]::FindIndex only ever finds the first match's index, Where-Object is limited in the opposite way: as of PowerShell 7.1, it always finds all matches, which is inefficient if you're only looking for one match.
While the related .Where() array method does offer a way to stop processing after the first match (e.g.,
('long', 'foo', 'bar').Where({ $_.Length -eq 3 }, 'First')), methods operate on in-memory collections only, so it would be helpful if the pipeline-based Where-Object cmdlet supported such a feature as well - see GitHub feature request #13834.
Or something like this. Add a property called line that keeps increasing for each object.
ps | % { $line = 1 } { $_ | add-member line ($line++) -PassThru } | where id -eq 13924 |
select line,id
line Id
---- --
184 13924
I have created a script that loops through an array and excludes any variables that are found within a second array.
While the code works; it got me wondering if it could be simplified or piped.
$result = #()
$ItemArray = #("a","b","c","d")
$exclusionArray = #("b","c")
foreach ($Item in $ItemArray)
{
$matchFailover = $false
:gohere
foreach ($ExclusionItem in $exclusionArray)
{
if ($Item -eq $ExclusionItem)
{
Write-Host "Match: $Item = $ExclusionItem"
$matchFailover = $true
break :gohere
}
else{
Write-Host "No Match: $Item != $ExclusionItem"
}
}
if (!($matchFailover))
{
Write-Host "Adding $Item to results"
$result += $Item
}
}
Write-Host "`nResults are"
$result
To give your task a name: You're looking for the relative complement aka set difference between two arrays:
In set-theory notation, it would be $ItemArray \ $ExclusionArray, i.e., those elements in $ItemArray that aren't also in $ExclusionArray.
This related question is looking for the symmetric difference between two sets, i.e., the set of elements that are unique to either side - at last that's what the Compare-Object-based solutions there implement, but only under the assumption that each array has no duplicates.
EyIM's helpful answer is conceptually simple and concise.
A potential problem is performance: a lookup in the exclusion array must be performed for each element in the input array.
With small arrays, this likely won't matter in practice.
With larger arrays, LINQ offers a substantially faster solution:
Note: In order to benefit from the LINQ solution, your arrays should be in memory already, and the benefit is greater the larger the exclusion array is. If your input is streaming via the pipeline, the overhead from executing the pipeline may make attempts to optimize array processing pointless or even counterproductive, in which case sticking with the native PowerShell solution makes sense - see iRon's answer.
# Declare the arrays as [string[]]
# so that calling the LINQ method below works as-is.
# (You could also cast to [string[]] ad hoc.)
[string[]] $ItemArray = 'a','b','c','d'
[string[]] $exclusionArray = 'b','c'
# Return only those elements in $ItemArray that aren't also in $exclusionArray
# and convert the result (a lazy enumerable of type [IEnumerable[string]])
# back to an array to force its evaluation
# (If you directly enumerate the result in a pipeline, that step isn't needed.)
[string[]] [Linq.Enumerable]::Except($ItemArray, $exclusionArray) # -> 'a', 'd'
Note the need to use the LINQ types explicitly, via their static methods, because PowerShell, as of v7, has no support for extension methods.
However, there is a proposal on GitHub to add such support; this related proposal asks for improved support for calling generic methods.
See this answer for an overview of how to currently call LINQ methods from PowerShell.
Performance comparison:
Tip of the hat to iRon for his input.
The following benchmark code uses the Time-Command function to compare the two approaches, using arrays with roughly 4000 and 2000 elements, respectively, which - as in the question - differ by only 2 elements.
Note that in order to level the playing field, the .Where() array method (PSv4+) is used instead of the pipeline-based Where-Object cmdlet, as .Where() is faster with arrays already in memory.
Here are the results averaged over 10 runs; note the relative performance, as shown in the Factor columns; from a single-core Windows 10 VM running Windows PowerShell v5.1.:
Factor Secs (10-run avg.) Command TimeSpan
------ ------------------ ------- --------
1.00 0.046 # LINQ... 00:00:00.0455381
8.40 0.382 # Where ... -notContains... 00:00:00.3824038
The LINQ solution is substantially faster - by a factor of 8+ (though even the much slower solution only took about 0.4 seconds to run).
It seems that the performance gap is even wider in PowerShell Core, where I've seen a factor of around 19 with v7.0.0-preview.4.; interestingly, both tests ran faster individually than in Windows PowerShell.
Benchmark code:
# Script block to initialize the arrays.
# The filler arrays are randomized to eliminate caching effects in LINQ.
$init = {
$fillerArray = 1..1000 | Get-Random -Count 1000
[string[]] $ItemArray = $fillerArray + 'a' + $fillerArray + 'b' + $fillerArray + 'c' + $fillerArray + 'd'
[string[]] $exclusionArray = $fillerArray + 'b' + $fillerArray + 'c'
}
# Compare the average of 10 runs.
Time-Command -Count 10 { # LINQ
. $init
$result = [string[]] [Linq.Enumerable]::Except($ItemArray, $exclusionArray)
}, { # Where ... -notContains
. $init
$result = $ItemArray.Where({ $exclusionArray -notcontains $_ })
}
You can use Where-Object with -notcontains:
$ItemArray | Where-Object { $exclusionArray -notcontains $_ }
Output:
a, d
Advocating native PowerShell:
As per #mklement0's answer, with no doubt, Language Integrated Query (LINQ) is //Fast...
But in some circumstances, native PowerShell commands using the pipeline as suggested by #EylM can still beat LINQ. This is not just theoretical but might happen in used cases where the concerned process is idle and waiting for a slow input. E.g. where the input comes from:
A remote server (e.g. Active Directory)
A slow device
A separate thread that has to make a complex calculation
The internet ...
Despite I haven't seen an easy prove for this yet, this is suggested at several sites and can be deducted from sites as e.g. High Performance PowerShell with LINQ and Ins and Outs of the PowerShell Pipeline.
Prove
To prove the above thesis, I have created a small Slack cmdlet that slows down each item dropped into the pipeline with 1 millisecond (by default):
Function Slack-Object ($Delay = 1) {
process {
Start-Sleep -Milliseconds $Delay
Write-Output $_
}
}; Set-Alias Slack Slack-Object
Now let's see if native PowerShell can actually beat LINQ:
(To get a good performance comparison, caches should be cleared by e.g. starting a fresh PowerShell session.)
[string[]] $InputArray = 1..200
[string[]] $ExclusionArray = 100..300
(Measure-Command {
$Result = [Linq.Enumerable]::Except([string[]] ($InputArray | Slack), $ExclusionArray)
}).TotalMilliseconds
(Measure-Command {
$Result = $InputArray | Slack | Where-Object {$ExclusionArray -notcontains $_}
}).TotalMilliseconds
Results:
LINQ: 411,3721
PowerShell: 366,961
To exclude the LINQ cache, a single run test should be done but as commented by #mklement0, the results of single runs might vary each run.
The results also highly depend on the size of the input arrays, the size of the result, the slack, the test system, etc.
Conclusion:
PowerShell might still be faster than LINQ in some scenarios!
Quoting mklement0's comment:
"Overall, it's fair to say that the difference in performance is so small in this scenario that it's not worth picking the approach based on performance - and it makes sense to go with the more PowerShell-like approach (Where-Object), given that the LINQ approach is far from obvious. The bottom line is: choose LINQ only if you have large arrays that are already in memory. If the pipeline is involved, the pipeline overhead alone may make optimizations pointless."
I have a file that looks like this:
a,1
b,2
c,3
a,4
b,5
c,6
(...repeat 1,000s of lines)
How can I transpose it into this?
a,b,c
1,2,3
4,5,6
Thanks
Here's a brute-force one-liner from hell that will do it:
PS> Get-Content foo.txt |
Foreach -Begin {$names=#();$values=#();$hdr=$false;$OFS=',';
function output { if (!$hdr) {"$names"; $global:hdr=$true}
"$values";
$global:names=#();$global:values=#()}}
-Process {$n,$v = $_ -split ',';
if ($names -contains $n) {output};
$names+=$n; $values+=$v }
-End {output}
a,b,c
1,2,3
4,5,6
It's not what I'd call elegant but should get you by. This should copy/paste correctly as-is. However if you reformat it to what is shown above you will need put back-ticks after the last curly on both the Begin and Process scriptblocks. This script requires PowerShell 2.0 as it relies on the new -split operator.
This approach makes heavy use of the Foreach-Object cmdlet. Normally when you use Foreach-Object (alias is Foreach) in the pipeline you specify just one scriptblock like so:
Get-Process | Foreach {$_.HandleCount}
That prints out the handle count for each process. This usage of Foreach-Object uses the -Process scriptblock implicitly which means it executes once for each object it receives from the pipeline. Now what if we want to total up all the handles for each process? Ignore the fact that you could just use Measure-Object HandleCount -Sum to do this, I'll show you how Foreach-Object can do this. As you see in the original solution to this problem, Foreach can take both a Begin scriptblock that is executed once for the first object in the pipeline and a End scripblock that executes when there are no more objects in the pipeline. Here's how you can total the handle count using Foreach-Object:
gps | Foreach -Begin {$sum=0} -Process {$sum += $_.HandleCount } -End {$sum}
Relating this back to the problem solution, in the Begin scriptblock I initialize some variables to hold the array of names and values as well as a bool ($hdr) that tells me whether or not the header has been output (we only want to output it once). The next mildly mind blowing thing is that I also declare a function (output) in the Begin scriptblock that I call from both the Process and End scriptblocks to output the current set of data stored in $names and $values.
The only other trick is that the Process scriptblock uses the -contains operator to see if the current line's field name has already been seen before. If so, then output the current names and values and reset those arrays to empty. Otherwise just stash the name and value in the appropriate arrays so they can be saved later.
BTW the reason the output function needs to use the global: specifier on the variables is that PowerShell performs a "copy-on-write" approach when a nested scope modifies a variable defined outside its scope. However when we really want that modification to occur at the higher scope, we have to tell PowerShell that by using a modifier like global: or script:.