Foreach of a piped number array isnt breaking properly - powershell

I dont quite understand this, but why doesn't the following code not work:
"start"
1..5 | foreach {
"$_"
break
}
"stop"
I've done a couple tests and this code does work properly :
"start"
foreach ($num in 1..5){
"$num"
break
}
"stop"
Is there a way to make the first example run properly? The last outputted line should be "stop".
Like so:
start
1
stop

First, you should know that you are using two entirely different language features when you use foreach ($thing in $things) {} vs. $things | foreach { }.
The first is the built-in foreach statement, and the second is an alias for ForEach-Object, and they work very differently.
ForEach-Object runs the scriptblock for each of the items, and it works within a pipeline.
The break statement in that case is only breaking out of the current item's execution. The "parent" so-to-speak doesn't know that the scriptblock exited because of break and it continues, executing the scriptblock for the next object.
How you would go about limiting the results depends on what you want to do.
If you just want to stop producing results, just don't return anything if the condition is met. You'll still run every iteration, but the results will be correct.
If you only need to return a certain number of items, like the first N items, the best way (from PowerShell v3 on) is to add Select-Object:
1..10 | ForEach-Object {
$_*2
} | Select-Object -First 5
This will only execute 5 times, and it will return the sequence 2,4,6,8,10.
This is because of how the pipeline works where each object gets sent through each cmdlet, and Select-Object can stop the pipeline so it doesn't keep executing.
Pre-version 3.0, the pipeline cannot be stopped in that way, and although the results will be correct, you won't have prevented the extra executions.
If you give more details on what your conditions are for exiting, I could give more input as to how you'd want to approach that particular problem (which may involve not using ForEach-Object).

Related

While working in Powershell, how do I pause between list items?

I have been working on this for a while and I cannot find this utilization anywhere. I am using a powershell array and the foreach statement:
#('program 1','program 2','program 3').foreach{winget show $_ -args}
I then want to have a pause between each one so I added ;sleep 1
This does not work. It pauses for 3s (based on this eg.) and then lists the items. What am I missing here?
Indeed it doesn't seem to respect the order, I don't know the technical reason why. You could either use a normal ForEach-Object
'program 1','program 2','program 3' | ForEach-Object {
winget show $_
sleep 1
}
or force the output to go to the console instead of being "buffered"
('program 1','program 2','program 3').ForEach{
winget show $_ | Out-Host
sleep 1
}
Doug Maurer's helpful answer provides effective solutions.
As for an explanation of the behavior you saw:
The intrinsic .ForEach() method first collects all success output produced by the successive, per-input-object invocations of the script block ({ ... }) passed to it, and only after having processed all input objects outputs the collected results to the pipeline, i.e. the success output stream, in the form of a [System.Collections.ObjectModel.Collection[psobject]] instance.
In other words:
Unlike its cmdlet counterpart, the ForEach-Object cmdlet, the .ForEach() method does not emit output produced in its script block instantly to the pipeline.
As with any method, success output is only sent to the pipeline when the method returns.
Note that non-PowerShell .NET methods only ever produce success output (which is their return value) and only ever communicate failure via exceptions, which become statement-terminating PowerShell errors.
By contrast, the following do take instant effect inside a .ForEach() call's script block:
Suspending execution temporarily, such as via a Start-Sleep
Forcing instant display (host) output, such as via Out-Host or Write-Host.
Note that to-host output with Out-Host prevents capturing the output altogether, whereas Write-Host output, in PowerShell v5+, can only be captured via the information output stream (number 6).
Writing to an output stream other than the success output stream, such as via Write-Error, Write-Warning or Write-Verbose -Verbose.
Alternatively, you may use the foreach statement, which, like the ForEach-Object cmdlet, also instantly emits output to the success output stream:
# Stand-alone foreach statement: emits each number right away.
foreach ($i in 1..3) { $i; pause }
# In a pipeline, you must call it via & { ... } or . { ... }
# to get streaming behavior.
# $(...), the subexpression operator would NOT stream,
# i.e. it would act like the .ForEach() method.
& { foreach ($i in 1..3) { $i; pause } } | Write-Output

How does powershell lazily evaluate this statement?

I was searching for a way to to read only the first few lines of a csv file and came across this answer. The accepted answer suggests using
Get-Content "C:\start.csv" | select -First 10 | Out-File "C:\stop.csv"
Another answers suggests using
Get-Content C:\Temp\Test.csv -TotalCount 3
Because my csv is fairly large I went with the second option. It worked fine. Out of curiosity I decided to try the first option assuming I could ctrl+c if it took forever. I was surprised to see that it returned just as quickly.
Is it safe to use the first approach when working with large files? How does powershell achieve this?
Yes, Select-Object -First n is "safe" for large files (provided you want to read only a small number of lines, so pipeline overhead will be insignificant, else Get-Content -TotalCount n will be more efficient).
It works like break in a loop, by exiting the pipeline early, when the given number of items have been processed. Internally it throws a special exception that the PowerShell pipeline machinery recognizes.
Here is a demonstration that "abuses" Select-Object to break from a ForEach-Object "loop", which is not possible using normal break statement.
1..10 | ForEach-Object {
Write-Host $_ # goes directly to console, so is ignored by Select-Object
if( $_ -ge 3 ) { $true } # "break" by outputting one item
} | Select-Object -First 1 | Out-Null
Output:
1
2
3
As you can see, Select-Object -First n actually breaks the pipeline instead of first reading all input and then selecting only the specified number of items.
Another, more common use case is when you want to find only a single item in the output of a pipeline. Then it makes sense to exit from the pipeline as soon as you have found that item:
Get-ChildItem -Recurse | Where-Object { SomeCondition } | Select-Object -First 1
According to Microsoft the Get-Content cmdlet has a parameter called -ReadCount. Their documentation states
Specifies how many lines of content are sent through the pipeline at a time. The default value is 1. A value of 0 (zero) sends all of the content at one time.
This parameter does not change the content displayed, but it does affect the time it takes to display the content. As the value of ReadCount increases, the time it takes to return the first line increases, but the total time for the operation decreases. This can make a perceptible difference in large items.
Since -ReadCount defaults to 1 Get-Content effectively acts as a generator for reading a file line-by-line.

Pipeline semantics aren't propagated into Where-Object

I use the following command to run a pipeline.
.\Find-CalRatioSamples.ps1 data16 `
| ? {-Not (Test-GRIDDataset -JobName DiVertAnalysis -JobVersion 13 -JobSourceDatasetName $_ -Exists -Location UWTeV-linux)}
The first is a custom script of mine, and runs very fast (miliseconds). The second is a custom command, also written by me (see https://github.com/LHCAtlas/AtlasSSH/blob/master/PSAtlasDatasetCommands/TestGRIDDataset.cs). It is very slow.
Actually, it isn't so slow processing each line of input. The setup before the first line of input can be processed is very expensive. That done, however, it goes quite quickly. So all the expensive code gets executed once, and only the fairly fast code needs to be executed for each new pipeline input.
Unfortunately, when I want to do the ? { } construct above, it seems like PowerShell doesn't keep the pipe-line as it did before. It now calls me command a fresh time for each line of input, causing the command to redo all the setup for each line.
Is there something I can change in how I invoke the pipe-line? Or in how I've coded up my cmdlet to prevent this from happening? Or am I stuck because this is just the way Where-Object works?
It is working as designed. You're starting a new (nested) pipeline inside the scriptblock when you call your command.
If your function is doing the expensive code in its Begin block, then you need to directly pipe the first script into your function to get that advantage.
.\Find-CalRatioSamples.ps1 data16 |
Test-GRIDDataset -JobName DiVertAnalysis -JobVersion 13 -Exists -Location UWTeV-linux |
Where-Object { $_ }
But then it seems that you are not returning the objects you want (the original).
One way you might be able to change Test-GRIDDataset is to implement a -PassThru switch, though you aren't actually accepting the full objects from your original script, so I'm unable to tell if this is feasible; but the code you wrote seems to be retrieving... stuff(?) from somewhere based on the name. Perhaps that would be sufficient? When -PassThru is specified, send the objects through the pipeline if they exist (rather than just a boolean of whether or not they do).
Then your code would look like this:
.\Find-CalRatioSamples.ps1 data16 |
Test-GRIDDataset -JobName DiVertAnalysis -JobVersion 13 -Exists -Location UWTeV-linux -PassThru

How does the PowerShell Pipeline Concept work?

I understand that PowerShell piping works by taking the output of one cmdlet and passing it to another cmdlet as input. But how does it go about doing this?
Does the first cmdlet finish and then pass all the output variables across at once, which are then processed by the next cmdlet?
Or is each output from the first cmdlet taken one at a time and then run it through all of the remaining piped cmdlet’s?
You can see how pipeline order works with a simple bit of script:
function a {begin {Write-Host 'begin a'} process {Write-Host "process a: $_"; $_} end {Write-Host 'end a'}}
function b {begin {Write-Host 'begin b'} process {Write-Host "process b: $_"; $_} end {Write-Host 'end b'}}
function c { Write-Host 'c' }
1..3 | a | b | c
Outputs:
begin a
begin b
process a: 1
process b: 1
process a: 2
process b: 2
process a: 3
process b: 3
end a
end b
c
Powershell pipe works in an asynchronous way. Meaning that output of the first cmdlet is available to the second cmdlet immediately one object at the time (even if the first one has not finished executing).
For example if you run the below line:
dir -recurse| out-file C:\a.txt
and then stop the execution by pressing Control+C you will see part of directory is written to the text file.
A better example is the following code:(which is indeed useful to delete all of .tmp files on drive c:)
get-childitem c:\ -include *.tmp -recurse | foreach ($_) {remove-item $_.fullname}
Each time $_ in the second cmdlet gets value of a (single file)
Both answers thusfar give you some good information about pipelining. However, there is more to be said.
First, to directly address your question, you posited two possible ways the pipeline might work. And they are both right... depending on the cmdlets on either side of the pipe!
However, the way the pipeline should work is closer to your second notion: objects are processed one at a time. (Though there's no guarantee that an object will go all the way through before the next one is started because each component in the pipeline is asynchronous, as S Nash mentioned.)
So what do I mean by "it depends on your cmdlets" ?
If you are talking about cmdlets supplied by Microsoft, they likely all work as you would expect, passing each object through the pipeline as efficiently as it can. But if you are talking about cmdlets that you write, it depends on how you write them: it is just as easy to write cmdlets that fail to do proper pipelining as those that succeed!
There are two principle failure modes:
generating all output before emitting any into the pipeline, or
collecting all pipeline input before processing any.
What you want to strive for, of course, is to process each input as soon as it is received and emit its output as soon as it is determined. For detailed examples of all of these see my article, Ins and Outs of the PowerShell Pipeline, just published on Simple-Talk.com.

How to transpose data in powershell

I have a file that looks like this:
a,1
b,2
c,3
a,4
b,5
c,6
(...repeat 1,000s of lines)
How can I transpose it into this?
a,b,c
1,2,3
4,5,6
Thanks
Here's a brute-force one-liner from hell that will do it:
PS> Get-Content foo.txt |
Foreach -Begin {$names=#();$values=#();$hdr=$false;$OFS=',';
function output { if (!$hdr) {"$names"; $global:hdr=$true}
"$values";
$global:names=#();$global:values=#()}}
-Process {$n,$v = $_ -split ',';
if ($names -contains $n) {output};
$names+=$n; $values+=$v }
-End {output}
a,b,c
1,2,3
4,5,6
It's not what I'd call elegant but should get you by. This should copy/paste correctly as-is. However if you reformat it to what is shown above you will need put back-ticks after the last curly on both the Begin and Process scriptblocks. This script requires PowerShell 2.0 as it relies on the new -split operator.
This approach makes heavy use of the Foreach-Object cmdlet. Normally when you use Foreach-Object (alias is Foreach) in the pipeline you specify just one scriptblock like so:
Get-Process | Foreach {$_.HandleCount}
That prints out the handle count for each process. This usage of Foreach-Object uses the -Process scriptblock implicitly which means it executes once for each object it receives from the pipeline. Now what if we want to total up all the handles for each process? Ignore the fact that you could just use Measure-Object HandleCount -Sum to do this, I'll show you how Foreach-Object can do this. As you see in the original solution to this problem, Foreach can take both a Begin scriptblock that is executed once for the first object in the pipeline and a End scripblock that executes when there are no more objects in the pipeline. Here's how you can total the handle count using Foreach-Object:
gps | Foreach -Begin {$sum=0} -Process {$sum += $_.HandleCount } -End {$sum}
Relating this back to the problem solution, in the Begin scriptblock I initialize some variables to hold the array of names and values as well as a bool ($hdr) that tells me whether or not the header has been output (we only want to output it once). The next mildly mind blowing thing is that I also declare a function (output) in the Begin scriptblock that I call from both the Process and End scriptblocks to output the current set of data stored in $names and $values.
The only other trick is that the Process scriptblock uses the -contains operator to see if the current line's field name has already been seen before. If so, then output the current names and values and reset those arrays to empty. Otherwise just stash the name and value in the appropriate arrays so they can be saved later.
BTW the reason the output function needs to use the global: specifier on the variables is that PowerShell performs a "copy-on-write" approach when a nested scope modifies a variable defined outside its scope. However when we really want that modification to occur at the higher scope, we have to tell PowerShell that by using a modifier like global: or script:.