I have a file that looks like this:
a,1
b,2
c,3
a,4
b,5
c,6
(...repeat 1,000s of lines)
How can I transpose it into this?
a,b,c
1,2,3
4,5,6
Thanks
Here's a brute-force one-liner from hell that will do it:
PS> Get-Content foo.txt |
Foreach -Begin {$names=#();$values=#();$hdr=$false;$OFS=',';
function output { if (!$hdr) {"$names"; $global:hdr=$true}
"$values";
$global:names=#();$global:values=#()}}
-Process {$n,$v = $_ -split ',';
if ($names -contains $n) {output};
$names+=$n; $values+=$v }
-End {output}
a,b,c
1,2,3
4,5,6
It's not what I'd call elegant but should get you by. This should copy/paste correctly as-is. However if you reformat it to what is shown above you will need put back-ticks after the last curly on both the Begin and Process scriptblocks. This script requires PowerShell 2.0 as it relies on the new -split operator.
This approach makes heavy use of the Foreach-Object cmdlet. Normally when you use Foreach-Object (alias is Foreach) in the pipeline you specify just one scriptblock like so:
Get-Process | Foreach {$_.HandleCount}
That prints out the handle count for each process. This usage of Foreach-Object uses the -Process scriptblock implicitly which means it executes once for each object it receives from the pipeline. Now what if we want to total up all the handles for each process? Ignore the fact that you could just use Measure-Object HandleCount -Sum to do this, I'll show you how Foreach-Object can do this. As you see in the original solution to this problem, Foreach can take both a Begin scriptblock that is executed once for the first object in the pipeline and a End scripblock that executes when there are no more objects in the pipeline. Here's how you can total the handle count using Foreach-Object:
gps | Foreach -Begin {$sum=0} -Process {$sum += $_.HandleCount } -End {$sum}
Relating this back to the problem solution, in the Begin scriptblock I initialize some variables to hold the array of names and values as well as a bool ($hdr) that tells me whether or not the header has been output (we only want to output it once). The next mildly mind blowing thing is that I also declare a function (output) in the Begin scriptblock that I call from both the Process and End scriptblocks to output the current set of data stored in $names and $values.
The only other trick is that the Process scriptblock uses the -contains operator to see if the current line's field name has already been seen before. If so, then output the current names and values and reset those arrays to empty. Otherwise just stash the name and value in the appropriate arrays so they can be saved later.
BTW the reason the output function needs to use the global: specifier on the variables is that PowerShell performs a "copy-on-write" approach when a nested scope modifies a variable defined outside its scope. However when we really want that modification to occur at the higher scope, we have to tell PowerShell that by using a modifier like global: or script:.
Related
Consider the following function:
function myFunction {
100
sleep 1
200
sleep 1
300
sleep 1
}
As you can see it will emit those values, one by one down the pipeline.
But I want to wait for all the values to be emitted before going on. Like
myFunction | waitForThePreviousCommandToComplete | Format-Table
I want the Format-Table above to receive the entire array, instead of one-by-one items.
Is it even possible in Powershell?
Use
(...), the grouping operator in order to collect a command's output in full first, before sending it to the success output stream (the pipeline).
# Due to (...), doesn't send myfunction's output to Format-Table until it has run
# to completion and all its output has been collected.
(myFunction) | Format-Table
# Also works for entire pipelines.
(100, 200, 300 | ForEach-Object { $_; Start-Sleep 1 }) | Format-Table
Note:
If you need to up-front collect the output of multiple commands (pipelines) and / or language statements, use $(...), the subexpression operator instead, e.g. $(Get-Date -Year 2020; Get-Date -Year 2030) | Format-Table; the next point applies to it as well.
Whatever output was collected by (...) is enumerated, i.e., if the collected output is an enumerable, its elements are emitted one by one to the success output stream - albeit without any delay at that point.
Note that the collected output is invariably an enumerable (an array of type [object[]]) if two or more output objects were collected, but it also can be one in the usual event that a single object that itself is an enumerable was collected.
E.g., (Write-Output -NoEnumerate 1, 2, 3) | Measure-Object reports a count of 3, even though Write-Output -NoEnumerate output the given array as a single object (without (...), Measure-Object would report 1).
Typically, commands (cmdlets, functions, scripts) stream their output objects, i.e. emit them one by one to the pipeline, as soon as they are produced, while the command is still running, as your function does, and also act on their pipeline input one by one. However, some cmdlets invariably, themselves collect all input objects first, before they start emitting their output object(s), of conceptual necessity: notable examples are Sort-Object, Group-Object, and Measure-Object, all of which must act on the entirety of their input before they can start emitting results. Ditto for Format-Table when it is passed the -AutoSize switch, discussed next.
In the case of Format-Table, specifically, you can use the -AutoSize switch in order force it to collect all input first, in order to determine suitable display column widths based on all data (by default, Format-Table waits for 300 msecs. in order to determine column widths, based on whatever subset of the input data it has received by then).
However, this does not apply to so-called out-of-band-formatted objects, notably strings and primitive .NET types, which are still emitted (by their culture-invariant .ToString() representation) as they're being received.
Only complex objects (those with properties) are collected first, notably hashtables and [pscustomobject] instances; e.g.:
# Because this ForEach-Object call outputs complex objects (hashtables),
# Format-Table, due to -AutoSize, collects them all first,
# before producing its formatted output.
100, 200, 300 | ForEach-Object { #{ num = $_ }; Start-Sleep 1 } |
Format-Table -AutoSize
If you want to create a custom function that collects all of its pipeline input up front, you have two options:
Create a simple function that uses the automatic $input variable in its function body, which implicitly runs only after all input has been received; e.g.:
# This simple function simply relays its input, but
# implicitly only after all of it has been collected.
function waitForThePreviousCommandToComplete { $input }
# Output doesn't appear until after the ForEach-Object
# call has emitted all its output.
100, 200, 300 | ForEach-Object { $_; Start-Sleep 1 } | waitForThePreviousCommandToComplete
In the context of an advanced function, you'll have to manually collect all input, iteratively in the process block, via a list-type instance allocated in the begin block, which you can then process in the end block.
While using a simple function with $input is obviously simpler, you may still want an advanced one for all the additional benefits it offers (preventing unbound arguments, parameter validation, multiple pipeline-binding parameters, ...).
See this answer for an example.
Sort waits until it has everything.
myFunction | sort-object
Or:
(myFunction)
$(myfunction1; myFunction2)
myFunction | format-table -autosize
myFunction | more
See also: How to tell PowerShell to wait for each command to end before starting the next?
For some unknown reason, just putting the function inside brackets solved my problem:
(myFunction) | Format-Table
I'm trying to find an efficient way to read the value of a string variable in a PowerShell .ps1 file and then update the same variable/value in another .ps1 file. In my specific case, I would update a variable for the version # on script one and then I would want to run a script to update it on multiple other .ps1 files. For example:
1_script.ps1 - Script I want to read variable from
$global:scriptVersion = "v1.1"
2_script.ps1 - script I would want to update variable on (Should update to v1.1)
$global:scriptVersion = "v1.0"
I would want to update 2_script.ps1 to set the variable to "v1.1" as read from 1_script.ps1. My current method is using get-content with a regex to find a line starting with my variable, then doing a bunch of replaces to get the portion of the string I want. This does work, but it seems like there is probably a better way I am missing or didn't get working correctly in my tests.
My Modified Regex Solution Based on Answer by #mklement0 :
I slightly modified #mklement0 's solution because dot-sourcing the first script was causing it to run
$file1 = ".\1_script.ps1"
$file2 = ".\2_script.ps1"
$fileversion = (Get-Content $file1 | Where-Object {$_ -match '(?m)(?<=^\s*\$global:scriptVersion\s*=\s*")[^"]+'}).Split("=")[1].Trim().Replace('"','')
(Get-Content -Raw $file2) -replace '(?m)(?<=^\s*\$global:scriptVersion\s*=\s*")[^"]+',$fileversion | Set-Content $file2 -NoNewLine
Generally, the most robust way to parse PowerShell code is to use the language parser. However, reconstructing source code, with modifications after parsing, may situationally be hampered by the parser not reporting the details of intra-line whitespace - see this answer for an example and a discussion.[1]
Pragmatically speaking, using a regex-based -replace solution is probably good enough in your simple case (note that the value to update is assumed to be enclosed in "..." - but matching could be made more flexible to support '...' quoting too):
# Dot-source the first script in order to obtain the new value.
# Note: This invariably executes *all* top-level code in the script.
. .\1_script.ps1
# Outputs to the display.
# Append
# | Set-Content -Encoding utf8 2_script.ps1
# to save back to the input file.
(Get-Content -Raw 2_script.ps1) -replace '(?m)(?<=^\s*\$global:scriptVersion\s*=\s*")[^"]+', $global:scriptVersion
For an explanation of the regex and the ability to experiment with it, see this regex101.com page.
[1] Syntactic elements are reported in terms of line and column position, and columns are character-based, meaning that spaces and tabs are treated the same, so that a difference of, say, 3 character positions can represent 3 spaces, 3 tabs, or any mix of it - the parser won't tell you. However, if your approach allows keeping the source code as a whole while only removing and splicing in certain elements, that won't be a problem, as shown in iRon's helpful answer.
To compliment the helpful answer from #mklement0. In case your do go for the PowerShell abstract syntax tree (AST) class, you might use the Extent.StartOffset/Extent.EndOffset properties to reconstruct your script:
Using NameSpace System.Management.Automation.Language
$global:scriptVersion = 'v1.1' # . .\Script1.ps1
$Script2 = { # = Get-Content -Raw .\Script2.ps1
[CmdletBinding()]param()
begin {
$global:scriptVersion = "v1.0"
}
process {
$_
}
end {}
}.ToString()
$Ast = [Parser]::ParseInput($Script2, [ref]$null, [ref]$null)
$Extent = $Ast.Find(
{
$args[0] -is [AssignmentStatementAst] -and
$args[0].Left.VariablePath.UserPath -eq 'global:scriptVersion' -and
$args[0].Operator -eq 'Equals'
}, $true
).Right.Extent
-Join (
$Script2.SubString(0, $Extent.StartOffset),
$global:scriptVersion,
$Script2.SubString($Extent.EndOffset)
) # |Set-Content .\Script2.ps1
I can't find a way to pass the function. Just variables.
Any ideas without putting the function inside the ForEach loop?
function CustomFunction {
Param (
$A
)
Write-Host $A
}
$List = "Apple", "Banana", "Grape"
$List | ForEach-Object -Parallel {
Write-Host $using:CustomFunction $_
}
The solution isn't quite as straightforward as one would hope:
# Sample custom function.
function Get-Custom {
Param ($A)
"[$A]"
}
# Get the function's definition *as a string*
$funcDef = ${function:Get-Custom}.ToString()
"Apple", "Banana", "Grape" | ForEach-Object -Parallel {
# Define the function inside this thread...
${function:Get-Custom} = $using:funcDef
# ... and call it.
Get-Custom $_
}
Note: This answer contains an analogous solution for using a script block from the caller's scope in a ForEach-Object -Parallel script block.
Note: If your function were defined in a module that is placed in one of the locations known to the module-autoloading feature, your function calls would work as-is with ForEach-Object -Parallel, without extra effort - but each thread would incur the cost of (implicitly) importing the module.
The above approach is necessary, because - aside from the current location (working directory) and environment variables (which apply process-wide) - the threads that ForEach-Object -Parallel creates do not see the caller's state, notably neither with respect to variables nor functions (and also not custom PS drives and imported modules).
As of PowerShell 7.2.x, an enhancement is being discussed in GitHub issue #12240 to support copying the caller's state to the parallel threads on demand, which would make the caller's functions automatically available.
Note that redefining the function in each thread via a string is crucial, as an attempt to make do without the aux. $funcDef variable and trying to redefine the function with ${function:Get-Custom} = ${using:function:Get-Custom} fails, because ${function:Get-Custom} is a script block, and the use of script blocks with the $using: scope specifier is explicitly disallowed in order to avoid cross-thread (cross-runspace) issues.
However, ${function:Get-Custom} = ${using:function:Get-Custom} would work with Start-Job; see this answer for an example.
It would not work with Start-ThreadJob, which currently syntactically allows you to do & ${using:function:Get-Custom} $_, because ${using:function:Get-Custom} is preserved as a script block (unlike with Start-Job, where it is deserialized as a string, which is itself surprising behavior - see GitHub issue #11698), even though it shouldn't. That is, direct cross-thread use of [scriptblock] instances causes obscure failures, which is why ForEach-Object -Parallel prevents it in the first place.
A similar loophole that leads to cross-thread issues even with ForEach-Object -Parallel is using a command-info object obtained in the caller's scope with Get-Command as the function body in each thread via the $using: scope: this too should be prevented, but isn't as of PowerShell 7.2.7 - see this post and GitHub issue #16461.
${function:Get-Custom} is an instance of namespace variable notation, which allows you to both get a function (its body as a [scriptblock] instance) and to set (define) it, by assigning either a [scriptblock] or a string containing the function body.
I just figured out another way using get-command, which works with the call operator. $a ends up being a FunctionInfo object.
EDIT: I'm told this isn't thread safe, but I don't understand why.
function hi { 'hi' }
$a = get-command hi
1..3 | foreach -parallel { & $using:a }
hi
hi
hi
So I figured out another little trick that may be useful for people trying to add the functions dynamically, particularly if you might not know the name of it beforehand, such as when the functions are in an array.
# Store the current function list in a variable
$initialFunctions=Get-ChildItem Function:
# Source all .ps1 files in the current folder and all subfolders
Get-ChildItem . -Recurse | Where-Object { $_.Name -like '*.ps1' } |
ForEach-Object { . "$($_.FullName)" }
# Get only the functions that were added above, and store them in an array
$functions = #()
Compare-Object $initialFunctions (Get-ChildItem Function:) -PassThru |
ForEach-Object { $functions = #($functions) + #($_) }
1..3 | ForEach-Object -Parallel {
# Pull the $functions array from the outer scope and set each function
# to its definition
$using:functions | ForEach-Object {
Set-Content "Function:$($_.Name)" -Value $_.Definition
}
# Call one of the functions in the sourced .ps1 files by name
SourcedFunction $_
}
The main "trick" of this is using Set-Content with Function: plus the function name, since PowerShell essentially treats each entry of Function: as a path.
This makes sense when you consider the output of Get-PSDrive. Since each of those entries can be used as a "Drive" in the same way (i.e., with the colon).
ForEach is documented as an alias of ForEach-Object. When I use Get-Alias ForEach it tells that it is in alias of ForEach-Object.
But,
ForEach-Object Accepts parameters such as Begin Process and End, where ForEach doesn't accept them.
When we call ForEach-Object without any thing it prompts for parameter Process, and on calling ForEach it leaves a nested prompt with >>.
% and ForEach behave same, but ForEach-Object don't.
Here my questions are:
Is ForEach really an alias of ForEach-Object?
Which is better, ForEach or ForEach-Object?
Please share your thoughts.
Thank you!
There are two distinct underlying constructs:
The ForEach-Object cmdlet
This cmdlet has a built-in alias name: foreach, which just so happens to match the name of the distinct foreach statement (see next point).
The foreach loop statement (akin to the lesser used for statement).
What foreach refers to depends on the parsing context - see about_Parsing:
In argument mode (in the context of a command), foreach is ForEach-Object's alias.
In expression mode (more strictly: statement mode in this case), foreach is the foreach loop statement.
As a cmdlet, ForEach-Object operates on pipeline input.
Use it process output from other commands in a streaming manner, object by object, as these objects are being received, via the automatic $_ variable.
As a language statement, the foreach loop operates on variables and expressions (which may include output collected from commands).
Use it to process already-collected-in-memory results efficiently, via a self-chosen iterator variable (e.g., $num in foreach ($num in 1..3) { ... }); doing so is noticeably faster than processing via ForEach-Object.[1]
Note that you cannot send outputs from a foreach statement directly to the pipeline, because PowerShell's grammar doesn't permit it; for streaming output to the pipeline, wrap a foreach statement in & { ... }. (By contrast, simple expressions (e.g., 1..3) can be sent directly to the pipeline).
For more information, a performance comparison and a discussion of the tradeoffs (memory use vs. performance), including the .ForEach() array method, see this answer.
[1] However, note that the main reason for this performance discrepancy as of PowerShell 7.2.x isn't the pipeline itself, but ForEach-Object's inefficient implementation - see GitHub issue #10982.
Suppose I have a process that generates a collection of objects. For a very simple example, consider $(1 | get-member). I can get the number of objects generated:
PS C:\WINDOWS\system32> $(1 | get-member).count
21
or I can do something with those objects.
PS C:\WINDOWS\system32> $(1 | get-member) | ForEach-object {write-host $_.name}
CompareTo
Equals
...
With only 21 objects, doing the above is no problem. But what if the process generates hundreds of thousands of objects? Then I don't want to run the process once just to count the objects and then run it again to execute what I want to do with them. So how can I get a count of objects in a collection sent down the pipeline?
A similar question was asked before, and the accepted answer was to use a counter variable inside the script block that works on the collection. The problem is that I already have that counter and what I want is to check that the outcome of that counter is correct. So I don't want to just count inside the script block. I want a separate, independent measure of the size of the collection that I sent down the pipeline. How can I do that?
If processing and counting is needed:
Doing your own counting inside a ForEach-Object script block is your best bet to avoid processing in two passes.
The problem is that I already have that counter and what I want is to check that the outcome of that counter is correct.
ForEach-Object is reliably invoked for each and every input object, including $null values, so there should be no need to double-check.
If you want a cleaner separation of processing and counting, you can pass multiple -Process script blocks to ForEach-Object (in this example, { $_ + 1 } is the input-processing script block and { ++$count } is the input-counting one):
PS> 1..5 | ForEach-Object -Begin { $count = 0 } `
-Process { $_ + 1 }, { ++$count } `
-End { "--- count: $count" }
2
3
4
5
6
--- count: 5
Note that, due to a quirk in ForEach-Object's parameter binding, passing -Begin and -End script blocks is actually required in order to pass multiple -Process (per-input-object) blocks; pass $null if you don't actually need -Begin and/or -End - see GitHub issue #4513.
Also note that the $count variable lives in the caller's scope, and is not scoped to the ForEach-Object call; that is, $count = 0 potentially updates a preexisting $count variable, and, if it didn't previously exist, lives on after the ForEach-Object call.
If only counting is needed:
Measure-Object is the cmdlet to use with large, streaming input collections in the pipeline[1]:
The following example generates 100,000 integers one by one and has Measure-Object count them one by one, without collecting the entire input in memory.
PS> (& { $i=0; while ($i -lt 1e5) { (++$i) } } | Measure-Object).Count
100000
Caveat: Measure-Object ignores $null values in the input collection - see GitHub issue #10905.
Note that while counting input objects is Measure-Object's default behavior, it supports a variety of other operations as well, such as summing -Sum and averaging (-Average), optionally combined in a single invocation.
[1] Measure-Object, as a cmdlet, is capable of processing input in a streaming fashion, meaning it counts objects it receives one by one, as they're being received, which means that even very large streaming input sets (those also created one by one, such as enumerating the rows of a large CSV file with Import-Csv) can be processed without the risk of running out of memory - there is no need to load the input collection as a whole into memory. However, if (a) the input collection already is in memory, or (b) it can fit into memory and performance is important, then use (...).Count.