powershell get the sum of a specific substring position - powershell

How can I get the sum of a file from a substring and placing the sum on a specific position (different line) using powershell if have the following conditions:
Get the sum of the numbers from position 3 to 13 of a line that is starting with a character D. Place the sum on position 10 to 14 on the line that starts with the S
So for example, if i have this file:
F123trial text
DA00000038.95==xxx11
DA00000018.95==yyy11
DA00000018.95==zzzyy
S xxxxx
I want to get the sum of 38.95, 18.95 and 18.95 and then place the sum on position xxxxx under the line that starts with the S.

PowerShell's switch statement has powerful, but little-known features that allow you to iterate over the lines of a file (-file) and match lines by regular expressions (-regex).
Not only is switch -file convenient, it is also much faster than using cmdlets in a pipeline (see bottom section).
[double] $sum = 0
switch -regex -file file.txt {
# Note: The string to the left of each script block below ({ ... }),
# e.g., '^D', is the regex to match each line against.
# Inside the script blocks, $_ refers to the input line at hand.
# Extract number, add to sum, output the line.
'^D' { $sum += $_.Substring(2, 11); $_; continue }
# Summary line: place sum at character position 10, with 0-padding
# Note: `-replace ',', '.'` is only needed if your culture uses "," as the
# decimal mark.
'^S' { $_.Substring(0, 9) + '{0:000000000000000.00}' -f $sum -replace ',', '.'; continue }
# All other lines: pass them through.
default { $_ }
}
Note:
continue in the script blocks short-circuits further matching for the line at hand; by contrast, if you used break, no further lines would be processed.
Based on a later comment, I'm assuming you want an 18-character 0-left-padded number on the S line at character position 10.
With your sample file, the above yields:
F123trial text
DA00000038.95==xxx11
DA00000018.95==yyy11
DA00000018.95==zzzyy
S 000000000000076.85
Optional reading: Comparing the performance of switch -file ... to Get-Content ... | ForEach-Object ...
Running the following test script:
& {
# Create a sample file with 100K lines.
1..1e5 > ($tmpFile = [IO.Path]::GetTempFileName())
(Measure-Command { switch -file ($tmpFile) { default { $_ } } }).TotalSeconds,
(Measure-Command { get-content $tmpFile | % { $_ } }).TotalSeconds
Remove-Item $tmpFile
}
yields the following timings on my machine, for instance (the absolute numbers aren't important, but their ratio should give you a sense):
0.0578924 # switch -file
6.0417638 # Get-Content | ForEach-Object
That is, the pipeline-based solution is about 100 (!) times slower than the switch -file solution.
Digging deeper:
Frode F. points out that Get-Content is slow with large files - though its convenience makes it a popular choice - and mentions using the .NET Framework directly as an alternative:
Using [System.IO.File]::ReadAllLines(); however, given that it reads the entire file into memory, that is only an option with smallish files.
Using [System.IO.StreamReader]'s ReadLine() method in a loop.
However, use of the pipeline in itself, irrespective of the specific cmdlets used, introduces overhead. When performance matters - but only then - you should avoid it.
Here's an updated test that includes commands that use the .NET Framework methods, with and without the pipeline (the use of intrinsic .ForEach() method requires PSv4+):
& {
# Create a sample file with 100K lines.
1..1e5 > ($tmpFile = [IO.Path]::GetTempFileName())
(Measure-Command { switch -file ($tmpFile) { default { $_ } } }).TotalSeconds
(Measure-Command { foreach ($line in [IO.File]::ReadLines((Convert-Path $tmpFile))) { $line } }).TotalSeconds
(Measure-Command {
$sr = [IO.StreamReader] (Convert-Path $tmpFile)
while(-not $sr.EndOfStream) { $sr.ReadLine() }
$sr.Close()
}).TotalSeconds
(Measure-Command { [IO.File]::ReadAllLines((Convert-Path $tmpFile)).ForEach({ $_ }) }).TotalSeconds
(Measure-Command { [IO.File]::ReadAllLines((Convert-Path $tmpFile)) | % { $_ } }).TotalSeconds
(Measure-Command { Get-Content $tmpFile | % { $_ } }).TotalSeconds
Remove-Item $tmpFile
}
Sample results, from fastest to slowest:
0.0124441 # switch -file
0.0365348 # [System.IO.File]::ReadLine() in foreach loop
0.0481214 # [System.IO.StreamReader] in a loop
0.1614621 # [System.IO.File]::ReadAllText() with .ForEach() method
0.2745749 # (pipeline) [System.IO.File]::ReadAllText() with ForEach-Object
0.5925222 # (pipeline) Get-Content with ForEach-Object
switch -file is the fastest by a factor of around 3, followed by the no-pipeline .NET solutions; using .ForEach() adds another factor of 3.
Simply introducing the pipeline (ForEach-Object instead of .ForEach()) adds another factor of 2; finally, using the pipeline with Get-Content and ForEach-Object adds another factor of 2.

You could try:
-match to find the lines using regex-pattern
The .NET string-method Substring() to extract the values from the "D"-lines
Measure-Object -Sum to calculate the sum
-replace to insert the value (searches using regex-pattern).
Ex:
$text = Get-Content -Path file.txt
$total = $text -match '^D' |
#Foreach "D"-line, extract the value and cast to double (to be able to sum it)
ForEach-Object { $_.Substring(2,11) -as [double] } |
#Measure the sum
Measure-Object -Sum | Select-Object -ExpandProperty Sum
$text | ForEach-Object {
if($_ -match '^S') {
#Line starts with S -> Insert sum
$_.SubString(0,(17-$total.Length)) + $total + $_.SubString(17)
} else {
#Not "S"-line -> output original content
$_
}
} | Set-Content -Path file.txt

Related

Join every other line in Powershell

I want to combine every other line from the input below. Here is the input.
ALPHA-FETOPROTEIN ROUTINE CH 0203 001 02/03/2023#10:45 LIVERF3
###-##-#### #######,#### In lab
ALPHA-FETOPROTEIN ROUTINE CH 0203 234 02/03/2023#11:05 LIVER
###-##-#### ########,######## In lab
ANION GAP STAT CH 0203 124 02/03/2023#11:06 DAY
###-##-#### ######,##### #### In lab
BASIC METABOLIC PANE ROUTINE CH 0203 001 02/03/2023#10:45 LIVERF3
###-##-#### #######,#### ###### In lab
This is the desired output
ALPHA-FETOPROTEIN ROUTINE CH 0203 001 02/03/2023#10:45 LIVERF3 ###-##-#### #######,#### In lab
ALPHA-FETOPROTEIN ROUTINE CH 0203 234 02/03/2023#11:05 LIVER ###-##-#### ########,######## In lab
ANION GAP STAT CH 0203 124 02/03/2023#11:06 DAY ###-##-#### ######,##### #### In lab
BASIC METABOLIC PANE ROUTINE CH 0203 001 02/03/2023#10:45 LIVERF3 ###-##-#### #######,#### ###### In lab
The code that I have tried is
for($i = 0; $i -lt $splitLines.Count; $i += 2){
$splitLines[$i,($i+1)] -join ' '
}
It came from Joining every two lines in Powershell output. But I can't seem to get it to work for me. I'm not well versed with powershell, but i'm at the mercy of what's available at work.
Edit: Here is the entire code that I am using as requested.
# SET VARIABLES
$inputfile = "C:\Users\Will\Desktop\testfile.txt"
$outputfile = "C:\Users\Will\Desktop\testfileformatted.txt"
$new_output = "C:\Users\Will\Desktop\new_formatted.txt"
# REMOVE EXTRA CHARACTERS
$remove_beginning_capture = "-------------------------------------------------------------------------------"
$remove_end_capture = "==============================================================================="
$remove_line = "------"
$remove_strings_with_spaces = " \d"
Get-Content $inputfile | Where-Object {$_ -notmatch $remove_beginning_capture} | Where-Object {$_ -notmatch $remove_end_capture} | Where-Object {$_ -notmatch $remove_line} | Where-Object {$_ -notmatch $remove_strings_with_spaces} | ? {$_.trim() -ne "" } | Set-Content $outputfile
# Measures line length for loop
$file_lines = gc $outputfile | Measure-Object
#Remove Whitespace
# $whitespace_removed = (Get-Content $outputfile -Raw) -replace '\s+', ' '| Set-Content -Path C:\Users\Will\Desktop\new_formatted.csv
# Combine every other line
$lines = Get-Content $outputfile -Raw
$newcontent = $lines.Replace("`n","")
Write-Host "Content: $newcontent"
$newcontent | Set-Content $new_output
for($i = 0; $i -lt $splitLines.Count; $i += 2){
$splitLines[$i,($i+1)] -join ' '
}
Just read two lines and then print one
$inputFilename = "c:\temp\test.txt"
$outputFilename = "c:\temp\test1.txt"
$reader = [System.IO.StreamReader]::new($inputFilename)
$writer = [System.IO.StreamWriter]::new($outputFilename)
while(($line = $reader.ReadLine()) -ne $null)
{
$secondLine = ""
if(!$reader.EndOfStream){ $secondLine = $reader.ReadLine() }
$writer.WriteLine($line + $secondLine)
}
$reader.Close()
$writer.Flush()
$writer.Close()
PowerShell-idiomatic solutions:
Use Get-Content with -ReadCount 2 in order to read the lines from your file in pairs, which allows you to process each pair in a ForEach-Object call, where the constituent lines can be joined to form a single output line.
Get-Content -ReadCount 2 yourFile.txt |
ForEach-Object { $_[0] + ' ' + $_[1].TrimStart() }
The above directly outputs the resulting lines (as the for command in your question does), causing them to print to the display by default.
Pipe to Set-Content to save the output to a file:
Get-Content -ReadCount 2 yourFile.txt |
ForEach-Object { $_[0] + ' ' + $_[1].TrimStart() } |
Set-Content yourOutputFile.txt
Performance notes:
Unfortunately (as of PowerShell 7.3.2), Get-Content is quite slow by default - see GitHub issue #7537, and the performance of ForEach-Object and Where-Object could be improved too - see GitHub issue #10982.
At the expense of collecting all inputs and outputs in memory first, you can noticeably improve the performance with the following variation, which avoids the ForEach-Object cmdlet in favor of the intrinsic .ForEach() method, and, instead of piping to Set-Content, passes all output lines via the -Value parameter:
Set-Content $tempOutFile -Value (
(Get-Content -ReadCount 2 $tempInFile).ForEach({ $_[0] + ' ' + $_[1].TrimStart() })
)
Read on for even faster alternatives, but remember that optimizations are only worth undertaking if actually needed - if the first PowerShell-idiomatic solution above is fast enough in practice, it is worth using for its conceptual elegance and concision.
See this Gist for benchmarks that compare the relative performance of the solutions in this answer as well as that of the solution from jdweng's .NET API-based answer.
An better-performing alternative is to use a switch statement with the -File parameter to process files line by line:
$i = 1
switch -File yourFile.txt {
default {
if ($i++ % 2) { $firstLineInPair = $_ }
else { $firstLineInPair + ' ' + $_.TrimStart() }
}
}
Helper index variable $i and the modulo operation (%) are simply used to identify which line is the start of a (new) pair, and which one is its second half.
The switch statement is itself streaming, but it cannot be used as-is as pipeline input. By enclosing it in & { ... }, it can, but that forfeits some of the performance benefits, making it only marginally faster than the optimized Get-Content -ReadCount 2 solution:
& {
$i = 1
switch -File yourFile.txt {
default {
if ($i++ % 2) { $firstLineInPair = $_ }
else { $firstLineInPair + ' ' + $_.TrimStart() }
}
}
} | Set-Content yourOutputFile.txt
For the best performance when writing to a file, use Set-Content $outFile -Value $(...), albeit at the expense of collecting all output lines in memory first:
Set-Content yourOutputFile.txt -Value $(
$i = 1
switch -File yourFile.txt {
default {
if ($i++ % 2) { $firstLineInPair = $_ }
else { $firstLineInPair + ' ' + $_.TrimStart() }
}
}
)
The fastest and most concise solution is to use a regex-based approach, which reads the entire file up front:
(Get-Content -Raw yourFile.txt) -replace '(.+)\r?\n(?: *)(.+\r?\n)', '$1 $2'
Note:
The assumption is that all lines are paired, and that the last line has a trailing newline.
The -replace operation matches two consecutive lines, and joins them together with a space, ignoring leading spaces on the second line. For a detailed explanation of the regex and the ability to interact with it, see this regex101.com page.
To save the output to a file, you can pipe directly to Set-Content:
(Get-Content -Raw yourFile.txt) -replace '(.+)\r?\n(?: *)(.+\r?\n)', '$1 $2' |
Set-Content yourOutputFile.txt
In this case, because the pipeline input to Set-Content is provided by an expression that doesn't involve for-every-input-line calls to script blocks ({ ... }) (as the switch solution requires), there is virtually no slowdown resulting from use of the pipeline (whose use is generally preferable for conceptual elegance and concision).
As for what you tried:
The $splitLines-based solution in your question is predicated on having assigned all lines of the input file to this self-chosen variable as an array, which your code does not do.
While you could fill variable $splitLines with an array of lines from your input file with $splitLines = Get-Content yourFile.txt, given that Get-Content reads text files line by line by default, the switch-based line-by-line solution is more efficient and streams its results (which - if saved to a file - keeps memory usage constant, which matters with large input sets (though rarely with text files)).
A performance tip when reading all lines at once into an array with Get-Content: use -ReadCount 0, which greatly speeds up the operation:
$splitLines = Get-Content -ReadCount 0 yourFile.txt

Why my Out-File does not write output into the file

$ready = Read-Host "How many you want?: "
$i = 0
do{
(-join(1..12 | ForEach {((65..90)+(97..122)+(".") | % {[char]$_})+(0..9)+(".") | Get-Random}))
$i++
} until ($i -match $ready) Out-File C:/numbers.csv -Append
If I give a value of 10 to the script - it will generate 10 random numbers and shows it on pshell. It even generates new file called numbers.csv. However, it does not add the generated output to the file. Why is that?
Your Out-File C:/numbers.csv -Append call is a completely separate statement from your do loop, and an Out-File call without any input simply creates an empty file.[1]
You need to chain (connect) commands with | in order to make them run in a pipeline.
However, with a statement such as as a do { ... } until loop, this won't work as-is, but you can convert such a statement to a command that you can use as part of a pipeline by enclosing it in a script block ({ ... }) and invoking it with &, the call operator (to run in a child scope), or ., the member-access operator (to run directly in the caller's scope):
[int] $ready = Read-Host "How many you want?"
$i = 0
& {
do{
-join (1..12 | foreach {
(65..90 + 97..122 + '.' | % { [char] $_ }) +(0..9) + '.' | Get-Random
})
$i++
} until ($i -eq $ready)
} | Out-File C:/numbers.csv -Append
Note the [int] type constraint to convert the Read-Host output, which is always a string, to a number, and the use of the -eq operator rather than the text- and regex-based -match operator in the until condition; also, unnecessary grouping with (...) has been removed.
Note: An alternative to the use of a script block with either the & or . operator is to use $(...), the subexpression operator, as shown in MikeM's helpful answer. The difference between the two approaches is that the former streams its output to the pipeline - i.e., outputs objects one by one - whereas $(...) invariably collects all output in memory, up front.
For smallish input sets this won't make much of a difference, but the in-memory collection that $(...) performs can become problematic with large input sets, so the & { ... } / . { ... } approach is generally preferable.
Arno van Boven' answer shows a simpler alternative to your do ... until loop based on a for loop.
Combining a foreach loop with .., the range operator, is even more concise and expressive (and the cost of the array construction is usually negligible and overall still amounts to noticeably faster execution):
[int] $ready = Read-Host "How many you want?"
& {
foreach ($i in 1..$ready) {
-join (1..12 | foreach {
([char[]] (65..90 + 97..122)) + 0..9 + '.' | Get-Random
})
}
} | Out-File C:/numbers.csv -Append
The above also shows a simplification of the original command via a [char[]] cast that directly converts an array of code points to an array of characters.
In PowerShell [Core] 7+, you could further simplify by taking advantage of Get-Random's -Count parameter:
[int] $ready = Read-Host "How many you want?"
& {
foreach ($i in 1..$ready) {
-join (
([char[]] (65..90 + 97..122)) + 0..9 + '.' | Get-Random -Count 12
)
}
} | Out-File C:/numbers.csv -Append
And, finally, you could have avoided a statement for looping altogether, and used the ForEach-Object cmdlet instead (whose built-in alias, perhaps confusingly, is also foreach, but there'a also %), as you're already doing inside your loop (1..12 | foreach ...):
[int] $ready = Read-Host "How many you want?"
1..$ready | ForEach-Object {
-join (1..12 | ForEach-Object {
([char[]] (65..90 + 97..122)) + 0..9 + '.' | Get-Random
})
} | Out-File C:/numbers.csv -Append
[1] In Windows PowerShell, Out-File uses UTF-16LE ("Unicode") encoding by default, so even a conceptually empty file still contains 2 bytes, namely the UTF-16LE BOM. In PowerShell [Core] v6+, BOM-less UTF-8 is the default across all cmdlets, so there you'll truly get an empty (0 bytes) file.
Another way is to wrap the loop in a sub-expression and pipe it:
$ready = Read-Host "How many you want?: "
$i = 0
$(do{
(-join(1..12 | ForEach {((65..90)+(97..122)+(".") | % {[char]$_})+(0..9)+(".") | Get-Random}))
$i++
} until ($i -match $ready)) | Out-File C:/numbers.csv -Append
I personally avoid Do loops when I can, because I find them hard to read. Combining the two previous answers, I'd write it like this, because I find it easier to tell what is going on. Using a for loop instead, every line becomes its own self-contained piece of logic.
[int]$amount = Read-Host "How many you want?: "
& {
for ($i = 0; $i -lt $amount; $i++) {
-join(1..12 | foreach {((65..90)+(97..122)+(".") | foreach {[char]$_})+(0..9)+(".") | Get-Random})
}
} | Out-File C:\numbers.csv -Append
(Please do not accept this as an answer, this is just showing another way of doing it)

Compare the contents of two files and output the the differences in contents along with line numbers

I came upon the problem where we need to compare contents two files a.txt and b.txt line by line and output the result if any difference found along with content and line number.
We should not use Compare-Object in this scenario. Do we have any alternative?
I tried using for loops but unable to get desired result
For ex : a.txt:
Hello = "Required"
World = 5678
Environment = "new"
Available = 9080.90
b.txt"
Hello = "Required"
World = 5678.908
Environment = "old"
Available = 6780.90
I need to get the output as:
Line number 2:World is not matching
Line number 3:Environment is not matching
Line number 4:Available is not matching
I tried with the following code snippet but was unsuccessful
$file1 = Get-Content "C:\Users\Desktop\a.txt"
$file2 = Get-Content "C:\Users\Desktop\b.txt"
$result = "C:\Users\Desktop\result.txt"
$file1 | foreach {
$match = $file2 -match $_
if ( $match ){
$match | Out-File -Force $result -Append
}
}
As you seem to have an adverse reaction to Compare-Object, lets try this extremely janky set-up. As you have little to no requirements listed, this will give you the bare minimum to meet your conditions of 'any difference found'.
Copy and paste more If statements should you have more lines.
$a = get-content C:\a.txt
$b = get-content C:\b.txt
If($a[0] -ne $b[0]) {
"Line number 1:Hello is not matching" | Out-Host
}
If($a[1] -ne $b[1]) {
"Line number 2:World is not matching" | Out-Host
}
If($a[2] -ne $b[2]) {
"Line number 3:Environment is not matching" | Out-Host
}
If($a[3] -ne $b[3]) {
"Line number 4:Available is not matching" | Out-Host
}
Get-Content returns the file content as an array of strings with a zero based index.
The array variable has an automatic property .Count/.Length
you can use to iterate the arrays with a simple counting for.
You need to split the line at the = to separate name and content.
Use -f format operator to output the results.
## Q:\Test\2019\05\21\SO_56231110.ps1
$Desktop = [environment]::GetFolderPath('Desktop')
$File1 = Get-Content (Join-Path $Desktop "a.txt")
$File2 = Get-Content (Join-Path $Desktop "b.txt")
for ($i=0;$i -lt $File.Count;$i++){
if($File1[$i] -ne $File2[$i]){
"Line number {0}:{1} is not matching" -f ($i+1),($File1[$i] -split ' = ')[0]
}
}
Sample output:
Line number 2:World is not matching
Line number 3:Environment is not matching
Line number 4:Available is not matching

PowerShell: Is `$matches` guaranteed to carry down the pipeline in sync with the pipeline variable?

First, make some example files:
2010..2015 | % { "" | Set-Content "example $_.txt" }
#example 2010.txt
#example 2011.txt
#example 2012.txt
#example 2013.txt
#example 2014.txt
#example 2015.txt
What I want to do is match the year with a regex capture group, then reference the match with $matches[1] and use it. I can write this to do both in one scriptblock, in one cmdlet, and it works fine:
gci *.txt | foreach {
if ($_ -match '(\d+)') # regex match the year
{ # on the current loop variable
$matches[1] # and use the capture group immediately
}
}
#2010
#2011
#.. etc
I can also write this to do the match in one scriptblock, and then reference $matches in another cmdlet's scriptblock later on:
gci *.txt | where {
$_ -match '(\d+)' # regex match here, in the Where scriptblock
} | foreach { # pipeline!
$matches[1] # use $matches which was set in the previous
# scriptblock, in a different cmdlet
}
Which has the same output and it appears to work fine. But is it guaranteed to work, or is it undefined and a coincidence?
Could 'example 2012.txt' get matched, then buffered. 'example 2013.txt' gets matched, then buffered. | foreach gets to work on 'example 2012.txt' but $matches has already been updated with 2013 and they're out of sync?
I can't make them fall out of sync - but I could still be relying on undefined behaviour.
(FWIW, I prefer the first approach for clarity and readability as well).
There is no synchronization going on, per se. The second example works because of the way the pipeline works. As each single object gets passed along by satisfying the condition in Where-Object, the -Process block in ForEach-Object immediately processes it, so $Matches hasn't yet been overwritten from any other -match operation.
If you were to do something that causes the pipeline to gather objects before passing them on, like sorting, you would be in trouble:
gci *.txt | where {
$_ -match '(\d+)' # regex match here, in the Where scriptblock
} | sort | foreach { # pipeline!
$matches[1] # use $matches which was set in the previous
# scriptblock, in a different cmdlet
}
For example, the above should fail, outputting n objects, but they will all be the very last match.
So it's prudent not to rely on that, because it obscures the danger. Someone else (or you a few months later) may not think anything of inserting a sort and then be very confused by the result.
As TheMadTechnician pointed out in the comments, the placement changes things. Put the sort after the part where you reference $Matches (in the foreach), or before you filter with where, and it will still work as expected.
I think that drives home the point that it should be avoided, as it's fairly unclear. If the code changes in parts of the pipeline you don't control, then the behavior may end up being different, unexpectedly.
I like to throw in some verbose output to demonstrate this sometimes:
Original
gci *.txt | where {
"Where-Object: $_" | Write-Verbose -Verbose
$_ -match '(\d+)' # regex match here, in the Where scriptblock
} | foreach { # pipeline!
"ForEach-Object: $_" | Write-Verbose -Verbose
$matches[1] # use $matches which was set in the previous
# scriptblock, in a different cmdlet
}
Sorted
gci *.txt | where {
"Where-Object: $_" | Write-Verbose -Verbose
$_ -match '(\d+)' # regex match here, in the Where scriptblock
} | sort | foreach { # pipeline!
"ForEach-Object: $_" | Write-Verbose -Verbose
$matches[1] # use $matches which was set in the previous
# scriptblock, in a different cmdlet
}
The difference you'll see is that in the original, as soon as where "clears" an object, foreach gets it right away. In the sorted, you can see all of the wheres happening first, before foreach gets any of them.
sort doesn't have any verbose output so I didn't bother calling it that way, but essentially its Process {} block just collects all of objects so it can compare (sort!) them, then spits them out in the End {} block.
More examples
First, here's a function that mocks Sort-Object's collection of objects (it doesn't actually sort them or do anything):
function mocksort {
[CmdletBinding()]
param(
[Parameter(
ValueFromPipeline
)]
[Object]
$O
)
Begin {
Write-Verbose "Begin (mocksort)"
$objects = #()
}
Process {
Write-Verbose "Process (mocksort): $O (nothing passed, collecting...)"
$objects += $O
}
End {
Write-Verbose "End (mocksort): returning objects"
$objects
}
}
Then, we can use that with the previous example and some sleep at the end:
gci *.txt | where {
"Where-Object: $_" | Write-Verbose -Verbose
$_ -match '(\d+)' # regex match here, in the Where scriptblock
} | mocksort -Verbose | foreach { # pipeline!
"ForEach-Object: $_" | Write-Verbose -Verbose
$matches[1] # use $matches which was set in the previous
# scriptblock, in a different cmdlet
} | % { sleep -milli 500 ; $_ }
To complement briantist's great answer:
Aside from aggregating cmdlets such as Sort-Object (cmdlets that (must) collect all input first, before producing any output), the -OutBuffer common parameter can also break the command:
gci *.txt | where -OutBuffer 100 {
$_ -match '(\d+)' # regex match here, in the Where scriptblock
} | foreach { # pipeline!
$matches[1] # use $matches which was set in the previous
# scriptblock, in a different cmdlet
}
This causes the where (Where-Object) cmdlet to buffer its first 100 output objects until the 101th object is generated, and only then send these 101 objects on, so that $matches[1] in the foreach (ForEach-Object) block will in this case only see the 101th (matching) filename's capture-group value, in every of the (first) 101 iterations.
Generally, with an -OutputBuffer value of N, the first N + 1 foreach invocations would all see the same $matches value from the (N + 1)-th input object, and so forth for subsequent batches of N + 1 objects.
From Get-Help about_CommonParameters:
When you use this parameter, Windows PowerShell does not call the
next cmdlet in the pipeline until the number of objects generated
equals OutBuffer + 1. Thereafter, it sends all objects as they are
generated.
Note that the last sentence suggests that only the first N + 1 objects are subject to buffering, which, however, is not true, as the following example (thanks, #briantist) demonstrates:
1..5 | % { Write-Verbose -vb $_; $_ } -OutBuffer 1 | % { "[$_]" }
VERBOSE: 1
VERBOSE: 2
[1]
[2]
VERBOSE: 3
VERBOSE: 4
[3]
[4]
VERBOSE: 5
[5]
That is, -OutBuffer 1 caused all objects output by % (ForEach-Object) to be batched in groups of 2, not just the first 2.

Fastest way to parse thousands of small files in PowerShell

I have over 16000 inventory log files ranging in size from 3-5 KB on a network share.
Sample file looks like this:
## System Info
SystemManufacturer:=:Dell Inc.
SystemModel:=:OptiPlex GX620
SystemType:=:X86-based PC
ChassisType:=:6 (Mini Tower)
## System Type
isLaptop=No
I need to put them into a DB, so I started parsing them and creating a custom object for each that I can later use to check duplicates, normalize etc...
Initial parse with a code snippet as in below took about 7.5mins.
Foreach ($invlog in $invlogs) {
$content = gc $invlog.FullName -ReadCount 0
foreach ($line in $content) {
if ($line -match '^#|^\s*$') { continue }
$invitem,$value=$line -split ':=:'
[PSCustomObject]#{Name=$invitem;Value=$value}
}
}
I started optimizing it and after several trial and error ended up with this which takes 2mins and 4 secs:
Foreach ($invlog in $invlogs) {
foreach ($line in ([System.IO.File]::ReadLines("$($invlog.FullName)") -match '^\w') ) {
$invitem,$value=$line -split ':=:'
[PSCustomObject]#{name=$invitem;Value=$value} #2.04mins
}
}
I also tried using a hash instead of PSCustomObject, but to my surprise it took much longer (5mins 26secs)
Foreach ($invlog in $invlogs) {
$hash=#{}
foreach ($line in ([System.IO.File]::ReadLines("$($invlog.FullName)") -match $propertyline) ) {
$invitem,$value=$line -split ':=:'
$hash[$invitem]=$value #5.26mins
}
}
What would be the fastest method to use here?
See if this is any faster:
Foreach ($invlog in $invlogs) {
#(gc $invlog.FullName -ReadCount 0) -notmatch '^#|^\s*$' |
foreach {
$invitem,$value=$line -split ':=:'
[PSCustomObject]#{Name=$invitem;Value=$value}
}
}
The -match and -notmatch operators, when appied to an array return all the elements that satisfy the match, so you can eliminate having to test every line for the lines to exclude.
Are you really wanting to create a PS Object for every line, or just one for every file?
If you want one object per file, see if this is any quicker:
The multi-line regex eliminates the line array, and a filter is used in place of the foreach to create the hash entries.
$regex = [regex]'(?ms)^(\w+):=:([^\r]+)'
filter make-hash { #{$_.groups[1].value = $_.groups[2].value} }
Foreach ($invlog in $invlogs) {
$regex.matches([io.file]::ReadAllText($invlog.fullname)) | make-hash
}
The objective of switching to using the multi-line regex and [io.file]::ReadAllText] is to simplify what Powershell is doing with the file input internally. The result of [io.file]::ReadAllText() will be a string object, which is a much simpler type of object than the array of strings that [io.file]::ReadAllLines() will produce, and requires less overhead to counstruct internally. A filter is essentially just the Process block of a function - it will run once for every object that comes to it from the pipeline, so it emulates the action of foreach-object, but actually runs slightly faster (I don't know the internals well enough to tell you exactly why). Both of these changes require more coding and only result in a marginal increase in performace. In my testing switching to multi-line gained about .1ms per file, and changing from foreach-object to the filter another .1 ms. You probably don't see these techniques used very often because of the low return compared to the additional coding work required, but it becomes significant when you start to multiply those fractions of a ms by 160K iterations.
Try this:
Foreach ($invlog in $invlogs) {
$output = #{}
foreach ($line in ([IO.File]::ReadLines("$($invlog.FullName)") -ne '') ) {
if ($line.Contains(":=:")) {
$item, $value = $line.Split(":=:") -ne ''
$output[$item] = $value
}
}
New-Object PSObject -Property $output
}
As a general rule, Regex is sometimes cool but always slower.
Wouldn't you want an object per system, and not per key-value pair? :S
Like this.. By replacing Get-Content to the .Net method you could probably save some time.
Get-ChildItem -Filter *.txt -Path <path to files> | ForEach-Object {
$ht = #{}
Get-Content $_ | Where-Object { $_ -match ':=:' } | ForEach-Object {
$ht[($_ -split ':=:')[0].Trim()] = ($_ -split ':=:')[1].Trim()
}
[pscustomobject]$ht
}
ChassisType SystemManufacturer SystemType SystemModel
----------- ------------------ ---------- -----------
6 (Mini Tower) Dell Inc. X86-based PC OptiPlex GX620