how to write streaming function in powershell - powershell

I tried to create a function that emulates Linux's head:
Function head( )
{
[CmdletBinding()]
param (
[parameter(mandatory=$false, ValueFromPipeline=$true)] [Object[]] $inputs,
[parameter(position=0, mandatory=$false)] [String] $liness = "10",
[parameter(position=1, ValueFromRemainingArguments=$true)] [String[]] $filess
)
$lines = 0
if (![int]::TryParse($liness, [ref]$lines)) {
$lines = 10
$filess = ,$liness + (#{$true=#();$false=$filess}[$null -eq $filess])
}
$read = 0
$input | select-object -First $lines
if ($filess) {
get-content -TotalCount $lines $filess
}
}
The problem is that this will actually read all the content (whether by reading $filess or from $input) and then print the first, where I'd want head to read the first lines and forget about the rest so it can work with large files.
How can this function be rewritten?

Well, as far as I know, you are overdoing it slightly...
"Beginning in Windows PowerShell 3.0, Select-Object includes an optimization feature that prevents commands from creating and processing objects that are not used. When you include a Select-Object command with the First or Index parameter in a command pipeline, Windows PowerShell stops the command that generates the objects as soon as the selected number of objects is generated, even when the command that generates the objects appears before the Select-Object command in the pipeline. To turn off this optimizing behavior, use the Wait parameter."
So all you need to do is:
Get-Content -Path somefile | Select-Object -First 10 #or pass a variable

Related

Select-Object has different behavior using piping vs. -InputObject

can someone shed light on why example 2 does not have the same result as example 1? I would think that $a and $b2 should be the same. $b2 is null! I am writing a script where using the method in example 2 is preferable.
example 1:
$a = Get-Content $some_text_file | Select-Object -Skip 1
example 2:
$b1 = Get-Content $some_text_file
$b2 = Select-Object -InputObject $b1 -Skip 1
edit: using this syntax gets me where I need to be.
$b1 = Get-Content $file
$b2 = $b1 | Select-Object -Skip 1
As Lee_Dailey notes, this is expected behavior
can someone shed light on why
This has to do with how cmdlets execute in the pipeline.
As you might know, the core functionality of a cmdlet is made up of three methods:
BeginProcessing()
ProcessRecord()
EndProcessing()
*(the begin/process/end blocks in an advanced function correspond to these).
BeginProcessing() and EndProcessing() are always executed exactly once. How many times ProcessRecord() execute depend on whether it's the first command in a pipeline or not.
When a cmdlet occurs as the first element in a pipeline (ie. there are no | sign to the left of it), ProcessRecord() executes once.
When a cmdlet receives input from an upstream command in its pipeline, however, ProcessRecord() is run once for each input item coming in through the pipeline.
With this in mind, please consider this simplified version of Select-Object:
function Select-FakeObject {
param(
[Parameter(Mandatory, ValueFromPipeline)]
[object[]]$InputObject,
[Parameter()]
[int]$Skip = 0
)
begin {
$skipped = 0
}
process {
if($skipped -lt $Skip){
$skipped++
Write-Host "Skipping $InputObject"
}
else{
Write-Host "Selecting $InputObject"
Write-Output $InputObject
}
}
}
Now, let's try both scenarios with this dummy function:
PS C:\> $a = 1,2,3
PS C:\> $b1 = $a |Select-FakeObject -Skip 1
We'll see that PowerShell indeed calls the process block once per input item:
Skipping 1
Selecting 2
Selecting 3
whereas if we pass the object like in your second example:
PS C:\> $a = 1,2,3
PS C:\> $b2 = Select-FakeObject -Skip 1
Skipping 1 2 3
We now see that the process block only executes once, over all of $a rather than the individual items.

How to get output in desired encoding scheme using powershell out-fil

I have a requirement, in which I need to do read line by line, and then do string/character replacement in a datafile having data in windows latin 1.
I've written this powershell (my first one) initially using out-file -encoding option. However the output file thus created was doing some character translation. Then I searched and came across WriteAllLines, but I'm unable to use it in my code.
$encoding =[Text.Encoding]::GetEncoding('iso-8859-1')
$pdsname="ABCD.XYZ.PQRST"
$datafile="ABCD.SCHEMA.TABLE.DAT"
Get-Content ABCD.SCHEMA.TABLE.DAT | ForEach-Object {
$matches = [regex]::Match($_,'ABCD')
$string_to_be_replaced=$_.substring($matches.Index,$pdsname.Length+10)
$string_to_be_replaced="`"$string_to_be_replaced`""
$member = [regex]::match($_,"`"$pdsname\(([^\)]+)\)`"").Groups[1].Value
$_ -replace $([regex]::Escape($string_to_be_replaced)),$member
} | [System.IO.File]::WriteAllLines("C:\Users\USer01", "ABCD.SCHEMA.TABLE.NEW.DAT", $encoding)
With the help of an answer from #Gzeh Niert, I updated my above script. However, when I execute the script the output file being generated by the script has just the last record, as it was unable to append, and it did an overwrite, I tried using System.IO.File]::AppendAllText, but this strangely creates a larger file, and has only the last record. In short its likely that empty lines are being written.
param(
[String]$datafile
)
$pdsname="ABCD.XYZ.PQRST"
$encoding =[Text.Encoding]::GetEncoding('iso-8859-1')
$datafile = "ABCD.SCHEMA.TABLE.DAT"
$datafile2="ABCD.SCHEMA.TABLE.NEW.DAT"
Get-Content $datafile | ForEach-Object {
$matches = [regex]::Match($_,'ABCD')
if($matches.Success) {
$string_to_be_replaced=$_.substring($matches.Index,$pdsname.Length+10)
$string_to_be_replaced="`"$string_to_be_replaced`""
$member = [regex]::match($_,"`"$pdsname\(([^\)]+)\)`"").Groups[1].Value
$replacedContent = $_ -replace $([regex]::Escape($string_to_be_replaced)),$member
[System.IO.File]::AppendAllText($datafile2, $replacedContent, $encoding)
}
else {
[System.IO.File]::AppendAllText($datafile2, $_, $encoding)
}
#[System.IO.File]::WriteAllLines($datafile2, $replacedContent, $encoding)
}
Please help me figure out where I am going wrong.
System.IO.File.WriteAllLines is getting either an array of strings or an IEnumerable of strings as second parameter and cannot be piped to a command because it is not a CmdLet handling pipeline input but a .NET Framework method.
You should try storing your replaced content into a string[]to use it as parameter when saving the file.
param(
[String]$file
)
$encoding =[Text.Encoding]::GetEncoding('iso-8859-1')
$replacedContent = [string[]]#(Get-Content $file | ForEach-Object {
# Do stuff
})
[System.IO.File]::WriteAllLines($file, $replacedContent, $encoding)

Use powershell tabexpansion function

When I run the statement man tabexpansion it shows that I simply need to provide -line and -lastword.
However, when I run a statement like this:
tabexpansion -line "c:/win" -lastword "c:/win"
It returns nothing.
Shouldn't it show at least C:\Windows? What am I doing wrong?
tabexpansion is a function involved in doing tab handling during manual command entry. It performs completion of various types of objects, like path names, variable names, function/cmdlet names, etc. The interface to this function is discussed at this SO post. The booked referenced describes overriding TabExpansion2.
I read elsewhere that TabExpansion is used in PS 2.0, whereas TabExpansion2 is used in 3.0.
As to why TabExpansion doesn't return anything, I can only answer for my system (which has PS 3.0). On my system cat function:tabexpansion gives:
[CmdletBinding()]
param(
[String] $line,
[String] $lastWord
)
process {
if ($line -eq "Install-Module $lastword" -or $line -eq "inmo $lastword"
-or $line -eq "ismo $lastword" -or $line -eq "upmo $lastword"
-or $line -eq "Update-Module $lastword") {
Get-PsGetModuleInfo -ModuleName "$lastword*" | % { $_.Id } | sort -Unique
}
elseif ( Test-Path -Path Function:\$tabExpansionBackup ) {
& $tabExpansionBackup $line $lastWord
}
}
Unless $line begins with a few specific tokens it goes to the elseif statement. There if the variable $tabexpansionBackup is not defined, the function is exited with no output. With the input in the OP, it gives the output you're seeing - none.

Searching many large text files in Powershell

I frequently have to search server log files in a directory that may contain 50 or more files of 200+ MB each. I've written a function in Powershell to do this searching. It finds and extracts all the values for a given query parameter. It works great on an individual large file or a collection of small files but totally bites in the above circumstance, a directory of large files.
The function takes a parameter, which consists of the query parameter to be searched.
In pseudo-code:
Take parameter (e.g. someParam or someParam=([^& ]+))
Create a regex (if one is not supplied)
Collect a directory list of *.log, pipe to Select-String
For each pipeline object, add the matchers to a hash as keys
Increment a match counter
Call GC
At the end of the pipelining:
if (hash has keys)
enumerate the hash keys,
sort and append to string array
set-content the string array to a file
print summary to console
exit
else
print summary to console
exit
Here's a stripped-down version of the file processing.
$wtmatches = #{};
gci -Filter *.log | Select-String -Pattern $searcher |
%{ $wtmatches[$_.Matches[0].Groups[1].Value]++; $items++; [GC]::Collect(); }
I'm just using an old perl trick of de-duplicating found items by making them the keys of a hash. Perhaps, this is an error, but a typical output of the processing is going to be around 30,000 items at most. More normally, found items is in the low thousands range. From what I can see, the number of keys in the hash does not affect processing time, it is the size and number of the files that breaks it. I recently threw in the GC in desperation, it does have some positive effect but it is marginal.
The issue is that with the large collection of large files, the processing sucks the RAM pool dry in about 60 seconds. It doesn't actually use a lot of CPU, interestingly, but there's a lot of volatile storage going on. Once the RAM usage has gone up over 90%, I can just punch out and go watch TV. It could take hours to complete the processing to produce a file with 15,000 or 20,000 unique values.
I would like advice and/or suggestions for increasing the efficiency, even if that means using a different paradigm to accomplish the processing. I went with what I know. I use this tool on almost a daily basis.
Oh, and I'm committed to using Powershell. ;-) This function is part of a complete module I've written for my job, so, suggestions of Python, perl or other useful languages are not useful in this case.
Thanks.
mp
Update:
Using latkin's ProcessFile function, I used the following wrapper for testing. His function is orders of magnitude faster than my original.
function Find-WtQuery {
<#
.Synopsis
Takes a parameter with a capture regex and a wildcard for files list.
.Description
This function is intended to be used on large collections of large files that have
the potential to take an unacceptably long time to process using other methods. It
requires that a regex capture group be passed in as the value to search for.
.Parameter Target
The parameter with capture group to find, e.g. WT.z_custom=([^ &]+).
.Parameter Files
The file wildcard to search, e.g. '*.log'
.Outputs
An object with an array of unique values and a count of total matched lines.
#>
param(
[Parameter(Mandatory = $true)] [string] $target,
[Parameter(Mandatory = $false)] [string] $files
)
begin{
$stime = Get-Date
}
process{
$results = gci -Filter $files | ProcessFile -Pattern $target -Group 1;
}
end{
$etime = Get-Date;
$ptime = $etime - $stime;
Write-Host ("Processing time for {0} files was {1}:{2}:{3}." -f (gci
-Filter $files).Count, $ptime.Hours,$ptime.Minutes,$ptime.Seconds);
return $results;
}
}
The output:
clients:\test\logs\global
{powem} [4] --> Find-WtQuery -target "WT.ets=([^ &]+)" -files "*.log"
Processing time for 53 files was 0:1:35.
Thanks to all for comments and help.
Here's a function that will hopefully speed up and reduce the memory impact of the file processing part. It will return an object with 2 properties: The total count of lines matched, and a sorted array of unique strings from the match group specified. (From your description it sounds like you don't really care about the count per string, just the string values themselves)
function ProcessFile
{
param(
[Parameter(ValueFromPipeline = $true, Mandatory = $true)]
[System.IO.FileInfo] $File,
[Parameter(Mandatory = $true)]
[string] $Pattern,
[Parameter(Mandatory = $true)]
[int] $Group
)
begin
{
$regex = new-object Regex #($pattern, 'Compiled')
$set = new-object 'System.Collections.Generic.SortedDictionary[string, int]'
$totalCount = 0
}
process
{
try
{
$reader = new-object IO.StreamReader $_.FullName
while( ($line = $reader.ReadLine()) -ne $null)
{
$m = $regex.Match($line)
if($m.Success)
{
$set[$m.Groups[$group].Value] = 1
$totalCount++
}
}
}
finally
{
$reader.Close()
}
}
end
{
new-object psobject -prop #{TotalCount = $totalCount; Unique = ([string[]]$set.Keys)}
}
}
You can use it like this:
$results = dir *.log | ProcessFile -Pattern 'stuff (capturegroup)' -Group 1
"Total matches: $($results.TotalCount)"
$results.Unique | Out-File .\Results.txt
IMO #latkin's approach is the way to go if you want do this within PowerShell and not use some dedicated tool. I made a few changes though to make the command play better with respect to accepting pipeline input. I also modified the regex to search for all matches on a particular line. Neither approach searches across multiple lines although that scenario would be pretty easy to handle as long as the pattern only ever spanned a few lines. Here's my take on the command (put it in a file called Search-File.ps1):
[CmdletBinding(DefaultParameterSetName="Path")]
param(
[Parameter(Mandatory=$true, Position=0)]
[ValidateNotNullOrEmpty()]
[string]
$Pattern,
[Parameter(Mandatory=$true, Position=1, ParameterSetName="Path",
ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true,
HelpMessage="Path to ...")]
[ValidateNotNullOrEmpty()]
[string[]]
$Path,
[Alias("PSPath")]
[Parameter(Mandatory=$true, Position=1, ParameterSetName="LiteralPath",
ValueFromPipelineByPropertyName=$true,
HelpMessage="Path to ...")]
[ValidateNotNullOrEmpty()]
[string[]]
$LiteralPath,
[Parameter()]
[ValidateRange(0, [int]::MaxValue)]
[int]
$Group = 0
)
Begin
{
Set-StrictMode -Version latest
$count = 0
$matched = #{}
$regex = New-Object System.Text.RegularExpressions.Regex $Pattern,'Compiled'
}
Process
{
if ($psCmdlet.ParameterSetName -eq "Path")
{
# In the -Path (non-literal) case we may need to resolve a wildcarded path
$resolvedPaths = #($Path | Resolve-Path | Convert-Path)
}
else
{
# Must be -LiteralPath
$resolvedPaths = #($LiteralPath | Convert-Path)
}
foreach ($rpath in $resolvedPaths)
{
Write-Verbose "Processing $rpath"
$stream = new-object System.IO.FileStream $rpath,'Open','Read','Read',4096
$reader = new-object System.IO.StreamReader $stream
try
{
while (($line = $reader.ReadLine())-ne $null)
{
$matchColl = $regex.Matches($line)
foreach ($match in $matchColl)
{
$count++
$key = $match.Groups[$Group].Value
if ($matched.ContainsKey($key))
{
$matched[$key]++
}
else
{
$matched[$key] = 1;
}
}
}
}
finally
{
$reader.Close()
}
}
}
End
{
new-object psobject -Property #{TotalCount = $count; Matched = $matched}
}
I ran this against my IIS log dir (8.5 GB and ~1000 files) to find all the IP addresses in all the logs e.g.:
$r = ls . -r *.log | C:\Users\hillr\Search-File.ps1 '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
This took 27 minutes on my system and found 54356330 matches:
$r.Matched.GetEnumerator() | sort Value -Descending | select -f 20
Name Value
---- -----
xxx.140.113.47 22459654
xxx.29.24.217 13430575
xxx.29.24.216 13321196
xxx.140.113.98 4701131
xxx.40.30.254 53724

How does Select-Object stop the pipeline in PowerShell v3?

In PowerShell v2, the following line:
1..3| foreach { Write-Host "Value : $_"; $_ }| select -First 1
Would display:
Value : 1
1
Value : 2
Value : 3
Since all elements were pushed down the pipeline. However, in v3 the above line displays only:
Value : 1
1
The pipeline is stopped before 2 and 3 are sent to Foreach-Object (Note: the -Wait switch for Select-Object allows all elements to reach the foreach block).
How does Select-Object stop the pipeline, and can I now stop the pipeline from a foreach or from my own function?
Edit: I know I can wrap a pipeline in a do...while loop and continue out of the pipeline. I have also found that in v3 I can do something like this (it doesn't work in v2):
function Start-Enumerate ($array) {
do{ $array } while($false)
}
Start-Enumerate (1..3)| foreach {if($_ -ge 2){break};$_}; 'V2 Will Not Get Here'
But Select-Object doesn't require either of these techniques so I was hoping that there was a way to stop the pipeline from a single point in the pipeline.
Check this post on how you can cancel a pipeline:
http://powershell.com/cs/blogs/tobias/archive/2010/01/01/cancelling-a-pipeline.aspx
In PowerShell 3.0 it's an engine improvement. From the CTP1 samples folder ('\Engines Demos\Misc\ConnectBugFixes.ps1'):
# Connect Bug 332685
# Select-Object optimization
# Submitted by Shay Levi
# Connect Suggestion 286219
# PSV2: Lazy pipeline - ability for cmdlets to say "NO MORE"
# Submitted by Karl Prosser
# Stop the pipeline once the objects have been selected
# Useful for commands that return a lot of objects, like dealing with the event log
# In PS 2.0, this took a long time even though we only wanted the first 10 events
Start-Process powershell.exe -Args '-Version 2 -NoExit -Command Get-WinEvent | Select-Object -First 10'
# In PS 3.0, the pipeline stops after retrieving the first 10 objects
Get-WinEvent | Select-Object -First 10
After trying several methods, including throwing StopUpstreamCommandsException, ActionPreferenceStopException, and PipelineClosedException, calling $PSCmdlet.ThrowTerminatingError and $ExecutionContext.Host.Runspace.GetCurrentlyRunningPipeline().stopper.set_IsStopping($true) I finally found that just utilizing select-object was the only thing that didn't abort the whole script (versus just the pipeline). [Note that some of the items mentioned above require access to private members, which I accessed via reflection.]
# This looks like it should put a zero in the pipeline but on PS 3.0 it doesn't
function stop-pipeline {
$sp = {select-object -f 1}.GetSteppablePipeline($MyInvocation.CommandOrigin)
$sp.Begin($true)
$x = $sp.Process(0) # this call doesn't return
$sp.End()
}
New method follows based on comment from OP. Unfortunately this method is a lot more complicated and uses private members. Also I don't know how robust this - I just got the OP's example to work and stopped there. So FWIW:
# wh is alias for write-host
# sel is alias for select-object
# The following two use reflection to access private members:
# invoke-method invokes private methods
# select-properties is similar to select-object, but it gets private properties
# Get the system.management.automation assembly
$smaa=[appdomain]::currentdomain.getassemblies()|
? location -like "*system.management.automation*"
# Get the StopUpstreamCommandsException class
$upcet=$smaa.gettypes()| ? name -like "*upstream*"
filter x {
[CmdletBinding()]
param(
[parameter(ValueFromPipeline=$true)]
[object] $inputObject
)
process {
if ($inputObject -ge 5) {
# Create a StopUpstreamCommandsException
$upce = [activator]::CreateInstance($upcet,#($pscmdlet))
$PipelineProcessor=$pscmdlet.CommandRuntime|select-properties PipelineProcessor
$commands = $PipelineProcessor|select-properties commands
$commandProcessor= $commands[0]
$null = $upce.RequestingCommandProcessor|select-properties *
$upce.RequestingCommandProcessor.commandinfo =
$commandProcessor|select-properties commandinfo
$upce.RequestingCommandProcessor.Commandruntime =
$commandProcessor|select-properties commandruntime
$null = $PipelineProcessor|
invoke-method recordfailure #($upce, $commandProcessor.command)
1..($commands.count-1) | % {
$commands[$_] | invoke-method DoComplete
}
wh throwing
throw $upce
}
wh "< $inputObject >"
$inputObject
} # end process
end {
wh in x end
}
} # end filter x
filter y {
[CmdletBinding()]
param(
[parameter(ValueFromPipeline=$true)]
[object] $inputObject
)
process {
$inputObject
}
end {
wh in y end
}
}
1..5| x | y | measure -Sum
PowerShell code to retrieve PipelineProcessor value through reflection:
$t_cmdRun = $pscmdlet.CommandRuntime.gettype()
# Get pipelineprocessor value ($pipor)
$bindFlags = [Reflection.BindingFlags]"NonPublic,Instance"
$piporProp = $t_cmdRun.getproperty("PipelineProcessor", $bindFlags )
$pipor=$piporProp.GetValue($PSCmdlet.CommandRuntime,$null)
Powershell code to invoke method through reflection:
$proc = (gps)[12] # semi-random process
$methinfo = $proc.gettype().getmethod("GetComIUnknown", $bindFlags)
# Return ComIUnknown as an IntPtr
$comIUnknown = $methinfo.Invoke($proc, #($true))
I know that throwing a PipelineStoppedException stops the pipeline. The following example will simulate what you see with Select -first 1 in v3.0, in v2.0:
filter Select-Improved($first) {
begin{
$count = 0
}
process{
$_
$count++
if($count -ge $first){throw (new-object System.Management.Automation.PipelineStoppedException)}
}
}
trap{continue}
1..3| foreach { Write-Host "Value : $_"; $_ }| Select-Improved -first 1
write-host "after"