No garbage collection while PowerShell pipeline is executing - powershell

UPDATE: The following bug seems to be resolved with PowerShell 5. The bug remains in 3 and 4. So don't process any huge files with the pipeline unless you're running PowerShell 2 or 5.
Consider the following code snippet:
function Get-DummyData() {
for ($i = 0; $i -lt 10000000; $i++) {
"This is freaking huge!! I'm a ninja! More words, yay!"
}
}
Get-DummyData | Out-Null
This will cause PowerShell memory usage to grow uncontrollably. After executing Get-DummyData | Out-Null a few times, I have seen PowerShell memory usage get all the way up to 4 GB.
According to ANTS Memory Profiler, we have a whole lot of things sitting around in the garbage collector's finalization queue. When I call [GC]::Collect(), the memory goes from 4 GB to a mere 70 MB. So we don't have a memory leak, strictly speaking.
Now, it's not good enough for me to be able to call [GC]::Collect() when I'm finished with a long-lived pipeline operation. I need garbage collection to happen during a pipeline operation. However if I try to invoke [GC]::Collect() while the pipeline is executing...
function Get-DummyData() {
for ($i = 0; $i -lt 10000000; $i++) {
"This is freaking huge!! I'm a ninja! More words, yay!"
if ($i % 1000000 -eq 0) {
Write-Host "Prompting a garbage collection..."
[GC]::Collect()
}
}
}
Get-DummyData | Out-Null
... the problem remains. Memory usage grows uncontrollably again. I have tried several variations of this, such as adding [GC]::WaitForPendingFinalizers(), Start-Sleep -Seconds 10, etc. I have tried changing garbage collector latency modes and forcing PowerShell to use server garbage collection to no avail. I just can't get the garbage collector to do its thing while the pipeline is executing.
This isn't a problem at all in PowerShell 2.0. It's also interesting to note that $null = Get-DummyData also seems to work without memory issues. So it seems tied to the pipeline, rather than the fact that we're generating tons of strings.
How can I prevent my memory from growing uncontrollably during long pipelines?
Side note:
My Get-DummyData function is only for demonstration purposes. My real-world problem is that I'm unable to read through large files in PowerShell using Get-Content or Import-Csv. No, I'm not storing the contents of these files in variables. I'm strictly using the pipeline like I'm supposed to. Get-Content .\super-huge-file.txt | Out-Null produces the same problem.

A couple of things to point out here. First, GC calls do work in the pipeline. Here's a pipeline script that only invokes the GC:
1..10 | Foreach {[System.GC]::Collect()}
Here's the perfmon graph of GCs during the time the script ran:
However, just because you invoke the GC it doesn't mean the private memory usage will return to the value you had before your script started. A GC collect will only collect memory that is no longer used. If there is a rooted reference to an object, it is not eligible to be collected (freed). So while GC systems typically don't leak in the C/C++ sense, they can have memory hoards that hold onto objects longer than perhaps they should.
In looking at this with a memory profiler it seems the bulk of the excess memory is taken up by a copy of the string with parameter binding info:
The root for these strings look like this:
I wonder if there is some logging feature that is causing PowerShell to hang onto a string-ized form pipeline bound objects?
BTW in this specific case, it is much more memory efficient to assign to $null to ignore the output:
$null = GetDummyData
Also, if you need to simply edit a file, check out the Edit-File command in the PowerShell Community Extensions 3.2.0. It should be memory efficient as long as you don't use the SingleString switch parameter.

It's not at all uncommon to find that the native cmdlets don't satisfy perfectly when you're doing something unusual like processing a massive text file. Personally, I've found working with large files in Powershell is much better when you script it with System.IO.StreamReader:
$SR = New-Object -TypeName System.IO.StreamReader -ArgumentList 'C:\super-huge-file.txt';
while ($line = $SR.ReadLine()) {
Do-Stuff $line;
}
$SR.Close() | Out-Null;
Note that you should use the absolute path in the ArgumentList. For me it always seems to assume you're in your home directory with relative paths.
Get-Content is simply meant to read the entire object into memory as an array and then outputs it. I think it just calls System.IO.File.ReadAllLines().
I don't know of any way to tell Powershell to discard items from the pipeline immediately upon completion, or that a function may return items asynchronously, so instead it preserves order. It may not allow it because it has no natural way to tell that the object isn't going to be used later on, or that later objects won't need to refer to earlier objects.
The other nice thing about Powershell is that you can often adopt the C# answers, too. I've never tried File.ReadLines, but that looks like it might be pretty easy to use, too.

Related

How can I run a batch-file about 1000 times, afterwards I want the average execution time as a output. Is this possible?

I tried following powershell-command, but then 1000 windows opened and the powershell ISE crashed. Is there a way to run the batch-file 1000 times in the background? And is there a smarter way that leads to the average execution time?
That's the code I've tried:
cd C:\scripts
Measure-Command {
for($i = 0;$i -lt 1000;$i++){
Start-Process -FilePath "C:\scripts\open.bat"
}
}
Start-Process by default runs programs asynchronously, in a new console window.
Since you want to run your batch file synchronously, in the same console window, invoke it directly (which, since the path is double-quoted - though it doesn't strictly have to be in this case - requires &, the call operator for syntactic reasons):
Measure-Command {
foreach ($i in 1..1000){
& "C:\scripts\open.bat"
}
}
Note: Measure-Command discards the success output from the script block being run; if you do want to see it in the console, use the following variation, though note that it will slow down processing:
Measure-Command {
& {
foreach ($i in 1..1000){
& "C:\scripts\open.bat"
}
} | Out-Host
}
This answer explains in more detail why Start-Process is typically the wrong tool for invoking console-based programs and scripts.
Measure-Command is the right tool for performance measurement in PowerShell, but it's important to note that such measurements are far from an exact science, given PowerShell's dynamic nature, which involves many caches and on-demand compilation behind the scenes.
Averaging multiple runs generally makes sense, especially when calling external programs; by contrast, if PowerShell code is executed repeatedly and the repeat count exceeds 16, on-demand compilation occurs and speeds up subsequent executions, which can skew the result.
Time-Command is a friendly wrapper around Measure-Command, available from this MIT-licensed Gist[1]; it can be used to simplify your tests.
# Download and define function `Time-Command` on demand (will prompt).
# To be safe, inspect the source code at the specified URL first.
if (-not (Get-Command -ea Ignore Time-Command)) {
$gistUrl = 'https://gist.github.com/mklement0/9e1f13978620b09ab2d15da5535d1b27/raw/Time-Command.ps1'
if ((Read-Host "`n====`n OK to download and define benchmark function ``Time-Command`` from Gist ${gistUrl}?`n=====`n(y/n)?").Trim() -notin 'y', 'yes') { Write-Warning 'Aborted.'; exit 2 }
Invoke-RestMethod $gistUrl | Invoke-Expression
if (-not ${function:Time-Command}) { exit 2 }
}
Write-Verbose -Verbose 'Running benchmark...'
# Omit -OutputToHost to run the commands quietly.
Time-Command -Count 1000 -OutputToHost { & "C:\scripts\open.bat" }
Note that while Time-Command is a convenient wrapper even for measuring a single command's performance, it also allows you to compare the performance of multiple commands, passed as separate script blocks ({ ... }).
[1] Assuming you have looked at the linked Gist's source code to ensure that it is safe (which I can personally assure you of, but you should always check), you can install it directly as follows:
irm https://gist.github.com/mklement0/9e1f13978620b09ab2d15da5535d1b27/raw/Time-Command.ps1 | iex

Pipeline semantics aren't propagated into Where-Object

I use the following command to run a pipeline.
.\Find-CalRatioSamples.ps1 data16 `
| ? {-Not (Test-GRIDDataset -JobName DiVertAnalysis -JobVersion 13 -JobSourceDatasetName $_ -Exists -Location UWTeV-linux)}
The first is a custom script of mine, and runs very fast (miliseconds). The second is a custom command, also written by me (see https://github.com/LHCAtlas/AtlasSSH/blob/master/PSAtlasDatasetCommands/TestGRIDDataset.cs). It is very slow.
Actually, it isn't so slow processing each line of input. The setup before the first line of input can be processed is very expensive. That done, however, it goes quite quickly. So all the expensive code gets executed once, and only the fairly fast code needs to be executed for each new pipeline input.
Unfortunately, when I want to do the ? { } construct above, it seems like PowerShell doesn't keep the pipe-line as it did before. It now calls me command a fresh time for each line of input, causing the command to redo all the setup for each line.
Is there something I can change in how I invoke the pipe-line? Or in how I've coded up my cmdlet to prevent this from happening? Or am I stuck because this is just the way Where-Object works?
It is working as designed. You're starting a new (nested) pipeline inside the scriptblock when you call your command.
If your function is doing the expensive code in its Begin block, then you need to directly pipe the first script into your function to get that advantage.
.\Find-CalRatioSamples.ps1 data16 |
Test-GRIDDataset -JobName DiVertAnalysis -JobVersion 13 -Exists -Location UWTeV-linux |
Where-Object { $_ }
But then it seems that you are not returning the objects you want (the original).
One way you might be able to change Test-GRIDDataset is to implement a -PassThru switch, though you aren't actually accepting the full objects from your original script, so I'm unable to tell if this is feasible; but the code you wrote seems to be retrieving... stuff(?) from somewhere based on the name. Perhaps that would be sufficient? When -PassThru is specified, send the objects through the pipeline if they exist (rather than just a boolean of whether or not they do).
Then your code would look like this:
.\Find-CalRatioSamples.ps1 data16 |
Test-GRIDDataset -JobName DiVertAnalysis -JobVersion 13 -Exists -Location UWTeV-linux -PassThru

Copy-Item with timeout

I am recovering files from a hard drive wherein some number of the files are unreadable. I'm unable to change the hardware level timeout / ERC, and it's extremely difficult to work around when I have several hundred thousand files, any tens of thousands of which might be unreadable.
The data issues were the result of a controller failure. Buying a matching drive (all the way down), I've been able to access the drive, and can copy huge swaths of it without issues. However, there are unreadable files dotted throughout the drive that, when accessed, will cause the SATA bus to hang. I've used various resumable file copy applications like robocopy, RichCopy, and a dozen others, but they all have the same issue. They have a RETRY count that is based on actually getting an error reported from the drive. The issue is that the drive is taking an extremely long time to report the error, and this means that a single file may take up to an hour to fail officially. I know how fast each file SHOULD be, so I'd like to build a powershell CMDLET or similar that will allow me to pass in a source and destination file name, and have it try to copy the file. If, after 5 seconds, the file hasn't copied (or if it has - this can be a dumb process), I'd like it to quit. I'll write a script that fires off each copy process individually, waiting for the process before it to finish, but I'm so far unable to find a good way of putting a time limit on the process.
Any suggestions you might have would be greatly appreciated!
Edit: I would be happy with spawning a Copy-Item in a new thread, with a new PID, then counting down, then killing that PID. I'm just a novice at PowerShell, and have seen so many conflicting methods for imposing timers that I'm lost on what the best practices way would be.
Edit 2: Please note that applications like robocopy will utterly hang when encountering the bad regions of the disk. These are not simple hangs, but bus hangs that windows will try to preserve in order to not lose data. In these instances task manager is unable to kill the process, but Process Explorer IS. I'm not sure what the difference in methodology is, but regardless, it seems relevant.
I'd say the canonical way of doing things like this in PowerShell are background jobs.
$timeout = 300 # seconds
$job = Start-Job -ScriptBlock { Copy-Item ... }
Wait-Job -Job $job -Timeout $timeout
Stop-Job -Job $job
Receive-Job -Job $job
Remove-Job -Job $job
Replace Copy-Item inside the scriptblock with whatever command you want to run. Beware though, that all variables you want to use inside the scriptblock must be either defined inside the scriptblock, passed in via the -ArgumentList parameter, or prefixed with the using: scope qualifier.
An alternative to Wait-Job would be a loop that waits until the job is completed or the timeout is reached:
$timeout = (Get-Date).AddMinutes(5)
do {
Start-Sleep -Milliseconds 100
} while ($job.State -eq 'Running' -and (Get-Date) -lt $timeout)

Powershell - "Clear-Item variable:" vs "Remove-Variable"

When storing text temporarily in powershell variables at runtime, what is the most efficient way of removing a variables contents from memory when no longer needed?
I've used both Clear-Item variable: and Remove-Variable but how quickly does something get removed from memory with the latter vs nulling the memory contents with the former?
EDIT: I should have made it a little clearer why I am asking.
I am automating RDP login for a bunch of application VMs (application doesn't run as a service, outsourced developers, long story).
So, I am developing (largely finished) a script to group launch sessions to each of the VMs.
Idea is that the script function that stores credentials uses read-host to prompt for hostname then get-credentials to pick up domain/user/password.
The pass is then converted from secure-string using 256-bit key (runtime key unique to machine/user that stored the creds and runs the group launch).
The VMs name, domain, user and encrypted pass are stored in a file. When launching a session, the details are read in, password decrypted, details passed to cmdkey.exe to store \generic:TERMSRV credential for that VM, clear plaintext pass variable, launch mstsc to that host, a few seconds later remove the credential from windows credential store.
(If I passed password to cmdkey.exe as anything other than plaintext, the RDP session would either receive incorrect or no credentials).
So, hence the question, I need the password In plaintext to exist in memory for as short a time as possible.
To keep security guys happy, the script itself is aes256 encrypted and a c# wrapper with its own ps host reads, decrypts and runs the script, so there is no plaintext source on the machine that runs this. (Encrypted source on a file share so effectively I have a kill switch, can simply replace encrypted script with another displaying a message that this app has been disabled)
The only way I have been able to, with certainty, to clear variable data/content is to remove all variables running in the current session using:
Remove-Variable -Name * -ErrorAction SilentlyContinue
This removes all variables immediately. In fact, I add this to the end of some of my scripts, so that I can be sure that running another script with potentially the same name, will not have new data added and cause undesired results.
DRAWBACK: If you only need one variable cleared, which was in my case a few minutes ago, then you need to re-instantiate input variables required by your script.
The most efficient way is to let garbage collection do its job. Remember, Powershell is all .NET, with its famous memory management. Always control your scope and make sure variables get out of scope as soon as they are not needed. For example, if temporary variables are needed inside loops, they will invalidate at loop's end automatically, so no need to worry about that, etc.
EDIT: Regarding your update, why not just do $yourPasswordVariable = $null? I think it would be much easier to understand. And it should be the fastest way to do it. Because Remove-Item and Clear-Item are kind of all-in-one handlers, they need to process some stuff first, before determining you really wanted to erase a variable.
You can use a stopwatch to get the execution time for the commandlets. I think there is not really a time difference between these two cmdlets. I´m using normally "Remove-Item" because in my eyes it´s better to remove a variable complete.
$a = "TestA"
$b = "TestB"
$c = "TestC"
$d = "TestD"
$time = New-Object system.Diagnostics.Stopwatch
Start-Sleep 1
$time.Start()
$time.Stop()
$system = $time.Elapsed.TotalMilliseconds
Write-Host "Stopwatch StartStop" $system
$time.Reset()
Start-Sleep 1
$time.Start()
Clear-Item Variable:a
$time.Stop()
$aTime = $time.Elapsed.TotalMilliseconds - $system
Write-Host "Clear-Item in " $aTime
$time.Reset()
Start-Sleep 1
$time.Start()
Remove-Variable b
$time.Stop()
$bTime = $time.Elapsed.TotalMilliseconds - $system
Write-Host "Remove-Variable in " $bTime
$time.Reset()
Start-Sleep 1
$time.Start()
Clear-Item Variable:c
$time.Stop()
$cTime = $time.Elapsed.TotalMilliseconds - $system
Write-Host "Clear-Item in " $cTime
$time.Reset()
Start-Sleep 1
$time.Start()
Remove-Variable d
$time.Stop()
$dTime = $time.Elapsed.TotalMilliseconds - $system
Write-Host "Remove-Variable in " $dTime
$time.Reset()
Both efficiently remove "a" reference to a .NET object. Now if that reference is the last reference to the object then the GC will determine when the memory for said object is collected. However, if you no longer need the variable then use Remove-Variable to also allow the memory associated with the System.Management.Automation.PSVariable object to be eventually collected as well.
To measure the time it takes to run script blocks and cmdlets, use Measure-Command

How to iterate over a folder with a large number of files in PowerShell?

I'm trying to write a script that would go through 1.6 million files in a folder and move them to the correct folder based on the file name.
The reason is that NTFS can't handle a large number of files within a single folder without a degrade in performance.
The script call "Get-ChildItem" to get all the items within that folder, and as you might expect, this consumes a lot of memory (about 3.8 GB).
I'm curious if there are any other ways to iterate through all the files in a directory without using up so much memory.
If you do
$files = Get-ChildItem $dirWithMillionsOfFiles
#Now, process with $files
you WILL face memory issues.
Use PowerShell piping to process the files:
Get-ChildItem $dirWithMillionsOfFiles | %{
#process here
}
The second way will consume less memory and should ideally not grow beyond a certain point.
If you need to reduce the memory footprint, you can skip using Get-ChildItem and instead use a .NET API directly. I'm assuming you are on Powershell v2, if so first follow the steps here to enable .NET 4 to load in Powershell v2.
In .NET 4 there are some nice APIs for enumerating files and directories, as opposed to returning them in arrays.
[IO.Directory]::EnumerateFiles("C:\logs") |%{ <move file $_> }
By using this API, instead of [IO.Directory]::GetFiles(), only one file name will be processed at a time, so the memory consumption should be relatively small.
Edit
I was also assuming you had tried a simple pipelined approach like Get-ChildItem |ForEach { process }. If this is enough, I agree it's the way to go.
But I want to clear up a common misconception: In v2, Get-ChildItem (or really, the FileSystem provider) does not truly stream. The implementation uses the APIs Directory.GetDirectories and Directory.GetFiles, which in your case will generate a 1.6M-element array before any processing can occur. Once this is done, then yes, the remainder of the pipeline is streaming. And yes, this initial low-level piece has relatively minimal impact, since it is simply a string array, not an array of rich FileInfo objects. But it is incorrect to claim that O(1) memory is used in this pattern.
Powershell v3, in contrast, is built on .NET 4, and thus takes advantage of the streaming APIs I mention above (Directory.EnumerateDirectories and Directory.EnumerateFiles). This is a nice change, and helps in scenarios just like yours.
This is how I implemented it without using .Net 4.0. Only Powershell 2.0 and old-fashioned DIR-command:
It's just 2 lines of (easy) code:
cd <source_path>
cmd /c "dir /B"| % { move-item $($_) -destination "<dest_folder>" }
My Powershell Proces only uses 15MB. No changes on the old Windows 2008 server!
Cheers!