Copy-Item with timeout - powershell

I am recovering files from a hard drive wherein some number of the files are unreadable. I'm unable to change the hardware level timeout / ERC, and it's extremely difficult to work around when I have several hundred thousand files, any tens of thousands of which might be unreadable.
The data issues were the result of a controller failure. Buying a matching drive (all the way down), I've been able to access the drive, and can copy huge swaths of it without issues. However, there are unreadable files dotted throughout the drive that, when accessed, will cause the SATA bus to hang. I've used various resumable file copy applications like robocopy, RichCopy, and a dozen others, but they all have the same issue. They have a RETRY count that is based on actually getting an error reported from the drive. The issue is that the drive is taking an extremely long time to report the error, and this means that a single file may take up to an hour to fail officially. I know how fast each file SHOULD be, so I'd like to build a powershell CMDLET or similar that will allow me to pass in a source and destination file name, and have it try to copy the file. If, after 5 seconds, the file hasn't copied (or if it has - this can be a dumb process), I'd like it to quit. I'll write a script that fires off each copy process individually, waiting for the process before it to finish, but I'm so far unable to find a good way of putting a time limit on the process.
Any suggestions you might have would be greatly appreciated!
Edit: I would be happy with spawning a Copy-Item in a new thread, with a new PID, then counting down, then killing that PID. I'm just a novice at PowerShell, and have seen so many conflicting methods for imposing timers that I'm lost on what the best practices way would be.
Edit 2: Please note that applications like robocopy will utterly hang when encountering the bad regions of the disk. These are not simple hangs, but bus hangs that windows will try to preserve in order to not lose data. In these instances task manager is unable to kill the process, but Process Explorer IS. I'm not sure what the difference in methodology is, but regardless, it seems relevant.

I'd say the canonical way of doing things like this in PowerShell are background jobs.
$timeout = 300 # seconds
$job = Start-Job -ScriptBlock { Copy-Item ... }
Wait-Job -Job $job -Timeout $timeout
Stop-Job -Job $job
Receive-Job -Job $job
Remove-Job -Job $job
Replace Copy-Item inside the scriptblock with whatever command you want to run. Beware though, that all variables you want to use inside the scriptblock must be either defined inside the scriptblock, passed in via the -ArgumentList parameter, or prefixed with the using: scope qualifier.
An alternative to Wait-Job would be a loop that waits until the job is completed or the timeout is reached:
$timeout = (Get-Date).AddMinutes(5)
do {
Start-Sleep -Milliseconds 100
} while ($job.State -eq 'Running' -and (Get-Date) -lt $timeout)

Related

add-content, stream not readable, root cause?

I have a script that is using Add-Content to log progress. The log files are written to a network share, and on some networks I have gotten stream not readable errors. Rare, but often enough to want to address the issue.
I found [this thread][1] that seems to offer an answer. And I initially implemented this loop
$isWritten = $false
do {
try {
Add-Content -Path $csv_file -Value $newline -ErrorAction Stop
$isWritten = $true
}
catch {
}
} until ( $isWritten )
but I added a 1 second wait between tries, and limited myself to 20 tries. I figured no network could be such crap that it would timeout for longer than that. But on one network I still have problems, so I bumped the count to 60, and STILL have failures to write to the log. I tried [System.IO.File]::AppendAllText($file, $line) and that seems to solve all the timeouts, at least in 20 some odd tries it hasn't failed, where before I would get 1 or two failures in 10 tries. But the formatting is off, likely I need to set the encoding.
But more importantly, I wonder what is actually the SOURCE of the issue in Add-Content, and why does [System.IO.File]::AppendAllText() not have the issue, and is this a sign of potentially other problems with the network, or with the machines at this one location? Or just a bug in PowerShell that I need to work around. FWIW, it's PS 5.1 on Windows 10 21H2.
Also, FWIW, the logs can get to a few hundred lines long, but I often see the error in the first 10 lines.
[1]: add-content produces stream not readable

How to print PDF files in a sequence using Powershell?

I have a bunch of PDF files that I would like to print in sequence on a windows 7 computer using Powershell.
get-childItem "*.pdf" | sort lastWriteTime | foreach-object {start-process $._Name -verb 'print'}
The printed files are sometimes out of order like 1) A.pdf, 2) C.pdf, 3) B.pdf 4) D.pdf.
Different trials printed out a different sequence of files, thus, I fear the error is related to the printing queue or the start-process command. My guess is that each printing process is fired without waiting for the previous printing process to be completed.
Is there a way to consistently print out PDF files in a sequence that I specify?
You are starting the processes in order, but by default Start-Process does not wait until the command completes before it starts the next one. Since the commands take different amounts of time to complete based on the .PDF file size they print in whatever order they finish in. Try adding the -wait switch to your Start-Process, which will force it to wait until the command completes before starting the next one.
EDIT: Found an article elsewhere on Stack which addresses this. Maybe it will help. https://superuser.com/questions/1277881/batch-printing-pdfs
Additionally, there are a number of PDF solutions out there which are not Adobe, and some of them are much better for automation than the standard Reader. Adobe has licensed .DLL files you can use, and the professional version of Acrobat has hooks into the back end .DLLs as well.
If you must use Acrobat Reader DC (closed system or some such) then I would try opening the file to print and getting a pointer to the process, then waiting some length of time, and forcing the process closed. This will work well if your PDF sizes are known and you can estimate how long it takes to finish printing so you're not killing the process before it finishes. Something like this:
ForEach ($PDF in (gci "*.pdf"))
{
$proc = Start-Process $PDF.FullName -PassThru
Start-Sleep -Seconds $NumberOfSeconds
$proc | Stop-Process
}
EDIT #2: One possible (but untested) optimization is that you might be able use the ProcessorTime counters $proc.PrivilegedProcessorTime and $proc.UserProcessorTime to see when the process goes idle. Of course, this assumes that the program goes completely idle after printing. I would try something like this:
$LastPrivTime = 0
$LastUserTime = 0
ForEach ($PDF in (gci "*.pdf"))
{
$proc = Start-Process $PDF.FullName -PassThru
Do
{
Start-Sleep -Seconds 1
$PrivTimeElapsed = $proc.PrivilegedProcessorTime - $LastPrivTime
$UserTimeElapsed = $proc.UserProcessorTime - $LastUserTime
$LastPrivTime = $proc.PrivilegedProcessorTime
$LastUserTime = $proc.UserProcessorTime
}
Until ($PrivTimeElapsed -eq 0 -and $UserTimeElapsed -eq 0)
$proc | Stop-Process
}
If the program still ends too soon, you might need to increase the # of seconds to sleep inside the inner Do loop.

No garbage collection while PowerShell pipeline is executing

UPDATE: The following bug seems to be resolved with PowerShell 5. The bug remains in 3 and 4. So don't process any huge files with the pipeline unless you're running PowerShell 2 or 5.
Consider the following code snippet:
function Get-DummyData() {
for ($i = 0; $i -lt 10000000; $i++) {
"This is freaking huge!! I'm a ninja! More words, yay!"
}
}
Get-DummyData | Out-Null
This will cause PowerShell memory usage to grow uncontrollably. After executing Get-DummyData | Out-Null a few times, I have seen PowerShell memory usage get all the way up to 4 GB.
According to ANTS Memory Profiler, we have a whole lot of things sitting around in the garbage collector's finalization queue. When I call [GC]::Collect(), the memory goes from 4 GB to a mere 70 MB. So we don't have a memory leak, strictly speaking.
Now, it's not good enough for me to be able to call [GC]::Collect() when I'm finished with a long-lived pipeline operation. I need garbage collection to happen during a pipeline operation. However if I try to invoke [GC]::Collect() while the pipeline is executing...
function Get-DummyData() {
for ($i = 0; $i -lt 10000000; $i++) {
"This is freaking huge!! I'm a ninja! More words, yay!"
if ($i % 1000000 -eq 0) {
Write-Host "Prompting a garbage collection..."
[GC]::Collect()
}
}
}
Get-DummyData | Out-Null
... the problem remains. Memory usage grows uncontrollably again. I have tried several variations of this, such as adding [GC]::WaitForPendingFinalizers(), Start-Sleep -Seconds 10, etc. I have tried changing garbage collector latency modes and forcing PowerShell to use server garbage collection to no avail. I just can't get the garbage collector to do its thing while the pipeline is executing.
This isn't a problem at all in PowerShell 2.0. It's also interesting to note that $null = Get-DummyData also seems to work without memory issues. So it seems tied to the pipeline, rather than the fact that we're generating tons of strings.
How can I prevent my memory from growing uncontrollably during long pipelines?
Side note:
My Get-DummyData function is only for demonstration purposes. My real-world problem is that I'm unable to read through large files in PowerShell using Get-Content or Import-Csv. No, I'm not storing the contents of these files in variables. I'm strictly using the pipeline like I'm supposed to. Get-Content .\super-huge-file.txt | Out-Null produces the same problem.
A couple of things to point out here. First, GC calls do work in the pipeline. Here's a pipeline script that only invokes the GC:
1..10 | Foreach {[System.GC]::Collect()}
Here's the perfmon graph of GCs during the time the script ran:
However, just because you invoke the GC it doesn't mean the private memory usage will return to the value you had before your script started. A GC collect will only collect memory that is no longer used. If there is a rooted reference to an object, it is not eligible to be collected (freed). So while GC systems typically don't leak in the C/C++ sense, they can have memory hoards that hold onto objects longer than perhaps they should.
In looking at this with a memory profiler it seems the bulk of the excess memory is taken up by a copy of the string with parameter binding info:
The root for these strings look like this:
I wonder if there is some logging feature that is causing PowerShell to hang onto a string-ized form pipeline bound objects?
BTW in this specific case, it is much more memory efficient to assign to $null to ignore the output:
$null = GetDummyData
Also, if you need to simply edit a file, check out the Edit-File command in the PowerShell Community Extensions 3.2.0. It should be memory efficient as long as you don't use the SingleString switch parameter.
It's not at all uncommon to find that the native cmdlets don't satisfy perfectly when you're doing something unusual like processing a massive text file. Personally, I've found working with large files in Powershell is much better when you script it with System.IO.StreamReader:
$SR = New-Object -TypeName System.IO.StreamReader -ArgumentList 'C:\super-huge-file.txt';
while ($line = $SR.ReadLine()) {
Do-Stuff $line;
}
$SR.Close() | Out-Null;
Note that you should use the absolute path in the ArgumentList. For me it always seems to assume you're in your home directory with relative paths.
Get-Content is simply meant to read the entire object into memory as an array and then outputs it. I think it just calls System.IO.File.ReadAllLines().
I don't know of any way to tell Powershell to discard items from the pipeline immediately upon completion, or that a function may return items asynchronously, so instead it preserves order. It may not allow it because it has no natural way to tell that the object isn't going to be used later on, or that later objects won't need to refer to earlier objects.
The other nice thing about Powershell is that you can often adopt the C# answers, too. I've never tried File.ReadLines, but that looks like it might be pretty easy to use, too.

Powershell script - Increase progressbar in start-job scriptblock

I am pretty new to powershell (About 1 week in) and I am trying to create a tool for our helpdesk to import and export printers. The tool is running great except for the form is freezing when the code is being run.
To mitigate the freezing, I found that running it as a job gets the task done, however I am having 2 issues with it.
I am not able to get the progress bar to increase 1 step as a result of the job completing.
I am not able to pass variables to it. (I am not as worried about this as there is a ton of information on it, I just need to figure out the syntax for it. If you could help with that as well, that would be great though.)
start-job -scriptblock {
C:\Windows\system32\spool\tools\PrintBrm.exe -b -f \\filestore\$EXPORTPRINTERS.printerExport
$progressbarexportprinters.PerformStep()
$progressbarexportprinters.TextOverlay = "Printer Export Complete"
}
I found a solution for this. the form is still freezing, but I can show movement on the progress bar. Which will be good enough.
C:\Windows\system32\spool\tools\PrintBrm.exe -r -f \\filestore\$EXPORTPRINTERS.printerExport | out-string -Stream | foreach-object {
$richTextBox1.lines = $richTextBox1.lines + $_
$richTextBox1.Select($richTextBox1.Text.Length, 0)
$richTextBox1.ScrollToCaret()
$progressbaraddprinters.PerformStep()
}

Get-Content -wait not working as described in the documentation

I've noticed that when Get-Content path/to/logfile -Wait, the output is actually not refreshed every second as the documentation explains it should. If I go in Windows Explorer to the folder where the log file is and Refresh the folder, then Get-Content would output the latest changes to the log file.
If I try tail -f with cygwin on the same log file (not at the same time than when trying get-content), then it tails as one would expect, refreshing real time without me having to do anything.
Does anyone have an idea why this happens?
Edit: Bernhard König reports in the comments that this has finally been fixed in Powershell 5.
You are quite right. The -Wait option on Get-Content waits until the file has been closed before it reads more content. It is possible to demonstrate this in Powershell, but can be tricky to get right as loops such as:
while (1){
get-date | add-content c:\tesetfiles\test1.txt
Start-Sleep -Milliseconds 500
}
will open and close the output file every time round the loop.
To demonstrate the issue open two Powershell windows (or two tabs in the ISE). In one enter this command:
PS C:\> 1..30 | % { "${_}: Write $(Get-Date -Format "hh:mm:ss")"; start-sleep 1 } >C:\temp\t.txt
That will run for 30 seconds writing 1 line into the file each second, but it doesn't close and open the file each time.
In the other window use Get-Content to read the file:
get-content c:\temp\t.txt -tail 1 -wait | % { "$_ read at $(Get-Date -Format "hh:mm:ss")" }
With the -Wait option you need to use Ctrl+C to stop the command so running that command 3 times waiting a few seconds after each of the first two and a longer wait after the third gave me this output:
PS C:\> get-content c:\temp\t.txt -tail 1 -wait | % { "$_ read at $(Get-Date -Format "hh:mm:ss")" }
8: Write 12:15:09 read at 12:15:09
PS C:\> get-content c:\temp\t.txt -tail 1 -wait | % { "$_ read at $(Get-Date -Format "hh:mm:ss")" }
13: Write 12:15:14 read at 12:15:15
PS C:\> get-content c:\temp\t.txt -tail 1 -wait | % { "$_ read at $(Get-Date -Format "hh:mm:ss")" }
19: Write 12:15:20 read at 12:15:20
20: Write 12:15:21 read at 12:15:32
21: Write 12:15:22 read at 12:15:32
22: Write 12:15:23 read at 12:15:32
23: Write 12:15:24 read at 12:15:32
24: Write 12:15:25 read at 12:15:32
25: Write 12:15:26 read at 12:15:32
26: Write 12:15:27 read at 12:15:32
27: Write 12:15:28 read at 12:15:32
28: Write 12:15:29 read at 12:15:32
29: Write 12:15:30 read at 12:15:32
30: Write 12:15:31 read at 12:15:32
From this I can clearly see:
Each time the command is run it gets the latest line written to the file. i.e. There is no problem with caching and no buffers needing flushed.
Only a single line is read and then no further output appears until the command running in the other window completes.
Once it does complete all of the pending lines appear together. This must have been triggered by the source program closing the file.
Also when I repeated the exercise with the Get-Content command running in two other windows one window read line 3 then just waited, the other window read line 6, so the line is definitely being written to the file.
It seems pretty conclusive that the -Wait option is waiting for a file close event, not waiting for the advertised 1 second. The documentation is wrong.
Edit:
I should add, as Adi Inbar seems to insistent that I'm wrong, that the examples I gave here use Powershell only as that seemed most appropriate for a Powershell discussion. I did also verify using Python that the behaviour is exactly as I described:
Content written to a file is readable by a new Get-Content -Wait command immediately provided the application has flushed its buffer.
A Powershell instance using Get-Content -Wait will not display new content in the file that is being written even though another Powershell instance, started later, sees the later data. This proves conclusively that the data is accessible to Powershell and Get-Content -Wait is not polling at 1 second intervals but waiting for some trigger event before it next looks for data.
The size of the file as reported by dir is updating while lines are being added, so it is not a case of Powershell waiting for the directory entry size to be updated.
When the process writing the file closes it, the Get-Content -Wait displays the new content almost instantly. If it were waiting until the data was flushed to disk there would be up to a delay until Windows flushed it's disk cache.
#AdiInbar, I'm afraid you don't understand what Excel does when you save a file. Have a closer look. If you are editing test.xlsx then there is also a hidden file ~test.xlsx in the same folder. Use dir ~test.xlsx -hidden | select CreationTime to see when it was created. Save your file and now test.xlsx will have the creation time from ~test.xlsx. In other words saving in Excel saves to the ~ file then deletes the original, renames the ~ file to the original name and creates a new ~ file. There's a lot of opening and closing going on there.
Before you save it has the file you are looking at open, and after that file is open, but its a different file. I think Excel is too complex a scenario to say exactly what triggers Get-Content to show new content but I'm sure you mis-interpreted it.
It looks like Powershell is monitoring the file's Last Modified property. The problem is that "for performance reasons" the NTFS metadata containing this property is not automatically updated except under certain circumstances.
One cirumstance is when the file handle is closed (hence #Duncan's observations). Another is when the file's information is queried directly, hence the Explorer refresh behaviour mentioned in the question.
You can observe the correlation by having Powershell monitoring a log with Get-Content -Wait and having Explorer open in the folder in details view with Last Modified column visible. Notice that Last Modified doesn't update automatically as the file is modified.
Now get the properties of the file in another window. E.g. at a command prompt, type the file. Or open another Explorer window in the same folder, and right-click the file and get its properties (for me, just right-clicking is enough). As soon as you do that, the first Explorer window will automatically update the Last Modified column and Powershell will notice the update and catch up with the log. In Powershell, touching the LastWriteTime property is enough:
(Get-Item file.log).LastWriteTime = (Get-Item file.log).LastWriteTime
or
(Get-Item file.log).LastWriteTime = Get-Date
So this is now working for me:
Start-Job {
$f=Get-Item full\path\to\log
while (1) {
$f.LastWriteTime = Get-Date
Start-Sleep -Seconds 10
}
}
Get-Content path\to\log -Wait
Can you tell us how to reproduce that?
I can start this script on one PS session:
get-content c:\testfiles\test1.txt -wait
and this in another session:
while (1){
get-date | add-content c:\tesetfiles\test1.txt
Start-Sleep -Milliseconds 500
}
And I see the new entries being written in the first session.
It appears that get-content only works if it goes through the windows api and that versions of appending to a file are different.
program.exe > output.txt
And then
get-content output.txt -wait
Will not update. But
program.exe | add-content output.txt
will work with.
get-content output.txt -wait
So I guess it depends on how the application does output.
I can assure you that Get-Content -Wait does refresh every second, and shows you changes when the file changes on the disk. I'm not sure what tail -f is doing differently, but based on your description I'm just about certain that this issue is not with PowerShell but with write caching. I can't rule out the possibility that log4net is doing the caching, but I strongly suspect that OS-level caching is the culprit, for two reasons:
The documentation for log4j/log4net says that it flushes the buffer after every append operation by default, and I presume that if you had explicitly configured it not to flush after every append, you'd be aware of that.
I know for a fact that refreshing Windows Explorer triggers a write buffer flush if any files in the directory have changed. That's because it actually reads the file contents, not just the metadata, in order to provide extended information such as thumbnails and previews, and the read operation causes the write buffer to flush. So, if you're seeing the delayed updates every time you refresh the logfile's directory in Windows Explorer, that points strongly in this direction.
Try this: Open Device Manager, expand the Disk Drives node, open the Properties of the disk on which the logfile is stored, switch to the Policies tab, and uncheck Enable write caching on the device. I think you'll find that Get-Content -Wait will now show you the changes as they happen.
As for why tail -f is showing you the changes immediately as it is, I can only speculate. Maybe you're using it to monitor a logfile on a different drive, or perhaps Cygwin requests frequent flushes while you're running tail -f, to address this very issue.
UPDATE:
Duncan commented below that it is an issue with PowerShell, and posted an answer contending that Get-Content -Wait doesn't output new results until the file is closed, contrary to the documentation.
However, based on information already established and further testing, I've confirmed conclusively that it does not wait for the file to be closed, but outputs new data added to the file as soon as it's written to disk, and that the issue the OP is seeing is almost definitely due to write buffering.
To prove this, let the facts be submitted to a candid world:
I created an Excel spreadsheet, and ran Get-Content -Wait against the .xlsx file. When I entered new data into the spreadsheet, the Get-Content -Wait did not produce new output, which is expected while the new information is only in RAM and not on disk. However, whenever I saved the spreadsheet after adding data, new output was produced immediately.
Excel does not close the file when you save it. The file remains open until you close the Window from Excel, or exit Excel. You can verify this by trying to delete, rename, or otherwise modify the .xlsx file after you've saved it, while the window is still open in Excel.
The OP stated that he gets new output when he refreshes the folder in Windows Explorer. Refreshing the folder listing does not close the file. It does flush the write buffer if any of the files have changed. That's because it has to read the file's attributes, and this operation flushes the write buffer. I'll try to find some references for this, but as I noted above, I know for a fact that this is true.
I verified this behavior by running the following modified version of Duncan's test, which runs for 1,000 iterations instead of 50, and displays progress at the console so that you can track exactly how the output in your Get-Content -Wait window relates to the data that the pipeline has added to the file:
1..1000 | %{"${_}: Write $(Get-Date -Format "hh:mm:ss")"; Write-Host -NoNewline "$_..."; Start-Sleep 1} > .\gcwtest.txt
While this was running, I ran Get-Content -Wait .\gcwtest.txt in another window, and opened the directory in Windows Explorer. I found that if I refresh, more output is produced any time the file size in KB changes, and sometimes but not always even if nothing visible has changed. (More on the implications of that inconsistency later...)
Using the same test, I opened a third PowerShell window, and observed that all of the following trigger an immediate update in the Get-Content -Wait listing:
Listing the file's contents with plain old Get-Content .\gcwtest.txt
Reading any of the file's attributes. However, for attributes that don't change, only the first read triggers an update.
For example, (gi .\gcwtest.txt).lastwritetime triggers more output multiple times. On the other hand, (gi .\gcwtest.txt).mode or (gi .\gcwtest.txt).directory trigger more output the first time each, but not if you repeat them. Also note the following:
» This behavior is not 100% consistent. Sometimes, reading Mode or Directory doesn't trigger more output the first time, but it does if you repeat the operation. All subsequent repetitions after the first one that triggers updated output have no effect.
» If you repeat the test, reading attributes that are the same does not trigger output, unless you delete the .txt file before running the pipeline again. In fact, sometimes even (gi .\gcwtest.txt).lastwritetime doesn't trigger more output if you repeat the test without deleting gcwtest.txt.
» If you issue (gi .\gcwtest.txt).lastwritetime multiple times in one second, only the first one triggers output, i.e. only when the result has changed.
Opening the file in a text editor. If you use an editor that keeps the file handle open (notepad does not), you'll see that closing the file without saving does not cause Get-Content -Wait to output the lines added by the pipeline since you opened the file in the editor.
Tab-completing the file's name
After you try any of the tests above a few times, you many find that Get-Content -Wait outputs more lines periodically for the remainder of the pipeline's execution, even if you don't do anything. Not one line at a time, but in batches.
The inconsistency in behavior itself points to buffer flushing, which occurs according to variable criteria that are hard to predict, as opposed to closing, which occurs under clear-cut and consistent circumstances.
Conclusion: Get-Content -Wait works exactly as advertised. New content is displayed as soon as it's physically written to the file on disk*.
It should be noted that my suggestion to disable write caching on the drive did not for the test above, i.e. it did not result in `Get-Content -Wait displaying new lines as soon as they're added to the text file by the pipeline, so perhaps the buffering responsible for the output latency is occurring on a filesystem or OS level as opposed to the disk's write cache. However, write buffering is clearly the explanation for the behavior observed in the OP's question.
* I'm not going to get into this in detail, since it's out of the scope of the question, but Get-Content -Wait does behave oddly if you add content to the file not at the end. It displays data from the end of the file equal in size to the amount of data added. The newly displayed data generally repeats data that was previously displayed, and may or may not include any of the new data, depending on whether the size of the new data exceeds the size of the data that follows it.
I ran in to the same issue while trying to watch WindowsUpdate.log in realtime. While not ideal, the code below allowed me to monitor the progress. -Wait didn't work due to the same file-writing limitations discussed above.
Displays the last 10 lines, sleeps for 10 seconds, clears the screen and then displays the last 10 again. CTRL + C to stop stream.
while(1){
Get-Content C:\Windows\WindowsUpdate.log -tail 10
Start-Sleep -Seconds 10
Clear
}