Powershell 5.1: How to iterate over files in parallel

Powershell 5.1: How to iterate over files in parallel - powershell

I need to copy files dependent on content. So I get all files, read the content and ask a regex if it is valid. After that I want to copy the file to a certain directory.
My problem is, that there are a lot of source files, so I need to execute this in parallel.
I cannot use PowerShell ForEach-Object Parallel Feature because we are using Powershell Version < 7.0.
Using a workflow is way to slow.
$folder = "C:\InputFiles"
workflow CopyFiles
{
foreach -parallel ($file in gci $folder *.* -rec | where { ! $_.PSIsContainer })
{
//Get content and compare against a regex
//Copy if regex matches
}
}
CopyFiles
Any ideas how to run this in a parallel manner with Powershell?

Another option is using jobs. You'd have to define a ScriptBlock accepting path and regex as parameters, then run it in paralell in the background. Read about Start-Job, Receive-Job, Get-Job, Remove-Job cmdlets.
But I don't think it's really going to help:
I don't expect it to be much faster than workflows
You'd have to throttle and control execution of jobs by yourself adding complexity to the script
There's substantial overhead to running jobs
Most probably file system is the bottleneck of this task, so any approach accessing files in paralell isn't really going to help here

Can you run the following script with your configuration and see how much time it takes with this method? It takes 100ms for me to find around 2000 occurrences of text PowerShell in it.
$starttime = Get-Date;
$RegEx = 'Powershell'
$FilesFound = Get-ChildItem -Path "$PSHOME\en-US\*.txt" | Select-String -Pattern $RegEx
Write-Host "Total occurence found: $($FilesFound.Count)"
$endtime = Get-Date;
Write-Host "Time of execution:" ($endtime - $starttime).Milliseconds "Mili Seconds";

Related

Copy files after time x based on modifaction Date

I need a script that only copy files after 5 minutes based on the modification date. Does anyone have a solution for this ?
I couldn't find any script online.

The answer from jdweng is a good solution to identify the files in scope.
You could make your script something like this to easily re-use it with other paths or file age.
# Customizable variables
$Source = 'C:\Temp\Input'
$Destination = 'C:\Temp\Output'
[int32]$FileAgeInMinutes = 5
# Script Execution
Get-ChildItem -Path $Source | Where-Object { $_.LastWriteTime -lt (Get-Date).AddMinutes(-$FileAgeInMinutes) } | Copy-Item -Destination $Destination
You could then run a scheduled task using this script and schedule it to run in periodically, depending on your need.

Is there any way to geth less CPU Usage, by a powershell Script like mine

I wrote a PowerSehll Script, which reads a ini file and creats a new one with the Syntax my data needs, in an infinite Loop. It looks like this:
for(;;) {
$fileToCheck = "C:\test\test.ini"
if (Test-Path $fileToCheck -PathType leaf)
{
$ini = Get-Content -Path C:\test\test.ini
$out = "[Event]`nTop = DISPLAY`n[Display_00000]`nDisplay=" + $ini
Add-Content -Path C:\test\exit.txt -Value $out
Remove-Item C:\test\hallo.ini
}
}
I don't know if this is the best way, but it has to be a powershell Script. If it starts via the Task Scheduler by starting the Computer, I can see in the TaskManager that it takes more than 30% of the CPU Usage.
Is this normal for a script like this? And is there a way to reduce the amount of CPU Usage?

You're reading and writing files in infinite loop without any delay, so the script performs as many of those operations as it can in a unit of time.
If you want to monitor a file for changes, you can play around with something like this:
Get-Content C:\TMP\in.txt -Wait | Add-Content c:\tmp\out.txt -Force -PassThru
-Wait causes Get-Content to keep the file open and output any new lines to the pipeline. It checks for changes every second, so cpu usage will be low. -PassThru outputs the lines to console, so it's useful for testing and you can remove it later.

Fastest way to copy files (but not the entire directory) from one location to another

Summary
I am currently tasked with migrating around 6TB of data to a cloud server, and am trying to optimise how fast this can be done.
I would use standard Robocopy to do this usually, but there is a requirement that I am to only transfer files that are present in a filetable in SQL, and not the entire directories (due to a lot of junk being inside these folders that we do not want to migrate).
What I have tried
Feeding in individual files from an array into Robocopy is unfeasibly slow, as Robocopy instances were being started sequentially for each file, so I tried to speed up this process in 2 ways.
It was pointless to have /MT set above 1 if only one file was being transferred, so I attempted to simulate the multithreading feature. I did this by utilising the new ForEach-Object –Parallel feature in PowerShell 7.0, and setting the throttle limit to 4. With this, I was able to pass the array in and run 4 Robocopy jobs in parallel (still starting and stopping for each file), which increased speed a bit.
Secondly, I split the array into 4 equal arrays, and ran the above function across each array as a job, which again increased the speed by quite a bit. For clarity, I had equal 4 arrays fed to 4 ForEach-Object -Parallel code blocks that were running 4 Robocopy instances, so a total of 16 Robocopy instances at once.
Issues
I encountered a few problems.
My simulation of the multithreading feature did not behave in the way that the /MT flag works in Robocopy. When examining the processes running, my code executes 16 instances of Robocopy at once, whereas the normal /MT:16 flag of Robocopy would only kick off one Robocopy instance (but still be multithreading).
Secondly, the code causes a memory leak. The memory usage starts to increase when the jobs and accumulates over time, until a large portion of memory is being utilised. When the jobs complete, the memory usage is still high until I close PowerShell and the memory is released. Normal Robocopy did not do this.
Finally, I decided to compare the time taken for my method, and then a standard Robocopy of the entire testing directory, and the normal Robocopy was still over 10x faster, and had a better success rate (a lot of the files weren’t copied over with my code, and a lot of the time I was receiving error messages that the files were currently in use and couldn’t be Robocopied, presumably because they were in the process of being Robocopied).
Are there any faster alternatives, or is there a way to manually create a multithreading instance of robocopy that would perform like the /MT flag of the standard robocopy? I appreciate any insight/alternative ways of looking at this. Thanks!
#Item(0) is the Source excluding the filename, Item(2) is the Destination, Item(1) is the filename
$robocopy0 = $tables.Tables[0].Rows
$robocopy1 = $tables.Tables[1].Rows
$robocopy0 | ForEach-Object -Parallel {robocopy $_.Item(0) $_.Item(2) $_.Item(1) /e /w:1 /r:1 /tee /NP /xo /mt:1 /njh /njs /ns
} -ThrottleLimit 4 -AsJob
$robocopy1 | ForEach-Object -Parallel {robocopy $_.Item(0) $_.Item(2) $_.Item(1) /e /w:1 /r:1 /tee /NP /xo /mt:1 /njh /njs /ns
} -ThrottleLimit 4 -AsJob
#*8 for 8 arrays

RunspaceFactory multithreading might be optimally suited for this type of work--with one HUGE caveat. There are quite a few articles out on the net about it. Essentially you create a scriptblock that takes parameters for the source file to copy and the destination to write to and uses those parameters to execute robocopy against it. You create individual PowerShell instances to execute each variant of the scriptblock and append it to the RunspaceFactory. The RunspaceFactory will queue up the jobs and work against the probably millions of jobs X number at a time, where X is equal to the number of threads you allocate for the factory.
CAVEAT: First and foremost, to queue up millions of jobs relative to the probable millions of files you have across 6TB, you'll likely need monumental amounts of memory. Assuming an average path length for source and destination of 40 characters (probably very generous) * a WAG of 50 million files is nearly 4GB in memory by itself, which doesn't include object structural overhead, the PowerShell instances, etc. You can overcome this either breaking up the job into smaller chunks or use a server with 128GB RAM or better. Additionally, if you don't terminate the jobs once they've been processed, you'll also experience what appears to be a memory leak but is just your jobs producing information that you're not closing when completed.
Here's a sample from a recent project I did migrating files from an old domain NAS to a new domain NAS -- I'm using Quest SecureCopy instead of RoboCopy but you should be able to easily replace those bits:
## MaxThreads is an arbitrary number I use relative to the hardware I have available to run jobs I'm working on.
$FileRSpace_MaxThreads = 15
$FileRSpace = [runspacefactory]::CreateRunspacePool(1, $FileRSpace_MaxThreads, ([System.Management.Automation.Runspaces.InitialSessionState]::CreateDefault()), $Host)
$FileRSpace.ApartmentState = 'MTA'
$FileRSpace.Open()
## The scriptblock that does the actual work.
$sb = {
param(
$sp,
$dp
)
## This is my output object I'll emit through STDOUT so I can consume the status of the job in the main thread after each instance is completed.
$n = [pscustomobject]#{
'source' = $sp
'dest' = $dp
'status' = $null
'sdtm' = [datetime]::Now
'edtm' = $null
'elapsed' = $null
}
## Remove the Import-Module and SecureCopy cmdlet and replace it with the RoboCopy version
try {
Import-Module "C:\Program Files\Quest\Secure Copy 7\SCYPowerShellCore.dll" -ErrorAction Stop
Start-SecureCopyJob -Database "C:\Program Files\Quest\Secure Copy 7\SecureCopy.ssd" -JobName "Default" -Source $sp -Target $dp -CopySubFolders $true -Quiet $true -ErrorAction Stop | Out-Null
$n.status = $true
} catch {
$n.status = $_
}
$n.edtm = [datetime]::Now
$n.elapsed = ("{0:N2} minutes" -f (($n.edtm - $n.sdtm).TotalMinutes))
$n
}
## The array to hold the individual runspaces and ulitimately iterate over to watch for completion.
$FileWorkers = #()
$js = [datetime]::now
log "Job starting at $js"
## $peers is a [pscustomobject] I precreate that just contains every source (property 's') and the destination (property 'd') -- modify to suit your needs as necessary
foreach ($c in $peers) {
try {
log "Configuring migration job for '$($c.s)' and '$($c.d)'"
$runspace = [powershell]::Create()
[void]$runspace.AddScript($sb)
[void]$runspace.AddArgument($c.s)
[void]$runspace.AddArgument($c.d)
$runspace.RunspacePool = $FileRSpace
$FileWorkers += [pscustomobject]#{
'Pipe' = $runspace
'Async' = $runspace.BeginInvoke()
}
log "Successfully created a multi-threading job for '$($c.s)' and '$($c.d)'"
} catch {
log "An error occurred creating a multi-threading job for '$($c.s)' and '$($c.d)'"
}
}
while ($FileWorkers.Async.IsCompleted -contains $false) {
$Completed = $FileWorkers | ? { $_.Async.IsCompleted -eq $true }
[pscustomobject]#{
'Numbers' = ("{0}/{1}" -f $Completed.Count, $FileWorkers.Count)
'PercComplete' = ("{0:P2}" -f ($Completed.Count / $FileWorkers.Count))
'ElapsedMins' = ("{0:N2}" -f ([datetime]::Now - $js).TotalMinutes)
}
$Completed | % { $_.Pipe.EndInvoke($_.Async) } | Export-Csv -NoTypeInformation ".\$($DtmStamp)_SecureCopy_Results.csv"
Start-Sleep -Seconds 15
}
## This is to handle a race-condition where the final job(s) aren't completed before the sleep but do when the while is re-eval'd
$FileWorkers | % { $_.Pipe.EndInvoke($_.Async) } | Export-Csv -NoTypeInformation ".\$($DtmStamp)_SecureCopy_Results.csv"
Suggested strategies if you don't have a beefy server to queue up all the jobs simultaneously is to either batch out the files in statically sized blocks (e.g. 100,000 or whatever your hw can take) or you could group files together to send to each script block (e.g. 100 files per scriptblock) which would minimize the number of jobs to queue up in the runspace factory (but would require some code change).
HTH
Edit 1: To Address constructing the input object I'm using
$destRoot = '\\destinationserver.com\share'
$peers = #()
$children = #()
$children += (get-childitem '\\sourceserver\share' -Force) | Select -ExpandProperty FullName
foreach ($c in $children) {
$peers += [pscustomobject]#{
's' = $c
'd' = "$($destRoot)\$($c.Split('\')[3])\$($c | Split-Path -Leaf)"
}
}
In my case, I was taking stuff from \server1\share1\subfolder1 and moving it to something like \server2\share1\subfolder1\subfolder2. So in essence, all the '$peers' array is doing is constructing an object that took in the fullname of the source target and constructing the corresponding destination path (since the source/dest server names are different and possibly share name too).
You don't have to do this, you can dynamically construct the destination and just loop through the source folders. I perform this extra step because now I have a two property array that I can verify is pre-constructed accurately as well as perform tests to ensure things exist and are accessible.
There is a lot of extra bloat in my script due to custom objects meant to give me output from each thread put into the multi-threader so I can see the status of each copy attempt--to track things like folders that were successful or not, how long it took to perform that individual copy, etc. If you're using robocopy and dumping the results to a text file, you may not need this. If you want me to pair down script to it's barebone components just to get things multi-threading, I can do that if you like.

Powershell Rename Files in a Directory Synchronously

I have a script that renames all the files in a directory from fields/columns after importing a .CSV. My problem is that PS is renaming the files asynchronously and not synchronously. Is there a better way to accomplish get the result I want?
Current file name = 123456789.pdf
New File Name = $documentID_$fileID
I need to new file name to rename the files in order to make the script viable.
Here's my code (I'm new at this):
$csvPath = "C:\Users\dougadmin28\Desktop\Node Modify File Name App\test.csv"
$filePath = "C:\Users\dougadmin28\Desktop\Node Modify File Name App\pdfs"
$csv = Import-Csv $csvPath Select-Object -Skip 0
$files = Get-ChildItem $filePath
foreach ($item in $csv) {
foreach($file in $files) {
Rename-Item $file.fullname -NewName "$($item.DocumentID +"_"+ ($item.FileID)+($file.extension))" -Verbose
}
}

you may try using workflows, which would allow you to execute tasks in parallel:
https://learn.microsoft.com/en-us/powershell/module/psworkflow/about/about_foreach-parallel?view=powershell-5.1
Have in mind that PowerShell Workflows, have some limitations:
https://devblogs.microsoft.com/scripting/powershell-workflows-restrictions/
Hope it helps!

I thought synchronous meant sequentially as in 'one after the other', Which is what your script is doing now.
If you mean to say 'in parallel' as in Asynchronously or independent of each other, you can look at using
Background Jobs. They include Start-Job, wait-job and
receive-job. Easiest to work with but not efficient in terms to performance. It is also available in some cmdlets as an -AsJob switch.
PowerShell Runspaces. [Most efficient but hard to code for]
PowerShell Workflows [Balanced but has limitations]

Why is this PowerShell script so slow? How can I speed it up?

I developed this script to apply Sitecore workflows to whole swaths of items without having to manually click through the GUI. I'm pretty pleased with how well it works, but it's just slow. Here is the script:
Import-Module 'C:\Subversion\CMS Build\branches\1.2\sitecorepowershell\Sitecore.psd1'
# Hardcoded IDs of workflows and states
$ContentApprovalWfId = "{7005647C-2DAC-4C32-8A09-318000556325}";
$ContentNoApprovalWfId = "{BCBE4080-496F-4DCB-8A3F-6682F303F3B4}";
$SettingsWfId = "{7D2BA7BE-6A0A-445D-AED7-686385145340}";
#new-psdrive *REDACTED*
set-location m-rocks:
function ApplyWorkflows([string]$path, [string]$WfId) {
Write-Host "ApplyWorkflows called: " $path " - " $wfId;
$items = Get-ChildItem -Path $path;
$items | foreach-object {
if($_ -and $_.Name) {
$newPath = $path + '\' + $_.Name;
$newPath;
} else {
Write-host "Name is empty.";
return;
}
if($_.TemplateName -eq "Folder" -or $_TemplateName -eq "Template Folder") {
# don't apply workflows to pure folders, just recurse
Write-Host $_.Name " is a folder, recursing.";
ApplyWorkflows $newPath $wfId;
}
elseif($_.TemplateName -eq "Siteroot" -or $_.TemplateName -eq "InboundSiteroot") {
# Apply content-approval workflow
Set-ItemProperty $newPath -name "__Workflow" $ContentApprovalWfId;
Set-ItemProperty $newPath -name "__Default workflow" $ContentApprovalWfId;
# Apply content-no-approval workflow to children
Write-Host $_.Name " is a siteroot, applying approval workflow and recursing.";
ApplyWorkflows $newPath $ContentNoApprovalWfId;
}
elseif($_.TemplateName -eq "QuotesHomePage") {
# Apply settings workflow to item and children
Write-Host $_.Name " is a quotes item, applying settings worfklow recursing.";
Set-ItemProperty $newPath -name "__Workflow" $SettingsWfId;
Set-ItemProperty $newPath -name "__Default workflow" $SettingsWfId;
ApplyWorkflows $newPath $SettingsWfId;
}
elseif($_.TemplateName -eq "Wildcard")
{
Write-Host $_.Name " is a wildcard, applying workflow (and halting).";
Set-ItemProperty $newPath -name "__Workflow" $ContentApprovalWfId;
Set-ItemProperty $newPath -name "__Default workflow" $ContentApprovalWfId;
}
elseif($_ -and $_.Name) {
# Apply passed in workflow and recurse with passed in workflow
Write-Host $_.Name " is a something else, applying workflow and recursing.";
Set-ItemProperty $newPath -name "__Workflow" $WfId;
Set-ItemProperty $newPath -name "__Default workflow" $WfId;
ApplyWorkflows $newPath $wfId;
}
}
}
ApplyWorkflows "sitecore\Content\" $ContentNoApprovalWfId;
It processes one item in a little less than a second. There are some pauses in its progress - evidence suggests that this is when Get-ChildItem returns a lot of items. There are a number of things I would like to try, but it's still running against one of our sites. It's been about 50 minutes and looks to be maybe 50% done, maybe less. It looks like it's working breadth-first, so it's hard to get a handle on exactly what's done and what's not.
So what's slowing me down?
Is it the path construction and retrieval? I tried to just get the children on the current item via $_ or $_.Name, but it always looks in the current working directory, which is the root, and can't find the item. Would changing the directory on every recursion be any faster?
Is it the output that's bogging it down? Without the output, I have no idea where it is or that it's still working. Is there some other way I could get indication of where it is, how many it has done, etc.?
Is there a better approach where I just use Get-ChildItem -r with filter sets and loop through those? If so, a first attempt at incorporating some of my conditionals in the first script into a filter set would be very much appreciated. I am new to PowerShell, so I'm sure there's more than one or two improvements to be made in my code.
Is it that I always call the recursive bit even if there aren't any children? The content tree here is very broad with a very many leaves with no children. What would be a good check whether or not child items exist?
Finally, the PowerShell provider (PSP) we have is not complete. It does not seem to have a working implementation of Get-Item, which is why everything is almost completely written with Get-ChildItem instead. Our Sitecore.Powershell.dll says it is version 0.1.0.0. Would upgrading that help? Is there a newer one?
Edit: it finally finished. I did a count on the output and came up with 1857 items and it took ~85 minutes to run, for an average of 21 items per minute. Slower than I thought, even...
Edit: My first run was on PowerShell 2.0, using the Windows PowerShell ISE. I've not tried the Sitecore PowerShell plugin module or the community. I didn't even know it existed until yesterday :-)
I tried another run after upgrading to PowerShell 3.0. Starting locally - running the script from my laptop, connecting to the remote server - there was no noticeable difference. I installed PowerShell 3.0 on the hosting box and ran the script from there and saw maybe a 20-30% increase in speed. So that's not the silver bullet I was hoping it would be - I need an order of magnitude or two's worth of improvement to make this something I won't have to babysit and run in batches. I am now playing around with some of the actual script improvements suggested by the fine answers below. I will post back with what works for me.

Personally I think the biggest boost you would get if you started using the community PowerShell implementation over the Rocks one.
Let me explain why.
You're traversing the whole tree which means you have to visit every node in your branch, which means it has to be read and travel over the Rocks web service at least once.
Then every property save is another webservice call.
I have run your script in the community console and it took me around 25 seconds for 3724 items.
(I've removed the modifications as the values didn't relate to my system).
A Simple
Get-ChildItem -recurse
On my 3724 item tree took 11 seconds in the community console vs 48 seconds in the Rocks implementation.
Additional tweaks you could use in the Community implementation for your script would be using the Sitecore query like:
get-item . -Query '/sitecore/content/wireframe//*[##TemplateName="Template Folder"]'
and only send those items into your function
None of it means the Rocks console is not written right, it just means the design choices in the Rocks console and their target is different.
You can find the community console here:
http://bit.ly/PsConScMplc

See this blog post where there are stated differences between foreach-object cmdlet and foreach statement.
You could speed it up when you pipe the get-childitem with foreach object:
get-childitem . | foreach-object { ApplyWorkflow($_) }
This will cause that each object returned by get-childitem will be immediately passed to the following step in pipeline so you will be processing them only once. This operation should also prevent long pauses for reading all children.
Additionally you can get all the items recursively and filter them by template and then apply appropriate workflows, eg:
get-childitem -recurse . | where-object { $_.TemplateName -eq "MyTemplate"} | foreach-object { ApplyWorkflowForMyTemplate($_) }
get-childitem -recurse . | where-object { $_.TemplateName -eq "MySecondTemplate"} | foreach-object { ApplyWorkflowForMySecondTemplate($_) }
Still I would rather not expect this script to run in seconds anyway. In the end you are going over the whole content tree.
And finally what library are you using? Is this Sitecore Powershell Console (the name of the dll sounds familiar)? There is newer version which has a lots of new features added.

The most obvious problems I see are that you are iterating over files twice per invocation of the function:
function ApplyWorkflows([string]$path, [string]$WfId) {
Write-Host "ApplyWorkflows called: " $path " - " $wfId;
# this iterates through all files and assigns to $items
$items = Get-ChildItem -Path $path;
# this iterates through $items for a second time
$items | foreach-object { # foreach-object is slower than for (...)
Also, piping to foreach-object is slower than using the traditional for keyword.