Get-Childitem - improve memory usage and performance - powershell

I would like to be able to also retrieve the file owner , LastAccessTime, LastWriteTime, CreationTime. Get-Childitem has known performance issues when scaled to large directory structures.
We had some performance issue while looking for files in a folder which have more than 100000 subfolders.
Here is my script:
$Dir = get-childitem "W:\DATA" -recurse -force
$Dir | Select-Object name,fullname, LastAccessTime, LastWriteTime, CreationTime, #{N='Owner';E={$_.GetAccessControl().Owner}} | Export-Csv -path C:\Scripts\xlsx.csv -NoTypeInformation
thanks in advance,

Memory
PowerShell objects (PSCustomObject) are optimized for streaming (One-at-a-time processing) and therefore quiet heavy.
Using parenthesis ((...)) or assigning you stream to a variable (like: $Dir =) will choke the pipeline and pile up all the objects into memory.
To reduce memory usage, immediately pass your objects through the pipeline by chaining the concerned cmdlets with a pipe character:
Get-childitem "W:\DATA" -recurse -force |
Select-Object astAccessTime, LastWriteTime, CreationTime |
Export-Csv -path C:\Scripts\xlsx.csv -NoTypeInformation
Performance
Starting with a quote from PowerShell scripting performance considerations:
PowerShell scripts that leverage .NET directly and avoid the pipeline tend to be faster than idiomatic PowerShell. Idiomatic PowerShell typically uses cmdlets and PowerShell functions heavily, often leveraging the pipeline, and dropping down into .NET only when necessary.
In your case, the performance bottleneck is likely not in PowerShell but due to the server and the network. Meaning leveraging from .NET directly would probably not have any effect on the performance.
In fact, using the PowerShell pipeline might be even faster in this case as you do not have to wait until the last file info item is loaded into memory where the native PowerShell pipeline immediately starts processing at the first item while the next items are (slowly) provided by the server.
If you change the last cmdlet (Export-Csv) to ConvertTo-Csv you will probably see the difference where a (correctly setup) pipeline almost starts on fly and other solutions take a while before outputting any data to the console.
The numbers tell the tale
(In Dutch: "meten is weten", which literally means: measuring is knowing)
If you aren't sure what technique would give you the best performance, I recommend you to simply test it (on a subset), like:
Measure-Command {
Get-childitem "W:\DATA" -recurse -force |
Select-Object astAccessTime, LastWriteTime, CreationTime |
Export-Csv -path C:\Scripts\xlsx.csv -NoTypeInformation
} | Select-Object TotalMilliseconds
and compare the results.

Give this a try, should be faster than Get-ChildItem. You could also use [SearchOption]::AllDirectories and no Collections.Queue but I'm not certain if that would consume less memory.
using namespace System.Collections
using namespace System.IO
class InfoProps {
[string] $Name
[string] $FullName
[datetime] $LastAccessTime
[datetime] $LastWriteTime
[datetime] $CreationTime
[string] $Owner
Infoprops([object]$FileInfo)
{
$this.Name = $FileInfo.Name
$this.FullName = $FileInfo.FullName
$this.LastAccessTime = $FileInfo.LastAccessTime
$this.LastWriteTime = $FileInfo.LastWriteTime
$this.CreationTime = $FileInfo.CreationTime
$this.Owner = $FileInfo.GetAccessControl().Owner
}
}
$initialDirectory = $pwd.Path
$queue = [Queue]::new()
$queue.Enqueue($initialDirectory)
& {
while ($queue.Count)
{
$target = $queue.Dequeue()
foreach ($childs in [Directory]::EnumerateDirectories($target)) {
$queue.Enqueue($childs)
}
[InfoProps] [DirectoryInfo] $target # => Remove this line if you want only files!
[InfoProps[]] [FileInfo[]] [Directory]::GetFiles($target)
}
} | Export-Csv test.csv -NoTypeInformation

Related

Which way is better in PowerShell and why

I am novice in powershell and using it very rarely for some little things.
I am using this one liner in order to extract emails
recursive
(Get-ChildItem -Include *.txt -Recurse | Get-Content | Select-String -Pattern "(?:[a-zA-Z0-9_\-\.]+)#(?:[a-zA-Z0-9_\-\.]+)\.(?:[a-zA-Z]{2,5})").Matches | Select-Object -ExpandProperty Value -Unique
In order to access Matches property I've added parentheses. Later I come to that way:
Get-ChildItem -Include *.txt -Recurse | Get-Content | Select-String -Pattern "(?:[a-zA-Z0-9_\-\.]+)#(?:[a-zA-Z0-9_\-\.]+)\.(?:[a-zA-Z]{2,5})" | Select-Object -ExpandProperty Matches -Unique | Select-Object -ExpandProperty Value
I want to to ask what parentheses do exactly in the first version.
Say you have an $output via some function (gci in your case) and you are interested in the field $output.Matches.
If you run $output | select Matches (example 1), you run a
Foreach-Object statement against every object in your array. This
pipeline will use some RAM (very limited, indeed) that are used in a
serial calculation, so every object of $output is processed one after
the other.
If you run $output.Matches (example 2), you select a field from an
array. This will use a lot of RAM at once, but the field will be
processed as one big object instead of many little objects.
As it comes to performance. As always, note that PowerShell is not the way to go if you need high performance. It was never designed to be a fast programming language.
When you're using small objects (like gci $env:userprofile\Desktop), the performance hit will be small. When using large objects or using a lot of nested pipes, the performance hit will be large.
I've just tested it with a gci Z:\ -recurse when Z:\ is a network drive. Performance is dropped with a factor of 20 in this specific case. (Use Measure-Command to test this.)

Powershell -- Get-ChildItem Directory full path and lastaccesstime

I am attempting to output full directory path and lastaccesstime in one line.
Needed --
R:\Directory1\Directory2\Directory3, March 10, 1015
What I am getting --
R:\Directory1\Directory2\Directory3
March 10, 1015
Here is my code, It isn't that complicated, but it is beyond me.
Get-ChildItem -Path "R:\" -Directory | foreach-object -process{$_.FullName, $_.LastAccessTime} | Where{ $_.LastAccessTime -lt [datetime]::Today.AddYears(-2) } | Out-File c:\temp\test.csv
I have used foreach-object in the past in order to ensure I do not truncate the excessively long directory names and paths, but never used it when pulling two properties. I would like the information to be on all one line, but haven't been successful. Thanks in advance for the assist.
I recommend filtering (Where-Object) before selecting the properties you want. Also I think you want to replace ForEach-Object with Select-Object, and lastly I think you want Export-Csv rather than Out-File. Example:
Get-ChildItem -Path "R:\" -Directory |
Where-Object { $_.LastAccessTime -lt [DateTime]::Today.AddYears(-2) } |
Select-Object FullName,LastAccessTime |
Export-Csv C:\temp\test.csv -NoTypeInformation
We can get your output on one line pretty easily, but to make it easy to read we may have to split your script out to multiple lines. I'd recommend saving the script below as a ".ps1" which would allow you to right click and select "run with powershell" to make it easier in the future. This script could be modified to play around with more inputs and variables in order to make it more modular and work in more situations, but for now we'll work with the constants you provided.
$dirs = Get-ChildItem -Path "R:\" -Directory
We'll keep the first line you made, since that is solid and there's nothing to change.
$arr = $dirs | Select-Object {$_.FullName, $_.LastAccessTime} | Where-Object{ $_.LastAccessTime -lt [datetime]::Today.AddYears(-2) }
For the second line, we'll use "Select-Object" instead. In my opinion, it's a lot easier to create an array this way. We'll want to deal with the answers as an array since it'll be easiest to post the key,value pairs next to each other this way. I've expanded your "Where" to "Where-Object" since it's best practice to use the full cmdlet name instead of the alias.
Lastly, we'll want to convert our "$arr" object to csv before putting in the temp out-file.
ConvertTo-CSV $arr | Out-File "C:\Temp\test.csv"
Putting it all together, your final script will look like this:
$dirs = Get-ChildItem -Path "C:\git" -Directory
$arr = $dirs | Select-Object {$_.FullName, $_.LastAccessTime} | Where{ $_.LastAccessTime -lt [datetime]::Today.AddYears(-2) }
ConvertTo-CSV $arr | Out-File "C:\Temp\test.csv"
Again, you can take this further by creating a function, binding it to a cmdlet, and creating parameters for your path, output file, and all that fun stuff.
Let me know if this helps!

Powershell memory exhaustion using NTFSSecurity module on a deep folder traverse

I have been tasked with reporting all of the ACL's on each folder in our Shared drive structure. Added to that, I need to do a look up on the membership of each unique group that gets returned.
Im using the NTFSSecurity module in conjunction with the get-childitem2 cmdlet to get past the 260 character path length limit. The path(s) I am traversing are many hundreds of folders deep and long since pass the 260 character limit.
I have been banging on this for a couple of weeks. My first challenge was crafting my script to do my task all at once, but now im thinking thats my problem... The issue at hand is resources, specifically memory exhaustion. Once the script gets into one of the deep folders, it consumes all RAM and starts swapping to disk, and I eventually run out of disk space.
Here is the script:
$csvfile = 'C:\users\user1\Documents\acl cleanup\dept2_Dir_List.csv'
foreach ($record in Import-Csv $csvFile)
{
$Groups = get-childitem2 -directory -path $record.FullName -recurse | Get-ntfsaccess | where -property accounttype -eq -value group
$groups2 = $Groups | where -property account -notmatch -value '^builtin|^NT AUTHORITY\\|^Creator|^AD\\Domain'
$groups3 = $groups2 | select account -Unique
$GroupMembers = ForEach ($Group in $Groups3) {
(Get-ADGroup $Group.account.sid | get-adgroupmember | select Name, #{N="GroupName";e={$Group.Account}}
)}
$groups2 | select FullName,Account,AccessControlType,AccessRights,IsInherited | export-csv "C:\Users\user1\Documents\acl cleanup\Dept2\$($record.name).csv"
$GroupMembers | export-csv "C:\Users\user1\Documents\acl cleanup\Dept2\$($record.name)_GroupMembers.csv"
}
NOTE: The dir list it reads in is the top level folders created from a get-childitem2 -directory | export-csv filename.csv
During the run, it appears to not be flushing memory properly. This is just a guess from observation. At the end of each run through the code, the variables should be getting over-written, I thought, but memory doesn't go down, so it looked to me that since memory didn't go back down, that it wasn't properly releasing it? Like I said, a guess... I have been reading about runspaces but I am confused about how to implement that with this script. Is that the right direction for this?
Thanks in advance for any assistance...!
Funny you should post about this as I just finished a modified version of the script that I think works much better. A friend turned me on to 'Function Filters' that seem to work well here. Ill test it on the big directories tomorrow to see how much better the memory management is but so far it looks great.
#Define the function ‘filter’ here and call it ‘GetAcl’. Process is the keyword that tells the function to deal with each item in the pipeline one at a time
Function GetAcl {
PROCESS {
Get-NTFSAccess $_ | where -property accounttype -eq -value group | where -property account -notmatch -value '^builtin|^NT AUTHORITY\\|^Creator|^AD\\Domain'
}
}
#Import the directory top level paths
$Paths = import-csv 'C:\users\rknapp2\Documents\acl cleanup\dept2_Dir_List.csv'
#Process each line from the importcsv one at a time and run GetChilditem against it.
#Notice the second part – I ‘|’ pipe the results of the GetChildItem to the function that because of the type of function it is, handles each item one at a time
#When done, pass results to Exportcsv and send it to a file name based on the path name. This puts each dir into its own file.
ForEach ($Path in $paths) {
(Get-ChildItem2 -path $path.FullName -Recurse -directory) | getacl | export-csv "C:\Users\rknapp2\Documents\acl cleanup\TestFilter\$($path.name).csv" }

Powershell get-childitem needs a lot of memory

My question is pretty much the same as the one posted on metafilter.
I need to use a PowerShell script to scan through a large amount of files. The issue is that it seems that the "Get-ChildItem" function insists on shoving the whole folder and file structure in memory. Since the drive has over a million files in over 30,000 folders, the script needs a lot of memory.
http://ask.metafilter.com/134940/PowerShell-recursive-processing-of-all-files-and-folders-without-OutOfMemory-exception
All what I need is the name, size and location of the files.
What I do since now is:
$filesToIndex = Get-ChildItem -Path $path -Recurse | Where-Object { !$_.PSIsContainer }
It works but I don't want to punish my memory :-)
Best regards,
greenhoorn
If you want to optimize the script to use less memory, you need to properly utilize the pipeline. What you are doing is saving the result of Get-ChildItem -recurse into memory, all of it! What you could do is something like this:
Get-ChildItem -Path $Path -Recurse | Foreach-Object {
if (-not($_.PSIsContainer)) {
# do stuff / get info you need here
}
}
This way you are always streaming the data through the pipeline and you will see that PowerShell will consume less memory (if done correctly).
One thing you can do to help is to reduce the size of the objects your saving by paring them down to just the properties you're interested in.
$filesToIndex = Get-ChildItem -Path $path -Recurse |
Where-Object { !$_.PSIsContainer } |
Select Name,Fullname,Length

powershell slow(?) - write names of subfolders to a text file

My Powershell script seems slow, when I run the below code in ISE, it keeps running, doesn't stop.
I am trying to write the list of subfolders in a folder(the folder path is in $scratchpart) to a text file. There are >30k subfolders
$limit = (Get-Date).AddDays(-15)
$path = "E:\Data\PathToScratch.txt"
$scratchpath = Get-Content $path -TotalCount 1
Get-ChildItem -Path $scratchpath -Recurse -Force | Where-Object { $_.PSIsContainer -and $_.CreationTime -lt $limit } | Add-Content C:\Data\eProposal\POC\ScratchContents.txt
Let me know if my approach is not optimal. Ultimately, I will read the text file, zip the subfolders for archival and delete them.
Thanks for your help in advance. I am new to PS, watched few videos on MVA
Add-Content, Set-Content, and even Out-File are notoriously slow in PowerShell. This is because each call opens the file, writes to it, and closes the handle. It never does anything more intelligently than that.
That doesn't sound bad until you consider how pipelines work with Get-ChildItem (and Where-Object and Select-Object). It doesn't wait until it's completed before it begins passing objects into the pipeline. It starts passes objects as soon as the provider returns them. For a large result set, this means that the objects are still feeding in the pipeline long after several have finished processing. Generally speaking, this is great! It means the system will function more efficiently, and it's why stuff like this:
$x = Get-ChildItem;
$x | ForEach-Object { [...] };
Is significantly slower than stuff like this:
Get-ChildItem | ForEach-Object { [...] };
And it's why stuff like this appears to stall:
Get-ChildItem | Sort-Object Name | ForEach-Object { [...] };
The Sort-Object cmdlet needs to waits until it's received all pipeline objects before it sorts. It kind of has to to be able to sort. The sort itself is nearly instantaneous; it's just the cmdlet waiting until it has the full results.
The issue with Add-Content is that, well, it experiences the pipeline not as, "Here's a giant string to write once," but instead as, "Here's a string to write. Here's a string to write. Here's a string to write. Here's a string to write." You'll be sending content to Add-Content here line by line. Each line will instantiate a new call to Add-Content, requiring the file to open, write, and close. You'll likely see better performance if you assign the result of Get-ChildItem [...] | Where-Object [...] to a variable, and then write the entire variable to the file at once:
$limit = (Get-Date).AddDays(-15);
$path = "E:\Data\PathToScratch.txt";
$scratchpath = Get-Content $path -TotalCount 1;
$Results = Get-ChildItem -Path $scratchpath -Recurse -Force -Directory | `
Where-Object{$_.CreationTime -lt $limit } | `
Select-Object -ExpandPropery FullName;
Add-Content C:\Data\eProposal\POC\ScratchContents.txt -Value $Results;
However, you might be concerned about memory usage if your results are actually going to be extremely large. You can actually use System.IO.StreamWriter for this purpose, too. My process improved in speed by nearly two orders of magnitude (from 12 hours to 20 minutes) by switching to StreamWriter and also only calling StreamWriter when I had about 250 lines to write (that seemed to be the break-even point for StreamWriter's overhead). But I was parsing all ACLs for user home and group shares for about 10,000 users and nearly 10 TB of data. Your task might not be as large.
Here's a good blog explaining the issue.
Do you have at least PowerShell 3.0? If you do you should be able to reduce the time by filtering out the files since you are returning those as well.
Get-ChildItem -Path $scratchpath -Recurse -Force -Directory | ...
Currently you are returning all files and folders then filtering out the files with $_.PSIsContainer which would be slower. So should end up with something like this
Get-ChildItem -Path $scratchpath -Recurse -Force -Directory |
Where-Object{$_.CreationTime -lt $limit } |
Select-Object -ExpandPropery FullName |
Add-Content C:\Data\eProposal\POC\ScratchContents.txt