Edit Get-DirStats.ps1 to take a depth (subdirectory) parameter - powershell

As I'm new to powershell I need help to edit the popular Get-DirStats.ps1 in a way it takes a new (depth) parameter which will define how much sub-directories it will read and output stats.
Currently the code will output all sub-directories which I need to limit by depth.
This should be in powershell 2.0 as I have some windows 7 machines.
The ps1 can be downloaded here: https://gallery.technet.microsoft.com/scriptcenter/Outputs-directory-size-964d07ff
I've learned from this thread: Limit Get-ChildItem recursion depth
That this should be a general solution:
$Depth = 2
$Levels = "*" * $Depth
But adding this to the Get-DirStats.ps1 is giving errors.
# Written by Bill Stewart
# Outputs file system directory statistics.
#requires -version 2
<#
.SYNOPSIS
Outputs file system directory statistics.
.DESCRIPTION
Outputs file system directory statistics (number of files and the sum of all file sizes) for one or more directories.
.PARAMETER Path
Specifies a path to one or more file system directories. Wildcards are not permitted. The default path is the current directory (.).
.PARAMETER LiteralPath
Specifies a path to one or more file system directories. Unlike Path, the value of LiteralPath is used exactly as it is typed.
.PARAMETER Only
Outputs statistics for a directory but not any of its subdirectories.
.PARAMETER Every
Outputs statistics for every directory in the specified path instead of only the first level of directories.
.PARAMETER FormatNumbers
Formats numbers in the output object to include thousands separators.
.PARAMETER Total
Outputs a summary object after all other output that sums all statistics.
#>
[CmdletBinding(DefaultParameterSetName="Path")]
param(
[parameter(Position=0,Mandatory=$false,ParameterSetName="Path",ValueFromPipeline=$true)]
$Path=(get-location).Path,
[parameter(Position=0,Mandatory=$true,ParameterSetName="LiteralPath")]
[String[]] $LiteralPath,
[Switch] $Only,
[Switch] $Every,
[Switch] $FormatNumbers,
[Switch] $Total
)
begin {
$ParamSetName = $PSCmdlet.ParameterSetName
if ( $ParamSetName -eq "Path" ) {
$PipelineInput = ( -not $PSBoundParameters.ContainsKey("Path") ) -and ( -not $Path )
}
elseif ( $ParamSetName -eq "LiteralPath" ) {
$PipelineInput = $false
}
# Script-level variables used with -Total.
[UInt64] $script:totalcount = 0
[UInt64] $script:totalbytes = 0
# Returns a [System.IO.DirectoryInfo] object if it exists.
function Get-Directory {
param( $item )
if ( $ParamSetName -eq "Path" ) {
if ( Test-Path -Path $item -PathType Container ) {
$item = Get-Item -Path $item -Force
}
}
elseif ( $ParamSetName -eq "LiteralPath" ) {
if ( Test-Path -LiteralPath $item -PathType Container ) {
$item = Get-Item -LiteralPath $item -Force
}
}
if ( $item -and ($item -is [System.IO.DirectoryInfo]) ) {
return $item
}
}
# Filter that outputs the custom object with formatted numbers.
function Format-Output {
process {
$_ | Select-Object Path,
#{Name="Files"; Expression={"{0:N0}" -f $_.Files}},
#{Name="Size"; Expression={"{0:N0}" -f $_.Size}}
}
}
# Outputs directory statistics for the specified directory. With -recurse,
# the function includes files in all subdirectories of the specified
# directory. With -format, numbers in the output objects are formatted with
# the Format-Output filter.
function Get-DirectoryStats {
param( $directory, $recurse, $format )
Write-Progress -Activity "Get-DirStats.ps1" -Status "Reading '$($directory.FullName)'"
$files = $directory | Get-ChildItem -Force -Recurse:$recurse | Where-Object { -not $_.PSIsContainer }
if ( $files ) {
Write-Progress -Activity "Get-DirStats.ps1" -Status "Calculating '$($directory.FullName)'"
$output = $files | Measure-Object -Sum -Property Length | Select-Object `
#{Name="Path"; Expression={$directory.FullName}},
#{Name="Files"; Expression={$_.Count; $script:totalcount += $_.Count}},
#{Name="Size"; Expression={$_.Sum; $script:totalbytes += $_.Sum}}
}
else {
$output = "" | Select-Object `
#{Name="Path"; Expression={$directory.FullName}},
#{Name="Files"; Expression={0}},
#{Name="Size"; Expression={0}}
}
if ( -not $format ) { $output } else { $output | Format-Output }
}
}
process {
# Get the item to process, no matter whether the input comes from the
# pipeline or not.
if ( $PipelineInput ) {
$item = $_
}
else {
if ( $ParamSetName -eq "Path" ) {
$item = $Path
}
elseif ( $ParamSetName -eq "LiteralPath" ) {
$item = $LiteralPath
}
}
# Write an error if the item is not a directory in the file system.
$directory = Get-Directory -item $item
if ( -not $directory ) {
Write-Error -Message "Path '$item' is not a directory in the file system." -Category InvalidType
return
}
# Get the statistics for the first-level directory.
Get-DirectoryStats -directory $directory -recurse:$false -format:$FormatNumbers
# -Only means no further processing past the first-level directory.
if ( $Only ) { return }
# Get the subdirectories of the first-level directory and get the statistics
# for each of them.
$directory | Get-ChildItem -Force -Recurse:$Every |
Where-Object { $_.PSIsContainer } | ForEach-Object {
Get-DirectoryStats -directory $_ -recurse:(-not $Every) -format:$FormatNumbers
}
}
end {
# If -Total specified, output summary object.
if ( $Total ) {
$output = "" | Select-Object `
#{Name="Path"; Expression={"<Total>"}},
#{Name="Files"; Expression={$script:totalcount}},
#{Name="Size"; Expression={$script:totalbytes}}
if ( -not $FormatNumbers ) { $output } else { $output | Format-Output }
}
}
PS C:\Users\user\Desktop> .\Get-DirStats.ps1 "C:\Program Files" 2
Where 2 is the depth

Related

How to make Powershell Extract Metadata script run faster?

I have a script that I use to extract metadata from files in a network directory. It originates here, and I modified it in order to obtain additional metadata (size of file, filehash, date created,and lastwritetime), but these additions appear to slow down the script to the point that it takes weeks to complete when the number of files is more than 10000.
To illustrate the impact of the script additions on the speed, I ran the script on a folder containing five documents:
original script (no get-item or get-file hash lines): 2.9794699 seconds
with 'get-item' lines (size, filehash, created, lastwritetime): 7.6295035 seconds
with 'get-filehash' line : 6.9363834 seconds
with 'get-item' lines and 'get-filehash' lines: 12.4516334 seconds
I tried putting all the get-item lines together in a for-loop thinking that it would be faster to retrieve the file once from the network, then extract the metadata. While this modified script runs at a much faster 8.6488492 seconds, the metadata fields are not included in the output.
Here's the original script:
#Works on Powershell version 5.1
#The filepath of the folder being printed and the filepath where the output file will be placed need to be specified in the last line of script.
Function Get-FolderItem {
[cmdletbinding(DefaultParameterSetName='Filter')]
Param (
[parameter(Position=0,ValueFromPipeline=$True,ValueFromPipelineByPropertyName=$True)]
[Alias('FullName')]
[string[]]$Path = $PWD,
[parameter(ParameterSetName='Filter')]
[string[]]$Filter = '*.*',
[parameter(ParameterSetName='Exclude')]
[string[]]$ExcludeFile,
[parameter()]
[int]$MaxAge,
[parameter()]
[int]$MinAge
)
Begin {
$params = New-Object System.Collections.Arraylist
$params.AddRange(#("/L","/E","/NJH","/NDL","/BYTES","/FP","/NC","/XJ","/R:0","/W:0","T:W"))
If ($PSBoundParameters['MaxAge']) {
$params.Add("/MaxAge:$MaxAge") | Out-Null
}
If ($PSBoundParameters['MinAge']) {
$params.Add("/MinAge:$MinAge") | Out-Null
}
}
Process {
ForEach ($item in $Path) {
Try {
$item = (Resolve-Path -LiteralPath $item -ErrorAction Stop).ProviderPath
If (-Not (Test-Path -LiteralPath $item -Type Container -ErrorAction Stop)) {
Write-Warning ("{0} is not a directory and will be skipped" -f $item)
Return
}
If ($PSBoundParameters['ExcludeFile']) {
$Script = "robocopy `"$item`" NULL $Filter $params /XF $($ExcludeFile -join ',')"
} Else {
$Script = "robocopy `"$item`" NULL $Filter $params"
}
Write-Verbose ("Scanning {0}" -f $item)
Invoke-Expression $Script | ForEach {
Try {
If ($_.Trim() -match "^(?<Children>\d+)\s+(?<FullName>.*)") {
$object = New-Object PSObject -Property #{
FullName = $matches.FullName
Extension = $matches.fullname -replace '.*\.(.*)','$1'
FullPathLength = [int] $matches.FullName.Length
Stuff = foreach {$matches in $match}Length = (Get-Item $matches.FullName).length
FileHash = Get-FileHash -Path "\\?\$($matches.FullName)" |Select -Expand Hash
Created = (Get-Item $matches.FullName).creationtime
LastWriteTime = (Get-Item $matches.FullName).LastWriteTime
Owner = (Get-ACL $matches.Fullname).Owner
}
$object.pstypenames.insert(0,'System.IO.RobocopyDirectoryInfo')
Write-Output $object
} Else {
Write-Verbose ("Not matched: {0}" -f $_)
}
} Catch {
Write-Warning ("{0}" -f $_.Exception.Message)
Return
}
}
} Catch {
Write-Warning ("{0}" -f $_.Exception.Message)
Return
}
}
}
}
Get-FolderItem "O:\directory\to\files" | Export-Csv -Path C:\output.csv
Does anyone know how to make the script run faster?
Rather than (Get-Item $matches.FullName).length you could do ([System.IO.FileInfo]$Matches.FullName).length. I see much better performance from that (Get-Item taking about 3.5x longer). Same for LastWriteTime.

Calculate the hash of a file longer than 256 characters?

I am using Boe Prox's script to print a list of all the files and folders in a directory. I need Prox's script (as opposed to other windows print directory commands) because it uses robocopy to print filepaths longer than 260 characters.
My problem is, I also want the filehash to be printed alongside the file name. I have modified the script to obtain hashes (please see below) and it generally works except when the filepath is longer than 260 characters. Long filepaths get a blank in the hash column of the final output.
Research I have done:
According to this stackoverflow question, Robocopy has several switches that can be modified. I have scoured Boe's blog, as well as the list of robocopy commands, but there is nothing about a filehash switch.
Attempts to fix the problem:
I have also tried to modify the syntax of the filehash to make it more in line with the rest of the script ie. Hash = $matches.Hash
(this returns all blanks in place of the filehashs)
I tried taking off the part of the regex that seems to specify an item rather than the content of the item ie:If ($_.Trim() -match "^(?<Children>\d+)\s+") {
(this leads to the error code WARNING: Cannot bind argument to parameter 'Path' because it is null.)
I'm pretty hopeful that this can happen though: comments in Boe's original script includes the line: "Version 1.1 -Added ability to calculate file hashes"
Here's is my (partially working script):
Function Get-FolderItem {
[cmdletbinding(DefaultParameterSetName='Filter')]
Param (
[parameter(Position=0,ValueFromPipeline=$True,ValueFromPipelineByPropertyName=$True)]
[Alias('FullName')]
[string[]]$Path = $PWD,
[parameter(ParameterSetName='Filter')]
[string[]]$Filter = '*.*',
[parameter(ParameterSetName='Exclude')]
[string[]]$ExcludeFile,
[parameter()]
[int]$MaxAge,
[parameter()]
[int]$MinAge
)
Begin {
$params = New-Object System.Collections.Arraylist
$params.AddRange(#("/L","/E","/NJH","/BYTES","/FP","/NC","/XJ","/R:0","/W:0","T:W"))
If ($PSBoundParameters['MaxAge']) {
$params.Add("/MaxAge:$MaxAge") | Out-Null
}
If ($PSBoundParameters['MinAge']) {
$params.Add("/MinAge:$MinAge") | Out-Null
}
}
Process {
ForEach ($item in $Path) {
Try {
$item = (Resolve-Path -LiteralPath $item -ErrorAction Stop).ProviderPath
If (-Not (Test-Path -LiteralPath $item -Type Container -ErrorAction Stop)) {
Write-Warning ("{0} is not a directory and will be skipped" -f $item)
Return
}
If ($PSBoundParameters['ExcludeFile']) {
$Script = "robocopy `"$item`" NULL $Filter $params /XF $($ExcludeFile -join ',')"
} Else {
$Script = "robocopy `"$item`" NULL $Filter $params"
}
Write-Verbose ("Scanning {0}" -f $item)
Invoke-Expression $Script | ForEach {
Try {
If ($_.Trim() -match "^(?<Children>\d+)\s+(?<FullName>.*)") {
$object = New-Object PSObject -Property #{
FullName = $matches.FullName
Extension = $matches.fullname -replace '.*\.(.*)','$1'
FullPathLength = [int] $matches.FullName.Length
Length = [int64]$matches.Size
FileHash = (Get-FileHash -Path $matches.FullName).Hash
Created = (Get-Item $matches.FullName).creationtime
LastWriteTime = (Get-Item $matches.FullName).LastWriteTime
}
$object.pstypenames.insert(0,'System.IO.RobocopyDirectoryInfo')
Write-Output $object
} Else {
Write-Verbose ("Not matched: {0}" -f $_)
}
} Catch {
Write-Warning ("{0}" -f $_.Exception.Message)
Return
}
}
} Catch {
Write-Warning ("{0}" -f $_.Exception.Message)
Return
}
}
}
}
Get-FolderItem "C:\TestingFileFolders"
In Windows PowerShell, you can prepend \\?\ to the full path and read the file:
Get-FileHash -Path "\\?\$($matches.FullName)"

Powershell Script is printing out duplicate entries of the same path

My objective is to write a powershell script that will recursively check a file server for any directories that are "x" (insert days) old or older.
I ran into a few issues initially, and I think I got most of it worked out. One of the issues I ran into was with the path limitation of 248 characters. I found a custom function that I am implementing in my code to bypass this limitation.
The end result is I would like to output the path and LastAccessTime of the folder and export the information into an easy to read csv file.
Currently everything is working properly, but for some reason I get some paths output several times (duplicates, triples, even 4 times). I just want it output once for each directory and subdirectory.
I'd appreciate any guidance I can get. Thanks in advance.
Here's my code
#Add the import and snapin in order to perform AD functions
Add-PSSnapin Quest.ActiveRoles.ADManagement -ea SilentlyContinue
Import-Module ActiveDirectory
#Clear Screen
CLS
Function Get-FolderItem
{
[cmdletbinding(DefaultParameterSetName='Filter')]
Param (
[parameter(Position=0,ValueFromPipeline=$True,ValueFromPipelineByPropertyName=$True)]
[Alias('FullName')]
[string[]]$Path = $PWD,
[parameter(ParameterSetName='Filter')]
[string[]]$Filter = '*.*',
[parameter(ParameterSetName='Exclude')]
[string[]]$ExcludeFile,
[parameter()]
[int]$MaxAge,
[parameter()]
[int]$MinAge
)
Begin
{
$params = New-Object System.Collections.Arraylist
$params.AddRange(#("/L","/S","/NJH","/BYTES","/FP","/NC","/NFL","/TS","/XJ","/R:0","/W:0"))
If ($PSBoundParameters['MaxAge'])
{
$params.Add("/MaxAge:$MaxAge") | Out-Null
}
If ($PSBoundParameters['MinAge'])
{
$params.Add("/MinAge:$MinAge") | Out-Null
}
}
Process
{
ForEach ($item in $Path)
{
Try
{
$item = (Resolve-Path -LiteralPath $item -ErrorAction Stop).ProviderPath
If (-Not (Test-Path -LiteralPath $item -Type Container -ErrorAction Stop))
{
Write-Warning ("{0} is not a directory and will be skipped" -f $item)
Return
}
If ($PSBoundParameters['ExcludeFile'])
{
$Script = "robocopy `"$item`" NULL $Filter $params /XF $($ExcludeFile -join ',')"
}
Else
{
$Script = "robocopy `"$item`" NULL $Filter $params"
}
Write-Verbose ("Scanning {0}" -f $item)
Invoke-Expression $Script | ForEach {
Try
{
If ($_.Trim() -match "^(?<Children>\d+)\s+(?<FullName>.*)")
{
$object = New-Object PSObject -Property #{
ParentFolder = $matches.fullname -replace '(.*\\).*','$1'
FullName = $matches.FullName
Name = $matches.fullname -replace '.*\\(.*)','$1'
}
$object.pstypenames.insert(0,'System.IO.RobocopyDirectoryInfo')
Write-Output $object
}
Else
{
Write-Verbose ("Not matched: {0}" -f $_)
}
}
Catch
{
Write-Warning ("{0}" -f $_.Exception.Message)
Return
}
}
}
Catch
{
Write-Warning ("{0}" -f $_.Exception.Message)
Return
}
}
}
}
Function ExportFolders
{
#================ Global Variables ================
#Path to folders
$Dir = "\\myFileServer\somedir\blah"
#Get all folders
$ParentDir = Get-ChildItem $Dir | Where-Object {$_.PSIsContainer -eq $True}
#Export file to our destination
$ExportedFile = "c:\temp\dirFolders.csv"
#Duration in Days+ the file hasn't triggered "LastAccessTime"
$duration = 800
$cutOffDate = (Get-Date).AddDays(-$duration)
#Used to hold our information
$results = #()
#=============== Done with Variables ===============
ForEach ($SubDir in $ParentDir)
{
$FolderPath = $SubDir.FullName
$folders = Get-ChildItem -Recurse $FolderPath -force -directory| Where-Object { ($_.LastAccessTimeUtc -le $cutOffDate)} | Select-Object FullName, LastAccessTime
ForEach ($folder in $folders)
{
$folderPath = $folder.fullname
$fixedFolderPaths = ($folderPath | Get-FolderItem).fullname
ForEach ($fixedFolderPath in $fixedFolderPaths)
{
#$fixedFolderPath
$getLastAccessTime = $(Get-Item $fixedFolderPath -force).lastaccesstime
#$getLastAccessTime
$details = #{ "Folder Path" = $fixedFolderPath; "LastAccessTime" = $getLastAccessTime}
$results += New-Object PSObject -Property $details
$results
}
}
}
}
ExportFolders
I updated my code a bit and simplified it. Here is the new code.
#Add the import and snapin in order to perform AD functions
Add-PSSnapin Quest.ActiveRoles.ADManagement -ea SilentlyContinue
Import-Module ActiveDirectory
#Clear Screen
CLS
Function ExportFolders
{
#================ Global Variables ================
#Path to user profiles in Barrington
$Dir = "\\myFileServer\somedir\blah"
#Get all user folders
$ParentDir = Get-ChildItem $Dir | Where-Object {$_.PSIsContainer -eq $True} | where {$_.GetFileSystemInfos().Count -eq 0 -or $_.GetFileSystemInfos().Count -gt 0}
#Export file to our destination
$ExportedFile = "c:\temp\dirFolders.csv"
#Duration in Days+ the file hasn't triggered "LastAccessTime"
$duration = 1
$cutOffDate = (Get-Date).AddDays(-$duration)
#Used to hold our information
$results = #()
$details = $null
#=============== Done with Variables ===============
ForEach ($SubDir in $ParentDir)
{
$FolderName = $SubDir.FullName
$FolderInfo = $(Get-Item $FolderName -force) | Select-Object FullName, LastAccessTime #| ft -HideTableHeaders
$FolderLeafs = gci -Recurse $FolderName -force -directory | Where-Object {$_.PSIsContainer -eq $True} | where {$_.GetFileSystemInfos().Count -eq 0 -or $_.GetFileSystemInfos().Count -gt 0} | Select-Object FullName, LastAccessTime #| ft -HideTableHeaders
$details = #{ "LastAccessTime" = $FolderInfo.LastAccessTime; "Folder Path" = $FolderInfo.FullName}
$results += New-Object PSObject -Property $details
ForEach ($FolderLeaf in $FolderLeafs.fullname)
{
$details = #{ "LastAccessTime" = $(Get-Item $FolderLeaf -force).LastAccessTime; "Folder Path" = $FolderLeaf}
$results += New-Object PSObject -Property $details
}
$results
}
}
ExportFolders
The FolderInfo variable is sometimes printing out multiple times, but the FolderLeaf variable is printing out once from what I can see. The problem is if I move or remove the results variable from usnder the details that print out the folderInfo, then the Parent directories don't get printed out. Only all the subdirs are shown. Also some directories are empty and don't get printed out, and I want all directories printed out including empty ones.
The updated code seems to print all directories fine, but as I mentioned I am still getting some duplicate $FolderInfo variables.
I think I have to put in a condition or something to check if it has already been processed, but I'm not sure which condition I would use to do that, so that it wouldn't print out multiple times.
In your ExportFolders you Get-ChildItem -Recurse and then loop over all of the subfolders calling Get-FolderItem. Then in Get-FolderItem you provide Robocopy with the /S flag in $params.AddRange(#("/L", "/S", "/NJH", "/BYTES", "/FP", "/NC", "/NFL", "/TS", "/XJ", "/R:0", "/W:0")) The /S flag meaning copy Subdirectories, but not empty ones. So you are recursing again. Likely you just need to remove the /S flag, so that you are doing all of your recursion in ExportFolders.
In response to the edit:
Your $results is inside of the loop. So you will have a n duplicates for the first $subdir then n-1 duplicates for the second and so forth.
ForEach ($SubDir in $ParentDir) {
#skipped code
ForEach ($FolderLeaf in $FolderLeafs.fullname) {
#skipped code
}
$results
}
should be
ForEach ($SubDir in $ParentDir) {
#skipped code
ForEach ($FolderLeaf in $FolderLeafs.fullname) {
#skipped code
}
}
$results

Retrieve matching strings from two text files

I have a text file file_paths.txt that contains full paths on each line:
C:\MyFolder1\app1.exe
C:\MyFolder2\l1.dll
C:\MyFolder3\app2.exe
C:\MyFolder1\l2.dll
C:\MyFolder5\app3.exe
C:\MyFolder3\app4.exe
C:\MyFolder6\app5.exe
I also have file folders.txt that contains list of folders:
C:\MyFolder1
C:\MyFolder2
C:\MyFolder3
C:\MyFolder4
C:\MyFolder8
I need to iterate through the list of folders in folders.txt, match it with files in file_paths.txt and write the results to a file result.txt like this:
In C:\MyFolder1 more than one files has been found:
C:\MyFolder1\app1.exe
C:\MyFolder1\l2.dll
In C:\MyFolder2 one file has been:
C:\MyFolder2\l1.dll
In C:\MyFolder3 more than one files has been found:
C:\MyFolder3\app2.exe
C:\MyFolder3\app4.exe
In C:\MyFolder4 no files has been found.
In C:\MyFolder8 no files has been found.
My attempt that doesn't work:
$paths = [System.IO.File]::OpenText("file_paths.txt")
$folders = [System.IO.File]::OpenText("folders.txt")
$result = "result.txt"
try {
for(;;) {
$folder = $folders.ReadLine()
if ($folder -eq $null) { break }
"In ">> $folder >> ": `n" >> $result
for(;;) {
$path = $paths.ReadLine()
if ($path -eq $null) { break }
if ($path -contains $folder) {" ">>$path>>"`n">>$result }
}
}
} finally {
$paths.Close()
$folders.Close()
}
I would separate processing from reporting. First build a hashtable from the contents of folders.txt and add the lines from file_paths.txt to the matching keys:
$folders = #{}
Get-Content 'folders.txt' | ForEach-Object { $folders[$_] = #() }
Get-Content 'file_paths.txt' | ForEach-Object {
$line = $_
$($folders.Keys) | Where-Object {
$line -like "$_*"
} | ForEach-Object {
$folders[$_] += $line
}
}
Then you can output the resulting data structure like this:
$folders.Keys | ForEach-Object {
'In {0} {1} files have been found' -f $_, $folders[$_].Count
if ($folders[$_].Count -gt 0) {
$folders[$_] | ForEach-Object { "`t$_" }
}
} | Out-File 'result.txt'
Below is a script you can use to do exactly what you need.
Note the $folderPath and $filePath variables. Replace with absolute or relative (to where you execute the script) path of the file_paths.txt and folders.txt files.
$folderPath = 'folders.txt'
$filePath = 'file_paths.txt'
(Get-Content $folderPath).Split('`r`n') | ForEach-Object {
$folder = $_
$count = 0
$fileArray = #()
(Get-Content $filePath).Split('`r`n') | ForEach-Object {
$file = $_
if( $file | Select-String $folder -Quiet ) {
$count++
$fileArray += $file
}
}
if($count -ne 0) {
Write-Output "In $folder, $count files has been found."
$fileArray | ForEach-Object {
Write-Output "`t$_"
}
} else {
Write-Output "In $folder, no files has been found."
}
}

Removing duplicate files with Powershell

I have several thousand duplicate files (jar files as an example) that I'd like to use powershell to
Search through the file system recursively
Find the dups (either by name only or a checksum method or both)
Delete all duplicates but one.
I'm new to powershell and am throwing this out there to the PS folks that might be able to help.
try this:
ls *.txt -recurse | get-filehash | group -property hash | where { $_.count -gt 1 } | % { $_.group | select -skip 1 } | del
from: http://n3wjack.net/2015/04/06/find-and-delete-duplicate-files-with-just-powershell/
Keep a dictionary of files, delete when the next file name was already encountered before:
$dict = #{};
dir c:\admin -Recurse | foreach {
$key = $_.Name #replace this with your checksum function
$find = $dict[$key];
if($find -ne $null) {
#current file is a duplicate
#Remove-Item -Path $_.FullName ?
}
$dict[$key] = 0; #dummy placeholder to save memory
}
I used file name as a key, but you can use a checksum if you want (or both) - see code comment.
Even though the question is old, I have been in a need to clean up all duplicate files based on content. The idea is simple, the algorithm for this is not straightforward. Here is the code which accepts a parameter of "path" to delete duplicates from.
Function Delete-Duplicates {
param(
[Parameter(
Mandatory=$True,
ValueFromPipeline=$True,
ValueFromPipelineByPropertyName=$True
)]
[string[]]$PathDuplicates)
$DuplicatePaths =
Get-ChildItem $PathDuplicates |
Get-FileHash |
Group-Object -Property Hash |
Where-Object -Property Count -gt 1 |
ForEach-Object {
$_.Group.Path |
Select -First ($_.Count -1)}
$TotalCount = (Get-ChildItem $PathDuplicates).Count
Write-Warning ("You are going to delete {0} files out of {1} total. Please confirm the prompt" -f $DuplicatePaths.Count, $TotalCount)
$DuplicatePaths | Remove-Item -Confirm
}
The script
a) Lists all ChildItems
b) Retrieves FileHash from them
c) Groups them by Hash Property (so all the same files are in the single group)
d) Filters out the already-unique files (count of group -eq 1)
e) Loops through each group and lists all but last paths - ensuring one file of each "Hash" always stays
f) Warns before preceding, saying how many files are there in total and how many are going to be deleted.
Probably not the most performance-wise option (SHA1-ing every file) but ensures the file is a duplicate.
Works perfectly fine for me :)
Evolution of #KaiWang's answer which:
Avoids calculating hash of every single file by comparing file length first;
Allows choosing which file you want (here it keeps the file with the longest name).
Get-ChildItem *.ttf -Recurse |
Group -Property Length |
Where { $_.Count -gt 1 } |
ForEach { $_.Group } |
ForEach { $_ } |
Get-FileHash -Algorithm 'MD5' |
Group -Property Hash |
Where { $_.Count -gt 1 } |
ForEach {
$_.Group |
Sort -Property #{ Expression = { $_.Path.Length } } |
Select -SkipLast 1
} |
ForEach { $_.Path } |
ForEach {
Write-Host $_
Del -LiteralPath $_
}
Instead of just remove your duplicates files, you can replace by a shortcut
#requires -version 3
<#
.SYNOPSIS
Script de nettoyage des doublons
.DESCRIPTION
Cherche les doublons par taille, compare leur CheckSum MD5 et les regroupes par Taille et MD5
peut remplacer chacun des doubles par un lien vers le 1er fichier, l'original
.PARAMETER Path
Chemin ou rechercher les doublon
.PARAMETER ReplaceByShortcut
si specifier alors les doublons seront remplacé
.PARAMETER MinLength
ignore les fichiers inferieure a cette taille (en Octets)
.EXAMPLE
.\Clean-Duplicate '\\dfs.adds\donnees\commun'
.EXAMPLE
recherche les doublon de 10Ko et plus
.\Clean-Duplicate '\\dfs.adds\donnees\commun' -MinLength 10000
.EXAMPLE
.\Clean-Duplicate '\\dpm1\d$\Coaxis\Logiciels' -ReplaceByShortcut
#>
[CmdletBinding()]
param (
[string]$Path = '\\Contoso.adds\share$\path\data',
[switch]$ReplaceByShortcut = $false,
[int]$MinLength = 10*1024*1024 # 10 Mo
)
$version = '1.0'
function Create-ShortCut ($ShortcutPath, $shortCutName, $Target) {
$link = "$ShortcutPath\$shortCutName.lnk"
$WshShell = New-Object -ComObject WScript.Shell
$Shortcut = $WshShell.CreateShortcut($link)
$Shortcut.TargetPath = $Target
#$Shortcut.Arguments ="shell32.dll,Control_RunDLL hotplug.dll"
#$Shortcut.IconLocation = "hotplug.dll,0"
$Shortcut.Description ="Copy Doublon"
#$Shortcut.WorkingDirectory ="C:\Windows\System32"
$Shortcut.Save()
# write-host -fore Cyan $link -nonewline; write-host -fore Red ' >> ' -nonewline; write-host -fore Yellow $Target
return $link
}
function Replace-ByShortcut {
Param(
[Parameter(ValueFromPipeline=$true,ValueFromPipelineByPropertyName=$true)]
$SameItems
)
begin{
$result = [pscustomobject][ordered]#{
Replaced = #()
Gain = 0
Count = 0
}
}
Process{
$Original = $SameItems.group[0]
foreach ($doublon in $SameItems.group) {
if ($doublon -ne $Original) {
$result.Replaced += [pscustomobject][ordered]#{
lnk = Create-Shortcut -ShortcutPath $doublon.DirectoryName -shortCutName $doublon.BaseName -Target $Original.FullName
target = $Original.FullName
size = $doublon.Length
}
$result.Gain += $doublon.Length
$result.Count++
Remove-item $doublon.FullName -force
}
}
}
End{
$result
}
}
function Get-MD5 {
param (
[Parameter(Mandatory)]
[string]$Path
)
$HashAlgorithm = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$Stream = [System.IO.File]::OpenRead($Path)
try {
$HashByteArray = $HashAlgorithm.ComputeHash($Stream)
} finally {
$Stream.Dispose()
}
return [System.BitConverter]::ToString($HashByteArray).ToLowerInvariant() -replace '-',''
}
if (-not $Path) {
if ((Get-Location).Provider.Name -ne 'FileSystem') {
Write-Error 'Specify a file system path explicitly, or change the current location to a file system path.'
return
}
$Path = (Get-Location).ProviderPath
}
$DuplicateFiles = Get-ChildItem -Path $Path -Recurse -File |
Where-Object { $_.Length -gt $MinLength } |
Group-Object -Property Length |
Where-Object { $_.Count -gt 1 } |
ForEach-Object {
$_.Group |
ForEach-Object {
$_ | Add-Member -MemberType NoteProperty -Name ContentHash -Value (Get-MD5 -Path $_.FullName)
}
$_.Group |
Group-Object -Property ContentHash |
Where-Object { $_.Count -gt 1 }
}
$somme = ($DuplicateFiles.group | Measure-Object length -Sum).sum
write-host "$($DuplicateFiles.group.count) doublons, soit $($somme/1024/1024) Mo" -fore cyan
if ($ReplaceByShortcut) {
$DuplicateFiles | Replace-ByShortcut
} else {
$DuplicateFiles
}