Powershell Script to find duplicate files - powershell

I found a PowerShell script on TechNet to help locate duplicate files in folders. However, when I run it, I am getting an error on what appears to be every folder\file. Not sure what switch is supposed to be used in this.
$Path = '\\servername\Share\Folders' #define path to folders to find duplicate files
$Files=gci -File -Recurse -path $Path | Select-Object -property FullName,Length
$Count=1
$TotalFiles=$Files.Count
$MatchedSourceFiles=#()
ForEach ($SourceFile in $Files)
{
Write-Progress -Activity "Processing Files" -status "Processing File $Count / $TotalFiles" -PercentComplete ($Count / $TotalFiles * 100)
$MatchingFiles=#()
$MatchingFiles=$Files |Where-Object {$_.Length -eq $SourceFile.Length}
Foreach ($TargetFile in $MatchingFiles)
{
if (($SourceFile.FullName -ne $TargetFile.FullName) -and !(($MatchedSourceFiles |
Select-Object -ExpandProperty File) -contains $TargetFile.FullName))
{
Write-Verbose "Matching $($SourceFile.FullName) and $($TargetFile.FullName)"
Write-Verbose "File sizes match."
if ((fc.exe /A $SourceFile.FullName $TargetFile.FullName) -contains "FC: no differences encountered")
{
Write-Verbose "Match found."
$MatchingFiles+=$TargetFile.FullName
}
}
}
if ($MatchingFiles.Count -gt 0)
{
$NewObject=[pscustomobject][ordered]#{
File=$SourceFile.FullName
MatchingFiles=$MatchingFiles
}
$MatchedSourceFiles+=$NewObject
}
$Count+=1
}
$MatchedSourceFiles
Errors
FC: Insufficient number of file specifications
fc.exe : FC: Invalid Switch
At line:18 char:12
gci : Could not find a part of the path
At line:2 char:8
fc.exe : FC: Invalid Switch
At line:18 char:12

To fix your fc.exe error and optimize tour script, I also recommend #rich-moss 's solution.
But if you only want to find duplicates, you can easily accomplish so by checking their hashes.
Example:
$Duplicates = Get-ChildItem -File -Recurse | Get-FileHash | Group-Object -Property Hash | Where-Object Count -gt 1
If ($duplicates.count -lt 1) {
$null # 'No duplicates found. Do stuff ...'
} else {
$result = foreach ($d in $duplicates) {
$d.Group | Select-Object -Property Path, Hash
}

The script you provided is very inefficient and provides false positives in my tests. It's inefficient because it compares every file twice (Source->Target and Target->Source) and because it iterates through all files regardless of size. Here's a quicker version that gathers the files into groups of similarly sized files and only executes FC.EXE once per pair of files:
$Path = 'C:\Temp'
$SameSizeFiles = gci -Path $Path -File -Recurse | Select FullName, Length | Group-Object Length | ? {$_.Count -gt 1} #the list of files with same size
$MatchingFiles=#()
$GroupNdx=1
Foreach($SizeGroup in ($SameSizeFiles | Select Group)){
For($FromNdx = 0; $FromNdx -lt $SizeGroup.Group.Count - 1; $FromNdx++){
For($ToNdx = $FromNdx + 1; $ToNdx -lt $SizeGroup.Group.Count; $ToNdx++){
If( (fc.exe /A $SizeGroup.Group[$FromNdx].FullName $SizeGroup.Group[$ToNdx].FullName) -contains "FC: no differences encountered"){
$MatchingFiles += [pscustomobject]#{File=$SizeGroup.Group[$FromNdx].FullName; Match = $SizeGroup.Group[$ToNdx].FullName }
}
}
}
Write-Progress -Activity "Finding Duplicates" -status "Processing group $GroupNdx of $($SameSizeFiles.Count)" -PercentComplete ($GroupNdx / $SameSizeFiles.Count * 100)
$GroupNdx += 1
}
$MatchingFiles
Efficiency will be even more important if you're running it over the network. You may find it quicker to execute the script on the server itself, rather than from a share. There is some discussion here about the fastest way to compare files in .Net.

Related

Powershell script to compare two directories (including sub directories and contents) that are supposed to be identical but on different servers

I would like to run a powershell script that can be supplied a directory name by the user and then it will check the directory, sub directories, and all file contents of those directories to compare if they are identical to each other. There are 8 servers that should all have identical files and contents. The below code does not appear to be doing what I intended. I have seen the use of Compare-Object, Get-ChildItem, and Get-FileHash but have not found the right combo that I am certain is actually accomplishing the task. Any and all help is appreciated!
$35 = "\\server1\"
$36 = "\\server2\"
$37 = "\\server3\"
$38 = "\\server4\"
$45 = "\\server5\"
$46 = "\\server6\"
$47 = "\\server7\"
$48 = "\\server8\"
do{
Write-Host "|1 : New |"
Write-Host "|2 : Repeat|"
Write-Host "|3 : Exit |"
$choice = Read-Host -Prompt "Please make a selection"
switch ($choice){
1{
$App = Read-Host -Prompt "Input Directory Application"
}
2{
#rerun
}
3{
exit; }
}
$c35 = $35 + "$App" +"\*"
$c36 = $36 + "$App" +"\*"
$c37 = $37 + "$App" +"\*"
$c38 = $38 + "$App" +"\*"
$c45 = $45 + "$App" +"\*"
$c46 = $46 + "$App" +"\*"
$c47 = $47 + "$App" +"\*"
$c48 = $48 + "$App" +"\*"
Write-Host "Comparing Server1 -> Server2"
if((Get-ChildItem $c35 -Recurse | Get-FileHash | Select-Object Hash,Path).hash -eq (Get-ChildItem $c36 -Recurse | Get-FileHash | Select-Object Hash,Path).hash){"Identical"}else{"NOT Identical"}
Write-Host "Comparing Server1 -> Server3"
if((Get-ChildItem $c35 -Recurse | Get-FileHash | Select-Object Hash,Path).hash -eq (Get-ChildItem $c37 -Recurse | Get-FileHash | Select-Object Hash,Path).hash){"Identical"}else{"NOT Identical"}
Write-Host "Comparing Server1 -> Server4"
if((Get-ChildItem $c35 -Recurse | Get-FileHash | Select-Object Hash,Path).hash -eq (Get-ChildItem $c38 -Recurse | Get-FileHash | Select-Object Hash,Path).hash){"Identical"}else{"NOT Identical"}
Write-Host "Comparing Server1 -> Server5"
if((Get-ChildItem $c35 -Recurse | Get-FileHash | Select-Object Hash,Path).hash -eq (Get-ChildItem $c45 -Recurse | Get-FileHash | Select-Object Hash,Path).hash){"Identical"}else{"NOT Identical"}
Write-Host "Comparing Server1 -> Server6"
if((Get-ChildItem $c35 -Recurse | Get-FileHash | Select-Object Hash,Path).hash -eq (Get-ChildItem $c46 -Recurse | Get-FileHash | Select-Object Hash,Path).hash){"Identical"}else{"NOT Identical"}
Write-Host "Comparing Server1 -> Server7"
if((Get-ChildItem $c35 -Recurse | Get-FileHash | Select-Object Hash,Path).hash -eq (Get-ChildItem $c47 -Recurse | Get-FileHash | Select-Object Hash,Path).hash){"Identical"}else{"NOT Identical"}
Write-Host "Comparing Server1 -> Server8"
if((Get-ChildItem $c35 -Recurse | Get-FileHash | Select-Object Hash,Path).hash -eq (Get-ChildItem $c48 -Recurse | Get-FileHash | Select-Object Hash,Path).hash){"Identical"}else{"NOT Identical"}
} until ($choice -eq 3)
Here is an example function that tries to compare one reference directory against multiple difference directories efficiently. It does so by comparing the most easily available informations first and stopping at the first difference.
Get all relevant informations about files in reference directory once, including hashes (though this could be more optimized by getting hashes only if necessary).
For each difference directory, compare in this order:
file count - if different, then obviously directories are different
relative file paths - if not all paths from difference directory can be found in reference directory, then directories are different
file sizes - should be obvious
file hashes - hashes only need to be calculated if files have equal size
Function Compare-MultipleDirectories {
param(
[Parameter(Mandatory)] [string] $ReferencePath,
[Parameter(Mandatory)] [string[]] $DifferencePath
)
# Get basic file information recursively by calling Get-ChildItem with the addition of the relative file path
Function Get-ChildItemRelative {
param( [Parameter(Mandatory)] [string] $Path )
Push-Location $Path # Base path for Get-ChildItem and Resolve-Path
try {
Get-ChildItem -File -Recurse |
Select-Object FullName, Length, #{ n = 'RelativePath'; e = { Resolve-Path $_.FullName -Relative } }
} finally {
Pop-Location
}
}
Write-Verbose "Reading reference directory '$ReferencePath'"
# Create hashtable with all infos of reference directory
$refFiles = #{}
Get-ChildItemRelative $ReferencePath |
Select-Object *, #{ n = 'Hash'; e = { (Get-FileHash $_.FullName -Algorithm MD5).Hash } } |
ForEach-Object { $refFiles[ $_.RelativePath ] = $_ }
# Compare content of each directory of $DifferencePath with $ReferencePath
foreach( $diffPath in $DifferencePath ) {
Write-Verbose "Comparing directory '$diffPath' with '$ReferencePath'"
$areDirectoriesEqual = $false
$differenceType = $null
$diffFiles = Get-ChildItemRelative $diffPath
# Directories must have same number of files
if( $diffFiles.Count -eq $refFiles.Count ) {
# Find first different path (if any)
$firstDifferentPath = $diffFiles | Where-Object { -not $refFiles.ContainsKey( $_.RelativePath ) } |
Select-Object -First 1
if( -not $firstDifferentPath ) {
# Find first different content (if any) by file size comparison
$firstDifferentFileSize = $diffFiles |
Where-Object { $refFiles[ $_.RelativePath ].Length -ne $_.Length } |
Select-Object -First 1
if( -not $firstDifferentFileSize ) {
# Find first different content (if any) by hash comparison
$firstDifferentContent = $diffFiles |
Where-Object { $refFiles[ $_.RelativePath ].Hash -ne (Get-FileHash $_.FullName -Algorithm MD5).Hash } |
Select-Object -First 1
if( -not $firstDifferentContent ) {
$areDirectoriesEqual = $true
}
else {
$differenceType = 'Content'
}
}
else {
$differenceType = 'FileSize'
}
}
else {
$differenceType = 'Path'
}
}
else {
$differenceType = 'FileCount'
}
# Output comparison result
[PSCustomObject]#{
ReferencePath = $ReferencePath
DifferencePath = $diffPath
Equal = $areDirectoriesEqual
DiffCause = $differenceType
}
}
}
Usage example:
# compare each of directories B, C, D, E, F against A
Compare-MultipleDirectories -ReferencePath 'A' -DifferencePath 'B', 'C', 'D', 'E', 'F' -Verbose
Output example:
ReferencePath DifferencePath Equal DiffCause
------------- -------------- ----- ---------
A B True
A C False FileCount
A D False Path
A E False FileSize
A F False Content
DiffCause column gives you the information why the function thinks the directories are different.
Note:
Select-Object -First 1 is a neat trick to stop searching after we got the first result. It is efficient because it doesn't process all input first and drop everything except first item, but instead it actually cancels the pipeline after the 1st item has been found.
Group-Object RelativePath -AsHashTable creates a hashtable of the file information so it can be looked up quickly by the RelativePath property.
Empty sub directories are ignored, because the function only looks at files. E. g. if reference path contains some empty directories but difference path does not, and the files in all other directories are equal, the function treats the directories as equal.
I've choosen MD5 algorithm because it is faster than the default SHA-256 algorithm used by Get-FileHash, but it is insecure. Someone could easily manipulate a file that is different, to have the same MD5 hash as the original file. In a trusted environment this won't matter though. Remove -Algorithm MD5 if you need more secure comparison.
A simple place to start:
compare (dir -r dir1) (dir -r dir2) -Property name,length,lastwritetime
You can also add -passthru to see the original objects, or -includeequal to see the equal elements. The order of each array doesn't matter without -syncwindow. I'm assuming all the lastwritetime's are in sync, to the millisecond. Don't assume you can skip specifying the properties to compare. See also Comparing folders and content with PowerShell
I was looking into calculated properties like for relative path, but it looks like you can't name them, even in powershell 7. I'm chopping off the first four path elements, 0..3.
compare (dir -r foo1) (dir -r foo2) -Property length,lastwritetime,#{e={($_.fullname -split '\\')[4..$_.fullname.length] -join '\'}}
length lastwritetime ($_.fullname -split '\\')[4..$_.fullname.length] -join '\' SideIndicator
------ ------------- ---------------------------------------------------------- -------------
16 11/12/2022 11:30:20 AM foo2\file2 =>
18 11/12/2022 11:30:20 AM foo1\file2 <=

How to check duplicate multiple file using powershell?

I want to check duplicate file.If the condition of the file like this, it means duplicate. The same name but different extension.
AAA18WWQ6BT602.PRO
AAA18WWQ6BT602.XML
I can figure out this case with my script. But I have problem if I have this more than 1 .XML file like this
AAA18WWQ6BT602.PRO
AAA18WWQ6BT602.XML
AAA18WWQ6BT601.XML
AAA18WWQ6BT604.XML
This case, it will not detect that file AAA18WWQ6BT602.PRO and AAA18WWQ6BT602.XML duplicated.
Anyone can help me please.
Thanks
$duplicate = #()
#(Get-ChildItem "$Flag_Path\*.xml") | ForEach-Object { $duplicate += $_.basename }
if(Test-Path -Path "$Flag_Path\*$duplicate*" -Exclude *.xml)
{
Get-ChildItem -Path "$Flag_Path\*$duplicate*" -Include *.xml | Out-File $Flag_Path\Flag_Duplicate
Write-Host "Flag duplicated, continue for Error_Monitoring"
pause
Error_Monitoring
}
else{
Write-Host "Flag does not duplicate, continue the process"
}
The -Include parameter only works if the path on Get-ChildItem ends in \* OR if the -Recurse switch is used.
The following should do what you want:
$flagFolder = 'D:\*'
$dupeReport = 'D:\Flag_Duplicate.txt'
$duplicates = Get-ChildItem -Path $flagFolder -File -Include '*.xml', '*.pro' |
Group-Object -Property BaseName | Where-Object { $_.Count -gt 1 }
if ($duplicates) {
# output the duplicate XML to Flag_Duplicate.txt
$duplicates.Group | Where-Object {$_.Extension -eq '.xml' } | ForEach-Object {
$_.FullName | Out-File -FilePath $dupeReport -Append
}
# do the rest of your code
Write-Host "Flag duplicated, continue for Error_Monitoring"
Error_Monitoring
}
else {
Write-Host "Flag does not duplicate, continue the process"
}
Your script does not iterate correctly. You need to have an iteration to check. The Test-Path logic looks mixed up to me. I tried to keep as much of your code as possible.
This script checks for a any xml basename filename against any suffix duplicate (not only pro):
$Flag_Path = "C:\dir_to_be_checked"
$xmlFilesArray = #()
$allFilesExceptXml = #() # all files excluding xml files
# Get all the xml files
Get-ChildItem -Path $Flag_Path -Include "*.xml" | ForEach-Object { $xmlFilesArray += $_.basename }
# Get all files from the directory the xml files
Get-ChildItem -Path $Flag_Path -Exclude "*.xml" | ForEach-Object { $allFilesExceptXml += $_.basename }
# Iterate over list of files names without suffix
ForEach ($xmlFile in $xmlFilesArray) {
ForEach ($fileToCheck in $allFilesExceptXml) {
If ($xmlFile -eq $fileToCheck) {
# logging the duplicate file (specifying utf8 or the output would be UTF-16)
Write-Output "$Flag_Path\$xmlFile.xml" | Out-File -Append -Encoding utf8 $Flag_Path\Flag_Duplicate
Write-Host "Flag duplicated, continue with duplicate search"
# pause
Write-Host "Press any key to continue ..."
$x = $host.UI.RawUI.ReadKey("NoEcho,IncludeKeyDown")
Error_Monitoring
} Else {
Write-Host "Flag is not duplicated. Continue with the search."
}
}
}

Powershell search path with wildcard-string-wildcard

I am trying to search a log file for updates that were not installed, then using the returned array install the updates. Problem is my files are named:
Windows6.1-KB3102429-v2-x64.msu
My parsed array has a item of KB3102429 how can I use a wild card - call the array item - then another wildcard .msu
my code is listed below:
# Read KBLIST.txt and create array of KB Updates that were not installed
$Failed = Get-Content -Path C:/Updates/KBLIST.txt | Where-Object {$_ -like '*NOT*'}
# create a list of all items in Updates folder
$dir = (Get-Item -Path "C:\Updates" -Verbose).FullName
# Parse the $Failed array down to just the KB#######
for($i = $Failed.GetLowerBound(0); $i -le $Failed.GetUpperBound(0); $i++)
{
$Failed[$i][1..9] -join ""
# Search the $dir list for files that contain KB####### and end in .msu then quiet install
Foreach($item in (ls $dir *$Failed[$i]*.msu -Name))
{
echo $item
$item = "C:\Updates\" + $item
wusa $item /quiet /norestart | Out-Null
}
}
It works down to the Foreach($item in (ls $dir *$Failed[$i]*.msu -Name)).
If I just use * instead of the wildcard,string,wildcard it returns a list of all the .msu files for the basic syntax it correct.
It was hard to follow your work since you used aliases, but I think this should be able to accomplish what you're looking for.
$UpdateFolder = 'C:\Updates'
$FailedUpdates = Get-Content -Path C:/Updates/KBLIST.txt | Where-Object {$_ -like '*NOT*'}
foreach ( $Update in $FailedUpdates )
{
Write-Host -Object "Update $Update failed"
$UpdatePath = Get-Item -Path "$UpdateFolder\*$Update*.msu" | Select-Object -ExpandProperty FullName
Write-Host -Object "`tReinstalling from path: $UpdatePath"
wusa $UpdatePath /quiet /norestart | Out-Null
}

Delete massive amount of files without running out of memory

There is COTS app we have that creates reports and never deletes it. So we need to start cleaning it up. I started doing a foreach and would run out of memory on the server (36GB) when it got up to 50ish million files. After searching it seemed you could change it like so
Get-ChildItem -path $Path -recurse | foreach {
and it won't go through memory but process each item at a time. I can get to 140 million files before I run out of memory.
Clear-Host
#Set Age to look for
$TimeLimit = (Get-Date).AddMonths(-4)
$Path = "D:\CC\LocalStorage"
$TotalFileCount = 0
$TotalDeletedCount = 0
Get-ChildItem -Path $Path -Recurse | foreach {
if ($_.LastWriteTime -le $TimeLimit) {
$TotalDeletedCount += 1
$_.Delete()
}
$TotalFileCount += 1
$FileDiv = $TotalFileCount % 10000
if ($FileDiv -eq 0 -and $TotalFileCount -ne 0) {
$TF = [string]::Format('{0:N0}', $TotalFileCount)
$TD = [string]::Format('{0:N0}', $TotalDeletedCount)
Write-Host "Files Scanned : " -ForegroundColor Green -NoNewline
Write-Host "$TF" -ForegroundColor Yellow -NoNewline
Write-Host " Deleted: " -ForegroundColor Green -NoNewline
Write-Host "$TD" -ForegroundColor Yellow
}
Is there a better way to do this? My only next thought was not to use the -Recurse command but make my own function that calls itself for each directory.
EDIT:
I used the code provided in the first answer and it does not solve the issue. Memory is still growing.
$limit = (Get-Date).Date.AddMonths(-3)
$totalcount = 0
$deletecount = 0
$Path = "D:\CC\"
Get-ChildItem -Path $Path -Recurse -File | Where-Object { $_.LastWriteTime -lt $limit } | Remove-Item -Force
Using the ForEach-Object and the pipeline should actually prevent the code from running out of memory. If you're still getting OOM exceptions I suspect that you're doing something in your code that counters this effect, which you didn't tell us about.
With that said, you should be able to clean up your data directory with something like this:
$limit = (Get-Date).Date.AddMonths(-4)
Get-ChildItem -Path $Path -Recurse -File |
Where-Object { $_.LastWriteTime -lt $limit } |
Remove-Item -Force -WhatIf
Remove the -WhatIf switch after you verified that everything is working.
If you need the total file count and the number of deleted files, add counters like this:
$totalcount = 0
$deletecount = 0
Get-ChildItem -Path $Path -Recurse -File |
ForEach-Object { $totalcount++; $_ } |
Where-Object { $_.LastWriteTime -lt $limit } |
ForEach-Object { $deletecount++; $_ } |
Remove-Item -Force -WhatIf
I don't recommend printing status information to the console when you're bulk-processing large numbers of files. The output could significantly slow down the processing. If you must have that information, write it to a log file and tail that file separately.

Only recurse subfolder x numbers of levels using PowerShell

The PowerShell script below will list out all shared folders (excluding hidden shared folders), then list out all sub-folders and finally get the ACL information of each of them and export to a CSV file.
However, I'm trying to set the limit of the sub-folder it can drill into. For example, if I set it to 3, the script will get the ACL information of first three sub-folders. How can I do this?
Input:
path=\\server\sharefolder0\subfolder01\subfolder02
path=\\server\sharefolder1\subfolder11\subfolder12\subfolder13\subfolder14
path=\\server\sharefolder2
Expected result:
path=\\server\sharefolder0
path=\\server\sharefolder0\subfolder01
path=\\server\sharefolder0\subfolder01\subfolder02
path=\\server\sharefolder1
path=\\server\sharefolder1\subfolder11
path=\\server\sharefolder1\subfolder11\subfolder12
path=\\server\sharefolder2
This is the code:
$getSRVlist = Get-Content .\server.txt
$outputDirPath=".\DirPathList.txt"
$outputACLInfo=".\ACLInfo.CSV"
$header="FolderPath,IdentityReference,Rights"
Del $outputACLInfo
add-content -value $header -path $outputACLInfo
foreach ($readSRVlist in $getSRVlist)
{
foreach ($readShareInfoList in $getShareInfoList=Get-WmiObject Win32_Share
-computerName $readSRVlist | Where {$_.name -notlike "*$"} | %{$_.Name})
{
foreach ($readDirPathList in
$getDirPathList=get-childitem \\$readSRVlist\$readShareInfoList -recurse
| where {$_.PSIsContainer})# | %{$_.fullname})
{
$getACLList=get-ACL $readDirPathList.fullname | ForEach-Object
{$_.Access}
foreach ($readACLList in $getACLList)
{
$a = $readDirPathList.fullname + "," +
$readACLList.IdentityReference + "," + $readACLList.FileSystemRights
add-content -value $a -path $outputACLInfo
}
}
}
}
Recursion is your friend. Try this:
$maxDepth = 3
function TraverseFolders($folder, $remainingDepth) {
Get-ChildItem $folder | Where-Object { $_.PSIsContainer } | ForEach-Object {
if ($remainingDepth -gt 1) {
TraverseFolders $_.FullName ($remainingDepth - 1)
}
}
}
TraverseFolders "C:\BASE\PATH" $maxDepth
Edit: Now I see what you mean. For checking the first three parent folders of a given path try this:
$server = "\\server\"
$path = ($args[0] -replace [regex]::escape($server), "").Split("\\")[0..2]
for ($i = 0; $i -lt $path.Length; $i++) {
Get-ACL ($server + [string]::join("\", $path[0..$i])
}
In newer version of powershell one can use -DEPTH parameter,
One liner can help-
get-childitem -path \\server\folder -Depth 2 -Directory | Select-object -Property Name, Fullname
It will search for 2 nested folders and will provide folder name and full path of that particular folder. Tested in version- PSVersion 5.1.17134.858