Powershell to display duplicate files - powershell

I have a task to check if new files are imported for the day in a shared location folder and alert if any duplicate files and no recursive check needed.
Below code displays all the file details with size which are 1 day old However I need only files with the same size as I cannot compare them using name.
$Files = Get-ChildItem -Path E:\Script\test |
Where-Object {$_.CreationTime -gt (Get-Date).AddDays(-1)}
$Files | Select-Object -Property Name, hash, LastWriteTime, #{N='SizeInKb';E={[double]('{0:N2}' -f ($_.Length/1kb))}}

I didn't like the big DOS-like script answer written here, so here's an idiomatic way of doing it for Powershell:
From the folder you want to find the duplicates, just run this simple set of pipes
Get-ChildItem -Recurse -File `
| Group-Object -Property Length `
| ?{ $_.Count -gt 1 } `
| %{ $_.Group } `
| Get-FileHash `
| Group-Object -Property Hash `
| ?{ $_.Count -gt 1 } `
| %{ $_.Group }
Which will show all files and their hashes that match other files.
Each line does the following:
get files
from current directory (use -Path $directory otherwise)
recursively (if not wanted, remove -Recurse)
group based on file size
discard groups with less than 2 files
grab all those files
get hashes for each
group based on hash
discard groups with less than 2 files
get all those files
Add | %{ $_.path } to just show the paths instead of the hashes.
Add | %{ $_.path -replace "$([regex]::escape($(pwd)))",'' } to only show the relative path from the current directory (useful in recursion).
For the question-asker specifically, don't forget to whack in | Where-Object {$_.CreationTime -gt (Get-Date).AddDays(-1)} after the gci so you're not comparing files you don't want to consider, which might get very time-consuming if you have a lot of coincidentally same-length files in that shared folder.
Finally, if you're like me and just wanted to find dupes based on name, as google will probably take you here too:
gci -Recurse -file | Group-Object name | Where-Object { $_.Count -gt 1 } | select -ExpandProperty group | %{ $_.fullname }

All the examples here take in account only timestamp, lenght and name. That is for sure not enough.
Imagine this example
You have two files:
c:\test_path\test.txt and c:\test_path\temp\text.txt.
The first one contains 12345. The second contains 54321. In this case these files will be considered identical even when they are not.
I have create a duplicate checker based on hash calculation. It was created right now from my head so it is rather crude (but I think you get the idea and it will be easy to optimize):
Edit I've decided the source code was "too crude" (nick name for incorrect) and I have improved it (removed superfluous code):
# The current directory where the script is executed
$path = (Resolve-Path .\).Path
$hash_details = #{}
$duplicities = #{}
# Remove unique record by size (different size = different hash)
# You can select only those you need with e.g. "*.jpg"
$file_names = Get-ChildItem -path $path -Recurse -Include "*.*" | ? {( ! $_.PSIsContainer)} | Group Length | ? {$_.Count -gt 1} | Select -Expand Group | Select FullName, Length
# I'm using SHA256 due to SHA1 collisions found
$hash_details = ForEach ($file in $file_names) {
Get-FileHash -Path $file.Fullname -Algorithm SHA256
}
# just counter for the Hash table key
$counter = 0
ForEach ($first_file_hash in $hash_details) {
ForEach ($second_file_hash in $hash_details) {
If (($first_file_hash.hash -eq $second_file_hash.hash) -and ($first_file_hash.path -ne $second_file_hash.path)) {
$duplicities.add($counter, $second_file_hash)
$counter += 1
}
}
}
##Throw output with duplicity files
If ($duplicities.count -gt 0) {
#Write-Output $duplicities.values
Write-Output "Duplicate files found:" $duplicities.values.Path
$duplicities.values | Out-file -Encoding UTF8 duplicate_log.txt
} Else {
Write-Output 'No duplicities found'
}
I have created a test structure:
PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> Get-ChildItem -path $path -Recurse
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 9.4.2018 9:58 test
-a--- 9.4.2018 11:06 2067 check_for_duplicities.ps1
-a--- 9.4.2018 11:06 757 duplicate_log.txt
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test
Mode LastWriteTime Length Name
---- ------------- ------ ----
d---- 9.4.2018 9:58 identical_file
d---- 9.4.2018 9:56 t
-a--- 9.4.2018 9:55 5 test.txt
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 9.4.2018 9:55 5 test.txt
Directory: C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\t
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 9.4.2018 9:55 5 test.txt
(Where file in the ..\duplicities\test\t is different from the others).
The result of the running script.
The console output:
PS C:\prg\PowerShell\_Snippets\_file_operations\duplicities> .\check_for_duplicities.ps1
Duplicate files found:
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt
C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt
The duplicate_log.txt file contains more detailed information:
Algorithm Hash Path
--------- ---- ----
SHA256 5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5 C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\identical_file\test.txt
SHA256 5994471ABB01112AFCC18159F6CC74B4F511B99806DA59B3CAF5A9C173CACFC5 C:\prg\PowerShell\_Snippets\_file_operations\duplicities\test\test.txt
Conclusion
As you see the different file is correctly omitted from the result set.

Since the file contents that you are determining to be duplicate. It's more prudent to just hash files and compare the hash.
The name, size. timestamp would not be a prudent attributes for the defined use case. Since the hash would tell you if the files have the same content.
See these discussions
Need a way to check if two files are the same? Calculate a hash of
the files. Here is one way to do it:
https://blogs.msdn.microsoft.com/powershell/2006/04/25/duplicate-files
Duplicate File Finder and Remover
And now the moment you have been waiting for....an all PowerShell file
duplicate finder and remover! Now you can clean up all those copies of
pictures, music files, and videos. The script opens a file dialog box
to select the target folder, recursively scans each file for duplica
https://gallery.technet.microsoft.com/scriptcenter/Duplicate-File-Finder-and-78f40ae9

This might helpful for you.
$files = Get-ChildItem 'E:\SC' | Where-Object {$_.CreationTime -eq (Get-Date).AddDays(-1)} | Group-Object -Property Length
foreach($filegroup in $allfiles)
{
if ($filegroup.Count -ne 1)
{
foreach ($file in $filegroup.Group)
{
Invoke-Item $file.fullname
}
}
}

Related

How to compare 2 directories and list only the files which are different?

I have the following code that compares 2 directories and outputs an indicator to indicate the difference between the directory files.
$Folder1 = "source_folder_path"
$Folder2 = "dest_folder_path"
function Get-Directories ($path)
{
$PathLength = $path.length
Get-ChildItem $path -Recurse | % {
Add-Member -InputObject $_ -MemberType NoteProperty -Name RelativePath -Value $_.FullName.substring($PathLength+1)
$_
}
}
Compare-Object (Get-Directories $Folder1) (Get-Directories $Folder2) -Property RelativePath | Sort RelativePath, Name -desc
This gives me an output like this:
The side indicator <= means the file or folder exists only in the source. i.e., missing in the destination.
The side indicator => means the file or folder exists only in the destination. i.e., missing in the source.
However, i only want to list the files that are different (i.e. dont exist in the directory), and only different by Name. Currently, the code above compares the file with the extension, and if the extension is different, e.g. image1.jpg vs image1.png, it will mark them as different. however, i dont want that and instead i would like it to ignore the extension.
i.e.
image1.png vs image1.jpg = no difference, dont list it
image1.png in folder 1 but not in folder 2 = list it
# Resolve the input folders to full paths first.
$Folder1 = Convert-Path $Folder1
$Folder2 = Convert-Path $Folder2
Compare-Object (Get-ChildItem -File -Recurse $Folder1) `
(Get-ChildItem -File -Recurse $Folder2) `
-Property {
# Determine the relative path and then compare
# the directory path + the file *base* name.
$relativePath = $_.FullName.Substring(
($Folder1.Length, $Folder2.Length)[$_.FullName.StartsWith($Folder2, 'InvariantCultureIgnoreCase')]
)
[IO.Path]::GetDirectoryName($relativePath) + '/' +
[IO.Path]::GetFileNameWithoutExtension($relativePath)
} -PassThru |
Select-Object Name, Directory
The above doesn't just list the name (which could be ambiguous), but also the directory in which a given unique file was found.
If I understand this correctly, compare two folders using the basename property:
Directory: C:\Users\admin\foo\dir1
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 11/24/2020 4:54 PM 4 image1.png
-a--- 11/24/2020 4:55 PM 4 image2.png
Directory: C:\Users\admin\foo\dir2
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 11/24/2020 4:54 PM 4 image1.jpg
compare-object (dir dir1) (dir dir2) -property basename
basename SideIndicator
-------- -------------
image2 <=

Search for specific folder name in filtered path

I need to search for a specific folder name from filtered path.
For example:
I have some folders like on disk m:
M:\
├───2.46.567
│ └───A
├───3.09.356
│ └───A
├───4.05.123
│ └───A
└───4.05.124
└───B
I want to search folder A only from 4.05.xxx dir. And also i want to check is this folder is last one contains folder A.
I try something like the following command:
Get-ChildItem -Path m:\* -recurse -filter '*4.05*' | sort -descending LastWriteTime
Can I do this in PowerShell?
Get-ChildItem allows wildcards on several levels in a path not just the last - no recurse needed.
Get-ChildItem 'M:\4.05.*\A' -Directory | Sort-Object -Descending LastWriteTime
In the above tree this returns just one entry:
Directory: M:\4.05.123
Mode LastWriteTime Length Name
---- ------------- ------ ----
d----- 2019-08-30 12:10 A
An alternative with the same result based on above tree:
Get-ChildItem -Path 'M:\4.05.*' -Filter A -Recurse -Directory | Sort-Object -Descending LastWriteTime
PowerShell Version 2 variant
Get-ChildItem 'M:\4.05.*\A' | Where-Object {$_.PSIsContainer} |
Sort-Object -Desc LastWriteTime | Select-Object -First 1 | Set-Location
Try this:
param(
$SourceDir = "M:\"
)
$a = gci $SourceDir | foreach { $i = gci $sourcedir\$_ -Name ; if($i.equals("A")) {"$_"} }
for($h=0;$h -le $a.Length-1; $h++ ) {
if($a[$h] -like "4.05.*") {
$a[$h]
if( $a[$h].equals($a[$a.length-1])) {
"It is the last one."
}}}
This will return all folders that contain a folder "A" and have "4.05." as part of its name. It will also return whether or not it is the last folder in the array, therefore also the last folder that contains "A".
You can also use Resolve-Path.
(Resolve-Path "M:\4.05.*\A").ProviderPath
This returns the string (not the folder object!) of the paths you're after.

PowerShell CSV Compare files with dynamic file names [duplicate]

We produce files with date in the name.
(* below is the wildcard for the date)
I want to grab the last file and the folder that contains the file also has a date(month only) in its title.
I am using PowerShell and I am scheduling it to run each day. Here is the script so far:
$LastFile = *_DailyFile
$compareDate = (Get-Date).AddDays(-1)
$LastFileCaptured = Get-ChildItem -Recurse | Where-Object {$LastFile.LastWriteTime
-ge $compareDate}
If you want the latest file in the directory and you are using only the LastWriteTime to determine the latest file, you can do something like below:
gci path | sort LastWriteTime | select -last 1
On the other hand, if you want to only rely on the names that have the dates in them, you should be able to something similar
gci path | select -last 1
Also, if there are directories in the directory, you might want to add a ?{-not $_.PsIsContainer}
Yes I think this would be quicker.
Get-ChildItem $folder | Sort-Object -Descending -Property LastWriteTime -Top 1
Try:
$latest = (Get-ChildItem -Attributes !Directory | Sort-Object -Descending -Property LastWriteTime | select -First 1)
$latest_filename = $latest.Name
Explanation:
PS C:\Temp> Get-ChildItem -Attributes !Directory *.txt | Sort-Object -Descending -Property LastWriteTime | select -First 1
Directory: C:\Temp
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 5/7/2021 5:51 PM 1802 Prison_Mike_autobiography.txt
Get-ChildItem -Attributes !Directory *.txt or Get-ChildItem or gci : Gets list of files ONLY in current directory. We can give a file extension filter too as needed like *.txt. Reference: gci, Get-ChildItem
Sort-Object -Descending -Property LastWriteTime : Sort files by LastWriteTime (modified time) in descending order. Reference
select -First 1 : Gets the first/top record. Reference Select-Object / select
Getting file metadata
PS C:\Temp> $latest.Name
Prison_Mike_autobiography.txt
PS C:\Temp> $latest.DirectoryName
C:\Temp
PS C:\Temp> $latest.FullName
C:\Temp\Prison_Mike_autobiography.txt
PS C:\Temp> $latest.CreationTime
Friday, May 7, 2021 5:51:19 PM
PS C:\Temp> $latest.Mode
-a----
#manojlds's answer is probably the best for the scenario where you are only interested in files within a root directory:
\path
\file1
\file2
\file3
However, if the files you are interested are part of a tree of files and directories, such as:
\path
\file1
\file2
\dir1
\file3
\dir2
\file4
To find, recursively, the list of the 10 most recently modified files in Windows, you can run:
PS > $Path = pwd # your root directory
PS > $ChildItems = Get-ChildItem $Path -Recurse -File
PS > $ChildItems | Sort-Object LastWriteTime -Descending | Select-Object -First 10 FullName, LastWriteTime
You could try to sort descending "sort LastWriteTime -Descending" and then "select -first 1." Not sure which one is faster

Multiple counts from Powershell Get-ChildItem?

How would I use powershell to count the number of files and the number of folders under a folder, emulating the Windows "properties" of that folder?
The FAQ says cross-posting is OK, so the full question can be found at: https://superuser.com/questions/605911/multiple-counts-from-powershell-get-childitem
Basically you need to enumerate all files and folders for a given path and maintain a count of each object type:
$files=$folders=0
$path = "d:\temp"
dir $path -recurse | foreach { if($_.psiscontainer) {$folders+=1} else {$files+=1} }
"'$path' contains $files files, $folders folders"
Edit:
Edited to improve efficiency and stop loops...
This can also be done by using the Measure-Object cmdlet with Get-childItem cmdlet:
gci "C:\" -rec | Measure -property psiscontainer -max -sum | Select `
#{Name="Path"; Expression={$directory.FullName}},
#{Name="Files"; Expression={$_.Count - $_.Sum}},
#{Name="Folders"; Expression={$_.Sum}}
Output:
Path Files Folders
---- ----- -------
C:\Test 470 19

Powershell Get-ChildItem most recent file in directory

We produce files with date in the name.
(* below is the wildcard for the date)
I want to grab the last file and the folder that contains the file also has a date(month only) in its title.
I am using PowerShell and I am scheduling it to run each day. Here is the script so far:
$LastFile = *_DailyFile
$compareDate = (Get-Date).AddDays(-1)
$LastFileCaptured = Get-ChildItem -Recurse | Where-Object {$LastFile.LastWriteTime
-ge $compareDate}
If you want the latest file in the directory and you are using only the LastWriteTime to determine the latest file, you can do something like below:
gci path | sort LastWriteTime | select -last 1
On the other hand, if you want to only rely on the names that have the dates in them, you should be able to something similar
gci path | select -last 1
Also, if there are directories in the directory, you might want to add a ?{-not $_.PsIsContainer}
Yes I think this would be quicker.
Get-ChildItem $folder | Sort-Object -Descending -Property LastWriteTime -Top 1
Try:
$latest = (Get-ChildItem -Attributes !Directory | Sort-Object -Descending -Property LastWriteTime | select -First 1)
$latest_filename = $latest.Name
Explanation:
PS C:\Temp> Get-ChildItem -Attributes !Directory *.txt | Sort-Object -Descending -Property LastWriteTime | select -First 1
Directory: C:\Temp
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 5/7/2021 5:51 PM 1802 Prison_Mike_autobiography.txt
Get-ChildItem -Attributes !Directory *.txt or Get-ChildItem or gci : Gets list of files ONLY in current directory. We can give a file extension filter too as needed like *.txt. Reference: gci, Get-ChildItem
Sort-Object -Descending -Property LastWriteTime : Sort files by LastWriteTime (modified time) in descending order. Reference
select -First 1 : Gets the first/top record. Reference Select-Object / select
Getting file metadata
PS C:\Temp> $latest.Name
Prison_Mike_autobiography.txt
PS C:\Temp> $latest.DirectoryName
C:\Temp
PS C:\Temp> $latest.FullName
C:\Temp\Prison_Mike_autobiography.txt
PS C:\Temp> $latest.CreationTime
Friday, May 7, 2021 5:51:19 PM
PS C:\Temp> $latest.Mode
-a----
#manojlds's answer is probably the best for the scenario where you are only interested in files within a root directory:
\path
\file1
\file2
\file3
However, if the files you are interested are part of a tree of files and directories, such as:
\path
\file1
\file2
\dir1
\file3
\dir2
\file4
To find, recursively, the list of the 10 most recently modified files in Windows, you can run:
PS > $Path = pwd # your root directory
PS > $ChildItems = Get-ChildItem $Path -Recurse -File
PS > $ChildItems | Sort-Object LastWriteTime -Descending | Select-Object -First 10 FullName, LastWriteTime
You could try to sort descending "sort LastWriteTime -Descending" and then "select -first 1." Not sure which one is faster