I am trying to make a script in PowerShell that analyzes recursevily a directory and gets all hashes MD5 from all files and from all files inside any directories inside the 1st one given.
After that, I want to compare all the hashes between each other to see which one is a copy, and then give an option to delete these copies or not.
At the moment I have this:
$UserInput=Read-Host
Get-ChildItem -Path $UserInput -Recurse
$someFilePath = $UserInput
$md5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$hash = [System.BitConverter]::ToString($md5.ComputeHash([System.IO.File]::ReadAllBytes($someFilePath)))
$hash
The main problem is in the hash part, that I get an error in calling the "ReadAllBytes".
I also am doubting if creating an array so when I compare the hashes, if they are equal, put the copies in an array, so then deleting them is "easier".
What do you think? (I am also not sure if I am using right the "SomeFilePath", MD5 nor Hash).
If targeting PowerShell 5.1 on Windows 10, I'd use the Get-FileHash cmdlet and then group them by hash using the Group-Object cmdlet:
$UserInput = Read-Host
$DuplicateFiles = Get-ChildItem -Path $UserInput -Recurse -File |Group {($_|Get-FileHash).Hash} |Where Count -gt 1
foreach($FileGroup in $DuplicateFiles)
{
Write-Host "These files share hash $($FileGroup.Name)"
$FileGroup.Group.FullName |Write-Host
}
Try this:
$fileHashes = Get-ChildItem -Path $myFilePath -Recurse -File | Get-Filehash -Algorithm MD5
$doubles = $fileHashes | Group hash | ? {$_.count -gt 1} | % {$_.Group}
foreach($item in $doubles) {
Write-Output $item
}
Just do it
Get-ChildItem -Path $UserInput -Recurse -File | Get-FileHash | Group Hash | Where Count -gt 1
Short version :
gci -Path $UserInput -R -File | Get-FileHash | Group Hash | ? Count -gt 1
Related
Ive been running around like crazy lately with this script that Im trying to modify it to suit my needs. I recently found out that deleting the files using "LastWriteTime" is not what Im after..
What I need my script to do is to delete the files that are older than 30 days using the "CreationTime" property, the problem is that after I modify the script to use this it deletes the entire folder structure?
How can this small modification change the behavior of the entire script?
This is what Im using:
$limit = (Get-Date).AddDays(-30)
$del30 = "D:\CompanyX_ftp\users"
$ignore = Get-Content "C:\Users\UserX\Documents\Scripts\ignorelist.txt"
Get-ChildItem $del30 -Recurse |
Where-Object {$_.CreationTime -lt $limit } |
Select-Object -ExpandProperty FullName |
Select-String -SimpleMatch -Pattern $ignore -NotMatch |
Select-Object -ExpandProperty Line |
Remove-Item -Recurse
So if I were to replace the "CreationTime" property with "LastWriteTime" the script will run and do what its supposed to but if I use "CreationTime" it just deletes everything under the folder structure including the folders themselves and the paths that its supposed to ignore.
UPDATE: The script is working for me now for the actual deletion of the files but for the script that Im using to just get a report on the actual files that the script will delete is actually including the paths of the ignorelist.txt file?
Please see below script:
$limit = (Get-Date).AddDays(-30)
$del30 = "D:\CompanyX_ftp\users"
#Specify path for ignore-list
$ignore = Get-Content "C:\Users\UserX\Documents\Scripts\ignorelist.txt"
Get-ChildItem $del5 -File -Recurse |
Where-Object {$_.CreationTime -lt $limit } |
Select-Object -ExpandProperty FullName |
Select-String -SimpleMatch -Pattern $ignore -NotMatch |
Select-Object -ExpandProperty Line |
Get-ChildItem -Recurse | Select-Object FullName,CreationTime
ignorelist.txt sample data:
D:\CompanyX_ftp\users\ftp-customerA\Customer Downloads
D:\CompanyX_ftp\users\ftp-customerB\Customer Downloads
D:\CompanyX_ftp\users\ftp-customerC\Customer Downloads
D:\CompanyX_ftp\users\ftp-customerD\Customer Downloads
D:\CompanyX_ftp\users\ftp-customerE\Customer Downloads
D:\CompanyX_ftp\users\ftp-customerF\Customer Downloads
D:\CompanyX_ftp\users\ftp-customerG\Customer Downloads
D:\CompanyX_ftp\users\ftp-customerH\Customer Downloads\
Any ideas on why its including the paths that I have mentioned on the ignorelist.txt? (I will also provide an image for better illustration).
Thanks in advance for any help or guidance with this.
//Lennart
I see two problems with the updated code:
Duplicate recursion. First Get-ChildItem iterates over contents of directory recursively. Later in the pipeline another recursive iteration starts on items returned by the first Get-ChildItem, causing overlap.
When filtering by $ignore, only paths that exactly match against the $ignore paths are being ignored. Paths that are children of items in the ignore list are not ignored.
Here is how I would do this. Create a function Test-IgnoreFile that matches given path against an ignore list, checking if the current path starts with any path in the ignore list. This way child paths are ignored too. This enables us to greatly simplify the pipeline.
Param(
[switch] $ReportOnly
)
# Returns $true if $File.Fullname starts with any path in $Ignore (case-insensitive)
Function Test-IgnoreFile( $File, $Ignore ) {
foreach( $i in $Ignore ) {
if( $File.FullName.StartsWith( $i, [StringComparison]::OrdinalIgnoreCase ) ) {
return $true
}
}
$false
}
$limit = (Get-Date).AddDays(-30)
$del30 = "D:\CompanyX_ftp\users"
$ignore = Get-Content "C:\Users\UserX\Documents\Scripts\ignorelist.txt"
Get-ChildItem $del30 -File -Recurse |
Where-Object { $_.CreationTime -lt $limit -and -not ( Test-IgnoreFile $_ $ignore ) } |
ForEach-Object {
if( $ReportOnly) {
$_ | Select-Object FullName, CreationTime
}
else {
$_ | Remove-Item -Force
}
}
I need to delete all the archived files and folder older than 15 days.
I have implemented the solution using PowerShell script but it taking more than a day to delete all files. Total size of the folder is less than 100 GB.
$StartFolder = "\\Guru\Archive\"
$deletefilesolderthan = "15"
#Get Foldernames for ForEach Loop
$SubFolders = Get-ChildItem -Path $StartFolder |
Where-Object {$_.PSIsContainer -eq "True"} |
Select-Object Name
#Loop through folders
foreach ($Subfolder in $SubFolders) {
Write-Host "Processing Folder:" $Subfolder
#For each folder recurse and delete files olders than specified number of days while the folder structure is left intact.
Get-ChildItem -Path $StartFolder$($Subfolder.name) -Include *.* -File -Recurse |
Where LastWriteTime -lt (Get-Date).AddDays(-$deletefilesolderthan) |
foreach {$_.Delete()}
#$dirs will be an array of empty directories returned after filtering and loop until until $dirs is empty while excluding "Inbound" and "Outbound" folders.
do {
$dirs = gci $StartFolder$($Subfolder.name) -Exclude Inbound,Outbound -Directory -Recurse |
Where {(gci $_.FullName).Count -eq 0} |
select -ExpandProperty FullName
$dirs | ForEach-Object {Remove-Item $_}
} while ($dirs.Count -gt 0)
}
Write-Host "Completed" -ForegroundColor Green
#Read-Host -Prompt "Press Enter to exit"
Please suggest some way to optimise the performance.
If you have many smaller files, the long delete time is not abnormal because it has to process each file descriptor. Some improvements can be made depending on your version; I'm going to assume you're on at least v4.
#requires -Version 4
param(
[string]
$start = '\\Guru\Archive',
[int]
$thresholdDays = 15
)
# getting the name wasn't useful. keep objects as objects
foreach ($folder in Get-ChildItem -Path $start -Directory) {
"Processing Folder: $folder"
# get all items once
$folders, $files = ($folder | Get-ChildItem -Recurse).
Where({ $_.PSIsContainer }, 'Split')
# process files
$files.Where{
$_.LastWriteTime -lt (Get-Date).AddDays(-$thresholdDays)
} | Remove-Item -Force
# process folders
$folders.Where{
$_.Name -notin 'Inbound', 'Outbound' -and
($_ | Get-ChildItem).Count -eq 0
} | Remove-Item -Force
}
"Complete!"
The reason why it takes so many time is that you are deleting files/folder over network which leads to need for additional network communication for every file and folder. You can easily check that fact using network analyzer. The best approach here is to use one of the method that allows to run code which executes file operations on remote machine, for example you can try to use:
WinRM
psexec (first copy code to remote machine and then execute it using psexec)
remote WMI (using CIM_Datafile)
or even adding needed task to the scheduler
I would prefer to use WinRM but psexec is also good decision (if you don't want to perform additional configuration of WinRM).
I am trying to compare files and directories by hash, and it is working, but I now need an easier way to figure out which file is in different.
I originally started without comparing the hash, and it worked for files and folders, but it would not tell me anything other than the fact that they exist.
$Source = Get-ChildItem -recurse –Path E:\path | foreach {Get-FileHash –Path $_.FullName}
$Destination = Get-ChildItem -recurse –Path "\\server\e$\path" | foreach {Get-FileHash –Path $_.FullName}
Compare-Object -ReferenceObject $Source.hash -DifferenceObject $Destination.hash
Now this works great, but I want to also list the files that are associated with the hash. After I get the hash, I then need to go back to the files and compare the hash to the original directories to figure out which one it came from.
InputObject SideIndicator
----------- -------------
CFD1DF3C08A9F7C4D81E22DA7D1CBB35FA12220C3CB85777EBA9BD89362AEDA3 =>
2B098B7FC189A87B41A7706EA7ABFFDB343B8B5AF3712BA6614E04BD3032A977 =>
D8CBDD03564C3547D8189D11A9BAE078FBD70986DBFB485EAEE5170C13113798 =>
F5D7AE29DB432EC3421EE956B70927AE394C0F27CE00FF855666DBC3E14084DB <=
85795253C6CCDC3CC2A4CAE055CC7478946CDB33D35EAE2BB5796C55954205B2 <=
9CE2A42C8FFA2D8001BA2874324987DCEF601173CB2ED8B654A76598F90B126E <=
IF you are going for the hash why not use the Group-Object instead of the Compare-Object. Something like this:
$Source = Get-ChildItem -recurse –Path E:\path
$Destination = Get-ChildItem -recurse –Path "\\server\e$\path"
$Source + $Destination | Group-Object #{Expression={(Get-FileHash $_.FullName).hash}} | ? {$_.Count -gt 1}
Output would be something like this:
Count Name Group
----- ---- -----
2 DF7E70E5021544F4834BBE... {b.txt, c.txt}
Compare-Object by default outputs differences,
if you want to compare Hash and Name (without path)
there is the problem that Get-FileHash only output's Algorithm,Hash and the complete Path.
You can directly pipe Get-ChildItem output to Get-FileHash,
but need to attach the Name (here using a calculated property)
I'd use the -PassThru parameter and use the whole objects specifying the properties Hash and Name for comparison.
## Q:\Test\2019\06\12\SO_565666700.ps1
$SourceDir = 'E:\path' # 'C:\Bat' #
$TargetDir = '\\server\e$\path' # 'K:\Bat' #
$Source = Get-ChildItem –Path $SourceDir -Recurse -PipeLineVar Item |
Get-FileHash | Select-Object *,#{n='Name';e={$Item.Name}}
$Target = Get-ChildItem –Path $TargetDir -Recurse -PipeLineVar Item |
Get-FileHash | Select-Object *,#{n='Name';e={$Item.Name}}
Compare-Object -ReferenceObject $Source -Property Name,Hash `
-DifferenceObject $Target -PassThru |
Sort-Object Name | Select-Object Hash,Path
I am doing a script that identifies the hashes of all the files of a path (and recursively). This is alright.
My problem comes when, after I have identified which hashes are the same, I want to save them into an array so later I can delete these files that have the same Hash (if I want to), or just print the duplicate files. And I have been all afternoon and evening trying to figure out how to do it.
My code at the moment:
Write-Host "Write a path: "
$UserInput=Read-Host
Get-ChildItem -Path $UserInput -Recurse
#Get-FileHash cmdlet to get the hashes
$files = Get-ChildItem -Path $UserInput -Recurse | where { !$_.PSIsContainer }
$files | % {(Get-FileHash -Path $_.FullName -Algorithm MD5)}
#Creating an array for all the values and an array for the duplicates
$originals=#()
$copies=#()
#grouping the hashes that are duplicated cmdlet Group-Object:
$Duplicates = Get-ChildItem -Path $UserInput -Recurse -File |Group {($_|Get-FileHash).Hash} |Where Count -gt 1
foreach($FileGroup in $Duplicates)
{
Write-Host "These files share hash : $($FileGroup.Name)"
$FileGroup.Group.FullName |Write-Host
$copies+=$Duplicates
}
So the last part "$copies+=$Duplicates" does not work properly.
In the begining I was thinking of saving the first file in the "original" array. If the second one has the same hash, save that 2nd in the "copies" array. But I am not sure if I can do that in the 1st part of the script when getting the hashes.
After that, the second array would have the duplicates, so it would be easy to delete them from the computer.
I think you should filter the items. I did it and I have a list with only one item of duplicate files and a list with all duplicated files.
You can use the SHA1 algorithm instead of MD5
SHA1 is much more faster than the MD5 algorithm
$fileHashes = Get-ChildItem -Path $myFilePath -Recurse -File | Get-Filehash -Algorithm SHA1
$duplicates = $fileHashes | Group hash | ? {$_.count -gt 1} | % {$_.Group}
$uniqueItems = #{}
$doubledItems = #()
foreach($item in $duplicates) {
if(-not $uniqueItems.ContainsKey($item.Hash)){
$uniqueItems.Add($item.Hash,$item)
}else{
$doubledItems += $item
}
}
# all duplicates files
$doubledItems
# Remove the duplicate files
# $doubledItems | % {Remove-Item $_.path} -Verbose
# one of the duplicate files
$uniqueItems
Set the seach root folder
$myFilePath = ''
You should only need to use Get-ChildItem once, once you have all the files you can create a hash for them and then group the hashes to find duplicates. See my example code below:
Write-Host "Write a path: "
$UserInput=Read-Host
#Get-FileHash cmdlet to get the hashes
$files = Get-ChildItem -Path $UserInput -Recurse | Where-Object -FilterScript { !$_.PSIsContainer }
$hashes = $files | ForEach-Object -Process {Get-FileHash -Path $_.FullName -Algorithm MD5}
$duplicates = $hashes | Group-Object -Property Hash | Where-Object -FilterScript {$_.Count -gt 1}
foreach($duplicate in $duplicates)
{
Write-Host -Object "These files share hash : $($duplicate.Group.Path -join ', ')"
# delete first duplicate
# Remove-Item -Path $duplicate.Group[0].Path -Force -WhatIf
# delete second duplicate
# Remove-Item -Path $duplicate.Group[1].Path -Force -WhatIf
# delete all duplicates except the first
# foreach($duplicatePath in ($duplicate.Group.Path | Select-Object -Skip 1))
# {
# Remove-Item -Path $duplicatePath -Force -WhatIf
# }
}
Uncomment the code at the end to delete duplicates based on your preferences and when you're ready to delete files make sure you also remove the -WhatIf parameter.
This is the output i receive from the above command if i uncomment out the "delete all duplicates except the first"
Write a path:
H:\
These files share hash : H:\Rename template 2.csv, H:\Rename template.csv
What if: Performing the operation "Remove File" on target "H:\Rename template.csv".
I am trying to count the files in all subfolders in a directory and display them in a list.
For instance the following dirtree:
TEST
/VOL01
file.txt
file.pic
/VOL02
/VOL0201
file.nu
/VOL020101
file.jpg
file.erp
file.gif
/VOL03
/VOL0301
file.org
Should give as output:
PS> DirX C:\TEST
Directory Count
----------------------------
VOL01 2
VOL02 0
VOL02/VOL0201 1
VOL02/VOL0201/VOL020101 3
VOL03 0
VOL03/VOL0301 1
I started with the following:
Function DirX($directory)
{
foreach ($file in Get-ChildItem $directory -Recurse)
{
Write-Host $file
}
}
Now I have a question: why is my Function not recursing?
Something like this should work:
dir -recurse | ?{ $_.PSIsContainer } | %{ Write-Host $_.FullName (dir $_.FullName | Measure-Object).Count }
dir -recurse lists all files under current directory and pipes (|) the result to
?{ $_.PSIsContainer } which filters directories only then pipes again the resulting list to
%{ Write-Host $_.FullName (dir $_.FullName | Measure-Object).Count } which is a foreach loop that, for each member of the list ($_) displays the full name and the result of the following expression
(dir $_.FullName | Measure-Object).Count which provides a list of files under the $_.FullName path and counts members through Measure-Object
?{ ... } is an alias for Where-Object
%{ ... } is an alias for foreach
Similar to David's solution this will work in Powershell v3.0 and does not uses aliases in case someone is not familiar with them
Get-ChildItem -Directory | ForEach-Object { Write-Host $_.FullName $(Get-ChildItem $_ | Measure-Object).Count}
Answer Supplement
Based on a comment about keeping with your function and loop structure i provide the following. Note: I do not condone this solution as it is ugly and the built in cmdlets handle this very well. However I like to help so here is an update of your script.
Function DirX($directory)
{
$output = #{}
foreach ($singleDirectory in (Get-ChildItem $directory -Recurse -Directory))
{
$count = 0
foreach($singleFile in Get-ChildItem $singleDirectory.FullName)
{
$count++
}
$output.Add($singleDirectory.FullName,$count)
}
$output | Out-String
}
For each $singleDirectory count all files using $count ( which gets reset before the next sub loop ) and output each finding to a hash table. At the end output the hashtable as a string. In your question you looked like you wanted an object output instead of straight text.
Well, the way you are doing it the entire Get-ChildItem cmdlet needs to complete before the foreach loop can begin iterating. Are you sure you're waiting long enough? If you run that against very large directories (like C:) it is going to take a pretty long time.
Edit: saw you asked earlier for a way to make your function do what you are asking, here you go.
Function DirX($directory)
{
foreach ($file in Get-ChildItem $directory -Recurse -Directory )
{
[pscustomobject] #{
'Directory' = $File.FullName
'Count' = (GCI $File.FullName -Recurse).Count
}
}
}
DirX D:\
The foreach loop only get's directories since that is all we care about, then inside of the loop a custom object is created for each iteration with the full path of the folder and the count of the items inside of the folder.
Also, please note that this will only work in PowerShell 3.0 or newer, since the -directory parameter did not exist in 2.0
Get-ChildItem $rootFolder `
-Recurse -Directory |
Select-Object `
FullName, `
#{Name="FileCount";Expression={(Get-ChildItem $_ -File |
Measure-Object).Count }}
My version - slightly cleaner and dumps content to a file
Original - Recursively count files in subfolders
Second Component - Count items in a folder with PowerShell
$FOLDER_ROOT = "F:\"
$OUTPUT_LOCATION = "F:DLS\OUT.txt"
Function DirX($directory)
{
Remove-Item $OUTPUT_LOCATION
foreach ($singleDirectory in (Get-ChildItem $directory -Recurse -Directory))
{
$count = Get-ChildItem $singleDirectory.FullName -File | Measure-Object | %{$_.Count}
$summary = $singleDirectory.FullName+" "+$count+" "+$singleDirectory.LastAccessTime
Add-Content $OUTPUT_LOCATION $summary
}
}
DirX($FOLDER_ROOT)
I modified David Brabant's solution just a bit so I could evaluate the result:
$FileCounter=gci "$BaseDir" -recurse | ?{ $_.PSIsContainer } | %{ (gci "$($_.FullName)" | Measure-Object).Count }
Write-Host "File Count=$FileCounter"
If($FileCounter -gt 0) {
... take some action...
}