I'm running the following MD5 check on 500 million files to check for duplicates. The scripts taking forever to run and I was wondering how to speed it up. How could I speed it up? Could I use a try catch loop instead of contains to throw an error when the hash already exists instead? What would you all recommend?
$folder = Read-Host -Prompt 'Enter a folder path'
$hash = #{}
$lineCheck = 0
Get-ChildItem $folder -Recurse | where {! $_.PSIsContainer} | ForEach-Object {
$lineCheck++
Write-Host $lineCheck
$tempMD5 = (Get-FileHash -LiteralPath $_.FullName -Algorithm MD5).Hash;
if(! $hash.Contains($tempMD5)){
$hash.Add($tempMD5,$_.FullName)
}
else{
Remove-Item -literalPath $_.fullname;
}
}
As suggested in the comments, you might consider to start hashing files only if there is a match with the file length found first. Meaning that you will not invoke the expensive hash method for any file length that is unique.
*Note: that the Write-Host command is quite expensive by itself, therefore I would not display every iteration (Write-Host $lineCheck) but e.g. only when a match is found.
$Folder = Read-Host -Prompt 'Enter a folder path'
$FilesBySize = #{}
$FilesByHash = #{}
Function MatchHash([String]$FullName) {
$Hash = (Get-FileHash -LiteralPath $FullName -Algorithm MD5).Hash
$Found = $FilesByHash.Contains($Hash)
If ($Found) {$Null = $FilesByHash[$Hash].Add($FullName)}
Else {$FilesByHash[$Hash] = [System.Collections.ArrayList]#($FullName)}
$Found
}
Get-ChildItem $Folder -Recurse | Where-Object -Not PSIsContainer | ForEach-Object {
$Files = $FilesBySize[$_.Length]
If ($Files) {
If ($Files.Count -eq 1) {$Null = MatchHash $Files[0]}
If ($Files.Count -ge 1) {If (MatchHash $_) {Write-Host 'Found match:' $_.FullName}}
$Null = $FilesBySize[$_.Length].Add($_.FullName)
} Else {
$FilesBySize[$_.Length] = [System.Collections.ArrayList]#($_.FullName)
}
}
Display the found duplicates:
ForEach($Hash in $FilesByHash.GetEnumerator()) {
If ($Hash.Value.Count -gt 1) {
Write-Host 'Hash:' $Hash.Name
ForEach ($File in $Hash.Value) {
Write-Host 'File:' $File
}
}
}
I'd guess that the slowest part of your code is the Get-FileHash invocation, since everything else is either not computationally intensive or limited by your hardware (disk IOPS).
You could try replacing it with the invocation of the native tool which has more optimized MD5 implementation and see if it helps.
Could I use a try catch loop instead of contains to throw an error when the hash already exists instead?
Exceptions are slow and using them for flow control is not recommended:
DA0007: Avoid using exceptions for control flow
While the use of exception handlers to catch errors and other events that disrupt program execution is a good practice, the use of exception handler as part of the regular program execution logic can be expensive and should be avoided
https://stackoverflow.com/a/162027/4424236
There is the definitive answer to this from the guy who implemented them - Chris Brumme. He wrote an excellent blog article about the subject (warning - its very long)(warning2 - its very well written, if you're a techie you'll read it to the end and then have to make up your hours after work :) )
The executive summary: they are slow. They are implemented as Win32 SEH exceptions, so some will even pass the ring 0 CPU boundary!
I know this is a PowerShell question, but you can make good use of parallelization in C#. You also mentioned in one of the comments about using C# as an alternative, so I thought it wouldn't hurt posting a possible implemenation of how it could be done.
You could first create a method to calculate the MD5 Checksum for a file:
private static string CalculateMD5(string filename)
{
using var md5 = MD5.Create();
using var stream = File.OpenRead(filename);
var hash = md5.ComputeHash(stream);
return BitConverter.ToString(hash).Replace("-", string.Empty).ToLowerInvariant();
}
Then you could make a method with queries all file hashes in parellel using ParallelEnumerable.AsParallel():
private static IEnumerable<FileHash> FindFileHashes(string directoryPath)
{
var allFiles = Directory
.EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories);
var hashedFiles = allFiles
.AsParallel()
.Select(filename => new FileHash {
FileName = filename,
Hash = CalculateMD5(filename)
});
return hashedFiles;
}
Then you can simply use the above method to delete duplicate files:
private static void DeleteDuplicateFiles(string directoryPath)
{
var fileHashes = new HashSet<string>();
foreach (var fileHash in FindFileHashes(directoryPath))
{
if (!fileHashes.Contains(fileHash.Hash))
{
Console.WriteLine($"Found - File : {fileHash.FileName} Hash : {fileHash.Hash}");
fileHashes.Add(fileHash.Hash);
continue;
}
Console.WriteLine($"Deleting - File : {fileHash.FileName} Hash : {fileHash.Hash}");
File.Delete(fileHash.FileName);
}
}
Full Program:
using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.Security.Cryptography;
namespace Test
{
internal class FileHash
{
public string FileName { get; set; }
public string Hash { get; set; }
}
public class Program
{
public static void Main()
{
var path = "C:\\Path\To\Files";
if (File.Exists(path))
{
Console.WriteLine($"Deleting duplicate files at {path}");
DeleteDuplicateFiles(path);
}
}
private static void DeleteDuplicateFiles(string directoryPath)
{
var fileHashes = new HashSet<string>();
foreach (var fileHash in FindFileHashes(directoryPath))
{
if (!fileHashes.Contains(fileHash.Hash))
{
Console.WriteLine($"Found - File : {fileHash.FileName} Hash : {fileHash.Hash}");
fileHashes.Add(fileHash.Hash);
continue;
}
Console.WriteLine($"Deleting - File : {fileHash.FileName} Hash : {fileHash.Hash}");
File.Delete(fileHash.FileName);
}
}
private static IEnumerable<FileHash> FindFileHashes(string directoryPath)
{
var allFiles = Directory
.EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories);
var hashedFiles = allFiles
.AsParallel()
.Select(filename => new FileHash {
FileName = filename,
Hash = CalculateMD5(filename)
});
return hashedFiles;
}
private static string CalculateMD5(string filename)
{
using var md5 = MD5.Create();
using var stream = File.OpenRead(filename);
var hash = md5.ComputeHash(stream);
return BitConverter.ToString(hash).Replace("-", string.Empty).ToLowerInvariant();
}
}
}
If you're trying to find duplicates the fastest way to do this is to use something like jdupes or fdupes. These are incredibly performant and written in C.
Related
How do I call methods within workflow?
I am trying to call "Task" from within a workflow and it seems to be getting ignored? I have provided a watered down nonsense code to illustrate what I'm trying to do. basically call a method within the class in parallel and return the results.
Sample Code
Class Something {
[string]Task($item) {
Start-sleep -Seconds 10
Return "Result"
}
[System.Array]GetSomething($list) {
workflow GetWF {
param($listarr)
ForEach -parallel ($item in $listarr) {
$res = InlineScript {
Write-Host("Starting.." + $using:item)
$this.Task($using:item)
}
}
$res
}
Return GetWF -listarr $list
}
}
$list = #('host1','host2','host3','host4')
$Something = [Something]::New()
$Something.GetSomething($list)
Output :
Starting..host1
Starting..host4
Starting..host2
Starting..host3
Desired Result from Example
The issue is how do I get my results back as an array ? In this example above I would like to see the final result to be this:
$Result = #("Result","Result","Result","Result")
how would i implement the following javascript code snippet in powershell?
String.prototype.regexCount = function (pattern) {
if (pattern.flags.indexOf("g") < 0) {
pattern = new RegExp(pattern.source, pattern.flags + "g");
}
return (this.match(pattern) || []).length;
};
I'm thinking its something like this:
$regexCount = {
param(
$pattern
)
# ??????
if ($pattern.flags.indexOf("g") -lt 0) {
# ????
# $pattern = new RegExp(pattern.source, pattern.flags + "g");
$pattern = [regex]::new($pattern)
}
# ????
return ($this.match($pattern) || []).length;
}
I have almost the entire script converted into powershell except for this little nugget of code... Actually, i'm a little bit clueless when javascript starts creating lambda functions with regular expression objects...
for instance what's the significants of string.prototype.somename? wouldn't you just save the lambda to any variable name?
Using Update-TypeData, create a type-level ScriptMethod ETS member for the .NET string type (System.String):
Update-TypeData -TypeName System.String -MemberName RegexCount -MemberType ScriptMethod -Value {
param([regex] $Regex)
$Regex.Matches($this).Count
}
Now you can call the .RegexCount() method on any string instance, analogous to what your JavaScript code does.
Sample call:
'foo'.RegexCount('.') # -> 3
That is, 3 matches for regex . were found in the input string.
I am working on implementing a singleton class to store some regularly accessed status information for my script, including hacking around the issue of $myInvocation only being populated in the main script. All working as planned with this.
class pxStatus {
static [pxStatus] $singleton = $null
[string]$Context = 'machine'
[string]$Path = $null
[datetime]$StartTime = (Get-Date)
pxStatus ([string]$path) {
if ([pxStatus]::singleton -eq $null) {
$this.Path = $path
[pxStatus]::singleton = $this
} else {
Throw "Singleton already initialized"
}
}
static [pxStatus] Get() {
if ([pxStatus]::singleton -eq $null) {
Throw "Singleton not yet initialized"
} else {
return [pxStatus]::singleton
}
}
}
CLS
[void]([pxStatus]::New((Split-Path ($myInvocation.myCommand.path) -parent)))
([pxStatus]::Get()).StartTime
([pxStatus]::Get()).Context
([pxStatus]::Get()).Path
With one exception. Even with that [void] on the [pxStatus]::New() line, I am getting a blank line in the console. Even $null = ([pxStatus]::New((Split-Path ($myInvocation.myCommand.path) -parent))) is echoing a blank line to the console. And for the life of me I can't see what is causing it.
It's not new that causes a blank line but ([pxStatus]::Get()).StartTime.
To fix the issue, you may output it as string, i.e. not formatted, e.g. ([pxStatus]::Get()).StartTime.ToString()
You problem has already been diagnosed, but I wanted to take a second to show how to actually implement a singleton-like type in PowerShell (see inline comments):
class pxStatus {
# hide backing field from user
hidden static [pxStatus] $singleton = $null
[string]$Context = 'machine'
[string]$Path = $null
[datetime]$StartTime = (Get-Date)
# hide instance constructor, no one should call this directly
hidden pxStatus ([string]$path) {
# Only allow to run if singleton instance doesn't exist already
if ($null -eq [pxStatus]::singleton) {
$this.Path = $path
} else {
Throw "Singleton already initialized - use [pxStatus]::Get()"
}
}
# Use a static constructor to initialize singleton
# guaranteed to only run once, before [pxStatus]::Get() or [pxStatus]::singleton
static pxStatus () {
# grab the path from context, don't rely on user input
if(-not $PSScriptRoot){
throw "[pxStatus] can only be used in scripts!"
}
# this will only succeed once anyway
[pxStatus]::singleton = [pxStatus]::new($PSScriptRoot)
}
static [pxStatus] Get() {
# No need to (double-)check ::singleton, static ctor will have run already
return [pxStatus]::singleton
}
}
[pxStatus]::Get().StartTime
I am refactoring some function based XML reader code to class methods, and seeing some issues. With the function, I can run a test and verify the XML loaded right, then change the XML and test for error conditions. But this class based approach fails due to "the file is open in another program", forcing me to close the console before I can revise the XML.
Initially I was using the path directly in the xmlReader. So I moved to a StreamReader input to the xmlReader. And I even played with creating an all new xmlDocument and importing the root node of the loaded XML into that new xmlDocument. None works.
I suspect the reason the function based version works is because the xmlReader variable is local scope, so it goes out of scope when the function completes. But I'm grasping at straws there. I also read that Garbage Collection could be an issue, so I added [system.gc]::Collect() right after the Dispose and still no change.
class ImportXML {
# Properties
[int]$status = 0
[xml.xmlDocument]$xml = ([xml.xmlDocument]::New())
[collections.arrayList]$message = ([collections.arrayList]::New())
# Methods
[xml.xmlDocument] ImportFile([string]$path) {
$importError = $false
$importFile = ([xml.xmlDocument]::New())
$xmlReaderSettings = [xml.xmlReaderSettings]::New()
$xmlReaderSettings.ignoreComments = $true
$xmlReaderSettings.closeInput = $true
$xmlReaderSettings.prohibitDtd = $false
try {
$streamReader = [io.streamReader]::New($path)
$xmlreader = [xml.xmlreader]::Create($streamReader, $xmlReaderSettings)
[void]$importFile.Load($xmlreader)
$xmlreader.Dispose
$streamReader.Dispose
} catch {
$exceptionName = $_.exception.GetType().name
$exceptionMessage = $_.exception.message
switch ($exceptionName) {
Default {
[void]$this.message.Add("E_$($exceptionName): $exceptionMessage")
$importError = $true
}
}
}
if ($importError) {
$importFile = $null
}
return $importFile
}
}
class SettingsXML : ImportXML {
# Constructor
SettingsXML([string]$path){
if ($this.xml = $this.ImportFile($path)) {
Write-Host "$path!"
} else {
Write-Host "$($this.message)"
}
}
}
$settingsPath = '\\Mac\iCloud Drive\Px Tools\Dev 4.0\Settings.xml'
$settings = [SettingsXML]::New($settingsPath)
EDIT:
I also tried a FileStream rather than a StreamReader, with FileShare of ReadWrite, like so
$fileMode = [System.IO.FileMode]::Open
$fileAccess = [System.IO.FileAccess]::Read
$fileShare = [System.IO.FileShare]::ReadWrite
$fileStream = New-Object -TypeName System.IO.FileStream $path, $fileMode, $fileAccess, $fileShare
Still no luck.
I think you're on the right lines with Dispose, but you're not actually invoking the method - you're just getting a reference to it and then not doing anything with it...
Compare:
PS> $streamReader = [io.streamReader]::New(".\test.xml");
PS> $streamReader.Dispose
OverloadDefinitions
-------------------
void Dispose()
void IDisposable.Dispose()
PS> _
with
PS> $streamReader = [io.streamReader]::New(".\test.xml");
PS> $streamReader.Dispose()
PS> _
You need to add some () after the method name so your code becomes:
$xmlreader.Dispose()
$streamReader.Dispose()
And then it should release the file lock ok.
I am trying to parse values from an XML file that will add items to a collection using a foreach loop, then add the items from that collection to another collection using another foreach loop with an addition value. This is what I am doing so far:
[xml]$testResults = Get-Content -Path $testResultsPath
$resultsByName = #{}
$resultsByPhone = #{}
$loop = 0
foreach($testCase in $testResults.'test-results'.'test-suite')
{
foreach($testCase in $testResults.'test-results'.'test-suite'[$loop].'results'.'test-suite'.'results'.'test-suite'.'results'.
'test-suite'.'results'.'test-suite'.'results'.'test-suite'.'results'.'test-suite'.'results'.'test-case')
{
$NameWithPone = $testCase.name.ToUpper().Substring($testCase.name.LastIndexOf('.')+1);
$Name =$NameWithPone.Substring(0, $NameWithPone.IndexOf('_'));
$PhoneVersion = $testCase.name.Substring($testCase.name.IndexOf('_')+1);
$resultsByName.Add($PhoneVersion, $Name)
Foreach($resultCase in $resultsByName)
{
$resultsByPhone.Add($resultsByName, $testCase.result)
}
}
$loop++
}
But this will only add it first result, then give the error "Item has already been added. Key in dictionary:
'System.Collections.Hashtable' Key being added: 'System.Collections.Hashtable'" I think this is because I am adding the same item each time, how can I correct this?
The first collection will look like this:
google_pixel_xl-7_1_1 TESTTHATAREGISTEREDUSERCANLOGINTOTHECUSTOMERAPPSUCCESSFULLY
htc_10-6_0_1 TESTTHATAREGISTEREDUSERCANLOGINTOTHECUSTOMERAPPSUCCESSFULLY
oneplus_one-4_4_4 TESTTHATAREGISTEREDUSERCANLOGINTOTHECUSTOMERAPPSUCCESSFULLY
But I want to add both values together to another collection which would look like:
google_pixel_xl-7_1_1 TESTTHATAREGISTEREDUSERCANLOGINTOTHECUSTOMERAPPSUCCESSFULLY Error
I got this done by doing this:
[xml]$testResults = Get-Content -Path $testResultsPath
foreach($testCase in $testResults.'test-results'.'test-suite')
{
function Get-TestCases($myResults)
{
$testCases = #()
foreach($child in $myResults.ChildNodes)
{
if($child.'test-case' -eq $null)
{
foreach($testCase in Get-TestCases $child)
{
$testCases += $testCase
}
}
else
{
$testCases += $child.'test-case'
}
}
return $testCases
}
$tests = Get-TestCases $testResults.'test-results'.'test-suite'[$loop]
foreach($test in $tests)
{
$PhoneVersion = $test.name.Substring($test.name.IndexOf('_')+1);
$resultsByPhone.Add($PhoneVersion, #{})
$NameWithPhone = $test.name.ToUpper().Substring($test.name.LastIndexOf('.')+1);
$Name =$NameWithPhone.Substring(0, $NameWithPhone.IndexOf('_'));
$resultsByPhone[$phoneVersion].Add($Name, $test.result)
}
$resultsByPhone[$phoneVersion]
$resultsByPhone
$loop++
}