PowerShell large folder structure compare - powershell

I am looking to compare two large folder structures and compare the current state of a folder structure to a known good historical state, and I need the most performant option possible.
The folder structure in question is an Autodesk "deployment", which is a bloated mess of an installer, with upwards of 15GB of data, 15,000+ files and 2500+ folders. And they are fragile, seemingly innocuous changes can break them. Ran into one the other day where the absence of a thumbs.DB file caused the entire install of Revit 2018 to fail. So, I want to be able to both copy deployments and do a hash compare of source and destination to ensure everything copied properly, and compare a current state with a historical state to make sure nothing has changed.
My initial thought was I could user Get-ChildItem the get a full listing of every folder and file name, then Get-FileHash that as a first step, then apportion all the file paths to an array to dole out to multiple jobs that Get-FileHash all the files, and then sum all the hashes to get a single hash for the folder. That can then be compared with the hash for the other folder, or the historical hash, to determine if anything has changed. I have benchmarked Get-FileHash running as a single threaded sequence on an example deployment, and it takes a reasonable couple of minutes, so multithreading that and reducing it by a factor of 4 or so on average would be quite doable.
That said, comparing two folder structures seems like the kind of thing that might be already implemented, and much faster, in Windows already. So before I go down that rabbit hole I thought it best to see if it's the right hole in the first place.

Related

Will the depth of a file within a filesystem change the time taken to copy it?

I am trying to figure out if whether or not the depth of a file in a filesystem will change the amount of time it takes to execute a "cp" bash command with that file.
By depth I mean how many parent directories its contained in.
I tried running a few tests, but my results are pretty inconclusive, and when I try to logically answer, I can think of reasons of why it would be either way.
What is the purpose of this?
Provided nothing is cached, the deeper the directory tree the more data has to be read from storage to get to the file - you have to find the name of the second dir, then the third within the second and so on. On the other hand if the file is big, the time needed to do this can be negligible in comparison.
Also mere startup of a command like cp is not without its cost.
If you are interested in how file systems work read this free book: http://www.nobius.org/~dbg/practical-file-system-design.pdf
Performance is a complicated subject, especially so when hard media is involved. Without proper understanding of how this works and proper understanding of statistics, you can't perform a correct test.

A custom directory/folder merge tool

I am thinking about developing a custome directory/folder merge tool as part of learning functional programming as well as to scratch a very personal itch.
I usually work on three different computers and I tend to accumulate lots of files (text, video, audio) locally and then painstakingly merge them for backup purposes. I am pretty sure I have dupes and unwanted files lying around wasting space. I am moving to a cloud backup solution as a secondary backup source and I want to save as much space as possible by eliminating redundant files.
I have a complex deeply nested directory structure and I want an automated tool that automatically walks down the folder tree and perform the merge. Another problem is that I use a mix of Linux and Windows and many of my files have spaces in the name...
My initial thought was that I need to generate hashes for every file and compare using hashes rather than file names (spaces in folder name as well as contents of files could be different between source and target). Is RIPEMD-160 a good balance between performance and collision avoidance? or is SHA-1 enough? Is SHA-256/512 overkill?
Which functional programming env comes with a set of ready made libraries for generating these hashes? I am leaning towards OCaml...
Check out the Unison file synchronizer.
I don't use it myself, but I heard quite a few positive reviews. It is a mature software based on some theoretic foundation.
Also, it is written in OCaml.

Cryptographically Secure Backup

Until now I have been doing backups with rsync from my computer to an external drive. The backup data is made of tens of thousands of small files and hundreds of big ones (Maildir email messages and episodes of my favorite series). The problem with this is that if a single sector of my backup disk fails, perhaps a single message may be corrupted, which I find intolerable.
I have thought of an alternative that works as follows. There are three trees: the file tree consisting of the data I wish to backup, the backup tree containing a copy of the file tree at a given moment in time and a hash tree which contains file hashes and metadata hashes of the backup tree. A hash of the whole hash tree is also kept. Prior to a backup, the hash of the hash tree is checked. A failure here invalidates the whole backed up data. After the check succeeds, the hash tree shape is compared to the backup tree shape and the metadata hashes are verified to ensure the backup tree is metadata and shape consistent. If it is not, individual culprits can be listed. After this, the rsync backup traversal is performed. Whenever rsync updates a file, its new hash and metadata hash are computed and inserted into the hash tree. Whenever rsync deletes a file, that file is removed from the hash tree. In the end, the hash of the hash tree is computed and stored.
This process is very useful because the hashes are computed for correct data, meaning even if a file in the file tree is corrupted after it has been inserted in the hash tree, this inconsistency does not invalidate the backup (or future backups). The most important property, however, is that if an attacker corrupts the backup medium however he likes, the information that lies there will be trusted if and only if it is correct, unless the attacker has broken the hash algorithm. Also, the data sent to the backup or restored from it can be verified incrementally.
My question is: is there a reasonable implementation of such a backup scheme? My searches tell me that the only backup schemes available either do full or differential backups (tar based, for instance) or fail to provide a cryptographic correctness guarantee (rsync).
If there are no implementations of anything like that, maybe I will write one, but I would like to avoid reinventing the wheel.
What you're talking about sounds a lot like Git. I think it would pretty much do what you're describing. Just implement the process of "backing up" as git commit. You can then restore to any previous version with git checkout.
It is amazingly storage efficient and extremely fast for transfering content, which would probably save you a lot of time on your backups. As a bonus, it's free, portable and already debugged!
This sounds almost exactly identical to how the Mercurial storage system works. The 'rsync command' would be implement using Mercurial's push, which is remarkably network efficient.
If I had to solve the problem, I'd take RAID array (to prevent corruption) of drives, which use built-in AES encryption, and then would use any backup method I am used to.
Git-Annex is the proper solution to this problem given available tools. It is an extension to git which allows robust support for files which are arbitrarily large, synchronizes between datastores automatically, has an optional graphical user interface, tracks how many backups you have and precisely what files are stored where, and allows you to set rules for how it should manage different content. You can also customize what cryptographic hashes are used to validate the integrity of the content.
For needs of drive backups, git-annex has interoperability with bup which has more features tuned towards those looking for regular backups of entire systems.

How many sub-directories should be put into a directory

At SO there has been much discussion about how many files in a directory are appropriate: on older filesystems stay below a few thousand on newer stay below a few hundred thousand.
Generally the suggestion is to create sub-directories for every few thousand files.
So the next question is: what is the maximum number of sub directories I should put into a directory? Nesting them too deep kills dir tree traversal performance. IS there a nesting them to shallow?
From a practicality standpoint applications might not handle well large directory entries.
For example Windows Explorer gets bogged down with with several thousand directory entries (I've had Vista crash, but XP seems to handle it better).
Since you mention nesting directories also keep in mind that there are limits to the length of fully qualified (with drive designator and path) filenames (See wikipedia 'filename' entry). This will vary with the operating system file system (See Wikipedia 'comparison on file systems' entry).
For Windows NTFS it is supposed to be 255, however, I have encountered problems with commands and API functions with fully qualified filenames at about 120 characters. I have also had problems with long path names on mapped networked drives (at least with Vista and I.E. Explorer 7).
Also there are limitations on the nesting level of subdirectories. For example CD-ROM (ISO 9660) is limited to 8 directory levels (something to keep in mind if you would want to copy your directory structure to a CD-ROM, or another filesystem).
So there is a lot of inconsistency when you push the file system to extremes
(while the file system may be able to handle it theoretically, apps and libraries may not).
Really depends on the OS you are using as directory manipulations are done using system calls. For unix based OS, i-node look-up algorithms are highly efficient and number of files and folders in a directory does not matter. May be that's why there is no limit to it in Unix based systems. However, in windows, it varies from file-system to file-systems.
Usually modern filesystems (like NTFS or ext3) don't have a problem with accessing files directly (ie. if you are trying to open /foo/bar/baz.dat). Where you can run into problems is enumerating subdirectories / files in a given directory (ie. give me all the files/dirs from /foo). This can occur in multiple scenarios (for example while debugging, or during backup, etc). I found that keeping childcount around a couple of hundred at most gave me acceptable response times.
Of course this varies from situation to situation, so do test :-)
My guess is as little as possible.
At ISP I was working for (back in 2003) we had lots of user emails and web files. We structured them with md5 hashed usernames, 3 levels deep (ie. /home/a/b/c/abcuser). This resulted in maybe up to 100 users inside third level directory.
You can make deeper structure with user directories in shallow structure too. The best option is to try and see, but smaller the directory count faster the lookup is.
I've come across a similar situation recently. We were using the file system to store serialized trade details. These would only be looked at infrequently and it wasn't worth the pain to store them in a database.
We found that Windows and Linux coped with a thousand or so files but it did get much slower accessing them - we organised them in sub-dirs in a logical grouping and this solved the problem.
It was also easier to grep them. Grepping through thousands of files is slower than changing to the correct sub-dir and grepping through a few hundred.
In windows API, the maximum length is set as 260 characters. The unicode functions do extend this limit to 32767 characters, which is used by major file systems.
I found out the hard way that for UFS2 the limit is around 2^15 sub-directories. So while UFS2 and oder modern filesystems work decently with a few hundred thousand files in a directory it can only handle relatively few sub-directories. The non-obvious error message is "can't create link".
While I haven't tested ext2 I found various mailing list postings where the posters also had issues with more than 2^15 files on an ext2 filesystem.

How do you deal with lots of small files?

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.
I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.
Is there a way to optimize NTFS or Windows so that it can work with all these small files?
NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.
A reboot is required before this setting will take effect.
NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.
For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.
The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.
One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:
ABC\
DEF\
ABCDEFGHI.db
EFG\
ABCEFGHIJ.db
Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.
I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.
I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.
You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.
In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.
You could try using something like Solid File System.
This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.
http://www.eldos.com/solfsdrv/
If you can calculate names of files, you might be able to sort them into folders by date, so that each folder only have files for a particular date. You might also want to create month and year hierarchies.
Also, could you move files older than say, a year, to a different (but still accessible) location?
Finally, and again, this requires you to be able to calculate names, you'll find that directly accessing a file is much faster than trying to open it via explorer. For example, saying
notepad.exe "P:\ath\to\your\filen.ame"
from the command line should actually be pretty quick, assuming you know the path of the file you need without having to get a directory listing.
One common trick is to simply create a handful of subdirectories and divvy up the files.
For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.
Aside from placing the files in sub-directories..
Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.
So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.
Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.
If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.
The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:
1.xml
24.xml
12331.xml
2304252.xml
you can sort them into directories like so:
data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml
This scheme will ensure that you will never have more than 100 files in each directory.
Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?
If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.
The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).
Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.
Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.
Rename the folder each day with a time stamp.
If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.
Then you will get one folder for each day, each containing several thousand files.
You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.
You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.
To create a folder structure that will scale to a large unknown number of files, I like the following system:
Split the filename into fixed length pieces, and then create nested folders for each piece except the last.
The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.
12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg
This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.
And here's a beautiful PowerShell one-liner to get you going!
$s = '123456'
-join (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*$','' ), $s )