How do you deal with lots of small files?

How do you deal with lots of small files? - windows-xp

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.
I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.
Is there a way to optimize NTFS or Windows so that it can work with all these small files?

NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.
A reboot is required before this setting will take effect.

NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.
For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.

The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.
One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:
ABC\
DEF\
ABCDEFGHI.db
EFG\
ABCEFGHIJ.db
Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.

I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.

I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.
You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.
In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.

You could try using something like Solid File System.
This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.
http://www.eldos.com/solfsdrv/

If you can calculate names of files, you might be able to sort them into folders by date, so that each folder only have files for a particular date. You might also want to create month and year hierarchies.
Also, could you move files older than say, a year, to a different (but still accessible) location?
Finally, and again, this requires you to be able to calculate names, you'll find that directly accessing a file is much faster than trying to open it via explorer. For example, saying
notepad.exe "P:\ath\to\your\filen.ame"
from the command line should actually be pretty quick, assuming you know the path of the file you need without having to get a directory listing.

One common trick is to simply create a handful of subdirectories and divvy up the files.
For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.

Aside from placing the files in sub-directories..
Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.
So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.

Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.
If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.
The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:
1.xml
24.xml
12331.xml
2304252.xml
you can sort them into directories like so:
data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml
This scheme will ensure that you will never have more than 100 files in each directory.

Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?

If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.
The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).
Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.
Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.

Rename the folder each day with a time stamp.
If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.
Then you will get one folder for each day, each containing several thousand files.
You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.
You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.

To create a folder structure that will scale to a large unknown number of files, I like the following system:
Split the filename into fixed length pieces, and then create nested folders for each piece except the last.
The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.
12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg
This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.
And here's a beautiful PowerShell one-liner to get you going!
$s = '123456'
-join (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*$','' ), $s )

Related

Does seafile store synced files anywhere?

I'm using Seafile (on docker) to sync some files to a Synology nas and it is all working correctly. I've created an external folder that is pointed to /shared folder in the container.
I think I already know the answer, but are the files synced to the server stored 'normally' somewhere? i.e. If I sync a folder called 'photos' and it has 'a.jpg' in it, will I be able to find that file on the seafile server?
The reason for the question is I would like to backup the original files that are sync'd, rather than having to backup the seafile DB, etc.
(I am aware that syncthing does what I want, so I may choose to use that instead, just want to confirm my understanding)
Thanks

TLDR;
No you won't find your a.jpg file on the server. Your files are going to be turned into blocks of bytes.
To understand
If you take a look at this part of the documentation of data model
FS
There are two types of FS objects, SeafDir Object and Seafile Object. SeafDir Object represents a directory, and Seafile Object represents a file.
Block
A file is further divided into blocks with variable lengths. We use Content Defined Chunking algorithm to divide file into blocks. A clear overview of this algorithm can be found at http://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf. On average, a block's size is around 1MB.
So backing up files will won't be as easy as making a raw copy of the seafile drive. As mentioned by #JensV you may still achieve something along those lines using the seafile drive client.

PowerShell large folder structure compare

I am looking to compare two large folder structures and compare the current state of a folder structure to a known good historical state, and I need the most performant option possible.
The folder structure in question is an Autodesk "deployment", which is a bloated mess of an installer, with upwards of 15GB of data, 15,000+ files and 2500+ folders. And they are fragile, seemingly innocuous changes can break them. Ran into one the other day where the absence of a thumbs.DB file caused the entire install of Revit 2018 to fail. So, I want to be able to both copy deployments and do a hash compare of source and destination to ensure everything copied properly, and compare a current state with a historical state to make sure nothing has changed.
My initial thought was I could user Get-ChildItem the get a full listing of every folder and file name, then Get-FileHash that as a first step, then apportion all the file paths to an array to dole out to multiple jobs that Get-FileHash all the files, and then sum all the hashes to get a single hash for the folder. That can then be compared with the hash for the other folder, or the historical hash, to determine if anything has changed. I have benchmarked Get-FileHash running as a single threaded sequence on an example deployment, and it takes a reasonable couple of minutes, so multithreading that and reducing it by a factor of 4 or so on average would be quite doable.
That said, comparing two folder structures seems like the kind of thing that might be already implemented, and much faster, in Windows already. So before I go down that rabbit hole I thought it best to see if it's the right hole in the first place.

Can a FAT filesystem support multiple references to a file?

Can a FAT based file system be modified to support multiple references to a file (i.e. aliases) by using the same FAT block sequence in directory table entries?

No because then when any reference was deleted, the file would be added to free space and possibly reused. This would result in two different files sharing space with any write to one corrupting the other.
This could work if the file system was immutable. For example if it was written to an unwritable medium.

Surely, you can have directory items points to same FAT records, but there are two things you should keep in mind:
1) never run any standard check disk utilities otherwise you get it wrong
2) you have to implement own delete operation to remove records from directory which points to the same item that you delete.
UPD: answer consider that question has 'can be modified' approach

The FAT File System stores all information about a ﬁle in a single structure inside a directory, except the addresses of disk blocks that contain ﬁle data. Disk block numbers of all ﬁles are kept in a File Allocation Table (FAT).
Since the link information and ﬁle container information are bound together in a single structure, FAT ﬁle system does not support multiple links to a single ﬁle. It does not support symbolic links either, though it could have. However, Windows supports shortcuts that are similar to symbolic links.

Reason for monolithic data files

Primarily this seems to be a technique used by games, where they have all the sounds in one file, textures in another etc. With these files commonly reaching the GB size.
What is the reason behind doing this over maintaining it all in subdirectories as small files - one per texture which many small games use this, with the monolithic system being favoured by larger companies?
Is there some file system overhead with lots of small files?
Are they trying to protect their property - although most just seem to be a compressed file with a new extension?

The reasons we use an "archive" system like this where I work (a game development company):
lookup speed: We rarely need to iterate over files in a directory; we're far more often looking them up directly by name. By using a custom "file allocation table" that is essentially just a sequence of hash( normalized_filename ) -> [ offset, size ], we can look up files very quickly. We can also keep this index in RAM, potentially interleave it with other index tables, etc.
(When we do need to iterate, we can either easily iterate over all files in a .arc, or we can store a list of filenames, a list of hash-of-filenames, or just a list of [ offset, size ] pairs somewhere -- maybe even as a file in the archive. This is usually faster than directory-traversal on a FS.)
metadata: It's easy for us to tuck in any file metadata we want. For example, a single bit in the "size" field indicates whether the file is compressed or not (if it is, it has a header with more details about how to decompress it). We can even vary compression on pieces of a file if we know enough about the structure of the file ahead of time (we do this for sprite archives).
size: One of the devices we use has a "file size must be a multiple of X" requirement, where X is large compared to some of our files. For example, some of our lua scripts end up being just a few hundred bytes when compiled; taking extra overhead per .luc file adds up quickly.
alignment: on the other hand, sometimes we want to waste space. To take advantage of faster streaming (e.g. background DMA) from the filesystem, some of our files do want to obey certain alignment/size requirements. We can take care of that right in the tool, and the align/size we're shooting for doesn't necessarily have to line up with the underlying FS, allowing us to waste space only where we need it.
But those are the mundane reasons. The more fun stuff:
Each .arc registers in a list, and attempts to open a file know to look in the arcs. We search already-in-RAM archives first, then archives on the device FS, then the actual device FS. This gives us a ton of flexibility:
dynamic additions to the filesystem: at any time we can stream a new file or archive to the machine in question (over the network or the like) and have it appear as part of the "logical" filesystem; this is great when the actual FS resides in ROM or on a CD, and allows us to iterate much more quickly than we could otherwise.
(Doom's .wad system is a sort of example of the above, which allows modders to more easily override assets and scripts built into the game.)
possibility of no underlying fs: It's possible to use bin2obj to embed an entire arc directly in the executable (.rodata) at link time, at which point you don't ever need to look at the device FS -- we do this for certain small demo builds and the like. We can also send levels across the network or savegame-sneakernet this way. =)
organization and load/unload: since we can load and unload and override virtual "pieces" of our filesystem at any time, we can do some performance tricks with having the number of files in the FS be very small at any given time. We can additionally specify that an entire archive be loaded into memory, index table and data; our file load code is smart enough to know that if the file is already in memory, it doesn't need to do anything to read it other than move a pointer around. Some of the higher level code can actually detect that the file is in ram and just ask for the probably-already-looks-like-a-struct pointer directly.
portability: we only need to figure out how to get a few files on each new device we use, and then the remainder of the FS code is more or less the same. =) We do change the tool output a bit occasionally (for alignment reasons), but most of the processing remains the same.
de-duplication: with smarter archives, such as our sprite archives, we can (and do) de-duplicate data. If "jump" animation's fifth frame and "kick"'s third frame are the same, we can pull apart the file and only store one copy of that frame. We can do the same for whole files.
We ported a PC game to a system with much slower FS access recently. We didn't change the data format, and it turns out iterating through a dir on the raw device FS to load a hundred small XML files was absolutely killing our load times. The solution we used was to take each dir, make it into its own subdir.arc, and stick it in the master game.arc compressed. When the dir was needed (something like opendir was called) we decompressed the entire subdir.arc into RAM, added it to the filesystem, then iterated through it super-quickly.
It's the ability to throw something like this together in a few hours, and to ease the pain of porting across systems, that makes stuff like this worthwhile.

File systems do have an overhead. Usually, a file takes disk space rounded up to some power of 2 (e.g. up to 4 KB), so many small files would waste space. Some modern file systems try to mitigate that, but AFAIK it's not widespread yet. Additionally, file systems are often quite slow when accessing multiple files. E.g. it is usually considerably faster to copy one 400 MB file than 4000 100 KB files.
File systems come in handy when you have to modify files, because they handle changing file sizes much better than any simple home-grown solution. However, that's certainly not the case for constant game data.

On Apple systems, the most common way is to use, as you suggest, directories. They are called Bundles, and are in the Finder represented as just one file, but if you explore them more, they're actually directories. This makes writing code and conserving memory when loading individual items out of this bundle very easy. :-) Also, this makes taking incremental backups of gigantic databases easy, as for instance your iPhoto database is just a bundle, so you just backup changed and new files
On Windows, however, I believe this is much harder to do, it will look like a directory "no matter what" (I'm sure smart people have found a solution that will make Explorer see certain directories as a single file, but it's not common).
From a games developer point of view, you're not dealing with so small files that disk space overhead is something you're very much concerned with, so I doubt #doublep's suggestion, since it makes for such a hassle, but it makes it much easier with a single file if users are to copy an entire game over somewhere, then it's easy to check if the entire set is correct.
And, of course, it's harder to read for people that shouldn't have access to it. But it's also harder to modify, which means harder to patch, and harder to write extensions. Someone that uses extensions a lot, prefers the directory structure: The Sims.
Were I the games developer, I'd love to go for individual files. Then again, I'd be using bundles as I'd be writing for the Mac ;-)
Cheers
Nik

I can think of multiple reasons.
As doublep suggested, files occupy more space on the disc than they require. So an archive saves space. 10k files (of any size) should save you 20MB when packed into an archive. Not exactly a large amount of space nowadays, but still.
The other reason I can think of is disc fragmentation. I suspect a heavily fragmented disc will perform worse when accessing thousands of separate files on a fragmented space. But I'm no expert in this field, so I'd appreciate if someone more experienced verified this.
Finally, I think this may also have something to do with restricting access to separate game files. You can have a bunch of Lua scripts exposed, mess with them and break something. Or you could have the outro cinematic/sound/text/whatever exposed and get spoiled by accessing it. I do that myself as well: I encrypt images with a multipass XOR key, pack text files and config variables into a monolithic file (zipped for extra security) and only leave music freely accessible. This way, the game's secrets will remain undiscovered for a bit longer :).
Or there may be another reason I never thought about :D.

As you know games, especially with larger companies try to squeeze as much performance as they can. One technique is to have all the data in one large file and just DMA it to memory (think of it as a memcpy from CD to RAM). Since all the files are in one large one there will be no disk seeks and you can have a large number of files (which may cause large amount of seeks) all loaded quicky because of the technique.

How many sub-directories should be put into a directory

At SO there has been much discussion about how many files in a directory are appropriate: on older filesystems stay below a few thousand on newer stay below a few hundred thousand.
Generally the suggestion is to create sub-directories for every few thousand files.
So the next question is: what is the maximum number of sub directories I should put into a directory? Nesting them too deep kills dir tree traversal performance. IS there a nesting them to shallow?

From a practicality standpoint applications might not handle well large directory entries.
For example Windows Explorer gets bogged down with with several thousand directory entries (I've had Vista crash, but XP seems to handle it better).
Since you mention nesting directories also keep in mind that there are limits to the length of fully qualified (with drive designator and path) filenames (See wikipedia 'filename' entry). This will vary with the operating system file system (See Wikipedia 'comparison on file systems' entry).
For Windows NTFS it is supposed to be 255, however, I have encountered problems with commands and API functions with fully qualified filenames at about 120 characters. I have also had problems with long path names on mapped networked drives (at least with Vista and I.E. Explorer 7).
Also there are limitations on the nesting level of subdirectories. For example CD-ROM (ISO 9660) is limited to 8 directory levels (something to keep in mind if you would want to copy your directory structure to a CD-ROM, or another filesystem).
So there is a lot of inconsistency when you push the file system to extremes
(while the file system may be able to handle it theoretically, apps and libraries may not).

Really depends on the OS you are using as directory manipulations are done using system calls. For unix based OS, i-node look-up algorithms are highly efficient and number of files and folders in a directory does not matter. May be that's why there is no limit to it in Unix based systems. However, in windows, it varies from file-system to file-systems.

Usually modern filesystems (like NTFS or ext3) don't have a problem with accessing files directly (ie. if you are trying to open /foo/bar/baz.dat). Where you can run into problems is enumerating subdirectories / files in a given directory (ie. give me all the files/dirs from /foo). This can occur in multiple scenarios (for example while debugging, or during backup, etc). I found that keeping childcount around a couple of hundred at most gave me acceptable response times.
Of course this varies from situation to situation, so do test :-)

My guess is as little as possible.
At ISP I was working for (back in 2003) we had lots of user emails and web files. We structured them with md5 hashed usernames, 3 levels deep (ie. /home/a/b/c/abcuser). This resulted in maybe up to 100 users inside third level directory.
You can make deeper structure with user directories in shallow structure too. The best option is to try and see, but smaller the directory count faster the lookup is.

I've come across a similar situation recently. We were using the file system to store serialized trade details. These would only be looked at infrequently and it wasn't worth the pain to store them in a database.
We found that Windows and Linux coped with a thousand or so files but it did get much slower accessing them - we organised them in sub-dirs in a logical grouping and this solved the problem.
It was also easier to grep them. Grepping through thousands of files is slower than changing to the correct sub-dir and grepping through a few hundred.

In windows API, the maximum length is set as 260 characters. The unicode functions do extend this limit to 32767 characters, which is used by major file systems.

I found out the hard way that for UFS2 the limit is around 2^15 sub-directories. So while UFS2 and oder modern filesystems work decently with a few hundred thousand files in a directory it can only handle relatively few sub-directories. The non-obvious error message is "can't create link".
While I haven't tested ext2 I found various mailing list postings where the posters also had issues with more than 2^15 files on an ext2 filesystem.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse