I know moodle's internal files such as uploaded images are stored in moodledata directory.
Inside, there are several directories:
moodledata/filedir/1c/01/1c01d0b6691ace075042a14416a8db98843b0856
moodledata/filedir/63/
moodledata/filedir/63/89/
moodledata/filedir/63/89/63895ece79c4a91666312d0a24db82fe3017f54d
moodledata/filedir/63/3c/
moodledata/filedir/63/37/
moodledata/filedir/63/a7/
What are these hashses?
What are the design reasons behind this design, in oposition with, for example, wordpress /year/month/file.jpg structure?
https://docs.moodle.org/dev/File_API_internals#File_storage_on_disk
Simple answer - files are stored based on the hash of their content (inspired by the way Git stores files internally).
This means that if you have the same file in multiple places (e.g. the same PDF or image in multiple courses), it is stored only once on the disk, even if the original filename is different.
On real sites this can involve a huge reduction in disk usage (obviously dependent on how much duplication there is on your site).
Moodledata files are stored according to the SHA1 encoding of their content, in order to prevent the duplication of content (for example, when the same file is uploaded twice with a different name).
For further explanations of how to handle such files, you can read the official documentation of the File API :
https://docs.moodle.org/dev/File_API_internals
especially the File storage on disk part.
Is it good to save an image in database with type BLOB?
or save only the path and copy the image in specific directory?
Which way is the best (I mean good performance for the database and the application) and why?
What are your requirements?
In the vast majority of cases saving the path will be better, simply because of the sheer size of the files compared to the rest of data (bulge the DB by GBs due to image inclusion). Consider adding an indirection, eg. save the path as a name and a reference to a storage resource (eg. a storage_id referencing a row in storages tables) and the path attached to the 'storage'. This way you can easily move files (copy all files, then update the storage path, rather than update 1MM individual paths).
However, if your requirements include consistent backup/restore and/or disaster recoverability, is often better to store images in the DB. Is not easier, nor more convenient, but is simply going to be required. Each DB has its own way of dealing with this problem, eg. in SQL Server you would use a FILESTREAM type which allows remote access via file access API. See FILESTREAM MVC: Download and Upload images from SQL Server for an example.
Also, a somehow dated but none the less interesting paper on the topic: To BLOB or Not to BLOB.
Can a FAT based file system be modified to support multiple references to a file (i.e. aliases) by using the same FAT block sequence in directory table entries?
No because then when any reference was deleted, the file would be added to free space and possibly reused. This would result in two different files sharing space with any write to one corrupting the other.
This could work if the file system was immutable. For example if it was written to an unwritable medium.
Surely, you can have directory items points to same FAT records, but there are two things you should keep in mind:
1) never run any standard check disk utilities otherwise you get it wrong
2) you have to implement own delete operation to remove records from directory which points to the same item that you delete.
UPD: answer consider that question has 'can be modified' approach
The FAT File System stores all information about a file in a single structure inside a directory, except the addresses of disk blocks that contain file data. Disk block numbers of all files are kept in a File Allocation Table (FAT).
Since the link information and file container information are bound together in a single structure, FAT file system does not support multiple links to a single file. It does not support symbolic links either, though it could have. However, Windows supports shortcuts that are similar to symbolic links.
I have 2 independent Matlab workers, with FIRST getting/saving data and SECOND reading it (and doing some calculations etc).
FIRST saves data as .mat file on the hard-disk while SECOND reads it from there. It takes ~20 seconds to SAVE this data as .mat and 8millisec to DELETE it. Before SAVING data, FIRST deletes the old file and then saves a newer version.
How can the SECOND verify that data exists and is not corrupt? I can use exists but that doesn't tell me if the data is corrupt or not. For eg, if SECOND tries to read data exactly when FIRST is saving it, exists passes but LOAD gives you an error saying - Data Corrupt etc.
Thanks.
You can't, without some synchronization mechanism - by the time SECOND completes its check and starts to read the file, FIRST might have started writing it again. You need some sort of lock or mutex.
Two options for base Matlab.
If this is on a local filesystem, you could use a separate lock file sitting next to the data file to manage concurrent access to the data file. Use Java's NIO FileChannel and FileLock objects from within Matlab to lock the first byte of the lock file and use that as a semaphore to control access to the data file, so the reader waits until the writer is finished and vice versa. (If this is on a network filesystem, don't try this - file locking may seem to work but usually is not officially supported and in my experience is unreliable.)
Or you could just put a try/catch around your load() call and have it pause a few seconds and retry if you get a corrupt file error. The .mat file format is such that you won't get a partial read if the writer is still writing it; you'll get that corrupt file error. So you could use this as a lazy sort of collision detection and backoff. This is what I usually do.
To reduce the window of contention, consider having FIRST write to a temporary file in the same directory, and then use a rename to move it to its final destination. That way the file is only unavailable during a quick filesystem move operation, not the 20 seconds of data writing. If you have multiple writers, stick the PID and hostname in the temp file name to avoid collisions.
Sounds like a classic resource sharing problem between 2 threads (R-W)
In short, you should find a method of inter-workers safe communication. Check this out.
Also, try to type
showdemo('paralleldemo_communic_prof')
in Matlab
A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.
I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.
Is there a way to optimize NTFS or Windows so that it can work with all these small files?
NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.
A reboot is required before this setting will take effect.
NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.
For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.
The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.
One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:
ABC\
DEF\
ABCDEFGHI.db
EFG\
ABCEFGHIJ.db
Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.
I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.
I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.
You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.
In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.
You could try using something like Solid File System.
This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.
http://www.eldos.com/solfsdrv/
If you can calculate names of files, you might be able to sort them into folders by date, so that each folder only have files for a particular date. You might also want to create month and year hierarchies.
Also, could you move files older than say, a year, to a different (but still accessible) location?
Finally, and again, this requires you to be able to calculate names, you'll find that directly accessing a file is much faster than trying to open it via explorer. For example, saying
notepad.exe "P:\ath\to\your\filen.ame"
from the command line should actually be pretty quick, assuming you know the path of the file you need without having to get a directory listing.
One common trick is to simply create a handful of subdirectories and divvy up the files.
For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.
Aside from placing the files in sub-directories..
Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.
So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.
Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.
If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.
The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:
1.xml
24.xml
12331.xml
2304252.xml
you can sort them into directories like so:
data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml
This scheme will ensure that you will never have more than 100 files in each directory.
Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?
If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.
The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).
Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.
Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.
Rename the folder each day with a time stamp.
If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.
Then you will get one folder for each day, each containing several thousand files.
You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.
You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.
To create a folder structure that will scale to a large unknown number of files, I like the following system:
Split the filename into fixed length pieces, and then create nested folders for each piece except the last.
The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.
12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg
This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.
And here's a beautiful PowerShell one-liner to get you going!
$s = '123456'
-join (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*$','' ), $s )