Merging text files - merge

I want to merge many text files to a file.
But the normal solutions ( like copy in command prompt or using c# in streaming way ) is not good for me because they are very slow and uses huge amount of system IO.
Is there any way with a low level programing that merge this files without loading in the memory? for example just with new addressing in the disk.
I believe windows defragmentation uses this method for moving block of a file.
Thanks

Related

Reading text matlab without blocking the file

Is there a way to read a text file in matlab, without blocking the file.
so basically i want to read in read-only-mode and avoid blocking the files in case simultaneously another software is trying to updating/modify or delete them.
fopen/textscan will block the files.
You probably want to use Memory Mapping in MATLAB:
Overview of Memory-Mapping
Benefits of Memory-Mapping
The principal benefits of memory-mapping are efficiency, faster file
access, the ability to share memory between applications, and more
efficient coding.
There are some limitations. The file should be pretty well structured. I'm also not sure what happens if other application change the file size.

A custom directory/folder merge tool

I am thinking about developing a custome directory/folder merge tool as part of learning functional programming as well as to scratch a very personal itch.
I usually work on three different computers and I tend to accumulate lots of files (text, video, audio) locally and then painstakingly merge them for backup purposes. I am pretty sure I have dupes and unwanted files lying around wasting space. I am moving to a cloud backup solution as a secondary backup source and I want to save as much space as possible by eliminating redundant files.
I have a complex deeply nested directory structure and I want an automated tool that automatically walks down the folder tree and perform the merge. Another problem is that I use a mix of Linux and Windows and many of my files have spaces in the name...
My initial thought was that I need to generate hashes for every file and compare using hashes rather than file names (spaces in folder name as well as contents of files could be different between source and target). Is RIPEMD-160 a good balance between performance and collision avoidance? or is SHA-1 enough? Is SHA-256/512 overkill?
Which functional programming env comes with a set of ready made libraries for generating these hashes? I am leaning towards OCaml...
Check out the Unison file synchronizer.
I don't use it myself, but I heard quite a few positive reviews. It is a mature software based on some theoretic foundation.
Also, it is written in OCaml.

Reason for monolithic data files

Primarily this seems to be a technique used by games, where they have all the sounds in one file, textures in another etc. With these files commonly reaching the GB size.
What is the reason behind doing this over maintaining it all in subdirectories as small files - one per texture which many small games use this, with the monolithic system being favoured by larger companies?
Is there some file system overhead with lots of small files?
Are they trying to protect their property - although most just seem to be a compressed file with a new extension?
The reasons we use an "archive" system like this where I work (a game development company):
lookup speed: We rarely need to iterate over files in a directory; we're far more often looking them up directly by name. By using a custom "file allocation table" that is essentially just a sequence of hash( normalized_filename ) -> [ offset, size ], we can look up files very quickly. We can also keep this index in RAM, potentially interleave it with other index tables, etc.
(When we do need to iterate, we can either easily iterate over all files in a .arc, or we can store a list of filenames, a list of hash-of-filenames, or just a list of [ offset, size ] pairs somewhere -- maybe even as a file in the archive. This is usually faster than directory-traversal on a FS.)
metadata: It's easy for us to tuck in any file metadata we want. For example, a single bit in the "size" field indicates whether the file is compressed or not (if it is, it has a header with more details about how to decompress it). We can even vary compression on pieces of a file if we know enough about the structure of the file ahead of time (we do this for sprite archives).
size: One of the devices we use has a "file size must be a multiple of X" requirement, where X is large compared to some of our files. For example, some of our lua scripts end up being just a few hundred bytes when compiled; taking extra overhead per .luc file adds up quickly.
alignment: on the other hand, sometimes we want to waste space. To take advantage of faster streaming (e.g. background DMA) from the filesystem, some of our files do want to obey certain alignment/size requirements. We can take care of that right in the tool, and the align/size we're shooting for doesn't necessarily have to line up with the underlying FS, allowing us to waste space only where we need it.
But those are the mundane reasons. The more fun stuff:
Each .arc registers in a list, and attempts to open a file know to look in the arcs. We search already-in-RAM archives first, then archives on the device FS, then the actual device FS. This gives us a ton of flexibility:
dynamic additions to the filesystem: at any time we can stream a new file or archive to the machine in question (over the network or the like) and have it appear as part of the "logical" filesystem; this is great when the actual FS resides in ROM or on a CD, and allows us to iterate much more quickly than we could otherwise.
(Doom's .wad system is a sort of example of the above, which allows modders to more easily override assets and scripts built into the game.)
possibility of no underlying fs: It's possible to use bin2obj to embed an entire arc directly in the executable (.rodata) at link time, at which point you don't ever need to look at the device FS -- we do this for certain small demo builds and the like. We can also send levels across the network or savegame-sneakernet this way. =)
organization and load/unload: since we can load and unload and override virtual "pieces" of our filesystem at any time, we can do some performance tricks with having the number of files in the FS be very small at any given time. We can additionally specify that an entire archive be loaded into memory, index table and data; our file load code is smart enough to know that if the file is already in memory, it doesn't need to do anything to read it other than move a pointer around. Some of the higher level code can actually detect that the file is in ram and just ask for the probably-already-looks-like-a-struct pointer directly.
portability: we only need to figure out how to get a few files on each new device we use, and then the remainder of the FS code is more or less the same. =) We do change the tool output a bit occasionally (for alignment reasons), but most of the processing remains the same.
de-duplication: with smarter archives, such as our sprite archives, we can (and do) de-duplicate data. If "jump" animation's fifth frame and "kick"'s third frame are the same, we can pull apart the file and only store one copy of that frame. We can do the same for whole files.
We ported a PC game to a system with much slower FS access recently. We didn't change the data format, and it turns out iterating through a dir on the raw device FS to load a hundred small XML files was absolutely killing our load times. The solution we used was to take each dir, make it into its own subdir.arc, and stick it in the master game.arc compressed. When the dir was needed (something like opendir was called) we decompressed the entire subdir.arc into RAM, added it to the filesystem, then iterated through it super-quickly.
It's the ability to throw something like this together in a few hours, and to ease the pain of porting across systems, that makes stuff like this worthwhile.
File systems do have an overhead. Usually, a file takes disk space rounded up to some power of 2 (e.g. up to 4 KB), so many small files would waste space. Some modern file systems try to mitigate that, but AFAIK it's not widespread yet. Additionally, file systems are often quite slow when accessing multiple files. E.g. it is usually considerably faster to copy one 400 MB file than 4000 100 KB files.
File systems come in handy when you have to modify files, because they handle changing file sizes much better than any simple home-grown solution. However, that's certainly not the case for constant game data.
On Apple systems, the most common way is to use, as you suggest, directories. They are called Bundles, and are in the Finder represented as just one file, but if you explore them more, they're actually directories. This makes writing code and conserving memory when loading individual items out of this bundle very easy. :-) Also, this makes taking incremental backups of gigantic databases easy, as for instance your iPhoto database is just a bundle, so you just backup changed and new files
On Windows, however, I believe this is much harder to do, it will look like a directory "no matter what" (I'm sure smart people have found a solution that will make Explorer see certain directories as a single file, but it's not common).
From a games developer point of view, you're not dealing with so small files that disk space overhead is something you're very much concerned with, so I doubt #doublep's suggestion, since it makes for such a hassle, but it makes it much easier with a single file if users are to copy an entire game over somewhere, then it's easy to check if the entire set is correct.
And, of course, it's harder to read for people that shouldn't have access to it. But it's also harder to modify, which means harder to patch, and harder to write extensions. Someone that uses extensions a lot, prefers the directory structure: The Sims.
Were I the games developer, I'd love to go for individual files. Then again, I'd be using bundles as I'd be writing for the Mac ;-)
Cheers
Nik
I can think of multiple reasons.
As doublep suggested, files occupy more space on the disc than they require. So an archive saves space. 10k files (of any size) should save you 20MB when packed into an archive. Not exactly a large amount of space nowadays, but still.
The other reason I can think of is disc fragmentation. I suspect a heavily fragmented disc will perform worse when accessing thousands of separate files on a fragmented space. But I'm no expert in this field, so I'd appreciate if someone more experienced verified this.
Finally, I think this may also have something to do with restricting access to separate game files. You can have a bunch of Lua scripts exposed, mess with them and break something. Or you could have the outro cinematic/sound/text/whatever exposed and get spoiled by accessing it. I do that myself as well: I encrypt images with a multipass XOR key, pack text files and config variables into a monolithic file (zipped for extra security) and only leave music freely accessible. This way, the game's secrets will remain undiscovered for a bit longer :).
Or there may be another reason I never thought about :D.
As you know games, especially with larger companies try to squeeze as much performance as they can. One technique is to have all the data in one large file and just DMA it to memory (think of it as a memcpy from CD to RAM). Since all the files are in one large one there will be no disk seeks and you can have a large number of files (which may cause large amount of seeks) all loaded quicky because of the technique.

How To Create File System Fragmentation?

Risk Factors for File Fragmentation include mostly full Disks and repeated file appends. What are other risk factors for file fragmentation? How would one make a program using common languages like C++/C#/VB/VB.NET to work with files & make new files with the goal of increasing file fragmentation?
WinXP / NTFS is the target
Edit: Would something like this be a good approach? Hard Drive free space = FreeMB_atStart
Would creating files of say 10MB to fill
90% of the remaining hard drive space
Deleting every 3rd created file
making file of size FreeMB_atStart * .92 / 3
This should achieve at least some level of fragmentation on most file systems:
Write numerous small files,
Delete some at random files,
Writing a large file, byte-by-byte.
Writing it byte-by-byte is important, because otherwise if the file system is intelligent, it can just write the large file to a single contiguous place.
Another possibility would be to write several files simultaneously byte-by-byte. This would probably have more effect.

How do you deal with lots of small files?

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.
I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.
Is there a way to optimize NTFS or Windows so that it can work with all these small files?
NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.
A reboot is required before this setting will take effect.
NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.
For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.
The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.
One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:
ABC\
DEF\
ABCDEFGHI.db
EFG\
ABCEFGHIJ.db
Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.
I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.
I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.
You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.
In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.
You could try using something like Solid File System.
This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.
http://www.eldos.com/solfsdrv/
If you can calculate names of files, you might be able to sort them into folders by date, so that each folder only have files for a particular date. You might also want to create month and year hierarchies.
Also, could you move files older than say, a year, to a different (but still accessible) location?
Finally, and again, this requires you to be able to calculate names, you'll find that directly accessing a file is much faster than trying to open it via explorer. For example, saying
notepad.exe "P:\ath\to\your\filen.ame"
from the command line should actually be pretty quick, assuming you know the path of the file you need without having to get a directory listing.
One common trick is to simply create a handful of subdirectories and divvy up the files.
For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.
Aside from placing the files in sub-directories..
Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.
So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.
Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.
If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.
The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:
1.xml
24.xml
12331.xml
2304252.xml
you can sort them into directories like so:
data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml
This scheme will ensure that you will never have more than 100 files in each directory.
Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?
If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.
The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).
Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.
Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.
Rename the folder each day with a time stamp.
If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.
Then you will get one folder for each day, each containing several thousand files.
You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.
You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.
To create a folder structure that will scale to a large unknown number of files, I like the following system:
Split the filename into fixed length pieces, and then create nested folders for each piece except the last.
The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.
12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg
This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.
And here's a beautiful PowerShell one-liner to get you going!
$s = '123456'
-join (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*$','' ), $s )