Large Files in Source Control (TFS) - version-control

Recently at the office we have been talking about placing large files into our TFS repository. The files themselves are XML, usually 100-200MB in size, and sometimes as large as 1GB. We use them as data for automated testing and they are mostly static (one gets a minor tweak every year or so). Anyway, there is a notion that putting files like this into the repository is a no-no because they are "big" and that will make things "slow" (outside of the original check-in/out) but we don't really have any evidence to back this up.
So my question is, what are the pros / cons / implications of putting large static files into a source code repository like TFS (or SVN, Git, etc. for that matter) Is it OK? Will it "fill up the server" or have some other dire consequence?

tl;dr: TFS is designed to handle large files gracefully. The largest hurdle you'll have to face is network bandwidth to upload/download the files. The second issue is that of storage space on the server. Assuming you've considered these two issues, you shouldn't have any other problems.
Network bandwidth: There is very little overhead in checking in or getting files, it should be as fast as a typical HTTP upload or download. If your clients are remote from the server, network-wise, they may benefit by having a TFS source control proxy on their local network to speed up downloads.
Note that unlike some version control systems, TFS does not compute and transmit deltas when uploading or downloading new content. That is to say, if a client had revision 4 of a large text file, and revision 5 had added a few lines at the end, some version control tools optimize this experience to only send the changed lines. TFS does not do this optimization, so if your files change frequently, clients will need to download the entirety of the file each time.
Server storage: Disk space on the server is fairly straightforward - you'll need enough space to hold the files, there's little overhead beyond that. TFS will not slow down just because your repository contains large files.
If these files get modified frequently, you will need to account for the disk space used by the revisions, also. TFS stores "deltas" between file revisions - that is, a binary difference between two versions. So if the file's contents change minimally between revisions as in the typical use case with text files, the storage cost should be inexpensive. However, if the entirety of the contents change as would be typical with binary files like images or DLLs, then you'll need enough disk space to store each revision. (Of course, you can destroy previous revisions in order to regain that space.)
One note on deltas in TFS: to reduce overhead at check-in time, the deltas between revisions are not computed immediately, there's a background "deltafication" job that runs nightly to compute the deltas to trim space. Until that point, each revision is stored in its entirety in the database. So if you have a very large text file with a lot of revisions happening daily, your disk space requirements will need to take this into account.
Client storage: Clients will need to have enough disk space to contain these files also (although only at the revision that they've downloaded.) This can be mitigated in your workspace mappings such that the large files are cloaked (or otherwise not included in your workspace) if they're not needed.
Caveat: Getting Historic Versions: If you find yourself requesting historical versions of large files frequently (for example: I want an ISO image seven changesets ago), then you're going to make the server apply the delta chain to get back to that revision. If you have multiple clients doing this concurrently, this could tax your memory.

If those files were constantly changing & their deltas were big, I would eventually expect a penalty in the overall TFS performance.You clearly state that this is not the case, so, provided that your SQL server has the capacity to house the storage, I believe you should be able to proceed without any implications. A minor downside you may experience, is when you 're constructing new workspaces, where you would have to pull those files from their repository. Unfortunately this does also happen during TFS Build, so it's possible that your builds will now take that much longer. The severity of this angle greatly depends on your network constellation/stability.

The biggest problem (inconvenience) you'll have is having to download these massive files to all your workspaces, or map them out. Consider putting them into a separate team project to make this easier (unless you want to include them in branches, in which case I'd abuse keeping everything in one team project)
If you have control of the xml format then also consider a few tweaks to make them smaller. This will improve performance of store/get operations and also loading speed... Shorten element and attribute names, reduce the number of decimal places you are outputting for floating point numbers, etc. You will find threat simple schemes like this will knock many megabytes off the size of Gb-sized files, and it's easy to knock up a quick xslt transform or code to convert the files quickly over to the new format.

Related

I need to backup database files. Need something like github that is happy with 100s of gigabytes of data

I want a source controlled environment for a fairly large amount of database data, in text, before its loaded into the DBMS. We've been using GITHUB and its great. But they expect that a repository is less than 1 gigabyte and we have hundreds.
It could be in CVS or SVN, but tracking versions is important. The data is very static and is accessed only at low rates, say once a week for parts of it, once a month for more.
Any suggested places/services that do this? It doesn't have to be free, we'll happily pay a reasonable amount.
I confirm this kind of amount of data is incompatible with a Version Control System (made to record the history, ie the evolution of mostly text files and small binary files)
It is certainly not compatible with a Distributed VCS, where any clone would clone all the repo.
You need to look at cloud services for this type of storage.
The OP protests (downvote), stating that:
They would be normal ASCII except that GitHub has such small file size limits that I ran them through ZIP compression.
They rarely change, and when the contents change, its just a tiny number of lines within the file.
Its exactly what version control is about. Which 0.005% of the ASCII changed? Who changed it? When?
I maintain that:
hundreds of megabytes is incompatible with most source control repo providers out there (it would even be incompatible with most internal enterprise repos, and I am in a large company)
putting them in a zip file isn't practical in that a Version Control Tool system wouldn't be able to record the delta.
You need to keep separate:
the data (stores "elsewhere" as a large content of plain text files, certainly not on GitHub)
the metadata you want (author, date of modification), stored in a regular git repo in association with "shell" data (ie, your files which are actually "references", or kind of "symlinks" to, the actual files put elsewhere)
The one system, based on Git, who provides that is git-annex, using your own cloud storage with (if implemented) git-annex assistant: see its roadmap.

Simple version-control systems or versioning file system or versioning database

I am looking for a simple versioning system for a large number of records or files (~50 million, ~100GB unpacked, ~20MB packed). The files are only a few Kilobytes each, and have unique IDs, so I don't mind whether they are stored in a flat structure (table, directory...) or not. On average, each record is changed once a month, but most changes have diffs less than a Kilobyte so it should be easy to compress versions. However, a naive database with one entry for each version would grow too quickly. I need the following operations:
basic CRUD operations: create, read, update, delete
quick listing of recent changes
quick listing of recent changes of a particular record
query for changes in a given period of time
query for changes by a given user (each edit is associated to some user id and optionally has a commit message as comment)
for write operations there must be a commit hook to validate and reject illformed records.
In short, I am looking for a Wiki-like software for simple records or files.
I thought about possible solutions:
Put files in a version control system. This gives me replication and many available access tools, so it is my preferred solution. But the amount of data is too large for distributed systems like git. Is anyone using Subversion for a similar task with success?
Implement my own versioning in a database or in a file system. I would pobably need to store only compressed records and diffs, would have more work and learn something. This would be my preferred solution, if it was just for fun.
Use a versioning file system. This would make setup, replication and access more difficult. Probably I would need to implement my own access API above the file system.
Use a versioning database system. Can you suggest some?
Use some other existing data store with versioning (MediaWiki?, Amazon Cloud Drive?, ...)
Obviously there are many pathes. Which pathes have been used by others with success for similar or larger amounts of data?
If you're not averse to having a raw copy of each file on your client (which I imagine is OK, if you're considering svn) then git is probably quite a good solution to your problem. The underlying repository storage will use binary diffs between files as well as between versions, so you should have close to optimal compression there.
With a bare repo and some scripting, you may even be able to get away with not having the current revision checked out: objects are available from the command line and you can create new commits without a checkout.

Reason for monolithic data files

Primarily this seems to be a technique used by games, where they have all the sounds in one file, textures in another etc. With these files commonly reaching the GB size.
What is the reason behind doing this over maintaining it all in subdirectories as small files - one per texture which many small games use this, with the monolithic system being favoured by larger companies?
Is there some file system overhead with lots of small files?
Are they trying to protect their property - although most just seem to be a compressed file with a new extension?
The reasons we use an "archive" system like this where I work (a game development company):
lookup speed: We rarely need to iterate over files in a directory; we're far more often looking them up directly by name. By using a custom "file allocation table" that is essentially just a sequence of hash( normalized_filename ) -> [ offset, size ], we can look up files very quickly. We can also keep this index in RAM, potentially interleave it with other index tables, etc.
(When we do need to iterate, we can either easily iterate over all files in a .arc, or we can store a list of filenames, a list of hash-of-filenames, or just a list of [ offset, size ] pairs somewhere -- maybe even as a file in the archive. This is usually faster than directory-traversal on a FS.)
metadata: It's easy for us to tuck in any file metadata we want. For example, a single bit in the "size" field indicates whether the file is compressed or not (if it is, it has a header with more details about how to decompress it). We can even vary compression on pieces of a file if we know enough about the structure of the file ahead of time (we do this for sprite archives).
size: One of the devices we use has a "file size must be a multiple of X" requirement, where X is large compared to some of our files. For example, some of our lua scripts end up being just a few hundred bytes when compiled; taking extra overhead per .luc file adds up quickly.
alignment: on the other hand, sometimes we want to waste space. To take advantage of faster streaming (e.g. background DMA) from the filesystem, some of our files do want to obey certain alignment/size requirements. We can take care of that right in the tool, and the align/size we're shooting for doesn't necessarily have to line up with the underlying FS, allowing us to waste space only where we need it.
But those are the mundane reasons. The more fun stuff:
Each .arc registers in a list, and attempts to open a file know to look in the arcs. We search already-in-RAM archives first, then archives on the device FS, then the actual device FS. This gives us a ton of flexibility:
dynamic additions to the filesystem: at any time we can stream a new file or archive to the machine in question (over the network or the like) and have it appear as part of the "logical" filesystem; this is great when the actual FS resides in ROM or on a CD, and allows us to iterate much more quickly than we could otherwise.
(Doom's .wad system is a sort of example of the above, which allows modders to more easily override assets and scripts built into the game.)
possibility of no underlying fs: It's possible to use bin2obj to embed an entire arc directly in the executable (.rodata) at link time, at which point you don't ever need to look at the device FS -- we do this for certain small demo builds and the like. We can also send levels across the network or savegame-sneakernet this way. =)
organization and load/unload: since we can load and unload and override virtual "pieces" of our filesystem at any time, we can do some performance tricks with having the number of files in the FS be very small at any given time. We can additionally specify that an entire archive be loaded into memory, index table and data; our file load code is smart enough to know that if the file is already in memory, it doesn't need to do anything to read it other than move a pointer around. Some of the higher level code can actually detect that the file is in ram and just ask for the probably-already-looks-like-a-struct pointer directly.
portability: we only need to figure out how to get a few files on each new device we use, and then the remainder of the FS code is more or less the same. =) We do change the tool output a bit occasionally (for alignment reasons), but most of the processing remains the same.
de-duplication: with smarter archives, such as our sprite archives, we can (and do) de-duplicate data. If "jump" animation's fifth frame and "kick"'s third frame are the same, we can pull apart the file and only store one copy of that frame. We can do the same for whole files.
We ported a PC game to a system with much slower FS access recently. We didn't change the data format, and it turns out iterating through a dir on the raw device FS to load a hundred small XML files was absolutely killing our load times. The solution we used was to take each dir, make it into its own subdir.arc, and stick it in the master game.arc compressed. When the dir was needed (something like opendir was called) we decompressed the entire subdir.arc into RAM, added it to the filesystem, then iterated through it super-quickly.
It's the ability to throw something like this together in a few hours, and to ease the pain of porting across systems, that makes stuff like this worthwhile.
File systems do have an overhead. Usually, a file takes disk space rounded up to some power of 2 (e.g. up to 4 KB), so many small files would waste space. Some modern file systems try to mitigate that, but AFAIK it's not widespread yet. Additionally, file systems are often quite slow when accessing multiple files. E.g. it is usually considerably faster to copy one 400 MB file than 4000 100 KB files.
File systems come in handy when you have to modify files, because they handle changing file sizes much better than any simple home-grown solution. However, that's certainly not the case for constant game data.
On Apple systems, the most common way is to use, as you suggest, directories. They are called Bundles, and are in the Finder represented as just one file, but if you explore them more, they're actually directories. This makes writing code and conserving memory when loading individual items out of this bundle very easy. :-) Also, this makes taking incremental backups of gigantic databases easy, as for instance your iPhoto database is just a bundle, so you just backup changed and new files
On Windows, however, I believe this is much harder to do, it will look like a directory "no matter what" (I'm sure smart people have found a solution that will make Explorer see certain directories as a single file, but it's not common).
From a games developer point of view, you're not dealing with so small files that disk space overhead is something you're very much concerned with, so I doubt #doublep's suggestion, since it makes for such a hassle, but it makes it much easier with a single file if users are to copy an entire game over somewhere, then it's easy to check if the entire set is correct.
And, of course, it's harder to read for people that shouldn't have access to it. But it's also harder to modify, which means harder to patch, and harder to write extensions. Someone that uses extensions a lot, prefers the directory structure: The Sims.
Were I the games developer, I'd love to go for individual files. Then again, I'd be using bundles as I'd be writing for the Mac ;-)
Cheers
Nik
I can think of multiple reasons.
As doublep suggested, files occupy more space on the disc than they require. So an archive saves space. 10k files (of any size) should save you 20MB when packed into an archive. Not exactly a large amount of space nowadays, but still.
The other reason I can think of is disc fragmentation. I suspect a heavily fragmented disc will perform worse when accessing thousands of separate files on a fragmented space. But I'm no expert in this field, so I'd appreciate if someone more experienced verified this.
Finally, I think this may also have something to do with restricting access to separate game files. You can have a bunch of Lua scripts exposed, mess with them and break something. Or you could have the outro cinematic/sound/text/whatever exposed and get spoiled by accessing it. I do that myself as well: I encrypt images with a multipass XOR key, pack text files and config variables into a monolithic file (zipped for extra security) and only leave music freely accessible. This way, the game's secrets will remain undiscovered for a bit longer :).
Or there may be another reason I never thought about :D.
As you know games, especially with larger companies try to squeeze as much performance as they can. One technique is to have all the data in one large file and just DMA it to memory (think of it as a memcpy from CD to RAM). Since all the files are in one large one there will be no disk seeks and you can have a large number of files (which may cause large amount of seeks) all loaded quicky because of the technique.

Do any common OS file systems use hashes to avoid storing the same content data more than once?

Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.
It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?
This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.
Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.
ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup
Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.
It would save space, but the time cost is prohibitive. The products you mention are already io bound, so the computational cost of hashing is not a bottleneck. If you hashed at the filesystem level, all io operations which are already slow will get worse.
NTFS has single instance storage.
NetApp has supported deduplication (that's what its called in the storage industry) in the WAFL filesystem (yeah, not your common filesystem) for a few years now. This is one of the most important features found in the enterprise filesystems today (and NetApp stands out because they support this on their primary storage also as compared to other similar products which support it only on their backup or secondary storage; they are too slow for primary storage).
The amount of data which is duplicate in a large enterprise with thousands of users is staggering. A lot of those users store the same documents, source code, etc. across their home directories. Reports of 50-70% data deduplicated have been seen often, saving lots of space and tons of money for large enterprises.
All of this means that if you create any common filesystem on a LUN exported by a NetApp filer, then you get deduplication for free, no matter what the filesystem created in that LUN. Cheers. Find out how it works here and here.
btrfs supports online de-duplication of data at the block level. I'd recommend duperemove as an external tool is needed.
It would require a fair amount of work to make this work in a file system. First of all, a user might be creating a copy of a file, planning to edit one copy, while the other remains intact -- so when you eliminate the duplication, the hard link you created that way would have to give COW semantics.
Second, the permissions on a file are often based on the directory into which that file's name is placed. You'd have to ensure that when you create your hidden hard link, that the permissions were correctly applied based on the link, not just the location of the actual content.
Third, users are likely to be upset if they make (say) three copies of a file on physically separate media to ensure against data loss from hardware failure, then find out that there was really only one copy of the file, so when that hardware failed, all three copies disappeared.
This strikes me as a bit like a second-system effect -- a solution to a problem long after the problem ceased to exist (or at least matter). With hard drives current running less than $100US/terabyte, I find it hard to believe that this would save most people a whole dollar worth of hard drive space. At that point, it's hard to imagine most people caring much.
There are file systems that do deduplication, which is sort of like this, but still noticeably different. In particular, deduplication is typically done on a basis of relatively small blocks of a file, not on complete files. Under such a system, a "file" basically just becomes a collection of pointers to de-duplicated blocks. Along with the data, each block will typically have some metadata for the block itself, that's separate from the metadata for the file(s) that refer to that block (e.g., it'll typically include at least a reference count). Any block that has a reference count greater than 1 will be treated as copy on write. That is, any attempt at writing to that block will typically create a copy, write to the copy, then store the copy of the block to the pool (so if the result comes out the same as some other block, deduplication will coalesce it with the existing block with the same content).
Many of the same considerations still apply though--most people don't have enough duplication to start with for deduplication to help a lot.
At the same time, especially on servers, deduplication at a block level can serve a real purpose. One really common case is dealing with multiple VM images, each running one of only a few choices of operating systems. If we look at the VM image as a whole, each is usually unique, so file-level deduplication would do no good. But they still frequently have a large chunk of data devoted to storing the operating system for that VM, and it's pretty common to have many VMs running only a few operating systems. With block-level deduplication, we can eliminate most of that redundancy. For a cloud server system like AWS or Azure, this can produce really serious savings.

How can we manage non-code files in TFS for designers, etc?

Normally projects consist of a set of non-source code files like interface images (PSDs, JPGs,...). How can we managing these types of files with TFS and how graphic designers can check-in or out their image files to use them in applications like Photoshop?
You can simply add binary files (PSD, JPG etc.) to your tree, with the following caveats:
Large files take more space on the server. A quote from http://social.msdn.microsoft.com/Forums/en-US/tfsversioncontrol/thread/6f642d0f-5459-4a14-a19d-ede34713bcf4 :
TFS does handle large (> 16mb) files differently. It does not perform Delta storage but instead stores a complete copy of each version. This is an optimization to make check-ins faster for those large files. There is no difference between text files and binary files. Small ones are Delta'd, large ones are Stored.
Large files take slower to download (see the same link above).
If there is a conflict (i.e. two people modify the same binary file at the same time), one of them has to resolve the conflict completely manually, e.g. he has to load all 3 image versions in the image editor, look at the differences, and merge the changes manually.