Is there a distributed VCS that can manage large files? - version-control

Is there a distributed version control system (git, bazaar, mercurial, darcs etc.) that can handle files larger than available RAM?
I need to be able to commit large binary files (i.e. datasets, source video/images, archives), but I don't need to be able to diff them, just be able to commit and then update when the file changes.
I last looked at this about a year ago, and none of the obvious candidates allowed this, since they're all designed to diff in memory for speed. That left me with a VCS for managing code and something else ("asset management" software or just rsync and scripts) for large files, which is pretty ugly when the directory structures of the two overlap.

It's been 3 years since I asked this question, but, as of version 2.0 Mercurial includes the largefiles extension, which accomplishes what I was originally looking for:
The largefiles extension allows for tracking large, incompressible binary files in Mercurial without requiring excessive bandwidth for clones and pulls. Files added as largefiles are not tracked directly by Mercurial; rather, their revisions are identified by a checksum, and Mercurial tracks these checksums. This way, when you clone a repository or pull in changesets, the large files in older revisions of the repository are not needed, and only the ones needed to update to the current version are downloaded. This saves both disk space and bandwidth.

No free distributed version control system supports this. If you want this feature, you will have to implement it.
You can write off git: they are interested in raw performance for the Linux kernel development use case. It is improbable they would ever accept the performance trade-off in scaling to huge binary files. I do not know about Mercurial, but they seem to have made similar choices as git in coupling their operating model to their storage model for performance.
In principle, Bazaar should be able to support your use case with a plugin that implements tree/branch/repository formats whose on-disk storage and implementation strategy is optimized for your use case. In case the internal architecture blocks you, and you release useful code, I expect the core developers will help fix the internal architecture. Also, you could set up a feature development contract with Canonical.
Probably the most pragmatic approach, irrespective of the specific DVCS would be to build a hybrid system: implement a huge-file store, and store references to blobs in this store into the DVCS of your choice.
Full disclosure: I am a former employee of Canonical and worked closely with the Bazaar developers.

Yes, Plastic SCM. It's distributed and it manages huge files in blocks of 4Mb so it's not limited by having to load them entirely on mem at any time. Find a tutorial on DVCS here:
http://codicesoftware.blogspot.com/2010/03/distributed-development-for-windows.html

BUP might be what you're looking for. It was built as an extension of git functionality for doing backups, but that's effectively the same thing. It breaks files into chunks and uses a rolling hash to make the file content addressable/do efficient storage.
https://github.com/bup/bup
http://blogs.kde.org/node/4440

I think it would be inefficient to store binary files in any form of version control system.
The better idea would be to store meta-data textfiles in the repository that reference the binary objects.

Does it have to be distributed? Supposedly the one big benefit subversion has to the newer, distributed VCSes is its superior ability to deal with binary files.

I came to the conclusion that the best solution in this case would be to use the ZFS.
Yes ZFS is not a DVCS but:
You can allocate space for repository via creating new FS
You can track changes by creating snapshots
You can send snapshots (commits) to another ZFS dataset

Related

Migrating from CVS to distributed version control (Mercurial)

Some background: We're working on projects that involve projects across 2 different countries, and we've been using CVS. Developers in the country not hosting the CVS server will take forever to connect to the remote server, so we've set up this system to have 2 separate CVS servers in each country and have a sync job that keeps them in sync every hour or so.
Given this, we're looking at migrating to a distributed version control system, mostly because we've been having problems with the sync job failing and the limitation that for a given set of files only one side can have the writelock for it at a time.
We're currently looking at Mercurial for this purpose, so can anyone help tell us if:
a. Will Mercurial be a good fit for our use case above? How easy will it be for devs to make the transition, i.e. will they still be able to work the same way? etc
b. Can Mercurial support branching a specific folder only?
c. We also hold a lot of binary docs in version control, will they be suitable for Mercurial?
d. Is there support for getting the "writelock" of particular files? i.e. I want no other people to update these particular files while I'm working on them
Thanks!
a/ and d/: yes and no. Yes, a DVCS like Mercurial is a good fit for distributed development, but by nature, there is no more "writelock", since there is no one "central server" which would be notified each time you want to modify anything.
You will pull (or check incomings) regularly from the remote repo.
b/ no, this isn't how a DVCS works, since a branch is no longer a copy of a directory.
c/ binaries are best kept outside a DVCS (since it will be cloned around, and binaries would make its size grow too fast)
See "How is Mercurial/Git worse than Subversion with binary files?"

do any source-control systems use a document database for storage?

One of those questions that's difficult to google.
We were running into issues the other day with speed of our svn repository. The standard solution to this seems to be "more RAM! more CPU!" etc. Which got me to wondering, are there any source-control systems that use a document/nosql database (mongodb, couchdb etc) for database? It seems like it might be a natural -- but I'm no expert on source-control database theory. Perhaps there's a way to configure a more recent source control to use a document db as storage?
None that I know of do, and they wouldn't want to. Given the difference in degrees of testing, it would likely hurt robustness (a really bad thing for a source code repository). It would probably also end up hurting performance, because of the inability to do delta storage.
Note that Subversion has two very different storage mechanisms, one backed by the embedded Berkeley DB, and the other backed by simple files. One or the other of these might be better suited to your usage.
Also, since you posed your question pretty broadly, I'll comment on Git and TFS.
Git uses very efficiently packed files in the filesystem to store the repository. Frequently, the entire history is smaller than a checkout. For one very old project that my lab has, the entire history is 57MiB, and a working tree (not counting history) is 56MiB.
TFS stores a lot (possibly all) of its data in a SQL database.
Git uses memory-mapped files just like MongoDB :)
Though Git doesn't actually use MongoDB and I don't think it would want to. If you look at Git, it doesn't really need a NoSQL DB, it basically is a DB.
As far as i know no of the VCS uses noSQL/document based databases. The idea of using a couchdb etc. is not new...but no one has implemented such a thing till now...

Is a VCS appropriate for usage by a designer?

I know that a VCS is absolutely critical for a developer to increase productivity and protect the code, no doubts about it. But what about a designer, using say, Photoshop (though it's not specific to any tools, just to make my point clearer).
VCSs uses delta compression to store different versions of files. This works very well for code, but for images, that's a problem. Raster image files are binary formats, though vector image files are text (SVG comes to my mind) and pose to problem. The problem comes with .psd files (and any other image "source" file) - those can get pretty big and since I'm not familiar with the format, I'll consider them as binary files. How would a VCS work in this condition?
The repository could be pretty darned big if the VCS server isn't able to diff the files efficiently (or worse, not at all) and over time this can become a really big pain when someone needs to check out the repository (or clone it if using a DVCS).
Have any of you used a VCS for this purpose? How well does it work? I'm mostly interested in Mercurial, though this is a general situation that applies to any VCS.
Designers usually use specialized tools like AlienBrain, Adobe VersionCue or similar, which are essentically Version Control Systems that understand Images and other Media Assets, which allows stuff like diffing two images.
Designers IMHO should definitely use VCS systems, at least as a means for Versioning and backup - their stuff is just as important as Specs, Documentation, Code, Deploy Scripts and everything else that makes a project.
I do not know if there are bridges between "Asset Management Systems" like the mentioned ones and Developer VCS' systems though.
Version Control Systems are useful for ANYONE that is doing work that they might need an older version of at some later date. That said, I have set up all my creative friends with Subversion (in the past) and now I recommend Git. Even those that are doing Video editing with hundreds of gigs of video. They can archive off the projects when they get final payment. Drive space is CHEAP, cheaper than ever before, size isn't an issue in any modern VCS. Being able to revert back to a previous working state or experiment with something without losing data and manually managing multiple "temp" directories is invaluable if you bill by the hour.
Yes
Don't worry about the size, if you run out of space, just buy a larger hard drive.
Losing information will be far costlier.
In addition to a VCS (any will do, as you won't be needing delta storage), do regular backups.
When doing checkout you shouldn't be standing on the root of the system, but rather on a specific branch to your project, that way it won't be slower than any simple copy operation of that folder.
Definitely recommend using Version Control for any type of file you care about, or can't afford to loose. Disk space is cheap, and as has already been pointed out it'd be far worse to loose a bunch of important files than to spend a few extra bucks on a new HDD. I recommend Subversion since it has file locking, an important feature when working with binary files and version control to prevent ugly or impossible merge conflicts.
I believe so. Especially if you wish to track changes over time or need to rollback to previous versions. Centralized source control may be the way to go if you're worried about the size.

Version control for version control?

I was overseeing branching and merging throughout the last release at my company, and a number of times had to modify our Subversion pre-commit hooks to enforce different requirements on check-in comments and such. I was a bit nervous every time I was editing those files, because (a) they're part of a live production system, albeit only used internally (and we're not a huge organization), and (b) they're not under version control themselves.
I'm curious what sort of fail-safes people have in place on their version control infrastructure. Daily backups? "Meta" version control? I suppose the former is in place here as part of the backup of the whole repository. And the latter would be useful as the complexity of check-in requirements grows...
Natch - the version-control and any other infrastructure code is also under version-control but I would use a separate project from any development project.
I prefer a searchable wiki or similar knowledge-base repository to clogging up your bug-tracking system with things like VCS config.
Most importantly, make sure that the documentation is kept up to date - in my experience, people are vastly better at keeping code docs up to date than admin docs. This may have been the individuals concerned . One thing that is often overlooked is, if systems are configured according to standard Unix Practices or similar philosophy, that implies a body of knowledge about locations that may not be familiar to an OS/X or Windows programmer, faced with suddenly fixing a broken script. Without being condescending, make sure basic assumptions about location and interdependency are documented.
You should document all "setup" configuration for all your tools and these documents should be checked into version control. For tools with text file configurations which allow comments, you could just checkin the config file. But for tools that require using the interface, you should have a full document with images of the dialog boxes showing what choices are chosen.
Most importantly though, these documents should say WHY you have set the values chosen (when not taking the default).
Second, as backup, the same documents should be included in your bug tracking software under a "How do I setup the version control software?" bug. (The bug tracking database is located on a different physical server, right?)
Third, all of this should be backed-up off-site. I'm sure there question on SO about backup strategies.
What's wrong with using the same version control repository for the commit hooks and other configuration files? That's how I've handled it in the past when I've been responsible for a project's configuration management.
You should also back up your svn repository. That way if the repository itself becomes corrupted or the server catches fire or something, you can recover both your project and the svn control files.
If you have build scripts that are doing this (such as Nant) then you could be checking in those.

Version Management / Backup solution

This is not strictly a technical question, however I feel this will be useful for many technical people as well.
I'm looking for a version management / backup solution which need not be only for source code. This could be for non-text files e.g. images.
The requirement is this -
Every time I save the file from within the application, it should create a version.
I should be able to add comments for say, major revisions.
At any time, there should be only one version current.
I should be able to view previous versions without doing a 'restore'
I should be able to move back and forth between versions.
A calendar feature showing the various versions of a file would be helpful, if I could get to it for a specific file from the Explorer context menu
I don't really need to compare different versions or anything like that.
Windows solutions only. I've looked at NTI Shadow and it comes a bit close to what I'm looking for.
Are there any paid / free / open source solutions for the above requirements?
Pretty much any version control system i know of supports binary uploads. Subversion (in short SVN) is free and pretty popular. If you also download TortoiseSVN you can handle everything from within Explorer.
The only requirement i cannot help you with is 1. automatic saving from within your application. But you can of course do this by copying over your old version of the file in the file system and committing your changes via TortoiseSVN.
PS for some reason i cannot connect to the SVN site right now. It might be down at the moment. It is still a great product, though :)
[not an actual answer, just a note about DVCS backup capabilities]
I would not advise for a DVCS (Distributed Version control System) like Git or alike for backup strategy.
As stated in DVCS Myths
So, why make backups of a source control server with so many backups?
It is improbable that many servers will suffer catastrophic hardware failures simultaneously, but it is not impossible.
A more likely scenario might be a particularly nasty computer virus that sinks its teeth into an entire network of vulnerable machines.
In any case, the probability of any or all of your backups becoming suddenly unavailable is really not the point.
The bottom line is that using independent clones as canonical backups (as opposed to temporary stopgaps) is a suboptimal strategy.
Security, for example, should be considered.
If you are using authorization rules to control access to specific portions of your repository, canonicalizing an arbitrary clone of the repository effectively renders those rules useless.
While this would rarely be a matter of practical concern in a controlled corporate environment, it is nonetheless possible.
(my imput:) Full data backup is not really possible with a DVCS, since it would implies all repositories push their changes to a "central" repository, which is not the main use case scenario in a DVCS (whereas with a classical VCS, anything committed is stored in one place)
The key win of DVCS for backups, then, is that you don't really need to invest in a "hot" backup.
When the server inevitably goes down, DVCS will buy you time. Lots of time. You'll essentially be running at full productivity (or very nearly so) while you rebuild your server from backup.
When changesets created during the server downtime are pushed back to the restored server, the freshly restored authorization rules will be reapplied and you'll be back on track.
So, for us:
hot "backup" is actually achieved with SRDF (Symmetrix Remote Data Facility), but that is commercial and is linked to our infrastructure which support LUNs duplication to achieve data replication.
incremental daily backup is achieved for a limited set of repositories (including some "central" Git repos), but in our case, with a custom tool.
I think you're looking for the benefits of a versioning file system that takes immutable snapshots of files upon each write. You could build this into a DVCS if something set up watches on files contained in the versioned directory (committing each time a file is changed) but that would get ugly, quick.
This topic was also explored in this question. I think your ideal solution would be a DVCS repository that resides on a versioning/cow file system of some type. This lets you manage revisions of each file independently of commits that you make in the DVCS.
Unless, of course, toxic revisions would not be an issue for you.
In order for this to be transparent to applications (i.e., would not need to have application implement a different API for saving/loading files to access these backup features), you'd want to do this in the Operating System, at its file system layer.
ZFS filesystem could be wrapped to provide the user interface capabilities you describe, but it is doubtful this filesystem would ever reach Windows (directly, at least).
A simpler way to think of this is to look at network storage systems which can provide you the features you need.
NetApp Snapshot offers capabilities that could be tapped to do this at the network storage level. It implements CIFS, so is definitely available on windows. Open your wallet.
If you think this is an extremely important feature, you may consider other OSes than Windows; filesystems and filesystem support in OSes other than Windows are more diverse.
I strongly suggest using subversion. I have used 4 different version control systems and have found subversion powerful and easy to use.
For windows this is the easiest server to install is Visual SVN
And Smart SVN is the best subversion client I've used.