How can I do a > 5GB commit to Mercurial? - version-control

I'm trying to import an existing project into Mercurial. The project is a bit over 5GB.
When I try to do an hg push I always get an error about being out of buffer space.
Does anyone know of a good way of doing the initial commit?

If you are not tied down to using Mercurial, then another possibility would be to use boar. It is not a DVCS like Mercurial, instead you have a central repository in which you store your data, and "check out" versions of files - in much the same way as with Subversion.
The important part is that it is written with the express purpose of storing large, binary files.
I have not used it, so I cannot comment on how good it is at its job, or how stable it is, but it is a possible alternative that may well suit your needs.

For a brief explanation of why storing binary files in mercurial is discouraged please read
https://www.mercurial-scm.org/wiki/BinaryFiles and http://kiln.stackexchange.com/questions/1074/why-is-it-bad-to-store-binary-files-in-mercurial
In our case we handle binary files using Dropbox. It allows you to both keep the history of files and sync the folder between team members. If you don't need to keep history of files, you can use rsync to keep binaries sync'ed.

Assuming you do actually need to put such a large commit into Mercurial, I would guess that rather than a few million tiny files, the size of your commit is primarily due to a handful of biiiig files. In this case you could investigate the Large Files Extension, which should suit your needs. When you add a large file, it is tracked by checksum rather than content, so what Mercurial itself tracks is relatively small. The extension will take care of the versions for you.
However, as Alex Stuckey mentions, you shouldn't normally be committing things such as compiled binaries (object code, resulting executables, ...), which are the most likely reason you have such a big commit. You would do well to create a decent .hgignore file (one that removes the usual suspects - *.o, *.pdb, whatever, ...), which will help eliminate accidentally adding files like that in the future. I have a standard .hgignore which gets put into nearly all my repositories as the first commit, and has served me well.

Related

Starting to version an already medium size project

I am about to start participating in the development of a medium-sized project (~50k lines) that was until now written by a single person, and not versioned; as a result folders are cluttered with different versions of the same file (named file1, file2, file3, etc.).
I proposed to start using a VCS for it (a priori Mercurial, which is the only one I've ever used -- for my personal projects --, but I'm open to suggestions), so I'm taking any good ideas as to how to "start" the repository. E.g., should I make an initial commit with all the existing files, and immediately make a new commit with the unused files removed? Or something else?
(constructive remarks on mercurial vs bazaar vs git vs whatever are also welcome.)
Thanks for your tips.
E.g., should I make an initial commit with all the existing files, and immediately make a new commit with the unused files removed?
If the size of the repository is not a concern, then yes, that is a good starting point. Otherwise you can just commit what's actually used, and go from there.
As for which system, all DVCSes stick to the same core principles. Which one you pick is entirely subjective — the only way to truly know which one you like is to try each one.
I would say use what you are the most comfortable with and meets your needs. As far as where to start, I personally would seed the repo with the current source as is, that way you can verify that everything builds and runs as expected. you can make this initial seed a branch. That way you can always go back to your starting point before refactoring.
My approach to this was:
create a Mercurial repository in the existing project folder ("existing")
commit all project files to "existing"
create an empty repository in what a different location ("new")
As files are tested and QA'd (this was necessary because there was so much dross in "existing") pull them from "everything" to "new".
Once files had been pulled into "new"; delete the corresponding files from "existing". If access is needed to these files while the migration is under way, push them back from "new" to "existing".
This gave me the advantage of putting everything under some sort of control for recovery purposes, control over introducing the project to the DVCS. Eventually the existing project folder became completely tested and approved for the project moving forward. At this point the "everything" directory could be deleted or changed into a working folder; and "new" became the actual project folder.
I think Mercurial is a good choice. Lightweight, fast, very simple to use and well-integrated with Windows (if that's the platform you're dealing with).
I would probably get rid of all the clutter before the first commit. Delete everything you don't care about, run all the necessary tests and only then do the commit.
Yes, I'm dead set against the 0-day cluttering of repos.
Granted, a 50K SLOC project isn't very big, but if you commit files you already know you won't need, they will make your repo slightly bigger.
Also, remember to check that the tree doesn't contain large binary files. If it does, get rid of them if at all possible.

Working with folders in RCS

I have been following the tutorial http://www.burlingtontelecom.net/~ashawley/rcs/tutorial.html on how to work with files using RCS. This works well but only with one file. Is there a way to create an RCS file with directories as well?
I have a project folder called myproject, and in this directory I have all my files for that project. I want to create a revision control system for the myproject folder and all its files that are inside.
As William's comment says, RCS only works with single files. (It also doesn't seem to be particularly suitable for multiple-user stuff.)
Of course, nothing stops you from putting each (source) file in a directory under RCS control; in fact, this is essentially what CVS does (though in recent versions it handles the RCS data itself, rather than invoking RCS to do it as it used to do). Unfortunately, this fragments the change history rather badly; a commit affecting many files ends up as separate commits to each file, which just happen to have the same commit message (and timestamp?), and in general every file will have a different revision in what the user might like to think of as the "same" revision. (This makes tags quite essential.) CVS also has issues with the atomicity of commits: you could end up with commit A and commit B getting tangled up, such that in file foo commit A precedes commit B, but in file bar commit B precedes commit A!
SVN (Subversion) is an attempt to rectify some of the problems in CVS, though it also brings some new limitations, and keeps many of the existing ones; it is probably wiser (as William implies) to just use a distributed version control system (DVCS) for your multi-file projects. There are many choices:
Darcs uses a unique patch-based model: a repository is treated as a sequence of patches, which can be applied to an empty tree to build the current revision; patches can often be reordered by "commuting" pairs of patches, and cherry-picking patches from other repositories is quite easy. The downside is that the change history is a bit less clear than in most DVCSes. See http://wiki.darcs.net/Using/Model, http://en.wikibooks.org/wiki/Understanding_Darcs/Patch_theory.
Directed-acyclic-graph (DAG) based DVCSes model a repository as a directed acyclic graph of revisions, where each revision can have one parent, two parents, or perhaps more. Each revision has an associated file tree state; sometimes renames are also tracked somehow.
Git, as already mentioned. Has a very simple model, but a very complicated interface: there are many commands, some of which are not really intended for humans to use (owing to many parts of it having been prototyped in shell script, probably), so it can be hard to find the ones you want. Also, its model might be a bit too simple: it doesn't track renames at all.
Bazaar (a.k.a. bzr) has a more complicated model, including support for file/directory renames. It's difficult to say how much more complicated, though, because whatever documentation may exist is not nearly as accessible as Git's. It does, however, have a rather simpler user interface, and there are a number of useful plugins, including a distributed-development-friendly SVN plugin: committing from a branch back to SVN need not interfere with the validity of others' branches of your branches, and bzr metadata is even committed back to SVN. Can make things much less painful if you want to start hacking on an SVN-based project without having commit access, but hope to get your changes committed eventually. Bazaar is my personal favorite DAG-based DVCS.
Mercurial (a.k.a. hg) seems fairly similar to Bazaar, though I think it tracks renames only for individual files, not for directories. It also supports plugins, though its SVN plugin isn't as nice as Bazaar's: it doesn't support lossless commits, so branching from other peoples' branches is unwise. I don't have much experience with it, so I can't really evaluate it in-depth.
As the comments already mention, if you are starting out with version control, you would be well advised to choose a newer system than RCS (git, mercurial, fossil, subversion, ...). That said, RCS still works fine for a single developer working primarily on a single machine - I still use it for my own code because I've not yet OK worked out how to get the (20+ years of) history I want into git in the way I want it.
Anyway, to use RCS, make sure you have an RCS sub-directory in each directory where you have working source code under RCS management. The RCS files will be placed in the sub-directory automatically, and retrieved automatically. If your version of make is not already aware of RCS, then you can train it so that it is - or get a version of make that does (GNU Make, for example).
TL:DR - Look into DCVS for an alternative of RCS. It uses CVS, which uses RCS, but it's more modular for working in a repository that is distributed, as well as having a hierarchy of directories.
I'm currently going through a similar issue, and may have found something worthy of note, especially for people who are being forced to use a light, command-line based revision control systems with multiple team members.
My manager will not get off this idea of using RCS as our version control. But for the specifications, he wants developers to be able to create and edit on their own repository on a localized server within our company. Two issues with this:
RCS does not create, nor hold any sort of 'repository'. It is software that keeps track of file edits, on a Per File Basis. Meaning that the 'repository' is nothing more than another directory with RCS checked-in files. This is sub-par for team-geared projects, to say the least.
On a large project with multiple directories and tens of individual working files, even the prospect of creating a top-level RCS directory with a symbolic link in the working directories gives rise to complications such as naming conventions, as well as forgetting which file came from which bottom-level / working directory.
With what SamB posted, even CVS gives additional problems with RCS that we now have to account for, but gives us a slight ability for some additional hierarchy. But one suggestion he forgot was DCVS.
It's nothing more than an extension of CVS, CVSup, and:
contains functionality to distribute CVS repositories with local lines of development and automatically handles synchronization of the distributed repositories in the background.

File history: in the source or let scm handle it?

I'm learning mercurial as my solo scm software. With other management software, you can put change comments into the file header through tags. With hg you comment the change set, and that doesn't get into the source. I'm more used to central control like VSS.
Why should I put the file history into the header of the source file? Should I let mercurial manage the history with my changeset comments?
Let the source control system handle it.
If you put change details in the header it will soon become unwieldy and overwhelm the actual code.
Additionally if the scm has the concept of changelists (where many files are grouped into a single change) then you'll be able to write the comment so that it applies to the whole change and not just the edits in the one file (if that makes sense), giving you a clearer picture of why the edit was required.
Yes; let the source control system handle your changeset comments. The rationale for this is that it makes considerably more sense when you're viewing the change log later, trying to work out what's going on between two versions of a file - the source control system can present the change comment to try and enlighten the situation.
There's no reason to manually maintain a file history when SCM software is much better suited to solve this problem. All too often I see partially-completed file histories in the source, which actually hurts, because people incorrectly assume it is accurate.
The difference is not whether it's a centralized or distributed VCS, it's more about what's being changed.
When I moved to .Net, the number of files updated for any individual change seemed to skyrocket. If I had to log the change in each file, I'd never get any real work done. By commenting on the set of changes, it doesn't matter how many files I had to update.
If I ever needed to identify all of the changes for a particular change, I can diff between the two versions of the project.
The biggest difference (and advantage) I saw when switching away from SourceSafe was the switch from file based to project based commits. As soon as I got used to that, I stopped adding change-log type comments to all of my files.
(As a side effect, I've found that my process description comments have gotten better)
I'm not a big proponent of littering the code with change comments. In the event they are needed they can be looked up in the SCM (at least for the SCM variants I have used). If you do want them in the file, consider putting them at the end instead of the beginning. That way you won't have to scroll down past the (uninteresting, to me at least) comments before you get to the actual code.
Another vote for letting the SCM system handle the checkin comments, but I do have one thing to add.
Some systems allow you to use RCS tags in your source code where the SCM can insert the change history directly into the source file being committed automatically. Sounds like a nice balance because the history is then in the SCM system and then automatically put into the source code itself.
The problem is that this process changes the source file. I think that's a bad idea because the file cannot be changed on disk until after you comment is inserted. If you were a good engineer, you should have built and tested changes before the commit. If your source changes after the commit, then you've essentially got a build that could be broken - but most engineers won't build after a commit - why should they?
But it's just a comment you say! True, but I did have a case where there was code in my source file that strangely enough had reason to look like an RCS header tag and that section of the code got replaced on checkin, thereby munging my code. Easy enough to fix, but bad that a build got broken for 20+ users
Much easier to forget to maintain history in the source, as one always (imo) should comment commits to source control system that problem dissappers. Also if changing lots of files before commit, changing history in every file will be annoying work. This is really one of the points with having scm.
I have experience with this. I've had the file history in the comments, it was awful. Nothing but garbage, sometimes you would have to scroll down almost 1k lines of code changes before you finally got to what you wanted. Not to mention, you're slowing down other aspects of your build process by adding more kb to your source code tree.

Are there version control systems that allow you to permanently delete files?

I need to keep under version some large files (some Gigs).
I don't need, and I can't keep under version all the version of the files.
I want to be able to remove from my VCS large files version at some moment.
The files that I want to keep under version control are big .zip files or ISO images.
These files may contains executable software or data (seismic data, SAR images, GNSS data) and they are provided by the software supplier of my company.
What control version system could I use?
In CVS you can do that by removing the files from the repo. Subversion allows that by dumping the content of the repo and filter it to remove the files (that is a bit cumbersome). Perforce has an obliterate command for that. Many of the newer distributed VCS make it rather difficult by their usage of hashes all over the places and the fact that your repo may have been replicated elsewhere also complicate things. Hg has a strip command (part of the Mq extension), Git can also do that I think.
I don’t think there’s any version control system that allows you do that regularly because that goes against everything version control systems stand for.
Perforce generally allows files to be put in two way, as head revision only (so, you'd only every have one copy) or all revisions. Perforce does have the admin level obliterate command that can be used to delete revisions. Its up to you to query for a list of files, possibly by date or number of revisions, and to specify the revisions to the obliterate command. As the name suggests obliterate deletes the revisions permanently from the database, so, I always generate scripts to do this and review them before running them. If the obliterate command is NOT run with the -Y flag, it will generate a list of what would be obliterated, also very useful.
Somehow I get the impression that you should not use a version control system at all. As said before, what you're trying to do goes against everything you would need a version control system for in the first place.
I suggest you create a file system directory structure that makes sense for what you're trying to accomplish and so that you can structure your data. And just make backup's of those files.
TFS has a destroy command that you can use to permanently delete files or revisions as you see fit.
There is more information at this MSDN article.
Many version control systems allow you to configure them in a way so that they store only the differences between several versions of a file and save space through that.
For example if you have a 1Gig file committed, change a part of it and commit it again, only the changed part will be stored in the version control system.
There won't be 2Gigs used (initial and new file) but only 1Gig+sizeOfChanges.
There's just one downside:if you're storing files which change their whole content from revision to revision this can also be counter-productive as the changes take almost the same space as the original version. Archive files are a example for such files where only a small change in the (real) content can lead to a completely changed content of the archive file.
I'd suggest to test several version control systems on your own and with your specific needs and environment and monitor each one at the server-side how the storage requirements for each system changes.
Some distributed version control systems allow to create "checkpoints" that allow you to use this version as kind of a base revision and safe you from pulling all the history before the checkpoint on every checkout. So you can remove the big files, create a checkpoint, and checkout/clone the repository from that checkpoint to a new directory. Then you have there a new, small repository, but without the history before the checkpoint. It you don't need that history you can burn the old repository on CD and use the new, partial one from now on.
I've only tested it in darcs, and there it works, but YMMV depending on version control system and use cases.
It sounds to me like you need an intelligent backup system, rather than version control.
I use SyncBackSE; it allows you to keep a number of previous versions, and can also do things like "ignore all files changed more than 30 days ago".
It's one of the few bits of paid-for software I use. I think it's worth checking out.
I think you're talking about something like "AlienBrain" "bucket" system, aren't you? The ability to remove some revisions from version control.
If you want to destroy an item, it's normally called "obliterate" and it's supported by a number of systems out there.
Buckets, AFAIK are supported by:
AlienBrain
Accurev
PlasticSCM
I would save such files under a unique name (datestamped, perhaps), and perhaps additionally make a textual reference to the external file in the version control system.
Fossil allows you to do this via the "shun" mechanism. Fossil being a distributed SCM, however, means that this does not affect all repositories (for obvious reasons).

What is the best solution for maintaining backup and revision control on live websites?

What is the best solution for maintaining backup and revision control on live websites?
As part of my job I work with several live websites. We need an efficient means of maintaining backups of the live folders over time. Additionally, updating these sites can be a pain, especially if a change happens to break in the live environment for whatever reason.
What would be ideal would be hassle-free source control. I implemented SVN for a while which was great as a semi-solution for backup as well as revision control (easy reversion of temporary or breaking changes) etc.
Unfortunately SVN places .SVN hidden directories everywhere which cause problems, especially when other developers make folder structure changes or copy/move website directories. I've heard the argument that this is a matter of education etc. but the approach taken by SVN is simply not a practical solution for us.
I am thinking that maybe an incremental backup solution may be better.
Other possibilities include:
SVK, which is command-line only which becomes a problem. Besides, I am unsure on how appropriate this would be.
Mercurial, perhaps with some triggers to hide the distributed component which is not required in this case and would be unnecessarily complicated for other developers.
I experimented briefly with Mercurial but couldn't find a nice way to have the repository seperate and kept constantly in-sync with the live folder working copy. Maybe as a source control solution (making repository and live folder the same place) combined with another backup solution this could be the way to go.
One downside of Mercurial is that it doesn't place empty folders under source control which is problematic for websites which often have empty folders as placeholder locations for file uploads etc.
Rsync, which I haven't really investigated.
I'd really appreciate your advice on the best way to maintain backups of live websites, ideally with an easy means of retrieving past versions quickly.
Answer replies:
#Kibbee:
It's not so much about education as no familiarity with anything but VSS and a lack of time/effort to learn anything else.
The xcopy/7-zip approach sounds reasonable I guess but it would quickly take up a lot of room right?
As far as source control, I think I'd like the source control to just say that "this is the state of the folder now, I'll deal with that and if I can't match stuff up that's your fault, I'll just start new histories" rather than fail hard.
#Steve M:
Yeah that's a nicer way of doing it but would require a significant cultural change. Having said that I very much like this approach.
#mk:
Nice, I didn't think about using Rsync to deploy. Does this only upload the differences? Overwriting the entire live directory everytime we make a change would be problematic due to site downtime.
I am still curious to see if there are any more traditional options
You can still use SVN, but instead of doing a checkout on your live environment, do an export, that way no .svn directories will be created. The downside, of course, is that no code changes on your live environment can take place. This is a good thing.
As a general rule, code changes on production systems should never be allowed. The change should be made and tested in a development/test/UAT environment, then once confirmed as OK, you can tag that code in SVN with something like RELEASE-x-x-x. Then, on the live system, export the code with that tag.
We use option 3. Rsync. I wrote a bash script to do this along with some extra checking, but here are the basics of what it does.
Make a tag for pushing to live.
Run svn export on that tag.
rsync to live.
So far it has been working out. We don't have to worry about user conflicts or have a separate user for running svn up on the production machine.
Any source control solution you pick is going to have problems if people are moving, deleting, or adding files and not telling the source control system about it. I'm not aware of any source control item that could solve this problem.
In the case where you just can't educate the people working on the project[1], then you may just have to go with daily snapshots. Something as simple as batch file using xcopy to a network drive, and possibly 7-zip on the command line to compress it so it doesn't take up too much space would probably be the simplest solution.
[1] I would highly disbelieve this, probably just more a case of people being too stubborn and not willing to learn, or do "extra work". Nevermind how much time source control could save them when they have to go back to previous versions, or 2 people have edited the same file.
rsync will only upload the differences. I haven't personally used it, but Mark Pilgrim wrote a long time ago about how it even handles binary diffs brilliantly.
svn+rsync sounds like a fantastic solution. I'll have to try that in the future.