How to use replication in combination with version control system? - version-control

The situation is as follow :
Our company works two main production sites, communicating via WAN. We develop a software internally which uses about 100Gb of disk space on our servers (application data deployed to our customers with a lot of images). In order in improve performance, our network administrators choosed DFS replication (every 6 hours). This means that our users (people from within the company) do not have to wait (sometimes 2-3 hours) to download the needed files, because they are available locally (over LAN).
The problem is that the algorithm used by DFS replication is "Last Writer Wins". So, in case of simultaneous changes (during development/maintenance), the file with the latest date will win. I would like to avoid such data loss.
I am project manager for the overall develop process. What I want to do, is to introduce people to version control systems to tackle the simultaneous modifications problem. I plan to use Mercurial for several reasons, mainly because it is distributed, simple to explain, usable for personal use, free, and (most importantly) has great merging capabilities. However the benefits of the version control system when used locally (LAN) is lost because of the replication process (WAN) which doesn't know how to merge.
Some possible solutions are to :
use only version control over the WAN (hoping that compression will be enough to speed things up)
use only DFS, and track changes manually (error-prone)
find a work-around with both methods
The team is small (about 10 persons). Your help and experience is appreciated.

If it were me, I'd have a "central" repository at each location, with the developers from each site working on a different branch. One of those should probably be chosen as the "main" branch (ideally the one that will be making the most changes), although in practice it won't really matter much.
Each team's repo should be synchronized regularly (e.g., daily, on your 6 hour schedule, or even more often) with the repo from the other location, to reflect changes made in that branch. Then they would be merged to the site's branch (ideally this would be done automatically as part of the same update, but the exact details of how that merge will happen may vary, depending on your VCS of choice and your branching model).
Remember: "sync early, sync often"

Related

Version Control advice

We've decided on a version control system - using Mercurial clients and Bitbucket for repositories. But it's just occurred to me we have a problem I didn't consider.
We have an internal development LAMP server (Ubuntu) and all the developers work on websites stored on it, which means all developers share a single file source and we are all working from it. It's rare that two different developers will work on the same site at the some time, but it does happen occasionally. This means that two developers can easily overwrite each others work if they are working on the same file at the same time.
So my questions is: what is the best solution to this problem? Bearing in mind we like the convenience of a single internal server so that we can demo sites internally, and it also has a cron job running for backing up the files and databases.
I am guessing each developer would have to run their own LAMP (or WAMP) servers on their individual workstations, commit, and push to bitbucket repository. And of course whenever working on a different site, do a pull and resolve any differences as per usual. This of course takes away the convenience of other team members (non developers) being able to browse to 192.168.0.100 (the LAMP server IP address) and looking at the progress of websites, not to mention that some clients can also access the same server externally (I've set up a port forward and limited to their IP addresses) to see the progress of their websites too.
Any advice will be greatly appreciated.
Thanks in advance.
I think, you have to seriously re-think about used workflow, because LAMP-per-dev is only slightly better than editing sites in-place
I can't see place for Bitbucket in serious corporate development - in-house resources are at least more manageable
I can't see reasons don't use Staging Mercurial-server (pseudo-central) with Staging internal LAMP-server (which you have and use now)
I can imagine at least two possible choices (fast, dirty, draft idea, not ready-to-use solution), both are hook-based
Less manageable, faster for implement
Every developer have in own local repo hook, which after (each?) commit export his tip and copy exported to related site space. Workflow: commit - test results on internal site
Advantages: easy, fast to implement
Disadvantages: Can't prevent (due to distributed nature) overwriting of tested code by code from another developer
Manageable deploy, harder to implement and manage
LAMP-server become also Mercurial-server, which hosts "central" clones of all site-repos, updated by push only from developer local repo. Each repo on this server must get two hooks:
"before-push" checks, is it allowed to push now, or site "locked" by previous developer
"post-push", which export-copy received data and perform also control function for hook 1: based on conditions (subject of discussion) lock/unlock pushes to repo
Workflow: commit - push - test results - tag WC with special (moved) tag - commit tag - push unlocking changeset into repo
Advantages: manageable single-point testing
Disadvantages: possible delays due to push-workflow and blocking of pushes. The need to install, configure, support additional server. Complexity of changegroup and pretxnchangegroup hooks
Final notes and hints for solution 2: I think (not tested), special tag (with -f for movement across changesets) can be used as unlock sign (bookmark will not satisfy condition "move by hand"). I.e - developer commit (and pushes) non-tagged changeset, tag (f.e) "Passed" mark some older changeset. When testing results on Staging server is done, developer tag WC with the above tag, commit tag and pushed to central repo. changegroup hook must detect pushing of .hgtags and (in some-way) allow future data-pushes (control-pushes must be allowed always)
Yes, the better solution is probably to set each developer up with a local server. It may seem inconvenient to you because you're apparently used to sharing a server, but consider:
If you're really interested in using a single server as a demo server, it's probably better that people aren't actively working developing on it at the time. They could break stuff that way! And developers shouldn't have to worry about breaking stuff when they're developing. Developing often means experimenting.
Having each developer running their own server will give them flexibility to, say, work disconnected. You've got a decentralized version control system (mercurial), but your development process is highly centralized. Even if you don't want people to work remotely, realize that when your single server goes down now, everybody goes down.
Any time a developer commits and pushes those commits, you can automate deployment directly to your demo site. That way, you still have a quite up-to-date source on your demo server.
TL;DR: Keep the demo server, but let your devs work on their own servers.

Automatically triggering merge activity after remote on-site (custom) development?

In our office, the software we create is sent to our client's office along with an engineer and a laptop. They modify the code at the customer site, based on the customer requests, and deploy the exe.
When the engineer returns to the office, the changed/latest code is not updated to the server, thereby causing us all sorts of problems in the source code on the development boxes and laptops.
I tried to use a version control system like svn, but sometimes the engineer forgets to update the latest code to the svn server. Is there an automatic way that when the laptop connects to the domain, the version control system should automatically check for changes and prompt the user to update the code on the server, or automatically update the code to the server.
I think that the key to this is to require the on-site engineers to use a VCS at the customer site, and to make it a condition of their continued employment that the code at the customer site is in fact reloaded into the VCS on return to the office. You could say that the engineers sent on-site need to be trained in their duties, and they should be held accountable for not doing the complete job - the job isn't finished until the paperwork is done (where 'paperwork' in this context includes updating the source repositories with the customer's custom adaptations of the software).
It seems to me that it might be better to use a DVCS such as Git or Mercurial rather than SVN in this context. However, you should be able to work with SVN if the laptop dispatched to the server has a suitable working copy created for the customization work.
That said, the question is "can we make this easier and more nearly automatic". In part, that might depend on your infrastructure - it also might depend on Windows capabilities about which I'm clueless. There might be a way to get a particular program to run when the laptop connects to a new domain. An alternative (Unix-ish) approach would be to use some regularly scheduled job that runs, say, every hour and looks to see whether it is on the home domain and whether there are changes that should be submitted to the main repository.

Can I use "Online Backup" to backup my DVS instead of pushing to an external repo?

I'm currently signed up with a third party service that hosts my mercurial repositories as a central hub to push my changes to as a sort of backup.
Now, I'm looking at a system to backup my laptop and am concidering Mozy. I'm a loan developer, and work on a laptop and am usualy connected to my internet via wifi with my laptop only really being on when I'm working, so feel something like Mozy is my best option.
My question is, if I'm the only developer, could I get away with just using local mercurial repos and using Mozy to backup everything up? Rather than pushing to an external repo?
Many thanks
Matt
Disclaimer: My experience is with git rather than hg, but as I understand it the concepts apply equally to both systems.
An advantage of backing up to a remote repo is that if your local repo becomes corrupted (perhaps due to a problem with the underlying filesystem), that corruption does not get transferred over to the backup, unless the files in your working tree themselves are corrupted.
For example, it's possible for some of the objects in the repository, perhaps those which are rarely accessed because you don't change them, to become corrupted. It could be months before you use one of those files again, and so months before you notice (though I think doing a garbage collect run, eg git gc, will detect corruption).
So if you are backing up by pushing commits, you're creating an independent version of those objects, and using checksums (ie the commit hash) to verify the transfer of any new files. Whereas if you are backing up to a backup provider, you're duplicating the actual objects in the repo, in whatever state they are in, and duplicating any changes to those files, including corruption of them.
Usually backup providers will give you rollback (spideroak seems to be particularly good for this) but you'll still have to sift through a lot of versions to figure out when the corruption happened; also with some providers, the rollback period is limited (especially for free accounts).

Do Distributed Version Control Systems promote poor backup habits?

In a DVCS, each developer has an entire repository on their workstation, to which they can commit all their changes. Then they can merge their repo with someone else's, or clone it, or whatever (as I understand it, I'm not a DVCS user).
To me that flags a side-effect, of being more vulnerable to forgetting to backup. In a traditional centralised system, both you as a developer and the people in charge know that if you commit something, it's held on a central server which can have decent backup solutions in place.
But using a DVCS, it seems you only have to push your work to a server when you feel like sharing it. It's all very well you have the repo locally so you can work on your feature branch for a month without bothering anyone, but it means (I think) that checking in your code to the repo is not enough, you have to remember to do regular pushes to a backed-up server.
It also means, doesn't it, that a team lead can't see all those nice SVN commit emails to keep a rough idea what's going on in the code-base?
Is any of this a real issue?
I can understand your concern about devs forgetting backups once their local diff is gone (because they've committed locally) and stops nagging them with copious output. I think the solution can lie in better tools, moar tools! You could set up a cron job on each dev's box that pushes every last reachable object in their repository to the central repo, and labels them in the central (backed-up) repo with namespaced branches. I think "git push" can do this, given the correct refspec. Then, all you aren't doing is affecting the state of your public branches.
But do you really need as aggressive a backup process as before, when the repo existed only in one place? With a DVCS, you need a far higher category of catastrophes to lose all your code. You now need an asteroid or a bomb hitting your office (and all your off-site team members), instead of just a hard disk or RAID controller going bad. Note, I'm not advocating sloppiness; I'm advocating equal risk at lower cost.
I don't think you have an automatism on this. Distributed or centralized VCS can be combined with backup (or not). It's all a question of discipline in the team.
Same for commit-emails. If the team has the discipline to regularly push changes to the right repositories, you can have a working commit-mailinglist too.
Bad habits also can grow in a team with centralized VCS. You have always to fight bad habits.
In most places I imagine that there is probably still a 'central' repository from where builds are made and put to test. If you want your code in the build, it's got to be pushed centrally.
This is also a management issue - tell your team - push regularly (at least daily) so that your code is backed up. If it's not being done, then get out the big stick.
I'd also note, that if you're relying on looking at the commits to see what your staff are doing, you probably have some larger issues that you might look at addressing...
Having a local copy of the repository might encourage poor backup habits, if one were slack. However, your master repository SHOULD be backed up.
The "local copy of the entire repository" has a much more important use than being a backup. It reduces the latency of examining the history of the codebase - say, diffing against the latest version - from being a network round trip to a trip to your local hard drive.
That doesn't sound all that big a deal if your main repository's on your gigabit LAN. If you're a telecommuter, and the repository's a good 600+ ms away over a VPN, it makes a world of difference.
I've never looked into it, but I'm sure both Mercurial and Git support post-commit hooks, allowing you to set up commit mails going to the team lead. Then each developer could set up her repository accordingly, or have an interim repository that permits half-baked features with the commit mails, or whatever.
Edit: Regarding John's comment about a long-running experiment being lost because it wasn't ready to commit to the master repo: work in a separate branch and regularly push your changes to the master. You still get all the benefits of working against a local repository (mainly, for me, very low latency), and still not annoy your colleagues with your half-baked feature... and you can still store your changes off your machine, in a place where your admin can properly back up the repository.

How is Accurev Performance?

How is performance in the current version (4.7) of Accurev?
time to checkout per 100mb, per gb?
time to commit per # of files or mb?
responsiveness of gui when 100+ streams?
I just had a demo of Accurev, and the streams look like a lightweight way to model workflow around code/projects. I've heard people praising Accurev for the streams back end and complaining about performance. Accurev appears to have worked on the performance, but I'd like to get some real world data to make sure it isn't a case of demos-well-runs-less-well.
Does anyone have Accurev performance anecdotes or (even better) data from testing?
I don't have any numbers but I can tell you where we have noticed performance issues.
Our builds typically use 30-40K files from source control. In my workspace currently there are over 66K files including build intermediate and output files, over 15GB in size. To keep AccuRev working responsively we aggressively use the ignore elements so AccuRev ignores any intermediate files such as *.obj. In addition we use the time stamp optimization. In general running an update is quick, but the project sizes are typically 5-10 people so normally only a couple of dozen files come down if you update daily. Even if someone made changes that touched lots of files speed is not an issue. On the other hand a full populate of all 30K+ files is slow. I don't have a time since I seldom do this and on the rare occasion I do, I run the populate when I'm going to lunch or a meeting. I expect it could be as much as 10 minutes. In general source files come down very quickly, but we have some large binary files, 10-20MB, that take a couple of seconds each.
If the exclude rules and ignore elements are not correctly configured, AccuRev can take a couple of minutes to run an update for workspaces of this size. When I hear of other developers complaining about the speed I know something is miss-configured and we get it straightened out.
A year or so ago one of the project updated boost with 25K+ files and also added FireFox to the repository (forget the size but made boost look small.) They also added ICU, wrote a lot of software and modified countless files. In all I recall there were approx 250K+ files sitting in a stream. I unfortunately decided that all their good code should be promoted to the root so all projects could share. This turned out to be a little beyond what AccuRev could handle well. It was a multi hour process getting all the changes promoted. As I recall once FireFox was promoted the rest went smoothly - perhaps a single transaction with over 100K files was the issue?
I recently updated boost and so had to keep and promote 25K+ files. It took a minute or two but not unreasonable considering the number of files and the size of the binaries.
As for the number of streams, we have over 800 streams and workspaces. Performance here is not an issue. In general I find the large number of streams hard to navigate so I run a filtered view of just my workspaces and the just streams I'm interested in. However when I need to look at the unfiltered list to find something performance is fine.
As a final note, AccuRev support is terrific - we call them the voice in the sky. Every now and again we shoot ourselves in the foot using AccuRev and wind up clueless on how to fix things. Almost always we did something dumb and then tried something dumber to fix it. Eventually we place a support request and next thing we know they are walking us through the steps to righteousness either on the phone or a goto meeting. I've even contacted them for trivial things that I just don't have time to figure out as I'm having a hectic day and they kindly walk me through it rather than telling me to RTFM.
Edit 2014: We can now get acceptable X-Windows performance by using the commercial version of RealVNC.
Original comment:This answer applies to any version of Accurev, not just 4.7. Firstly, GUI performance might be OK if you can use the web client. If you can't use the web client and if you want GUI performance then you'd better be using Windows, or have all your developers in one place, i.e. where the Accurev server is located. Try to run the GUI on X-Windows over a WAN ? Forget it : our experience has been dozens of seconds or minutes for basic point and click operations. This is over a fairly good WAN about 800 miles distant, with an almost optimal ping time. This is not a failing of Accurev, but of X-Windows, and you'll likely have similar problems with other X applications over a WAN. So avoid basic X if you possibly can. Currently we cannot, and our WAN users are forcibly relegated to command-line only. The basic problem is that Accurev is is centralized and you can't increase the speed of light. I believe you can get around WAN latency by running Accurev Replication Servers, but that still does not properly address the problem if you have remote developers at single-person offices over VPN. It is ironic that the replication servers somewhat turn this centralized VCS into a form of DVCS. If you don't have replication servers then a horrible but somewhat workable work-around is to use a delta-synchronization tool such as rsync to sync your source tree between your local machine where you can run the GUI (i.e. GUI running directly on your Windows or Linux laptop), and the machine where you're actually working (e.g. UNIX machine 1,000 miles away). Another option is to use something like VNC which works better over a WAN than X, connecting to a virtual desktop at the Accurev server's location, and use X from there. At my workplace more than one team has resorted to using Mercurial on the side and promoting to Accurev only when it's strictly necessary. As Stephen Nutt points out above, other necessary work is to use time-stamp optimization and ignores. We also have our Accurev admins (yes, it requires you employ people to baby sit it) complain when we need to include large numbers of files, despite the fact they form a core part of our product and MUST be included and version controlled. Draw your own conclusions.