Starting to version an already medium size project - version-control

I am about to start participating in the development of a medium-sized project (~50k lines) that was until now written by a single person, and not versioned; as a result folders are cluttered with different versions of the same file (named file1, file2, file3, etc.).
I proposed to start using a VCS for it (a priori Mercurial, which is the only one I've ever used -- for my personal projects --, but I'm open to suggestions), so I'm taking any good ideas as to how to "start" the repository. E.g., should I make an initial commit with all the existing files, and immediately make a new commit with the unused files removed? Or something else?
(constructive remarks on mercurial vs bazaar vs git vs whatever are also welcome.)
Thanks for your tips.

E.g., should I make an initial commit with all the existing files, and immediately make a new commit with the unused files removed?
If the size of the repository is not a concern, then yes, that is a good starting point. Otherwise you can just commit what's actually used, and go from there.
As for which system, all DVCSes stick to the same core principles. Which one you pick is entirely subjective — the only way to truly know which one you like is to try each one.

I would say use what you are the most comfortable with and meets your needs. As far as where to start, I personally would seed the repo with the current source as is, that way you can verify that everything builds and runs as expected. you can make this initial seed a branch. That way you can always go back to your starting point before refactoring.

My approach to this was:
create a Mercurial repository in the existing project folder ("existing")
commit all project files to "existing"
create an empty repository in what a different location ("new")
As files are tested and QA'd (this was necessary because there was so much dross in "existing") pull them from "everything" to "new".
Once files had been pulled into "new"; delete the corresponding files from "existing". If access is needed to these files while the migration is under way, push them back from "new" to "existing".
This gave me the advantage of putting everything under some sort of control for recovery purposes, control over introducing the project to the DVCS. Eventually the existing project folder became completely tested and approved for the project moving forward. At this point the "everything" directory could be deleted or changed into a working folder; and "new" became the actual project folder.

I think Mercurial is a good choice. Lightweight, fast, very simple to use and well-integrated with Windows (if that's the platform you're dealing with).
I would probably get rid of all the clutter before the first commit. Delete everything you don't care about, run all the necessary tests and only then do the commit.
Yes, I'm dead set against the 0-day cluttering of repos.
Granted, a 50K SLOC project isn't very big, but if you commit files you already know you won't need, they will make your repo slightly bigger.
Also, remember to check that the tree doesn't contain large binary files. If it does, get rid of them if at all possible.

Related

Project files under version control?

I work on a large project where all the source files are stored in a version control except the project files. This was the lead developer's decision. His reasoning was:
Its to time consuming to reconcile the differences among developers' working directories.
It allows developers to work independently until their changes are stable
Instead, a developer initially gets a copy of a fellow developer's project files. Then when new files are added each developer notifies all the rest about the change. This strikes me as far more time consuming in the long run.
In my opinion the supposed benefits of not tracking changes to the project files are outweighed by the danger. In addition to references to its needed source files each project file has configuration settings that would be very time consuming and error prone to reproduce if it became corrupted or there was a hardware failure. Some of them have source code embedded in them that would be nearly impossible to recover.
I tried to convince the lead that both of his reasons can be accomplished by:
Agreeing on a standard folder structure
Using relative paths in the project files
Using the version control system more effectively
But so far he's unwilling to heed my suggestions. I checked the svn log and discovered that each major version's history begins with an Add. I have a feeling he doesn't know how to use the branching feature at all.
Am I worrying about nothing or are my concerns valid?
Your concerns are valid. There's no good reason to exclude project files from the repository. They should absolutely be under version control. You'll need to standardize on a directory structure for automated builds as well, so your lead is just postponing the inevitable.
Here are some reasons to check project (*.*proj) files into version control:
Avoid unnecessary build breaks. Relying on individual developers to notify the rest of the team every time the add, remove or rename a source file is not a sustainable practice. There will be mistakes and you will end up with broken builds and your team will waste valuable time trying to determine why the build broke.
Maintain an authoritative source configuration. If there are no project files in the repository, you don't have enough information there to reliably build the solution. Is your team planning to deliver a build from one of your developer's machines? If so, which one? The whole point of having a source control repository is to maintain an authoritative source configuration from which you build and deliver releases.
Simplify management of your projects. Having each team member independently updating their individual copies of your various project files gets more complicated when you introduce project types that not everyone is familiar with. What happens if you need to introduce a WiX project to generate an MSI package or a Database project?
I'd also argue that the two points made in defense of this strategy of not checking in project files are easily refuted. Let's take a look at each:
Its to time consuming to reconcile the differences among developers' working directories.
Source configurations should always be setup with relative paths. If you have hard coded paths in your source configuration (project files, resource files, etc.) then you're doing it wrong. Choosing to ignore the problem is not going to make it go away.
It allows developers to work independently until their changes are stable
No, using version control lets developers work in isolation until their changes are stable. If you each continue to maintain your own separate copies of the project files, as soon as someone checks in a change that references a class in a new source file, you've broken everyone on the team until they stop what they're doing and carefully update their project files. Compare that experience with just "getting latest" from source control.
Generally, a project checked out of SVN should be working, or there should be tools included to make it work (e.g. autogen.sh). If the project file is missing or you need knowledge about which files should be in the project, there is something missing.
Automatically generated files should not be in SVN, as it is pointless to track the changes to these.
Project files with relative path belong under source control.
Files that don't: For example in .Net, I would not put the .suo (user options) web.config (or app.config under source control. You may have developers using different connection strings, etc.
In the case of web.config, I like to put a web.config.example in. That way you copy the file to web.config upon initial checkout and tweak what settings you'd like. If you add something that needs to be added to all web.config, you merge those lines into the .example version and notify the team to merge that into their local version.
I think it depends on the IDE and configuration of the project. Some IDEs have hard-coded absolute paths and that's a real problem with multiple developers working on the same code with different local copies and configurations. Avoid absolute path references to libraries, for example, if you can.
In Eclipse (and Java), it's fine to commit .project and .classpath files (so long as the classpath doesn't have absolute references). However, you may find that using tools like Maven can help having some independence from the IDE and individual settings (in which case you wouldn't need to commit .project, .settings and .classpath in Eclipse since m2eclipse would re-create them for you automatically). This might not apply as well to other languages/environments.
In addition, if I need to reference something really specific to my machine (either configuration or file location), it tend to have my own local branch in Git which I rebase when necessary, committing only the common parts to the remote repository. Git diff/rebase works well: it tends to be able to work out the diffs even if the local changes affect files that have been modified remotely, except when those changes conflict, in which case you get the opportunity to merge the changes manually.
That's just retarded. With a set up like that, I can have a perfectly working project containing files that are subtly different from everyone else. Imagine the havoc this would cause if someone accidentally propagates this mess into QA and everyone is trying to figure out what's going on. Imagine the catastrophe that would ensue if it ever got released to the production environment...!

Is the Mercurial .hgignore my only option for handling hundreds of temp files generated when compiling?

I've been all over google and SO looking for someone who has asked this question, but am coming up completely empty. I'll apologize in advance for the lengthy round-about way of asking the question. (If I was able to figure out how to encapsulate the problem, maybe I would have been successful in finding an answer.)
How are large projects managed in Mercurial, when the act of building / compiling generates hundreds of temporary files in order to create the end result?? Is .hgignore the only answer?
Example Scenario:
You have a project that wants to use some open source package for some feature, and needs to compile from source. So you go get the package. un-.tgz it and then slap it into its own Mercurial repository so you can then start tracking changes. Then you make all your changes, and run a build.
You test your end result, are happy with the results and are ready to commit back to your local clone of the repository. So you do an hg status to check your changes prior to committing The hg status results cause you to immediately start using all those words that would make your mother ashamed — because you now have screens and screens of "build cruft".
For the sake of argument say this package is MySQL or Apache: something that
you don't control and will be changing regularly,
leaves a whole lot of cruft behind in a whole lot of places, and
there is no guarantee the cruft won't change each time you get a new version from the external source.
Wow what? The particular project causing this angst is going to be worked on by multiple developers in multiple physical locations, and so needs to be as straightforward as possible. If there is too much involved they're not going to do it, and we'll have a bigger problem on our hands. (Sadly, some old dogs are not keen on learning new tricks...)
One proposed solution was that they would just have to commit everything locally before doing a make, so they have a "clean slate" they would then have to clone from to actually do the build in. That got shot down as (a) too many steps, and (b) not wanting to cruft up the history with a bunch of "time to build now" changesets.
Someone else has proposed that all the cruft just be committed into the Mercurial repository. I am strongly against that because then the next time around those files will turn up as "modified" and therefore be included in the changeset's file list.
We can't possibly be the only people who have run into this problem. So what is the "right" solution? Is our only recourse to try create a massively intelligent .hginore file? This makes me uneasy, because if I tell Mercurial to "ignore everything in this directory I haven't already told you about", then what happens if the next applied patch adds files into that ignored directory? (Mercurial will never see that new file, right?)
Hopefully this is not a completely stupid question with an obvious answer. I've compiled things from source many times before, but have never needed to apply version control on top of that. Plus we're new to Mercurial.
Two options:
The best option is to do an out of tree build, if you can. This is a build where you place the object files outside of the source tree. Some build systems, such as CMake, support this directly. For other systems, you need to be lucky since the upstream project must have added support for this in their Makefile or similar.
A more general option is to tell Mercurial to ignore specific types of files, not entire directories. This works well in my experience.
To test the second option, I wanted to compile Apache. However, it requires APR, so I tested with that instead. After checking in a clean apr-1.3.8.tar.bz2 I did ./configure; make and looked at the output of hg status. The first few pattens were easy:
syntax: glob
*~
*.o
*.lo
*.la
*.so
.libs/*
The remaining new files look like they are specific files generated by the build process. It's easy to add them too:
% hg status --unknown --no-status >> .hgignore
That also added .hgignore since I hadn't yet scheduled it for addition. Removing that I ended up with this .hgignore file:
syntax: glob
*~
*.o
*.lo
*.la
*.so
.libs/*
.make.dirs
Makefile
apr-1-config
apr-config.out
apr.exp
apr.pc
build/apr_rules.mk
build/apr_rules.out
build/pkg/pkginfo
config.log
config.nice
config.status
export_vars.c
exports.c
include/apr.h
include/arch/unix/apr_private.h
libtool
test/Makefile
test/internal/Makefile
I consider this a quite robust way to go about this in Mercurial or any other revision control system for that matter.
The best solution would be to fix the build process so that it behaves in a 'nice' manner.. namely allowing you to specify some separate directory to store intermediate files in (that could then be completely ignored via a very simple .hgignore entry... or not even within the version-controlled directory structure at all.
For what it's worth, I've found that in this situation a smart .hgignore is the only solution that has worked for me so far. With the inclusion of regular expression support, it's very powerful, but tricky, too, since a pattern that is cruft in one directory may well be source in another.
At least you can check in the .hgignore and share it with your developers. That way the work is only done once.
[Edit] At least, however, it's possible -- as noted above by Martin Geisler -- to have full path specifications in your .hgignore file; you can, therefore, have test/Makefile in the .hgignore and still have Mercurial notice a new test2/Makefile
His process for creating the file should give you almost what you want, and you can tune it from there.
One option you have is to clean your working directory after verifying a build.
make clean
hg status
Of course you may not want to clean your project if it takes more than a few minutes to build.
If the files you want to track are already known to hg, you can hgignore everything. Then you need to use hg import to add patch, and not just use the patch command (since hg needs to be aware if some new files should be tracked).
How about a shell (or whatever) script that walks your build directory recursively, finds every file created after your build process started running, and moves all these files (of course, you can specify the exceptions) into a cruft_dir subdirectory. Then you can just put cruft_dir/* in .hgignore.
EDIT: I forgot to add, but this is fairly obvious, that this shell script runs automatically as soon as your build finishes. Maybe it's even called as the last command in your Makefile/ant/whatever file.

SVN Branching in Eclipse (Conceptual)

I understand the basic concept of a branch and merge. All of the explanations I've found talk about branching your entire trunk to create a branch project and working on it and then merging it back. Is it possible to branch a subset of a project?
I think an example will help me explain best what I want to do. Suppose I have an application with ten files file0 through file10. All files are interdependent and to be able to test any one file all the others need to be included in the build. I want to work on file0 but don't need to make changes to file1 through file10. Can I branch file0 so changes committed to file0 will update something like myrepos/branches/a-branch/file0 but all the other files in my working copy will simply be from the trunk?
The reason I want to do this is that I'm working on a huge j2ee application with tens of thousands of files and it seems like branching the entire thing will take a really long time. Also, I'm using eclipse with subclipse (and I could be wrong about this) but it seem like if I branch a project in eclipse then I will have to set up a new eclipse project to point to the branch. Unfortunately importing this particular project from SVN to eclipse takes several hours due to the size of the application. It isn't realistic for me to spend this much time.
I suppose that I could have the concepts wrong. Perhaps branching an entire project doesn't require a new working copy at all?
Thanks for any light shed on this issue.
Branching an entire (even) very large tree in Subversion is a very cheap operation, which does lazy (O(1) time) file copying.
You don't necessarily have to change your entire working copy to work on just one changed file. You can use svn switch to switch one file or one directory in your working copy to be a checked out version of the file on the branch.
In Subversion, making a branch is simply making a copy of a hierarchy of directories. Therefore, you can branch a subset, but only if that subset can be defined by a hierarchy of directories.
Can I
branch file0 so changes committed to
file0 will update something like
myrepos/branches/a-branch/file0 but
all the other files in my working copy
will simply be from the trunk?
To answer this question: No, you can't branch a single file. However, what I think you want to do instead is to make a branch and work on file0 there. As you make changes to trunk files, you simply merge them into your branch where you're working on file0.
In this way, you'll always have the latest information from trunk, which will let you test the file0 changes independently of trunk. Then you can use svn switch to move your "file lens" between the trunk and the branch (but beware, Eclipse may complain about such shenanigans).
svn branching is based on lazy copy mechanism, so you can branch safely your all project: that would not take long.
As mentioned in the the question "How do I branch an individual file in SVN?", you could branch a subset, but I believe this would be dangerous with the svn:merginfo properties mechanism: it works better it that property is set from the root of the project.
Branching in SVN is an O(1) operation. Also, as SVN internally employs lazy copying, you only pay a space penalty for what you change.
So if you are unsure, why not go ahead and branch the whole project?
(As quark mentioned, one problem with branching big projects is that, if you checkout several branches/the trunk in parallel, this might take a lot of local disk space.)

TFS - Branching for experimental development: Solution fails to load

Disclaimer: I'm stuck on TFS and I hate it.
My source control structure looks like this:
/dev
/releases
/branches
/experimental-upgrade
I branched from dev to experimental-upgrade and didn't touch it. I then did some more work in dev and merged to experimental-upgrade. Somehow TFS complained that I had changes in both source and target and I had to resolve them. I chose to "Copy item from source branch" for all 5 items.
I check out the experimental-upgrade to a local folder and try to open the main solution file in there. TFS prompts me:
"Projects have recently been added to this solution. Would you like to get them from source control?
If I say yes it does some stuff but ultimately comes back failing to load a handful of the projects. If I say no I get the same result.
Comparing my sln in both branches tells me that they are equal.
Can anyone let me know what I'm doing wrong? This should be a straightforward branch/merge operation...
TIA.
UPDATE:
I noticed that if I click "yes" on the above dialog, the projects are downloaded to the $/ root of source control... (i.e. out of the dev & branches folders)
If I open up the solution in the branch and remove the dead projects and try to re-add them (by right-clicking sln, add existing project, choose project located in the branch folder, it gives me the error...
Cannot load the project c:\sandbox\my_solution\proj1\proj1.csproj, the file has been removed or deleted. The project path I was trying to add is this: c:\sandbox\my_solution\branches\experimental-upgrade\proj1\proj1.csproj
What in the world is pointing these projects outside of their local root? The solution file is identical to the one in the dev branch, and those projects load just fine. I also looked at the vspscc and vssscc files but didn't find anything.
Ideas?
#Ben
You can actually do a full delete in TFS, but it is highly not recommended unless you know what you are doing. You have to do it from the command line with the command tf destroy
tf destroy [/keephistory] itemspec1 [;versionspec]
[itemspec2...itemspecN] [/stopat:versionspec] [/preview]
[/startcleanup] [/noprompt]
Versionspec:
Date/Time Dmm/dd/yyyy
or any .Net Framework-supported format
or any of the date formats of the local machine
Changeset number Cnnnnnn
Label Llabelname
Latest version T
Workspace Wworkspacename;workspaceowner
Just before you do this make sure you try it out with the /preview. Also everybody has their own methodology for branching. Mine is to branch releases, and do all development in the development or root folder. Also it sounded like branching worked fine for you, just the solution file was screwed up, which may be because of a binding issue and the vssss file.
#Nick: No changes have been made to this just yet. I may have to delete it and re-branch (however you really can't fully delete in TFS)
And I have to disagree... branching is absolutely a good practice for experimental changes. Shelving is just temporary storage that will get backed up if I don't want to check in yet. But this needs to be developed while we develop real features.
Without knowing more about your solution setup I can't be sure. But, if you have any project references that could explain it. Because you have the "experimental-upgrade" subfolder under "branches" your relative paths have changed.
This means when VS used to look for your referenced projects in ..\..\project\whatever it now has to look in ..\..\..\project\whatever. Note the extra ..\
To fix this you have to re-add your project references. I haven't found a better way. You can either remove them and re-add them, or go to the properties window and change the path to them, then reload them. Either way, you'll have to redo your references to them from any projects.
Also, check your working folders to make sure that it didn't download any of your projects into the wrong folders. This can happen sometimes...
A couple of things. Are the folder structures the same? Can you delete and readd the project references successfully?
If you create a solution and then manually add all of the projects, does that work. (That may not be feasable - we have solutions with over a hundred projects).
One other thing (and it may be silly) - after you did the branch, did you commit it? I'm wondering if you branched and didn't check it in, and then merged, and then when you tried to check-in then, TFS was mighty confused.
#Kevin:
This means when VS used to look for your referenced projects in ....\project\whatever it now has to look in ......\project\whatever. Note the extra ..\
You may be on to something here, however it doesn't explain why some projects load and others do not. I haven't found a correlation between them yet.
I think I'll try to re-add the projects and see if that works.
#Cory:
I think that's what I'm going to try... I have about 20 projects and 8 or so aren't loading. The folder structures are identical from root... ie: there aren't any references outside of DEV.

The theory (and terminology) behind Source Control

I've tried using source control for a couple projects but still don't really understand it. For these projects, we've used TortoiseSVN and have only had one line of revisions. (No trunk, branch, or any of that.) If there is a recommended way to set up source control systems, what are they? What are the reasons and benifits for setting it up that way? What is the underlying differences between the workings of a centralized and distributed source control system?
Think of source control as a giant "Undo" button for your source code. Every time you check in, you're adding a point to which you can roll back. Even if you don't use branching/merging, this feature alone can be very valuable.
Additionally, by having one 'authoritative' version of the source control, it becomes much easier to back up.
Centralized vs. distributed... the difference is really that in distributed, there isn't necessarily one 'authoritative' version of the source control, although in practice people usually still do have the master tree.
The big advantage to distributed source control is two-fold:
When you use distributed source control, you have the whole source tree on your local machine. You can commit, create branches, and work pretty much as though you were all alone, and then when you're ready to push up your changes, you can promote them from your machine to the master copy. If you're working "offline" a lot, this can be a huge benefit.
You don't have to ask anybody's permission to become a distributor of the source control. If person A is running the project, but person B and C want to make changes, and share those changes with each other, it becomes much easier with distributed source control.
I recommend checking out the following from Eric Sink:
http://www.ericsink.com/scm/source_control.html
Having some sort of revision control system in place is probably the most important tool a programmer has for reviewing code changes and understanding who did what to whom. Even for single person projects, it is invaluable to be able to diff current code against previous known working version to understand what might have gone wrong due to a change.
Here are two articles that are very helpful for understanding the basics. Beyond being informative, Sink's company sells a great source control product called Vault that is free for single users (I am not affiliated in any way with that company).
http://www.ericsink.com/scm/source_control.html
http://betterexplained.com/articles/a-visual-guide-to-version-control/
Vault info at www.vault.com.
Even if you don't branch, you may find it useful to use tags to mark releases.
Imagine that you rolled out a new version of your software yesterday and have started making major changes for the next version. A user calls you to report a serious bug in yesterday's release. You can't just fix it and copy over the changes from your development trunk because the changes you've just made the whole thing unstable.
If you had tagged the release, you could check out a working copy of it and use it to fix the bug.
Then, you might choose to create a branch at the tag and check the bug fix into it. That way, you can fix more bugs on that release while you continue to upgrade the trunk. You can also merge those fixes into the trunk so that they'll be present in the next release.
The common standard for setting up Subversion is to have three folders under the root of your repository: trunk, branches and tags. The trunk folder holds your current "main" line of development. For many shops and situations, this is all they ever use... just a single working repository of code.
The tags folder takes it one step further and allows you to "checkpoint" your code at certain points in time. For example, when you release a new build or sometimes even when you simply make a new build, you "tag" a copy into this folder. This just allows you to know exactly what your code looked like at that point in time.
The branches folder holds different kinds of branches that you might need in special situations. Sometimes a branch is a place to work on experimental feature or features that might take a long time to get stable (therefore you don't want to introduce them into your main line just yet). Other times, a branch might represent the "production" copy of your code which can be edited and deployed independently from your main line of code which contains changes intended for a future release.
Anyway, this is just one aspect of how to set up your system, but I think giving some thought to this structure is important.