Performing Historical Builds with Mercurial - date

Background
We use a central repository model to coordinate code submissions between all the developers on my team. Our automated nightly build system has a code submission cut-off of 3AM each morning, when it pulls the latest code from the central repo to its own local repository.
Some weeks ago, a build was performed that included Revision 1 of the repo. At that time, the build system did not in any way track the revision of the repository that was used to perform the build (it does now, thankfully).
-+------- Build Cut-Off Time
|
|
O Revision 1
An hour before the build cut-off time, a developer branched off the repository and committed a new revision in their own local copy. They did NOT push it back to the central repo before the cut-off and so it was not included in the build. This would be Revision 2 in the graph below.
-+------- Build Cut-Off Time
|
| O Revision 2
| |
| |
|/
|
O Revision 1
An hour after the build, the developer pushed their changes back to the central repo.
O Revision 3
|\
| |
-+-+----- Build Cut-Off Time
| |
| O Revision 2
| |
| |
|/
|
O Revision 1
So, Revision 1 made it into the build, while the changes in Revision 2 would've been included in the following morning's build (as part of Revision 3). So far, so good.
Problem
Now, today, I want to reconstruct the original build. The seemingly obvious steps to do this would be to
determine the revision that was in the original build,
update to that revision, and
perform the build.
The problem comes with Step 1. In the absence of a separately recorded repository revision, how can I definitively determine what revision of the repo was used in the original build? All revisions are on the same named branch and no tags are used.
The log command
hg log --date "<cutoff_of_original_build" --limit 1
gives Revision 2 - not Revision 1, which was in the original build!
Now, I understand why it does this - Revision 2 is now the revision closest to the build cut-off time - but it doesn't change the fact that I've failed to identify the correct revision on which to rebuild.
Thus, if I can't use the --date option of the log command to find the correct historical version, what other means are available to determine the correct one?

Considering whatever history might have been in the undo files is gone by now (the only thing I can think of that could give an indication), I think the only way to narrow it down to a specific revision will be a brute force approach.
If the range of possible revisions is a bit large and the product of building changes in size or other non-date aspect that is linear or near enough to linear, you may be able to use the bisect command to basically do a binary search to narrow down what revision you're looking for (or maybe just get close to it). At each revision that bisect stops to test, you would build at that revision and test whatever aspect you're using to compare against what the scheduled build produced that night. Might not even require building, depending on the test.
If it really is as simple as the graph you depict and the range of possibilities is short, you could just start from the latest revision it might be and walk backwards a few revisions, testing against the original build.
As for a definitive test comparing the two builds, hashing the test build and comparing it to a hash of the original build might work. If a compile on the nightly build machine and a compile on your machine of the same revision do not produce binary-identical builds, you may have to use binary diffing (such as with xdelta or bsdiff) and look for the smallest diff.
Mercurial does not have the information you want:
Mercurial does not, out of the box, make it its business to log and track every action performed regarding a repository, such as push, pull, update. If it did, it would be producing a lot of logging information. It does make available hooks that can be used to do that if one so desires.
It also does not care what you do with the contents of the working directory, such as opening files or compiling, so of course it is not going to track that at all. It's simply not what Mercurial does.
It was a mistake to not know exactly what the scheduled build was building. You agree implicitly because you now log that very information. The lack of that information before has simply come back to bite you, and there is no easy way out of it. Mercurial does not have the information you need. If the central repo is just a shared directory rather than a web-hosted repository that might have tracked activity, the only information about what was built is in the compiled version. Whether it is some metadata declared in the source that becomes part of the build, a naive aspect like filesize, or you truly are stuck hashing files, you can't get your answer without some effort.
Maybe you don't need to test every revision; there may be revisions you can be certain are not candidates. Knowing the time of the compile is merely a factor as the upper bound on the range of revisions to test. You know that revisions after that time could not possibly be candidates. What you don't know is what was pushed to the server at the time the build server pulled from it. But you do know that revisions from that day are the most likely. You also know that revisions in parallel unnamed branches are less-likely candidates than linear revisions and merges. If there are a lot of parallel unnamed branches and you know all your developers merge in a particular way, you might know whether the revisions under parent1 or parent2 should be tested based.
Maybe you don't even need to compile if there is metadata you can parse from the source code to compare with what you know about the specific build.
And you can automate your search. It would be easiest to do so with a linear search: less heuristics to design.
The bottom line is simply that Mercurial does not have a magic button to help in this case.

Apologies, it's probably bad form to answer one's own question, but there wasn't enough room to properly respond in a comment box.
To Joel, a couple of things:
First - and I mean this sincerely - thanks for your response. You provided an option that was considered, but which was ultimately rejected because it would be too complex to apply to my build environment.
Second, you got a little preachy there. In the question, it was understood that because a separately recorded repository revision was absent, there would be 'some effort' to figure out the correct revision. In a response to Lance's comment (above), I agree that recording the 40-byte repository hash is the 'correct' way of archiving the necessary build info. However, this question was about what CAN be done IF you do not have that information.
To be clear, I posted my question on StackOverflow for two reasons:
I figured that others must have run into this situation before and that, perhaps, someone may have determined a means to get at the requisite information. So, it was worth a shot.
Information sharing. If others run into this problem in the future, they will have an online reference that clearly explained the problem and discussed viable options for remediation.
Solution
In the end, perhaps my greatest thanks should go to Chris Morgan, who got me thinking to use the central server's mercurial-server logs. Using those logs, and some scripting, I was able to definitively determine the set of revisions that were pushed to the central repository at the time of the build. So, my thanks to Chris and to everyone else who responded.

As Joel said, it is not possible. However there are certain solutions that can help you:
maintain a database of nightly build revisions (date + changeset id)
build server can automatically tag revision it is based on (nightly/)
switch to Bazaar, it manages version numbers differently (branched versions are in form of REVISION_FORKED.BRANCH_NUMBER.BRANCH_REVISION so your change number 2 would be 1.1.1

Related

Ignore specific commit in svn when showing annotation

I use the "show annotation" functionality quite often. Now, I accidentally crushed the svn and solved it by making a re-commit of everything. Now, every time I use the "show annotation" function, it shows this last commit on every line.
Can I revert this somehow?
I'm assuming you didn't kill the entire SVN and "solved" that by starting over from rev 1. I'm assuming some intermediate revision got corrupted and you had to touch and commit every file in a new revision, but older revisions are visible and accessible in the SVN history. The Annotations feature, and Plan B both rely on that.
What the textbook offers
Excluding a single mid-range revision is not possible, given a certain history. You can only exclude head or tail ranges by specifying revisions other than 1 for the "From" and HEAD for the "To".
Say the "repair" revision you want to exclude is r1000. To exclude it, you can choose to consider either (from-to) r1-r999 or r1001-HEAD, leaving out r1000. So you are confined to either viewing the changes before or after the repair.
You can read up on the possibilities and options of what's internally called svn blame in the SVN documentation.
Plan B
Now, that's not really satisfying, I imagine. Here's something else you can try, but please create a backup of your repo first.
With the help of the SVN history viewer, or log viewer, find the last revision before the corrupted revision, say r997.
Make a branch based off that last good revision.
Then delete or move the current trunk, using the corresponding SVN commands.
In the last step, move or branch(=copy) the branch back to the trunk location.
You have effectively cut out the corrupt revisions. The branch-now-trunk has a "hole" in its revision numbers, because branching off r997 created a new revision younger than the corrupted and repairing revisions. Afterwards, showing annotations on that new trunk will work like before, but wont include the corruption and your "repair".
Here, I made an illustration for you:
This operation can screw up some ancestry operations like merging, but I've done it successfully before, even with large merging operations later on, so you might as well try it, too. Good luck!

Mercurial difference between changesets and revisions

I'm new to Mercurial and trying to understand how things work.
I wonder what is the difference between changesets and revisions ?
Thanks.
None.
From the Understanding Mercurial page:
When you commit, the state of the working directory relative to its
parents is recorded as a new changeset (also called a new
"revision")...
and further down the page:
Mercurial groups related changes to multiple files into single atomic
changesets, which are revisions of the whole project.
(emphasis mine)
Even if old, someone might stumble on this and I would say that there's a crucial difference. They are related as #Edward pointed out. Still, based on Mercurial's FAQ they are not the same.
A revision number is a simple decimal number that corresponds with the ordering of commits in the local repository.
The important part is local repository and further:
It is important to understand that this ordering can change from machine to machine due to Mercurial's distributed, decentralized architecture. This is where changeset IDs come in. A changeset ID is a 160-bit identifier that uniquely describes a changeset and its position in the change history, regardless of which machine it's on.
You should always use some form of changeset ID rather than the local revision number when discussing revisions with other Mercurial users as they may have different revision numbering on their system.
From experience I can tell, revision numbers do differ sometimes and are not unique.

Storing code metrics

I'd like to write a pre-commit hook that tells you if you've improved/worsened some code metric of a project (i.e. average function length). The hook would have to know what the previous average function length was and I don't know where to store that information. One option would be to store an additional .metrics file in the repo but that sounds clunky. Another option would be to git stash, compute the metrics, git stash pop, compute the metrics again and print the delta. I'm inclined to go with the latter. Are the any other solutions?
Disclaimer: I am author of the Metrix++ tool, which I am using in the workflow I described below. I guess the same workflow can be executed with other tools capable to compare the results.
One of the ideas you suggested works perfectly, if you add a couple of CI checks (see the steps below). I find it solid. Not sure why you are considering it clunky.
I have got a file with metrics results which is updated before each commit and stored in VCS. Let's name this file metrics.db, and consider automation of the following workflow on build/test of a project:
1) if metrics.db has not been changed since last checkout (i.e. it is the original data for the previous/base revision), copy it to metrics-prev.db
2) Collect metrics for current code, what produces metrics.db file again. Note: It is very helpful when a metrics tool can do iterative scans for the best performance (i.e. calculate metrics for updated functions/classes), so it gives you the opportunity to run metrics tool on every build, including iterative.
3) Compare metrics-prev.db with metrics.db. If metrics identify regressions, fail the build and [optionally] do not allow to commit - team rule. If metrics are good, build is successful, and commit may happen.
4) [optionally] you may run Continuous Integration (CI) which validates that the actual committed metrics.db file corresponds to the committed code for the same revision (i.e. do the same 1-3 steps and make sure that the diff is zero at the step 3). If diff is not zero, it means somebody forgot to update the metrics.db file, and presumably did not execute pre-commit check, so revert the change.
5) [optionally] CI may do steps 1-3 if you fetch metrics.db as metrics-prev.db from the previous revision. In this case, CI may also check that the collected metrics.db is the same as committed (alternative or addition for the step 4).
Another implementation I have seen: metrics.db files are stored in a separate drive, out of VCS, and custom script is able to locate corresponding metrics.db for a revision. I find this solution unreliable as the drive can disappear, files can be moved and renamed, and so on. So, placing the file in VCS is better solution, but any will work.
I have attempted to do the alternative you suggested: switch to the previous revision and run the metrics tool twice. I abandoned this approach for several reasons: metrics check script alters your source files (so, it is impossible to include it into iterative rebuild and continue to work smoothly with your IDE as it will complain about changed files), and secondly it is very slow performance (comparing with iterative re-scans, it is extremely slow).
Hope it helps.

automatically add to each changeset a file that contains the new revision number

Whenever I commit, I want to save in a file the revision number of the changeset that I'm creating. I also want that file to be added to the same changeset.
Note that the revision number of the parent of the working directory is not what I want because the changeset being created will have a higher revision number. Usually it's just the parent revision number + 1, but if someone committed since the time I checked out my working directory, it may be higher.
UPDATE:
It's obviously very strange that I'd be interested in this information, since as the comments below say, it's repo-specific and won't match what others see. However, I am the only developer, using a single repository. I find the repo revision numbers super convenient to keep track of what code was used to generated various research results. I can see how it's not great, but it works in this specific scenario.
Obviously, I could use the hash, but that's harder to remember and use in a conversation. If I did want to use the hash, my question would still remain: how to get the hash of the changeset that's being committed.
Related:
mercurial - I want to add some custom code to be run after commit seems to be unable to achieve the desired outcome.
This article is clearly relevant, but unless I miss something, it relies on the fact that nobody committed to the same repository since the last checkout by the current user.
I'm under Windows 7, TortoiseHG, latest version.
You can probably just put this in there:
TIP=$(hg id --num --rev tip)
NEXT=$(($TIP + 1))
but please do keep in mind that those numbers are almost entirely meaningless. When someone else clones that repository the revision numbers can change. Only the nodeids have any meaning outside the repository in which you looked them up.

How to manage source control changesets with multiple overlapping changes and daily rebuilds?

I am working at a company which uses cvs for version control.
We develop and maintain an online system involving around a hundred executables - which share large amounts of code, and a large number of supporting files.
We have a head branch, and live branches taken from the head branch. The live branches represent major releases which are released about every three months.
In addition there are a numerous daily bug fixes which must be applied to both the live branch - so they can be take to the live environment immeadiatley, and merged back to the head branch, so they will be in the next major release.
Our most obvious difficulty is with the daily fixes. As we have many daily modifications there are always multiple changes on the testing environment. Often when the executables are rebuilt for one task, untested changes to shared code get included in the build and taken to the live environment.
It seems to me we need some tools to better manage changesets.
I'm not the person who does the builds, so I am hoping to find a straight forward process for managing this, as it will make it easier for me get the build manager interested in adopting it.
I think what you need is a change in repository layout. If I understand correctly, your repository looks like this:
Mainline
|
-- Live branch January (v 1.0)
|
-- Live branch April (v 2.0)
|
-- Live branch July (v 3.0)
So each of the branches contains all your sites (hundreds) as well as folders for shared code.
There is no scientific way to exactly tell the chance of an error appearing after a release but lets have a look at the two most important factors:
Number of code lines commited per time-unit. You can not / will not want to globaly change this as it is the developer's productivity output.
Test-Coverage, aka how often gets code executed BEFORE beeing live and how much of your codebase is involved. This could easily be change by giving people more time to test before a release or by implementing automated tests. Its a ressources issue.
If your company neither wants to spend money on extra testing nor decrease release frequency (not neccessarily productivity!) you will indeed have to find a way to release less changes, effecively decreasing the number of changed lines of code per release.
As a result of this insight, having all developers committing into the same branch and going live from there multiple times a day doesn't sound like a good idea, does it?
You want increased Isolation.
Isolation in most version controll systems is implemented by
Incrementing revision numbers per atomic commit
Branching
You could try to implement a solution that packs changes from multiple revisions into release-packages a bit like the version controll system "perforce" does it. I wouldn't do that though as branching is allmost allways easier. Keep the KISS principle in mind.
So how could branching help?
You could try to Isolate change that have to go live today from changes that might have to go live tomorrow or next week.
Iteration Branches
Mainline
|
-- Live branch July (v 3.0)
|
-- Monday (may result in releases 3.0.1 - 3.0.5)
|
-- Thuesday (may result in releases 3.0.6 - 3.0.8)
|
-- Wednesday (may result in releases 3.0.9 - 3.0.14)
People need to spend more thought about "targeting" their changes to the right release but it could lead to not-so-urgent changes (especialy on shared/library code) staying longer OUTSIDE of a release and inside the live branch where by chance or systematic testing they could be discovered before going live (see factor test coverage).
Additional merging down is required of course and sometimes cherrypicking of changes out of the live-branch into the daily-branch.
Now please dont take me too literaly with the daily branches. In my company we have 2-week iterations and for each iteration a release branch and it is enough overhead allready to maintain that branch.
Instead of isolating by day you could try to isolate by product/site.
Project Branches
Mainline
|
-- Live branch July (v 3.0)
|
-- Mysite-A (released when something in A changed and only released to the destination of A)
|
-- Mysite-B
|
-- Mysite-C
In this scenario the code of the single site AND all needed shared code and libraries would reside in such a Site-branch.
If shared code has to be altered for something to work within site A you only change the shared code in site A. You also merge the change down so anyone can catch up on your changes. Catching up cycles may be a lot longer than releases so the code has time to "ripe".
In your deploy/build process you have to make sure that the shared code of site-A does NOT overwrite the shared code site-B uses of course. You are effectivly "forking" your shared code with all implication (incompatibility, overhead for integrating team-changes).
Once in a while there should be forced merges down to the live-branch (you might want to rename that then too) to integrate all changes that have been done on shared code. Your 3-month iteration will force you to that anyway I guess but you might find out that 3 months is too long for hassle-free integration.
The third approach is the most extrem.
Project & Iteration Branches
Mainline
|
-- Live branch July (v 3.0)
|
-- Mysite-A
|
-- Week 1
|
-- Week 2
|
-- Week 3
|
-- Mysite-B
|
-- Week 1
|
-- Week 2
|
-- Week 3
|
-- Mysite-C
|
-- Week 1
|
-- Week 2
|
-- Week 3
This certainly brings a huge ammount of overhead and potential headache if you are not paying attention. On the good side you can very accuratly deploy only the changes that are needed NOW for THIS Project/Site.
I hope this all gives you some ideas.
Applied source controll is alot about risk controll for increased product quality.
While the decision what level of quality your company wants to deliver might not be in your hands, knowing it will help you deciding what changes to suggest. Might turn out your customers are adequatly content with your quality and further efforts to increase it do not amortize.
Good luck.
Christoph