Storing code metrics - code-metrics

I'd like to write a pre-commit hook that tells you if you've improved/worsened some code metric of a project (i.e. average function length). The hook would have to know what the previous average function length was and I don't know where to store that information. One option would be to store an additional .metrics file in the repo but that sounds clunky. Another option would be to git stash, compute the metrics, git stash pop, compute the metrics again and print the delta. I'm inclined to go with the latter. Are the any other solutions?

Disclaimer: I am author of the Metrix++ tool, which I am using in the workflow I described below. I guess the same workflow can be executed with other tools capable to compare the results.
One of the ideas you suggested works perfectly, if you add a couple of CI checks (see the steps below). I find it solid. Not sure why you are considering it clunky.
I have got a file with metrics results which is updated before each commit and stored in VCS. Let's name this file metrics.db, and consider automation of the following workflow on build/test of a project:
1) if metrics.db has not been changed since last checkout (i.e. it is the original data for the previous/base revision), copy it to metrics-prev.db
2) Collect metrics for current code, what produces metrics.db file again. Note: It is very helpful when a metrics tool can do iterative scans for the best performance (i.e. calculate metrics for updated functions/classes), so it gives you the opportunity to run metrics tool on every build, including iterative.
3) Compare metrics-prev.db with metrics.db. If metrics identify regressions, fail the build and [optionally] do not allow to commit - team rule. If metrics are good, build is successful, and commit may happen.
4) [optionally] you may run Continuous Integration (CI) which validates that the actual committed metrics.db file corresponds to the committed code for the same revision (i.e. do the same 1-3 steps and make sure that the diff is zero at the step 3). If diff is not zero, it means somebody forgot to update the metrics.db file, and presumably did not execute pre-commit check, so revert the change.
5) [optionally] CI may do steps 1-3 if you fetch metrics.db as metrics-prev.db from the previous revision. In this case, CI may also check that the collected metrics.db is the same as committed (alternative or addition for the step 4).
Another implementation I have seen: metrics.db files are stored in a separate drive, out of VCS, and custom script is able to locate corresponding metrics.db for a revision. I find this solution unreliable as the drive can disappear, files can be moved and renamed, and so on. So, placing the file in VCS is better solution, but any will work.
I have attempted to do the alternative you suggested: switch to the previous revision and run the metrics tool twice. I abandoned this approach for several reasons: metrics check script alters your source files (so, it is impossible to include it into iterative rebuild and continue to work smoothly with your IDE as it will complain about changed files), and secondly it is very slow performance (comparing with iterative re-scans, it is extremely slow).
Hope it helps.

Related

How to manage PR validation builds for 100+ projects

We have 100+ services/apps in a repository in Azure Devops. We have defined a single CI/CD YAML multistage pipeline for each (build and deployment). This limits blast radius and allows for auditability of each release of each project. We rely on templates for all the real pipeline work so this is easy to maintain; just a small root azure-pipelines.yml file for each project that includes the needed templates.
Now, we'd like to start using PR validation builds. And, as best as I can tell, we have two options:
Create a separate PR build for for every project and use the UI/API for policies to create 100+ policies
Create a single PR build that has stages for all 100+ projects.
I'm not a fan of the 1st option as now we'll have 200+ builds. The 2nd option is possible, but to avoid a 3 hour PR build, we'd need a way to only run needed stages (aka project builds).
Is there a 3rd option I'm missing? If the 2nd option is our best bet, how do we turn off stages for projects not changed in that PR (i.e. what condition would we use)?
(FYI, our policy is to change only one project per PR, but there are, on occasion exceptions to that.)
For personal suggestion, I also recommend the second method. Though the build script would be very large in one configure file, but much better than have hundreds build configuration files.
But the difficulty is these 100+ apps are all in one repository. This means all the normal method will not suitable for you, include using Build.Repository.Name value as the stage condition. Also, there's no more details which describing the source file path stored in the commit.
So, I suggest you and your team developers input the project name info into your commit message. Then, in the build pipeline you could use the variable Build.SourceVersionMessage to get its comment message. Since this is a environment variable which only work in step level(Not work for stage level and the job level), it needs you add one task in the first step and use the condition for it.
The logic of it is add one step as the first one in every stages. This step is only used to conditional judgment. If the Build.SourceVersionMessage matches the prefix or any key contents words, the jobs will be early-exit.
If use the condition like this:
condition: startsWith(variables['Build.SourceVersionMessage'], '[maven-plugin]')
It needs your commit message must follow a strict content writing format, starting with the specified project name.
Another condition can for you consider is:
condition: in(variables['Build.SourceVersionMessage'], 'maven-plugin')
This does not need the strict content writing format, but also need input the project name in the commit message. Thus it could be evaluated in the job condition with the above script.
Hope it could give you some help.

Reading custom metrics from the last build for custom baseline comparisons

I'm planning to introduce linting into a rather massive code base. Fixing all existing issues beforehand is not possible, so seeing thousands of linter errors at start is inevitable.
I'd like to record the number of detected errors each time the build runs for master and treat this number as a success / failure threshold. If a new pull request does not exceed the current baseline, its pipeline passes and so the proposed change is good to go. However, if the number of errors increases, I'd like the pipeline to fail, thus preventing the merge.
This functionality I’ve described narrows down to writing variables to Azure DevOps servers as some side-effects of builds and also reading these values from the previous build. This looks very similar to comparing code coverage, however, I can't seem to find any docs on how to implement the read-write logic manually.
What pipeline task could I use? What else can I leverage to track a custom metric over a number of builds and compare the value with previous? To summarise, my ultimate goal is to gradually lower an arbitrary value from a large number to zero over the course of several months.

What is the standard workflow for applying conflicting patches?

This is a programming language and version control system agnostic question.
There is a source code tree and two patches X and Y. Each of them apply cleanly to the source code tree. But applying one of them (either X or Y first), then another one, results in second patch failing to apply (patches conflict).
Is my only option applying one of them (probably the biggest one, so most of work gets done automatically), then merging the other one by hand and resolving conflicts, or there are better tools/practices to handle this scenario?
The goal is to avoid this situation from happening as there's no easy solution to merge.
In order to avoid, make small commits with their tests and push them to the source repository. Other guys in the team will be forced to pull the latest changes in order to commit their code, and this will ensure that nothing gets broken.
I encourage you to avoid having multiple teams manipulating the same part of the source code. Create a good structure and if possible break down the project into smaller projects.

Why should I combine code and tests in a single commit?

Let's say I write an atomic change of code. I also write some tests to make sure the change works and keeps working.
I think I should commit the code change together with the tests.
PRO:
This makes sure (as far as possible) that every changeset of the branch results in running code and passing tests.
It documents that these tests "belong to" the code change that I did.
CON:
I may want to backout the change without backing out the tests some time in the future
What reasons more are there for doing it one way or the other?
Has either strategy ever bitten you? How exactly?
Somewhat related: Should change to code be committed separately from corresonding change to test suite?
My only rule is to make sure that code builds prior to checking in.
Yes, ideally you want to check in atomic bits of code. However, how do you define atomic? And then what happens if, later, you have to change one of the bits? Your changes are no longer in one commit.
I think I should commit the code change together with the tests.
I never do, but I would, if I used a non-branching SCM.
In branching SCMs (I've worked with mercurial and git) it's a non-issue.
I use a different branch every time I start working or something; I many times have local branch commits that just commit outstanding changes (that don't even compile) when I move to something else for a while, or when I just need a backup point before I start a major code change). I know that most of those intermediary commits will never be checked out again, when I create them. They don't affect anything though and are really useful when needed.
I only merge branches back when I have code that compiles and tests that run successfully on it.

Performing Historical Builds with Mercurial

Background
We use a central repository model to coordinate code submissions between all the developers on my team. Our automated nightly build system has a code submission cut-off of 3AM each morning, when it pulls the latest code from the central repo to its own local repository.
Some weeks ago, a build was performed that included Revision 1 of the repo. At that time, the build system did not in any way track the revision of the repository that was used to perform the build (it does now, thankfully).
-+------- Build Cut-Off Time
|
|
O Revision 1
An hour before the build cut-off time, a developer branched off the repository and committed a new revision in their own local copy. They did NOT push it back to the central repo before the cut-off and so it was not included in the build. This would be Revision 2 in the graph below.
-+------- Build Cut-Off Time
|
| O Revision 2
| |
| |
|/
|
O Revision 1
An hour after the build, the developer pushed their changes back to the central repo.
O Revision 3
|\
| |
-+-+----- Build Cut-Off Time
| |
| O Revision 2
| |
| |
|/
|
O Revision 1
So, Revision 1 made it into the build, while the changes in Revision 2 would've been included in the following morning's build (as part of Revision 3). So far, so good.
Problem
Now, today, I want to reconstruct the original build. The seemingly obvious steps to do this would be to
determine the revision that was in the original build,
update to that revision, and
perform the build.
The problem comes with Step 1. In the absence of a separately recorded repository revision, how can I definitively determine what revision of the repo was used in the original build? All revisions are on the same named branch and no tags are used.
The log command
hg log --date "<cutoff_of_original_build" --limit 1
gives Revision 2 - not Revision 1, which was in the original build!
Now, I understand why it does this - Revision 2 is now the revision closest to the build cut-off time - but it doesn't change the fact that I've failed to identify the correct revision on which to rebuild.
Thus, if I can't use the --date option of the log command to find the correct historical version, what other means are available to determine the correct one?
Considering whatever history might have been in the undo files is gone by now (the only thing I can think of that could give an indication), I think the only way to narrow it down to a specific revision will be a brute force approach.
If the range of possible revisions is a bit large and the product of building changes in size or other non-date aspect that is linear or near enough to linear, you may be able to use the bisect command to basically do a binary search to narrow down what revision you're looking for (or maybe just get close to it). At each revision that bisect stops to test, you would build at that revision and test whatever aspect you're using to compare against what the scheduled build produced that night. Might not even require building, depending on the test.
If it really is as simple as the graph you depict and the range of possibilities is short, you could just start from the latest revision it might be and walk backwards a few revisions, testing against the original build.
As for a definitive test comparing the two builds, hashing the test build and comparing it to a hash of the original build might work. If a compile on the nightly build machine and a compile on your machine of the same revision do not produce binary-identical builds, you may have to use binary diffing (such as with xdelta or bsdiff) and look for the smallest diff.
Mercurial does not have the information you want:
Mercurial does not, out of the box, make it its business to log and track every action performed regarding a repository, such as push, pull, update. If it did, it would be producing a lot of logging information. It does make available hooks that can be used to do that if one so desires.
It also does not care what you do with the contents of the working directory, such as opening files or compiling, so of course it is not going to track that at all. It's simply not what Mercurial does.
It was a mistake to not know exactly what the scheduled build was building. You agree implicitly because you now log that very information. The lack of that information before has simply come back to bite you, and there is no easy way out of it. Mercurial does not have the information you need. If the central repo is just a shared directory rather than a web-hosted repository that might have tracked activity, the only information about what was built is in the compiled version. Whether it is some metadata declared in the source that becomes part of the build, a naive aspect like filesize, or you truly are stuck hashing files, you can't get your answer without some effort.
Maybe you don't need to test every revision; there may be revisions you can be certain are not candidates. Knowing the time of the compile is merely a factor as the upper bound on the range of revisions to test. You know that revisions after that time could not possibly be candidates. What you don't know is what was pushed to the server at the time the build server pulled from it. But you do know that revisions from that day are the most likely. You also know that revisions in parallel unnamed branches are less-likely candidates than linear revisions and merges. If there are a lot of parallel unnamed branches and you know all your developers merge in a particular way, you might know whether the revisions under parent1 or parent2 should be tested based.
Maybe you don't even need to compile if there is metadata you can parse from the source code to compare with what you know about the specific build.
And you can automate your search. It would be easiest to do so with a linear search: less heuristics to design.
The bottom line is simply that Mercurial does not have a magic button to help in this case.
Apologies, it's probably bad form to answer one's own question, but there wasn't enough room to properly respond in a comment box.
To Joel, a couple of things:
First - and I mean this sincerely - thanks for your response. You provided an option that was considered, but which was ultimately rejected because it would be too complex to apply to my build environment.
Second, you got a little preachy there. In the question, it was understood that because a separately recorded repository revision was absent, there would be 'some effort' to figure out the correct revision. In a response to Lance's comment (above), I agree that recording the 40-byte repository hash is the 'correct' way of archiving the necessary build info. However, this question was about what CAN be done IF you do not have that information.
To be clear, I posted my question on StackOverflow for two reasons:
I figured that others must have run into this situation before and that, perhaps, someone may have determined a means to get at the requisite information. So, it was worth a shot.
Information sharing. If others run into this problem in the future, they will have an online reference that clearly explained the problem and discussed viable options for remediation.
Solution
In the end, perhaps my greatest thanks should go to Chris Morgan, who got me thinking to use the central server's mercurial-server logs. Using those logs, and some scripting, I was able to definitively determine the set of revisions that were pushed to the central repository at the time of the build. So, my thanks to Chris and to everyone else who responded.
As Joel said, it is not possible. However there are certain solutions that can help you:
maintain a database of nightly build revisions (date + changeset id)
build server can automatically tag revision it is based on (nightly/)
switch to Bazaar, it manages version numbers differently (branched versions are in form of REVISION_FORKED.BRANCH_NUMBER.BRANCH_REVISION so your change number 2 would be 1.1.1