Search code history for closest matching version based on content

Search code history for closest matching version based on content - version-control

I have a file that was forked from a project at an unknown moment in the past. I want to identify as closely as possible the moment of that fork. The file has been changed since the fork-moment.
Winmerge highlights about about 20% of the lines, with about half of those being just a few characters within the line, a path change or inline function turned into a variable or function call for instance. (20% after ignoring whitespace change and enabling moved-block detection that is, closer to ~40% without that.)
I don't have to worry about branches, the original version control system was CVS. (I don't have access to the CVS file system). I have a git imported version with tags corresponding to the CVS commits, and could generate the same with Mercurial for little effort if need be.
I don't care about matching the specific CSV commit date/time/number/whatever. The goal is to identify when the content of new file started drifting, and step forward through the revision history, cherry picking what to merge to the forked file.
For this project I could brute force it, there only a dozen or so revisions where the fork has mostly likely occurred and the file is less than 500 lines. However it's not hard to imagine a scenario where this is not feasible and I'm curious about what an elegant solution might be.
How would you go about solving this?

"Brute force" sounds as if you were contemplating testing all revisions. Normally one would use a binary search. To decide if it was a good match, I'd normally use just the numbers from diffstat (since you say there are post-fork changes). Accounting for block-moves complicates things, though.

Related

Perforce CLI - how to determine which changelist a file is checked out under?

In the absence of development branches (or sometimes in the overabundance thereof), it is common to have at least half a dozen of other people's shelves unshelved in a workspace, as well as multiple of one's own: both local changes (divided by directory, of course) and shelves from other workspaces.
(As an aside, yes, this is almost as bad as pre-version control email/tarball/fileshare-based nonsense, and no, there's nothing I can do about it.)
As a result of this situation, it is often necessary to identify which of myriad changelists refers to a given file.
Is there a way of retrieving this information, using the p4 command line tool, without recourse to elaborate shell scripting or wrapper programmes?

p4 opened FILE
Note that this tells you which changelist you have the file open in -- it doesn't necessarily correspond to which changelist you may have unshelved from originally. If this is important, make sure that when you unshelve it's into distinct changelists (rather than opening everything in the same changelist) so you can keep track of what came from where.

ClearCase merge-conflict-on-rebase mystery -- why does manual merges are sometimes required when doing a rebase for a file that has NO local changes?

Here's a rather advanced question for true ClearCase experts:
I frequently perform a rebase on a ClearCase snapshot view that has just a very limited number of small changes in few files (e.g. file1.c, file2.c, file3.c).
I do it with the following on the UNIX command line:
cleartool rebase -recommended -complete
Sometimes, while this command runs, out of the blue, and for no exlained reasons (yet), I get prompted for manual input to solve some "merge" conflicts. But they make no sense to me, as they happen in file(s) that I NEVER EVER TOUCHED -- and which ONLY ONE OTHER DEVELOPER EVER TOUCHES.
The "merge" prompts I see when this scenario happens during a rebase look usually like:
"Do you want INSERTION from file x? [yes/no]"
or
"Do you want DELETION from file y? [yes/no]"
or
"Do you want CHANGES from file z? [yes/no]".
Etc.
I have no clue why these "conflicts" are happening. Additionally, it's really hard (see impossible) to make good decisions because the details are shown with a very narrow number of columns, and there's hardly any way to guess right. Using graphical merging is not an option here because this is meant for an automation script that should ideally never ask for user input.
What I do know about this scenario is:
We have a team of 6 developers. 5 of us usually work the same limited number of files... say file1.c, file2.c, file.3.c
I work on a child development stream on these three files. And when I'm done, I normally deliver up to the default parent stream.
On the occasions where the "merge conflicts" on rebase happened, it's always on a totally DIFFERENT FILE -- one that is ONLY EVER TOUCHED by JUST ONE other developer in the team (it's a module that HE owns, NO ONE EVER TOUCHES THAT FILE BUT HIM). Let's call him developer #6.
When this strange "merge conflict" on rebase happens, I've usually been working for an extended time in my own development child stream (always with a snapshot view), and I've done a couple rebases (at least 3) to bring other changes ALL made by other developers (in file1.c, file2.c and file3.c) and which I needed to complete my work.
But, the other developer (#6), the ONLY ONE working on banana.c, had made changes to that file, in at least two of the three rebases activities that were created in the snapshot view of my child development stream.
Again, I repeat, I NEVER touched banana.c, and none of the 5 other developers ever did, except for the guy (#6) who owns the banana.c module.
And there, it happened - ClearCase asked me for manual input to solve a "merge" conflict with banana.c when doing "cleartool rebase -recommended -complete".
How can this be possible???
How can a file require "merging" when doing rebase if there is ONLY EVER one single developer making changes to it?
It's as if ClearCase got confused between different versions of banana.c in at least two of the three rebase activities automatically created in my stream (which both modified banana.c) and prompted me for "merge conflict" resolution (even though I never ever touched banana.c and only one developer (#6) ever did modify that file).
.
.
.
UPDATE: AUGUST 31ST 2015
Here's a log from an occurrence of the problem on August 28th 2015
Needs Merge "/view/MYDYNAMICVIEW/vobs/DIRLEVEL1/DIRLEVEL2/SOMEFILE.cpp" to /main/MAIN_INT_STREAM/SUB_STREAM/CHECKEDOUT from /main/MAIN_INT_STREAM/SUB_STREAM/MY_DEV_STREAM/37 base /main/
MAIN_INT_STREAM/SUB_STREAM/150
********************************
<<< file 1: /view/MYDYNAMICVIEW/vobs/DIRLEVEL1/DIRLEVEL2/SOMEFILE.cpp##/main/MYDYNAMICVIEW/SUB_STREAM/150
>>> file 2: /view/MYDYNAMICVIEW/vobs/DIRLEVEL1/DIRLEVEL2/SOMEFILE.cpp##/main/MYDYNAMICVIEW/SUB_STREAM/MY_DEV_STREAM/37
>>> file 3: /view/MYDYNAMICVIEW/vobs/DIRLEVEL1/DIRLEVEL2/SOMEFILE.cpp
********************************
...CUT FOR BREVITY...
*** Automatic: Applying DELETION from file 3 [deleting base line 123]
...CUT FOR BREVITY...
*** Automatic: Applying INSERT from file 3 [lines 123-124]
...CUT FOR BREVITY...
*** Automatic: Applying CHANGE from file 3 [lines 1329-1335]
...CUT FOR BREVITY...
*** No Automatic Decision Possible
merge: Warning: *** Aborting...
Missing charsets in String to FontSet conversion
Missing charsets in String to FontSet conversion
Missing charsets in String to FontSet conversion
Cannot convert string "-misc-kochi mincho-medium-r-normal--0-*-*-*-*-*-*-*" to type FontSet
...GMERGE POPPED HERE...
Moved contributor "/view/MYDYNAMICVIEW/vobs/DIRLEVEL1/DIRLEVEL2/SOMEFILE.cpp" to "/view/MYDYNAMICVIEW/vobs/DIRLEVEL1/DIRLEVEL2/SOMEFILE.cpp.contrib".
Output of merge is in "/view/MYDYNAMICVIEW/vobs/DIRLEVEL1/DIRLEVEL2/SOMEFILE.cpp".
Recorded merge of "/view/MYDYNAMICVIEW/vobs/DIRLEVEL1/DIRLEVEL2/SOMEFILE.cpp".
I never touched SOMEFILE.cpp - only ONE other developer ever changes it- why does it require a merge?
My net impression at the moment is that ClearCase's automatic merge is doing a bad job.
Could it be a good idea to think of using the "-qall" or "-qntrivial" options to disable ALL/MOST automatic merging -- and do EVERY/MOST merge manually? (or with an external tool?)

1 & 2 How can this be possible???
This "Do you want the CHANGE made in file 2? [yes] no" message only appears when 2 contributors differ from the base contributor.
In that case, a cleartool lsvtree (not with -graph, since you don't have X-Window server) might help seeing the version involved, and trying to make some cleartool diff to see the difference (compared to the base contributor)
3/: that is one example where, if possible, working collaboratively on the same stream/branch (instead of working each developer in one own's stream) would be better.
Regarding the update of August 2015, the key error message is:
Missing charsets in String to FontSet conversion
See technote "Using GUI results in "Missing charsets in String to FontSet conversion" warning"
Possible causes include:
An improper setting of the locale variable. For example it may be set to UTF-8.
The file of interest is the Linux/palette, which defines the actual fonts used in the environment. This file is read to determine the fonts that can be displayed in the ClearCase GUI.
The palette file does not contain the correct fontset.
This issue was identified as a product defect and logged under APAR PK30799.

tracking file-name version control in a real version control system?

Is there an established method to tell the SCM, mercurial in my case, that files of the pattern foobaz_1_2_3.csv should all be considered versions of foobaz.csv ?
In my application I rely on data tables from an external source that put the version number in the filename. The importance of tracking changes across their versions was made painfully sharp recently when I spend days troubleshooting a bug on my side of the fence, only to discover it was because they changed some data content and notification of said change did not reach me.
If the filename was constant Hg would have informed me immediately of the internal change and I could have responded appropriately in an hour or two, with very little stress. I could just adopt the habit of renaming foobaz_2_3_4 to foobaz myself before checking in, or running diff old new and one or both of those is likely what I'll do from now on.
The whole experience has me wondering though if there might be other methods I've not thought of that don't mess with the external file. (for example what of I have a downstream user who doesn't use SCM and relies on the filename+version number, which I've thrown away?)

If you get data in file with permanently changed name and (possibly) changeable data, you can:
Store data-file under version-control (mercurial is OK)
replace old file with new every time
hg addremove -s nnn (Check Manual hg help addremove) will detect possible rename and include new file in history of old

How to join two files in a version control system

I am doing a refactoring of my C++ project containing many source files.
The current refactoring step includes joining two files (say, x.cpp and y.cpp) into a bigger one (say, xy.cpp) with some code being thrown out, and some more code added to it.
I would like to tell my version control system (Perforce, in my case) that the resulting file is based on two previous files, so in future, when i look at the revision history of xy.cpp, i also see all the changes ever done to x.cpp and y.cpp.
Perforce supports renaming files, so if y.cpp didn't exist i would know exactly what to do. Perforce also supports merging, so if i had 2 different versions of xy.cpp it could create one version from it. From this, i figure out that joining two different files is possible (not sure about it); however, i searched through some documentation on Perforce and other source control systems and didn't find anything useful.
Is what i am trying to do possible at all?
Does it have a conventional name (searching the documentation on "merging" or "joining" was unsuccessful)?

You could try integrating with baseless merges (-i on the command line). If I understand the documentation correctly (and I've never used it myself), this will force the integration of two files. You would then need to resolve the integration however you choose, resulting in something close to the file you are envisioning.
After doing this, I assume the Perforce history would show the integration from the unrelated file in it's integration history, allowing you to track back to that file when desired.

I don't think it can be done in a classic VCS.
Those versioning systems come in two flavors (slide 50+ of Getting git by Scott Chacon):
delta-based history: you take one file, and record its delta. In this case, the unit being the file, you cannot associate its history with another file.
DAG-based history: you take one content and record its patches. In this case, the file itself can vary (it can be renamed/moved at will), and it can be the result of two other contents (so it is close of what you want)... but still within the history of one file (the contents coming from different branches of its DAG).

The easy part would be this:
p4 edit x.cpp y.cpp
p4 move x.cpp xy.cpp
p4 move y.cpp xy.cpp
Then the tricky part becomes resolving the move of y.cpp and doing your refactoring. But this will tell Perforce that the files are combined.

Which Version control?

If a project has multiple people, say, A,B,C working together and they all edit a same source file.
Couple months later, they realize that what A has been doing is wrong and they want to roll back the file in such a way that only parts/functions/lines/... that A "touched" are removed and the work B and C did is still in the roll back version. In other words, the roll back version has only the work of B and C up to the time they decide to remove A's work.
Is there any version/source control software out there (free/commercial) can do that?
Thanks.

Git and a bit of scripting will do that. Probably a bit of hand work too, but you can resort commits using interactive rebase.

Most VCSs should be able to do this -- it's a reverse merge. In Subversion you would identify the revisions made by A and merge them in again, but the other way round. To oversimplify, this means turning line additions into line removals, and vice versa.
# Don't want revision 37 because A made it.
$ svn merge -r 37:36 path
http://svnbook.red-bean.com/en/1.5/svn.branchmerge.basicmerging.html#svn.branchmerge.basicmerging.undo

I use TFS and Git. But, there are a lot of free and open source version control softwares. You can find all the source control softwares here.

In Git, you would probably do something like
git revert `git rev-list --author=A`
[Note: completely untested.]

I bet it can (easily) be done with Monotone by using `mtn local kill_certs selector certname [certval]' command (see reference) which:
This command deletes certs with the given name on revisions that match the given selector. If a value is given, it restricts itself to only delete certs that also have that same value. Like kill_revision, it is a very dangerous command; it permanently and irrevocably deletes historical information from your database.
So, by using A's certificate, the above command will eliminate 'wrong work' done by him.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse