I'm trying to implement a three-way diff/merge algorithm (in python) between, say, the base version X and two different derivative versions A and B, and I'm having trouble figuring out how to handle some changes.
I have a line-by-line diff from X to A, and from X to B. These diffs give, for each line, an "opcode" which is either =, if the line didn't change, + if a line was added, - if the line was removed, and c if the line was changed (which is simply a - immediately followed by a +, indicating a line was removed and then replaced, effectively modified).
Now I'm comparing corresponding opcodes from the A-diff and the B-diff, to try to decide how to merge them. Some of these opcode combos are easy: = and = means neither version changed the line, so we keep the original. + and = means that a line was added on one side and no change was made on the other, so accept the addition and advance to the next line only on the side that added the line. And - and c is a conflict that the user must resolve, because one one side a line was changed, on the other side the same line was removed.
However, I'm struggling with what to do with a + and a -, or a + and a c. In the first case for instance, I added a new line on one side, and deleted a subsequent line on the other side. Strictly, I don't think this is a conflict, but what if the addition was relying on that line being there? I guess that applies to the entire thing (something added in one place may rely on something somewhere else to make sense). The second case is similar, I added a line on one side, and on the other side I changed a subsequent line, but the addition may be relying on the original version of the line.
What is the normal approach to handling this?
The usual robust strategy (diff3, git's resolve ...) is, that changes (+ - c) in one file must be away minimum N (e.g. 3) context lines from competing changes in the other file - unless being exactly equal. Otherwise its a conflict for manual resolution. Similar to the requirement of some clean context in patch application.
Here is an example where sb tries some fancy extra strategies like "# if two Delete actions overlap, take the union of their ranges" to reduce certain conflicts. But thats risky; and rather on the other side there is no guarantee that concurrent changes do not result in problems even when very far away.
Related
Typical merge and source control tools mistake added code for code change especially if you merge several branches/version. Is there solution to merge several versions /branches of code without deleting a symbol? So the result of merge is interleaving of both files?
A use case one branch adds a fields and another branch adds a similar fields to website/api. The changes are practically isomorphic so might confuse a source code management tool.
Adding a non trivial field is typically tedious error prone task due MVC pattern and multi-layering, and involves adding some files, modifying other, lines and symbols yet deletion or swapping pieces of code is rarely needed. In fact I think to automate the task by having a generic patch/branch and just changing name and adding custom logic each time. I am aware one can code in way the adding of typical fields is push button task, or few lines change, yet not everybody likes such code.
Usually I use PyCharm (python version of IntellijIdea) or Smart Git for merging, but open to any tools or solutions.
Beyond Compare can't automatically detect it, but in the Text Merge, you can manually take text from both sides using the right click Take Left Then Right or Take Right Then Left commands.
My question is partly liguistic, but very related to programming (of almost anything, web pages or anything else).
I would like to know why word refactor was chosen for changing of program or its part, if else word probably would be more exact and better describing done change.
IDEs (for example NetBeans or Eclipse) use this word only for renaming of any part of chosen program (project), including moving of file to else place (from view of any OS it is probably only renaming).
But renaming is not about changing of factor (because it is something that is not changed when it is renamed).
Closer to meaning of word refactor (as changing of factor) is manual rewriting of any part, when rewritten part has changed behaviour (but not what program does from outer view - as is written in topic What is refactoring and what is only modifying code?).
The word "Refactoring" is derived from mathematics where you find an equivalent expression by applying factoring again. The equivalent expression does not change the final outcome but it is much easier to understand, use, or reuse.
There are many refactoring techniques and renaming is one of them. Other techniques include extract method, extract class, move method, move class, pull/push method to super/sub-class and many more.
I am posting for a sanity check so please forgive me if this sounds a little basic. I am trying to learn more about encryption so I figured a good starter project would be implement the Sha-1 hashing algorithm. I found a walk-through and have hit a point where I do not know if the walk-through is wrong or my understanding of bitness/rotation/binary operations is wrong.
From the document:
Step 11.2: Put them together
After completing one of the four functions above, each variable will move on to this step before restarting the loop with the next
word. For this step we are going to create a new variable called
'temp' and set it equal to: (A left rotate 5) + F + E + K + (the
current word).
Notice that other than the left rotate the only operation we're doing is basic addition. Addition in binary is about as simple as it
can be.
We'll use the results from the last word(79) as an example for this step.
A lrot 5:
00110001000100010000101101110100
F:
10001011110000011101111100100001
A lrot 5 + F
Out:
110111100110100101110101010010101
Notice that the result of this operation is one bit longer than the two inputs. This is just like adding 5 and 6, you will need a new
place value to represent the answer. For everything to work out
properly we will need to truncate that extra bit eventually. However,
we do not want to do that until the end!
This does not quite work out. What I think would happen is:
A = 00110001000100010000101101110100
F = 10001011110000011101111100100001
A Left rotate 5 = 00100010001000010110111010000110
(A Left Rotate 5) + F = 10101101111000110100110110100111 (which is still 32 bits)
What I need is just another set of eyes on this to say "Yes krtzer, you are correct and this document is wrong" Or "Your understanding of bitness, endianness, and/or bit rotation is wrong, this is how it works".
Right now I am not sure if my integer representation is wrong (the spec says use U32s, but this section says that I need to keep track of the extra bits), the endianness of my program is messing up the rotation (I use little endian) or there is something else.
Any experience or insight will be appreciated!
You are correct in your understanding of how everything works. The problem was with the article (which I wrote). An extra digit must always be added in the beginning of step 11.2, whether it's necessary or not, and if it's not necessary, it should be set to 1.
The article now reads:
Notice that the result of this operation is one bit longer than the two inputs. After each iteration the new word should be one bit longer than the last. Sometimes this will be a necessary carrier bit (like the extra place value you need to represent the result of adding the two single digit numbers 5 and 6 in base 10), and when that's not needed you must simply prepend a 1. For everything to work out properly we will need to truncate that extra bit eventually; however, we do not want to do that until the end!
The article was also unclear about the fact that an already rotated A was being shown in the example.
I'm working on a new weave-based data structure for storing version control history. This will undoubtedly cause some religious wars about whether it's The Right Way Of Doing Things when it comes out, but that isn't my question right now.
My question has to do with what output blame should give. When a line of code has been added, removed, and merged into itself a number of times, it isn't always clear what revision should get blame for it. Notably this means that when a section of code is deleted, all records of it having been there is gone, and there is no blame for the removal. Everyone I've gone over this issue with has said that trying to do better simply isn't worth it. Sometimes people put in the hack that the line after the section which got deleted has its blame changed from whatever it actually was to the revision when the section got deleted. Presumably if the section is at the end then the last line get its blame changed, and if the file winds up empty then the blame really does disappear into the aether, because there's literally nowhere left to put blame information. For various technical reasons I won't be using this hack, but assume that continuing but with this completely undocumented but de facto standard practice will be uncontroversial (but feel free to flame me and get it out of your system).
Moving on to my actual question. Usually in blame for each line you look at the complete history of where it was added and removed in the history and using three-way merge (or, in the case of criss-cross merges, random bullshit) and based on the relationships between those you determine whether the line should have been there based on its history, and if it shouldn't but is then you mark it as new with the current revision. In the case where a line occurs in multiple ancestors with different blames then it picks which one to inherit arbitrarily. Again, I assume that continuing with this completely undocumented but de facto standard practice will be uncontroversial.
Where my new system diverges is that rather than doing a complicated calculation of whether a given line should be in the current revision based on a complex calculation of the whole history, it simply looks at the immediate ancestors, and if the line is in any of them it picks an arbitrary one to inherit the blame from. I'm making this change for largely technical reasons (and it's entirely possible that other blame implementations do the same thing, for similar technical reasons and a lack of caring) but after thinking about it a bit part of me actually prefers the new behavior as being more intuitive and predictable than the old one. What does everybody think?
I actually wrote a one of the blame implementations out there (Subversion's current one I believe, unless someone replaced it in the past year or two). I helped with some others as well.
At least most implementations of blame don't do what you describe:
Usually in blame for each line you look at the complete history of where it was added and removed in the history and using three way merge (or, in the case of criss-cross merges, random bullshit) and based on the relationships between those you determine whether the line should have been there based on its history, and if it shouldn't but is then you mark it as new with the current revision. In the case where a line occurs in multiple ancestors with different blames then it picks which one to inherit arbitrarily. Again, I assume that continuing with this completely undocumented but de facto standard practice will be uncontroversial.
Actually, most blames are significantly less complex than this and don't bother trying to use the relationships at all, but they just walk parents in some arbitrary order, using simple delta structures (usually the same internal structure whatever diff algorithm they have uses before it turns it into textual output) to see if the chunk changed, and if so, blame it, and mark that line as done.
For example, Mercurial just does an iterative depth first search until all lines are blamed. It doesn't try to take into account whether the relationships make it unlikely it blamed the right one.
Git does do something a bit more complicated, but still, not quite like you describe.
Subversion does what Mercurial does, but the history graph is very simple, so it's even easier.
In turn, what you are suggesting is, in fact, what all of them really do:
Pick an arbitrary ancestor and follow that path down the rabbit hole until it's done, and if it doesn't cause you to have blamed all the lines, arbitrarily pick the next ancestor, continue until all blame is assigned.
On a personal level, I prefer your simplified option.
Reason: Blame isn't used very much anyway.
So I don't see a point in wasting a lot of time doing a comprehensive implementation of it.
It's true. Blame has largely turned out to be one of those "pot of gold at the end of the rainbow" features. It looked really cool from those of us standing on the ground, dreaming about a day when we could just click on a file and see who wrote which lines of code. But now that it's widely implemented, most of us have come to realize that it actually isn't very helpful. Check the activity on the blame tag here on Stack Overflow. It is underwhemingly desolate.
I have run across dozens of "blame-worthy" scenarios in recent months alone, and in most cases I have attempted to use blame first, and found it either cumbersome or utterly unhelpful. Instead, I found the information I needed by doing a simple filtered changelog on the file in question. In some cases, I could have found the information using Blame as well, had I been persistent, but it would have taken much longer.
The main problem is code formatting changes. The first-tier blame for almost everything was listed as... me! Why? Because I'm the one responsible for fixing newlines and tabs, re-sorting function order, splitting functions into separate utility modules, fixing comment typos, and improving or simplifying code flow. And if it wasn't me, someone else had done a whitespace or block-move somewhere along-the-way as well. In order to get a meaningful blame on anything dating back to a time before I can already remember without the help of blame, I had to roll back revisions and re-blame. And re-blame again. And again.
So in order for a blame to actually be a useful time saver for more than the most lucky of situations, the blame has to be able to heuristicly make its way past newline, whitespace, and ideally block copy/move changes. That sounds like a very tall order, especially when scouring the changelog for a single file, most of the time, it won't yield many diffs anyway and you can just sift through by hand fairly quickly. (The notable exception being, perhaps, badly engineered source trees where 90% of the code is stuffed in one or two ginormous files... but who these days in a collaborative coding environment does much of that anymore?).
Conclusion: Give it a bare-bones implementation of blame, because some people like to see "it can blame!" on the features list. And then move on to things that matter. Enjoy!
The line-merge algorithm is stupider than the developer. If they disagree, that just indicates that the merger is wrong rather than indicating a decision point. So, the simplified logic should actually be more correct.
My goal is coming up with a script to track the point a line was added, even if the line is subsequently modified or moved around (both of which confuse traditional vcs 'blame' scripts. I've done some minor background research (see bottom) but didn't find anything useful. I have a concept for how to proceed but the runtime would be atrocious (there's a factorial involved).
The two missing features are tracking edited-in-place lines separate from a deletion-and-addition of that line, and tracking entire functions moved around so they're in different hunks. For those experienced with diff but unfamiliar with the terminology, a subsequence is a contiguous group of + or - lines, with a type of either delete (all -), add (all +), or replace (a combination). I need more information, on moves and edit-in-place lines, vaguely alluded to in an entry on c2: DiffAlgorithm (paragraph starts with "My favorite mode"). Does anyone know what that is? (seems to be based on Tichy, see bottom.)
Here's more info on the two missing features:
no concept of a change on a line, (a fourth type, something like edit-in-place). In this hunk, the parent of 'bc' is 'b' but 'd' is new and isn't a descendant of 'b':
a
-b
+bc
+d
The workaround for this isn't too complicated, if the position of edits is the same (just an expanded version of markup_instraline_changes but comparing edit distance on all equal-sized subsets of old and new lines.
no concept of "moving" code that preserves the ownership of the lines, e.g. this diff shouldn't alter the ownership of "line", although its position changes.
a
-line
c
+line
This could be dealt with in the same way but with much worse runtime (instead of only checking single blocks marked 'replace', you'd need to check Levenshtein distance between all added against all removed lines) and with likely false positives (some, like whitespace-only lines, aren't relevant to my problem).
Research I've done: reading about gestalt pattern matching (Ratcliff and Obershelp, used in Python's difflib) and An O(ND) Difference Algorithm and its Variations (EW Myers).
After posting the question, I found references to Tichy84 which appears to be The string-to-string correction problem with block moves (which I haven't read yet) according to Walter Tichy's paper a year later on RCS
You appear to be interested in origin tracking, the problem of tracing where a line came from.
Ideally, you'd instrument the editor to remember how things were edited, and store the edits with the text in your repository, thus solving the problem trivially, but none of us software engineers seem to be smart enough to implement this simple idea.
As a weak substitute, one can look at a sequence of source code revisions from the repository and reconstruct a "plausible" history of changes. This is what you seem to be doing by proposing the use of "diff". As you've noted, diff doesn't understand the idea of "moving" or "copying".
SD Smart Differencer tools compare source text by parsing the text according to the langauge it is in, discovering the code structures, and computing least-Levensthein differences in terms of programming language constructs (identifiers, expressions, statements, blocks, classes, ...) and abstract editing operators "insert", "delete", "copy", "move" and "rename identifier within a scope". They produce diff-like output, a little richer because they tell you line/column -> line/column with different editing operations.
Obviously the "move" and "copy" edits are the ones most interesting to you in terms of tracking specific lines (well, specific language constructs). Our experience is that code goes through lots of copy and edits, too, which I suspect won't surprise you.
These tools are in Beta, and are presently available for COBOL, Java and C#. Lots of other langauges are in the pipe, because the SmartDifferencer is built on top of a langauge-parameterized infrastructure, DMS Software Reengineering Toolkit, which has quite a number of already existing, robust langauge grammars.
I think the idea of what amount of editing a line that can be done while it remains a descendent of some previously written line is very subjective, and based on context, both things that a computer cannot work with. You'd have to specify some sort of configurable minimum similarity on lines in your program I think... The other problem is that it is entirely possible for two identical lines to be written completely independently (for example incrementing the value of some variable), and this will be be quite a common thing, so your desired algorithm won't really give truthful or useful information about a line quite often.
I would like to suggest an algorithm for this though (which makes tons of hopefully obvious assumptions by the way) so here goes:
Convert both texts to lists of lines
Copy the lists and Strip all whitespace from inside of each line
Delete blank lines from both lists
Repeat
Do a Levenshtein distance from the old to new lists ...
... keeping all intermediate data
Find all lines in the new text that were matched with old lines
Mark the line in both new/old original lists as having been matched
Delete the line from the new text (the copy)
Optional: If some matched lines are in a contiguous sequence ...
... in either original text assign them to a grouping as well!
Until there is nothing left but unmatchable lines in the new text
Group together sequences of unmatched lines in both old and new texts ...
... which are contiguous in the original text
Attribute each with the line match before and after
Run through all groups in old text
If any match before and after attributes with new text groups for each
//If they are inside the same area basically
Concatenate all the lines in both groups (separately and in order)
Include a character to represent where the line breaks are
Repeat
Do a Levenshtein distance on these concatenations
If there are any significantly similar subsequences found
//I can't really define this but basically a high proportion
//of matches throughout all lines involved on both sides
For each matched subsequence
Find suitable newline spots to delimit the subsequence
Mark these lines matched in the original text
//Warning splitting+merging of lines possible
//No 1-to-1 correspondence of lines here!
Delete the subsequence from the new text group concat
Delete also from the new text working list of lines
Until there are no significantly similar subsequences found
Optional: Regroup based on remaining unmatched lines and repeat last step
//Not sure if there's any point in trying that at the moment
Concatenate the ENTIRE list of whitespaced-removed lines in the old text
Concatenate the lines in new text also (should only be unmatched ones left)
//Newline character added in both cases
Repeat
Do Levenshtein distance on these concatenations
Match similar subsequences in the same way as earlier on
//Don't need to worry deleting from list of new lines any more though
//Similarity criteria should be a fair bit stricter here to avoid
// spurious matchings. Already matched lines in old text might have
// even higher strictness, since all of copy/edit/move would be rare
While you still have matchings
//Anything left unmatched in the old text is deleted stuff
//Anything left unmatched in the new text is newly written by the author
Print out some output to show all the comparing results!
Well, hopefully you can see the basics of what I mean with that completely untested algorithm. Find obvious matches first, and verbatim moves of chunks of decreasing size, then compare stuff that's likely to be similar, then look for anything else which is similar, but both modified and moved: probably just coincidentally similar.
Well, if you try implementing this, tell me how it works out, and what details you changed, and what kind of assignments you made to the various variables involved... I expect there will be some test cases where it works brilliantly and others where it just abyssmally fails due to some massive oversight. The idea is that most stuff will be matched before you get to the inefficient final loop, and indeed the previous one