How well does Python's whitespace dependency interact with source control with regards to merging? - version-control

I'm wondering if the need to alter the indentation of code to adjust the nesting has any adverse effects on merging changes in a system like SVN.

I've used python with SVN and Mercurial, and have no hassles merging.
It all depends on how the diffing is done - and I suspect that it is character-by-character, which would notice the difference between one level of indent and another.

It works fine so long as everyone on the project has agreed to use the same whitespace style (spaces or tabs).
But I've seen cases where a developer has converted an entire file from spaces to tabs (I think Eclipse had that as a feature, bound to Ctrl+Tab!), which makes spotting diffs near impossible.

Generally source control systems merge on a line-by-line basis by default. I have found that merging Python code is no different from merging any other source code that is reasonably indented. If one programmer wraps a block of code in an if statement (indenting the whole block), and another programmer modifies something inside the block, then there will be a merge conflict. Fortunately, the conflict in this case is super easy to resolve.
If you use an external merge tool, then your tool may support more detailed textual merging algorithms that take the above scenario into account automatically.

Related

All Inclusive file version merge

Typical merge and source control tools mistake added code for code change especially if you merge several branches/version. Is there solution to merge several versions /branches of code without deleting a symbol? So the result of merge is interleaving of both files?
A use case one branch adds a fields and another branch adds a similar fields to website/api. The changes are practically isomorphic so might confuse a source code management tool.
Adding a non trivial field is typically tedious error prone task due MVC pattern and multi-layering, and involves adding some files, modifying other, lines and symbols yet deletion or swapping pieces of code is rarely needed. In fact I think to automate the task by having a generic patch/branch and just changing name and adding custom logic each time. I am aware one can code in way the adding of typical fields is push button task, or few lines change, yet not everybody likes such code.
Usually I use PyCharm (python version of IntellijIdea) or Smart Git for merging, but open to any tools or solutions.
Beyond Compare can't automatically detect it, but in the Text Merge, you can manually take text from both sides using the right click Take Left Then Right or Take Right Then Left commands.

Is simplified semantics for the 'blame' command a good thing?

I'm working on a new weave-based data structure for storing version control history. This will undoubtedly cause some religious wars about whether it's The Right Way Of Doing Things when it comes out, but that isn't my question right now.
My question has to do with what output blame should give. When a line of code has been added, removed, and merged into itself a number of times, it isn't always clear what revision should get blame for it. Notably this means that when a section of code is deleted, all records of it having been there is gone, and there is no blame for the removal. Everyone I've gone over this issue with has said that trying to do better simply isn't worth it. Sometimes people put in the hack that the line after the section which got deleted has its blame changed from whatever it actually was to the revision when the section got deleted. Presumably if the section is at the end then the last line get its blame changed, and if the file winds up empty then the blame really does disappear into the aether, because there's literally nowhere left to put blame information. For various technical reasons I won't be using this hack, but assume that continuing but with this completely undocumented but de facto standard practice will be uncontroversial (but feel free to flame me and get it out of your system).
Moving on to my actual question. Usually in blame for each line you look at the complete history of where it was added and removed in the history and using three-way merge (or, in the case of criss-cross merges, random bullshit) and based on the relationships between those you determine whether the line should have been there based on its history, and if it shouldn't but is then you mark it as new with the current revision. In the case where a line occurs in multiple ancestors with different blames then it picks which one to inherit arbitrarily. Again, I assume that continuing with this completely undocumented but de facto standard practice will be uncontroversial.
Where my new system diverges is that rather than doing a complicated calculation of whether a given line should be in the current revision based on a complex calculation of the whole history, it simply looks at the immediate ancestors, and if the line is in any of them it picks an arbitrary one to inherit the blame from. I'm making this change for largely technical reasons (and it's entirely possible that other blame implementations do the same thing, for similar technical reasons and a lack of caring) but after thinking about it a bit part of me actually prefers the new behavior as being more intuitive and predictable than the old one. What does everybody think?
I actually wrote a one of the blame implementations out there (Subversion's current one I believe, unless someone replaced it in the past year or two). I helped with some others as well.
At least most implementations of blame don't do what you describe:
Usually in blame for each line you look at the complete history of where it was added and removed in the history and using three way merge (or, in the case of criss-cross merges, random bullshit) and based on the relationships between those you determine whether the line should have been there based on its history, and if it shouldn't but is then you mark it as new with the current revision. In the case where a line occurs in multiple ancestors with different blames then it picks which one to inherit arbitrarily. Again, I assume that continuing with this completely undocumented but de facto standard practice will be uncontroversial.
Actually, most blames are significantly less complex than this and don't bother trying to use the relationships at all, but they just walk parents in some arbitrary order, using simple delta structures (usually the same internal structure whatever diff algorithm they have uses before it turns it into textual output) to see if the chunk changed, and if so, blame it, and mark that line as done.
For example, Mercurial just does an iterative depth first search until all lines are blamed. It doesn't try to take into account whether the relationships make it unlikely it blamed the right one.
Git does do something a bit more complicated, but still, not quite like you describe.
Subversion does what Mercurial does, but the history graph is very simple, so it's even easier.
In turn, what you are suggesting is, in fact, what all of them really do:
Pick an arbitrary ancestor and follow that path down the rabbit hole until it's done, and if it doesn't cause you to have blamed all the lines, arbitrarily pick the next ancestor, continue until all blame is assigned.
On a personal level, I prefer your simplified option.
Reason: Blame isn't used very much anyway.
So I don't see a point in wasting a lot of time doing a comprehensive implementation of it.
It's true. Blame has largely turned out to be one of those "pot of gold at the end of the rainbow" features. It looked really cool from those of us standing on the ground, dreaming about a day when we could just click on a file and see who wrote which lines of code. But now that it's widely implemented, most of us have come to realize that it actually isn't very helpful. Check the activity on the blame tag here on Stack Overflow. It is underwhemingly desolate.
I have run across dozens of "blame-worthy" scenarios in recent months alone, and in most cases I have attempted to use blame first, and found it either cumbersome or utterly unhelpful. Instead, I found the information I needed by doing a simple filtered changelog on the file in question. In some cases, I could have found the information using Blame as well, had I been persistent, but it would have taken much longer.
The main problem is code formatting changes. The first-tier blame for almost everything was listed as... me! Why? Because I'm the one responsible for fixing newlines and tabs, re-sorting function order, splitting functions into separate utility modules, fixing comment typos, and improving or simplifying code flow. And if it wasn't me, someone else had done a whitespace or block-move somewhere along-the-way as well. In order to get a meaningful blame on anything dating back to a time before I can already remember without the help of blame, I had to roll back revisions and re-blame. And re-blame again. And again.
So in order for a blame to actually be a useful time saver for more than the most lucky of situations, the blame has to be able to heuristicly make its way past newline, whitespace, and ideally block copy/move changes. That sounds like a very tall order, especially when scouring the changelog for a single file, most of the time, it won't yield many diffs anyway and you can just sift through by hand fairly quickly. (The notable exception being, perhaps, badly engineered source trees where 90% of the code is stuffed in one or two ginormous files... but who these days in a collaborative coding environment does much of that anymore?).
Conclusion: Give it a bare-bones implementation of blame, because some people like to see "it can blame!" on the features list. And then move on to things that matter. Enjoy!
The line-merge algorithm is stupider than the developer. If they disagree, that just indicates that the merger is wrong rather than indicating a decision point. So, the simplified logic should actually be more correct.

Is there a diff tool (patch) that is aware of indentation?

I'm regularly using the gnu-utils patch and diff. Using git, I often do:
git diff
Often simple changes create a large patch because the only that changed was, for example, adding a if/else loop and everything inside is indented to the right.
Reviewing such a patch can be cumbersome because only line by line manual comparison can indicate if anything has essentially changed within the indented code. We may be speaking about a few lines of code only, or about dozens (or much more) of nested code. (I know: such an hypothetically large function would better be split into smaller functions, but that's beside the point).
Can't GNU diff/patch be aware when the only change within a code block is the indentation and let the developer know as much?
Are there any other diff tools that operate this way?
Edit: Ok, there is --ignore-space-change but then we are in a either/or situation: either we have a human-more-readable patch or we have a complete patch that the machine would know how to read. Can't we have the best of both world with a more elaborate diff tool that would show to the human space changes for what they are while allowing the machine to apply the patch fully?
With GNU diff you can pass -b or --ignore-space-change to ignore changes in the amount of white space in a patch.
If you use emacs and have been sent a patch, you can also use M-x diff-ignore-whitespace-hunk to reformat the patch to ignore white space in a particular hunk. Or diff-refine-hunk to highlight changes at a character by character level, which tends to point out the "meat" of a change.
As for applying patches, you can use the -l or --ignore-whitespace with GNU patch to ignore tabs and spaces changes. Just be careful with Python code :-)
For what is worth, using git difftool with a tool like meld or xxdiff makes the diff much more readable.
I don't know about git diff. But a diff-like tool that understands not just indentation but in fact any layout changes in your target language is our Smart Differencer.
This tool parses the before- and after- versions of your code the same way compiler does, and compares the resulting syntax trees, so it isn't affected by whitespace changes (except semantically important whitespace such as Python indentation) of any kind, inserted or deleted comments, or even change of radix on constants.
The result is report in terms of programmer editing actions ("move, insert, delete, copy, rename") over language structures (expressions, statements, declarations, blocks, methods, ...) rather than "insert line" or "delete line".
I try to not do file-wide indentation changes in the same commit as some other changes. And I commit the indentation changes in a separate commit before or after, with a commit message of "Changed indentation only.", to make it clear so that no manual diff inspection is needed, to see if something else was changed.

Code formatting and source control diffs

What source control products have a "diff" facility that ignores white space, braces, etc., in calculating the difference between checked-in versions? I seem to remember that Clearcase's diff did this but Visual SourceSafe (or at least the version I used) did not.
The reason I ask is probably pretty typical. Four perfectly reasonable developers on a team have four entirely different ways of formatting their code. Upon checking out the code last changed by someone else, each will immediately run some kind of program or editor macro to format things the way they like. They make actual code changes. They check-in their changes. They go on vacation. Two days later that program, which had been running fine for two years, blows up. The developer assigned to the bug does a diff between versions and finds 204 differences, only 3 of which are of any significance, because the diff algorithm is lame.
Yes, you can have coding standards. Most everyone finds them dreadful. A solution where everyone can have their cake and eat it too seems far more preferable.
=========
EDIT: Thanks to everyone for some great suggestions.
What I take away from this is:
(1) A source control system with plug-in type diffs is preferable.
(2) Find a diff with suitable options.
(3) Use a good source formatting program and settle on a check-in standard.
Sounds like a plan. Thanks again.
Git does have these options:
--ignore-space-at-eol
Ignore changes in whitespace at EOL.
-b, --ignore-space-change
Ignore changes in amount of whitespace. This ignores whitespace at line end, and considers all other sequences of one or more
whitespace characters to be equivalent.
-w, --ignore-all-space
Ignore whitespace when comparing lines. This ignores differences even if one line has whitespace where the other line has
none.
I am not sure if brace changes can be ignored using Git's diff.
If it is C/C++ code, you can define Astyle rules and then convert the source code's brace style to the one that you want, using Astyle. A git diff will then produce sane output.
Choose one (dreadful) coding standard, write it down in some official coding standards document, and get on with your life, messing with whitespace is not productive work.
And remember you are a professional developer, it's your job to get the project done, changing anything in the code because of a personal style preference hurts the project - it wont only make diff-ing more difficult, it can also introduce hard to find problems if your source formatter or compiler has bugs (and your fancy diff tool won't save you when two co-worker start fighting over casing).
And if someone just doesn't agree to work with the selected style just remind him (or her) that he is programming as a profession not as an hobby, see http://www.ericsink.com/entries/No_Great_Hackers.html
Maybe you should choose one format and run some indentation tool before checking in so that each person can check out, reformat to his/her own preferences, do the changes, reformat back to the official standard and then check in?
A couple of extra steps but they already use indentation tools when working. Maybe it can be a triggered check-in script?
Edit: this would perhaps also solve the brace problem.
(I haven't tried this solution myself, hence the "perhapes" and "maybes", but I have been in projects with the same problems, and it is a pain to try to go through diffs with hundreds of irrelevant changes that are not limited to whitespace, but includes the formatting itself.)
As explained in Is it possible for git-merge to ignore line-ending differences?, it is more a matter to associate the right diff tool to your favorite VCS, rather than to rely on the right VCS option (even if Git does have some options regarding whitespace, like the one mentioned in Alan's answer, it will always be not as complete as one would like).
DiffMerge is the more complete on those "ignore" options, as it can not only ignore spaces but also other "variations" based on the programming language used in a given file.
Subversion apparently supports this, either natively in the latest versions, or by using an alternate diff like Gnu Diff.
Beyond Compare does this (and much much more) and you can integrate it either in Subversion or Sourcesafe as an external diff tool.

How do you prevent file confusion if version-control keywords are forbidden?

At least two brilliant programmers, Linus Torvalds and Guido von Rossum, disparage the practice of putting keywords into a file that expand to show the version number, last author, etc.
I know how keyword differences clutter up diffs. One of the reasons I like SlickEdit's DiffZilla is because it can be set to skip leading comments.
However, I have vivid memories of team-programming where we had four versions of a file (two different releases, a customer one-off, and the development version) all open for patching at the same time, and was quite helpful to verify with a glance that each time we navigated to an included header we got the proper one, and each time we pasted code the source and destination were what we expected.
There is also the where-did-this-file-come-from problem that arises when a hasty developer copies a file from one place to another using the file system, rather than checking it out of the repository using the tool; or, more defensibly, when files under control in locations A, B, and C need to be marshalled (with cherry-picking) into a distribution location D.
In places where VCS keywords are banned, how do you cope?
I've never used VCS keywords in my entire career, over 30 years. From the most primitive VCS system I've used, up to the present (TFS), I've used some other structure to understand "where I am".
I am rarely in a situation where I've only got one file to work with. I've usually got all the other files necessary to build the project or set of projects. I usually use branching (or streams on one occasion), and I'm working on some slice of the given branch or stream.
If I'm working on multiple branches or streams, I'll have one directory tree for each. All I need to do to know what file I'm working on is check the file path, at the very worst.
At the very best, the version control system will tell you exactly which version of the file you're working on, what the change history is, who else is working on different versions of the file, and anything else you'd care to know.
This doesn't exactly answer your question, but I imagine Linus and Guido have reasons for disliking keywords that don't apply to small-team corporate development.
An $Id$ tag for instance, has what you could consider to be a global version number. Linux and I guess also Python development is fragmented enough that no number can be global. Lots of people have their own repositories all over the place that would fill in their own $Id$ values and then those patches might be sent to Linus or Guido's repositories where they don't make any sense.
However, in your environment, you probably have one central repository which would assign these and it would be fine. Sounds like you're using git. I wonder if it's possible to configure the central git repository to do tag substitution while the local developer repositories don't. Or perhaps it's better to get the commit hash in the tag.