Is there a way to diff files sentence-by-sentence instead of line-by-line? - diff

Just trying to get diff to work better for certain kinds of documents. With LaTeX, for example, I might have a long paragraph that is strictly just one line, but I don't want to see that entire paragraph if just a sentence is changed. Particularly if I'm running some kind of version control and a co-author edits the same paragraph (but not the same sentence) as me. I wouldn't want that to show up as a conflict.
That's a secondary question. The main question is whether I can use diff to look sentence-by-sentence. Thanks.
Edit
wdiff is almost perfect. But is there a merge equivalent, as diff has with diff3?

wdiff will give you a word-by-word diff instead of line-by-line. I'm not aware of any sentence-by-sentence diff programs.

Preprocess the files before diffing them. Write a script to write one sentence per line and any line by line diff program will work.
I have done this on a C token level for diffing C code in order to make absolutely sure my CVS merge was correct.

Related

Tools to diff, patch or merge on a word-by-word basis

For text, not source code, files like LaTeX, markdown, restructuredText, usually single line breaks does not matter for the semantics and they are frequently refilled within 80 columns. When things are changed, the line break might change by quite a lot. So the common line-by-line diff and patch tools do not actually work for them very well. So I am wondering if there already exist good tools for diffing, patching and even merging this kind of changes? wdiff and git diff --color-words does exactly the kind of thing, but they seem to lack the patching and merging capability. Ideally, if we have got a line
He do not owe us nothing.
and one author changed it into
He do not owe us anything.
and another author changed it into
He does not owe us nothing.
then a merge could give
He does not owe us anything.
without conflict. That is the ideal result. Thanks in advance.
Besides meld, you can also use Beyond Compare or WinMerge

Is there a tool to automatically extract common subroutines from files?

I have two 1000+ line programs in Perl, each with about 20 subroutines in the main file. One was forked from the other some time ago and I want to factor out the common parts (before porting features backward.) Is there a diff tool that will treat the subroutines (and preceding comments) as units, and extract the common units into a new file? (if one line of a subroutine is different, the unit doesn’t match.)
My SCM is currently Subversion if that helps. A Perl script that processes the code would be cool.
You can try to use the PPI module; to my knowledge there's no tool for refactoring as the one you mentioned.
If you had 500,000 lines of code it might be useful to have or write such a tool. For 1000 lines, this shouldn't be too hard with a simple visual diff tool, like BeyondCompare ($) or WinMerge (free).
You're trying to compare two different versions of the files?
I use VIM which comes with a built in diff program vimdiff and a fully gui one called gvimdiff. It'll fold common lines and just show you the lines that differ and where.
With gvim, you can open up three splits in one window (the two versions and a blank) and then copy over the various lines you want. If you're using Subversion, you can use the built in merge tool (if you're talking about different versions of the same file). The Subversion merge is pretty good and will probably help you with the merge issues.

Is there a diff tool (patch) that is aware of indentation?

I'm regularly using the gnu-utils patch and diff. Using git, I often do:
git diff
Often simple changes create a large patch because the only that changed was, for example, adding a if/else loop and everything inside is indented to the right.
Reviewing such a patch can be cumbersome because only line by line manual comparison can indicate if anything has essentially changed within the indented code. We may be speaking about a few lines of code only, or about dozens (or much more) of nested code. (I know: such an hypothetically large function would better be split into smaller functions, but that's beside the point).
Can't GNU diff/patch be aware when the only change within a code block is the indentation and let the developer know as much?
Are there any other diff tools that operate this way?
Edit: Ok, there is --ignore-space-change but then we are in a either/or situation: either we have a human-more-readable patch or we have a complete patch that the machine would know how to read. Can't we have the best of both world with a more elaborate diff tool that would show to the human space changes for what they are while allowing the machine to apply the patch fully?
With GNU diff you can pass -b or --ignore-space-change to ignore changes in the amount of white space in a patch.
If you use emacs and have been sent a patch, you can also use M-x diff-ignore-whitespace-hunk to reformat the patch to ignore white space in a particular hunk. Or diff-refine-hunk to highlight changes at a character by character level, which tends to point out the "meat" of a change.
As for applying patches, you can use the -l or --ignore-whitespace with GNU patch to ignore tabs and spaces changes. Just be careful with Python code :-)
For what is worth, using git difftool with a tool like meld or xxdiff makes the diff much more readable.
I don't know about git diff. But a diff-like tool that understands not just indentation but in fact any layout changes in your target language is our Smart Differencer.
This tool parses the before- and after- versions of your code the same way compiler does, and compares the resulting syntax trees, so it isn't affected by whitespace changes (except semantically important whitespace such as Python indentation) of any kind, inserted or deleted comments, or even change of radix on constants.
The result is report in terms of programmer editing actions ("move, insert, delete, copy, rename") over language structures (expressions, statements, declarations, blocks, methods, ...) rather than "insert line" or "delete line".
I try to not do file-wide indentation changes in the same commit as some other changes. And I commit the indentation changes in a separate commit before or after, with a commit message of "Changed indentation only.", to make it clear so that no manual diff inspection is needed, to see if something else was changed.

Code formatting and source control diffs

What source control products have a "diff" facility that ignores white space, braces, etc., in calculating the difference between checked-in versions? I seem to remember that Clearcase's diff did this but Visual SourceSafe (or at least the version I used) did not.
The reason I ask is probably pretty typical. Four perfectly reasonable developers on a team have four entirely different ways of formatting their code. Upon checking out the code last changed by someone else, each will immediately run some kind of program or editor macro to format things the way they like. They make actual code changes. They check-in their changes. They go on vacation. Two days later that program, which had been running fine for two years, blows up. The developer assigned to the bug does a diff between versions and finds 204 differences, only 3 of which are of any significance, because the diff algorithm is lame.
Yes, you can have coding standards. Most everyone finds them dreadful. A solution where everyone can have their cake and eat it too seems far more preferable.
=========
EDIT: Thanks to everyone for some great suggestions.
What I take away from this is:
(1) A source control system with plug-in type diffs is preferable.
(2) Find a diff with suitable options.
(3) Use a good source formatting program and settle on a check-in standard.
Sounds like a plan. Thanks again.
Git does have these options:
--ignore-space-at-eol
Ignore changes in whitespace at EOL.
-b, --ignore-space-change
Ignore changes in amount of whitespace. This ignores whitespace at line end, and considers all other sequences of one or more
whitespace characters to be equivalent.
-w, --ignore-all-space
Ignore whitespace when comparing lines. This ignores differences even if one line has whitespace where the other line has
none.
I am not sure if brace changes can be ignored using Git's diff.
If it is C/C++ code, you can define Astyle rules and then convert the source code's brace style to the one that you want, using Astyle. A git diff will then produce sane output.
Choose one (dreadful) coding standard, write it down in some official coding standards document, and get on with your life, messing with whitespace is not productive work.
And remember you are a professional developer, it's your job to get the project done, changing anything in the code because of a personal style preference hurts the project - it wont only make diff-ing more difficult, it can also introduce hard to find problems if your source formatter or compiler has bugs (and your fancy diff tool won't save you when two co-worker start fighting over casing).
And if someone just doesn't agree to work with the selected style just remind him (or her) that he is programming as a profession not as an hobby, see http://www.ericsink.com/entries/No_Great_Hackers.html
Maybe you should choose one format and run some indentation tool before checking in so that each person can check out, reformat to his/her own preferences, do the changes, reformat back to the official standard and then check in?
A couple of extra steps but they already use indentation tools when working. Maybe it can be a triggered check-in script?
Edit: this would perhaps also solve the brace problem.
(I haven't tried this solution myself, hence the "perhapes" and "maybes", but I have been in projects with the same problems, and it is a pain to try to go through diffs with hundreds of irrelevant changes that are not limited to whitespace, but includes the formatting itself.)
As explained in Is it possible for git-merge to ignore line-ending differences?, it is more a matter to associate the right diff tool to your favorite VCS, rather than to rely on the right VCS option (even if Git does have some options regarding whitespace, like the one mentioned in Alan's answer, it will always be not as complete as one would like).
DiffMerge is the more complete on those "ignore" options, as it can not only ignore spaces but also other "variations" based on the programming language used in a given file.
Subversion apparently supports this, either natively in the latest versions, or by using an alternate diff like Gnu Diff.
Beyond Compare does this (and much much more) and you can integrate it either in Subversion or Sourcesafe as an external diff tool.

Why is it bad to commit lines with trailing whitespace into source control?

Why is it bad to check in lines with trailing whitespace to your source control? What kinds of problems could that cause?
False differences, basically. It's helpful if diffs only show "real" changes. Some diff programs will ignore whitespace, but it would be better just to avoid the dummy change in the first place.
Of course, it also doesn't help if it makes the line wrap on a colleague's machine.
Because many people remove them you will have them show up as modified lines in diff tools if you don't use all the options (say a plain old cvs diff) which means people see your line for no good reason.
In theory you could also have strings that wrap lines where whitespace would truly be bad, but... probably not your issue.
It's like painting your walls, but not finishing the edges off properly, and going right onto the skirting board.
Some editors automatically remove trailing whitespace, some don't. This creates diff noise and can cause merge conflicts.
Yeah, I sort of agree with the other posts, but I would add that it's not bad per se. It is not a great practice, but that's the sort of thing that happens and you just sort of sigh and get on with things.
Modern diff utilities don't get hung up on whitespace.