Graphically view long line diffs that mostly match - diff

I have two data files. They're both 5000 lines long, and each row is ~10000 characters wide. They are nearly identical, except a few characters on some lines are different. Is there any tool that will jump both vertically and horizontally to the disagreements for both sides, so I can see what's different?
I'm using OSX, but I could use Linux or windows if it was absolutely necessary. I've tried:
FileMerge
KDiff3
DiffMerge
command line diff
Now trying to install meld

vimdiff, or even better its graphical twin gvimdiff, would work if there is at most two edits per line. It highlights the part between the first and last changed character on a line. It also collapses the parts of the files which are identical.
These are part of Vim; I have no idea whether the MacOS versions contain (g)vimdiff.
Note that in my experience Vim sometimes becomes slow with long lines, so that might hurt you for this specific use case.

Araxis Merge may do what you're after (it does character-diffs within lines), and there's an OSX version. It's not cheap, and I've never tried it on 10,000 character wide lines, though.

Related

Rebuild text sequence from base and diff

I have two text sequences in memory and run a diff on them. Later I want to take the base sequence and the diff and reconstitute the second sequence. A Python solution would be perfect but a command line utility would be OK too. The solution must work on Windows, Linux and MacOS. Needs to be efficient because it will be carried out millions of times. My Internet searches have not turned up a good answer.

Tool to compare/diff HTML in bulk

I have a lot of HTML files (10,000's and GBs worth) scraped from a server and I want to check to make sure the server produces the same results after some modifications but ignore kinds of differences that don't matter, e.g. whitespace, missing newlines, timestamps, small changes in some kinds of number, etc.
Does anyone know of a tool for doing this? I'd really rather not do more filtering than I have to.
(Oh and it needs to run under linux)
You might consider using a clone detector such as our CloneDR. This tool parses large sets of computer program (HTML is special case) files, builds abstract syntax trees representing the essential structure of each files, and compares programs for similarity.
Because it is comparing essential program structure, it ignores inessential differences such as comments and whitespace, and deterimines that two code segments are either identical or one can be obtained from the other by substituting other blocks of code. The latter allows the recognition of code that has been modified in various ways. You can see samples of clone detection runs on a variety of computer languages at the web site.
In your case, what you would be looking for are files in system A which are essentially clones (exact or near misses) of files in system B. As a general rule, if a file a is a variant of file b (e.g., with a few changes) the CloneDr will report it as a clone and show the exact differences.
At the scale of 20,000 files, I can see why you want a tool, and I can see why you want near-miss matches rather than exact matches.
Doesn't run under Linux, but I assume your problem is hard to enough to solve so that isn't what you are optimizing.
I use winmerge alot in windows and from what i can see some people enjoy meld in linux, so perhaps that could do the trick for you
http://meld.sourceforge.net/
Other examples i saw from a quick googling was Kompare,xxdiff.sourceforge.net, and kdiff3.sourceforge.net
(could only post 1 link so wrote the adresses to xxdiff and kdiff3 as text)
Beyond Compare is purchased software that is actually worth the money (I never thought I'd hear myself typing that!). It is GUI based but handles thousands of files very well. It will allow you to specify unimportant changes with regular expressions as well as whitespace (beginning, middle and end of line). The feature set is very extensive, check out a trial download.
I do not work for this company, I just use Beyond Compare every day at work and enjoy it every time!

Code formatting and source control diffs

What source control products have a "diff" facility that ignores white space, braces, etc., in calculating the difference between checked-in versions? I seem to remember that Clearcase's diff did this but Visual SourceSafe (or at least the version I used) did not.
The reason I ask is probably pretty typical. Four perfectly reasonable developers on a team have four entirely different ways of formatting their code. Upon checking out the code last changed by someone else, each will immediately run some kind of program or editor macro to format things the way they like. They make actual code changes. They check-in their changes. They go on vacation. Two days later that program, which had been running fine for two years, blows up. The developer assigned to the bug does a diff between versions and finds 204 differences, only 3 of which are of any significance, because the diff algorithm is lame.
Yes, you can have coding standards. Most everyone finds them dreadful. A solution where everyone can have their cake and eat it too seems far more preferable.
=========
EDIT: Thanks to everyone for some great suggestions.
What I take away from this is:
(1) A source control system with plug-in type diffs is preferable.
(2) Find a diff with suitable options.
(3) Use a good source formatting program and settle on a check-in standard.
Sounds like a plan. Thanks again.
Git does have these options:
--ignore-space-at-eol
Ignore changes in whitespace at EOL.
-b, --ignore-space-change
Ignore changes in amount of whitespace. This ignores whitespace at line end, and considers all other sequences of one or more
whitespace characters to be equivalent.
-w, --ignore-all-space
Ignore whitespace when comparing lines. This ignores differences even if one line has whitespace where the other line has
none.
I am not sure if brace changes can be ignored using Git's diff.
If it is C/C++ code, you can define Astyle rules and then convert the source code's brace style to the one that you want, using Astyle. A git diff will then produce sane output.
Choose one (dreadful) coding standard, write it down in some official coding standards document, and get on with your life, messing with whitespace is not productive work.
And remember you are a professional developer, it's your job to get the project done, changing anything in the code because of a personal style preference hurts the project - it wont only make diff-ing more difficult, it can also introduce hard to find problems if your source formatter or compiler has bugs (and your fancy diff tool won't save you when two co-worker start fighting over casing).
And if someone just doesn't agree to work with the selected style just remind him (or her) that he is programming as a profession not as an hobby, see http://www.ericsink.com/entries/No_Great_Hackers.html
Maybe you should choose one format and run some indentation tool before checking in so that each person can check out, reformat to his/her own preferences, do the changes, reformat back to the official standard and then check in?
A couple of extra steps but they already use indentation tools when working. Maybe it can be a triggered check-in script?
Edit: this would perhaps also solve the brace problem.
(I haven't tried this solution myself, hence the "perhapes" and "maybes", but I have been in projects with the same problems, and it is a pain to try to go through diffs with hundreds of irrelevant changes that are not limited to whitespace, but includes the formatting itself.)
As explained in Is it possible for git-merge to ignore line-ending differences?, it is more a matter to associate the right diff tool to your favorite VCS, rather than to rely on the right VCS option (even if Git does have some options regarding whitespace, like the one mentioned in Alan's answer, it will always be not as complete as one would like).
DiffMerge is the more complete on those "ignore" options, as it can not only ignore spaces but also other "variations" based on the programming language used in a given file.
Subversion apparently supports this, either natively in the latest versions, or by using an alternate diff like Gnu Diff.
Beyond Compare does this (and much much more) and you can integrate it either in Subversion or Sourcesafe as an external diff tool.

Is there a way to diff files sentence-by-sentence instead of line-by-line?

Just trying to get diff to work better for certain kinds of documents. With LaTeX, for example, I might have a long paragraph that is strictly just one line, but I don't want to see that entire paragraph if just a sentence is changed. Particularly if I'm running some kind of version control and a co-author edits the same paragraph (but not the same sentence) as me. I wouldn't want that to show up as a conflict.
That's a secondary question. The main question is whether I can use diff to look sentence-by-sentence. Thanks.
Edit
wdiff is almost perfect. But is there a merge equivalent, as diff has with diff3?
wdiff will give you a word-by-word diff instead of line-by-line. I'm not aware of any sentence-by-sentence diff programs.
Preprocess the files before diffing them. Write a script to write one sentence per line and any line by line diff program will work.
I have done this on a C token level for diffing C code in order to make absolutely sure my CVS merge was correct.

Why is it bad to commit lines with trailing whitespace into source control?

Why is it bad to check in lines with trailing whitespace to your source control? What kinds of problems could that cause?
False differences, basically. It's helpful if diffs only show "real" changes. Some diff programs will ignore whitespace, but it would be better just to avoid the dummy change in the first place.
Of course, it also doesn't help if it makes the line wrap on a colleague's machine.
Because many people remove them you will have them show up as modified lines in diff tools if you don't use all the options (say a plain old cvs diff) which means people see your line for no good reason.
In theory you could also have strings that wrap lines where whitespace would truly be bad, but... probably not your issue.
It's like painting your walls, but not finishing the edges off properly, and going right onto the skirting board.
Some editors automatically remove trailing whitespace, some don't. This creates diff noise and can cause merge conflicts.
Yeah, I sort of agree with the other posts, but I would add that it's not bad per se. It is not a great practice, but that's the sort of thing that happens and you just sort of sigh and get on with things.
Modern diff utilities don't get hung up on whitespace.