command line recursive word-based diff? - word-diff

is there a command line program that gives recursive word-based diff (on 2 directories)?
diff -u is recursive, but it doesn't do word by word comparison. wdiff and dwdiff does word based diff but there are not built-in options for recursive diff.
I would like to pipe the result to colordiff so a program that generates output that colordiff understands would be especially useful. Any suggestions? Thanks!
CC

Git can do it and output color:
The following often works:
git diff --color-words path1 path2
but in general you may need to do
git diff --no-index --color-words path1 path2
Neither file even needs to be in a git repository!
--no-index is needed if you and the paths are in a git working tree. It can be elided if you or one of the files are outside a git working tree.
Manpage: https://git-scm.com/docs/git-diff/1.8.5 (and later...)
git diff --no-index [--options] [--] […​]
This form is to compare the given two paths on the filesystem. You can
omit the --no-index option when running the command in a working tree
controlled by Git and at least one of the paths points outside the
working tree, or when running the command outside a working tree
controlled by Git.

Related

How can I convert indentation between spaces and tabs for all files in a workspace in a single action?

How can I use VS Code's Convert Indentation To Spaces or Convert Indentation to Tabs commands on all the files in my workspace in a single action instead of using the command for each file?
I'm not aware of a way to do this with VS Code (at least- not without extensions, and I don't know of any such extensions offhand).
But if you're on a posix system (not sure if I'm using "posix" right here), you can do this via command line using a modified version of this:
git ls-files | command grep -E '*.ts$' | awk '{print "expand --tabs=4 --first-only", $0, " > /tmp/e; mv /tmp/e ", $0}' | sh
The above command lists all files tracked in the git repo for the current working directory, filters for files with the .ts extension, and then uses awk and expand to replace leading indentation of a tabs to a specified number of spaces.
To go from spaces to tabs, use the unexpand command instead.
If you're not working with a git repo, you can replace git ls-files with find -type f (the advantage of git ls-files is that it won't touch anything that's not tracked).
Just change the regular expression in the grep filter to whatever you need.
The command replaces leading groups of 4 spaces with tab characters. Just change the --tabs argument to the unexpand command with whatever number of spaces your indentation is.

Generate Hg diff (or patch) that Only includes content for files modified by a revision/revset

Given a large codebase and two revisions or revsets (say, a local 'source' and a target) which may or may not share a recent parent and usually contain a large number of non-relevant file deltas;
How can diff output be generated to compare the changes only for files that are modified in the source revset itself?
(This should be the delta between the ancestry; but only for the files contained within the revset.)
Using hg diff -r target -r source shows all the changes in the ancestry, even for files not modified by the source sevision.
In addition, this should be done without any knowledge of the directory structure.
If I understand correctly, you want a diff between revs source and target, but you want to restrict it to the files that were modified in changeset source. You can do it in two steps (easily assembled into one with an alias or shell script):
List the files modified in source:
hg log -r source --template '{files}'
This just outputs a list of filenames.
Request a diff for these files:
hg diff -r target -r source $(hg log -r source --template '{files}')
Step 2 is a bash command with "command substitution" $(...), which inserts the output of step one. Other methods are possible.
From hg help diff
If only one revision is specified then that revision is compared to the working directory
i.e. you can try hg up $SOURCE; hg diff -r $TARGET

Perforce: Prevent keywords from being expanded when syncing files out of the depot?

I have a situation where I'd like to diff two branches in Perforce. Normally I'd use diff2 to do a server-side diff but in this case the files on the branches are so large that the diff2 call ends up filling up /tmp on my server trying to diff them and the diff fails.
I can't bring down my server to rectify this so I'm looking at checking out the the content to disk and using diff on the command line to inspect and compare the content.
The trouble is: most of the files have RCS keywords in them that are being expanded.
I know can remove keyword expansion from a file by opening the files for edit and removing the -k attribute from the files in the process, but that seems a bit brute force. I was hoping I could just tell the p4 sync command not to expand the keywords on checkout. I can't seem to find a way to do this? Is it possible?
As a possible alternative solution, does anyone know if you can tell p4 diff2 which directory to use for temporary space when you call it? If I could tell it to use abundant NAS space instead of /tmp on the Perforce server I might be able to make it work.
I'm using 2010.x version of Perforce if that changes the answer in any way.
There's no way I know of to disable keyword expansion on sync. Here's what I would try:
1) Create a branch spec between the two sets of files
2) Run "p4 files //path/to/files/... | cut -d '#' -f 1 > tmp"
Path to files above should be the right hand side of the branch spec you created
3) p4 -x tmp diff2 -b
This tells p4 to iterate over the lines of text in 'tmp' and treat them as arguments to the command. I think /tmp on your server will get cleared in-between each file this way, preventing it from filling up.
I unfortunately don't have files large enough to test that it works, so this is entirely theoretical.
To change the temp directory that p4d uses just TEMP or TMP to a different path and restart p4d. If you're on Windows make sure to call 'p4 set -S perforce TMP=' to set variable for the Perforce service; without the -S perforce you'll just set it for the current user.

Creating a script that compares multiple files in multiple servers

I have several different linux servers, all of which are essentially mirrors of each other. However, some of them have gone out of sync (file A in machine 1 is different from file B in machine 2).
I'm in the process of designing a script (shell or Perl only) that will systematically walk through certain directories and diff the corresponding files in the different machines against each other, and generate a meaningful report. Later on, I will try to sync up the files.
These are my thoughts so far on how to approach this:
sftp files to /tmp and diff locally
using ssh and diff
using rsync
My question is: what is the best way to systematically compare two files that are in different machines (but similar directory structure), and are there any built-in Perl utilities that may be helpful?
rsync will figure out the difference and sync your files by sending only the diff. Once two folders get synced, it will be pretty quick. (But the 1st time to sync will take some time)
You can also use git here. One possible workflow: just checkin all files you want to compare (or complete directories using git add -A). Then create an empty git repository on your local workstation which is used fetch all the other repositories, and which is used to do the comparisons:
git init
git remote add firstmachine ssh://user#firstmachine/path/to/directory
git remote add othermachine ssh://user#othermachine/path/to/directory
git fetch --all
Now the contents of two machines may be compared:
git diff remotes/firstmachine/master remotes/othermachine/master
Or just compare the contents of a specific file:
git diff remotes/firstmachine/master remotes/othermachine/master -- file/to/compare
It's not strictly necessary to use a third machine for the comparisons. You can also git-fetch the contents from othermachine to firstmachine.
I had worked on a similar tool (which was in python). What it did was, run a cron job, at a given time of the night, which would bring the tar bzipped files to one server, extract the directories and run a recursive diff on it. The diff output was then run through some python scripts, which would analyse the diff hunks (+ lines/! lines etc) to know the amount of change.
Not sure if there are pre-built modules in Perl or Python, but some helper utils might sure be available in one of them.
If you need to know the difference between some local and remote file systems, the following method minimizes the network load:
make a local copy ($C) of the local directory ($D) you want to compare. I.e.:
cp -R $D $C
use rsync to copy the remote directory ($R) you want to compare over $C:
rsync -av --delete $remote_host:$R $C
compare $D to $C:
diff -u $D $C

Mercurial command line client, reading commands, options, and arguments from a file?

Is there a way to ask Mercurial to read most/all of the commands, options, and arguments that I want to give it from a response file, instead of passing them on the command line?
For instance, instead of this:
hg commit -m "commit message" --INCLUDE file1 --INCLUDE file2 ...
I would create a text file containing
-m "commit message" --INCLUDE file1 --INCLUDE file2 ...
and then ask Mercurial to read it with this (hyphotetical) syntax:
hg commit #responses.txt
The reason I'm asking is that I'm creating a wrapper library for .NET around the Mercurial command line client, and this question on SO got me worried that the length of the command line might be a problem for me at some point: “Resulting command line for hg.exe too long” error in Mercurial.
There isn't a built-in way to do this as far as I know, but I think there is a way you can build what you need.
Use the Mercurial internal API and write your own wrapper script. Rather than trying to get it to read any and all commands and options, it'll be a lot easier to stick to your specific goal (i.e. "commit" and the options you need).
(Note the warnings on the API page. If this wrapper you're building is going to be distributed to other people, look into the licensing issue and have a plan for how to handle future Mercurial upgrades, which may break your wrapper.)
Here's a kludgy workaround...
Create a dummy, empty response file in the repo's .hg directory, for example .hg\response.
In the repo's .hg\hgrc, add the line
%include response
Before doing any repository operations, write the command line options to this response file. Use the [defaults] section to (I know it's deprecated) to specify your options.
[defaults]
commit = -m "This is a commit message" -I file1 -I file2 ...
(According to Microsoft's support, the maximum command line is 8,191 characters on XP and later. Might be useful to know if you even need to use this trick.)