Creating a script that compares multiple files in multiple servers - perl

I have several different linux servers, all of which are essentially mirrors of each other. However, some of them have gone out of sync (file A in machine 1 is different from file B in machine 2).
I'm in the process of designing a script (shell or Perl only) that will systematically walk through certain directories and diff the corresponding files in the different machines against each other, and generate a meaningful report. Later on, I will try to sync up the files.
These are my thoughts so far on how to approach this:
sftp files to /tmp and diff locally
using ssh and diff
using rsync
My question is: what is the best way to systematically compare two files that are in different machines (but similar directory structure), and are there any built-in Perl utilities that may be helpful?

rsync will figure out the difference and sync your files by sending only the diff. Once two folders get synced, it will be pretty quick. (But the 1st time to sync will take some time)

You can also use git here. One possible workflow: just checkin all files you want to compare (or complete directories using git add -A). Then create an empty git repository on your local workstation which is used fetch all the other repositories, and which is used to do the comparisons:
git init
git remote add firstmachine ssh://user#firstmachine/path/to/directory
git remote add othermachine ssh://user#othermachine/path/to/directory
git fetch --all
Now the contents of two machines may be compared:
git diff remotes/firstmachine/master remotes/othermachine/master
Or just compare the contents of a specific file:
git diff remotes/firstmachine/master remotes/othermachine/master -- file/to/compare
It's not strictly necessary to use a third machine for the comparisons. You can also git-fetch the contents from othermachine to firstmachine.

I had worked on a similar tool (which was in python). What it did was, run a cron job, at a given time of the night, which would bring the tar bzipped files to one server, extract the directories and run a recursive diff on it. The diff output was then run through some python scripts, which would analyse the diff hunks (+ lines/! lines etc) to know the amount of change.
Not sure if there are pre-built modules in Perl or Python, but some helper utils might sure be available in one of them.

If you need to know the difference between some local and remote file systems, the following method minimizes the network load:
make a local copy ($C) of the local directory ($D) you want to compare. I.e.:
cp -R $D $C
use rsync to copy the remote directory ($R) you want to compare over $C:
rsync -av --delete $remote_host:$R $C
compare $D to $C:
diff -u $D $C

Related

Perforce: Prevent keywords from being expanded when syncing files out of the depot?

I have a situation where I'd like to diff two branches in Perforce. Normally I'd use diff2 to do a server-side diff but in this case the files on the branches are so large that the diff2 call ends up filling up /tmp on my server trying to diff them and the diff fails.
I can't bring down my server to rectify this so I'm looking at checking out the the content to disk and using diff on the command line to inspect and compare the content.
The trouble is: most of the files have RCS keywords in them that are being expanded.
I know can remove keyword expansion from a file by opening the files for edit and removing the -k attribute from the files in the process, but that seems a bit brute force. I was hoping I could just tell the p4 sync command not to expand the keywords on checkout. I can't seem to find a way to do this? Is it possible?
As a possible alternative solution, does anyone know if you can tell p4 diff2 which directory to use for temporary space when you call it? If I could tell it to use abundant NAS space instead of /tmp on the Perforce server I might be able to make it work.
I'm using 2010.x version of Perforce if that changes the answer in any way.
There's no way I know of to disable keyword expansion on sync. Here's what I would try:
1) Create a branch spec between the two sets of files
2) Run "p4 files //path/to/files/... | cut -d '#' -f 1 > tmp"
Path to files above should be the right hand side of the branch spec you created
3) p4 -x tmp diff2 -b
This tells p4 to iterate over the lines of text in 'tmp' and treat them as arguments to the command. I think /tmp on your server will get cleared in-between each file this way, preventing it from filling up.
I unfortunately don't have files large enough to test that it works, so this is entirely theoretical.
To change the temp directory that p4d uses just TEMP or TMP to a different path and restart p4d. If you're on Windows make sure to call 'p4 set -S perforce TMP=' to set variable for the Perforce service; without the -S perforce you'll just set it for the current user.

How to change (CQ5) VLT repo url/port?

I have checked out vlt repo using:
vlt co http://localhost:4502/crx/-/jcr:root path/to/repo --force
But now, my CQ instance changed location (port). Is there a way to set new URL(port) to vlt?
(without checking out again)
I have tried unzipping path/to/repo/.vlt and changing repository.url file sometimes it works, but in most cases it breaks local repo, or I'm unable to unzip.
I understand you're looking for something like the "svn relocate" command. This is not possible with the VLT tool directly.
Options (any one of these should do it):
I recommend checking out a new copy of the repository and reapplying the changes that show from running "vlt status" over there.
Set up a new CQ server on the old port, then use "vlt rcp". The process would probably be: copy the whole repository from old to new server, push your local stuff to the new server, copy part of the tree from new to old.
The repository.url setting is nested in .vlt files under all subdirectories of the repository. You could try a global/recursive search & replace for all of these. I've never tried this though. For example, something like this: (I get permission denied running this, needs more work.)
find -name .vlt -type f -print0 | xargs -0 sed -i 's/localhost:4502/localhost:4503/g'
Remove all the .vlt files and use the vlt import/export commands to load. See the "Using import/export instead of .vlt control" section of this document: http://wem.help.adobe.com/enterprise/en_US/10-0/core/how_to/how_to_use_the_vlttool.html

flexible merge command for unison to pick newer or older file?

I've been using unison as my file synchronizer of choice and life has been great.
Essentially I could modify any files on any side at any time without ever worrying who's master and slave, etc. It's bidirectional.
However with four roots failing over to each other when each's primary partner cannot be reached, I'm starting to push the limits of this tool. Conflicts arise that halt automatic syncing for the files involved. Aspects of my business logic are distributed across the different hosts, which modify sometimes the same files when run.
The merge option in the configuration file comes into play. It lets you specify different merge commands for different file types.
For example for log files only I like to interpolate their lines with:
merge = Name *.log -> diff3 -m CURRENT1 CURRENTARCH CURRENT2 > NEW || echo "differences detected"
Question: for *.last files only, what merge command would always favor the older copy?
For *.rb *.sh and other source files, I'm not looking to merge but always pick the newer version in case of conflicts. I can do that by default with the prefer = newer global option though.
For *.png files I typically prefer to keep the smaller(optimized) size.
Regarding the .rb and .sh files, you could use the preferpartial = Name *.rb -> newer and the same for .ssh files. For .last files, you can use older instead.
Regarding .png files, you could write your own merge command that checks the size of both files. I would then set merge = Name *.png -> mycmp CURRENT1 CURRENT2 NEW, and have the mycmp command takes three file path, compare the size of the first two, and copy it to the third path.

how to check for activity or lack thereof on a unix file directory using perl or unix commands

Scenario:
I have a process where many files are being copied (scp'd) to a DestinationServer by Host1, Host2, Host3, Host4 for example. Going to the same common directory: DestinationServer:/home/target. All the files are unique so no files will be overwritten. Host1-Host4 will have a cronjob that will launch their scp script to DestinationServer. The caveat is the Hosts are in different time zones, locations. So, they will finish at different times.
Need:
Since the files are being scp'd to Destination:/home/target, what is the best way to programmatically check when those scp's from the other Hosts are done??
Options:
My options are to programmatically do this either in perl or shell if possible.
What do I look for, what unix commands or perl modules could I use to help determine when the processes would finish? Any ideas, examples would be great! Thanks.
Use a Maildir kind of approach: copy all files to a temporary directory, then after the transfer is complete have the originating host perform a rename into the target directory via ssh. That way when a file appears in the target directory, you know that it is complete.
I suggest this because if you just scp files into the target directory and monitor the directory in whatever way, you cannot distinguish a complete transfer from an interrupted scp command or a network failure.
SGI::FAM, Sys::Gamin
Similar but alternative way to Jouni is to use semaphore files. Before scp-ing files originating host puts up semaphore-file and when finished, remove it. So you know, it's time.

How do I search a CVS repository for a particular file?

Is there any way to do it? I only have client access and no access to the server. Is there a command I've missed or some software that I can install locally that can connect and find a file by filename?
You could grep the output of
cvs rlog -Nh .
(note the period character at the end - this effectively means: the whole repository).
That should give you info about the whole shebang including removed files and files added on branches.
You can use
cvs rls -Rde <modulename>
which will give you all files in recursively, e.g.
foo:
/x.py/1.2/Mon Dec 1 23:33:51 2008//
/y.py/1.1/Mon Dec 1 23:33:31 2008//
D/bar////
foo/bar:
/xxx/1.1/Mon Dec 1 23:36:38 2008//
Notice that the -d option gives you also deleted files; not sure whether you
wanted that. Without -e, it only gives you the file names.