Print line numbers after comparison - perl

Can someone tell me the best way to print the number of different lines in 2 files. I have 2 directories with 1000s of files and I have a perl script that compares all files in dir1 with all files in dir2 and outputs the difference to a different file. Now I need to add something like Filename - # of different lines
File1 - 8
File2 - 30
Right now I am using
my $diff = `diff -y --suppress-common-lines "$DirA/$file" "$DirB/$file"`;
But along with this I also need to print how many lines are different in each one of those 1000 files.
Sorry is a duplicate of my prev thread. So would be glad if some moderator could delete the previous one

Why you even use perl?
for i in "$dirA"/*; do file="${i##*/}"; echo "$file - $(diff -y --suppress-common-lines "$i" "$dirB/$file" | wc -l)" ; done > diffs.txt

Related

how to replace with sed when source contains $

I have a file that contains:
$conf['minified_version'] = 100;
I want to increment that 100 with sed, so I have this:
sed -r 's/(.*minified_version.*)([0-9]+)(.*)/echo "\1$((\2+1))\3"/ge'
The problem is that this strips the $conf from the original, along with any indentation spacing. What I have been able to figure out is that it's because it's trying to run:
echo " $conf['minified_version'] = $((100+1));"
so of course it's trying to replace the $conf with a variable which has no value.
Here is an awk version:
$ awk '/minified_version/{$3+=1} 1' file
$conf['minified_version'] = 101
This looks for lines that contain minified_version. Anytime such a line is found the third field, $3, is incremented by.
My suggested approach to this would be to have a file on-disk that contained nothing but the minified_version number. Then, incrementing that number would be as simple as:
minified_version=$(< minified_version)
printf '%s\n' "$(( minified_version + 1 ))" >minified_version
...and you could just put a sigil in your source file where that needs to be replaced. Let's say you have a file named foo.conf.in that contains:
$conf['minified_version'] = #MINIFIED_VERSION#
...then you could simply run, in your build process:
sed -e "s/#MINIFIED_VERSION#/$(<minified_version)/g" <foo.conf.in >foo.conf
This has the advantage that you never have code changing foo.conf.in, so you don't need to worry about bugs overwriting the file's contents. It also means that if you're checking your files into source control, so long as you only check in foo.conf.in and not foo.conf you avoid potential merge conflicts due to context near the version number changing.
Now, if you did want to do the native operation in-place, here's a somewhat overdesigned approach written in pure native bash (reading from infile and writing to outfile; just rename outfile back over infile when successful to make this an in-place replacement):
target='$conf['"'"'minified_version'"'"'] = '
suffix=';'
while IFS= read -r line; do
if [[ $line = "$target"* ]]; then
value=${line##*=}
value=${value%$suffix}
new_value=$(( value + 1 ))
printf '%s\n' "${target}${new_value}${suffix}"
else
printf '%s\n' "$line"
fi
done <infile >outfile

File Splitting with DataStage (8.5)

I have a job that successfully produces a sequential file (CSV) output with some hundred million rows, can someone provide an example where the output is written to a hundred separate sequential files, each with a million rows?
What does the sequential file stage look like, how is it configured?
This is to ultimately allow QA to review any one of the individual outputs without a special text editor that can view large text files.
Based on the suggestion from #Mr. Llama and a lack of forthcoming solutions we decided on a simple script to be executed at the end of the scheduled DataStage event.
#!/bin/bash
# usage:
# sh ./[script] [input]
# check for input:
if [ ! $# == 1 ]; then
echo "No input file provided."
exit
fi
# directory for output:
mkdir split
# header without content:
head -n 1 $1 > header.csv
# content without header:
tail +2 $1 > content.csv
# split content into 100000 record files:
split -l 100000 content.csv split/data_
# loop through the new split files, adding the header
# and a '.csv' extension:
for f in split/*; do cat header.csv $f > $f.csv; rm $f; done;
# remove the temporary files:
rm header.csv
rm content.csv
Crude but works for us in this case.

Perl script for normalizing negative numbers

I have several large csv files in which I would like to replace all numbers less than 100, including negative numbers with 500 or another positive number.
I'm not a programmer but I found a nice perl one liner to replace the white space with comma 's/[^\S\n]+/,/g'. I was wondering if there's any easy way to do this as well.
Using Windows formatting for a perl 1-liner
perl -F/,/ -lane "print join(q{,},map{/^[-\d.]+$/ && $_ < 100 ? 100: $_} #F),qq{\n};" input.csv > output.csv
The following works for me, assuming there are 2 files in the directory:
test1.txt:
201,400,-1
-2.5,677,90.66,30.32
222,18
test2.txt
-1,-1,-1,99,101
3,3,222,190,-999
22,100,100,3
using the one liner:
perl -p -i.bak -e 's/(-?\d+\.?\d*)/$1<100?500:$1/ge' *
-p will apply the search-replace process to each line in each file, -i.bak means do the replacement in the original files and backup those files with new files having .bak extension. s///ge part will find all the numbers (including negative numbers) and then compare each number with 100, if less than 100 then replace it with 500. g means find all match numbers. e means the replacement part will be treated as Perl code. * means process all the files in the directory
After executed this one liner, I got 4 files in the directory as:
test1.txt.bak test1.txt test2.txt.bak test2.txt
and the content for test1.txt and test2.txt are:
test1.txt
201,400,500
500,677,500,500
222,500
test2.txt
500,500,500,500,101
500,500,222,190,500
500,100,100,500

extraction of required columns from many files and writing it to a single file

perl -F"\t" -lane '$, = ","; print $F[0], $F[4]' EM2.gcount > Em2gcount.csv
Using this command I was able to extract 0 and 4 column from file 1 and wrote in the separate file in .csv...I have many files and also I want to print them in to single file...
please help me what changes should i make
find -type f -name "*.gcount" -exec <yourperlcommand> {} >> Em2gcount.csv \;
this will find all .gcount files from your current directory and execute the perl command on {} which references the file found and appends it to Em2gcount.csv

Optimize Duplicate Detection

Background
This is an optimization problem. Oracle Forms XML files have elements such as:
<Trigger TriggerName="name" TriggerText="SELECT * FROM DUAL" ... />
Where the TriggerText is arbitrary SQL code. Each SQL statement has been extracted into uniquely named files such as:
sql/module=DIAL_ACCESS+trigger=KEY-LISTVAL+filename=d_access.fmb.sql
sql/module=REP_PAT_SEEN+trigger=KEY-LISTVAL+filename=rep_pat_seen.fmb.sql
I wrote a script to generate a list of exact duplicates using a brute force approach.
Problem
There are 37,497 files to compare against each other; it takes 8 minutes to compare one file against all the others. Logically, if A = B and A = C, then there is no need to check if B = C. So the problem is: how do you eliminate the redundant comparisons?
The script will complete in approximately 208 days.
Script Source Code
The comparison script is as follows:
#!/bin/bash
echo Loading directory ...
for i in $(find sql/ -type f -name \*.sql); do
echo Comparing $i ...
for j in $(find sql/ -type f -name \*.sql); do
if [ "$i" = "$j" ]; then
continue;
fi
# Case insensitive compare, ignore spaces
diff -IEbwBaq $i $j > /dev/null
# 0 = no difference (i.e., duplicate code)
if [ $? = 0 ]; then
echo $i :: $j >> clones.txt
fi
done
done
Question
How would you optimize the script so that checking for cloned code is a few orders of magnitude faster?
Idea #1
Remove the matching files into another directory so that they don't need to be examined twice.
System Constraints
Using a quad-core CPU with an SSD; trying to avoid using cloud services if possible. The system is a Windows-based machine with Cygwin installed -- algorithms or solutions in other languages are welcome.
Thank you!
Your solution, and sputnick's solution, both take O(n^2) time. This can be done in O(nlog n) time by sorting the files and using a list merge. It can be sped up further by comparing MD5 (or any other cryptographically-strong hash function) of the files, instead of the files themselves.
Assuming you're in the sql directory:
md5sum * | sort > ../md5sums
perl -lane 'print if $F[0] eq $lastMd5; $last = $_; $lastMd5 = $F[0]' < ../md5sums
Using the above code will report only exact byte-for-byte duplicates. If you want to consider two non-identical files to be equivalent for the purposes of this comparison (e.g. if you don't care about case), first create a canonicalised copy of each file (e.g. by converting every character to lower case with tr A-Z a-z < infile > outfile).
The best way to do this is to hash each file, like SHA-1, and then use a set. I'm not sure bash can do this, but python can. Although if you want best performance C++ is the way to go.
To optimize comparison of your files :
#!/bin/bash
for i; do
for j; do
[[ "$i" != "$j" ]] &&
if diff -IEbwBaq "$i" "$j" > /dev/null; then
echo "$i & $j are the same"
else
echo "$i & $j are different"
fi
done
done
USAGE
./script /dir/*