Optimize Duplicate Detection

Optimize Duplicate Detection - perl

Background
This is an optimization problem. Oracle Forms XML files have elements such as:
<Trigger TriggerName="name" TriggerText="SELECT * FROM DUAL" ... />
Where the TriggerText is arbitrary SQL code. Each SQL statement has been extracted into uniquely named files such as:
sql/module=DIAL_ACCESS+trigger=KEY-LISTVAL+filename=d_access.fmb.sql
sql/module=REP_PAT_SEEN+trigger=KEY-LISTVAL+filename=rep_pat_seen.fmb.sql
I wrote a script to generate a list of exact duplicates using a brute force approach.
Problem
There are 37,497 files to compare against each other; it takes 8 minutes to compare one file against all the others. Logically, if A = B and A = C, then there is no need to check if B = C. So the problem is: how do you eliminate the redundant comparisons?
The script will complete in approximately 208 days.
Script Source Code
The comparison script is as follows:
#!/bin/bash
echo Loading directory ...
for i in $(find sql/ -type f -name \*.sql); do
echo Comparing $i ...
for j in $(find sql/ -type f -name \*.sql); do
if [ "$i" = "$j" ]; then
continue;
fi
# Case insensitive compare, ignore spaces
diff -IEbwBaq $i $j > /dev/null
# 0 = no difference (i.e., duplicate code)
if [ $? = 0 ]; then
echo $i :: $j >> clones.txt
fi
done
done
Question
How would you optimize the script so that checking for cloned code is a few orders of magnitude faster?
Idea #1
Remove the matching files into another directory so that they don't need to be examined twice.
System Constraints
Using a quad-core CPU with an SSD; trying to avoid using cloud services if possible. The system is a Windows-based machine with Cygwin installed -- algorithms or solutions in other languages are welcome.
Thank you!

Your solution, and sputnick's solution, both take O(n^2) time. This can be done in O(nlog n) time by sorting the files and using a list merge. It can be sped up further by comparing MD5 (or any other cryptographically-strong hash function) of the files, instead of the files themselves.
Assuming you're in the sql directory:
md5sum * | sort > ../md5sums
perl -lane 'print if $F[0] eq $lastMd5; $last = $_; $lastMd5 = $F[0]' < ../md5sums
Using the above code will report only exact byte-for-byte duplicates. If you want to consider two non-identical files to be equivalent for the purposes of this comparison (e.g. if you don't care about case), first create a canonicalised copy of each file (e.g. by converting every character to lower case with tr A-Z a-z < infile > outfile).

The best way to do this is to hash each file, like SHA-1, and then use a set. I'm not sure bash can do this, but python can. Although if you want best performance C++ is the way to go.

To optimize comparison of your files :
#!/bin/bash
for i; do
for j; do
[[ "$i" != "$j" ]] &&
if diff -IEbwBaq "$i" "$j" > /dev/null; then
echo "$i & $j are the same"
else
echo "$i & $j are different"
fi
done
done
USAGE
./script /dir/*

Related

Fish Shell to Truncate list of files

In bash if I wish to truncate a bunch of files in a directory, I would do the following:
for i in *
do
cat /dev/null > $i
done
In fish, I tried:
for I in *
cat /dev/null > $I
end
but that gives me the error:
fish: Invalid redirection target: $I
So anyone know how to achieve this?
Thanks.

Works for me. Note that the only way you'll get that error is if variable I is not set. I noticed you used a lowercase letter for your bash example and uppercase for the fish example. Did you perhaps mix the case? For example, this will cause the error you saw:
for i in *
true > $I
end
P.S., In a POSIX shell it's more efficient to do : > $i. Since fish doesn't support : it's more efficient to do true > $i to avoid spawning an external command and opening /dev/null.

Cross platform shell script to find all files to a maxdepth

Using GNU find, I can use the -maxdepth option to specify a specific depth to search for files. Unfortunately, my command needs to run on HP-UX, AIX, and Solaris as well which don't support the -maxdepth option.
I have found that I can run find /some/path/* -prune to get only files in a single folder, but I want to recurse down n levels, just like the -maxdepth argument allows. Can this be done in a cross platform way?
Edit: I found I can use the -path option to do a similar filter like so
find ./ ! -path "./*/**"
Unfortunately, AIX find does not support the -path option. I'm at least a little bit closer.

This may not be the most performant solution but it should be quite portable. I tested it on Solaris in addition to OSX and Linux. In the essence, it is a recursive tree-walking depth-first using ls. Feel free to tweak and sanitize it to your needs. Hope, it works on AIX too.
#!/bin/bash
path="$(echo $1 | sed -e 's%/*$%%')" # remove trailing spaces
maxDepth="$2" # maximum search depth
currDepth="$3" # current depth
[ -z $currDepth ] && currDepth=0 # initialize
[ $currDepth -lt $maxDepth ] && { # are we allowed to go deeper
echo "D: \"$path\"" # show where we are
IFS=$'\n' # split the "ls" output by newlines instead of spaces
for entry in $(ls -F "$path"); done # scan directory
[ -d "$path/$entry" ] && { # recursively descent if it is a child directory
$0 "$path/$entry" $maxDepth $((currDepth+1))
continue
}
echo "F: \"$path/$entry\"" # show it if it is not a directory (symink, file, whatever)
done
}

Print line numbers after comparison

Can someone tell me the best way to print the number of different lines in 2 files. I have 2 directories with 1000s of files and I have a perl script that compares all files in dir1 with all files in dir2 and outputs the difference to a different file. Now I need to add something like Filename - # of different lines
File1 - 8
File2 - 30
Right now I am using
my $diff = `diff -y --suppress-common-lines "$DirA/$file" "$DirB/$file"`;
But along with this I also need to print how many lines are different in each one of those 1000 files.
Sorry is a duplicate of my prev thread. So would be glad if some moderator could delete the previous one

Why you even use perl?
for i in "$dirA"/*; do file="${i##*/}"; echo "$file - $(diff -y --suppress-common-lines "$i" "$dirB/$file" | wc -l)" ; done > diffs.txt

generate text sequence in powershell

I just had to produce a long xml sequence for some testing purpose, a lot of elements
like <hour>2009.10.30.00</hour>.
This made me drop into a linux shell and just run
for day in $(seq -w 1 30) ; do
for hour in $(seq -w 0 23) ;
do echo "<hour>2009.10.$day.$hour</hour>" ;
done ;
done >out
How would I do the same in powershell on windows ?

Pretty similar...
$(foreach ($day in 1..30) {
foreach ($hour in 0..23) {
"<hour>2009.10.$day.$hour</hour>"
}
}) > tmp.txt
Added file redirection. If you are familiar with bash the syntax should be pretty intuitive.

If I were scripting I would probably go with orsogufo's approach for readability. But if I were typing this at the console interactively I would use a pipeline approach - less typing and it fits on a single line e.g.:
1..30 | %{$day=$_;0..23} | %{"<hour>2009.10.$day.$_</hour>"} > tmp.txt

How can I change the case of filenames in Perl?

I'm trying to create a process that renames all my filenames to Camel/Capital Case. The closest I have to getting there is this:
perl -i.bak -ple 's/\b([a-z])/\u$1/g;' *.txt # or similar .extension.
Which seems to create a backup file (which I'll remove when it's verified this does what I want); but instead of renaming the file, it renames the text inside of the file. Is there an easier way to do this? The theory is that I have several office documents in various formats, as I'm a bit anal-retentive, and would like them to look like this:
New Document.odt
Roffle.ogg
Etc.Etc
Bob Cat.flac
Cat Dog.avi
Is this possible with perl, or do I need to change to another language/combination of them?
Also, is there anyway to make this recursive, such that /foo/foo/documents has all files renamed, as does /foo/foo/documents/foo?

You need to use rename .
Here is it's signature:
rename OLDNAME,NEWNAME
To make it recursive, use it along with File::Find
use strict;
use warnings;
use File::Basename;
use File::Find;
#default searches just in current directory
my #directories = (".");
find(\&wanted, #directories);
sub wanted {
#renaming goes here
}
The following snippet, will perform the code inside wanted against all the files that are found. You have to complete some of the code inside the wanted to do what you want to do.
EDIT: I tried to accomplish this task using File::Find, and I don't think you can easily achieve it. You can succeed by following these steps :
if the parameter is a dir, capitalize it and obtain all the files
for each file, if it's a dir, go back at the beginning with this file as argument
if the file is a regular file, capitalize it
Perl just got in my way while writing this script. I wrote this script in ruby :
require "rubygems"
require "ruby-debug"
# camelcase files
class File
class << self
alias :old_rename :rename
end
def self.rename(arg1,arg2)
puts "called with #{arg1} and #{arg2}"
self.old_rename(arg1,arg2)
end
end
def capitalize_dir_and_get_files(dir)
if File.directory?(dir)
path_c = dir.split(/\//)
#base = path_c[0,path_c.size-1].join("/")
path_c[-1].capitalize!
new_dir_name = path_c.join("/")
File.rename(dir,new_dir_name)
files = Dir.entries(new_dir_name) - [".",".."]
files.map! {|file| File.join(new_dir_name,file)}
return files
end
return []
end
def camelize(dir)
files = capitalize_dir_and_get_files(dir)
files.each do |file|
if File.directory?(file)
camelize(file.clone)
else
dir_name = File.dirname(file)
file_name = File.basename(file)
extname = File.extname(file)
file_components = file_name.split(/\s+/)
file_components.map! {|file_component| file_component.capitalize}
new_file_name = File.join(dir_name,file_components.join(" "))
#if extname != ""
# new_file_name += extname
#end
File.rename(file,new_file_name)
end
end
end
camelize(ARGV[0])
I tried the script on my PC and it capitalizes all dirs,subdirs and files by the rule you mentioned. I think this is the behaviour you want. Sorry for not providing a perl version.

Most systems have the rename command ....
NAME
rename - renames multiple files
SYNOPSIS
rename [ -v ] [ -n ] [ -f ] perlexpr [ files ]
DESCRIPTION
"rename" renames the filenames supplied according to the rule specified as the first argument. The perlexpr argument is a Perl expression which
is expected to modify the $_ string in Perl for at least some of the filenames specified. If a given filename is not modified by the expression,
it will not be renamed. If no filenames are given on the command line, filenames will be read via standard input.
For example, to rename all files matching "*.bak" to strip the extension, you might say
rename 's/\.bak$//' *.bak
To translate uppercase names to lower, you’d use
rename 'y/A-Z/a-z/' *
OPTIONS
-v, --verbose
Verbose: print names of files successfully renamed.
-n, --no-act
No Action: show what files would have been renamed.
-f, --force
Force: overwrite existing files.
AUTHOR
Larry Wall
DIAGNOSTICS
If you give an invalid Perl expression you’ll get a syntax error.

Since Perl runs just fine on multiple platforms, let me warn you that FAT (and FAT32, etc) filesystems will ignore renames that only change the case of the file name. This is true under Windows and Linux and is probably true for other platforms that support the FAT filesystem.
Thus, in addition to Geo's answer, note that you may have to actually change the file name (by adding a character to the end, for example) and then change it back to the name you want with the correct case.
If you will only rename files on NTFS filesystems or only on ext2/3/4 filesystems (or other UNIX/Linux filesystems) then you probably don't need to worry about this. I don't know how the Mac OSX filesystem works, but since it is based on BSDs, I assume it will allow you to rename files by only changing the case of the name.

I'd just use the find command to recur the subdirectories and mv to do the renaming, but still leverage Perl to get the renaming right.
find /foo/foo/documents -type f \
-execdir bash -c 'mv "$0" \
"$(echo "$0" \
| perl -pe "s/\b([[:lower:]])/\u\$1/g; \
s/\.(\w+)$/.\l\$1/;")"' \
{} \;
Cryptic, but it works.

Another one:
find . -type f -exec perl -e'
map {
( $p, $n, $s ) = m|(.*/)([^/]*)(\.[^.]*)$|;
$n =~ s/(\w+)/ucfirst($1)/ge;
rename $_, $p . $n . $s;
} #ARGV
' {} +

Keep in mind that on case-remembering filesystems (FAT/NTFS), you'll need to rename the file to something else first, then to the case change. A direct rename from "etc.etc" to "Etc.Etc" will fail or be ignored, so you'll need to do two renames: "etc.etc" to "etc.etc~" then "etc.etc~" to "Etc.Etc", for example.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Optimize Duplicate Detection - perl

The best way to do this is to hash each file, like SHA-1, and then use a set. I'm not sure bash can do this, but python can. Although if you want best performance C++ is the way to go.

To optimize comparison of your files : #!/bin/bash for i; do for j; do [[ "$i" != "$j" ]] && if diff -IEbwBaq "$i" "$j" > /dev/null; then echo "$i & $j are the same" else echo "$i & $j are different" fi done done USAGE ./script /dir/*

Related

Fish Shell to Truncate list of files

Cross platform shell script to find all files to a maxdepth

Print line numbers after comparison

generate text sequence in powershell

How can I change the case of filenames in Perl?

Categories

Resources