I am trying to achieve the follwing using perl
A script that performs bitwise comparison of files from two directories
(the directory names are passed as arguments to the script in the command line).
The script should read all files from the first directory and all subdirectories, and
compare them to the corresponding files (e.g. files with the same names) in the
second directory.
The result of the script - (PASSED or FAILED) is formed according to:
The result is FAILED when at least one file from the first directory is not bitwise
equal to the corresponding file in the second directory or the second directory
has no corresponding file.
Otherwise test is PASSED.
So far I have tried the approach in this thread created by me - Comparing two directories using Perl . After some point I realized I am essentially trying to do simulate "diff -r dir1 dir2" which isn't the goal, How can one perform bitwise comparision operation on two directories?
EDIT: Test Case
/dir1 /dir2
-- file1 -- file1
-- file2 -- file2
-- file3
-- ....
-- ...
---/subDir1
--file1
--file2
file1 of dir1 contains :- foo bar
file1 of dir2 contains :- foo
Result - Fail
file1 of dir1 contains :- foo bar
file1 of dir2 contains :- foo bar
Result - Pass.
The script should essentially extract files with same names present in different directories.
I would do something like this:
Open dir1
Read all filenames into an array
Open dir2
Read all filenames into an array
For any case in which a filename in dir1 matches a filename in dir2 or vice versa, begin compare logic
Use Digest::MD5 here to perform an MD5 comparison of the two files. If even one bit is off, you will get different checksums.
Code example from Digest::MD5...
use Digest::MD5 qw(md5 md5_hex md5_base64);
$digest = md5($data);
$digest = md5_hex($data);
$digest = md5_base64($data);
# OO style
use Digest::MD5;
$ctx = Digest::MD5->new;
$ctx->add($data);
$ctx->addfile(*FILE);
$digest = $ctx->digest;
$digest = $ctx->hexdigest;
$digest = $ctx->b64digest;
Generate an MD5 hash for each file and compare them, then pass or fail accordingly.
Related
as an example I will put different inputs to keep the privacy of my files and to avoid long text, these are of the following form :
INPUT1.cfg :
TC # aa # D317
TC # bb # D314
TC # cc # D315
TC # dd # D316
INPUT2.cfg
BL;nn;3
LY;ww;3
LO;xx;3
TC;vv;3
TC;dd;3
OD;pp;3
TC;aa;3
what I want to do is iterate the name (column 2) in the rows of input1 and compare with the name (column 2) in the rows of input2; if they match we will get the line of INPUT2 in an output file otherwise it will return that the table is not found, here is my try code:
#!/bin/bash
input1="input1.cfg";
input2="input2.cfg"
cat $input1|while read line
do
TableNameIN=`echo $line|cut -d"#" -f2`
cat $input2| while read line
do
TableNameOUT=`echo $line|cut -d";" -f2`
if echo "$TableNameOUT" | grep -q $TableNameIN;
then echo "$line" >> output.txt
else
echo "Table $TableNameIN non trouvé"
fi
done
done
this what i get as result :
Table bb not found
Table bb not found
Table bb not found
Table cc not found
Table cc not found
Table cc not found
I manage to write what is equal but the problem with my code is that it has in output "table not found" for each row whereas I just want to write only once at the end of the comparison of all the lines
here is the output i want to get :
Table bb not found
Table cc not found
Can any one help me with this , PS : I don't want to use awk because it's just a part of my code and i already use sh
Assumptions:
for file input2.cfg the 2nd column (table name) is unique
input2.cfg is not so large that we run the risk of using up all memory for storing intput2.cfg in an associative array (otherwise we could store the table names from input1.cfg's - assuming this is a smaller file - in the array and swap the processing order of the two files)
there are no explicit requirements for data to be sorted (otherwise we may need to add a sort or two)
a bash solution is sufficient (based on inclusion of the #!/bin/bash shebang in OPs current code)
There are many ways to slice-n-dice this one (awk being my preference but OP doesn't want to use awk). For this particular answer I'll pull the awk steps out into separate bash commands.
NOTE: While we could use a set of nested loops (as in the OPs code), I've opted to use an associative array to store input2.cfg thus eliminating the need to repeatedly scan input2.cfg.
#!/usr/bin/bash
input1=input1.cfg
input2=input2.cfg
> output.txt # clear out the target file
# load ${input2} into an associative array
unset lines
typeset -A lines # associative array for storing contents of ${input2}
while read -r line
do
x="${line%;*}" # use parameter expansion
tabname="${x#*;}" # to parse out table name
lines["${tabname}"]="${line}" # add to array
done < "${input2}"
# process ${input1}
while read -r c1 c2 tabname rest_of_line
do
[[ -v lines["${tabname}"] ]] && # if tabname has an entry in our array
echo "${lines[${tabname}]}" >> output.txt && # then dump the associated line (from ${input2}) to output.txt
continue # process next line from ${input1}
echo "Table ${tabname} not found" # otherwise print 'not found' message
done < "${input1}"
# display contents of output.txt
echo "++++++++++++++++ output.txt"
cat output.txt
echo "++++++++++++++++"
This generates the following:
Table bb not found
Table cc not found
++++++++++++++++ output.txt
TC;aa;3
TC;dd;3
++++++++++++++++
I have a perl script that reads a .txt and a .bam file, and creates an output called output.txt.
I have a lot of files that are all in different folders, but are only slightly different in the filename and directory path.
All of my txt files are in different subfolders called PointMutation, with the full path being
/Volumes/Lab/Data/Darwin/Patient/[Plate 1/P1H10]/PointMutation
The text(s) in the bracket is the part that changes, But the Patient subfolder contains all of my txt files.
My .bam file is located in a subfolder named DNA with a full path of
/Volumes/Lab/Data/Darwin/Patient/[Plate 1/P1H10]/SequencingData/DNA
Currently how I run this script is go on the terminal
cd /Volumes/Lab/Data/Darwin/Patient/[Plate 1/P1H10]/PointMutation
perl ~/Desktop/Scripts/Perl.pl "/Volumes/Lab/Data/Darwin/Patient/[Plate
1/P1H10]/PointMutation/txtfile.txt" "/Volumes/Lab/Data/Darwin/Patient/[Plate
1/P1H10]/SequencingData/DNA/bamfile.bam"
With only 1 or two files, that is fairly easy, but I would like to automate it once the files get much larger. Also once I run these once, I don't want to do it again, but I will get more information from the same patient, is there a way to block a folder from being read?
I would do something like:
for my $dir (glob "/Volumes/Lab/Data/Darwin/Patient/*/"){
# skip if not a directory
if (! -d $dir) {
next;
}
my $txt = "$dir/PointMutation/txtfile.txt";
my $bam = "$dir/SequencingData/DNA/bamfile.bam";
# ... you magical stuff here
}
This is assuming that all directories under /Volumes/Lab/Data/Darwin/Patient/ follow the convention.
That said, more long term/robust way of organizing analyses with lots of different files all over the place is either 1) organize all files necessary for each analysis under one directory, or 2) to create meta files (i'd use JSON/yaml) which contain the necessary file names.
I had tried to remove the files which ever named along with digits but it is not happening in my code.Here $output is my directory location.In which the directory contains multiple folders and sub folders.
From that folders and sub folders i want to pick my .ml files .In which the only the aplhabets named .ml files to be listed.
If the file names comes like(ev4.html,ev8.html and so on it should be omitted).
Because here the file names comes along with the digits so i want to exclude the files which ever named along with digits and print the excepted output.
Here is my code:
use strict;
use warnings;
use File::Find::Rule;
my $output="/home/location/radio/datas";
my #files=File::Find::Rule->file()
->name('*.ml')
#->name(qr/\^ev\d+/)->prune->discard
->in($output);
for my $file(#files)
{
print "file:$file\n";
}
Obtained output:
file:/dacr/dacr.ml
file:/DV/DV.ml
file:DV/ev4/ev4.ml
Expected Output:
file:/dacr/dacr.ml
file:/DV/DV.ml
Your attempt was almost correct, but your regular expression is wrong, and the prune and discard will remove all files, not only the ones for the regex.
my #files=File::Find::Rule->file()
->name('*.ml')
->name(qr/\^ev\d+/) # wrong regex
->prune->discard # throws away all files
->in($output);
The correct regular expression to get files that contain any digit is simply \d. You are saying a literal ^, the letters ev and any number of digits, at least one.
To make File::Find::Rule take all files that end in .ml and then not the ones that have a digit, use not.
my #files=File::Find::Rule->file()
->name('*.ml')
->not( File::Find::Rule->name(qr/\d/) )
->in($output);
This will get all .ml files, and discard any file that has any digit in the name.
I have a collection of concert-media (audio and video). They're all organised using the same pattern:
./ARTISTNAME/[YYYY-MM-DD] VENUE, CITY
Basically I want to write a script which goes through the folders, looks for the [YYYY-MM-DD]-folders and get the information regarding artist (one folder level above), date, venue and location (using the name of the folder it just found) and writing the information into a .nfo-file which it saves into the found folder.
I already did quite a research on this topic, found a similar script but I am stuck because it searches files instead of folders:
#!/bin/bash
find . -path "*[????-??-??]*" | while read folder; do # the script is supposed to look for the concert-folders
-> band_name=`echo $file_name | sed 's/ *-.*$//'` # Get rid of song name
-> band_name=`echo $band_name | sed 's/^.*\///'` # Get rid of file path
-> song_name=`echo $file_name | sed 's/^.*- *//'` # Get rid of band name
-> song_name=`echo $song_name | sed 's/.avi//'` # Get rid of file extension
-> new_file_name=`echo $file_name | sed 's/.avi/.nfo/'` # Make new filename
-> echo "Making $new_file_name..."
echo -en "<musicvideo>\n<venue>$venue_name</venue>\n<city>$city_name</city><date>$date>\date>\n<artist>$band_name</artist>\n</musicvideo>\n" > "$new_file_name"
done
After changing the first part of the script (making it look for the folders with "[YYYY-MM-DD]") I understand that the second part of the script allocates the "tags" (such as artist, date, location, etc.). But I don't know how to make the script take the tags from folder names. Basically help is needed after the "->".
At the last part of the script it is supposed to write the collected information for this folder into a .nfo-file (e.g. FOLDERNAME.nfo).
Here is a simple Python example to get you started. If you want to port it to shell you can.
First, my setup:
test
test/BOB
test/BOB/[3011-01-01] Lollapalooza 213, Saturn Base 5
test/THE WHO
test/THE WHO/[1969-08-17] Woodstock, Woodstock
The code:
#!/usr/bin/env python
import os
import os.path
import re
import sys
def handle_concert(dirname, artist, date, venue, city):
"""Create a NFO file in the directory."""
# the {} are replaced by each argument in turn, like printf()
# Using a triple quote is a lot like a here document in shell, i.e.
# cat <EOF
# foo
# EOF
nfo_data = """<musicvideo>
<venue>{}</venue>
<city>{}</city>
<date>{}</date>
<artist>{}</artist>
</musicvideo>
""".format(venue, city, date, artist)
nfo_file = "[{}] {}, {}.nfo".format(date, venue, city)
# when the with statement is done the file is closed and fully
# written.
with open(os.path.join(dirname, nfo_file), "w") as fp:
fp.write(nfo_data)
# This is a regular expression which matches:
# */FOO/[YYYY-MM-DD] VENUE, CITY
# Where possible, white space is left intact
concert_re = re.compile(r'.*/(?P<artist>.+)/\[(?P<date>\d{4}-\d{2}-\d{2})\]\s+(?P<venue>.+),\s+(?P<city>.+)')
def handle_artist(dirname):
"""Found an ARTIST directory. Look for concerts.
If a subdirectory is found, see if it matches a concert.
When a concert is found, handle it.
"""
for i in os.listdir(dirname):
subdir = os.path.join(dirname, i)
m = concert_re.match(subdir)
if m:
print subdir # to watch the progress
handle_concert(subdir, m.group("artist"), m.group("date"),
m.group("venue"), m.group("city"))
def walk_collection(start_dir):
"""Examine contents of start_dir.
If a directory is found, assume it is an ARTIST.
"""
for i in os.listdir(start_dir):
# os.path.join ensures paths are right regardless of OS
dirname = os.path.join(start_dir, i)
if os.path.isdir(dirname):
print dirname # to watch the progress
handle_artist(dirname)
if __name__ == "__main__":
collection_dir = sys.argv[1] # equiv of $1
walk_collection(collection_dir)
How to run it:
$ nfo-creator.py /path/to/your/collection
The results:
<musicvideo>
<venue>Woodstock</venue>
<city>Woodstock</city>
<date>1969-08-17</date>
<artist>THE WHO</artist>
</musicvideo>
and
<musicvideo>
<venue>Lollapalooza 213</venue>
<city>Saturn Base 5</city>
<date>3011-01-01</date>
<artist>BOB</artist>
</musicvideo>
You can add more handle_ functions to handle differing formats. Just define a new regular expression and a handler for it. I kept this really simple for learning purposes with plenty of shell oriented comments.
This could totally be done in the shell. But writing this Python code was way easier and allows for more growth in the future.
Enjoy, and feel free to ask questions.
contents of remote directory mydir :
blah.myname.1.txt
blah.myname.somethingelse.txt
blah.myname.randomcharacters.txt
blah.notmyname.1.txt
blah.notmyname.2.txt
...
in perl, I want to download all of this stuff with myname
I am failing really hard with the appropriate quoting. please help.
failed code
my #files;
#files = $ftp->ls( '*.myname.*.txt' ); # finds nothing
#files = $ftp->ls( '.*.myname.*.txt' ); # finds nothing
etc..
How do I put the wildcards so that they are interpreted by the ls, but not by perl? What is going wrong here?
I will assume that you are using the Net::FTP package. Then this part of the docs is interesting:
ls ( [ DIR ] )
Get a directory listing of DIR, or the current directory.
In an array context, returns a list of lines returned from the server. In a scalar context, returns a reference to a list.
This means that if you call this method with no arguments, you get a list of all files from the current directory, else from the directory specified.
There is no word about any patterns, which is not suprising: FTP is just a protocol to transfer files, and this module only a wrapper around that protocoll.
You can do the filtering easily with grep:
my #interesting = grep /pattern/, $ftp->ls();
To select all files that contain the character sequence myname, use grep /myname/, LIST.
To select all files that contain the character sequence .myname., use grep /\.myname\./, LIST.
To select all files that end with the character sequence .txt, use grep /\.txt$/, LIST.
The LIST is either the $ftp->ls or another grep, so you can easily chain multiple filtering steps.
Of course, Perl Regexes are more powerful than that, and we could do all the filtering in a single /\.myname\.[^.]+\.txt$/ or something, depending on your exact requirements. If you are desperate for a globbing syntax, there are tools available to convert glob patterns to regex objects, like Text::Glob, or even to do direct glob matching:
use Text::Glob qw(match_glob);
my #interesting = match_glob ".*.myname.*.txt", $ftp->ls;
However, that is inelegant, to say the least, as regexes are far more powerful and absolutely worth learning.