Perl file list compare - perl

I need to iterate through and compare all files passed in as command line arguments in a perl script.
For example:
./script f1.txt f2.txt f3.txt
I'll need to compare
f1 & f2,
f1 & f3,
f2 & f3,
So that all files are compared to each other in some way, and not repeated.
I can do the internal 'comparing' of the files just fine, it's the way to get the files paired up which is the problem for me.
Any help to discover a way for this would be muchly appreciated!

You just want to compare every argument against every argument past itself. The ones before it would have been compared already, so you just have to look beyond. Something like this:
for (my $i = 0; $i < #ARGV; ++$i)
{
for (my $j = $i + 1; $j < #ARGV; ++$j)
{
my $f1 = $ARGV[$i];
my $f2 = $ARGV[$j];
say "Comparing $f1 to $f2";
}
}

Assuming that comparing "p" and "q" is the same as comparing "q" and "p", then you can do something like this. Here, #filelist is a changing list of files that haven't yet been the left-hand side of the compare. In each iteration of the outer loop, we take one element out of that, and compare it against all the rest.
my #filelist = #ARGV;
while (#filelist) {
my $p = shift #filelist;
foreach my $q (#filelist) {
compare($p, $q);
}
}
You could do the same thing with indices instead. Here, $p counts from 0 to the number of files you have, and $q starts counting from $p.
foreach my $p (0..$#ARGV) {
foreach my $q ($p+1..$#ARGV) {
compare($ARGV[$p], $ARGV[$q]);
}
}
If comparing "p" and "q" is different than comparing "q" and "p", then it gets a bit easier:
foreach my $p (#ARGV) {
foreach my $q (#ARGV) {
compare($p, $q) unless $p eq $q;
}
}

Related

Perl script to pair two array

I want to pair two array and add char '/' between them. Let say, two arrays are like below
#array1 = (FileA .. FileZ);
#array2 = (FileA.txt .. FileZ.txt);
The output that I want is like below
../../../experiment/fileA/fileA.txt
.
.
../../../experiment/fileZ/fileZ.txt
here is my code
my #input_name = input();
my $dirname = "../../../experiment/";
# CREATE FOLDER PATH
my #fileDir;
foreach my $input_name (#input_name){
chomp $input_name;
$_ = $dirname . $input_name;
push #fileDir, $_;
}
# CREATE FILE NAME
my #filename;
my $extension = '.txt';
foreach my $input_name (#input_name){
chomp $input_name;
$_ = $input_name . $extension;
push #filename, $_;
}
The code that I'd try is like below. But it seem doesn't work
#CREATE FULL PATH
foreach my $test_path (#test_path){
foreach my $testname (#testname){
my $test = map "$test_path[$_]/$testname[$_]", 0..$#test_path;
push #file, $test;
}
}
print #file;
I assume input() returns something like ('fileA', 'fileB').
The problem with your code is the nested loop here:
foreach my $test_path (#test_path){
foreach my $testname (#testname){
This combines every $test_path with every possible $testname. You don't want that. Also, it doesn't make much sense to assign the result of map to a scalar: All you'll get is the number of elements in the list created by map.
(Also, you have random chomp calls sprinkled throughout your code. None of those should be there.)
You only need a single array and a single loop:
use strict;
use warnings;
sub input {
return ('fileA', 'fileB');
}
my #input = input();
my $dirname = '../../../experiment';
my #files = map "$dirname/$_/$_.txt", #input;
for my $file (#files) {
print "got $file\n";
}
Here the loop is hidden in the map ..., #input call. If you want to write it as a for loop, it would look like this:
my #files;
for my $input (#input) {
push #files, "$dirname/$input/$input.txt";
}
The problem is your algorithm. You're iterating all filenames and all dirnames at the same time.
I mean, your code says "For every directory, create every file".
Try something along the lines of this and you'll be fine:
# WRITE TESTFILE
foreach my $filename (#filename){
chomp $filename;
if ( -e "$filename/$filename" and -d "$filename/$filename" ){
print "File already exists\n";
}
else {
open ( TXT_FILE, ">$filename/$filename" );
print TXT_FILE "Hello World";
close TXT_FILE;
}
}

Using Perl or Powershell, how to compare 2 CSV files and get only the new rows?

I am comparing two large comma-delimited CSV files File1.csv and File2.csv using the
Text::Diff Perl module.
The Perl program is called from a .bat file and I put the result in a third file Diff.csv
Perl
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Diff;
my $diffs = diff $ARGV[0] => $ARGV[1];
$diffs =~ s/^(?:[^\n]*+\n){2}//;
$diffs =~ s/^(?:[\# ][^\n]*+)?+\n//mg;
print $diffs;
This is how I call the Perl script:
perl "C:\diffBetweenTwoFiles.pl" "C:\File1.csv" "C:\File2.csv" > "C:\Diff.csv"
One of the columns in the CSV file is Name.
Currently the result lists all rows whose values in any columns change, but I want only to list new Name rows.
For example:
File1.csv
"Name","DOB","Address"
"One","1/1/01","5 Stock Rd"
"Two","1/2/02","1 Research Rd"
File2.csv
"Name","DOB","Address"
"One","1/1/01","5 Stock Rd"
"Two","1/2/02","111 Research Rd"
"Three","1/3/03","3 Bold Rd"
Currently, the result list these (it includes "Two" because its Address changed):
"Name","DOB","Address"
"Two","1/2/02","111 Research Rd"
"Three","1/3/03","3 Bold Rd"
But, I only want the result to list the new "Name" like this:
"Name","DOB","Address"
"Three","1/3/03","3 Bold Rd"
How can I do that in Perl or Powershell script?
Use Text::CSV in Perl
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my ($file_old, $file_new, $file_diff) =
map { $_ . '.csv' } qw(File1 File2 Diff);
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag();
my ($old, $header) = get_lines($csv, $file_old, 1);
my $new = get_lines($csv, $file_new);
my #lines_with_new_names = #{ new_names($old, $new) };
open my $fh, '>', $file_diff or die "Can't open $file_diff: $!";
$csv->say($fh, $header);
$csv->say($fh, $_) for #lines_with_new_names; # or print with eol set
sub new_names {
my ($old, $new) = #_;
my %old = map { $_->[0] => 1 } #$old;
return [ map { (!exists $old{$_->[0]}) ? $_ : () } #$new ];
}
sub get_lines {
my ($csv, $file, $return_header) = #_;
open my $fh, '<', $file or die "Can't open $file $!";
my $header = $csv->getline($fh); # remove the header line
return ($return_header)
? ( $csv->getline_all($fh), $header )
: $csv->getline_all($fh);
}
This prints the correct difference with the provided samples.
Variable names tagged with old are related to the file with fewer lines, the other one being new. The "Name" column is taken to be the first one.
Comments
The getline_all method returns an arrayref for all lines, where each is an arrayref with all fields. This is done from a sub, with an option to return the header line as well.
The optional return of another variable here makes a difference of whether a single scalar or a list is returned, so it can also be handled using wantarray builtin
return wantarray ? ( LIST ) : scalar;
which returns true if the sub is called in a list context. Thus the caller decides by invoking the sub in either the list or scalar context, my ($v1, $v2) = f(...) or my $v = f(...), in which case a flag is not needed in the call. I opted for a more explicit way.
The difference in names' list is produced in new_names sub. First a lookup hash is made with all names from the "old" arrayref. Then lines in "new" arrayref are filtered, taking those which don't have a name in the "old" (no such key in the hash), and returned in an arrayref [].
Such use of a hash is a standard technique for finding differences between arrays.
The documented method say used for printing doesn't work on my older version of the module with which this is tested. In that case use print and set eol in the constructor.
Since you're working with large files that are stressing your memory limit, you can try:
Read the first CSV file one line at a time, and use a hashtable to store the file's Name entries.
Read the second CSV file one line at a time and compare it's Name entries against the first.
(UPDATED based on comments) A simple example in PowerShell:
$output = New-Object System.Text.StringBuilder;
$file1 = #{};
$header = $null;
# $filePaths is two-element array with full path to CSV files
for ($i = 0; $i -lt $filePaths.Length; ++$i) {
$reader = New-Object System.IO.StreamReader($filePaths[$i]);
while (($line = $reader.ReadLine()) -ne $null) {
if ($line -match '\S') {
if ($header -eq $null) {
$header = $line;
$output.AppendLine($line) | Out-Null;
}
$name = ($line -split ',')[0];
switch ($i) {
0 { $file1.Add($name, $null); }
1 {
if (!$file1.ContainsKey($name)) {
$output.AppendLine($line) | Out-Null;
}
}
}
}
}
$reader.Dispose();
}
$output.ToString() | Out-File -FilePath $outPath;

What am I not getting about foreach loops?

It was always my understanding that
foreach (#arr)
{
....
}
and
for(my $i=0; $i<#arr; $i++)
{
.....
}
were functionally equivalent.
However, in all of my code, whenever I use a foreach loop I run into problems that get fixed when I change to a for loop. It always has to do with comparing the values of two things, usually with nested loops.
Here is an example:
for(my $i=0; $i<#files; $i++)
{
my $sel;
foreach (#selected)
{
if(files[$i] eq selected[$_])
{
$selected='selected';
}
}
<option value=$Files[$i] $sel>$files[$i]</option>
}
The above code falls between select tags in a cgi program.
Basically I am editing the contents of a select box according to user specifications.
But after they add or delete choices I want the choices that were origionally selected to remain selected.
The above code is supposed to accomplish this when reassembling the select on the next form. However, with the foreach version it only gets the first choice that's selected and skips the rest. If I switch it to a 3 part for loop, without changing anything else, it will work as intended.
This is only a recent example, so clearly I am missing something here, can anyone help me out?
Let's assume that #files is a list of filenames.
In the following code, $i is the array index (i.e. it's an integer):
for (my $i=0; $i<#files; $i++) { ... }
In the following code, $i is set to each array item in turn (i.e. it's a filename):
foreach my $i (#files) { ... }
So for example:
use strict;
use warnings;
my #files = (
'foo.txt',
'bar.txt',
'baz.txt',
);
print "for...\n";
for (my $i=0; $i<#files; $i++) {
print "\$i is $i.\n";
}
print "foreach...\n";
foreach my $i (#files) {
print "\$i is $i.\n";
}
Produces the following output:
for...
$i is 0.
$i is 1.
$i is 2.
foreach...
$i is foo.txt.
$i is bar.txt.
$i is baz.txt.
foreach loops are generally preferred for looping through arrays to avoid accidental off-by-one errors caused by things like for (my $i=1;...;...) or for (my $i=0;$i<=#arr;...).
That said, for and foreach are actually implemented as synonyms in Perl, so the following script produces identical output to my previous example:
use strict;
use warnings;
my #files = (
'foo.txt',
'bar.txt',
'baz.txt',
);
print "for...\n";
foreach (my $i=0; $i<#files; $i++) {
print "\$i is $i.\n";
}
print "foreach...\n";
for my $i (#files) {
print "\$i is $i.\n";
}
It it simply customary to refer to the second type of loop as a foreach loop, even if the source code uses the keyword for to perform the loop (as has become quite common).

Perl IF statement not matching variables in REGEX

my $pointer = 0;
foreach (#new1)
{
my $test = $_;
foreach (#chk)
{
my $check = $_;
chomp $check;
delete($new1[$pointer]) if ($test =~ /^$check/i);
}
$pointer++;
}
The if statement never matches the fact that many entries in the #new1 array do contain $check at the start of the array element (88 at least).
I am not sure it is the nested loop that is causing the problem because if i try this it also fails to match:
foreach (#chk)
{
#final = (grep /^$_/, #new1);
}
#final is empty but I know at least 88 entires for $_ are in #new1.
I wrote this code on a machine running Windows ActivePerl 5.14.2 and the top code works. I then (using a copy of #new1) compare the two and remove any duplicates (also works on 5.14.2). I did try to negate the if match but that seemed to wipe out the #new1 array (so that I didn't need to do a hash compare).
When I try to run this code on a Linux RedHat box with Perl 5.8.0 it seems to struggle with the variable matching in the REGEX. If I hard code the REGEX with an example I know is in #new1 the match works and in the first code the entry is deleted (in the second one value is inserted in #final).
The #chk array is a listing file on the web server and the #new1 array is created by opening two log files on the web server and then pushing one into the other.
I had even gone to the trouble of printing out $test and $check in each loop iteration and manually checking to see if any of the the values did match and some of them do.
It has had me baffled for days now and I have had to throw the towel in and ask for help, any ideas?
As tested by user1568538, the solution was to replace
chomp $check;
with
$check =~ s/\r\n//g;
to remove Windows-style line endings from the variable.
Since chomp removes the contents of the input record separator $/ from the end of its argument, you could also change its value:
my $pointer = 0;
foreach (#new1)
{
my $test = $_;
foreach (#chk)
{
local $/="\r\n";
my $check = $_;
chomp $check;
delete($new1[$pointer]) if ($test =~ /^$_/i);
}
$pointer++;
}
However, since $/ also affects other operations (such as reading from a file handle), perhaps it is safest to avoid changing $/ unless you are sure if it is safe. Here I limit the change to the foreach loop where the chomp occurs.
No knowing what your input data looks like, using \Q might help:
if ($test =~ /^\Q$check/i);
See quotemeta.
It is not clear what you are trying to do. However, you may be trying to only get those elements for which there is no match or vice versa. Adapt the code below for your needs
#!/usr/bin/perl
use strict; use warnings;
my #item = qw(...); # your #new?
my #check = qw(...); # your #chk?
my #match;
my #nomatch;
ITEM:
foreach my $item (#item) {
CHECK:
foreach my $check (#check) {
# uncomment this if $check should not be interpreted as a pattern,
# but as literal characters:
# $item = '\Q' . $item;
if ($item =~ /^$check/) {
push #match, $item;
next ITEM; # there was a match, so this $item is burnt
# we don't need to test against other $checks.
}
}
# there was no match, so lets store it:
push #nomatch, $item.
}
print "matched $_\n" for #matched;
print "didn't match $_" for #nomatch;
Your code is somewhat difficult to read. Let me tell you what this
foreach (#chk) {
#final = (grep /^$_/, #new1);
}
does: It is roughly equivalent to
my #final = ();
foreach my $check (#chk) {
#final = grep /^$check/, #new1;
}
which is equivalent to
my #final = ();
foreach my $check (#chk) {
# #final = grep /^$check/, #new1;
#final = ();
foreach (#new) {
if (/^$check/) {
push #final, $_;
last;
}
}
}
So your #final array gets reset, possibly emptied.

How to create the next file or folder in a series of progressively numbered files?

Sorry for the bad title but this is the best I could do! :D
I have a script which creates a new project every time the specified function is called.
Each project must be stored in its own folder, with the name of the project. But, if you don't specify a name, the script will just name it "new projectX", where X is a progressive number.
With time the user could rename the folders or delete some, so every time the script runs, it checks for the smallest number available (not used by another folder) and creates the relevant folder.
Now I managed to make a program which I think works as wanted, but I would like to hear from you if it's OK or there's something wrong which I'm unable to spot, given my inexperience with the language.
while ( defined( $file = readdir $projects_dir ) )
{
# check for files whose name start with "new project"
if ( $file =~ m/^new project/i )
{
push( #files, $file );
}
}
# remove letters from filenames, only the number is left
foreach $file ( #files )
{
$file =~ s/[a-z]//ig;
}
#files = sort { $a <=> $b } #files;
# find the smallest number available
my $smallest_number = 0;
foreach $file ( #files )
{
if ( $smallest_number != $file )
{
last;
}
$smallest_number += 1;
}
print "Smallest number is $smallest_number";
Here's a basic approach for this sort of problem:
sub next_available_dir {
my $n = 1;
my $d;
$n ++ while -e ($d = "new project$n");
return $d;
}
my $project_dir = next_available_dir();
mkdir $project_dir;
If you're willing to use a naming pattern that plays nicely with Perl's string auto-increment feature, you can simplify the code further, eliminating the need for $n. For example, newproject000.
I think I would use something like:
use strict;
use warnings;
sub new_project_dir
{
my($base) = #_;
opendir(my $dh, $base) || die "Failed to open directory $base for reading";
my $file;
my #numbers;
while ($file = readdir $dh)
{
$numbers[$1] = 1 if ($file =~ m/^new project(\d+)$/)
}
closedir($dh) || die "Failed to close directory $base";
my $i;
my $max = $#numbers;
for ($i = 0; $i < $max; $i++)
{
next if (defined $numbers[$i]);
# Directory did not exist when we scanned the directory
# But maybe it was created since then!
my $dir = "new project$i";
next unless mkdir "$base/$dir";
return $dir;
}
# All numbers from 0..$max were in use...so try adding new numbers...
while ($i < $max + 100)
{
my $dir = "new project$i";
$i++;
next unless mkdir "$base/$dir";
return $dir;
}
# Still failed - give in...
die "Something is amiss - all directories 0..$i in use?";
}
Test code:
my $basedir = "base";
mkdir $basedir unless -d $basedir;
for (my $j = 0; $j < 10; $j++)
{
my $dir = new_project_dir($basedir);
print "Create: $dir\n";
if ($j % 3 == 2)
{
my $k = int($j / 2);
my $o = "new project$k";
rmdir "$basedir/$o";
print "Remove: $o\n";
}
}
Try this:
#!/usr/bin/env perl
use strict;
use warnings;
# get the current list of files
# see `perldoc -f glob` for details.
my #files = glob( 'some/dir/new\\ project*' );
# set to first name, in case there are none others
my $next_file = 'new project1';
# check for others
if( #files ){
# a Schwartian transform
#files = map { $_->[0] } # get original
sort { $a->[1] <=> $b->[1] } # sort by second field which are numbers
map { [ $_, do{ ( my $n = $_ ) =~ s/\D//g; $n } ] } # create an anonymous array with original value and the second field nothing but digits
#files;
# last file name is the biggest
$next_file = $files[-1];
# add one to it
$next_file =~ s/(.*)(\d+)$/$1.($2+1)/e;
}
print "next file: $next_file\n";
Nothing wrong per se, but that's an awful lot of code to achieve a single objective (get the minimum index of directories.
A core module, couple of subs and few Schwartzian transforms will make the code more flexible:
use strict;
use warnings;
use List::Util 'min';
sub num { $_[0] =~ s|\D+||g } # 'new project4' -> '4', 'new1_project4' -> '14' (!)
sub min_index {
my ( $dir, $filter ) = #_;
$filter = qr/./ unless defined $filter; # match all if no filter specified
opendir my $dirHandle, $dir or die $!;
my $lowest_index = min # get the smallest ...
map { num($_) } # ... numerical value ...
grep { -d } # ... from all directories ...
grep { /$filter/ } # ... that match the filter ...
readdir $dirHandle; # ... from the directory contents
$lowest_index++ while grep { $lowest_index == num( $_ ) } readdir $dirhandle;
return $lowest_index;
}
# Ready to use!
my $index = min_index ( 'some/dir' , qr/^new project/ );
my $new_project_name = "new project $index";