Perl out of order diff between text files - perl

I basically want to do an out-of-order diff between two text files (in CSV style) where I compare the fields in the first two columns (I don't care about the 3rd columns value). I then print out the values that file1.txt has but aren't present in file2.txt and vice-versa for file2.txt compared to file1.txt.
file1.txt:
cat,val 1,43432
cat,val 2,4342
dog,value,23
cat2,value,2222
hedgehog,input,233
file2.txt:
cat2,value,312
cat,val 2,11
cat,val 3,22
dog,value,23
hedgehog,input,2145
bird,output,9999
Output would be something like this:
file1.txt:
cat,val 1,43432
file2.txt:
cat,val 3,22
bird,output,9999
I'm new to Perl so some of the better, less ugly methods to achieve this are outside of my knowledge currently. Thanks for any help.
current code:
#!/usr/bin/perl -w
use Cwd;
use strict;
use Data::Dumper;
use Getopt::Long;
my $myName = 'MyDiff.pl';
my $usage = "$myName is blah blah blah";
#retreive the command line options, set up the environment
use vars qw($file1 $file2);
#grab the specified values or exit program
GetOptions("file1=s" => \$file1,
"file2=s" => \$file2)
or die $usage;
( $file1 and $file2 ) or die $usage;
open (FH, "< $file1") or die "Can't open $file1 for read: $!";
my #array1 = <FH>;
close FH or die "Cannot close $file1: $!";
open (FH, "< $file2") or die "Can't open $file2 for read: $!";
my #array2 = <FH>;
close FH or die "Cannot close $file2: $!";
#...do a sort and match

Use a Hash for this with first 2 columns as key.
once you have these two hashes you can iterate and delete the common entries,
what remains in respective hashes will be what you are looking for.
Initialize,
my %hash1 = ();
my %hash2 = ();
Read in first file, join first two columns to form key and save it in hash. This assumes fields are comma separated. You could use a CSV module also for the same.
open( my $fh1, "<", $file1 ) || die "Can't open $file1: $!";
while(my $line = <$fh1>) {
chomp $line;
# join first two columns for key
my $key = join ",", (split ",", $line)[0,1];
# create hash entry for file1
$hash1{$key} = $line;
}
Do the same for file2 and create %hash2
open( my $fh2, "<", $file2 ) || die "Can't open $file2: $!";
while(my $line = <$fh2>) {
chomp $line;
# join first two columns for key
my $key = join ",", (split ",", $line)[0,1];
# create hash entry for file2
$hash2{$key} = $line;
}
Now go over the entries and delete the common ones,
foreach my $key (keys %hash1) {
if (exists $hash2{$key}) {
# common entry, delete from both hashes
delete $hash1{$key};
delete $hash2{$key};
}
}
%hash1 will now have lines which are only in file1.
You could print them as,
foreach my $key (keys %hash1) {
print "$hash1{$key}\n";
}
foreach my $key (keys %hash2) {
print "$hash2{$key}\n";
}

Perhaps the following will be helpful:
use strict;
use warnings;
my #files = #ARGV;
pop;
my %file1 = map { chomp; /(.+),/; $1 => $_ } <>;
push #ARGV, $files[1];
my %file2 = map { chomp; /(.+),/; $1 => $_ } <>;
print "$files[0]:\n";
print $file1{$_}, "\n" for grep !exists $file2{$_}, keys %file1;
print "\n$files[1]:\n";
print $file2{$_}, "\n" for grep !exists $file1{$_}, keys %file2;
Usage: perl script.pl file1.txt file2.txt
Output on your datasets:
file1.txt:
cat,val 1,43432
file2.txt:
cat,val 3,22
bird,output,9999
This builds a hash for each file. The keys are the first two columns and the associated values are the full lines. grep is used to filter the shared keys.
Edit: On relatively smaller files, using map as above to process the file's lines will work fine. However, a list of all of the file's lines is first created, and then passed to map. On larger files, it may be better to use a while (<>) { ... construct, to read one line at a time. The code below does this--generating the same output as above--and uses a hash of hashes (HoH). Because it uses a HoH, you'll note some dereferencing:
use strict;
use warnings;
my %hash;
my #files = #ARGV;
while (<>) {
chomp;
$hash{$ARGV}{$1} = $_ if /(.+),/;
}
print "$files[0]:\n";
print $hash{ $files[0] }{$_}, "\n"
for grep !exists $hash{ $files[1] }{$_}, keys %{ $hash{ $files[0] } };
print "\n$files[1]:\n";
print $hash{ $files[1] }{$_}, "\n"
for grep !exists $hash{ $files[0] }{$_}, keys %{ $hash{ $files[1] } };

I think the above prob can be solved by either of the mentioned algo
a) We can use the hash as mentioned above
b)
1. Sort both the files with Key1 and Key2 (use sort fun)
Iterate through FILE1
Match the key1 and key2 entry of FILE1 with FILE2
If yes then
take action by printing common lines it to desired file as required
Move to next row in File1 (continue with the loop )
If No then
Iterate through File2 startign from the POS-FILE2 until match is found
Match the key1 and key2 entry of FILE1 with FILE2
If yes then
take action by printing common lines it to desired file as required
setting FILE2-END as true
exit from the loop noting the position of FILE2
If no then
take action by printing unmatched lines to desired file as req.
Move to next row in File2
If FILE2-END is true
Rest of Lines in FILE1 doesnt exist in FILE2

Related

Trying to compare "parts of speech" tags of two files and print the matched tags in a separate file

I am trying to code a perl program to compare the "parts of speech" tags of two text files and print the matched tags along with the corresponding words in a separate file in Windows.
File1:
boy N
went V
loves V
girl N
File2:
boy N
swims V
girl N
loves V
The expected output:
boy N N
girl N N
loves V V
The columns are separated by tabs. The coding I did so far:
use strict;
use warnings;
my $filename = 'file1.txt';
open(my $fh, $filename)
or die "Could not open file '$filename'";
while (my $row = <$fh>) {
chomp $row;
print "$row\n";
}
my $tagfile = 'file2.txt';
open(my $tg, $tagfile)
or die "Could not open file '$filename'";
while (my $row = <$tg>) {
chomp $row;
print "$row\n";
}
It's really unclear what you're asking for. But I think this is close.
#!/usr/bin/perl
use strict;
use warnings;
my ($file1, $file2) = #ARGV;
my %words; # Keep details of the words
while (<>) { # Read all input files a line at a time
chomp;
my ($word, $pos) = split;
$words{$ARGV}{$word}{$pos}++;
# If we're processing file1 then don't look for a match
next if $ARGV eq $file1;
if (exists $words{$file1}{$word}{$pos}) {
print join(' ', $word, ($pos) x 2), "\n";
}
}
Running it like this:
./pos file1 file2
Gives:
boy N N
girl N N
loves V V
OK, first off what you want is a hash.
You need to:
read the first file, split it into "word" and "pos".
save it in a hash
read the second file, split each line into "word" and "pos".
compare it to the hash you populated, and check it matches.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
#declare our hash:
my %pos_for;
#open the first file
my $filename = 'file1.txt';
open( my $fh, '<', $filename ) or die "Could not open file '$filename'";
while (<$fh>) {
#remove linefeed from this line.
#note - both chomp and split default to using $_ which is defined by the while loop.
chomp;
#split it on whitespace.
my ( $word, $pos ) = split;
#record this value in the hash %pos_for
$pos_for{$word} = $pos;
}
close($fh);
#process second file:
my $tagfile = 'file2.txt';
open( my $tg, '<', $tagfile ) or die "Could not open file '$filename'";
while (<$tg>) {
#remove linefeed from this line.
chomp;
#split it on whitespace.
my ( $word, $pos ) = split;
#check if this word was in the other file
if (defined $pos_for{$word}
#and that it's the same "pos" value.
and $pos_for{$word} eq $pos
)
{
print "$word $pos\n";
}
}
close($tg);

How to sort CSV file by header?

There are two CSV files I want to compare. However, they have different order of headers and rows/values.
Here's a simple example:
INPUT FILE1:
NAME,AGE,BDAY
ABC,1,090214
DEF,1,122514
INPUT FILE2:
BDAY,NAME,AGE
122514,DEF,1
090214,ABC,1
INPUT FILE3:
BDAY,NAME,AGE
122514,DEFG,1
090214,ABC,1
Diff FILE1 and FILE2
No diffs.
Diff FILE1 and FILE3
Found diffs in FILE and FILE3.
<Any format of diffs is okay.>
I can easily create a perl script for this but before I do, does anyone know if there's an existing script/tool that already does this?
I have tried copying the files from UNIX to Windows, and sorting them using Excel. It works well but I encounter problems saving it.
I also have googled but can't find a reference for this.
Thanks for any inputs.
I think you need some kind of advanced comparation (with requires deeper analysis), so the use of a relational db approach maybe interesting.
In this respect, the module DBD::CSV is helpful. It allows writting SELECT statements, including join between tables.
Normalize your data
Use Text::CSV to reorder the columns of your CSV file.
Then you can use Perl’s sort or some other utility to reorder the rows of your files.
This also uses Text::Wrap to display the normalized files in a pleasing format:
use strict;
use warnings;
use autodie;
# Setup fake data
my #files;
{
local $/ = ''; # Paragraph mode
while (<DATA>) {
chomp;
my ( $file, $data ) = split "\n", $_, 2;
open my $fh, '>', $file;
print $fh $data, "\n";
push #files, $file;
}
}
# Normalize Files by Column Order
use Text::CSV;
my $csv = Text::CSV->new( { binary => 1, eol => $/ } )
or die "Cannot use CSV: " . Text::CSV->error_diag();
for my $file (#files) {
local #ARGV = $file;
local $^I = '.bak';
my #old_order;
my #new_order;
while (<>) {
if ( !$csv->parse($_) ) {
die "Bad parse $file, line $.: " . $csv->error_diag();
}
my #columns = $csv->fields();
if ( $. == 1 ) {
#old_order = #columns;
#new_order = sort #columns;
}
my %hash;
#hash{#old_order} = #columns;
if ( !$csv->combine( #hash{#new_order} ) ) {
die "Bad combine $file, line $.: " . $csv->error_diag();
}
print $csv->string();
}
unlink "$file$^I"; # Optionally delete backup
}
# Normalize Files by Row Order
for my $file (#files) {
my ( $header, #data ) = do { local #ARGV = $file; <> };
open my $fh, '>', $file;
print $fh $header, sort #data;
}
# View Normalized Files
use Text::Wrap;
for my $file (#files) {
open my $fh, '<', $file;
print wrap( sprintf( "%-12s", $file ), ' ' x 12, <$fh>, ), "\n";
}
__DATA__
file1.csv
NAME,AGE,BDAY
ABC,1,090214
DEF,1,122514
file2.csv
BDAY,NAME,AGE
122514,DEF,1
090214,ABC,1
file3.csv
BDAY,NAME,AGE
122514,DEFG,1
090214,ABC,1
Outputs:
file1.csv AGE,BDAY,NAME
1,090214,ABC
1,122514,DEF
file2.csv AGE,BDAY,NAME
1,090214,ABC
1,122514,DEF
file3.csv AGE,BDAY,NAME
1,090214,ABC
1,122514,DEFG

Match column in 2 CSV files and display output in third file

I have 2 CSV files, file1.csv and file2.csv . I have to pick each row of column 3 in file1 and iterate through column 3 of file2 to find a match and if the match occurs then display the complete matched rows(from column 1,2 and 3)only from file2.csv in a third csv file.My code till now only fetches the column 3 from both the csv files. How can I match column 3 of both the files and display the matched rows ? Please help.
File1:
Comp_Name,Date,Files
Component1,2013/04/01,/Com/src/folder1/folder2/newfile.txt;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile25;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile26;
Component1,2013/04/25,/Com/src2;
File2:
Comp_name,Date,Files
Component1,2013/04/07,/Com/src/folder1/folder2/newfile.txt;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile25;
Component2,2013/04/23,/Com/src/folder1/folder2/newfile.txt;
Component3,2013/04/27,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/25,/Com/src2;
Output format:
Comp_Name,Date,Files
Component1,2013/04/07,/Com/src/folder1/folder2/newfile.txt;
Component2,2013/04/23,/Com/src/folder1/folder2/newfile.txt;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile24;
Component3,2013/04/27,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile25;
Component1,2013/04/25,/Com/src2;
Code:
use strict;
use warnings;
my $file1 = "C:\\pick\\file1.csv";
my $file2 = "C:\\pick\\file2.csv";
my $file3 = "C:\\pick\\file3.csv";
my $type;
my $type1;
my #fields;
my #fields2;
open(my $fh, '<:encoding(UTF-8)', $file1) or die "Could not open file '$file1' $!"; #Throw error if file doesn't open
while (my $row = <$fh>) # reading each row till end of file
{
chomp $row;
#fields = split ",",$row;
$type = $fields[2];
print"\n$type";
}
open(my $fh2, '<:encoding(UTF-8)', $file2) or die "Could not open file '$file2' $!"; #Throw error if file doesn't open
while (my $row2 = <$fh2>) # reading each row till end of file
{
chomp $row2;
#fields2 = split ",",$row2;
$type1 = $fields2[2];
print"\n$type1";
foreach($type)
{
if ($type eq $type1)
{
print $row2;
}
}
}
This is not a matter to over complicate.. I would personally use a module Text::CSV_XS or as mentioned already Tie::Array::CSV to perform here.
If you're having trouble using a module, I suppose this would be an alternative. You can modify to your desired wants and needs, I used the data you supplied and got the results you want.
use strict;
use warnings;
open my $fh1, '<', 'file1.csv' or die "failed open: $!";
open my $fh2, '<', 'file2.csv' or die "failed open: $!";
open my $out, '>', 'file3.csv' or die "failed open: $!";
my %hash1 = map { $_ => 1 } <$fh1>;
my %hash2 = map { $_ => 1 } <$fh2>;
close $fh1;
close $fh2;
my #result =
map { join ',', $hash1{$_->[2]} ? () : $_->[0], $_->[1], $_->[2] }
sort { $a->[1] <=> $b->[1] || $a->[2] cmp $b->[2] || $a->[0] cmp $b->[0] }
map { s/\s*$//; [split /,/] } keys %hash2;
print $out "$_\n" for #result;
close $out;
__OUTPUT__
Comp_name,Date,Files
Component1,2013/04/07,/Com/src/folder1/folder2/newfile.txt;
Component2,2013/04/23,/Com/src/folder1/folder2/newfile.txt;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile24;
Component3,2013/04/27,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile25;
Component1,2013/04/25,/Com/src2;
This is a job for a hash ( my %file1)
so instead of continually opening the files you can read the contents into hashes
#fields = split ",",$row;
$type = $fields[2];
$hash1{$type} = $row;
I see you have duplicates too so the hash entry will be replaced upon duplication
so you can store an array of values in the hash
$hash1{$type} = [] unless $hash1{$type};
push #{$hash1{$type}}, $row;
Your next problem is how to traverse the arrays inside hashes
Here is an example using my Tie::Array::CSV module. It uses some clever Perl tricks to represent each CSV file as a Perl array of arrayrefs. I use it to make an index of the first file, then to loop over the second file and finally to output to the third.
#!/usr/bin/env perl
use strict;
use warnings;
use Tie::Array::CSV;
tie my #file1, 'Tie::Array::CSV', 'file1' or die 'Cannot tie file1';
tie my #file2, 'Tie::Array::CSV', 'file2' or die 'Cannot tie file2';
tie my #output, 'Tie::Array::CSV', 'output' or die 'Cannot tie output';
# setup a match table from file2
my %match = map { ( $_->[-1] => 1 ) } #file1[1..$#file1];
#header
push #output, $file2[0];
# iterate over file2
for my $row ( #file2[1..$#file2] ) {
next unless $match{$row->[-1]}; # check for match
push #output, $row; # print to output if match
}
The output I get is different from yours, but I cannot figure out why your output does not include testfile25 and src2.

Parsing a CSV file and Hashing

I am trying to parse a CSV file to read in all the other zip codes. I am trying to create a hash where each key is a zip code and the value is the number it appears in the file. Then I want to print out the contents as Zip Code - Number. Here is the Perl script I have so far.
use strict;
use warnings;
my %hash = qw (
zipcode count
);
my $file = $ARGV[0] or die "Need CSV file on command line \n";
open(my $data, '<', $file) or die "Could not open '$file $!\n";
while (my $line = <$data>) {
chomp $line;
my #fields = split "," , $line;
if (exists($hash{$fields[2]})) {
$hash{$fields[1]}++;
}else {
$hash{$fields[1]} = 1;
}
}
my $key;
my $value;
while (($key, $value) = each(%hash)) {
print "$key - $value\n";
}
exit;
You don't say which column your zip code is in, but you are using the third field to check for an existing hash element, and then the second field to increment it.
There is no need to check whether a hash element already exists: Perl will happily create a non-existent hash element and increment it to 1 the first time you access it.
There is also no need to explicitly open any files passed as command line parameters: Perl will open them and read them if you use the <> operator without a file handle.
This reworking of your own program may work. It assumes the zip code is in the second column of the CSV. If it is anywhere else just change ++$hash{$fields[1]} appropriately.
use strict;
use warnings;
#ARGV or die "Need CSV file on command line \n";
my %counts;
while (my $line = <>) {
chomp $line;
my #fields = split /,/, $line;
++$counts{$fields[1]};
}
while (my ($key, $value) = each %counts) {
print "$key - $value\n";
}
Sorry if this is off-topic, but if you're on a system with the standard Unix text processing tools, you could use this command to count the number of occurrences of each value in field #2, and not need to write any code.
cut -d, -f2 filename.csv | sort | uniq -c
which will generate something like this output, where the count is listed first, and the zipcode second:
12 12345
2 56789
34 78912
1 90210

File manipulation in Perl

I have a simple .csv file that has that I want to extract data out of a write to a new file.
I to write a script that reads in a file, reads each line, then splits and structures the columns in a different order, and if the line in the .csv contains 'xxx' - dont output the line to output file.
I have already managed to read in a file, and create a secondary file, however am new to Perl and still trying to work out the commands, the following is a test script I wrote to get to grips with Perl and was wondering if I could aulter this to to what I need?-
open (FILE, "c1.csv") || die "couldn't open the file!";
open (F1, ">c2.csv") || die "couldn't open the file!";
#print "start\n";
sub trim($);
sub trim($)
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}
$a = 0;
$b = 0;
while ($line=<FILE>)
{
chop($line);
if ($line =~ /xxx/)
{
$addr = $line;
$post = substr($line, length($line)-18,8);
}
$a = $a + 1;
}
print $b;
print " end\n";
Any help is much appreciated.
To manipulate CSV files it is better to use one of the available modules at CPAN. I like Text::CSV:
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, empty_is_undef => 1 }) or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<", 'c1.csv' or die "ERROR: $!";
$csv->column_names('field1', 'field2');
while ( my $l = $csv->getline_hr($fh)) {
next if ($l->{'field1'} =~ /xxx/);
printf "Field1: %s Field2: %s\n", $l->{'field1'}, $l->{'field2'}
}
close $fh;
If you need do this only once, so don't need the program later you can do it with oneliner:
perl -F, -lane 'next if /xxx/; #n=map { s/(^\s*|\s*$)//g;$_ } #F; print join(",", (map{$n[$_]} qw(2 0 1)));'
Breakdown:
perl -F, -lane
^^^ ^ <- split lines at ',' and store fields into array #F
next if /xxx/; #skip lines what contain xxx
#n=map { s/(^\s*|\s*$)//g;$_ } #F;
#trim spaces from the beginning and end of each field
#and store the result into new array #n
print join(",", (map{$n[$_]} qw(2 0 1)));
#recombine array #n into new order - here 2 0 1
#join them with comma
#print
Of course, for the repeated use, or in a bigger project you should use some CPAN module. And the above oneliner has much cavetas too.