Comparing Two text files in perl and output the matched result - perl

I want to compare two text files that i have generated from one of the perl script that i wrote.
I want to print out the matched results from those two text files. I tried looking at couple of answers and questions that people have asked on stackoverflow but it does not work for me. Here is what i have tried.
my $file1 = "Scan1.txt";
my $file2 = "Scan2.txt";
my $OUTPUT = "final_result.txt";
my %results = ();
open FILE1, "$file1" or die "Could not open $file1 \n";
while(my $matchLine = <FILE1>)
{
$results{$matchLine} = 1;
}
close(FILE1);
open FILE2, "$file2" or die "Could not open $file2 \n";
while(my $matchLine =<FILE2>)
{
$results{$matchLine}++;
}
close(FILE2);
open (OUTPUT, ">$OUTPUT") or die "Cannot open $OUTPUT \n";
foreach my $matchLine (keys %results) {
print OUTPUT $matchLine if $results{$matchLine} ne 1;
}
close OUTPUT;
EXAPLE OF OUTPUT THAT I WANT
FILE1.TXT
data 1
data 2
data 3
FILE2.TXT
data2
data1
OUTPUT
data 1
data 2

Your problem is that your hash now has following states:
0 (line not found anywhere),
1 (line found in file1 OR line found once in file2),
2 (line found in file1 and once in file2, OR line found twice in file2)
n (line found in file1 and n-1 times in file2, OR line found n times in file2)
This ambiguity will make your check (hash ne 1) fail.
The minimal required change to your algorithm would be:
my $file1 = "Scan1.txt";
my $file2 = "Scan2.txt";
my $OUTPUT = "final_result.txt";
my %results = ();
open FILE1, "$file1" or die "Could not open $file1 \n";
while(my $matchLine = <FILE1>)
{
$results{$matchLine} = 1;
}
close(FILE1);
open FILE2, "$file2" or die "Could not open $file2 \n";
while(my $matchLine =<FILE2>)
{
$results{$matchLine} = 2 if $results{$matchLine}; #Only when already found in file1
}
close(FILE2);
open (OUTPUT, ">$OUTPUT") or die "Cannot open $OUTPUT \n";
foreach my $matchLine (keys %results) {
print OUTPUT $matchLine if $results{$matchLine} ne 1;
}
close OUTPUT;

Related

Perl - Compare two text files and then match only the difference found on the first file

I'm trying to make a script that would only print the difference in text found in the first file but not in the second file.
For example the first text file contains:
a
b
c
d
While the second file contains:
a
x
y
z
With the script that I'm trying, it prints the difference for both the files which is:
b
c
d
x
y
z
But the result I can't figure out to make is just:
b
c
d
Here is the code:
use strict;
use warnings;
my $f1 = 'C:\Strawberry\new.raw';
my $f2 = 'C:\Strawberry\orig.raw';
my $outfile = 'C:\Strawberry\mt_deleted.txt';
my %results = ();
open FILE1, "$f1" or die "Could not open file: $! \n";
while(my $line = <FILE1>){
$results{$line}=1;
}
close(FILE1);
open FILE2, "$f2" or die "Could not open file: $! \n";
while(my $line =<FILE2>) {
$results{$line}++;
}
close(FILE2);
open (OUTFILE, ">$outfile") or die "Cannot open $outfile for writing \n";
foreach my $line (keys %results) {
print OUTFILE $line if $results{$line} == 1;
}
close OUTFILE;
You need to add chomp, and assign different value for keys of file2
use strict;
use warnings;
my $f1 = 'C:\Strawberry\new.raw';
my $f2 = 'C:\Strawberry\orig.raw';
my $outfile = 'C:\Strawberry\mt_deleted.txt';
my %results = ();
open FILE1, "$f1" or die "Could not open file: $! \n";
while ( my $line = <FILE1> ) {
chomp $line;
$results{$line} = 1;
}
close(FILE1);
open FILE2, "$f2" or die "Could not open file: $! \n";
while ( my $line = <FILE2> ) {
chomp $line;
$results{$line} = 2;
}
close(FILE2);
open( OUTFILE, ">$outfile" ) or die "Cannot open $outfile for writing \n";
foreach my $line ( keys %results ) {
print OUTFILE "$line\n" if $results{$line} == 1;
}
close OUTFILE;
Let's start by counting the number of occurrences of each line in file 2.
my %counts;
while (<$fh2>) {
chomp;
++$counts{$_};
}
To print each line of file 1 not matched by a line in file 2, simply process file 1 line by line, decrementing the count, and printing the line if the count is negative.
while (<$fh1>) {
chomp;
say if --$counts{$_} < 0;
}
You said the files could have duplicate lines, but you didn't say how you wanted to handle them. The above handles duplicates as follows:
File 1:
a
a
a
b
c
File 2:
c
a
Output:
a
a
b
Let's start by forming a lookup table of what's in file 2.
my %seen;
while (<$fh2>) {
chomp;
++$seen{$_};
}
To print each line of file 1 not found in file 2, simply process file 1 line by line and printing the line if it's not in the lookup table.
while (<$fh1>) {
chomp;
say if !$seen{$_};
}
You said the files could have duplicate lines, but you didn't say how you wanted to handle them. The above handles duplicates as follows:
File 1:
a
a
a
b
c
File 2:
c
a
Output:
b

Getting an error as "No such file or directory" in perl

Here is my code, I am passing the files with subroutine. From subroutine i am not able to open the file. and it is throwing an error a
"Couldn't open
inputFiles/Fundamental.FinancialLineItem.FinancialLineItem.SelfSourcedPublic.SHE.1.2017-01-11-2259.Full.txt
: No such file or directory at Practice_Dubugg.pl line 40."
use strict;
use warnings;
use Getopt::Std;
use FileHandle;
my %opts;
my $optstr = "i:o:";
getopts("$optstr", \%opts);
if($opts{i} eq '' || $opts{o} eq '' )
{
print "usage: perl TextCompare_Fund.pl <-i INPUTFILE> <-o MAPREDUCE OUTPUTFILE>\n";
die 1;
}
my $inputFilesPath=$opts{i};
my $outputFilesPath=$opts{o};
my #ifiles=`ls $inputFilesPath`;
my #ofiles=`ls $outputFilesPath`;
foreach my $ifile (#ifiles)
{
my $ifile_substr=substr("$ifile",0,-25);
foreach my $ofile (#ofiles)
{
my $ofile_substr=substr("$ofile",0,-21);
my $result=$ifile_substr cmp $ofile_substr;
if($result eq 0)
{
print "$result\n";
#print "$ifile\n";
compare($ifile,$ofile)
}
}
}
sub compare
{
my $afile="$_[0]";
my $bfile="$_[1]";
my $path1="$inputFilesPath/$afile";
my $path2="$outputFilesPath/$bfile";
#open FILE, "<", $path1 or die "$!:$path1";
open my $infile, "<", $path1 or die "Couldn't open $path1: $!";
my %a_lines;
my %b_lines;
my $count1=0;
while (my $line = <$infile>)
{
chomp $line;
$a_lines{$line} = undef;
$count1=$count1+1;
}
print"File1 records count : $count1\n";
close $infile;
my $file=substr("$afile",0,-25);
my $OUTPUT = "/hadoop/user/m6034690/Kishore/$file.comparision_result";
open my $outfile, "<", $path2 or die "Couldn't open $path2: $!";
open (OUTPUT, ">$OUTPUT") or die "Cannot open $OUTPUT \n";
my $count=0;
my $count2=0;
while (my $line = <$outfile>)
{
chomp $line;
$b_lines{$line} = undef;
$count2=$count2+1;
next if exists $a_lines{$line};
$count=$count+1;
print OUTPUT "$line \t===> The Line which is selected from file2/arg2 is mismatching/not available in file1\n";
}
print "File2 records count : $count2\n";
print "Total mismatching/unavailable records in file1 : $count\n";
close $outfile;
close OUTPUT;
}
try by adding the following lines after te below comment.
my $afile="$_[0]";
my $bfile="$_[1]";
my $path1="$inputFilesPath/$afile";
my $path2="$outputFilesPath/$bfile";
#comment, add the below chomps right under the above portion.
chomp $path1;
chomp $path2;
My test works as the path is now properly formatted.
my result:
File1 records count : 1
File2 records count : 1
Total mismatching/unavailable records in file1 : 1
It took a while to figure this issue out, but it was a bit confusing as the script says usage is -i INPUTFILE and -o OUTPUTFILE
This tells me I should add a file path and not path to folders, but regardless, issue should be resolved.
Edit:
an even better option, We should add the chomp where the ls occurs.
chomp( my #ifiles=`ls $inputFilesPath` );
chomp( my #ofiles=`ls $outputFilesPath` );

perl combine specific columns from multiple files

I'd like to create a perl script that combines columns from multiple files. I have to respect a series of criteria (folder/file structure). I'll try to represent what I have and what I have.I have two folders with a bunch of files. The files inside each folders have the same names.
Folder1: File1, File2, File3, ...
Folder2: File1, File2, File3, ...
Folder1:File1 content looks like this (tab delimited):
aaaaa 233
bbbbb 34
ccccc 853
...
All the other files look like this one, except the numerical values are different. I want to create a single file (a report) that will look like this:
aaaaa value_Folder1:File1 value_Folder2:File1 value_Folder1:File2 value_Folder2:File2 ...
...
It would be nice to have the file name on top of the columns from which the values are coming from (just the file name, the folder is not important).
I have some code evolving, but it's not doing what I want right now! I tried to make it work via loops, but I feel that it might not be the solution... One other problem is that I don't know how to add columns to my report file. In the following code, I just append the value a the end of the file. Even if it's not super nice, here's my code:
#!/usr/bin/perl -w
use strict;
use warnings;
my $outputfile = "/home/duceppemo/Desktop/count.all.txt";
my $queryDir = "/home/duceppemo/Desktop/query_count/";
my $hitDir = "/home/duceppemo/Desktop/hit_count/";
opendir (DIR, "$queryDir") or die "Error opening $queryDir: $!"; #Open the directory containing the files with sequences to look for
my #queryFileNames = readdir (DIR);
opendir (DIR, "$hitDir") or die "Error opening $hitDir: $!"; #Open the directory containing the files with sequences to look for
my #hitFileNames = readdir (DIR);
my $index = 0;
$index ++ until $queryFileNames[$index] eq ".";
splice(#queryFileNames, $index, 1);
$index = 0;
$index ++ until $queryFileNames[$index] eq "..";
splice(#queryFileNames, $index, 1);
$index = 0;
$index ++ until $hitFileNames[$index] eq ".";
splice(#hitFileNames, $index, 1);
$index = 0;
$index ++ until $hitFileNames[$index] eq "..";
splice(#hitFileNames, $index, 1);
#counter for query file number opened
my $i = 0;
foreach my $queryFile (#queryFileNames) #adjust the file name according to the subdirectory
{
$i += 1; #keep track of the file number opened
$queryFile = $queryDir . $queryFile;
open (QUERY, "$queryFile") or die "Error opening $queryFile: $!";
my #query = <QUERY>; #Put the query sequences from the count file into an array
close (QUERY);
my $line = 0;
open (RESULT, ">>$outputfile") or die "Error opening $outputfile: $!";
foreach my $lineQuery (#query) #look into the query file
{
my #columns = split(/\s+/, $lineQuery); #Split each line into a new array, when it meets a whitespace character (including tab)
if ($i == 1)
{
#open (RESULT, ">>$outputfile") or die "Error opening $outputfile: $!";
print RESULT "$columns[0]\t";
print RESULT "$columns[1]\n";
#close (RESULT);
$line += 1;
}
else
{
open (RESULT, ">>$outputfile") or die "Error opening $outputfile: $!";
print RESULT "$columns[1]\n";
close (RESULT);
$line += 1;
}
}
$line = 0;
}
close (RESULT);
closedir (DIR);
P.S. Any other advises on code optimisation be gratefully accepted!
The main problem is that you don't seem to understand what is a FILEHANDLE. You should research on this.
A Filehandle is a sort of reference to an open file, and since everything is a file, it can be a command or a directory.
When you make opendir(DIR, ...) "DIR" is not a keyword but a filehandle that can have any name. That means your 2 opendir() have the same filehandle, which does not make sense.
It should be more like:
opendir(QDIR, $queryDir) or die "Error opening $queryDir: $!";
my #queryFileNames = readdir(QDIR);
opendir(HDIR, $hitDir) or die "Error opening $hitDir: $!";
my #hitFileNames = readdir(HDIR);
Also, since you should always close every open filehandle, you must call close() at the same level and make sure close() will be called.
e.g. the opening of the filehandle RESULT and its close after the loop in which it was opened does not make sense... How many times will you open it without closing it?
You probably need to open it before the loop, and you don't have to open it twice with the same filehandle...
In general you want to avoid open/close in loops. You simply open before and close after.
That code is doing pretty much what I want:
#!/usr/bin/perl
use strict;
use warnings;
#my $queryDir = "ARGV[0]";
my $queryDir = "C:/Users/Marco/Desktop/query_count/";
opendir (DIR1, "$queryDir") or die "Error opening $queryDir: $!"; #Open the directory containing the files with sequences to look for
my #queryFileName = readdir (DIR1);
#my $hitDir = "ARGV[1]";
my $hitDir = "C:/Users/Marco/Desktop/hit_count/";
opendir (DIR2, "$hitDir") or die "Error opening $hitDir: $!"; #Open the directory containing the files with sequences to look for
my #hitFileName = readdir (DIR2);
my $index = 0;
$index ++ until $queryFileName[$index] eq ".";
splice(#queryFileName, $index, 1);
$index = 0;
$index ++ until $queryFileName[$index] eq "..";
splice(#queryFileName, $index, 1);
$index = 0;
$index ++ until $hitFileName[$index] eq ".";
splice(#hitFileName, $index, 1);
$index = 0;
$index ++ until $hitFileName[$index] eq "..";
splice(#hitFileName, $index, 1);
foreach my $queryFile (#queryFileName) #adjust the queryFileName according to the subdirectory
{
$queryFile = "$queryDir" . $queryFile;
}
foreach my $hitFile (#hitFileName) #adjust the queryFileName according to the subdirectory
{
$hitFile = "$hitDir" . $hitFile;
}
my $outputfile = "C:/Users/Marco/Desktop/out.txt";
my %hash;
foreach my $queryFile (#queryFileName)
{
my $i = 0;
open (QUERY, "$queryFile") or die "Error opening $queryFile: $!";
while (<QUERY>)
{
chomp;
my $val = (split /\t/)[1];
$i++;
$hash{$i}{$queryFile} = $val;
}
close (QUERY);
}
foreach my $hitFile (#hitFileName)
{
my $i = 0;
open (HIT, "$hitFile") or die "Error opening $hitFile: $!";
while (<HIT>)
{
chomp;
my $val = (split /\t/)[1];
$i++;
$hash{$i}{$hitFile} = $val;
}
close (HIT);
}
open (RESULT, ">>$outputfile") or die "Error opening $outputfile: $!";
foreach my $qfile (#queryFileName)
{
print RESULT "\t$qfile";
}
foreach my $hfile (#hitFileName)
{
print RESULT "\t$hfile";
}
print RESULT "\n";
foreach my $id (sort keys %hash)
{
print RESULT "$id\t";
print RESULT "$hash{$id}{$_}\t" foreach (#queryFileName, #hitFileName);
print RESULT "\n";
}
close (RESULT);

perl match lines from one fine to another file then output the current line and the next line to a new file [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
If any of you could modify the code so that the sequence names in file 1 are searched within file 2, and if there is a match, the lines in file 1 and its next line are copied to an outfile. right now the code only copies the matched titles but not its next line which is the sequence to the outfile. thanks
for example:
FILE 1 :
SEQUENCE 1 NAME
SEQUENCE 2 NAME
SEQUENCE 3 NAME
FILE 2:
SEQUENCE 1 NAME
AGTCAGTCAGTCAGTCAGTC
SEQUENCE 2 NAME
AAGGGTTTTCCCCCCAAAAA
SEQUENCE 3 NAME
GGGGTTTTTTTTTTAAAAAC
SEQUENCE 4 NAME
AAGTCCCCCCCCCCAAGGTT
etc.
OUTFILE:
SEQUENCE 1 NAME
AGTCAGTCAGTCAGTCAGTC
SEQUENCE 2 NAME
AAGGGTTTTCCCCCCAAAAA
SEQUENCE 3 NAME
GGGGTTTTTTTTTTAAAAAC
code:
use strict;
use warnings;
my $f1 = 'FILE1.fasta';
open FILE1, "$f1" or die "Could not open file \n";
my $f2= 'FILE2.fasta';
open FILE2, "$f2" or die "Could not open file \n";
my $outfile = $ARGV[1];
my #outlines;
my $n=0;
foreach (<FILE1>) {
my $y = 0;
my $outer_text = $_ ;
seek(FILE2,0,0);
foreach (<FILE2>) {
my $inner_text = $_;
if($outer_text eq $inner_text) {
print "$outer_text\n";
push(#outlines, $outer_text);
$n++;
}
}
}
open (OUTFILE, "sequences.fasta") or die "Cannot open $outfile \ +n";
print OUTFILE #outlines;
close OUTFILE;
For very large FILE1, %seen hash could be tied to some of DBM storage,
use strict;
use warnings;
my $f1 = 'FILE1.fasta';
open FILE1, "<", $f1 or die $!;
my $f2 = 'FILE2.fasta';
open FILE2, "<", $f2 or die $!;
# my $outfile = $ARGV[1];
open OUTFILE, ">", "sequences.fasta" or die $!;
my %seen;
while (<FILE1>) {
$seen{$_} = 1;
}
while (<FILE2>) {
my $next_line = <FILE2>;
if ($seen{$_}) {
print OUTFILE $_, $next_line;
}
}
close OUTFILE;
I would put the contents of file 2 into a hash, then check if each record from file 1 was in the hash:
#!perl
use strict;
use warnings;
my $f2= 'FILE2.fasta';
open FILE2, "$f2" or die "Could not open file \n";
my $k;
my $v;
my %hash;
while (defined($k = <FILE2>)) {
chomp $k;
$v = <FILE2>;
$hash{$k} = $v;
}
my $f1 = 'FILE1.fasta';
open FILE1, "$f1" or die "Could not open file \n";
open (OUTFILE, ">sequences.fasta") or die "Cannot open seqeneces.fasta\n";
while (<FILE1>) {
chomp;
if (exists($hash{$_})) {
print OUTFILE "$_\n";
print OUTFILE "$hash{$_}\n";
}
}
close OUTFILE;

perl increasing the counter number every time the script running

I have a script to compare 2 files and print out the matching lines on the file. what I want to add a logic to help me to identify for how long these devices are matched. currently I have add the starting point 1 so I want to increase that number every time the script run and matched.
Example.
inputfile:-########################
retiredDevice.txt
Alpha
Beta
Gamma
Delta
prodDevice.txt
first
second
third
forth
Gamma
Delta
output file :-#######################
final_result.txt
1 Delta
1 Gamma
my objective is to add a counter stamp on each matching line to identify for how long "Delta" and "Gamma" matched. the script running every week. so every time the script running adding 1 so when I audit the 'finalResult.txt. the result should looks like
Delta 4
Gamma 3
the result indicate me Delta matched for last 4 weeks and Gamma for last 3 weeks.
#! /usr/local/bin/perl
my $ndays = 1;
my $f1 = "/opt/retiredDevice.txt ";
my $f2 = "prodDevice.txt";
my $outfile = "/opt/final_result.txt";
my %results = ();
open FILE1, "$f1" or die "Could not open file: $! \n";
while(my $line = <FILE1>){ $results{$line}=1;
}
close(FILE1);
open FILE2, "$f2" or die "Could not open file: $! \n";
while(my $line =<FILE2>) {
$results{$line}++;
}
close(FILE2);
open (OUTFILE, ">$outfile") or die "Cannot open $outfile for writing \n";
foreach my $line (keys %results) {
my $x = $ndays;
$x++;
print OUTFILE "$x : ", $line if $results{$line} != 1;
}
close OUTFILE;
Thanks in advance for any help!
Based on your earlier question and comments, perhaps this might work.
use strict;
use warnings;
use autodie;
my $logfile = 'int.txt';
my $f1 = shift || "/opt/test.txt";
my $f2 = shift || "/opt/test1.txt";
my %results;
open my $file1, '<', $f1;
while (my $line = <$file1>) {
chomp $line;
$results{$line} = 1;
}
open my $file2, '<', $f2;
while (my $line = <$file2>) {
chomp $line;
$results{$line}++;
}
{ ############ added part
my %c;
for (keys %results) {
$c{$_} = $results{$_} if $results{$_} > 1;
}
%results = %c;
} ############ end added part
my (%log, $log);
if ( -e $logfile ) {
open $log, '<', $logfile;
while (<$log>) {
my ($num, $key) = split;
$log{$key} = $num;
}
}
open $log, '>', $logfile or die $!;
for my $key (keys %results) {
my $old = ( $log{$key} || 0 ); # keep old count, or 0 otherwise
my $new = ( $results{$key} ? 1 : 0 ); # 1 if it exists, 0 otherwise
print $log $old + $new, " $key\n";
}
Perform this computation in two steps.
Each time you run the comparison between retired and prod, produce an output file that you save with a unique file name, e.g. result-XXX where XXX denotes when you ran the comparison.
Then write a script which iterates over all of the result-XXX files and produces a summary.
I would name the files result-YYYY-MM-DD where YYYY-MM-DD is the date that the comparison was created. Then it will be relatively easy to iterate over a subset of the files (e.g. ones for a certain month).
Or store the data in a relational database.