Print matching lines without using a hash - perl

i want to match two different file for similar line in both the files without using hash function. but i am getting no clue of doing that. can you please help
here is my input:
file1:
delhi
bombay
kolkata
shimla
ghuhati
File2:
london
delhi
jammu
punjab
shimla
output:
delhi
shimla
Stub of code:
#/usr/bin/perl
use strict;
use warnings;
my #line1 = <file1>;
my #line2 = <file2>;
while (<file1>) {
do stuff;
}

You really need a very good reason not to use a hash to solve this problem. Is it a homework assignment?
This will do what you ask. It reads all of file1 into an array, then reads through file2 one line at a time, using grep to check whether the name appears in the array.
use strict;
use warnings;
use autodie;
my #file1 = do {
open my $fh, '<', 'file1.txt';
<$fh>;
};
chomp #file1;
open my $fh, '<', 'file2.txt';
while (my $line = <$fh>) {
chomp $line;
print "$line\n" if grep { $_ eq $line } #file1;
}
output
delhi
shimla

Of course you can do it without using hash, in a very inefficient way:
my #line1 = <file1>;
my #line2 = <file2>;
foreach (#line1) {
if (isElement($_, \#line2)) {
print;
}
}
sub isElement {
my ($ele, $arr) = #_;
foreach (#$arr) {
if ($ele eq $_) {
return 1;
}
}
return 0;
}

You don't want to use hashes? Can I use modules?
The easiest way may be to avoid using Perl, but use the Unix sort and comm utilities:
$ sort -u file1.txt > file1.sort.txt
$ sort -u file2.txt > file2.sort.txt
# comm -12 file1.sort.txt file2.sort.txt
The comm utility prints out three columns: The first column are lines found only in the first file. The second column are lines found only in the second file, and the last column are lines found in both files. The -12 parameter suppresses the printing of columns #1 and #2 giving you only the lines in common.
You could use a double loop where you compare each value with each other one as Lee Duhem did, but that's pretty inefficient. The time it takes increases by the square of the number of items.

Related

Perl : Adding 2 files line by line

I am a beginner in perl, so please bear with me.
I have 2 files:
1
2
3
and
2
4
5
6
I want to create a new file that is the sum of the above 2 files:
output file:
3
6
8
6
What I am doing right now is reading the files as arrays and adding them element by element.
To add the arrays I am using the following:
$asum[#asum] = $array1[#asum] + $array2[#asum] while defined $array1[#asum] or defined $array2[#asum];
But this is giving the following error:
Argument "M-oM-;M-?3" isn't numeric in addition (+) at perl_ii.pl line 30.
Argument "M-oM-;M-?1" isn't numeric in addition (+) at perl_ii.pl line 30.
Use of uninitialized value in addition (+) at perl_ii.pl line 30.
I am using the following code to read files as arrays:
use strict;
use warnings;
my #array1;
open(my $fh, "<", "file1.txt") or die "Failed to open file1\n";
while(<$fh>) {
chomp;
push #array1, $_;
}
close $fh;
my #array2;
open(my $fh1, "<", "file2.txt") or die "Failed to open file2\n";
while(<$fh1>) {
chomp;
push #array2, $_;
}
close $fh1 ;
Anyone could tell me how to fix this, or give a better approach altogether?
Here is another Perl solution that makes use of the diamond, <>, file read operator. This reads in files specified on the command line, (rather than explicitly opening them within the program). Sorry, I can't find the part of the docs that explains this for a read.
The command line for this program would look like:
perl myprogram.pl file1 file2 > outputfile
Where file1 and file2 are the 2 input files and outputfile is the file you want to print the results of the addition.
#!/usr/bin/perl
use strict;
use warnings;
my #sums;
my $i = 0;
while (my $num = <>) {
$sums[$i++] += $num;
$i = 0 if eof;
}
print "$_\n" for #sums;
Note: $i is reset to zero at the end of file, (in this case after the first file is read). Actually, it is also reset to 0 after the second file is read. This has no effect on the program however, because there are no files to be read after the second file in your example.
The following solution makes the memory footprint of the program independent of the sizes of the files. Instead, now the memory footprint only depends on the number of files:
#!/usr/bin/env perl
use strict;
use warnings;
use Carp qw( croak );
use List::Util qw( sum );
use Path::Tiny;
run(#ARGV);
sub run {
my #readers = map make_reader($_), #_;
while (my #obs = grep defined, map $_->(), #readers) {
print sum(#obs), "\n";
}
return;
}
sub make_reader {
my $fname = shift;
my $fhandle = path( $fname )->openr;
my $is_readable = 1;
sub {
return unless $is_readable;
my $line = <$fhandle>;
return $line if defined $line;
close $fhandle
or croak "Cannot close '$fname': $!";
$is_readable = 0;
return;
}
}
You have two different problems with your script now:
First error
Argument "M-oM-;M-?3" isn't numeric in addition (+) at perl_ii.pl line
30
happens because your input files are saved in Unicode and first line is read with "\xFF\xFE" BOM bytes.
To fix it simply, just resave the files as ANSI text. If Unicode is required, then remove these bytes from first string you read from file.
Second error
Use of uninitialized value in addition (+) at perl_ii.pl line 30.
happens because you access 4th element in first array that doesn't exist. Remember, you select maximal input array length as index limit. To fix it just add following condition for input element:
$asum[#asum] = (#asum < #array1 ? $array1[#asum] : 0) + (#asum < #array2 ? $array2[#asum] : 0) while defined $array1[#asum] or defined $array2[#asum];
The logic of reading your two files is the same, and I suggest using a subroutine for that and calling it twice:
#!/usr/bin/env perl
use strict;
use warnings;
my #array1 = read_into_array('file1.txt');
my #array2 = read_into_array('file2.txt');
sub read_into_array
{
my $filename = shift;
my #array;
open(my $fh, "<", $filename) or die "Failed to open $filename: $!\n";
while(<$fh>) {
chomp;
push #array, $_;
}
close $fh;
return #array;
}
But that's just an observation I made and not a solution to your problem. As CodeFuller already said, you should re-save your files as plain ASCII instead of UTF-8.
The second problem, Use of uninitialized value in addition (+), can also be solved with the Logical Defined Or operator // which was introduced in Perl 5.10:
my #asum;
$asum[#asum] = ($array1[#asum] // 0)
+ ($array2[#asum] // 0)
while defined $array1[#asum] or defined $array2[#asum];
No, this is not a comment, but an operator very similar to ||. The difference is that it triggers when the left-hand-side (lhs) is undef while the || triggers when the lhs is falsy (i.e. 0, '' or undef). Thus
$array1[#asum] // 0
gives 0 if $array1[#asum] is undef. It's the same as
defined($array1[#asum]) ? $array1[#asum] : 0
A different approach altogether:
$ paste -d '+' file1 file2 | sed 's/^+//;s/+$//' | bc
3
6
8
6
The paste command prints the files next to each other, separated by a + sign:
$ paste -d '+' file1 file2
1+2
2+4
3+5
+6
The sed command removes leading and trailing + signs, because those trip up bc:
$ paste -d '+' file1 file2 | sed 's/^+//;s/+$//'
1+2
2+4
3+5
6
And bc finally calculates the sums.
Here’s a rendition of Sinan’s approach in a more Perlish form:
#!/usr/bin/env perl
use 5.010; use strict; use warnings;
use autodie;
use List::Util 'sum';
my #fh = map { open my $fh, '<', $_; $fh } #ARGV;
while ( my #value = grep { defined } map { scalar readline $_ } #fh ) {
say sum #value;
#fh = grep { not eof $_ } #fh if #value < #fh;
}

How to find lines containing a match between two files in perl?

I'm a novice at using perl. What I want to do is compare two files. One is my index file that I am calling "temp." I am attempting to use this to search through a main file that I am calling "array." The index file has only numbers in it. There are lines in my array that have those numbers. I've been trying to find the intersection between those two files, but my code is not working. Here's what I've been trying to do.
#!/usr/bin/perl
print "Enter the input file:";
my $filename=<STDIN>;
open (FILE, "$filename") || die "Cannot open file: $!";
my #array=<FILE>;
close(FILE);
print "Enter the index file:";
my $temp=<STDIN>;
open (TEMP, "$temp") || die "Cannot open file: $!";
my #temp=<TEMP>;
close(TEMP);
my %seen= ();
foreach (#array) {
$seen{$_}=1;
}
my #intersection=grep($seen{$_}, #temp);
foreach (#intersection) {
print "$_\n";
}
If I can't use intersection, then what else can I do to move each line that has a match between the two files?
For those of you asking for the main file and the index file:
Main file:
1 CP TRT
...
14 C1 MPE
15 C2 MPE
...
20 CA1 MPE
Index file
20
24
22
17
18
...
I want to put those lines that contain one of the numbers in my index file into a new array. So using this example, only
20 CA1 MPE would be placed into a new array.
My main file and index file are both longer than what I've shown, but that hopefully gives you an idea on what I'm trying to do.
I am assuming something like this?
use strict;
use warnings;
use Data::Dumper;
# creating arrays instead of reading from file just for demo
# based on the assumption that your files are 1 number per line
# and no need for any particular parsing
my #array = qw/1 2 3 20 60 50 4 5 6 7/;
my #index = qw/10 12 5 3 2/;
my #intersection = ();
my %hash1 = map{$_ => 1} #array;
foreach (#index)
{
if (defined $hash1{$_})
{
push #intersection, $_;
}
}
print Dumper(\#intersection);
==== Out ====
$VAR1 = [
'5',
'3',
'2'
];
A few things:
Always have use strict; and use warnings; in your program. This will catch a lot of possible errors.
Always chomp after reading input. Perl automatically adds \n to the end of lines read. chomp removes the \n.
Learn a more modern form of Perl.
Use nemonic variable names. $temp doesn't cut it.
Use spaces to help make your code more readable.
You never stated the errors you were getting. I assume it has to do with the fact that the input from your main file doesn't match your index file.
I use a hash to create an index that the index file can use via my ($index) = split /\s+/, $line;:
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
print "Input file name: ";
my $input_file = <STDIN>;
chomp $input_file; # Chomp Input!
print "Index file name: ";
my $index_file = <STDIN>;
chomp $index_file; # Chomp Input!
open my $input_fh, "<", $input_file;
my %hash;
while ( my $line = <$input_fh> ) {
chomp $line;
#
# Using split to find the item to index on
#
my ($index) = split /\s+/, $line;
$hash{$index} = $line;
}
close $input_fh;
open my $index_fh, "<", $index_file;
while ( my $index = <$index_fh> ) {
chomp $index;
#
# Now index can look up lines
#
if( exists $hash{$index} ) {
say qq(Index: $index Line: "$hash{$index}");
}
else {
say qq(Index "$index" doesn't exist in file.);
}
}
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
#ARGV = 'main_file';
open(my $fh_idx, '<', 'index_file');
chomp(my #idx = <$fh_idx>);
close($fh_idx);
while (defined(my $r = <>)) {
print $r if grep { $r =~ /^[ \t]*$_/ } #idx;
}
You may wish to replace those hardcoded file names for <STDIN>.
FYI: The defined call inside a while condition might be "optional".

Parsing the large files in Perl

I need to compare the big file(2GB) contains 22 million lines with the another file. its taking more time to process it while using Tie::File.so i have done it through 'while' but problem remains. see my code below...
use strict;
use Tie::File;
# use warnings;
my #arr;
# tie #arr, 'Tie::File', 'title_Nov19.txt';
# open(IT,"<title_Nov19.txt");
# my #arr=<IT>;
# close(IT);
open(RE,">>res.txt");
open(IN,"<input.txt");
while(my $data=<IN>){
chomp($data);
print"$data\n";
my $occ=0;
open(IT,"<title_Nov19.txt");
while(my $line2=<IT>){
my $line=$line2;
chomp($line);
if($line=~m/\b$data\b/is){
$occ++;
}
}
print RE"$data\t$occ\n";
}
close(IT);
close(IN);
close(RE);
so help me to reduce it...
Lots of things wrong with this.
Asides from the usual (lack of use strict, use warnings, use of 2-argument open(), not checking open() result, use of global filehandles), the specific problem in your case is that you are opening/reading/closing the second file once for every single line of the first. This is going to be very slow.
I suggest you open the file title_Nov19.txt once, read all the lines into an array or hash or something, then close it; and then you can open the first file, input.txt and walk along that once, comparing to things in the array so you don't have to reopen that second file all the time.
Futher I suggest you read some basic articles on style/etc.. as your question is likely to gain more attention if it's actually written in vaguely modern standards.
I tried to build a small example script with a better structure but I have to say, man, your problem description is really very unclear. It's important to not read the whole comparison file each time as #LeoNerd explained in his answer. Then I use a hash to keep track of the match count:
#!/usr/bin/env perl
use strict;
use warnings;
# cache all lines of the comparison file
open my $comp_file, '<', 'input.txt' or die "input.txt: $!\n";
chomp (my #comparison = <$comp_file>);
close $comp_file;
# prepare comparison
open my $input, '<', 'title_Nov19.txt' or die "title_Nov19.txt: $!\n";
my %count = ();
# compare each line
while (my $title = <$input>) {
chomp $title;
# iterate comparison strings
foreach my $comp (#comparison) {
$count{$comp}++ if $title =~ /\b$comp\b/i;
}
}
# done
close $input;
# output (sorted by count)
open my $output, '>>', 'res.txt' or die "res.txt: $!\n";
foreach my $comp (#comparison) {
print $output "$comp\t$count{$comp}\n";
}
close $output;
Just to get you started... If someone wants to further work on this: these were my test files:
title_Nov19.txt
This is the foo title
Wow, we have bar too
Nothing special here but foo
OMG, the last title! And Foo again!
input.txt
foo
bar
And the result of the program was written to res.txt:
foo 3
bar 1
Here's another option using memowe's (thank you) data:
use strict;
use warnings;
use File::Slurp qw/read_file write_file/;
my %count;
my $regex = join '|', map { chomp; $_ = "\Q$_\E" } read_file 'input.txt';
for ( read_file 'title_Nov19.txt' ) {
my %seen;
!$seen{ lc $1 }++ and $count{ lc $1 }++ while /\b($regex)\b/ig;
}
write_file 'res.txt', map "$_\t$count{$_}\n",
sort { $count{$b} <=> $count{$a} } keys %count;
Numerically-sorted output to res.txt:
foo 3
bar 1
An alternation regex which quotes meta characters (\Q$_\E) is built and used, so only one pass against the large file's lines is needed. The hash %seen is used to insure that the input words are only counted once per line.
Hope this helps!
Try this:
grep -i -c -w -f input.txt title_Nov19.txt > res.txt

Merging two files based on first column and returns multiple values for each key

I am fairly new to Perl so hopefully this has a quick solution.
I have been trying to combine two files based on a key. The problem is there are multiple values instead of the one it is returning. Is there a way to loop through the hash to get the 1-10 more values it could be getting?
Example:
File Input 1:
12345|AA|BB|CC
23456|DD|EE|FF
File Input2:
12345|A|B|C
12345|D|E|F
12345|G|H|I
23456|J|K|L
23456|M|N|O
32342|P|Q|R
The reason I put those last one in is because the second file has a lot of values I don’t want but file 1 I want all values. The result I want is something like this:
WANTED OUTPUT:
12345|AA|BB|CC|A|B|C
12345|AA|BB|CC|D|E|F
12345|AA|BB|CC|G|H|I
23456|DD|EE|FF|J|K|L
23456|DD|EE|FF|M|N|O
Attached is the code I am currently using. It gives an output like so:
OUTPUT I AM GETTING:
12345|AA|BB|CC|A|B|C
23456|DD|EE|FF|J|K|L
My code so far:
#use strict;
#use warnings;
open file1, "<FILE1.txt";
open file2, "<FILE2.txt";
while(<file2>){
my($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~ /(.+)\|(.+)\|(.+)\|(.+)/;
$value4 = "$value1|$value2|$value3";
$file2Hash{$key} = $value4;
}
while(<file1>){
my ($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~/(.+)\|(.+)\|(.+)\|(.+)/;
if (exists $file2Hash{$key}) {
print $line."|".$file2Hash{$key}."\n";
}
else {
print $line."\n";
}
}
Thank you for any help you may provide,
Your overall idea is sound. However in file2, if you encounter a key you have already defined, you overwrite it with a new value. To work around that, we store an array(-ref) inside our hash.
So in your first loop, we do:
push #{$file2Hash{$key}}, $value4;
The #{...} is just array dereferencing syntax.
In your second loop, we do:
if (exists $file2Hash{$key}){
foreach my $second_value (#{$file2Hash{$key}}) {
print "$line|$second_value\n";
}
} else {
print $line."\n";
}
Beyond that, you might want to declare %file2Hash with my so you can reactivate strict.
Keys in a hash must be unique. If keys in file1 are unique, use file1 to create the hash. If keys are not unique in either file, you have to use a more complicated data structure: hash of arrays, i.e. store several values at each unique key.
I assume that each key in FILE1.txt is unique and that each unique key has at least one corresponding line in FILE2.txt.
Your approach is then quite close to what you need, you should just use FILE1.txt to create the hash from (as already mentioned here).
The following should work:
#!/usr/bin/perl
use strict;
use warnings;
my %file1hash;
open file1, "<", "FILE1.txt" or die "$!\n";
while (<file1>) {
my ($key, $rest) = split /\|/, $_, 2;
chomp $rest;
$file1hash{$key} = $rest;
}
close file1;
open file2, "<", "FILE2.txt" or die "$!\n";
while (<file2>) {
my ($key, $rest) = split /\|/, $_, 2;
if (exists $file1hash{$key}) {
chomp $rest;
printf "%s|%s|%s\n", $key, $file1hash{$key}, $rest;
}
}
close file2;
exit 0;

Getting unique random line (at each script run) from an text file with perl

Having an text file like the next one called "input.txt"
some field1a | field1b | field1c
...another approx 1000 lines....
fielaNa | field Nb | field Nc
I can choose any field delimiter.
Need a script, what at every discrete run will get one unique (never repeated) random line from this file, until used all lines.
My solution: I added one column into a file, so have
0|some field1a | field1b | field1c
...another approx 1000 lines....
0|fielaNa | field Nb | field Nc
and processing it with the next code:
use 5.014;
use warnings;
use utf8;
use List::Util;
use open qw(:std :utf8);
my $file = "./input.txt";
#read all lines into array and shuffle them
open(my $fh, "<:utf8", $file);
my #lines = List::Util::shuffle map { chomp $_; $_ } <$fh>;
close $fh;
#search for the 1st line what has 0 at the start
#change the 0 to 1
#and rewrite the whole file
my $random_line;
for(my $i=0; $i<=$#lines; $i++) {
if( $lines[$i] =~ /^0/ ) {
$random_line = $lines[$i];
$lines[$i] =~ s/^0/1/;
open($fh, ">:utf8", $file);
print $fh join("\n", #lines);
close $fh;
last;
}
}
$random_line = "1|NO|more|lines" unless( $random_line =~ /\w/ );
do_something_with_the_fields(split /\|/, $random_line))
exit;
It is an working solution, but not very nice one, because:
the line order is changing at each script run
not concurrent script-run safe.
How to write it more effective and more elegantly?
What about keeping a shuffled list of the line numbers in a different file, removing the first one each time you use it? Some locking might be needed to asure concurent script-run safety.
From perlfaq5.
How do I select a random line from a file?
Short of loading the file into a database or pre-indexing the lines in
the file, there are a couple of things that you can do.
Here's a reservoir-sampling algorithm from the Camel Book:
srand;
rand($.) < 1 && ($line = $_) while <>;
This has a significant advantage in space over reading the whole file
in. You can find a proof of this method in The Art of Computer
Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
You can use the File::Random module which provides a function for that
algorithm:
use File::Random qw/random_line/;
my $line = random_line($filename);
Another way is to use the Tie::File module, which treats the entire
file as an array. Simply access a random array element.
All Perl programmers should take the time to read the FAQ.
Update: To get a unique random line each time you're going to have to store state. The easiest way to store the state is to remove the lines that you've used from the file.
This program uses the Tie::File module to open your input.txt file as well as an indices.txt file.
If indices.txt is empty then it is initialised with the indices of all the records in input.txt in a shuffled order.
Each run, the index at the end of the list is removed and the corresponding input record displayed.
use strict;
use warnings;
use Tie::File;
use List::Util 'shuffle';
tie my #input, 'Tie::File', 'input.txt'
or die qq(Unable to open "input.txt": $!);
tie my #indices, 'Tie::File', 'indices.txt'
or die qq(Unable to open "indices.txt": $!);
#indices = shuffle(0..$#input) unless #indices;
my $index = pop #indices;
print $input[$index];
Update
I have modified this solution so that it populates a new indices.txt file only if it doesn't already exist and not, as before, simply when it is empty. That means a new sequence of records can be printed simply by deleting the indices.txt file.
use strict;
use warnings;
use Tie::File;
use List::Util 'shuffle';
my ($input_file, $indices_file) = qw( input.txt indices.txt );
tie my #input, 'Tie::File', $input_file
or die qq(Unable to open "$input_file": $!);
my $first_run = not -f $indices_file;
tie my #indices, 'Tie::File', $indices_file
or die qq(Unable to open "$indices_file": $!);
#indices = shuffle(0..$#input) if $first_run;
#indices or die "All records have been displayed";
my $index = pop #indices;
print $input[$index];