How to delete common lines from one of 2 files in Perl? - perl

I have 2 files, a small one and a big one. The small file is a subset of the big one.
For instance:
Small file:
solar:1000
alexey:2000
Big File:
andrey:1001
solar:1000
alexander:1003
alexey:2000
I want to delete all the lines from Big.txt which are also present in Small.txt. In other words, I want to delete the lines in Big file which are common to the small File.
So, I wrote a Perl Script as shown below:
#! /usr/bin/perl
use strict;
use warnings;
my ($small, $big, $output) = #ARGV;
open(BIG, "<$big") || die("Couldn't read from the file: $big\n");
my #contents = <BIG>;
close (BIG);
open(SMALL, "<$small") || die ("Couldn't read from the file: $small\n");
while(<SMALL>)
{
chomp $_;
#contents = grep !/^\Q$_/, #contents;
}
close(SMALL);
open(OUTPUT, ">>$output") || die ("Couldn't open the file: $output\n");
print OUTPUT #contents;
close(OUTPUT);
However, this Perl Script does not delete the lines in Big.txt which are common to Small.txt
In this script, I first open the big file stream and copy the entire contents into the array, #contents. Then, I iterate over each entry in the small file and check for its presence in the bigger file. I filter the line from Big File and save it back into the array.
I am not sure why this script does not work? Thanks

Your script does NOT work because grep uses $_ and takes over (for the duration of grep) the old value of your $_ from the loop (e.g. the variable $_ you use in the regex is NOT the variable used for storing the loop value in the while block - they are named the same, but have different scopes).
Use a named variable instead (as a rule, NEVER use $_ for any code longer than 1 line, precisely to avoid this type of bug):
while (my $line=<SMALL>) {
chomp $line;
#contents = grep !/^\Q$line/, #contents;
}
However, as Oleg pointed out, a more efficient solution is to read small file's lines into a hash and then process the big file ONCE, checking hash contents (I also improved the style a bit - feel free to study and use in the future, using lexical filehandle variables, 3-arg form of open and IO error printing via $!):
#! /usr/bin/perl
use strict;
use warnings;
my ($small, $big, $output) = #ARGV;
use File::Slurp;
my #small = read_file($small);
my %small = map { ($_ => 1) } #small;
open(my $big, "<", $big) or die "Can not read $big: Error: $!\n";
open(my $output, ">", $output) or die "Can not write to $output: Error: $!\n";
while(my $line=<$big>) {
chomp $line;
next if $small{$line}; # Skip common
print $output "$line\n";
}
close($big);
close($output);

It doesn't work for several reasons. First, lines in #content still have their newlines in. And second, when you grep, $_ in !/^\Q$_/ is set not to the last line from small file, but for each element of #contents array, effectively making it: for each element in list return everything except this element, leaving you with empty list at the end.
This isn't really the good way to do it - you're reading big file and then trying to reprocess it several times. First, read a small file and put every line in hash. Then read big file inside while(<>) loop, so you won't waste your memory reading it entirely. On each line, check if key exists in previously populated hash and if it does - go to next iteration, otherwise print the line.

Here is a small and efficient solution to your problem:
#!/usr/bin/perl
use strict;
use warnings;
my ($small, $big, $output) = #ARGV;
my %diffx;
open my $bfh, "<", $big or die "Couldn't read from the file $big: $!\n";
# load big file's contents
my #big = <$bfh>;
chomp #big;
# build a lookup table, a structured table for big file
#diffx{#big} = ();
close $bfh or die "$!\n";
open my $sfh, "<", $small or die "Couldn't read from the file $small: $!\n";
my #small = <$sfh>;
chomp #small;
# delete the elements that exist in small file from the lookup table
delete #diffx{#small};
close $sfh;
# print join "\n", keys %diffx;
open my $ofh, ">", $output or die "Couldn't open the file $output for writing: $!\n";
# what is left is unique lines from big file
print $ofh join "\n", keys %diffx;
close $ofh;
__END__
P.S. I learned this trick and many others from Perl Cookbook, 2nd Edition. Thanks

Related

Perl copying specific lines of VECT File

I want to copy lines 7-12 of files, like
this example .vect file,
into another .vect file in the same directory.
I want each line, to be copied twice, and the two copies of each line to be pasted consecutively in the new file.
This is the code I have used so far, and would like to continue using these methods/packages in Perl.
use strict;
use warnings;
use feature qw(say);
# This method works for reading a single file
my $dir = "D:\\Downloads";
my $readfile = $dir ."\\2290-00002.vect";
my $writefile = $dir . "\\file2.vect";
#open a file to read
open(DATA1, "<". $readfile) or die "Can't open '$readfile': $!";;
# Open a file to write
open(DATA2, ">" . $writefile) or die "Can't open '$writefile': $!";;
# Copy data from one file to another.
while ( <DATA1> ) {
print DATA2 $_;
}
close( DATA1 );
close( DATA2 );
What would be a simple way to do this using the same opening and closing file syntax I have used above?
Just modify the print line to
print DATA2 $_, $_ if 7 .. 12;
See Range Operators in "perlop - Perl operators and precedence" for details.
It's worth remembering the
Tie::File
module which maps a file line by line to a Perl array and allows you to manipulate text files using simple array operations. It can be slow when working with large amounts of data, but it is ideal for the majority of applications involving regular text files
Copying a range of lines from one file to another becomes a simple matter of copying an array slice. Remember that the file starts with line one in array element zero, so lines 7 to 12 are at indexes 6...11
This is the Perl code to do what you ask
use strict;
use warnings;
use Tie::File;
chdir 'D:\Downloads' or die $!;
tie my #infile, 'Tie::File', '2290-00002.vect' or die $!;
tie my #outfile, 'Tie::File', 'file2.vect' or die $!;
#outfile = map { $_, $_ } #infile[6..11];
Nothing else is required. Isn't that neat?

Perl Script can't use Tie::File

I'm trying to run a perl script which uses the Tie::File module.
What it basically is supposed to do is read in all the files from the current directory, cut off the last line of the first document, then the first and last line of every other document and the first line of the last document, then write everything to a new document.
When I'm trying to run my script (which might have some mistakes in it...I'd be happy if someone could correct them if you find any) I'm getting an errormessage:
Can't locate object method "TIEARRAY" via package "TIE:File" at script.pl line 28, <$fh> line 7.
I've marked line 28 in the code.
I've installed the latest version of Tie::File and checked with
cpan Tie::File
and
cpan Tie::Array
if everything is installed, I received Tie::Array is up to date (v1.06) and Tie::File is up to date (v1.00) from the terminal, so they have to be installed correctly.
#!/usr/bin/perl
use Cwd;
use Tie::File;
use Tie::Array;
my $cwd = getcwd();
my $buff = '';
# Get all files in cwd.
#my #files = grep { -f && /\.txt$/ } readdir $cwd;
my #files = grep ( -f ,<*.txt>);
# Cut off footer of first (files[0]) file
print 'Opening' . $files[0] . "\n";
use Tie::File;
tie (#lines, Tie::File, $files[0]) or die "can't update $file: $!";
delete $lines[-1];
# Cut off header and footer of $files [1] to $files[-2]
for ($a = 1, $a < $#files-1, $a++){
print 'Opening' . $file . "\n";
use Tie::FILE;
tie (#lines, TIE::File, $files[$a]) or die "can't update $file: $!"; ####this is line 28
delete $lines[0];
delete $lines[-1];
open (FILE, "<", $files[$a]) or die $!;
while (my $line =<FILE>) {
$buff .= $line;
}
close FILE;
}
print 'Opening' . $files[-1] . "\n";
use Tie::FILE;
tie (#lines, TIE::File, $files[-1]) or die "can't update $file: $!";
delete $lines[0];
open (lastfile, "<", $files[-1]) or die "can't open $files[-1]: $!";
while (my $line =<lastfile>) {
$buff .= $line;
}
close lastfile;
# Write the buffer to a new file.
my $allfilename = $cwd.'/Trace.txt';
print 'Writing all files into new file: ' . $allfilename . "\n";
open $outputfile, ">".$allfilename or die $!;
# Write the buffer into the output file.
print $outputfile $buff;
close $outputfile;
Perl module names are case sensitive. The module is called Tie::File, not Tie::FILE or TIE::File.
Your program is frankly a bit of a mess. You seem to be trying things in the hope that they work but without any real reasoning.
I have refactored your code to do what I think you want below. Here are the main changes I have made
You must always add use strict and use warnings to every Perl program you write, and declare all your variables with my as close as possible to their first point of use. Those simple measures alone will save you from a lot of simple errors that you will otherwise overlook
You don't need Tie::Array or Cwd. They are irrelevant to this program
Your tie statement needs a string as the second parameter, so you need to use 'Tie::File' instead of Tie::File
Your output file Trace.txt will be found by the <*.txt> glob, so unless you take measures to specifically exclude it your program will copy trim the first and last lines and copy the contents of that file to itself. In my program I have simply checked in the for loop whether the current file name is Trace.txt and skipped it if so
There is no point in accumulating the data in a buffer $buff. You may as well just write the data to the file as you encounter it
The lines in the tied array #lines have no trailing newline, so you will presumably want to add one when you write to the file
As has been discussed in the comments, you are using Tie::FILE and TIE::File as well as the correct Tie::File. And you have written use Tie::File (and its variations) four times in total. Sure it doesn't stop the program from working, but it is a major indication of foggy thinking, and that you are just statements around in the hope that they make your program work
Using delete on anything other than the last element of an array just sets that element to undef: it doesn't delete it, and all that happens in the tied file is that the text is removed leaving just a newline. You need to use splice instead
Separating your files into the first, the last, and the rest is unnecessary and makes your code illegible. In my program below I have used a single loop that removes the first line of the file unless it's the first fil, and removes the last line of the file unless it's the last file. It's far easier to read that way
Lastly, I'm not at all sure that you want to remove the first and last lines from the existing files, or if you just want all the data copied to your output file except those lines. I have written my program according to your specification, but bear in mind that the files will get shorter by two lines every time you run it, and that probably isn't the effect you want. If you have a different requirement and can't see how to modify the code to achieve it then please ask another question.
I hope this helps you.
use strict;
use warnings;
use Tie::File;
my #files = grep -f, glob '*.txt';
my $all_filename = 'Trace.txt';
open my $out_fh, '>', $all_filename or die qq{Unable to open "$all_filename" for output: $!};
for my $i ( 0 .. $#files ) {
my $file = $files[$i];
next if $file eq $all_filename;
print "Opening $file\n";
tie my #lines, 'Tie::File', $file or die qq{Can't update "$file": $!};
splice #lines, 0, 1 unless $i == 0;
splice #lines, -1, 1 unless $i == $#files;
print $out_fh "$_\n" for #lines;
}
close $out_fh;

foreach condition working wrong in perl?

My perl script below
#array = qw(one two three four five);
sleep (60);
foreach (#array){
open(new,">>$_.txt");
print new "$_ This is testing\n";
}
sleep(120);
open (new2,">>for2.txt");
print new2 "Hai";
In my script. New files are open by the foreach condition. In my script open five files.
But my problem is First open the new five files but not writted in the last array element of the opening file. When finished the outer of the condition then only writted into the foreach last element file. For example: New files are one two three four five are created and $_ this is testing are writted into file except the file five when file for2 created and writted the value into the file then the file five is involve to write. How to change it.?
Output to a file handle is buffered, and will normally be flushed only when the buffer is full or when the file handle is closed. Perl closes file handles implicitly if you open another file on the same handle, or when the program terminates.
The solution is to set your file handles to autoflush as shown here. If you are running on an older version of Perl, before version 14 of Perl 5, then you will also need to add use IO::Handle; to the top of your program.
foreach (#array){
open(new,">>$_.txt");
new->autoflush;
print new "$_ This is testing\n";
}
Update
There are a few things that you should take note of that would improve your Perl programming enormously
You must always use strict and use warnings at the top of every Perl program you write, and declare all your variables as late as possible using my
You should use the three-parameter form of open with lexical file handles, and you must always check that an open call succeeded, otherwise subsequent reads and writes will fail and there is little point in continuing
You should use copious amounts of whitespace to lay out your code better and make it more readable
I would write the code in your question like this
use strict;
use warnings;
my #files = qw( one two three four five );
sleep(60);
for my $name ( #files ){
my $file = "$name.txt";
open my ($new_fh), '>>', $file or die "Unable to open '$file' for appending: $!";
$new_fh->autoflush;
print $new_fh "$name This is testing\n";
}
sleep(120);
open my ($fh_new2), '>>', 'for2.txt' or die "Unable to open 'for2.txt' for appending: $!";
print $fh_new2 'Hai';
Your filehandles to files one through four are closed and flushed when you open a subsequent filehandle of the same name. However, the handle for the last file is left open until the end of the script.
One way to to fix this is to explicity close your file handles so they are flushed.
foreach (#array){
open(new,">>$_.txt");
print new "$_ This is testing\n";
close new;
}
The better solution though is to use more Modern Perl techniques. If you use lexical filehandles, they will be automatically closed when they go out of scope.
use strict;
use warnings;
use autodie;
my #array = qw(one two three four five);
sleep(60);
for (#array) {
open my $fh, '>>', "$_.txt";
print $fh "$_ This is testing\n";
}
sleep(120);
open my $fh, '>>', "for2.txt";
print $fh "Hai";

Read specific part of a filehandle in PERL

Hi I have a large file I would like to read. To save resource I want to read it slowly, one line at a time. However I'm wondering if there is a way to read specific line from a filehandle instead. For example, say I have a test.txt file containing a billion numbers starting with 1. Each number is on a separate line.
1
2
3
...
so now what I currently do to get say line 10 is this,
open (FILE, "< test.txt") or die "$!";
#reads = <FILE>
print $reads[9];
however, is there a way I can access certain part of the FILE without reading everything into a big array, say I want line 10 instead.
something like FILE->[9]
-
thanks for helping in advance!
Two methods, do line by line processing your skip to the desired line. You can use the Input Line Number variable, $. to help:
use strict;
use warnings;
use autodie;
my $line10 = sub {
open my $fh, '<', 'text.txt';
while (<$fh>) {
return $_ if $. == 10;
}
}->();
Alternatively, you could use Tie::File as you already noticed. However, while that interface is very convenient, and I'd recommend it's use, it also will loop through the file behind the scenes.
use strict;
use warnings;
use autodie;
use Tie::File;
tie my #array, 'Tie::File', 'text.txt' or die "Can't open text.txt: $!";
print $array[9] // die "Line 10 does not exist";
For memory purposes large files should be read in using a while loop which will read the file line by line:
open my $fh, '<', 'somefile.txt';
while ( my $line = <$fh> ) {
//read in text line by line
}
Either way to get at that line number you are going to have to read the whole file in. Now I would recommend using the while loop and a counter to print / save the line you are looking for.

Perl File Handling

The below is the Perl script that I wrote today. This reads the content from one file and writes on the other file. It works but, not completely.
#---------------------------------------------------------------------------
#!/usr/bin/perl
open IFILE, "text3.txt" or die "File not found";
open OFILE, ">text4.txt" or die "File not found";
my $lineno = 0;
while(<IFILE>)
{
#var=<IFILE>;
$lineno++;
print OFILE "#var";
}
close(<IFILE>);
close(<OFILE>);
#---------------------------------------------------------------------------
The issue is, it reads and writes contens, but not all.
text3.txt has four lines. The above script reads only from second line and writes on text4.txt. So, finally I get only three lines (line.no 2 to line.no 4) of text3.txt.
What is wrong with the above program. I don't have any idea about how to check the execution flow on Perl scripts. Kindly help me.
I'm completely new to Programming. I believe, learning all these would help me in changing my career path.
Thanks in Advance,
Vijay
<IFILE> reads one line from IFILE (only one because it's in scalar context). So while(<IFILE>) reads the first line, then the <IFILE> in list context within the while block reads the rest. What you want to do is:
# To read each line one by one:
while(!eof(IFILE)) { # check if end of file is reached instead of reading a line
my $line = <IFILE>; # scalar context, reads only one line
print OFILE $line;
}
# Or to read the whole file at once:
my #content = <IFILE>; # list context, read whole file
print OFILE #content;
The problem is that this line...
while(<IFILE>)
...reads one line from text3.txt, and then this line...
#var=<IFILE>;
...reads ALL of the remaining lines from text3.txt.
You can do it either way, by looping with while or all at once with #var=<IFILE>, but trying to do both won't work.
This is how I would have written the code in your question.
#!/usr/bin/perl
use warnings;
use strict;
use autodie;
# don't need to use "or die ..." when using the autodie module
open my $input, '<', 'text3.txt';
open my $output, '>', 'text4.txt';
while(<$input>){
my $lineno = $.;
print {$output} $_;
}
# both files get closed automatically when they go out of scope
# so no need to close them explicitly
I would recommend always putting use strict and use warnings at the beginning of all Perl files. At least until you know exactly why it is recommended.
I used autodie so that I didn't have to check the return value of open manually. ( autodie was added to Core in version 5.10.1 )
I used the three argument form of open because it is more robust.
It is important to note that while (<$input>){ ... } gets transformed into while (defined($_ = <$input>)){ ... } by the compiler. Which means that the current line is in the $_ variable.
I also used the special $. variable to get the current line number, rather than trying to keep track of the number myself.
There is a couple of questions you might want to think about, if you are strictly copying a file you could use File::Copy module.
If you are going to process the input before writing it out, you might also consider whether you want to keep both files open at the same time or instead read the whole content of the first file (into memory) first, and then write it to the outfile.
This depends on what you are doing underneath. Also if you have a huge binary file, each line in the while-loop might end up huge, so if memory is indeed an issue you might want to use more low-level stream-based reading, more info on I/O: http://oreilly.com/catalog/cookbook/chapter/ch08.html
My suggestion would be to use the cleaner PBP suggested way:
#!/usr/bin/perl
use strict;
use warnings;
use English qw(-no_match_vars);
my $in_file = 'text3.txt';
my $out_file = 'text4.txt';
open my $in_fh, '<', $in_file or die "Unable to open '$in_file': $OS_ERROR";
open my $out_fh, '>', $out_file or die "Unable to open '$out_file': $OS_ERROR";
while (<$in_fh>) {
# $_ is automatically populated with the current line
print { $out_fh } $_ or die "Unable to write to '$out_file': $OS_ERROR";
}
close $in_fh or die "Unable to close '$in_file': $OS_ERROR";
close $out_fh or die "Unable to close '$out_file': $OS_ERROR";
OR just print out the whole in-file directly:
#!/usr/bin/perl
use strict;
use warnings;
use English qw(-no_match_vars);
my $in_file = 'text3.txt';
my $out_file = 'text4.txt';
open my $in_fh, '<', $in_file or die "Unable to open '$in_file': $OS_ERROR";
open my $out_fh, '>', $out_file or die "Unable to open '$out_file': $OS_ERROR";
local $INPUT_RECORD_SEPARATOR; # Slurp mode, read in all content at once, see: perldoc perlvar
print { $out_fh } <$in_fh> or die "Unable to write to '$out_file': $OS_ERROR";;
close $in_fh or die "Unable to close '$in_file': $OS_ERROR";
close $out_fh or die "Unable to close '$out_file': $OS_ERROR";
In addition if you just want to apply a regular expression or similar to a file quickly, you can look into the -i switch of the perl command: perldoc perlrun
perl -p -i.bak -e 's/foo/bar/g' text3.txt; # replace all foo with bar in text3.txt and save original in text3.txt.bak
When you're closing the files, use just
close(IFILE);
close(OFILE);
When you surround a file handle with angle brackets like <IFILE>, Perl interprets that to mean "read a line of text from the file inside the angle brackets". Instead of reading from the file, you want to close the actual file itself here.