Perl script find and replace not working? - perl

I am trying to create a script in Perl to replace text in all HTML files in a given directory. However, it is not working. Could anyone explain what I'm doing wrong?
my #files = glob "ACM_CCS/*.html";
foreach my $file (#files)
{
open(FILE, $file) || die "File not found";
my #lines = <FILE>;
close(FILE);
my #newlines;
foreach(#lines) {
$_ =~ s/Authors Here/Authors introduced this subject for the first time in this paper./g;
#$_ =~ s/Authors Elsewhere/Authors introduced this subject in a previous paper./g;
#$_ =~ s/D4-/D4: Is the supporting evidence described or cited?/g;
push(#newlines,$_);
}
open(FILE, $file) || die "File not found";
print FILE #newlines;
close(FILE);
}
For example, I'd want to replace "D4-" with "D4: Is the...", etc. Thanks, I'd appreciate any tips.

You are using the two argument version of open. If $file does not start with "<", ">", or ">>", it will be opened as read filehandle. You cannot write to a read file handle. To solve this, use the three argument version of open:
open my $in, "<", $file or die "could not open $file: $!";
open my $out, ">", $file or die "could not open $file: $!";
Also note the use of lexical filehandles ($in) instead of the bareword file handles (FILE). Lexical filehandles have many benefits over bareword filehandles:
They are lexically scoped instead of global
They close when they go out of scope instead of at the end of the program
They are easier to pass to functions (ie you don't have to use a typeglob reference).
You use them just like you would use a bareword filehandle.
Other things you might want to consider:
use the strict pragma
use the warnings pragma
work on files a line or chunk at a time rather than reading them in all at once
use an HTML parser instead of regex
use named variables instead of the default variable ($_)
if you are using the default variable, don't include it where it is already going to be used (eg s/foo/bar/; instead of $_ =~ s/foo/bar/;)
Number 4 may be very important for what you are doing. If you are not certain of the format these HTML files are in, then you could easily miss things. For instance, "Authors Here" and "Authors\nHere" means the same thing to HTML, but your regex will miss the later. You might want to take a look at XML::Twig (I know it says XML, but it handles HTML as well). It is a very easy to use XML/HTML parser.

Related

Open a file and overwrite the file with adjustments and no backup

I have the following three lines:
rename($file_path, $file_fh.'.bak');
open( my $file_IN_fh, '<' , $file_path.'.bak') || die "die message";
open( my $file_OUT_fh, '>' , $file_path) || die "die message";
It works great. It allows me to go through the in file while(<$file_IN_fh>), make a bunch of changes with a script (s///g, if() to determine if the line stays or not, etc), and write to the out file. In the end I get my edited file and the file name is unchanged.
My issue is that I am at a point where I no longer (currently) want the backup files, so I want to replace the code with something similar that won't create the backup file, and comment back and forth the three lines over the years if my needs change.
How do I do this kind of editing in place not from the command line?
One basic way is to read the file line by line and write desired output lines to a temporary file, which is then renamed so to overwrite the original.
use File::Copy qw(move);
open my $fh, '<', $file or die "Can't open $file: $!";
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
while (<$fh>) {
next if /line_to_skip/;
s/patt/repl/g;
print $fh_out $_;
}
close $_ for ($fh, $fh_out);
move ($outfile, $file) or die "Can't move $outfile to $file: $!";
This is what is normally done by tools that edit files "in place" (with additional safety, checks, and flexibility). Since the $outfile is temporary use File::Temp.
Add checks when close-ing files.
Note that this changes the file's inode number, which may matter for some applications.†
If the file isn't huge you can simplify this and read it in first
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
open $fh, '>', $file or die "Can't open $file for writing: $!";
for (#lines) {
next if /line_to_skip/;
s/patt/repl/g;
print $fh_out $_;
}
close $fh;
what preserves the inode number, since > mode truncates the existing inode data.
† If this is indeed a problem, you can still keep the same inode. After the temporary file is written, open it for reading and open the original file for writing; that truncates the contents of that inode. Then copy the temporary file to the original. Close handles and delete the temporary file.
If the file is huge, then I'd question why you'd want to avoid the temporary file. Otherwise, I'd suggest just loading the file into memory, make modifications, then write it back out.
use File::Slurp qw( read_file write_file );
my $in = read_file($qfn, array_ref => 1);
my #out;
while (defined( $_ = shift(#$in) )) {
s/a/b/g; # For example.
push #out, $_ if /c/; # For example.
}
write_file($qfn, \#out);
I avoided using expensive splice by using two arrays.
Note that using Tie::File might save one line of code, but this will be 30x faster[1], and probably use less memory (despite memory-saving being Tie::File's goal). Tie::File is never the answer!!!
This is not necessarily representative of all Tie::File uses, but I have indeed timed Tie::File taking 30x longer than the alternative at some basic task. That means that 2 seconds worth of work would have taken 1 minute with Tie::File!
Take a look at the Tie::File module. It is a core module and so shouldn't need installing, and the code is as simple as
use Tie::File;
tie my #file, 'Tie::File', $filepath or die $!;
Thereafter the array #file will hold the contents of the file, one line per element, and any changes to the array will be reflected in the file. All array operations such as push, splice, etc. will work fine
Note that line one of the file is in element zero of the array etc.

Replace multiple lines in text file

I have text files containing the text below (amongst other text)
DIFF_COEFF= 1.000e+07,1.000e+07,1.000e+07,1.000e+07,
1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,
1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,
1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,
1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,
1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,4.000e+05,
and I need to replace it with the following text:
DIFF_COEFF= 2.000e+07,2.000e+07,2.000e+07,2.000e+07,
2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,
2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,
2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,
2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,
2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,8.000e+05,
Each line above corresponds to a new line in the text file.
After some googling, I thought making use of Perl in the following might work, but it did not. I got the error message
Illegal division by zero at -e line 1, <> chunk 1
s_orig='DIFF_COEFF=*4.000e+05,'
s_new='DIFF_COEFF= 2.000e+07,2.000e+07,2.000e+07,2.000e+07,\n2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,\n2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,\n2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,\n2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,\n2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,8.000e+05,'
perl -0 -i -pe "s:\Q${s_orig}\E:${s_new}:/igs" file.txt
Does anyone here know the right way to do this?
Edit - some more details: the text after this block is "DIFF_COEFF_Q=" followed by the same set of numbers, so I need to search for and replace the specific lines shown. The text files are not very large in size.
Copy the file over to a new one, except that within the range of text between these markers drop the replacement text instead. Then move that file to replace the original, as it may be needed judging by the attempted perl -0 -i in the question.
Note that when changing a file we have to build new content and then replace the file. There are a few ways to do this and modules that make it easier, shown further below.
The code below uses the range operator and the fact that it returns the counter for lines within the range, 1 for the first and the number ending with E0 for the last. So we don't copy lines inside that region while we write the replacement text (and the post-region-end marker) on the last line.
I consider the region of interest to end right before DIFF_COEFF_Q= line, per the question edit.
use warnings;
use strict;
use feature 'say';
use File::Copy 'move';
my $replacement = "replacement text";
my $file = 'input.txt';
my $out_file = 'new_' . $file;
open my $fh_out, '>', $out_file or die "Can't open $out_file: $!";
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>)
{
if (my $range_cnt = /^\s*DIFF_COEFF\s*=/ .. /^\s*DIFF_COEFF_Q\s*=/) #/
{
if ($range_cnt =~ /E0$/)
{
print $fh_out $replacement; # may need a newline
print $fh_out $_;
}
}
else {
print $fh_out $_;
}
}
close $fh or die "Can't close $file: $!"; # don't overwrite original
close $fh_out or die "Can't close $out_file: $!"; # if there are problems
#move $out_file, $file or die "Can't move $file to $out_file: $!";
Uncomment the move line once this has been tested well enough on your actual files, if you want to replace the original. You may or may not need a newline after $replacement, depending on it.
An alternative is to use flags for entering/leaving that range. But this won't be cleaner since there are two distinct actions, to stop copying when entering the range and write replacement when leaving. Thus multiple flags need be set and checked, what may end up messier.
If the files can't ever be huge it is simpler to read and process the file in memory. Then open the same file for writing and dump the new content
my $text = do { # slurp file into a scalar
local $/;
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>
};
$text =~ s/^\s*DIFF_COEFF\s*=.*?(\n\s*DIFF_COEFF_Q)/$replacement$1/ms;
# Change $out_file to $file to overwrite
open my $fh_out, '>', $out_file or die "Can't open $out_file: $!";
print $fh_out $text;
Here /m modifier is for multiline mode in which we can use ^ for the beginning of a line (not the whole string), what is helpful here. The /s makes . match a newline, too. Also note that we can slurp a file with Path::Tiny as simply as: my $text = path($file)->slurp;
Another option is to use Path::Tiny, which in newer versions has edit and edit_lines methods
use Path::Tiny;
# NOTE: edits $file in place (changes it)
path($file)->edit(
sub { s/DIFF_COEFF=.*?(\n\s*DIFF_COEFF_Q)/$replacement$1/s }
);
For more on this see, for example, this post and this post and this post.
The first and last way change the inode number of the file. See this post if that is a problem.
It's an interesting error that you've made and I can see what has led you to make it. But I don't think I've ever seen anyone else make the same mistake :-)
Your substitution statement is this:
s:\Q${s_orig}\E:${s_new}:/igs
So you've decided to use : as the delimiter of the substitution operator. But you want to use the options i, g and s and everywhere you've seen people talk about options on a substitution operator, they talk about using / to introduce the options. So you've added /igs to your substitution operator.
But what you've missed (and I completely understand why) is that the / that comes before the options is actually the closing delimiter of the standard, s/.../.../, version of the substitution operator. If you change the delimiter (as you have done) then your altered closing delimiter is all you need.
In your case, Perl doesn't expect the / as it has already seen the closing delimiter. It, therefore, decides that the / is a division operator and tries to divide the result of your substitution by igs. It interprets igs as zero and you get your error.
The fix is to remove that / so:
s:\Q${s_orig}\E:${s_new}:/igs
becomes:
s:\Q${s_orig}\E:${s_new}:igs

Perl File Handling

The below is the Perl script that I wrote today. This reads the content from one file and writes on the other file. It works but, not completely.
#---------------------------------------------------------------------------
#!/usr/bin/perl
open IFILE, "text3.txt" or die "File not found";
open OFILE, ">text4.txt" or die "File not found";
my $lineno = 0;
while(<IFILE>)
{
#var=<IFILE>;
$lineno++;
print OFILE "#var";
}
close(<IFILE>);
close(<OFILE>);
#---------------------------------------------------------------------------
The issue is, it reads and writes contens, but not all.
text3.txt has four lines. The above script reads only from second line and writes on text4.txt. So, finally I get only three lines (line.no 2 to line.no 4) of text3.txt.
What is wrong with the above program. I don't have any idea about how to check the execution flow on Perl scripts. Kindly help me.
I'm completely new to Programming. I believe, learning all these would help me in changing my career path.
Thanks in Advance,
Vijay
<IFILE> reads one line from IFILE (only one because it's in scalar context). So while(<IFILE>) reads the first line, then the <IFILE> in list context within the while block reads the rest. What you want to do is:
# To read each line one by one:
while(!eof(IFILE)) { # check if end of file is reached instead of reading a line
my $line = <IFILE>; # scalar context, reads only one line
print OFILE $line;
}
# Or to read the whole file at once:
my #content = <IFILE>; # list context, read whole file
print OFILE #content;
The problem is that this line...
while(<IFILE>)
...reads one line from text3.txt, and then this line...
#var=<IFILE>;
...reads ALL of the remaining lines from text3.txt.
You can do it either way, by looping with while or all at once with #var=<IFILE>, but trying to do both won't work.
This is how I would have written the code in your question.
#!/usr/bin/perl
use warnings;
use strict;
use autodie;
# don't need to use "or die ..." when using the autodie module
open my $input, '<', 'text3.txt';
open my $output, '>', 'text4.txt';
while(<$input>){
my $lineno = $.;
print {$output} $_;
}
# both files get closed automatically when they go out of scope
# so no need to close them explicitly
I would recommend always putting use strict and use warnings at the beginning of all Perl files. At least until you know exactly why it is recommended.
I used autodie so that I didn't have to check the return value of open manually. ( autodie was added to Core in version 5.10.1 )
I used the three argument form of open because it is more robust.
It is important to note that while (<$input>){ ... } gets transformed into while (defined($_ = <$input>)){ ... } by the compiler. Which means that the current line is in the $_ variable.
I also used the special $. variable to get the current line number, rather than trying to keep track of the number myself.
There is a couple of questions you might want to think about, if you are strictly copying a file you could use File::Copy module.
If you are going to process the input before writing it out, you might also consider whether you want to keep both files open at the same time or instead read the whole content of the first file (into memory) first, and then write it to the outfile.
This depends on what you are doing underneath. Also if you have a huge binary file, each line in the while-loop might end up huge, so if memory is indeed an issue you might want to use more low-level stream-based reading, more info on I/O: http://oreilly.com/catalog/cookbook/chapter/ch08.html
My suggestion would be to use the cleaner PBP suggested way:
#!/usr/bin/perl
use strict;
use warnings;
use English qw(-no_match_vars);
my $in_file = 'text3.txt';
my $out_file = 'text4.txt';
open my $in_fh, '<', $in_file or die "Unable to open '$in_file': $OS_ERROR";
open my $out_fh, '>', $out_file or die "Unable to open '$out_file': $OS_ERROR";
while (<$in_fh>) {
# $_ is automatically populated with the current line
print { $out_fh } $_ or die "Unable to write to '$out_file': $OS_ERROR";
}
close $in_fh or die "Unable to close '$in_file': $OS_ERROR";
close $out_fh or die "Unable to close '$out_file': $OS_ERROR";
OR just print out the whole in-file directly:
#!/usr/bin/perl
use strict;
use warnings;
use English qw(-no_match_vars);
my $in_file = 'text3.txt';
my $out_file = 'text4.txt';
open my $in_fh, '<', $in_file or die "Unable to open '$in_file': $OS_ERROR";
open my $out_fh, '>', $out_file or die "Unable to open '$out_file': $OS_ERROR";
local $INPUT_RECORD_SEPARATOR; # Slurp mode, read in all content at once, see: perldoc perlvar
print { $out_fh } <$in_fh> or die "Unable to write to '$out_file': $OS_ERROR";;
close $in_fh or die "Unable to close '$in_file': $OS_ERROR";
close $out_fh or die "Unable to close '$out_file': $OS_ERROR";
In addition if you just want to apply a regular expression or similar to a file quickly, you can look into the -i switch of the perl command: perldoc perlrun
perl -p -i.bak -e 's/foo/bar/g' text3.txt; # replace all foo with bar in text3.txt and save original in text3.txt.bak
When you're closing the files, use just
close(IFILE);
close(OFILE);
When you surround a file handle with angle brackets like <IFILE>, Perl interprets that to mean "read a line of text from the file inside the angle brackets". Instead of reading from the file, you want to close the actual file itself here.

How can I create a new file using a variable value as the name in Perl?

Eg:
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable."."."txt";
open output,'>$file' or die "Can't open the output file!";
}
This doesn't work. Please suggest a new way.
Everyone here has it right, you are using single quotes in your call to open. Single quotes do not interpolate variables into the quoted string. Double quotes do.
my $foo = 'cat';
print 'Why does the dog chase the $foo?'; # prints: Why does the dog chase the $foo?
print "Why does the dog chase the $foo?"; # prints: Why does the dog chase the cat?
So far, so good. But, the others have neglected to give you some important advice about open.
The open function has evolved over the years, as has the way that Perl works with filehandles. In the old days, open was always called with the mode and the file name combined in the second argument. The first argument was always a global filehandle.
Experience showed that this was a bad idea. Combining the mode and the filename in one argument created security problems. Using global variables, well, is using global variables.
Since Perl 5.6.0 you can use a 3 argument form of open that is much more secure, and you can store your filehandle in a lexically scoped scalar.
open my $fh, '>', $file or die "Can't open $file - $!\n";
print $fh "Goes into the file\n";
There are many nice things about lexical filehandles, but one excellent property is that they are automatically closed when their refcount drops to 0 and they are destroyed. There is no need to explicitly close them.
Something else worth noting is that it is considered by most of the Perl community that it is a good idea to always use the strict and warnings pragmas. Using them helps catch many bugs early in the development process and can be a huge time saver.
use strict;
use warnings;
for my $base ( 10_001..10_003 ) {
my $file = "$base.txt";
print "file: $file\n";
open my $fh,'>', $file or die "Can't open the output file: $!";
# Do stuff with handle.
}
I simplified your code a bit too. I used the range operator to generate your base numbers for the file names. Since we are working with numbers and not strings, I was able to use the _, as the thousands separator to improve readability without impacting the final result. Finally, I used an idiomatic perl for loop instead of the C style for you had.
I hope you find this helpful.
use double quotes: ">$file". single quotes will not interpolate your variable.
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable."."."txt";
print "file: $file\n";
open $output,">$file" or die "Can't open the output file!";
close($output);
}
The problem is that you're using single quotes for the second argument to open, and single-quoted strings do not interpolate variables mentioned in them. Perl interpreted your code as though you wanted to open a file that really had a dollar sign for the first character of its name. (Check your disk; you should see an empty file named $file there.)
You can avoid the issue by using the three-argument version of open:
open output, '>', $file
Then the file-name argument can't accidentally interfere with the open-mode argument, and there's no unnecessary variable interpolation or concatenation.
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable . 'txt';
open output,'>$file' or die "Can't open the output file!";
}
this works
1.txt
2.txt and so on ..
Use a file handle:
my $file = "whatevernameyouwant";
open (MYFILE, ">>$file");
print MYFILE "Bob\n";
close (MYFILE);
print '$file' yields $file, whereas print "$file" yields whatevernameyouwant.
You almost have it right, but there are a couple of issues.
1 - You need to use double quotes around the file you're opening.
open output,">$file" or die[...]
2 - Minor niggles, you don't close the files afterwards.
I'd rewrite your code something like this:
#!/usr/bin/perl
$variable = "1000";
for($i=0; $i<3;$i++) {
$variable++;
$file = $variable."."."txt";
open output,">$file" or die "Can't open the output file!";
}

What is the best way to slurp a file into a string in Perl?

Yes, There's More Than One Way To Do It but there must be a canonical or most efficient or most concise way. I'll add answers I know of and see what percolates to the top.
To be clear, the question is how best to read the contents of a file into a string.
One solution per answer.
How about this:
use File::Slurp;
my $text = read_file($filename);
ETA: note Bug #83126 for File-Slurp: Security hole with encoding(UTF-8). I now recommend using File::Slurper (disclaimer: I wrote it), also because it has better defaults around encodings:
use File::Slurper 'read_text';
my $text = read_text($filename);
or Path::Tiny:
use Path::Tiny;
path($filename)->slurp_utf8;
I like doing this with a do block in which I localize #ARGV so I can use the diamond operator to do the file magic for me.
my $contents = do { local(#ARGV, $/) = $file; <> };
If you need this to be a bit more robust, you can easily turn this into a subroutine.
If you need something really robust that handles all sorts of special cases, use File::Slurp. Even if you aren't going to use it, take a look at the source to see all the wacky situations it has to handle. File::Slurp has a big security problem that doesn't look to have a solution. Part of this is its failure to properly handle encodings. Even my quick answer has that problem. If you need to handle the encoding (maybe because you don't make everything UTF-8 by default), this expands to:
my $contents = do {
open my $fh, '<:encoding(UTF-8)', $file or die '...';
local $/;
<$fh>;
};
If you don't need to change the file, you might be able to use File::Map.
In writing File::Slurp (which is the best way), Uri Guttman did a lot of research in the many ways of slurping and which is most efficient. He wrote down his findings here and incorporated them info File::Slurp.
open(my $f, '<', $filename) or die "OPENING $filename: $!\n";
$string = do { local($/); <$f> };
close($f);
Things to think about (especially when compared with other solutions):
Lexical filehandles
Reduce scope
Reduce magic
So I get:
my $contents = do {
local $/;
open my $fh, $filename or die "Can't open $filename: $!";
<$fh>
};
I'm not a big fan of magic <> except when actually using magic <>. Instead of faking it out, why not just use the open call directly? It's not much more work, and is explicit. (True magic <>, especially when handling "-", is far more work to perfectly emulate, but we aren't using it here anyway.)
mmap (Memory mapping) of strings may be useful when you:
Have very large strings, that you don't want to load into memory
Want a blindly fast initialisation (you get gradual I/O on access)
Have random or lazy access to the string.
May want to update the string, but are only extending it or replacing characters:
#!/usr/bin/perl
use warnings; use strict;
use IO::File;
use Sys::Mmap;
sub sip {
my $file_name = shift;
my $fh;
open ($fh, '+<', $file_name)
or die "Unable to open $file_name: $!";
my $str;
mmap($str, 0, PROT_READ|PROT_WRITE, MAP_SHARED, $fh)
or die "mmap failed: $!";
return $str;
}
my $str = sip('/tmp/words');
print substr($str, 100,20);
Update: May 2012
The following should be pretty well equivalent, after replacing Sys::Mmap with File::Map
#!/usr/bin/perl
use warnings; use strict;
use File::Map qw{map_file};
map_file(my $str => '/tmp/words', '+<');
print substr($str, 100, 20);
use Path::Class;
file('/some/path')->slurp;
{
open F, $filename or die "Can't read $filename: $!";
local $/; # enable slurp mode, locally.
$file = <F>;
close F;
}
This is neither fast, nor platform independent, and really evil, but it's short (and I've seen this in Larry Wall's code ;-):
my $contents = `cat $file`;
Kids, don't do that at home ;-).
use IO::All;
# read into a string (scalar context)
$contents = io($filename)->slurp;
# read all lines an array (array context)
#lines = io($filename)->slurp;
See the summary of Perl6::Slurp which is incredibly flexible and generally does the right thing with very little effort.
For one-liners you can usually use the -0 switch (with -n) to make perl read the whole file at once (if the file doesn't contain any null bytes):
perl -n0e 'print "content is in $_\n"' filename
If it's a binary file, you could use -0777:
perl -n0777e 'print length' filename
Here is a nice comparison of the most popular ways to do it:
http://poundcomment.wordpress.com/2009/08/02/perl-read-entire-file/
Nobody said anything about read or sysread, so here is a simple and fast way:
my $string;
{
open my $fh, '<', $file or die "Can't open $file: $!";
read $fh, $string, -s $file; # or sysread
close $fh;
}
Candidate for the worst way to do it! (See comment.)
open(F, $filename) or die "OPENING $filename: $!\n";
#lines = <F>;
close(F);
$string = join('', #lines);
Adjust the special record separator variable $/
undef $/;
open FH, '<', $filename or die "$!\n";
my $contents = <FH>;
close FH;