How to Access an Array in Perl for Regex - perl

I have two inputs reading into my command prompt, the first being a series of words that are to be searched by the program I'm writing, and the second being the file that contains where the words are to be found. So, for instance, my command prompt reads perl WebScan.pl word WebPage000.htm
Now, I have no trouble accessing either of these inputs for printing, but I am having great difficulty accessing the contents of the webpage so I can perform regular expressions to remove html tags and access the content. I realize that there is a subroutine available to do this without regular expressions that is far more effective, but I need to do with with regular expressions :(.
I can access the html file for printing with no trouble:
open (DATA, $ARGV[1]);
my #file = <DATA>;
print #file;
Which prints the entire code of the html page, but I am unable to pass regular expressions in order to remove html blocks. I keep receiving an error that says "Can't modify array dereference in s/// near," which is where I have my specific regular expression. I'm not sure how to get around this- I've tried converting the array into a scalar but then I am unable to access any of the data in the html at all (and no, it doesn't just print the number of values in the array :P)
How do I access the array's contents so I can use regular expressions to refine the desired output?

It sounds like you are doing something like #file =~ s/find/replace/;. You are getting that error because the left hand side of the regex binding operator imposes scalar context on its argument. An array in scalar context returns its length, but this value is read only. So when your substitution tries to perform the replacement, kaboom.
In order to process all of the lines of the file, you could use a foreach loop:
foreach my $line (#file) {$line =~ s/find/replace/}
or more succinctly as:
s/find/replace/ for #file;
However, if you are running regular expressions on an HTML file, chances are you will need them to match across multiple lines. What you are doing above is reading the entire file in, and storing each line as an element of #file. If you use one of Perl's iterative control structures on the array, you will not be able to match multiple lines. So you should instead read the file into a single scalar. You can then use $file =~ s/// as expected.
You can slurp the file into a single variable by temporarily clearing the input record separator $/:
my $file = do {local $/; <DATA>};
In general, regular expressions are the wrong tool for parsing HTML, but it sounds like this is a homework assignment, so in that case its just practice anyway.
And finally, in modern Perl, you should use the three argument form of open with a lexical file handle and error checking:
open my $DATA, '<', $ARGV[1] or die "open error: $!";
my $file = do {local $/; <$DATA>};

Related

Reading a file line by line in Perl

I want to read a file by one line, but it's reading just the first line. How to read all lines?
My code:
open(file_E, $file_E);
while ( <file_E> ) {
/([^\n]*)/;
print $line1;
}
close($file_E);
Let's start by looking at your code.
open(file_E, $file_E);
while ( <file_E> ) {
/([^\n]*)/;
print $line1;
}
close($file_E);
On the first line you open a file named in $file_E using the bareword filehandle file_E. This should work so long as the file successfully opens. It would be better to also check the success of this operation one of two ways: Either put use autodie; at the top of your script (but then risk applying its semantics in places where your code is incompatible with this level of error handling), or change your open to look like this:
open(file_E, $file_E) or die "Failed to open $file_E: $!\n";
Now if you fail to open the file you will get an error message that will help track down the problem.
Next lets look at the while loop, because it's here where you have an issue that is causing the bug you are experiencing. On the first line of the while loop you have this:
while ( <file_E> ) {
By consulting perldoc perlsyn you will see that line is special-cased to actually do this:
while (defined($_ = <file_E>)) {
So your code is implicitly assigning each line to $_ on successive iterations. Also by consulting perldoc perlop you'll find that when the match operator (/.../ or m/.../) is invoked without binding the match explicitly using =~, the match will bind against $_. Still then, so far so good. However, you are not actually doing anything useful with the match. The match operator will return Boolean truth / falsehood for whether or not the match succeeded. And because your pattern contains capturing parenthesis, it will capture something into the capture variable $1. But you are never testing for match success, nor are you ever referring to $1 again.
On the line that follows, you do this: print $line1. Where, in your code, is $line1 being assigned a value? Because it is never being assigned a value in what you've shown us.
I can only guess that your intent is to iterate over the lines of the file, capture the line but without the trailing newline, and then print it. It seems that you wish to print it without any newlines, so that all of the input file is printed as a single line of output.
open my $input_fh_e, '<', $file_E or die "Failed to open $file_E: $!\n";
while(my $line = <$input_fh_e>) {
chomp $line;
print $line;
}
close $input_fh_e or die "Failed to close $file_E: $!\n";
No need to capture anything -- if all that the capture is doing is just grabbing everything up to the newline, you can simply strip off the newline with chomp to begin with.
In my example I used a lexical filehandle (a file handle that is lexically scoped, declared with my). This is generally a better practice in modern Perl as it avoids using a bareword, avoids possible namespace collisions, and assures that the handle will get closed as soon as the lexical scope closes.
I also used the 'three arg' version of open, which is safer because it eliminates the potential for $file_E to be used to open a pipe or do some other nefarious or simply unintended shell manipulation.
I suggest also starting your script with use strict;, because had you done so, you would have gotten an error message at compiletime telling you that $line1 was never declared. Also start your script with use warnings, so that you would get a warning when you try to print $line1 before assigning a value to it.
Most of the issues in your code will be discussed in perldoc perlintro, which you can arrive at from your command line simply by typing perldoc perlintro, assuming you have Perl installed. It typically takes 20-40 minutes to read through perlintro. If ever there were a document that should constitute required reading before getting started writing Perl code, that reading would probably include perlintro.
Another alternative, note that $_ will include newline so you will need to chomp it if you don't want the newline in $line:
open(file_E, $file_E);
while ( <file_E> ) {
my $line = $_;
print $line;
}
close($file_E);

Perl replacement operator doesn't work under Windows when patterns contain slashes

I want to replace a string with a path:
my $somedir = "D:/somedir/someotherdir";
system("perl -pi.bak -e \"s{STRING_TO_BE_REPLACED}{$somedir}\" $file");
but under Windows it replaces string with random symbols instead of slashes.
What's the problem?
I think it's got something to do with a syntax detail needed on Windows, but can't test now.
However, as you are in a Perl script, why go out with system and run another Perl interpreter? It is far more complex and inefficient since it involves a syscall or a shell, and starts another program. Also, it is far harder to get it right -- you need to deal with syntax details, quoting and escaping, for system, your system's command interpreter, the other instance of Perl, and the regex.
The code below reads the whole file into an array first, which is fine if files aren't huge. In general it is better to process a file line by line, and how to do what you need in that way is discussed in detail in a perlfaq5 page. See the comment at the end, with the link.
use warnings 'all';
use strict;
# your code ...
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
# Change #lines in-place. See the comment
s/STRING_TO_BE_REPLACED/$somedir/ for #lines;
open $fh, '>', $file or die "Can't open $file for write: $!";
print $fh #lines;
close $fh;
When we open the $fh the second time it is closed and re-opened, so there is no need for an explicit close. When an existing file is opened for writing ('>') it is clobbered, so this replaces it.
It's more to write but it is better.
Comment on the in-place change to #lines This uses the fact that when iterating over an array if we change the index variable, here $_, the change is made in the original element. The index variable is like an alias for the array element. It says in perlsyn
If any element of LIST is an lvalue, you can modify it by modifying VAR inside the loop. Conversely, if any element of LIST is NOT an lvalue, any attempt to modify that element will fail. In other words, the foreach loop index variable is an implicit alias for each item in the list that you're looping over.
This has the benefit of not copying data and not touching elements that don't change so it is more efficient, potentially a lot more. However, it relies on a subtle property and thus it may be tricky and error prone, so I do not recommend it as a general practice.
To copy the array, with modifications, to a new one
my #lines_new;
foreach my $line (#lines) {
$line =~ s{STRING_TO_BE_REPLACED}{$somedir};
push #lines_new, $line;
}
This also changes #lines. If it need be kept intact do (my $new_line = $line) =~ s/.../. Then write #lines_new to $file. Somewhere in between these two is
#lines = map { s{STRING_TO_BE_REPLACED}{$somedir}; $_ } #lines;
what was posted originally. However, since the map changes elements of #lines and copies data to build the output list, while the whole statement also overwrites the array, on reflection I think it makes more sense to do either the in-place change or an explicit copy to a new array.
In principle it is better to not read the whole file at once but rather to process line by line, unless the file is small enough. In that case open the file for reading and new one for writing, and after you copy (with changes) the file over, move the new one to rewrite $file. See the topic in perlfaq5
The copied file is temporary, to be used to overwrite $file, so it can be named using the core module File::Temp to avoid accidents. To move a file use move from the core module File::Copy.

PERL: String Replacement on file

I am working on a script to do a string replacement in a file and I will read the variables and values and files from a configuration file and do string replacement.
Here is my logic to do a string replacement.
sub expansion($$$){
my $f = shift(#_) ; # file Name
my $vname = shift(#_) ; # variable name for pattern match
my $value = shift(#_) ; # value to replace
my $n = "$f".".new";
open ( O, "<$f") or print( "Can't open $f file: $!");
open ( N ,">$n" ) or print( "Can't open $n file: $!");
while (<O>)
{
$_ =~ s/$vname/$value/g; #check for pattern
print N "$_" ;
}
close (O);
close (N);
}
In my logic am reading line by line in from input file ($f) for the pattern and writing to a new file ($n) .
Instead of write to a new file is there any way to do a string replacement the original file when I try to do the same it has only empty file with no contents.
Do not. Never, ever1. Don't you dare, Don't even think of, do not use subroutine prototyping. It is horribly broken (that is, it doesn't do what you think it does) and is dangerous.
Now, we got that out of the way:
Yes, you can do what you want. You can open a file as both read and writable by using the mode <+. So far, so good.
However, due to buffering, you cannot use the standard read and write methods to read and write to the file. Instead, you need to use sysread and syswrite.
Then, what you need to do is read the line, use sysseek to go back to the start of where you read, and then write to that spot.
Not only is it very complex to do, but it is full of peril. Let's take a simple example. I have a document, and I want to replace my curly quotes with straight quotes.
$line =~ s/“|”/"/g;
That should work. I'm replacing one character with another. What could go wrong?
If this is a UTF-8 file (what Macs and Linux systems use by default), those curly quotes are two-byte characters and that straight quote is a single byte character. I would be writing back a line that was shorter than the line I read in. My buffer is going to be off.
Back in the days when computer memory and storage were measured in kilobytes, and you serial devices like reel-to-reel tapes, this type of operation was quite common. However, in this age where storage is vast, it's simply not worth the complexity and error prone process that this entails. Stick with reading from one file, and writing to another. Then use unlink and rename to delete the original and to rename the copy to the original's name.
A few more pointers:
Don't print if the file can't be opened. Use die. Otherwise, your program will simply continue on blithely unaware that it is not working. Even better, use the pragma use autodie;, and you won't have to worry about testing whether or not a read/write failed.
Use scalars for file handles.
That is instead of
open OUT, ">my_file.txt";
use
open my $out_fh, ">my_file.txt";
And, it is highly recommended to use the three parameter open:
Use
open my $out_fh, ">", "my_file.txt";
If you aren't, always add use strict; and use warnings;.
In fact, your Perl syntax is a bit ancient. You need to get a book on Modern Perl. Perl originally was written as a hack language to replace shell and awk programming. However, Perl has morphed into a full fledge language that can handle complex data types, object orientation, and large projects. Learning the modern syntax of Perl will help you find errors, and become a better developer.
1. Like all rules, this can be broken, but only if you have a clear and careful understanding what is going on. It's like those shows that say "Don't do this at home. We're professionals."
sub inplace_expansion($$$){
my $f = shift(#_) ; # file Name
my $vname = shift(#_) ; # variable name for pattern match
my $value = shift(#_) ; # value to replace
local #ARGV = ( $f );
local $^I = '';
while (<>)
{
s/\Q$vname/$value/g; #check for pattern
print;
}
}
or, my preference would run closer to this (basically equivalent, changes mostly in formatting, variable names, etc.):
use English;
sub inplace_expansion {
my ( $filename, $pattern, $replacement ) = #_;
local #ARGV = ( $filename ),
$INPLACE_EDIT = '';
while ( <> ) {
s/\Q$pattern/$replacement/g;
print;
}
}
The trick with local basically simulates a command-line script (as one would run with perl -e); for more details, see perldoc perlrun. For more on $^I (aka $INPLACE_EDIT), see perldoc perlvar.
(For the business with \Q (in the s// expression), see perldoc -f quotemeta. This is unrelated to your question, but good to know. Also be aware that passing regex patterns around in variables—as opposed to, e.g., using literal regexes exclusively— can be vulnerable to injection attacks; Perl's built-in taint mode is useful here.)
EDIT: David W. is right about prototypes.

Perl extract matches from list

I'm fairly new to perl but not to scripting languages. I have a file, and I'm trying to extract just one portion of each line that matches a regex. For example, given the file:
FLAG(123)
FLAG(456)
Not a flag
FLAG(789)
I'd like to extract the list [123, 456, 789]
The regex is obviously /^FLAG\((\w+)/. My question is, what's an easy way to extract this data in perl?
It's obviously not hard to set up a loop and do a bunch of =~ matches, but I've heard quite a bit about perl's terseness and how it has an operator for everything, so I'm wondering if there's a slick, simple way to do this.
Also, can you point me towards a good perl reference where I can find out slick ways to do other things like this when the opportunity next arises? There are many perl resources on the web, but 90% of them are too simple and the other 10% I seem to lose the signal in the noise.
Thanks!
Here's how I would do it... Did you learn anything new and/or helpful?
my $file_name = "somefile.txt";
open my $fh, '<', $file_name or die "Could not open file $file_name: $!";
my #list;
while (<$fh>)
{
push #list, $1 if /^FLAG\((\w+)/;
}
Things worth pointing out:
In a while loop condition (and ONLY in a while loop condition), reading from a filehandle will set the value to $_ and check that the file was read automatically.
A statement can be modified by attaching an if, unless, for, foreach, while, or until to the end of it. Then it works as a conditional or loop on that one statement.
You probably know that regex capture groups are stored in $1, $2, etc., but you might not have known that the statement will work even if the statement has an if suffix. The if is evaluated first, so push #list, $1 if /some_regex/ makes sense and will do the match first, assigning to $1 before it is needed in the push statement.
Assuming that you have all of the data together in a single string:
my #matches = $data =~ /^FLAG\((\w+)/mg;
The /g modifier means to match as many times as possible, the /m makes ^ match after any newline (not only at the beginning of the string) and a match in list context returns all of the captures for all of those matches.
If you're reading the data in line-by-line then Platinum Azure's solution is the one you want.
map is your friend here.
use strict;
use warnings;
use File::Slurp;
my #matches = map { /^FLAG\((\w+)/ } read_file('file.txt');

Can someone suggest how this Perl script works?

I have to maintain the following Perl script:
#!/usr/bin/perl -w
die "Usage: $0 <file1> <file2>\n" unless scalar(#ARGV)>1;
undef $/;
my #f1 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
my #f2 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
die "Error: file1 has $#f1 serials, file2 has $#f2\n" if ($#f1 != $#f2);
foreach my $g (0 .. $#f1) {
print (($f2[$g] =~ m/RESULT:\s+PASS/) ? $f2[$g] : $f1[$g]);
}
print STDERR "$#f1 serials found\n";
I know pretty much what it does, but how it's done is difficult to follow. The calls to split() are particulary puzzling.
It's fairly idiomatic Perl and I would be grateful if a Perl expert could make a few clarifying suggestions about how it does it, so that if I need to use it on input files it can't deal with, I can attempt to modify it.
It combines the best results from two datalog files containing test results. The datalog files contain results for various serial numbers and the data for each serial number begins and ends with SERIAL NUMBER: n (I know this because my equipment creates the input files)
I could describe the format of the datalog files, but I think the only important aspect is the SERIAL NUMBER: n because that's all the Perl script checks for
The ternary operator is used to print a value from one input file or the other, so the output can be redirected to a third file.
This may not be what I would call "idiomatic" (that would be use Module::To::Do::Task) but they've certainly (ab)used some language features here. I'll see if I can't demystify some of this for you.
die "Usage: $0 <file1> <file2>\n" unless scalar(#ARGV)>1;
This exits with a usage message if they didn't give us any arguments. Command-line arguments are stored in #ARGV, which is like C's char **argv except the first element is the first argument, not the program name. scalar(#ARGV) converts #ARGV to "scalar context", which means that, while #ARGV is normally a list, we want to know about it's scalar (i.e. non-list) properties. When a list is converted to scalar context, we get the list's length. Therefore, the unless conditional is satisfied only if we passed no arguments.
This is rather misleading, because it will turn out your program needs two arguments. If I wrote this, I would write:
die "Usage: $0 <file1> <file2>\n" unless #ARGV == 2;
Notice I left off the scalar(#ARGV) and just wrote #ARGV. The scalar() function forces scalar context, but if we're comparing equality with a number, Perl can implicitly assume scalar context.
undef $/;
Oof. The $/ variable is a special Perl built-in variable that Perl uses to tell what a "line" of data from a file is. Normally, $/ is set to the string "\n", meaning when Perl tries to read a line it will read up until the next linefeed (or carriage return/linefeed on Windows). Your writer has undef-ed the variable, though, which means when you try to read a "line", Perl will just slurp up the whole file.
my #f1 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
This is a fun one. <> is a special filehandle that reads line-by-line from each file given on the command line. However, since we've told Perl that a "line" is an entire file, calling <> once will read in the entire file given in the first argument, and storing it temporarily as a string.
Then we take that string and split() it up into pieces, using the regex /(?=(?:SERIAL NUMBER:\s+\d+))/. This uses a lookahead, which tells our regex engine "only match if this stuff comes after our match, but this stuff isn't part of our match," essentially allowing us to look ahead of our match to check on more info. It basically splits the file into pieces, where each piece (except possibly the first) begins with "SERIAL NUMBER:", some arbitrary whitespace (the \s+ part), and then some digits (the \d+ part). I can't teach you regexes, so for more info I recommend reading perldoc perlretut - they explain all of that better than I ever will.
Once we've split the string into a list, we store that list in a list called #f1.
my #f2 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
This does the same thing as the last line, only to the second file, because <> has already read the entire first file, and storing the list in another variable called #f2.
die "Error: file1 has $#f1 serials, file2 has $#f2\n" if ($#f1 != $#f2);
This line prints an error message if #f1 and #f2 are different sizes. $#f1 is a special syntax for arrays - it returns the index of the last element, which will usually be the size of the list minus one (lists are 0-indexed, like in most languages). He uses this same value in his error message, which may be deceptive, as it will print 1 fewer than might be expected. I would write it as:
die "Error: file $ARGV[0] has ", $#f1 + 1, " serials, file $ARGV[1] has ", $#f2 + 1, "\n"
if $#f1 != $#f2;
Notice I changed "file1" to "file $ARGV[0]" - that way, it will print the name of the file you specified, rather than just the ambiguous "file1". Notice also that I split up the die() function and the if() condition on two lines. I think it's more readable that way. I also might write unless $#f1 == $#f2 instead of if $#f1 != $#f2, but that's just because I happen to think != is an ugly operator. There's more than one way to do it.
foreach my $g (0 .. $#f1) {
This is a common idiom in Perl. We normally use for() or foreach() (same thing, really) to iterate over each element of a list. However, sometimes we need the indices of that list (some languages might use the term "enumerate"), so we've used the range operator (..) to make a list that goes from 0 to $#f1, i.e., through all the indices of our list, since $#f1 is the value of the highest index in our list. Perl will loop through each index, and in each loop, will assign the value of that index to the lexically-scoped variable $g (though why they didn't use $i like any sane programmer, I don't know - come on, people, this tradition has been around since Fortran!). So the first time through the loop, $g will be 0, and the second time it will be 1, and so on until the last time it is $#f1.
print (($f2[$g] =~ m/RESULT:\s+PASS/) ? $f2[$g] : $f1[$g]);
This is the body of our loop, which uses the ternary conditional operator ?:. There's nothing wrong with the ternary operator, but if the code gives you trouble we can just change it to an if(). Let's just go ahead and rewrite it to use if():
if($f2[$g] =~ m/RESULT:\s+PASS/) {
print $f2[$g];
} else {
print $f1[$g];
}
Pretty simple - we do a regex check on $f2[$g] (the entry in our second file corresponding to the current entry in our first file) that basically checks whether or not that test passed. If it did, we print $f2[$g] (which will tell us that test passed), otherwise we print $f1[$g] (which will tell us the test that failed).
print STDERR "$#f1 serials found\n";
This just prints an ending diagnostic message telling us how many serials were found (minus one, again).
I personally would rewrite that whole hairy bit where he hacks with $/ and then does two reads from <> to be a loop, because I think that would be more readable, but this code should work fine, so if you don't have to change it too much you should be in good shape.
The undef $/ line deactivates the input record separator. Instead of reading records line by line, the interpreter will read whole files at once after that.
The <>, or 'diamond operator' reads from the files from the command line or standard input, whichever makes sense. In your case, the command line is explicitely checked, so it will be files. Input record separation has been deactivated, so each time you see a <>, you can think of it as a function call returning a whole file as a string.
The split operators take this string and cut it in chunks, each time it meets the regular expression in argument. The (?= ... ) construct means "the delimiter is this, but please keep it in the chunked result anyway."
That's all there is to it. There would always be a few optimizations, simplifications, or "other ways to do it," but this should get you running.
You can get quick glimpse how the script works, by translating it into java or scala. The inccode.com translator delivers following java code:
public class script extends CRoutineProcess implements IInProcess
{
VarArray arrF1 = new VarArray();
VarArray arrF2 = new VarArray();
VarBox call ()
{
// !/usr/bin/perl -w
if (!(BoxSystem.ProgramArguments.scalar().isGT(1)))
{
BoxSystem.die(BoxString.is(VarString.is("Usage: ").join(BoxSystem.foundArgument.get(0
).toString()).join(" <file1> <file2>\n")
)
);
}
BoxSystem.InputRecordSeparator.empty();
arrF1.setValue(BoxConsole.readLine().split(BoxPattern.is("(?=(?:SERIAL NUMBER:\\s+\\d+))")));
arrF2.setValue(BoxConsole.readLine().split(BoxPattern.is("(?=(?:SERIAL NUMBER:\\s+\\d+))")));
if ((arrF1.length().isNE(arrF2.length())))
{
BoxSystem.die("Error: file1 has $#f1 serials, file2 has $#f2\n");
}
for (
VarBox varG : VarRange.is(0,arrF1.length()))
{
BoxSystem.print((arrF2.get(varG).like(BoxPattern.is("RESULT:\\s+PASS"))) ? arrF2.get(varG) : arrF1.get(varG)
);
}
return STDERR.print("$#f1 serials found\n");
}
}