Perl: test regex without creating new variable - perl

Sorry if this is a basic question, but I'm somewhat new to perl, and I feel there should be a way to do this, but am having trouble finding any documentation. I'm wondering if you can do the following without the throw-away variable $doto:
my $file="foo/bar.c";
my $doto = $file;
$doto =~ s/\.c$/\.o/;
print ".o exists" if ( -f $doto );
That is, something like:
print ".o exists" if ( -f ($file =~ s/\.c$/\.o/gr) );
(but that creates a compile error of course).
My compile error is as follows:
Bareword found where operator expected at - line 2, near "s/.c$/.o/gr"
This is perl, v5.8.9

Your statement
print ".o exists" if ( -f ($file =~ s/\.c$/\.o/gr) )
works fine on versions of Perl that support the /r modifier—v5.14 or better. (Note that /g is superfluous.)
Without it there is no way to apply a substitution without modifying a variable, although you can make it a very short-lived temporary variable using a block
{
(my $doto = $file) =~ s/\.c$/\.o/;
print ".o exists" if -f $doto;
}

This answer talks about making the actual print if -f lookup code more readable. If you want the code to run faster, this solution is more expensive than your ugly one.
Since in your version of Perl there is no non-destructive substitution all you could do is implement your own function for that. It will not be as nice as the s///r, but it does the job. If you've got several occurrences of this type of code, it will make sense.
sub replace {
my ($text, $pattern, $replacement) = #_;
$text =~ s{$pattern}{$replacement}g; # do you need /g?
return $text;
}
# ... later
print ".o exists" if -f replace($file, qr/\.c$/, '.o');
This already takes care of making a copy for you, much like your temporary variable does, so $file will not actually be altered.
Note that your /g was useless as the filename will only ever have one end of the line, but it might not be useless later. However, it would be better to not fix it there, but to pass in an optional flag as another argument.
replace( $file, qr/.../, '.o', 'g' ); # where 'g' just means any true value
sub replace {
my ($text, $pattern, $replacement, $global) = #_;
if ($global) {
$text =~ s{$pattern}{$replacement}g;
} else {
$text =~ s{$pattern}{$replacement};
}
return $text;
}
You also generally don't need to escape the . in the replacement part because that's not actually a regular expression pattern, just a string.

I would approach it by adding a function as follows.
sub doto_exists {
my $doto = shift;
$doto =~ s/\.c$/\.o/;
return (-f $doto);
}
$file = "file1.c";
print ".o exists\n" if doto_exists($file) ;

Related

Grep using perl

I'm trying to grep multiple patterns from a log file using perl. For the first pattern i'm getting the desired matching pattern via read only variable($1,$2..). But for the next pattern the read only variable is returning the previous value but not the value matching the second pattern.
here is the code:
$tmp = `grep "solo_video_channel_.*(0): queueing" $log`;
chomp($tmp);
$tmp =~ m/(.*):.*solo_video_channel_write(.*): queueing page (.*).*/;
$chnl = $2;
$page = $3;
$timestamp = $1;
$tmp1 = `grep "(0): DUMP GO" $log`;
chomp($tmp1);
$tmp1 =~ m/(.*): solo_video_channel_write(0): DUMP GO/;
$dmp = $1;
print "dump go time = $1\n";
tmp1's value after grep is coming as expected. but $1 value remains same as the previous one.
Any suggestions?
Always make sure that you verify that a regex matched before using a captured variable.
Additionally, there is no reason to shell out to grep. Use Perl's file processing instead:
use strict;
use warnings;
local #ARGV = $log;
while (<>) {
chomp;
if (/solo_video_channel_.*\(0\): queueing/) {
if ( my ( $timestamp, $chnl, $page ) = m/(.*):.*solo_video_channel_write(.*): queueing page (.*).*/ ) {
print "$. - $timestamp, $chnl, $page\n";
}
}
if ( my ($dmp) = m/(.*): solo_video_channel_write\(0\): DUMP GO/ ) {
print "dump go time = $dmp\n";
}
}
Note, your first set of if's could almost certainly be combined into a single if statement, but I left it as is for now.
Why not use Pure Perl? It's faster than running external greps. Plus, you can grep both regular expressions at once. Faster than looping through the file twice.
Always check the value of your rexp match. Here I'm using if statements to do this. Note too that I am printing all lines that don't match with UNMATCHED LINES. You can remove the else when you see that everything is working, or simply redirect 2> /dev/null.
use strict;
use warnings;
use autodie;
use feature qw(say);
my $log = "log.txt";
open my $log_fh, "<", $log;
while ( my $line = <$log_fh> ) {
my $timestamp;
my $channel;
my $page;
my $gotime;
if ( $line =~ /(.*):.*solo_video_channel_(.*):\s+queueing page (.*)/ ) {
$timestamp = $1;
$channel = $2;
$page = $3;
say qq(Timestamp = "$timestamp" Channel = "$channel" Page = "$page");
}
elsif ( $line =~ /(.*): solo_video_channel_write(0): DUMP GO/ ) {
$gotime = $1;
say "Dump Go Time = $1";
}
else {
say STDERR qq(UNMATCHED LINES: "$line");
}
}
close $log_fh;
In the second regexp you need to escape the literal brackets
$tmp1 =~ m/(.*): solo_video_channel_write\(0\): DUMP GO/
This is because the expression \(0\) matches the exact pattern (0)
In the example given in this answer this would include strings such as
37: solo_video_channel_write(0): DUMP GO
In contrast, the expression (0) matches the exact pattern 0 and sets a capture group.
With the regexp given in your original question
$tmp1 =~ m/(.*): solo_video_channel_write(0): DUMP GO/;
matching would occur on strings such as
37: solo_video_channel_write0: DUMP GO
Of course in the original program the strings are not in this format, so they do not match and $1 is not set
The regular expression syntax for the shell program grep is (confusingly) different
To use round brackets for setting a capture group they must be escaped with a backslash, which is the opposite to the syntax in perl

A couple of Perl subtleties

I've been programming in Perl for a while, but I never have understood a couple of subtleties about Perl:
The use and the setting/unsetting of the $_ variable confuses me. For instance, why does
# ...
shift #queue;
($item1, #rest) = split /,/;
work, but (at least for me)
# ...
shift #queue;
/some_pattern.*/ or die();
does not seem to work?
Also, I don't understand the difference between iterating through a file using foreach versus while. For instance,I seem to be getting different results for
while(<SOME_FILE>){
# Do something involving $_
}
and
foreach (<SOME_FILE>){
# Do something involving $_
}
Can anyone explain these subtle differences?
shift #queue;
($item1, #rest) = split /,/;
If I understand you correctly, you seem to think that this shifts off an element from #queue to $_. That is not true.
The value that is shifted off of #queue simply disappears The following split operates on whatever is contained in $_ (which is independent of the shift invocation).
while(<SOME_FILE>){
# Do something involving $_
}
Reading from a filehandle in a while statement is special: It is equivalent to
while ( defined( $_ = readline *SOME_FILE ) ) {
This way, you can process even colossal files line-by-line.
On the other hand,
for(<SOME_FILE>){
# Do something involving $_
}
will first load the entire file as a list of lines into memory. Try a 1GB file and see the difference.
Another, albeit subtle, difference between:
while (<FILE>) {
}
and:
foreach (<FILE>) {
}
is that while() will modify the value of $_ outside of its scope, whereas, foreach() makes $_ local. For example, the following will die:
$_ = "test";
while (<FILE1>) {
print "$_";
}
die if $_ ne "test";
whereas, this will not:
$_ = "test";
foreach (<FILE1>) {
print "$_";
}
die if $_ ne "test";
This becomes more important with more complex scripts. Imagine something like:
sub func1() {
while (<$fh2>) { # clobbers $_ set from <$fh1> below
<...>
}
}
while (<$fh1>) {
func1();
<...>
}
Personally, I stay away from using $_ for this reason, in addition to it being less readable, etc.
Regarding the 2nd question:
while (<FILE>) {
}
and
foreach (<FILE>) {
}
Have the same functional behavior, including setting $_. The difference is that while() evaluates <FILE> in a scalar context, while foreach() evaluates <FILE> in a list context. Consider the difference between:
$x = <FILE>;
and
#x = <FILE>;
In the first case, $x gets the first line of FILE, and in the second case #x gets the entire file. Each entry in #x is a different line in FILE.
So, if FILE is very big, you'll waste memory slurping it all at once using foreach (<FILE>) compared to while (<FILE>). This may or may not be an issue for you.
The place where it really matters is if FILE is a pipe descriptor, as in:
open FILE, "some_shell_program|";
Now foreach(<FILE>) must wait for some_shell_program to complete before it can enter the loop, while while(<FILE>) can read the output of some_shell_program one line at a time and execute in parallel to some_shell_program.
That said, the behavior with regard to $_ remains unchanged between the two forms.
foreach evaluates the entire list up front. while evaluates the condition to see if its true each pass. while should be considered for incremental operations, foreach only for list sources.
For example:
my $t= time() + 10 ;
while ( $t > time() ) { # do something }
StackOverflow: What’s the difference between iterating over a file with foreach or while in Perl?
It is to avoid this sort of confusion that it's considered better form to avoid using the implicit $_ constructions.
my $element = shift #queue;
($item,#rest) = split /,/ , $element;
or
($item,#rest) = split /,/, shift #queue;
likewise
while(my $foo = <SOMEFILE>){
do something
}
or
foreach my $thing(<FILEHANDLE>){
do something
}
while only checks if the value is true, for also places the value in $_, except in some circumstances. For example <> will set $_ if used in a while loop.
to get similar behaviour of:
foreach(qw'a b c'){
# Do something involving $_
}
You have to set $_ explicitly.
while( $_ = shift #{[ qw'a b c' ]} ){
# Do something involving $_
}
It is better to explicitly set your variables
for my $line(<SOME_FILE>){
}
or better yet
while( my $line = <SOME_FILE> ){
}
which will only read in the file one line at a time.
Also shift doesn't set $_ unless you specifically ask it too
$_ = shift #_;
And split works on $_ by default. If used in scalar, or void context will populate #_.
Please read perldoc perlvar so that you will have an idea of the different variables in Perl.
perldoc perlvar.

Perl for loop explanation

I'm looking through perl code and I see this:
sub html_filter {
my $text = shift;
for ($text) {
s/&/&/g;
s/</</g;
s/>/>/g;
s/"/"/g;
}
return $text;
}
what does the for loop do in this case and why would you do it this way?
The for loop aliases each element of the list its looping over to $_. In this case, there is only one element, $text.
Within the body, this allows one to write
s/&/&/g;
etc. instead of having to write
$text =~ s/&/&/g;
repeatedly. See also perldoc perlsyn.
Without an explicit loop variable, the for loop uses the special variable called $_. The substitution statements inside the loop also use the special $_ variable because none other is specified, so this is just a trick to make the source code shorter. I would probably write this function as:
sub html_filter {
my $text = shift;
$text =~ s/&/&/g;
$text =~ s/</</g;
$text =~ s/>/>/g;
$text =~ s/"/"/g;
return $text;
}
This will have no performance consequences and is readable by people other than Perl.
As Mr Hewgill points out, the code sample is implicitly localizing and aliasing to $_, the magical implied variable.
He offers a substitute that is more readable at the cost of boilerplate code.
There is no reason to sacrifice readability for brevity. Simply replace the implicit localization and assignment with an explicit version:
sub html_filter {
local $_ = shift;
s/&/&/g;
s/</</g;
s/>/>/g;
s/"/"/g;
return $_;
}
If I didn't know Perl all that well and came across this code, I'd know that I needed to look at the docs for $_ and local--as a bonus in perlvar, there a few examples of localizing $_.
For anyone who uses Perl a lot, the above should be easy to understand.
So there is really no reason to sacrifice readability for brevity here.
It's just used to alias $text to $_, the default variable. Done because they're too lazy to use an explicit variable or don't want to waste precious cycles creating a new scalar.
Its cleaning up &, < , > and quote characters and replacing them with the appropriate HTML entity chars.
It loops through your text and substitutes ampersands (&) with &amp, < with &lt, > with &gt and " with &quot. You'd do this for output to a .html document... those are the proper entity characters.
The original code could be more flexible by using wantarray to test the desired context:
sub html_filter {
my #text = #_;
for (#text) {
s/&/&/g;
s/</</g;
s/>/>/g;
s/"/"/g;
}
return wantarray ? #text: "#text"; }
That way you could call it in list context or scalar context and get back the correct results, for example:
my #stuff = html_filter('"','>');
print "$_\n" for #stuff;
my $stuff = html_filter('&');
print $stuff;

Neatest way to remove linebreaks in Perl

I'm maintaining a script that can get its input from various sources, and works on it per line. Depending on the actual source used, linebreaks might be Unix-style, Windows-style or even, for some aggregated input, mixed(!).
When reading from a file it goes something like this:
#lines = <IN>;
process(\#lines);
...
sub process {
#lines = shift;
foreach my $line (#{$lines}) {
chomp $line;
#Handle line by line
}
}
So, what I need to do is replace the chomp with something that removes either Unix-style or Windows-style linebreaks.
I'm coming up with way too many ways of solving this, one of the usual drawbacks of Perl :)
What's your opinion on the neatest way to chomp off generic linebreaks? What would be the most efficient?
Edit: A small clarification - the method 'process' gets a list of lines from somewhere, not nessecarily read from a file. Each line might have
No trailing linebreaks
Unix-style linebreaks
Windows-style linebreaks
Just Carriage-Return (when original data has Windows-style linebreaks and is read with $/ = '\n')
An aggregated set where lines have different styles
After digging a bit through the perlre docs a bit, I'll present my best suggestion so far that seems to work pretty good. Perl 5.10 added the \R character class as a generalized linebreak:
$line =~ s/\R//g;
It's the same as:
(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
I'll keep this question open a while yet, just to see if there's more nifty ways waiting to be suggested.
Whenever I go through input and want to remove or replace characters I run it through little subroutines like this one.
sub clean {
my $text = shift;
$text =~ s/\n//g;
$text =~ s/\r//g;
return $text;
}
It may not be fancy but this method has been working flawless for me for years.
$line =~ s/[\r\n]+//g;
Reading perlport I'd suggest something like
$line =~ s/\015?\012?$//;
to be safe for whatever platform you're on and whatever linefeed style you may be processing because what's in \r and \n may differ through different Perl flavours.
Note from 2017: File::Slurp is not recommended due to design mistakes and unmaintained errors. Use File::Slurper or Path::Tiny instead.
extending on your answer
use File::Slurp ();
my $value = File::Slurp::slurp($filename);
$value =~ s/\R*//g;
File::Slurp abstracts away the File IO stuff and just returns a string for you.
NOTE
Important to note the addition of /g , without it, given a multi-line string, it will only replace the first offending character.
Also, the removal of $, which is redundant for this purpose, as we want to strip all line breaks, not just line-breaks before whatever is meant by $ on this OS.
In a multi-line string, $ matches the end of the string and that would be problematic ).
Point 3 means that point 2 is made with the assumption that you'd also want to use /m otherwise '$' would be basically meaningless for anything practical in a string with >1 lines, or, doing single line processing, an OS which actually understands $ and manages to find the \R* that proceed the $
Examples
while( my $line = <$foo> ){
$line =~ $regex;
}
Given the above notation, an OS which does not understand whatever your files '\n' or '\r' delimiters, in the default scenario with the OS's default delimiter set for $/ will result in reading your whole file as one contiguous string ( unless your string has the $OS's delimiters in it, where it will delimit by that )
So in this case all of these regex are useless:
/\R*$// : Will only erase the last sequence of \R in the file
/\R*// : Will only erase the first sequence of \R in the file
/\012?\015?// : When will only erase the first 012\015 , \012 , or \015 sequence, \015\012 will result in either \012 or \015 being emitted.
/\R*$// : If there happens to be no byte sequences of '\015$OSDELIMITER' in the file, then then NO linebreaks will be removed except for the OS's own ones.
It would appear nobody gets what I'm talking about, so here is example code, that is tested to NOT remove line feeds. Run it, you'll see that it leaves the linefeeds in.
#!/usr/bin/perl
use strict;
use warnings;
my $fn = 'TestFile.txt';
my $LF = "\012";
my $CR = "\015";
my $UnixNL = $LF;
my $DOSNL = $CR . $LF;
my $MacNL = $CR;
sub generate {
my $filename = shift;
my $lineDelimiter = shift;
open my $fh, '>', $filename;
for ( 0 .. 10 )
{
print $fh "{0}";
print $fh join "", map { chr( int( rand(26) + 60 ) ) } 0 .. 20;
print $fh "{1}";
print $fh $lineDelimiter->();
print $fh "{2}";
}
close $fh;
}
sub parse {
my $filename = shift;
my $osDelimiter = shift;
my $message = shift;
print "Parsing $message File $filename : \n";
local $/ = $osDelimiter;
open my $fh, '<', $filename;
while ( my $line = <$fh> )
{
$line =~ s/\R*$//;
print ">|" . $line . "|<";
}
print "Done.\n\n";
}
my #all = ( $DOSNL,$MacNL,$UnixNL);
generate 'Windows.txt' , sub { $DOSNL };
generate 'Mac.txt' , sub { $MacNL };
generate 'Unix.txt', sub { $UnixNL };
generate 'Mixed.txt', sub {
return #all[ int(rand(2)) ];
};
for my $os ( ["$MacNL", "On Mac"], ["$DOSNL", "On Windows"], ["$UnixNL", "On Unix"]){
for ( qw( Windows Mac Unix Mixed ) ){
parse $_ . ".txt", #{ $os };
}
}
For the CLEARLY Unprocessed output, see here: http://pastebin.com/f2c063d74
Note there are certain combinations that of course work, but they are likely the ones you yourself naívely tested.
Note that in this output, all results must be of the form >|$string|<>|$string|< with NO LINE FEEDS to be considered valid output.
and $string is of the general form {0}$data{1}$delimiter{2} where in all output sources, there should be either :
Nothing between {1} and {2}
only |<>| between {1} and {2}
In your example, you can just go:
chomp(#lines);
Or:
$_=join("", #lines);
s/[\r\n]+//g;
Or:
#lines = split /[\r\n]+/, join("", #lines);
Using these directly on a file:
perl -e '$_=join("",<>); s/[\r\n]+//g; print' <a.txt |less
perl -e 'chomp(#a=<>);print #a' <a.txt |less
To extend Ted Cambron's answer above and something that hasn't been addressed here: If you remove all line breaks indiscriminately from a chunk of entered text, you will end up with paragraphs running into each other without spaces when you output that text later. This is what I use:
sub cleanLines{
my $text = shift;
$text =~ s/\r/ /; #replace \r with space
$text =~ s/\n/ /; #replace \n with space
$text =~ s/ / /g; #replace double-spaces with single space
return $text;
}
The last substitution uses the g 'greedy' modifier so it continues to find double-spaces until it replaces them all. (Effectively substituting anything more that single space)

How can I find the strings from one file in another file in Perl?

The script below takes function names in a text file and scans on a
folder that contains multiple c,h files. It opens those files one-by-one and
reads each line. If the match is found in any part of the files, it prints the
line number and the line that contains the match.
Everything is working fine except that the comparison is not working properly. I would be very grateful to whoever solves my problem.
#program starts:
use FileHandle;
print "ENTER THE PATH OF THE FILE THAT CONTAINS THE FUNCTIONS THAT YOU WANT TO
SEARCH: ";#getting the input file
our $input_path = <STDIN>;
$input_path =~ s/\s+$//;
open(FILE_R1,'<',"$input_path") || die "File open failed!";
print "ENTER THE PATH OF THE FUNCTION MODEL: ";#getting the folder path that
#contains multiple .c,.h files
our $model_path = <STDIN>;
$model_path =~ s/\s+$//;
our $last_dir = uc(substr ( $model_path,rindex( $model_path, "\\" ) +1 ));
our $output = $last_dir."_FUNC_file_names";
while(our $func_name_input = <FILE_R1> )#$func_name_input is the function name
#that is taken as the input
{
$func_name_input=reverse($func_name_input);
$func_name_input=substr($func_name_input,rindex($func_name_input,"\("+1);
$func_name_input=reverse($func_name_input);
$func_name_input=substr($func_name_input,index($func_name_input," ")+1);
#above 4 lines are func_name_input is choped and only part of the function
#name is taken.
opendir FUNC_MODEL,$model_path;
while (our $file = readdir(FUNC_MODEL))
{
next if($file !~ m/\.(c|h)/i);
find_func($file);
}
close(FUNC_MODEL);
}
sub find_func()
{
my $fh1 = FileHandle->new("$model_path//$file") or die "ERROR: $!";
while (!$fh1->eof())
{
my $func_name = $fh1->getline(); #getting the line
**if($func_name =~$func_name_input)**#problem here it does not take the
#match
{
next if($func_name=~m/^\s+/);
print "$.,$func_name\n";
}
}
}
$func_name_input=substr($func_name_input,rindex($func_name_input,"\("+1);
You're missing an ending parenthesis. Should be:
$func_name_input=substr($func_name_input,rindex($func_name_input,"\(")+1);
There's probably an easier way than those four statements, too. But it's a little early to wrap my head around it all. Do you want to match "foo" in "function foo() {"? If so, you could use a regex like /\s+([^) ]+)/.
When you say $func_name =~$func_name_input, you're treating all characters in $func_name_input as special regex characters. If this is not what you mean to do, you can use quotemeta (perldoc -f quotemeta): $func_name =~quotemeta($func_name_input) or $func_name =~ qr/\Q$func_name_input\E/.
Debugging will be easier with strictures (and a syntax-hilighting editor). Also note that, if you're not using those variables in other files, "our" doesn't do anything "my" wouldn't do for file-scoped variables.
find + xargs + grep does 90% of what you want.
find . -name '*.[c|h]' | xargs grep -n your_pattern
ack does it even easier.
ack --type=cc your_pattern
Simply take your list of patterns from your file and "or" them together.
ack --type=cc 'foo|bar|baz'
This has the benefit of only search the files once, and not once for each pattern being searched for as you're doing.
I still think you should just use ack, but your code needed some serious love.
Here is an improved version of your program. It now takes the directory to search and patterns on the command line rather than having to ask for (and the user write) files. It searches all the files under the directory, not just the ones in the directory, using File::Find. It does this in one pass by concatenating all the patterns into regular expressions. It uses regexes instead of index() and substr() and reverse() and oh god. It simply uses built in filehandles rather than the FileHandle module and checking for eof(). Everything is declared lexical (my) instead of global (our). Strict and warnings are on for easier debugging.
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
die "Usage: search_directory function ...\n" unless #ARGV >= 2;
my $Search_Dir = shift;
my $Pattern = build_pattern(#ARGV);
find(
{
wanted => sub {
return unless $File::Find::name =~ m/\.(c|h)$/i;
find_func($File::Find::name, $pattern);
},
no_chdir => 1,
},
$Search_Dir
);
# Join all the function names into one pattern
sub build_pattern {
my #patterns;
for my $name (#_) {
# Turn foo() into foo. This replaces all that reverse() and rindex()
# and substr() stuff.
$name =~ s{\(.*}{};
# Use \Q to protect against regex metacharacters in the input
push #patterns, qr{\Q$name\E};
}
# Join them up into one pattern.
return join "|", #patterns;
}
sub find_func {
my( $file, $pattern ) = #_;
open(my $fh, "<", $file) or die "Can't open $file: $!";
while (my $line = <$fh>) {
# XXX not all functions are unindented, but your choice
next if $line =~ m/^\s+/;
print "$file:$.: $line" if $line =~ $pattern;
}
}