What does this if statement do? (string comparison) - perl

I am trying to understand a piece of code which loops over a file, does various assignments, then enters a set of if statements where a string is seemingly compared to nothing. What are /nonsynonymous/ and /prematureStop/ being compared to here? I am mostly experienced with python.
open(IN,$file);
while(<IN>){
chomp $_;
my #tmp = split /\t+/,$_;
my $id = join("\t",$tmp[0],$tmp[1]-1);
$id =~ s/chr//;
my #info_field = split /;/,$tmp[2];
my $vat = $info_field[$#info_field];
my $score = 0;
$self -> {VAT} ->{$id}= $vat;
$self ->{GENE} -> {$id} = $tmp[3];
if (/nonsynonymous/ || /prematureStop/){...

It is comparing against the current input line ($_).
By default, perl will automatically use the current input line ($_) when doing regex matches unless overridden (with =~).

From http://perldoc.perl.org/perlretut.html
If you're matching against the special default variable $_ , the $_ =~
part can be omitted:
$_ = "Hello World";
if (/World/) {
print "It matches\n";
}
else {
print "It doesn't match\n";
}

Often in Perl, if a specific variable isn't given, it's assumed that you want to use the default variable $_. For instance, the while loop assigns the incoming lines from <IN> to that variable, chomp $_; could just as well have been written chomp;, and the regular expressions in the if statement try to match with $_ as well.

Related

Perl: return an array from subroutine

Perl noob.
I am having trouble understanding returning an array value from a subroutine in a module.
The script has the following:
print "Enter your first and last name.\n";
chomp(my $fullname = <STDIN>); #input is 'testing this' all lower case
Jhusbands::Loginpass->check_name($fullname);
print "$fullname\n";
The module includes the following subroutine:
sub check_name {
my $class = shift;
if ($_[0] =~ /^\w+\s+\w+$/ ) {
#_ = split( / /, $_[0]);
foreach $_ (#_) {
$_ = ucfirst lc for #_;
#_ = join(" ", #_);
print Dumper(#_) . "\n";
return #_;
}
}
}
I am taking the name, checking it for only first and last (I'll get to else statements later), splitting it, correcting the case, and joining again. Dumper displays the final array as:
$VAR1 = 'Testing This';
So it appears to be working that far. However, the return vale for $fullname in the script displays the all lower case:
testing this
Why is it not taking the corrected uppercase variable that Dumper displays as the last array iteration?
You don't assign the return to anything. Also, the sub manipulates #_ which it shouldn't be doing, as discussed below. It can also be greatly simplified
sub check_name {
my ($class, $name) = #_;
if ($name =~ /^\w+\s+\w+$/) {
return join ' ', map { ucfirst lc } split ' ', $name;
}
return; # returns "undef" (as input wasn't in expected format)
}
Then the caller can do
my $fullname = Jhusbands::Loginpass->check_name($name);
print "$fullname\n" if $fullname;
A sub's return should always be checked but in this case even more so, since it processes its input conditionally. I renamed the input to sub (to $name), for clarity.
If the code in the sub is meant to change the $fullname by writing directly to #_ (and you had no return for that reason), that fails since after the specific manipulations $_[0] isn't any more aliased to the argument that was passed.
In any case, doing that is very tricky, can lead to opaque code -- and is unneeded. To directly change the argument pass it as a reference and write to it. However, it is probably far clearer and less error prone to return the result in this case.
It should be noted that the above name "processing" runs into the standard problems with processing of names, due to their bewildering variety. If this needs to be comprehensive then the name parsing should be dispatched to a rounded library (or procedure) that can deal with the possible dirersity.
Thanks to ikegami for comments bringing this up with examples, as well as a more direct way:
$name =~ s/(\w+)/\u\L$1/g;
return $name;
which with /r introduced in v5.14 can be written as
return $name =~ s/(\w+)/\u\L$1/gr;
If $name has no word-characters (\w) and there is no match this returns the same string.

Perl regular expressions and returned array of matched groups

i am new in Perl and i need to do some regexp.
I read, when array is used like integer value, it gives count of elements inside.
So i am doing for example
if (#result = $pattern =~ /(\d)\.(\d)/) {....}
and i was thinking it should return empty array, when pattern matching fails, but it gives me still array with 2 elements, but with uninitialized values.
So how i can put pattern matching inside if condition, is it possible?
EDIT:
foreach (keys #ARGV) {
if (my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) {
if (defined $params{$result[0]}) {
print STDERR "Cmd option error\n";
}
$params{$result[0]} = (defined $result[1] ? $result[1] : 1);
}
else {
print STDERR "Cmd option error\n";
exit ERROR_CMD;
}
}
It is regexp pattern for command line options, cmd options are in long format with two hyphens preceding and possible with argument, so
--CMD[=ARG]. I want elegant solution, so this is why i want put it to if condition without some prolog etc.
EDIT2:
oh sry, i was thinking groups in #result array are always counted from 0, but accesible are only groups from branch, where the pattern is success. So if in my code command is "input", it should be in $result[0], but actually it is in $result[1]. I thought if $result[0] is uninitialized, than pattern fails and it goes to the if statement.
Consider the following:
use strict;
use warnings;
my $pattern = 42.42;
my #result = $pattern =~ /(\d)\.(\d)/;
print #result, ' elements';
Output:
24 elements
Context tells Perl how to treat #result. There certainly aren't 24 elements! Perl has printed the array's elements which resulted from your regex's captures. However, if we do the following:
print 0 + #result, ' elements';
we get:
2 elements
In this latter case, Perl interprets a scalar context for #result, so adds the number of elements to 0. This can also be achieved through scalar #results.
Edit to accommodate revised posting: Thus, the conditional in your code:
if(my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) { ...
evaluates to true if and only if the match was successful.
#results = $pattern =~ /(\d)\.(\d)/ ? ($1,$2) : ();
Try this:
#result = ();
if ($pattern =~ /(\d)\.(\d)/)
{
push #result, $1;
push #result, $2;
}
=~ is not an equal sign. It's doing a regexp comparison.
So my code above is initializing the array to empty, then assigning values only if the regexp matches.

What does $dummy and non-parameter split mean in Perl?

I need some help decoding this perl script. $dummy is not initialized with anything throughout anywhere else in the script. What does the following line mean in the script? and why does it mean when the split function doesn't have any parameter?
($dummy, $class) = split;
The program is trying to check whether a statement is truth or lie using some statistical classification method. So lets say it calculates and give the following number to "truth-sity" and "falsity" then it checks whether the lie detector is correct or not.
# some code, some code...
$_ = "truth"
# more some code, some code ...
$Truthsity = 9999
$Falsity = 2134123
if ($Truthsity > $Falsity) {
$newClass = "truth";
} else {
$newClass = "lie";
}
($dummy, $class) = split;
if ($class eq $newClass) {
print "correct";
} elsif ($class eq "true") {
print "false neg";
} else {
print "false pos"
}
($dummy, $class) = split;
Split returns an array of values. The first is put into $dummy, the second into $class, and any further values are ignored. The first arg is likely named dummy because the author plans to ignore that value. A better option is to use undef to
ignore a returned entry: ( undef, $class ) = split;
Perldoc can show you how split functions. When called without arguments, split will operate against $_ and split on whitespace. $_ is the default variable in perl, think of it as an implied "it," as defined by context.
Using an implied $_ can make short code more concise, but it's poor form to use it inside larger blocks. You don't want the reader to get confused about which 'it' you want to work with.
split ; # split it
for (#list) { foo($_) } # look at each element of list, foo it.
#new = map { $_ + 2 } #list ;# look at each element of list,
# add 2 to it, put it in new list
while(<>){ foo($_)} # grab each line of input, foo it.
perldoc -f split
If EXPR is omitted, splits the $_ string. If PATTERN is also omitted, splits on
whitespace (after skipping any leading whitespace). Anything matching PATTERN
is taken to be a delimiter separating the fields. (Note that the delimiter may
be longer than one character.)
I'm a big fan of the ternary operator ? : for setting string values and of pushing logic into blocks and subroutines.
my $Truthsity = 9999
my $Falsity = 2134123
print test_truthsity( $Truthsity, $Falsity, $_ );
sub test_truthsity {
my ($truthsity, $falsity, $line ) = #_;
my $newClass = $truthsity > $falsity ? 'truth' : 'lie';
my (undef, $class) = split /\s+/, $line ;
my $output = $class eq $newClass ? 'correct'
: $class eq 'true' ? 'false neg'
: 'false pos';
return $output;
}
There may be a subtle bug in this version. split with no args is not the exactly the same as split(/\s+/, $_), they behave differently if the line starts with spaces. In fully qualified split, blank leading fields are returned. split with no args drops the leading spaces.
$_ = " ab cd";
my #a = split # #a contains ( 'ab', 'cd' );
my #b = split /\s+/, $_; # #b contains ( '', 'ab', 'cd')
From the documentation for split:
split /PATTERN/,EXPR
If EXPR is omitted, splits the $_ string. If PATTERN is also omitted,
splits on whitespace (after skipping any leading whitespace). Anything
matching PATTERN is taken to be a delimiter separating the fields.
(Note that the delimiter may be longer than one character.)
So since both the pattern and the expression are omitted, we are splitting the default variable $_ on whitespace.
The purpose of the $dummy variable is to capture the first element of the list returned from split and ignore it, because the code is only interested in the second element, which gets put into $class.
You'll have to look at the surrounding code to find out what $_ is in this context; it may be a loop variable or a list item in a map block, or something else.
If you read the documentation, you'll find that:
The default for the first operand is " ".
The default for the second operand is $_.
The default for the third operand is 0.
so
split
is short for
split " ", $_, 0
and it means:
Take $_, split its value on whitespace, ignoring leading and trailing whitespace.
The first resulting field is placed in $dummy, and the second in $class.
Based on its name, I presume you proceed to never use $dummy again, so it's simply acting as a placeholder. You can get rid of it, though.
my ($dummy, $class) = split;
can be written as
my (undef, $class) = split; # Use undef as a placeholder
or
my $class = ( split )[1]; # Use a list slice to get second item

Perl if equals sign

I need to detect if the first character in a file is an equals sign (=) and display the line number. How should I write the if statement?
$i=0;
while (<INPUT>) {
my($line) = $_;
chomp($line);
$findChar = substr $_, 0, 1;
if($findChar == "=")
$output = "$i\n";
print OUTPUT $output;
$i++;
}
Idiomatic perl would use a regular expression (^ meaning beginning of line) plus one of the dreaded builtin variables which happens to mean "line in file":
while (<INPUT>) {
print "$.\n" if /^=/;
}
See also perldoc -v '$.'
Use $findChar eq "=". In Perl:
== and != are numeric comparisons. They will convert both operands to a number.
eq and ne are string comparisons. They will convert both operands to a string.
Yes, this is confusing. Yes, I still write == when I mean eq ALL THE TIME. Yes, it takes me forever to spot my mistake too.
It looks like you are not using strict and warnings. Use them, especially since you do not know Perl, you might also want to add diagnostics to the list of must-use pragmas.
You are keeping track of the input line number in a separate variable $i. Perl has various builtin variables documented in perlvar. Some of these, such as $. are very useful use them.
You are using my($line) = $_; in the body of the while loop. Instead, avoid $_ and assign to $line directly as in while ( my $line = <$input> ).
Note that bareword filehandles such as INPUT are package global. With the exception of the DATA filehandle, you are better off using lexical filehandles to properly limit the scope of your filehandles.
In your posts, include sample data in the __DATA_ section so others can copy, paste and run your code without further work.
With these comments in mind, you can print all lines that do not start with = using:
#!/usr/bin/perl
use strict; use warnings;
while (my $line = <DATA> ) {
my $first_char = substr $line, 0, 1;
if ( $first_char ne '=' ) {
print "$.:$first_char\n";
}
}
__DATA__
=
=
a
=
+
However, I would be inclined to write:
while (my $line = <DATA> ) {
# this will skip blank lines
if ( my ($first_char) = $line =~ /^(.)/ ) {
print "$.:$first_char\n" unless $first_char eq '=';
}
}

A couple of Perl subtleties

I've been programming in Perl for a while, but I never have understood a couple of subtleties about Perl:
The use and the setting/unsetting of the $_ variable confuses me. For instance, why does
# ...
shift #queue;
($item1, #rest) = split /,/;
work, but (at least for me)
# ...
shift #queue;
/some_pattern.*/ or die();
does not seem to work?
Also, I don't understand the difference between iterating through a file using foreach versus while. For instance,I seem to be getting different results for
while(<SOME_FILE>){
# Do something involving $_
}
and
foreach (<SOME_FILE>){
# Do something involving $_
}
Can anyone explain these subtle differences?
shift #queue;
($item1, #rest) = split /,/;
If I understand you correctly, you seem to think that this shifts off an element from #queue to $_. That is not true.
The value that is shifted off of #queue simply disappears The following split operates on whatever is contained in $_ (which is independent of the shift invocation).
while(<SOME_FILE>){
# Do something involving $_
}
Reading from a filehandle in a while statement is special: It is equivalent to
while ( defined( $_ = readline *SOME_FILE ) ) {
This way, you can process even colossal files line-by-line.
On the other hand,
for(<SOME_FILE>){
# Do something involving $_
}
will first load the entire file as a list of lines into memory. Try a 1GB file and see the difference.
Another, albeit subtle, difference between:
while (<FILE>) {
}
and:
foreach (<FILE>) {
}
is that while() will modify the value of $_ outside of its scope, whereas, foreach() makes $_ local. For example, the following will die:
$_ = "test";
while (<FILE1>) {
print "$_";
}
die if $_ ne "test";
whereas, this will not:
$_ = "test";
foreach (<FILE1>) {
print "$_";
}
die if $_ ne "test";
This becomes more important with more complex scripts. Imagine something like:
sub func1() {
while (<$fh2>) { # clobbers $_ set from <$fh1> below
<...>
}
}
while (<$fh1>) {
func1();
<...>
}
Personally, I stay away from using $_ for this reason, in addition to it being less readable, etc.
Regarding the 2nd question:
while (<FILE>) {
}
and
foreach (<FILE>) {
}
Have the same functional behavior, including setting $_. The difference is that while() evaluates <FILE> in a scalar context, while foreach() evaluates <FILE> in a list context. Consider the difference between:
$x = <FILE>;
and
#x = <FILE>;
In the first case, $x gets the first line of FILE, and in the second case #x gets the entire file. Each entry in #x is a different line in FILE.
So, if FILE is very big, you'll waste memory slurping it all at once using foreach (<FILE>) compared to while (<FILE>). This may or may not be an issue for you.
The place where it really matters is if FILE is a pipe descriptor, as in:
open FILE, "some_shell_program|";
Now foreach(<FILE>) must wait for some_shell_program to complete before it can enter the loop, while while(<FILE>) can read the output of some_shell_program one line at a time and execute in parallel to some_shell_program.
That said, the behavior with regard to $_ remains unchanged between the two forms.
foreach evaluates the entire list up front. while evaluates the condition to see if its true each pass. while should be considered for incremental operations, foreach only for list sources.
For example:
my $t= time() + 10 ;
while ( $t > time() ) { # do something }
StackOverflow: What’s the difference between iterating over a file with foreach or while in Perl?
It is to avoid this sort of confusion that it's considered better form to avoid using the implicit $_ constructions.
my $element = shift #queue;
($item,#rest) = split /,/ , $element;
or
($item,#rest) = split /,/, shift #queue;
likewise
while(my $foo = <SOMEFILE>){
do something
}
or
foreach my $thing(<FILEHANDLE>){
do something
}
while only checks if the value is true, for also places the value in $_, except in some circumstances. For example <> will set $_ if used in a while loop.
to get similar behaviour of:
foreach(qw'a b c'){
# Do something involving $_
}
You have to set $_ explicitly.
while( $_ = shift #{[ qw'a b c' ]} ){
# Do something involving $_
}
It is better to explicitly set your variables
for my $line(<SOME_FILE>){
}
or better yet
while( my $line = <SOME_FILE> ){
}
which will only read in the file one line at a time.
Also shift doesn't set $_ unless you specifically ask it too
$_ = shift #_;
And split works on $_ by default. If used in scalar, or void context will populate #_.
Please read perldoc perlvar so that you will have an idea of the different variables in Perl.
perldoc perlvar.