What does $dummy and non-parameter split mean in Perl? - perl

I need some help decoding this perl script. $dummy is not initialized with anything throughout anywhere else in the script. What does the following line mean in the script? and why does it mean when the split function doesn't have any parameter?
($dummy, $class) = split;
The program is trying to check whether a statement is truth or lie using some statistical classification method. So lets say it calculates and give the following number to "truth-sity" and "falsity" then it checks whether the lie detector is correct or not.
# some code, some code...
$_ = "truth"
# more some code, some code ...
$Truthsity = 9999
$Falsity = 2134123
if ($Truthsity > $Falsity) {
$newClass = "truth";
} else {
$newClass = "lie";
}
($dummy, $class) = split;
if ($class eq $newClass) {
print "correct";
} elsif ($class eq "true") {
print "false neg";
} else {
print "false pos"
}

($dummy, $class) = split;
Split returns an array of values. The first is put into $dummy, the second into $class, and any further values are ignored. The first arg is likely named dummy because the author plans to ignore that value. A better option is to use undef to
ignore a returned entry: ( undef, $class ) = split;
Perldoc can show you how split functions. When called without arguments, split will operate against $_ and split on whitespace. $_ is the default variable in perl, think of it as an implied "it," as defined by context.
Using an implied $_ can make short code more concise, but it's poor form to use it inside larger blocks. You don't want the reader to get confused about which 'it' you want to work with.
split ; # split it
for (#list) { foo($_) } # look at each element of list, foo it.
#new = map { $_ + 2 } #list ;# look at each element of list,
# add 2 to it, put it in new list
while(<>){ foo($_)} # grab each line of input, foo it.
perldoc -f split
If EXPR is omitted, splits the $_ string. If PATTERN is also omitted, splits on
whitespace (after skipping any leading whitespace). Anything matching PATTERN
is taken to be a delimiter separating the fields. (Note that the delimiter may
be longer than one character.)
I'm a big fan of the ternary operator ? : for setting string values and of pushing logic into blocks and subroutines.
my $Truthsity = 9999
my $Falsity = 2134123
print test_truthsity( $Truthsity, $Falsity, $_ );
sub test_truthsity {
my ($truthsity, $falsity, $line ) = #_;
my $newClass = $truthsity > $falsity ? 'truth' : 'lie';
my (undef, $class) = split /\s+/, $line ;
my $output = $class eq $newClass ? 'correct'
: $class eq 'true' ? 'false neg'
: 'false pos';
return $output;
}
There may be a subtle bug in this version. split with no args is not the exactly the same as split(/\s+/, $_), they behave differently if the line starts with spaces. In fully qualified split, blank leading fields are returned. split with no args drops the leading spaces.
$_ = " ab cd";
my #a = split # #a contains ( 'ab', 'cd' );
my #b = split /\s+/, $_; # #b contains ( '', 'ab', 'cd')

From the documentation for split:
split /PATTERN/,EXPR
If EXPR is omitted, splits the $_ string. If PATTERN is also omitted,
splits on whitespace (after skipping any leading whitespace). Anything
matching PATTERN is taken to be a delimiter separating the fields.
(Note that the delimiter may be longer than one character.)
So since both the pattern and the expression are omitted, we are splitting the default variable $_ on whitespace.
The purpose of the $dummy variable is to capture the first element of the list returned from split and ignore it, because the code is only interested in the second element, which gets put into $class.
You'll have to look at the surrounding code to find out what $_ is in this context; it may be a loop variable or a list item in a map block, or something else.

If you read the documentation, you'll find that:
The default for the first operand is " ".
The default for the second operand is $_.
The default for the third operand is 0.
so
split
is short for
split " ", $_, 0
and it means:
Take $_, split its value on whitespace, ignoring leading and trailing whitespace.
The first resulting field is placed in $dummy, and the second in $class.
Based on its name, I presume you proceed to never use $dummy again, so it's simply acting as a placeholder. You can get rid of it, though.
my ($dummy, $class) = split;
can be written as
my (undef, $class) = split; # Use undef as a placeholder
or
my $class = ( split )[1]; # Use a list slice to get second item

Related

perl Grouping things and hierarchical matching

I've been testing Perl regex code what is written in the perlrequick section on Grouping things and hierarchical matching
This my Perl code
my $t = "housecats";
my ($m) = $t =~ m/house(cat|)/;
print $m;
The output is cat, but should be as written in the documentation
/house(cat|)/; # matches either 'housecat' or 'house'
What is wrong? Is there something amiss?
What you're doing with this code
my $t = "housecats";
my ($m) = $t =~ m/house(cat|)/;
print $m;
is copying the first capture into $m. Parentheses () in the pattern indicate which parts of the matching string to capture and store into built-in variables $1, $2 etc. You can have as many captures as you like, and they are numbered in the same order as the opening parentheses appears in the pattern
What perlrequick is talking about is what constitutes a successful match. Normally you would write
my $t = "housecats";
my $success = $t =~ m/house(cat|)/;
print $success ? "matched\n" : "no match\n";
This code produces
matched
as the document describes. If you set $t to housemartin then the result is the same because the regex pattern successfully finds house. But if $t is hosepipe then we see no match because the string contains neither house nor housecat
If you need to extract parts of the matched string then you must use captures as described above. You can access the whole string that was matched by accessing the built-in variable $&, but doing so causes unacceptable performance degradation in all but the latest Perl versions. For backward-compatability you should simply capture the whole pattern by writing
my $t = "housecats";
my ($m) = $t =~ m/(house(cat|))/;
print $m;
which produces housecat as you expected. It also sets the values of $1 and $2 to housecat and cat respectively
You probably misunderstood the comment. It means that
for my $t (qw( housecats house )) {
my ($m) = $t =~ /house(cat|)/;
print "[$m]\n";
}
will print
[cat]
[]
i.e. the regex will match both housecat and house. If the pattern didn't match at all then $m would be undef
my $t = "housecats";
my ($m) = $t=~m/house(cat|)/gn;
print $m;

What does this if statement do? (string comparison)

I am trying to understand a piece of code which loops over a file, does various assignments, then enters a set of if statements where a string is seemingly compared to nothing. What are /nonsynonymous/ and /prematureStop/ being compared to here? I am mostly experienced with python.
open(IN,$file);
while(<IN>){
chomp $_;
my #tmp = split /\t+/,$_;
my $id = join("\t",$tmp[0],$tmp[1]-1);
$id =~ s/chr//;
my #info_field = split /;/,$tmp[2];
my $vat = $info_field[$#info_field];
my $score = 0;
$self -> {VAT} ->{$id}= $vat;
$self ->{GENE} -> {$id} = $tmp[3];
if (/nonsynonymous/ || /prematureStop/){...
It is comparing against the current input line ($_).
By default, perl will automatically use the current input line ($_) when doing regex matches unless overridden (with =~).
From http://perldoc.perl.org/perlretut.html
If you're matching against the special default variable $_ , the $_ =~
part can be omitted:
$_ = "Hello World";
if (/World/) {
print "It matches\n";
}
else {
print "It doesn't match\n";
}
Often in Perl, if a specific variable isn't given, it's assumed that you want to use the default variable $_. For instance, the while loop assigns the incoming lines from <IN> to that variable, chomp $_; could just as well have been written chomp;, and the regular expressions in the if statement try to match with $_ as well.

Perl regular expressions and returned array of matched groups

i am new in Perl and i need to do some regexp.
I read, when array is used like integer value, it gives count of elements inside.
So i am doing for example
if (#result = $pattern =~ /(\d)\.(\d)/) {....}
and i was thinking it should return empty array, when pattern matching fails, but it gives me still array with 2 elements, but with uninitialized values.
So how i can put pattern matching inside if condition, is it possible?
EDIT:
foreach (keys #ARGV) {
if (my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) {
if (defined $params{$result[0]}) {
print STDERR "Cmd option error\n";
}
$params{$result[0]} = (defined $result[1] ? $result[1] : 1);
}
else {
print STDERR "Cmd option error\n";
exit ERROR_CMD;
}
}
It is regexp pattern for command line options, cmd options are in long format with two hyphens preceding and possible with argument, so
--CMD[=ARG]. I want elegant solution, so this is why i want put it to if condition without some prolog etc.
EDIT2:
oh sry, i was thinking groups in #result array are always counted from 0, but accesible are only groups from branch, where the pattern is success. So if in my code command is "input", it should be in $result[0], but actually it is in $result[1]. I thought if $result[0] is uninitialized, than pattern fails and it goes to the if statement.
Consider the following:
use strict;
use warnings;
my $pattern = 42.42;
my #result = $pattern =~ /(\d)\.(\d)/;
print #result, ' elements';
Output:
24 elements
Context tells Perl how to treat #result. There certainly aren't 24 elements! Perl has printed the array's elements which resulted from your regex's captures. However, if we do the following:
print 0 + #result, ' elements';
we get:
2 elements
In this latter case, Perl interprets a scalar context for #result, so adds the number of elements to 0. This can also be achieved through scalar #results.
Edit to accommodate revised posting: Thus, the conditional in your code:
if(my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) { ...
evaluates to true if and only if the match was successful.
#results = $pattern =~ /(\d)\.(\d)/ ? ($1,$2) : ();
Try this:
#result = ();
if ($pattern =~ /(\d)\.(\d)/)
{
push #result, $1;
push #result, $2;
}
=~ is not an equal sign. It's doing a regexp comparison.
So my code above is initializing the array to empty, then assigning values only if the regexp matches.

Split on comma, but only when not in parenthesis

I am trying to do a split on a string with comma delimiter
my $string='ab,12,20100401,xyz(A,B)';
my #array=split(',',$string);
If I do a split as above the array will have values
ab
12
20100401
xyz(A,
B)
I need values as below.
ab
12
20100401
xyz(A,B)
(should not split xyz(A,B) into 2 values)
How do I do that?
use Text::Balanced qw(extract_bracketed);
my $string = "ab,12,20100401,xyz(A,B(a,d))";
my #params = ();
while ($string) {
if ($string =~ /^([^(]*?),/) {
push #params, $1;
$string =~ s/^\Q$1\E\s*,?\s*//;
} else {
my ($ext, $pre);
($ext, $string, $pre) = extract_bracketed($string,'()','[^()]+');
push #params, "$pre$ext";
$string =~ s/^\s*,\s*//;
}
}
This one supports:
nested parentheses;
empty fields;
strings of any length.
Here is one way that should work.
use Regexp::Common;
my $string = 'ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(?:$RE{balanced}{-parens=>'()'}|[^,])+/g);
Regexp::Common can be installed from CPAN.
There is a bug in this code, coming from the depths of Regexp::Common. Be warned that this will (unfortunately) fail to match the lack of space between ,,.
Well, old question, but I just happened to wrestle with this all night, and the question was never marked answered, so in case anyone arrives here by Google as I did, here's what I finally got. It's a very short answer using only built-in PERL regex features:
my $string='ab,12,20100401,xyz(A,B)';
$string =~ s/((\((?>[^)(]*(?2)?)*\))|[^,()]*)(*SKIP),/$1\n/g;
my #array=split('\n',$string);
Commas that are not inside parentheses are changed to newlines and then the array is split on them. This will ignore commas inside any level of nested parentheses, as long as they're properly balanced with a matching number of open and close parens.
This assumes you won't have newline \n characters in the initial value of $string. If you need to, either temporarily replace them with something else before the substitution line and then use a loop to replace back after the split, or just pick a different delimiter to split the array on.
Limit the number of elements it can be split into:
split(',', $string, 4)
Here's another way:
my $string='ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(
[^,]*\([^)]*\) # comma inside parens is part of the word
|
[^,]*) # split on comma outside parens
(?:,|$)/gx);
Produces:
ab
12
20100401
xyz(A,B)
Here is my attempt. It should handle depth well and could even be extended to include other bracketed symbols easily (though harder to be sure that they MATCH). This method will not in general work for quotation marks rather than brackets.
#!/usr/bin/perl
use strict;
use warnings;
my $string='ab,12,20100401,xyz(A(2,3),B)';
print "$_\n" for parse($string);
sub parse {
my ($string) = #_;
my #fields;
my #comma_separated = split(/,/, $string);
my #to_be_joined;
my $depth = 0;
foreach my $field (#comma_separated) {
my #brackets = $field =~ /(\(|\))/g;
foreach (#brackets) {
$depth++ if /\(/;
$depth-- if /\)/;
}
if ($depth == 0) {
push #fields, join(",", #to_be_joined, $field);
#to_be_joined = ();
} else {
push #to_be_joined, $field;
}
}
return #fields;
}

Beginner calling a Perl subroutine

I am trying to teach myself Perl and I've looked everywhere for an answer to what probably is a very simple problem. I've defined a subroutine that I call to count the number of letters in a word. If I write it out like this:
$sentence="This is a short sentence.";
#words = split(/\s+/, $sentence);
foreach $element (#words) {
$lngths .= length($element) . "\n";
}
print "$lngths\n";
Then it works like a charm. However, if I wrap it into a subroutine split doesn't split up the input and instead counts the whole sentence as a single input. Here's how I'm defining the subroutine:
sub countWords {
#words = split(/\s+/, #_);
foreach $element(#words) {
$lngths .= length($element) . "\n";
}
return $lngths;
}
From all the pages I've read and texts I've consulted this should work but it doesn't.
Thanks in advance!
The problem is your use of #_. This is an array, but you're accessing it like a scalar.
#_ contains all the parameters to this function. The way it looks, you're passing it a sentence, and you want to split it. Here are some possible ways to do it:
#words = split(/\s+/, $_[0]);
which means "take the first parameter to the function and split it".
Or:
my $sentence = shift;
#words = split(/\s+/, $sentence);
Which is pretty much the same, but uses an intermediate variable for readability.
In fact, what you're doing is:
#words = split(/\s+/, #_);
Which means:
interpret #_ as a scalar, which means the number of elements in #_ (1, in this case)
split the string "1" by whitespace
Which returns the array:
#words = ("1");
You've got the main part of the answer from Nathan; the residual observation is that most people don't count punctuation and digits as letters, but your subroutine does. I'd probably go with:
sub countLetters
{
my($sentence) = #_;
$sentence =~ s/[^[:alpha:]]//gm;
return length($sentence);
}
The key point here is the parentheses around the variable list in the my clause. In general, you have several arguments passed into a sub, and you can assign (copies) of them to variables in your subroutine like this:
my($var1, $var2, $var3) = #_;
The parentheses provide 'list context' and ensure that the first element of #_ is copied to $var1, the second to $var2 and so on. Without the parentheses, you have 'scalar context', and when an array is evaluated in scalar context, the value returned is the number of elements in the array. Thus:
my $var1, $var2, $var3 = #_;
would likely assign 3 to $var1 (because three values were passed to the subroutine), and $var1 and $var2 would both be undef.
The regular expression deletes all non-alphabetic characters from the string; the number of letters is the length of what's left.
When counting characters, perl's transliteration operator often comes in handy.
To count the non-whitespace characters without having to split your string into separate words, you can do:
$lngths = $sentence =~ tr/ \t\f\r\n//c;