Perl $1 giving uninitialized value error - perl

I am trying to extract a part of a string and put it into a new variable. The string I am looking at is:
maker-scaffold_26653|ref0016423-snap-gene-0.1
(inside a $gene_name variable)
and the thing I want to match is:
scaffold_26653|ref0016423
I'm using the following piece of code:
my $gene_name;
my $scaffold_name;
if ($gene_name =~ m/scaffold_[0-9]+\|ref[0-9]+/) {
$scaffold_name = $1;
print "$scaffold_name\n";
}
I'm getting the following error when trying to execute:
Use of uninitialized value $scaffold_name in concatenation (.) or string
I know that the pattern is right, because if I use $' instead of $1 I get
-snap-gene-0.1
I'm at a bit of a loss: why will $1 not work here?

If you want to use a value from the matching you have to make () arround the character in regex

To expand on Jens' answer, () in a regex signifies an anonymous capture group. The content matched in a capture group is stored in $1-9+ from left to right, so for example,
/(..):(..):(..)/
on an HH:MM:SS time string will store hours, minutes, and seconds in $1, $2, $3 respectively. Naturally this begins to become unwieldy and is not self-documenting, so you can assign the results to a list instead:
my ($hours, $mins, $secs) = $time =~ m/(..):(..):(..)/;
So your example could bypass the use of $ variables by doing direct assignment:
my ($scaffold_name) = $gene_name =~ m/(scaffold_[0-9]+[|]ref[0-9]+)/;
# $scaffold_name now contains 'scaffold_26653|ref0016423'
You can even get rid of the ugly =~ binding by using for as a topicalizer:
my $scaffold_name;
for ($gene_name) {
($scaffold_name) = m/(scaffold_\d+[|]ref\d+)/;
print $scaffold_name;
}
If things start to get more complex, I prefer to use named capture groups (introduced in Perl v5.10.0):
$gene_name =~ m{
(?<scaffold_name> # ?<name> creates a named capture group
scaffold_\d+? # 'scaffold' and its trailing digits
[|] # Literal pipe symbol
ref\d+ # 'ref' and its trailing digits
)
}xms; # The x flag lets us write more readable regexes
print $+{scaffold_name}, "\n";
The results of named capture groups are stored in the magic hash %+. Access is done just like any other hash lookup, with the capture groups as the keys. %+ is locally scoped in the same way the $ are, so it can be used as a drop-in replacement for them in most situations.
It's overkill for this particular example, but as regexes start to get larger and more complicated, this saves you the trouble of either having to scroll all the way back up and count anonymous capture groups from left to right to find which of those darn $ variables is holding the capture you wanted, or scan across a long list assignment to find where to add a new variable to hold a capture that got inserted in the middle.
My personal rule of thumb is to assign the results of anonymous captured to descriptively named lexically scoped variables for 3 or less captures, then switch to using named captures, comments, and indentation in regexes when more are necessary.

Related

What does $variable{$2}++ mean in Perl?

I have a two-column data set in a tab-separated .txt file, and the perl script reads it as FH and this is the immediate snippet of code that follows:
while(<FH>)
{
chomp;
s/\r//;
/(.+)\t(.+)/;
$uniq_tar{$2}++;
$uniq_mir{$1}++;
push#{$mir_arr{$1}},$2;
push #{$target{$2}} ,$1;
}
When I try to print any of the above 4 variables, it says the variables are uninitialized.
And, when I tried to print $uniq_tar{$2}++; and $uniq_mir{$1}++;
It just prints some numbers which I cannot understand.
I would just like to know what this part of code evaluate in general?
$uniq_tar{$2}++;
The while loop puts each line of your file, in turn, into Perl's special variable $_.
/.../ is the match operator. By default it works on $_.
/(.*)\t(.*)/ is a regular expression inside the match operator. If the regex matches what is in $_, then the bits of the matching string that are inside the two pairs of parentheses are stored in Perl's special variables $1 and $2.
You have hashes called %uniq_tar and %uniq_mir. You access individual elements in a hash using the $hashname{key}. So, $uniq_tar{$1} is finding the value in %uniq_tar associated with the key that is stored in $1 (that is - the part of your record before the first tab).
$variable++ increments the number in $variable. So $uniq_tar{$1}++ increments the value that we found in the previous paragraph.
So, as zdim says, it's a frequency counter. You read each line in the file, and extract the bits of data before and after the first tab in the line. You then increment the values in two hashes to count the number of occurences of each of the strings.

Check if an array contais a time - Perl

How do I check if an array contains a time value? I've tried checking like this:
if ( #time =~ /$_:$_:$_/)
But it didn't work. Any ideas?
P.S.: The time is given like this: HH:MM:SS
Matching the time
To check for HH:MM:SS with a regular expression match, the simplest pattern would be
/\d\d:\d\d:\d\d/
If you only want this, add anchors for start (^) and end ($) of the string.
/^\d\d:\d\d:\d\d$/
If you want to make sure that your digits are only 0 to 9 and not digits from any script, use a character group.
/^[0-9]{2}:[0-9]{2}:[0-9]{2}$/
If you also want to make sure the time is a valid time, things get more complicated.
You might want to read perlre and perlretut. The tag wiki on Regular Expressions here on Stack Overflow has a lot of useful information and links to tools as well.
On arrays and scalars
However, there is no array in the code you've shown. In Perl, a variable with a $ as its sigil is called a scalar and represents a single value. That's the only thing you can pattern match against. An array would start with an # symbol.
What you can do is match against every element in your array. For that, you have to iterate the array.
A very verbose way to do that would be:
my $matches;
foreach my $time (#times) {
++$matches if $time =~ m/\d\d:\d\d:\d\d/;
}
A more Perlish way would be to use grep.
my $matches = grep { m/\d\d:\d\d:\d\d/ } #times;
This makes use of the fact that the list returned by grep will be converted to its number of elements in scalar context. If all you want is to know whether any of the elements matched, this is enough.
What your code did
The $_ variable is called the topic in Perl, and often contains some kind of default value for certain operators, if no other value is specified. Depending on where in your program you used your line of code, you are matching the number of elements in #time (because of scalar context, see above) against a pattern built up of the content of $_ and colons.
if (
#time # number of elements in array #times
=~ # because this operator forces scalar context
/
$_ # value of $_ based on surrounding code, or undef
: # a literal colon
$_ # see above
: # a literal colon
$_ # see above
/x # ( I added /x to allow comments so this compiles)
) { ... }

Maximum number of captured groups in perl regex

Given a regex in perl, how do I find the maximum number of captured groups in that regex? I know that I can use $1, $2 etc to reference the first, second etc captured groups. But how do I find the maximum number of such groups? By captured groups, I mean the string matched by a regex in paranthesis. For ex: if the regex is (a+)(b+)c+ then the string "abc" matches that regex. And the first captured group will be $1, second will be $2.
amon hinted at the answer to this question when he mentioned the %+ hash. But what you need is the #+ array:
#+
This array holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. $+[0] is the offset into the string of the end of the entire match. This is the same value as what the pos function returns when called on the variable that was matched against. The nth element of this array holds the offset of the nth submatch, so $+1 is the offset past where $1 ends, $+[2] the offset past where $2 ends, and so on. You can use $#+ to determine how many subgroups were in the last successful match. See the examples given for the #- variable. [enphasis added]
$re = "(.)" x 500;
$str = "a" x 500;
$str =~ /$re/;
print "Num captures is $#+"; # outputs "Num captures is 500"
The number of captures is effectivly unlimited. While there can only be nine captures that you can access with the $1–$9 variables, you can use more capture groups.
If you have more than a few capture groups, you might want to use named captures, like
my $str = "foobar";
if ($str =~ /(?<name>fo+)/) {
say $+{name};
}
Output: foo. You can access the values of named captures via the %+ hash.
You can use code like the following to give you a count of capture groups:
$regex = qr/..../; # Some arbitrary regex with capture groups
my #capture = '' =~ /$regex|()/; # A successful match incorporating the regex
my $groups_in_my_regex = scalar(#capture) - 1;
The way it works is that it performs a match which must succeed and then checks how many capture groups were created. (An extra one is created due to the trailing |()
Edit: Actually, it doesn't seem to be necessary to append an extra capture group. Just so long as the match is guaranteed to succeed then the array will contain an entry for every capture group.
So we can change the 2nd and 3rd lines to:
my #capture = '' =~ /$regex|/; # A successful match incorporating the regex
my $groups_in_my_regex = scalar(#capture);
See also:
Count the capture groups in a qr regex?

(3 lines) from bash to perl?

I have these three lines in bash that work really nicely. I want to add them to some existing perl script but I have never used perl before ....
could somebody rewrite them for me? I tried to use them as they are and it didn't work
note that $SSH_CLIENT is a run-time parameter you get if you type set in bash (linux)
users[210]=radek #where 210 is tha last octet from my mac's IP
octet=($SSH_CLIENT) # split the value on spaces
somevariable=$users[${octet[0]##*.}] # extract the last octet from the ip address
These might work for you. I noted my assumptions with each line.
my %users = ( 210 => 'radek' );
I assume that you wanted a sparse array. Hashes are the standard implementation of sparse arrays in Perl.
my #octet = split ' ', $ENV{SSH_CLIENT}; # split the value on spaces
I assume that you still wanted to use the environment variable SSH_CLIENT
my ( $some_var ) = $octet[0] =~ /\.(\d+)$/;
You want the last set of digits from the '.' to the end.
The parens around the variable put the assignment into list context.
In list context, a match creates a list of all the "captured" sequences.
Assigning to a scalar in a list context, means that only the number of scalars in the expression are assigned from the list.
As for your question in the comments, you can get the variable out of the hash, by:
$db = $users{ $some_var };
# OR--this one's kind of clunky...
$db = $users{ [ $octet[0] =~ /\.(\d+)$/ ]->[0] };
Say you have already gotten your IP in a string,
$macip = "10.10.10.123";
#s = split /\./ , $macip;
print $s[-1]; #get last octet
If you don't know Perl and you are required to use it for work, you will have to learn it. Surely you are not going to come to SO and ask every time you need it in Perl right?

Splitting a variable and putting into an array

I have a string like this <name>sekar</name>. I want to split this string (i am using perl) and take out only sekar, and push it into an array while leaving other stuff.
I know how to push into an array, but struck with the splitting part.
Does any one have any idea of doing this?
push #output, $1 if m|<name>(\w*)</name>|;
Try this:
my($name) = $string =~ m|<name>(.*)</name>|;
From perldoc perlop:
If the "/g" option is not used, "m//" in list context returns a
list consisting of the subexpressions matched by the
parentheses in the pattern, i.e., ($1, $2, $3...).
Try <(("[^"]*"|'[^']*'|[^'">])*)>(\w+)<\/\1>. Should work, when I get home I'll test it. The idea is that the first capture group finds the contents within a <> and its nested capture group prevents a situation like <blah=">"> matching as <blah=">. The third capture group (\w+) matches the inner word. This may have to be changed depending on the format of the possibilities you can have within the <tag>content</tag>. Lastly the \1 looks back at the content of the first capture group so that this way you will find the proper closing tag.
Edit: I've tested this with perl and it works.