I have a string like this <name>sekar</name>. I want to split this string (i am using perl) and take out only sekar, and push it into an array while leaving other stuff.
I know how to push into an array, but struck with the splitting part.
Does any one have any idea of doing this?
push #output, $1 if m|<name>(\w*)</name>|;
Try this:
my($name) = $string =~ m|<name>(.*)</name>|;
From perldoc perlop:
If the "/g" option is not used, "m//" in list context returns a
list consisting of the subexpressions matched by the
parentheses in the pattern, i.e., ($1, $2, $3...).
Try <(("[^"]*"|'[^']*'|[^'">])*)>(\w+)<\/\1>. Should work, when I get home I'll test it. The idea is that the first capture group finds the contents within a <> and its nested capture group prevents a situation like <blah=">"> matching as <blah=">. The third capture group (\w+) matches the inner word. This may have to be changed depending on the format of the possibilities you can have within the <tag>content</tag>. Lastly the \1 looks back at the content of the first capture group so that this way you will find the proper closing tag.
Edit: I've tested this with perl and it works.
Related
I am trying to select the below value from database:
Reporting that one of #its many problems had been the recent# extended
sales slump in women's apparel, the seven-store retailer said it would
start a three-month liquidation sale in all of its stores.~(A) its
many problems had been the recent~(B) its many problems has been the
recently~(C) its many problems is the recently~(D) their many problems
is the recent~(E) their many problems had been the recent~
i am selecting this value in variable $ques and then selecting a text as below:
$ques=~s/^(.*?)\#(.*?)\#(.*?)$/$2/;
Now, while replacing the ~ character in the string by
$3=~s/~/\n/g; ---->line 171
and running the script, I am getting one error as:
Modification of a read-only value attempted at main.pl line 171
I want to replace all the ~ character with '\n' and print the final value. Please suggest how to do it.
*I have researched this on net, but got confused that how to handle these read only variables.
You've already got a good explanation of the problem from José Castro. But there's another solution if you're using a recent-ish version of Perl (Update: having checked more carefully, I find that means 5.14+). The /r argument to the substitution operator will copy your string, make the substitution on the copy and then return that altered value.
So you could write:
my $new_value = $3 =~ s/~/\n/rg;
It sounds like what you really want in this case is split rather than regular expression capture groups:
my #parts = split(/#/, $ques);
$parts[2] =~ s/~/\n/g;
It makes the intent of your code clearer since you are, in fact, splitting on # symbols.
Just like you say, the special variables $1, $2, etc., are read-only, and that means that you can't perform that substitution on them.
Performing the substitution on $ques will do what you need:
$ques =~ s/~/\n/g;
print $ques;
Do note that in the earlier substitution that you're performing on $ques you're getting rid of all the ~ characters.
I am still learning perl and have all most got a program written. My question, as simple as it may be, is if I want to hardcode a string to a field would the below do that? Thank you :).
$out[45]="VUS";
In the other lines I use the below to define the values that are passed into the `$[out], but the one in question is hardcoded and the others come from a split.
my #vals = split/\t/; # this splits the line at tabs
my #mutations=split/,/,$vals[9]; # splits on comma to create an array of mutations
my ($gene,$transcript,$exon,$coding,$aa);
for (#mutations)
{
($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons
grep {$transcript eq $_} keys %nms or next;
}
my #out=($.,#colsleft,$_,#colsright);
$out[2]=$gene;
$out[3]=$nms{$transcript};
$out[4]=$transcript;
$out[15]=$coding;
$out[17]=$aa;
Your line of code: $out[45]="VUS"; is correct in that it is defining that 46th element of the array #out to the string, "VUS". I am trying to understand from your code, however why you would want to do that? Usually, it is better practice to not hardcode if at all possible. You want to make it your goal to make your program as dynamic as possible.
I am trying to extract a part of a string and put it into a new variable. The string I am looking at is:
maker-scaffold_26653|ref0016423-snap-gene-0.1
(inside a $gene_name variable)
and the thing I want to match is:
scaffold_26653|ref0016423
I'm using the following piece of code:
my $gene_name;
my $scaffold_name;
if ($gene_name =~ m/scaffold_[0-9]+\|ref[0-9]+/) {
$scaffold_name = $1;
print "$scaffold_name\n";
}
I'm getting the following error when trying to execute:
Use of uninitialized value $scaffold_name in concatenation (.) or string
I know that the pattern is right, because if I use $' instead of $1 I get
-snap-gene-0.1
I'm at a bit of a loss: why will $1 not work here?
If you want to use a value from the matching you have to make () arround the character in regex
To expand on Jens' answer, () in a regex signifies an anonymous capture group. The content matched in a capture group is stored in $1-9+ from left to right, so for example,
/(..):(..):(..)/
on an HH:MM:SS time string will store hours, minutes, and seconds in $1, $2, $3 respectively. Naturally this begins to become unwieldy and is not self-documenting, so you can assign the results to a list instead:
my ($hours, $mins, $secs) = $time =~ m/(..):(..):(..)/;
So your example could bypass the use of $ variables by doing direct assignment:
my ($scaffold_name) = $gene_name =~ m/(scaffold_[0-9]+[|]ref[0-9]+)/;
# $scaffold_name now contains 'scaffold_26653|ref0016423'
You can even get rid of the ugly =~ binding by using for as a topicalizer:
my $scaffold_name;
for ($gene_name) {
($scaffold_name) = m/(scaffold_\d+[|]ref\d+)/;
print $scaffold_name;
}
If things start to get more complex, I prefer to use named capture groups (introduced in Perl v5.10.0):
$gene_name =~ m{
(?<scaffold_name> # ?<name> creates a named capture group
scaffold_\d+? # 'scaffold' and its trailing digits
[|] # Literal pipe symbol
ref\d+ # 'ref' and its trailing digits
)
}xms; # The x flag lets us write more readable regexes
print $+{scaffold_name}, "\n";
The results of named capture groups are stored in the magic hash %+. Access is done just like any other hash lookup, with the capture groups as the keys. %+ is locally scoped in the same way the $ are, so it can be used as a drop-in replacement for them in most situations.
It's overkill for this particular example, but as regexes start to get larger and more complicated, this saves you the trouble of either having to scroll all the way back up and count anonymous capture groups from left to right to find which of those darn $ variables is holding the capture you wanted, or scan across a long list assignment to find where to add a new variable to hold a capture that got inserted in the middle.
My personal rule of thumb is to assign the results of anonymous captured to descriptively named lexically scoped variables for 3 or less captures, then switch to using named captures, comments, and indentation in regexes when more are necessary.
Given a regex in perl, how do I find the maximum number of captured groups in that regex? I know that I can use $1, $2 etc to reference the first, second etc captured groups. But how do I find the maximum number of such groups? By captured groups, I mean the string matched by a regex in paranthesis. For ex: if the regex is (a+)(b+)c+ then the string "abc" matches that regex. And the first captured group will be $1, second will be $2.
amon hinted at the answer to this question when he mentioned the %+ hash. But what you need is the #+ array:
#+
This array holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. $+[0] is the offset into the string of the end of the entire match. This is the same value as what the pos function returns when called on the variable that was matched against. The nth element of this array holds the offset of the nth submatch, so $+1 is the offset past where $1 ends, $+[2] the offset past where $2 ends, and so on. You can use $#+ to determine how many subgroups were in the last successful match. See the examples given for the #- variable. [enphasis added]
$re = "(.)" x 500;
$str = "a" x 500;
$str =~ /$re/;
print "Num captures is $#+"; # outputs "Num captures is 500"
The number of captures is effectivly unlimited. While there can only be nine captures that you can access with the $1–$9 variables, you can use more capture groups.
If you have more than a few capture groups, you might want to use named captures, like
my $str = "foobar";
if ($str =~ /(?<name>fo+)/) {
say $+{name};
}
Output: foo. You can access the values of named captures via the %+ hash.
You can use code like the following to give you a count of capture groups:
$regex = qr/..../; # Some arbitrary regex with capture groups
my #capture = '' =~ /$regex|()/; # A successful match incorporating the regex
my $groups_in_my_regex = scalar(#capture) - 1;
The way it works is that it performs a match which must succeed and then checks how many capture groups were created. (An extra one is created due to the trailing |()
Edit: Actually, it doesn't seem to be necessary to append an extra capture group. Just so long as the match is guaranteed to succeed then the array will contain an entry for every capture group.
So we can change the 2nd and 3rd lines to:
my #capture = '' =~ /$regex|/; # A successful match incorporating the regex
my $groups_in_my_regex = scalar(#capture);
See also:
Count the capture groups in a qr regex?
I have a file that I am reading in. I'm using perl to reformat the date. It is a comma seperated file. In one of the files, I know that element.0 is a zipcode and element.1 is a counter. Each row can have 1-n number of cities. I need to know the number of elements from element.3 to the end of the line so that I can reformat them properly. I was wanting to use a foreach loop starting at element.3 to format the other elements into a single string.
Any help would be appreciated. Basically I am trying to read in a csv file and create a cpp file that can then be compiled on another platform as a plug-in for that platform.
Best Regards
Michael Gould
you can do something like this to get the fields from a line:
my #fields = split /,/, $line;
To access all elements from 3 to the end, do this:
foreach my $city (#fields[3..$#fields])
{
#do stuff
}
(Note, based on your question I assume you are using zero-based indexing. Thus "element 3" is the 4th element).
Alternatively, consider Text::CSV to read your CSV file, especially if you have things like escaped delimiters.
Well if your line is being read into an array, you can get the number of elements in the array by evaluating it in scalar context, for example
my $elems = #line;
or to be really sure
my $elems = scalar(#line);
Although in that case the scalar is redundant, it's handy for forcing scalar context where it would otherwise be list context. You can also find the index of the last element of the array with $#line.
After that, if you want to get everything from element 3 onwards you can use an array slice:
my #threeonwards = #line[3 .. $#line];