How to create dynamically a substitution? - perl

How could I add dynamically a modifier to a substitution?
my $str = 'aaBaabaabaa';
my $mod = 'g';
$str =~ s/b// + $mod;

qr/(?X:).../ doesn't work for operation-only modifiers like /g.
A possible approach would be:
#!/usr/bin/perl
use strict;
use warnings;
my $str = 'aaBaabaabaa';
print "BEFORE: ${str}\n";
# regex configuration
my $pattern = '(.)(.)'; # 'b';
my $replacement = '"$2$1"'; # '';
my $mod = 'ieg'; # 'g';
# extract operation-only modifiers
my $operation_only = qr/([ge])/;
my %special;
$special{$_}++ foreach ($mod =~ /$operation_only/g);
my $mod_filtered;
($mod_filtered = $mod) =~ s/$operation_only//g;
# generate regex
my $regex = "(?${mod_filtered}:${pattern})";
my $compiled = qr/$regex/;
print "s/ ${regex} -> ${compiled} / ${replacement} / $mod -> ", join('', sort keys %special), "\n";
# execute substitution
my $replacement_code = sub { return $replacement };
for (1..$special{e} // 0) {
my $recurse = $replacement_code;
$replacement_code = sub { return eval $recurse->() };
}
if (exists $special{g}) {
$str =~ s/$compiled/$replacement_code->()/ge;
} else {
$str =~ s/$compiled/$replacement_code->()/e;
}
print "AFTER: ${str}\n";
exit 0;
Output:
$ perl dummy.pl
BEFORE: aaBaabaabaa
s/ (?i:(.)(.)) -> (?^:(?i:(.)(.))) / "$2$1" / ieg -> eg
AFTER: aaaBbaaaaba

Related

match string between columns using perl

I want to compare a string in column A with that in column B for every row and print a third column that highlights the differences.
Column A Column B
uuaaugcuaauugugauaggggu uuaaugcuaauugugauaggggu
uuaaugcuaauugugauagggguu uuaaugcuaauugugauaggggu
uuaaugcuaauugugauagggguuu uuaaugcuaauugugauaggggu
Desired Result:
Column A Column B Column C
uuaaugcuaauugugauaggggu uuaaugcuaauugugauaggggu ********************
uuaaugcuaauugugauagggguu uuaaugcuaauugugauaggggu ********************u
uuaaugcuaauugugauagggguuu uuaaugcuaauugugauaggggu ********************uu
I have an example script that might work, but how do I do this for every row in the data frame?
use strict;
use warnings;
my $string1 = 'AAABBBBBCCCCCDDDDD';
my $string2 = 'AEABBBBBCCECCDDDDD';
my $result = '';
for(0 .. length($string1)) {
my $char = substr($string2, $_, 1);
if($char ne substr($string1, $_, 1)) {
$result .= "**$char**";
} else {
$result .= $char;
}
}
print $result;
Using bruteforce and substr
use strict;
use warnings;
while (<DATA>) {
my ($str1, $str2) = split;
my $len = length $str1 < length $str2 ? length $str1 : length $str2;
for my $i (0..$len-1) {
my $c1 = substr $str1, $i, 1;
my $c2 = substr $str2, $i, 1;
if ($c1 eq $c2) {
substr $str1, $i, 1, '*';
substr $str2, $i, 1, '*';
}
}
printf "%-30s %s\n", $str1, $str2;
}
__DATA__
Column_A Column_B
uuaaugcuaauugugauaggggu uuaaugcuaauugugauaggggu
uuaaugcuaauugugauagggguu uuaaugcuaauugugauaggggu
uuaaugcuaauugugauagggguuu uuaaugcuaauugugauaggggu
AAABBBBBCCCCCDDDDD AEABBBBBCCECCDDDDD
Outputs:
*******A *******B
*********************** ***********************
***********************u ***********************
***********************uu ***********************
*A********C******* *E********E*******
Alternative using XOR
It's also possible to use ^ to find the intersection between two strings.
The following performs the same as the above:
while (<DATA>) {
my ($str1, $str2) = split;
my $intersection = $str1 ^ $str2;
while ($intersection =~ /(\0+)/g) {
my $len = length $1;
my $pos = pos($intersection) - $len;
substr $str1, $pos, $len, '*' x $len;
substr $str2, $pos, $len, '*' x $len;
}
printf "%-30s %s\n", $str1, $str2;
}
I could not resist to provide a modified Miller's solution with regular expressions
use strict;
use warnings;
while (<DATA>) {
my $masked_str1 ="";
my $masked_str2 ="";
my ($str1, $str2) = split;
my $intersection = $str1 ^ $str2;
while ($intersection =~ /(\x00+)/g) {
my $mask = $intersection;
$mask =~ s/\x00/1/g;
$mask =~ s/[^1]/0/g;
while ( $mask =~ /\G(.)/gc ) { # traverse the mask
my $bit = $1;
if ( $str1 =~ /\G(.)/gc ) { # traverse the string1 to be masked
$masked_str1 .= $bit ? '_' : $1;
}
if ( $str2 =~ /\G(.)/gc ) { # traverse the string2 to be masked
$masked_str2 .= $bit ? '_' : $1;
}
}
}
print "=" x 80;
printf "\n%-30s %s\n", $str2, $str1; # Minimum length 30 char, left-justified
printf "%-30s %s\n", $str1, $str2;
printf "%-30s %s\n\n", $masked_str1, $masked_str2;
}

Pattern Matching in perl

I want to parse some information from the file.
Information in the file:
Rita_bike_house_Sha9
Rita_bike_house
I want to have output like dis
$a = Rita_bike_house and $b = Sha9,
$a = Rita_bike_house and $b = "original"
In order to get that I have used the below code:
$name = #_; # This #_ has all the information from the file that I have shown above.
#For matching pattern Rita_bike_house_Sha9
($a, $b) = $name =~ /\w\d+/;
if ($a ne "" and $b ne "" ) { return ($a,$b) }
# this statement doesnot work at all as its first condition
# before the end is not satisified.
Is there any way where I can store "Rita_bike_house" in $a and "Sha9" in $b? I think my regexp is missing with something. Can you suggest anything?
Please don't use the variables $a and $b in your code. There are used by sort and will confuse you.
Try:
while( my $line = <DATA> ){
chomp $line;
if( $line =~ m{ \A ( \w+ ) _ ( [^_]* \d [^_]* ) \z }msx ){
my $first = $1;
my $second = $2;
print "\$a = $first and \$b = $second\n";
}else{
print "\$a = $line and \$b = \"original\"\n";
}
}
__DATA__
Rita_bike_house_Sha9
Rita_bike_house
Not very nice, but the next:
use strict;
use warnings;
while(<DATA>) {
chomp;
next if /^\s*$/;
my #parts = split(/_/);
my $b = pop #parts if $parts[$#parts] =~ /\d/;
$b //= '"original"';
my $a = join('_', #parts);
print "\$a = $a and \$b = $b,\n";
}
__DATA__
Rita_bike_house_Sha9
Rita_bike_house
prints:
$a = Rita_bike_house and $b = Sha9,
$a = Rita_bike_house and $b = "original",
If you are sure that the pattern which is required will always be similar to 'Sha9' and also it will appear at the end then just do a greedy matching....
open FILE, "filename.txt" or die $!;
my #data = <FILE>;
close(<FILE>);
#my $line = "Rita_bike_house_Sha9";
foreach $line (#data)
{
chomp($line);
if ($line =~ m/(.*?)(_([a-zA-Z]+[0-9]+))?$/)
{
$a = $1;
$b = $3 ? $3 : "original";
}
}

Uninitialized variable issue in Perl program

#!/usr/bin/perl
use warnings;
use Scalar::Util qw(looks_like_number);
sub term_value();
sub factor_value();
sub expression_value()
{
$num = #_;
#expression = $_[0];
print "expression[0]: " . $expression[0] . "\n";
$index = $_[$num-1];
print "index: $index\n";
$result = &term_value(#expression, $index);
$more = 1;
while($more)
{
$op = $expression[$index];
print "$op\n";
if ($op eq "+" || $op eq "-")
{
$index++;
$value = &term_value(#expression, $index);
if ($op eq '+')
{
$result = $result + $value;
} else {
$result = $result - $value;
}
}
else
{
$more = 0;
}
}
return $result;
}
sub term_value()
{
$num = #_;
#expression = $_[0];
print "expression[0]: " . $expression[0] . "\n";
$index = $_[$num-1];
print "index: $index\n";
$result = &factor_value(#expression, $index);
$more = 1;
while($more)
{
$op = $expression[$index];
if ($op eq "*" || $op eq "/")
{
$index++;
$value = &factor_value(#expression, $index);
if ($op eq '*')
{
$result = $result * $value;
} else {
$result = $result / $value;
}
} else {
$more = 0;
}
}
return $result;
}
sub factor_value()
{
$num = #_;
#expression = $_[0];
print "expression[0]: " . $expression[0] . "\n";
$index = $_[$num-1];
print "index: $index\n";
$result = 0;
$c = $expression[$index];
if ($c eq '(')
{
$index++;
$result = &expression_value(#expression, $index);
$index++;
} else {
while (looks_like_number($c))
{
$result = 10 * $result + $c - '0';
$index++;
$c = $expression[$index];
}
}
return $result;
}
#Collect argument and separate by character
#one_char = split(//, $ARGV[0]);
$index = 0;
$result = &expression_value(#one_char, $index);
print $result . "\n";
My console returns these warnings:
Use of uninitialized value $op in string eq at eval.pl line 58.
Use of uninitialized value $op in string eq at eval.pl line 58.
Use of uninitialized value $op in string eq at eval.pl line 25.
Use of uninitialized value $op in string eq at eval.pl line 25.
about the $op variable being uninitialized. I'm thinking this may be a scope problem...but I can't figure it out. I've tried everything I could think of (initializing the variable outside of the loop, etc.), but none of it seems to make a difference when running the program. Any suggestions would be greatly appreciated!
You're only using package (~global) variables, which is a huge problem given that you are using recursive functions! Start by adding
use strict;
Primarily, this will identify the variables you haven't declared. Use my to declare them in the appropriate scope.
You're trying to pass arrays to the subs, but you're failing. The only thing that can be passed to a sub is a list of scalars. If you want to pass an array to a sub, you'll need to pass a reference (~pointer) to the array.
sub foo {
my ($expressions, $index) = #_;
print($expressions->[$index], "\n");
}
foo(\#expressions, $index);
This is the reason you're getting the warnings. You are assigning one element to an array (#expression = $_[0]), then you try to index the second or later element.
By using prototype (), you're telling Perl the sub takes no arguments. Then you use & to tell Perl to ignore the prototype so you can pass arguments to your subs. Get rid of both the () after the sub names and & before sub calls.
my $more = 1;
while ($more) {
...
if (cond) {
...
} else {
$more = 0;
}
}
can be reduced to
while (1) {
...
last if !cond;
...
}
Higher Order Perl has a chapter on parsing. See section 8.1.2 for how you would build an expression parser and evaluator from scratch.
You can also take a look at the demo calculator script provided with Parse::RecDescent.
Just out of curiosity, I wanted to see what can be achieved without using parsers. The following script makes a lot of assumptions, but "works" for the simple cases.
#!/usr/bin/env perl
use strict;
use warnings;
use Regexp::Common qw(balanced number);
die "Need expression\n" unless #ARGV;
my ($expression) = #ARGV;
my $result = evaluate_expression($expression);
printf(
"'%s' evaluated to %g\n",
$expression, $result
);
my $expected = eval $expression;
unless ($result == $expected) {
die "Wrong result, should have been '$expected'\n";
}
sub evaluate_expression {
my ($expression) = #_;
my $n = qr!$RE{num}{real}!;
my $mul = qr![*/]!;
my $add = qr![+-]!;
my $subexpr = qr!$RE{balanced}{-parens=>'()'}{-keep}!;
1 while
$expression =~ s!
$subexpr
!
my $s = $1;
$s =~ s{(?:^\()|(?:\)\z)}{}g;
evaluate_expression($s)
!gex;
1 while
$expression =~ s!($n) \s* ($mul) \s* ($n)!"$1 $2 $3"!geex;
1 while
$expression =~ s!($n) \s* ($add) \s* ($n)!"$1 $2 $3"!geex;
return $expression;
}
Output:
C:\Temp> z "((1+1)*3 +2)*5"
'((1+1)*3 +2)*5' evaluated to 40
C:\Temp> z "(1+1)*3 + 2*5"
'(1+1)*3 + 2*5' evaluated to 16
But, of course, it's fragile:
C:\Temp> z "2*3+2*5"
'2*3+2*5' evaluated to 610
Wrong result, should have been '16'
As a bit of a corollary to Sinan's answer, here is a "parser" written from the other side of the camel.
use 5.010;
use strict;
use warnings;
my #ops;
use overload map {
my $op = $_;
$op => sub {
my ($x, $y) = #_[$_[2] ? (1, 0) : (0, 1)];
bless [$x, $op, $y]
}
} #ops = qw(+ - / *);
my %ops = map {$_ => eval "sub {\$_[0] $_ \$_[1]}"} #ops;
sub eval {
my $self = shift;
return $$self[0] if #$self == 1;
my ($x, $op, $y) = map {ref eq 'main' ? $_->eval : $_} #$self;
my $ret = $ops{$op}->($x, $y);
say "$ret = $x $op $y";
$ret;
}
BEGIN {overload::constant integer => sub {bless [$_[1]]}}
eval->eval for "#ARGV";
Which when run:
$ perl eval.pl 2*3+2*5
prints:
6 = 2 * 3
10 = 2 * 5
16 = 6 + 10

Perl - How to change every $variable occurrence of ";" in a string

Very new here so be gentle. :)
Here is the jist of what I want to do:
I want to take a string that is made up of numbers separated by semi-colons (ex. 6;7;8;9;1;17;4;5;90) and replace every "X" number of semicolons with a "\n" instead. The "X" number will be defined by the user.
So if:
$string = "6;7;8;9;1;17;4;5;90";
$Nth_number_of_semicolons_to_replace = 3;
The output should be:
6;7;8\n9;1;17\n4;5;90
I've found lots on changing the Nth occurrence of something but I haven't been able to find anything on changing every Nth occurrence of something like I am trying to describe above.
Thanks for all your help!
use List::MoreUtils qw(natatime);
my $input_string = "6;7;8;9;1;17;4;5;90";
my $it = natatime 3, split(";", $input_string);
my $output_string;
while (my #vals = $it->()) {
$output_string .= join(";", #vals)."\n";
}
Here is a quick and dirty answer.
my $input_string = "6;7;8;9;1;17;4;5;90";
my $count = 0;
$input_string =~ s/;/++$count % 3 ? ";" : "\n"/eg;
Don't have time for a full answer now, but this should get you started.
$string = "6;7;8;9;1;17;4;5;90";
$Nth_number_of_semicolons_to_replace = 3;
my $regexp = '(' . ('\d+;' x ($Nth_number_of_semicolons_to_replace - 1)) . '\d+);';
$string =~ s{ $regexp ) ; }{$1\n}xsmg
sub split_x{
my($str,$num,$sep) = #_;
return unless defined $str;
$num ||= 1;
$sep = ';' unless defined $sep;
my #return;
my #tmp = split $sep, $str;
while( #tmp >= $num ){
push #return, join $sep, splice #tmp, 0, $num;
}
push #return, join $sep, #tmp if #tmp;
return #return;
}
print "$_\n" for split_x '6;7;8;9;1;17;4;5;90', 3
print join( ',', split_x( '6;7;8;9;1;17;4;5;90', 3 ) ), "\n";
my $string = "6;7;8;9;1;17;4;5;90";
my $Nth_number_of_semicolons_to_replace = 3;
my $num = $Nth_number_of_semicolons_to_replace - 1;
$string =~ s{ ( (?:[^;]+;){$num} [^;]+ ) ; }{$1\n}gx;
print $string;
prints:
6;7;8
9;1;17
4;5;90
The regex explained:
s{
( # start of capture group 1
(?:[^;]+;){$num} # any number of non ';' characters followed by a ';'
# repeated $num times
[^;]+ # any non ';' characters
) # end of capture group
; # the ';' to replace
}{$1\n}gx; # replace with capture group 1 followed by a new line
If you've got 5.10 or higher, this could do the trick:
#!/usr/bin/perl
use strict;
use warnings;
my $string = '1;2;3;4;5;6;7;8;9;0';
my $n = 3;
my $search = ';.*?' x ($n -1);
print "string before: [$string]\n";
$string =~ s/$search\K;/\n/g;
print "print string after: [$string]\n";
HTH,
Paul

Why does perl "hash of lists" do this?

I have a hash of lists that is not getting populated.
I checked that the block at the end that adds to the hash is in fact being called on input. It should either add a singleton list if the key doesn't exist, or else push to the back of the list (referenced under the right key) if it does.
I understand that the GOTO is ugly, but I've commented it out and it has no effect.
The problem is that when printhits is called, nothing is printed, as if there are no values in the hash. I also tried each (%genomehits), no dice.
THANKS!
#!/usr/bin/perl
use strict;
use warnings;
my $len = 11; # resolution of the peaks
#$ARGV[0] is input file
#$ARGV[1] is call number
# optional -s = spread number from call
# optional -o specify output file name
my $usage = "see arguments";
my $input = shift #ARGV or die $usage;
my $call = shift #ARGV or die $usage;
my $therest = join(" ",#ARGV) . " ";
print "the rest".$therest."\n";
my $spread = 1;
my $output = $input . ".out";
if ($therest =~ /-s\s+(\d+)\s/) {$spread = $1;}
if ($therest =~ /-o\s+(.+)\s/) {$output = $1;}
# initialize master hash
my %genomehits = ();
foreach (split ';', $input) {
my $mygenename = "err_naming";
if ($_ =~ /^(.+)-/) {$mygenename = $1;}
open (INPUT, $_);
my #wiggle = <INPUT>;
&singlegene(\%genomehits, \#wiggle, $mygenename);
close (INPUT);
}
&printhits;
#print %genomehits;
sub printhits {
foreach my $key (%genomehits) {
print "key: $key , values: ";
foreach (#{$genomehits{$key}}) {
print $_ . ";";
}
print "\n";
}
}
sub singlegene {
# let %hash be the mapping hash
# let #mygene be the gene to currently process
# let $mygenename be the name of the gene to currently process
my (%hash) = %{$_[0]};
my (#mygene) = #{$_[1]};
my $mygenename = $_[2];
my $chromosome;
my $leftbound = -2;
my $rightbound = -2;
foreach (#mygene) {
#print "Doing line ". $_ . "\n";
if ($_ =~ "track" or $_ =~ "output" or $_ =~ "#") {next;}
if ($_ =~ "Step") {
if ($_ =~ /chrom=(.+)\s/) {$chromosome = $1;}
if ($_ =~ /span=(\d+)/) {$1 == 1 or die ("don't support span not equal to one, see wig spec")};
$leftbound = -2;
$rightbound = -2;
next;
}
my #line = split /\t/, $_;
my $pos = $line[0];
my $val = $line[-1];
# above threshold for a call
if ($val >= $call) {
# start of range
if ($rightbound != ($pos - 1)) {
$leftbound = $pos;
$rightbound = $pos;
}
# middle of range, increment rightbound
else {
$rightbound = $pos;
}
if (\$_ =~ $mygene[-1]) {goto FORTHELASTONE;}
}
# else reinitialize: not a call
else {
FORTHELASTONE:
# typical case, in an ocean of OFFs
if ($rightbound != ($pos-1)) {
$leftbound = $pos;
}
else {
# register the range
my $range = $rightbound - $leftbound;
for ($spread) {
$leftbound -= $len;
$rightbound += $len;
}
#print $range . "\n";
foreach ($leftbound .. $rightbound) {
my $key = "$chromosome:$_";
if (not defined $hash{$key}) {
$hash{$key} = [$mygenename];
}
else { push #{$hash{$key}}, $mygenename; }
}
}
}
}
}
You are passing a reference to %genomehits to the function singlegene, and then copying it into a new hash when you do my (%hash) = %{$_[0]};. You then add values to %hash which goes away at the end of the function.
To fix it, use the reference directly with arrow notation. E.g.
my $hash = $_[0];
...
$hash->{$key} = yadda yadda;
I think it's this line:
my (%hash) = %{$_[0]};
You're passing in a reference, but this statement is making a copy of your hash. All additions you make in singlegene are then lost when you return.
Leave it as a hash reference and it should work.
PS - Data::Dumper is your friend when large data structures are not behaving as expected. I'd sprinkle a few of these in your code...
use Data::Dumper; print Dumper \%genomehash;