perl, having weird behavior with s/// - perl

I'm embarrassed to ask this because it is so simple, but I can't see what's wrong. I have a routine to clean up input for an ip range. It's sort of brute-force but I don't know of a better way. The problem I am having is when I try to remove inner spaces leaving only a '-' or ',' as separators within a conditional block a single preceding and trailing space is left bracketing the separator. If I clean up the inner spaces outside the conditional block the spaces are properly removed. So, in the sample code if I only have the s/\s+//g on Line 1 it cleans properly, if I only have the s/\s+//g on Lines 2 and 3 spaces are left bracketing the '-' and ','. What the heck is wrong?
use feature qw(say);
use Data::Dumper qw(Dumper);
$input = " 192.168.1.1 198.168.1.254 ";
buildIpRangeArray ($input);
$input = " 192.168.1.1 , 198.168.1.254 ";
buildIpRangeArray ($input);
$input = " 192.168.1.1 - 198.168.1.254 ";
buildIpRangeArray ($input);
sub buildIpRangeArray {
say "input: $input";
$input = shift;
$input =~ s/^\s+//;
$input =~ s/\s+$//;
# $input =~ s/\s+//g; # Line 1.
# Works if this is uncommented
# and lines 2 and 3 are omitted
if ( index($input,' ') >= 0) {
$input =~ s/\s+/ /g; # this works
say "cleaned input 2: $input";
#range = split(/ /,$input);
say Dumper(#range);
}
elsif ( index($input,',') >= 0) {
$input =~ s/\s+//g; # Line 2
say "cleaned input 3: $input";
#range = split(/,/, $input);
say Dumper(#range);
}
elsif ( index($input,'-') >= 0) {
$input =~ s/\s+//g; # Line 3
say "cleaned input 4: $input";
#range = split(/-/, $input);
say Dumper(#range);
}
}
The output:
input: 192.168.1.1 198.168.1.254
cleaned input 2: 192.168.1.1 198.168.1.254
$VAR1 = '192.168.1.1';
$VAR2 = '198.168.1.254';
input: 192.168.1.1 , 198.168.1.254
cleaned input 2: 192.168.1.1 , 198.168.1.254
$VAR1 = '192.168.1.1';
$VAR2 = ',';
$VAR3 = '198.168.1.254';
input: 192.168.1.1 - 198.168.1.254
cleaned input 2: 192.168.1.1 - 198.168.1.254
$VAR1 = '192.168.1.1';
$VAR2 = '-';
$VAR3 = '198.168.1.254';

If you look at your debug output, it's fairly obvious what's going on. Let's take the second block of output with the comma.
input: 192.168.1.1 , 198.168.1.254
cleaned input 2: 192.168.1.1 , 198.168.1.254
$VAR1 = '192.168.1.1';
$VAR2 = ',';
$VAR3 = '198.168.1.254';
After input, there is a check to see if there's a space ' ' in the string. There is, here:
V
192.168.1.1 , 198.168.1.254
Therefore the code never reaches the elsif for the comma , or the dash -. You can verify that because you always get input 2, never input 3 or input 4.
The next step is cleaning whitespace, where you say yourself it works. You replace lots of any whitespace with one space. That leaves , in the string. Now you split on whitespace, giving you
ip
,
ip
Overall, your code is fairly naive. There is a lot of repetition, and you don't have use strict or use warnings, which makes it harder to debug. Depending on how this code is going to be used, I suggest a huge simplification.
sub buildIpRangeArray {
my $input = shift;
say "input: $input";
my #range = grep {$_} split /[^0-9.]+/, $input;
say Dumper #range;
return;
}
We split on lots of characters that can't be in an IP address. This is naive too, as it will not verify you have actual IP addresses, but neither does your code. It will work for any number of whitespace or any delimiters, even if they are text. We need the grep to remove empty strings that occur from leading or trailing whitespace. An empty string "" in Perl evaluates as false, so grep will filter these out.

Your first condition is always true, as the string always contains spaces.
So you have to write the if/elsif/elsif in another way.
if ( index($input,',') >= 0) {
$input =~ s/\s+//g; # Line 2
say "cleaned input 3: $input";
#range = split(/,/, $input);
say Dumper(#range);
}
elsif ( index($input,'-') >= 0) {
$input =~ s/\s+//g; # Line 3
say "cleaned input 4: $input";
#range = split(/-/, $input);
say Dumper(#range);
}
elsif ( index($input,' ') >= 0) {
$input =~ s/\s+/ /g; # this works
say "cleaned input 2: $input";
#range = split(/ /,$input);
say Dumper(#range);
}
will probably yield the result you want.

Related

How to separate an array in Perl based on pattern

I am trying to write a big script but I am stuck on a part. I want to sprit an array based on ".."
From the script I got this:
print #coordinates;
gene complement(872..1288)
my desired output:
complement 872 1288
I tried:
1) my #answer = split(.., #coordinates)
print("#answer\n");
2) my #answer = split /../, #coordinates;
3) print +(split /\../)[-1],[-2],[-3] while <#coordinates>
4) foreach my $anwser ( #coordinates )
{$anwser =~ s/../"\t"/;
print $anwser;}
5) my #answer = split(/../, "complement(872..1288)"); #to see if the printed array is problematic.
which prints:
) ) ) ) ) ) ) ) )
6) my #answer = split /"gene "/, #coordinates; # I tried to "catch" the entire output's spaces and tabs
which prints
0000000000000000000000000000000001000000000100000000
But none of them works. Does anyone has any idea how to step over this issue?
Ps, unfortunately, I can't run my script right now on Linux so I used this website to run my script. I hope this is not the reason why I didn't get my desired output.
my $RE_COMPLEMENT = qr{(complement)\((\d+)\.\.(\d+)\)}msx;
for my $item (#coordinates) {
my ($head, $i, $j) = $item =~ $RE_COMPLEMENT;
if (defined($head) && defined($i) && defined($j)) {
print("$head\t$i\t$j\n");
}
}
split operates on a scalar, not on an array.
my $string = 'gene complement(872..1288)';
my #parts = split /\.\./, $string;
print $parts[0]; # gene complement(872
print $parts[1]; # 1288)
To get the desired output, you can use a substitution:
my $string = 'gene complement(872..1288)';
$string =~ s/gene +|\)//g;
$string =~ s/\.\./ /;
$string =~ s/\(/ /;
Desired effect can be achieved with
use of tr operator to replace '(.)' => ' '
then splitting data string into element on space
storing only required part of array
output elements of array joined with tabulation
use strict;
use warnings;
use feature 'say';
my $data = <DATA>;
chomp $data;
$data =~ tr/(.)/ /;
my #elements = (split ' ', $data)[1..3];
say join "\t", #elements;
__DATA__
gene complement(872..1288)
Or as an alternative solution with only substitutions (without splitting data string into array)
use strict;
use warnings;
use feature 'say';
my $data = <DATA>;
chomp $data;
$data =~ s/gene\s+//;
$data =~ s/\)//;
$data =~ s/[(.]+/\t/g;
say $data;
__DATA__
gene complement(872..1288)
Output
complement 872 1288

how to count a repeating string in a line using perl

I have the below file
file1:
abc def host 123 host 869 host
I wrote below script to count the occurrence of a "host" keyword in each line.
I tried all the ways(refer the ones which are commented) still it does not seem to work. sed command worked in command line but not inside the perl script
#!/usr/bin/perl
open(SOURCE,"</home/amp/surevy01/file1");
open(DESTINATION,"</home/amp/surevy01/file2");
while(my $line = <SOURCE>)
{
while(my $line1 = <DESTINATION>)
{
#chomp($line);
#chomp($line1);
if ($line =~ "host")
{
#my $count = grep {host} $line;
#my $count = `sed -i {s/host/host\n/g} $line1 | grep -c {host}`;
#my $count = `perl -pi -e 's/host/host\n/g' $line1 | grep -c host`;
#my $count grep ("host" ,$line);
print "$count";
print "match found \n";
next;
}
else
{
print "match not found \n";
exit;
}
}
}
I'm a beginner to perl. Looking for your valuable suggestions
Your own solution will match instances like hostages and Shostakovich
grep is the canonical way to count elements of a list, and split will turn your line into a list of words, giving
my $count = grep { $_ eq 'host' } split ' ', $line
I don't know why you're looping through two files in your example, but you can use the /g (global) flag:
my $line = "abc def host 123 host 869 host";
my $x = 0;
while ($line =~ /host/g){
$x++;
}
print "$x\n"; # 3
When you run a regex with /g in scalar context (as is the conditional in the while statement), it will keep track of the location of the last match and restart from there. Therefore, /host/g in a loop as above will find each occurence of host. You can also use the /g in list contexts:
my $line = "abc def host 123 host 869 host";
my #matches = $contents =~ /host/g;
print scalar #matches; # 3 again
In this case, #matches will contain all matches of the regexp against the string, which will be ('host', 'host', 'host') since the query is a simple string. Then, scalar(#matches) will yield the length of the list.
This produces the number of instances of host in $line:
my $count = () = $line =~ /host/g;
But that also matches hosting. To avoid that, the following will probably do the trick:
my $count = () = $line =~ /\bhost\b/g;
=()= this is called Perl secret Goatse operator. More info

Ignore the first two lines with ## in perl

all.
Im a newbie in programming especially in perl. I would like to skip the first two lines in my dataset.
these are my codes.
while (<PEPTIDELIST>) {
next if $_ !=~ "##";
chomp $_;
#data = split /\t/;
chomp $_;
next if /Sequence/;
chomp $_;
$npeptides++;
# print "debug: 0: $data[0] 1: $data[1] 2: $data[2] 3:
$data[3]
\n" if ( $debug );
my $pepseq = $data[1];
#print $pepseq."\n";
foreach my $header (keys %sequence) {
#print "looking for $pepseq in $header \n";
if ($sequence{$header} =~ /$pepseq/ ) {
print "matched $pepseq in protein $header" if ( $debug );
# my $in =<STDIN>;
if ( $header =~ /(ENSGALP\S+)\s.+(ENSGALG\S+)/ ) {
print "debug: $1 $2 have the pep = $pepseq \n\n" if (
$debug);
my $lprot = $1;
my $lgene = $2;
$gccount{$lgene}++;
$pccount{$lprot}++;
# print "$1" if($debug);
# print "$2" if ($debug);
print OUT "$pepseq,$1,$2\n";
}
}
}
my $ngenes = keys %gccount;
my $nprots = keys %pccount;
somehow the peptide is not in the output list. please help point me where it goes wrong?
thanks
If you want to skip lines that contain ## anywhere in them:
next if /##/;
If you only want to skip lines that start with ##:
next if /^##/;
If you always want to skip the first two lines, regardless of content:
next if $. < 3;
next if $_ !=~ "##"; must be next if $_ =~ "##";
Ignore this lie if $_ matched ##

How to display user input on one line in a palandrome assignment?

In perl, I have to determine whether user input is a palindrome or not and it must display like this:
Enter in 7 characters: ghghghg #one line here #
Palindrome! #second line answer#
But instead this is what it does:
Enter in 7 characters: g #one line#
h #second line#
g #third line#
h #fourth line#
g #fifth line#
h #sixth line#
g Palindrom
e! #seventh line#
My problem seems to be on the chomp lines with all the variables but I just can't figure out what to do and I've been at if for hours. I need a simple solution, but have not progressed to arrays yet so need some simple to fix this. Thanks
And here is what i have so far, the formula seems to work but it keeps printing a new line for each character:
use strict;
use warnings;
my ($a, $b, $c, $d, $e, $f, $g);
print "Enter in 7 characters:";
chomp ($a = <>); chomp ($b = <>); chomp ($c = <>); chomp ($d = <>); chomp ($e = <>); chomp ($f = <>); chomp ($g = <>);
if (($a eq $g) && ($b eq $f) && ($c eq $e) && ($d eq $d) && ($e eq $c) && ($f eq $b) && ($g eq $a))
{print "Palindrome! \n";}
else
{print "Not Palindrome! \n";}
If you're going to determine if a word is the same backwards, may I suggest using reverse and lc?
chomp(my $word = <>);
my $reverse = reverse $word;
if (lc($word) eq lc($reverse)) {
print "Palindrome!";
} else {
print "Not palindrome!";
}
Perl is famous for its TIMTOWTDI. Here are two more ways of doing it:
print "Enter 7 characters: ";
chomp(my $i= <STDIN>);
say "reverse: ", pal_reverse($i) ? "yes" : "no";
say "regex: ", pal_regex($i) ? "yes" : "no";
sub pal_reverse {
my $i = (#_ ? shift : $_);
return $i eq reverse $i;
}
sub pal_regex {
return (#_ ? shift() : $_) =~ /^(.?|(.)(?1)\2)$/ + 0;
}
use strict;
use warnings;
use feature 'say';
print "Please enter 7 characters : ";
my $input = <>; # Read in input
chomp $input; # To remove trailing "\n"
# Season with input validation
warn 'Expected 7 characters, got ', length $input, ' instead'
unless length $input == 7;
# Determine if it's palindromic or not
say $input eq reverse $input
? 'Palindrome'
: 'Not palindrome' ;
TIMTOWTDI for the recursion-prone:
sub is_palindrome {
return 1 if length $_[0] < 2; # Whole string is palindromic
goto \&is_palindrome
if substr $_[0], 0, 1, '' eq substr $_[0], -1, 1, ''; # Check next chars
return; # Not palindromic if we reach here
}
say is_palindrome( 'ghghghg' ) ? 'Palindromic' : 'Not palindromic' ;
And perldoc perlretut for those who aren't :)
Recursive patterns
This feature (introduced in Perl 5.10) significantly extends the power
of Perl's pattern matching. By referring to some other capture group
anywhere in the pattern with the construct (?group-ref), the pattern
within the referenced group is used as an independent subpattern in
place of the group reference itself. Because the group reference may
be contained within the group it refers to, it is now possible to
apply pattern matching to tasks that hitherto required a recursive
parser.
To illustrate this feature, we'll design a pattern that matches if a
string contains a palindrome. (This is a word or a sentence that,
while ignoring spaces, interpunctuation and case, reads the same
backwards as forwards. We begin by observing that the empty string or
a string containing just one word character is a palindrome. Otherwise
it must have a word character up front and the same at its end, with
another palindrome in between.
/(?: (\w) (?...Here be a palindrome...) \g{-1} | \w? )/x
Adding \W* at either end to eliminate what is to be ignored, we
already have the full pattern:
my $pp = qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix;
for $s ( "saippuakauppias", "A man, a plan, a canal: Panama!" ){
print "'$s' is a palindrome\n" if $s =~ /$pp/;
}

How can I iterate through nested arrays?

I have created an array as follows
while (defined ($line = `<STDIN>`))
{
chomp ($line);
push #stack,($line);
}
each line has two numbers.
15 6
2 8
how do iterate over each item in each line?
i.e. I want to print
15
6
2
8
I understand it's something like
foreach (#{stack}) (#stack){
print "?????
}
This is where I am stuck.
See the perldsc documentation. That's the Perl Data Structures Cookbook, which has examples for dealing with arrays of arrays. From what you're doing though, it doesn't look like you need an array of arrays.
For your problem of taking two numbers per line and outputting one number per line, just turn the whitespace into newlines:
while( <> ) {
s/\s+/\n/; # turn all whitespace runs into newlines
print; # it's ready to print
}
With Perl 5.10, you can use the new \h character class that matches only horizontal whitespace:
while( <> ) {
s/\h+/\n/; # turn all horizontal whitespace runs into newlines
print; # it's ready to print
}
As a Perl one-liner, that's just:
% perl -pe 's/\h+/\n/' file.txt
#!/usr/bin/perl
use strict;
use warnings;
while ( my $data = <DATA> ) {
my #values = split ' ', $data;
print $_, "\n" for #values;
}
__DATA__
15 6
2 8
Output:
C:\Temp> h
15
6
2
8
Alternatively, if you want to store each line in #stack and print out later:
my #stack = map { [ split ] } grep { chomp; length } <DATA>;
The line above slurps everything coming from the DATA filehandle into a list of lines (because <DATA> happens in list context). The grep chomps each line and filters by length after chomping (to avoid getting any trailing empty lines in the data file -- you can avoid it if there are none). The map then splits each line along spaces, and then creates an anonymous array reference for each line. Finally, such array references are stored in each element of #stack. You might want to use Data::Dumper to look at #stack to understand what's going on.
print join("\n", #$_), "\n" for #stack;
Now, we look over each entry in stack, dereferencing each array in turn, then joining the elements of each array with newlines to print one element per line.
Output:
C:\Temp> h
15
6
2
8
The long way of writing essentially the same thing (with less memory consumption) would be:
my #stack;
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
my #values = split ' ', $line;
push #stack, \#values;
}
for my $ref ( #stack ) {
print join("\n", #$ref), "\n";
}
Finally, if you wanted do something other than printing all values, say, sum all the numbers, you should store one value per element of #stack:
use List::Util qw( sum );
my #stack;
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
my #values = split ' ', $line;
push #stack, #values;
}
printf "The sum is %d\n", sum #stack;
#!/usr/bin/perl
while ($line = <STDIN>) {
chomp ($line);
push #stack, $line;
}
# prints each line
foreach $line (#stack) {
print "$line\n";
}
# splits each line into items using ' ' as separator
# and prints the items
foreach $line (#stack) {
#items = split / /, $line;
foreach $item (#items) {
print $item . "\n";
}
}
I use 'for' for "C" style loops, and 'foreach' for iterating over lists.
#!/usr/bin/perl
use strict;
use warnings;
open IN, "< read.txt" or
die "Can't read in 'read.txt'!";
my $content = join '', <IN>;
while ($content =~ m`(\d+)`g) {
print "$1\n";
}