Matching in Perl - perl

I am trying to get text in between two dots of a line, but my program returns the entire line.
For example: I have text which looks like:
My sampledata 1,2 for perl .version 1_1.
I used the following match statement
$x =~ m/(\.)(.*)(\.)/;
My output for $x should be version 1_1, but I am getting the entire line as my match.

In your code, the value of $x will not change after the match.
When $x is successfully matched with m/(.)(.*)(.)/, your three capture groups will contain '.', 'version 1_1' and '.' respectively (in the order given). $2 will give you 'version 1_1'.
Considering that you might probably only want the part 'version 1_1', you need not capture the two dots. This code will give you the same result:
$x =~ m/\.(.*)\./;
print $1;

Try this:
my $str = "My sampledata 1,2 for perl .version 1_1.";
$str =~ /\.\K[^.]+(?=\.)/;
print $&;
The period must be escaped out of a character class.
\K resets all that has been matched before (you can replace it by a lookbehind (?<=\.))
[^.] means any character except a period.
For several results, you can do this:
my $str = "qwerty .target 1.target 2.target 3.";
my #matches = ($str =~ /\.\K[^.]+(?=\.)/g);
print join("\n", #matches);
If you don't want to use twice a period you can do this:
my $str = "qwerty .target 1.target 2.target 3.";
my #matches = ($str =~ /\.([^.]+)\./g);
print join("\n", #matches)."\n";

It should be simple enough to do something like this:
#!/usr/bin/perl
use warnings;
use strict;
my #tests = (
"test one. get some stuff. extra",
"stuff with only one dot.",
"another test line.capture this. whatever",
"last test . some data you want.",
"stuff with only no dots",
);
for my $test (#tests) {
# For this example, I skip $test if the match fails,
# otherwise, I move on do stuff with $want
next if $test !~ /\.(.*)\./;
my $want = $1;
print "got: $want\n";
}
Output
$ ./test.pl
got: get some stuff
got: capture this
got: some data you want

Related

How to remove array's newlines and add an element at the beginning of it in Perl?

First of I have to apologize for editing my initial post. But after I provide my code I did the question fuzzy.
So, I have this an array (#start_cod) containing lines separated by /n as follows:
print #start_cod;
tatatattataattatatttat
cacacacaacaccacaac
aaaaaaaaaaaaaaa
I need to remove the newlines and add ">text" ONLY at the beginning of the array as follow:
>text
tatatattataattatatttatcacacacaacaccacaacaaaaaaaaaaaaaaa
I tried:
s/\s+\z// for #start_cod;
print ">text#start_cod";
I tried also with chomp
chomp #start_cod;
print ">text#start_cod";
and
my #start_cod = split("\n",$start_cod);
$start_cod = join("",#start_cod);
print ">text$start_cod";
but I get
aaaaaaaaaaaaaaaaaaa>textcacacacacaacaccacaac>textaattatatattataattatatttat
Any suggestions on how to handle this in Perl Programming?
Here is my code which works 100%.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my %alliloux =();
$/="\n>";
while (<>) {
s/>//g;
my ($onoma, #seq) = split (/\n/, $_);
my ($sp, $head) = split (/\./, $onoma);
push #{ $alliloux{$sp} }, join "\n", ">$onoma", #seq;
}
foreach my $sp (keys %alliloux) {
chomp $sp;
my ($head, $dna) = split(/\t/, $sp);
my #start_cod = substr($dna, 3);
say #start_cod;
Input file:
>name aaaaaaaaaaaaaaaaaa
>name2 acacacacacaacaccacaac
>namex aattatatattataattatatttat
output after Perl run
tatatattataattatatttat
cacacacaacaccacaac
aaaaaaaaaaaaaaa
Desired output:
>text
tatatattataattatatttatcacacacaacaccacaacaaaaaaaaaaaaaaa
If I understand your question correctly, this should do what you want:
use strict;
use warnings;
my #start_cod = (
'aaaaaaaaaaaaaaaaaa',
'acacacacacaacaccacaac',
'aattatatattataattatatttat',
);
print ">text\n", #start_cod, "\n";
The print first prints ">text" and a newline once, then you get the #start_cod items on a line, and the last "\n" makes sure you have a newline after the last element.
Output:
>text
aaaaaaaaaaaaaaaaaaacacacacacaacaccacaacaattatatattataattatatttat
You might want to see Read FASTA into Hash. It's the same problem and very close to the code I wrote before I read it. Also, there are modules on CPAN that can handle FASTA.
I think you want to combine the sequences that start with the same name, disregarding the numbers. The sequences shouldn't have interior whitespace. In your code, you are constantly adding whitespace. You even join on a newline. So, you go to the doctor and say "My arm hurts when I do this", and the doctor says "So don't do that". :)
When you run into these sort of problems, check the results of your operations at each step to see if you get what you expect. Here's a much simplified version of a program that I think does what you want. I've removed most of the data structure because they are complicating your process.
In short, read a line and remove the newline at the end. That's one source of your newlines. Then, extract the sequence and concatenate that to the previous sequence. When you join with newlines, you are adding newlines. So, don't do that:
use v5.14;
use warnings;
use Data::Dumper;
my %alliloux = ();
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
# now split on whitespace, but only up to two parts.
# There's no array here.
my( $name, $seq ) = split /\s+/, $_, 2;
# remove the numbers at the end to get the prefix of the
# name.
my $prefix = $name =~ s/\d+\z//r;
# append the current sequence for this prefix to what we
# have already seen.f
$alliloux{$prefix} .= $seq;
}
say Dumper( \%alliloux );
foreach my $base ( keys %alliloux ) {
say ">text $alliloux{$base}";
}
__DATA__
>name aaa
>name2 cccc
>name99 aattaatt
You don't need the intermediate array. You can build up your string as you go. You don't need to have all the parts before you do that.
Now, to figure out where you might be going wrong, do a little at once. Ensure that you've extracted the right thing. It's handle to put characters around the variables you interpolate so you can see whitespace at the beginning or end:
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
my( $name, $seq ) = split /\s+/, $_, 2;
say "Name: <$name>";
say "Seq: <$seq>"
}
Then, add another step, and ensure that works:
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
my( $name, $seq ) = split /\s+/, $_, 2;
say "Name: <$name>";
say "Seq: <$seq>"
my $prefix = $name =~ s/\d+\z//r;
say "Prefix: <$prefix>";
}
Repeat this process for each step. Then, when you come with a question, you've pinpointed the point where things diverge. Here's the same technique in your program:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
s/>//g;
my ($onoma, #seq) = split (/\n/, $_);
say "Onoma: <$onoma>";
}
__DATA__
>name aaa
>name2 cccc
>name99 aattaatt
The output shows that you never had anything in #seq. You are splitting on a newline, but unless you've changed the default line ending, you'll only get a newline at the end:
Onoma: <name aaa>
Onoma: <name2 cccc>
Onoma: <name99 aattaatt>
Now there's nothing in #seq, so a line like join "\n", ">$onoma", #seq; is really just join "\n", ">$onoma". You could have seen that with a little checking.
The description lacks clarity of the problem.
By looking at the desired output the following code comes to mind. Please see if it does what you was looking for.
Even looking at your code it is not clear what you try to do -- some part of the code does not make much sense.
use strict;
use warnings;
use feature 'say';
my #start_cod;
while( <DATA> ) {
chomp;
next unless />\s?name.?\s+(.*)/;
push #start_cod, $1;
}
print ">text\n " . join('',#start_cod);
__DATA__
>name aaaaaaaaaaaaaaaaaa
>name2 acacacacacaacaccacaac
.
.
.
> namex aattatatattataattatatttat

Perl: Final Output Line in Foreach Loop Prints Twice

I'm trying to write a very simple script that takes two words from STDIN and outputs TRUE if they're anagrams and FALSE if not. My main issue is that if the two words aren't anagrams (this is the final "else" statement in the script), the output looks like:
Sorry, that's not an anagram pair
Sorry, that's not an anagram pair
where I just want:
Sorry, that's not an anagram pair
Other, more minor issues for the especially generous:
I know what the FALSE values are for Perl, but I can't get the script to print FALSE by, for example, setting a variable to '' or 0, etc. or saying "return ''". Ideally, I wouldn't have to put "print TRUE/FALSE" in the script at all.
I put in the last elsif statement in the script to see if it would affect the printing twice problem. It didn't, and now I'm curious why my m// expression doesn't work. It's supposed to find pairs that are identical except that one has more whitespace than the other.
Here's the script! I'm sorry it's so long - again, the problem is at the very end with the final "else" statement. Many thanks!!!
#To Run: Type start.pl on the command line.
#The script should prompt you to enter a word or phrase.
#Once you've done that, it'll prompt you for another one.
#Then you will be told if the two terms are anagrams or not.
#!/usr/bin/perl -w
use strict;
#I have to use this to make STDIN work. IDK why.
$|=1;
#variables
my $aWord;
my $bWord;
my $word;
my $sortWord;
my $sortWords;
my #words;
my %anaHash;
print "\nReady to play the anagram game? Excellent.\n\nType your first word or phrase, then hit Enter.\n\n";
$aWord = <STDIN>;
chomp $aWord;
print "\n\nThanks! Now type your second word or phrase and hit Enter.\n\n";
$bWord = <STDIN>;
chomp $bWord;
#This foreach loop performs the following tasks:
#1. Pushes the two words from STDIN into an array (unsure if this is really necessary)
#2. lowercases everything and removes all characters except for letters & spaces
#3. splits both words into characters, sorts them alphabetically, then joins the sorted letters into a single "word"
#4.pushes the array into a hash
#words = ($bWord, $aWord);
foreach $word (#words) {
$word =~ tr/A-Z/a-z/;
$word =~ s/[^a-z ]//ig;
$sortWord = join '', sort(split(//, $word));
push #{$anaHash{$sortWord}}, $word;
}
#This foreach loop tries to determine if the word pairs are anagrams or not.
foreach $sortWords (values %anaHash) {
#"if you see the same word twice AND the input was two identical words:"
if (1 < #$sortWords &&
#$sortWords[0] eq #$sortWords[1]) {
print "\n\nFALSE: Your phrases are identical!\n\n";
}
#"if you see the same word twice AND the input was two different words (i.e. a real anagram):"
elsif (1 < #$sortWords &&
#$sortWords[0] ne #$sortWords[1]) {
print "\n\nTRUE: #$sortWords[0] and #$sortWords[1] are anagrams!\n\n";
}
#this is a failed attempt to identify pairs that are identical except one has extra spaces. Right now, this fails and falls into the "else" category below.
elsif (#$sortWords[0] =~ m/ +#$sortWords[-1]/ ||
#$sortWords[-1] =~ m/ +#$sortWords[0]/) {
print "\n\FALSE: #$sortWords[0] and #$sortWords[-1] are NOT anagrams. Spaces are characters, too!\n\n";
}
#This is supposed to identify anything that's not an acronym. But the output prints twice! It's maddening!!!!
else {
print "Sorry, that's not an anagram pair\n";
}
}
It's useful to print out the contents of %anaHash after you've finished building it, but before you start examining it. Using the words "foo" and "bar", I get this result using Data::Dumper.
$VAR1 = {
'abr' => [
'bar'
],
'foo' => [
'foo'
]
};
So the hash has two keys. And as you loop round all of the keys in the hash, you'll get the message twice (once for each key).
I'm not really sure what the hash is for here. I don't think it's necessary. I think that you need to:
Read in the two words
Convert the words to a canonical format
Check if the two strings are the same
Simplified, your code would look like this:
print 'Give me a word: ';
chomp(my $word1 = <STDIN>);
print 'Give me another word: ';
chomp(my $word2 = <STDIN>);
# convert to lower case
$word1 = lc $word1;
$word2 = lc $word2;
# remove non-letters
$word1 =~ s/[^a-z]//g;
$word2 =~ s/[^a-z]//g;
# sort letters
$word1 = join '', sort split //, $word1;
$word2 = join '', sort split //, $word2;
if ($word1 eq $word2) {
# you have an anagram
} else {
# you don't
}
My final answer, thanks so much to Dave and #zdim! I'm so happy I could die.
#!/usr/bin/perl -w
use strict;
use feature qw(say);
#I have to use this to make STDIN work. IDK why.
$|=1;
#declare variables below
print "First word?\n";
$aWord = <STDIN>;
chomp $aWord;
print "Second word?\n";
$bWord = <STDIN>;
chomp $bWord;
#clean up input
$aWord =~ tr/A-Z/a-z/;
$bWord =~ tr/A-Z/a-z/;
$aWord =~ s/[^a-z ]//ig;
$bWord =~ s/[^a-z ]//ig;
#if the two inputs are identical, print FALSE and exit
if ($aWord eq $bWord) {
say "\n\nFALSE: Your phrases are identical!\n";
exit;
}
#split each word by character, sort characters alphabetically, join characters
$aSortWord = join '', sort(split(//, $aWord));
$bSortWord = join '', sort(split(//, $bWord));
#if the sorted characters match, you have an anagram
#if not, you don't
if ($aSortWord eq $bSortWord) {
say "\n\nTRUE: Your two terms are anagrams!";
}
else {
say "\n\nFALSE: Your two terms are not acronyms.";
}

perl Grouping things and hierarchical matching

I've been testing Perl regex code what is written in the perlrequick section on Grouping things and hierarchical matching
This my Perl code
my $t = "housecats";
my ($m) = $t =~ m/house(cat|)/;
print $m;
The output is cat, but should be as written in the documentation
/house(cat|)/; # matches either 'housecat' or 'house'
What is wrong? Is there something amiss?
What you're doing with this code
my $t = "housecats";
my ($m) = $t =~ m/house(cat|)/;
print $m;
is copying the first capture into $m. Parentheses () in the pattern indicate which parts of the matching string to capture and store into built-in variables $1, $2 etc. You can have as many captures as you like, and they are numbered in the same order as the opening parentheses appears in the pattern
What perlrequick is talking about is what constitutes a successful match. Normally you would write
my $t = "housecats";
my $success = $t =~ m/house(cat|)/;
print $success ? "matched\n" : "no match\n";
This code produces
matched
as the document describes. If you set $t to housemartin then the result is the same because the regex pattern successfully finds house. But if $t is hosepipe then we see no match because the string contains neither house nor housecat
If you need to extract parts of the matched string then you must use captures as described above. You can access the whole string that was matched by accessing the built-in variable $&, but doing so causes unacceptable performance degradation in all but the latest Perl versions. For backward-compatability you should simply capture the whole pattern by writing
my $t = "housecats";
my ($m) = $t =~ m/(house(cat|))/;
print $m;
which produces housecat as you expected. It also sets the values of $1 and $2 to housecat and cat respectively
You probably misunderstood the comment. It means that
for my $t (qw( housecats house )) {
my ($m) = $t =~ /house(cat|)/;
print "[$m]\n";
}
will print
[cat]
[]
i.e. the regex will match both housecat and house. If the pattern didn't match at all then $m would be undef
my $t = "housecats";
my ($m) = $t=~m/house(cat|)/gn;
print $m;

Split, insert and join

Here's I want to archive. I want to split a one-liner comma-separated and insert #domain.com then join it back as comma-separated.
The one-liner contains something like:
username1,username2,username3
and I want to be something like:
username1#domain.com,username2#domain.com,username3#domain.com
So my Perl script that I tried which doesn't not work properly:
my $var ='username1,username2,username3';
my #tkens = split /,/, $var;
my #user;
foreach my $tken (#tkens) {
push (#user, "$tken\#domain.com");
}
my $to = join(',',#user);
Is there any shortcut on this in Perl and please post sample please. Thanks
Split, transform, stitch:
my $var ='username1,username2,username3';
print join ",", map { "$_\#domain.com" } split(",", $var);
# ==> username1#domain.com,username2#domain.com,username3#domain.com
You could also use a regular expression substitution:
#!/usr/bin/perl
use strict;
use warnings;
my $var = "username1,username2,username3";
# Replace every comma (and the end of the string) with a comma and #domain.com
$var =~ s/$|,/\#domain.com,/g;
# Remove extra comma after last item
chop $var;
print "$var\n";
You already have good answers. Here I am just telling why your script is not working. I didn't see any print or say line in your code, so not sure how you are trying to print something. No need of last line in your program. You can simply suffix #domain.com with each value, push to an array and print it with join.
#!/usr/bin/perl
use strict;
use warnings;
my $var = 'username1,username2,username3';
my #tkens = split ',', $var;
my #user;
foreach my $tken (#tkens)
{
push #user, $tken."\#domain.com"; # `.` after `$tken` for concatenation
}
print join(',', #user), "\n"
Output:
username1#domain.com,username2#domain.com,username3#domain.com

How to split a this string 'gi|216ATGCTGATGCTGTG' in this format 'gi|216 ATGCTGTGCTGATGCTG' in Perl?

I am parsing the fasta alignment file which contains
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
I want to split this string into this:
gi|216 CCAACGAAATGATCGCCACACAA
gi|21- GCTGGTTCAGCGACCAAAAGTAGC
For first string, I use
$aar=split("\d",$string);
But that didn't work. What should I do?
So you're parsing some genetic data and each line has a gi| prefix followed by a sequence of numbers and hyphens followed by the nucleotide sequence? If so, you could do something like this:
my ($number, $nucleotides);
if($string =~ /^gi\|([\d-]+)([ACGT]+)$/) {
$number = $1;
$nucleotides = $2;
}
else {
# Broken data?
}
That assumes that you've already stripped off leading and trailing whitespace. If you do that, you should get $number = '216' and $nucleotides = 'CCAACGAAATGATCGCCACACAA' for the first one and $number = '216-' and $nucleotides = 'GCTGGTTCAGCGACCAAAAGTAGC' for the second one.
Looks like BioPerl has some stuff for dealing with fasta data so you might want to use BioPerl's tools rather than rolling your own.
Here's how I'd go about doing that.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
while ( my $line = <DATA> ) {
my #strings =
grep {m{\A \S+ \z}xms} # no whitespace tokens
split /\A ( \w+ \| [\d-]+ )( [ACTG]+ ) /xms, # capture left & right
$line;
print Dumper( \#strings );
}
__DATA__
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
If you just want to add a space (can't really tell from your question), use substitution. To put a space in front of any grouping of ACTG:
$string =~ s/([ACTG]+)/ \1/;
or to add a tab after any grouping of digits and dashes:
$string =~ s/([\d-]+)/\1\t/;
note that this will substitute on $string in place.