How to extract the words through pattern matching? - perl

#!/usr/bin/perl
use strict;
use warnings;
my $string = "praveen is a good boy";
my #try = split(/([a,e,i,o,u]).*\1/,$string);
print "#try\n";
I am trying to print all words containing 2 adjacent vowels in a given string.
o/p : has to be "praveen" and "good" .
I tried with the negate exp [^] to split and give only the 2 adjacent vowels.

The Perl function split isn't a great fit for finding a list of matches. Instead, I would recommend using the regex modifier g. To process all the matches, you can either loop, using e.g. while, or you can assign the list of matches in one go.
The following example should match all words in a string which contain two adjacent vowels:
my $string = "praveen is a good boy";
while ( $string =~ /(\w*[aeiou]{2}\w*)/g ) {
print "$1\n"
}
Output:
praveen
good
You could also do this:
my #matches = ( $string =~ /\w*[aeiou]{2}\w*/g );
and process the result similar to how you were processing #try in the OP.

You could do something like..
#!/usr/bin/perl
use strict;
use warnings;
my $str
= "praveen is a good boy\n"
. "aaron is a good boy\n"
. "praveen and aaron are good, hoot, ho"
;
while ($str =~ /(\w*([aeiou])\2(?:\w*))/g) {
print $1, "\n";
}
Regular expression:
( group and capture to \1:
\w* word characters (a-z, A-Z, 0-9, _) (0 or more times)
( group and capture to \2:
[aeiou] any character of: 'a', 'e', 'i', 'o', 'u'
) end of \2
\2 what was matched by capture \2
(?: group, but do not capture:
\w* word characters (a-z, A-Z, 0-9, _) (0 or more times)
) end of grouping
) end of \1
Which is basically the same as doing /(\w*([aeiou])[aeiou]+(?:\w*))/
Output:
praveen
good
aaron
good
praveen
aaron
good
hoot

#!/usr/bin/perl
use strict;
use warnings;
my $string = "praveen is a good boy";
my #try = split(/\s/,$string);
for(#try) {
# if(/[a,e,i,o,u]{2}/) {
if(/[aeiou]{2}/) { # edited after Birei's comment
print "$_\n";
};
};
First argumant of "split" is a delimiter. Split splits (-8

Related

Perl: Find Nth occurrence in a string and return the sub-string up to this occurrence

I have the below string in which the delimiter is comma ",":
$str = "abc,123,rty,567,89,,90,gg"
I want to first find the Nth occurrence of "," in the above string, let's say I want to find the 5th occurrence.
In $str this is the comma after element 89.
Then I want to get that portion of the string $str which starts from 0 and ends up to this 5th comma, which would be:
"abc,123,rty,567,89"
Please advise how can I do this with Perl.
Thank you
One simple way using split and list slices:
#!/usr/bin/perl
use strict; use warnings; use 5.010;
my $str = "abc,123,rty,567,89,,90,gg";
say $str;
my $to_fifth = join ',', ( split(',', $str) )[0 .. 4] ;
say $to_fifth;
output
abc,123,rty,567,89,,90,gg
abc,123,rty,567,89

Split string into array of words and numbers using Perl

How can I split a string like "aa132bc4253defg18" to get an array of words and numbers:
aa,132,bc,4253,defg,18
I´m using Perl. The lengths of the subtrings are variable.
How about:
use Modern::Perl;
use Data::Dump qw(dump);
my $str = "aa132bc4253defg18";
my #l = split(/(?<=\d)(?=\D)|(?<=\D)(?=\d)/, $str);
dump#l;
Output:
("aa", 132, "bc", 4253, "defg", 18)
It splits between a digit \d and a non digit \D, in both order.
Something like this will do:
split (/(\d+)/, $x);
With a full working example:
use strict; use warnings;
my $x = 'aa132bc4253defg18';
my #y = split /(\d+)/, $x;
print join ",", #y;
The important sections in perldoc split are:
Anything in EXPR that matches PATTERN is taken to be a separator that separates the EXPR into substrings (called "fields") that do not include the separator.
and
If the PATTERN contains capturing groups, then for each separator, an additional field is produced for each substring captured by a group (in the order in which the groups are specified, as per backreferences);
Edit
If the string starts with a number, split will return an empty element. These can be elimanated with a grep:
grep {$_} split /(\d+)/, $x;
The example becomes
use strict; use warnings;
my $x = '44aa132bc4253defg18';
my #y = grep {$_} split /(\d+)/, $x;
print join ",", #y;

How to split a this string 'gi|216ATGCTGATGCTGTG' in this format 'gi|216 ATGCTGTGCTGATGCTG' in Perl?

I am parsing the fasta alignment file which contains
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
I want to split this string into this:
gi|216 CCAACGAAATGATCGCCACACAA
gi|21- GCTGGTTCAGCGACCAAAAGTAGC
For first string, I use
$aar=split("\d",$string);
But that didn't work. What should I do?
So you're parsing some genetic data and each line has a gi| prefix followed by a sequence of numbers and hyphens followed by the nucleotide sequence? If so, you could do something like this:
my ($number, $nucleotides);
if($string =~ /^gi\|([\d-]+)([ACGT]+)$/) {
$number = $1;
$nucleotides = $2;
}
else {
# Broken data?
}
That assumes that you've already stripped off leading and trailing whitespace. If you do that, you should get $number = '216' and $nucleotides = 'CCAACGAAATGATCGCCACACAA' for the first one and $number = '216-' and $nucleotides = 'GCTGGTTCAGCGACCAAAAGTAGC' for the second one.
Looks like BioPerl has some stuff for dealing with fasta data so you might want to use BioPerl's tools rather than rolling your own.
Here's how I'd go about doing that.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
while ( my $line = <DATA> ) {
my #strings =
grep {m{\A \S+ \z}xms} # no whitespace tokens
split /\A ( \w+ \| [\d-]+ )( [ACTG]+ ) /xms, # capture left & right
$line;
print Dumper( \#strings );
}
__DATA__
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
If you just want to add a space (can't really tell from your question), use substitution. To put a space in front of any grouping of ACTG:
$string =~ s/([ACTG]+)/ \1/;
or to add a tab after any grouping of digits and dashes:
$string =~ s/([\d-]+)/\1\t/;
note that this will substitute on $string in place.

How to Split on three different delimiters then ucfirst each result[]?

I am trying to figure out how to split a string that has three possible delimiters (or none) without a million lines of code but, code is still legible to a guy like me.
Many possible combinations in the string.
this-is_the.string
this.is.the.string
this-is_the_string
thisisthestring
There are no spaces in the string and none of these characters:
~`!##$%^&*()+=\][{}|';:"/?>,<.
The string is already stripped of all but:
0-9
a-Z
-
_
.
There are also no sequential dots, dashes or underscores.
I would like the result to be displayed like Result:
This Is The String
I am really having a difficult time trying to get this going.
I believe I will need to use a hash and I just have not grasped the concept even after hours of trial and error.
I am bewildered at the fact I could possibly split a string on multiple delimiters where the delimiters could be in any order AND/OR three different types (or none at all) AND maintain the order of the result!
Any possibilities?
Split the string into words, capitalise the words, then join the words while inserting spaces between them.
It can be coded quite succinctly:
my $clean = join ' ', map ucfirst lc, split /[_.-]+/, $string;
If you just want to print out the result, you can use
use feature qw( say );
say join ' ', map ucfirst lc, split /[_.-]+/, $string;
or
print join ' ', map ucfirst lc, split /[_.-]+/, $string;
print "\n";
It is simple to use a global regular expression to gather all sequences of characters that are not a dot, dash, or underscore.
After that, lc will lower-case each string and ucfirst will capitalise it. Stringifying an array will insert spaces between the elements.
for ( qw/ this-is_the.string this.is.the.string this-is_the_string / ) {
my #string = map {ucfirst lc } /[^-_.]+/g;
print "#string\n";
}
output
This Is The String
This Is The String
This Is The String
" the delimiters could be anywhere AND/OR three different types (or none at all)" ... you need a delimiter to split a string, you can define multiple delimiters with a regular expression to the split function
my #parts = split(/[-_\.]/, $string);
print ucfirst "$_ " foreach #parts;
print "\n"
Here's a solution that will work for all but your last test case. It's extremely hard to split a string without delimiters, you'd need to have a list of possible words, and even then it would be prone to error.
#!/usr/bin/perl
use strict;
use warnings;
my #strings = qw(
this-is_the.string
this.is.the.string
this-is_the_string
thisisthestring
);
foreach my $string (#strings) {
print join(q{ }, map {ucfirst($_)} split(m{[_.-]}smx,$string)) . qq{\n};
}
And here's an alternative for the loop that splits everything into separate statements to make it easier to read:
foreach my $string (#strings) {
my #words = split m{[_.-]}smx, $string;
my #upper_case_words = map {ucfirst($_)} #words;
my $string_with_spaces = join q{ }, #upper_case_words;
print $string_with_spaces . qq{\n};
}
And to prove that just because you can, doesn't mean you should :P
$string =~ s{([A-Za-z]+)([_.-]*)?}{ucfirst(lc("$1")).($2?' ':'')}ge;
For all but last possibility:
use strict;
use warnings;
my $file;
my $newline;
open $file, "<", "testfile";
while (<$file>) {
chomp;
$newline = join ' ', map ucfirst lc, split /[-_\.]/, $_;
print $newline . "\n";
}

Split on comma, but only when not in parenthesis

I am trying to do a split on a string with comma delimiter
my $string='ab,12,20100401,xyz(A,B)';
my #array=split(',',$string);
If I do a split as above the array will have values
ab
12
20100401
xyz(A,
B)
I need values as below.
ab
12
20100401
xyz(A,B)
(should not split xyz(A,B) into 2 values)
How do I do that?
use Text::Balanced qw(extract_bracketed);
my $string = "ab,12,20100401,xyz(A,B(a,d))";
my #params = ();
while ($string) {
if ($string =~ /^([^(]*?),/) {
push #params, $1;
$string =~ s/^\Q$1\E\s*,?\s*//;
} else {
my ($ext, $pre);
($ext, $string, $pre) = extract_bracketed($string,'()','[^()]+');
push #params, "$pre$ext";
$string =~ s/^\s*,\s*//;
}
}
This one supports:
nested parentheses;
empty fields;
strings of any length.
Here is one way that should work.
use Regexp::Common;
my $string = 'ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(?:$RE{balanced}{-parens=>'()'}|[^,])+/g);
Regexp::Common can be installed from CPAN.
There is a bug in this code, coming from the depths of Regexp::Common. Be warned that this will (unfortunately) fail to match the lack of space between ,,.
Well, old question, but I just happened to wrestle with this all night, and the question was never marked answered, so in case anyone arrives here by Google as I did, here's what I finally got. It's a very short answer using only built-in PERL regex features:
my $string='ab,12,20100401,xyz(A,B)';
$string =~ s/((\((?>[^)(]*(?2)?)*\))|[^,()]*)(*SKIP),/$1\n/g;
my #array=split('\n',$string);
Commas that are not inside parentheses are changed to newlines and then the array is split on them. This will ignore commas inside any level of nested parentheses, as long as they're properly balanced with a matching number of open and close parens.
This assumes you won't have newline \n characters in the initial value of $string. If you need to, either temporarily replace them with something else before the substitution line and then use a loop to replace back after the split, or just pick a different delimiter to split the array on.
Limit the number of elements it can be split into:
split(',', $string, 4)
Here's another way:
my $string='ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(
[^,]*\([^)]*\) # comma inside parens is part of the word
|
[^,]*) # split on comma outside parens
(?:,|$)/gx);
Produces:
ab
12
20100401
xyz(A,B)
Here is my attempt. It should handle depth well and could even be extended to include other bracketed symbols easily (though harder to be sure that they MATCH). This method will not in general work for quotation marks rather than brackets.
#!/usr/bin/perl
use strict;
use warnings;
my $string='ab,12,20100401,xyz(A(2,3),B)';
print "$_\n" for parse($string);
sub parse {
my ($string) = #_;
my #fields;
my #comma_separated = split(/,/, $string);
my #to_be_joined;
my $depth = 0;
foreach my $field (#comma_separated) {
my #brackets = $field =~ /(\(|\))/g;
foreach (#brackets) {
$depth++ if /\(/;
$depth-- if /\)/;
}
if ($depth == 0) {
push #fields, join(",", #to_be_joined, $field);
#to_be_joined = ();
} else {
push #to_be_joined, $field;
}
}
return #fields;
}