Split() on newline AND space characters? - perl

I want to split() a string on both newlines and space characters:
#!/usr/bin/perl
use warnings;
use strict;
my $str = "aa bb cc\ndd ee ff";
my #arr = split(/\s\n/, $str); # Split on ' ' and '\n'
print join("\n", #arr); # Print array, one element per line
Output is this:
aa bb cc
dd ee ff
But, what I want is this:
aa
bb
cc
dd
ee
ff
So my code is splitting on the newline (good) but not the spaces. According to perldoc, whitespace should be matched with \s in a character class, and I would have assumed that is whitespace. Am I missing something?

You are splitting on a whitespace character followed by a line feed. To split when either one is encountered, there's
split /[\s\n]/, $str
But \s includes \n, so this can be simplified.
split /\s/, $str
But what if you have two spaces in a row? You could split when a sequence of whitespace is encountered.
split /\s+/, $str
There's a special input you can provide which does the same thing except it ignores leading whitespace.
split ' ', $str
So,
use v5.14;
use warnings;
my $str = "aa bb cc\ndd ee ff";
my #arr = split ' ', $str;
say for #arr;

my code is splitting on the newline (good)
Your code is not splitting on newline; it only seems that way due to how you are printing things. Your array contains one element, not two. The element has a newline in the middle of it, and you are simply printing aa bb cc\ndd ee ff.
\s\n means: any whitespace followed by newline, where whitespace actually includes \n.
Change:
my #arr = split(/\s\n/, $str);
to:
my #arr = split(/\s/, $str);
Using Data::Dumper makes it clear that the array now has 6 elements:
use warnings;
use strict;
use Data::Dumper;
my $str = "aa bb cc\ndd ee ff";
my #arr = split(/\s/, $str);
print Dumper(\#arr);
Prints:
$VAR1 = [
'aa',
'bb',
'cc',
'dd',
'ee',
'ff'
];
The above code works on the input string you provided. It is also common to split on multiple consecutive whitespaces using:
my #arr = split(/\s+/, $str);

Your question comes from an incorrect analysis of the outcome of your code. You think you have split on newline, when you have not actually split anything at all and are in fact just printing a newline.
If you want to avoid this mistake in the future, and know exactly what your variables contain, you can use the core module Data::Dumper:
use strict;
use warnings;
use Data::Dumper;
my $str = "aa bb cc\ndd ee ff";
my #arr = split(/\s\n/, $str); # split on whitespace followed by newline
$Data::Dumper::Useqq = 1; # show exactly what is printed
print Dumper \#arr; # using Data::Dumper
Output:
$VAR1 = [
"aa bb cc\ndd ee ff"
];
As you would easily be able to tell, you are not printing an array at all, just a single scalar value (inside an array, because you put it there). Data::Dumper is an excellent tool for debugging your data, and a valuable tool for you to learn.

Related

Split string into array of words and numbers using Perl

How can I split a string like "aa132bc4253defg18" to get an array of words and numbers:
aa,132,bc,4253,defg,18
I´m using Perl. The lengths of the subtrings are variable.
How about:
use Modern::Perl;
use Data::Dump qw(dump);
my $str = "aa132bc4253defg18";
my #l = split(/(?<=\d)(?=\D)|(?<=\D)(?=\d)/, $str);
dump#l;
Output:
("aa", 132, "bc", 4253, "defg", 18)
It splits between a digit \d and a non digit \D, in both order.
Something like this will do:
split (/(\d+)/, $x);
With a full working example:
use strict; use warnings;
my $x = 'aa132bc4253defg18';
my #y = split /(\d+)/, $x;
print join ",", #y;
The important sections in perldoc split are:
Anything in EXPR that matches PATTERN is taken to be a separator that separates the EXPR into substrings (called "fields") that do not include the separator.
and
If the PATTERN contains capturing groups, then for each separator, an additional field is produced for each substring captured by a group (in the order in which the groups are specified, as per backreferences);
Edit
If the string starts with a number, split will return an empty element. These can be elimanated with a grep:
grep {$_} split /(\d+)/, $x;
The example becomes
use strict; use warnings;
my $x = '44aa132bc4253defg18';
my #y = grep {$_} split /(\d+)/, $x;
print join ",", #y;

Perl split every n characters and new lines

I'm new to perl. I know I can split some constant number of characters via unpack or using regexes.
But is there some standard way to split every n characters and new lines?
Here's the string I'm looking to split:
my $str="hello\nworld";
my $num_split_chars=2;
Perhaps the following will be helpful:
use strict;
use warnings;
use Data::Dumper;
my $str = "hello\nworld";
my $num_split_chars = 2;
$num_split_chars--;
my #arr = $str =~ /.{$num_split_chars}.?/g;
print Dumper \#arr;
Output:
$VAR1 = [
'he',
'll',
'o',
'wo',
'rl',
'd'
];

How to extract the words through pattern matching?

#!/usr/bin/perl
use strict;
use warnings;
my $string = "praveen is a good boy";
my #try = split(/([a,e,i,o,u]).*\1/,$string);
print "#try\n";
I am trying to print all words containing 2 adjacent vowels in a given string.
o/p : has to be "praveen" and "good" .
I tried with the negate exp [^] to split and give only the 2 adjacent vowels.
The Perl function split isn't a great fit for finding a list of matches. Instead, I would recommend using the regex modifier g. To process all the matches, you can either loop, using e.g. while, or you can assign the list of matches in one go.
The following example should match all words in a string which contain two adjacent vowels:
my $string = "praveen is a good boy";
while ( $string =~ /(\w*[aeiou]{2}\w*)/g ) {
print "$1\n"
}
Output:
praveen
good
You could also do this:
my #matches = ( $string =~ /\w*[aeiou]{2}\w*/g );
and process the result similar to how you were processing #try in the OP.
You could do something like..
#!/usr/bin/perl
use strict;
use warnings;
my $str
= "praveen is a good boy\n"
. "aaron is a good boy\n"
. "praveen and aaron are good, hoot, ho"
;
while ($str =~ /(\w*([aeiou])\2(?:\w*))/g) {
print $1, "\n";
}
Regular expression:
( group and capture to \1:
\w* word characters (a-z, A-Z, 0-9, _) (0 or more times)
( group and capture to \2:
[aeiou] any character of: 'a', 'e', 'i', 'o', 'u'
) end of \2
\2 what was matched by capture \2
(?: group, but do not capture:
\w* word characters (a-z, A-Z, 0-9, _) (0 or more times)
) end of grouping
) end of \1
Which is basically the same as doing /(\w*([aeiou])[aeiou]+(?:\w*))/
Output:
praveen
good
aaron
good
praveen
aaron
good
hoot
#!/usr/bin/perl
use strict;
use warnings;
my $string = "praveen is a good boy";
my #try = split(/\s/,$string);
for(#try) {
# if(/[a,e,i,o,u]{2}/) {
if(/[aeiou]{2}/) { # edited after Birei's comment
print "$_\n";
};
};
First argumant of "split" is a delimiter. Split splits (-8

How to split a this string 'gi|216ATGCTGATGCTGTG' in this format 'gi|216 ATGCTGTGCTGATGCTG' in Perl?

I am parsing the fasta alignment file which contains
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
I want to split this string into this:
gi|216 CCAACGAAATGATCGCCACACAA
gi|21- GCTGGTTCAGCGACCAAAAGTAGC
For first string, I use
$aar=split("\d",$string);
But that didn't work. What should I do?
So you're parsing some genetic data and each line has a gi| prefix followed by a sequence of numbers and hyphens followed by the nucleotide sequence? If so, you could do something like this:
my ($number, $nucleotides);
if($string =~ /^gi\|([\d-]+)([ACGT]+)$/) {
$number = $1;
$nucleotides = $2;
}
else {
# Broken data?
}
That assumes that you've already stripped off leading and trailing whitespace. If you do that, you should get $number = '216' and $nucleotides = 'CCAACGAAATGATCGCCACACAA' for the first one and $number = '216-' and $nucleotides = 'GCTGGTTCAGCGACCAAAAGTAGC' for the second one.
Looks like BioPerl has some stuff for dealing with fasta data so you might want to use BioPerl's tools rather than rolling your own.
Here's how I'd go about doing that.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
while ( my $line = <DATA> ) {
my #strings =
grep {m{\A \S+ \z}xms} # no whitespace tokens
split /\A ( \w+ \| [\d-]+ )( [ACTG]+ ) /xms, # capture left & right
$line;
print Dumper( \#strings );
}
__DATA__
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
If you just want to add a space (can't really tell from your question), use substitution. To put a space in front of any grouping of ACTG:
$string =~ s/([ACTG]+)/ \1/;
or to add a tab after any grouping of digits and dashes:
$string =~ s/([\d-]+)/\1\t/;
note that this will substitute on $string in place.

How to Split on three different delimiters then ucfirst each result[]?

I am trying to figure out how to split a string that has three possible delimiters (or none) without a million lines of code but, code is still legible to a guy like me.
Many possible combinations in the string.
this-is_the.string
this.is.the.string
this-is_the_string
thisisthestring
There are no spaces in the string and none of these characters:
~`!##$%^&*()+=\][{}|';:"/?>,<.
The string is already stripped of all but:
0-9
a-Z
-
_
.
There are also no sequential dots, dashes or underscores.
I would like the result to be displayed like Result:
This Is The String
I am really having a difficult time trying to get this going.
I believe I will need to use a hash and I just have not grasped the concept even after hours of trial and error.
I am bewildered at the fact I could possibly split a string on multiple delimiters where the delimiters could be in any order AND/OR three different types (or none at all) AND maintain the order of the result!
Any possibilities?
Split the string into words, capitalise the words, then join the words while inserting spaces between them.
It can be coded quite succinctly:
my $clean = join ' ', map ucfirst lc, split /[_.-]+/, $string;
If you just want to print out the result, you can use
use feature qw( say );
say join ' ', map ucfirst lc, split /[_.-]+/, $string;
or
print join ' ', map ucfirst lc, split /[_.-]+/, $string;
print "\n";
It is simple to use a global regular expression to gather all sequences of characters that are not a dot, dash, or underscore.
After that, lc will lower-case each string and ucfirst will capitalise it. Stringifying an array will insert spaces between the elements.
for ( qw/ this-is_the.string this.is.the.string this-is_the_string / ) {
my #string = map {ucfirst lc } /[^-_.]+/g;
print "#string\n";
}
output
This Is The String
This Is The String
This Is The String
" the delimiters could be anywhere AND/OR three different types (or none at all)" ... you need a delimiter to split a string, you can define multiple delimiters with a regular expression to the split function
my #parts = split(/[-_\.]/, $string);
print ucfirst "$_ " foreach #parts;
print "\n"
Here's a solution that will work for all but your last test case. It's extremely hard to split a string without delimiters, you'd need to have a list of possible words, and even then it would be prone to error.
#!/usr/bin/perl
use strict;
use warnings;
my #strings = qw(
this-is_the.string
this.is.the.string
this-is_the_string
thisisthestring
);
foreach my $string (#strings) {
print join(q{ }, map {ucfirst($_)} split(m{[_.-]}smx,$string)) . qq{\n};
}
And here's an alternative for the loop that splits everything into separate statements to make it easier to read:
foreach my $string (#strings) {
my #words = split m{[_.-]}smx, $string;
my #upper_case_words = map {ucfirst($_)} #words;
my $string_with_spaces = join q{ }, #upper_case_words;
print $string_with_spaces . qq{\n};
}
And to prove that just because you can, doesn't mean you should :P
$string =~ s{([A-Za-z]+)([_.-]*)?}{ucfirst(lc("$1")).($2?' ':'')}ge;
For all but last possibility:
use strict;
use warnings;
my $file;
my $newline;
open $file, "<", "testfile";
while (<$file>) {
chomp;
$newline = join ' ', map ucfirst lc, split /[-_\.]/, $_;
print $newline . "\n";
}