How to Split on three different delimiters then ucfirst each result[]? - perl

I am trying to figure out how to split a string that has three possible delimiters (or none) without a million lines of code but, code is still legible to a guy like me.
Many possible combinations in the string.
this-is_the.string
this.is.the.string
this-is_the_string
thisisthestring
There are no spaces in the string and none of these characters:
~`!##$%^&*()+=\][{}|';:"/?>,<.
The string is already stripped of all but:
0-9
a-Z
-
_
.
There are also no sequential dots, dashes or underscores.
I would like the result to be displayed like Result:
This Is The String
I am really having a difficult time trying to get this going.
I believe I will need to use a hash and I just have not grasped the concept even after hours of trial and error.
I am bewildered at the fact I could possibly split a string on multiple delimiters where the delimiters could be in any order AND/OR three different types (or none at all) AND maintain the order of the result!
Any possibilities?

Split the string into words, capitalise the words, then join the words while inserting spaces between them.
It can be coded quite succinctly:
my $clean = join ' ', map ucfirst lc, split /[_.-]+/, $string;
If you just want to print out the result, you can use
use feature qw( say );
say join ' ', map ucfirst lc, split /[_.-]+/, $string;
or
print join ' ', map ucfirst lc, split /[_.-]+/, $string;
print "\n";

It is simple to use a global regular expression to gather all sequences of characters that are not a dot, dash, or underscore.
After that, lc will lower-case each string and ucfirst will capitalise it. Stringifying an array will insert spaces between the elements.
for ( qw/ this-is_the.string this.is.the.string this-is_the_string / ) {
my #string = map {ucfirst lc } /[^-_.]+/g;
print "#string\n";
}
output
This Is The String
This Is The String
This Is The String

" the delimiters could be anywhere AND/OR three different types (or none at all)" ... you need a delimiter to split a string, you can define multiple delimiters with a regular expression to the split function
my #parts = split(/[-_\.]/, $string);
print ucfirst "$_ " foreach #parts;
print "\n"

Here's a solution that will work for all but your last test case. It's extremely hard to split a string without delimiters, you'd need to have a list of possible words, and even then it would be prone to error.
#!/usr/bin/perl
use strict;
use warnings;
my #strings = qw(
this-is_the.string
this.is.the.string
this-is_the_string
thisisthestring
);
foreach my $string (#strings) {
print join(q{ }, map {ucfirst($_)} split(m{[_.-]}smx,$string)) . qq{\n};
}
And here's an alternative for the loop that splits everything into separate statements to make it easier to read:
foreach my $string (#strings) {
my #words = split m{[_.-]}smx, $string;
my #upper_case_words = map {ucfirst($_)} #words;
my $string_with_spaces = join q{ }, #upper_case_words;
print $string_with_spaces . qq{\n};
}

And to prove that just because you can, doesn't mean you should :P
$string =~ s{([A-Za-z]+)([_.-]*)?}{ucfirst(lc("$1")).($2?' ':'')}ge;

For all but last possibility:
use strict;
use warnings;
my $file;
my $newline;
open $file, "<", "testfile";
while (<$file>) {
chomp;
$newline = join ' ', map ucfirst lc, split /[-_\.]/, $_;
print $newline . "\n";
}

Related

Splitting Perl string and adding _ in between

Question. I am trying to read in perl string from command line e.g. "abcdef" and then split this into "a_b_c_d_e_f".
I am struggling with logic part. any ideas?
#!/usr/bin/perl
while($line=<STDIN>){
chomp $line;
split $line;
join ("_", $line);
print $line;
}
The split manpage actually includes exactly this example:
print join(':', split('', 'abc')), "\n";
Adjusting to use _ instead of : and $line instead of 'abc', we get:
print join('_', split('', $line)), "\n";
The most important point is that split doesn't modify its arguments, it just returns a list, and join doesn't modify its arguments, it just returns a string. So it never makes sense to call split or join without using the return-value.
What you need is
print join('_', split //, $line), "\n";
One-liner:
print join('_', split('', $line)), '\n';
You can read more about perl's split() function here.
Unless you must use split, you can use a between-character substitution for this:
use strict;
use warnings;
my $string = 'abcdef';
$string =~ s/(?<=.)(?:)(?=.)/_/g;
print $string;
Output:
a_b_c_d_e_f
Hope this helps!

How to split a this string 'gi|216ATGCTGATGCTGTG' in this format 'gi|216 ATGCTGTGCTGATGCTG' in Perl?

I am parsing the fasta alignment file which contains
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
I want to split this string into this:
gi|216 CCAACGAAATGATCGCCACACAA
gi|21- GCTGGTTCAGCGACCAAAAGTAGC
For first string, I use
$aar=split("\d",$string);
But that didn't work. What should I do?
So you're parsing some genetic data and each line has a gi| prefix followed by a sequence of numbers and hyphens followed by the nucleotide sequence? If so, you could do something like this:
my ($number, $nucleotides);
if($string =~ /^gi\|([\d-]+)([ACGT]+)$/) {
$number = $1;
$nucleotides = $2;
}
else {
# Broken data?
}
That assumes that you've already stripped off leading and trailing whitespace. If you do that, you should get $number = '216' and $nucleotides = 'CCAACGAAATGATCGCCACACAA' for the first one and $number = '216-' and $nucleotides = 'GCTGGTTCAGCGACCAAAAGTAGC' for the second one.
Looks like BioPerl has some stuff for dealing with fasta data so you might want to use BioPerl's tools rather than rolling your own.
Here's how I'd go about doing that.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
while ( my $line = <DATA> ) {
my #strings =
grep {m{\A \S+ \z}xms} # no whitespace tokens
split /\A ( \w+ \| [\d-]+ )( [ACTG]+ ) /xms, # capture left & right
$line;
print Dumper( \#strings );
}
__DATA__
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
If you just want to add a space (can't really tell from your question), use substitution. To put a space in front of any grouping of ACTG:
$string =~ s/([ACTG]+)/ \1/;
or to add a tab after any grouping of digits and dashes:
$string =~ s/([\d-]+)/\1\t/;
note that this will substitute on $string in place.

Using a char variable in tr///

I am trying to count the characters in a string and found an easy solution counting a single character using the tr operator. Now I want to do this with every character from a to z. The following solution doesn't work because tr/// matches every character.
my #chars = ('a' .. 'z');
foreach my $c (#chars)
{
$count{$c} = ($text =~ tr/$c//);
}
How do I correctly use the char variable in tr///?
tr/// doesn't work with variables unless you wrap it in an eval
But there is a nicer way to do this:
$count{$_} = () = $text =~ /$_/g for 'a' .. 'z';
For the TIMTOWTDI:
$count{$_}++ for grep /[a-z]/i, split //, $text;
tr doesn't support variable interpolation (neither in the search list nor in the replacement list). If you want to use variables, you must use eval():
$count{$c} = eval "\$text =~ tr/$c/$c/";
That said, a more efficient (and secure) approach would be to simply iterate over the characters in the string and increment counters for each character, e.g.:
my %count = map { $_ => 0 } 'a' .. 'z';
for my $char (split //, $text) {
$count{$char}++ if defined $count{$char};
}
If you look at the perldoc for tr/SEARCHLIST/REPLACEMENTLIST/cdsr, then you'll see, right at the bottom of the section, the following:
Because the transliteration table is built at compile time, neither the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote interpolation. That means that if you want to use variables, you must use an eval():
eval "tr/$oldlist/$newlist/";
die $# if $#;
eval "tr/$oldlist/$newlist/, 1" or die $#;
Thus, you would need an eval to generate a new SEARCHLIST.
This is going to be very inefficient... the code might feel neat, but you're processing the complete string 26 times. You're also not counting uppercase characters.
You'd be better off stepping through the string once and just incrementing counters for each character found.
From the perlop documentation:
tr/AAA/XYZ/
will transliterate any A to X.
Because the transliteration table is built at compile time, neither
the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote
interpolation. That means that if you want to use variables, you must
use an eval()
Alternatively in your case you can use the s/// operator as:
foreach my $c (#chars) {
$count{$c} += ($text =~ s/$c//g);
}
My solution with some modification based from http://www.perlmonks.org/?node_id=446003
sub lowerLetters {
my $string = shift;
my %table;
#table{split //, $letters_uc} = split //, $letters_lc;
my $table_re = join '|', map { quotemeta } reverse sort keys %table;
$string =~ s/($table_re)/$table{$1}/g;
return if not defined $string;
return $string;
}
You may want to use s instead. Substitution is much more powerful than tr
My solution:
$count{$c} =~ s/\$search/$replace/g;
g at the end means "use it globally".
See:
https://blog.james.rcpt.to/2010/10/25/perl-search-and-replace-using-variables/
https://docstore.mik.ua/orelly/perl3/lperl/ch09_06.htm

Split on comma, but only when not in parenthesis

I am trying to do a split on a string with comma delimiter
my $string='ab,12,20100401,xyz(A,B)';
my #array=split(',',$string);
If I do a split as above the array will have values
ab
12
20100401
xyz(A,
B)
I need values as below.
ab
12
20100401
xyz(A,B)
(should not split xyz(A,B) into 2 values)
How do I do that?
use Text::Balanced qw(extract_bracketed);
my $string = "ab,12,20100401,xyz(A,B(a,d))";
my #params = ();
while ($string) {
if ($string =~ /^([^(]*?),/) {
push #params, $1;
$string =~ s/^\Q$1\E\s*,?\s*//;
} else {
my ($ext, $pre);
($ext, $string, $pre) = extract_bracketed($string,'()','[^()]+');
push #params, "$pre$ext";
$string =~ s/^\s*,\s*//;
}
}
This one supports:
nested parentheses;
empty fields;
strings of any length.
Here is one way that should work.
use Regexp::Common;
my $string = 'ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(?:$RE{balanced}{-parens=>'()'}|[^,])+/g);
Regexp::Common can be installed from CPAN.
There is a bug in this code, coming from the depths of Regexp::Common. Be warned that this will (unfortunately) fail to match the lack of space between ,,.
Well, old question, but I just happened to wrestle with this all night, and the question was never marked answered, so in case anyone arrives here by Google as I did, here's what I finally got. It's a very short answer using only built-in PERL regex features:
my $string='ab,12,20100401,xyz(A,B)';
$string =~ s/((\((?>[^)(]*(?2)?)*\))|[^,()]*)(*SKIP),/$1\n/g;
my #array=split('\n',$string);
Commas that are not inside parentheses are changed to newlines and then the array is split on them. This will ignore commas inside any level of nested parentheses, as long as they're properly balanced with a matching number of open and close parens.
This assumes you won't have newline \n characters in the initial value of $string. If you need to, either temporarily replace them with something else before the substitution line and then use a loop to replace back after the split, or just pick a different delimiter to split the array on.
Limit the number of elements it can be split into:
split(',', $string, 4)
Here's another way:
my $string='ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(
[^,]*\([^)]*\) # comma inside parens is part of the word
|
[^,]*) # split on comma outside parens
(?:,|$)/gx);
Produces:
ab
12
20100401
xyz(A,B)
Here is my attempt. It should handle depth well and could even be extended to include other bracketed symbols easily (though harder to be sure that they MATCH). This method will not in general work for quotation marks rather than brackets.
#!/usr/bin/perl
use strict;
use warnings;
my $string='ab,12,20100401,xyz(A(2,3),B)';
print "$_\n" for parse($string);
sub parse {
my ($string) = #_;
my #fields;
my #comma_separated = split(/,/, $string);
my #to_be_joined;
my $depth = 0;
foreach my $field (#comma_separated) {
my #brackets = $field =~ /(\(|\))/g;
foreach (#brackets) {
$depth++ if /\(/;
$depth-- if /\)/;
}
if ($depth == 0) {
push #fields, join(",", #to_be_joined, $field);
#to_be_joined = ();
} else {
push #to_be_joined, $field;
}
}
return #fields;
}

Reformulate a string query in perl

How do i reformulate a string in perl?
For example consider the string "Where is the Louvre located?"
How can i generate strings like the following:
"the is Louvre located"
"the Louvre is located"
"the Louvre located is"
These are being used as queries to do a web search.
I was trying to do something like this:
Get rid of punctuations and split the sentence into words.
my #words = split / /, $_[0];
I don't need the first word in the string, so getting rid of it.
shift(#words);
And then i need move the next word through out the array - not sure how to do this!!
Finally convert the array of words back to a string.
How can I generate all permutations of an array in Perl?
Then use join to glue each permutation array back together into a single string.
Somewhat more verbose example:
use strict;
use warnings;
use Data::Dumper;
my $str = "Where is the Louvre located?";
# split into words and remove the punctuation
my #words = map {s/\W+//; $_} split / /, $str;
# remove the first two words while storing the second
my $moving = splice #words, 0 ,2;
# generate the variations
my #variants;
foreach my $position (0 .. $#words) {
my #temp = #words;
splice #temp, $position, 0, $moving;
push #variants, \#temp;
}
print Dumper(\#variants);
my #head;
my ($x, #tail) = #words;
while (#tail) {
push #head, shift #tail;
print join " ", #head, $x, #tail;
};
Or you can just "bubble" $x through the array: $words[$n-1] and words[$n]
foreach $n (1..#words-1) {
($words[$n-1, $words[$n]) = ($words[$n], $words[$n-1]);
print join " ", #words, "\n";
};