Split on comma, but only when not in parenthesis - perl

I am trying to do a split on a string with comma delimiter
my $string='ab,12,20100401,xyz(A,B)';
my #array=split(',',$string);
If I do a split as above the array will have values
ab
12
20100401
xyz(A,
B)
I need values as below.
ab
12
20100401
xyz(A,B)
(should not split xyz(A,B) into 2 values)
How do I do that?

use Text::Balanced qw(extract_bracketed);
my $string = "ab,12,20100401,xyz(A,B(a,d))";
my #params = ();
while ($string) {
if ($string =~ /^([^(]*?),/) {
push #params, $1;
$string =~ s/^\Q$1\E\s*,?\s*//;
} else {
my ($ext, $pre);
($ext, $string, $pre) = extract_bracketed($string,'()','[^()]+');
push #params, "$pre$ext";
$string =~ s/^\s*,\s*//;
}
}
This one supports:
nested parentheses;
empty fields;
strings of any length.

Here is one way that should work.
use Regexp::Common;
my $string = 'ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(?:$RE{balanced}{-parens=>'()'}|[^,])+/g);
Regexp::Common can be installed from CPAN.
There is a bug in this code, coming from the depths of Regexp::Common. Be warned that this will (unfortunately) fail to match the lack of space between ,,.

Well, old question, but I just happened to wrestle with this all night, and the question was never marked answered, so in case anyone arrives here by Google as I did, here's what I finally got. It's a very short answer using only built-in PERL regex features:
my $string='ab,12,20100401,xyz(A,B)';
$string =~ s/((\((?>[^)(]*(?2)?)*\))|[^,()]*)(*SKIP),/$1\n/g;
my #array=split('\n',$string);
Commas that are not inside parentheses are changed to newlines and then the array is split on them. This will ignore commas inside any level of nested parentheses, as long as they're properly balanced with a matching number of open and close parens.
This assumes you won't have newline \n characters in the initial value of $string. If you need to, either temporarily replace them with something else before the substitution line and then use a loop to replace back after the split, or just pick a different delimiter to split the array on.

Limit the number of elements it can be split into:
split(',', $string, 4)

Here's another way:
my $string='ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(
[^,]*\([^)]*\) # comma inside parens is part of the word
|
[^,]*) # split on comma outside parens
(?:,|$)/gx);
Produces:
ab
12
20100401
xyz(A,B)

Here is my attempt. It should handle depth well and could even be extended to include other bracketed symbols easily (though harder to be sure that they MATCH). This method will not in general work for quotation marks rather than brackets.
#!/usr/bin/perl
use strict;
use warnings;
my $string='ab,12,20100401,xyz(A(2,3),B)';
print "$_\n" for parse($string);
sub parse {
my ($string) = #_;
my #fields;
my #comma_separated = split(/,/, $string);
my #to_be_joined;
my $depth = 0;
foreach my $field (#comma_separated) {
my #brackets = $field =~ /(\(|\))/g;
foreach (#brackets) {
$depth++ if /\(/;
$depth-- if /\)/;
}
if ($depth == 0) {
push #fields, join(",", #to_be_joined, $field);
#to_be_joined = ();
} else {
push #to_be_joined, $field;
}
}
return #fields;
}

Related

Perl regular expressions and returned array of matched groups

i am new in Perl and i need to do some regexp.
I read, when array is used like integer value, it gives count of elements inside.
So i am doing for example
if (#result = $pattern =~ /(\d)\.(\d)/) {....}
and i was thinking it should return empty array, when pattern matching fails, but it gives me still array with 2 elements, but with uninitialized values.
So how i can put pattern matching inside if condition, is it possible?
EDIT:
foreach (keys #ARGV) {
if (my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) {
if (defined $params{$result[0]}) {
print STDERR "Cmd option error\n";
}
$params{$result[0]} = (defined $result[1] ? $result[1] : 1);
}
else {
print STDERR "Cmd option error\n";
exit ERROR_CMD;
}
}
It is regexp pattern for command line options, cmd options are in long format with two hyphens preceding and possible with argument, so
--CMD[=ARG]. I want elegant solution, so this is why i want put it to if condition without some prolog etc.
EDIT2:
oh sry, i was thinking groups in #result array are always counted from 0, but accesible are only groups from branch, where the pattern is success. So if in my code command is "input", it should be in $result[0], but actually it is in $result[1]. I thought if $result[0] is uninitialized, than pattern fails and it goes to the if statement.
Consider the following:
use strict;
use warnings;
my $pattern = 42.42;
my #result = $pattern =~ /(\d)\.(\d)/;
print #result, ' elements';
Output:
24 elements
Context tells Perl how to treat #result. There certainly aren't 24 elements! Perl has printed the array's elements which resulted from your regex's captures. However, if we do the following:
print 0 + #result, ' elements';
we get:
2 elements
In this latter case, Perl interprets a scalar context for #result, so adds the number of elements to 0. This can also be achieved through scalar #results.
Edit to accommodate revised posting: Thus, the conditional in your code:
if(my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) { ...
evaluates to true if and only if the match was successful.
#results = $pattern =~ /(\d)\.(\d)/ ? ($1,$2) : ();
Try this:
#result = ();
if ($pattern =~ /(\d)\.(\d)/)
{
push #result, $1;
push #result, $2;
}
=~ is not an equal sign. It's doing a regexp comparison.
So my code above is initializing the array to empty, then assigning values only if the regexp matches.

How to split a this string 'gi|216ATGCTGATGCTGTG' in this format 'gi|216 ATGCTGTGCTGATGCTG' in Perl?

I am parsing the fasta alignment file which contains
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
I want to split this string into this:
gi|216 CCAACGAAATGATCGCCACACAA
gi|21- GCTGGTTCAGCGACCAAAAGTAGC
For first string, I use
$aar=split("\d",$string);
But that didn't work. What should I do?
So you're parsing some genetic data and each line has a gi| prefix followed by a sequence of numbers and hyphens followed by the nucleotide sequence? If so, you could do something like this:
my ($number, $nucleotides);
if($string =~ /^gi\|([\d-]+)([ACGT]+)$/) {
$number = $1;
$nucleotides = $2;
}
else {
# Broken data?
}
That assumes that you've already stripped off leading and trailing whitespace. If you do that, you should get $number = '216' and $nucleotides = 'CCAACGAAATGATCGCCACACAA' for the first one and $number = '216-' and $nucleotides = 'GCTGGTTCAGCGACCAAAAGTAGC' for the second one.
Looks like BioPerl has some stuff for dealing with fasta data so you might want to use BioPerl's tools rather than rolling your own.
Here's how I'd go about doing that.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
while ( my $line = <DATA> ) {
my #strings =
grep {m{\A \S+ \z}xms} # no whitespace tokens
split /\A ( \w+ \| [\d-]+ )( [ACTG]+ ) /xms, # capture left & right
$line;
print Dumper( \#strings );
}
__DATA__
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
If you just want to add a space (can't really tell from your question), use substitution. To put a space in front of any grouping of ACTG:
$string =~ s/([ACTG]+)/ \1/;
or to add a tab after any grouping of digits and dashes:
$string =~ s/([\d-]+)/\1\t/;
note that this will substitute on $string in place.

Why is my Perl code not omitting newlines?

I'm reading this textfile to get ONLY the words in it and ignore all kind of whitespaces:
hello
now
do you see this.sadslkd.das,msdlsa but
i hoohoh
And this is my Perl code:
#!usr/bin/perl -w
require 5.004;
open F1, './text.txt';
while ($line = <F1>) {
#print $line;
#arr = split /\s+/, $line;
foreach $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
#print #arr;
}
close F1;
And this is the output:
hello
now
do
you
see
this.sadslkd.das,msdlsa
but
i
hoohoh
The output is showing two newlines but I am expecting the output to be just words. What should I do to just get words?
You should always use strict and use warnings (in preference to the -w command-line qualifier) at the top of every Perl program, and declare each variable at its first point of use using my. That way Perl will tell you about simple errors that you may otherwise overlook.
You should also use lexical file handles with the three-parameter form of open, and check the status to make sure it succeeded. There is little point in explicitly closing an input file unless you expect your program to run for an appreciable time, as Perl will close all files for you on exit.
Do you really need to require Perl v5.4? That version is fifteen years old, and if there is anything older than that installed then you have a museum!
Your program would be better like this:
use strict;
use warnings;
open my $fh, '<', './text.txt' or die $!;
while (my $line = <$fh>) {
my #arr = split /\s+/, $line;
foreach my $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
}
Note: my apologies. The warnings pragma and lexical file handles were introduced only in v5.6 so that part of my answer is irrelevant. The latest version of Perl is v5.16 and you really should upgrade
As Birei has pointed out, the problem is that, when the line has leading whitespace, there is a empty field before the first separator. Imagine if your data was comma-separated, then you would want Perl to report a leading empty field if the line started with a comma.
To extract all the non-space characters you can use a regular expression that does exactly that
my #arr = $line =~ /\S+/g;
and this can be emulated by using the default parameter for split which is a single quoted space (not a regular expression)
my #arr = $line =~ split ' ', $line;
In this case split behaves like the awk utility and discards any leading empty fields as you expected.
This is even simpler if you let Perl use the $_ variable in the read loop, as all of the parameters for split can be defaulted:
while (<F1>) {
my #arr = split;
foreach my $w (#arr) {
print "$w\n" if $w !~ /^\s+$/;
}
}
This line is the problem:
#arr=split(/\s+/,$line);
\s+ does a match just before the leading spaces. Use ' ' instead.
#arr=split(' ',$line);
I believe that in this line:
if(!($w =~ /^\s+$/))
You wanted to ask if there's nothing in this row - don't print it.
But the "+" in the REGEX actually force it to have at least 1 space.
If you change the "\s+" to "\s*", you'll see that it's working. because * is 0 occurrences or more ...

Using a char variable in tr///

I am trying to count the characters in a string and found an easy solution counting a single character using the tr operator. Now I want to do this with every character from a to z. The following solution doesn't work because tr/// matches every character.
my #chars = ('a' .. 'z');
foreach my $c (#chars)
{
$count{$c} = ($text =~ tr/$c//);
}
How do I correctly use the char variable in tr///?
tr/// doesn't work with variables unless you wrap it in an eval
But there is a nicer way to do this:
$count{$_} = () = $text =~ /$_/g for 'a' .. 'z';
For the TIMTOWTDI:
$count{$_}++ for grep /[a-z]/i, split //, $text;
tr doesn't support variable interpolation (neither in the search list nor in the replacement list). If you want to use variables, you must use eval():
$count{$c} = eval "\$text =~ tr/$c/$c/";
That said, a more efficient (and secure) approach would be to simply iterate over the characters in the string and increment counters for each character, e.g.:
my %count = map { $_ => 0 } 'a' .. 'z';
for my $char (split //, $text) {
$count{$char}++ if defined $count{$char};
}
If you look at the perldoc for tr/SEARCHLIST/REPLACEMENTLIST/cdsr, then you'll see, right at the bottom of the section, the following:
Because the transliteration table is built at compile time, neither the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote interpolation. That means that if you want to use variables, you must use an eval():
eval "tr/$oldlist/$newlist/";
die $# if $#;
eval "tr/$oldlist/$newlist/, 1" or die $#;
Thus, you would need an eval to generate a new SEARCHLIST.
This is going to be very inefficient... the code might feel neat, but you're processing the complete string 26 times. You're also not counting uppercase characters.
You'd be better off stepping through the string once and just incrementing counters for each character found.
From the perlop documentation:
tr/AAA/XYZ/
will transliterate any A to X.
Because the transliteration table is built at compile time, neither
the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote
interpolation. That means that if you want to use variables, you must
use an eval()
Alternatively in your case you can use the s/// operator as:
foreach my $c (#chars) {
$count{$c} += ($text =~ s/$c//g);
}
My solution with some modification based from http://www.perlmonks.org/?node_id=446003
sub lowerLetters {
my $string = shift;
my %table;
#table{split //, $letters_uc} = split //, $letters_lc;
my $table_re = join '|', map { quotemeta } reverse sort keys %table;
$string =~ s/($table_re)/$table{$1}/g;
return if not defined $string;
return $string;
}
You may want to use s instead. Substitution is much more powerful than tr
My solution:
$count{$c} =~ s/\$search/$replace/g;
g at the end means "use it globally".
See:
https://blog.james.rcpt.to/2010/10/25/perl-search-and-replace-using-variables/
https://docstore.mik.ua/orelly/perl3/lperl/ch09_06.htm

How can I combine Perl's split command with white space trimming?

Repost from Perlmonks for a coworker:
I wrote a perl script to separate long lists of email separated by a semi colon. What I would like to do with the code is combine the split with the trimming of white space so I don't need two arrays. Is there away to trim while loading the first array. Output is a sorted list of names.
Thanks.
#!/pw/prod/svr4/bin/perl
use warnings;
use strict;
my $file_data =
'Builder, Bob ;Stein, Franklin MSW; Boop, Elizabeth PHD Cc: Bear,
+ Izzy';
my #email_list;
$file_data =~ s/CC:/;/ig;
$file_data =~ s/PHD//ig;
$file_data =~ s/MSW//ig;
my #tmp_data = split( /;/, $file_data );
foreach my $entry (#tmp_data) {
$entry =~ s/^[ \t]+|[ \t]+$//g;
push( #email_list, $entry );
}
foreach my $name ( sort(#email_list) ) {
print "$name \n";
}
You don't have to do both operations in one go using the same function. Sometimes performing the actions separately can be more clear. That is, split first, then strip the whitespace off of each element (and then sort the result):
#email_list =
sort(
map {
s/\s*(\S+)\s*/\1/; $_
}
split ';', $file_data
);
EDIT: Stripping more than one part of a string at the same time can lead to pitfalls, e.g. Sinan's point below about leaving trailing spaces in the "Elizabeth" portion. I coded that snippet with the assumption that the name would not have internal whitespace, which is actually quite wrong and would have stood out as incorrect if I had consciously noticed it. The code is much improved (and also more readable) below:
#email_list =
sort(
map {
s/^\s+//; # strip leading spaces
s/\s+$//; # strip trailing spaces
$_ # return the modified string
}
split ';', $file_data
);
If you don't need to trim the first and final element, this will do the trick:
#email_list = split /\s*;\s*/, $file_data;
If you do need to trim the first and final element, trim $file_data first, then repeat as above. :-P
Well, you can do what Chris suggested, but it doesn't handle leading and trailing spaces in $file_data.
You can add handling of these like this:
$file_data =~ s/\A\s+|\s+\z//g;
Also, please note that using 2nd array was not necessary. Check this:
my $file_data = 'Builder, Bob ;Stein, Franklin MSW; Boop, Elizabeth PHD Cc: Bear, Izzy';
my #email_list;
$file_data =~ s/CC:/;/ig;
$file_data =~ s/PHD//ig;
$file_data =~ s/MSW//ig;
my #tmp_data = split( /;/, $file_data );
foreach my $entry (#tmp_data) {
$entry =~ s/^[ \t]+|[ \t]+$//g;
}
foreach my $name ( sort(#tmp_data) ) {
print "$name \n";
}
my #email_list = map { s/^[ \t]+|[ \t]+$//g; $_ } split /;/, $file_data;
or the more elegant:
use Algorithm::Loops "Filter";
my #email_list = Filter { s/^[ \t]+|[ \t]+$//g } split /;/, $file_data;
See How do I strip blank space from the beginning/end of a string? in the FAQ.
#email_list = sort map {
s/^\s+//; s/\s+$//; $_
} split ';', $file_data;
Now, note also that a for loop aliases each element of an array, so
#email_list = sort split ';', $file_data;
for (#email_list) {
s/^\s+//;
s/\s+$//;
}
would also work.
My turn:
my #fields = grep { $_ } split m/\s*(?:;|^|$)\s*/, $record;
It also strips the first and last elements as well. If grep is overkill for getting rid of the first element:
my ( undef, #fields ) = split m/\s*(?:;|^|$)\s*/, $record;
works if you know that there is a space, but that's not likely, so
my #fields = split m/\s*(?:;|^|$)\s*/, $record;
shift #fields unless $fields[0];
is the most sure way to do it.
Barring some minor sintax error, this should do the whole work for you. Oh, list operations, how beautiful you are!
print join (" \n", sort { $a <=> $b } map { s/^[ \t]+|[ \t]+$//g } split (/;/, $file_data));