How can I combine Perl's split command with white space trimming? - perl

Repost from Perlmonks for a coworker:
I wrote a perl script to separate long lists of email separated by a semi colon. What I would like to do with the code is combine the split with the trimming of white space so I don't need two arrays. Is there away to trim while loading the first array. Output is a sorted list of names.
Thanks.
#!/pw/prod/svr4/bin/perl
use warnings;
use strict;
my $file_data =
'Builder, Bob ;Stein, Franklin MSW; Boop, Elizabeth PHD Cc: Bear,
+ Izzy';
my #email_list;
$file_data =~ s/CC:/;/ig;
$file_data =~ s/PHD//ig;
$file_data =~ s/MSW//ig;
my #tmp_data = split( /;/, $file_data );
foreach my $entry (#tmp_data) {
$entry =~ s/^[ \t]+|[ \t]+$//g;
push( #email_list, $entry );
}
foreach my $name ( sort(#email_list) ) {
print "$name \n";
}

You don't have to do both operations in one go using the same function. Sometimes performing the actions separately can be more clear. That is, split first, then strip the whitespace off of each element (and then sort the result):
#email_list =
sort(
map {
s/\s*(\S+)\s*/\1/; $_
}
split ';', $file_data
);
EDIT: Stripping more than one part of a string at the same time can lead to pitfalls, e.g. Sinan's point below about leaving trailing spaces in the "Elizabeth" portion. I coded that snippet with the assumption that the name would not have internal whitespace, which is actually quite wrong and would have stood out as incorrect if I had consciously noticed it. The code is much improved (and also more readable) below:
#email_list =
sort(
map {
s/^\s+//; # strip leading spaces
s/\s+$//; # strip trailing spaces
$_ # return the modified string
}
split ';', $file_data
);

If you don't need to trim the first and final element, this will do the trick:
#email_list = split /\s*;\s*/, $file_data;
If you do need to trim the first and final element, trim $file_data first, then repeat as above. :-P

Well, you can do what Chris suggested, but it doesn't handle leading and trailing spaces in $file_data.
You can add handling of these like this:
$file_data =~ s/\A\s+|\s+\z//g;
Also, please note that using 2nd array was not necessary. Check this:
my $file_data = 'Builder, Bob ;Stein, Franklin MSW; Boop, Elizabeth PHD Cc: Bear, Izzy';
my #email_list;
$file_data =~ s/CC:/;/ig;
$file_data =~ s/PHD//ig;
$file_data =~ s/MSW//ig;
my #tmp_data = split( /;/, $file_data );
foreach my $entry (#tmp_data) {
$entry =~ s/^[ \t]+|[ \t]+$//g;
}
foreach my $name ( sort(#tmp_data) ) {
print "$name \n";
}

my #email_list = map { s/^[ \t]+|[ \t]+$//g; $_ } split /;/, $file_data;
or the more elegant:
use Algorithm::Loops "Filter";
my #email_list = Filter { s/^[ \t]+|[ \t]+$//g } split /;/, $file_data;

See How do I strip blank space from the beginning/end of a string? in the FAQ.
#email_list = sort map {
s/^\s+//; s/\s+$//; $_
} split ';', $file_data;
Now, note also that a for loop aliases each element of an array, so
#email_list = sort split ';', $file_data;
for (#email_list) {
s/^\s+//;
s/\s+$//;
}
would also work.

My turn:
my #fields = grep { $_ } split m/\s*(?:;|^|$)\s*/, $record;
It also strips the first and last elements as well. If grep is overkill for getting rid of the first element:
my ( undef, #fields ) = split m/\s*(?:;|^|$)\s*/, $record;
works if you know that there is a space, but that's not likely, so
my #fields = split m/\s*(?:;|^|$)\s*/, $record;
shift #fields unless $fields[0];
is the most sure way to do it.

Barring some minor sintax error, this should do the whole work for you. Oh, list operations, how beautiful you are!
print join (" \n", sort { $a <=> $b } map { s/^[ \t]+|[ \t]+$//g } split (/;/, $file_data));

Related

Cannot split line and save to variable using whitespace in perl

I'm having some trouble with parsing a file.
Two lines in the file contain the word ' Mapped', and I would like to extract the number that is in those two lines.
And this is my code:
my %cellHash = ();
my $mapped = 0;
my $alnPairs = 0;
my #mappedReads = ();
while (<ALIGN_SUMMARY>) {
chomp($_);
if (/Mapped/) {
print "\n$_\n";
$mapped = (split / /, $_)[2];
push(#mappedReads, $mapped);
}
if (/Aligned pairs/) {
print "\n$_\n";
$alnPairs = (split / /, $_)[4];
}
}
{ $cellHash{$cellDir} } = (
'MappedR1' => $mappedReads[0] ,
'MappedR2' => $mappedReads[1] ,
'AlnPairs' => $alnPairs ,
);
foreach my $cellName ( keys %cellHash){
print OUTPUT $cellName,
"\t", ${ $cellHash{$cellName} }{"LibSize"},
"\t", ${ $cellHash{$cellName} }{"MappedR1"},
"\t", ${ $cellHash{$cellName} }{"MappedR2"},
"\t", ${ $cellHash{$cellName} }{"AlnPairs"},
"\n";
}
But the OUTPUT file only has the 'AlignedPairs' column and never anything in MappedR1 or MappedR2.
What am I doing wrong? Thanks!
When I look at the file, it looks like there is more than a single space. Here is an example of what I mean and what I did to extract the number.
my $test = "blah : 123455";
my #test_ary = split(/ /, $test);
print scalar #test_ary . "\n"; # Prints the size of the array
$number = $1 if $test =~ m/([0-9]+)/;
print "$number\n"; # Prints the extracted number
Output of run:
Size of array: 8
The extracted number: 123455
Hope this helps.
First off, paste in your actual input and output if you want anyone to actually test somethnig for you, not an image.
Second, you're not splitting on whitespace, you're splitting on a single literal space. Use the special case of
split ' ', $_;
to split on arbitrary length whitespace, discarding leading and trailing whitespace.

Perl script that parses CSV file excluding the contents enclosed in []

Hi there I am struggling with perl script that parses a an eight column CSV line into another CSV line using the split command. But i want to exclude all the text enclosed by square brackets []. The line looks like :
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
I used the following script but when i print $fields[7] it gives me N. one of the fields inside [] above.but by print "$fields[7]" i want it to be 1399385680 which is the last field in the above line. the script i tried was.
while (my $line = <LOG>) {
chomp $line;
my #fields=grep { !/^[\[.*\]]$/ } split ",", $line;
my $timestamp=$fields[7];
print "$fields[7]";
}
Thanks for your time. I will appreciate your help.
Always include use strict; and use warnings; at the top of EVERY perl script.
Your "csv" file isn't proper csv. So the only thing I can suggest is to remove the contents in the brackets before you split:
use strict;
use warnings;
while (<DATA>) {
chomp;
s/\[.*?\]//g;
my #fields = split ',', $_;
my $timestamp = $fields[7];
print "$timestamp\n";
}
__DATA__
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
Outputs:
1399385680
Obviously it is possible to also capture the contents of the bracketed fields, but you didn't say that was a requirement or goal.
Update
If you want to capture the bracket delimited field, one method would be to use a regex for capturing instead.
Note, this current regex requires that each field has a value.
chomp;
my #fields = $_ =~ /(\[.*?\]|[^,]+)(?:,|$)/g;
my $timestamp = $fields[7];
print "$timestamp";
Well, if you want to actually ignore the text between square brackets, you might as well get rid of it:
while ( my $line = <LOG> ) {
chomp $line;
$line =~ s,\[.*?\],,; # Delete all text between square brackets
my #fields = split ",", $line;
my $timestamp = $fields[7];
print $fields[7], "\n";
}

How to skip splitting for some part of the line

Say I have a line lead=george wife=jane "his boy"=elroy. I want to split with space but that does not include the "his boy" part. I should be considered as one.
With normal split it is also splitting "his boy" like taking "his" as one and "boy" as second part. How to escape this
Following this i tried
split " ", $_
Just came to know that this will work
use strict; use warnings;
my $string = q(hi my name is 'john doe');
my #parts = $string =~ /'.*?'|\S+/g;
print map { "$_\n" } #parts;
But it does not looks good. Any other simple thing with split itself?
You could use Text::ParseWords for this
use Text::ParseWords;
$list = "lead=george wife=jane \"his boy\"=elroy";
#words = quotewords('\s+', 0, $list);
$i = 0;
foreach (#words) {
print "$i: <$_>\n";
$i++;
}
ouput:
0: <lead=george>
1: <wife=jane>
2: <his boy=elroy>
sub split_space {
my ( $text ) = #_;
while (
$text =~ m/
( # group ($1)
\"([^\"]+)\" # first try find something in quotes ($2)
|
(\S+?) # else minimal non-whitespace run ($3)
)
=
(\S+) # then maximum non-whitespace run ($4)
/xg
) {
my $key = defined($2) ? $2 : $3;
my $value = $4;
print( "key=$key; value=$value\n" );
}
}
split_space( 'lead=george wife=jane "his boy"=elroy' );
Outputs:
key=lead; value=george
key=wife; value=jane
key=his boy; value=elroy
PP posted a good solution. But just to make it sure, that there is a cool other way to do it, comes my solution:
my $string = q~lead=george wife=jane "his boy"=elroy~;
my #split = split / (?=")/,$string;
my #split2;
foreach my $sp (#split) {
if ($sp !~ /"/) {
push #split2, $_ foreach split / /, $sp;
} else {
push #split2,$sp;
}
}
use Data::Dumper;
print Dumper #split2;
Output:
$VAR1 = 'lead=george';
$VAR2 = 'wife=jane';
$VAR3 = '"his boy"=elroy';
I use a Lookahead here for splitting at first the parts which keys are inside quotes " ". After that, i loop through the complete array and split all other parts, which are normal key=values.
You can get the required result using a single regexp, which extract the keys and the values and put the result inside a hash table.
(\w+|"[\w ]+") will match both a single and multiple word in the key side.
The regexp captures only the key and the value, so the result of the match operation will be a list with the following content: key #1, value #1, key #2, value#2, etc.
The hash is automatically initiated with the appropriate keys and values, when the match result is assigned to it.
here is the code
my $str = 'lead=george wife=jane "hello boy"=bye hello=world';
my %hash = ($str =~ m/(?:(\w+|"[\w ]+")=(\w+)(?:\s|$))/g);
## outputs the hash content
foreach $key (keys %hash) {
print "$key => $hash{$key}\n";
}
and here is the output of this script
lead => george
wife => jane
hello => world
"hello boy" => bye

How to split a this string 'gi|216ATGCTGATGCTGTG' in this format 'gi|216 ATGCTGTGCTGATGCTG' in Perl?

I am parsing the fasta alignment file which contains
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
I want to split this string into this:
gi|216 CCAACGAAATGATCGCCACACAA
gi|21- GCTGGTTCAGCGACCAAAAGTAGC
For first string, I use
$aar=split("\d",$string);
But that didn't work. What should I do?
So you're parsing some genetic data and each line has a gi| prefix followed by a sequence of numbers and hyphens followed by the nucleotide sequence? If so, you could do something like this:
my ($number, $nucleotides);
if($string =~ /^gi\|([\d-]+)([ACGT]+)$/) {
$number = $1;
$nucleotides = $2;
}
else {
# Broken data?
}
That assumes that you've already stripped off leading and trailing whitespace. If you do that, you should get $number = '216' and $nucleotides = 'CCAACGAAATGATCGCCACACAA' for the first one and $number = '216-' and $nucleotides = 'GCTGGTTCAGCGACCAAAAGTAGC' for the second one.
Looks like BioPerl has some stuff for dealing with fasta data so you might want to use BioPerl's tools rather than rolling your own.
Here's how I'd go about doing that.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
while ( my $line = <DATA> ) {
my #strings =
grep {m{\A \S+ \z}xms} # no whitespace tokens
split /\A ( \w+ \| [\d-]+ )( [ACTG]+ ) /xms, # capture left & right
$line;
print Dumper( \#strings );
}
__DATA__
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
If you just want to add a space (can't really tell from your question), use substitution. To put a space in front of any grouping of ACTG:
$string =~ s/([ACTG]+)/ \1/;
or to add a tab after any grouping of digits and dashes:
$string =~ s/([\d-]+)/\1\t/;
note that this will substitute on $string in place.

Split on comma, but only when not in parenthesis

I am trying to do a split on a string with comma delimiter
my $string='ab,12,20100401,xyz(A,B)';
my #array=split(',',$string);
If I do a split as above the array will have values
ab
12
20100401
xyz(A,
B)
I need values as below.
ab
12
20100401
xyz(A,B)
(should not split xyz(A,B) into 2 values)
How do I do that?
use Text::Balanced qw(extract_bracketed);
my $string = "ab,12,20100401,xyz(A,B(a,d))";
my #params = ();
while ($string) {
if ($string =~ /^([^(]*?),/) {
push #params, $1;
$string =~ s/^\Q$1\E\s*,?\s*//;
} else {
my ($ext, $pre);
($ext, $string, $pre) = extract_bracketed($string,'()','[^()]+');
push #params, "$pre$ext";
$string =~ s/^\s*,\s*//;
}
}
This one supports:
nested parentheses;
empty fields;
strings of any length.
Here is one way that should work.
use Regexp::Common;
my $string = 'ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(?:$RE{balanced}{-parens=>'()'}|[^,])+/g);
Regexp::Common can be installed from CPAN.
There is a bug in this code, coming from the depths of Regexp::Common. Be warned that this will (unfortunately) fail to match the lack of space between ,,.
Well, old question, but I just happened to wrestle with this all night, and the question was never marked answered, so in case anyone arrives here by Google as I did, here's what I finally got. It's a very short answer using only built-in PERL regex features:
my $string='ab,12,20100401,xyz(A,B)';
$string =~ s/((\((?>[^)(]*(?2)?)*\))|[^,()]*)(*SKIP),/$1\n/g;
my #array=split('\n',$string);
Commas that are not inside parentheses are changed to newlines and then the array is split on them. This will ignore commas inside any level of nested parentheses, as long as they're properly balanced with a matching number of open and close parens.
This assumes you won't have newline \n characters in the initial value of $string. If you need to, either temporarily replace them with something else before the substitution line and then use a loop to replace back after the split, or just pick a different delimiter to split the array on.
Limit the number of elements it can be split into:
split(',', $string, 4)
Here's another way:
my $string='ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(
[^,]*\([^)]*\) # comma inside parens is part of the word
|
[^,]*) # split on comma outside parens
(?:,|$)/gx);
Produces:
ab
12
20100401
xyz(A,B)
Here is my attempt. It should handle depth well and could even be extended to include other bracketed symbols easily (though harder to be sure that they MATCH). This method will not in general work for quotation marks rather than brackets.
#!/usr/bin/perl
use strict;
use warnings;
my $string='ab,12,20100401,xyz(A(2,3),B)';
print "$_\n" for parse($string);
sub parse {
my ($string) = #_;
my #fields;
my #comma_separated = split(/,/, $string);
my #to_be_joined;
my $depth = 0;
foreach my $field (#comma_separated) {
my #brackets = $field =~ /(\(|\))/g;
foreach (#brackets) {
$depth++ if /\(/;
$depth-- if /\)/;
}
if ($depth == 0) {
push #fields, join(",", #to_be_joined, $field);
#to_be_joined = ();
} else {
push #to_be_joined, $field;
}
}
return #fields;
}