Odd behavior with split in perl when result includes empty strings - perl

This is my perl 5.16 code
while(<>) {
chomp;
#data = split /a/, $_;
print(join("b",#data),"\n");
}
If I input a file with this in it:
paaaa
paaaaq
I get
p
pbbbbq
But I was expecting
pbbbb
pbbbbq
Why am I wrong to expect the latter behavior?

It is documented that trailing empties are removed unless you specify a third, non-zero argument.
If LIMIT is omitted (or, equivalently, zero), then it is usually treated as if it were instead negative but with the exception that trailing empty fields are stripped (empty leading fields are always preserved)
You want
split /a/, $_, -1;

Take a look at the LIMIT parameter in the split perldoc:
http://perldoc.perl.org/functions/split.html
The relevant section is:
If LIMIT is negative, it is treated as if it were instead arbitrarily large; as many fields as possible are produced.
If LIMIT is omitted (or, equivalently, zero), then it is usually treated as if it were instead negative but with the exception that trailing empty fields are stripped (empty leading fields are always preserved); if all fields are empty, then all fields are considered to be trailing (and are thus stripped in this case).
So to get the behavior you're expecting, try:
while(<>) {
chomp;
#data = split /a/, $_, -1;
print(join("b",#data),"\n");
}

Because after splitting paaaa , you got an array #data that has only one elemet p in it.
Maybe substitution is better:
while(<>) {
chomp;
$_=~s/a/b/g;
print($_,"\n");
}

Related

Work around for split function when last character is a terminator

I have this line of data with 20 fields:
my $data = '54243|601|0|||0|N|0|0|0|0|0||||||99582|';
I'm using this to split the data:
my #data = split ('\|'), $data;
However, instead of 20 pieces of data, you only get 19:
print scalar #data;
I could manually push an empty string onto #data if the last character is a | but I'm wondering if there is a more perlish way.
Do
my #data = split /\|/, $data, -1;
The -1 tells split to include empty trailing fields.
(Your parentheses around the regex are incorrect, and lead to $data not being considered a parameter of split. Also, with one exception, the first argument of split is always a regex, so it is better to specify it as a regex not a string that will be interpreted as a regex.)

In Perl, how can I tell split not to strip empty trailing fields?

Was trying to count the number of lines in a string of text (including empty lines). A little surprised by the behavior of split. Had expected the following to output 2 but it printed 1 on my perl 5.14.2.
$str = "hello\
world\n\n";
#a = split(/\n/, $str);
print $#a, "\n";
Seems that split() is insensitive to consecutive \n (add more \n's at the end of the string will not increase the printout). The only I can get it sort of close to giving the number of lines is
$str = "hello\
world\n\n";
#a = split(/(\n)/, $str);
printf "%d\n", ($#a + 1)/2, "\n";
But it looks more like a workaround than a straight solution. Any ideas?
perldoc -f split:
If LIMIT is negative, it is treated as if it were instead
arbitrarily large; as many fields as possible are produced.
If LIMIT is omitted (or, equivalently, zero), then it is usually
treated as if it were instead negative but with the exception that
trailing empty fields are stripped (empty leading fields are
always preserved); if all fields are empty, then all fields are
considered to be trailing (and are thus stripped in this case).
$ perl -E 'my $x = "1\n2\n\n"; my #x = split /\n/, $x, -1; say $#x'
3
Perhaps the problem is that you are using $#a when scalar #a is what you are actually looking for?
I apologize if you are already aware of this or if this is not the issue, but $#a returns the index of the last element of #a and (scalar #a) returns the number of elements that #a contains. Since array indexing starts at 0, $#a is one less than scalar #a.

What does this Perl syntax mean?

I have one line of code in a pretty massive perl program that I don't understand.
map {$cycle{$_}=1} split(/\s*,\s*/,$cycle);
$cycle is a string, and there is my %cycle declared above this line. I get that the "split" part separates the string into it's elements, but what are the s and the slashes for in the second part, and I don't understand the first half at all.
The first half is the really confusing part, what happens to all the elements of that split string?
I've never used Perl before. Thanks for any explanation you can give
It's a misuse of map.
Adding some whitespace it looks like
map {
$cycle{$_} = 1
} split /\s*,\s*/, $cycle;
So $cycle is being split at the commas, including any spaces that may precede or follow it, and the corresponding element of %cycle is set to 1 for each item from the split
It should be written
$cycle{$_} = 1 for split /\s*,\s*/, $cycle;
or perhaps
for ( split /\s*,\s*/, $cycle ) {
$cycle{$_} = 1;
}
And if you know who wrote the original code then please give them a slap from me.
split's first parameter is a regular expression used to define what separates the items to return. It is traditionally provided to split as a match operator.
/.../
is match operator. It's the short form of
m/.../
However,
split /.../, ....
and
split m/.../, ....
actually behave as
split qr/.../, ....
qr/.../ compiles the regex pattern within and returns the compiled form in a scalar.
Operators including m// and qr// are documented in perlop.
As a regex pattern,
\s*,\s*
signifies zero or more whitespace characters (\s*), followed by a comma, followed by zero or more whitespace characters.
Regular expressions are documented in perlre.
The map is used to perform a foreach loop. When the values it returns are ignored
map BLOCK LIST
is just a weird way of writing
for (LIST) BLOCK
so
map { $cycle{$_} = 1 } split(/\s*,\s*/, $cycle);
is the same as
$cycle{$_} = 1 for split(/\s*,\s*/, $cycle);
or
for (split(/\s*,\s*/, $cycle)) {
$cycle{$_} = 1;
}
or
for my $val (split(/\s*,\s*/, $cycle)) {
$cycle{$val} = 1;
}
A more appropriate use of map would have been
my %cycle = map { $_ => 1 } split(/\s*,\s*/, $cycle);
which is equivalent to
my #anon;
for (split(/\s*,\s*/, $cycle)) {
push #anon, $_ => 1;
}
my %cycle = #anon;
though the following is more efficient:
my %cycle; ++$cycle{$_} for split(/\s*,\s*/, $cycle);
map

What does the Perl split function return when there is no value between tokens?

I'm trying to split a string using the split function but there isn't always a value between tokens.
Ex: ABC,123,,,,,,XYZ
I don't want to skip the multiple tokens though. These values are in specific positions in the string. However, when I do a split, and then try to step through my resulting array, I get "Use of uninitialized value" warnings.
I've tried comparing the value using $splitvalues[x] eq "" and I've tried using defined($splitvalues[x]) , but I can't for the life of me figure out how to identify what the split function is putting in to my array when there is no value between tokens.
Here's the snippet of my code (now with more crunchy goodness):
my #matrixDetail = ();
#some other processing happens here that is based on matching data from the
##oldDetail array with the first field of the #matrixLine array. If it does
#match, then I do the split
if($IHaveAMatch)
{
#matrixDetail = split(',', $matrixLine[1]);
}
else
{
#matrixDetail = ('','','','','','','');
}
my $newDetailString =
(($matrixDetail[0] eq '') ? $oldDetail[0] : $matrixDetail[0])
. (($matrixDetail[1] eq '') ? $oldDetail[1] : $matrixDetail[1])
.
.
.
. (($matrixDetail[6] eq '') ? $oldDetail[6] : $matrixDetail[6]);
because this is just snippets, I've left some of the other logic out, but the if statement is inside a sub that technically returns the #matrixDetail array back. If I don't find a match in my matrix and set the array equal to the array of empty strings manually, then I get no warnings. It's only when the split populates the #matrixDetail.
Also, I should mention, I've been writing code for nearly 15 years, but only very recently have I needed to work with Perl. The logic in my script is sound (or at least, it works), I'm just being anal about cleaning up my warnings and trying to figure out this little nuance.
#!perl
use warnings;
use strict;
use Data::Dumper;
my $str = "ABC,123,,,,,,XYZ";
my #elems = split ',', $str;
print Dumper \#elems;
This gives:
$VAR1 = [
'ABC',
'123',
'',
'',
'',
'',
'',
'XYZ'
];
It puts in an empty string.
Edit: Note that the documentation for split() states that "by default, empty leading fields are preserved, and empty trailing ones are deleted." Thus, if your string is ABC,123,,,,,,XYZ,,,, then your returned list will be the same as the above example, but if your string is ,,,,ABC,123, then you will have a list with three empty strings in elements 0, 1, and 2 (in addition to 'ABC' and '123').
Edit 2: Try dumping out the #matrixDetail and #oldDetail arrays. It's likely that one of those isn't the length that you think it is. You might also consider checking the number of elements in those two lists before trying to use them to make sure you have as many elements as you're expecting.
I suggest to use Text::CSV from CPAN. It is a ready made solution which already covers all the weird edge cases of parsing CSV formatted files.
delims with nothing between them give empty strings when split. Empty strings evaluate as false in boolean context.
If you know that your "details" input will never contain "0" (or other scalar that evaluates to false), this should work:
my #matrixDetail = split(',', $matrixLine[1]);
die if #matrixDetail > #oldDetail;
my $newDetailString = "";
for my $i (0..$#oldDetail) {
$newDetailString .= $matrixDetail[$i] || $oldDetail[$i]; # thanks canSpice
}
say $newDetailString;
(there are probably other scalars besides empty string and zero that evaluate to false but I couldn't name them off the top of my head.)
TMTOWTDI:
$matrixDetail[$_] ||= $oldDetail[$_] for 0..$#oldDetail;
my $newDetailString = join("", #matrixDetail);
edit: for loops now go from 0 to $#oldDetail instead of $#matrixDetail since trailing ",,," are not returned by split.
edit2: if you can't be sure that real input won't evaluate as false, you could always just test the length of your split elements. This is safer, definitely, though perhaps less elegant ^_^
Empty fields in the middle will be ''. Empty fields on the end will be omitted, unless you specify a third parameter to split large enough (or -1 for all).

How do I get the length of a string in Perl?

What is the Perl equivalent of strlen()?
length($string)
perldoc -f length
length EXPR
length Returns the length in characters of the value of EXPR. If EXPR is
omitted, returns length of $_. Note that this cannot be used on an
entire array or hash to find out how many elements these have. For
that, use "scalar #array" and "scalar keys %hash" respectively.
Note the characters: if the EXPR is in Unicode, you will get the num-
ber of characters, not the number of bytes. To get the length in
bytes, use "do { use bytes; length(EXPR) }", see bytes.
Although 'length()' is the correct answer that should be used in any sane code, Abigail's length horror should be mentioned, if only for the sake of Perl lore.
Basically, the trick consists of using the return value of the catch-all transliteration operator:
print "foo" =~ y===c; # prints 3
y///c replaces all characters with themselves (thanks to the complement option 'c'), and returns the number of character replaced (so, effectively, the length of the string).
length($string)
The length() function:
$string ='String Name';
$size=length($string);
You shouldn't use this, since length($string) is simpler and more readable, but I came across some of these while looking through code and was confused, so in case anyone else does, these also get the length of a string:
my $length = map $_, $str =~ /(.)/gs;
my $length = () = $str =~ /(.)/gs;
my $length = split '', $str;
The first two work by using the global flag to match each character in the string, then using the returned list of matches in a scalar context to get the number of characters. The third works similarly by splitting on each character instead of regex-matching and using the resulting list in scalar context