For what input and arguments will perl split give the result (""), if ever? - perl

To me, it looks like perl split can never give the result (""), i.e. a single-element list whose single element is the empty string. No matter what -- any input, any arguments to split. Can anyone show otherwise? And if not, is this a feature or a bug?
I wanted split to be able to for consistency, but alas:
Note that splitting an EXPR that evaluates to the empty string always
produces zero fields, regardless of the LIMIT specified.
http://perldoc.perl.org/functions/split.html
E.g.:
$ echo ""|perl -ne 'chomp;print 0+split/x/,$_,-1'
0
$ echo "x"|perl -ne 'chomp;print 0+split/x/,$_,-1'
2
$ echo "xx"|perl -ne 'chomp;print 0+split/x/,$_,-1'
3
$ echo "xxx"|perl -ne 'chomp;print 0+split/x/,$_,-1'
4

And if not, is this a feature or a bug?
Not returning an empty string is not a bug. As per the documentation,
Note that splitting an EXPR that evaluates to the empty string always produces zero fields, regardless of the LIMIT specified.
Can anyone show otherwise?
It's highly unlikely that anyone will be able to find an input for which split return an empty string when it's documented to never return an empty string.
It sounds like you want a list of one item when the input is an empty string, so
length($_) ? split(..., $_, -1) : ""

This is a tentative answer to my own question, pending any further information/correction that may come from others:
There are no inputs and arguments to perl split for which the result will ever be a single-element list containing the empty string.
In order to get a result consistent with the promises 1) "size of result will always be one more than the number of separators (regexp matches)" and 2) "if there are no separators, the result will always be a single-element list whose element is the whole original string", what would normally be a clean function call expression
split /.../
instead needs to be wrapped as follows, including an additional auxiliary array:
#s = split /.../, $_, -1 or push #s, "";
and then #s used where the split /.../ normally would have been.
E.g.:
$ echo ""|perl -ne 'chomp;#s=split/x/,$_,-1 or push #s,"";print 0+#s'
1
$ echo "x"|perl -ne 'chomp;#s=split/x/,$_,-1 or push #s,"";print 0+#s'
2
$ echo "xx"|perl -ne 'chomp;#s=split/x/,$_,-1 or push #s,"";print 0+#s'
3
$ echo "xxx"|perl -ne 'chomp;#s=split/x/,$_,-1 or push #s,"";print 0+#s'
4
Or, alternatively, any code using bare split /.../ and relying on either of the above "promises" needs to be put inside a guard if (length) {...} and the case of length==0 handled in separate code.

Related

index argument contains . perl

If a string contains . representing any character, index doesn't match on it. What to do so that it takes . as any character?
For ex,
index($str, $substr)
if $substr contains . anywhere, index will always return -1
thanks
carol
That is not possible. The documentation says:
The index function searches for one string within another, but without
the wildcard-like behavior of a full regular-expression pattern match.
...
The keywords, you can use for further googlings are:
perl regular expression wildcard
Update:
If you just want to know, if your string matches, using a regular expression could look like that:
my $string = "Hello World!";
if( $string =~ /ll. Worl/ )
{
print "Ahoi! Position: ".($-[0])."\n";
}
This is matching a single character.
$-[0] is the offset into the string of the beginning of the entire
match.
-- http://perldoc.perl.org/perlvar.html
If you want to have a pattern, that is matching an arbitary amount of arbitary characters, you could choose a pattern like...
...
if( $string =~ /ll.*orl/ )
{
...
See perlvar for further information about special perl variables. You will find the variable #LAST_MATCH_START and some explanation about $-[0] over there. There are several more variables, that can help you to find sub matches and to gather other interessting information about your matches...
From perldoc -f index, you can see index() doesn't have any regex syntax:
index STR,SUBSTR
The index function searches for one string within another, but without the wildcard-like behavior of a full regular-
expression pattern match. It returns the position of the first occurrence of SUBSTR in STR at or after POSITION. If
POSITION is omitted, starts searching from the beginning of the string. POSITION before the beginning of the string or after
its end is treated as if it were the beginning or the end, respectively. POSITION and the return value are based at 0 (or
whatever you've set the $[ variable to--but don't do that). If the substring is not found, "index" returns one less than the
base, ordinarily "-1"
A simple test:
$ perl -e 'print index("1234567asdfghj.","j.")'
13
Use regex:
$str =~ /$substr/g;
$index = pos();

Anonymous hash in perl

I am starting to learn Perl, so I am trying to read some posts here at SO. Now I came across this code https://stackoverflow.com/a/22310773/2173773 (simplified somewhat here) :
echo "1 2 3 4" | perl -lane'
$h{#F} ||= [];
print $_ for keys %h;
'
What does this code do, and why does this code print 4?
I have tried to study Perl references at http://perldoc.perl.org/perlreftut.html , but I still could not figure this out.
(I am puzzled about this line: $h{#F} ||= [].. )
The -n option (part of -lane) causes Perl to execute the given code for each individual line of input.
The -a option (when used with the -n or -p option) causes Perl to split every line of input on whitespace and store the fields in the #F variable.
$something ||= [] is equivalent to $something = $something || []; i.e., it assigns [] (a reference to an empty array) to the variable $something if & only if $something is already false or undefined.
$h{#F} is an element of the hash %h. Because this expression begins with $ (rather than #), the subscript #F is evaluated in scalar context, and scalar context for an array makes the array evaluate to its length. As the Perl code is only ever executed on the line 1 2 3 4, which is split into four elements, #F will only be four elements long, so $h{#F} is here equivalent to $h{4} (or, technically, $h{"4"}).
Thus, [] will be assigned to $h{"4"}, and as 4 is the only element of the hash %h in existence, keys %h will return a list containing only "4", and printing the elements of this list will print 4.

= and , operators in Perl

Please explain this apparently inconsistent behaviour:
$a = b, c;
print $a; # this prints: b
$a = (b, c);
print $a; # this prints: c
The = operator has higher precedence than ,.
And the comma operator throws away its left argument and returns the right one.
Note that the comma operator behaves differently depending on context. From perldoc perlop:
Binary "," is the comma operator. In
scalar context it evaluates its left
argument, throws that value away, then
evaluates its right argument and
returns that value. This is just like
C's comma operator.
In list context, it's just the list
argument separator, and inserts both
its arguments into the list. These
arguments are also evaluated from left
to right.
As eugene's answer seems to leave some questions by OP i try to explain based on that:
$a = "b", "c";
print $a;
Here the left argument is $a = "b" because = has a higher precedence than , it will be evaluated first. After that $a contains "b".
The right argument is "c" and will be returned as i show soon.
At that point when you print $a it is obviously printing b to your screen.
$a = ("b", "c");
print $a;
Here the term ("b","c") will be evaluated first because of the higher precedence of parentheses. It returns "c" and this will be assigned to $a.
So here you print "c".
$var = ($a = "b","c");
print $var;
print $a;
Here $a contains "b" and $var contains "c".
Once you get the precedence rules this is perfectly consistent
Since eugene and mugen have answered this question nicely with good examples already, I am going to setup some concepts then ask some conceptual questions of the OP to see if it helps to illuminate some Perl concepts.
The first concept is what the sigils $ and # mean (we wont descuss % here). # means multiple items (said "these things"). $ means one item (said "this thing"). To get first element of an array #a you can do $first = $a[0], get the last element: $last = $a[-1]. N.B. not #a[0] or #a[-1]. You can slice by doing #shorter = #longer[1,2].
The second concept is the difference between void, scalar and list context. Perl has the concept of the context in which your containers (scalars, arrays etc.) are used. An easy way to see this is that if you store a list (we will get to this) as an array #array = ("cow", "sheep", "llama") then we store the array as a scalar $size = #array we get the length of the array. We can also force this behavior by using the scalar operator such as print scalar #array. I will say it one more time for clarity: An array (not a list) in scalar context will return, not an element (as a list does) but rather the length of the array.
Remember from before you use the $ sigil when you only expect one item, i.e. $first = $a[0]. In this way you know you are in scalar context. Now when you call $length = #array you can see clearly that you are calling the array in scalar context, and thus you trigger the special property of an array in list context, you get its length.
This has another nice feature for testing if there are element in the array. print '#array contains items' if #array; print '#array is empty' unless #array. The if/unless tests force scalar context on the array, thus the if sees the length of the array not elements of it. Since all numerical values are 'truthy' except zero, if the array has non-zero length, the statement if #array evaluates to true and you get the print statement.
Void context means that the return value of some operation is ignored. A useful operation in void context could be something like incrementing. $n = 1; $n++; print $n; In this example $n++ (increment after returning) was in void context in that its return value "1" wasn't used (stored, printed etc).
The third concept is the difference between a list and an array. A list is an ordered set of values, an array is a container that holds an ordered set of values. You can see the difference for example in the gymnastics one must do to get particular element after using sort without storing the result first (try pop sort { $a cmp $b } #array for example, which doesn't work because pop does not act on a list, only an array).
Now we can ask, when you attempt your examples, what would you want Perl to do in these cases? As others have said, this depends on precedence.
In your first example, since the = operator has higher precedence than the ,, you haven't actually assigned a list to the variable, you have done something more like ($a = "b"), ("c") which effectively does nothing with the string "c". In fact it was called in void context. With warnings enabled, since this operation does not accomplish anything, Perl attempts to warn you that you probably didn't mean to do that with the message: Useless use of a constant in void context.
Now, what would you want Perl to do when you attempt to store a list to a scalar (or use a list in a scalar context)? It will not store the length of the list, this is only a behavior of an array. Therefore it must store one of the values in the list. While I know it is not canonically true, this example is very close to what happens.
my #animals = ("cow", "sheep", "llama");
my $return;
foreach my $animal (#animals) {
$return = $animal;
}
print $return;
And therefore you get the last element of the list (the canonical difference is that the preceding values were never stored then overwritten, however the logic is similar).
There are ways to store a something that looks like a list in a scalar, but this involves references. Read more about that in perldoc perlreftut.
Hopefully this makes things a little more clear. Finally I will say, until you get the hang of Perl's precedence rules, it never hurts to put in explicit parentheses for lists and function's arguments.
There is an easy way to see how Perl handles both of the examples, just run them through with:
perl -MO=Deparse,-p -e'...'
As you can see, the difference is because the order of operations is slightly different than you might suspect.
perl -MO=Deparse,-p -e'$a = a, b;print $a'
(($a = 'a'), '???');
print($a);
perl -MO=Deparse,-p -e'$a = (a, b);print $a'
($a = ('???', 'b'));
print($a);
Note: you see '???', because the original value got optimized away.

What does the Perl split function return when there is no value between tokens?

I'm trying to split a string using the split function but there isn't always a value between tokens.
Ex: ABC,123,,,,,,XYZ
I don't want to skip the multiple tokens though. These values are in specific positions in the string. However, when I do a split, and then try to step through my resulting array, I get "Use of uninitialized value" warnings.
I've tried comparing the value using $splitvalues[x] eq "" and I've tried using defined($splitvalues[x]) , but I can't for the life of me figure out how to identify what the split function is putting in to my array when there is no value between tokens.
Here's the snippet of my code (now with more crunchy goodness):
my #matrixDetail = ();
#some other processing happens here that is based on matching data from the
##oldDetail array with the first field of the #matrixLine array. If it does
#match, then I do the split
if($IHaveAMatch)
{
#matrixDetail = split(',', $matrixLine[1]);
}
else
{
#matrixDetail = ('','','','','','','');
}
my $newDetailString =
(($matrixDetail[0] eq '') ? $oldDetail[0] : $matrixDetail[0])
. (($matrixDetail[1] eq '') ? $oldDetail[1] : $matrixDetail[1])
.
.
.
. (($matrixDetail[6] eq '') ? $oldDetail[6] : $matrixDetail[6]);
because this is just snippets, I've left some of the other logic out, but the if statement is inside a sub that technically returns the #matrixDetail array back. If I don't find a match in my matrix and set the array equal to the array of empty strings manually, then I get no warnings. It's only when the split populates the #matrixDetail.
Also, I should mention, I've been writing code for nearly 15 years, but only very recently have I needed to work with Perl. The logic in my script is sound (or at least, it works), I'm just being anal about cleaning up my warnings and trying to figure out this little nuance.
#!perl
use warnings;
use strict;
use Data::Dumper;
my $str = "ABC,123,,,,,,XYZ";
my #elems = split ',', $str;
print Dumper \#elems;
This gives:
$VAR1 = [
'ABC',
'123',
'',
'',
'',
'',
'',
'XYZ'
];
It puts in an empty string.
Edit: Note that the documentation for split() states that "by default, empty leading fields are preserved, and empty trailing ones are deleted." Thus, if your string is ABC,123,,,,,,XYZ,,,, then your returned list will be the same as the above example, but if your string is ,,,,ABC,123, then you will have a list with three empty strings in elements 0, 1, and 2 (in addition to 'ABC' and '123').
Edit 2: Try dumping out the #matrixDetail and #oldDetail arrays. It's likely that one of those isn't the length that you think it is. You might also consider checking the number of elements in those two lists before trying to use them to make sure you have as many elements as you're expecting.
I suggest to use Text::CSV from CPAN. It is a ready made solution which already covers all the weird edge cases of parsing CSV formatted files.
delims with nothing between them give empty strings when split. Empty strings evaluate as false in boolean context.
If you know that your "details" input will never contain "0" (or other scalar that evaluates to false), this should work:
my #matrixDetail = split(',', $matrixLine[1]);
die if #matrixDetail > #oldDetail;
my $newDetailString = "";
for my $i (0..$#oldDetail) {
$newDetailString .= $matrixDetail[$i] || $oldDetail[$i]; # thanks canSpice
}
say $newDetailString;
(there are probably other scalars besides empty string and zero that evaluate to false but I couldn't name them off the top of my head.)
TMTOWTDI:
$matrixDetail[$_] ||= $oldDetail[$_] for 0..$#oldDetail;
my $newDetailString = join("", #matrixDetail);
edit: for loops now go from 0 to $#oldDetail instead of $#matrixDetail since trailing ",,," are not returned by split.
edit2: if you can't be sure that real input won't evaluate as false, you could always just test the length of your split elements. This is safer, definitely, though perhaps less elegant ^_^
Empty fields in the middle will be ''. Empty fields on the end will be omitted, unless you specify a third parameter to split large enough (or -1 for all).

Why does Perl's shift complain 'Type of arg 1 to shift must be array (not grep iterator).'?

I've got a data structure that is a hash that contains an array of hashes. I'd like to reach in there and pull out the first hash that matches a value I'm looking for. I tried this:
my $result = shift grep {$_->{name} eq 'foo'} #{$hash_ref->{list}};
But that gives me this error: Type of arg 1 to shift must be array (not grep iterator). I've re-read the perldoc for grep and I think what I'm doing makes sense. grep returns a list, right? Is it in the wrong context?
I'll use a temporary variable for now, but I'd like to figure out why this doesn't work.
A list isn't an array.
my ($result) = grep {$_->{name} eq 'foo'} #{$hash_ref->{list}};
… should do the job though. Take the return from grep in list context, but don't assign any of the values other than the first.
I think a better way to write this would be this:
use List::Util qw/first/;
my $result = first { $_->{name} eq 'foo' } #{ $hash_ref->{list} };
Not only will it be more clear what you're trying to do, it will also be faster because it will stop grepping your array once it has found the matching element.
Another way to do it:
my $result = (grep {$_->{name} eq 'foo'} #{$hash_ref->{list}})[0];
Note that the curlies around the first argument to grep are redundant in this case, so you can avoid block setup and teardown costs with
my $result = (grep $_->{name} eq 'foo', #{$hash_ref->{list}})[0];
“List value constructors” in perldata documents subscripting of lists:
A list value may also be subscripted like a normal array. You must put the list in parentheses to avoid ambiguity. For example:
# Stat returns list value.
$time = (stat($file))[8];
# SYNTAX ERROR HERE.
$time = stat($file)[8]; # OOPS, FORGOT PARENTHESES
# Find a hex digit.
$hexdigit = ('a','b','c','d','e','f')[$digit-10];
# A "reverse comma operator".
return (pop(#foo),pop(#foo))[0];
As I recall, we got this feature when Randal Schwartz jokingly suggested it, and Chip Salzenberg—who was a patching machine in those days—implemented it that evening.
Update: A bit of searching shows the feature I had in mind was $coderef->(#args). The commit message even logs the conversation!