split does not return empty elements - perl

Why do these not all return bbb?
$ perl -e '$a=" "; print map { "b" } split / /, $a;'
<<nothing>>
$ perl -e '$a=",,"; print map { "b" } split /,/, $a;'
<<nothing>>
$ perl -e '$a=" a"; print map { "b" } split / /, $a;'
bbb
$ perl -e '$a=",,a"; print map { "b" } split /,/, $a;'
bbb
I would have expected split to return an array with 3 elements in all cases.
$ perl -V
Summary of my perl5 (revision 5 version 24 subversion 1) configuration:

split's third parameter says how many elements to produce:
split /PATTERN/,EXPR,LIMIT
...
If LIMIT is negative, it is treated as if it were instead arbitrarily large; as many fields as possible are produced.
If LIMIT is omitted (or, equivalently, zero), then it is usually treated as if it were instead negative but with the exception that trailing empty fields are stripped (empty leading fields are always preserved); if all fields are empty, then all fields are considered to be trailing (and are thus stripped in this case).
It defaults to 0, which means as many as possible but leaving off any trailing empty elements.
You can pass -1 as the third argument to split to suppress this behavior.

Related

In Perl, substitution operator for removing space is removing value 0 when removing space

We have code where input could be either single value or comma separated value. We need to remove any spaces present before and after each value.
We are doing as below:
my #var_1 = split /,/,$var;
print "print_1 : #var_1 \n ";
#var_1 = grep {s/^\s+|\s+$//g; $_ } #var_1;
print "print_2 : #var_1 \n ";
$var would contain input value. If the $var is 0 , in print_1 is printing value 0 but print_2 is printing nothing. Our requirement was just to remove spaces before and after value 0. But if the $var is 1, both print (print_1 and print_2) is correctly printing value 1. if we give input as 1,0 it is removing 0 and printing value 1 in print_2.
I am not sure why it is removing value 0. Is there any correction that can be done to substitution operator not to remove value 0 ?
Thanks in advance!!!
In Perl, only a few distinct values are false. These are primarily
undef
the integer 0
the unsigned integer 0
the floating point number 0
the string 0
the empty string ""
You've got the empty string variant and 0 here.
#var_1 = grep {s/^\s+|\s+$//g; $_ } #var_1;
This code can go in three ways:
$_ gets cleaned up and becomes foo. We want it to pass.
$_ gets cleaned up and becomes 0. We want it to pass.
$_ gets cleaned up and becomes the empty string "". We want it to fail.
But what happens is that because 0 is false, and grep only lets it through if the last statement in its block is true. That's what we want for the empty string "", but not for 0.
#var_1 = grep {s/^\s+|\s+$//g; $_ ne "" } #var_1;
Now that we check explicitly that the cleaned up value is not the empty string "", zero 0 is allowed.
Here's a complete version with cleaned up variable names (naming is important!).
my $input = q{foo, bar, 1 23 ,,0};
my #values = split /,/,$input;
print "print_1 : #values \n ";
#values = grep {s/^\s+|\s+$//g; $_ ne q{} } #values;
print "print_2 : #values \n ";
The output is:
print_1 : foo bar 1 23 0
print_2 : foo bar 1 23 0
Note that your grep is not the optimal solution. As always, there is more than one way to do it in Perl. The for loop that Сухой27 suggests in their answer is way more concise and I would go with that.
If you want to split on commas and removing leading and trailing whitespace from each of the resulting strings, that translates pretty literally into code:
my #var = map s/^\s+|\s+\z//gr, split /,/, $var, -1;
/r makes s/// return the result of the substitution (requires perl 5.14+). -1 on the split is required to keep it from ignoring trailing empty fields.
If there are no zero length entries (so not e.g. a,,b), you can just extract what you want (sequences of non-commas that don't start or end with whitespace) directly from the string instead of first splitting it:
#var = $var =~ /(?!\s)[^,]+(?<!\s)/g;
You want
#var_1 = map { my $v = s/^\s+|\s+$//gr; length($v) ? $v : () } #var_1
instead of,
#var_1 = grep {s/^\s+|\s+$//g; $_ } #var_1;
grep is used for filtering list elements, and all false values are filtered (including '', 0, and undef)
I suggest cleaning up the array using map with a regex pattern that matches from the first to the last non-space character
There's also no need to do the split operation separately
Like this
my #var_1 = map { / ( \S (?: .* \S )? ) /x } split /,/, $var;
Note that this method removes empty fields. It's unclear whether that was required or not
You can also use
#values = map {$_ =~ s/^\s+|\s+$//gr } #values;
or even more concise
#values = map {s/^\s+|\s+$//gr } #values;
to remove spaces in you array
Do not forget the r as it is the non-destructive option, otherwise you will replace your string by the number of occurences of spaces.
This said it will only work if you use Perl 5.14 or higher,
some documentation here ;)
https://www.perl.com/pub/2011/05/new-features-of-perl-514-non-destructive-substitution.html
I think this synthax is easier to understand since it is closer to the "usual" method of substitution.

Can "perl -a" somehow re-join #F using the original whitespace?

My input has a mix of tabs and spaces for readability. I want to modify a field using perl -a, then print out the line in its original form. (The data is from findup, showing me a count of duplicate files and the space they waste.) Input is:
2 * 4096 backup/photos/photo.jpg photos/photo.jpg
2 * 111276032 backup/books/book.pdf book.pdf
The output would convert field 3 to kilobytes, like this:
2 * 4 KB backup/photos/photo.jpg photos/photo.jpg
2 * 108668 KB backup/books/book.pdf book.pdf
In my dream world, this would be my code, since I could just will perl to automatically recombine #F and preserve the original whitespace:
perl -lanE '$F[2]=int($F[2]/1024)." KB"; print;'
In real life, joining with a single space seems like my only option:
perl -lanE '$F[2]=int($F[2]/1024)." KB"; print join(" ", #F);'
Is there any automatic variable which remembers the delimiters? If I had a magic array like that, the code would be:
perl -lanE 'BEGIN{use List::Util "reduce";} $F[2]=int($F[2]/1024)." KB"; print reduce { $a . shift(#magic) . $b } #F;'
No, there is no such magic object. You can do it by hand though
perl -wnE'#p = split /(\s+)/; $p[4] = int($p[4]/1024); print #p' input.txt
The capturing parens in split's pattern mean that it is also returned, so you catch exact spaces. Since spaces are in the array we now need the fifth field.
As it turns out, -F has this same property. Thanks to Сухой27. Then
perl -F'(\s+)' -lanE'$F[4] = int($F[4]/1024); say #F' input.txt
Note: with 5.20.0 "-F now implies -a and -a implies -n". Thanks to ysth.
You could just find the correct part of the line and modify it:
perl -wpE's/^\s*+(?>\S+\s+){2}\K(\S+)/int($1\/1024) . " KB"/e'

Perl - "/" causing issues for splitting by comma

I'm trying to split a file by ",". It is a CSV file.
However, one "column" has values that includes "/" and spaces. And it seems to freak out with that column and does not print anything after that column but moves on to the next row.
My code is simply:
perl -lane '#values = split(",",$F[0]); print $values[0]."\t".$values[3];' basefile.txt > newfile.txt
The basefile.txt looks like:
"1","text","abc // 123 /// some more text // text","filename1"
"2","text","abc // 123 /// some more text // text","filename2"
"3","text","abc // 123 /// some more text // text","filename3"
My newfile.txt should have an output of:
"1","filename1"
"2","filename2"
"3","filename3"
Instead I get:
"1",
"2",
"3",
Thanks!
It's not the / that is confusing perl here, it's the spaces combined with the -a flag. Try:
perl -lne '#values = split(",",$_); print $values[0]."\t".$values[3]' basefile
Or, better yet, use Text::CSV_XS to do the splitting.
It's not the '/', it's the spaces.
The -a flag causes perl to split each line of input and put the fields into the variable #F. The delimiter for this split operation is whitespace, unless you override it with the -Fdelimiter option on the command line, too.
So for the input
"1","text","abc // 123 /// some more text // text","filename"
with the -lan flags specified, perl sets
$F[0] = '"1","text","abc';
$F[1] = '//';
$F[2] = '123';
$F[3] = '///';
$F[4] = 'some';
etc.
It seems like you just want to do your split operation on the whole line. In which case you should stop using the -a flag and just say
#values = split(",",$_); ...
or leverage the -a and -F... options and say
perl -F/,/ -lane '#values=#F; ...'

In Perl, how can I tell split not to strip empty trailing fields?

Was trying to count the number of lines in a string of text (including empty lines). A little surprised by the behavior of split. Had expected the following to output 2 but it printed 1 on my perl 5.14.2.
$str = "hello\
world\n\n";
#a = split(/\n/, $str);
print $#a, "\n";
Seems that split() is insensitive to consecutive \n (add more \n's at the end of the string will not increase the printout). The only I can get it sort of close to giving the number of lines is
$str = "hello\
world\n\n";
#a = split(/(\n)/, $str);
printf "%d\n", ($#a + 1)/2, "\n";
But it looks more like a workaround than a straight solution. Any ideas?
perldoc -f split:
If LIMIT is negative, it is treated as if it were instead
arbitrarily large; as many fields as possible are produced.
If LIMIT is omitted (or, equivalently, zero), then it is usually
treated as if it were instead negative but with the exception that
trailing empty fields are stripped (empty leading fields are
always preserved); if all fields are empty, then all fields are
considered to be trailing (and are thus stripped in this case).
$ perl -E 'my $x = "1\n2\n\n"; my #x = split /\n/, $x, -1; say $#x'
3
Perhaps the problem is that you are using $#a when scalar #a is what you are actually looking for?
I apologize if you are already aware of this or if this is not the issue, but $#a returns the index of the last element of #a and (scalar #a) returns the number of elements that #a contains. Since array indexing starts at 0, $#a is one less than scalar #a.

Anonymous hash in perl

I am starting to learn Perl, so I am trying to read some posts here at SO. Now I came across this code https://stackoverflow.com/a/22310773/2173773 (simplified somewhat here) :
echo "1 2 3 4" | perl -lane'
$h{#F} ||= [];
print $_ for keys %h;
'
What does this code do, and why does this code print 4?
I have tried to study Perl references at http://perldoc.perl.org/perlreftut.html , but I still could not figure this out.
(I am puzzled about this line: $h{#F} ||= [].. )
The -n option (part of -lane) causes Perl to execute the given code for each individual line of input.
The -a option (when used with the -n or -p option) causes Perl to split every line of input on whitespace and store the fields in the #F variable.
$something ||= [] is equivalent to $something = $something || []; i.e., it assigns [] (a reference to an empty array) to the variable $something if & only if $something is already false or undefined.
$h{#F} is an element of the hash %h. Because this expression begins with $ (rather than #), the subscript #F is evaluated in scalar context, and scalar context for an array makes the array evaluate to its length. As the Perl code is only ever executed on the line 1 2 3 4, which is split into four elements, #F will only be four elements long, so $h{#F} is here equivalent to $h{4} (or, technically, $h{"4"}).
Thus, [] will be assigned to $h{"4"}, and as 4 is the only element of the hash %h in existence, keys %h will return a list containing only "4", and printing the elements of this list will print 4.