In Perl, how can I tell split not to strip empty trailing fields? - perl

Was trying to count the number of lines in a string of text (including empty lines). A little surprised by the behavior of split. Had expected the following to output 2 but it printed 1 on my perl 5.14.2.
$str = "hello\
world\n\n";
#a = split(/\n/, $str);
print $#a, "\n";
Seems that split() is insensitive to consecutive \n (add more \n's at the end of the string will not increase the printout). The only I can get it sort of close to giving the number of lines is
$str = "hello\
world\n\n";
#a = split(/(\n)/, $str);
printf "%d\n", ($#a + 1)/2, "\n";
But it looks more like a workaround than a straight solution. Any ideas?

perldoc -f split:
If LIMIT is negative, it is treated as if it were instead
arbitrarily large; as many fields as possible are produced.
If LIMIT is omitted (or, equivalently, zero), then it is usually
treated as if it were instead negative but with the exception that
trailing empty fields are stripped (empty leading fields are
always preserved); if all fields are empty, then all fields are
considered to be trailing (and are thus stripped in this case).
$ perl -E 'my $x = "1\n2\n\n"; my #x = split /\n/, $x, -1; say $#x'
3

Perhaps the problem is that you are using $#a when scalar #a is what you are actually looking for?
I apologize if you are already aware of this or if this is not the issue, but $#a returns the index of the last element of #a and (scalar #a) returns the number of elements that #a contains. Since array indexing starts at 0, $#a is one less than scalar #a.

Related

How to make initialized value with in #array in regrep compilation

i try to run this code lines,
#checkpoint = split (/\s+/,$array_lcp[0]);
$i=scalar #checkpoint
print NAME " $checkpoint[0] ";
for ($k=0; $k<=i; $k++)
{
if ($array_ARGVTEMP[$d] =~ m/$checkpoint[$k]/i)
{
#array = split (/\s+/,$array_ARGVTEMP[$d]);
print NAME " $checkpoint[$k]| $k||
$checkpoint[0]||| $checkpoint[1] ||||$checkpoint[2]||||| ";
} }
but in the result warnings, it said that:
"Use of uninitialized value within #checkpoint in regexp compilation at new3.pl line 64 (#2)" and line 64 is " if($array_ARGVTEMP[$d] =~ m/$checkpoint[$k]/i) "
Please help me, Thank you
#checkpoint has $i elements numbered 0 to $i-1, but you're accessing the element at index $i.
for (my $k=0; $k<=$i; $k++)
should be
for (my $k=0; $k<$i; $k++)
Actually, it should be
for my $k (0..$#checkpoints)
Actually, it should be
for my $checkpoint (#checkpoints)
Update The question changed. Originally the loop went to $k<=2 (array size wasn't mentioned), what this answer addressed. The main point remains and now it is clear that the loop goes up to the index equal to the array size, so one past the fence; the limit should be $k < $i. Thus "value of $k for which there is no element" mentioned below is the last one looped over.
The $checkpoint[$k], that draws the warning in the regex, are elements of the array #checkpoint with indices 0, 1, 2 -- what $k is in the loop.
The "uninitialized value within..." means that the array #checkpoint doesn't actually have all those elements, so for a value of $k for which there is no element the regex attempts to retrieve an undefined value and complains.
The first split likely returned fewer than three elements. Print out the #checkpoint to see.
A few more comments
Please always have use warnigs; and use strict; at the beginning of a program
Use lexical filehandles, so open files with open my $name, ... (not open NAME, ...)
To loop over numbers from a range a nice way is
for my $k (0..2) { ... }
(update) ... but the question changed, with the loop (intended to be) over all array elements and then there is no reason to use the index. Iterate directly over elements
foreach my $checkpoint (#checkpoints) { .... }
Whenever you use \s+ for the separator pattern in split most likely you should be using a special pattern ' ', which splits by \s+ and also disregards leading and trailing space.

Learning Perl, One-Liner behaves differently

UPDATE
As pointed out in the answer, this question really has to do with Scalar versus List Context in Perl.
## ## ##
I am learning perl via self-taught crash course (primarily with the Llama book and the web). In attempting some byte swap code, I have found a one liner I do not understand completely. A comment in the script explains what I think is happening in the one-liner.
#!/usr/bin/perl --
#
# Script to print byte-swapped hex values
#
use 5.010;
use warnings;
use strict;
# NOTE: I realize I could use a single variable 'data', but the x- y- z- prefixes may help in
# identification (for clarity) in the code for this SO question.
my $xData;
my $yData;
my $zData;
for ( my $ijk = 998; $ijk < 1001; $ijk++ )
{
printf ( "\n%4d is hex " . (sprintf "0x%04X", $ijk) . "\n", $ijk );
# with byte swap
say "These numbers (bytes swapped) should match...";
# do sprintf, match pattern and store to ($1)($2), now reverse them into ($2)($1).
# BindOp leaves $_ alone, match stuffs $_ & is then used as input for reverse, prints.
say reverse ((sprintf "%04X", $ijk) =~ /(..)(..)/) ; # from perlmonks' webpage
$yData = (reverse ((sprintf "%04X", $ijk) =~ /(..)(..)/) );
say $yData; # does NOT match
$xData = sprintf "%04X", $ijk;
$xData =~ s/(..)(..)/$2$1/ ;
say $xData; # does match
$_ = sprintf "%04X", $ijk;
/(..)(..)/;
$zData = $2 . $1 ;
say $zData; # does match
}
exit 0;
OUTPUT:
998 is hex 0x03E6 These numbers (bytes swapped) should match...
E603
6E30
E603
E603
999 is hex 0x03E7 These numbers (bytes swapped) should match...
E703
7E30
E703
E703
1000 is hex 0x03E8 These numbers (bytes swapped) should match...
E803
8E30
E803
E803
Why does the one liner work, and why doesn't the $yData perform the same way? I'm pretty sure I understand why $xData and $zData work, but I would expect $yData to be the closest equivalent non-one-liner. What is the closest equivalent non-one-liner and why? Where is the discrepancy?
The reverse in your print (say) statement comes in the list context, while when assigned to $yData the context is scalar. This function (may) behave considerably differently based on the context.
From perldoc -f reverse
reverse LIST
In list context, returns a list value consisting of the elements of LIST in the opposite order. In scalar context, concatenates the elements of LIST and returns a string value with all characters in the opposite order.
In this case this produces different results.
When called in list context, it swaps the (two) input bytes, keeping each byte intact (represented by two hexadecimal digits matched in a group). When called in scalar context, it joins the input and returns a character string, running in the opposite order. Taken to represent a hex number this would have each byte changed.

Odd behavior with split in perl when result includes empty strings

This is my perl 5.16 code
while(<>) {
chomp;
#data = split /a/, $_;
print(join("b",#data),"\n");
}
If I input a file with this in it:
paaaa
paaaaq
I get
p
pbbbbq
But I was expecting
pbbbb
pbbbbq
Why am I wrong to expect the latter behavior?
It is documented that trailing empties are removed unless you specify a third, non-zero argument.
If LIMIT is omitted (or, equivalently, zero), then it is usually treated as if it were instead negative but with the exception that trailing empty fields are stripped (empty leading fields are always preserved)
You want
split /a/, $_, -1;
Take a look at the LIMIT parameter in the split perldoc:
http://perldoc.perl.org/functions/split.html
The relevant section is:
If LIMIT is negative, it is treated as if it were instead arbitrarily large; as many fields as possible are produced.
If LIMIT is omitted (or, equivalently, zero), then it is usually treated as if it were instead negative but with the exception that trailing empty fields are stripped (empty leading fields are always preserved); if all fields are empty, then all fields are considered to be trailing (and are thus stripped in this case).
So to get the behavior you're expecting, try:
while(<>) {
chomp;
#data = split /a/, $_, -1;
print(join("b",#data),"\n");
}
Because after splitting paaaa , you got an array #data that has only one elemet p in it.
Maybe substitution is better:
while(<>) {
chomp;
$_=~s/a/b/g;
print($_,"\n");
}

PERL -- Regex incl all hash keys (sorted) + deleting empty fields from $_ in file read

I'm working on a program and I have a couple of questions, hope you can help:
First I need to access a file and retrieve specific information according to an index that is obtained from a previous step, in which the indexes to retrieve are found and store in a hash.
I've been looking for a way to include all array elements in a regex that I can use in the file search, but I havenĀ“t been able to make it work. Eventually i've found a way that works:
my #atoms = ();
my $natoms=0;
foreach my $atomi (keys %{$atome}){
push (#atoms,$atomi);
$natoms++;
}
#atoms = sort {$b cmp $a} #atoms;
and then I use it as a regex this way:
while (<IN_LIG>){
if (!$natoms) {last;}
......
if ($_ =~ m/^\s*$atoms[$natoms-1]\s+/){
$natoms--;
.....
}
Is there any way to create a regex expression that would include all hash keys? They are numeric and must be sorted. The keys refer to the line index in IN_LIG, whose content is something like this:
8 C5 9.9153 2.3814 -8.6988 C.ar 1 MLK -0.1500
The key is to be found in column 0 (8). I have added ^ and \s+ to make sure it refers only to the first column.
My second problem is that sometimes input files are not always identical and they make contain white spaces before the index, so when I create an array from $_ I get column0 = " " instead of column0=8
I don't understand why this "empty column" is not eliminated on the split command and I'm having some trouble to remove it. This is what I have done:
#info = split (/[\s]+/,$_);
if ($info[0] eq " ") {splice (#info, 0,1);} # also tried $info[0] =~ m/\s+/
and when I print the array #info I get this:
Array:
Array: 8
Array: C5
Array: 9.9153
Array: 2.3814
.....
How can I get rid of the empty column?
Many thanks for your help
Merche
There is a special form of split where it will remove both leading and trailing spaces. It looks like this, try it:
my $line = ' begins with spaces and ends with spaces ';
my #tokens = split ' ', $line;
# This prints |begins:with:spaces:and:ends:with:spaces|
print "|", join(':', #tokens), "|\n";
See the documentation for split at http://p3rl.org/split (or with perldoc split)
Also, the first part of your program might be simpler as:
my #atoms = sort {$b cmp $a} keys %$atome;
my $natoms = #atoms;
But, what is your ultimate goal with the atoms? If you simply want to verify that the atoms you're given are indeed in the file, then you don't need to sort them, nor to count them:
my #atoms = keys %$atome;
while (<IN_LIG>){
# The atom ID on this line
my ($atom_id) = split ' ';
# Is this atom ID in the array of atom IDs that we are looking for
if (grep { /$atom_id/ } #atoms) {
# This line of the file has an atom that was in the array: $atom_id
}
}
Lets warm up by refining and correcting some of your code:
# If these are all numbers, do a numerical sort: <=> not cmp
my #atoms = ( sort { $b <=> $a } keys %{$atome} );
my $natoms = scalar #atoms;
No need to loop through the keys, you can insert them into the array right away. You can also sort them right away, and if they are numbers, the sort must be numerical, otherwise you will get a sort like: 1, 11, 111, 2, 22, 222, ...
$natoms can be assigned directly by the count of values in #atoms.
while(<IN_LIG>) {
last unless $natoms;
my $key = (split)[0]; # split splits on whitespace and $_ by default
$natoms-- if ($key == $atoms[$natoms - 1]);
}
I'm not quite sure what you are doing here, and if it is the best way, but this code should work, whereas your regex would not. Inside a regex, [] are meta characters. Split by default splits $_ on whitespace, so you need not be explicit about that. This split will also definitely remove all whitespace. Your empty field is most likely an empty string, '', and not a space ' '.
The best way to compare two numbers is not by a regex, but with the equality operator ==.
Your empty field should be gone by splitting on whitespace. The default for split is split ' '.
Also, if you are not already doing it, you should use:
use strict;
use warnings;
It will save you a lot of headaches.
for your second question you could use this line:
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
in order to capture 9 items from each line (assuming they do not contain whitespace).
The first question I do not understand.
Update: I would read alle the lines of the file and use them in a hash with $info[0] as the key and [#info[1..8]] as the value. Then you can lookup the entries by your index.
my %details;
while (<IN_LIG>) {
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
$details{ $info[0] } = [ #info[1..$#info] ];
}
Later you can lookup details for the indices you are interested in and process as needed. This assumes the index is unique (has the property of keys).
thanks for all your replies. I tried the split form with ' ' and it saved me several lines of code. thanks!
As for the regex, I found something that could make all keys as part of the string expression with join and quotemeta, but I couldn't make it work. Nevertheless I found an alternative that works, but I liked the join/quotemeta solution better
The atom indexes are obtained from a text file according to some energy threshold. Later, in the IN_LIG loop, I need to access the molecule file to obtain more information about the atoms selected, thus I use the atom "index" in the molecule to identify which lines of the file I have to read and process. This is a subroutine to which I send a hash with the atom index and some other information.
I tried this for the regex:
my $strings = join "|" map quotemeta,
sort { $hash->{$b} <=> $hash->{$a}} keys %($hash);
but I did something wrong cos it wouldn't take all keys

How do I get the length of a string in Perl?

What is the Perl equivalent of strlen()?
length($string)
perldoc -f length
length EXPR
length Returns the length in characters of the value of EXPR. If EXPR is
omitted, returns length of $_. Note that this cannot be used on an
entire array or hash to find out how many elements these have. For
that, use "scalar #array" and "scalar keys %hash" respectively.
Note the characters: if the EXPR is in Unicode, you will get the num-
ber of characters, not the number of bytes. To get the length in
bytes, use "do { use bytes; length(EXPR) }", see bytes.
Although 'length()' is the correct answer that should be used in any sane code, Abigail's length horror should be mentioned, if only for the sake of Perl lore.
Basically, the trick consists of using the return value of the catch-all transliteration operator:
print "foo" =~ y===c; # prints 3
y///c replaces all characters with themselves (thanks to the complement option 'c'), and returns the number of character replaced (so, effectively, the length of the string).
length($string)
The length() function:
$string ='String Name';
$size=length($string);
You shouldn't use this, since length($string) is simpler and more readable, but I came across some of these while looking through code and was confused, so in case anyone else does, these also get the length of a string:
my $length = map $_, $str =~ /(.)/gs;
my $length = () = $str =~ /(.)/gs;
my $length = split '', $str;
The first two work by using the global flag to match each character in the string, then using the returned list of matches in a scalar context to get the number of characters. The third works similarly by splitting on each character instead of regex-matching and using the resulting list in scalar context