Substitution on string in perl changes string to an integer value - perl

I am trying to do delete some characters matching a regex in perl and when I do that it returns an integer value.
I have tried substituting multiple spaces in a string with empty string or basically deleting the space.
#! /usr/intel/bin/perl
my $line = "foo/\\bar car";
print "$line\n";
#$line = ~s/(\\|(\s)+)+//; <--Ultimately need this, where backslash and space needs to be deleted. Tried this, returns integer value
$line = ~s/\s+//; <-- tried this, returns integer value
print "$line\n";
Expected results:
First print: foo/\bar car
Second print: foo/barcar
Actual result:
First print: foo/\\bar car
Second print: 18913234908

The proper solution is
$line =~ s/[\s\\]+//g;
Note:
g flag to substitute all occurrences
no space between = and ~
=~ is a single operator, binding the substitution operator s to the target variable $line.
Inserting a space (as in your code) means s binds to the default target, $_, because there is no explicit target, and then the return value (which is the number of substitutions made) has all its bits inverted (unary ~ is bitwise complement) and is assigned to $line.
In other words,
$line = ~ s/...//
parses as
$line = ~(s/...//)
which is equivalent to
$line = ~($_ =~ s/...//)
If you had enabled use warnings, you would've gotten the following message:
Use of uninitialized value $_ in substitution (s///) at prog.pl line 6.

You've already accepted an answer, but I thought it would be useful to give you a few more details.
As you now know,
$line = ~s/\s+//;
is completely different to:
$line =~ s/\s+//;
You wanted the second, but you typed the first. So what did you end up with?
~ is "bitwise negation operator". That is, it converts its argument to a binary number and then bit-flips that number - all the zeroes become ones and all the ones become zeros.
So you're asking for the bitwise negation of s/\s+//. Which means the bitwise negation works on the value returned by s/\s+//. And the value returned by a substitution is the number of substitutions made.
We can now work out all of the details.
s/\s+// carries out your substitution and returns the number of substitutions made (an integer).
~s/\s+// returns the bitwise negation of the integer returned by the substitution (which is also an integer).
$line = ~s/\s+// takes that second integer and assigns it to the variable $line.
Probably, the first step returns 1 (you don't use /g on your s/.../.../, so only one substitution will be made). It's easy enough to get the bitwise negation of 1.
$ perl -E'say ~1'
18446744073709551614
So that might well be the integer that you're seeing (although it might be different on a 32-bit system).

Related

Perl tr operator is transliterating based on the variable's name not its value

I'm using Perl 5.16.2 to try to count the number of occurrences of a particular delimiter in the $_ string. The delimiter is passed to my Perl program via the #ARGV array. I verify that it is correct within the program. My instruction to count the number of delimiters in the string is:
$dlm_count = tr/$dlm//;
If I hardcode the delimiter, e.g. $dlm_count = tr/,//; the count comes out correctly. But when I use the variable $dlm, the count is wrong. I modified the instruction to say
$dlm_count = tr/$dlm/\t/;
and realized from how the tabs were inserted in the string that the operation was substituting every instance of any of the four characters "$", "d", "l", or "m" to \t — i.e. any of the four characters that made up my variable name $dlm.
Here is a sample program that illustrates the problem:
$_ = "abcdefghij,klm,nopqrstuvwxyz";
my $dlm = ",";
my $dlm_count = tr/$dlm/\t/;
print "The count is $dlm_count\n";
print "The modified string is $_\n";
There are only two commas in the $_ string, but this program prints the following:
The count is 3
The modified string is abc efghij,k ,nopqrstuvwxyz
Why is the $dlm token being treated as a literal string of four characters instead of as a variable name?
You cannot use tr that way, it doesn't interpolate variables. It runs strictly character by character replacement. So this
$string =~ tr/a$v/123/
is going to replace every a with 1, every $ with 2, and every v with 3. It is not a regex but a transliteration. From perlop
Because the transliteration table is built at compile time, neither the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote interpolation. That means that if you want to use variables, you must use an eval():
eval "tr/$oldlist/$newlist/";
die $# if $#;
eval "tr/$oldlist/$newlist/, 1" or die $#;
The above example from docs hints how to count. For $dlms in $string
$dlm_count = eval "\$string =~ tr/$dlm//";
The $string is escaped so to not be interpolated before it gets to eval. In your case
$dlm_count = eval "tr/$dlm//";
You can also use tools other than tr (or regex). For example, with string being in $_
my $dlm_count = grep { /$dlm/ } split //;
When split breaks $_ by the pattern that is empty string (//) it returns the list of all characters in it. Then the grep block tests each against $dlm so returning the list of as many $dlm characters as there were in $_. Since this is assigned to a scalar, $dlm_count is set to the length of that list, which is the count of all $dlm.
In the section of the docs on perlop 'Quote Like Operators', it states:
Because the transliteration table is built at compile time, neither
the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote
interpolation. That means that if you want to use variables, you must
use an eval():
As documented and as you discovered, tr/// doesn't interpolate. The simple solution is to use s/// instead.
my $dlm = ",";
$_ = "abcdefghij,klm,nopqrstuvwxyz";
my $dlm_count = s/\Q$dlm/\t/g;
If the transliteration is being performed in a loop, the following might speed things up noticeably:
my $dlm = ",";
my $tr = eval "sub { tr/\Q$dlm\E/\\t/ }";
for (...) {
my $dlm_count = $tr->();
...
}
Although several answers have hinted at the eval() idiom for tr///, none have the form that covers cases where the string has tr syntax characters in it, e.g.- (hyphen):
$_ = "abcdefghij,klm,nopqrstuvwxyz";
my $dlm = ",";
my $dlm_count = eval sprintf "tr/%s/%s/", map quotemeta, $dlm, "\t";
But as others have noted, there are lots of ways to count characters in Perl that avoid eval(), here's another:
my $dlm_count = () = m/$dlm/go;

How do map and grep work?

I came across this code in a script, can you please explain what map and grep does here?
open FILE, '<', $file or die "Can't open file $file: $!\n";
my #sets = map {
chomp;
$_ =~ m/use (\w+)/;
$1;
}
grep /^use/, ( <FILE> );
close FILE;
The file pointed by $file has:
use set_marvel;
use set_caprion;
and so on...
Despite the fact that your question doesn't show any research effort, I'm going to answer it anyway, because it might be helpful for future readers who come across this page.
According to perldoc, map:
Evaluates the BLOCK or EXPR for each element of LIST (locally setting
$_ to each element) and returns the list value composed of the results
of each such evaluation. In scalar context, returns the total number
of elements so generated. Evaluates BLOCK or EXPR in list context, so
each element of LIST may produce zero, one, or more elements in the
returned value.
The definition for grep, on the other hand:
Evaluates the BLOCK or EXPR for each element of LIST (locally setting
$_ to each element) and returns the list value consisting of those
elements for which the expression evaluated to true. In scalar
context, returns the number of times the expression was true.
So they're similar in their input values, their return values, and the fact that they both localize $_.
In your specific code, going from right to left:
<FILE> slurps the lines in the file pointed to by the FILE filehandle and returns a list
In the context of grep, /^use/ looks at each line and returns true for the ones that match the regular expression. The return value of grep, therefore, is a list of lines that that start with use.
In the BLOCK of your map (which is only considering lines that passed the earlier grep test):
chomp removes any trailing string from $_ that corresponds to the current value of $/ (i.e., the newline). This is unnecessary, because as you'll see below, \w will never match a newline.
$_ =~ m/use (\w+)/ is a regular expression that looks for use followed by a space, followed by one or more word characters ([0-9a-zA-Z_]) in a capture group. The $_ =~ is redundant, since the match operator m// binds to $_ by default.
$1 is the first matching capture group from the previous expression. Since it's the last expression in the BLOCK, it bubbles up as the return value for each list item that was evaluated.
The end result is stored in an array named #sets, which should contain 'set_marvel', 'set_caprion', etc.
Equivalently, your code could be rewritten without map and grep like this, which may make it easier for you to understand:
my #sets;
while (<FILE>) {
next unless /^use (\w+)/;
push(#sets, $1);
}
The grep takes the <FILE> as input and uses the regular expression ^use to copy all of the lines that start with use into an array that is passed to map.
The map loops through each array entry and puts each entry in $_, then calls chomp on $_ implicitly. Then $_ =~ m/use (\w+)/; performs a regular expression on $_ that captures the word after the use and puts it into $1. Then the $1 is called to put it in #set.

index argument contains . perl

If a string contains . representing any character, index doesn't match on it. What to do so that it takes . as any character?
For ex,
index($str, $substr)
if $substr contains . anywhere, index will always return -1
thanks
carol
That is not possible. The documentation says:
The index function searches for one string within another, but without
the wildcard-like behavior of a full regular-expression pattern match.
...
The keywords, you can use for further googlings are:
perl regular expression wildcard
Update:
If you just want to know, if your string matches, using a regular expression could look like that:
my $string = "Hello World!";
if( $string =~ /ll. Worl/ )
{
print "Ahoi! Position: ".($-[0])."\n";
}
This is matching a single character.
$-[0] is the offset into the string of the beginning of the entire
match.
-- http://perldoc.perl.org/perlvar.html
If you want to have a pattern, that is matching an arbitary amount of arbitary characters, you could choose a pattern like...
...
if( $string =~ /ll.*orl/ )
{
...
See perlvar for further information about special perl variables. You will find the variable #LAST_MATCH_START and some explanation about $-[0] over there. There are several more variables, that can help you to find sub matches and to gather other interessting information about your matches...
From perldoc -f index, you can see index() doesn't have any regex syntax:
index STR,SUBSTR
The index function searches for one string within another, but without the wildcard-like behavior of a full regular-
expression pattern match. It returns the position of the first occurrence of SUBSTR in STR at or after POSITION. If
POSITION is omitted, starts searching from the beginning of the string. POSITION before the beginning of the string or after
its end is treated as if it were the beginning or the end, respectively. POSITION and the return value are based at 0 (or
whatever you've set the $[ variable to--but don't do that). If the substring is not found, "index" returns one less than the
base, ordinarily "-1"
A simple test:
$ perl -e 'print index("1234567asdfghj.","j.")'
13
Use regex:
$str =~ /$substr/g;
$index = pos();

Why does print ($a = a..c) produce: 1E0

print (a..c) # this prints: abc
print ($a = "abc") # this prints: abc
print ($a = a..c); # this prints: 1E0
I would have thought it would print: abc
use strict;
print ($a = "a".."c"); # this prints 1E0
Why? Is it just my computer?
edit: I've got a partial answer (the range operator .. returns a boolean value in scalar context - thanks) but what I don't understand is:
why does: print ($a = "a"..."c") produce 1 instead of 0
why does: print ($a = "a".."c") produce 1E0 instead of 1 or 0
There are a number of subtle things going on here. The first is that .. is really two completely different operators depending on the context in which it's called. In list context it creates a list of values (incrementing by one) between the given starting and ending points.
#numbers = 1 .. 3; # 1, 2, 3
#letters = 'a' .. 'c'; # a, b, c (Yes, Perl can increment strings)
Because print interprets its arguments in list context
print 'a' .. 'c'; # <-- this
print 'a', 'b', 'c'; # <-- is equivalent to this
In scalar context, .. is flip-flop operator. From Range Operators in perlop:
It is false as long as its left operand is false. Once the left
operand is true, the range operator stays true until the right operand
is true, AFTER which the range operator becomes false again.
Assignment to a scalar value as in $a = ... creates scalar context. That means that the .. in print ($a = 'a' .. 'c') is an instance of the flip-flop operator, not the list creation operator.
The flip-flop operator is designed to be used when filtering lines in a file. e.g.
while (<$fh>) {
print if /first/ .. /last/;
}
would print all of the lines in a file starting with the one that contained first and ending with the one that contained last.
The flip-flop operator has some additional magic designed to make it easy to filter based on the line number.
while (<$fh>) {
print if 10 .. 20;
}
will print lines 10 through 20 of a file. It does this by employing special case behavior:
If either operand of scalar .. is a constant expression, that
operand is considered true if it is equal (==) to the current input
line number (the $. variable).
The strings a and c are both constant expressions so they trigger this special case. They aren't numbers, but they're used as numbers (== is a numeric comparison). Perl will convert scalar values between strings and numbers as needed. In this case, both values nummify to 0. Therefore
print ($a = 'a' .. 'c'); # <-- this
print ($a = 0 .. 0); # <-- is effectively this
print ($a = ($. == 0) .. ($. == 0)); # <-- which is really this
We're getting close to the bottom of the mystery. On to the next bit. More from perlop:
The value returned is either the empty string for false, or a sequence
number (beginning with 1) for true. The sequence number is reset for
each range encountered. The final sequence number in a range has the
string "E0" appended to it
If you haven't read any lines from a file yet, $. will be undef which is 0 in a numerical context. 0 == 0 is true, so the .. returns a true value. It's the first true value, so it's 1. Because both the left-hand and right-hand sides are true the first true value is also the last true value and the E0 "this is the last value" suffix is appended to the return value. That is why print ($a = 'a' .. 'c') prints 1E0. If you were to set $. to a non-zero value the .. would be false and return the empty string.
print ($a = 'a' .. 'c'); # prints "1E0"
$. = 1;
print ($a = 'a' .. 'c'); # prints nothing
The very final piece of the puzzle (and I might be going too far now) is that the assignment operator returns a value. In this case that's the value assigned to $a1 -- 1E0. This value is what is ultimately spit out by the print.
1: Technically, the assignment produces a lvalue for the item assigned to. i.e. it returns an lvalue for the variable $a which then evaluates to 1E0.
It's a matter of list context vs. scalar context, as explained in perldoc perlop:
In scalar context, ".." returns a boolean value. The operator is
bistable, like a flip-flop, and emulates the line-range (comma)
operator of sed, awk, and various editors. Each ".." operator
maintains its own boolean state, even across calls to a subroutine
that contains it. It is false as long as its left operand is false.
Once the left operand is true, the range operator stays true until the
right operand is true, AFTER which the range operator becomes false
again. It doesn't become false till the next time the range operator
is evaluated. It can test the right operand and become false on the
same evaluation it became true (as in awk), but it still returns true
once. If you don't want it to test the right operand until the next
evaluation, as in sed, just use three dots ("...") instead of two. In
all other regards, "..." behaves just like ".." does.
[snip]
The final sequence number in a range has the string "E0" appended to
it, which doesn't affect its numeric value, but gives you something to
search for if you want to exclude the endpoint.
EDIT in response to DanD man's comment:
I find it a bit hard to digest too; frankly, I rarely use the .. operator, and even more rarely in scalar context. But for example, the expression 5..10 in an input loop implicitly compares to the current value of $. (that's part of the description that I didn't quote; see the manual). On lines 5 through 9, it yields a true value (experiment shows that it's a number, but the documentation doesn't say so). On line 10, it yields a number with "E0" appended to it -- i.e., it's in exponential notation, but with the same value it would have without the "E0".
The point of the "E0" tweak is to let you detect whether you're in a specified range and to flag the last line in the range for special treatment. Without the "E0", you wouldn't be able to treat the final match specially.
An example:
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
my $dotdot = 2..4;
print "On line $., 2..4 yields \"$dotdot\"\n";
}
Given 5 lines of input, this prints:
On line 1, 2..4 yields ""
On line 2, 2..4 yields "1"
On line 3, 2..4 yields "2"
On line 4, 2..4 yields "3E0"
On line 5, 2..4 yields ""
This lets you detect whether a line is inside or outside the range and when it's the last line in the range.
But scalar .. is probably more commonly used just for its boolean result, often in one-liners; for example, perl -ne 'print if 2..4' will print lines 2, 3, and 4 of whatever input you give it. It's deliberately similar to sed -n '2,4p'.
The answer can be found by consulting perldoc's perlop page:
Binary ".." is the range operator, which is really two different operators depending on the context. In list context, it returns a list of values counting (up by ones) from the left value to the right value...
This is the familiar usage, which is invoked by print "a" .. "c"; because arguments to functions are evaluated in list context. (If they were evaluated in scalar context, then print #list would print the size of #list, which is almost definitely not what people usually want.)
In scalar context, ".." returns a boolean value. The operator is bistable, like a flip-flop, and emulates the line-range (comma) operator of sed, awk, and various editors. Each ".." operator maintains its own boolean state, even across calls to a subroutine that contains it. It is false as long as its left operand is false. Once the left operand is true, the range operator stays true until the right operand is true, AFTER which the range operator becomes false again. It doesn't become false till the next time the range operator is evaluated. It can test the right operand and become false on the same evaluation it became true (as in awk), but it still returns true once. If you don't want it to test the right operand until the next evaluation, as in sed, just use three dots ("...") instead of two. In all other regards, "..." behaves just like ".." does.
It goes into further detail, but the bolded sections are the important parts to understanding how the operator works. Scalar context is forced by $a =, i.e. assignment to a scalar lvalue. If you did #a =, it would print what you expect.
Note that "a" .. "b" doesn't produce the string "abc", it produces the list ("a", "b", "c"). You will get similar results if you used the list (though the value printed when the list is forced into scalar context will differ).

How do I get the length of a string in Perl?

What is the Perl equivalent of strlen()?
length($string)
perldoc -f length
length EXPR
length Returns the length in characters of the value of EXPR. If EXPR is
omitted, returns length of $_. Note that this cannot be used on an
entire array or hash to find out how many elements these have. For
that, use "scalar #array" and "scalar keys %hash" respectively.
Note the characters: if the EXPR is in Unicode, you will get the num-
ber of characters, not the number of bytes. To get the length in
bytes, use "do { use bytes; length(EXPR) }", see bytes.
Although 'length()' is the correct answer that should be used in any sane code, Abigail's length horror should be mentioned, if only for the sake of Perl lore.
Basically, the trick consists of using the return value of the catch-all transliteration operator:
print "foo" =~ y===c; # prints 3
y///c replaces all characters with themselves (thanks to the complement option 'c'), and returns the number of character replaced (so, effectively, the length of the string).
length($string)
The length() function:
$string ='String Name';
$size=length($string);
You shouldn't use this, since length($string) is simpler and more readable, but I came across some of these while looking through code and was confused, so in case anyone else does, these also get the length of a string:
my $length = map $_, $str =~ /(.)/gs;
my $length = () = $str =~ /(.)/gs;
my $length = split '', $str;
The first two work by using the global flag to match each character in the string, then using the returned list of matches in a scalar context to get the number of characters. The third works similarly by splitting on each character instead of regex-matching and using the resulting list in scalar context