How does Perl split work with string exactly? - perl

Quoted from perldoc -f split:
As a special case, specifying a PATTERN of space (' ' ) will split on
white space just as split with no arguments does. Thus, split(' ') can
be used to emulate awk's default behavior, whereas split(/ /) will
give you as many initial null fields (empty string) as there are
leading spaces.
The above is all that's mentioned about how split deals with string delimiter, but what's the general case,is the empty leading fields always deleted for string delimiters?

No, only when the delimiter is a string that is a single space. In any other case, the delimiter is interpreted as a regex pattern.

Related

Why does q/\\a/ equal q/\a/?

The following example prints "SAME":
if (q/\\a/ eq q/\a/) {
print "SAME\n";
}
else {
print "DIFFERENT\n";
}
I understand this is consistent with the documentation. But I think this behavior is undesirable. Is there a need to escape a backlash lilteral in single-quoted string? If I wanted 2 backlashes, I'd need to specify 4; this does not seem convenient.
Shouldn't Perl detect whether a backslash serves as an escape character or not? For instance, when a backslash does not precede a delimiter, it should be treated as a literal; and if that were the case, I wouldn't need 3 backslashes to express two, e.g.,
q<a\\b>
instead of
q<a\\\b>.
Is there a need to escape a backlash in single-quoted string?
Yes, if the backslash is followed by another backslash, or is the last character in the string:
$ perl -e'print q/C:\/'
Can't find string terminator "/" anywhere before EOF at -e line 1.
$ perl -e'print q/C:\\/'
C:\
This makes it possible to include any character in a single-quoted string, including the delimiter and the escape character.
If I wanted 2 backlashes, I'd need to specify 4; this does not seem convenient.
Actually, you only need three (because the second backslash isn't followed by another backslash). But as an alternative, if your string contains a lot of backslashes you can use a single-quoted heredoc, which requires no escaping:
my $path = <<'END';
C:\a\very\long\path
END
chomp $path;
print $path; # C:\a\very\long\path
Note that the heredoc adds a newline to the end, which you can remove with chomp.
In single-quoted string literals,
A backslash represents a backslash unless followed by the delimiter or another backslash, in which case the delimiter or backslash is interpolated.
In other words,
You must escape delimiters.
You must escape \ that are followed by \ or the delimiter.
You may escape \ that aren't followed by \ or the delimiter.
So,
q/\// ⇒ /
q/\\\\a/ ⇒ \\a
q/\\\a/ ⇒ \\a
q/\\a/ ⇒ \a
q/\a/ ⇒ \a
Is there a need to escape a backlash in single-quoted string?
Yes, if it's followed by another backslash or the delimiter.
If I wanted 2 backlashes, I'd need to specify 4
Three would suffice.
this does not seem convenient.
It's more convenient than double-quoted strings, where backslashes must always be escaped. Single-quoted string require the minimum amount of escaping possible without losing the ability to produce the delimiter.

Splitting on whitespace character and strip empty fields

($red, $tapinfo) = split(/:/, $line);
#fields = split(/\s+/, $tapinfo);
In the array fields, I see that even space gets added. I want to eliminate the space so that fields only contains non-space characters. Please comment on what can be going wrong.
I assume you are talking about leading whitespace remaining, so that #fields looks something like:
$VAR1 = [
'', # empty field
'foo',
'bar'
];
This is because you are using /\s+/ for your split when you should be using the default ' ' (a single blank space character). This default behaviour will strip leading whitespace before splitting the string. In other words, you should do:
#fields = split(' ', $tapinfo);
This is documented in perldoc -f split:
As another special case, "split" emulates the default behavior
of the command line tool awk when the PATTERN is either omitted
or a *literal string* composed of a single space character (such
as ' ' or "\x20", but not e.g. "/ /"). In this case, any leading
whitespace in EXPR is removed before splitting occurs, and the
PATTERN is instead treated as if it were "/\s+/"; in particular,
this means that *any* contiguous whitespace (not just a single
space character) is used as a separator. However, this special
treatment can be avoided by specifying the pattern "/ /" instead
of the string " ", thereby allowing only a single space
character to be a separator.
What split does by default is the same as
my #list = $string =~ /\S+/g;
i.e. it finds all the contiguous substrings of non-whitespace characters.
You could use the regex, but to to get the default behaviour from split, pass a single literal space character as the first parameter. Not a regex. The documentation says this
As another special case, split emulates the default behavior of the command line tool awk when the PATTERN is either omitted or a literal string composed of a single space character (such as ' ' or "\x20" , but not e.g. / / ). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were /\s+/ ; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator.

How to search for a string that contains no whitespace in perl

my $string3 = "anima ls";
my $t3 = $string3 =~ /[^\s]+/;
print "$t3\n";
I wanted to write a regex that searches for a string containing no whitespace. The above code works even if i give space.
The regex [^\s]+ searches for at least one character that is not whitespace. It is better written as \S+, though. A regex that matches any string that does not contain a whitespace character is rather
/^\S+$/

Split by dot using Perl

I use the split function by two ways. First way (string argument to split):
my $string = "chr1.txt";
my #array1 = split(".", $string);
print $array1[0];
I get this error:
Use of uninitialized value in print
When I do split by the second way (regular expression argument to split), I don't get any errors.
my #array1 = split(/\./, $string); print $array1[0];
My first way of splitting is not working only for dot.
What is the reason behind this?
"\." is just ., careful with escape sequences.
If you want a backslash and a dot in a double-quoted string, you need "\\.". Or use single quotes: '\.'
If you just want to parse files and get their suffixes, better use the fileparse() method from File::Basename.
Additional details to the information provided by Mat:
In split "\.", ... the first parameter to split is first interpreted as a double-quoted string before being passed to the regex engine. As Mat said, inside a double-quoted string, a \ is the escape character, meaning "take the next character literally", e.g. for things like putting double quotes inside a double-quoted string: "\""
So your split gets passed "." as the pattern. A single dot means "split on any character". As you know, the split pattern itself is not part of the results. So you have several empty strings as the result.
But why is the first element undefined instead of empty? The answer lies in the documentation for split: if you don't impose a limit on the number of elements returned by split (its third argument) then it will silently remove empty results from the end of the list. As all items are empty the list is empty, hence the first element doesn't exist and is undefined.
You can see the difference with this particular snippet:
my #p1 = split "\.", "thing";
my #p2 = split "\.", "thing", -1;
print scalar(#p1), ' ', scalar(#p2), "\n";
It outputs 0 6.
The "proper" way to deal with this, however, is what #soulSurfer2010 said in his post.

How do I escape special characters for a substitution in a Perl one-liner?

Is there some way to replace a string such as #or * or ? or & without needing to put a "\" before it?
Example:
perl -pe 'next if /^#/; s/\#d\&/new_value/ if /param5/' test
In this example I need to replace a #d& with new_value but the old value might contain any character, how do I escape only the characters that need to be escaped?
You have several problems:
You are using \b incorrectly
You are replacing code with shell variables
You need to quote metacharacters
From perldoc perlre
A word boundary ("\b") is a spot between two characters that has a "\w" on one side of it
Neither of the characters # or & are \w characters. So your match is guaranteed to fail. You may want to use something like s/(^|\s)\#d\&(\s|$)/${1}new text$2/
(^|\s) says to match either the start of the string (^)or a whitespace character (\s).
(\s|$) says to match either the end of the string ($) or a whitespace character (\s).
To solve the second problem, you should use %ENV.
To solve the third problem, you should use the \Q and \E escape sequences to escape the value in $ENV{a}.
Putting it all together we get:
#!/bin/bash
export a='#d&'
export b='new text'
echo 'param5 #d&' |
perl -pe 'next if /^#/; s/(^|\s)\Q$ENV{a}\E(\s|$)/$1$ENV{b}$2/ if /param5/'
Which prints
param5 new text
As discussed at perldoc perlre:
...Today it is more common to use the quotemeta() function or the "\Q" metaquoting
escape sequence to disable all metacharacters' special meanings like this:
/$unquoted\Q$quoted\E$unquoted/
Beware that if you put literal backslashes (those not inside interpolated variables) between "\Q" and "\E", double-quotish backslash interpolation may
lead to confusing results. If you need to use literal backslashes within "\Q...\E", consult "Gory details of parsing quoted constructs" in perlop.
You can also use a ' as the delimiter in the s/// operation to make everything be parsed literally:
my $text = '#';
$text =~ s'#'1';
print $text;
In your example, you can do (note the single quotes):
perl -pe 's/\b\Q#f&\E\b/new_value/g if m/param5/ and not /^ *#/'
The other answers have covered the question, now here's your meta-problem: Leaning Toothpick Syndrome. Its when the delimiter and escapes start to blur together:
s/\/foo\/bar\\/\/bar\/baz/
The solution is to use a different delimiter. You can use just about anything, but balanced braces work best. Most editors can parse them and you generally don't have to worry about escaping.
s{/foo/bar\\}{/bar/baz}
Here's your regex with braced delimiters.
s{\#d\&}{new_value}
Much easier on the eyeholes.
If you really want to avoid typing the \s, put your search string into a variable and then use that in your regex instead. You don't need quotemeta or \Q ... \E in that case. For example:
my $s = '#d&';
s/$s/new_value/g;
If you must use this in a one-liner, bear in mind that you will have to escape the $s if you use "s to contain your perl code, or escape the 's if you use 's to contain your perl code.
If you have a string like
my $var1 = abc$123
and you want to replace it with abcd then you have to use \Q \E. If you don't then no matter what perl doesn't replace the string.
This is the only thing that worked for me.
my $var2 = s/\Q$var1\E/abcd/g;