So, I saw in another post that to split using \\ as a delimiter, you need to split on \\\\\\\\. This didn't really make sense to me, but when I attempted to split using \\\\, this happened:
my $string="a\\\\b\\\\c";
my #ra=split("\\\\",$string);
Array is:
a
<empty>
b
<empty>
c
As the other poster said, using \\\\\\\\ works perfectly. Why is this the case?
Also, I got curious and started messing with '' vs "" and got unexpected results. I thought that I understood what the difference is, but I guess I didn't, at least not in the following context:
my $string="a\.\.b\.\.c";
my #ra=split("\.\.",$string);
Array is:
<empty>
<empty>
<empty>
c
Yet,
my $string="a\.\.b\.\.c";
my #ra=split('\.\.',$string);
Array is:
a
b
c
Thanks in advance.
Oh, quoting rules and regexes.
Backslash rules with different quotes
In q() and related, all backslashes are left in the string, unless they escape the string delimiter or another backslash:
say '\a\\b\''; # »\a\b'«
In qq() and related, all backslashes that do not form a known string escape sequence are silently removed:
say "\d\\b\"\."; # »d\b."«
Ditto in qr// and regex literals, except that there are different escapes compared to double quoted strings.
If a string is used in place of a regex, then during compilation the escape rules for that kind of string are performed. However, a second level of escapes is processed when it is used as a regex, hence backslashes have to be double-escaped in the worst cases. Regex literals don't suffer from this problem; there is only one level of escaping.
Explanations for your examples
Therefore, "a\\\\b\\\\c"; is a\\b\\c, and "\\\\" is \\ which matches \ as a regex. So it splits on every backslash, thus producing zero-length fields in between the double backslashes.
The '\\\\\\\\' of the other question you meant is \\\\ which as a regex matches \\.
The "a\.\.b\.\.c" is a..b..c, and "\.\." is .. which as a regex matches two non-newline characters. It first matches a., then .b, then ... This produces the string fragments "", "", "", "c".
The string '\.\.' is \.\., which as a regex matches two literal periods in sequence.
The solution is to use regexes where regexes are due. split takes a regex as first argument like split /foo/, in other scenarios the regex quote qr/foo/ is useful. This avoids mind-bending[1] double escaping.
[1]: for small values of ”mind bending”, once you grok the rules.
In single-quoted strings literals,
\ followed by the string delimiter (' by default) results in the string delimiter.
'That\'s fool\'s gold!' -> That's fool's gold!
q!That's fool's gold\!! -> That's fool's gold!
\ followed by \ results in \.
'c:\\foo' -> c:\foo
\ followed by anything else results in those two characters.
'c:\foo' -> c:\foo
In double-quoted strings literals,
\ followed by non-word character results in that character.
"c:\\foo" -> c:\foo
"Can't open \"foo\"" -> Can't open "foo"
\ followed by word character has a special meaning.
"foo\n" -> foo{newline}
In regular expressions literals,
\ followed by the delimiter is replaced results in the delimiter.
qr/\// -> /
\ followed by anything else results in those two characters.
qr/\\/ -> \\
qr/\_/ -> \_
qr/\$/ -> \$
qr/\n/ -> \n
When applying a regular expressions,
\ followed by non-word character matches that character.
/c:\\foo/ -> Matches strings containing: c:\foo
\ followed by word character has a special meaning.
/foo\z/ -> Matches strings ending with: foo
Looking at your cases:
my $string="a\\\\b\\\\c";
my #ra=split("\\\\",$string);
"\\\\" results in the string \\, so you first create the string a\\b\\c and you pass \\ to split.
The first argument of split is used as a regular expression, and the regex pattern \\ matches a single \. There are 4 \ in a\\b\\c, so it gets split into 4+1 pieces.
If you use regex literals instead of double-quoted string literals, there will be less confusion.
split(/\\/, $string); # Passes pattern \\ to split. Matches singles
split("\\\\", $string); # Passes pattern \\ to split. Matches singles
split(/\\\\/, $string); # Passes pattern \\\\ to split. Matches doubles
split("\\\\\\\\", $string); # Passes pattern \\\\ to split. Matches doubles
In short, don't use split "..."!
Your other two cases should be obvious to you by now.
my $string="a\.\.b\.\.c"; # String a..b..c
my #ra=split("\.\.",$string); # Pattern .., which matches any two chars.
my $string="a\.\.b\.\.c"; # String a..b..c
my #ra=split('\.\.',$string); # Pattern \.\., which matches two periods.
Split using /\\\\/ instead of "\\\\" and avoid all the worries,
e.g.
use Data::Dumper;
my $string= "a\\\\b\\\\c";
my #ra = split /\\\\/, $string;
print Dumper #ra;
will output
$VAR1 = [
'a',
'b',
'c'
];
/\\/ will match a two \ in a row
or you can be cute and do
split /\\{2}/, $string
Related
I want to replace specific strings in php files automatically using sed. Some work, and some do not. I already investigated this is not an issue with the replacement string but with the string that is to be replaced. I already tried to escape [ and ] with no success. It seems to be the whitespace within the () - not whitespaces in general. The first whitespaces (around the = ) do not have any problems. Please can someone point me to the problem:
sed -e "1,\$s/$adm = substr($path . rawurlencode($upload['name']) , 16);/$adm = rawurlencode($upload['name']); # fix 23/g" -i administration/identify.php
I already tried to shorten the string which should be replaced and the result was if I cut it directly behind $path it works, with the following whitespace it does not. Escaping whitespace has no effect...
what must be escaped for sed
The following characters have special meaning in sed and have to be escaped with \ for the regex to be taken literally:
\
[
the character used in separating s command parts, ie. / here
.
*
& only replacement string
Newline character is handled specially as the end of the string, but can be replaced for \n.
So first escape all special characters in input and then pass it to sed:
rgx="$adm = substr($path . rawurlencode($upload['name']) , 16);"
rgx_escaped=$(sed 's/[\\\[\.\*\/&]/\\&/g' <<<"$rgx")
sed "s/$rgx_escaped/ etc."
See Escape a string for a sed replace pattern for a generic escaping solution.
You may use
sed -i 's/\$adm = substr(\$path \. rawurlencode(\$upload\['"'"'name'"'"']) , 16);/$adm = rawurlencode($upload['"'"'name'"'"']); # fix 23/g' administration/identify.php
Note:
the sed command is basically wrapped in single quotes, the variable expansion won't occur inside single quotes
In the POSIX BRE syntax, ( matches a literal (, you do not need to escape ) either, but you need to escape [ and . that must match themselves
The single quotes require additional quoting with concatenation.
The following match returns false. How can I change the regular expression to correct it?
"hello$world" -match '^hello$(wo|ab).*$'
"hello$abcde" -match '^hello$(wo|ab).*$'
'hello$world' -match '^hello\$(wo|ab).*$'
'hello$abcde' -match '^hello\$(wo|ab).*$'
You need to quote the left hand side with single quotes so $world isn't treated like variable interpolation. You need to escape the $ in the right hand side so it isn't treated as end of line.
From About Quoting Rules:
When you enclose a string in double quotation marks (a double-quoted string), variable names that are preceded by a dollar sign ($) are replaced with the variable's value before the string is passed to the command for processing.
...
When you enclose a string in single-quotation marks (a single-quoted string), the string is passed to the command exactly as you type it. No substitution is performed.
From About Regular Expressions:
The two commonly used anchors are ^ and $. The carat ^ matches the start of a string, and $, which matches the end of a string. This allows you to match your text at a specific position while also discarding unwanted characters.
...
Escaping characters
The backslash \ is used to escape characters so they are not parsed by the regular expression engine.
The following characters are reserved: []().\^$|?*+{}.
You will need to escape these characters in your patterns to match them in your input strings.
The following example prints "SAME":
if (q/\\a/ eq q/\a/) {
print "SAME\n";
}
else {
print "DIFFERENT\n";
}
I understand this is consistent with the documentation. But I think this behavior is undesirable. Is there a need to escape a backlash lilteral in single-quoted string? If I wanted 2 backlashes, I'd need to specify 4; this does not seem convenient.
Shouldn't Perl detect whether a backslash serves as an escape character or not? For instance, when a backslash does not precede a delimiter, it should be treated as a literal; and if that were the case, I wouldn't need 3 backslashes to express two, e.g.,
q<a\\b>
instead of
q<a\\\b>.
Is there a need to escape a backlash in single-quoted string?
Yes, if the backslash is followed by another backslash, or is the last character in the string:
$ perl -e'print q/C:\/'
Can't find string terminator "/" anywhere before EOF at -e line 1.
$ perl -e'print q/C:\\/'
C:\
This makes it possible to include any character in a single-quoted string, including the delimiter and the escape character.
If I wanted 2 backlashes, I'd need to specify 4; this does not seem convenient.
Actually, you only need three (because the second backslash isn't followed by another backslash). But as an alternative, if your string contains a lot of backslashes you can use a single-quoted heredoc, which requires no escaping:
my $path = <<'END';
C:\a\very\long\path
END
chomp $path;
print $path; # C:\a\very\long\path
Note that the heredoc adds a newline to the end, which you can remove with chomp.
In single-quoted string literals,
A backslash represents a backslash unless followed by the delimiter or another backslash, in which case the delimiter or backslash is interpolated.
In other words,
You must escape delimiters.
You must escape \ that are followed by \ or the delimiter.
You may escape \ that aren't followed by \ or the delimiter.
So,
q/\// ⇒ /
q/\\\\a/ ⇒ \\a
q/\\\a/ ⇒ \\a
q/\\a/ ⇒ \a
q/\a/ ⇒ \a
Is there a need to escape a backlash in single-quoted string?
Yes, if it's followed by another backslash or the delimiter.
If I wanted 2 backlashes, I'd need to specify 4
Three would suffice.
this does not seem convenient.
It's more convenient than double-quoted strings, where backslashes must always be escaped. Single-quoted string require the minimum amount of escaping possible without losing the ability to produce the delimiter.
I'm trying to figure out the syntax of both the sed command and perl script:
sed 's/^EOR:$//' INPUTFILE |
perl -00 -ne '/
TAGA01:\s+(.*?)\n
.*
TAGCC08:\s+(.*?)\n
# and so on
/xs && print "$1 $2\n"'
Why is there a circumflex ^ in the sed command? The third slash / will replace all instances of EOR: with a blank line, correct?
I understand some of the Perl script. Looking at perlrun, -00 will slurp the stream in paragraph mode and -n starts a while <> loop.
Why is there the first slash / next to the apostrophe? The command searches for TAGXXXX:, but I am not sure what \s+(.*?) does. Does that put whatever is after the tag into a variable? How about the .* in the between tag searches? What does /ns do? What do the $1 and $2 refer to in the print line?
This was tough to find online, and if someone could kick me in the right direction, I'd appreciate it.
The circumflex ^ is regex for "start of line", and $ is regex for "end of line"; so sed will only remove lines which contain exactly "EOR:" and nothing else.
The Perl script is basically perl -00 -ne '/(re)g(ex)/ && print "re ex\n"' with a big ole regex instead of the simple placeholder I put here. In particular, the /x modifier allows you to split the regex over several lines. So the first / is the start of the regex and the final / is the end of the regex and the lines in between form the regex together.
The /s modifier changes how Perl interprets . in a regex; normally it will match any character except newline, but with this option, it includes newlines as well. This means that .* can match multiple lines.
\s matches a single whitespace character; \s+ matches as many whitespace characters as possible, but there has to be at least one.
(.*?) matches an arbitrary length of string; the dot matches any character, the asterisk says zero or more of any character, and the question mark modifies the asterisk repetition operator to match as short a string as possible instead of as long a string as possible. The parentheses cause the skipped expression to be captured in a back reference; the backrefs are named $1, $2, etc, as many as there are backreferences; the numbers correspond to the order of the opening parenthesis (so if you apply (a(b)) to the string "ab", $1 will be "ab" and $2 will be "b").
Finally, \n matches a literal newline. So the (.*?) non-greedy match will match up to the first newline, i.e. the tail of the line on which the TAGsomething was found. (I
imagine these are gene sequences, not "tags"?)
It doesn't really make sense to run sed separately; Perl would be quite capable of removing the EOR: lines before attempting to match the regex.
Let's see...
Yes, sed will empty the lines with EOR:
The first / in the Perl script means a regexp pattern. Concretely, it is searching for a pattern in the form below
The regex ends with "xs", which means that the regex will match multiple lines of the input
The script also will print as output the strings found in the tags (see below). The $1 and $2 mean the elements contained in the first pair of parentheses ($1) and in the second ($2).
. The form is this one:
TAGA01:<spaces><string1>
<whatever here>
TAGCC00:<spaces><string2>
In this case, $1 is <string1> and $2 is <string2>.
Quoted from perldoc -f split:
As a special case, specifying a PATTERN of space (' ' ) will split on
white space just as split with no arguments does. Thus, split(' ') can
be used to emulate awk's default behavior, whereas split(/ /) will
give you as many initial null fields (empty string) as there are
leading spaces.
The above is all that's mentioned about how split deals with string delimiter, but what's the general case,is the empty leading fields always deleted for string delimiters?
No, only when the delimiter is a string that is a single space. In any other case, the delimiter is interpreted as a regex pattern.