Where can I find a formal grammar for the Perl programming language? - perl

I understand that the Perl syntax is ambiguous and that its disambiguation is non-trivial (sometimes involving execution of code during the compile phase). Regardless, does Perl have a formal grammar (albeit ambiguous and/or context-sensitive)?

From perlfaq7
Can I get a BNF/yacc/RE for the Perl language?
There is no BNF, but you can paw your
way through the yacc grammar in
perly.y in the source distribution if
you're particularly brave. The grammar
relies on very smart tokenizing code,
so be prepared to venture into toke.c
as well.
In the words of Chaim Frenkel: "Perl's
grammar can not be reduced to BNF. The
work of parsing perl is distributed
between yacc, the lexer, smoke and
mirrors."
To see the wonderful set of examples of WHY it's pretty much near impossible to parse Perl due to context influences, please look into Randal Schwartz's post: On Parsing Perl
In addition, please see the discussion in "Perl 5 Internals (Chapter 5. The Lexer and the Parser)" by Simon Cozens.
Please note that the answer is different for Perl6:
There exists a grammar for Perl6
Rakudo Perl has its own version of the grammar

Other people have posted this link before on similar questions, but I think it is fun and has a great case example: Perl Cannot Be Parsed (A Formal Proof).
From that link:
[Consider] the following devilish
snippet of code, concocted by Randal
Schwartz, and determine the correct
parse for it:
whatever / 25 ; # / ; die "this dies!";
Schwartz's Snippet can parse two different ways: if whatever is nullary
(that is, takes no arguments), the
first statement is a division in void
context, and the rest of the line is a
comment. If whatever takes an
argument, Schwartz's Snippet parses as
a call to the whatever function with
the result of a match operator, then a
call to the die() function.
This means that, in order to statically parse Perl, it must be
possible to determine from a string of
Perl 5 code whether it establishes a
nullary prototype for the whatever
subroutine.
I just post this part to show that it gets really hard really quickly.
Alternatively, many code/text editors can do a decent (though never great) job of syntax highlighting so you may start at those specs to see what they do. In fact you have inspired me, I think I will post a related question asking what editor best highlights Perl.

There is no formal grammar in the sense "this is the specification of Perl 5" (The Perl 6 effort is trying to fix that, though). But there is a formal grammar in the Perl 5 source code. Of course, understanding the code is most likely not a trivial undertaking.
Jeffrey Kegler has written some good articles about the perl grammar as well on his blog. In particular see, this post and this one. The rest of the blog has some quite interesting thoughts on parsing in general as well.

Related

Can Marpa be used to speed up Perl interpreter's parsing?

Can the existing Marpa parser be used to improve parsing of Perl 5 (e.g., replace all or chunks of existing Perl interpreter's parser)?
I am asking on a theoretical level, e.g. ignoring practical considerations such as "if it can, it would cost 10,000 man hours of work".
If not, what are the specific issues preventing the use of Marpa? (again, preferably theoretical ones).
For background of why this is interesting, Jeffrey Kegler (Marpa's author) has posted a somewhat famous article "Perl Cannot Be Parsed: A Formal Proof" on PerlMonks in 2008, which was influenced by his then-current work on Marpa.
Thanks for asking. The perlmonks post and my current parsing work address two different if related questions. Question 1: Is Perl parsing, in its full generality, decidable by a Turing machine? Question 2: Can Marpa, as a practical matter, parse Perl 5?
You might compare the two questions: "Is the behavior of every C program decidable?" and "Can machine X run programs compiled in C?" The answers are, respectively, "no" and "yes for all practical purposes and reasonable choices of X". So my perlmonks post (updated here) is about the theoretical question of whether the syntax of Perl programs in, in its full generality, decidable. Note that the decidability of Perl parsing in that context has nothing to do with Marpa, recursive descent, bison, etc. -- it's about Turing machines.
Question 2 is "Can Marpa drive a practical Perl 5 parser?" The current Perl 5 parser is LALR, with a separate lexer and lots of procedural assistance. Marpa is more powerful than LALR, allows a separate lexer, and offers much more help to procedural code than LALR does. I addressed the speed question in a recent blog post: "Is Earley parsing fast enough?" What I've just said is very telegraphic -- but I hope it will do to outline how I'd justify my "yes" answer to Question 2.
No deep architectural problem stands in the way of a Marpa-driven Perl 5 parser. At this point, it's really a question of comfort level.

Is there runtime flow chart for Perl?

I am trying to better understand logic and flow of exceptions. So i got to state that i really feeled lack of understanding how Perl interpretes and runs programs, which phases are involved and what happens on every phase.
For example, I'd like to understand, when are binded STD* IO and when released, what is happening with $SIG{*} things, how they are depended with execepions, how program dies, etc. I'd like to have better insight of internals mechanics.
I am looking for links or books. I prefer some material which has also visual charts involved but this is not mandatory. I'd like to see some "big picture" of whole process, then i have already possibilities to dig further if i find it necessary.
I found Chapter 18th in Programming Perl gives overview of compiling phase and i try to work it trough, but i appreciate other good sources too.
Some alternative sources (there are not very many):
Mannning's Extending and Embedding Perl, which is the go-to reference on Perl's internals outside of the source
The chapter on the Perl internals in Advanced Perl Programming, which may be exactly what you want
Simon Cozens's Perl internals FAQ
Those may be more focused to what you're looking for. I'm not sure any of them explicitly spells out the interpreter's runtime execution order, though. The first one is a better "I want to work with this stuff" book; the second two are probably good introductory references.
Some of the questions you ask are not, as far as I know, explicitly documented - the I/O question being one I can't think of a good source for in particular. Exception handling is documented very well in Try::Tiny's documentation, and it's what we use for exceptions. Signal handling is messy, but perlipc documents it pretty well. With threads, you may be stuck with unsafe signals - I generally avoid threads in favor of multiple processes unless I must have shared memory.
You might start with these topics accessible via the perldoc program:
Internals and C Language Interface
perlembed Perl ways to embed perl in your C or C++ application
perldebguts Perl debugging guts and tips
perlxstut Perl XS tutorial
perlxs Perl XS application programming interface
perlxstypemap Perl XS C/Perl type conversion tools
perlclib Internal replacements for standard C library functions
perlguts Perl internal functions for those doing extensions
perlcall Perl calling conventions from C
perlmroapi Perl method resolution plugin interface
perlreapi Perl regular expression plugin interface
perlreguts Perl regular expression engine internals
perlapi Perl API listing (autogenerated)
perlintern Perl internal functions (autogenerated)
perliol C API for Perl's implementation of IO in Layers
perlapio Perl internal IO abstraction interface
perlhack Perl hackers guide
perlsource Guide to the Perl source tree
perlinterp Overview of the Perl interpreter source and how it works
perlhacktut Walk through the creation of a simple C code patch
perlhacktips Tips for Perl core C code hacking
perlpolicy Perl development policies
perlgit Using git with the Perl repository

Why is there so much "magic" in Perl?

Looking through the perlsub and perlop manpages I've noticed that there are many references to "magic" and "magical" there (just search any of them for "magic"). I wonder why is Perl so rich in them.
Some examples:
print ++($foo = 'zz') # prints 'aaa'
printf "%d: %s", $! = 1, $! # prints '1: Operation not permitted'
while (my $line = <FH>) { ... } # $line is tested for definedness, not truth
use warnings; print "0 but true" + 1 # "0 but true" is a valid number!
When a Perl feature is described as "magic":
It means that that feature is
implemented by NBA star Magic Johnson.
Whenever Perl executes "magic", it is
actually sending an RPC call to a
remote receiver implanted in Magic
himself. He computes the answer, and
then sends a return message. The use
of Mr. Johnson for all the hard parts
of Perl provides a great abstraction
layer and simplifies porting to new
platforms. It's way easier than, say,
the Apache Portable Runtime.
Source: perrin on Perl Monks
It's official! Perl is more magical.
Hits from the following Google searches:
25 site:ruby-doc.org magic
36 site:docs.python.org magic
497 site:perldoc.perl.org magic
Magic, in Perl parlance is simply the word given to attributes applied to variables / functions that allow an extension of their functionality. Some of this functionality is available directly from Perl, and some requires the use of the C api.
A perfect example of magic is the tie interface which allows you to define your own implementation of a variable. Every operation that can be done to a variable (fetching or storing a value for instance) is exposed for reimplementation, allowing for elegant and logical syntactic constructs like a hash with values stored on disk, which are transparently loaded and saved behind the scenes.
Magic can also refer to the special ways that certain builtins can behave, such as how the first argument to map or grep can either be a block or a bare expression:
my #squares = map {$_**2} 1 .. 10;
my #roots = map sqrt, 1 .. 10;
which is not a behavior available to user defined subroutines.
Many other features of Perl, such as operator overloading or variables that can return different values when used with numeric or string operators are implemented with magic. Context could be seen as magic as well.
In a nutshell, magic is any time that a Perl construct behaves differently than a naive interpretation would suggest, an exception to the rule. Magic is of course very powerful, and should not be wielded without great care. Magic Johnson is of course involved in the execution of all magic (see FM's answer), but that is beyond the scope of this explaination.
I wonder why is Perl so rich in them.
To make things easy.
You'll find that most "magic" in Perl is to simplify the syntax for common tasks.
Because perl always Does What I Mean for some values of always.
I think (opinion more than fact) that this has to do with the organic growth viewpoint that Perl's creator Larry Wall has with the Perl language. Python is a study in the opposite approach, whose style often makes Perl hackers cringe at the perception of being forced to conform to a stylistic regime.
Some of it has to do with Perl being designed to be "efficient" at writing quick scripts to do Perl*-ish* tasks, in both wall clock time, and in keystrokes. Some of it has to do with the TMTOWTDI mantra of Perl and its followers.
Programmers tend to be opinionated about Perl's frequent usage of "magic", for some it is an eye-straining visual cacophony of chaos and disrespect for orderliness (which harkens back to the days of computer Priesthood in white lab coats behind a glass window), for others it is a shining example of getting things done efficiently, if not always obviously to the novice or outsider.
Perl's design philosophy is that simple things must be simple. This sounds good,and to some extent it is. However, there's a tradeoff involved: Making every simple thing a one-liner results in tons of special case hacks to save a few lines of code. Different people have different preferences regarding making simple operations within a language simple versus making the language specification simple. Perl is at one extreme. Java is at the other, at least among languages that people actually use. Python and C# are somewhere in between.

Why are Perl source filters bad and when is it OK to use them?

It is "common knowledge" that source filters are bad and should not be used in production code.
When answering a a similar, but more specific question I couldn't find any good references that explain clearly why filters are bad and when they can be safely used. I think now is time to create one.
Why are source filters bad?
When is it OK to use a source filter?
Why source filters are bad:
Nothing but perl can parse Perl. (Source filters are fragile.)
When a source filter breaks pretty much anything can happen. (They can introduce subtle and very hard to find bugs.)
Source filters can break tools that work with source code. (PPI, refactoring, static analysis, etc.)
Source filters are mutually exclusive. (You can't use more than one at a time -- unless you're psychotic).
When they're okay:
You're experimenting.
You're writing throw-away code.
Your name is Damian and you must be allowed to program in latin.
You're programming in Perl 6.
Only perl can parse Perl (see this example):
#result = (dothis $foo, $bar);
# Which of the following is it equivalent to?
#result = (dothis($foo), $bar);
#result = dothis($foo, $bar);
This kind of ambiguity makes it very hard to write source filters that always succeed and do the right thing. When things go wrong, debugging is awkward.
After crashing and burning a few times, I have developed the superstitious approach of never trying to write another source filter.
I do occasionally use Smart::Comments for debugging, though. When I do, I load the module on the command line:
$ perl -MSmart::Comments test.pl
so as to avoid any chance that it might remain enabled in production code.
See also: Perl Cannot Be Parsed: A Formal Proof
I don't like source filters because you can't tell what code is going to do just by reading it. Additionally, things that look like they aren't executable, such as comments, might magically be executable with the filter. You (or more likely your coworkers) could delete what you think isn't important and break things.
Having said that, if you are implementing your own little language that you want to turn into Perl, source filters might be the right tool. However, just don't call it Perl. :)
It's worth mentioning that Devel::Declare keywords (and starting with Perl 5.11.2, pluggable keywords) aren't source filters, and don't run afoul of the "only perl can parse Perl" problem. This is because they're run by the perl parser itself, they take what they need from the input, and then they return control to the very same parser.
For example, when you declare a method in MooseX::Declare like this:
method frob ($bubble, $bobble does coerce) {
... # complicated code
}
The word "method" invokes the method keyword parser, which uses its own grammar to get the method name and parse the method signature (which isn't Perl, but it doesn't need to be -- it just needs to be well-defined). Then it leaves perl to parse the method body as the body of a sub. Anything anywhere in your code that isn't between the word "method" and the end of a method signature doesn't get seen by the method parser at all, so it can't break your code, no matter how tricky you get.
The problem I see is the same problem you encounter with any C/C++ macro more complex than defining a constant: It degrades your ability to understand what the code is doing by looking at it, because you're not looking at the code that actually executes.
In theory, a source filter is no more dangerous than any other module, since you could easily write a module that redefines builtins or other constructs in "unexpected" ways. In practice however, it is quite hard to write a source filter in a way where you can prove that its not going to make a mistake. I tried my hand at writing a source filter that implements the perl6 feed operators in perl5 (Perl6::Feeds on cpan). You can take a look at the regular expressions to see the acrobatics required to simply figure out the boundaries of expression scope. While the filter works, and provides a test bed to experiment with feeds, I wouldn't consider using it in a production environment without many many more hours of testing.
Filter::Simple certainly comes in handy by dealing with 'the gory details of parsing quoted constructs', so I would be wary of any source filter that doesn't start there.
In all, it really depends on the filter you are using, and how broad a scope it tries to match against. If it is something simple like a c macro, then its "probably" ok, but if its something complicated then its a judgement call. I personally can't wait to play around with perl6's macro system. Finally lisp wont have anything on perl :-)
There is a nice example here that shows in what trouble you can get with source filters.
http://shadow.cat/blog/matt-s-trout/show-us-the-whole-code/
They used a module called Switch, which is based on source filters. And because of that, they were unable to find the source of an error message for days.

Why is Perl the best choice for most string manipulation tasks?

I've heard that Perl is the go-to language for string manipulation (and line noise ;). Can someone provide examples and comparisons with other language(s) to show me why?
It is very subjective, so I wouldn't say that Perl is the best choice, but it is certainly a valid choice for string manipulation. Other alternatives are Tcl, Python, AWK, etc.
I like Perl's capabilities because it has excellent support (better than POSIX as pointed out in the comment) for fast regexs and the implicit variables makes it easy to do basic string crunching with very little code.
If you have a *nix background a lot of what you already know will apply to Perl as well, which makes it fairly easy to pick up for a lot of people.
Perl -> Practical Extraction and Reporting Language
Perl's strength(when it comes to string processing) lies in it's very powerful Regular expression engine.
Because of this there are many people in the field of BioInformatics using Perl as their
main tool, hence the large number of posts about BioPerl on PerlMonks . In BioInformatics they work with strings a lot , they call them "sequences"(I don't know much about this).
Perlmonks.org is the heart of the Perl community, check out the immense number of hits
when you search for site:perlmonks.org regex 20,000 hits
You cannot ignore the sheer number of modules on CPAN:
375 modules under the namespace String on CPAN(Perl's module repository)
241 in Regex namespace
156 in Regexp namespace.
This is very clear evidence that Perl is a very powerful language when it comes to string processing.
So if you want to do some string processing and you're using Perl, you've got it covered :)
To address the second part of your question: Perl's reputation for line noise comes from 4 kinds of people:
Overly clever (for their own good) hackers (or sometimes just hacks) who value cleverness and showing off over readability. "If it was hard to write it should be hard to read" is NOT just a mythical attitude.
People who wouldn't know good software development if it hit them over the head with a cluebat. Such as people who save a couple of characters in a program by using $_ instead of a named variable. In a nested scope. Or never heard of comments. Or self-documenting identifiers. Or whitespace.
People who think that software development == code golf. More seriously, that the less the amount of characters in the code, the more readable it is, because they misunderstand what "conciseness" means in code.
(NOTE: first 2 sets are not mutually exclusive)
People who code/hack in perl (e.g. SysAdmins) who have very little training, experience or incentive to do software development. E.g. the percentage of people using Perl who do quick and dirty hacks with bad style and worse code quality is probably higher than, say Python.
Just for reference, 80% of awful Perl "code" in my $work falls under this - it was written by financial analysts who are smart enough to pick up a Perl book and some earlier scripts, clone off a script that does what business need is, and don't have CS/programming background to worry about how readable/maintainable their code was.
In other (and less snide) words, you can write beautiful, incredibly readable and easy to maintain software in Perl. It all depends on who does the writing, what their priorities and skills are. Also, just like with any other language, you can write a miserable write-only mess with it.
The difference from other languages is that very often, the write-onlyness of said mess, when done in Perl, does indeed consist of very high density of non-letter characters (sygils and special characters in poorly written RegExes). This high density can indeed, asymptotically approximate line noise.
Because It is what is perl made for. Because Perl is expressive, powerful and fast. I have beaten many times specialized products with small and dirty script in perl written in few minutes. For example, outer join and large join vs. MySQL (just because can't do merge join), ETL processing vs. Java Hadoop (because I have years experience to write it effectively and perl IO layer is just great) and so and so.
It's a very subjective question. Perhaps the true answer is that Perl has a nice syntax (incl. the regex syntax) that makes people want to sign it high praises over other languages? IMHO, any language that supports a rich regex syntax would be considerablly powerfull at string manipulation.
Kids these days! Back in the day, all we had was SNOBOL -- and we liked it! Try it sometime...you never know, you might want something respectable to fall back on when this Perl fad runs its course!
Perl is widely used for string manipulation tasks as its string manipulation API is easy to learn. And also its regex is widely used. It has been in use for a very long time and anyone with a Unix background would pick up perl very easily. Historically, perl was developed in the late 80's for report processing tasks and was "originally" developed for text processing tasks. So till date, the trend continues as anyone with a string manipulation task or text processing task would opt for perl as the first choice. Its not that other languages like python arent up to the task, but perl's popular in this area.
I like Perl a lot, write books about it, publish a magazine about it, and so on. I don't think I would ever say it's the best language to do anything in. A lot of that has to do with the task you need to do. For many string processing tasks, ETL, data cleanup, and so in, Perl is a very strong and capable language. You wouldn't have that much trouble doing simple tasks.
Your comment sounds like it comes from the early 1990s though, when the rest of the world hadn't caught up. Many of the dynamic languages are now up to task, so you might not have to switch languages. If you decide to use Perl and run into problems, there are plenty of people here who are willing to help, and not all of us will fault you if you choose something else. :)
At the beginning, Perl was developed for easy report processing and dealing with text files, thus it's got a very strong REGEX support. Most of the info on REGEX you can find in perldoc.
Perl was the go-to language for a long time. The problem is it can be pretty messy and difficult to maintain (some people can write Perl that avoids this, but it is very easy to wrote ugly code). I would not tell you to avoid Perl, but many have moved on to some modern alternatives.
I would recommend learning one of the newer scripting languages such as Python or Ruby. Both will work very well for your needs, and can easily handle more difficult tasks later on. They're both quite nice to work in, after having written C and Perl for so long.
In short, Perl would be a good hammer for this nail. Python and Ruby would be nail-guns.
I disagree that Perl is the best language for text processing. Simple things are easy; to replace foo with bar:
$data =~ s/foo/bar/g;
Harder things are not simple, though. Look at Data::SExpression, for example. It is a lot of code to do something very simple.
An similar implementation in Haskell with PArrow looks something like:
import Text.ParserCombinators.PArrow
data Atom = QuotedString String | Symbol String
deriving (Show, Eq)
data Sexp = Sexp [Sexp] | Atom Atom
deriving (Eq)
quotedString :: Char -> Char -> MD a Atom
quotedString quoteChar escapeChar = between q q inside >>^ QuotedString
where q = char quoteChar
inside = many $ (char escapeChar >>> anyChar) <+> notChar quoteChar
doubleQuotedString, symbol :: MD a Atom
doubleQuotedString = quotedString '"' '\\'
symbol = word >>^ Symbol
atom, sexp :: MD a Sexp
atom = (doubleQuotedString <+> symbol) >>^ Atom
sexp = atom <+> (between (char '(') (char ')') sexp' >>^ Sexp)
where sexp' = sepBy1 sexp spaces
Just sayin'. Perl is not the end-all-and-be-all of text manipulation. There are many reasons to prefer Perl to other languages, but parsing is not one of them.