Generate syntactically correct sentences from an Antlr grammar

Generate syntactically correct sentences from an Antlr grammar - code-generation

I have an Xtext/Antlr grammar that parses a subset of coffeescript. I have some test cases, but I thought of doing another sort of test:
Generate random, syntactically correct snippets from my Antlr grammar
Feed these snippets to the original coffeescript parser (calling coffee -ne "the sentence")
Check if each sentence is parsed by coffeescript
Thus I could ensure that my parser accepts a proper subset, and it's not too permissive in some cases. Now, I am stuck with the first step. How can I generate sentences from my Antlr grammar (which also makes heavy use of syntactic predicates)? So I'm interested in the opposite of parsing a sentence.
I found some related attempts, but the answers are not using Antlr at all, but a custom grammar in python, or in clojure, or in ruby. I'd prefer a working solution rather than a hint about how it could be implemented.

No, you can't do this. If you look at the code that ANTLR compiles into, you can see that it's only a recognizer, not a generator.
The links you provided are your best bet -- take your ANTLR grammar, strip out all the rules to make it into a formal grammar, and then try to run it through one of those programs.
Or if your coffeescript subset is very small, you could take the approach of generating strings of random tokens and throwing away all the strings that don't parse.

Related

Implementing language translators in racket

I am implementing an interpreter that codegen to another language using Racket. As a novice I'm trying to avoid macros to the extent that I can ;) Hence I came up with the following "interpreter":
(define op (open-output-bytes))
(define (interpret arg)
(define r
(syntax-case arg (if)
[(if a b) #'(fprintf op "if (~a) {~a}" a b)]))
; other cases here
(eval r))
This looks a bit clumsy to me. Is there a "best practice" for doing this? Am I doing a totally crazy thing here?

Short answer: yes, this is a reasonable thing to do. The way in which you do it is going to depend a lot on the specifics of your situation, though.
You're absolutely right to observe that generating programs as strings is an error-prone and fragile way to do it. Avoiding this, though, requires being able to express the target language at a higher level, in order to circumvent that language's parser.
Again, it really has a lot to do with the language that you're targeting, and how good a job you want to do. I've hacked together things like this for generating Python myself, in a situation where I knew I didn't have time to do things right.
EDIT: oh, you're doing Python too? Bleah! :)
You have a number of different choices. Your cleanest choice is to generate a representation of Python AST nodes, so you can either inject them directly or use existing serialization. You're going to ask me whether there are libraries for this, and ... I fergits. I do believe that the current Python architecture includes ... okay, yes, I went and looked, and you're in good shape. Python's "Parser" module generates ASTs, and it looks like the AST module can be constructed directly.
https://docs.python.org/3/library/ast.html#module-ast
I'm guessing your cleanest path would be to generate JSON that represents these AST modules, then write a Python stub that translates these to Python ASTs.
All of this assumes that you want to take the high road; there's a broad spectrum of in-between approaches involving simple generalizations of python syntax (e.g.: oh, it looks like this kind of statement has a colon followed by an indented block of code, etc.).

If your source language shares syntax with Racket, then use read-syntax to produce a syntax-object representing the input program. Then use recursive descent using syntax-case or syntax-parse to discern between the various constructs.
Instead of printing directly to an output port, I recommend building a tree of elements (strings, numbers, symbols etc). The last step is then to print all the elements of the tree. Representing the output using a tree is very flexible and allows you to handle sub expressions out of order. It also allows you to efficiently concatenate output from different sources.
Macros are not needed.

Can CoffeeScript be compiled to Haxe?

I'm learning Haxe now, and I'm wondering if it's possible for any programming language to be compiled to Haxe (instead of from Haxe.) If there isn't any programming language that can be compiled to Haxe in its entirety, then could at least a small subset of a programming language (such as Coffeescript) be compiled to Haxe?

At this time, there is no way to compile coffeescript or similar into Haxe.
CoffeeScript is a source-to-source compiler, so you would change need to change it from going CoffeeScript->JS to CoffeeScript->Haxe,
I'm not sure how difficult it would be, and you have to remember that Haxe has a bunch of features that Javascript doesn't, all of which would need to be represented in the "new" coffeescript. Things such as: type information, enums, typedefs, iterators, macros, conditional compilation, untyped blocks, metadata, property access etc. You would need to figure out how to represent each of these in coffeescript in a way which doesn't conflict with itself or with existing syntax.
I too have thought it might be nice, as CoffeeScript has such a clean syntax, but then looking at the complexity of getting it to work I decided that curly brackets and semi colons aren't so bad :)

parsing different files of the same grammar and calculating file to file similarities

I've got a bunch of ACPI Source Language files and I want to calculate file to file similarities between them. I thought of using something like Perl's Parse::RecDescent
but I am stuck at:
1) Translating the ACPI Grammar (www.acpi.info/DOWNLOADS/ACPIspec40a.pdf) to something Parse::RecDescent would understand
2) Have a metric to compare 2 parsed files
Any ideas?

To get started with Parse::RecDescent you may look at Pro Perl Parsing, Ch. 5 or
at Advanced Perl Programming, Ch. 2
Xml Diff tools should be appropriate for comparing hierarchically structured data; perhaps you can apply such a tool to ASTs saved in XML format

So you have two problems:
Parsing ACPI to build an AST. This has the usual troubles of ensuring that you have a well defined grammar, that your parsing machinery can parse according to that grammar (often you have to bend a good grammar definition to enable the parsing machiney to process it), and building a corresponding AST. You will have these troubles with Perl parsing machinery, simply because it is a parsing engine.
Comparing the structure of the ASTs and producing a sensible answer. What you are likely to find here is that there is some literature describing roughtly how to do this (using e.g. Levenshtein distance), but that the details for ASTs matter. (Change distilling: Tree differencing for fine-grained source code change extraction Finally, having determined the distance, you need to print out the deltas in some readable form.
However, AFAIK, my company is the only one that has reduced this to practice. See our Smart Differencer tool. THe SmartDifferencers parse, build ASTs, and report changers in terms of ASTs elements moved, inserted, deleted, replaced, or modifiied by consistent identifier substitition. They depend on any underlying very strong GLR parsing engine which minimized the problems of accepting new grammars. They work for many common languages but not presently for ACPI.

Where can I find a formal grammar for the Perl programming language?

I understand that the Perl syntax is ambiguous and that its disambiguation is non-trivial (sometimes involving execution of code during the compile phase). Regardless, does Perl have a formal grammar (albeit ambiguous and/or context-sensitive)?

From perlfaq7
Can I get a BNF/yacc/RE for the Perl language?
There is no BNF, but you can paw your
way through the yacc grammar in
perly.y in the source distribution if
you're particularly brave. The grammar
relies on very smart tokenizing code,
so be prepared to venture into toke.c
as well.
In the words of Chaim Frenkel: "Perl's
grammar can not be reduced to BNF. The
work of parsing perl is distributed
between yacc, the lexer, smoke and
mirrors."
To see the wonderful set of examples of WHY it's pretty much near impossible to parse Perl due to context influences, please look into Randal Schwartz's post: On Parsing Perl
In addition, please see the discussion in "Perl 5 Internals (Chapter 5. The Lexer and the Parser)" by Simon Cozens.
Please note that the answer is different for Perl6:
There exists a grammar for Perl6
Rakudo Perl has its own version of the grammar

Other people have posted this link before on similar questions, but I think it is fun and has a great case example: Perl Cannot Be Parsed (A Formal Proof).
From that link:
[Consider] the following devilish
snippet of code, concocted by Randal
Schwartz, and determine the correct
parse for it:
whatever / 25 ; # / ; die "this dies!";
Schwartz's Snippet can parse two different ways: if whatever is nullary
(that is, takes no arguments), the
first statement is a division in void
context, and the rest of the line is a
comment. If whatever takes an
argument, Schwartz's Snippet parses as
a call to the whatever function with
the result of a match operator, then a
call to the die() function.
This means that, in order to statically parse Perl, it must be
possible to determine from a string of
Perl 5 code whether it establishes a
nullary prototype for the whatever
subroutine.
I just post this part to show that it gets really hard really quickly.
Alternatively, many code/text editors can do a decent (though never great) job of syntax highlighting so you may start at those specs to see what they do. In fact you have inspired me, I think I will post a related question asking what editor best highlights Perl.

There is no formal grammar in the sense "this is the specification of Perl 5" (The Perl 6 effort is trying to fix that, though). But there is a formal grammar in the Perl 5 source code. Of course, understanding the code is most likely not a trivial undertaking.
Jeffrey Kegler has written some good articles about the perl grammar as well on his blog. In particular see, this post and this one. The rest of the blog has some quite interesting thoughts on parsing in general as well.

Why are Perl source filters bad and when is it OK to use them?

It is "common knowledge" that source filters are bad and should not be used in production code.
When answering a a similar, but more specific question I couldn't find any good references that explain clearly why filters are bad and when they can be safely used. I think now is time to create one.
Why are source filters bad?
When is it OK to use a source filter?

Why source filters are bad:
Nothing but perl can parse Perl. (Source filters are fragile.)
When a source filter breaks pretty much anything can happen. (They can introduce subtle and very hard to find bugs.)
Source filters can break tools that work with source code. (PPI, refactoring, static analysis, etc.)
Source filters are mutually exclusive. (You can't use more than one at a time -- unless you're psychotic).
When they're okay:
You're experimenting.
You're writing throw-away code.
Your name is Damian and you must be allowed to program in latin.
You're programming in Perl 6.

Only perl can parse Perl (see this example):
#result = (dothis $foo, $bar);
# Which of the following is it equivalent to?
#result = (dothis($foo), $bar);
#result = dothis($foo, $bar);
This kind of ambiguity makes it very hard to write source filters that always succeed and do the right thing. When things go wrong, debugging is awkward.
After crashing and burning a few times, I have developed the superstitious approach of never trying to write another source filter.
I do occasionally use Smart::Comments for debugging, though. When I do, I load the module on the command line:
$ perl -MSmart::Comments test.pl
so as to avoid any chance that it might remain enabled in production code.
See also: Perl Cannot Be Parsed: A Formal Proof

I don't like source filters because you can't tell what code is going to do just by reading it. Additionally, things that look like they aren't executable, such as comments, might magically be executable with the filter. You (or more likely your coworkers) could delete what you think isn't important and break things.
Having said that, if you are implementing your own little language that you want to turn into Perl, source filters might be the right tool. However, just don't call it Perl. :)

It's worth mentioning that Devel::Declare keywords (and starting with Perl 5.11.2, pluggable keywords) aren't source filters, and don't run afoul of the "only perl can parse Perl" problem. This is because they're run by the perl parser itself, they take what they need from the input, and then they return control to the very same parser.
For example, when you declare a method in MooseX::Declare like this:
method frob ($bubble, $bobble does coerce) {
... # complicated code
}
The word "method" invokes the method keyword parser, which uses its own grammar to get the method name and parse the method signature (which isn't Perl, but it doesn't need to be -- it just needs to be well-defined). Then it leaves perl to parse the method body as the body of a sub. Anything anywhere in your code that isn't between the word "method" and the end of a method signature doesn't get seen by the method parser at all, so it can't break your code, no matter how tricky you get.

The problem I see is the same problem you encounter with any C/C++ macro more complex than defining a constant: It degrades your ability to understand what the code is doing by looking at it, because you're not looking at the code that actually executes.

In theory, a source filter is no more dangerous than any other module, since you could easily write a module that redefines builtins or other constructs in "unexpected" ways. In practice however, it is quite hard to write a source filter in a way where you can prove that its not going to make a mistake. I tried my hand at writing a source filter that implements the perl6 feed operators in perl5 (Perl6::Feeds on cpan). You can take a look at the regular expressions to see the acrobatics required to simply figure out the boundaries of expression scope. While the filter works, and provides a test bed to experiment with feeds, I wouldn't consider using it in a production environment without many many more hours of testing.
Filter::Simple certainly comes in handy by dealing with 'the gory details of parsing quoted constructs', so I would be wary of any source filter that doesn't start there.
In all, it really depends on the filter you are using, and how broad a scope it tries to match against. If it is something simple like a c macro, then its "probably" ok, but if its something complicated then its a judgement call. I personally can't wait to play around with perl6's macro system. Finally lisp wont have anything on perl :-)

There is a nice example here that shows in what trouble you can get with source filters.
http://shadow.cat/blog/matt-s-trout/show-us-the-whole-code/
They used a module called Switch, which is based on source filters. And because of that, they were unable to find the source of an error message for days.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse