Invalid character stream macros - macros

The following preprocessor macro:
#define _VARIANT_BOOL /##/
is not actually valid C; roughly speaking, the reason is that the preprocessor is defined as working on a stream of tokens, whereas the above assumes that it works on a stream of characters.
On the other hand, unfortunately the above actually occurs in a Microsoft header file, so I have to handle it anyway. (I'm working on a preprocessor implementation.)
What other cases have people encountered in the wild, be it in legacy code however old as long as that code may be still in use, of preprocessor macros that are not actually valid, but work anyway because they were written under compilers that use a character oriented preprocessor implementation?
(Rationale: I'm trying to get some idea in advance how many special cases I'm going to have to hack, if I write a proper clean standard-conforming token oriented implementation.)

The relevant part of the standard (§ The ## operator) says:
If the result is not a valid preprocessing token, the behavior is undefined.
This means that your preprocessor can do anything it likes and still be standard conforming, including emulating the common behaviour.
I think you can still have a "token-based" implementation and support this behaviour, by specifying that when the result of the ## operator is not a valid preprocessing token, the result is the two operand tokens unchanged. You may also want to have your preprocessor emit a warning about the invalid code.


How does one implement the typedef hack in an antlr4 grammar

I don't need typedef's exactly. I need aliases (for a shell language). But the hack of looking up an identifier and returning a different token type is what I need to make the grammar work. I don't necessarily need it to be done in the lexer, although that would seem cleanest to me (or in a phase between the lexer and parser).
Here is (a fragment of) the closest I can seem to come to a solution given what I know of antlr4, but it requires a whole level of non-terminals for each keyword token. Note, that per Antlr4 Capitalized words or tokens, lower case words are non-terminals.
aliasstmt: alias ident ident; // rule that makes aliases
ifstmt: if expression then statement; // sample rule with two keywords
// non-terminals converting aliases into keywords
alias: Alias // normal token for keyword
// hack, LookupAlias is map, I need.
| { LookupAlias(_input.LT(1).getText()).equals("alias") }? Ident
if : If
| { LookupAlias(_input.LT(1).getText()).equals("if") }? Ident
then : Then
| { LookupAlias(_input.LT(1).getText()).equals("then") }? Ident
// Non-terminal going the other way, converting keywords to identifiers when needed
ident : Ident
| Alias
| If
| Then
Now, I suppose, I could get rid of the Tokens for the keywords and do it all in the parser for this example. It wouldn't completely work in the language I'm parsing because a significant number of the keywords have "normal" spellings like "Set-Alias" or "-Name" which are not legal identifiers (and "Set - Alias" or "Set -Alias" is not the same as "Set-Alias", uggh).
However, I want to LookupAlias() function to be it's own Java class not something just embedded in the parser. I have other times I need to us it that aren't part of parsing and those times need to have then coordinated. How to do that is a separate question I will ask.
(Caveat... maybe aliases can be used in a shell in places I don’t know about, so this is based on my understanding)
In a shell, an alias is essentially an identifier that is expanded when it’s encountered. It’s only expected where a command could occur, and since you can’t know all the command in the path, your grammar would likely have an IDENTIFIER token (or the like) at that location in the parser rule.
You’d then check it against a list of built-in commands, commands in your PATH, and aliases (I’m not sure of the precedence, TBH).
So, you’d need to keep a symbol table to look up the alias resolution. I think post-resolution is where things will get “tricky”. IIRC, aliases don’t have to be syntactically complete, you you couldn’t really expect to pre-parse them (they possibly won’t parse correctly). Also, they are pretty much “injected” into the input stream. In this way they’re much more like pre-processor macros. I don’t see much way around detecting them, building an expanded input stream and lexing/parsing it.
I suppose that you could write a custom TokenStream, that detected aliases and responded to getNextToken() (and methods to get the token at a particular index, etc.). That would allow aliases anywhere in the token stream, which could get weird, and it would be the devil, probably, to provide useful error messages. (I guess you’d just have to point them at the alias itself). This approach would supply the alias definition tokens in place of the alias as the parser asked for the next token. I don’t see a way that you’ll use actions/predicates to change ANTLRs mind about what token it just saw :).
I suspect playing with existing shells a bit, creating invalid alias substitutions into the command line, and observing the error messages, might give insight into how other shells handle it. My impression, is that the shell preprocess the input and substitutes things like aliases and ENV variables, etc. and then re-parses the result the result for execution.
I’m pretty sure trying to modify the tokenStream as the parser is already processing it, is either no doable, or the path to madness.

gcc precompiler directive __attribute__ ((__cleanup__)) vs ((cleanup)) (with vs without underscores?)

I'm learning about gcc's cleanup attribute, and learning how it calls a function to be run when a variable goes out of scope, and I don't understand why you can use the word "cleanup" with or without underscores. Where is the documentation for, or documentation of, the version with underscores?
The gcc documentation above shows it like this:
__attribute__ ((cleanup(cleanup_function)))
However, most code samples I read, show it like this:
__attribute__ ((__cleanup__(cleanup_function)))
Note that the first example link states they are identical, and of course coding it proves this, but how did he know this originally? Where did this come from?
Why the difference? Where is __cleanup__ defined or documented, as opposed to cleanup?
My fundamental problem lies in the fact that I don't know what I don't know, therefore I am trying to expose some of my unknown unknowns so they become known unknowns, until I can study them and make them known knowns.
My thinking is that perhaps there is some globally-applied principle to gcc preprocessor directives, where you can arbitrarily add underscores before or after any of them? -- Or perhaps only some of them? -- Or perhaps it modifies the preprocessor directive or attribute somehow and there are cases where one method, with or without the extra underscores, is preferred over the other?
You are allowed to define a macro cleanup, as it is not a name that is reserved to the compiler. You are not allowed to define one named __cleanup__. This guarantees that your code using __cleanup__ is unaffected by other code (provided that other code behaves, of course).
As explains:
You may optionally specify attribute names with __ preceding and following the name. This allows you to use them in header files without being concerned about a possible macro of the same name. For example, you may use the attribute name __noreturn__ instead of noreturn.
(But note that attributes are not preprocessor directives.)

What does internal mean in function names in Emacs Lisp?

Some people use double dash to indicate that the function is subject to change:
What does the double minus (--) convention in function names mean in Emacs Lisp
Does including internal in function names mean similar things?
Two examples
The function where-is-internal has a detailed docstring and is mentioned in the manual as well. Is where-is-internal an exception?
Is there a difference between having -internal as suffix and having internal- as prefix?
Adding to confusion, there are also function names with internal-- (with double dash) as prefix.
The confusion is not just in the naming convention (variability due to history and perhaps sometimes whim). The confusion is in the very notion of "internal" in free software, where the source code is open to everyone to use or modify (even fork) as they please.
To answer your question from (what I think is) the point of view of Emacs Dev, and thus in terms of the underlying intention: "internal" means that someone using such a function is perhaps more likely to encounter future changes in the Emacs-Dev implementation and use of that function than might be the case for a non-"internal" function. IOW, you might not want to count on it remaining as it is now. That's all.
But there's a lot of "perhaps", "more likely", and "might" in there. In practice, some non-"internal" functions change more radically or more quickly than some "internal" functions. It might be the case that for the former there will be a deprecation grace period, during which the pre-change situation is tolerated, i.e., still works. That might not be the case for something "internal". But again, in practice there is some gray between the black of "internal" and the white of non-"internal".
Someone from Emacs Dev (e.g. #Stefan) will perhaps put this differently or correct my interpretation.
My own take: there have sometimes (often) been functions and variables that the author did not expect users to make use of directly, and thus naturally thought of as "internal", which users have nevertheless put to good use, or even "had" to use (modulo rewriting lots of code). Some such have had their "internal" status removed (no, I don't have examples memorized). Or sometimes a new, non-"internal" function has been added to make the behavior available - e.g., a wrapper or function-valued argument has been added (again, I have no offhand examples to give).
IOW, for Emacs Dev too it is not always clear what should be considered "internal". Just take the label as a flag that you might not want to count too much on that function or variable.
Wrt the various notations: My impression is that the -- convention seems recently to be used more (though there is also some old code that uses it); using internal is an older convention, for the most part.
The "internal" and the "--" conventions are similar. Basically "internal" is used when there's no prefix after which to put a double dash (which is usually the case for functions implemented in C).
And yes, as Drew explains, the intention behind the notion of something being "internal" is just to recommend people not use it directly. IOW if they need the corresponding functionality, they should report a bug requesting to promote its status to "non-internal".

`Is defining a macro via -D option ALWAYS equivalent to #define MACRO (except precedence)

I have a third party piece of code that works differently when I add a macro via Makefile e.g. -DMacro instead of doing #define MACRO in a top level header file
(which as their documentation implies is included in ALL files).
I Googled if there are any differences in defining it in different ways but could not come up with much except Precedence of -D MACRO and #define MACRO.
I am wondering if I am missing anything about make documentation / C standards before I start debugging and determining the issue.
Thanks for any answers.
Usually, it's exactly the same but neither make nor the ISO standard have anything to say about it. It's up to the compiler itself, some may not even have a -D option.
To make, it's just running the command (such as gcc) with whatever options it takes. ISO doesn't specify anything about how to run a compiler, just how the compiler (and the things it creates) behaves.
For gcc, the preprocessor options can be found here so it looks like it is identical to #define.

Why are Perl source filters bad and when is it OK to use them?

It is "common knowledge" that source filters are bad and should not be used in production code.
When answering a a similar, but more specific question I couldn't find any good references that explain clearly why filters are bad and when they can be safely used. I think now is time to create one.
Why are source filters bad?
When is it OK to use a source filter?
Why source filters are bad:
Nothing but perl can parse Perl. (Source filters are fragile.)
When a source filter breaks pretty much anything can happen. (They can introduce subtle and very hard to find bugs.)
Source filters can break tools that work with source code. (PPI, refactoring, static analysis, etc.)
Source filters are mutually exclusive. (You can't use more than one at a time -- unless you're psychotic).
When they're okay:
You're experimenting.
You're writing throw-away code.
Your name is Damian and you must be allowed to program in latin.
You're programming in Perl 6.
Only perl can parse Perl (see this example):
#result = (dothis $foo, $bar);
# Which of the following is it equivalent to?
#result = (dothis($foo), $bar);
#result = dothis($foo, $bar);
This kind of ambiguity makes it very hard to write source filters that always succeed and do the right thing. When things go wrong, debugging is awkward.
After crashing and burning a few times, I have developed the superstitious approach of never trying to write another source filter.
I do occasionally use Smart::Comments for debugging, though. When I do, I load the module on the command line:
$ perl -MSmart::Comments
so as to avoid any chance that it might remain enabled in production code.
See also: Perl Cannot Be Parsed: A Formal Proof
I don't like source filters because you can't tell what code is going to do just by reading it. Additionally, things that look like they aren't executable, such as comments, might magically be executable with the filter. You (or more likely your coworkers) could delete what you think isn't important and break things.
Having said that, if you are implementing your own little language that you want to turn into Perl, source filters might be the right tool. However, just don't call it Perl. :)
It's worth mentioning that Devel::Declare keywords (and starting with Perl 5.11.2, pluggable keywords) aren't source filters, and don't run afoul of the "only perl can parse Perl" problem. This is because they're run by the perl parser itself, they take what they need from the input, and then they return control to the very same parser.
For example, when you declare a method in MooseX::Declare like this:
method frob ($bubble, $bobble does coerce) {
... # complicated code
The word "method" invokes the method keyword parser, which uses its own grammar to get the method name and parse the method signature (which isn't Perl, but it doesn't need to be -- it just needs to be well-defined). Then it leaves perl to parse the method body as the body of a sub. Anything anywhere in your code that isn't between the word "method" and the end of a method signature doesn't get seen by the method parser at all, so it can't break your code, no matter how tricky you get.
The problem I see is the same problem you encounter with any C/C++ macro more complex than defining a constant: It degrades your ability to understand what the code is doing by looking at it, because you're not looking at the code that actually executes.
In theory, a source filter is no more dangerous than any other module, since you could easily write a module that redefines builtins or other constructs in "unexpected" ways. In practice however, it is quite hard to write a source filter in a way where you can prove that its not going to make a mistake. I tried my hand at writing a source filter that implements the perl6 feed operators in perl5 (Perl6::Feeds on cpan). You can take a look at the regular expressions to see the acrobatics required to simply figure out the boundaries of expression scope. While the filter works, and provides a test bed to experiment with feeds, I wouldn't consider using it in a production environment without many many more hours of testing.
Filter::Simple certainly comes in handy by dealing with 'the gory details of parsing quoted constructs', so I would be wary of any source filter that doesn't start there.
In all, it really depends on the filter you are using, and how broad a scope it tries to match against. If it is something simple like a c macro, then its "probably" ok, but if its something complicated then its a judgement call. I personally can't wait to play around with perl6's macro system. Finally lisp wont have anything on perl :-)
There is a nice example here that shows in what trouble you can get with source filters.
They used a module called Switch, which is based on source filters. And because of that, they were unable to find the source of an error message for days.