Noweb does not cross-reference Perl identifiers delimited on the left by # - perl

Consider this Noweb source file named quux.nw:
\documentclass{article}
\usepackage{noweb}
\usepackage[colorlinks]{hyperref}
\begin{document}
<<quux.pl>>=
my #foo ;
my $bar ;
my %baz ;
# %def foo bar baz
\end{document}
and compiled using the commands:
$ noweb quux.nw
$ latexmk -pdf quux.tex
The identifiers bar and baz are properly highlighted as identifiers and cross referenced in the PDF output. The identifier foo is not.
It's my understanding that Noweb has a very simple heuristic for recognizing identifiers. foo should be recognizable as an identifier because, like bar and baz, it begins with an alphanumeric, is delimited on the left by a symbol (at-sign), and is delimited on the right by a delimiter (whitespace).
I considered the possibility that the at-sign was being interpreted by Noweb as an escape and tried doubling it, but that (i) did not solve the problem, and (ii) introduced the syntax error my ##foo into quux.pl. This makes sense because according to the fine manual, a double at-sign is only treated specially in columns 1–2.

Noweb treats # as alphanumeric, with the rationale that it “helps LaTeX”. I did not find anything about this in the Noweb manual. This is documented only in the Noweb source file finduses.nw, line 24, in Noweb version 2.12.
Apparently, when writing your own LaTeX package, any macro you define has public scope. To write “private” macros, the trick is to temporarily reclass the # as a letter at the top of the package, incorporate an # into the name of each “private” macro, and restore the class of # at the bottom of the package. The macro remains public, but is impossible to call because the name gets broken up into multiple lexemes. (A user can still call such a macro by reclassing # as a letter before the call, but if they do that, they assume the risk.)
So yes, # should be included as an alphanumeric character when the code block contains a LaTeX package.
The full list of symbols treated as alphanumeric by Noweb is:
_ ' # #
The _ is treated as an identifier character in many programming languages, so Noweb is right to treat it as alphanumeric.
The # is treated as alphanumeric to “avoid false hits on C preprocessor directives”.
No explanation is given for treating the ' as alphanumeric.
Ideally, Noweb would support separate character class schemes for each source language. But as I understand it, Noweb has only the one global character class scheme, and no support for changing it (other than modifying the source).
Fortunately, Perl has alternate syntaxes for array identifiers that work around this limitation. Instead of #foo you can write #{foo} or even # foo and it will work.

Related

What counts as a newline for Raku *source* files?

I was somewhat surprised to observe that the following code
# comment 
say 1;
# comment 
say 2;
# comment say 3;
# comment say 4;
prints 1, 2, 3, and 4.
Here are the relevant characters after "# comment":
say "

".uninames.raku;
# OUTPUT: «("PARAGRAPH SEPARATOR", "LINE SEPARATOR", "<control-000B>", "<control-000C>").Seq»
Note that many/all of these characters are invisible in most fonts. At least with my editor, none cause the following text to be printed on a new line. And at least one (<control-000C>, aka Form Feed, sometimes printed as ^L) is in fairly wide use in Vim/Emacs as a section separator.
This raises a few questions:
Is this intentional, or a bug?
If intentional, what's the use-case (other than winning obfuscated code contests!)
Is it just these 4 characters, or are there others? (I found these because they share the mandatory break Unicode property. Does that property (or some other Unicode property?) govern what Raku considers as a newline?)
Just, really, wow.
(I realize #4 is not technically a question, but I feel it needed to be said).
Raku's syntax is defined as a Raku grammar. The rule for parsing such a comment is:
token comment:sym<#> {
'#' {} \N*
}
That is, it eats everything after the # that is not a newline character. As with all built-in character classes in Raku, \n and its negation are Unicode-aware. The language design docs state:
\n matches a logical (platform independent) newline, not just \x0a. See TR18 section 1.6 for a list of logical newlines.
Which is a reference to the Unicode standard for regular expressions.
I somewhat doubt there was ever a specific language design discussion along the lines of "let's enable all the kinds of newlines in Unicode, it'll be cool!" Rather, the decisions were that Raku should follow the Unicode regex technical report, and that Raku syntax would be defined in terms of a Raku grammar and thus make use of the Unicode-aware character classes. That a range of different newline characters are supported is a consequence of consistently following those principles.

How to wrap long descriptions in a Flutter pubspec.yaml file?

I noticed that when using the Pubspec Assist plugin, it wraps the description line when updating a dependency.
description: Have you been turned into a newt? Would you like to be? This
package can help. It has all of the newt-transmogrification functionality you
have been looking for.
In researching this wrapping, I found that yaml supports wrapping, but indicates to use > (or | for keeping newlines, which probably isn't recommended for Flutter apps?).
The pubspec page on dart.dev shows an example using >-, but its own description doesn't mention anything about long descriptions or wrapping.
description: >-
Have you been turned into a newt? Would you like to be?
This package can help. It has all of the
newt-transmogrification functionality you have been looking
for.
Does it matter, in a Flutter project, say from an app store's perspective, which method is used for wrapping long descriptions in the pubspec.yaml file? I've always just kept them as one long line.
Wrapping is a YAML syntax feature. Flutter applies semantics to the parsed content of your YAML file.
This means that it doesn't matter to Flutter how you represent your YAML scalars, as long as the result – as defined by the YAML syntax you use – yields a valid value for Flutter.
With some scalars, YAML employs line folding: Single line breaks are transformed into a space, while empty lines are transformed into line breaks. This happens both with plain scalars and folded block scalars:
droggeljug: This is a plain scalar.
It spans multiple lines but when parsed, contains a single line.
baked_beans: >-
This is a folded block scalar.
It also spans multiple lines.
The previous empty line yields a line break in the parsed value.
There are some differences to consider:
plain scalars get ended by various special characters, such as : (when followed by whitespace). This should be obvious seeing that it forms an implicit key.
folded block scalars only end when content at the indentation of a parent node is encountered. You can savely write any character into a folded block scalar, even # which would otherwise starte a comment.
some scalars may be parsed a non-strings when given as plain scalar. For example, true might be a boolean and 42 might be a number. Folded block scalars always yield strings no matter the content.
Apart from these, there are also single- and double-quoted scalars, and literal block scalars (starting with | instead of >). Literal block scalars parse line breaks as-is. Double-quoted scalars parse escape sequences. Single-quoted scalars just don't end until the second ' is encountered. All of these scalar types may be multiline and all except literal block scalars do line folding. You can choose any of them to encode your string value.
As to the question which one you should use, I'd say the folded block scalar >- is the right tool for the job: You can write anything without worrying about YAML special characters, escape sequences, etc.

How to insert breakpoint using symbols include "<>" (angle brackets)

I want to insert a breakpoint in windbg, using symbols named "TSmartPointer::TSmartPointer".
bp TSmartPointer<class CDataMemberMgr>::TSmartPointer<class CDataMemberMgr>
WinDbg noticed me that no symbols were found.
I use command x to search symbol, but also no symbols are found:
x TSmartPointer<class CDataMemberMgr>::TSmartPointer<class CDataMemberMgr>
When I replace "<" and ">" to "*", WinDbg can find symbols:
x TSmartPointer*class CDataMemberMgr*::TSmartPointer*class CDataMemberMgr*
Am I wrong? How can I insert this breakpoint?
I could not find this in WinDbg's internal help, but in Microsoft documentation, which makes me wonder a bit about the spaces as well
To set a breakpoint on complicated functions, including functions that contain spaces, as well as a member of a C++ public class, enclose the expression in parentheses. For example, use bp (??MyPublic) or bp (operator new).
Furthermore, it explicitly talks about angle brackets:
You must start with the three symbols #!" and end with a quotation mark ("). Without this syntax, you cannot use spaces, angle brackets (<, >), or other special characters in symbol names in the MASM evaluator.
(emphasis mine)
So, in your case, the following should work:
bp #!"TSmartPointer<class CDataMemberMgr>::TSmartPointer<class CDataMemberMgr>"
The quotation marks should care about the spaces as well.
And to make a comment of #Kurt Hutchinson persistent:
For template classes, it's important to use the exact spacing and angle bracket placement that Windbg wants. Sometimes there will be an extra space in there that is significant. You can tell what it should be by doing a symbol lookup first like x MSHTML!TSmartPointer*CDataMemberMgr*. Windbg should do a wildcard match and print out a bunch of symbol names. Then you should copy and paste the correct name from that list, using the #!"..." quoting. Don't try to retype the symbol name yourself because spaces matter and if you miss one, Windbg won't match it correctly.

Is it possible to change an emacs syntax table based on context?

I'm working on improving an emacs major mode for UnrealScript. One of the (many) quirks is that it allows syntax like this for specifying tooltips in the Unreal editor:
var() int MyEditorVar <Foo=Bar|Tooltip=My tooltip text isn't quoted>;
The angle brackets after the variable declaration denote a pipe-separated list of Key=Value metadata pairs, and the metadata is not quoted but can contain quote marks -- a pipe (|) or right angle bracket (>) denotes the end.
Is there a way I can get the emacs syntax table to recognize this context-dependent syntax in a useful way? I'd like everything except for pipes and right angle brackets to be highlighted in some way inside of these variable metadata declarations, but otherwise retain their normal highlighting.
Right now, the single quote character is set up to be a quote delimiter (syntax designator "), so font-lock-mode interprets such a quote as starting a quoted string, which it's not in this very specific instance, so it mishighlights everything until it finds another supposedly matching single quote.
You'll need to setup a syntax-propertize-function which lets you apply different syntax designators to different characters in the buffer, depending on their context.
Grep for syntax-propertize-function in Emacs's lisp directory to see various examples (from simple to pretty complex ones).
You'll probably want to mark the "=" chars after your "Foo" and after your "Tooltip" as "generic string delimiter", then do the same with the corresponding terminating "|" and ">". An alternative could be to mark the char before the ">" as a (closing) generic string delimiter, so that you can then mark the "<" and ">" as open&close parens.

What characters are allowed in Perl identifiers?

I'm working on regular expressions homework where one question is:
Using language reference manuals online determine the regular expressions for integer numeric constants and identifiers for Java, Python, Perl, and C.
I don't need help on the regular expression, I just have no idea what identifiers look like in Perl. I found pages describing valid identifiers for C, Python and Java, but I can't find anything about Perl.
EDIT: To clarify, finding the documentation was meant to be easy (like doing a Google search for python identifiers). I'm not taking a class in "doing Google searches".
Perl Integer Constants
Integer constants in Perl can be
in base 16 if they start with ^0x
in base 2 if they start with ^0b
in base 8 if they start with 0
otherwise they are in base 10.
Following that leader is any number of valid digits in that base and also optional underscores.
Note that digit does not mean \p{POSIX_Digit}; it means \p{Decimal_Number}, which is really quite different, you know.
Please note that any leading minus sign is not part of the integer constant, which is easily proven by:
$ perl -MO=Concise,-exec -le '$x = -3**$y'
1 <0> enter
2 <;> nextstate(main 1 -e:1) v:{
3 <$> const(IV 3) s
4 <$> gvsv(*y) s
5 <2> pow[t1] sK/2
6 <1> negate[t2] sK/1
7 <$> gvsv(*x) s
8 <2> sassign vKS/2
9 <#> leave[1 ref] vKP/REFC
-e syntax OK
See the 3 const, and much later on the negate op-code? That tells you a bunch, including a curiosity of precedence.
Perl Identifiers
Identifiers specified via symbolic dereferencing have absolutely no restriction whatsoever on their names.
For example, 100->(200) calls the function named 100 with the arugments (100, 200).
For another, ${"What’s up, doc?"} refers to the scalar package variable by that name in the current package.
On the other hand, ${"What's up, doc?"} refers to the scalar package variable whose name is ${"s up, doc?"} and which is not in the current package, but rather in the What package. Well, unless the current package is the What package, of course. Similary $Who's is the $s variable in the Who package.
One can also have identifiers of the form ${^identifier}; these are not considered symbolic dereferences into the symbol table.
Identifiers with a single character alone can be a punctuation character, include $$ or %!.
Identifers can also be of the form $^C, which is either a control character or a circumflex folllowed by a non-control character.
If none of those things is true, a (non–fully qualified) identifier follows the Unicode rules related to characters with the properties ID_Start followed by those with the property ID_Continue. However, it overrules this in allowing all-digit identifiers and identifiers that start with (and perhaps have nothing else beyond) an underscore. You can generally pretend (but it’s really only pretending) that that is like saying \w+, where \w is as described in Annex C of UTS#18. That is, anything that has any of these:
the Alphabetic property — which includes far more than just Letters; it also contains various combining characters and the Letter_Number code points, plus the circled letters
the Decimal_Number property, which is rather more than merely [0-9]
Any and all characters with the Mark property, not just those marks that are deemed Other_Alphabetic
Any characters with the Connector_Puncutation property, of which underscore is just one such.
So either ^\d+$ or else
^[\p{Alphabetic}\p{Decimal_Number}\p{Mark}\p{Connector_Punctuation}]+$
ought to do it for the really simple ones if you don’t care to explore the intricacies of the Unicode ID_Start and ID_Continue properties. That’s how it’s really done, but I bet your instructor doesn’t know that. Perhaps one shan’t tell him, eh?
But you should cover the nonsimple ones I describe earlier.
And we haven’t talked about packages yet.
Perl Packages in Identifiers
Beyond those simple rules, you must also consider that identifiers may be qualified with a package name, and package names themselves follow the rules of identifiers.
The package separator is either :: or ' at your whim.
You do not have to specify a package if it is the first component in a fully qualified identifier, in which case it means the package main. That means things like $::foo and $'foo are equivalent to $main::foo, and isn't_it() is equivalent to isn::t_it(). (Typo removed)
Finally, as a special case, a trailing double-colon (but not a single-quote) at the end of a hash is permitted, and this then refers to the symbol table of that name.
Thus %main:: is the main symbol table, and because you can omit main, so too is %::.
Meanwhile %foo:: is the foo symbol table, as is %main::foo:: and also %::foo:: just for perversity’s sake.
Summary
It’s nice to see instructors giving people non-trivial assignments. The question is whether the instructor realized it was non-trivial. Probably not.
And it’s hardly just Perl, either. Regarding the Java identifiers, did you figure out yet that the textbooks lie? Here’s the demo:
$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: ^\033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[
Yes, it’s true. It is also true for many other code points, especially if you use -encoding UTF-8 on the compile line. Your job is to find the pattern that describes these startlingly unforbidden Java identifiers. Hint: make sure to include code point U+0000.
There, aren’t you glad you asked? Hope this helps. Or something. ☺
The homework requests that you use the reference manuals, so I'll answer in those terms.
The Perl documentation is available at http://perldoc.perl.org/. The section that deals on variables is perldata. That will easily give you a usable answer.
In reality, I doubt that the complete answer is available in the documentation. There are special variables (see perlvar), and "use utf8;" can greatly affect the definition of "letter" and "number".
$ perl -E'use utf8; $é=123; say $é'
123
[ I only covered the identifier part. I just noticed the question is larger than that ]
The perlvar page of the Perl documentation has a section at the end roughly outlining the allowable syntax. In summary:
Any combination of letters, digits, underscores, and the special sequence :: (or '), provided it starts with a letter or underscore.
A sequence of digits.
A single punctuation character.
A single control character, which can also be written as caret-{letter}, e.g. ^W.
An alphanumeric string starting with a control character.
Note that most of the identifiers other than the ones in set 1 are either given a special meaning by Perl, or are reserved and may gain a special meaning in later versions. But if you're just trying to work out what is a valid identifier, then that doesn't really matter in your case.
Having no official specification (Perl is whatever the perl interpreter can parse) these can be a little tricky to discern.
This page has examples of all the integer constant formats. The format of identifiers will need to be inferred from various pages in perldoc.