What is \n\nnd supposed to do? - sed

echo n | sed '\n\nnd'
This command prints n with GNU sed. With BSD sed, it doesn't print anything.
The POSIX sed spec. says:
In a context address, the construction \cBREc, where c is any character other than <backslash> or <newline>, shall be identical to /BRE/. If the character designated by c appears following a <backslash>, then it shall be considered to be that literal character, which shall not terminate the BRE. For example, in the context address \xabc\xdefx, the second x stands for itself, so that the BRE is abcxdef.
The escape sequence \n shall match a <newline> embedded in the pattern space. A literal <newline> shall not be used in the BRE of a context address or in the substitute function.
but doesn't elaborate any further on these contradictory statements.
So my question is, which behavior is correct? Or is it intentionally left unspecified?

There's an update; with this commit GNU sed no longer prints n for the command in OP.
According to a reply to my email on Austin Group mailing list (quoted below), the standard is unclear on this, and both behaviors are correct. HP-UX and Solaris adopted the GNU behavior too; so it's not a bug in implementations, but a lack of clarity in the standard.
Neither is more correct than the other because, as you said yourself, the standard is unclear. A formal interpretation would say "The standard is unclear on this issue, and no conformance distinction can be made between alternative implementations based on this."
Given that implementations differ, we should probably make the behaviour explicitly unspecified.

Related

What counts as a newline for Raku *source* files?

I was somewhat surprised to observe that the following code
# comment 
say 1;
# comment 
say 2;
# comment say 3;
# comment say 4;
prints 1, 2, 3, and 4.
Here are the relevant characters after "# comment":
say "

".uninames.raku;
# OUTPUT: «("PARAGRAPH SEPARATOR", "LINE SEPARATOR", "<control-000B>", "<control-000C>").Seq»
Note that many/all of these characters are invisible in most fonts. At least with my editor, none cause the following text to be printed on a new line. And at least one (<control-000C>, aka Form Feed, sometimes printed as ^L) is in fairly wide use in Vim/Emacs as a section separator.
This raises a few questions:
Is this intentional, or a bug?
If intentional, what's the use-case (other than winning obfuscated code contests!)
Is it just these 4 characters, or are there others? (I found these because they share the mandatory break Unicode property. Does that property (or some other Unicode property?) govern what Raku considers as a newline?)
Just, really, wow.
(I realize #4 is not technically a question, but I feel it needed to be said).
Raku's syntax is defined as a Raku grammar. The rule for parsing such a comment is:
token comment:sym<#> {
'#' {} \N*
}
That is, it eats everything after the # that is not a newline character. As with all built-in character classes in Raku, \n and its negation are Unicode-aware. The language design docs state:
\n matches a logical (platform independent) newline, not just \x0a. See TR18 section 1.6 for a list of logical newlines.
Which is a reference to the Unicode standard for regular expressions.
I somewhat doubt there was ever a specific language design discussion along the lines of "let's enable all the kinds of newlines in Unicode, it'll be cool!" Rather, the decisions were that Raku should follow the Unicode regex technical report, and that Raku syntax would be defined in terms of a Raku grammar and thus make use of the Unicode-aware character classes. That a range of different newline characters are supported is a consequence of consistently following those principles.

Name of the notation that goes like /<command> [arg0|arg1]

I know there is a notation or convention that for example describes the usage of a command (in a shell for example).
/<command> [arg0|arg1]
means the following is a right way of expressing/using the command: /TheNameOfTheCommand arg0 or /TheNameOfTheCommand arg1.
It is a bit like RegEx or a formal language. Minecraft uses this notation too to describe the syntax of their command. And I once heard about it by a professor in a programming lecture. That's the reason I think it must be a real convention.
Do you know the name of this convention or does it exist at all?
It's a convention, but not a standard. Or maybe it would be more accurate to say that it is a convention adapted from a collection of standards which differ in details, except that the standards derive from the conventional use, and it's more common to find other variants than strict application of the standards.
The use of angle brackets to delimit grammatical variables goes back to Peter Backus' notation to describe the original Algol (1959); the use of brackets to donate optionality and vertical bars to list alternatives was used in the definition of Pascal (1974) and promoted by Niklaus Wirth in a note published in 1977 (“What Can We Do About the
Unnecessary Diversity of Notation for
Syntactic Definitions”).
In the Pascal report, it was called "Extended Backus Naur Form", and it is one of a number of similar notations which go by that name. I think that's unfortunate because it doesn't acknowledge Wirth's contribution but if you called it WBNF people would probably think you were talking about a radio station.
(Wirth didn't always use angle brackets. In the published version of the Pascal report, grammatical variables were printed on italics, but in the widely-distributed typed manuscript, angle brackets were used. Similarly, literal tokens were sometimes typeset in boldface and sometimes typed surrounded by quotation marks.)

Force CL-Lex to read whole word

I'm using CL-Lex to implement a lexer (as input for CL-YACC) and my language has several keywords such as "let" and "in". However, while the lexer recognizes such keywords, it does too much. When it finds words such as "init", it returns the first token as IN, while it should return a "CONST" token for the "init" word.
This is a simple version of the lexer:
(define-string-lexer lexer
(...)
("in" (return (values :in $#)))
("[a-z]([a-z]|[A-Z]|\_)" (return (values :const $#))))
How do I force the lexer to fully read the whole word until some whitespace appears?
This is both a correction of Kaz's errors, and a vote of confidence for the OP.
In his original response, Kaz states the order of Unix lex precedence exactly backward. From the lex documentation:
Lex can handle ambiguous specifications. When more than one expression can
match the current input, Lex chooses as follows:
The longest match is preferred.
Among rules which matched the same number of characters, the rule given
first is preferred.
In addition, Kaz is wrong to criticize the OP's solution of using Perl-regex word-boundary matching. As it happens, you are allowed (free of tormenting guilt) to match words in any way that your lexer generator will support. CL-LEX uses Perl regexes, which use \b as a convenient syntax for the more cumbersome lex approximate of :
%{
#include <stdio.h>
%}
WC [A-Za-z']
NW [^A-Za-z']
%start INW NIW
{WC} { BEGIN INW; REJECT; }
{NW} { BEGIN NIW; REJECT; }
<INW>a { printf("'a' in wordn"); }
<NIW>a { printf("'a' not in wordn"); }
All things being equal, finding a way to unambiguously match his words is probably better than the alternative.
Despite Kaz wanting to slap him, the OP has answered his own question correctly, coming up with a solution that takes advantage of the flexibility of his chosen lexer generator.
Your example lexer above has two rules, both of which match a sequence of exactly two characters. Moreover, they have common matches (the language matched by the second is a strict superset of the first).
In the classic Unix lex, if two rules both match the same length of input, precedence is given to the rule which occurs first in the specification. Otherwise, the longest possible match dominates.
(Although without RTFM, I can't say that that is what happens in CL-LEX, it does make a plausible hypothesis of what is happening in this case.)
It looks like you're missing a regex Kleene operator to match a longer token in the second rule.

What characters are allowed in Perl identifiers?

I'm working on regular expressions homework where one question is:
Using language reference manuals online determine the regular expressions for integer numeric constants and identifiers for Java, Python, Perl, and C.
I don't need help on the regular expression, I just have no idea what identifiers look like in Perl. I found pages describing valid identifiers for C, Python and Java, but I can't find anything about Perl.
EDIT: To clarify, finding the documentation was meant to be easy (like doing a Google search for python identifiers). I'm not taking a class in "doing Google searches".
Perl Integer Constants
Integer constants in Perl can be
in base 16 if they start with ^0x
in base 2 if they start with ^0b
in base 8 if they start with 0
otherwise they are in base 10.
Following that leader is any number of valid digits in that base and also optional underscores.
Note that digit does not mean \p{POSIX_Digit}; it means \p{Decimal_Number}, which is really quite different, you know.
Please note that any leading minus sign is not part of the integer constant, which is easily proven by:
$ perl -MO=Concise,-exec -le '$x = -3**$y'
1 <0> enter
2 <;> nextstate(main 1 -e:1) v:{
3 <$> const(IV 3) s
4 <$> gvsv(*y) s
5 <2> pow[t1] sK/2
6 <1> negate[t2] sK/1
7 <$> gvsv(*x) s
8 <2> sassign vKS/2
9 <#> leave[1 ref] vKP/REFC
-e syntax OK
See the 3 const, and much later on the negate op-code? That tells you a bunch, including a curiosity of precedence.
Perl Identifiers
Identifiers specified via symbolic dereferencing have absolutely no restriction whatsoever on their names.
For example, 100->(200) calls the function named 100 with the arugments (100, 200).
For another, ${"What’s up, doc?"} refers to the scalar package variable by that name in the current package.
On the other hand, ${"What's up, doc?"} refers to the scalar package variable whose name is ${"s up, doc?"} and which is not in the current package, but rather in the What package. Well, unless the current package is the What package, of course. Similary $Who's is the $s variable in the Who package.
One can also have identifiers of the form ${^identifier}; these are not considered symbolic dereferences into the symbol table.
Identifiers with a single character alone can be a punctuation character, include $$ or %!.
Identifers can also be of the form $^C, which is either a control character or a circumflex folllowed by a non-control character.
If none of those things is true, a (non–fully qualified) identifier follows the Unicode rules related to characters with the properties ID_Start followed by those with the property ID_Continue. However, it overrules this in allowing all-digit identifiers and identifiers that start with (and perhaps have nothing else beyond) an underscore. You can generally pretend (but it’s really only pretending) that that is like saying \w+, where \w is as described in Annex C of UTS#18. That is, anything that has any of these:
the Alphabetic property — which includes far more than just Letters; it also contains various combining characters and the Letter_Number code points, plus the circled letters
the Decimal_Number property, which is rather more than merely [0-9]
Any and all characters with the Mark property, not just those marks that are deemed Other_Alphabetic
Any characters with the Connector_Puncutation property, of which underscore is just one such.
So either ^\d+$ or else
^[\p{Alphabetic}\p{Decimal_Number}\p{Mark}\p{Connector_Punctuation}]+$
ought to do it for the really simple ones if you don’t care to explore the intricacies of the Unicode ID_Start and ID_Continue properties. That’s how it’s really done, but I bet your instructor doesn’t know that. Perhaps one shan’t tell him, eh?
But you should cover the nonsimple ones I describe earlier.
And we haven’t talked about packages yet.
Perl Packages in Identifiers
Beyond those simple rules, you must also consider that identifiers may be qualified with a package name, and package names themselves follow the rules of identifiers.
The package separator is either :: or ' at your whim.
You do not have to specify a package if it is the first component in a fully qualified identifier, in which case it means the package main. That means things like $::foo and $'foo are equivalent to $main::foo, and isn't_it() is equivalent to isn::t_it(). (Typo removed)
Finally, as a special case, a trailing double-colon (but not a single-quote) at the end of a hash is permitted, and this then refers to the symbol table of that name.
Thus %main:: is the main symbol table, and because you can omit main, so too is %::.
Meanwhile %foo:: is the foo symbol table, as is %main::foo:: and also %::foo:: just for perversity’s sake.
Summary
It’s nice to see instructors giving people non-trivial assignments. The question is whether the instructor realized it was non-trivial. Probably not.
And it’s hardly just Perl, either. Regarding the Java identifiers, did you figure out yet that the textbooks lie? Here’s the demo:
$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: ^\033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[
Yes, it’s true. It is also true for many other code points, especially if you use -encoding UTF-8 on the compile line. Your job is to find the pattern that describes these startlingly unforbidden Java identifiers. Hint: make sure to include code point U+0000.
There, aren’t you glad you asked? Hope this helps. Or something. ☺
The homework requests that you use the reference manuals, so I'll answer in those terms.
The Perl documentation is available at http://perldoc.perl.org/. The section that deals on variables is perldata. That will easily give you a usable answer.
In reality, I doubt that the complete answer is available in the documentation. There are special variables (see perlvar), and "use utf8;" can greatly affect the definition of "letter" and "number".
$ perl -E'use utf8; $é=123; say $é'
123
[ I only covered the identifier part. I just noticed the question is larger than that ]
The perlvar page of the Perl documentation has a section at the end roughly outlining the allowable syntax. In summary:
Any combination of letters, digits, underscores, and the special sequence :: (or '), provided it starts with a letter or underscore.
A sequence of digits.
A single punctuation character.
A single control character, which can also be written as caret-{letter}, e.g. ^W.
An alphanumeric string starting with a control character.
Note that most of the identifiers other than the ones in set 1 are either given a special meaning by Perl, or are reserved and may gain a special meaning in later versions. But if you're just trying to work out what is a valid identifier, then that doesn't really matter in your case.
Having no official specification (Perl is whatever the perl interpreter can parse) these can be a little tricky to discern.
This page has examples of all the integer constant formats. The format of identifiers will need to be inferred from various pages in perldoc.

Japanese COBOL Code: rules for G literals and identifiers?

We are processing IBMEnterprise Japanese COBOL source code.
The rules that describe exactly what is allowed in G type literals,
and what are allowed for identifiers are unclear.
The IBM manual indicates that a G'....' literal
must have a SHIFT-OUT as the first character inside the quotes,
and a SHIFT-IN as the last character before the closing quote.
Our COBOL lexer "knows" this, but objects to G literals
found in real code. Conclusion: the IBM manual is wrong,
or we are misreading it. The customer won't let us see the code,
so it is pretty difficult to diagnose the problem.
EDIT: Revised/extended below text for clarity:
Does anyone know the exact rules of G literal formation,
and how they (don't) match what the IBM reference manuals say?
The ideal answer would a be regular expression for the G literal.
This is what we are using now (coded by another author, sigh):
#token non_numeric_literal_quote_g [STRING]
"<G><squote><ShiftOut> (
(<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)
(<NotLineOrParagraphSeparator>|<squote><squote>)
| <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
<ShiftIn>|<ShiftOut>)
| <squote><squote>
)* <ShiftIn><squote>"
where <name> is a macro that is another regular expression. Presumably they
are named well enough so you can guess what they contain.
Here is the IBM Enterprise COBOL Reference.
Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading.
I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means
when it says "one or more characters in the range X'00...X'FF for either byte"
How can DBCS-characters be anything but pairs of 8-bit character codes?
The existing RE matches 3 types of pairs of characters if you examine it.
One answer below suggests that the <squote><squote> pairing is wrong.
OK, I might believe that, but that means the RE would only reject
literal strings containing single <squote>s. I don't believe that's
the problem we are having as we seem to trip over every instance of a G literal.
Similarly, COBOL identifiers can apparantly be composed
with DBCS characters. What is allowed for an identifier, exactly?
Again a regular expression would be ideal.
EDIT2: I'm beginning to think the problem might not be the RE.
We are reading Shift-JIS encoded text. Our reader converts that
text to Unicode as it goes. But DBCS characters are really
not Shift-JIS; rather, they are binary-coded data. Likely
what is happening is the that DBCS data is getting translated
as if it were Shift-JIS, and that would muck up the ability
to recognize "two bytes" as a DBCS element. For instance,
if a DBCS character pair were :81 :1F, a ShiftJIS reader
would convert this pair into a single Unicode character,
and its two-byte nature is then lost. If you can't count pairs,
you can't find the end quote. If you can't find the end quote,
you can't recognize the literal. So the problem would appear
to be that we need to switch input-encoding modes in the middle
of the lexing process. Yuk.
Try to add a single quote in your rule to see if it passes by making this change,
<squote><squote> => <squote>{1,2}
If I remember it correctly, one difference between N and G literals is that G allows single quote. Your regular expression doesn't allow that.
EDIT: I thought you got all other DBCS literals working and just having issues with G-string so I just pointed out the difference between N and G. Now I took a closer look at your RE. It has problems. In the Cobol I used, you can mix ASCII with Japanese, for example,
G"ABC<ヲァィ>" <> are Shift-out/shift-in
You RE assumes the DBCS only. I would loose this restriction and try again.
I don't think it's possible to handle G literals entirely in regular expression. There is no way to keep track of matching quotes and SO/SI with a finite state machine alone. Your RE is so complicated because it's trying to do the impossible. I would just simplify it and take care of mismatching tokens manually.
You could also face encoding issues. The code could be in EBCDIC (Katakana) or UTF-16, treating it as ASCII will not work. SO/SI sometimes are converted to 0x1E/0x1F on Windows.
I am just trying to help you shoot in the dark without seeing the actual code :)
Does <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut> also include single and double quotation marks, or just apostrophes? That would be a problem, as it would consume the literal closing character sequence >' ...
I would check the definition of all other macros to make sure. The only obvious problem that I can see is the <squote><squote> that you already seem to be aware of.