How can I define a Regexp::Grammar rule that ignores leading whitespaces? - perl

From the Regexp::Grammars documentation:
The difference between a token and a rule is that a token treats any
whitespace within it exactly as a normal Perl regular expression
would. That is, a sequence of whitespace in a token is ignored if the
/x modifier is in effect, or else matches the same literal sequence
of whitespace characters (if /x is not in effect).
In a rule, most sequences of whitespace are treated as matching the
implicit subrule <.ws>, which is automatically predefined to match
optional whitespace (i.e. \s*).
...
In other words, a rule such as:
<rule: sentence> <noun> <verb>
| <verb> <noun>
is equivalent to a token with added non-capturing whitespace matching:
<token: sentence> <.ws> <noun> <.ws> <verb>
| <.ws> <verb> <.ws> <noun>
Is there a way to get the rule to ignore the leading implicit <.ws>? In the example above, it would be equivalent to:
<token: sentence> <noun> <.ws> <verb>
| <verb> <.ws> <noun>

Related

Scala Syntax Specification mismatch if-else with one line expression end by semicolon?

I'm learning Scala Syntax Specification.
Confused by the if-else syntax:
Expr1 ::= ‘if’ ‘(’ Expr ‘)’ {nl} Expr [[semi] ‘else’ Expr]
| ...
How could it match below if-else with one line expression end by semicolon ?
if (true) // \n
println(1); //\n
else //\n
println(2); //\n
Notice there're 4 lines and each followed by a '\n'. I have these questions:
When the 1st ; after println(1) match semi before else( [[semi] ‘else’ Expr] ), how to match the 2nd '\n' after ; after println(1) ?
How to match the 3rd '\n' after else ?
How to match the 2nd ; and the 4th '\n' after println(2) ? Since if-else don't match any ; or '\n' at tail.
I think you are being confused by thinking that all newlines must match the nl token. That is not correct.
Newlines are in general simply treated as whitespace. There is a very long subsection on newlines in the Lexical Syntax chapter section 1.2 Newline characters which explains in detail, when, exactly, a newline character is an nl token and when it isn't.
Only the first newline character in your example is an nl token, the other three are just whitespace.
in Scala, semicolon ; doesn't exist (is ignored)
if-else statement is so simple with brackets as :
if (true) {
"\n" // this will be returned
println(1) // this will be ignored
"\n" // this will be ignored
} else {
"\n" // this will be returned
println(2) // this will be ignored
"\n" // this will be ignored
}
or, you can use without accolades, but the statement must be writed in one line:
if (true)
"\n" // this will be returned, can not have another line here
else
"\n"
without comments: if (true) "\n" else "\n"
More about if-else in Scala

Scala replace multiple strings having common character at once

val str= " This string has " , need to escape with \ .Even string has \ before"
val resultShouldbe=" This string has \" ,need to escape with \\.Even string has \\ before"
str.replace(""""""" , """\"""").replace("\\","\\\\")
The output of first replace is adding up to the second replace.
Kindly help.
str.replaceAll("([\"\\\\])" , "\\\\$1")
Matching regex:
(...) - capture group: Capture everything that matches this pattern.
[...] - character class: Match any of the given characters.
\"\\\\ - 2 characters: A quote mark (escaped) or a backslash (doubly escaped).
Replacement string:
\\\\$1 - 2 elements: A backslash (doubly escaped) followed by whatever was captured in the 1st capture group. (In this case there was only 1 capture group.)
In other words: For every quote " or backslash \ character, replace it with the same character preceded by a backslash \ character.

BNF to EBNF conversion

I've trying to convert a given BNF list to EBNF and im completely clueless how. Can anyone help?
The BNF is:
<Sentence> :== <NounPhrase><VerbPhrase>
<NounPhrase> :== <Noun>
<NounPhrase> :== <Article><Noun>
<NounPhrase> :== <Article><AdjectiveList><Noun>
<NounPhrase> :== <AdjectiveList><Noun>
<AdjectiveList> :== <Adjective>
<AdjectiveList> :== <Adjective><AdjectiveList>
<VerbPhrase> :== <Verb>
<VerbPhrase> :== <Verb><Adverb>
<Noun> :== frog | grass | goblin
<Article> :== a | the | that
<Adjective> :== purple | green | tiny
<Verb> :== grows | dreams | eats
<Adverb> :== quickly | slowly | badly
Extended BNF grammar uses the following conventions:
A superscript ? after a symbol means it is optional and can appear once or not at all.
A superscript + after a symbol means it must appear at least once but can appear more than once.
A superscript * after a symbol means it can appear not at all, once, or may times.
Paired parentheses can be used to group together symbols for purposes of the: ?, +, * operators.
The angle brackets are typically dropped from non-terminal symbols and a different font is used to distinguish terminals from non-terminals.
This is what I've came up with so far, but I'm not sure it's right.
Sentence :== (<NounPhrase><VerbPhrase>) +
NounPhrase :== <Noun> + (<Article>< AdjectiveList>)?
AdjectiveList :== <Adjective> *
VerbPhrase :== <Verb> + <Adverb>?
Noun :== (frog | grass | goblin)*
Article :== (a | the | that)*
Adjective :== (purple | green | tiny)*
Verb :== (grows | dreams | eats)*
Adverb :== (quickly | slowly | badly)*
The original BNF is:
<Sentence> :== <NounPhrase><VerbPhrase>
<NounPhrase> :== <Noun>
<NounPhrase> :== <Article><Noun>
<NounPhrase> :== <Article><AdjectiveList><Noun>
<NounPhrase> :== <AdjectiveList><Noun>
<AdjectiveList> :== <Adjective>
<AdjectiveList> :== <Adjective><AdjectiveList>
<VerbPhrase> :== <Verb>
<VerbPhrase> :== <Verb><Adverb>
<Noun> :== frog | grass | goblin
<Article> :== a | the | that
<Adjective> :== purple | green | tiny
<Verb> :== grows | dreams | eats
<Adverb> :== quickly | slowly | badly
The first attempt at conversion to the dialect of EBNF required is:
Sentence :== (<NounPhrase><VerbPhrase>) +
NounPhrase :== <Noun> + (<Article>< AdjectiveList>)?
AdjectiveList :== <Adjective> *
VerbPhrase :== <Verb> + <Adverb>?
Noun :== (frog | grass | goblin)*
Article :== (a | the | that)*
Adjective :== (purple | green | tiny)*
Verb :== (grows | dreams | eats)*
Adverb :== (quickly | slowly | badly)*
What you've come up with isn't correct:
You've not dropped the angle brackets.
In the original, a sentence is a noun phrase followed by a verb phrase; in your rewrite, it is a sequence of one or more 'noun phrase followed by verb phrase'.
In the original, a noun phrase ends with a noun; in your rewrite, it can be followed by a list of zero or one combinations of an article and an adjective list (but not preceded by either an article or an adjective list).
In the original, an adjective list is a sequence of one or more adjectives; in your rewrite, is a list of zero or more adjectives.
In the original, a verb phrase is a single verb, optionally followed by an adverb; in your rewrite, it is one or more verbs followed by zero or more adverbs.
In the original, each of noun, article, adjective, verb and adverb is exactly one of three alternative values; in your rewrite, each is a list of zero or more or the corresponding three alternative values.
I'm a little confused as to which brackets to drop. I don't know what the difference is between terminal and non terminal and how to differentiate them in the above. Would removing the superscript "+" and parenthesis correct it?
Terminal symbols are things that represent themselves. In this context, the words such as 'frog', 'the', 'green', 'dreams' and 'badly' are terminals.
Non-terminal symbols are defined in terms of other symbols, either other non-terminals or in terms of terminals. Things such as <Sentence> and <Noun> are non-terminals.
Angle brackets are the < and > symbols (versus round brackets or parentheses (), square brackets [], or curly brackets or braces {}).
Removing the parentheses and + (and angle brackets) from Sentence :== (<NounPhrase><VerbPhrase>) + would improve it. In standard BNF, the :== symbol is normally ::= and in standard EBNF is replaced by just =, and concatenation is indicated explicitly with a comma:
Sentence = Noun Phrase, Verb Phrase
In standard EBNF, terminals are enclosed in double quotes or single quotes (rather than with a font change). And the 'superscript' isn't necessary, either — the ?, + and * simply appear after the unit that repeats. (Note that standard EBNF uses [ … ] around optional matter and { … } around repeated (zero or more) items, and { … }- around repeated (one or more) items).
NounPhrase = Article ? AdjectiveList ? Noun
Noun = "frog" | "grass" | "goblin"

Forcing gaps between words in a Marpa grammar

I'm trying to set up a grammar that requires that [\w] characters cannot appear directly adjacent to each other if they are not in the same lexeme. That is, words must be separated from each other by a space or punctuation.
Consider the following grammar:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
This parses successfully. Now I want to change the grammar to force a separation between 9 and september. I thought of doing this by introducing an unused lexeme that matches [\w]+:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
word ~ [\w]+ ### <== Add unused lexeme to match joined keywords
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
Unfortunately, this grammar fails with:
A lexeme is not accessible from the start symbol: word
Marpa::R2 exception at marpa.pl line 3.
Although this can be resolved by using a lexeme default statement:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
lexeme default = action => [value] ### <== Fix exception by adding lexeme default statement
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
word ~ [\w]+
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
This results in the following output:
Inaccessible symbol: word
Error in SLIF parse: No lexemes accepted at line 1, column 1
* String before error:
* The error was at line 1, column 1, and at character 0x0039 '9', ...
* here: 9september
Marpa::R2 exception at marpa.pl line 16.
That is, the parse has failed due to the fact that there is no gap between 9 and september which is exactly what I want to happen. The only fly in the ointment is that there is an annoying Inaccessible symbol: word message on STDERR because the word lexeme is not used in the actual grammar.
I see that in Marpa::R2::Grammar I could have declared word as inaccessible_ok in the constructor options but I can't do that in Marpa::R2::Scanless.
I also could have done something like the following:
Rule ::= nine september
nine ~ word
september ~ word
then used a pause to use custom code to examine the actual lexeme value and return the appropriate lexeme depending on the value.
What is the best way to construct a grammar that uses keywords or numbers and words but will disallow adjacent lexemes to be run together without white space or punctuation separating them?
Well, the obvious solution is to require some whitespace in between (on the G1 level). When we use the following grammar
:default ::= action => ::array
:start ::= Rule
Rule ::= '9' (Ws) 'september'
Ws ::= [\s]+
:discard ~ whitespace
whitespace ~ [\s]+
then 9september fails, but 9 september is parsed. Important points to note:
Lexemes can be both discarded and required, when they are both a longest token. This is why the :discard and Ws rule don't interfere with each other. Marpa doesn't mind this kind of “ambiguity”.
The Ws rule is enclosed in parens, which discards the value – to keep the resulting parse tree clean.
You do not usually want to use tricks like phantom lexemes to misguide the parser. That way lies breakage.
When every bit of whitespace is important, you might want to get rid of :discard ~ whitespace. This is meant to be used e.g. for C-like languages where whitespace traditionally does not matter.

Valid identifier characters in Scala

One thing I find quite confusing is knowing which characters and combinations I can use in method and variable names. For instance
val #^ = 1 // legal
val # = 1 // illegal
val + = 1 // legal
val &+ = 1 // legal
val &2 = 1 // illegal
val £2 = 1 // legal
val ¬ = 1 // legal
As I understand it, there is a distinction between alphanumeric identifiers and operator identifiers. You can mix an match one or the other but not both, unless separated by an underscore (a mixed identifier).
From Programming in Scala section 6.10,
An operator identifier consists of one or more operator characters.
Operator characters are printable ASCII characters such as +, :, ?, ~
or #.
More precisely, an operator character belongs to the Unicode set
of mathematical symbols(Sm) or other symbols(So), or to the 7-bit
ASCII characters that are not letters, digits, parentheses, square
brackets, curly braces, single or double quote, or an underscore,
period, semi-colon, comma, or back tick character.
So we are excluded from using ()[]{}'"_.;, and `
I looked up Unicode mathematical symbols on Wikipedia, but the ones I found didn't include +, :, ? etc. Is there a definitive list somewhere of what the operator characters are?
Also, any ideas why Unicode mathematical operators (rather than symbols) do not count as operators?
Working from the EBNF syntax in the spec:
upper ::= ‘A’ | ... | ‘Z’ | ‘$’ | ‘_’ and Unicode category Lu
lower ::= ‘a’ | ... | ‘z’ and Unicode category Ll
letter ::= upper | lower and Unicode categories Lo, Lt, Nl
digit ::= ‘0’ | ... | ‘9’
opchar ::= “all other characters in \u0020-007F and Unicode
categories Sm, So except parentheses ([]) and periods”
But also taking into account the very beginning on Lexical Syntax that defines:
Parentheses ‘(’ | ‘)’ | ‘[’ | ‘]’ | ‘{’ | ‘}’.
Delimiter characters ‘‘’ | ‘’’ | ‘"’ | ‘.’ | ‘;’ | ‘,’
Here is what I come up with. Working by elimination in the range \u0020-007F, eliminating letters, digits, parentheses and delimiters, we have for opchar... (drumroll):
! # % & * + - / : < = > ? # \ ^ | ~
and also Sm and So - except for parentheses and periods.
(Edit: adding valid examples here:). In summary, here are some valid examples that highlights all cases - watch out for \ in the REPL, I had to escape as \\:
val !#%&*+-/:<=>?#\^|~ = 1 // all simple opchars
val simpleName = 1
val withDigitsAndUnderscores_ab_12_ab12 = 1
val wordEndingInOpChars_!#%&*+-/:<=>?#\^|~ = 1
val !^©® = 1 // opchars ans symbols
val abcαβγ_!^©® = 1 // mixing unicode letters and symbols
Note 1:
I found this Unicode category index to figure out Lu, Ll, Lo, Lt, Nl:
Lu (uppercase letters)
Ll (lowercase letters)
Lo (other letters)
Lt (titlecase)
Nl (letter numbers like roman numerals)
Sm (symbol math)
So (symbol other)
Note 2:
val #^ = 1 // legal - two opchars
val # = 1 // illegal - reserved word like class or => or #
val + = 1 // legal - opchar
val &+ = 1 // legal - two opchars
val &2 = 1 // illegal - opchar and letter do not mix arbitrarily
val £2 = 1 // working - £ is part of Sc (Symbol currency) - undefined by spec
val ¬ = 1 // legal - part of Sm
Note 3:
Other operator-looking things that are reserved words: _ : = => <- <: <% >: # # and also \u21D2 ⇒ and \u2190 ←
The language specification. gives the rule in Chapter 1, lexical syntax (on page 3):
Operator characters. These consist of all printable ASCII
characters \u0020-\u007F. which are in none of the sets above,
mathematical sym- bols(Sm) and other symbols(So).
This is basically the same as your extract of Programming in Programming in Scala. + is not an Unicode mathematical symbol, but it is definitely an ASCII printable character not listed above (not a letter, including _ or $, a digit, a paranthesis, a delimiter).
In your list:
# is illegal not because the character is not an operator character
(#^ is legal), but because it is a reserved word (on page 4), for type projection.
&2 is illegal because you mix an operator character & and a non-operator character, digit 2
£2 is legal because £ is not an operator character: it is not a seven bit ASCII, but 8 bit extended ASCII. It is not nice, as $ is not one either (it is considered a letter).
use backticks to escape limitations and use Unicode symbols
val `r→f` = 150
println(`r→f`)