BNF to EBNF conversion - discrete-mathematics

I've trying to convert a given BNF list to EBNF and im completely clueless how. Can anyone help?
The BNF is:
<Sentence> :== <NounPhrase><VerbPhrase>
<NounPhrase> :== <Noun>
<NounPhrase> :== <Article><Noun>
<NounPhrase> :== <Article><AdjectiveList><Noun>
<NounPhrase> :== <AdjectiveList><Noun>
<AdjectiveList> :== <Adjective>
<AdjectiveList> :== <Adjective><AdjectiveList>
<VerbPhrase> :== <Verb>
<VerbPhrase> :== <Verb><Adverb>
<Noun> :== frog | grass | goblin
<Article> :== a | the | that
<Adjective> :== purple | green | tiny
<Verb> :== grows | dreams | eats
<Adverb> :== quickly | slowly | badly
Extended BNF grammar uses the following conventions:
A superscript ? after a symbol means it is optional and can appear once or not at all.
A superscript + after a symbol means it must appear at least once but can appear more than once.
A superscript * after a symbol means it can appear not at all, once, or may times.
Paired parentheses can be used to group together symbols for purposes of the: ?, +, * operators.
The angle brackets are typically dropped from non-terminal symbols and a different font is used to distinguish terminals from non-terminals.
This is what I've came up with so far, but I'm not sure it's right.
Sentence :== (<NounPhrase><VerbPhrase>) +
NounPhrase :== <Noun> + (<Article>< AdjectiveList>)?
AdjectiveList :== <Adjective> *
VerbPhrase :== <Verb> + <Adverb>?
Noun :== (frog | grass | goblin)*
Article :== (a | the | that)*
Adjective :== (purple | green | tiny)*
Verb :== (grows | dreams | eats)*
Adverb :== (quickly | slowly | badly)*

The original BNF is:
<Sentence> :== <NounPhrase><VerbPhrase>
<NounPhrase> :== <Noun>
<NounPhrase> :== <Article><Noun>
<NounPhrase> :== <Article><AdjectiveList><Noun>
<NounPhrase> :== <AdjectiveList><Noun>
<AdjectiveList> :== <Adjective>
<AdjectiveList> :== <Adjective><AdjectiveList>
<VerbPhrase> :== <Verb>
<VerbPhrase> :== <Verb><Adverb>
<Noun> :== frog | grass | goblin
<Article> :== a | the | that
<Adjective> :== purple | green | tiny
<Verb> :== grows | dreams | eats
<Adverb> :== quickly | slowly | badly
The first attempt at conversion to the dialect of EBNF required is:
Sentence :== (<NounPhrase><VerbPhrase>) +
NounPhrase :== <Noun> + (<Article>< AdjectiveList>)?
AdjectiveList :== <Adjective> *
VerbPhrase :== <Verb> + <Adverb>?
Noun :== (frog | grass | goblin)*
Article :== (a | the | that)*
Adjective :== (purple | green | tiny)*
Verb :== (grows | dreams | eats)*
Adverb :== (quickly | slowly | badly)*
What you've come up with isn't correct:
You've not dropped the angle brackets.
In the original, a sentence is a noun phrase followed by a verb phrase; in your rewrite, it is a sequence of one or more 'noun phrase followed by verb phrase'.
In the original, a noun phrase ends with a noun; in your rewrite, it can be followed by a list of zero or one combinations of an article and an adjective list (but not preceded by either an article or an adjective list).
In the original, an adjective list is a sequence of one or more adjectives; in your rewrite, is a list of zero or more adjectives.
In the original, a verb phrase is a single verb, optionally followed by an adverb; in your rewrite, it is one or more verbs followed by zero or more adverbs.
In the original, each of noun, article, adjective, verb and adverb is exactly one of three alternative values; in your rewrite, each is a list of zero or more or the corresponding three alternative values.
I'm a little confused as to which brackets to drop. I don't know what the difference is between terminal and non terminal and how to differentiate them in the above. Would removing the superscript "+" and parenthesis correct it?
Terminal symbols are things that represent themselves. In this context, the words such as 'frog', 'the', 'green', 'dreams' and 'badly' are terminals.
Non-terminal symbols are defined in terms of other symbols, either other non-terminals or in terms of terminals. Things such as <Sentence> and <Noun> are non-terminals.
Angle brackets are the < and > symbols (versus round brackets or parentheses (), square brackets [], or curly brackets or braces {}).
Removing the parentheses and + (and angle brackets) from Sentence :== (<NounPhrase><VerbPhrase>) + would improve it. In standard BNF, the :== symbol is normally ::= and in standard EBNF is replaced by just =, and concatenation is indicated explicitly with a comma:
Sentence = Noun Phrase, Verb Phrase
In standard EBNF, terminals are enclosed in double quotes or single quotes (rather than with a font change). And the 'superscript' isn't necessary, either — the ?, + and * simply appear after the unit that repeats. (Note that standard EBNF uses [ … ] around optional matter and { … } around repeated (zero or more) items, and { … }- around repeated (one or more) items).
NounPhrase = Article ? AdjectiveList ? Noun
Noun = "frog" | "grass" | "goblin"

Related

Why order of quote character "'" in relation to '#1=' matters in circular lists?

I am experimenting with circular lists and have written a flawed graph viewer. So when I try to draw 2 different graphs
(graph-viewer:graph '#1=(1 2 3 #1#) "a")
(graph-viewer:graph #1='(1 2 3 #1#) "b")
I get who different pictures with the latter version having a quote symbol included in the graph.
You need to think how the reader does its job.
When it sees #1=, it knows to store whatever comes next and re-use it on #1# - think about it as a "binding" for the "variable 1".
When it sees '<blah>, it reads it as (quote <blah>).
Thus, when it sees '#1=(1 2 3 #1#), it reads it as
(quote (1 2 3 *))
^ |
| |
+------+
(quote is outside of "binding of 1")
while #1='(1 2 3 #1#) is read as
(quote (1 2 3 *))
^ |
| |
+-------------+
(quote is inside the "binding of 1").
The "arrow" in the pics above is the reference, i.e., * points along the arrow.

Characters and digits of Chapter four of the Unicode Standard

In a language specification, there is
name-start-character=
'_' | '\' | ? any code points which are characters as defined by the Unicode character properties, chapter four of the Unicode Standard ?;
Could anyone tell me how to correctly represent that any code points which are characters as defined by the Unicode character properties, chapter four of the Unicode Standard in a lexer?
Similarly, there is
name-character=
name-start-character | decimal-digit | full-stop | ? any code points which are digits
as defined by the Unicode character properties, chapter four of the Unicode standard ?;
Does anyone know how to faithfully represent that any code points which are digits as defined by the Unicode character properties, chapter four of the Unicode standard in a lexer?
I have found this, but it is too hard for me to understand.
PS: I use sedlex to write my lexer.
Edit 1:
Previously, I used the following code to make name_start_character. Even though it was not fully complete, it worked more or less.
let first_Latin_identifier_character = [%sedlex.regexp? ('a'..'z') | ('A'..'Z') ]
let subsequent_Latin_identifier_character = [%sedlex.regexp? first_Latin_identifier_character | '\x5F' (* underscore *) | ('0'..'9')]
let latin_identifier = [%sedlex.regexp? first_Latin_identifier_character, (Star subsequent_Latin_identifier_character)]
let cP936_initial_character = [%sedlex.regexp? 0xff21 .. 0xff3a | 0xff41 .. 0xff5a | 0x3001 .. 0x2014 | 0x2016 .. 0x2026 | 0x3014 .. 0x2103 | 0x00a4 .. 0x2605 | 0x2488 .. 0x216b | 0x3041 .. 0xfa29]
let cP936_subsequent_character = [%sedlex.regexp? cP936_initial_character | 0xff3f | 0xff10 .. 0xff19]
let first_sChinese_identifier_character = [%sedlex.regexp? first_Latin_identifier_character | cP936_initial_character]
let subsequent_sChinese_identifier_character = [%sedlex.regexp? subsequent_Latin_identifier_character | cP936_subsequent_character]
let simplified_Chinese_identifier = [%sedlex.regexp? first_sChinese_identifier_character, (Star subsequent_sChinese_identifier_character)]
let cjk_character = [%sedlex.regexp? 0x4E00 .. 0x9FFF | 0x3400 .. 0x4DBF | 0x20000 .. 0x2A6DF | 0x2A700 .. 0x2B73F | 0x2B740 .. 0x2B81F |
0x2B820 .. 0x2CEAF | 0xF900 .. 0xFAFF | 0x2F800 .. 0x2FA1F]
let cjk_identifier = [%sedlex.regexp? (Plus cjk_character)]
let korean_character = [%sedlex.regexp? 0xAC00 .. 0xD7A3]
let korean_identifier = [%sedlex.regexp? Plus korean_character]
let japanese_character = [%sedlex.regexp? 0x3000 .. 0x303f | 0x3040 .. 0x309f | 0x30a0 .. 0x30ff | 0xff00 .. 0xffef] (* except CJK unifed ideographs - Common and uncommon kanji (4e00 - 9faf) *)
let japanese_identifier = [%sedlex.regexp? Plus japanese_character]
let cP2_character_874 = [%sedlex.regexp? 0x20ac|0x0081|0x0082|0x0083|0x0084|0x2026|0x0086|0x0087|0x0088|0x0089|0x008a|0x008b|0x008c|0x008d|0x008e|0x008f|0x0090|0x2018|0x2019|0x201c|0x201d|0x2022|0x2013|0x2014|0x0098|0x0099|0x009a|0x009b|0x009c|0x009d|0x009e|0x009f|0x00a0|0x0e01|0x0e02|0x0e03|0x0e04|0x0e05|0x0e06|0x0e07|0x0e08|0x0e09|0x0e0a|0x0e0b|0x0e0c|0x0e0d|0x0e0e|0x0e0f|0x0e10|0x0e11|0x0e12|0x0e13|0x0e14|0x0e15|0x0e16|0x0e17|0x0e18|0x0e19|0x0e1a|0x0e1b|0x0e1c|0x0e1d|0x0e1e|0x0e1f|0x0e20|0x0e21|0x0e22|0x0e23|0x0e24|0x0e25|0x0e26|0x0e27|0x0e28|0x0e29|0x0e2a|0x0e2b|0x0e2c|0x0e2d|0x0e2e|0x0e2f|0x0e30|0x0e31|0x0e32|0x0e33|0x0e34|0x0e35|0x0e36|0x0e37|0x0e38|0x0e39|0x0e3a|0xf8c1|0xf8c2|0xf8c3|0xf8c4|0x0e3f|0x0e40|0x0e41|0x0e42|0x0e43|0x0e44|0x0e45|0x0e46|0x0e47|0x0e48|0x0e49|0x0e4a|0x0e4b|0x0e4c|0x0e4d|0x0e4e|0x0e4f|0x0e50|0x0e51|0x0e52|0x0e53|0x0e54|0x0e55|0x0e56|0x0e57|0x0e58|0x0e59|0x0e5a|0x0e5b|0xf8c5|0xf8c6|0xf8c7|0xf8c8]
let cP2_character_1250 = [%sedlex.regexp? 0x20ac|0x0081|0x201a|0x0083|0x201e|0x2026|0x2020|0x2021|0x0088|0x2030|0x0160|0x2039|0x015a|0x0164|0x017d|0x0179|0x0090|0x2018|0x2019|0x201c|0x201d|0x2022|0x2013|0x2014|0x0098|0x2122|0x0161|0x203a|0x015b|0x0165|0x017e|0x017a|0x00a0|0x02c7|0x02d8|0x0141|0x00a4|0x0104|0x00a6|0x00a7|0x00a8|0x00a9|0x015e|0x00ab|0x00ac|0x00ad|0x00ae|0x017b|0x00b0|0x00b1|0x02db|0x0142|0x00b4|0x00b5|0x00b6|0x00b7|0x00b8|0x0105|0x015f|0x00bb|0x013d|0x02dd|0x013e|0x017c|0x0154|0x00c1|0x00c2|0x0102|0x00c4|0x0139|0x0106|0x00c7|0x010c|0x00c9|0x0118|0x00cb|0x011a|0x00cd|0x00ce|0x010e|0x0110|0x0143|0x0147|0x00d3|0x00d4|0x0150|0x00d6|0x00d7|0x0158|0x016e|0x00da|0x0170|0x00dc|0x00dd|0x0162|0x00df|0x0155|0x00e1|0x00e2|0x0103|0x00e4|0x013a|0x0107|0x00e7|0x010d|0x00e9|0x0119|0x00eb|0x011b|0x00ed|0x00ee|0x010f|0x0111|0x0144|0x0148|0x00f3|0x00f4|0x0151|0x00f6|0x00f7|0x0159|0x016f|0x00fa|0x0171|0x00fc|0x00fd|0x0163|0x02d9]
let cP2_character_1251 = [%sedlex.regexp? 0x0402|0x0403|0x201a|0x0453|0x201e|0x2026|0x2020|0x2021|0x20ac|0x2030|0x0409|0x2039|0x040a|0x040c|0x040b|0x040f|0x0452|0x2018|0x2019|0x201c|0x201d|0x2022|0x2013|0x2014|0x0098|0x2122|0x0459|0x203a|0x045a|0x045c|0x045b|0x045f|0x00a0|0x040e|0x045e|0x0408|0x00a4|0x0490|0x00a6|0x00a7|0x0401|0x00a9|0x0404|0x00ab|0x00ac|0x00ad|0x00ae|0x0407|0x00b0|0x00b1|0x0406|0x0456|0x0491|0x00b5|0x00b6|0x00b7|0x0451|0x2116|0x0454|0x00bb|0x0458|0x0405|0x0455|0x0457|0x0410|0x0411|0x0412|0x0413|0x0414|0x0415|0x0416|0x0417|0x0418|0x0419|0x041a|0x041b|0x041c|0x041d|0x041e|0x041f|0x0420|0x0421|0x0422|0x0423|0x0424|0x0425|0x0426|0x0427|0x0428|0x0429|0x042a|0x042b|0x042c|0x042d|0x042e|0x042f|0x0430|0x0431|0x0432|0x0433|0x0434|0x0435|0x0436|0x0437|0x0438|0x0439|0x043a|0x043b|0x043c|0x043d|0x043e|0x043f|0x0440|0x0441|0x0442|0x0443|0x0444|0x0445|0x0446|0x0447|0x0448|0x0449|0x044a|0x044b|0x044c|0x044d|0x044e|0x044f]
let cP2_character_1252 = [%sedlex.regexp? 0x20ac|0x0081|0x201a|0x0192|0x201e|0x2026|0x2020|0x2021|0x02c6|0x2030|0x0160|0x2039|0x0152|0x008d|0x017d|0x008f|0x0090|0x2018|0x2019|0x201c|0x201d|0x2022|0x2013|0x2014|0x02dc|0x2122|0x0161|0x203a|0x0153|0x009d|0x017e|0x0178|0x00a0|0x00a1|0x00a2|0x00a3|0x00a4|0x00a5|0x00a6|0x00a7|0x00a8|0x00a9|0x00aa|0x00ab|0x00ac|0x00ad|0x00ae|0x00af|0x00b0|0x00b1|0x00b2|0x00b3|0x00b4|0x00b5|0x00b6|0x00b7|0x00b8|0x00b9|0x00ba|0x00bb|0x00bc|0x00bd|0x00be|0x00bf|0x00c0|0x00c1|0x00c2|0x00c3|0x00c4|0x00c5|0x00c6|0x00c7|0x00c8|0x00c9|0x00ca|0x00cb|0x00cc|0x00cd|0x00ce|0x00cf|0x00d0|0x00d1|0x00d2|0x00d3|0x00d4|0x00d5|0x00d6|0x00d7|0x00d8|0x00d9|0x00da|0x00db|0x00dc|0x00dd|0x00de|0x00df|0x00e0|0x00e1|0x00e2|0x00e3|0x00e4|0x00e5|0x00e6|0x00e7|0x00e8|0x00e9|0x00ea|0x00eb|0x00ec|0x00ed|0x00ee|0x00ef|0x00f0|0x00f1|0x00f2|0x00f3|0x00f4|0x00f5|0x00f6|0x00f7|0x00f8|0x00f9|0x00fa|0x00fb|0x00fc|0x00fd|0x00fe|0x00ff]
let cP2_character_1253 = [%sedlex.regexp? 0x20ac|0x0081|0x201a|0x0192|0x201e|0x2026|0x2020|0x2021|0x0088|0x2030|0x008a|0x2039|0x008c|0x008d|0x008e|0x008f|0x0090|0x2018|0x2019|0x201c|0x201d|0x2022|0x2013|0x2014|0x0098|0x2122|0x009a|0x203a|0x009c|0x009d|0x009e|0x009f|0x00a0|0x0385|0x0386|0x00a3|0x00a4|0x00a5|0x00a6|0x00a7|0x00a8|0x00a9|0xf8f9|0x00ab|0x00ac|0x00ad|0x00ae|0x2015|0x00b0|0x00b1|0x00b2|0x00b3|0x0384|0x00b5|0x00b6|0x00b7|0x0388|0x0389|0x038a|0x00bb|0x038c|0x00bd|0x038e|0x038f|0x0390|0x0391|0x0392|0x0393|0x0394|0x0395|0x0396|0x0397|0x0398|0x0399|0x039a|0x039b|0x039c|0x039d|0x039e|0x039f|0x03a0|0x03a1|0xf8fa|0x03a3|0x03a4|0x03a5|0x03a6|0x03a7|0x03a8|0x03a9|0x03aa|0x03ab|0x03ac|0x03ad|0x03ae|0x03af|0x03b0|0x03b1|0x03b2|0x03b3|0x03b4|0x03b5|0x03b6|0x03b7|0x03b8|0x03b9|0x03ba|0x03bb|0x03bc|0x03bd|0x03be|0x03bf|0x03c0|0x03c1|0x03c2|0x03c3|0x03c4|0x03c5|0x03c6|0x03c7|0x03c8|0x03c9|0x03ca|0x03cb|0x03cc|0x03cd|0x03ce|0xf8fb]
let cP2_character_1254 = [%sedlex.regexp? 0x20ac|0x0081|0x201a|0x0192|0x201e|0x2026|0x2020|0x2021|0x02c6|0x2030|0x0160|0x2039|0x0152|0x008d|0x008e|0x008f|0x0090|0x2018|0x2019|0x201c|0x201d|0x2022|0x2013|0x2014|0x02dc|0x2122|0x0161|0x203a|0x0153|0x009d|0x009e|0x0178|0x00a0|0x00a1|0x00a2|0x00a3|0x00a4|0x00a5|0x00a6|0x00a7|0x00a8|0x00a9|0x00aa|0x00ab|0x00ac|0x00ad|0x00ae|0x00af|0x00b0|0x00b1|0x00b2|0x00b3|0x00b4|0x00b5|0x00b6|0x00b7|0x00b8|0x00b9|0x00ba|0x00bb|0x00bc|0x00bd|0x00be|0x00bf|0x00c0|0x00c1|0x00c2|0x00c3|0x00c4|0x00c5|0x00c6|0x00c7|0x00c8|0x00c9|0x00ca|0x00cb|0x00cc|0x00cd|0x00ce|0x00cf|0x011e|0x00d1|0x00d2|0x00d3|0x00d4|0x00d5|0x00d6|0x00d7|0x00d8|0x00d9|0x00da|0x00db|0x00dc|0x0130|0x015e|0x00df|0x00e0|0x00e1|0x00e2|0x00e3|0x00e4|0x00e5|0x00e6|0x00e7|0x00e8|0x00e9|0x00ea|0x00eb|0x00ec|0x00ed|0x00ee|0x00ef|0x011f|0x00f1|0x00f2|0x00f3|0x00f4|0x00f5|0x00f6|0x00f7|0x00f8|0x00f9|0x00fa|0x00fb|0x00fc|0x0131|0x015f|0x00ff]
let cP2_character_1255 = [%sedlex.regexp? 0x20ac|0x0081|0x201a|0x0192|0x201e|0x2026|0x2020|0x2021|0x02c6|0x2030|0x008a|0x2039|0x008c|0x008d|0x008e|0x008f|0x0090|0x2018|0x2019|0x201c|0x201d|0x2022|0x2013|0x2014|0x02dc|0x2122|0x009a|0x203a|0x009c|0x009d|0x009e|0x009f|0x00a0|0x00a1|0x00a2|0x00a3|0x20aa|0x00a5|0x00a6|0x00a7|0x00a8|0x00a9|0x00d7|0x00ab|0x00ac|0x00ad|0x00ae|0x00af|0x00b0|0x00b1|0x00b2|0x00b3|0x00b4|0x00b5|0x00b6|0x00b7|0x00b8|0x00b9|0x00f7|0x00bb|0x00bc|0x00bd|0x00be|0x00bf|0x05b0|0x05b1|0x05b2|0x05b3|0x05b4|0x05b5|0x05b6|0x05b7|0x05b8|0x05b9|0x05ba|0x05bb|0x05bc|0x05bd|0x05be|0x05bf|0x05c0|0x05c1|0x05c2|0x05c3|0x05f0|0x05f1|0x05f2|0x05f3|0x05f4|0xf88d|0xf88e|0xf88f|0xf890|0xf891|0xf892|0xf893|0x05d0|0x05d1|0x05d2|0x05d3|0x05d4|0x05d5|0x05d6|0x05d7|0x05d8|0x05d9|0x05da|0x05db|0x05dc|0x05dd|0x05de|0x05df|0x05e0|0x05e1|0x05e2|0x05e3|0x05e4|0x05e5|0x05e6|0x05e7|0x05e8|0x05e9|0x05ea|0xf894|0xf895|0x200e|0x200f|0xf896]
let cP2_character_1256 = [%sedlex.regexp? 0x20ac|0x067e|0x201a|0x0192|0x201e|0x2026|0x2020|0x2021|0x02c6|0x2030|0x0679|0x2039|0x0152|0x0686|0x0698|0x0688|0x06af|0x2018|0x2019|0x201c|0x201d|0x2022|0x2013|0x2014|0x06a9|0x2122|0x0691|0x203a|0x0153|0x200c|0x200d|0x06ba|0x00a0|0x060c|0x00a2|0x00a3|0x00a4|0x00a5|0x00a6|0x00a7|0x00a8|0x00a9|0x06be|0x00ab|0x00ac|0x00ad|0x00ae|0x00af|0x00b0|0x00b1|0x00b2|0x00b3|0x00b4|0x00b5|0x00b6|0x00b7|0x00b8|0x00b9|0x061b|0x00bb|0x00bc|0x00bd|0x00be|0x061f|0x06c1|0x0621|0x0622|0x0623|0x0624|0x0625|0x0626|0x0627|0x0628|0x0629|0x062a|0x062b|0x062c|0x062d|0x062e|0x062f|0x0630|0x0631|0x0632|0x0633|0x0634|0x0635|0x0636|0x00d7|0x0637|0x0638|0x0639|0x063a|0x0640|0x0641|0x0642|0x0643|0x00e0|0x0644|0x00e2|0x0645|0x0646|0x0647|0x0648|0x00e7|0x00e8|0x00e9|0x00ea|0x00eb|0x0649|0x064a|0x00ee|0x00ef|0x064b|0x064c|0x064d|0x064e|0x00f4|0x064f|0x0650|0x00f7|0x0651|0x00f9|0x0652|0x00fb|0x00fc|0x200e|0x200f|0x06d2]
let cP2_character_1257 = [%sedlex.regexp? 0x20ac|0x0081|0x201a|0x0083|0x201e|0x2026|0x2020|0x2021|0x0088|0x2030|0x008a|0x2039|0x008c|0x00a8|0x02c7|0x00b8|0x0090|0x2018|0x2019|0x201c|0x201d|0x2022|0x2013|0x2014|0x0098|0x2122|0x009a|0x203a|0x009c|0x00af|0x02db|0x009f|0x00a0|0xf8fc|0x00a2|0x00a3|0x00a4|0xf8fd|0x00a6|0x00a7|0x00d8|0x00a9|0x0156|0x00ab|0x00ac|0x00ad|0x00ae|0x00c6|0x00b0|0x00b1|0x00b2|0x00b3|0x00b4|0x00b5|0x00b6|0x00b7|0x00f8|0x00b9|0x0157|0x00bb|0x00bc|0x00bd|0x00be|0x00e6|0x0104|0x012e|0x0100|0x0106|0x00c4|0x00c5|0x0118|0x0112|0x010c|0x00c9|0x0179|0x0116|0x0122|0x0136|0x012a|0x013b|0x0160|0x0143|0x0145|0x00d3|0x014c|0x00d5|0x00d6|0x00d7|0x0172|0x0141|0x015a|0x016a|0x00dc|0x017b|0x017d|0x00df|0x0105|0x012f|0x0101|0x0107|0x00e4|0x00e5|0x0119|0x0113|0x010d|0x00e9|0x017a|0x0117|0x0123|0x0137|0x012b|0x013c|0x0161|0x0144|0x0146|0x00f3|0x014d|0x00f5|0x00f6|0x00f7|0x0173|0x0142|0x015b|0x016b|0x00fc|0x017c|0x017e|0x02d9]
let cP2_character_1258 = [%sedlex.regexp? 0x20ac|0x0081|0x201a|0x0192|0x201e|0x2026|0x2020|0x2021|0x02c6|0x2030|0x008a|0x2039|0x0152|0x008d|0x008e|0x008f|0x0090|0x2018|0x2019|0x201c|0x201d|0x2022|0x2013|0x2014|0x02dc|0x2122|0x009a|0x203a|0x0153|0x009d|0x009e|0x0178|0x00a0|0x00a1|0x00a2|0x00a3|0x00a4|0x00a5|0x00a6|0x00a7|0x00a8|0x00a9|0x00aa|0x00ab|0x00ac|0x00ad|0x00ae|0x00af|0x00b0|0x00b1|0x00b2|0x00b3|0x00b4|0x00b5|0x00b6|0x00b7|0x00b8|0x00b9|0x00ba|0x00bb|0x00bc|0x00bd|0x00be|0x00bf|0x00c0|0x00c1|0x00c2|0x0102|0x00c4|0x00c5|0x00c6|0x00c7|0x00c8|0x00c9|0x00ca|0x00cb|0x0300|0x00cd|0x00ce|0x00cf|0x0110|0x00d1|0x0309|0x00d3|0x00d4|0x01a0|0x00d6|0x00d7|0x00d8|0x00d9|0x00da|0x00db|0x00dc|0x01af|0x0303|0x00df|0x00e0|0x00e1|0x00e2|0x0103|0x00e4|0x00e5|0x00e6|0x00e7|0x00e8|0x00e9|0x00ea|0x00eb|0x0301|0x00ed|0x00ee|0x00ef|0x0111|0x00f1|0x0323|0x00f3|0x00f4|0x01a1|0x00f6|0x00f7|0x00f8|0x00f9|0x00fa|0x00fb|0x00fc|0x01b0|0x20ab|0x00ff]
let cP2_character = [%sedlex.regexp? cP2_character_874 | cP2_character_1250 | cP2_character_1251 | cP2_character_1252
| cP2_character_1253 | cP2_character_1254 | cP2_character_1255 | cP2_character_1256
| cP2_character_1257 | cP2_character_1258]
let codepage_identifier = [%sedlex.regexp? (first_Latin_identifier_character | cP2_character), Star (subsequent_Latin_identifier_character | cP2_character)]
let name_start_character = [%sedlex.regexp? '_' | '\x5C' |
first_Latin_identifier_character |
cP2_character |
cjk_character |
korean_character |
japanese_character]
Then, I tried rici's simpler solution:
let name_start_character = [%sedlex.regexp? '_' | '\x5C' | Compl (cn | cs)]
It returned Fatal error: exception Stack overflow followed by Error: Error while running external preprocessor
In brief, Chapter 4 defines a number of properties which indicate information about characters. This can indicate e.g. "this is a whitespace character", "this is a combining character", etc, and, predictably, "this is a digit". Also, paradoxically, "this is not a character".
How to reliably and robustly codify this information in a lexer depends on that particular lexer's requirements, and also on whether you need to update it when Unicode is updated, or if a static snapshot of the current Unicode standard is enough.
Either way, you will want to download and parse the Unicode Character Database and extract an enumeration of the code points with the properties you are asking about.
For a quick sampler of digits, e.g.
https://www.fileformat.info/search/google.htm?q=nine brings up mostly characters which have the "decimal digit" property. When you visit the individual results, examine the "Category" field near the top of each individual page, and the Character.isDigit() field further down. https://www.fileformat.info/info/unicode/category/Nd/list.htm has a full listing of the members of the category. The parent page https://www.fileformat.info/info/unicode/category/index.htm has a list of all the categories, with links to similar individual category pages with lists of their members.
https://www.unicode.org/faq/private_use.html contains a section which explains and enumerates the stable set of 66 code points which are defined as "noncharacters". Any others would satisfy the first defintion in your question.
Sedlex, according to its documentation, has predefined regular expression classes, many of which correspond to the Unicode standard. I believe you can use these to easily satisfy the requirement, assuming that sedlex is built with the same or newer Unicode version as that used by the document you are parsing:
Code points which are characters: Compl (cn | cs)
Digits: nd
Code points which are characters
The set of "code points which are characters as defined by the Unicode character properties" is actually defined in chapter two of the Unicode standard, not chapter four. In the current version (14.0.0), the definition is shown in Table 2.3 on page 30 (which is page 23 of the linked PDF).
In Unicode, every codepoint has a two-letter "general-category", which is normative although not always very informative. Table 2.3 relates general categories to three possible "character status" values: "Assigned to abstract character", "Cannot be assigned to abstract character" and "Not assigned to abstract character", as follows:
Cannot be assigned to abstract character: Cs
Not assigned to abstract character: Cn
Assigned to abstract character: Everything else.
It's worth noting that the 66 "non-characters" in the standard have category Cn, not Cs as one might expect. Nonetheless, the standard guarantees that these 66 characters will never be assigned. Category Cs refers to codepoints which can only be used in two-codepoint sequences ("surrogate pairs"), and only in UTF-16 encoding. A well-formed UTF-8 or UTF-32 sequence cannot contain surrogate codepoints. But it can contain "non-characters", even though the code doesn't actually map to a character. (Thus, "non-character" codes can be used internally as sentinels or for other purposes; surrogate codes cannot be used for any purpose other than their use in UTF-16 encoding.)
In short, you can get the set of codepoints mapped to characters in whatever Unicode version was used by sedlex to create its predefined patterns. That would be the categories cc, cf, co, ll, lm, lo, lt, lu, mc, me, mn, nd, nl, no, pc, pd, pe, pf, pi, po, ps, sc, sk, sm, so, zl, zp, zs, which I believe you can write as Compl (cs | cn). Note that the list of categories is fixed by the stability guarantees of the Unicode standard, but many of the categories may be extended with new characters in future versions.
Digits
The general categories themselves are explained (to some extent) in section 4.5 of Chapter 4. There is no category called "digits"; the closest one would be category Nd, which is "Numbers, decimal digits". I'd recommend using that one (nd in sedlex).

How can I define a Regexp::Grammar rule that ignores leading whitespaces?

From the Regexp::Grammars documentation:
The difference between a token and a rule is that a token treats any
whitespace within it exactly as a normal Perl regular expression
would. That is, a sequence of whitespace in a token is ignored if the
/x modifier is in effect, or else matches the same literal sequence
of whitespace characters (if /x is not in effect).
In a rule, most sequences of whitespace are treated as matching the
implicit subrule <.ws>, which is automatically predefined to match
optional whitespace (i.e. \s*).
...
In other words, a rule such as:
<rule: sentence> <noun> <verb>
| <verb> <noun>
is equivalent to a token with added non-capturing whitespace matching:
<token: sentence> <.ws> <noun> <.ws> <verb>
| <.ws> <verb> <.ws> <noun>
Is there a way to get the rule to ignore the leading implicit <.ws>? In the example above, it would be equivalent to:
<token: sentence> <noun> <.ws> <verb>
| <verb> <.ws> <noun>

when to quote symbol in Emacs Lisp

I've beginning learning programming with Emacs Lisp. I'm so confused by symbol quotation.
For example:
(progn
(setq a '(1 2))
(prin1 a)
(add-to-list 'a 3)
(prin1 a)
(setcar a 4)
(prin1 a)
(push 5 a)
""
)
why the "add-to-list" function need a quoted symbol as its first argument, while the "setcar" and "push" function need no argument quotation?
Here's a diagram that represents the symbol a and its value after (setq a '(1 2)). The boxes are elementary data structures (symbols and conses) and the arrows are pointers (where a piece of data references another). (I'm simplifying a little.)
symbol cons cons
+-------+----------+ +------+------+ +------+------+
|name: |variable: | |car: |cdr: | |car: |cdr: |
| a | | | | 1 | | | | 2 | nil |
+-------+----|-----+ +------+--|---+ +------+------+
| ​↑ | ↑
+-------------+ +-------+
The expression '(1 2) builds the two conses on the right, which make up a two-element list. The expression (setq a '(1 2)) creates the symbol a if it doesn't exist, then makes its “variable slot” (the part that contains the value of the symbol) point to the newly created list. setq is a built-in macro, and (setq a '(1 2)) is shorthand for (set 'a '(1 2)). The first argument of set is the symbol to modify and the second argument is the value to set the symbol's variable slot to.
(add-to-list 'a 3) is equivalent to (set 'a (cons 3 a)) here, because 3 is not in the list. This expression does four things:
Create a new cons cell.
Set the new cons cell's car field to 3.
Set the new cons cell's cdr field to the former (and still current) value of a (i.e. copy the contents of a's variable slot).
Set the variable slot of a to the new cons cell.
After that call, the data structures involved look like this:
symbol cons cons cons
+-------+----------+ +------+--|---+ +------+------+ +------+------+
|name: |variable: | |car: |cdr: | |car: |cdr: | |car: |cdr: |
| a | | | | 3 | | | | 1 | | | | 2 | nil |
+-------+----|-----+ +------+--|---+ +------+--|---+ +------+------+
| ​↑ | ↑ | ↑
+-------------+ +-------+ +-------+
The call to setcar doesn't create any new data structure, and doesn't act on the symbol a but on its value, which is the cons cell whose car currently contains 3. After (setcar a 4), the data structures look like this:
symbol cons cons cons
+-------+----------+ +------+--|---+ +------+------+ +------+------+
|name: |variable: | |car: |cdr: | |car: |cdr: | |car: |cdr: |
| a | | | | 4 | | | | 1 | | | | 2 | nil |
+-------+----|-----+ +------+--|---+ +------+--|---+ +------+------+
| ​↑ | ↑ | ↑
+-------------+ +-------+ +-------+
push is a macro; here, (push 5 a) is equivalent to (set 'a (cons 5 a)).
setq and push are macros (setq is a “special form”, which as far as we're concerned here means a macro whose definition is built into the interpreter and not provided in Lisp). Macros receive their arguments unevaluated and can choose to expand them or not. set, setcar and add-to-list are functions, which receive their arguments evaluated. Evaluating a symbol returns the contents of its variable slot, e.g. after the initial (setq a '(1 2)) the value of the symbol a is the cons cell whose car contains 1.
If you're still confused, I suggest experimenting with (setq b a) and seeing for yourself which of the expressions modify b when you act on a (the ones that act on the symbol a) and which don't (the ones that act on the value of the symbol a).
Functions evaluate their arguments before execution, so quote when you need to pass an actual symbol (as pointer to some data structure, for example) and don't quote when it's a variable value.
add-to-list performs in-place mutation of its first argument, so it needs a quoted symbol.
push is not a function, but a macro; that is why it's able to accept unquoted arguments without evaluation. Builtin forms, like setcar, also do not have that limitation.
The other answers given so far clarify the use of quote and the difference between functions, on the one hand, and macros and special forms on the other hand.
However, they do not get to another part of the question: why is add-to-list as it is? Why does it require its first argument to be a symbol? That's a separate question from whether or not it evaluates the argument. It's the real question behind the design of add-to-list.
One could imagine that add-to-list evaluated its args, and expected the value of the first arg to be a list, and then added the value of the second arg to that list as an element and returned the result (new list or same list). That would let you do (add-to-list foo 'shoe) to add the symbol shoe to the list that is the value of foo -- say (1 2 buckle) --, to give (1 2 buckle shoe).
The point is that such a function would not be very useful. Why? Because the list value isn't necessarily accessible. The variable foo might be thought of as a way to access it -- a "handle" or "pointer" to it. But that is not true of the list that the function returns. That returned list can be composed of new list structure, and there is typically nothing (no variable) pointing to that list. The function add-to-list never sees the symbol (variable) foo -- it has no way of knowing that the list value it receives as first argument is bound to foo. If add-to-list were designed that way then you would still need to assign its returned result to your list variable.
IOW, add-to-list evaluates its args because it is a function, but that doesn't explain much. It expects a symbol as the value of its first arg. And it expects the value of that variable (symbol) to be a list. It adds the value of its second arg to the list (possibly changing the list structure), and it sets the value of the variable that is the value of its first arg to that list.
Bottom line: It needs a symbol as arg because its job is to assign a new value to that symbol (the new value being the same list value or the same value with the new list element added at the front).
And yes, another way to go would be to use a macro or special form, as in push. It's the same idea: push wants a symbol as its second arg. The difference is that push does not evaluate its args so the symbol need not be quoted. But in both cases (in Emacs Lisp) the code needs to get hold of a symbol in order to set its value to the augmented list.

Valid identifier characters in Scala

One thing I find quite confusing is knowing which characters and combinations I can use in method and variable names. For instance
val #^ = 1 // legal
val # = 1 // illegal
val + = 1 // legal
val &+ = 1 // legal
val &2 = 1 // illegal
val £2 = 1 // legal
val ¬ = 1 // legal
As I understand it, there is a distinction between alphanumeric identifiers and operator identifiers. You can mix an match one or the other but not both, unless separated by an underscore (a mixed identifier).
From Programming in Scala section 6.10,
An operator identifier consists of one or more operator characters.
Operator characters are printable ASCII characters such as +, :, ?, ~
or #.
More precisely, an operator character belongs to the Unicode set
of mathematical symbols(Sm) or other symbols(So), or to the 7-bit
ASCII characters that are not letters, digits, parentheses, square
brackets, curly braces, single or double quote, or an underscore,
period, semi-colon, comma, or back tick character.
So we are excluded from using ()[]{}'"_.;, and `
I looked up Unicode mathematical symbols on Wikipedia, but the ones I found didn't include +, :, ? etc. Is there a definitive list somewhere of what the operator characters are?
Also, any ideas why Unicode mathematical operators (rather than symbols) do not count as operators?
Working from the EBNF syntax in the spec:
upper ::= ‘A’ | ... | ‘Z’ | ‘$’ | ‘_’ and Unicode category Lu
lower ::= ‘a’ | ... | ‘z’ and Unicode category Ll
letter ::= upper | lower and Unicode categories Lo, Lt, Nl
digit ::= ‘0’ | ... | ‘9’
opchar ::= “all other characters in \u0020-007F and Unicode
categories Sm, So except parentheses ([]) and periods”
But also taking into account the very beginning on Lexical Syntax that defines:
Parentheses ‘(’ | ‘)’ | ‘[’ | ‘]’ | ‘{’ | ‘}’.
Delimiter characters ‘‘’ | ‘’’ | ‘"’ | ‘.’ | ‘;’ | ‘,’
Here is what I come up with. Working by elimination in the range \u0020-007F, eliminating letters, digits, parentheses and delimiters, we have for opchar... (drumroll):
! # % & * + - / : < = > ? # \ ^ | ~
and also Sm and So - except for parentheses and periods.
(Edit: adding valid examples here:). In summary, here are some valid examples that highlights all cases - watch out for \ in the REPL, I had to escape as \\:
val !#%&*+-/:<=>?#\^|~ = 1 // all simple opchars
val simpleName = 1
val withDigitsAndUnderscores_ab_12_ab12 = 1
val wordEndingInOpChars_!#%&*+-/:<=>?#\^|~ = 1
val !^©® = 1 // opchars ans symbols
val abcαβγ_!^©® = 1 // mixing unicode letters and symbols
Note 1:
I found this Unicode category index to figure out Lu, Ll, Lo, Lt, Nl:
Lu (uppercase letters)
Ll (lowercase letters)
Lo (other letters)
Lt (titlecase)
Nl (letter numbers like roman numerals)
Sm (symbol math)
So (symbol other)
Note 2:
val #^ = 1 // legal - two opchars
val # = 1 // illegal - reserved word like class or => or #
val + = 1 // legal - opchar
val &+ = 1 // legal - two opchars
val &2 = 1 // illegal - opchar and letter do not mix arbitrarily
val £2 = 1 // working - £ is part of Sc (Symbol currency) - undefined by spec
val ¬ = 1 // legal - part of Sm
Note 3:
Other operator-looking things that are reserved words: _ : = => <- <: <% >: # # and also \u21D2 ⇒ and \u2190 ←
The language specification. gives the rule in Chapter 1, lexical syntax (on page 3):
Operator characters. These consist of all printable ASCII
characters \u0020-\u007F. which are in none of the sets above,
mathematical sym- bols(Sm) and other symbols(So).
This is basically the same as your extract of Programming in Programming in Scala. + is not an Unicode mathematical symbol, but it is definitely an ASCII printable character not listed above (not a letter, including _ or $, a digit, a paranthesis, a delimiter).
In your list:
# is illegal not because the character is not an operator character
(#^ is legal), but because it is a reserved word (on page 4), for type projection.
&2 is illegal because you mix an operator character & and a non-operator character, digit 2
£2 is legal because £ is not an operator character: it is not a seven bit ASCII, but 8 bit extended ASCII. It is not nice, as $ is not one either (it is considered a letter).
use backticks to escape limitations and use Unicode symbols
val `r→f` = 150
println(`r→f`)