Scala Syntax Specification mismatch if-else with one line expression end by semicolon? - scala

I'm learning Scala Syntax Specification.
Confused by the if-else syntax:
Expr1 ::= ‘if’ ‘(’ Expr ‘)’ {nl} Expr [[semi] ‘else’ Expr]
| ...
How could it match below if-else with one line expression end by semicolon ?
if (true) // \n
println(1); //\n
else //\n
println(2); //\n
Notice there're 4 lines and each followed by a '\n'. I have these questions:
When the 1st ; after println(1) match semi before else( [[semi] ‘else’ Expr] ), how to match the 2nd '\n' after ; after println(1) ?
How to match the 3rd '\n' after else ?
How to match the 2nd ; and the 4th '\n' after println(2) ? Since if-else don't match any ; or '\n' at tail.

I think you are being confused by thinking that all newlines must match the nl token. That is not correct.
Newlines are in general simply treated as whitespace. There is a very long subsection on newlines in the Lexical Syntax chapter section 1.2 Newline characters which explains in detail, when, exactly, a newline character is an nl token and when it isn't.
Only the first newline character in your example is an nl token, the other three are just whitespace.

in Scala, semicolon ; doesn't exist (is ignored)
if-else statement is so simple with brackets as :
if (true) {
"\n" // this will be returned
println(1) // this will be ignored
"\n" // this will be ignored
} else {
"\n" // this will be returned
println(2) // this will be ignored
"\n" // this will be ignored
}
or, you can use without accolades, but the statement must be writed in one line:
if (true)
"\n" // this will be returned, can not have another line here
else
"\n"
without comments: if (true) "\n" else "\n"
More about if-else in Scala

Related

Greedy negative lookbehind (in Swift)

I'm in need of a regular expression that acts like the following:
matches (any part of foo() in the following statement):
foo()
arg: foo()
foo()
(arg: foo()) {}
does not match:
#foo()
I currently have the following, but it has some problems:
^\s*?(?<!#)((\w+?)\()
^\s*? includes any whitespace at the beginning of the line, which means arg: foo() doesn't match the foo() bit. I had to include this to get the # lookbehind working correctly;
(?<!#) is a lookbehind to discard the match if a # before the thing() is matched;
(\w+?)\( matches the part of thething( correctly, only if there's no # before it.
If there's no ^\s*? in the regex, it would be behaving partly correct, but this shouldn't happen. It should rather discard the match entirely (not just for one character):
It has to discard the match entirely if any # is before it, although it must match this correctly: #Mode foo() (the foo() bit, disregarding the #Mode before it).
If there are any tips to help me out, that would be awesome!
Use
(?<![\w#])\w+\(\)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
[\w#] any character of: word characters (a-z,
A-Z, 0-9, _), '#'
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
\) ')'

Try/catch item with strange syntax

Strange syntax in this code fragment:
var result =
try {
Process(bl).!!
} catch {
case e: Exception =>
log.error(s"Error on query: ${hql}\n")
"Etc etc" + "Query: " + hql
}
Why not using separator like , or ; after log.error(s"...")?
The catch statement is returning one or two values?
PS: there are a better Guide tham this one, with all Scala syntax alternatives?
Newline characters can terminate statements
semi ::= ‘;’ | nl {nl}
Scala is a line-oriented language where statements may be terminated
by semi-colons or newlines. A newline in a Scala source text is
treated as the special token “nl” ...
IMHO, newline character \n is just as good of a statement terminator as semicolon character ;. However, it may have an advantage over ; in that it is invisible to humans which perhaps has the benefit of less code clutter. It might seem strange because it is invisible, but rest assured it is there silently doing its job delimiting statements. Perhaps it might become less strange if we try to imagine it like so
1 + 42'\n' // separating with invisible character \n
1 + 42; // separating with visible character ;
Note that we must use semicolons when writing multiple statements on the same line
log.error(s"Error on query: ${hql}\n"); "Etc etc" + "Query: " + hql
Addressing the comment, AFAIU, your confusion stems from misunderstanding how pattern matching anonymous functions and block expressions work. Desugared handler function
case e: Exception =>
log.error(s"Error on query: ${hql}\n")
"Etc etc" + "Query: " + hql
is equivalent to something like
case e: Exception => {
log.error(s"Error on query: ${hql}\n"); // side-effect statement that just logs an error
return "Etc etc" + "Query: " + hql; // final expression becomes the return value of the block
}
Hence, "one block with two branches into it" is not the correct understanding, instead there is only a single code path through your particular function.

Passing delimiter as command line argument in scala and use it to split a string

I have a scala program where I take "\t" as a command line input.
Inside the program I want to split a string on the basis of the delimiter passed from command line.
val splitter = args(0).charAt(0)
if(splitter == '\t')
println("true")
else
println("false")
This prints "false" and splitter "\".
The above method works for "," comma delimiter.
Please suggest how can I pass a tab or any other delimiter as command line parameter and use it for the splitting purpose.
It's because if you're passing "\t" in on the command line, then it's coming in as a two-character string \t, not a single-character tab. To do what you want, you can't just take the first character (charAt(0)) since you'll miss the t. Instead you'll have to unescape it by converting from the string \t to the tab character.
An easy way:
val splitter = args(0) match {
case "\\t" => '\t'
case x => x.head // same as x.charAt(0)
}

Forcing gaps between words in a Marpa grammar

I'm trying to set up a grammar that requires that [\w] characters cannot appear directly adjacent to each other if they are not in the same lexeme. That is, words must be separated from each other by a space or punctuation.
Consider the following grammar:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
This parses successfully. Now I want to change the grammar to force a separation between 9 and september. I thought of doing this by introducing an unused lexeme that matches [\w]+:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
word ~ [\w]+ ### <== Add unused lexeme to match joined keywords
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
Unfortunately, this grammar fails with:
A lexeme is not accessible from the start symbol: word
Marpa::R2 exception at marpa.pl line 3.
Although this can be resolved by using a lexeme default statement:
use Marpa::R2; use Data::Dump;
my $grammar = Marpa::R2::Scanless::G->new({source => \<<'END_OF_GRAMMAR'});
lexeme default = action => [value] ### <== Fix exception by adding lexeme default statement
:start ::= Rule
Rule ::= '9' 'september'
:discard ~ whitespace
whitespace ~ [\s]+
word ~ [\w]+
END_OF_GRAMMAR
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
dd $recce->read(\'9september');
This results in the following output:
Inaccessible symbol: word
Error in SLIF parse: No lexemes accepted at line 1, column 1
* String before error:
* The error was at line 1, column 1, and at character 0x0039 '9', ...
* here: 9september
Marpa::R2 exception at marpa.pl line 16.
That is, the parse has failed due to the fact that there is no gap between 9 and september which is exactly what I want to happen. The only fly in the ointment is that there is an annoying Inaccessible symbol: word message on STDERR because the word lexeme is not used in the actual grammar.
I see that in Marpa::R2::Grammar I could have declared word as inaccessible_ok in the constructor options but I can't do that in Marpa::R2::Scanless.
I also could have done something like the following:
Rule ::= nine september
nine ~ word
september ~ word
then used a pause to use custom code to examine the actual lexeme value and return the appropriate lexeme depending on the value.
What is the best way to construct a grammar that uses keywords or numbers and words but will disallow adjacent lexemes to be run together without white space or punctuation separating them?
Well, the obvious solution is to require some whitespace in between (on the G1 level). When we use the following grammar
:default ::= action => ::array
:start ::= Rule
Rule ::= '9' (Ws) 'september'
Ws ::= [\s]+
:discard ~ whitespace
whitespace ~ [\s]+
then 9september fails, but 9 september is parsed. Important points to note:
Lexemes can be both discarded and required, when they are both a longest token. This is why the :discard and Ws rule don't interfere with each other. Marpa doesn't mind this kind of “ambiguity”.
The Ws rule is enclosed in parens, which discards the value – to keep the resulting parse tree clean.
You do not usually want to use tricks like phantom lexemes to misguide the parser. That way lies breakage.
When every bit of whitespace is important, you might want to get rid of :discard ~ whitespace. This is meant to be used e.g. for C-like languages where whitespace traditionally does not matter.

Laundering tainted data

When I do laundering tainted data with checking whether it has any bad characters are there unicode-properties which will filter the bad characters?
User-Defined Character Properties in perlunicode
package Characters::Sid_com;
sub InBad {
return <<"BAD";
0000\t10FFFF
BAD
}
sub InEvil {
return <<"EVIL";
0488
0489
EVIL
}
sub InStupid {
return <<"STUPID";
E630\tE64F
F8D0\tF8FF
STUPID
}
⋮
die 'No.' if $tring =~ /
(?: \p{Characters::Sid_com::InBad}
| \p{Characters::Sid_com::InEvil}
| \p{Characters::Sid_com::InStupid}
)
/x;
I think "no" is an understatement for an answer, but there you have it. No, Unicode does not have a concept of "bad" or "good" characters (let alone "ugly" ones).
XML (and thus XHTML) can only contains these chars:
\x09 \x0A \x0D
\x{0020}-\x{D7FF}
\x{E000}-\x{FFFD}
\x{10000}-\x{10FFFF}
Of the above, the following should be avoided:
\x7F-\x84
\x86-\x9F
\x{FDD0}-\x{FDEF}
\x{1FFFE}-\x{1FFFF}
\x{2FFFE}-\x{2FFFF}
\x{3FFFE}-\x{3FFFF}
\x{4FFFE}-\x{4FFFF}
\x{5FFFE}-\x{5FFFF}
\x{6FFFE}-\x{6FFFF}
\x{7FFFE}-\x{7FFFF}
\x{8FFFE}-\x{8FFFF}
\x{9FFFE}-\x{9FFFF}
\x{AFFFE}-\x{AFFFF}
\x{BFFFE}-\x{BFFFF}
\x{CFFFE}-\x{CFFFF}
\x{DFFFE}-\x{DFFFF}
\x{EFFFE}-\x{EFFFF}
\x{FFFFE}-\x{FFFFF}
\x{10FFFE}-\x{10FFFF}
If you are generating XHTML, you need to escape the following:
& ⇒ &
< ⇒ <
> ⇒ > (optional)
" ⇒ " (optional except in attribute values delimited with ")
' ⇒ &apos; (optional except in attribute values delimited with ')
HTML should have the same if not looser requirements, so if you stick to this, you should be safe.