How to further improve error messages in Scala parser-combinator based parsers? - scala

I've coded a parser based on Scala parser combinators:
class SxmlParser extends RegexParsers with ImplicitConversions with PackratParsers {
[...]
lazy val document: PackratParser[AstNodeDocument] =
((procinst | element | comment | cdata | whitespace | text)*) ^^ {
AstNodeDocument(_)
}
[...]
}
object SxmlParser {
def parse(text: String): AstNodeDocument = {
var ast = AstNodeDocument()
val parser = new SxmlParser()
val result = parser.parseAll(parser.document, new CharArrayReader(text.toArray))
result match {
case parser.Success(x, _) => ast = x
case parser.NoSuccess(err, next) => {
tool.die("failed to parse SXML input " +
"(line " + next.pos.line + ", column " + next.pos.column + "):\n" +
err + "\n" +
next.pos.longString)
}
}
ast
}
}
Usually the resulting parsing error messages are rather nice. But sometimes it becomes just
sxml: ERROR: failed to parse SXML input (line 32, column 1):
`"' expected but `' found
^
This happens if a quote characters is not closed and the parser reaches the EOT. What I would like to see here is (1) what production the parser was in when it expected the '"' (I've multiple ones) and (2) where in the input this production started parsing (which is an indicator where the opening quote is in the input). Does anybody know how I can improve the error messages and include more information about the actual internal parsing state when the error happens (perhaps something like a production rule stacktrace or whatever can be given reasonably here to better identify the error location). BTW, the above "line 32, column 1" is actually the EOT position and hence of no use here, of course.

I don't know yet how to deal with (1), but I was also looking for (2) when I found this webpage:
https://wiki.scala-lang.org/plugins/viewsource/viewpagesrc.action?pageId=917624
I'm just copying the information:
A useful enhancement is to record the input position (line number and column number) of the significant tokens. To do this, you must do three things:
Make each output type extend scala.util.parsing.input.Positional
invoke the Parsers.positioned() combinator
Use a text source that records line and column positions
and
Finally, ensure that the source tracks positions. For streams, you can simply use scala.util.parsing.input.StreamReader; for Strings, use scala.util.parsing.input.CharArrayReader.
I'm currently playing with it so I'll try to add a simple example later

In such cases you may use err, failure and ~! with production rules designed specifically to match the error.

Related

Pre-compile textual replacement macro with arguments

I am trying to create some kind of a dut_error wrapper. Something that will take some arguments and construct them in a specific way to a dut_error.
I can't use a method to replace the calls to dut_error because to my understanding after check that ... then ... else can only come a dut_error (or dut_errorf). And indeed if I try to do something like:
my_dut_error(arg1: string, arg2: string) is {
dut_error("first argument is ", arg, " and second argument is ", arg2);
};
check that FALSE else my_dut_error("check1", "check2");
I get an error:
*** Error: Unrecognized exp
[Unrecognized expression 'FALSE else my_dut_error("check1", "check2")']
at line x in main.e
check that FALSE else my_dut_error("check1", "check2");
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
So I thought about defining a macro to simply do a textual replace from my wrapper to an actual dut_error:
define <my_dut_error'exp> "my_dut_error(<arg1'name>, <arg2'name>)" as {
dut_error("first argument is ", <arg1'name>, " and second argument is ", <arg2'name>)
};
But got the same error.
Then I read about the preprocessor directive #define so tried:
#define my_dut_error(arg1, arg2) dut_error("first argument is ", arg, " and second argument is ", arg2)
But that just gave a syntax error.
How can I define a pre-compiled textual replacement macro that takes arguments, similar to C?
The reason I want to do that is to achieve some sort of an "interface" to the dut_error so all errors have a consistent structure. This way, different people writing different errors will only pass the arguments necessary by that interface and internally an appropriate message will be created.
not sure i understood what you want to do in the wrapper, but perhaps you can achieve what you want by using the dut_error_struct.
it has set of api, which you can use as hooks (do something when the error is caught) and to query about the specific error.
for example:
extend dut_error_struct {
pre_error() is also {
if source_method_name() == "post_generate" and
source_struct() is a BLUE packet {
out("\nProblem in generation? ", source_location());
// do something for error during generation
};
write() is first {
if get_mesage() ~ "AHB Error..." {
ahb_monitor::increase_errors();
};
};
};
dut_error accepts one parameter, one string. but you can decide of a "separator", that will define two parts to the message.
e.g. - instruct people to write "XXX" in the message, before "first arg" and "second arg".
check that legal else dut_error("ONE thing", "XXX", "another thing");
check that x < 7 else dut_error("failure ", "XXX", "of x not 7 but is ", x);
extend dut_error_struct {
write() is first {
var message_parts := str_split(get_message(), "XXX");
if message_parts.size() == 2 {
out ("First part of message is ", message_parts[0],
"\nand second part of message is ", message_parts[1]
);
};
};
I could get pretty close to what I want using the dut_errorf method combined with a preprocessor directive defining the format string:
#define DUT_FORMAT "first argument is %s and second argument is %s"
check that FALSE else dut_errorf(DUT_FORMAT, "check1", "check2");
but I would still prefer a way that doesn't require this DUT_FORMAT directive and instead uses dut_error_struct or something similar.

Where does a variable in a match arm in a loop come from?

I am trying to implement an HTTP client in Rust using this as a starting point. I was sent to this link by the rust-lang.org site via one of their rust-by-example suggestions in their TcpStream page. I'm figuring out how to read from a TcpStream. I'm trying to follow this code:
fn handle_client(mut stream: TcpStream) {
// read 20 bytes at a time from stream echoing back to stream
loop {
let mut read = [0; 1028];
match stream.read(&mut read) {
Ok(n) => {
if n == 0 {
// connection was closed
break;
}
stream.write(&read[0..n]).unwrap();
}
Err(err) => {
panic!(err);
}
}
}
}
Where does the n variable come from? What exactly is it? The author says it reads 20 bytes at a time; where is this coming from?
I haven't really tried anything yet because I want to understand before I do.
I strongly encourage you to read the documentation for the tools you use. In this case, The match Control Flow Operator from The Rust Programming Language explains what you need to know.
From the Patterns that Bind to Values section:
In the match expression for this code, we add a variable called state to the pattern that matches values of the variant Coin::Quarter. When a Coin::Quarter matches, the state variable will bind to the value of that quarter’s state. Then we can use state in the code for that arm, like so:
fn value_in_cents(coin: Coin) -> u8 {
match coin {
Coin::Penny => 1,
Coin::Nickel => 5,
Coin::Dime => 10,
Coin::Quarter(state) => {
println!("State quarter from {:?}!", state);
25
},
}
}
If we were to call value_in_cents(Coin::Quarter(UsState::Alaska)), coin would be Coin::Quarter(UsState::Alaska). When we compare that value with each of the match arms, none of them match until we reach Coin::Quarter(state). At that point, the binding for state will be the value UsState::Alaska. We can then use that binding in the println! expression, thus getting the inner state value out of the Coin enum variant for Quarter.
There is an entire chapter about the pattern matching syntax available and where it can be used.
Figured it out, this is what's happening:
match stream.read(&mut read) {
This line is telling the software to pass stream.read(&mut read) to Ok(n) because stream.read returns the number of bytes read. I'm still not sure why they specify 20 bytes at a time as being read.

Next steps for debugging lmdbjni access violation

I am using this fork of LMDBjni:https://github.com/deephacks/lmdbjni to
form the backend of a medium-sized databases project in scala.
I've been hitting a EXCEPTION_ACCESS_VIOLATION (0xc0000005) in the JNI code for this LMDB, and would like to know whether there is anything obvious I'm doing wrong or what the next steps for debugging should be. I'm not exactly sure what I'm looking for, so I'm going to list as much information about what's happening as I can and hope the symptoms make sense to somebody.
The access violation occurs in a Database with a single key mapping to a single 8-byte value, on approximately the 4000th access to that database (the exact number appears to be the same with each run), suggesting that this is a deterministic problem.
I believe I only have one thread accessing the database at a time, and regardless, in my understanding, as the operation is wrapped in a transaction, concurrent accesses should not matter anyway.
By looking through stack traces and printing values the issue comes from this generic construction I wrote for building transactions.
My code that causes the issue is here, the crash occurs in the marked db.get() call:
def transactionalGetAndSet[A](
key: Key,
db: Database
)(
compute: A => LMDBEither[A]
)(
implicit sa: Storeable[A],
env: Env
): LMDBEither[A] = {
import org.fusesource.lmdbjni.Transaction
// get a new transaction
val tx: Transaction = instance.env.createWriteTransaction()
println("tx = " + tx + " id = " + tx.getId)
// get the key as an Array[Byte]. This is done by converting the key to a base64 string then converting that to bytes (so arbitary objects can be made into keys)
val k = key.render
println("Key = " + key + " Rendered = " + new String(k))
// instantiate a result value, so there is something if it fails
var res: LMDBEither[A] = NoResult.left // initialise the result as a failure to begin with
try {
res = for { // This for construction chains together operations that return LMDBEithers into one LMDBEither
bytes <- LMDBEither(db.get(tx, k)) // error occurs in this Database.get() call
_ = println("bytes = " + bytes)
a <- sa.fromBytes(safeRetrieve(bytes)) // sa is effectively an unmarshaller/unmarshaller object which converts Vector[Byte] => LMDBEither[A]
_ = println("a = " + a)
res <- compute(a) // get the next value for the value at the key
_ = println("res = " + res)
_ <- LMDBEither(db.put(tx, k, sa.toBytes(res).toArray))
} yield a // effectively, if all these steps worked, res == Right(a)
res // return the result
} finally {
// Make sure you either commit or rollback to avoid resource leaks.
if (res.isRight) tx.commit() // if the result is not an error (ie Either.isRight is true)
else tx.abort()
tx.close()
}
}
Where LMDBEither[A] is an alias for Either[E, A] for a specific error type E, and LMDBEither(x) is a function that lifts an expression that might throw exceptions during execution into an LMDBEither, catching any exceptions.
the function safeRetrieve converts a possibly null Array[Byte] into a definitely not null Vector[Byte], as follows:
private def safeRetrieve(bytes: Array[Byte]): Vector[Byte] =
Option(bytes).fold(Vector[Byte]()){ // if the array is null, convert to an empty vector, otherwise call the array's wrapper's vector
arr =>
println("Vector = " + arr.toVector)
arr.toVector
}
To the best of my knowledge, this does not modify the memory where the array is stored (LMDB's protected memory)
The values printed up to and including the crash are as follows:
tx = org.fusesource.lmdbjni.Transaction#391cec1f id = 15104
Key = Vector(Objects) Rendered = 84507411390877848991196161
#
# A fatal error has been detected by the Java Runtime Environment:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000018002453f, pid=10220, tid=7268
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C [lmdbjni-64-0-7710432736670562378.4+0x2453f]
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
# C:\dev\PartIIProject\hs_err_pid10220.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Which is more evidence that the mdb_get() fails.
the full contents of the file referenced are here: https://pastebin.com/v6AmFBjq
Again, I would be extremely grateful for the small chance that anyone could point me in the right direction. What next steps should I be taking?

Spark: run an external process in parallel

Is it possible with Spark to "wrap" and run an external process managing its input and output?
The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.
The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.
The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:
rdd.pipe("your_wrapper")
The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.
Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.
If you end up here based on the question title from a Google search, but you don't have the OP restriction that the external program needs to read from a file--i.e., if your external program can read from stdin--here is a solution. For my use case, I needed to call an external decryption program for each input file.
import org.apache.commons.io.IOUtils
import sys.process._
import scala.collection.mutable.ArrayBuffer
val showSampleRows = true
val bfRdd = sc.binaryFiles("/some/files/*,/more/files/*")
val rdd = bfRdd.flatMap{ case(file, pds) => { // pds is a PortableDataStream
val rows = new ArrayBuffer[Array[String]]()
var errors = List[String]()
val io = new ProcessIO (
in => { // "in" is an OutputStream; write the encrypted contents of the
// input file (pds) to this stream
IOUtils.copy(pds.open(), in) // open() returns a DataInputStream
in.close
},
out => { // "out" is an InputStream; read the decrypted data off this stream.
// Even though this runs in another thread, we can write to rows, since it
// is part of the closure for this function
for(line <- scala.io.Source.fromInputStream(out).getLines) {
// ...decode line here... for my data, it was pipe-delimited
rows += line.split('|')
}
out.close
},
err => { // "err" is an InputStream; read any errors off this stream
// errors is part of the closure for this function
errors = scala.io.Source.fromInputStream(err).getLines.toList
err.close
}
)
val cmd = List("/my/decryption/program", "--decrypt")
val exitValue = cmd.run(io).exitValue // blocks until subprocess finishes
println(s"-- Results for file $file:")
if (exitValue != 0) {
// TBD write to string accumulator instead, so driver can output errors
// string accumulator from #zero323: https://stackoverflow.com/a/31496694/215945
println(s"exit code: $exitValue")
errors.foreach(println)
} else {
// TBD, you'll probably want to move this code to the driver, otherwise
// unless you're using the shell, you won't see this output
// because it will be sent to stdout of the executor
println(s"row count: ${rows.size}")
if (showSampleRows) {
println("6 sample rows:")
rows.slice(0,6).foreach(row => println(" " + row.mkString("|")))
}
}
rows
}}
scala> :paste "test.scala"
Loading test.scala...
...
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[62] at flatMap at <console>:294
scala> rdd.count // action, causes Spark code to actually run
-- Results for file hdfs://path/to/encrypted/file1: // this file had errors
exit code: 255
ERROR: Error decrypting
my_decryption_program: Bad header data[0]
-- Results for file hdfs://path/to/encrypted/file2:
row count: 416638
sample rows:
<...first row shown here ...>
...
<...sixth row shown here ...>
...
res43: Long = 843039
References:
https://www.scala-lang.org/api/current/scala/sys/process/ProcessIO.html
https://alvinalexander.com/scala/how-to-use-closures-in-scala-fp-examples#using-closures-with-other-data-types

How to indiciate a failure for a function with a void result

I have a function in scala which has no return-value (so unit). This function can sometimes fail (if the user provided parameters are not valid). If I were on java, I would simply throw an exception. But on scala (although the same thing is possible), it is suggested to not use exceptions.
I perfectly know how to use Option or Try, but they all only make sense if you have something valid to return.
For example, think of a (imaginary) addPrintJob(printJob: printJob): Unit command which adds a print job to a printer. The job definition could now be invalid and the user should be notified of this.
I see the following two alternatives:
Use exceptions anyway
Return something from the method (like a "print job identifier") and then return a Option/Either/Try of that type. But this means adding a return value just for the sake of error handling.
What are the best practices here?
You are too deep into FP :-)
You want to know whether the method is successful or not - return a Boolean!
According to this Throwing exceptions in Scala, what is the "official rule" Throwing exceptions in scala is not advised as because it breaks the control flow. In my opinion you should throw an exception in scala only when something significant has gone wrong and normal flow should not be continued.
For all other cases it generally better to return the status/result of the operation that was performed. scala Option and Either serve this purpose. imho A function which does not return any value is a bad practice.
For the given example of the addPrintJob I would return an job identifier (as suggested by #marstran in comments), if this is not possible the status of addPrintJob.
The problem is that usually when you have to model things for a specific method it is not about having success or failure ( true or false ) or ( 0 or 1 - Unit exit codes wise ) or ( 0 or 1 - true or false interpolation wise ) , but about returning status info and a msg , thus the most simplest technique I use ( whenever code review naysayers/dickheads/besserwissers are not around ) is that
val msg = "unknown error has occurred during ..."
val ret = 1 // defined in the beginning of the method, means "unknown error"
.... // action
ret = 0 // when you finally succeeded to implement FULLY what THIS method was supposed to to
msg = "" // you could say something like ok , but usually end-users are not interested in your ok msgs , they want the stuff to work ...
at the end always return a tuple
return ( ret , msg )
or if you have a data as well ( lets say a spark data frame )
return ( ret , msg , Some(df))
Using return is more obvious, although not required ( for the purists ) ...
Now because ret is just a stupid int, you could quickly turn more complex status codes into more complex Enums , objects or whatnot , but the point is that you should not introduce more complexity than it is needed into your code in the beginning , let it grow organically ...
and of course the caller would call like
( ret , msg , mayBeDf ) = myFancyFunc(someparam, etc)
Thus exceptions would mean truly error situations and you will avoid messy try catch jungles ...
I know this answer WILL GET down-voted , because well there are too much guys from universities with however bright resumes writing whatever brilliant algos and stuff ending-up into the spagetti code we all are sick of and not something as simple as possible but not simpler and of course something that WORKS.
BUT, if you need only ok/nok control flow and chaining, here is bit more elaborated ok,nok example, which does really throw exception, which of course you would have to trap on an upper level , which works for spark:
/**
* a not so fancy way of failing asap, on first failing link in the control chain
* #return true if valid, false if not
*/
def isValid(): Boolean = {
val lst = List(
isValidForEmptyDF() _,
isValidForFoo() _,
isValidForBar() _
)
!lst.exists(!_()) // and fail asap ...
}
def isValidForEmptyDF()(): Boolean = {
val specsAreMatched: Boolean = true
try {
if (df.rdd.isEmpty) {
msg = "the file: " + uri + " is empty"
!specsAreMatched
} else {
specsAreMatched
}
} catch {
case jle: java.lang.UnsupportedOperationException => {
msg = msg + jle.getMessage
return false
}
case e: Exception => {
msg = msg + e.getMessage()
return false
}
}
}
Disclaimer: my colleague helped me with the fancy functions syntax ...