I am using this fork of LMDBjni:https://github.com/deephacks/lmdbjni to
form the backend of a medium-sized databases project in scala.
I've been hitting a EXCEPTION_ACCESS_VIOLATION (0xc0000005) in the JNI code for this LMDB, and would like to know whether there is anything obvious I'm doing wrong or what the next steps for debugging should be. I'm not exactly sure what I'm looking for, so I'm going to list as much information about what's happening as I can and hope the symptoms make sense to somebody.
The access violation occurs in a Database with a single key mapping to a single 8-byte value, on approximately the 4000th access to that database (the exact number appears to be the same with each run), suggesting that this is a deterministic problem.
I believe I only have one thread accessing the database at a time, and regardless, in my understanding, as the operation is wrapped in a transaction, concurrent accesses should not matter anyway.
By looking through stack traces and printing values the issue comes from this generic construction I wrote for building transactions.
My code that causes the issue is here, the crash occurs in the marked db.get() call:
def transactionalGetAndSet[A](
key: Key,
db: Database
)(
compute: A => LMDBEither[A]
)(
implicit sa: Storeable[A],
env: Env
): LMDBEither[A] = {
import org.fusesource.lmdbjni.Transaction
// get a new transaction
val tx: Transaction = instance.env.createWriteTransaction()
println("tx = " + tx + " id = " + tx.getId)
// get the key as an Array[Byte]. This is done by converting the key to a base64 string then converting that to bytes (so arbitary objects can be made into keys)
val k = key.render
println("Key = " + key + " Rendered = " + new String(k))
// instantiate a result value, so there is something if it fails
var res: LMDBEither[A] = NoResult.left // initialise the result as a failure to begin with
try {
res = for { // This for construction chains together operations that return LMDBEithers into one LMDBEither
bytes <- LMDBEither(db.get(tx, k)) // error occurs in this Database.get() call
_ = println("bytes = " + bytes)
a <- sa.fromBytes(safeRetrieve(bytes)) // sa is effectively an unmarshaller/unmarshaller object which converts Vector[Byte] => LMDBEither[A]
_ = println("a = " + a)
res <- compute(a) // get the next value for the value at the key
_ = println("res = " + res)
_ <- LMDBEither(db.put(tx, k, sa.toBytes(res).toArray))
} yield a // effectively, if all these steps worked, res == Right(a)
res // return the result
} finally {
// Make sure you either commit or rollback to avoid resource leaks.
if (res.isRight) tx.commit() // if the result is not an error (ie Either.isRight is true)
else tx.abort()
tx.close()
}
}
Where LMDBEither[A] is an alias for Either[E, A] for a specific error type E, and LMDBEither(x) is a function that lifts an expression that might throw exceptions during execution into an LMDBEither, catching any exceptions.
the function safeRetrieve converts a possibly null Array[Byte] into a definitely not null Vector[Byte], as follows:
private def safeRetrieve(bytes: Array[Byte]): Vector[Byte] =
Option(bytes).fold(Vector[Byte]()){ // if the array is null, convert to an empty vector, otherwise call the array's wrapper's vector
arr =>
println("Vector = " + arr.toVector)
arr.toVector
}
To the best of my knowledge, this does not modify the memory where the array is stored (LMDB's protected memory)
The values printed up to and including the crash are as follows:
tx = org.fusesource.lmdbjni.Transaction#391cec1f id = 15104
Key = Vector(Objects) Rendered = 84507411390877848991196161
#
# A fatal error has been detected by the Java Runtime Environment:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000018002453f, pid=10220, tid=7268
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# C [lmdbjni-64-0-7710432736670562378.4+0x2453f]
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
# C:\dev\PartIIProject\hs_err_pid10220.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Which is more evidence that the mdb_get() fails.
the full contents of the file referenced are here: https://pastebin.com/v6AmFBjq
Again, I would be extremely grateful for the small chance that anyone could point me in the right direction. What next steps should I be taking?
Related
I am refactoring a scala http4s application to remove some pesky side effects causing my app to block. I'm replacing .unsafeRunSync with cats.effect.IO. The problem is as follows:
I have 2 lists: alreadyAccessible: IO[List[Page]] and pages: List[Page]
I need to filter out the pages that are not contained in alreadyAccessible.
Then map over the resulting list to "grant Access" in the database to these pages. (e.g. call another method that hits the database and returns an IO[Page].
val addable: List[Page] = pages.filter(p => !alreadyAccessible.contains(p))
val added: List[Page] = addable.map((p: Page) => {
pageModel.grantAccess(roleLst.head.id, p.id) match {
case Right(p) => p
}
})
This is close to what I want; However, it does not work because filter requires a function that returns a Boolean but alreadyAccessible is of type IO[List[Page]] which precludes you from removing anything from the IO monad. I understand you can't remove data from the IO so maybe transform it:
val added: List[IO[Page]] = for(page <- pages) {
val granted = alreadyAccessible.flatMap((aa: List[Page]) => {
if (!aa.contains(page))
pageModel.grantAccess(roleLst.head.id, page.id) match { case Right(p) => p }
else null
})
} yield granted
this unfortunately does not work with the following error:
Error:(62, 7) ';' expected but 'yield' found.
} yield granted
I think because I am somehow mistreating the for comprehension syntax, I just don't understand why I cannot do what I'm doing.
I know there must be a straight forward solution to such a problem, so any input or advice is greatly appreciates. Thank you for your time in reading this!
granted is going to be an IO[List[Page]]. There's no particular point in having IO inside anything else unless you truly are going to treat the actions like values and reorder them/filter them etc.
val granted: IO[List[Page]] = for {
How do you compute it? Well, the first step is to execute alreadyAccessible to get the actual list. In fact, alreadyAccessible is misnamed. It is not the list of accessible pages; it is an action that gets the list of accessible pages. I would recommend you rename it getAlreadyAccessible.
alreadyAccessible <- getAlreadyAccessible
Then you filter pages with it
val required = pages.filterNot(alreadyAccessible.contains)
Now, I cannot decipher what you're doing to these pages. I'm just going to assume you have some kind of function grantAccess: Page => IO[Page]. If you map this function over required, you will get a List[IO[Page]], which is not desirable. Instead, we should traverse with grantAccess, which will produce a IO[List[Page]] that executes each IO[Page] and then assembles all the results into a List[Page].
granted <- required.traverse(grantAccess)
And we're done
} yield granted
I have a Spark program with calculates relations between users, i.e. it receives data set of type:
RDD[(java.lang.Long, Map[(String, String), Integer])]
Where the Long is timestamp, and the map is a score relevant to tuples of two users. and should run some function over the scores and return the following type:
Map[String, Map[java.lang.Long, java.lang.Double]]
Where the String is the first String in the tuple, and the map is the results of the function per timeslot.
In my case I have around 2000 users so the maps I receive are quite big (2000^2 per timeslot), and also the results relies on the previous timeslot results.
I am running the program locally and receiving GC overhead limit exceeded. I increased the heap memory to 14g using: -Xmx14G in vmarguments (I see the java process is occupying more than 12g of memory) but it didn't help.
Currently implemented method
I have tried several directions to decrease the memory consumption and currently came up with the following idea: since every timestamp relies only on the previous one I will collect every timeslot separately and keep the previous results on driver. In this manner I will run calculations only on part of the data and hopefully it will not crush the program.
The code:
def calculateScorePerTimeslot(scorePerTimeslotRDD: RDD[(java.lang.Long, Map[(String, String), Integer])]): Map[String, Map[java.lang.Long, java.lang.Double]] = {
var distancesPerTimeslotVarRDD = distancesPerTimeslotRDD.groupBy(_._1).sortBy(_._1)
println("Start collecting all the results - cache the data!!")
distancesPerTimeslotVarRDD.cache()
println("Caching all the data has completed!")
while(!distancesPerTimeslotVarRDD.isEmpty())
{
val dataForTimeslot: (java.lang.Long, Iterable[(java.lang.Long, Map[(String, String), Integer])]) = distancesPerTimeslotVarRDD.first()
println("Retrieved data for timeslot: " + dataForTimeslot._1)
//Code which is irrelevant for question - logic
println("Removing timeslot: " + dataForTimeslot._1)
distancesPerTimeslotVarRDD = distancesPerTimeslotVarRDD.filter(t => !t._1.equals(dataForTimeslot._1))
println("Filtering has complete! - without: " + dataForTimeslot._1)
}
}
Summary: Basically, the idea is to extract one timeslot at a time process it and save the results at driver - in this manner I try to reduce the size of data which passes on collect.
Reason I write this post
Unfortunately, this doesn't help me and the program still dies. My question is: is this manner of taking the first() item of a RDD and then filter it have the effect of iterating over the items on RDD? Are there other better ideas to tackle this kinds of question (better ideas which are not increasing the memory or moving to a real distributed cluster)?
Firstly, RDD[(java.lang.Long, Map[(String, String), Integer])] uses more memory than RDD[(java.lang.Long, Array[(String, String, Integer)])]. You'll save some memory if you can use the latter.
Secondly, your loop is pretty inefficient in caching data. Always call unpersist on any RDD you no longer need.
distancesPerTimeslotVarRDD.cache()
var rddSize = distancesPerTimeslotVarRDD.count()
println("Caching all the data has completed!")
while(rddSize > 0) {
val prevRDD = distancesPerTimeslotVarRDD
val dataForTimeslot = distancesPerTimeslotVarRDD.first()
println("Retrieved data for timeslot: " + dataForTimeslot._1)
// Code which is irrelevant for answer - logic
println("Removing timeslot: " + dataForTimeslot._1)
// Cache the new value of distancesPerTimeslotVarRDD
distancesPerTimeslotVarRDD = distancesPerTimeslotVarRDD.filter(t => !t._1.equals(dataForTimeslot._1)).cache()
// Force calculation so we can throw away previous iteration value
rddSize = distancesPerTimeslotVarRDD.count()
println("Filtering has complete! - without: " + dataForTimeslot._1)
// Get rid of previously cached RDD
prevRDD.unpersist(false)
}
Thirdly, you can try using Kryo Serializer, though this sometimes makes things worse. You have to configure the serializer and replace cache with persist(StorageLevel.MEMORY_ONLY_SER)
Is it possible with Spark to "wrap" and run an external process managing its input and output?
The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.
The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.
The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:
rdd.pipe("your_wrapper")
The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.
Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.
If you end up here based on the question title from a Google search, but you don't have the OP restriction that the external program needs to read from a file--i.e., if your external program can read from stdin--here is a solution. For my use case, I needed to call an external decryption program for each input file.
import org.apache.commons.io.IOUtils
import sys.process._
import scala.collection.mutable.ArrayBuffer
val showSampleRows = true
val bfRdd = sc.binaryFiles("/some/files/*,/more/files/*")
val rdd = bfRdd.flatMap{ case(file, pds) => { // pds is a PortableDataStream
val rows = new ArrayBuffer[Array[String]]()
var errors = List[String]()
val io = new ProcessIO (
in => { // "in" is an OutputStream; write the encrypted contents of the
// input file (pds) to this stream
IOUtils.copy(pds.open(), in) // open() returns a DataInputStream
in.close
},
out => { // "out" is an InputStream; read the decrypted data off this stream.
// Even though this runs in another thread, we can write to rows, since it
// is part of the closure for this function
for(line <- scala.io.Source.fromInputStream(out).getLines) {
// ...decode line here... for my data, it was pipe-delimited
rows += line.split('|')
}
out.close
},
err => { // "err" is an InputStream; read any errors off this stream
// errors is part of the closure for this function
errors = scala.io.Source.fromInputStream(err).getLines.toList
err.close
}
)
val cmd = List("/my/decryption/program", "--decrypt")
val exitValue = cmd.run(io).exitValue // blocks until subprocess finishes
println(s"-- Results for file $file:")
if (exitValue != 0) {
// TBD write to string accumulator instead, so driver can output errors
// string accumulator from #zero323: https://stackoverflow.com/a/31496694/215945
println(s"exit code: $exitValue")
errors.foreach(println)
} else {
// TBD, you'll probably want to move this code to the driver, otherwise
// unless you're using the shell, you won't see this output
// because it will be sent to stdout of the executor
println(s"row count: ${rows.size}")
if (showSampleRows) {
println("6 sample rows:")
rows.slice(0,6).foreach(row => println(" " + row.mkString("|")))
}
}
rows
}}
scala> :paste "test.scala"
Loading test.scala...
...
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[62] at flatMap at <console>:294
scala> rdd.count // action, causes Spark code to actually run
-- Results for file hdfs://path/to/encrypted/file1: // this file had errors
exit code: 255
ERROR: Error decrypting
my_decryption_program: Bad header data[0]
-- Results for file hdfs://path/to/encrypted/file2:
row count: 416638
sample rows:
<...first row shown here ...>
...
<...sixth row shown here ...>
...
res43: Long = 843039
References:
https://www.scala-lang.org/api/current/scala/sys/process/ProcessIO.html
https://alvinalexander.com/scala/how-to-use-closures-in-scala-fp-examples#using-closures-with-other-data-types
I have a function in scala which has no return-value (so unit). This function can sometimes fail (if the user provided parameters are not valid). If I were on java, I would simply throw an exception. But on scala (although the same thing is possible), it is suggested to not use exceptions.
I perfectly know how to use Option or Try, but they all only make sense if you have something valid to return.
For example, think of a (imaginary) addPrintJob(printJob: printJob): Unit command which adds a print job to a printer. The job definition could now be invalid and the user should be notified of this.
I see the following two alternatives:
Use exceptions anyway
Return something from the method (like a "print job identifier") and then return a Option/Either/Try of that type. But this means adding a return value just for the sake of error handling.
What are the best practices here?
You are too deep into FP :-)
You want to know whether the method is successful or not - return a Boolean!
According to this Throwing exceptions in Scala, what is the "official rule" Throwing exceptions in scala is not advised as because it breaks the control flow. In my opinion you should throw an exception in scala only when something significant has gone wrong and normal flow should not be continued.
For all other cases it generally better to return the status/result of the operation that was performed. scala Option and Either serve this purpose. imho A function which does not return any value is a bad practice.
For the given example of the addPrintJob I would return an job identifier (as suggested by #marstran in comments), if this is not possible the status of addPrintJob.
The problem is that usually when you have to model things for a specific method it is not about having success or failure ( true or false ) or ( 0 or 1 - Unit exit codes wise ) or ( 0 or 1 - true or false interpolation wise ) , but about returning status info and a msg , thus the most simplest technique I use ( whenever code review naysayers/dickheads/besserwissers are not around ) is that
val msg = "unknown error has occurred during ..."
val ret = 1 // defined in the beginning of the method, means "unknown error"
.... // action
ret = 0 // when you finally succeeded to implement FULLY what THIS method was supposed to to
msg = "" // you could say something like ok , but usually end-users are not interested in your ok msgs , they want the stuff to work ...
at the end always return a tuple
return ( ret , msg )
or if you have a data as well ( lets say a spark data frame )
return ( ret , msg , Some(df))
Using return is more obvious, although not required ( for the purists ) ...
Now because ret is just a stupid int, you could quickly turn more complex status codes into more complex Enums , objects or whatnot , but the point is that you should not introduce more complexity than it is needed into your code in the beginning , let it grow organically ...
and of course the caller would call like
( ret , msg , mayBeDf ) = myFancyFunc(someparam, etc)
Thus exceptions would mean truly error situations and you will avoid messy try catch jungles ...
I know this answer WILL GET down-voted , because well there are too much guys from universities with however bright resumes writing whatever brilliant algos and stuff ending-up into the spagetti code we all are sick of and not something as simple as possible but not simpler and of course something that WORKS.
BUT, if you need only ok/nok control flow and chaining, here is bit more elaborated ok,nok example, which does really throw exception, which of course you would have to trap on an upper level , which works for spark:
/**
* a not so fancy way of failing asap, on first failing link in the control chain
* #return true if valid, false if not
*/
def isValid(): Boolean = {
val lst = List(
isValidForEmptyDF() _,
isValidForFoo() _,
isValidForBar() _
)
!lst.exists(!_()) // and fail asap ...
}
def isValidForEmptyDF()(): Boolean = {
val specsAreMatched: Boolean = true
try {
if (df.rdd.isEmpty) {
msg = "the file: " + uri + " is empty"
!specsAreMatched
} else {
specsAreMatched
}
} catch {
case jle: java.lang.UnsupportedOperationException => {
msg = msg + jle.getMessage
return false
}
case e: Exception => {
msg = msg + e.getMessage()
return false
}
}
}
Disclaimer: my colleague helped me with the fancy functions syntax ...
I've coded a parser based on Scala parser combinators:
class SxmlParser extends RegexParsers with ImplicitConversions with PackratParsers {
[...]
lazy val document: PackratParser[AstNodeDocument] =
((procinst | element | comment | cdata | whitespace | text)*) ^^ {
AstNodeDocument(_)
}
[...]
}
object SxmlParser {
def parse(text: String): AstNodeDocument = {
var ast = AstNodeDocument()
val parser = new SxmlParser()
val result = parser.parseAll(parser.document, new CharArrayReader(text.toArray))
result match {
case parser.Success(x, _) => ast = x
case parser.NoSuccess(err, next) => {
tool.die("failed to parse SXML input " +
"(line " + next.pos.line + ", column " + next.pos.column + "):\n" +
err + "\n" +
next.pos.longString)
}
}
ast
}
}
Usually the resulting parsing error messages are rather nice. But sometimes it becomes just
sxml: ERROR: failed to parse SXML input (line 32, column 1):
`"' expected but `' found
^
This happens if a quote characters is not closed and the parser reaches the EOT. What I would like to see here is (1) what production the parser was in when it expected the '"' (I've multiple ones) and (2) where in the input this production started parsing (which is an indicator where the opening quote is in the input). Does anybody know how I can improve the error messages and include more information about the actual internal parsing state when the error happens (perhaps something like a production rule stacktrace or whatever can be given reasonably here to better identify the error location). BTW, the above "line 32, column 1" is actually the EOT position and hence of no use here, of course.
I don't know yet how to deal with (1), but I was also looking for (2) when I found this webpage:
https://wiki.scala-lang.org/plugins/viewsource/viewpagesrc.action?pageId=917624
I'm just copying the information:
A useful enhancement is to record the input position (line number and column number) of the significant tokens. To do this, you must do three things:
Make each output type extend scala.util.parsing.input.Positional
invoke the Parsers.positioned() combinator
Use a text source that records line and column positions
and
Finally, ensure that the source tracks positions. For streams, you can simply use scala.util.parsing.input.StreamReader; for Strings, use scala.util.parsing.input.CharArrayReader.
I'm currently playing with it so I'll try to add a simple example later
In such cases you may use err, failure and ~! with production rules designed specifically to match the error.