Spray, Slick, Spark - OutOfMemoryError: PermGen space - scala

I have successfully implemented a simple web service using Spray and Slick that passes an incoming request through a Spark ML Prediction Pipeline. Everything was working fine until I tried to add a data layer. I have chosen Slick it seems to be popular.
However, I can't quite get it to work right. I have been basing most of my code on the Hello-Slick Activator Template. I use a DAO object like so:
object dataDAO {
val datum = TableQuery[Datum]
def dbInit = {
val db = Database.forConfig("h2mem1")
try {
Await.result(db.run(DBIO.seq(
datum.schema.create
)), Duration.Inf)
} finally db.close
}
def insertData(data: Data) = {
val db = Database.forConfig("h2mem1")
try {
Await.result(db.run(DBIO.seq(
datum += data,
datum.result.map(println)
)), Duration.Inf)
} finally db.close
}
}
case class Data(data1: String, data2: String)
class Datum(tag: Tag) extends Table[Data](tag, "DATUM") {
def data1 = column[String]("DATA_ONE", O.PrimaryKey)
def data2 = column[String]("DATA_TWO")
def * = (data1, data2) <> (Data.tupled, Data.unapply)
}
I initialize my database in my Boot object
object Boot extends App {
implicit val system = ActorSystem("raatl-demo")
Classifier.initializeData
PredictionDAO.dbInit
// More service initialization code ...
}
I try to add a record to my database before completing the service request
val predictionRoute = {
path("data") {
get {
parameter('q) { query =>
// do Spark stuff to get prediction
DataDAO.insertData(data)
respondWithMediaType(`application/json`) {
complete {
DataJson(data1, data2)
}
}
}
}
}
When I send a request to my service my application crashes
java.lang.OutOfMemoryError: PermGen space
I suspect I'm implementing the Slick API incorrectly. its hard to tell from the documentation, because it stuffs all the operations into a main method.
Finally, my conf is the same as the activator ui
h2mem1 = {
url = "jdbc:h2:mem:raatl"
driver = org.h2.Driver
connectionPool = disabled
keepAliveConnection = true
}
Has anyone encountered this before? I'm using Slick 3.1

java.lang.OutOfMemoryError: PermGen space is normally not a problem with your usage, here is what oracle says about this.
The detail message PermGen space indicates that the permanent generation is full. The permanent generation is the area of the heap where class and method objects are stored. If an application loads a very large number of classes, then the size of the permanent generation might need to be increased using the -XX:MaxPermSize option.
I do not think this is because of incorrect implementation of the Slick API. This probably happens because you are using multiple frameworks that loads many classes.
Your options are:
Increase the size of perm gen size -XX:MaxPermSize
Upgrade to Java 8. The perm gen space is now replaced with MetaSpace which is tuned automatically

Related

NullPointerException in Flink custom SourceFunction

I wanted to create a SourceFunction which reads a http stream.
I used ScalaJ which does what I want (it splits the incoming text by \n-s).
Obviously the code works outside Flink, but I get a NullPointerExcetion every time I start it as a Flink job (sometimes immediately sometimes after 1-2 seconds after it transmitted 1-2 elements). It kind of looks like the Http object has some problems.
import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.io.Source.fromInputStream
import scalaj.http._
class HttpSource(url: String) extends SourceFunction[String] {
#volatile var isRunning = true
override def cancel(): Unit = isRunning = false
override def run(ctx: SourceFunction.SourceContext[String]): Unit =
httpStream(ctx.collect)
private def httpStream(f: String => Unit) = {
val request = Http(url)
request
.execute { inputStream =>
fromInputStream(inputStream)
.getLines()
.takeWhile(_ => isRunning)
.foreach(f)
}
}
}
Here's the exception I usually get:
(Sometimes it's a bit different, for example I tried to make the request value transient, then it's already null when it tries to refer to request)
Caused by: java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:78)
at java.io.InputStreamReader.<init>(InputStreamReader.java:129)
at scala.io.BufferedSource.reader(BufferedSource.scala:24)
at scala.io.BufferedSource.bufferedReader(BufferedSource.scala:25)
at scala.io.BufferedSource.scala$io$BufferedSource$$charReader$lzycompute(BufferedSource.scala:35)
at scala.io.BufferedSource.scala$io$BufferedSource$$charReader(BufferedSource.scala:33)
at scala.io.BufferedSource.scala$io$BufferedSource$$decachedReader(BufferedSource.scala:62)
at scala.io.BufferedSource$BufferedLineIterator.<init>(BufferedSource.scala:67)
at scala.io.BufferedSource.getLines(BufferedSource.scala:86)
at flinkextension.HttpSource$$anonfun$httpStream$1.apply(HttpSource.scala:21)
at flinkextension.HttpSource$$anonfun$httpStream$1.apply(HttpSource.scala:19)
at scalaj.http.HttpRequest$$anonfun$execute$1.apply(Http.scala:323)
at scalaj.http.HttpRequest$$anonfun$execute$1.apply(Http.scala:323)
at scalaj.http.HttpRequest$$anonfun$toResponse$3.apply(Http.scala:388)
at scalaj.http.HttpRequest$$anonfun$toResponse$3.apply(Http.scala:380)
at scala.Option.getOrElse(Option.scala:121)
at scalaj.http.HttpRequest.toResponse(Http.scala:380)
at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:360)
at scalaj.http.HttpRequest.exec(Http.scala:335)
at scalaj.http.HttpRequest.execute(Http.scala:323)
at flinkextension.HttpSource.httpStream(HttpSource.scala:19)
at flinkextension.HttpSource.run(HttpSource.scala:14)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
Everything else seems to be working fine, when I don't use a http request, but something else like file read with the same InputStream type, just a plain while loop with strings or even when I use single http requests, which aren't streaming.
I feel like I'm missing some theoretical background, maybe flink does something in the background which destroys the Http object or the InputStream, but I didn't find anything in the documentation.
UPDATE #1:
If I put a null check into the lambda, the job usually exits immediately, sometimes processes a few elements, sometimes timeouts after hanging for a minute. Here's this version of the httpStream function:
private def httpStream(f: String => Unit) = {
val request = Http(url)
request
.execute { inputStream =>
if (inputStream == null) println("null inputstream")
else {
println("not null inputstream")
fromInputStream(inputStream)
.getLines()
.takeWhile(_ => isRunning)
.foreach(f)
}
}
}
UPDATE #2:
The code actually works in distributed mode and with StreamExecutionEnvironment.createLocalEnvironment()
I only experience the issue if I use start-local.sh and submit the jar to it.

docker container based library to support elastic4s

Im using elastic4s and also interested in using a docker container based testing environment for my elastic search.
There are few libraries like: testcontainers-scala and docker-it-scala, but can't find how I integrate elastic4s into those libraries, did someone ever used a docker container based testing env?
currently my spec is very simple:
class ElasticSearchApiServiceSpec extends FreeSpec {
implicit val defaultPatience = PatienceConfig(timeout = Span(100, Seconds), interval = Span(50, Millis))
val configuration: Configuration = app.injector.instanceOf[Configuration]
val elasticSearchApiService = new ElasticSearchApiService(configuration)
override protected def beforeAll(): Unit = {
elasticSearchApiService.elasticClient.execute {
index into s"peopleIndex/person" doc StringDocumentSource(PeopleFactory.rawStringGoodPerson)
}
// since ES is eventually
Thread.sleep(3000)
}
override protected def afterAll(): Unit = {
elasticSearchApiService.elasticClient.execute {
deleteIndex("peopleIndex")
}
}
"ElasticSearchApiService Tests" - {
"elastic search service should retrieve person info properly - case existing person" in {
val personInfo = elasticSearchApiService.getPersonInfo("2324").futureValue
personInfo.get.name shouldBe "john"
}
}
}
and when I run it, I run elastic search in the background from my terminal, but I want to use containers now so it will be less dependent.
I guess you don't want to depend on ES server running on your local machine for the tests. Then the simplest approach would be using testcontainers-scala's GenericContainer to run official ES docker image this way:
class GenericContainerSpec extends FlatSpec with ForAllTestContainer {
override val container = GenericContainer("docker.elastic.co/elasticsearch/elasticsearch:5.5.1",
exposedPorts = Seq(9200),
waitStrategy = Wait.forHttp("/")
)
"GenericContainer" should "start ES and expose 9200 port" in {
assert(Source.fromInputStream(
new URL(
s"http://${container.containerIpAddress}:${container.mappedPort(9200)}/_status")
.openConnection()
.getInputStream)
.mkString
.contains("ES server is successfully installed"))
}
}

orientdb: how to fully shutdown down a memory database (Java/Scala API)

I'm trying to write some unit test utilities for an orientDB client in scala.
The following is intended to take a function to operate on a DB, and it should wrap the function with code to create and destroy the DB for a single unit test.
However, there doesn't see to be much good documentation on how to clean up a memory DB (and looking at many open source projects, people seem to simply just leak databases and create new ones on a new port).
Simply calling db.close leaves the DB listening to a port and subsequent tests fail. Calling db.drop seems to work, but only if the func succeeded in adding data to the DB.
So, what cleanup is required in the finally clause?
#Test
def fTest2(): Unit = {
def withJSONDBLoan(func: ODatabaseDocumentTx => Unit) : Unit = {
val db: ODatabaseDocumentTx = new ODatabaseDocumentTx("memory:jsondb")
db.create()
try {
func(db)
} finally {
if (!db.isClosed){
db.close // Nope. DB is leaked.
}
// db.drop seems to close the DB but can't
// see when to safely call this.
}
}
val query1 = "insert into ouser set name='test',password='test', status='ACTIVE'"
withJSONDBLoan { db =>
db.command(new OCommandSQL(query1)).execute[ODocument]()
}
// Fails at create because DB already exists.
val query2 = "insert into ouser set name='test2',password='test2', status='ACTIVE'"
withJSONDBLoan { db =>
db.command(new OCommandSQL(query2)).execute[ODocument]()
}
}
I tried your code and it worked for me.
Hope it helps.

Play reports that it can't get ClosableLazy value after it has been closed

I am trying to run specification test in Play/Scala/ReactiveMongo project. Setup is like this:
class FeaturesSpec extends Specification {
"Features controller" should {
"create feature from JSON request" in withMongoDb { app =>
// do test
}
}
With MongoDbFixture as follows:
object MongoDBTestUtils {
def withMongoDb[T](block: Application => T): T = {
implicit val app = FakeApplication(
additionalConfiguration = Map("mongodb.uri" -> "mongodb://localhost/unittests")
)
running(app) {
def db = ReactiveMongoPlugin.db
try {
block(app)
} finally {
dropAll(db)
}
}
}
def dropAll(db: DefaultDB) = {
Await.ready(Future.sequence(Seq(
db.collection[JSONCollection]("features").drop()
)), 2 seconds)
}
}
When test runs, logs are pretty noisy, complain about resource being already closed. Although tests work correctly, this is weird and I would like to know why this occurs and how to fix it.
Error:
[info] application - ReactiveMongoPlugin stops, closing connections...
[warn] play - Error stopping plugin
java.lang.IllegalStateException: Can't get ClosableLazy value after it has been closed
at play.core.ClosableLazy.get(ClosableLazy.scala:49) ~[play_2.11-2.3.7.jar:2.3.7]
at play.api.libs.concurrent.AkkaPlugin.applicationSystem(Akka.scala:71) ~[play_2.11-2.3.7.jar:2.3.7]
at play.api.libs.concurrent.Akka$$anonfun$system$1.apply(Akka.scala:29) ~[play_2.11-2.3.7.jar:2.3.7]
at play.api.libs.concurrent.Akka$$anonfun$system$1.apply(Akka.scala:29) ~[play_2.11-2.3.7.jar:2.3.7]
at scala.Option.map(Option.scala:145) [scala-library-2.11.4.jar:na]
The exception means that you are using the ReactiveMongo plugin after the application has stopped.
You might wanna try using Around:
class withMongoDb extends Around with Scope {
val db = ReactiveMongoPlugin.db
override def around[T: AsResult](t: => T): Result = try {
val res = t
AsResult.effectively(res)
} finally {
...
}
}
You should also take a look at Flapdoodle Embedded Mongo, with that you don't have to delete databases after testing IIRC.
This problem likely occurs because your test exercises code that references a closed MongoDB instance. After each Play Specs2 test runs, the MongoDb connection is reset, thus your first test may pass, but a subsequent test may hold a stale reference to the closed instance, and as a result fail.
One way to solve this issue is to ensure the following criteria are met in your application:
Avoid using val or lazy val for MongoDb database resources
(Re)Initialize all database references on application start.
I wrote up a blog post that describes a solution to the problem within the context of a Play Controller.

How to cache results in scala?

This page has a description of Map's getOrElseUpdate usage method:
object WithCache{
val cacheFun1 = collection.mutable.Map[Int, Int]()
def fun1(i:Int) = i*i
def catchedFun1(i:Int) = cacheFun1.getOrElseUpdate(i, fun1(i))
}
So you can use catchedFun1 which will check if cacheFun1 contains key and return value associated with it. Otherwise, it will invoke fun1, then cache fun1's result in cacheFun1, then return fun1's result.
I can see one potential danger - cacheFun1 can became to large. So cacheFun1 must be cleaned somehow by garbage collector?
P.S. What about scala.collection.mutable.WeakHashMap and java.lang.ref.* ?
See the Memo pattern and the Scalaz implementation of said paper.
Also check out a STM implementation such as Akka.
Not that this is only local caching so you might want to lookinto a distributed cache or STM such as CCSTM, Terracotta or Hazelcast
Take a look at spray caching (super simple to use)
http://spray.io/documentation/1.1-SNAPSHOT/spray-caching/
makes the job easy and has some nice features
for example :
import spray.caching.{LruCache, Cache}
//this is using Play for a controller example getting something from a user and caching it
object CacheExampleWithPlay extends Controller{
//this will actually create a ExpiringLruCache and hold data for 48 hours
val myCache: Cache[String] = LruCache(timeToLive = new FiniteDuration(48, HOURS))
def putSomeThingInTheCache(#PathParam("getSomeThing") someThing: String) = Action {
//put received data from the user in the cache
myCache(someThing, () => future(someThing))
Ok(someThing)
}
def checkIfSomeThingInTheCache(#PathParam("checkSomeThing") someThing: String) = Action {
if (myCache.get(someThing).isDefined)
Ok(s"just $someThing found this in the cache")
else
NotFound(s"$someThing NOT found this in the cache")
}
}
On the scala mailing list they sometimes point to the MapMaker in the Google collections library. You might want to have a look at that.
For simple caching needs, I'm still using Guava cache solution in Scala as well.
Lightweight and battle tested.
If it fit's your requirements and constraints generally outlined below, it could be a great option:
Willing to spend some memory to improve speed.
Expecting that keys will sometimes get queried more than once.
Your cache will not need to store more data than what would fit in RAM. (Guava caches are local to a single run of your application.
They do not store data in files, or on outside servers.)
Example for using it will be something like this:
lazy val cachedData = CacheBuilder.newBuilder()
.expireAfterWrite(60, TimeUnit.MINUTES)
.maximumSize(10)
.build(
new CacheLoader[Key, Data] {
def load(key: Key): Data = {
veryExpansiveDataCreation(key)
}
}
)
To read from it, you can use something like:
def cachedData(ketToData: Key): Data = {
try {
return cachedData.get(ketToData)
} catch {
case ee: Exception => throw new YourSpecialException(ee.getMessage);
}
}
Since it hasn't been mentioned before let me put on the table the light Spray-Caching that can be used independently from Spray and provides expected size, time-to-live, time-to-idle eviction strategies.
We are using Scaffeine (Scala + Caffeine), and you can read abouts its pros/cons compared to other frameworks over here.
You add your sbt,
"com.github.blemale" %% "scaffeine" % "4.0.1"
Build your cache
import com.github.blemale.scaffeine.{Cache, Scaffeine}
import scala.concurrent.duration._
val cachedItems: Cache[String, Int] =
Scaffeine()
.recordStats()
.expireAtferWrite(60.seconds)
.maximumSize(500)
.build[String, Int]()
cachedItems.put("key", 1) // Add items
cache.getIfPresent("key") // Returns an option