How to cache results in scala? - scala

This page has a description of Map's getOrElseUpdate usage method:
object WithCache{
val cacheFun1 = collection.mutable.Map[Int, Int]()
def fun1(i:Int) = i*i
def catchedFun1(i:Int) = cacheFun1.getOrElseUpdate(i, fun1(i))
}
So you can use catchedFun1 which will check if cacheFun1 contains key and return value associated with it. Otherwise, it will invoke fun1, then cache fun1's result in cacheFun1, then return fun1's result.
I can see one potential danger - cacheFun1 can became to large. So cacheFun1 must be cleaned somehow by garbage collector?
P.S. What about scala.collection.mutable.WeakHashMap and java.lang.ref.* ?

See the Memo pattern and the Scalaz implementation of said paper.
Also check out a STM implementation such as Akka.
Not that this is only local caching so you might want to lookinto a distributed cache or STM such as CCSTM, Terracotta or Hazelcast

Take a look at spray caching (super simple to use)
http://spray.io/documentation/1.1-SNAPSHOT/spray-caching/
makes the job easy and has some nice features
for example :
import spray.caching.{LruCache, Cache}
//this is using Play for a controller example getting something from a user and caching it
object CacheExampleWithPlay extends Controller{
//this will actually create a ExpiringLruCache and hold data for 48 hours
val myCache: Cache[String] = LruCache(timeToLive = new FiniteDuration(48, HOURS))
def putSomeThingInTheCache(#PathParam("getSomeThing") someThing: String) = Action {
//put received data from the user in the cache
myCache(someThing, () => future(someThing))
Ok(someThing)
}
def checkIfSomeThingInTheCache(#PathParam("checkSomeThing") someThing: String) = Action {
if (myCache.get(someThing).isDefined)
Ok(s"just $someThing found this in the cache")
else
NotFound(s"$someThing NOT found this in the cache")
}
}

On the scala mailing list they sometimes point to the MapMaker in the Google collections library. You might want to have a look at that.

For simple caching needs, I'm still using Guava cache solution in Scala as well.
Lightweight and battle tested.
If it fit's your requirements and constraints generally outlined below, it could be a great option:
Willing to spend some memory to improve speed.
Expecting that keys will sometimes get queried more than once.
Your cache will not need to store more data than what would fit in RAM. (Guava caches are local to a single run of your application.
They do not store data in files, or on outside servers.)
Example for using it will be something like this:
lazy val cachedData = CacheBuilder.newBuilder()
.expireAfterWrite(60, TimeUnit.MINUTES)
.maximumSize(10)
.build(
new CacheLoader[Key, Data] {
def load(key: Key): Data = {
veryExpansiveDataCreation(key)
}
}
)
To read from it, you can use something like:
def cachedData(ketToData: Key): Data = {
try {
return cachedData.get(ketToData)
} catch {
case ee: Exception => throw new YourSpecialException(ee.getMessage);
}
}

Since it hasn't been mentioned before let me put on the table the light Spray-Caching that can be used independently from Spray and provides expected size, time-to-live, time-to-idle eviction strategies.

We are using Scaffeine (Scala + Caffeine), and you can read abouts its pros/cons compared to other frameworks over here.
You add your sbt,
"com.github.blemale" %% "scaffeine" % "4.0.1"
Build your cache
import com.github.blemale.scaffeine.{Cache, Scaffeine}
import scala.concurrent.duration._
val cachedItems: Cache[String, Int] =
Scaffeine()
.recordStats()
.expireAtferWrite(60.seconds)
.maximumSize(500)
.build[String, Int]()
cachedItems.put("key", 1) // Add items
cache.getIfPresent("key") // Returns an option

Related

How can I see how many documents were written & handle errors correctly?

From the documentation I can see that I should be able to use WriteResult.ok, WriteResult.code and WriteResult.n in order to understand errors and the number of updated documents but this isn't working. Here is a sample of what I'm doing (using reactiveMongoDB/Play JSON Collection Plugin):
def updateOne(collName: String, id: BSONObjectID, q: Option[String] = None) = Action.async(parse.json) { implicit request: Request[JsValue] =>
val doc = request.body.as[JsObject]
val idQueryJso = Json.obj("_id" -> id)
val query = q match {
case Some(_) => idQueryJso.deepMerge(Json.parse(q.get).as[JsObject])
case None => idQueryJso
}
mongoRepo.update(collName)(query, doc, manyBool = false).map(result => writeResultStatus(result))
}
def writeResultStatus(writeResult: WriteResult): Result = {
// NOT WORKING
if(writeResult.ok) {
if(writeResult.n > 0) Accepted else NotModified
} else BadRequest
}
Can I give an alternative approach here? You said:
"in order to understand errors and the number of updated documents but this isn't working"
Why you don't use the logging functionality that Play provides? The general idea is that:
You set the logging level (e.g., only warning and errors, or errors, etc.).
You could use the log to output a message in any case, either something is ok, or it is not.
Play saves the logs of your application while it is running.
You the maintainer/developer could look into the logs to check if there is any errors.
This approach open a great possibility in the future: you could save the logs into a third-party service and put monitoring functionalities on the top of it.
Now if we look at the documentation here, you see about different log levels, and how to use the logger.

How to test `Var`s of `scala.rx` with scalatest?

I have a method which connects to a websocket and gets stream messages from some really outside system.
The simplified version is:
def watchOrders(): Var[Option[Order]] = {
val value = Var[Option[Order]](None)
// onMessage( order => value.update(Some(order))
value
}
When I test it (with scalatest), I want to make it connect to the real outside system, and only check the first 4 orders:
test("watchOrders") {
var result = List.empty[Order]
val stream = client.watchOrders()
stream.foreach {
case Some(order) =>
result = depth :: result
if (result.size == 4) { // 1.
assert(orders should ...) // 2.
stream.kill() // 3.
}
case _ =>
}
Thread.sleep(10000) // 4.
}
I have 4 questions:
Is it the right way to check the first 4 orders? there is no take(4) method found in scala.rx
If the assert fails, the test still passes, how to fix it?
Is it the right way to stop the stream?
If the thread doesn't sleep here, the test will pass the code in case Some(order) never runs. Is there a better way to wait?
One approach you might consider to get a List out of a Var is to use the .fold combinator.
The other issue you have is dealing with the asynchronous nature of the data - assuming you really want to talk to this outside real world system in your test code (ie, this is closer to the integration test side of things), you are going to want to look at scalatest's support for async tests and will probably do something like construct a future out of a promise that you can complete when you accumulate the 4 elements in your list.
See: http://www.scalatest.org/user_guide/async_testing

Spray, Slick, Spark - OutOfMemoryError: PermGen space

I have successfully implemented a simple web service using Spray and Slick that passes an incoming request through a Spark ML Prediction Pipeline. Everything was working fine until I tried to add a data layer. I have chosen Slick it seems to be popular.
However, I can't quite get it to work right. I have been basing most of my code on the Hello-Slick Activator Template. I use a DAO object like so:
object dataDAO {
val datum = TableQuery[Datum]
def dbInit = {
val db = Database.forConfig("h2mem1")
try {
Await.result(db.run(DBIO.seq(
datum.schema.create
)), Duration.Inf)
} finally db.close
}
def insertData(data: Data) = {
val db = Database.forConfig("h2mem1")
try {
Await.result(db.run(DBIO.seq(
datum += data,
datum.result.map(println)
)), Duration.Inf)
} finally db.close
}
}
case class Data(data1: String, data2: String)
class Datum(tag: Tag) extends Table[Data](tag, "DATUM") {
def data1 = column[String]("DATA_ONE", O.PrimaryKey)
def data2 = column[String]("DATA_TWO")
def * = (data1, data2) <> (Data.tupled, Data.unapply)
}
I initialize my database in my Boot object
object Boot extends App {
implicit val system = ActorSystem("raatl-demo")
Classifier.initializeData
PredictionDAO.dbInit
// More service initialization code ...
}
I try to add a record to my database before completing the service request
val predictionRoute = {
path("data") {
get {
parameter('q) { query =>
// do Spark stuff to get prediction
DataDAO.insertData(data)
respondWithMediaType(`application/json`) {
complete {
DataJson(data1, data2)
}
}
}
}
}
When I send a request to my service my application crashes
java.lang.OutOfMemoryError: PermGen space
I suspect I'm implementing the Slick API incorrectly. its hard to tell from the documentation, because it stuffs all the operations into a main method.
Finally, my conf is the same as the activator ui
h2mem1 = {
url = "jdbc:h2:mem:raatl"
driver = org.h2.Driver
connectionPool = disabled
keepAliveConnection = true
}
Has anyone encountered this before? I'm using Slick 3.1
java.lang.OutOfMemoryError: PermGen space is normally not a problem with your usage, here is what oracle says about this.
The detail message PermGen space indicates that the permanent generation is full. The permanent generation is the area of the heap where class and method objects are stored. If an application loads a very large number of classes, then the size of the permanent generation might need to be increased using the -XX:MaxPermSize option.
I do not think this is because of incorrect implementation of the Slick API. This probably happens because you are using multiple frameworks that loads many classes.
Your options are:
Increase the size of perm gen size -XX:MaxPermSize
Upgrade to Java 8. The perm gen space is now replaced with MetaSpace which is tuned automatically

How to properly use spray.io LruCache

I am quite an unexperienced spray/scala developer, I am trying to properly use spray.io LruCache. I am trying to achieve something very simple. I have a kafka consumer, when it reads something from its topic I want it to put the value it reads to cache.
Then in one of the routings I want to read this value, the value is of type string, what I have at the moment looks as follows:
object MyCache {
val cache: Cache[String] = LruCache(
maxCapacity = 10000,
initialCapacity = 100,
timeToLive = Duration.Inf,
timeToIdle = Duration(24, TimeUnit.HOURS)
)
}
to put something into cache i use following code:
def message() = Future { new String(singleMessage.message()) }
MyCache.cache(key, message)
Then in one of the routings I am trying to get something from the cache:
val res = MyCache.cache.get(keyHash)
The problem is the type of res is Option[Future[String]], it is quite hard and ugly to access the real value in this case. Could someone please tell me how I can simplify my code to make it better and more readable ?
Thanks in advance.
Don't try to get the value out of the Future. Instead call map on the Future to arrange for work to be done on the value when the Future is completed, and then complete the request with that result (which is itself a Future). It should look something like this:
path("foo") {
complete(MyCache.cache.get(keyHash) map (optMsg => ...))
}
Also, if singleMessage.message does not do I/O or otherwise block, then rather than creating the Future like you are
Future { new String(singleMessage.message) }
it would be more efficient to do it like so:
Future.successful(new String(singleMessage.message))
The latter just creates an already completed Future, bypassing the use of an ExecutionContext to evaluate the function.
If singleMessage.message does do I/O, then ideally you would do that I/O with some library (like Spray client, if it's an HTTP request) that returns a Future (rather than using Future { ... } to create another thread which will block).

Using Streams in Gatling repeat blocks

I've come across the following code in a Gatling scenario (modified for brevity/privacy):
val scn = scenario("X")
.repeat(numberOfLoops, "loopName") {
exec((session : Session) => {
val loopCounter = session.getTypedAttribute[Int]("loopName")
session.setAttribute("xmlInput", createXml(loopCounter))
})
.exec(
http("X")
.post("/rest/url")
.headers(headers)
.body("${xmlInput}"))
)
}
It's naming the loop in the repeat block, getting that out of the session and using it to create a unique input XML. It then sticks that XML back into the session and extracts it again when posting it.
I would like to do away with the need to name the loop iterator and accessing the session.
Ideally I'd like to use a Stream to generate the XML.
But Gatling controls the looping and I can't recurse. Do I need to compromise, or can I use Gatling in a functional way (without vars or accessing the session)?
As I see it, neither numberOfLoops nor createXml seem to depend on anything user related that would have been stored in the session, so the loop could be resolved at build time, not at runtime.
import com.excilys.ebi.gatling.core.structure.ChainBuilder
def addXmlPost(chain: ChainBuilder, i: Int) =
chain.exec(
http("X")
.post("/rest/url")
.headers(headers)
.body(createXml(i))
)
def addXmlPostLoop(chain: ChainBuilder): ChainBuilder =
(0 until numberOfLoops).foldLeft(chain)(addXmlPost)
Cheers,
Stéphane
PS: The preferred way to ask something about Gatling is our Google Group: https://groups.google.com/forum/#!forum/gatling