In my application, I have to interact (read-only) with multiple MySQL DBs one-by-one. For each DB, I need a certain no of connections. Interactions with a DB do not occur in a single stretch: I query the DB, take some time processing the results, again query the DB, again process the result and so on.
Each one of these interactions require multiple connections [I fire multiple queries concurrently], hence I need a ConnectionPool that spawns when I start interacting with the DB and lives until I'm done with all queries to that DB (including the interim time intervals when I'm not querying, only processing the results).
I'm able to successfully create a ConnectionPool with desired no of connections and obtain the implicit session as shown below
def createConnectionPool(poolSize: Int): DBSession = {
implicit val session: AutoSession.type = AutoSession
ConnectionPool.singleton(
url = "myUrl",
user = "myUser",
password = "***",
settings = ConnectionPoolSettings(initialSize = poolSize)
)
session
}
I then pass this implicit session throughout the methods where I need to interact with DB. That ways, I'm able to fire poolSize no of queries concurrently using this session. Fair enough.
def methodThatCallsAnotherMethod(implicit session: DBSession): Unit = {
...
methodThatInteractsWithDb
...
}
def methodThatInteractsWithDb(implicit session: DBSession): Unit = {
...
getResultsParallely(poolSize = 32, fetchSize = 2000000)
...
}
def getResultsParallely(poolSize: Int, fetchSize: Int)(implicit session: DBSession): Seq[ResultClass] = {
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
import scala.concurrent.duration._
implicit val ec: ExecutionContext = ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(poolSize))
val resultsSequenceFuture: Seq[Future[ResultClass]] = {
(0 until poolSize).map { i =>
val limit: Long = fetchSize
val offset: Long = i * fetchSize
Future(methodThatMakesSingleQuery(limit, offset))
}
}
val resultsFutureSequence: Future[Seq[ResultClass]] = Future.sequence(resultsSequenceFuture)
Await.result(resultsFuture, 2.minutes)
}
There are 2 problems with this technique:
My application is quite big and has many nested method calls, so passing implicit session through all methods like this (see below) isn't feasible.
In addition to the said interactions with different DBs one-by-one, I also need a single connection to another (fixed) DB throughout the lifetime of my entire application. This connection would be used to make a small write operation (logging the progress of my interactions with other DBs) after every few minutes. Therefore, I need multiple ConnectionPools, one for each DB
From what I could make out of ScalikeJdbc's docs, I came up with following way of doing it that doesn't require me to pass the implicit session everywhere.
def createConnectionPool(poolName: String, poolSize: Int): Unit = {
ConnectionPool.add(
name = poolName,
url = "myUrl",
user = "myUser",
password = "***",
settings = ConnectionPoolSettings(initialSize = poolSize)
)
}
def methodThatInteractsWithDb(poolName: String): Unit = {
...
(DB(ConnectionPool.get(poolName).borrow())).readOnly { implicit session: DBSession =>
// interact with DB
...
}
...
}
Although this works, but I'm no longer able to parallelize the db-interaction. This behaviour is obvious since I'm using the borrow() method, that gets a single connection from the pool. This, in turn, makes me wonder why that AutoSession thing worked earlier: why was I able to fire multiple queries simultaneously using a single implicit session? And if that thing worked, then why doesn't this work? But I find no examples of how to obtain a DBSession from a ConnectionPool that supports multiple connections.
To sum up, I have 2 problems and 2 solutions: one for each problem. But I need a single (commmon) solution that solves both the problems.
ScalikeJdbc's limited docs aren't offering a lot of help and blogs / articles on ScalikeJdbc are practically non-existent.
Please suggest the correct way / some work-around.
Framework versions
Scala 2.11.11
"org.scalikejdbc" %% "scalikejdbc" % "3.2.0"
Thanks to #Dennis Hunziker, I was able to figure out the correct way to release connections borrowed from ScalikeJdbc's ConnectionPool. It can be done as follows:
import scalikejdbc.{ConnectionPool, using}
import java.sql.Connection
using(ConnectionPool.get("poolName").borrow()) { (connection: Connection) =>
// use connection (only once) here
}
// connection automatically returned to pool
With this, now I'm able to parallelize interaction with the pool.
To solve my problem of managing several ConnectionPools and using connections across several classes, I ended up writing a ConnectionPoolManager, complete code for which can be found here. By offloading the tasks of
creating pools
borrowing connections from pools
removing pools
to a singleton object that I could use anywhere across my project, I was able to clear a lot of clutter and eliminated the need pass implicit session across chain of methods.
EDIT-1
While I've already linked the complete code for ConnectionPoolManager, here's a quick hint of how you can go about it
Following method of ConnectionPoolManager lets you borrow connections from ConnectionPools
def getDB(dbName: String, poolNameOpt: Option[String] = None): DB = {
// create a pool for db (only) if it doesn't exist
addPool(dbName, poolNameOpt)
val poolName: String = poolNameOpt.getOrElse(dbName)
DB(ConnectionPool.get(poolName).borrow())
}
Thereafter, throughout your code, you can use the above method to borrow connections from pools and make your queries
def makeQuery(dbName: String, poolNameOpt: Option[String]) = {
ConnectionPoolManager.getDB(dbName, poolNameOpt).localTx { implicit session: DBSession =>
// perform ScalikeJdbc SQL query here
}
}
Related
Is there any benefit to using partially applied functions vs injecting dependencies into a class? Both approaches as I understand them shown here:
class DB(conn: String) {
def get(sql: String): List[Any] = _
}
object DB {
def get(conn: String) (sql: String): List[Any] = _
}
object MyApp {
val conn = "jdbc:..."
val sql = "select * from employees"
val db1 = new DB(conn)
db1.get(sql)
val db2 = DB.get(conn) _
db2(sql)
}
Using partially-applied functions is somewhat simpler, but the conn is passed to the function each time, and could have a different conn each time it is called. The advantage of using a class is that it can perform one-off operations when it is created, such as validation or caching, and retain the results in the class for re-use.
For example the conn in this code is a String but this is presumably used to connect to a database of some sort. With the partially-applied function it must make this connection each time. With a class the connection can be made when the class is created and just re-used for each query. The class version can also prevent the class being created unless the conn is valid.
The class is usually used when the dependency is longer-lived or used by multiple functions. Partial application is more common when the dependency is shorter-lived, like during a single loop or callback. For example:
list.map(f(context))
def f(context: Context)(element: Int): Result = ???
It wouldn't really make sense to create a class just to hold f. On the other hand, if you have 5 functions that all take context, you should probably just put those into a class. In your example, get is unlikely to be the only thing that requires the conn, so a class makes more sense.
I have made a factory method which should either start a database (cassandra) and connect to it or should return an existing session. The connection to the database is static field.
class EmbeddedCassandraManager {
def getCassandra() = {
if(EmbeddedCassandraManager.cassandraConnection.isDefined) //return existing instance
{
(EmbeddedCassandraManager.testCassandra,EmbeddedCassandraManager.cassandraConnection)
}
else {
EmbeddedCassandraManager.startCassandra()
}
}
def closeCassandra() = {
EmbeddedCassandraManager.closeCassandra()
}
}
object EmbeddedCassandraManager {
val factory = new EmbeddedCassandraFactory
//can I do the logic without using var?
var (testCassandra,cassandraConnection) = startCassandra()
def closeCassandra() = {
cassandraConnection.get.close()
cassandraConnection = None
testCassandra.stop()
}
def startCassandra():(Cassandra,Option[CassandraConnection]) = {
val testCassandra = factory.create()
testCassandra.start()
val cassandraConnectionFactory:DefaultCassandraConnectionFactory = new DefaultCassandraConnectionFactory();
val localCassandraConnection:Option[CassandraConnection] = try{
val connection = cassandraConnectionFactory.create(testCassandra)
Some(connection)
}catch{
case exception:Throwable => {
throw exception
}
}
this.cassandraConnection = localCassandraConnection
(testCassandra,this.cassandraConnection)
}
}
The only way I am able to create the logic is by using a var for the cassandraConnection. Is there a pattern I can use to avoid using var?
In one of the test, I have to stop cassandra to test that the connection doesn't get established if database isn't running. This makes the existing connection stale. Without var, I am not able to set the value to None to invalidate the connection and set it to new value once the database connection is established again.
What is the functional way to create such logic? I need static value of connection so that only one connection is created and I want a way to check that the value is not stale.
Mutability is often unavoidable, because it is an inherent property of the systems we build. However, that doesn't mean that we have to use mutable variables in our code.
There are usually two main ways that you can deal with situations that involve mutable state:
Push the mutable state to a repository outside of your program.
Typical examples of this are "standard" database (if state needs to be persisted) and in-memory storage (if state exists for the duration of your program's lifecycle). Whenever you would fetch a value from such storage, you would treat it as an immutable value. Mutability still exists, but not inside your program, which makes it easier to reason about.
Some people criticize this line of thinking by saying "you are not solving anything, you're just making it some else's problem", and that's true actually. We are letting the database handle the mutability for us. Why not? It's what database is designed to do. Besides, main problem with mutability is reasoning about it, and we are not going to reason about internal implementation of the database. So pushing the mutability from one of our services to another is indeed like throwing the hot potato around, but pushing it to an external system that's designed for it is completely fine.
However, all that being said, it doesn't help your case, because it's not really elegant to store database connection objects in an external storage. Which takes me to point number two.
Use state monad.
If the word "monad" raises some flags for you, pretend I said "use State" (it's quite a simple concept actually, no big words needed). I will be using the implementation of State available in the Cats library, but it exists in other FP libraries as well.
State is a function from some existing state to a new state and some produced value:
S => (S, V)
By going from an existing state to a new state, we achieve the "mutation of state".
Example 1:
Here's some code that uses an integer state which gets incremented by one and produces a string value every time the state changes:
import cats.data.State
val s: State[Int, String] = State((i: Int) => (i + 1, s"Value: $i"))
val program = for {
produced1 <- s
_ = println(produced1) // Value: 42
produced2 <- s
_ = println(produced2) // Value: 43
produced3 <- s
_ = println(produced3) // Value: 44
} yield "Done."
program.run(42).value
That's the gist of it.
Example 2:
For completeness, here's a bigger example which demonstrates a use case similar to yours.
First, let's introduce a simplified model of CassandraConnection (this is just for the sake of example; real object would come from the Cassandra library, so no mutability would exist in our own code).
class CassandraConnection() {
var isOpen: Boolean = false
def connect(): Unit = isOpen = true
def close(): Unit = isOpen = false
}
How should we define the state? Mutable object is obviously the CassandraConnection, and the result value which will be used in for-comprehension could be a simple String.
import cats.data.State
type DbState = State[CassandraConnection, String]
Now let's define some functions for manipulating the state using an existing CassandraConnection object.
val openConnection: DbState = State(connection => {
if (connection.isOpen) {
(connection, "Already connected.")
} else {
val newConnection = new CassandraConnection()
newConnection.connect()
(newConnection, "Connected!")
}
})
val closeConnection: DbState = State(connection => {
connection.close()
(connection, "Closed!")
})
val checkConnection: DbState =
State(connection => {
if (connection.isOpen) (connection, "Connection is open.")
else (connection, "Connection is closed.")
})
And finally, let's play with these functions in the main program:
val program: DbState =
for {
log1 <- checkConnection
_ = println(log1) // Connection is closed.
log2 <- openConnection
_ = println(log2) // Connected!
log3 <- checkConnection
_ = println(log3) // Connection is open.
log4 <- openConnection
_ = println(log4) // Already connected.
log5 <- closeConnection
_ = println(log5) // Closed!
log6 <- checkConnection
_ = println(log6) // Connection is closed.
} yield "Done."
program.run(new CassandraConnection()).value
I know this is not exact code that you could copy/paste into your project and have it work nicely, but I wanted to give a slightly more general answer that might be a bit easier to understand for other readers. With some playing around, I'm sure you can shape it into your own solution. As long as your main program is a for-comprehension on the State level, you can easily open and close your connections and (re)use the same connection objects.
What did we really achieve with this solution? Why is this better than just having a mutable CassandraConnection value?
One big thing is that we achieve referential transparency, which is why this pattern fits into functional programming paradigm nicely, and standard mutability doesn't. Since this answer is already getting a bit long, I will point you towards Cats documentation which explains the whole thing in more detail and demonstrates the benefit of using State very nicely.
I've been using doobie (cats) to connect to a postgresql database from a scalatra application. Recently I noticed that the app was creating a new connection pool for every transaction. I eventually worked around it - see below, but this approach is quite different from that taken in the 'managing connections' section of the book of doobie, I was hoping someone could confirm whether it is sensible or whether there is a better way of setting up the connection pool.
Here's what I had initially - this works but creates a new connection pool on every connection:
import com.zaxxer.hikari.HikariDataSource
import doobie.hikari.hikaritransactor.HikariTransactor
import doobie.imports._
val pgTransactor = HikariTransactor[IOLite](
"org.postgresql.Driver",
s"jdbc:postgresql://${postgresDBHost}:${postgresDBPort}/${postgresDBName}",
postgresDBUser,
postgresDBPassword
)
// every query goes via this function
def doTransaction[A](update: ConnectionIO[A]): Option[A] = {
val io = for {
xa <- pgTransactor
res <- update.transact(xa) ensuring xa.shutdown
} yield res
io.unsafePerformIO
}
My initial assumption was that the problem was having ensuring xa.shutdown on every request, but removing it results in connections quickly being used up until there are none left.
This was an attempt to fix the problem - enabled me to remove ensuring xa.shutdown, but still resulted in the connection pool being repeatedly opened and closed:
val pgTransactor: HikariTransactor[IOLite] = HikariTransactor[IOLite](
"org.postgresql.Driver",
s"jdbc:postgresql://${postgresDBHost}:${postgresDBPort}/${postgresDBName}",
postgresDBUser,
postgresDBPassword
).unsafePerformIO
def doTransaction[A](update: ConnectionIO[A]): Option[A] = {
val io = update.transact(pgTransactor)
io.unsafePerformIO
}
Finally, I got the desired behaviour by creating a HikariDataSource object and then passing it into the HikariTransactor constructor:
val dataSource = new HikariDataSource()
dataSource.setJdbcUrl(s"jdbc:postgresql://${postgresDBHost}:${postgresDBPort}/${postgresDBName}")
dataSource.setUsername(postgresDBUser)
dataSource.setPassword(postgresDBPassword)
val pgTransactor: HikariTransactor[IOLite] = HikariTransactor[IOLite](dataSource)
def doTransaction[A](update: ConnectionIO[A], operationDescription: String): Option[A] = {
val io = update.transact(pgTransactor)
io.unsafePerformIO
}
You can do something like this:
val xa = HikariTransactor[IOLite](dataSource).unsafePerformIO
and pass it to your repositories.
.transact applies the transaction boundaries, like Slick's .transactionally.
E.g.:
def interactWithDb = {
val q: ConnectionIO[Int] = sql"""..."""
q.transact(xa).unsafePerformIO
}
Yes, the response from Radu gets at the problem. The HikariTransactor (the underlying HikariDataSource really) has internal state so constructing it is a side-effect; and you want to do it once when your program starts and pass it around as needed. So your solution works, just note the side-effect.
Also, as noted, I don't monitor SO … try the Gitter channel or open an issue if you have questions. :-)
I have a weka model stored in S3 which is of size around 400MB.
Now, I have some set of record on which I want to run the model and perform prediction.
For performing prediction, What I have tried is,
Download and load the model on driver as a static object , broadcast it to all executors. Perform a map operation on prediction RDD.
----> Not working, as in Weka for performing prediction, model object needs to be modified and broadcast require a read-only copy.
Download and load the model on driver as a static object and send it to executor in each map operation.
-----> Working (Not efficient, as in each map operation, i am passing 400MB object)
Download the model on driver and load it on each executor and cache it there. (Don't know how to do that)
Does someone have any idea how can I load the model on each executor once and cache it so that for other records I don't load it again?
You have two options:
1. Create a singleton object with a lazy val representing the data:
object WekaModel {
lazy val data = {
// initialize data here. This will only happen once per JVM process
}
}
Then, you can use the lazy val in your map function. The lazy val ensures that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for data.
elementsRDD.map { element =>
// use WekaModel.data here
}
Advantages
is more efficient, as it allows you to initialize your data once per JVM instance. This approach is a good choice when needing to initialize a database connection pool for example.
Disadvantages
Less control over initialization. For example, it's trickier to initialize your object if you require runtime parameters.
You can't really free up or release the object if you need to. Usually, that's acceptable, since the OS will free up the resources when the process exits.
2. Use the mapPartition (or foreachPartition) method on the RDD instead of just map.
This allows you to initialize whatever you need for the entire partition.
elementsRDD.mapPartition { elements =>
val model = new WekaModel()
elements.map { element =>
// use model and element. there is a single instance of model per partition.
}
}
Advantages:
Provides more flexibility in the initialization and deinitialization of objects.
Disadvantages
Each partition will create and initialize a new instance of your object. Depending on how many partitions you have per JVM instance, it may or may not be an issue.
Here's what worked for me even better than the lazy initializer. I created an object level pointer initialized to null, and let each executor initialize it. In the initialization block you can have run-once code. Note that each processing batch will reset local variables but not the Object-level ones.
object Thing1 {
var bigObject : BigObject = null
def main(args: Array[String]) : Unit = {
val sc = <spark/scala magic here>
sc.textFile(infile).map(line => {
if (bigObject == null) {
// this takes a minute but runs just once
bigObject = new BigObject(parameters)
}
bigObject.transform(line)
})
}
}
This approach creates exactly one big object per executor, rather than the one big object per partition of other approaches.
If you put the var bigObject : BigObject = null within the main function namespace, it behaves differently. In that case, it runs the bigObject constructor at the beginning of each partition (ie. batch). If you have a memory leak, then this will eventually kill the executor. Garbage collection would also need to do more work.
Here is what we usually do
define a singleton client that do those kind of stuff to ensure only one client is present in each executors
have a getorcreate method to create or fetch the client information, usulaly let's you have a common serving platform you want to serve for multiple different models, then we can use like concurrentmap to ensure threadsafe and computeifabsent
the getorcreate method will be called inside RDD level like transform or foreachpartition, so make sure init happen in executor level
You can achieve this by broadcasting a case object with a lazy val as follows:
case object localSlowTwo {lazy val value: Int = {Thread.sleep(1000); 2}}
val broadcastSlowTwo = sc.broadcast(localSlowTwo)
(1 to 1000).toDS.repartition(100).map(_ * broadcastSlowTwo.value.value).collect
The event timeline for this on three executors with three threads each looks as follows:
Running the last line again from the same spark-shell session does not initialize any more:
This works for me and it's threadsafe if you use singleton and synchronized like shown below
object singletonObj {
var data: dataObj =null
def getDataObj(): dataObj = this.synchronized {
if (this.data==null){
this.data = new dataObj()
}
this.data
}
}
object app {
def main(args: Array[String]): Unit = {
lazy val mydata: dataObj = singletonObj.getDataObj()
df.map(x=>{ functionA(mydata) })
}
}
I'm using slick, and have a question about slick session. I'll give an example first,
An Order class contains line items, Order can either fetch line items or remove one of the line item, and Order also can price it self. Below is the pseudocode:
class Order{
def getLineItems= database withSesison{
//get Line Items from db repository
}
def removeLineItem(itemId: String) = database withTransaction{
implicit ss: Session =>
//Remove item from db
//Price the order
}
def priceOrder() = database withTransaction{
implicit ss: Session =>
//getLineItems
//recalculate order price by each line item
}
}
So when I try to remove a line item, it will create a new Session and transaction, and then it will invoke priceOrder, which will also create a new Session and transaction, priceOrder will invoke getLineItems, which create another new session.
From slick document, I know each session is opening a jdbc connection, so in one method invocation it will create 3 database connection, it's a waste of connection resource. Is there a way to use only one connection to finish this operation?
I know slick has a threadLocalSession which bound the session to thread local, but from https://groups.google.com/forum/#!topic/scalaquery/Sg42HDEK34Q I see we should avoid to use threadLocalSession.
Please help, thanks.
Instead of creating a new session/transaction for each method, you can use currying to pass an implicit session.
def create(user: User)(implicit session: Session) = {
val id = Users.returning(Users.map(_.id)).insert(user)
user.copy(id = Some(id))
}
Then, in a controller or some place, when you want to call the methods, you setup a session/transaction and it will be used for all database work within that block.
// Create the user.
DB.withTransaction { implicit session: Session =>
Users.create(user)
}
Implicit sessions are how some of the Slick examples are setup. https://github.com/slick/slick-examples/blob/master/src/main/scala/com/typesafe/slick/examples/lifted/MultiDBCakeExample.scala