How to create shared JDBC connection to use on executors? - scala

I have created spark jdbc singleton connection in driver & planning to use connection in executors. I get below exception. org.apache.spark.SparkException: Task not serializable
Inside spark main class:
object ExecutorConnection {
private var connection: Connection = null
val url = prop.getProperty("url")
val user = prop.getProperty("user")
val pwd = prop.getProperty("password")
val driver = prop.getProperty("driver")
Class.forName(driver)
def getConnection(url: String, username: String, password: String): Connection = synchronized {
if (connection == null) {
connection = DriverManager.getConnection(url, username, password)
Class.forName(driver)
connection.setAutoCommit(false)
}
connection
}
lazy val createConnection = getConnection(url, user, pwd)
}
I have multiple dataframes(df1,df2,df3) with different schema , where im planning to create connection in driver level & serialize the connection & use it for all dataframes.
df1.rdd.repartition(2).mapPartitions((d) => Iterator(d)).foreach { partition =>
val conn = ExecutorConnection.createConnection
var ps: PreparedStatement = null
partition.grouped(1).foreach(batch => {
batch.foreach { x =>
{
ps = conn.prepareStatement(SqlString)
ps.addBatch()
conn.commit()
}
}
})
}

Use Dataset.foreachPartition:
foreachPartition(f: (Iterator[T]) ⇒ Unit): Unit
Applies a function f to each partition of this Dataset.
This trick with Scala object is exactly how you get the connection once per task (and I think per executor also).
df1.foreachPartition { vs =>
// use the connection here
}
Use Guava for a cache.

Re:
where im planning to create connection in driver level & serialize the connection
It does not work that way.
You have to create connections on executor else you will keep getting this exception.

Related

Add Configurations to a Singleton Object in Scala

I am trying to set up a connection pool to Redis in a Singleton Scala Object so that I can read/write to Redis while mapping partitions of a DF. I want to be able to configure the host, along with other connection pool variables when I run my main method. However, this current configuration does not give me my configured REDIS_HOST, it gives me localhost.
When writing this I referenced https://able.bio/patrickcording/sharing-objects-in-spark--58x4gbf the One instance per executor section.
What is the best way to acheive configuring the host while maintaining one RedisClient instance per executor?
object Main {
def main(args: Array[String]): Unit = {
val parsedConfig = ConfigFactory.parseFile(new File(args(0)))
val config = ConfigFactory.load(parsedConfig)
RedisClient.host = config.getString("REDIS_HOST")
val Main = new Main()
Main.runMain()
}
}
class Main{
val df = Seq(...).toDF()
df.mapPartitions(partitions => {
partitions.foreach(row => {
val count = RedisClient.getIdCount(row.getAs("id").asInstanceOf[String])
//do something
})
})
df.write.save
RedisClient.close()
}
object RedisClient {
var host: String = "localhost"
private val pool = new RedisClientPool(host, 6379)
def getIdCount(id: String):Option[String] = {
pool.withClient(client => {
client.get(orderLineId)
})
}
def close(): Unit = {
pool.close()
}
}
In Spark, main only runs on the driver, not the executors. RedisClient is not guaranteed to exist on any given executor until you call a method which invokes it, and it will just be initialized with default values.
Accordingly, the only way to ensure that it has the correct host is to, in the same RDD/DF operation, ensure that host is set, e.g.:
df.mapPartitions(partitions => {
RedisClient.host = config.getString("REDIS_HOST")
partitions.foreach(row => {
...
}
}
Of course, since main doesn't run on the executors, you'll probably also want to broadcast the config to the executors:
// after setting up the SparkContext
val sc: SparkContext = ???
val broadcastConfig = sc.broadcast(config)
Then you'll pass broadcastConfig around and use broadcastConfig.value in place of config, so the above would become:
df.mapPartitions(partitions => {
RedisClient.host = broadcastConfig.value.getString("REDIS_HOST")
partitions.foreach(row => {
...
}
}
As long as you take care to always be assigning the same value to RedisClient.host and to set it before doing anything else with RedisClient, you should be safe.

Determing if a MongoDB connection is unavailavble and creating a new connection if it is

I'm attempting to improve the below code that creates a MongoDB connection and inserts a document using the insertDocument method:
import com.typesafe.scalalogging.LazyLogging
import org.mongodb.scala.result.InsertOneResult
import org.mongodb.scala.{Document, MongoClient, MongoCollection, MongoDatabase, Observer, SingleObservable}
import play.api.libs.json.JsResult.Exception
object MongoFactory extends LazyLogging {
val uri: String = "mongodb+srv://*********"
val client: MongoClient = MongoClient(uri)
val db: MongoDatabase = client.getDatabase("db")
val collection: MongoCollection[Document] = db.getCollection("col")
def insertDocument(document: Document) = {
val singleObservable: SingleObservable[InsertOneResult] = collection.insertOne(document)
singleObservable.subscribe(new Observer[InsertOneResult] {
override def onNext(result: InsertOneResult): Unit = println(s"onNext: $result")
override def onError(e: Throwable): Unit = println(s"onError: $e")
override def onComplete(): Unit = println("onComplete")
})
}
}
The primary issue I see with the above code is that if the connection becomes stale due to MongoDB server going offline or some other condition then
the connection is not restarted.
An improvement to cater for this scenario is :
object MongoFactory extends LazyLogging {
val uri: String = "mongodb+srv://*********"
var client: MongoClient = MongoClient(uri)
var db: MongoDatabase = client.getDatabase("db")
var collection: MongoCollection[Document] = db.getCollection("col")
def isDbDown() : Boolean = {
try {
client.getDatabase("db")
false
}
catch {
case e: Exception =>
true
}
}
def insertDocument(document: Document) = {
if(isDbDown()) {
client = MongoClient(uri)
db = client.getDatabase("db")
collection = db.getCollection("col")
}
val singleObservable: SingleObservable[InsertOneResult] = collection.insertOne(document)
singleObservable.subscribe(new Observer[InsertOneResult] {
override def onNext(result: InsertOneResult): Unit = println(s"onNext: $result")
override def onError(e: Throwable): Unit = println(s"onError: $e")
override def onComplete(): Unit = println("onComplete")
})
}
}
I expect this to handle the scenario if the DB connection becomes unavailable but is there a more idiomatic Scala method of
determining
Your code does not create connections. It creates MongoClient instances.
As such you cannot "create a new connection". MongoDB drivers do not provide an API for applications to manage connections.
Connections are managed internally by the driver and are created and destroyed automatically as needed in response to application requests/commands. You can configure connection pool size and when stale connections are removed from the pool.
Furthermore, execution of a single application command may involve multiple connections (up to 3 easily, possibly over 5 if encryption is involved), and the connection(s) used depend on the command/query. Checking the health of any one connection, even if it was possible, wouldn't be very useful.

How to create a Scala Jdbc program using Option to handle null while returning connection?

I am trying to write a scala-jdbc program which will run an analyze statement on tables present on our database. To do that, I wrote the code as below.
object Trip {
def main(args: Array[String]): Unit = {
val gs = new GetStats(args(0))
gs.run_analyze()
}
}
-----------------------------------------------------------------
class GetStats {
var tables = ""
def this(tables:String){
this
this.tables = tables
}
def run_analyze():Unit = {
val tabList = tables.split(",")
val gpc = new GpConnection()
val con = gpc.getGpCon()
val statement = con.get.createStatement()
try {
for(t<-tabList){
val rs = statement.execute(s"analyze ${t}")
if(rs.equals(true)) println(s"Analyzed ${t}")
else println(s"Analyze failed ${t}")
}
} catch {
case pse:PSQLException => pse.printStackTrace()
case e:Exception => e.printStackTrace()
}
}
}
-----------------------------------------------------------------
class GpConnection {
var gpCon: Option[Connection] = None
def getGpCon():Option[Connection] = {
val url = "jdbc:postgresql://.."
val driver = "org.postgresql.Driver"
val username = "user"
val password = "1239876"
Class.forName(driver)
if(gpCon==None || gpCon.get.isClosed) {
gpCon = DriverManager.getConnection(url, username, password).asInstanceOf[Option[Connection]]
gpCon
} else gpCon
}
}
I create a jar file on my idea (IntelliJ) and submit the jar as below.
scala -cp /home/username/jars/postgresql-42.1.4.jar analyzetables_2.11-0.1.jar schema.table
When I submit the jar file, I see the exception ClassCastException as given below.
java.lang.ClassCastException: org.postgresql.jdbc.PgConnection cannot be cast to scala.Option
at com.db.manager.GpConnection.getGpCon(GpConnection.scala:15)
at com.gp.analyze.GetStats.run_analyze(GetStats.scala:19)
at com.runstats.Trip$.main(Trip.scala:8)
at com.runstats.Trip.main(Trip.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.reflect.internal.util.ScalaClassLoader.$anonfun$run$2(ScalaClassLoader.scala:98)
at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:32)
at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:30)
at scala.reflect.internal.util.ScalaClassLoader.run$(ScalaClassLoader.scala:98)
at scala.reflect.internal.util.ScalaClassLoader.run(ScalaClassLoader.scala:90)
at scala.tools.nsc.CommonRunner.run$(ObjectRunner.scala:22)
The exception says that connection cannot be casted to scala.option but if I don't use Option, I cannot use null to initialize the connection object & I see NullPointerException when I run the code.
Could anyone let me know what is the mistake I am making here and how can I fix it ?
asInstanceOf[] doesn't work that way. It won't just create an Option[] for you.
val x:Option[Int] = 5.asInstanceOf[Option[Int]] //not gonna happen
You have to create the Option[] explicitly.
val x:Option[Int] = Option(5)
You can use an uninitialized var as:
var gpCon: Connection = _
But since you are using scala.util.Option which is a better thing to do, do it in a functional way and don't write imperative Java code in Scala, like:
// a singleton object (Scala provided)
object GpConnection {
private var gpCon: Option[Connection] = None
// returns a Connection (no option - as we need it!)
def getOrCreateCon(): Connection = gpCon match {
case conOpt if conOpt.isEmpty || conOpt.get.isClosed =>
// connection not present or is closed
val url = "jdbc:postgresql://.."
val driver = "org.postgresql.Driver"
val username = "user"
val password = "1239876"
// may throw an exception - you can even handle this
Class.forName(driver)
// may throw an exception - you can even handle this
gpCon = Option(DriverManager.getConnection(url, username, password).asInstanceOf[Connection])
gpCon.getOrElse(throw new RuntimeException("Cannot create connection"))
case Some(con) => con
}
}
use it like:
val con = GpConnection.getOrCreateCon

Play, Scala, JDBC and concurrency

I have a Scala class that accesses a database through JDBC:
class DataAccess {
def select = {
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/db"
val username = "root"
val password = "xxxx"
var connection:Connection = null
try {
// make the connection
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
// create the statement, and run the select query
val statement = connection.createStatement()
val resultSet = statement.executeQuery("SELECT name, descrip FROM table1")
while ( resultSet.next() ) {
val name = resultSet.getString(1)
val descrip = resultSet.getString(2)
println("name, descrip = " + name + ", " + descrip)
}
} catch {
case e => e.printStackTrace
}
connection.close()
}
}
I access this class in my Play application, like so:
def testSql = Action {
val da = new DataAccess
da.select()
Ok("success")
}
The method testSql may be invoked by several users. Question is: could there be a race condition in the while ( resultSet.next() ) loop (or in any other part of the class)?
Note: I need to use JDBC as the SQL statement will be dynamic.
No there cannot.
Each thread is working with a distinct local instance of ResultSet so there cannot be concurrent access to the same object.

Why multiple mongodb connecions with Casbah?

I have to manage multiple databases connection to MongoDb, using casbah scala client. I have an approximation that works but open hundreds of connections.
I want to save a Map[String, MongoDB] that saves a connection for each database (which is the key. I'm using this in Spark Streaming with a two nodes cluster, so I think that is a serialization issue but I don't know how to fix it.
Take a look to my class.
abstract class AbstractMongoDAO(#transient val config: Config) extends Closeable with Serializable {
#transient private val mongoConfig = config.getConfig(CONFIG_KEY)
private val host = mongoConfig.getString(CONFIG_KEY_HOST)
#transient private var _mongoClient: MongoClient = MongoClient(host)
private var _dbs: mutable.HashMap[String, MongoDB] = mutable.HashMap()
protected def dbs(): mutable.HashMap[String, MongoDB] ={
if(_dbs == null)
_dbs = mutable.HashMap()
_dbs
}
def mongoClient: MongoClient = {
if (_mongoClient == null) {
_mongoClient = MongoClient(host)
}
_mongoClient
}
def db(dbName: String):MongoDB = {
if (dbs.get(dbName) == None) {
_dbs += (dbName -> mongoClient.getDB(dbName))
}
_dbs.get(dbName).get
}
override def close() = {
Option(_mongoClient).foreach(_.close())
}
}
private object AbstractMongoDAO {
val CONFIG_KEY = "mongo"
val CONFIG_KEY_HOST = "host"
}
And then I have another class that extends AbstractMongoDao
class MongoDAO (override val config : Config)
extends AbstractMongoDAO(config) with Serializable
And I get a db connection with this simple code. appName is a variable database.
val _db = db(appName)
What I'm doing wrong?
Casbah is built on top of official Java driver. MongoClient represents an internal pool of db connections to a MongoDB cluster. If you use the same cluster and only change database name and not the host, you don't need to create multiple MongoClients, one would be enough for the whole application.
To configure MongoClient check this documentation and corresponding options. If you have multiple DB hosts or still want to use multiple MongoClients then you can build your options and create MongoClient like this:
val options = MongoClientOptions.builder()
.connectionsPerHost(1)
// add other options if needed
.build();
val _mongoClient = MongoClient(host, options)
In your case since only db name neeeds to change and not the db host I would change the method that gets db to this:
def db(dbName: String):MongoDB =
mongoClient.getDB(dbName) // db will be created in Mongo on the fly if not exist
And you don't need the map anymore.