I am trying to set up a connection pool to Redis in a Singleton Scala Object so that I can read/write to Redis while mapping partitions of a DF. I want to be able to configure the host, along with other connection pool variables when I run my main method. However, this current configuration does not give me my configured REDIS_HOST, it gives me localhost.
When writing this I referenced https://able.bio/patrickcording/sharing-objects-in-spark--58x4gbf the One instance per executor section.
What is the best way to acheive configuring the host while maintaining one RedisClient instance per executor?
object Main {
def main(args: Array[String]): Unit = {
val parsedConfig = ConfigFactory.parseFile(new File(args(0)))
val config = ConfigFactory.load(parsedConfig)
RedisClient.host = config.getString("REDIS_HOST")
val Main = new Main()
Main.runMain()
}
}
class Main{
val df = Seq(...).toDF()
df.mapPartitions(partitions => {
partitions.foreach(row => {
val count = RedisClient.getIdCount(row.getAs("id").asInstanceOf[String])
//do something
})
})
df.write.save
RedisClient.close()
}
object RedisClient {
var host: String = "localhost"
private val pool = new RedisClientPool(host, 6379)
def getIdCount(id: String):Option[String] = {
pool.withClient(client => {
client.get(orderLineId)
})
}
def close(): Unit = {
pool.close()
}
}
In Spark, main only runs on the driver, not the executors. RedisClient is not guaranteed to exist on any given executor until you call a method which invokes it, and it will just be initialized with default values.
Accordingly, the only way to ensure that it has the correct host is to, in the same RDD/DF operation, ensure that host is set, e.g.:
df.mapPartitions(partitions => {
RedisClient.host = config.getString("REDIS_HOST")
partitions.foreach(row => {
...
}
}
Of course, since main doesn't run on the executors, you'll probably also want to broadcast the config to the executors:
// after setting up the SparkContext
val sc: SparkContext = ???
val broadcastConfig = sc.broadcast(config)
Then you'll pass broadcastConfig around and use broadcastConfig.value in place of config, so the above would become:
df.mapPartitions(partitions => {
RedisClient.host = broadcastConfig.value.getString("REDIS_HOST")
partitions.foreach(row => {
...
}
}
As long as you take care to always be assigning the same value to RedisClient.host and to set it before doing anything else with RedisClient, you should be safe.
Related
I try to develop gRPC server with Akka-gRPC and Slick. I also use Airframe for DI.
Source code is here
The issue is that it cause failure if it receive request when execute as gRPC server.
If it doesn't start as a gRPC server, but just reads resources from the database, the process succeeds.
What is the difference?
At Follows, It read object from database with slick.
...Component is airframe object. It will use by main module.
trait UserRepository {
def getUser: Future[Seq[Tables.UsersRow]]
}
class UserRepositoryImpl(val profile: JdbcProfile, val db: JdbcProfile#Backend#Database) extends UserRepository {
import profile.api._
def getUser: Future[Seq[Tables.UsersRow]] = db.run(Tables.Users.result)
}
trait UserResolveService {
private val repository = bind[UserRepository]
def getAll: Future[Seq[Tables.UsersRow]] =
repository.getUser
}
object userServiceComponent {
val design = newDesign
.bind[UserResolveService]
.toSingleton
}
Follows is gRPC Server source code.
trait UserServiceImpl extends UserService {
private val userResolveService = bind[UserResolveService]
private val system: ActorSystem = bind[ActorSystem]
implicit val ec: ExecutionContextExecutor = system.dispatcher
override def getAll(in: GetUserListRequest): Future[GetUserListResponse] = {
userResolveService.getAll.map(us =>
GetUserListResponse(
us.map(u =>
myapp.proto.user.User(
1,
"t_horikoshi#example.com",
"t_horikoshi",
myapp.proto.user.User.UserRole.Admin
)
)
)
)
}
}
trait GRPCServer {
private val userServiceImpl = bind[UserServiceImpl]
implicit val system: ActorSystem = bind[ActorSystem]
def run(): Future[Http.ServerBinding] = {
implicit def ec: ExecutionContext = system.dispatcher
val service: PartialFunction[HttpRequest, Future[HttpResponse]] =
UserServiceHandler.partial(userServiceImpl)
val reflection: PartialFunction[HttpRequest, Future[HttpResponse]] =
ServerReflection.partial(List(UserService))
// Akka HTTP 10.1 requires adapters to accept the new actors APIs
val bound = Http().bindAndHandleAsync(
ServiceHandler.concatOrNotFound(service, reflection),
interface = "127.0.0.1",
port = 8080,
settings = ServerSettings(system)
)
bound.onComplete {
case Success(binding) =>
system.log.info(
s"gRPC Server online at http://${binding.localAddress.getHostName}:${binding.localAddress.getPort}/"
)
case Failure(ex) =>
system.log.error(ex, "occurred error")
}
bound
}
}
object grpcComponent {
val design = newDesign
.bind[UserServiceImpl]
.toSingleton
.bind[GRPCServer]
.toSingleton
}
Follows is main module.
object Main extends App {
val conf = ConfigFactory
.parseString("akka.http.server.preview.enable-http2 = on")
.withFallback(ConfigFactory.defaultApplication())
val system = ActorSystem("GRPCServer", conf)
val dbConfig: DatabaseConfig[JdbcProfile] =
DatabaseConfig.forConfig[JdbcProfile](path = "mydb")
val design = newDesign
.bind[JdbcProfile]
.toInstance(dbConfig.profile)
.bind[JdbcProfile#Backend#Database]
.toInstance(dbConfig.db)
.bind[UserRepository]
.to[UserRepositoryImpl]
.bind[ActorSystem]
.toInstance(system)
.add(userServiceComponent.design)
.add(grpcComponent.design)
design.withSession(s =>
// Await.result(s.build[UserResolveService].getUser, Duration.Inf)) // success
// Await.result(s.build[UserServiceImpl].getAll(GetUserListRequest()), Duration.Inf)) // success
s.build[GRPCServer].run() // cause IllegalStateException when reciece request.
)
}
When UserResolveService and UserServiceImpl are called directly, the process of loading an object from the database is successful.
However, when running the application as a gRPC Server, an error occurs when a request is received.
Though I was thinking all day, I couldn't resolve...
Will you please help me to resolve.
It resolved. if execute async process, It has to start gRPC server with newSession.
I fix like that.
I trying to create blaze client with limited number of threads like this:
object ReactiveCats extends IOApp {
private val PORT = 8083
private val DELAY_SERVICE_URL = "http://localhost:8080"
// trying create client with limited number of threads
val clientPool: ExecutorService = Executors.newFixedThreadPool(64)
val clientExecutor: ExecutionContextExecutor = ExecutionContext.fromExecutor(clientPool)
private val httpClient = BlazeClientBuilder[IO](clientExecutor).resource
private val httpApp = HttpRoutes.of[IO] {
case GET -> Root / delayMillis =>
httpClient.use { client =>
client
.expect[String](s"$DELAY_SERVICE_URL/$delayMillis")
.flatMap(response => Ok(s"ReactiveCats: $response"))
}
}.orNotFound
// trying to create server on fixed thread pool
val serverPool: ExecutorService = Executors.newFixedThreadPool(64)
val serverExecutor: ExecutionContextExecutor = ExecutionContext.fromExecutor(serverPool)
// start server
override def run(args: List[String]): IO[ExitCode] =
BlazeServerBuilder[IO](serverExecutor)
.bindHttp(port = PORT, host = "localhost")
.withHttpApp(httpApp)
.serve
.compile
.drain
.as(ExitCode.Success)
}
full code and load-tests
But load-test results looks like one thread by one request:
How I make restrict numbers of threads for my blaze client?
There are two obvious things that are wrong with your code:
you're creating an Executor without shutting it down when you're done.
you're using the use method of the httpClient Resource inside the HTTP route, meaning that every time the route is called, it will create, use and destroy the http client. You should instead create it once during startup.
Executors, like any other resource (e. g. file handles etc.) should always be allocated using Resource.make like so:
val clientPool: Resource[IO, ExecutorService] = Resource.make(IO(Executors.newFixedThreadPool(64)))(ex => IO(ex.shutdown()))
val clientExecutor: Resource[IO, ExecutionContextExecutor] = clientPool.map(ExecutionContext.fromExecutor)
private val httpClient = clientExecutor.flatMap(ex => BlazeClientBuilder[IO](ex).resource)
The second problem can easily be fixed by allocating the httpClient before building the HTTP app:
private def httpApp(client: Client[IO]): Kleisli[IO, Request[IO], Response[IO]] = HttpRoutes.of[IO] {
case GET -> Root / delayMillis =>
client
.expect[String](s"$DELAY_SERVICE_URL/$delayMillis")
.flatMap(response => Ok(s"ReactiveCats: $response"))
}.orNotFound
…
override def run(args: List[String]): IO[ExitCode] =
httpClient.use { client =>
BlazeServerBuilder[IO](serverExecutor)
.bindHttp(port = PORT, host = "localhost")
.withHttpApp(httpApp(client))
.serve
.compile
.drain
.as(ExitCode.Success)
}
Another potential problem is that you're using IOApp, and it comes with its own thread pool. The best way to fix that is probably to mix in the IOApp.WithContext trait and implement this method:
override protected def executionContextResource: Resource[SyncIO, ExecutionContext] = ???
Copy from my commment.
Answer for performance issue is properly setup for Blaze client - for me this is .withMaxWaitQueueLimit(1024) parameter.
I have a spark program as follows:
object A {
var id_set: Set[String] = _
def init(argv: Array[String]) = {
val args = new AArgs(argv)
id_set = args.ids.split(",").toSet
}
def main(argv: Array[String]) {
init(argv)
val conf = new SparkConf().setAppName("some.name")
val rdd1 = getRDD(paras)
val rdd2 = getRDD(paras)
//......
}
def getRDD(paras) = {
//function details
getRDDDtails(paras)
}
def getRDDDtails(paras) = {
//val id_given = id_set
id_set.foreach(println) //worked normal, not empty
someRDD.filter{ x =>
val someSet = x.getOrElse(...)
//id_set.foreach(println) ------wrong, id_set just empty set
(some_set & id_set).size > 0
}
}
class AArgs(args: Array[String]) extends Serializable {
//parse args
}
I have a global variable id_set. At first, it is just an empty set. In main function, I call init which sets id_set to a non-empty set from args. After that, I call getRDD function which calls getRDDDtails. In getRDDDtails, I filter a rdd based on contents in id_set. However, the result semms to be empty. I tried to print is_set in executor, and it is just an empty line. So, the problem seems to be is_set is not well initilized(in init function). However, when I try to print is_set in driver(in head lines of function getRDDDtails), it worked normal, not empty.
So, I have tried to add val id_given = id_set in function getRDDDtails, and use id_given later. This seems to fix the problem. But I'm totally confused why should this happen? What is the execution order of Spark programs? Why does my solution work?
Please take a look at the following spark streaming code written in scala:
object HBase {
var hbaseTable = ""
val hConf = new HBaseConfiguration()
hConf.set("hbase.zookeeper.quorum", "zookeeperhost")
def init(input: (String)) {
hbaseTable = input
}
def display() {
print(hbaseTable)
}
def insertHbase(row: (String)) {
val hTable = new HTable(hConf,hbaseTable)
}
}
object mainHbase {
def main(args : Array[String]) {
if (args.length < 5) {
System.err.println("Usage: MetricAggregatorHBase <zkQuorum> <group> <topics> <numThreads> <hbaseTable>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads, hbaseTable) = args
HBase.init(hbaseTable)
HBase.display()
val sparkConf = new SparkConf().setAppName("mainHbase")
val ssc = new StreamingContext(sparkConf, Seconds(10))
ssc.checkpoint("checkpoint")
val topicpMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val storeStg = lines.foreachRDD(rdd => rdd.foreach(HBase.insertHbase))
lines.print()
ssc.start()
}
}
I am trying to initialize the parameter hbaseTable in the object HBase by calling HBase.init method. It was setting the parameter properly. I confirmed that by calling the HBase.display method in the next line.
However when HBase.insertHbase method in the foreachRDD is called, its throwing error that hbaseTable is not set.
Update with exception:
java.lang.IllegalArgumentException: Table qualifier must not be empty
org.apache.hadoop.hbase.TableName.isLegalTableQualifierName(TableName.java:179)
org.apache.hadoop.hbase.TableName.isLegalTableQualifierName(TableName.java:149)
org.apache.hadoop.hbase.TableName.<init>(TableName.java:303)
org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:339)
org.apache.hadoop.hbase.TableName.valueOf(TableName.java:426)
org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:156)
Can you please let me know how to make this code work.
"Where is this code running" - that's the question that we need to ask in order to understand what's going on.
HBase is a Scala object - by definition it's a singleton construct that gets initialized with 'only once' semantics in the JVM.
At the initialization point, HBase.init(hbaseTable) is executed in the driver of this Spark application, initializing this object with the given value in the VM of the driver.
But when we do: rdd.foreach(HBase.insertHbase), the closure is executed as a task on each executor that hosts a partition for the given RDD. At that point, the object HBase is initialized on each VM for each executor. As we can see, no initialization has happened on this object at that point.
There're two options:
We can add some checking "isInitialized" to the HBase object and add the -now conditional- call to initialize on each call to foreach.
Another option would be to use
rdd.foreachPartitition{partition =>
HBase.initialize(...)
partition.foreach(elem => HBase.insert(elem))
}
This construction will amortize any initialization by the amount of element in each partition. It's also possible to combine it with an initialization check to prevent unnecessary bootstrap work.
I have to manage multiple databases connection to MongoDb, using casbah scala client. I have an approximation that works but open hundreds of connections.
I want to save a Map[String, MongoDB] that saves a connection for each database (which is the key. I'm using this in Spark Streaming with a two nodes cluster, so I think that is a serialization issue but I don't know how to fix it.
Take a look to my class.
abstract class AbstractMongoDAO(#transient val config: Config) extends Closeable with Serializable {
#transient private val mongoConfig = config.getConfig(CONFIG_KEY)
private val host = mongoConfig.getString(CONFIG_KEY_HOST)
#transient private var _mongoClient: MongoClient = MongoClient(host)
private var _dbs: mutable.HashMap[String, MongoDB] = mutable.HashMap()
protected def dbs(): mutable.HashMap[String, MongoDB] ={
if(_dbs == null)
_dbs = mutable.HashMap()
_dbs
}
def mongoClient: MongoClient = {
if (_mongoClient == null) {
_mongoClient = MongoClient(host)
}
_mongoClient
}
def db(dbName: String):MongoDB = {
if (dbs.get(dbName) == None) {
_dbs += (dbName -> mongoClient.getDB(dbName))
}
_dbs.get(dbName).get
}
override def close() = {
Option(_mongoClient).foreach(_.close())
}
}
private object AbstractMongoDAO {
val CONFIG_KEY = "mongo"
val CONFIG_KEY_HOST = "host"
}
And then I have another class that extends AbstractMongoDao
class MongoDAO (override val config : Config)
extends AbstractMongoDAO(config) with Serializable
And I get a db connection with this simple code. appName is a variable database.
val _db = db(appName)
What I'm doing wrong?
Casbah is built on top of official Java driver. MongoClient represents an internal pool of db connections to a MongoDB cluster. If you use the same cluster and only change database name and not the host, you don't need to create multiple MongoClients, one would be enough for the whole application.
To configure MongoClient check this documentation and corresponding options. If you have multiple DB hosts or still want to use multiple MongoClients then you can build your options and create MongoClient like this:
val options = MongoClientOptions.builder()
.connectionsPerHost(1)
// add other options if needed
.build();
val _mongoClient = MongoClient(host, options)
In your case since only db name neeeds to change and not the db host I would change the method that gets db to this:
def db(dbName: String):MongoDB =
mongoClient.getDB(dbName) // db will be created in Mongo on the fly if not exist
And you don't need the map anymore.