How to dispose database connections when flink retarts - scala

I use dbcp2.BasicDataSource as database-connection-pool. The database query is used in some map function to get additional info of sensors; I found out that, when the flink job restarts due to exceptions, the old DB connections are still active on the server side.
flink version 1.7
BasicDataSource construct code here
object DbHelper extends Lazing with Logging {
private lazy val connectionPool: BasicDataSource = createDataSource()
private def createDataSource(): BasicDataSource = {
val conn_str = props.getProperty("db.url")
val conn_user = props.getProperty("db.user")
val conn_pwd = props.getProperty("db.pwd")
val initialSize = props.getProperty("db.initial.size", "3").toInt
val bds = new BasicDataSource
bds.setDriverClassName("org.postgresql.Driver")
bds.setUrl(conn_str)
bds.setUsername(conn_user)
bds.setPassword(conn_pwd)
bds.setInitialSize(initialSize)
bds
}
}

Change your map function to a RichMapFunction. Override the close() method of the RichMapFunction and put the code to close your database connection there. You should likely be putting the code to open the connection in the open() method as well.

Related

Is there a way to avoid cold start with Cloud SQL and Cloud Functions (using JVM/Scala)? [duplicate]

This question already has answers here:
How can I keep Google Cloud Functions warm?
(8 answers)
Closed 7 months ago.
I have implemented a cloud function that accesses a postgres DB per the documentation like this...
import java.util.Properties
import javax.sql.DataSource
import com.zaxxer.hikari.HikariConfig
import com.zaxxer.hikari.HikariDataSource
import io.github.cdimascio.dotenv.Dotenv
import java.sql.Connection
class CoreDataSource {
def getConnection = {
println("Getting the connection")
CoreDataSource.getConnection
}
}
object CoreDataSource {
var pool : Option[DataSource] = None
def getConnection: Option[Connection] = {
if(pool.isEmpty) {
println("Getting the datasource")
pool = getDataSource
}
if(pool.isEmpty){
None
} else {
println("Reusing the connection")
Some(pool.get.getConnection)
}
}
def getDataSource: Option[DataSource] = {
Class.forName("org.postgresql.Driver")
var dbName,dbUser,dbPassword,dbUseIAM,ssoMode, instanceConnectionName = ""
val dotenv = Dotenv
.configure()
.ignoreIfMissing()
.load()
dbName = dotenv.get("DB_NAME")
println("DB Name "+ dbName)
dbUser= dotenv.get("DB_USER")
println("DB User "+ dbUser)
dbPassword = Option(
dotenv.get("DB_PASS")
).getOrElse("ignored")
dbUseIAM = Option(
dotenv.get("DB_IAM")
).getOrElse("true")
println("dbUseIAM "+ dbUseIAM)
ssoMode = Option(
dotenv.get("DB_SSL")
).getOrElse("disable") // TODO: Should this be enabled by default?
println("ssoMode "+ ssoMode)
instanceConnectionName = dotenv.get("DB_INSTANCE")
println("instanceConnectionName "+ instanceConnectionName)
val jdbcURL: String = String.format("jdbc:postgresql:///%s", dbName)
val connProps = new Properties
connProps.setProperty("user", dbUser)
// Note: a non-empty string value for the password property must be set. While this property will be ignored when connecting with the Cloud SQL Connector using IAM auth, leaving it empty will cause driver-level validations to fail.
if( dbUseIAM.equals("true") ){
println("Using IAM password is ignored")
connProps.setProperty("password", "ignored")
} else {
println("Using manual, password must be provided")
connProps.setProperty("password", dbPassword)
}
connProps.setProperty("sslmode", ssoMode)
connProps.setProperty("socketFactory", "com.google.cloud.sql.postgres.SocketFactory")
connProps.setProperty("cloudSqlInstance", instanceConnectionName)
connProps.setProperty("enableIamAuth", dbUseIAM)
// Initialize connection pool
val config = new HikariConfig
config.setJdbcUrl(jdbcURL)
config.setDataSourceProperties(connProps)
config.setMaximumPoolSize(10)
config.setMinimumIdle(4)
config.addDataSourceProperty("ipTypes", "PUBLIC,PRIVATE") // TODO: Make configureable
println("Config created")
val pool : DataSource = new HikariDataSource(config) // Do we really need Hikari here if it doesn't need pooling?
println("Returning the datasource")
Some(pool)
}
}
class DoSomething() {
val ds = new CoreDataSource
def getUserInformation(): String = {
println("Getting user information")
connOpt = ds.getConnection
if(connOpt.isEmpty) throw new Error("No Connection Found")
...
}
}
class SomeClass extends HttpFunction {
override def service(httpRequest: HttpRequest, httpResponse: HttpResponse): Unit = {
httpResponse.setContentType("application/json")
httpResponse.getWriter.write(
GetCorporateInformation.corp.getUserInformation( )
)
}
}
object GetCorporateInformation {
val corp = new CorporateInformation()
}
And I deploy like this...
gcloud functions deploy identity-corporate --entry-point ... --min-instances 2 --runtime java17 --trigger-http --no-allow-unauthenticated --set-secrets '...'
But when first deployed (and after sitting idle for a while) the function takes 25 secs to return causing all kinds of issues with SLAs. After the "cold start" it returns quickly but at least in dev I can't really make sure someone is always hitting it.
Is there a way to mitigate this or do I need to use a VM to make sure it isn't destroyed? Or is there a way to do this without the overhead of pooling?
Since functions are stateless, your function sometimes initializes the execution environment from scratch, which is called a cold start. However, you can minimize the impact of cold start by setting a minimum number of instances (Note that this can help reduce but not eliminate) or you could create a scheduled function warmer that runs every few minutes and calls your high priority function ensuring they are kept warm.

How to use a persisted StateStore between two Kafka Streams

I'm having some troubles trying to achieve the following via Kafka Streams:
At the startup of the app, the (compacted) topic alpha gets loaded into a Key-Value StateStore map
A Kafka Stream consumes from another topic, uses (.get) the map above and finally produces a new record into topic alpha
The result is that the in-memory map should aligned with the underlying topic, even if the streamer gets restarted.
My approach is the following:
val builder = new StreamsBuilderS()
val store = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("store"), kSerde, vSerde)
)
builder.addStateStore(store)
val loaderStreamer = new LoaderStreamer(store).startStream()
[...] // I wait a few seconds until the loading is complete and the stream os running
val map = instance.store("store", QueryableStoreTypes.keyValueStore[K, V]()) // !!!!!!!! ERROR HERE !!!!!!!!
builder
.stream("another-topic")(Consumed.`with`(kSerde, vSerde))
.doMyAggregationsAndgetFromTheMapAbove
.transform(() => new StoreTransformer[K, V]("store"), "store")
.to("alpha")(Produced.`with`(kSerde, vSerde))
LoaderStreamer(store):
[...]
val builders = new StreamsBuilderS()
builder.addStateStore(store)
builder
.table("alpha")(Consumed.`with`(kSerde, vSerde))
builder.build
[...]
StoreTransformer:
[...]
override def init(context: ProcessorContext): Unit = {
this.context = context
this.store =
context.getStateStore(store).asInstanceOf[KeyValueStore[K, V]]
}
override def transform(key: K, value: V): (K, V) = {
store.put(key, value)
(key, value)
}
[...]
...but what I get is:
Caused by: org.apache.kafka.streams.errors.InvalidStateStoreException:
The state store, store, may have migrated to another instance.
while trying to get the store handler.
Any idea on how to achieve this?
Thank you!
You can't share state store between two Kafka Streams applications.
According to documentation: https://docs.confluent.io/current/streams/faq.html#interactive-queries there might be two reason of above exception:
The local KafkaStreams instance is not yet ready and thus its local state stores cannot be queried yet.
The local KafkaStreams instance is ready, but the particular state store was just migrated to another instance behind the scenes.
The easiest way to deal with it is to wait till state store will be queryable:
public static <T> T waitUntilStoreIsQueryable(final String storeName,
final QueryableStoreType<T> queryableStoreType,
final KafkaStreams streams) throws InterruptedException {
while (true) {
try {
return streams.store(storeName, queryableStoreType);
} catch (InvalidStateStoreException ignored) {
// store not yet ready for querying
Thread.sleep(100);
}
}
}
Whole example can be found at confluent github.

Kafka Streams NPE in MeteredKeyValueStore

Im trying to run a very basic Stream using the ProcessorAPI in Scala.
class KafkaProcessor extends Processor[String, GenericRecord] {
private var kvStore: KeyValueStore[String, GenericRecord] = _
override def init(processorContext: ProcessorContext): Unit = {
this.kvStore = Stores
.keyValueStoreBuilder(
Stores.persistentKeyValueStore("random-mame"),
Serdes.String,
new GenericAvroSerde
)
}
override def process(
key: String,
value: GenericRecord
): Unit = {
val currentState = Option(kvStore.get(key)) // NPE
...
}
}
It seems some internal NPE is thrown from the error logs:
Exception in thread "test-4294024b-1390-4c2f-ba8e-e520cca728ff-StreamThread-1" java.lang.NullPointerException
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.get(MeteredKeyValueStore.java:134)
at writeside.kafka.AggregateKafkaProcessor.process(KafkaProcessor.scala:64)
at writeside.kafka.AggregateKafkaProcessor.process(KafkaProcessor.scala:35)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:115)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:146)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:129)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:93)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:84)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:351)
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:104)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:413)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:862)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:777)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:747)
It is related to the getTime inside of the MeteredKeyValueStore. Im not sure how this happens and how I can prevent it.
If you want to use a store, you need to declare the store outside of the processor (ie, add the store the to StreamBuilder), connect the store (via StreamsBuilder) to the processor.
Within the processor you use the ProcessorContext to get a handle on the store.
See the docs for more details: https://kafka.apache.org/21/documentation/streams/developer-guide/processor-api.html

How to set Spark configuration properties using Apache Livy?

I don't know how to pass SparkSession parameters programmatically when submitting Spark job to Apache Livy:
This is the Test Spark job:
class Test extends Job[Int]{
override def call(jc: JobContext): Int = {
val spark = jc.sparkSession()
// ...
}
}
This is how this Spark job is submitted to Livy:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.build()
try {
client.uploadJar(new File(testJarPath)).get()
client.submit(new Test())
} finally {
client.stop(true)
}
How can I pass the following configuration parameters to SparkSession?
.config("es.nodes","1localhost")
.config("es.port",9200)
.config("es.nodes.wan.only","true")
.config("es.index.auto.create","true")
You can do that easily through the LivyClientBuilder like this:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.setConf("es.nodes","1localhost")
.setConf("key", "value")
.build()
Configuration parameters can be set to LivyClientBuilder using
public LivyClientBuilder setConf(String key, String value)
so that your code starts with:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.setConf("es.nodes","1localhost")
.setConf("es.port",9200)
.setConf("es.nodes.wan.only","true")
.setConf("es.index.auto.create","true")
.build()
LivyClientBuilder.setConf will not work, I think. Because Livy will modify all configs not starting with spark.. And Spark cannot read the modified config. See here
private static File writeConfToFile(RSCConf conf) throws IOException {
Properties confView = new Properties();
for (Map.Entry<String, String> e : conf) {
String key = e.getKey();
if (!key.startsWith(RSCConf.SPARK_CONF_PREFIX)) {
key = RSCConf.LIVY_SPARK_PREFIX + key;
}
confView.setProperty(key, e.getValue());
}
...
}
So the answer is quite simple: add spark. to all es configs, like this,
.config("spark.es.nodes","1localhost")
.config("spark.es.port",9200)
.config("spark.es.nodes.wan.only","true")
.config("spark.es.index.auto.create","true")
Don't know it is elastic-spark does the compatibility job, or spark. It just works.
PS: I've tried with the REST API, and it works. But not with the Programmatic API.

Spray, Slick, Spark - OutOfMemoryError: PermGen space

I have successfully implemented a simple web service using Spray and Slick that passes an incoming request through a Spark ML Prediction Pipeline. Everything was working fine until I tried to add a data layer. I have chosen Slick it seems to be popular.
However, I can't quite get it to work right. I have been basing most of my code on the Hello-Slick Activator Template. I use a DAO object like so:
object dataDAO {
val datum = TableQuery[Datum]
def dbInit = {
val db = Database.forConfig("h2mem1")
try {
Await.result(db.run(DBIO.seq(
datum.schema.create
)), Duration.Inf)
} finally db.close
}
def insertData(data: Data) = {
val db = Database.forConfig("h2mem1")
try {
Await.result(db.run(DBIO.seq(
datum += data,
datum.result.map(println)
)), Duration.Inf)
} finally db.close
}
}
case class Data(data1: String, data2: String)
class Datum(tag: Tag) extends Table[Data](tag, "DATUM") {
def data1 = column[String]("DATA_ONE", O.PrimaryKey)
def data2 = column[String]("DATA_TWO")
def * = (data1, data2) <> (Data.tupled, Data.unapply)
}
I initialize my database in my Boot object
object Boot extends App {
implicit val system = ActorSystem("raatl-demo")
Classifier.initializeData
PredictionDAO.dbInit
// More service initialization code ...
}
I try to add a record to my database before completing the service request
val predictionRoute = {
path("data") {
get {
parameter('q) { query =>
// do Spark stuff to get prediction
DataDAO.insertData(data)
respondWithMediaType(`application/json`) {
complete {
DataJson(data1, data2)
}
}
}
}
}
When I send a request to my service my application crashes
java.lang.OutOfMemoryError: PermGen space
I suspect I'm implementing the Slick API incorrectly. its hard to tell from the documentation, because it stuffs all the operations into a main method.
Finally, my conf is the same as the activator ui
h2mem1 = {
url = "jdbc:h2:mem:raatl"
driver = org.h2.Driver
connectionPool = disabled
keepAliveConnection = true
}
Has anyone encountered this before? I'm using Slick 3.1
java.lang.OutOfMemoryError: PermGen space is normally not a problem with your usage, here is what oracle says about this.
The detail message PermGen space indicates that the permanent generation is full. The permanent generation is the area of the heap where class and method objects are stored. If an application loads a very large number of classes, then the size of the permanent generation might need to be increased using the -XX:MaxPermSize option.
I do not think this is because of incorrect implementation of the Slick API. This probably happens because you are using multiple frameworks that loads many classes.
Your options are:
Increase the size of perm gen size -XX:MaxPermSize
Upgrade to Java 8. The perm gen space is now replaced with MetaSpace which is tuned automatically