Spark connection pooling - Is this the right approach - scala

I have a Spark job in Structured Streaming that consumes data from Kafka and saves it to InfluxDB. I have implemented the connection pooling mechanism as follows:
object InfluxConnectionPool {
val queue = new LinkedBlockingQueue[InfluxDB]()
def initialize(database: String): Unit = {
while (!isConnectionPoolFull) {
queue.put(createNewConnection(database))
}
}
private def isConnectionPoolFull: Boolean = {
val MAX_POOL_SIZE = 1000
if (queue.size < MAX_POOL_SIZE)
false
else
true
}
def getConnectionFromPool: InfluxDB = {
if (queue.size > 0) {
val connection = queue.take()
connection
} else {
System.err.println("InfluxDB connection limit reached. ");
null
}
}
private def createNewConnection(database: String) = {
val influxDBUrl = "..."
val influxDB = InfluxDBFactory.connect(...)
influxDB.enableBatch(10, 100, TimeUnit.MILLISECONDS)
influxDB.setDatabase(database)
influxDB.setRetentionPolicy(database + "_rp")
influxDB
}
def returnConnectionToPool(connection: InfluxDB): Unit = {
queue.put(connection)
}
}
In my spark job, I do the following
def run(): Unit = {
val spark = SparkSession
.builder
.appName("ETL JOB")
.master("local[4]")
.getOrCreate()
...
// This is where I create connection pool
InfluxConnectionPool.initialize("dbname")
val sdvWriter = new ForeachWriter[record] {
var influxDB:InfluxDB = _
def open(partitionId: Long, version: Long): Boolean = {
influxDB = InfluxConnectionPool.getConnectionFromPool
true
}
def process(record: record) = {
// this is where I use the connection object and save the data
MyService.saveData(influxDB, record.topic, record.value)
InfluxConnectionPool.returnConnectionToPool(influxDB)
}
def close(errorOrNull: Throwable): Unit = {
}
}
import spark.implicits._
import org.apache.spark.sql.functions._
//Read data from kafka
val kafkaStreamingDF = spark
.readStream
....
val sdvQuery = kafkaStreamingDF
.writeStream
.foreach(sdvWriter)
.start()
}
But, when I run the job, I get the following exception
18/05/07 00:00:43 ERROR StreamExecution: Query [id = 6af3c096-7158-40d9-9523-13a6bffccbb8, runId = 3b620d11-9b93-462b-9929-ccd2b1ae9027] terminated with error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 8, 192.168.222.5, executor 1): java.lang.NullPointerException
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:332)
at com.abc.telemetry.app.influxdb.InfluxConnectionPool$.returnConnectionToPool(InfluxConnectionPool.scala:47)
at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:55)
at com.abc.telemetry.app.ETLappSave$$anon$1.process(ETLappSave.scala:46)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:53)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)
The NPE is when the connection is returned to the connection pool in queue.put(connection). What am I missing here? Any help appreciated.
P.S: In the regular DStreams approach, I did it with foreachPartition method. Not sure how to do connection reuse/pooling with structured streaming.

I am using the forEachWriter for redis similarly, where the pool is being referenced in the process only. Your request would look something like below
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(record: record) = {
influxDB = InfluxConnectionPool.getConnectionFromPool
// this is where I use the connection object and save the data
MyService.saveData(influxDB, record.topic, record.value)
InfluxConnectionPool.returnConnectionToPool(influxDB)
}```

datasetOfString.writeStream.foreach(new ForeachWriter[String] {
def open(partitionId: Long, version: Long): Boolean = {
// open connection
}
def process(record: String) = {
// write string to connection
}
def close(errorOrNull: Throwable): Unit = {
// close the connection
}
})
From the docs of ForeachWriter,
Each task will get a fresh serialized-deserialized copy of the provided object
So whatever you initialize outside the ForeachWriter will run only at the driver.
You need to initialize the connection pool and open the connection in the open method.

Related

Creating stream from api in Apache Flink

Firstly I describe what I want to do. I have an API that gets a function as a argument (looks like this:dataFromApi => {//do sth}) and I would like to process this data by Flink. I wrote this code to simulate this API:
val myIterator = new TestIterator
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val th1 = new Thread {
override def run(): Unit = {
for (i <- 0 to 10) {
Thread sleep 1000
myIterator.addToQueue("test" + i)
}
}
}
th1.start()
val texts: DataStream[String] = env
.fromCollection(new TestIterator)
texts.print()
This is my iterator:
class TestIterator extends Iterator[String] with Serializable {
private val q: BlockingQueue[String] = new LinkedBlockingQueue[String]
def addToQueue(s: String): Unit = {
println("Put")
q.put(s)
}
override def hasNext: Boolean = true
override def next(): String = {
println("Wait for queue")
q.take()
}
}
My idea was execute myIterator.addToQueue(dataFromApi) when I receive data, but this code doesn't work. Despiting adding to the queue, execution blocks on q.take(). I tried to write own SourceFunction based on idea with Queue and also I tried with this: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/ but I can't manage I want.

Determing if a MongoDB connection is unavailavble and creating a new connection if it is

I'm attempting to improve the below code that creates a MongoDB connection and inserts a document using the insertDocument method:
import com.typesafe.scalalogging.LazyLogging
import org.mongodb.scala.result.InsertOneResult
import org.mongodb.scala.{Document, MongoClient, MongoCollection, MongoDatabase, Observer, SingleObservable}
import play.api.libs.json.JsResult.Exception
object MongoFactory extends LazyLogging {
val uri: String = "mongodb+srv://*********"
val client: MongoClient = MongoClient(uri)
val db: MongoDatabase = client.getDatabase("db")
val collection: MongoCollection[Document] = db.getCollection("col")
def insertDocument(document: Document) = {
val singleObservable: SingleObservable[InsertOneResult] = collection.insertOne(document)
singleObservable.subscribe(new Observer[InsertOneResult] {
override def onNext(result: InsertOneResult): Unit = println(s"onNext: $result")
override def onError(e: Throwable): Unit = println(s"onError: $e")
override def onComplete(): Unit = println("onComplete")
})
}
}
The primary issue I see with the above code is that if the connection becomes stale due to MongoDB server going offline or some other condition then
the connection is not restarted.
An improvement to cater for this scenario is :
object MongoFactory extends LazyLogging {
val uri: String = "mongodb+srv://*********"
var client: MongoClient = MongoClient(uri)
var db: MongoDatabase = client.getDatabase("db")
var collection: MongoCollection[Document] = db.getCollection("col")
def isDbDown() : Boolean = {
try {
client.getDatabase("db")
false
}
catch {
case e: Exception =>
true
}
}
def insertDocument(document: Document) = {
if(isDbDown()) {
client = MongoClient(uri)
db = client.getDatabase("db")
collection = db.getCollection("col")
}
val singleObservable: SingleObservable[InsertOneResult] = collection.insertOne(document)
singleObservable.subscribe(new Observer[InsertOneResult] {
override def onNext(result: InsertOneResult): Unit = println(s"onNext: $result")
override def onError(e: Throwable): Unit = println(s"onError: $e")
override def onComplete(): Unit = println("onComplete")
})
}
}
I expect this to handle the scenario if the DB connection becomes unavailable but is there a more idiomatic Scala method of
determining
Your code does not create connections. It creates MongoClient instances.
As such you cannot "create a new connection". MongoDB drivers do not provide an API for applications to manage connections.
Connections are managed internally by the driver and are created and destroyed automatically as needed in response to application requests/commands. You can configure connection pool size and when stale connections are removed from the pool.
Furthermore, execution of a single application command may involve multiple connections (up to 3 easily, possibly over 5 if encryption is involved), and the connection(s) used depend on the command/query. Checking the health of any one connection, even if it was possible, wouldn't be very useful.

Kafka tests failing intermittently if not starting/stopping kafka each time

I'm trying to run some integration tests for a data stream using an embedded kafka cluster. When executing all the tests in a different environment than my local, the tests are failing due to some internal state that's not removed properly.
I can get the all the tests running on the non-local environment when I start/stop the kafka cluster before/after each test but I only want to start and stop the cluster once, at the beginning and at the end of the execution of my suite of tests.
I tried to remove the local streams state but that didn't seem to work:
override protected def afterEach(): Unit = KStreamTestUtils.purgeLocalStreamsState(properties)
Is there a way to get my suit of tests running without having to start/stop cluster each time?
Right below there are the relevant classes.
class TweetStreamProcessorSpec extends FeatureSpec
with MockFactory with GivenWhenThen with Eventually with BeforeAndAfterEach with BeforeAndAfterAll {
val CLUSTER: EmbeddedKafkaCluster = new EmbeddedKafkaCluster
val TEST_TOPIC: String = "test_topic"
val properties = new Properties()
override def beforeAll(): Unit = {
CLUSTER.start()
CLUSTER.createTopic(TEST_TOPIC, 1, 1)
}
override def afterAll(): Unit = CLUSTER.stop()
// if uncommenting these lines tests works
// override def afterEach(): Unit = CLUSTER.stop()
// override protected def beforeEach(): Unit = CLUSTER.start()
def createProducer: KafkaProducer[String, TweetEvent] = {
val properties = Map(
KEY_SERIALIZER_CLASS_CONFIG -> classOf[StringSerializer].getName,
VALUE_SERIALIZER_CLASS_CONFIG -> classOf[ReflectAvroSerializer[TweetEvent]].getName,
BOOTSTRAP_SERVERS_CONFIG -> CLUSTER.bootstrapServers(),
SCHEMA_REGISTRY_URL_CONFIG -> CLUSTER.schemaRegistryUrlForcedToLocalhost()
)
new KafkaProducer[String, TweetEvent](properties)
}
def kafkaConsumerSettings: KafkaConfig = {
val bootstrapServers = CLUSTER.bootstrapServers()
val schemaRegistryUrl = CLUSTER.schemaRegistryUrlForcedToLocalhost()
val zookeeper = CLUSTER.zookeeperConnect()
KafkaConfig(
ConfigFactory.parseString(
s"""
akka.kafka.bootstrap.servers = "$bootstrapServers"
akka.kafka.schema.registry.url = "$schemaRegistryUrl"
akka.kafka.zookeeper.servers = "$zookeeper"
akka.kafka.topic-name = "$TEST_TOPIC"
akka.kafka.consumer.kafka-clients.key.deserializer = org.apache.kafka.common.serialization.StringDeserializer
akka.kafka.consumer.kafka-clients.value.deserializer = ${classOf[ReflectAvroDeserializer[TweetEvent]].getName}
akka.kafka.consumer.kafka-clients.client.id = client1
akka.kafka.consumer.wakeup-timeout=20s
akka.kafka.consumer.max-wakeups=10
""").withFallback(ConfigFactory.load()).getConfig("akka.kafka")
)
}
feature("Logging tweet data from kafka topic") {
scenario("log id and payload when consuming a update tweet event") {
publishEventsToKafka(List(upTweetEvent))
val logger = Mockito.mock(classOf[Logger])
val pipeline = new TweetStreamProcessor(kafkaConsumerSettings, logger)
pipeline.start
eventually(timeout(Span(5, Seconds))) {
Mockito.verify(logger, Mockito.times(1)).info(s"updating tweet uuid=${upTweetEvent.getUuid}, payload=${upTweetEvent.getPayload}")
}
pipeline.stop
}
scenario("log id when consuming a delete tweet event") {
publishEventsToKafka(List(delTweetEvent))
val logger = Mockito.mock(classOf[Logger])
val pipeline = new TweetStreamProcessor(kafkaConsumerSettings, logger)
pipeline.start
eventually(timeout(Span(5, Seconds))) {
Mockito.verify(logger, Mockito.times(1)).info(s"deleting tweet uuid=${delTweetEvent.getUuid}")
}
pipeline.stop
}
}
}
class TweetStreamProcessor(kafkaConfig: KafkaConfig, logger: Logger)
extends Lifecycle with TweetStreamProcessor with Logging {
private var control: Control = _
private val valueDeserializer: Option[Deserializer[TweetEvent]] = None
// ...
def tweetsSource(implicit mat: Materializer): Source[CommittableMessage[String, TweetEvent], Control] =
Consumer.committableSource(tweetConsumerSettings, Subscriptions.topics(kafkaConfig.topicName))
override def start: Future[Unit] = {
control = tweetsSource(materializer)
.mapAsync(1) { msg =>
logTweetEvent(msg.record.value())
.map(_ => msg.committableOffset)
}.batch(max = 20, first => CommittableOffsetBatch.empty.updated(first)) { (batch, elem) =>
batch.updated(elem)
}
.mapAsync(3)(_.commitScaladsl())
.to(Sink.ignore)
.run()
Future.successful()
}
override def stop: Future[Unit] = {
control.shutdown()
.map(_ => Unit)
}
}
Any help over this would be much appreciated? Thanks in advance.

Akka-streams: how to get flow names in metrics reported by kamon-akka

I've been trying to set-up some instrumentation for Akka streams. Got it working, but, even though I named all my Flows that are part of the streams, I still get this sort of names in the metrics: flow-0-0-unknown-operation
A simple example of what I'm trying to do:
val myflow = Flow[String].named("myflow").map(println)
Source.via(myflow).to(Sink.ignore).run()
I basically want to see the metrics for the Actor that gets created for "myflow", with a proper name.
Is this even possible? Am I missing something?
I was having this challenge in my project and I solved by using Kamon + Prometheus. However I had to create an Akka Stream Flow which I can set its name metricName and export the metric values from it using val kamonThroughputGauge: Metric.Gauge.
class MonitorProcessingTimerFlow[T](interval: FiniteDuration)(metricName: String = "monitorFlow") extends GraphStage[FlowShape[T, T]] {
val in = Inlet[T]("MonitorProcessingTimerFlow.in")
val out = Outlet[T]("MonitorProcessingTimerFlow.out")
Kamon.init()
val kamonThroughputGauge: Metric.Gauge = Kamon.gauge("akka-stream-throughput-monitor")
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new TimerGraphStageLogic(shape) {
// mutable state
var open = false
var count = 0
var start = System.nanoTime
setHandler(in, new InHandler {
override def onPush(): Unit = {
try {
push(out, grab(in))
count += 1
if (!open) {
open = true
scheduleOnce(None, interval)
}
} catch {
case e: Throwable => failStage(e)
}
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
override protected def onTimer(timerKey: Any): Unit = {
open = false
val duration = (System.nanoTime - start) / 1e9d
val throughput = count / duration
kamonThroughputGauge.withTag("name", metricName).update(throughput)
count = 0
start = System.nanoTime
}
}
override def shape: FlowShape[T, T] = FlowShape[T, T](in, out)
}
Then I created a simple stream that uses the MonitorProcessingTimerFlow to export the metrics:
implicit val system = ActorSystem("FirstStreamMonitoring")
val source = Source(Stream.from(1)).throttle(1, 1 second)
/** Simulating workload fluctuation: A Flow that expand the event to a random number of multiple events */
val flow = Flow[Int].extrapolate { element =>
Stream.continually(Random.nextInt(100)).take(Random.nextInt(100)).iterator
}
val monitorFlow = Flow.fromGraph(new MonitorProcessingTimerFlow[Int](5 seconds)("monitorFlow"))
val sink = Sink.foreach[Int](println)
val graph = source
.via(flow)
.via(monitorFlow)
.to(sink)
graph.run()
with a proper configuration at application.conf:
kamon.instrumentation.akka.filters {
actors.track {
includes = [ "FirstStreamMonitoring/user/*" ]
}
}
I can see the throughput metrics on prometheus console with the name name="monitorFlow":

Router Hanging in Dealer-Router Setup

Given the following attempt to connect 1 DEALER to 1 ROUTER:
package net.async
import org.zeromq.ZMQ
import org.zeromq.ZMQ.Socket
import scala.annotation.tailrec
object Client {
val Empty = "".getBytes
def message(x: Int) = s"HELLO_#$x".getBytes
val Count = 5
}
class Client(name: String) extends Runnable {
import Client._
import AsyncClientServer.Port
override def run(): Unit = {
val context = ZMQ.context(1)
val dealer = context.socket(ZMQ.DEALER)
dealer.setIdentity(name.getBytes)
dealer.connect(s"tcp://localhost:$Port")
runHelper(dealer, Count)
}
#tailrec
private def runHelper(dealer: Socket, count: Int): Unit = {
dealer.send(dealer.getIdentity, ZMQ.SNDMORE)
dealer.send(Empty, ZMQ.SNDMORE)
dealer.send(message(count), 0)
println(s"Dealer: ${dealer.getIdentity} received message: " + dealer.recv(0))
runHelper(dealer, count - 1)
}
}
object AsyncClientServer {
val Port = 5555
val context = ZMQ.context(1)
val router = context.socket(ZMQ.ROUTER)
def main(args: Array[String]): Unit = {
router.bind(s"tcp://*:$Port")
mainHelper()
new Thread(new Client("Joe")).start()
}
private def mainHelper(): Unit = {
println("Waiting to receive messages from Dealer.")
val identity = router.recv(0)
val empty = router.recv(0)
val message = router.recv(0)
println(s"Router received message, ${new String(message)} from sender: ${new String(identity)}.")
mainHelper()
}
}
I see the following output, hanging on the second message.
[info] Running net.async.AsyncClientServer
[info] Waiting to receive messages from Dealer.
Why is that?
Not sure if its the cause of your problem but you don't need to send the identity frame from your dealer, zeromq will do this for you. By adding it your actually sending a 4 part message.
IDENTITY
IDENTITY
EMPTY
CONTENT