Apache Spark - Exception handling inside foreachRDD

Apache Spark - Exception handling inside foreachRDD - scala

guys.
I´m runnig a Spark Streaming job that reads from Kafka, do some things and put the results in another Kafka topic (Spark 2.1.1)
I notice when one task throws an exception, the execution of foreachRDD finalizes "normally", committing the offsets that I handle manually, what according to my understanding, it was not expected.
The code is like
kafkaStream.foreachRDD { rdd =>
if (!rdd.isEmpty()) {
val offsets = offsetStore.getOffsetsFromRDD(rdd)
doTheThing(rdd)
offsetStore.persistOffsets(offsets)
}
}
doTheThing() executes a foreachPartition and one of this tasks fail and throws an exception, like this:
def doTheThing(rdd: RDD) = {
rdd.foreachPartition {l =>
try {
doAnotherThing()
} catch {
case e: NeedToAbortSomeException => throw e
}
}
}
What is wrong? According Spark documentation (http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html), is ok to commit a transaction inside foreachRDD block.

Related

How to start a spark session only if needed

I am fairly new to spark. I have a case where i dont need the executors and other infra until a condition is met.I have the following code
def main(args: Array[String]) {
try {
val request = args(0).toString
// Get the spark session
val spark = getSparkSession()
log.info("Running etl Job")
// Pipeline builder
val pipeline = new PipelineBuilder().build(request)
pipeline.execute(spark)
spark.stop()
} catch {
case e: Exception => {
throw new RuntimeException("Failed to successfully run", e)
}
}
}
The above code creates a spark session and executes an ETL pipeline.
However i have a requirement that i only need to start the pipeline if based on a condition. In the below code, i want to only start the sparksession if a condition is true.
def main(args: Array[String]) {
try {
val request = args(0).toString
if(condition) {
val spark = getSparkSession()
log.info("Running etl Job")
// Pipeline builder
val pipeline = new PipelineBuilder().build(request)
pipeline.execute(spark)
spark.stop()
} else {
// DO nothing
}
} catch {
case e: Exception => {
throw new RuntimeException("Failed to successfully run", e)
}
}
}
Does this ensure that no sparksession is initiated and no executors are spun up if the condition is false ? If not, is there any other way to solve this ?

You can make use of lazy evaluation in scala.
In your getSparkSession() function define
lazy val spark: SparkSession = ....
As per wikipedia, "Lazy Evaluation is an evaluation strategy which delays the evaluation of an expression until its value is needed" .
Few benefits of lazy evaluation are,
Lazy evaluation can help to resolve circular dependencies.
It can provide performance enhancement by not doing calculations until needed — and they may not be done at all if the calculation is not used.
It can increase the response time of applications by postponing the heavy operations until required.
Please refer https://dzone.com/articles/scala-lazy-evaluation to know more.

Alpakka Kafka Stream not restarting after connection errors despite using RestartSource

I have a simple committable source for Kafka stream wrapped in RestartSource. It works fine in happy path, but if I deliberately severe the connection to Kafka cluster, it throws connection exception from underlying kafka client and reports Kafka Consumer Shut Down. My expectation was it to restart the stream after ~150 seconds, but it doesn't. Is my understanding/usage of RestartSource incorrect from below:
val atomicControl = new AtomicReference[Consumer.Control](NoopControl)
val restartablekafkaSourceWithFlow = {
RestartSource.withBackoff(30.seconds, 120.seconds, 0.2) {
() => {
Consumer.committableSource(consumerSettings.withClientId("clientId"), Subscriptions.topics(Set("someTopic")))
.mapMaterializedValue(c => atomicControl.set(c))
.via(someFlow)
.via(httpFlow)
}
}
}
val committerSink: Sink[(Any, ConsumerMessage.CommittableOffset), Future[Done]] = Committer.sinkWithOffsetContext(CommitterSettings(actorSystem))
val runnableGraph = restartablekafkaSourceWithFlow.toMat(committerSink)(Keep.both)
val control = runnableGraph.mapMaterializedValue(x => Consumer.DrainingControl.apply(atomicControl.get, x._2)).run()

Maybe you are getting error outside of RestartSource.
You can add recover to see the error, and/or create a decider like below and use it in runnableGraph.
private val decider: Supervision.Decider = { e =>
logger.error("Unhandled exception in stream.", e)
Supervision.Resume
}
runnableGraph.withAttributes(supervisionStrategy(decider))

Azure Databricks Concurrent Job -Avoid Consuming the same eventhub messages in all Jobs

Please help us to implement the partition/grouping when receiving eventhub messages in a Azure Databricks concurrent Job and the right approach to consume eventhub messages in a concurrent job.
Created 3 concurrent jobs in Azure Databricks uploading consumer code written in scala as a jar files. In this case receiving the same messages in all 3 concurrent jobs. To overcome from this issue tried to consume the events by partitioning but receiving the same messages in all 3 partitions.
And also tried by sending messages based on partition key and also tried creating a consumer groups in eventhubs even though receiving same messages in all the groups. We are not sure to handle the eventhub messages in the concurrent job
EventHub Configuration:
No.of partitions is 3 and Message Retention is 1
EventHub Producer: Sending messages to Eventhub using .NET (C#) is working fine.
EventHub Consumer: Able to receive messages through Scala Program without any issues.
Problem : Created 3 concurrent jobs in Azure Databricks uploading consumer code written in Scala as a jar files.In this case receiving the same messages in all 3 concurrent jobs. To overcome from this issue tried to consume the events by partitioning but receiving the same messages in all 3 partitions.And
also tried by sending messages based on partition key and also tried creating a consumer groups in eventhubs even though receiving same messages in all the groups. We are not sure to handle the eventhub messages in the concurrent job.
Producer C# Code:
string eventHubName = ConfigurationManager.AppSettings["eventHubname"];
string connectionString = ConfigurationManager.AppSettings["eventHubconnectionstring"];
eventHubClient = EventHubClient.CreateFromConnectionString(connectionString, eventHubName);
for (var i = 0; i < 100; i++)
{
var sender = "event hub message 1" + i;
var data = new EventData(Encoding.UTF8.GetBytes(sender));
Console.WriteLine($"Sending message: {sender}");
eventHubClient.SendAsync(data);
}
eventHubClient.CloseAsync();
Console.WriteLine("Press ENTER to exit.");
Console.ReadLine();
Consumer Scala Code:
object ReadEvents {
val spark = SparkSession.builder()
.appName("eventhub")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
def main(args : Array[String]) : Unit = {
val connectionString = ConnectionStringBuilder("ConnectionString").setEventHubName("eventhub1").build
val positions = Map(new NameAndPartition("eventhub1", 0) -> EventPosition.fromStartOfStream)
val position2 = Map(new NameAndPartition("eventhub1", 1) -> EventPosition.fromEnqueuedTime(Instant.now()))
val position3 = Map(new NameAndPartition("eventhub1", 2) -> EventPosition.fromEnqueuedTime(Instant.now()))
val ehConf = EventHubsConf(connectionString).setStartingPositions(positions)
val ehConf2 = EventHubsConf(connectionString).setStartingPositions(position2)
val ehConf3 = EventHubsConf(connectionString).setStartingPositions(position3)
val stream = org.apache.spark.eventhubs.EventHubsUtils.createDirectStream(ssc, ehConf)
println("Before the loop")
stream.foreachRDD(rdd => {
rdd.collect().foreach(rec => {
println(String.format("Message is first stream ===>: %s", new String(rec.getBytes(), Charset.defaultCharset())))
})
})
val stream2 = org.apache.spark.eventhubs.EventHubsUtils.createDirectStream(ssc, ehConf2)
stream2.foreachRDD(rdd2 => {
rdd2.collect().foreach(rec2 => {
println(String.format("Message second stream is ===>: %s", new String(rec2.getBytes(), Charset.defaultCharset())))
})
})
val stream3 = org.apache.spark.eventhubs.EventHubsUtils.createDirectStream(ssc, ehConf)
stream3.foreachRDD(rdd3 => {
println("Inside 3rd stream foreach loop")
rdd3.collect().foreach(rec3 => {
println(String.format("Message is third stream ===>: %s", new String(rec3.getBytes(), Charset.defaultCharset())))
})
})
ssc.start()
ssc.awaitTermination()
}
}
Expecting to partition the eventhub messages properly when receiving it on concurrent jobs running using scala program.

Below code help to iterate through all partions
eventHubsStream.foreachRDD { rdd =>
rdd.foreach { message =>
if (message != null) {
callYouMethod(new String(message.getBytes()))
}
}
}

log from spark udf to driver

I have a simple UDF in databricks used in spark. I can't use println or log4j or something because it will get outputted to the execution, I need it in the driver. I have a very system log setup
var logMessage = ""
def log(msg: String){
logMessage += msg + "\n"
}
def writeLog(file: String){
println("start write")
println(logMessage)
println("end write")
}
def warning(msg: String){
log("*WARNING* " + msg)
}
val CleanText = (s: int) => {
log("I am in this UDF")
s+2
}
sqlContext.udf.register("CleanText", CleanText)
How can I get this to function properly and log to driver?

The closest mechanism in Apache Spark to what you're trying to do is accumulators. You can accumulate the log lines on the executors and access the result in driver:
// create a collection accumulator using the spark context:
val logLines: CollectionAccumulator[String] = sc.collectionAccumulator("log")
// log function adds a line to accumulator
def log(msg: String): Unit = logLines.add(msg)
// driver-side function can print the log using accumulator's *value*
def writeLog() {
import scala.collection.JavaConverters._
println("start write")
logLines.value.asScala.foreach(println)
println("end write")
}
val CleanText = udf((s: Int) => {
log(s"I am in this UDF, got: $s")
s+2
})
// use UDF in some transformation:
Seq(1, 2).toDF("a").select(CleanText($"a")).show()
writeLog()
// prints:
// start write
// I am in this UDF, got: 1
// I am in this UDF, got: 2
// end write
BUT: this isn't really recommended, especially not for logging purposes. If you log on every record, this accumulator would eventually crash your driver on OutOfMemoryError or just slow you down horribly.
Since you're using Databricks, I would check what options they support for log aggregation, or simply use the Spark UI to view the executor logs.

You can't... unless you want to go crazy and make some sort of log-back appender that sends logs over the network or something like that.
The code for the UDF will be run on all your executors when you evaluate a data frame. So, you might have 2000 hosts running it and each of them will log to their own location; that's how Spark works. The driver isn't the one running the code so it can't be logged to.
You can use YARN log aggregate to pull all the logs from the executors though for later analysis.
You could probably also write to a kafka stream or something creative like that with some work and write the logs contiguously later off the stream.

running akka stream in parallel

I have a stream that
listens for HTTP post receiving a list of events
mapconcat the list of events in stream elements
convert events in kafka record
produce the record with reactive kafka (akka stream kafka producer sink)
Here is the simplified code
// flow to split group of lines into lines
val splitLines = Flow[List[Evt]].mapConcat(list=>list)
// sink to produce kafka records in kafka
val kafkaSink: Sink[Evt, Future[Done]] = Flow[Evt]
.map(evt=> new ProducerRecord[Array[Byte], String](evt.eventType, evt.value))
.toMat(Producer.plainSink(kafka))(Keep.right)
val routes = {
path("ingest") {
post {
(entity(as[List[ReactiveEvent]]) & extractMaterializer) { (eventIngestList,mat) =>
val ingest= Source.single(eventIngestList).via(splitLines).runWith(kafkaSink)(mat)
val result = onComplete(ingest){
case Success(value) => complete(s"OK")
case Failure(ex) => complete((StatusCodes.InternalServerError, s"An error occurred: ${ex.getMessage}"))
}
complete("eventList ingested: " + result)
}
}
}
}
Could you highlight me what is run in parallel and what is sequential ?
I think the mapConcat sequentialize the events in the stream so how could I parallelize the stream so after the mapConcat each step would be processed in parallel ?
Would a simple mapAsyncUnordered be sufficient ? Or should I use the GraphDSL with a Balance and Merge ?

In your case it will be sequential I think. Also you're getting whole request before you start pushing data to Kafka. I'd use extractDataBytes directive that gives you src: Source[ByteString, Any]. Then I'd process it like
src
.via(Framing.delimiter(ByteString("\n"), 1024 /* Max size of line */ , allowTruncation = true).map(_.utf8String))
.mapConcat { line =>
line.split(",")
}.async
.runWith(kafkaSink)(mat)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Apache Spark - Exception handling inside foreachRDD - scala

Related

How to start a spark session only if needed

Alpakka Kafka Stream not restarting after connection errors despite using RestartSource

Azure Databricks Concurrent Job -Avoid Consuming the same eventhub messages in all Jobs

log from spark udf to driver

running akka stream in parallel

Categories

Resources