When is onQueryTerminated triggered in Apache Spark StreamingQueryListeners?

When is onQueryTerminated triggered in Apache Spark StreamingQueryListeners? - scala

I'm developing a custom StreamingQueryListener and I'd like to trigger its onQueryTerminated method in a test.
This is what I tried implementing:
import org.apache.spark.sql.{ SQLContext, SparkSession }
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.functions.{ col, to_date }
import org.apache.spark.sql.streaming.StreamingQueryListener
import org.scalatest.flatspec.AnyFlatSpec
class MyListener extends StreamingQueryListener {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {}
override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = println(event.exception)
}
class ListenerSpec extends AnyFlatSpec {
it should "trigger onQueryTerminated" in {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.streams.addListener(new MyListener())
implicit val sqlContext: SQLContext = spark.sqlContext
import spark.implicits._
val stream = MemoryStream[Int]
stream.addData(Seq(1, 3, 4))
val query = stream
.toDF()
.withColumn("columnDoesntExist", to_date(col("names")))
.writeStream
.format("console")
.start()
query.awaitTermination()
}
}
However, this doesn't work because it raises an AnalysisException but the onQueryTerminated method isn't triggered by the termination of the streaming query.
In what situations is that method triggered and event.exception is Some(exception)?
Update
The following code successfully triggers the execution of onQueryTerminated:
val exceptionUdf = udf(() => throw new Exception())
val query = stream
.toDF()
.withColumn("exception", exceptionUdf())
.writeStream
.format("console")
.start()
Refer to the accepted answer for an explanation as to why.

According to the book "Stream Processing with Apache Spark" (published by O'Reilly) the onQueryTerminated method gets
"Called when a streaming query is stopped. The event contains id and runId fields that correlate with the start event. It also provides an exception field that contains an exception if the query failed due to an error."
As you are getting an AnalysisException, your query did not even start yet. It only got to the first of the four phases in the Catalyst optimizer, which is the "Analysis" and it has not been transformed into runnable code yet:
More details on the Catalyst Optimizer.
The AnalysisException just means that there are issues in the code related to the Catalog which is exactly what you intended to do: Refer to a column that does not exist (in the Catalog).
If you want to run the execution of the onQueryTermination method you need to implement a working code but have it failed while it is already running (e.g. provide wrong data input type).

Related

Task not serializable in community.cloud.databricks

Databricks community cloud is throwing
an org.apache.spark.SparkException: Task not serializable exception that my local machine is not throwing executing the same code.
The code comes from the Spark in Action book. What the code is doing is reading a json file with github activity data, then reading a file with employees usernames from an invented company, and ultimately ranks the employees by number of pushes.
To avoid extra shuffling, the variable containing the list of employees is broadcasted, however, when its time to return the rank, is when databricks community cloud throws the exception.
import org.apache.spark.sql.SparkSession
import scala.io.Source.fromURL
val spark = SparkSession.builder()
.appName("GitHub push counter")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val inputPath = "/FileStore/tables/2015_03_01_0-a829c.json"
val pushes = spark.read.json(inputPath).filter("type = 'PushEvent'")
val grouped = pushes.groupBy("actor.login").count.orderBy(grouped("count").desc)
val empPath = "https://raw.githubusercontent.com/spark-in-action/first-edition/master/ch03/ghEmployees.txt"
val employees = Set() ++ (for { line <- fromURL(empPath).getLines} yield line.trim)
val bcEmployees = sc.broadcast(employees)
import spark.implicits._
val isEmp = user => bcEmployees.value.contains(user)
val isEmployee = spark.udf.register("SetContainsUdf", isEmp)
val filtered = ordered.filter(isEmployee($"login"))
filtered.show()

If you have custom classes like employee/user then you need to implement serialization interface.Spark when working with custom user objects,needs to serialize them.
Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

Spark Structured Streaming exception handling

I reading data from a MQTT streaming source with Spark Structured Streaming API.
val lines:= spark.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("topic", "Employee")
.option("username", "username")
.option("password", "passwork")
.option("clientId", "employee11")
.load("tcp://localhost:8000").as[(String, Timestamp)]
I convert the streaming data to case class Employee
case class Employee(Name: String, Department: String)
val ds = lines.map {
row =>
implicit val format = DefaultFormats
parse(row._1).extract[Employee]
}
....some transformations
df.writeStream
.outputMode("append")
.format("es")
.option("es.resource", "spark/employee")
.option("es.nodes", "localhost")
.option("es.port", 9200)
.start()
.awaitTermination()
Now there were some messages in the queue which had different structure than Employee case class. Lets say some required columns were missing. My streaming job failed with field not found exception.
Now I will like to handle such exception and also will like to send an alert notification for the same. I tried putting a try/catch block.
case class ErrorMessage(row: String)
catch {
case e: Exception =>
val ds = lines.map {
row =>
implicit val format = DefaultFormats
parse(row._1).extract[ErrorMessage]
}
val error = lines.foreach(row => {
sendErrorMail(row._1)
})
}
}
Got the exception that Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
mqtt
Any help on this will be appreciated.

I think you should rather use the return object of the start() method as described in Spark streaming doc. Something like:
val query = df.writeStream. ... .start()
try {
//If the query has terminated with an exception, then the exception will be thrown.
query.awaitTermination()
catch {
case ex: Exception => /*code to send mail*/
}
Implementing your own foreach sink can cause overhead with frequent opening and closing connections.

I created a foreach sink in the catch block and was able to handle the exceptions and send out mail alerts as well.
catch {
case e: Exception =>
val foreachWriter = new ForeachWriter[Row] {
override def open(partitionId: Timestamp, version: Timestamp): Boolean = {
true
}
override def process(value: Row): Unit = {
code for sending mail.........
}
override def close(errorOrNull: Throwable): Unit = {}
}
val df = lines.selectExpr("cast (value as string) as json")
df.writeStream
.foreach(foreachWriter)
.outputMode("append")
.start()
.awaitTermination()
}

If the streaming is writing to delta tables you may use the merge for handling exceptions.
First, create the function tho merge and handle problems.
from delta.tables import DeltaTable
myTable = DeltaTable.forName(spark, "MYTABLE")
# Function to upsert microBatchOutputDF into Delta table using merge
def insertMessages(microBatchOutputDF, batchId):
try:
myTable.alias("trg").merge(
microBatchOutputDF.alias("src"),
"""
src.keyId = trg.keyId and
src.secondKeyId = trg.secondKeyId
""") \
.whenNotMatchedInsertAll() \
.execute()
except Exception as e:
print(f"Exception in writing data to MYTABLE: {e}")
try:
pass # do something with the bad data / log the issue
except:
print(f"Exception in writing bad data / logging the issue: {e}")
Run the stream:
mytable_df.writeStream.format("delta").foreachBatch(insertMessages).outputMode("append").option("checkpointLocation", "/tmp/delta/messages/_checkpoints2/").start()
Important note:
If at least one record in the batch causes an exception (for example NOT NULL constraint) then the whole batch (all records) are not merged. The stream keep working after that issue, it's not breaking.

Spark job completes without executing udf

I've been having an issue with a long, complicated spark job which contains a udf.
The issue I've been having is that the udf doesn't seem to get called properly, although there is no error message.
I know it isn't called properly because the output gets written, only anything the udf was supposed to calculate is NULL, and no print statements appear when debugging locally.
The only lead is that this code previously worked using different input data, meaning the error must have something to do with the input.
The change in inputs mostly means different column names are used, which is addressed in the code.
Print statements are executed given the first, 'working' input.
Both inputs are created using the same series of steps from the same database, and by inspection there doesn't appear to be a problem with either one.
I've never experienced this sort of behaviour before, and any leads on what might cause it would be appreciated.
The code is monolithic and inflexible - I'm working on refactoring, but it's not an easy piece to break apart. This is a short version of what is happening:
package mypackage
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.util._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.types._
import scala.collection.{Map => SMap}
object MyObject {
def main(args: Array[String]){
val spark: SparkSession = SparkSession.builder()
.appName("my app")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
val bigInput = spark.read.parquet("inputname.parquet")
val reference_table = spark.read.parquet("reference_table.parquet")
val exchange_rate = spark.read.parquet("reference_table.parquet")
val bigInput2 = bigInput
.filter($"column1" === "condition1")
.join(joinargs)
.drop(dropargs)
val bigInput3 = bigInput
.filter($"column2" === "condition2")
.join(joinargs)
.drop(dropargs)
<continue for many lines...>
def mapper1(
arg1: String,
arg2: Double,
arg3: Integer
): List[Double]{
exchange_rate.map(
List(idx1, idx2, idx3),
r.toSeq.toList
.drop(idx4)
.take(arg2)
)
}
def mapper2(){}
...
def mapper5(){}
def my_udf(
arg0: Integer,
arg1: String,
arg2: Double,
arg3: Integer,
...
arg20: String
): Double = {
println("I'm actually doing something!")
val result1 = mapper1(arg1, arg2, arg3)
val result2 = mapper2(arg4, arg5, arg6, arg7)
...
val result5 = mapper5(arg18, arg19, arg20)
result1.take(arg0)
.zipAll(result1, 0.0, 0.0)
.map(x=>_1*x._2)
....
.zipAll(result5, 0.0, 0.0)
.foldLeft(0.0)(_+_)
}
spark.udf.register("myUDF", my_udf_)
val bigResult1 = bigInputFinal.withColumn("Newcolumnname",
callUDF(
"myUDF",
$"col1",
...
$"col20"
)
)
<postprocessing>
bigResultFinal
.filter(<configs>)
.select(<column names>)
.write
.format("parquet")
}
}
To recap
This code runs to completion on each of two input files.
The udf only appears to execute on the first file.
There are no error messages or anything using the second file, although all non-udf logic appears to complete successfully.
Any help greatly appreciated!

Here the UDF is not being called because spark is
Lazy it does not call the UDF unless you use any action on the dataframe. You can achieve this by forcing dataframe actions.

How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run.
I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code simply uses the example code from the Spark repo.
Any ideas why this is not working? Thanks!
NB. I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+?
I have used this to write the test but I'm still getting the errors.
I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included within the SBT build file. The errors on running the tests are:
Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found.
import spark.implicits._
IntelliJ won't recognise import spark.implicits._ or the .toDF() method.
I have imported:
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}

you need to assign sqlContext to a val for implicits to work . Since your sparkSession is a var, implicits won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Moreover you can write functions for your tests so that your test class looks as following
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)

There are many libraries for unit testing of spark, one of the mostly used is
spark-testing-base: By Holden Karau
This library have all with sc as the SparkContext below is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here, every thing is prepared with sc as a SparkContext
Another approach is to create a TestWrapper and use for the multiple testcases as below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach.
Hope this helps!

Why is AssertionError: assertion failed: Executor has not been attached to this receiver?

I'm trying to build a simple spark streaming custom receiver where messages are stored directly in the spark stream in order to be processed. However I get:
java.lang.AssertionError: assertion failed: Executor has not been attached to this receiver
I'm integrating into a third party java library that generates methods calls based on a socket it is listening to. By implementing an interface in the third party java library I plan to call the store method in the custom spark receiver.
I've created a simple cut down example which has the error, which does not reference the third party java library.
package com.custom.spark
import org.apache.spark.streaming.receiver.Receiver
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.storage.StorageLevel
import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds
object CustomSparkReceiver {
def main(args: Array[String]) {
// create stream with custom receiver
val conf = new SparkConf().setMaster("local[*]").setAppName("CustomSparkReceiver")
val ssc = new StreamingContext(conf, Seconds(1))
val customReceiver = new CustomReceiver
val stream = ssc.receiverStream(customReceiver)
// print values in spark stream
stream.print
// start pushing data into stream
ssc.start
Thread.sleep(1000)
List.range(1, 10)
.foreach { number => customReceiver.store(number) }
Thread.sleep(1000)
ssc.stop()
}
class CustomReceiver extends Receiver[Integer](StorageLevel.MEMORY_AND_DISK_2) {
def onStart() = {}
def onStop() = {}
}
}
I think it is some sort of threading issue but not sure how to fix it. Any pointers on the above would be great.