Databricks community cloud is throwing
an org.apache.spark.SparkException: Task not serializable exception that my local machine is not throwing executing the same code.
The code comes from the Spark in Action book. What the code is doing is reading a json file with github activity data, then reading a file with employees usernames from an invented company, and ultimately ranks the employees by number of pushes.
To avoid extra shuffling, the variable containing the list of employees is broadcasted, however, when its time to return the rank, is when databricks community cloud throws the exception.
import org.apache.spark.sql.SparkSession
import scala.io.Source.fromURL
val spark = SparkSession.builder()
.appName("GitHub push counter")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
val inputPath = "/FileStore/tables/2015_03_01_0-a829c.json"
val pushes = spark.read.json(inputPath).filter("type = 'PushEvent'")
val grouped = pushes.groupBy("actor.login").count.orderBy(grouped("count").desc)
val empPath = "https://raw.githubusercontent.com/spark-in-action/first-edition/master/ch03/ghEmployees.txt"
val employees = Set() ++ (for { line <- fromURL(empPath).getLines} yield line.trim)
val bcEmployees = sc.broadcast(employees)
import spark.implicits._
val isEmp = user => bcEmployees.value.contains(user)
val isEmployee = spark.udf.register("SetContainsUdf", isEmp)
val filtered = ordered.filter(isEmployee($"login"))
filtered.show()
If you have custom classes like employee/user then you need to implement serialization interface.Spark when working with custom user objects,needs to serialize them.
Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
Related
I'm developing a custom StreamingQueryListener and I'd like to trigger its onQueryTerminated method in a test.
This is what I tried implementing:
import org.apache.spark.sql.{ SQLContext, SparkSession }
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.functions.{ col, to_date }
import org.apache.spark.sql.streaming.StreamingQueryListener
import org.scalatest.flatspec.AnyFlatSpec
class MyListener extends StreamingQueryListener {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {}
override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = println(event.exception)
}
class ListenerSpec extends AnyFlatSpec {
it should "trigger onQueryTerminated" in {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.streams.addListener(new MyListener())
implicit val sqlContext: SQLContext = spark.sqlContext
import spark.implicits._
val stream = MemoryStream[Int]
stream.addData(Seq(1, 3, 4))
val query = stream
.toDF()
.withColumn("columnDoesntExist", to_date(col("names")))
.writeStream
.format("console")
.start()
query.awaitTermination()
}
}
However, this doesn't work because it raises an AnalysisException but the onQueryTerminated method isn't triggered by the termination of the streaming query.
In what situations is that method triggered and event.exception is Some(exception)?
Update
The following code successfully triggers the execution of onQueryTerminated:
val exceptionUdf = udf(() => throw new Exception())
val query = stream
.toDF()
.withColumn("exception", exceptionUdf())
.writeStream
.format("console")
.start()
Refer to the accepted answer for an explanation as to why.
According to the book "Stream Processing with Apache Spark" (published by O'Reilly) the onQueryTerminated method gets
"Called when a streaming query is stopped. The event contains id and runId fields that correlate with the start event. It also provides an exception field that contains an exception if the query failed due to an error."
As you are getting an AnalysisException, your query did not even start yet. It only got to the first of the four phases in the Catalyst optimizer, which is the "Analysis" and it has not been transformed into runnable code yet:
More details on the Catalyst Optimizer.
The AnalysisException just means that there are issues in the code related to the Catalog which is exactly what you intended to do: Refer to a column that does not exist (in the Catalog).
If you want to run the execution of the onQueryTermination method you need to implement a working code but have it failed while it is already running (e.g. provide wrong data input type).
I am trying to write the data-frame after filtering to S3 in CVS format in script editing with below Scala code.
Current status:
Does not show any error after run but just not writing to S3.
The logs screen print Start, however cannot see print End.
No particular error message indicating the problem.
Stops at temp.count.
Environment condition: I have admin rights to all S3.
import com.amazonaws.services.glue.GlueContext
import <others>
object GlueApp {
def main(sysArgs: Array[String]) {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
// #params: [JOB_NAME]
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
val datasource0 = glueContext.getCatalogSource(database = "db", tableName = "table", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
val appymapping1 = datasource0.appyMapping(mapping=........)
val temp=appymapping1.toDF.filter(some filtering rules)
print("start")
if (temp.count() <= 0) {
temp.write.format("csv").option("sep", ",").save("s3://directory/error.csv")
}
print("End")
you're writing Dataframe to S3 using if condition (If condition is to check whether dataframe has one or more row), but your If condition is invert. It's only true if dataframe has 0 (or lesser) row. so Change that.
Advance: Spark always saves files as "part-" name. so change S3 path as s3://directory/. and add .mode("overwrite") .
so your write df query should be
temp.write.format("csv").option("sep", ",").mode("overwrite").save("s3://directory")
I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.
Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.
I'm using spark 1.6 and run into the issue above when I run the following code:
// Imports
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SaveMode
import scala.concurrent.ExecutionContext.Implicits.global
import java.util.Properties
import scala.concurrent.Future
// Set up spark on local with 2 threads
val conf = new SparkConf().setMaster("local[2]").setAppName("app")
val sc = new SparkContext(conf)
val sqlCtx = new HiveContext(sc)
// Create fake dataframe
import sqlCtx.implicits._
var df = sc.parallelize(1 to 50000).map { i => (i, i, i, i, i, i, i) }.toDF("a", "b", "c", "d", "e", "f", "g").repartition(2)
// Write it as a parquet file
df.write.parquet("/tmp/parquet1")
df = sqlCtx.read.parquet("/tmp/parquet1")
// JDBC connection
val url = s"jdbc:postgresql://localhost:5432/tempdb"
val prop = new Properties()
prop.setProperty("user", "admin")
prop.setProperty("password", "")
// 4 futures - at least one of them has been consistently failing for
val x1 = Future { df.write.jdbc(url, "temp1", prop) }
val x2 = Future { df.write.jdbc(url, "temp2", prop) }
val x3 = Future { df.write.jdbc(url, "temp3", prop) }
val x4 = Future { df.write.jdbc(url, "temp4", prop) }
Here's the github gist: https://gist.github.com/karanveerm/27d852bf311e39f05491
The error I get is:
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2125) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
at org.apache.spark.sql.DataFrame.foreachPartition(DataFrame.scala:1482) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:247) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:306) ~[org.apache.spark.spark-sql_2.11-1.6.0.jar:1.6.0]
at writer.SQLWriter$.writeDf(Writer.scala:75) ~[temple.temple-1.0-sans-externalized.jar:na]
at writer.Writer$.writeDf(Writer.scala:33) ~[temple.temple-1.0-sans-externalized.jar:na]
at controllers.Api$$anonfun$downloadTable$1$$anonfun$apply$25.apply(Api.scala:460) ~[temple.temple-1.0-sans-externalized.jar:2.4.6]
at controllers.Api$$anonfun$downloadTable$1$$anonfun$apply$25.apply(Api.scala:452) ~[temple.temple-1.0-sans-externalized.jar:2.4.6]
at scala.util.Success$$anonfun$map$1.apply(Try.scala:237) ~[org.scala-lang.scala-library-2.11.7.jar:na]
Is this a spark bug or am I doing something wrong / any workarounds?
After trying several things, I found that one of the threads created by the global ForkJoinPool gets its spark.sql.execution.id property set to a random value.
I could not identify the process that actually did that but I could work around it by using my own ExecutionContext.
import java.util.concurrent.Executors
import concurrent.ExecutionContext
val executorService = Executors.newFixedThreadPool(4)
implicit val ec = ExecutionContext.fromExecutorService(executorService)
I used code from http://danielwestheide.com/blog/2013/01/16/the-neophytes-guide-to-scala-part-9-promises-and-futures-in-practice.html.
Maybe the ForkJoinPool clones threads attributes when creating new ones and if this happens during the context of an SQL execution it would get its non null value whereas a FixedThreadPool will create the threads at instantiation.
Please, check SPARK-13747
Consider to use Spark version 2.2.0 or higher if applicable in your environment.
Test 1: Does it help if you run each of the df.write operation in serial fashion instead of parallel future?
Test 2: Does it help if you persist the dataframe and then do all the df.write operation in parallel and seralize to unpersist after all are complete to see if this helps?
Strangely this doesnt work. Can someone explain the background? I want to understand why it doesnt take this.
The Inputfiles are parquet files spread across multiple folders. When I print the results, they are structured as I want to. When I use a dataframe.count() on the joined dataframe, the job will run forever. Can anyone help with the Details on that
import org.apache.spark.{SparkContext, SparkConf}
object TEST{
def main(args: Array[String] ) {
val appName = args(0)
val threadMaster = args(1)
val inputPathSent = args(2)
val inputPathClicked = args(3)
// pass spark configuration
val conf = new SparkConf()
.setMaster(threadMaster)
.setAppName(appName)
// Create a new spark context
val sc = new SparkContext(conf)
// Specify a SQL context and pass in the spark context we created
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create two dataframes for sent and clicked files
val dfSent = sqlContext.read.parquet(inputPathSent)
val dfClicked = sqlContext.read.parquet(inputPathClicked)
// Join them
val dfJoin = dfSent.join(dfClicked, dfSent.col("customer_id")
===dfClicked.col("customer_id") && dfSent.col("campaign_id")===
dfClicked.col("campaign_id"), "left_outer")
dfJoin.show(20) // perfectly shows the first 20 rows
dfJoin.count() //Here we run into trouble and it runs forever
}
}
Use println(dfJoin.count())
You will be able to see the count in your screen.