Task not serializable issue at window rank function in scala spark - scala

When i run my code from prod, Am not getting any task not serializable issue while when am trying to call from unit test case. am getting task not serializable from below code. Not getting what is the issue and why this strange behavior. Can anyone help on that or any better serialazable solution to get latest row from hive table ?
val distinctBy = Window.partitionBy("id").orderBy(desc("updated_at"));
val uniqueSellerDf = enrichedDf.withColumn("rank", rank().over(distinctBy))//row_number().over(distinctBy)) ### both rank function and row_number function have the same issue
.where($"rank" === 1).drop("rank")
uniqueSellerDf.show() //#### Getting task not serializable issue from show action cmd

Do like this:
#transient val distinctBy = Window.partitionBy("id").orderBy(desc("updated_at"))
It's a long story. See Task not serializable exception while running apache spark job and https://medium.com/swlh/spark-serialization-errors-e0eebcf0f6e6 and https://nathankleyn.com/2017/12/29/using-transient-and-lazy-vals-to-avoid-spark-serialisation-issues/

Related

How should you end a Spark job inside an if statement?

What is the recommended way to end a spark job inside a conditional statement?
I am doing validation on my data, and if false, I want to end the spark job gracefully.
Right now I have:
if (isValid(data)) {
sparkSession.sparkContext.stop()
}
However, I get the following error:
Exception in thread "main" java.lang.IllegalStateException: SparkContext has been shutdown
Then it shows a stacktrace.
Is sparkContext.stop() not the proper way to end a spark job gracefully?
Once you stop the SparkSession means, your SparkContext is killed on the JVM. sc is no longer active now.
So You can't call any sparkContext related objects/functions for creating RDD/Dataframe or anything else.
If you call the same sparksession again in the flow of program.. you should find the above Exception.
For example.
` val rdd=sc.parallelize(Seq(Row("RAMA","DAS","25"),Row("smritu","ranjan","26")))
val df=spark.createDataFrame(rdd,schema)
df.show() //It works fine
if(df.select("fname").collect()(0).getAs[String]("fname")=="MAA"){
println("continue")
}
else{
spark.stop() //stopping sparkSession
println("inside Stopiing condition")
}
println("code continues")
val rdd1=sc.parallelize(Seq(Row("afdaf","DAS","56"),Row("sadfeafe","adsadaf","27")))
//Throws Exception...
val df1=spark.createDataFrame(rdd1,schema)
df1.show()
`
There is nothing to say that you can't call stop in an if statement, but there is very little reason to do so and it is probably a mistake to do so. It seems implicit in your question that you may be attempting to open multiple Spark sessions.
The Spark session is intended to be left open for the life of the program - if you try to start two you will find that Spark throws an exception and prints some background including a JIRA ticket that discusses the topic to the logs.
If you wish to run multiple Spark tasks, you may submit them to the same context. One context can run multiple tasks at once.

Why does Spark application work in spark-shell but fail with "org.apache.spark.SparkException: Task not serializable" in Eclipse?

With the purpose of save a file (delimited by |) into a DataFrame, I have developed the next code:
val file = sc.textFile("path/file/")
val rddFile = file.map(a => a.split("\\|")).map(x => ArchivoProcesar(x(0), x(1), x(2), x(3))
val dfInsumos = rddFile.toDF()
My case class used for the creation of my DataFrame is defined as followed:
case class ArchivoProcesar(nombre_insumo: String, tipo_rep: String, validado: String, Cargado: String)
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly. But when I executed my program into eclipse, it throws me the next error:
Is it something missing inside my scala class that I'm using and running with eclipse. Or what could be the reason that my functions works correctly at the spark-shell but not in my eclipse app?
Regards.
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly.
That's because spark-shell takes care of creating an instance of SparkContext for you. spark-shell then makes sure that references to SparkContext are not from "sensitive places".
But when I executed my program into eclipse, it throws me the next error:
Somewhere in your Spark application you hold a reference to org.apache.spark.SparkContext that is not serializable and so holds your Spark computation back from being serialized and send across the wire to executors.
As #T. Gawęda has mentioned in a comment:
I think that ArchivoProcesar is a nested class and as a nested class has a reference to the outer class that has a property of type SparkContext
So while copying the code from spark-shell to Eclipse you have added some additional lines that you don't show thinking that they are not necessary which happens to be quite the contrary. Find any places where you create and reference SparkContext and you will find the root cause of your issue.
I can see that the Spark processing happens inside ValidacionInsumos class that main method uses. I think the affecting method is LeerInsumosAValidar that does map transformation and that's where you should seek the answer.
Your case class must have public scope. You can't have ArchivoProcesar inside a class

Using Spark From Within My Application

I am relatively new to Spark. I have created scripts which are run for offline, batch processing tasks but now I am attempting to use it from within a live application (i.e. not using spark-submit).
My application looks like the following:
An HTTP request will be received to initiate a data-processing tasks. The payload of the request will contain data needed for the spark job (time window to look at, data types to ignore, weights to use for ranking algorithms, etc).
Processing begins, which will eventually produce a result. The result is expected to fit into memory.
The result is sent to an endpoint (Written to a single file, sent to a client through Web Socket or SSE, etc)
The fact that the job is initiated by an HTTP request, and that the spark-job is dependant on data in this request, seems to imply that I can't use spark-submit for this problem. Although this originally seemed fairly straightforward to me, I've run into some issues that no amount of Googling has been able to solve. The main issue is exemplified by the following:
object MyApplication {
def main(args: Array[String]): Unit = {
val config = ConfigFactory.load()
val sparkConf = new SparkConf(true)
.setMaster(config.getString("spark.master-uri"))
.setAppName(config.getString("spark.app-name"))
val sc = new SparkContext(sparkConf)
val localData = Seq.fill(10000)(1)
// The following produces ClassNotFoundException for Test
val data = sc.parallelize[Int](localData).map(i => new Test(i))
println(s"My Result ${data.count()}")
sc.stop()
}
}
class Test(num: Int) extends Serializable
Which is then run
sbt run
...
[error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 127.0.0.1): java.lang.ClassNotFoundException: my.package.Test
...
I have classes within my application that I want to create and manipulate while using Spark. I've considered extracting all of these classes into a stand-alone library which I can send using .setJars("my-library-of-classes") but that will be very awkward to work with and deploy. I've also considered not using any classes within the spark logic and only using tuples. That would be just as awkward to use.
Is there a way for me to do this? Have I missed something obvious or are we really forced to use spark-submit to communicate with Spark?
Currently spark-submit is the only way to submit spark jobs to a cluster. So you have to create a fat jar consisting of all your classes and dependencies and use it with spark-submit. You could use command line arguments to pass either data or paths to configuration files

Spark Task not Serializable with simple accumulator?

I am running this simple code:
val accum = sc.accumulator(0, "Progress");
listFilesPar.foreach {
filepath =>
accum += 1
}
listFilesPar is an RDD[String]
which throws the following error:
org.apache.spark.SparkException: Task not serializable
Right now I don't understand what's happening
and I don't put parenthesis but brackets because I need to write a lengthy function. I am just doing unit testing
The typical cause of this is that the closure unexpectedly captures something. Something that you did not include in your paste, because you would never expect it would be serialized.
You can try to reduce your code until you find it. Or just turn on serialization debug logging with -Dsun.io.serialization.extendedDebugInfo=true. You will probably see in the output that Spark tries to serialize something silly.

Scala Spark : java.util.NoSuchElementException: key not found: -1.0

I am running an RF model on spark, [https://spark.apache.org/docs/2.0.0/ml-classification-regression.html#random-forest-classifier]
My issue is that if I load 2 different dataframes for train and test, eg:
val Array(trainingData, testData) = Array(convertedVecDF, convertedVecDF_test)
I get the above "java.util.NoSuchElementException: key not found: -1.0" being the cause of the error, but when I do the following, I get none
val data = convertedVecDF.union(convertedVecDF_test)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
And run the code in the link, it works fine.
The class of all variables: data, convertedVecDF, convertedDF_test, trainingData, testData is Class[_ <: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]] = class org.apache.spark.sql.Dataset
When I run seperate out the variables as in the first case, and use a very small test data (say 10 points), it works fine
Why is that ? Seems like a resource accessing issue, but I can't seem to understand how is Spark working.
What can I do to run the first case ? i.e. Run with Separate train/test data
EDIT
This problem was due to this issue: Error when passing data from a Dataframe into an existing ML VectorIndexerModel As I moved to Spark 2.3.1, it solved the issue