Spark Error: java.io.NotSerializableException: scala.runtime.LazyRef - scala

I am new to spark, can you please help in this?
The below simple pipeline to do a logistic regression produces an exception:
The Code:
package pipeline.tutorial.com
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.RFormula
import org.apache.spark.ml.tuning.ParamGridBuilder
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.tuning.TrainValidationSplit
object PipelineDemo {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf()
conf.set("spark.master", "local")
conf.set("spark.app.name", "PipelineDemo")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("PipelineDemo").getOrCreate()
val df = spark.read.json("C:/Spark-The-Definitive-Guide-master/data/simple-ml")
val rForm = new RFormula()
val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")
val stages = Array(rForm, lr)
val pipeline = new Pipeline().setStages(stages)
val params = new ParamGridBuilder().addGrid(rForm.formula, Array(
"lab ~ . + color:value1",
"lab ~ . + color:value1 + color:value2")).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).addGrid(lr.regParam, Array(0.1, 2.0)).build()
val evaluator = new BinaryClassificationEvaluator()
.setMetricName("areaUnderROC")
.setRawPredictionCol("prediction")
.setLabelCol("label")
val tvs = new TrainValidationSplit()
.setTrainRatio(0.75)
.setEstimatorParamMaps(params)
.setEstimator(pipeline)
.setEvaluator(evaluator)
val Array(train, test) = df.randomSplit(Array(0.7, 0.3))
val model = tvs.fit(train)
val rs = model.transform(test)
rs.select("features", "label", "prediction").show()
}
}
// end code.
The code runs fine from the spark-shell
when writing it as a spark application (using eclipse scala ide) it gives the error:
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Thanks.

I solved it by removing scala library from the build path, to do this, right click on the scala library container > build path > remove from build path
not sure about the root cause though.

this error can be resolved by changing the scala version in your project to 2.12.8 or higher. Scala 2.12.8 works and is very stable. You can change this by going to your project structure (In Intellij you can go by pressing 'Ctrl+alt+shift+S'). Go to Global libraries and in there you have to remove the old scala version by using the - symbol and add the new scala version i.e. 2.12.8 or higher from the + symbol.

Related

toDF is not working in spark scala ide , but works perfectly in spark-shell [duplicate]

This question already has answers here:
Spark 2.0 Scala - RDD.toDF()
(4 answers)
Closed 2 years ago.
I am new to Spark and I am trying to run the below commands both from spark-shell and spark scala eclipse ide
When I ran it from shell , it perfectly works .
But in ide , it gives the compilation error.
Please help
package sparkWCExample.spWCExample
import org.apache.log4j.Level
import org.apache.spark.sql.{ Dataset, SparkSession, DataFrame, Row }
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object TwitterDatawithDataset {
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("Spark Scala WordCount Example")
.setMaster("local[1]")
val spark = SparkSession.builder()
.config(conf)
.appName("CsvExample")
.master("local")
.getOrCreate()
val csvData = spark.sparkContext
.textFile("C:\\Sankha\\Study\\data\\bank_data.csv", 3)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Bank(age: Int, job: String)
val bankDF = dfData.map(x => Bank(x(0).toInt, x(1)))
val df = bankDF.toDF()
}
}
Exception is as below on compile time itself
Description Resource Path Location Type
value toDF is not a member of org.apache.spark.rdd.RDD[Bank] TwitterDatawithDataset.scala /spWCExample/src/main/java/sparkWCExample/spWCExample line 35 Scala Problem
To toDF(), you must enable implicit conversions:
import spark.implicits._
In spark-shell, it is enabled by default and that's why the code works there. :imports command can be used to see what imports are already present in your shell:
scala> :imports
1) import org.apache.spark.SparkContext._ (70 terms, 1 are implicit)
2) import spark.implicits._ (1 types, 67 terms, 37 are implicit)
3) import spark.sql (1 terms)
4) import org.apache.spark.sql.functions._ (385 terms)
This works fine for me in Eclipse Scala IDE:
case class Bank(age: Int, job: String)
val u = Array((1, "manager"), (2, "clerk"))
import spark.implicits._
spark.sparkContext.makeRDD(u).map(r => Bank(r._1, r._2)).toDF().show()

sparkml setParallelism for crossvalidator

so I am trying to set a cross validation using SparkML but I am getting a run time error saying that
"value setParallelism is not a member of org.apache.spark.ml.tuning.CrossValidator"
I am currently following the spark page tutorial. I am new to this so any help is appreciated. Bellow is my code snippet:
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
// Tokenizer
val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words")
// HashingTF
val hash_tf = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")
// ML models
val l_regression = new LogisticRegression().setMaxIter(100).setRegParam(0.15)
// Pipeline
val pipe = new Pipeline().setStages(Array(tokenizer, hash_tf, l_regression))
val paramGrid = new ParamGridBuilder()
.addGrid(hash_tf.numFeatures, Array(10,100,1000))
.addGrid(l_regression.regParam, Array(0.1,0.01,0.001))
.build()
val c_validator = new CrossValidator()
.setEstimator(pipe)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)
.setParallelism(2)
setParallelism is available only in Spark 2.3 or later. You must be using earlier version:
(expert-only) Parameter setters
(...)
def setParallelism(value: Int): CrossValidator.this.type
Set the maximum level of parallelism to evaluate models in parallel. Default is 1 for serial evaluation
Annotations #Since( "2.3.0" )

Importing Spark libraries using Intellij IDEA

I would like to use spark SQL in an Intellij IDEA SBT project.
Even though I have imported the library the code does not seem to import it.
Spark Core seems to be working however.
You can't create a DataFrame from a scala List[A]. You need first to create an RDD[A], and then transform that to a DataFrame. You also need an SQLContext:
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val test = sc.parallelize(List(1,2,3,4)).toDF
For reference this is how the Spark 2.0 boilerplate with spark sql should look like:
import org.apache.spark.sql.SparkSession
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.master("local")
.appName("some name")
.getOrCreate()
import spark.sqlContext.implicits._
}
}

Compilation error saving model written in Scala, Apache Spark

I am running the example source code provided by Apache Spark to create an FPGrowth model. I want to save the model for future use, therefore I wrote the ending line of this code (model.save):
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.mllib.util._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import java.io._
import scala.collection.mutable.Set
object App {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("prediction").setMaster("local[*]")
val sc = new SparkContext(conf)
val data = sc.textFile("FPFeatureSeries.txt")
val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))
val fpg = new FPGrowth()
.setMinSupport(0.1)
.setNumPartitions(10)
val model = fpg.run(transactions)
val minConfidence = 0.8
model.generateAssociationRules(minConfidence).collect().foreach { rule =>
if(rule.confidence>minConfidence){
println(
rule.antecedent.mkString("[", ",", "]")
+ " => " + rule.consequent .mkString("[", ",", "]")
+ ", " + rule.confidence)
}
}
model.save(sc, "FPGrowthModel");
}
}
The problem is that I get a compilation error: value save is not a member of org.apache.spark.mllib.fpm.FPGrowth
I have tried including libraries and copying the exact examples from the documentation but I am still getting the same error.
I am using Spark 2.0.0 and Scala 2.10.
i had the same issue.
used this to save model
sc.parallelize(Seq(model), 1).saveAsObjectFile("path")
and to load model
val linRegModel = sc.objectFile[LinearRegressionModel]("path").first()
this might help..
what-is-the-right-way-to-save-load-models-in-spark-pyspark

Class org.apache.spark.sql.types.SQLUserDefinedType not found - continuing with a stub

I have a basic spark mllib program as follows.
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.Vectors
class Sample {
val conf = new SparkConf().setAppName("helloApp").setMaster("local")
val sc = new SparkContext(conf)
val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// Export to PMML
println("PMML Model:\n" + clusters.toPMML)
}
I have manually added spark-core , spark-mllib and spark-sql to the project class path through intellij all having version 1.5.0.
I am getting the below error when I run the program? any idea what's wrong?
Error:scalac: error while loading Vector, Missing dependency 'bad
symbolic reference. A signature in Vector.class refers to term types
in package org.apache.spark.sql which is not available. It may be
completely missing from the current classpath, or the version on the
classpath might be incompatible with the version used when compiling
Vector.class.', required by
/home/fazlann/Downloads/spark-mllib_2.10-1.5.0.jar(org/apache/spark/mllib/linalg/Vector.class
DesirePRG. I have met the same problem as yours. The solution is to import some jar which assemble the spark and hadoop, such as spark-assembly-1.4.1-hadoop2.4.0.jar, then it could work properly.