Apache Spark Throwing Deserialization Error when using take method on RDD - scala

I am new to Spark, and I'm using Scala 2.12.8 with Spark 2.4.0. I'm trying to use the Random Forest classifier in Spark MLLib. I can build and train the classifier, and the classifier can predict if I use the first() function on the resulting RDD. However, if I try to use the take(n) function, I get a pretty big, ugly stack trace. Does anyone know what I'm doing wrong? The error is occurring in the line: ".take(3)". I am aware that this is the first effectful operation that I'm performing on the RDD, so if anyone can explain to me why it's failing and how to fix it, I would be really grateful.
object ItsABreeze {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder()
.appName("test")
.getOrCreate()
//Do stuff to file
val data: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark.sparkContext, "file.svm")
// Split the data into training and test sets (30% held out for testing)
val splits: Array[RDD[LabeledPoint]] = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a RandomForest model.
// Empty categoricalFeaturesInfo indicates all features are continuous
val numClasses = 4
val categoricaFeaturesInfo = Map[Int, Int]()
val numTrees = 3
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model: RandomForestModel = RandomForest.trainClassifier(
trainingData,
numClasses,
categoricaFeaturesInfo,
numTrees,
featureSubsetStrategy,
impurity,
maxDepth,
maxBins
)
testData
.map((point: LabeledPoint) => model.predict(point.features))
.take(3)
.foreach(println)
spark.stop()
}
}
The top portion of the stack trace follows:
java.io.IOException: unexpected exception type
at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1736)
at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1266)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2078)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1260)
... 25 more
Caused by: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/LambdaDeserialize
at ItsABreeze$.$deserializeLambda$(ItsABreeze.scala)
... 35 more
Caused by: java.lang.NoClassDefFoundError: scala/runtime/LambdaDeserialize
... 36 more
Caused by: java.lang.ClassNotFoundException: scala.runtime.LambdaDeserialize
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

The code that I was trying to run was a slightly modified version of the classification example on this page (from the Spark Machine Learning Library documentation).
Both commenters on my original question were correct: I changed the version of Scala that I was using from 2.12.8 to 2.11.12, and I reverted Spark to 2.2.1, and the code ran just as it was.
For anyone watching this issue that is qualified to answer, here is a followup question: Spark 2.4.0 claims to have new, experimental support for Scala 2.12.x. Are there a lot of known issues with the 2.12.x support?

Related

Databricks error java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;

I am trying to run some sample code from this link: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5537430417240233/312903576646278/3506802399907740/latest.html
I am running it in a databricks notebook on a cluster with runtime 6.3 (includes Apache Spark 2.4.4, Scala 2.11)
I initially create a dataframe using
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate
import spark.implicits._
val df = Seq(
("one", 2.0),
("two", 1.5),
("three", 8.0)
).toDF("id", "val")
and then I try to get a list of strings by running
df.select("id").map(_.getString(0)).collect.toList
and i get the error below
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
t line3700fe51392b4abe9744f6b3a059dbfa46.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1275538363433250:2)
at line3700fe51392b4abe9744f6b3a059dbfa46.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1275538363433250:53)
at line3700fe51392b4abe9744f6b3a059dbfa46.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-1275538363433250:55)
at line3700fe51392b4abe9744f6b3a059dbfa46.$read$$iw$$iw$$iw$$iw$$iw.(command-1275538363433250:57)
at line3700fe51392b4abe9744f6b3a059dbfa46.$read$$iw$$iw$$iw$$iw.(command-1275538363433250:59)
at line3700fe51392b4abe9744f6b3a059dbfa46.$read$$iw$$iw$$iw.(command-1275538363433250:61)
at line3700fe51392b4abe9744f6b3a059dbfa46.$read$$iw$$iw.(command-1275538363433250:63)
at line3700fe51392b4abe9744f6b3a059dbfa46.$read$$iw.(command-1275538363433250:65)
at line3700fe51392b4abe9744f6b3a059dbfa46.$read.(command-1275538363433250:67)
at line3700fe51392b4abe9744f6b3a059dbfa46.$read$.(command-1275538363433250:71)
at line3700fe51392b4abe9744f6b3a059dbfa46.$read$.(command-1275538363433250)
at line3700fe51392b4abe9744f6b3a059dbfa46.$eval$.$print$lzycompute(:7)
at line3700fe51392b4abe9744f6b3a059dbfa46.$eval$.$print(:6)
at line3700fe51392b4abe9744f6b3a059dbfa46.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572)
at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:215)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:699)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:652)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:385)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:362)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:251)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:246)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:49)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:288)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:49)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:362)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at scala.util.Try$.apply(Try.scala:192)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)
I get the same error when running
df.select("id").collect().map(_(0)).toList
but not when running
df.select("id").rdd.map(_(0)).collect.toList
The command above that runs successfully returns a List[Any], but I need a List[String]
Can someone please advise? I suspect it's a sparkscala version mismatch, but I can't figure out what is wrong.
The exception
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
often arises if you mix different scala versions. Do you have any dependencies loaded for scala 2.12?
EDIT: I just tested your code in databricks with the same runtime, worked just fine....
In .map convert to string using .toString then the result would be List[String]
Example:
df.select("id").collect().map(x => x(0).toString).toList
List[String] = List(one, two, three)

Issue in saving Xgboost model in spark scala

I trained a xgboost model in spark scala as follows:
val xgbParamGrid = new ParamGridBuilder()
.addGrid(xgb.maxDepth, Array(5, 10))
.addGrid(xgb.maxBins, Array(2))
.addGrid(xgb.minChildWeight, Array(0.2))
.addGrid(xgb.eta, Array(0.015))
.addGrid(xgb.alpha, Array(0.8, 0.9))
.addGrid(xgb.lambda, Array(0.9, 1.0))
.build()
// Create the XGBoost pipeline
val pipeline = new Pipeline().setStages(Array(xgb))
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(xgbParamGrid)
.setNumFolds(3)
val xgbModel = cv.fit(trainingData)
val xgbBest = xgbModel.bestModel.asInstanceOf[PipelineModel].stages(0).asInstanceOf[XGBoostClassificationModel]
Then I tried to save it to local as:
xgbBest.write.overwrite.save(modelSavePath)
but got error message as:
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:105)
at org.apache.spark.ml.param.LongParam.jsonEncode(params.scala:480)
at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$$anonfun$1$$anonfun$3.apply(DefaultXGBoostParamsWriter.scala:73)
at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$$anonfun$1$$anonfun$3.apply(DefaultXGBoostParamsWriter.scala:71)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$$anonfun$1.apply(DefaultXGBoostParamsWriter.scala:71)
at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$$anonfun$1.apply(DefaultXGBoostParamsWriter.scala:69)
at scala.Option.getOrElse(Option.scala:121)
at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.getMetadataToSave(DefaultXGBoostParamsWriter.scala:69)
at ml.dmlc.xgboost4j.scala.spark.params.DefaultXGBoostParamsWriter$.saveMetadata(DefaultXGBoostParamsWriter.scala:51)
at ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel$XGBoostClassificationModelWriter.saveImpl(XGBoostClassifier.scala:480)
at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:103)
at classificationWithXgboost$.main(classificationWithXgboost.scala:125)
at classificationWithXgboost.main(classificationWithXgboost.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Can anyone help with the problem? Thanks

Logistic Regression on a Dataset from generateLinearRDD fails with java.lang.IllegalArgumentException

So as a proof of concept for something I was trying to generate a DataFrame with sample data from LinearDataGenerator.generateLinearRDD, then perform a logistic regression on it.
Assuming that generateLinearRDD would generate data suitable for performing a linear regression, I stuck it in a pipeline with a Binarizer to create a threshold column suitable for a logistic regression.
My code is as follows:
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.util.{LinearDataGenerator, MLUtils}
import org.apache.spark.ml.feature.Binarizer
import org.apache.spark.ml.classification.{LogisticRegression, LogisticRegressionModel}
// databricks users can comment out lines between here...
val spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
// ...and here
val data = {
val tmp = LinearDataGenerator.generateLinearRDD(spark.sparkContext, 10000, 4, 0.05).toDF()
MLUtils.convertVectorColumnsToML(tmp, "features").withColumnRenamed("label", "continuousLabel")
}
val binarizer = new Binarizer()
.setInputCol("continuousLabel")
.setOutputCol("label")
.setThreshold(0)
val logisticRegression = new LogisticRegression()
val pipeline = new Pipeline()
.setStages(Array(binarizer, logisticRegression))
val pipelineModel = pipeline.fit(data)
println(pipelineModel.stages.last.asInstanceOf[LogisticRegressionModel].binarySummary.accuracy)
The stacktrace from the exception looks like this:
Exception in thread "main" java.lang.IllegalArgumentException
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2073)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:743)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:742)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:742)
at org.apache.spark.mllib.evaluation.MulticlassMetrics.tpByClass$lzycompute(MulticlassMetrics.scala:48)
at org.apache.spark.mllib.evaluation.MulticlassMetrics.tpByClass(MulticlassMetrics.scala:44)
at org.apache.spark.mllib.evaluation.MulticlassMetrics.accuracy$lzycompute(MulticlassMetrics.scala:168)
at org.apache.spark.mllib.evaluation.MulticlassMetrics.accuracy(MulticlassMetrics.scala:168)
at org.apache.spark.ml.classification.LogisticRegressionSummary$class.accuracy(LogisticRegression.scala:1445)
at org.apache.spark.ml.classification.LogisticRegressionSummaryImpl.accuracy(LogisticRegression.scala:1641)
at crossvalidation_graphs$.delayedEndpoint$crossvalidation_graphs$1(crossvalidation_graphs.scala:35)
at crossvalidation_graphs$delayedInit$body.apply(crossvalidation_graphs.scala:9)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at crossvalidation_graphs$.main(crossvalidation_graphs.scala:9)
at crossvalidation_graphs.main(crossvalidation_graphs.scala)
My schema currently looks like this:
root
|-- continuousLabel: double (nullable = false)
|-- features: vector (nullable = true)
I'm running Spark 2.3.1 with Scala 2.11.12
Similar to this guy, my actual problem was that I was using Java 10 instead of Java 8. When I switched back to Java 8 my code worked without problems.

NoClassDefFoundError: org/apache/spark/sql/types/GenericArrayData

When I run this code
val conf = new SparkConf().setMaster("local[2]").setAppName("MLlib")
val sc = new SparkContext(conf)
val data = sc.textFile("hdfs://192.168.1.20:8020/user/sparkMLlib/input/testMLlib.csv")
val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))).cache()
// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)
// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)
// Save and load model
clusters.save(sc, "hdfs://192.168.1.20:8020/user/sparkMLlib/output/")
val sameModel = KMeansModel.load(sc, "hdfs://192.168.1.20:8020/user/sparkMLlib/output/")
I get a runtime exception:
16/05/04 14:43:40 ERROR DefaultWriterContainer: Aborting task.
java.lang.NoClassDefFoundError: org/apache/spark/sql/types/GenericArrayData
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:179)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$productToRowRdd$1$$anonfun$apply$1.apply(ExistingRDD.scala:39)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$productToRowRdd$1$$anonfun$apply$1.apply(ExistingRDD.scala:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:263)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.types.GenericArrayData
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
I don't understand this at all.Is it a import or dependency problem ,I looked everywhere but I find nothing.It seems a really strange exception to get. Can anyone explain?

Running spark inside intellij idea HttpServletResponse - ClassNotFoundException

I try to run Spark with Scala from inside Intellij Idea:
object SimpleApp {
def main(args: Array[String]) {
val logFile = "/home/kamil/Apps/spark-1.2.1-bin/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[4]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
Running it within spark-submit works all right. Running it from IDE result in a following error:
Exception in thread "main" java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
at org.apache.spark.HttpServer.org$apache$spark$HttpServer$$doStart(HttpServer.scala:74)
at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:61)
at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:61)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1765)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1756)
at org.apache.spark.HttpServer.start(HttpServer.scala:61)
at org.apache.spark.HttpFileServer.initialize(HttpFileServer.scala:46)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:320)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:240)
at SimpleApp$.main(SimpleApp.scala:8)
at SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.lang.ClassNotFoundException: javax.servlet.http.HttpServletResponse
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 18 more
SimpleApp.scala:8 is the line with instantiating spark context. As someone suggested I've already added:
libraryDependencies += "javax.servlet" % "javax.servlet-api" % "3.0.1"
but it didn't help. Do you have any ideas? Thanks in advance.
I've just solved this issue myself. You need to change module settings.
Context Menu -> Open Module Settings -> Dependencies
Change the 'scope' of the missing jar from 'Provided' to 'Compile'.
I see that there is a pull request in Spark project to solve the issue:
https://github.com/apache/spark/pull/4411