Runtime error on Scala Spark 2.0 code - scala

I have the following code:
import org.apache.spark.sql.SparkSession
.
.
.
val spark = SparkSession
.builder()
.appName("PTAMachineLearner")
.getOrCreate()
When it executes, I get the following error:
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at org.apache.spark.sql.SparkSession$Builder.config(SparkSession.scala:750)
at org.apache.spark.sql.SparkSession$Builder.appName(SparkSession.scala:741)
at com.acme.pta.accuracy.ml.PTAMachineLearnerModel.getDF(PTAMachineLearnerModel.scala:52)
The code compiles and builds just fine. Here are the dependencies:
scalaVersion := "2.11.11"
libraryDependencies ++= Seq(
// Spark dependencies
"org.apache.spark" %% "spark-hive" % "2.1.1",
"org.apache.spark" %% "spark-mllib" % "2.1.1",
// Third-party libraries
"net.sf.jopt-simple" % "jopt-simple" % "5.0.3",
"com.amazonaws" % "aws-java-sdk" % "1.3.11",
"org.apache.logging.log4j" % "log4j-api" % "2.8.2",
"org.apache.logging.log4j" % "log4j-core" % "2.8.2",
"org.apache.logging.log4j" %% "log4j-api-scala" % "2.8.2",
"com.typesafe.play" %% "play-ahc-ws-standalone" % "1.0.0-M9",
"net.liftweb" % "lift-json_2.11" % "3.0.1"
)
I am executing the code like this:
/Users/paulreiners/spark-2.1.1-bin-hadoop2.7/bin/spark-submit \
--class "com.acme.pta.accuracy.ml.CreateRandomForestRegressionModel" \
--master local[4] \
target/scala-2.11/acme-pta-accuracy-ocean.jar \
I had this all running with Spark 1.6. I'm trying to upgrade to Spark 2, but am missing something.

The class ArrowAssoc is indeed present in your Scala library. See this Scala doc . But you are getting error in Spark library. So obviously, Spark version you are using is not compatible with Scala ver 2.11 as it is probably compiled with older Scala version. If you see this older Scala API doc , the ArrowSpec has changed a lot. e.g. it is implicit now with lots of implicit dependencies. Make sure your Spark and Scala version are compatible.

I found the problem. I had Scala 2.10.5 installed on my system. So either sbt or spark-submit was calling that and expecting 2.11.11.

I had the same issue. But, in my case, the problem was that I deployed the jar in Spark1.x cluster where as the code is written in Spark2.x.
So, if you see this error, just check the versions of spark & scala used in your code against the respective installed versions.

Related

fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found

I'm trying to read data from S3 using spark using following dependencies and configurations:
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.2.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.2.1"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.2.1"
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", config.s3AccessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", config.s3SecretKey)
spark.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
I'm getting error as
java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider not found
It was working fine with older version of spark and hadoop. To be exact, i was previously using spark version 2.4.8 and hadoop version 2.8.5
I was looking forward to use the latest EMR version with spark 3.2.0 and hadoop 3.2.1. This issue basically was faced mainly because of hadoop 3.2.1 and hence the only option was to use older version of EMR. Spark 2.4.8 and hadoop 2.10.1 worked for me.

NoSuchMethodError while creating spark session

I am new to spark. I am just trying to create a spark session in my local but I am getting the following error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.internal.config.package$.SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD()Lorg/apache/spark/internal/config/ConfigEntry;
at org.apache.spark.sql.internal.SQLConf$.<init>(SQLConf.scala:1011)
at org.apache.spark.sql.internal.SQLConf$.<clinit>(SQLConf.scala)
at org.apache.spark.sql.internal.StaticSQLConf$.<init>(StaticSQLConf.scala:31)
at org.apache.spark.sql.internal.StaticSQLConf$.<clinit>(StaticSQLConf.scala)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:938)
Similar error has been posted over here: Error while using SparkSession or sqlcontext
I am using the same version for spark-core and spark-sql. Here is my build.sbt:
libraryDependencies += ("org.apache.spark" %% "spark-core" % "2.3.1" % "provided")
libraryDependencies += ("org.apache.spark" %% "spark-sql" % "2.3.1" % "provided")
I am using scala version 2.11.8.
Can someone explain why I am still getting this error and how to correct it?
If you see a NoSuchMethod error with something like Lorg/… in the log, it's usually due to Spark version mismatch. Do you have Spark 2.3.1 installed in your system? Make sure that the dependencies match that of your local or cluster’s Spark version.

How should I log from my custom Spark JAR

Scala/JVM noob here that wants to understand more about logging, specifically when using Apache Spark.
I have written a library in Scala that depends upon a bunch of Spark libraries, here are my dependencies:
import sbt._
object Dependencies {
object Version {
val spark = "2.2.0"
val scalaTest = "3.0.0"
}
val deps = Seq(
"org.apache.spark" %% "spark-core" % Version.spark,
"org.scalatest" %% "scalatest" % Version.scalaTest,
"org.apache.spark" %% "spark-hive" % Version.spark,
"org.apache.spark" %% "spark-sql" % Version.spark,
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test",
"ch.qos.logback" % "logback-core" % "1.2.3",
"ch.qos.logback" % "logback-classic" % "1.2.3",
"com.typesafe.scala-logging" %% "scala-logging" % "3.8.0",
"com.typesafe" % "config" % "1.3.2"
)
val exc = Seq(
ExclusionRule("org.slf4j", "slf4j-log4j12")
)
}
(admittedly I copied a lot of this from elsewhere).
I am able to package my code as a JAR using sbt package which I can then call from Spark by placing the JAR into ${SPARK_HOME}/jars. This is working great.
I now want to implement logging from my code so I do this:
import com.typesafe.scalalogging.Logger
/*
* stuff stuff stuff
*/
val logger : Logger = Logger("name")
logger.info("stuff")
however when I try and call my library (which I'm doing from Python, not that I think that's relevant here) I get an error:
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.company.package.class.function.
E : java.lang.NoClassDefFoundError: com/typesafe/scalalogging/Logger$
Clearly this is because com.typesafe.scala-logging library is not in my JAR. I know I could solve this by packaging using sbt assembly but I don't want to do that because it will include all the other dependencies and cause my JAR to be enormous.
Is there a way to selectively include libraries (com.typesafe.scala-logging in this case) in my JAR? Alternatively, should I be attempting to log using another method, perhaps using a logger that is included with Spark?
Thanks to pasha701 in the comments I attempted packaging my dependencies by using sbt assembly rather than sbt package.
import sbt._
object Dependencies {
object Version {
val spark = "2.2.0"
val scalaTest = "3.0.0"
}
val deps = Seq(
"org.apache.spark" %% "spark-core" % Version.spark % Provided,
"org.scalatest" %% "scalatest" % Version.scalaTest,
"org.apache.spark" %% "spark-hive" % Version.spark % Provided,
"org.apache.spark" %% "spark-sql" % Version.spark % Provided,
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.8.0" % "test",
"ch.qos.logback" % "logback-core" % "1.2.3",
"ch.qos.logback" % "logback-classic" % "1.2.3",
"com.typesafe.scala-logging" %% "scala-logging" % "3.8.0",
"com.typesafe" % "config" % "1.3.2"
)
val exc = Seq(
ExclusionRule("org.slf4j", "slf4j-log4j12")
)
}
Unfortunately, even if specifying the spark dependencies as Provided my JAR went from 324K to 12M hence I opted to use println() instead. Here is my commit message:
log using println
I went with the println option because it keeps the size of the JAR small.
I trialled use of com.typesafe.scalalogging.Logger but my tests failed with error:
java.lang.NoClassDefFoundError: com/typesafe/scalalogging/Logger
because that isn't provided with Spark. I attempted to use sbt assembly
instead of sbt package but this caused the size of the JAR to go from
324K to 12M, even with spark dependencies set to Provided. A 12M JAR
isn't worth the trade-off just to use scalaLogging, hence using println
instead.
I note that pasha701 suggested using log4j instead as that is provided with Spark so I shall try that next. Any advice on using log4j from Scala when writing a Spark library would be much appreciated.
As you said 'sbt assembly' will include all the dependencies into your jar.
If you want use certain two option:
Download logback-core and logback-classic and add them on --jar spark2-submit command
Specify the above deps in --packages spark2-submit option

Using spark ML models outside of spark [hdfs DistributedFileSystem could not be instantiated]

I've been trying to follow along things blog post:
https://www.phdata.io/exploring-spark-mllib-part-4-exporting-the-model-for-use-outside-of-spark/
Using spark 2.1 with built in Hadoop 2.7 run locally I can save a model:
trainedModel.save("mymodel.model"))
However if I try to load the model from a regular scala (sbt) shell hdfs fails to load.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.{PipelineModel, Predictor}
val sc = new SparkContext(new SparkConf().setMaster("local[1]").setAppName("myApp"))
val model = PipelineModel.load("mymodel.model")
I get this is error:
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.DistributedFileSystem could not be instantiated
Is it in fact possible to use a spark model without calling spark-submit, or spark-shell? The article I linked to was the only one I'd seen mentioning such functionality.
My build.sbt is using the following dependencies:
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" % "spark-sql_2.11" % "2.1.0",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0",
"org.apache.spark" % "spark-mllib_2.11" % "2.1.0",
"org.apache.hadoop" % "hadoop-hdfs" % "2.7.0"
In both cases I am using Scala 2.11.8.
Edit: Okay it looks including this was the source of the problem
"org.apache.hadoop" % "hadoop-hdfs" % "2.7.0"
I removed that line and the problem went away
try:
trainedModel.write.overwrite().save("mymodel.model"))
Also if your model is saved locally, you can remove hdfs in your configuration. This should prevent spark from attempting to instantiate hdfs.

Saving data from Spark to Cassandra results in java.lang.ClassCastException

I'm trying to save data from Spark to Cassandra in Scala using saveToCassandra for an RDD or save with a dataframe (both result in the same error). The full message is:
java.lang.ClassCastException:
com.datastax.driver.core.DefaultResultSetFuture cannot be cast to
com.google.common.util.concurrent.ListenableFuture
I've followed along with the code here and still seem to get the error.
I'm using Cassandra 3.6, Spark 1.6.1, and spark-cassandra-connector 1.6. Let me know if there's anything else I can provide to help with the debugging.
I had similar exception and fixed it after changing in build.sbt scala version:
scalaVersion := "2.10.6"
and library dependencies:
libraryDependencies ++= Seq(
"com.datastax.spark" %% "spark-cassandra-connector" % "1.6.0",
"com.datastax.cassandra" % "cassandra-driver-core" % "3.0.2",
"org.apache.spark" %% "spark-core" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"
)
With this configuration example from 5-minute quick start guide works fine.