run-main-0) scala.ScalaReflectionException: class java.sql.Date in JavaMirror with ClasspathFilter( - scala

Hi I have a file given to by my teacher. It is about Scala and Spark.
When I run the code it gives me this exception:
(run-main-0) scala.ScalaReflectionException: class java.sql.Date in
JavaMirror with ClasspathFilter
The file itself looks like this:
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object Main {
type Embedding = (String, List[Double])
type ParsedReview = (Integer, String, Double)
org.apache.log4j.Logger getLogger "org" setLevel
(org.apache.log4j.Level.WARN)
org.apache.log4j.Logger getLogger "akka" setLevel
(org.apache.log4j.Level.WARN)
val spark = SparkSession.builder
.appName ("Sentiment")
.master ("local[9]")
.getOrCreate
import spark.implicits._
val reviewSchema = StructType(Array(
StructField ("reviewText", StringType, nullable=false),
StructField ("overall", DoubleType, nullable=false),
StructField ("summary", StringType, nullable=false)))
// Read file and merge the text abd summary into a single text column
def loadReviews (path: String): Dataset[ParsedReview] =
spark
.read
.schema (reviewSchema)
.json (path)
.rdd
.zipWithUniqueId
.map[(Integer,String,Double)] { case (row,id) => (id.toInt, s"${row getString 2} ${row getString 0}", row getDouble 1) }
.toDS
.withColumnRenamed ("_1", "id" )
.withColumnRenamed ("_2", "text")
.withColumnRenamed ("_3", "overall")
.as[ParsedReview]
// Load the GLoVe embeddings file
def loadGlove (path: String): Dataset[Embedding] =
spark
.read
.text (path)
.map { _ getString 0 split " " }
.map (r => (r.head, r.tail.toList.map (_.toDouble))) // yuck!
.withColumnRenamed ("_1", "word" )
.withColumnRenamed ("_2", "vec")
.as[Embedding]
def main(args: Array[String]) = {
val glove = loadGlove ("Data/glove.6B.50d.txt") // take glove
val reviews = loadReviews ("Data/Electronics_5.json") // FIXME
// replace the following with the project code
glove.show
reviews.show
spark.stop
}
}
I need to keep the line
import org.apache.spark.sql.Dataset
because some code depends on it but it is exactly because of it I have an exception throw.
My build.sbt file looks like this:
name := "Sentiment Analysis Project"
version := "1.1"
scalaVersion := "2.11.12"
scalacOptions ++= Seq("-unchecked", "-deprecation")
initialCommands in console :=
"""
import Main._
"""
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-mllib" %
"2.3.0"
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.5"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.5" %
"test"

The Scala guide recommends you compile with Java8:
We recommend using Java 8 for compiling Scala code. Since the JVM is backward compatible, it is usually safe to use a newer JVM to run your code compiled by the Scala compiler for older JVM versions.
Although it's only a recommendation, I found it to fix the issue you mention.
In order to install Java 8 using Homebrew, it's best to use jenv which will help you handle multiple Java versions should you need to.
brew install jenv
Then run the following to add a tap (repository) of alternative versions of casks, since Java 8 is not in the default tap anymore:
brew tap homebrew/cask-versions
To install Java 8:
brew cask install homebrew/cask-versions/adoptopenjdk8
Run the following to add the previously installed Java version to jenv's list of versions:
jenv add /Library/Java/JavaVirtualMachines/<installed_java_version>/Contents/Home
Finally run
jenv global 1.8
or
jenv local 1.8
to use Java 1.8 globally or locally (in the current folder).
Fore more information, follow the instructions at jenv's website

Related

Java Class not Found Exception while doing Spark-submit Scala using sbt

Here is my code that i wrote in scala
package normalisation
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs.{FileSystem,Path}
object Seasonality {
val amplitude_list_c1: Array[Nothing] = Array()
val amplitude_list_c2: Array[Nothing] = Array()
def main(args: Array[String]){
val conf = new SparkConf().setAppName("Normalization")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val line = "MP"
val ps = "Test"
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
val files = FileSystem.get(sc.hadoopConfiguration ).listStatus(new Path(location))
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
}
The error I received when compiled using sbt package is hereimage
Here is my build.sbt file
name := "OV"
scalaVersion := "2.11.8"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
in Spark Versions > 2 you should generally use SparkSession.
See https://spark.apache.org/docs/2.3.1/api/scala/#org.apache.spark.sql.SparkSession
also then you should be able to do
val spark:SparkSession = ???
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
spark.read.json(location)
to read all your json files in the directory.
Also I think you'd also get another compile error at
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
for ps_data being out of scope.
If you need to use SparkContext for some reason it should indeed be in spark-core. Have you tried restarting your IDE, cleaned caches, etc?
EDIT: I just notices that build.sbt is probably not in the directory where you call sbt package from so sbt won't pick it up

Setup Scala and Apache Spark with SBT in Intellij

I am trying to run Spark Scala project in IntelliJ Idea on Windows 10 machine.
My build.sbt:
name := "SbtIntellSpark1"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
project/build.properties:
sbt.version = 1.0.3
Main.scala:
package example
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object Main {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val session = SparkSession
.builder()
.appName("StackOverflowSurvey")
.master("local[1]")
.getOrCreate()
val df = session.read
val responses = df
.option("header", true)
.option("inferSchema", true)
.csv("2016-stack-overflow-survey-responses.csv")
responses.printSchema()
}
}
The code runs perfectly (the schema is properly printed) when I run the Main object as shown in the following image:
My Run Configuration is as follows:
The problem is when I run "Run the program", it shows a huge stack of error which is too large to show here. Please see this gist.
How can I solve this issue?

Why does spark-xml fail with NoSuchMethodError with Spark 2.0.0 dependency?

Hi I am a noob to Scala and Intellij and I am just trying to do this on Scala:
import org.apache.spark
import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml.XmlReader
object SparkSample {
def main(args: Array[String]): Unit = {
val conf = new spark.SparkConf()
conf.setAppName("Datasets Test")
conf.setMaster("local[2]")
val sc = new spark.SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "shop")
.load("shops.xml") /* NoSuchMethod error here */
val selectedData = df.select("author", "_id")
df.show
}
Basically I am trying to convert XML into spark dataframe
I am getting a NoSuchMethod error in '.load("shops.xml")'
the Below is the SBT
version := "0.1"
scalaVersion := "2.11.3"
val sparkVersion = "2.0.0"
val sparkXMLVersion = "0.3.3"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion exclude("jline", "2.12"),
"org.apache.spark" %% "spark-sql" % sparkVersion excludeAll(ExclusionRule(organization = "jline"),ExclusionRule("name","2.12")),
"com.databricks" %% "spark-xml" % sparkXMLVersion,
)
Below is the trace:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.types.DecimalType$.Unlimited()Lorg/apache/spark/sql/types/DecimalType;
at com.databricks.spark.xml.util.InferSchema$.<init>(InferSchema.scala:50)
at com.databricks.spark.xml.util.InferSchema$.<clinit>(InferSchema.scala)
at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:46)
at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:46)
at scala.Option.getOrElse(Option.scala:120)
at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:45)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
Can someone point out the error?Seems like a dependency issue to me.
spark-core seems to be working fine but not spark-sql
I had scala 2.12 before but changed to 2.11 because spark-core was not resolved
tl;dr I think it's a Scala version mismatch issue. Use spark-xml 0.4.1.
Quoting spark-xml's Requirements (highlighting mine):
This library requires Spark 2.0+ for 0.4.x.
For version that works with Spark 1.x, please check for branch-0.3.
That says to me that spark-xml 0.3.3 works with Spark 1.x (not Spark 2.0.0 that you requested).

Cannot make Spark run inside a scala worksheet in Intellij Idea

The following code runs with no problems if I put it inside an object which extends the App trait and run it using Idea's run command.
However, when I try running it from a worksheet, I encounter one of these scenarios:
1- If the first line is present, I get:
Task not serializable: java.io.NotSerializableException:A$A34$A$A34
2- If the first line is commented out, I get:
Unable to generate an encoder for inner class A$A35$A$A35$A12 without
access to the scope that this class was defined in.
//First line!
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
case class AClass(id: Int, f1: Int, f2: Int)
val spark = SparkSession.builder()
.master("local[*]")
.appName("Test App")
.getOrCreate()
import spark.implicits._
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("f1", IntegerType),
StructField("f2", IntegerType)))
val df = spark.read.schema(schema)
.option("header", "true")
.csv("dataset.csv")
// Displays the content of the DataFrame to stdout
df.show()
val ads = df.as[AClass]
//This is the line that causes serialization error
ads.foreach(x => println(x))
The project has been created using Idea's Scala plugin, and this is my build.sbt:
...
scalaVersion := "2.10.6"
scalacOptions += "-unchecked"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "2.1.0",
"org.apache.spark" % "spark-sql_2.10" % "2.1.0",
"org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
)
I tried the solution in this answer. But it is not working for Idea Ultimate 2017.1 which I am using and also, when I use worksheets, I prefer not to add an extra object to the worksheet if at all possible.
if I use collect() method on the dataset object and get an Array of "Aclass" instances, there will be no more errors either. It is trying to work with the DS directly that causes the error.
Use eclipse compatibility mode (open Preferences-> type scala -> in Languages & Frameworks, choose Scala -> Choose Worksheet -> only select eclipse compatibility mode) see https://gist.github.com/RAbraham/585939e5390d46a7d6f8

Compilation errors with spark cassandra connector and SBT

I'm trying to get the DataStax spark cassandra connector working. I've created a new SBT project in IntelliJ, and added a single class. The class and my sbt file is given below. Creating spark contexts seem to work, however, the moment I uncomment the line where I try to create a cassandraTable, I get the following compilation error:
Error:scalac: bad symbolic reference. A signature in CassandraRow.class refers to term catalyst
in package org.apache.spark.sql which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling CassandraRow.class.
Sbt is kind of new to me, and I would appreciate any help in understanding what this error means (and of course, how to resolve it).
name := "cassySpark1"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "1.1.0" withSources() withJavadoc()
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector-java" % "1.1.0-alpha2" withSources() withJavadoc()
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
And my class:
import org.apache.spark.{SparkConf, SparkContext}
import com.datastax.spark.connector._
object HelloWorld { def main(args:Array[String]): Unit ={
System.setProperty("spark.cassandra.query.retry.count", "1")
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "cassandra-hostname")
.set("spark.cassandra.username", "cassandra")
.set("spark.cassandra.password", "cassandra")
val sc = new SparkContext("local", "testingCassy", conf)
> //val foo = sc.cassandraTable("keyspace name", "table name")
val rdd = sc.parallelize(1 to 100)
val sum = rdd.reduce(_+_)
println(sum) } }
You need to add spark-sql to dependencies list
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.1.0"
Add library dependency in your project's pom.xml file. It seems they have changed the Vector.class dependencies location in the new refactoring.