Cannot make Spark run inside a scala worksheet in Intellij Idea - scala

The following code runs with no problems if I put it inside an object which extends the App trait and run it using Idea's run command.
However, when I try running it from a worksheet, I encounter one of these scenarios:
1- If the first line is present, I get:
Task not serializable: java.io.NotSerializableException:A$A34$A$A34
2- If the first line is commented out, I get:
Unable to generate an encoder for inner class A$A35$A$A35$A12 without
access to the scope that this class was defined in.
//First line!
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
case class AClass(id: Int, f1: Int, f2: Int)
val spark = SparkSession.builder()
.master("local[*]")
.appName("Test App")
.getOrCreate()
import spark.implicits._
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("f1", IntegerType),
StructField("f2", IntegerType)))
val df = spark.read.schema(schema)
.option("header", "true")
.csv("dataset.csv")
// Displays the content of the DataFrame to stdout
df.show()
val ads = df.as[AClass]
//This is the line that causes serialization error
ads.foreach(x => println(x))
The project has been created using Idea's Scala plugin, and this is my build.sbt:
...
scalaVersion := "2.10.6"
scalacOptions += "-unchecked"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "2.1.0",
"org.apache.spark" % "spark-sql_2.10" % "2.1.0",
"org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
)
I tried the solution in this answer. But it is not working for Idea Ultimate 2017.1 which I am using and also, when I use worksheets, I prefer not to add an extra object to the worksheet if at all possible.
if I use collect() method on the dataset object and get an Array of "Aclass" instances, there will be no more errors either. It is trying to work with the DS directly that causes the error.

Use eclipse compatibility mode (open Preferences-> type scala -> in Languages & Frameworks, choose Scala -> Choose Worksheet -> only select eclipse compatibility mode) see https://gist.github.com/RAbraham/585939e5390d46a7d6f8

Related

Exception while running hive support on Spark: Unable to instantiate SparkSession with Hive support because Hive classes are not found

Hello i am trying use Hive with spark but when i try executing, it shows this error
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
This is my source code
package com.spark.hiveconnect
import java.io.File
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
object sourceToHIve {
case class Record(key: Int, value: String)
def main(args: Array[String]){
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
sql("LOAD DATA LOCAL INPATH '/usr/local/spark3/examples/src/main/resources/kv1.txt' INTO TABLE src")
sql("SELECT * FROM src").show()
spark.close()
}
}
This is my build.sbt file.
name := "SparkHive"
version := "0.1"
scalaVersion := "2.12.10"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.5"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"
And i also have hive running in the console.
Can anyone help me with this?
Thank You.
try adding
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.4.5"
The Major Problem is that the class "org.apache.hadoop.hive.conf.HiveConf" can not be loaded.
you can insert ther following code for testing.
Class.forName("org.apache.hadoop.hive.conf.HiveConf",true,
Thread.currentThread().getContextClassLoader);
And an error will occur in this line.
To be exactly,the fundament problem is your pom may not support hive on spark.
you may check the following dependency.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.3</version>
</dependency>
The class "org.apache.hadoop.hive.conf.HiveConf" locates in this dependcy.

run-main-0) scala.ScalaReflectionException: class java.sql.Date in JavaMirror with ClasspathFilter(

Hi I have a file given to by my teacher. It is about Scala and Spark.
When I run the code it gives me this exception:
(run-main-0) scala.ScalaReflectionException: class java.sql.Date in
JavaMirror with ClasspathFilter
The file itself looks like this:
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object Main {
type Embedding = (String, List[Double])
type ParsedReview = (Integer, String, Double)
org.apache.log4j.Logger getLogger "org" setLevel
(org.apache.log4j.Level.WARN)
org.apache.log4j.Logger getLogger "akka" setLevel
(org.apache.log4j.Level.WARN)
val spark = SparkSession.builder
.appName ("Sentiment")
.master ("local[9]")
.getOrCreate
import spark.implicits._
val reviewSchema = StructType(Array(
StructField ("reviewText", StringType, nullable=false),
StructField ("overall", DoubleType, nullable=false),
StructField ("summary", StringType, nullable=false)))
// Read file and merge the text abd summary into a single text column
def loadReviews (path: String): Dataset[ParsedReview] =
spark
.read
.schema (reviewSchema)
.json (path)
.rdd
.zipWithUniqueId
.map[(Integer,String,Double)] { case (row,id) => (id.toInt, s"${row getString 2} ${row getString 0}", row getDouble 1) }
.toDS
.withColumnRenamed ("_1", "id" )
.withColumnRenamed ("_2", "text")
.withColumnRenamed ("_3", "overall")
.as[ParsedReview]
// Load the GLoVe embeddings file
def loadGlove (path: String): Dataset[Embedding] =
spark
.read
.text (path)
.map { _ getString 0 split " " }
.map (r => (r.head, r.tail.toList.map (_.toDouble))) // yuck!
.withColumnRenamed ("_1", "word" )
.withColumnRenamed ("_2", "vec")
.as[Embedding]
def main(args: Array[String]) = {
val glove = loadGlove ("Data/glove.6B.50d.txt") // take glove
val reviews = loadReviews ("Data/Electronics_5.json") // FIXME
// replace the following with the project code
glove.show
reviews.show
spark.stop
}
}
I need to keep the line
import org.apache.spark.sql.Dataset
because some code depends on it but it is exactly because of it I have an exception throw.
My build.sbt file looks like this:
name := "Sentiment Analysis Project"
version := "1.1"
scalaVersion := "2.11.12"
scalacOptions ++= Seq("-unchecked", "-deprecation")
initialCommands in console :=
"""
import Main._
"""
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-mllib" %
"2.3.0"
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.5"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.5" %
"test"
The Scala guide recommends you compile with Java8:
We recommend using Java 8 for compiling Scala code. Since the JVM is backward compatible, it is usually safe to use a newer JVM to run your code compiled by the Scala compiler for older JVM versions.
Although it's only a recommendation, I found it to fix the issue you mention.
In order to install Java 8 using Homebrew, it's best to use jenv which will help you handle multiple Java versions should you need to.
brew install jenv
Then run the following to add a tap (repository) of alternative versions of casks, since Java 8 is not in the default tap anymore:
brew tap homebrew/cask-versions
To install Java 8:
brew cask install homebrew/cask-versions/adoptopenjdk8
Run the following to add the previously installed Java version to jenv's list of versions:
jenv add /Library/Java/JavaVirtualMachines/<installed_java_version>/Contents/Home
Finally run
jenv global 1.8
or
jenv local 1.8
to use Java 1.8 globally or locally (in the current folder).
Fore more information, follow the instructions at jenv's website

Setup Scala and Apache Spark with SBT in Intellij

I am trying to run Spark Scala project in IntelliJ Idea on Windows 10 machine.
My build.sbt:
name := "SbtIntellSpark1"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
project/build.properties:
sbt.version = 1.0.3
Main.scala:
package example
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object Main {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val session = SparkSession
.builder()
.appName("StackOverflowSurvey")
.master("local[1]")
.getOrCreate()
val df = session.read
val responses = df
.option("header", true)
.option("inferSchema", true)
.csv("2016-stack-overflow-survey-responses.csv")
responses.printSchema()
}
}
The code runs perfectly (the schema is properly printed) when I run the Main object as shown in the following image:
My Run Configuration is as follows:
The problem is when I run "Run the program", it shows a huge stack of error which is too large to show here. Please see this gist.
How can I solve this issue?

Why does spark-xml fail with NoSuchMethodError with Spark 2.0.0 dependency?

Hi I am a noob to Scala and Intellij and I am just trying to do this on Scala:
import org.apache.spark
import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml.XmlReader
object SparkSample {
def main(args: Array[String]): Unit = {
val conf = new spark.SparkConf()
conf.setAppName("Datasets Test")
conf.setMaster("local[2]")
val sc = new spark.SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "shop")
.load("shops.xml") /* NoSuchMethod error here */
val selectedData = df.select("author", "_id")
df.show
}
Basically I am trying to convert XML into spark dataframe
I am getting a NoSuchMethod error in '.load("shops.xml")'
the Below is the SBT
version := "0.1"
scalaVersion := "2.11.3"
val sparkVersion = "2.0.0"
val sparkXMLVersion = "0.3.3"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion exclude("jline", "2.12"),
"org.apache.spark" %% "spark-sql" % sparkVersion excludeAll(ExclusionRule(organization = "jline"),ExclusionRule("name","2.12")),
"com.databricks" %% "spark-xml" % sparkXMLVersion,
)
Below is the trace:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.types.DecimalType$.Unlimited()Lorg/apache/spark/sql/types/DecimalType;
at com.databricks.spark.xml.util.InferSchema$.<init>(InferSchema.scala:50)
at com.databricks.spark.xml.util.InferSchema$.<clinit>(InferSchema.scala)
at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:46)
at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:46)
at scala.Option.getOrElse(Option.scala:120)
at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:45)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
Can someone point out the error?Seems like a dependency issue to me.
spark-core seems to be working fine but not spark-sql
I had scala 2.12 before but changed to 2.11 because spark-core was not resolved
tl;dr I think it's a Scala version mismatch issue. Use spark-xml 0.4.1.
Quoting spark-xml's Requirements (highlighting mine):
This library requires Spark 2.0+ for 0.4.x.
For version that works with Spark 1.x, please check for branch-0.3.
That says to me that spark-xml 0.3.3 works with Spark 1.x (not Spark 2.0.0 that you requested).

How does DataStax Spark Cassandra connector create SparkContext?

I have run the following Spark test program successfully. In this program I notice the "cassandraTable" method and "getOrCreate" method in SparkContext class. But I am not able to find it in the Spark Scala API docs for this class. What am I missing in understanding this code? I am trying to understand how this SparkContext is different when Datastax Connector is in sbt.
Code -
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
object CassandraInt {
def main(args:Array[String]){
val SparkMasterHost = "127.0.0.1"
val CassandraHost = "127.0.0.1"
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", CassandraHost)
.set("spark.cleaner.ttl", "3600")
.setMaster("local[12]")
.setAppName(getClass.getSimpleName)
// Connect to the Spark cluster:
lazy val sc = SparkContext.getOrCreate(conf)
val rdd = sc.cassandraTable("test", "kv")
println(rdd.count)
println(rdd.map(_.getInt("value")).sum)
}}
The build.sbt file I used is -
name := "Test Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
addCommandAlias("c1", "run-main CassandraInt")
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.0-M3"
fork in run := true
It is not different. Spark supports only one active SparkContext and getOrCreate is a method defined on the companion object:
This function may be used to get or instantiate a SparkContext and register it as a singleton object. Because we can only have one active SparkContext per JVM, this is useful when applications may wish to share a SparkContext.
This method allows not passing a SparkConf (useful if just retrieving).
To summarize:
If there is an active context it returns it.
Otherwise it creates a new one.
cassandraTable is a method of the SparkContextFunctions exposed using an implicit conversion.