I am trying to write some Apache Spark code in Scala, but I am stuck in 'Provider Hell'
import org.apache.spark
import org.apache.spark.ml.fpm.FPGrowth
#main def example1() = {
println("Example 1")
val dataset = spark.createDataset(Seq(
"1 2 5",
"1 2 3 5",
"1 2")
).map(t => t.split(" ")).toDF("items")
val fpGrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.5).setMinConfidence(0.6)
val model = fpGrowth.fit(dataset)
// Display frequent itemsets.
model.freqItemsets.show()
// Display generated association rules.
model.associationRules.show()
// transform examines the input items against all the association rules and summarize the
// consequents as prediction
model.transform(dataset).show()
}
When I do try to compile this I get
sbt:fp-laboratory> compile
[info] compiling 2 Scala sources to /Users/eric.kolotyluk/git/autonomous-iam/poc/fp-laboratory/target/scala-3.1.2/classes ...
[error] -- [E008] Not Found Error: /Users/eric.kolotyluk/git/autonomous-iam/poc/fp-laboratory/src/main/scala/Example1.scala:7:22
[error] 7 | val dataset = spark.createDataset(Seq(
[error] | ^^^^^^^^^^^^^^^^^^^
[error] | value createDataset is not a member of org.apache.spark
[error] one error found
[error] (Compile / compileIncremental) Compilation failed
[error] Total time: 3 s, completed May 30, 2022, 2:13:18 PM
but this is not really the problem. IntelliJ complains about line 1 of my code import org.apache.spark highlighting import such that when I mouse over import I get
'/private/var/folders/h0/9w1gfn9j1qvgs5b_q9c16bj40000gp/T/fp-laboratory-fp-laboratory-target' does not exist or is not a directory or .jar file
which means absolutely nothing to me. I have no idea why it's looking for that. However, looking at my build.sbt file
ThisBuild / organization := "com.forgerock"
ThisBuild / scalaVersion := "3.1.2"
ThisBuild / version := "0.1.0-SNAPSHOT"
lazy val root = (project in file("."))
.settings(
name := "fp-laboratory",
libraryDependencies ++= Seq(
("org.apache.spark" %% "spark-mllib" % "3.2.1" % "provided").cross(CrossVersion.for3Use2_13),
("org.apache.spark" %% "spark-sql" % "3.2.0" % "provided").cross(CrossVersion.for3Use2_13)
)
)
// include the 'provided' Spark dependency on the classpath for `sbt run`
Compile / run := Defaults.runTask(Compile / fullClasspath, Compile / run / mainClass, Compile / run / runner).evaluated
When I remove "provided" from the build.sbt file, IntelliJ stops complaining about
'/private/var/folders/h0/9w1gfn9j1qvgs5b_q9c16bj40000gp/T/fp-laboratory-fp-laboratory-target' does not exist or is not a directory or .jar file
and only complains about value createDataset is not a member of org.apache.spark
I suspect there is something fishy about provided scope, as until now, I have zero experience with this.
Consequently, Provider Hell.
My Scala code runs fine when I use
/opt/homebrew/Cellar/apache-spark/3.2.1/bin/spark-shell
So, I suspect there is some super secret magic required to run a Scala program under IntelliJ or SBT to say where spark-mllib and spark-sql are, but I am no wizard.
Can someone please tell me what the provider magic is so I can get out of hell.
Well actually, that's not related to the provider scope. As you also mentioned, the code works fine when you use spark-shell, right? That's because in spark-shell, there's a predefined value named spark, which is an object of type org.apache.spark.sql.SparkSession, and actually has a method called createDataset, but in your code in IntelliJ, the "spark" is referred to as org.apache.spark (imported right above), which is a package that does not have a method called createDataset. Do you see where I'm going? Try defining a variable of type SparkSession and use it in your code, something like this:
import org.apache.spark.ml.fpm.FPGrowth
org.apache.spark.sql.SparkSession
#main def example1() = {
println("Example 1")
val spark = SparkSession.builder()
/* do your configurations here, like setting master, ... */
.getOrCreate()
val dataset = spark.createDataset(Seq(
"1 2 5",
"1 2 3 5",
"1 2")
).map(t => t.split(" ")).toDF("items")
// other stuff
}
Now this should work fine.
Okay, it was not Provider Hell, although I still don't understand this well, but I was not setting up my Spark app properly, as I did not realize that spark-shell automatically does a lot of work...
In my code I needed to explicitly create a SparkSession first...
def newSparkSession = SparkSession
.builder
.appName("Simple Application")
.config("spark.master", "local")
.getOrCreate()
then
Using(newSparkSession) { sparkSession =>
import sparkSession.implicits._
val dataset = sparkSession.createDataset(Seq(
"1 2 5",
"1 2 3 5",
"1 2")
).map(t => t.split(" ")).toDF("items")
val fpGrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.5).setMinConfidence(0.6)
val model = fpGrowth.fit(dataset)
// Display frequent itemsets.
model.freqItemsets.show()
// Display generated association rules.
model.associationRules.show()
// transform examines the input items against all the association rules and summarize the
// consequents as prediction
model.transform(dataset).show()
}
Also, the correct build.sbt is
lazy val root = (project in file("."))
.settings(
name := "fp-laboratory",
libraryDependencies ++= Seq(
("org.apache.spark" %% "spark-mllib" % "3.2.1" % "provided").cross(CrossVersion.for3Use2_13),
("org.apache.spark" %% "spark-sql" % "3.2.1" % "provided").cross(CrossVersion.for3Use2_13)
),
Compile / run := Defaults.runTask(Compile / fullClasspath, Compile / run / mainClass, Compile / run / runner).evaluated
)
Now I am onto my next Scala problem...
Related
Hi I have a file given to by my teacher. It is about Scala and Spark.
When I run the code it gives me this exception:
(run-main-0) scala.ScalaReflectionException: class java.sql.Date in
JavaMirror with ClasspathFilter
The file itself looks like this:
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object Main {
type Embedding = (String, List[Double])
type ParsedReview = (Integer, String, Double)
org.apache.log4j.Logger getLogger "org" setLevel
(org.apache.log4j.Level.WARN)
org.apache.log4j.Logger getLogger "akka" setLevel
(org.apache.log4j.Level.WARN)
val spark = SparkSession.builder
.appName ("Sentiment")
.master ("local[9]")
.getOrCreate
import spark.implicits._
val reviewSchema = StructType(Array(
StructField ("reviewText", StringType, nullable=false),
StructField ("overall", DoubleType, nullable=false),
StructField ("summary", StringType, nullable=false)))
// Read file and merge the text abd summary into a single text column
def loadReviews (path: String): Dataset[ParsedReview] =
spark
.read
.schema (reviewSchema)
.json (path)
.rdd
.zipWithUniqueId
.map[(Integer,String,Double)] { case (row,id) => (id.toInt, s"${row getString 2} ${row getString 0}", row getDouble 1) }
.toDS
.withColumnRenamed ("_1", "id" )
.withColumnRenamed ("_2", "text")
.withColumnRenamed ("_3", "overall")
.as[ParsedReview]
// Load the GLoVe embeddings file
def loadGlove (path: String): Dataset[Embedding] =
spark
.read
.text (path)
.map { _ getString 0 split " " }
.map (r => (r.head, r.tail.toList.map (_.toDouble))) // yuck!
.withColumnRenamed ("_1", "word" )
.withColumnRenamed ("_2", "vec")
.as[Embedding]
def main(args: Array[String]) = {
val glove = loadGlove ("Data/glove.6B.50d.txt") // take glove
val reviews = loadReviews ("Data/Electronics_5.json") // FIXME
// replace the following with the project code
glove.show
reviews.show
spark.stop
}
}
I need to keep the line
import org.apache.spark.sql.Dataset
because some code depends on it but it is exactly because of it I have an exception throw.
My build.sbt file looks like this:
name := "Sentiment Analysis Project"
version := "1.1"
scalaVersion := "2.11.12"
scalacOptions ++= Seq("-unchecked", "-deprecation")
initialCommands in console :=
"""
import Main._
"""
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-mllib" %
"2.3.0"
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.5"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.5" %
"test"
The Scala guide recommends you compile with Java8:
We recommend using Java 8 for compiling Scala code. Since the JVM is backward compatible, it is usually safe to use a newer JVM to run your code compiled by the Scala compiler for older JVM versions.
Although it's only a recommendation, I found it to fix the issue you mention.
In order to install Java 8 using Homebrew, it's best to use jenv which will help you handle multiple Java versions should you need to.
brew install jenv
Then run the following to add a tap (repository) of alternative versions of casks, since Java 8 is not in the default tap anymore:
brew tap homebrew/cask-versions
To install Java 8:
brew cask install homebrew/cask-versions/adoptopenjdk8
Run the following to add the previously installed Java version to jenv's list of versions:
jenv add /Library/Java/JavaVirtualMachines/<installed_java_version>/Contents/Home
Finally run
jenv global 1.8
or
jenv local 1.8
to use Java 1.8 globally or locally (in the current folder).
Fore more information, follow the instructions at jenv's website
The following code runs with no problems if I put it inside an object which extends the App trait and run it using Idea's run command.
However, when I try running it from a worksheet, I encounter one of these scenarios:
1- If the first line is present, I get:
Task not serializable: java.io.NotSerializableException:A$A34$A$A34
2- If the first line is commented out, I get:
Unable to generate an encoder for inner class A$A35$A$A35$A12 without
access to the scope that this class was defined in.
//First line!
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}
case class AClass(id: Int, f1: Int, f2: Int)
val spark = SparkSession.builder()
.master("local[*]")
.appName("Test App")
.getOrCreate()
import spark.implicits._
val schema = StructType(Array(
StructField("id", IntegerType),
StructField("f1", IntegerType),
StructField("f2", IntegerType)))
val df = spark.read.schema(schema)
.option("header", "true")
.csv("dataset.csv")
// Displays the content of the DataFrame to stdout
df.show()
val ads = df.as[AClass]
//This is the line that causes serialization error
ads.foreach(x => println(x))
The project has been created using Idea's Scala plugin, and this is my build.sbt:
...
scalaVersion := "2.10.6"
scalacOptions += "-unchecked"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "2.1.0",
"org.apache.spark" % "spark-sql_2.10" % "2.1.0",
"org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
)
I tried the solution in this answer. But it is not working for Idea Ultimate 2017.1 which I am using and also, when I use worksheets, I prefer not to add an extra object to the worksheet if at all possible.
if I use collect() method on the dataset object and get an Array of "Aclass" instances, there will be no more errors either. It is trying to work with the DS directly that causes the error.
Use eclipse compatibility mode (open Preferences-> type scala -> in Languages & Frameworks, choose Scala -> Choose Worksheet -> only select eclipse compatibility mode) see https://gist.github.com/RAbraham/585939e5390d46a7d6f8
I have a small project.
Where I have the following problem:
scalaTest needs to be added to all three dependency project (client, server, shared), otherwise the scalatest library is not accessible from all projects.
In other words, if I write
val jvmDependencies = Def.setting(Seq(
"org.scalaz" %% "scalaz-core" % "7.2.8"
)++scalaTest)
then things work fine.
But if I don't write ++scalaTest into each three dependencies then it fails like this:
> test
[info] Compiling 1 Scala source to /Users/joco/tmp3/server/target/scala-2.11/test-classes...
[error] /Users/joco/tmp3/server/src/test/scala/Test.scala:1: object specs2 is not a member of package org
[error] import org.specs2.mutable.Specification
[error] ^
[error] /Users/joco/tmp3/server/src/test/scala/Test.scala:3: not found: type Specification
[error] class Test extends Specification {
[error] ^
[error] /Users/joco/tmp3/server/src/test/scala/Test.scala:5: value should is not a member of String
[error] "Test" should {
[error] ^
[error] /Users/joco/tmp3/server/src/test/scala/Test.scala:6: value in is not a member of String
[error] "one is one" in {
[error] ^
[error] /Users/joco/tmp3/server/src/test/scala/Test.scala:8: value === is not a member of Int
[error] 1===one
[error] ^
[error] 5 errors found
[error] (server/test:compileIncremental) Compilation failed
[error] Total time: 4 s, completed Mar 18, 2017 1:56:54 PM
However for production(not test) code everything works just fine: I don't have to add 3 times the same dependencies (in this example autowire) to all three projects if I want to use a library in all three projects, it is enough to add it to only the shared project and then I can use that library from all three projects.
For test code, however, as I mentioned above, currently I have to add the same library dependency (scalaTest - below) to all three projects.
Question: Is there a way to avoid this ?
Settings.scala:
import org.scalajs.sbtplugin.ScalaJSPlugin.autoImport._
import sbt.Keys._
import sbt._
object Settings {
val scalacOptions = Seq(
"-Xlint",
"-unchecked",
"-deprecation",
"-feature",
"-Yrangepos"
)
object versions {
val scala = "2.11.8"
}
val scalaTest=Seq(
"org.scalatest" %% "scalatest" % "3.0.1" % "test",
"org.specs2" %% "specs2" % "3.7" % "test")
val sharedDependencies = Def.setting(Seq(
"com.lihaoyi" %%% "autowire" % "0.2.6"
)++scalaTest)
val jvmDependencies = Def.setting(Seq(
"org.scalaz" %% "scalaz-core" % "7.2.8"
))
/** Dependencies only used by the JS project (note the use of %%% instead of %%) */
val scalajsDependencies = Def.setting(Seq(
"org.scala-js" %%% "scalajs-dom" % "0.9.1"
)++scalaTest)
}
build.sbt:
import sbt.Keys._
import sbt.Project.projectToRef
import webscalajs.SourceMappings
lazy val shared = (crossProject.crossType(CrossType.Pure) in file("shared")) .settings(
scalaVersion := Settings.versions.scala,
libraryDependencies ++= Settings.sharedDependencies.value,
addCompilerPlugin("org.scalamacros" % "paradise" % "2.1.0" cross CrossVersion.full)
) .jsConfigure(_ enablePlugins ScalaJSWeb)
lazy val sharedJVM = shared.jvm.settings(name := "sharedJVM")
lazy val sharedJS = shared.js.settings(name := "sharedJS")
lazy val elideOptions = settingKey[Seq[String]]("Set limit for elidable functions")
lazy val client: Project = (project in file("client"))
.settings(
scalaVersion := Settings.versions.scala,
scalacOptions ++= Settings.scalacOptions,
libraryDependencies ++= Settings.scalajsDependencies.value,
testFrameworks += new TestFramework("utest.runner.Framework")
)
.enablePlugins(ScalaJSPlugin)
.disablePlugins(RevolverPlugin)
.dependsOn(sharedJS)
lazy val clients = Seq(client)
lazy val server = (project in file("server")) .settings(
scalaVersion := Settings.versions.scala,
scalacOptions ++= Settings.scalacOptions,
libraryDependencies ++= Settings.jvmDependencies.value
)
.enablePlugins(SbtLess,SbtWeb)
.aggregate(clients.map(projectToRef): _*)
.dependsOn(sharedJVM)
onLoad in Global := (Command.process("project server", _: State)) compose (onLoad in Global).value
fork in run := true
cancelable in Global := true
For test code, however, as I mentioned above, currently I have to add the same library dependency (scalaTest - below) to all three projects.
That is expected. test dependencies are not inherited along dependency chains. That makes sense, because you don't want to depend on JUnit just because you depend on a library that happens to be tested using JUnit.
Although yes, that calls for a bit of duplication when you have several projects in the same build, all using the same testing framework. This is why we often find some commonSettings that are added to all projects of an sbt build. This is also where we typically put things like organization, scalaVersion, and many other settings that usually apply to all projects inside one build.
I'm trying to write a test case for a simple REST API in Play2/Scala that send/receives JSON. My test looks like the following:
import org.junit.runner.RunWith
import org.specs2.matcher.JsonMatchers
import org.specs2.mutable._
import org.specs2.runner.JUnitRunner
import play.api.libs.json.{Json, JsArray, JsValue}
import play.api.test.Helpers._
import play.api.test._
import play.test.WithApplication
/**
* Add your spec here.
* You can mock out a whole application including requests, plugins etc.
* For more information, consult the wiki.
*/
#RunWith(classOf[JUnitRunner])
class APIv1Spec extends Specification with JsonMatchers {
val registrationJson = Json.parse("""{"device":"576b9cdc-d3c3-4a3d-9689-8cd2a3e84442", |
"firstName":"", "lastName":"Johnny", "email":"justjohnny#test.com", |
"pass":"myPassword", "acceptTermsOfService":true}
""")
def dropJsonElement(json : JsValue, element : String) = (json \ element).get match {
case JsArray(items) => util.dropAt(items, 1)
}
def invalidRegistrationData(remove : String) = {
dropJsonElement(registrationJson,remove)
}
"API" should {
"Return Error on missing first name" in new WithApplication {
val result= route(
FakeRequest(
POST,
"/api/v1/security/register",
FakeHeaders(Seq( ("Content-Type", "application/json") )),
invalidRegistrationData("firstName").toString()
)
).get
status(result) must equalTo(BAD_REQUEST)
contentType(result) must beSome("application/json")
}
...
However when I attempt to run sbt test, I get the following error:
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=384M; support was removed in 8.0
[info] Loading project definition from /home/cassius/brentspace/esalestracker/project
[info] Set current project to eSalesTracker (in build file:/home/cassius/brentspace/esalestracker/)
[info] Compiling 3 Scala sources to /home/cassius/brentspace/esalestracker/target/scala-2.11/test-classes...
[error] /home/cassius/brentspace/esalestracker/test/APIv1Spec.scala:34: could not find implicit value for evidence parameter of type org.specs2.main.CommandLineAsResult[play.test.WithApplication{val result: scala.concurrent.Future[play.api.mvc.Result]}]
[error] "Return Error on missing first name" in new WithApplication {
[error] ^
[error] one error found
[error] (test:compileIncremental) Compilation failed
[error] Total time: 3 s, completed 18/01/2016 9:30:42 PM
I have similar tests in other applications, but it looks like the new version of specs adds a lot of support for Futures and other things that invalidate previous tutorials. I'm on Scala 2.11.6, Activator 1.3.6 and my build.sbt looks like the following:
name := """eSalesTracker"""
version := "1.0-SNAPSHOT"
lazy val root = (project in file(".")).enablePlugins(PlayScala)
scalaVersion := "2.11.6"
libraryDependencies ++= Seq(
jdbc,
cache,
ws,
"com.typesafe.slick" %% "slick" % "3.1.0",
"org.postgresql" % "postgresql" % "9.4-1206-jdbc42",
"org.slf4j" % "slf4j-api" % "1.7.13",
"ch.qos.logback" % "logback-classic" % "1.1.3",
"ch.qos.logback" % "logback-core" % "1.1.3",
evolutions,
specs2 % Test,
"org.specs2" %% "specs2-matcher-extra" % "3.7" % Test
)
resolvers += "scalaz-bintray" at "http://dl.bintray.com/scalaz/releases"
resolvers += Resolver.url("Typesafe Ivy releases", url("https://repo.typesafe.com/typesafe/ivy-releases"))(Resolver.ivyStylePatterns)
// Play provides two styles of routers, one expects its actions to be injected, the
// other, legacy style, accesses its actions statically.
routesGenerator := InjectedRoutesGenerator
I think you are using the wrong WithApplication import.
Use this one:
import play.api.test.WithApplication
The last line of the testcase should be the assertion/evaluation statement.
e.g. before the last } of the failing testcase method put the statement false must beEqualTo(true) and run it again.
I am trying to get a simple example working mapping rows from Cassandra to a scala case class using Apache Spark 1.1.1, Cassandra 2.0.11, & the spark-cassandra-connector (v1.1.0). I have reviewed the documentation at the spark-cassandra-connector github page, planetcassandra.org, datastax, and generally searched around; but have not found anyone else encountering this issue. So here goes...
Building a tiny spark application using sbt (0.13.5), scala 2.10.4, spark 1.1.1 against Cassandra 2.0.11. Modelling the example from the spark-cassandra-connector docs the following two lines present an error in my IDE and fail to compile.
case class SubHuman(id:String, firstname:String, lastname:String, isGoodPerson:Boolean)
val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
The simple error presented by eclipse is:
No RowReaderFactory can be found for this type
The compile error is only slightly more verbose:
> compile
[info] Compiling 1 Scala source to /home/bkarels/dev/simple-case/target/scala-2.10/classes...
[error] /home/bkarels/dev/simple-case/src/main/scala/com/bradkarels/simple/SimpleApp.scala:82: No RowReaderFactory can be found for this type
[error] val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
[error] Total time: 1 s, completed Dec 10, 2014 9:01:30 AM
>
Scala source:
package com.bradkarels.simple
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import com.datastax.spark.connector.rdd._
// Likely don't need this import - but throwing darts hits the bullseye once in a while...
import com.datastax.spark.connector.rdd.reader.RowReaderFactory
object CaseStudy {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "simple", conf)
case class SubHuman(id:String, firstname:String, lastname:String, isGoodPerson:Boolean)
val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
}
}
With the bothersome lines removed, everything compiles fine, assembly works, and I can perform other Spark operations normally. For example, if I remove the problem lines and drop in:
val rdd:CassandraRDD[CassandraRow] = sc.cassandraTable("nicecase", "human")
I get back the RDD and work with it as expected. That said, I suspect that my sbt project, assembly plugin, etc. are not contributing to the issues. The working source (less the new attempt to map to a case class as the connector as intended) can be found on github here.
But, to be more thorough, my build.sbt:
name := "Simple Case"
version := "0.0.1"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.1.1",
"org.apache.spark" %% "spark-sql" % "1.1.1",
"com.datastax.spark" %% "spark-cassandra-connector" % "1.1.0" withSources() withJavadoc()
)
So the question is what have I missed? Hoping this is something silly, but if you have encountered this and can help me get past this puzzling little issue I would very much appreciate it. Please let me know if there are any other details that would be helpful in troubleshooting.
Thank you.
This may be my newness with Scala in general, but I resolved this issue by moving the case class declaration out of the main method. So the simplified source now looks like this:
package com.bradkarels.simple
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import com.datastax.spark.connector.rdd._
object CaseStudy {
case class SubHuman(id:String, firstname:String, lastname:String, isGoodPerson:Boolean)
def main(args: Array[String]) {
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "simple", conf)
val foo = sc.cassandraTable[SubHuman]("nicecase", "human").select("id","firstname","lastname","isGoodPerson").toArray
}
}
The complete source (updated & fixed) can be found on github https://github.com/bradkarels/spark-cassandra-to-scala-case-class