How to set up a spark build.sbt file? - scala

I have been trying all day and cannot figure out how to make it work.
So I have a common library that will be my core lib for spark.
My build.sbt file is not working:
name := "CommonLib"
version := "0.1"
scalaVersion := "2.12.5"
// addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
// resolvers += "bintray-spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
// resolvers += Resolver.sonatypeRepo("public")
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.spark" % "spark-sql_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.hadoop" % "hadoop-common" % "2.7.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
// "org.apache.spark" % "spark-sql_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.spark" % "spark-hive_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"org.apache.spark" % "spark-yarn_2.10" % "1.6.0" exclude("org.apache.hadoop", "hadoop-yarn-server-web-proxy"),
"com.github.scopt" %% "scopt" % "3.7.0"
)
//addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.6")
//libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
//libraryDependencies ++= {
// val sparkVer = "2.1.0"
// Seq(
// "org.apache.spark" %% "spark-core" % sparkVer % "provided" withSources()
// )
//}
All the commented out are all the test I've done and I don't know what to do anymore.
My goal is to have spark 2.3 to work and to have scope available too.
For my sbt version, I have 1.1.1 installed.
Thank you.

I think I had two main issues.
Spark is not compatible with scala 2.12 yet. So moving to 2.11.12 solved one issue
The second issue is that for intelliJ SBT console to reload the build.sbt you either need to kill and restart the console or use the reload command which I didnt know so I was not actually using the latest build.sbt file.

There's a Giter8 template that should work nicely:
https://github.com/holdenk/sparkProjectTemplate.g8

Related

Spark dependencies configuration for streaming from Twitter

I am trying to run a Spark application with a Twitter streaming. However, I constantly experiencing problems with dependencies.
When I use org.apache.bahir spark-streaming-twitter dependency I get such an error:
module not found: org.apache.bahir#spark-streaming-twitter;2.0.0
Here is the corresponding build.sbt file:
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.bahir" %% "spark-streaming-twitter" % "2.0.0",
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.0",
"com.typesafe" % "config" % "1.3.0",
"org.twitter4j" % "twitter4j-stream" % "4.0.6"
)
But when I use older streaming dependency I get ClassNotFoundException: : org.apache.spark.Logging error.
Here is the corresponding build.sbt:
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.3.0",
"com.typesafe" % "config" % "1.3.0",
"org.twitter4j" % "twitter4j-stream" % "4.0.6",
"org.apache.spark" %% "spark-streaming-twitter" % "1.6.3"
)
In order to run my application, I run sbt clean and package commands.
So what dependencies should I use and how to configure them to run my application?
Twitter backend has been removed from Spark with 2.0 release and version of bahir you declared doesn't match Spark version. Finally bahir Twitter already comes with twitter4j-stream dependency (4.0.4 at this moment). Use:
val sparkVersion = "2.3.0"
libraryDependencies ++= Seq(
"org.apache.bahir" %% "spark-streaming-twitter" % sparkVersion,
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion
)

Error when run jar Exception in thread "main" java.lang.NoSuchMethodError scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;

I work on spark application using (spark 2.0.0 & scala 2.11.8) and the application works fine within intellij Idea environment. I've extracted application as jar file and tried to run spark application from jar file but this error raised on terminal:
Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
at org.apache.spark.util.Utils$.getSystemProperties(Utils.scala:1632)
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:65)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:60)
at org.apache.spark.SparkConf.<init>(SparkConf.scala:55)
at Main$.main(Main.scala:26)
at Main.main(Main.scala)
I've read discussions and similar question but all of them talk about different scala versions, however my sbt file is this:
name := "BaiscFM"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-8_2.11" % "2.0.0"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-graphx_2.11" % "2.0.0"
libraryDependencies += "com.typesafe.akka" % "akka-actor_2.11" % "2.4.17"
libraryDependencies += "net.liftweb" % "lift-json_2.11" % "2.6"
libraryDependencies += "com.typesafe.play" % "play-json_2.11" % "2.4.0-M2"
libraryDependencies += "org.json" % "json" % "20090211"
libraryDependencies += "org.scalaj" % "scalaj-http_2.11" % "2.3.0"
libraryDependencies += "org.drools" % "drools-core" % "6.3.0.Final"
libraryDependencies += "org.drools" % "drools-compiler" % "6.3.0.Final"
How to fix this problem?

Jackson version is too old

I have the following build.sbt file:
name := "myProject"
version := "1.0"
scalaVersion := "2.11.8"
javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled")
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-core" % "2.8.1"
)
// additional libraries
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-sql" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-hive" % "2.0.0" % "provided",
"com.databricks" %% "spark-csv" % "1.4.0",
"org.scalactic" %% "scalactic" % "2.2.1",
"org.scalatest" %% "scalatest" % "2.2.1" % "test",
"org.scalacheck" %% "scalacheck" % "1.12.4",
"com.holdenkarau" %% "spark-testing-base" % "2.0.0_0.4.4" % "test",
)
However, when I am running the code, I get this error:
An exception or error caused a run to abort.
java.lang.ExceptionInInitializerError
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Jackson version is too old 2.4.4
at com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:56)
at com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
at com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:549)
at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:82)
at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala)
... 58 more
Why is this the case?
I've added a newer version of Jackson to dependencyOverrides(after looking here Spark Parallelize? (Could not find creator property with name 'id')), so an older version shouldn't be used.
jackson-core and jackson-databind versions should match (at least up to the minor version, I believe).
So remove the dependencyOverrides and have
libraryDependencies ++= Seq(
...
"com.fasterxml.jackson.core" % "jackson-databind" % "2.8.1"
)
Or specify both in dependencyOverrides
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-core" % "2.8.1"
"com.fasterxml.jackson.core" % "jackson-databind" % "2.8.1"
)
Though I'm not sure I understand what you are trying to do; the linked question seems to say that you should used an older version (2.4.4).

build.sbt: how to add spark dependencies

Hello I am trying to download spark-core, spark-streaming, twitter4j, and spark-streaming-twitter in the build.sbt file below:
name := "hello"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
libraryDependencies ++= Seq(
"org.twitter4j" % "twitter4j-core" % "3.0.3",
"org.twitter4j" % "twitter4j-stream" % "3.0.3"
)
libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" % "0.9.0-incubating"
I simply took this libraryDependencies online so I am not sure which versions, etc. to use.
Can someone please explain to me how I should fix this .sbt files. I spent a couple hours trying to figure it out but none of the suggesstion worked. I installed scala through homebrew and I am on version 2.11.8
All of my errors were about:
Modules were resolved with conflicting cross-version suffixes.
The problem is that you are mixing Scala 2.11 and 2.10 artifacts. You have:
scalaVersion := "2.11.8"
And then:
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
Where the 2.10 artifact is being required. You are also mixing Spark versions instead of using a consistent version:
// spark 1.6.1
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
// spark 1.4.1
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.4.1"
// spark 0.9.0-incubating
libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" % "0.9.0-incubating"
Here is a build.sbt that fixes both problems:
name := "hello"
version := "1.0"
scalaVersion := "2.11.8"
val sparkVersion = "1.6.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-twitter" % sparkVersion
)
You also don't need to manually add twitter4j dependencies since they are added transitively by spark-streaming-twitter.
It works for me:
name := "spark_local"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.twitter4j" % "twitter4j-core" % "3.0.5",
"org.twitter4j" % "twitter4j-stream" % "3.0.5",
"org.apache.spark" %% "spark-core" % "2.0.0",
"org.apache.spark" %% "spark-sql" % "2.0.0",
"org.apache.spark" %% "spark-mllib" % "2.0.0",
"org.apache.spark" %% "spark-streaming" % "2.0.0"
)

Trying to compile gensort.scala, getting: [error] impossible to get artifacts when data has not been loaded. IvyNode =net.java.dev.jets3t#jets3t;0.6.1

New to scala and sbt, not sure how to proceed. Am I missing more dependencies?
Steps to reproduce:
save gensort.scala code in ~/spark-1.3.0/project/
begin build: my-server$ ~/spark-1.3.0/project/sbt
> run
gensort.scala:
gensort source
build definition file in ~/spark-1.3.0/project/build.sbt:
lazy val root = (project in file(".")).
settings(
name := "gensort",
version := "1.0",
scalaVersion := "2.11.6"
)
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-examples_2.10" % "1.1.1",
"org.apache.spark" % "spark-core_2.11" % "1.3.0",
"org.apache.spark" % "spark-streaming-mqtt_2.11" % "1.3.0",
"org.apache.spark" % "spark-streaming_2.11" % "1.3.0",
"org.apache.spark" % "spark-network-common_2.10" % "1.2.0",
"org.apache.spark" % "spark-network-shuffle_2.10" % "1.3.0",
"org.apache.hadoop" % "hadoop-core" % "1.2.1"
)
Greatly appreciate any insight on how to move forward. Thx! -Dennis
You should not mix 2.10 and 2.11, they are not binary compatible. Your libraryDependencies should look like this:
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-examples" % "1.1.1",
"org.apache.spark" %% "spark-core" % "1.3.0",
"org.apache.spark" %% "spark-streaming-mqtt" % "1.3.0",
"org.apache.spark" %% "spark-streaming" % "1.3.0",
"org.apache.spark" %% "spark-network-common" % "1.2.0",
"org.apache.spark" %% "spark-network-shuffle" % "1.3.0",
"org.apache.hadoop" % "hadoop-core" % "1.2.1"
)
The %% means that the Scala version is added as a suffix to the library id. After this change I got an error, because a dependency could not be found. It is located here:
resolvers += "poho" at "https://repo.eclipse.org/content/repositories/paho-releases"
Nevertheless, it seems that spark-examples is not available for 2.11. Changing the scalaVersion to
scalaVersion := "2.10.5"
solved all dependency problems and compilation succeeded successfully.