SBT compile failure for Spark 2.2 - scala

I just started using Spark 2.2 on HDP 2.6 and Iam facing issues when trying to do sbt compile
Error
[info] Updated file /home/maria_dev/structuredstreaming/project/build.properties: set sbt.version to 1.3.0
[info] Loading project definition from /home/maria_dev/structuredstreaming/project
[info] Fetching artifacts of
[info] Fetched artifacts of
[error] lmcoursier.internal.shaded.coursier.error.FetchError$DownloadingArtifacts: Error fetching artifacts:
[error] https://repo1.maven.org/maven2/com/squareup/okhttp3/okhttp-urlconnection/3.7.0/okhttp-urlconnection-3.7.0.jar: download error: Caught java.net.UnknownHostException: repo1.maven.org (repo1.maven.org) while downloading https://repo1.maven.org/maven2/com/squareup/okhttp3/okhttp-urlconnection/3.7.0/okhttp-urlconnection-3.7.0.jar
build.sbt file is as below
buid.sbt
scalaVersion := "2.11.8"
resolvers ++= Seq(
"Conjars" at "http://conjars.org/repo",
"Hortonworks Releases" at "http://repo.hortonworks.com/content/groups/public"
)
publishMavenStyle := true
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0.2.6.3.0-235",
"org.apache.spark" %% "spark-sql" % "2.2.0.2.6.3.0-235",
"org.apache.phoenix" % "phoenix-spark2" % "4.7.0.2.6.3.0-235",
"org.apache.phoenix" % "phoenix-core" % "4.7.0.2.6.3.0-235",
"org.apache.kafka" % "kafka-clients" % "0.10.1.2.6.3.0-235",
"org.apache.spark" %% "spark-streaming" % "2.0.2" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.0.2",
"org.apache.spark" %% "spark-sql-kafka-0-10" % "2.0.2" % "provided",
"com.typesafe" % "config" % "1.3.1",
"com.typesafe.play" %% "play-json" % "2.7.2",
"com.solarmosaic.client" %% "mail-client" % "0.1.0",
"org.json4s" %% "json4s-jackson" % "3.2.10",
"org.apache.logging.log4j" % "log4j-api-scala_2.11" % "11.0",
"com.databricks" %% "spark-avro" % "3.2.0",
"org.elasticsearch" %% "elasticsearch-spark-20" % "5.0.0-alpha5",
"io.spray" %% "spray-json" % "1.3.3"
)
retrieveManaged := true
fork in run := true

It looks like coursier is attempting to fetch dependencies from repo1.maven.org which is being blocked. The Scala-Metals people an explanation here. Basically, you have to set a global Coursier config pointing to your corporate proxy server by setting up a mirror.properties file that looks like this:
central.from=https://repo1.maven.org/maven2
central.to=http://mycorporaterepo.com:8080/nexus/content/groups/public
Based on your OS, it will be:
Windows: C:\Users\\AppData\Roaming\Coursier\config\mirror.properties
Linux: ~/.config/coursier/mirror.properties
MacOS: ~/Library/Preferences/Coursier/mirror.properties
You also might need to setup SBT to use a proxy for downloading dependencies. For that, you will need to edit this file:
~/.sbt/repositories
Set it to the following:
[repositories]
local
maven-central: http://mycorporaterepo.com:8080/nexus/content/groups/public
The combination of those two settings should take care of everything you need to do to point SBT to the correct places.

Related

Scala - Error java.lang.NoClassDefFoundError: upickle/core/Types$Writer

I'm new to Scala/Spark, so please be easy on me :)
I'm trying to run an EMR cluster on AWS, running the jar file I packed with sbt package.
When I run the code locally, it is working perfectly fine, but when I'm running it in the AWS EMR cluster, I'm getting an error:
ERROR Client: Application diagnostics message: User class threw exception: java.lang.NoClassDefFoundError: upickle/core/Types$Writer
From what I understand, this error originates in the dependencies of the scala/spark versions.
So I'm using Scala 2.12 with spark 3.0.1, and in AWS I'm using emr-6.2.0.
Here's my build.sbt:
scalaVersion := "2.12.14"
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.792"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.792"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.3.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "3.3.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.3.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.1"
libraryDependencies += "com.lihaoyi" %% "upickle" % "1.4.1"
libraryDependencies += "com.lihaoyi" %% "ujson" % "1.4.1"
What am I missing?
Thanks!
If you use sbt package, the generated jar will contain only the code of your project, but not dependencies. You need to use sbt assembly to generate so-called uberjar, that will include dependencies as well.
But in your cases, it's recommended to mark Spark and Hadoop (and maybe AWS) dependencies as Provided - they should be already included into the EMR runtime. Use something like this:
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1" % Provided

SBT: cannot resolve dependency that used to work before

My build.sbt looks like this:
import sbt._
name := "spark-jobs"
version := "0.1"
scalaVersion := "2.11.8"
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
// additional libraries
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.2.0" % "provided",
"org.apache.spark" % "spark-streaming_2.11" % "2.2.0",
"org.apache.spark" % "spark-sql_2.11" % "2.2.0" % "provided",
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.2.0"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
This used to work until I decided to see what happens if I add another % "provided" at the end of spark-streaming_2.11. It failed to resolve dependency, I moved on and reverted the change. But, it seems to give me the exception after that as well. Now my build.sbt looks exactly like it used to when everything worked. Still, it gives me this exception :
[error] (*:update) sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.spark#spark-streaming_2.11;2.2.0: org.apache.spark#spark-parent_2.11;2.2.0!spark-parent_2.11.pom(pom.original) origin location must be absolute: file:/home/aswin/.m2/repository/org/apache/spark/spark-parent_2.11/2.2.0/spark-parent_2.11-2.2.0.pom
SBT's behavior is a bit confusing to me. Could someone guide me to as why this could happen? Any good blogs/ resources to understand how exactly SBT works under the hood is also welcome.
Here is my project/assembly.sbt:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
project/build.properties:
sbt.version = 1.0.4
project/plugins.sbt:
resolvers += Resolver.url("artifactory", url("http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)
resolvers += "Typesafe Repository" at "http://repo.typesafe.com/typesafe/releases/"
Thank you!
If you are in sbt console, just run reload command and try again. After you update your dependencies or sbt plugins, you need to reload the project so that the changes take effect.
By the way, instead of defining the Scala version in your dependencies, you can just use %% operator and it will fetch the appropriate dependency according to your defined scala version.
// additional libraries
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.2.0",
"org.apache.spark" %% "spark-sql" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.2.0"
)

Using cloudera scalats library in project using sbt

I am trying to use cloudera scalats library for time series forecasting but unable to dowload the library using sbt.
Below is build.sbt file. I can see maven repo has 0.4.0 disted version, so not sure what wrong I am doing.
Can anyone please help me to know what wrong I am doing with sbt file?
import sbt.complete.Parsers._
scalaVersion := "2.11.8"
name := "Forecast Stock Price using Spark TimeSeries library"
val sparkVersion = "1.5.2"
//resolvers ++= Seq("Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion withSources(),
"org.apache.spark" %% "spark-streaming" % sparkVersion withSources(),
"org.apache.spark" %% "spark-sql" % sparkVersion withSources(),
"org.apache.spark" %% "spark-hive" % sparkVersion withSources(),
"org.apache.spark" %% "spark-streaming-twitter" % sparkVersion withSources(),
"org.apache.spark" %% "spark-mllib" % sparkVersion withSources(),
"com.databricks" %% "spark-csv" % "1.3.0" withSources(),
"com.cloudera.sparkts" %% "sparkts" % "0.4.0"
)
Change
"com.cloudera.sparkts" %% "sparkts" % "0.4.0"
to
"com.cloudera.sparkts" % "sparkts" % "0.4.0"
sparkts is only distributed for Scala 2.11; it does not encode the Scala version in the artifact name.

Spark Streaming Kafka java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.StringDeserializer

I am using spark streaming with the Kafka integration, When i run the streaming application from my IDE in Local mode, everything works as a charm. However as soon as i submit it to the cluster i keep having the following error:
java.lang.ClassNotFoundException:
org.apache.kafka.common.serialization.StringDeserializer
I am using sbt assembly to build my project.
my sbt is as such:
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.2.0" % Provided,
"org.apache.spark" % "spark-core_2.11" % "2.2.0" % Provided,
"org.apache.spark" % "spark-streaming_2.11" % "2.2.0" % Provided,
"org.marc4j" % "marc4j" % "2.8.2",
"net.sf.saxon" % "Saxon-HE" % "9.7.0-20"
)
run in Compile := Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run)).evaluated
mainClass in assembly := Some("EstimatorStreamingApp")
I also tried to use the --package option
attempt 1
--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0
attempt 2
--packages org.apache.spark:spark-streaming-kafka-0-10-assembly_2.11:2.2.0
All with no success. Does anyone has anything to suggest
You need to remove the "provided" flag from the Kafka dependency, as it is a dependency not provided OOTB with Spark:
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.2.0" % Provided,
"org.apache.spark" % "spark-streaming_2.11" % "2.2.0" % Provided,
"org.marc4j" % "marc4j" % "2.8.2",
"net.sf.saxon" % "Saxon-HE" % "9.7.0-20"
)

How to troubleshoot SBT's library dependency warnings?

I'm trying to build a "hello world"-esque app that uses Spark streaming to stream data from a Kafka broker (this works), filters/processes this data, and pushes it to a (local) web browser using the Scalatra web framework and its supported web sockets functionality from Atmosphere. The Kafka/Spark chunk works independently, and the Scalatra/Atmosphere chunk also works independently. It's when I try to bring the two halves together that I run into issues with library dependencies.
The real question: how do I go about selecting library versions for which Spark will play nice with Scalatra?
A bare bones Scalatra/Atmosphere app works fine as follows:
organization := "com.example"
name := "example app"
version := "0.1.0"
scalaVersion := "2.12.2"
val ScalatraVersion = "2.5.+"
libraryDependencies ++= Seq(
"org.json4s" %% "json4s-jackson" % "3.5.2",
"org.scalatra" %% "scalatra" % ScalatraVersion,
"org.scalatra" %% "scalatra-scalate" % ScalatraVersion,
"org.scalatra" %% "scalatra-specs2" % ScalatraVersion % "test",
"org.scalatra" %% "scalatra-atmosphere" % ScalatraVersion,
"org.eclipse.jetty" % "jetty-webapp" % "9.4.6.v20170531" % "provided",
"javax.servlet" % "javax.servlet-api" % "3.1.0" % "provided"
)
enablePlugins(JettyPlugin)
But if I add new dependencies for Spark and Spark streaming, and knock the Scala version down to 2.11 (required for Spark-Kafka streaming):
organization := "com.example"
name := "example app"
version := "0.1.0"
scalaVersion := "2.11.8"
val ScalatraVersion = "2.5.+"
val SparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.json4s" %% "json4s-jackson" % "3.5.2",
"org.scalatra" %% "scalatra" % ScalatraVersion,
"org.scalatra" %% "scalatra-scalate" % ScalatraVersion,
"org.scalatra" %% "scalatra-specs2" % ScalatraVersion % "test",
"org.scalatra" %% "scalatra-atmosphere" % ScalatraVersion,
"org.eclipse.jetty" % "jetty-webapp" % "9.4.6.v20170531" % "provided",
"javax.servlet" % "javax.servlet-api" % "3.1.0" % "provided"
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % SparkVersion,
"org.apache.spark" %% "spark-streaming" % SparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-8" % SparkVersion
)
enablePlugins(JettyPlugin)
The code compiles, but I get SBT's eviction warning:
[warn] There may be incompatibilities among your library dependencies.
[warn] Here are some of the libraries that were evicted:
[warn] * org.json4s:json4s-jackson_2.11:3.2.11 -> 3.5.3
[warn] Run 'evicted' to see detailed eviction warnings
Then finally, when Jetty tries to run the web server, it fails with this error:
WARN:oejuc.AbstractLifeCycle:main: FAILED org.eclipse.jetty.annotations.ServletContainerInitializersStarter#53fb3dab: java.lang.NoClassDefFoundError: com/sun/jersey/spi/inject/InjectableProvider
java.lang.NoClassDefFoundError: com/sun/jersey/spi/inject/InjectableProvider
How do I get to the bottom of this? I'm new to the Scala world, and the intricacies of dependencies are blowing my mind.
One way to remove the eviction warning is to add the library dependency with the required version using dependencyOverrides
try to add the following in your SBT file and re-build the application
dependencyOverrides += "org.json4s" % "json4s-jackson_2.11" % "3.5.3"
Check SBT documentation here