Question: libraries dependency Spark Scala in Intellij --Unresolved dependencies path: resolved - scala

Checking the logs is usually helpful. When you use:
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.2"
you will find a repo link:
not found: https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.13/3.1.2/spark-core_2.13-3.1.2.pom
Here you can see that you can't reach the exact dir because of %% always tries to find itself but can't reach.
Use % only and try to give the manual path like this added the postfix to spark_sql.
libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "3.1.2" % "provided"
libraryDependencies += "org.apache.spark" % "spark-core_2.12" % "3.1.2" % "provided"

According to JxD’s own response:
Here, you can see that you can't reach the exact dir because of %%.
Use % only and try to give the manual path like this added as the postfix to spark_sql dependency.
libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "3.1.2" % "provided"
libraryDependencies += "org.apache.spark" % "spark-core_2.12" % "3.1.2" % "provided"

Related

Scala - Error java.lang.NoClassDefFoundError: upickle/core/Types$Writer

I'm new to Scala/Spark, so please be easy on me :)
I'm trying to run an EMR cluster on AWS, running the jar file I packed with sbt package.
When I run the code locally, it is working perfectly fine, but when I'm running it in the AWS EMR cluster, I'm getting an error:
ERROR Client: Application diagnostics message: User class threw exception: java.lang.NoClassDefFoundError: upickle/core/Types$Writer
From what I understand, this error originates in the dependencies of the scala/spark versions.
So I'm using Scala 2.12 with spark 3.0.1, and in AWS I'm using emr-6.2.0.
Here's my build.sbt:
scalaVersion := "2.12.14"
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.792"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.792"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.3.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "3.3.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.3.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.1"
libraryDependencies += "com.lihaoyi" %% "upickle" % "1.4.1"
libraryDependencies += "com.lihaoyi" %% "ujson" % "1.4.1"
What am I missing?
Thanks!
If you use sbt package, the generated jar will contain only the code of your project, but not dependencies. You need to use sbt assembly to generate so-called uberjar, that will include dependencies as well.
But in your cases, it's recommended to mark Spark and Hadoop (and maybe AWS) dependencies as Provided - they should be already included into the EMR runtime. Use something like this:
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1" % Provided

I want to do the Spark tutorial in an SBT project, what libraries do I need to install?

I want to do the Apache Spark quick-start tutorial. I'd like to use a Scala worksheet in IntelliJ for each of the examples.
What do I need to add to my SBT configuration in order to make this work? Currently, I have:
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"
But that's not sufficient to get everything working. What else do I need to install to get everything working?
Try with those:
libraryDependencies += "org.apache.spark" % "spark-core_2.12" % "2.4.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.0"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.12" % "2.4.0"

SBT: cannot resolve dependency that used to work before

My build.sbt looks like this:
import sbt._
name := "spark-jobs"
version := "0.1"
scalaVersion := "2.11.8"
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
// additional libraries
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.2.0" % "provided",
"org.apache.spark" % "spark-streaming_2.11" % "2.2.0",
"org.apache.spark" % "spark-sql_2.11" % "2.2.0" % "provided",
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.2.0"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
This used to work until I decided to see what happens if I add another % "provided" at the end of spark-streaming_2.11. It failed to resolve dependency, I moved on and reverted the change. But, it seems to give me the exception after that as well. Now my build.sbt looks exactly like it used to when everything worked. Still, it gives me this exception :
[error] (*:update) sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.spark#spark-streaming_2.11;2.2.0: org.apache.spark#spark-parent_2.11;2.2.0!spark-parent_2.11.pom(pom.original) origin location must be absolute: file:/home/aswin/.m2/repository/org/apache/spark/spark-parent_2.11/2.2.0/spark-parent_2.11-2.2.0.pom
SBT's behavior is a bit confusing to me. Could someone guide me to as why this could happen? Any good blogs/ resources to understand how exactly SBT works under the hood is also welcome.
Here is my project/assembly.sbt:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
project/build.properties:
sbt.version = 1.0.4
project/plugins.sbt:
resolvers += Resolver.url("artifactory", url("http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)
resolvers += "Typesafe Repository" at "http://repo.typesafe.com/typesafe/releases/"
Thank you!
If you are in sbt console, just run reload command and try again. After you update your dependencies or sbt plugins, you need to reload the project so that the changes take effect.
By the way, instead of defining the Scala version in your dependencies, you can just use %% operator and it will fetch the appropriate dependency according to your defined scala version.
// additional libraries
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.2.0",
"org.apache.spark" %% "spark-sql" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.2.0"
)

Using cloudera scalats library in project using sbt

I am trying to use cloudera scalats library for time series forecasting but unable to dowload the library using sbt.
Below is build.sbt file. I can see maven repo has 0.4.0 disted version, so not sure what wrong I am doing.
Can anyone please help me to know what wrong I am doing with sbt file?
import sbt.complete.Parsers._
scalaVersion := "2.11.8"
name := "Forecast Stock Price using Spark TimeSeries library"
val sparkVersion = "1.5.2"
//resolvers ++= Seq("Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion withSources(),
"org.apache.spark" %% "spark-streaming" % sparkVersion withSources(),
"org.apache.spark" %% "spark-sql" % sparkVersion withSources(),
"org.apache.spark" %% "spark-hive" % sparkVersion withSources(),
"org.apache.spark" %% "spark-streaming-twitter" % sparkVersion withSources(),
"org.apache.spark" %% "spark-mllib" % sparkVersion withSources(),
"com.databricks" %% "spark-csv" % "1.3.0" withSources(),
"com.cloudera.sparkts" %% "sparkts" % "0.4.0"
)
Change
"com.cloudera.sparkts" %% "sparkts" % "0.4.0"
to
"com.cloudera.sparkts" % "sparkts" % "0.4.0"
sparkts is only distributed for Scala 2.11; it does not encode the Scala version in the artifact name.

Need some help in fixing the Spark streaming dependency (Scala sbt)

I am trying to run basic spark streaming example on my machine using IntelliJ, but I am unable to resolve the dependency issues.
Please help me in fixing it.
name := "demoSpark"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq("org.apache.spark"% "spark-core_2.11"%"2.1.0",
"org.apache.spark" % "spark-sql_2.10" % "2.1.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.1.0",
"org.apache.spark" % "spark-mllib_2.10" % "2.1.0"
)
At the very least, all the dependencies must use the same version of Scala, not a mix of 2.10 and 2.11. You can use %% symbol in sbt to ensure the right version is selected (the one you specified in scalaVersion).
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" %% "spark-sql" % "2.1.0",
"org.apache.spark" %% "spark-streaming" % "2.1.0",
"org.apache.spark" %% "spark-mllib" % "2.1.0"
)