ScalaSpark: unable to create dataframe with scala-client dependency - scala

I need to support couchbase version 6 with spark 2.3 or 2.4 and scala version is 2.11.12. I am facing an issue while creating a data frame.
SBT code snippet
scalaVersion := "2.11.12"
resolvers += "Couchbase Snapshots" at "http://files.couchbase.com/maven2"
val sparkVersion = "2.3.2"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"com.couchbase.client" %% "spark-connector" % "2.3.0",
"com.couchbase.client" %% "scala-client" % "1.0.0-alpha.3")
Code
val spark = SparkSession
.builder()
.appName("Example")
.master("local[*]")
.config("spark.couchbase.nodes", "10.12.12.88") // connect to Couchbase Server on localhost
.config("spark.couchbase.username", "abcd") // with given credentials
.config("spark.couchbase.password", "abcd")
.config("spark.couchbase.bucket.beer-sample", "") // open the travel-sample bucket
.getOrCreate()
val sc = spark.sparkContext
import com.couchbase.spark.sql._
val sql = spark.sqlContext
val dataframe = sql.read.couchbase()
val result = dataframe.collect()
Exception
Caused by: java.lang.ClassNotFoundException: com.couchbase.client.core.message.CouchbaseRequest
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
Tried:
As per the suggestion added a dependency
"com.couchbase.client" % "core-io" % "1.7.6",
Without scala-client dependency I am able to get dataframe but with scala-client unable to fix. please suggest a solution for this problem

I have made the changes to your build.sbt file and have added settings for the sbt-assembly plugin.
scalaVersion := "2.11.12"
resolvers += "Couchbase Snapshots" at "http://files.couchbase.com/maven2"
val sparkVersion = "2.3.2"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % Provided,
"org.apache.spark" %% "spark-streaming" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
"com.couchbase.client" %% "spark-connector" % "2.3.0")
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x: String if x.contains("UnusedStubClass.class") => MergeStrategy.first
case _ => MergeStrategy.first
}
You need to create a file called plugins.sbt in the directory called project and add the following line to it:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")
Once done, run the command sbt clean compile assembly in the projects' root directory. it should build your jar.

Related

Apache Spark 3.1.2 can't read from S3 via documented spark-hadoop-cloud

The spark docmentation suggests using spark-hadoop-cloud to read / write from S3 in https://spark.apache.org/docs/latest/cloud-integration.html .
There is no apache spark published artifact for spark-hadoop-cloud. Then when trying to use the Cloudera published module the following exception occurs
Exception in thread "main" java.lang.NoSuchMethodError: 'void com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870)
at org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:428)
This seems like a classpath conflict. Then it seems like it's not possible to use spark-hadoop-cloud to read with the vanilla apache spark 3.1.2 jars
version := "0.0.1"
scalaVersion := "2.12.12"
lazy val app = (project in file("app")).settings(
assemblyPackageScala / assembleArtifact := false,
assembly / assemblyJarName := "uber.jar",
assembly / mainClass := Some("com.example.Main"),
// more settings here ...
)
resolvers += "Cloudera" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.1" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % "3.1.1.3.1.7270.0-253"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.1.7.2.7.0-184"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "3.1.1.7.2.7.0-184"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901"
libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "0.38.2"
libraryDependencies += "com.github.mrpowers" %% "spark-fast-tests" % "0.21.3" % "test"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"
import org.apache.spark.sql.SparkSession
object SparkApp {
def main(args: Array[String]){
val spark = SparkSession.builder().master("local")
//.config("spark.jars.repositories", "https://repository.cloudera.com/artifactory/cloudera-repos/")
//.config("spark.jars.packages", "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253")
.appName("spark session").getOrCreate
val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json")
val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv")
jsonDF.show()
csvDF.show()
}
}
To read and write to S3 from Spark you only need these 2 dependencies:
"org.apache.hadoop" % "hadoop-aws" % hadoopVersion,
"org.apache.hadoop" % "hadoop-common" % hadoopVersion
Make sure the haddopVersion is the same used by your worker nodes and make sure your workers node also have these dependencies available. The rest of your code looks correct.

sbt assembly with Junit test fail

I am very new to scala and sbt.
I wanted to run Junit tests with sbt assembly.
I designed all my test an all run correctly with IntelliJ.
When i try to build with tests, it always fails giving lots of errors.
Here is my build.sbt
name := "updater"
version := "0.1-SNAPSHOT"
scalaVersion := "2.11.12"
val sparkVersion = "2.4.0"
libraryDependencies ++= Seq(
//"org.scala-lang" % "scala-reflect" % "2.11.12",
"org.apache.spark" %% "spark-core" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
"com.typesafe" % "config" % "1.3.4",
//Testing
"junit" % "junit" % "4.10" % Test,
"com.novocode" % "junit-interface" % "0.11" % Test
// exclude("junit", "junit-dep")
,
//"org.scalatest" %% "scalatest" % "3.0.7" % Test,
"org.easymock" % "easymock" % "4.0.2" % Test,
//Logging
"ch.qos.logback" % "logback-classic" % "1.2.3",
"com.typesafe.scala-logging" %% "scala-logging" % "3.9.0"
)
assemblyMergeStrategy in assembly := {
case PathList("src/test/resources/library.properties", xs#_*) => MergeStrategy.discard
case PathList("META-INF", xs#_*) => MergeStrategy.discard
case x => MergeStrategy.first
}
I attach you the log file as the problem, to me, as a newby seems not understandable. It is driving me crazy.
This is my abstract Test class which is supposed to initialize a spark context with #BeforeClass in every test class. I only included this because I suspect it could be the cause of the failure.
Do you have any suggestions on how to solve it?
Thanks
I was instanciating a class like so:
abstract class SparkTest {
val spark: SparkSession = SparkTest.spark
}
object SparkTest {
var spark: SparkSession = _
#BeforeClass
def initializeSpark(): Unit = {
spark = SparkSession
.builder()
.appName("TableUpdaterTest")
.master("local")
.getOrCreate()
}
#AfterClass
def stopSpark(): Unit = {
spark.stop()
}
}
Apparently by commenting the spark.stop() everything started to work.
Anyone has an Idea on why?

Scala/Spark: Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging

I am pretty new to scala and spark. Trying to fix my set-up of spark/scala development. I am confused by the versions and missing jars. I searched on stackoverflow, but still stuck in this issue. Maybe something missing or mis-configured.
Running commands:
me#Mycomputer:~/spark-2.1.0$ bin/spark-submit --class ETLApp /home/me/src/etl/target/scala-2.10/etl-assembly-0.1.0.jar
Output:
...
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
build.sbt:
name := "etl"
version := "0.1.0"
scalaVersion := "2.10.5"
javacOptions ++= Seq("-source", "1.8", "-target", "1.8")
mainClass := Some("ETLApp")
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.5.2" % "provided";
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.5.2";
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2";
libraryDependencies += "org.apache.curator" % "curator-recipes" % "2.6.0"
libraryDependencies += "org.apache.curator" % "curator-test" % "2.6.0"
libraryDependencies += "args4j" % "args4j" % "2.32"
java -version
java version "1.8.0_101"
scala -version
2.10.5
spark version
2.1.0
Any hints welcomed. Thanks
in that case, your jar must bring all dependend classes along when being submitted to spark.
in maven this would be possible with the assembly plugin and the jar-with-dependencies descriptor. with sbt a quick google found this: https://github.com/sbt/sbt-assembly
you can change your build.sbt as follows:
name := "etl"
version := "0.1.0"
scalaVersion := "2.10.5"
scalacOptions ++= Seq("-deprecation",
"-feature",
"-Xfuture",
"-encoding",
"UTF-8",
"-unchecked",
"-language:postfixOps")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-sql" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-streaming" % "1.5.2" % Provided,
"org.apache.spark" %% "spark-streaming-kafka" % "1.5.2" % Provided,
"com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2",
"org.apache.curator" % "curator-recipes" % "2.6.0",
"org.apache.curator" % "curator-test" % "2.6.0",
"args4j" % "args4j" % "2.32")
mainClass in assembly := Some("your.package.name.ETLApp")
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x: String if x.contains("UnusedStubClass.class") => MergeStrategy.first
case _ => MergeStrategy.first
}
add the sbt-assembly plugin to your plugins.sbt file under the project directory in your Project's Root directory. Running sbt assembly in the Terminal(Linux) or CMD(Windows) in the root directory of your project would download all the dependencies for you and create an U

Why does sbt assembly in Spark project fail with "Please add any Spark dependencies by supplying the sparkVersion and sparkComponents"?

I work on a sbt-managed Spark project with spark-cloudant dependency. The code is available on GitHub (on spark-cloudant-compile-issue branch).
I've added the following line to build.sbt:
"cloudant-labs" % "spark-cloudant" % "1.6.4-s_2.10" % "provided"
And so build.sbt looks as follows:
name := "Movie Rating"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies ++= {
val sparkVersion = "1.6.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
"org.apache.kafka" % "kafka-log4j-appender" % "0.9.0.0",
"org.apache.kafka" % "kafka-clients" % "0.9.0.0",
"org.apache.kafka" %% "kafka" % "0.9.0.0",
"cloudant-labs" % "spark-cloudant" % "1.6.4-s_2.10" % "provided"
)
}
assemblyMergeStrategy in assembly := {
case PathList("org", "apache", "spark", xs # _*) => MergeStrategy.first
case PathList("scala", xs # _*) => MergeStrategy.discard
case PathList("META-INF", "maven", "org.slf4j", xs # _* ) => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
unmanagedBase <<= baseDirectory { base => base / "lib" }
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
When I execute sbt assembly I get the following error:
java.lang.RuntimeException: Please add any Spark dependencies by
supplying the sparkVersion and sparkComponents. Please remove:
org.apache.spark:spark-core:1.6.0:provided
Probably related: https://github.com/databricks/spark-csv/issues/150
Can you try adding spIgnoreProvided := true to your build.sbt?
(This might not be the answer and I could have just posted a comment but I don't have enough reputation)
NOTE I still can't reproduce the issue, but think it does not really matter.
java.lang.RuntimeException: Please add any Spark dependencies by supplying the sparkVersion and sparkComponents.
In your case, your build.sbt misses a sbt resolver to find spark-cloudant dependency. You should add the following line to build.sbt:
resolvers += "spark-packages" at "https://dl.bintray.com/spark-packages/maven/"
PROTIP I strongly recommend using spark-shell first and only when you're comfortable with the package switch to sbt (esp. if you're new to sbt and perhaps other libraries/dependencies too). It's too much to digest in one bite. Follow https://spark-packages.org/package/cloudant-labs/spark-cloudant.

assemblyMergeStrategy causing scala.MatchError when compiling

I'm new to sbt/assembly. I'm trying to resolve some dependency problems, and it seems the only way to do it is through a custom merge strategy. However, whenever I try to add a merge strategy I get a seemingly random MatchError on compiling:
[error] (*:assembly) scala.MatchError: org/apache/spark/streaming/kafka/KafkaUtilsPythonHelper$$anonfun$13.class (of class java.lang.String)
I'm showing this match error for the kafka library, but if I take out that library altogether, I get a MatchError on another library. If I take out all the libraries, I get a MatchError on my own code. None of this happens if I take out the "assemblyMergeStrategy" block. I'm clearly missing something incredibly basic, but for the life of me I can't find it and I can't find anyone else that has this problem. I've tried the older mergeStrategy syntax, but as far as I can read from the docs and SO, this is the proper way to write it now. Please help?
Here is my project/assembly.sbt:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
And my project.sbt file:
name := "Clerk"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.6.1" % "provided",
"org.apache.kafka" %% "kafka" % "0.8.2.1",
"ch.qos.logback" % "logback-classic" % "1.1.7",
"net.logstash.logback" % "logstash-logback-encoder" % "4.6",
"com.typesafe.scala-logging" %% "scala-logging" % "3.1.0",
"org.apache.spark" %% "spark-streaming-kafka" % "1.6.1",
("org.apache.spark" %% "spark-streaming-kafka" % "1.6.1").
exclude("org.spark-project.spark", "unused")
)
assemblyMergeStrategy in assembly := {
case PathList("org.slf4j", "impl", xs # _*) => MergeStrategy.first
}
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
You're missing a default case for your merge strategy pattern match:
assemblyMergeStrategy in assembly := {
case PathList("org.slf4j", "impl", xs # _*) => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}