Scala SBT assembly kafka streaming error - scala

I am new to Scala and SBT. I am using Kafka streaming and storing the data to Cassandra DB. while trying to take fat jar using sbt assembly command, I am getting below mentioned error.
how to resolve this issue ? and take fat jar
build.sbt
organization := "com.example"
name := "cass-conn"
version := "0.1"
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
val connectorVersion = "2.0.7"
val kafka_stream_version = "1.6.3"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
"com.datastax.spark" %% "spark-cassandra-connector" % connectorVersion ,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.2.0",
"org.apache.spark" %% "spark-streaming" % "2.2.0" % "provided",
)
plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.5")
SBT version : 1.0.3
Error
[error] 1 error was encountered during merge
[error] java.lang.RuntimeException: deduplicate: different file contents found in the following:
[error] C:\Users\gnana\.ivy2\cache\org.apache.spark\spark-streaming-kafka-0-10_2.11\jars\spark-streaming-kafka-0-10_2.11-2.2.0.jar:org/apache/spark/unused/UnusedStubClass.class
[error] C:\Users\gnana\.ivy2\cache\org.apache.spark\spark-tags_2.11\jars\spark-tags_2.11-2.2.0.jar:org/apache/spark/unused/UnusedStubClass.class
[error] C:\Users\gnana\.ivy2\cache\org.spark-project.spark\unused\jars\unused-1.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
[error] at sbtassembly.Assembly$.applyStrategies(Assembly.scala:141)
[error] at sbtassembly.Assembly$.x$1$lzycompute$1(Assembly.scala:25)
[error] at sbtassembly.Assembly$.x$1$1(Assembly.scala:23)
[error] at sbtassembly.Assembly$.stratMapping$lzycompute$1(Assembly.scala:23)
[error] at sbtassembly.Assembly$.stratMapping$1(Assembly.scala:23)
[error] at sbtassembly.Assembly$.inputs$lzycompute$1(Assembly.scala:67)
[error] at sbtassembly.Assembly$.inputs$1(Assembly.scala:57)
[error] at sbtassembly.Assembly$.apply(Assembly.scala:84)
[error] at sbtassembly.Assembly$.$anonfun$assemblyTask$1(Assembly.scala:249)
[error] at scala.Function1.$anonfun$compose$1(Function1.scala:44)
[error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:42)
[error] at sbt.std.Transform$$anon$4.work(System.scala:64)
[error] at sbt.Execute.$anonfun$submit$2(Execute.scala:257)
[error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:16)
[error] at sbt.Execute.work(Execute.scala:266)
[error] at sbt.Execute.$anonfun$submit$1(Execute.scala:257)
[error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:167)
[error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:32)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error] at java.lang.Thread.run(Thread.java:745)
[error] (*:assembly) deduplicate: different file contents found in the following:
[error] C:\Users\gnana\.ivy2\cache\org.apache.spark\spark-streaming-kafka-0-10_2.11\jars\spark-streaming-kafka-0-10_2.11-2.2.0.jar:org/apache/spark/unused/UnusedStubClass.class
[error] C:\Users\gnana\.ivy2\cache\org.apache.spark\spark-tags_2.11\jars\spark-tags_2.11-2.2.0.jar:org/apache/spark/unused/UnusedStubClass.class
[error] C:\Users\gnana\.ivy2\cache\org.spark-project.spark\unused\jars\unused-1.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
[error] Total time: 91 s, completed Mar 11, 2018 6:15:45 PM

You need to write a merge strategy in your SBT file which will help SBT pick the right UnusedStubClass.class for you
organization := "com.example"
name := "cass-conn"
version := "0.1"
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
val connectorVersion = "2.0.7"
val kafka_stream_version = "1.6.3"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-hive" % sparkVersion % "provided",
"com.datastax.spark" %% "spark-cassandra-connector" % connectorVersion ,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.2.0",
"org.apache.spark" %% "spark-streaming" % "2.2.0" % "provided",
)
mergeStrategy in assembly := {
case PathList("org", "apache", "spark", "unused", "UnusedStubClass.class") => MergeStrategy.first
case x => (mergeStrategy in assembly).value(x)
}

Check with your Java version,I had same issue with higher Java versions and later I downgraded to Java8 to fix it

Related

Spark Build Fails Because Of Avro Mapred Dependency

I have a scala spark project that fails because of some dependency hell. Here is my build.sbt:
scalaVersion := "2.13.3"
val SPARK_VERSION = "3.2.0"
libraryDependencies ++= Seq(
"com.typesafe" % "config" % "1.3.1",
"com.github.pathikrit" %% "better-files" % "3.9.1",
"org.apache.commons" % "commons-compress" % "1.14",
"commons-io" % "commons-io" % "2.6",
"com.typesafe.scala-logging" %% "scala-logging" % "3.9.4",
"ch.qos.logback" % "logback-classic" % "1.2.3" exclude ("org.slf4j", "*"),
"org.plotly-scala" %% "plotly-render" % "0.8.1",
"org.apache.spark" %% "spark-sql" % SPARK_VERSION,
"org.apache.spark" %% "spark-mllib" % SPARK_VERSION,
// Test dependencies
"org.scalatest" %% "scalatest" % "3.2.10" % Test,
"com.amazon.deequ" % "deequ" % "2.0.0-spark-3.1" % Test,
"org.awaitility" % "awaitility" % "3.0.0" % Test,
"org.apache.spark" %% "spark-core" % SPARK_VERSION % Test,
"org.apache.spark" %% "spark-sql" % SPARK_VERSION % Test
Here is the build failure:
[error] stack trace is suppressed; run 'last update' for the full output
[error] stack trace is suppressed; run 'last ssExtractDependencies' for the full output
[error] (update) lmcoursier.internal.shaded.coursier.error.FetchError$DownloadingArtifacts: Error fetching artifacts:
[error] https://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.10.2/avro-mapred-1.10.2-hadoop2.jar: not found: https://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.10.2/avro-mapred-1.10.2-hadoop2.jar
[error] (ssExtractDependencies) lmcoursier.internal.shaded.coursier.error.FetchError$DownloadingArtifacts: Error fetching artifacts:
[error] https://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.10.2/avro-mapred-1.10.2-hadoop2.jar: not found: https://repo1.maven.org/maven2/org/apache/avro/avro-mapred/1.10.2/avro-mapred-1.10.2-hadoop2.jar
[error] Total time: 5 s, completed Dec 19, 2021, 5:14:33 PM
[info] shutting down sbt server
Is this caused by the fact that I',m using Scala 2.13?
I had to do the inevitable and add this to my build.sbt:
ThisBuild / useCoursier := false

What dependencies to use in build.sbt for neo4j spark connector?

I was running scala code in spark-shell using this:
spark-shell --conf spark.neo4j.bolt.password=TestNeo4j --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jsparkCluster.scala
This would run execute just fine on the one spark instance, now I want to clusterize it.
I have a build.sbt file as follows:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0",
"org.apache.spark" %% "spark-sql" % "2.2.0"
)
So I am not sure what I need in the libraryDependencies given the libs I am trying to load, I guess the syntax for it?
The errors I get with the sbt package are:
[info] Compiling 2 Scala sources to /Users/shane.thomas/SparkCourse/spark-sbt-builds/target/scala-2.11/classes...
[error] /Users/shane.thomas/SparkCourse/spark-sbt-builds/neo4jSparkCluster.scala:1: object neo4j is not a member of package org
[error] import org.neo4j.spark._
[error] ^
[error] /Users/shane.thomas/SparkCourse/spark-sbt-builds/neo4jSparkCluster.scala:5: object streaming is not a member of package org.apache.spark
[error] import org.apache.spark.streaming._
[error] ^
[error] /Users/shane.thomas/SparkCourse/spark-sbt-builds/neo4jSparkCluster.scala:6: object streaming is not a member of package org.apache.spark
[error] import org.apache.spark.streaming.StreamingContext._
[error] ^
[error] /Users/shane.thomas/SparkCourse/spark-sbt-builds/neo4jSparkCluster.scala:539: not found: value Neo4j
[error] val neo = Neo4j(sc)
[error] ^
[error] four errors found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 5 s, completed Dec 7, 2017 2:45:00 PM
Try adding the following
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies += "neo4j-contrib" % "neo4j-spark-connector" % "2.1.0-M4"
Taken from https://github.com/neo4j-contrib/neo4j-spark-connector under the SBT section
these work fine to me
scalaVersion := "2.12.13"
val spark_version:String = "3.1.0"
resolvers ++= Seq(
"Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % spark_version,
"org.apache.spark" %% "spark-sql" % spark_version,
"org.neo4j" % "neo4j-kernel" % "4.2.3",
"neo4j-contrib" % "neo4j-spark-connector" % "2.4.5-M2",
"graphframes" % "graphframes" % "0.8.1-spark3.0-s_2.12",
)

Build.sbt breaks when adding GraphFrames build with scala 2.11

I'm trying to add GraphFrames to my scala spark application, and this was going fine when I added the one based on 2.10. However, as soon as I tried to build it with GraphFrames build with scala 2.11, it breaks.
The problem would be that there are conflicting versions of scala used (2.10 and 2.11). I'm getting the following error:
[error] Modules were resolved with conflicting cross-version suffixes in {file:/E:/Documents/School/LSDE/hadoopcryptoledger/examples/scala-spark-graphx-bitcointransaction/}root:
[error] org.apache.spark:spark-launcher _2.10, _2.11
[error] org.json4s:json4s-ast _2.10, _2.11
[error] org.apache.spark:spark-network-shuffle _2.10, _2.11
[error] com.twitter:chill _2.10, _2.11
[error] org.json4s:json4s-jackson _2.10, _2.11
[error] com.fasterxml.jackson.module:jackson-module-scala _2.10, _2.11
[error] org.json4s:json4s-core _2.10, _2.11
[error] org.apache.spark:spark-unsafe _2.10, _2.11
[error] org.apache.spark:spark-core _2.10, _2.11
[error] org.apache.spark:spark-network-common _2.10, _2.11
However, I can't troubleshoot what causes this.. This is my full build.sbt:
import sbt._
import Keys._
import scala._
lazy val root = (project in file("."))
.settings(
name := "example-hcl-spark-scala-graphx-bitcointransaction",
version := "0.1"
)
.configs( IntegrationTest )
.settings( Defaults.itSettings : _*)
scalacOptions += "-target:jvm-1.7"
crossScalaVersions := Seq("2.11.8")
resolvers += Resolver.mavenLocal
fork := true
jacoco.settings
itJacoco.settings
assemblyJarName in assembly := "example-hcl-spark-scala-graphx-bitcointransaction.jar"
libraryDependencies += "com.github.zuinnote" % "hadoopcryptoledger-fileformat" % "1.0.7" % "compile"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-graphx" % "1.5.0" % "provided"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.0" % "provided"
libraryDependencies += "javax.servlet" % "javax.servlet-api" % "3.0.1" % "it"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.0" % "it" classifier "" classifier "tests"
libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.7.0" % "it" classifier "" classifier "tests"
libraryDependencies += "org.apache.hadoop" % "hadoop-minicluster" % "2.7.0" % "it"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.2.0" % "provided"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test,it"
libraryDependencies += "graphframes" % "graphframes" % "0.5.0-spark2.1-s_2.11"
Can anyone pinpoint which dependency is based on scala 2.10 causing the build to fail?
I found out what the problem was. Apparently, if you use:
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.0" % "provided"
It uses the 2.10 version by default. It all worked once I changed the dependencies of spark core and spark graphx to:
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.2.0"
libraryDependencies += "org.apache.spark" % "spark-graphx_2.11" % "2.2.0" % "provided"

Why does sbt assembly of a Spark application lead to "Modules were resolved with conflicting cross-version suffixes"?

I am using CDH cluster with Spark 2.1 with Scala 2.11.8.
I use sbt 1.0.2.
While doing assembly, I am getting error as
[error] java.lang.RuntimeException: Conflicting cross-version suffixes in: org.scala-lang.modules:scala-xml, org.scala-lang.modules:scala-parser-combinators
I tried to override the version mismatch using dependencyOverrides and force(), but neither worked.
Error message from sbt assembly
[error] Modules were resolved with conflicting cross-version suffixes in {file:/D:/Tools/scala_ide/test_workspace/test/NewSp
arkTest/}newsparktest:
[error] org.scala-lang.modules:scala-xml _2.11, _2.12
[error] org.scala-lang.modules:scala-parser-combinators _2.11, _2.12
[error] java.lang.RuntimeException: Conflicting cross-version suffixes in: org.scala-lang.modules:scala-xml, org.scala-lang.
modules:scala-parser-combinators
[error] at scala.sys.package$.error(package.scala:27)
[error] at sbt.librarymanagement.ConflictWarning$.processCrossVersioned(ConflictWarning.scala:39)
[error] at sbt.librarymanagement.ConflictWarning$.apply(ConflictWarning.scala:19)
[error] at sbt.Classpaths$.$anonfun$ivyBaseSettings$64(Defaults.scala:1971)
[error] at scala.Function1.$anonfun$compose$1(Function1.scala:44)
[error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:42)
[error] at sbt.std.Transform$$anon$4.work(System.scala:64)
[error] at sbt.Execute.$anonfun$submit$2(Execute.scala:257)
[error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:16)
[error] at sbt.Execute.work(Execute.scala:266)
[error] at sbt.Execute.$anonfun$submit$1(Execute.scala:257)
[error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:167)
[error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:32)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error] at java.lang.Thread.run(Thread.java:748)
[error] (*:update) Conflicting cross-version suffixes in: org.scala-lang.modules:scala-xml, org.scala-lang.modules:scala-par
ser-combinators
[error] Total time: 413 s, completed Oct 12, 2017 3:28:02 AM
build.sbt
name := "newtest"
version := "0.0.2"
scalaVersion := "2.11.8"
sbtPlugin := true
val sparkVersion = "2.1.0"
mainClass in (Compile, run) := Some("com.testpackage.sq.newsparktest")
assemblyJarName in assembly := "newtest.jar"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
"org.apache.spark" % "spark-sql_2.11" % "2.1.0" % "provided",
"com.databricks" % "spark-avro_2.11" % "3.2.0",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided")
libraryDependencies +=
"log4j" % "log4j" % "1.2.15" excludeAll(
ExclusionRule(organization = "com.sun.jdmk"),
ExclusionRule(organization = "com.sun.jmx"),
ExclusionRule(organization = "javax.jms")
)
resolvers += "SparkPackages" at "https://dl.bintray.com/spark-packages/maven/"
resolvers += Resolver.url("bintray-sbt-plugins", url("http://dl.bintray.com/sbt/sbt-plugin-releases"))(Resolver.ivyStylePatterns)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
plugins.sbt
dependencyOverrides += ("org.scala-lang.modules" % "scala-xml_2.11" % "1.0.4")
dependencyOverrides += ("org.scala-lang.modules" % "scala-parser-combinators_2.11" % "1.0.4")
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.5")
resolvers += Resolver.url("bintray-sbt-plugins", url("https://dl.bintray.com/eed3si9n/sbt-plugins/"))(Resolver.ivyStylePatterns)
tl;dr Remove sbtPlugin := true from build.sbt (that is for sbt plugins not applications).
You should also remove dependencyOverrides from plugins.sbt.
You should change spark-core_2.11 and the other Spark dependencies in libraryDependencies to be as follows:
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0" % "provided"
The change is to use %% (= two percent signs) and remove the version of Scala from the middle part of a dependency, e.g. spark-core above.

play-json breaks sbt build

Suddenly as of today my project has stopped compiling successfuly. Upon further investigation I've found out the reason is play-json library that I include in dependencies.
Here's my build.sbt:
name := """project-name"""
version := "1.0"
scalaVersion := "2.10.2"
libraryDependencies ++= Seq(
"com.typesafe.akka" %% "akka-actor" % "2.2.1",
"com.typesafe.akka" %% "akka-testkit" % "2.2.1",
"org.scalatest" %% "scalatest" % "1.9.1" % "test",
"org.bouncycastle" % "bcprov-jdk16" % "1.46",
"com.sun.mail" % "javax.mail" % "1.5.1",
"com.typesafe.slick" %% "slick" % "2.0.1",
"org.postgresql" % "postgresql" % "9.3-1101-jdbc41",
"org.slf4j" % "slf4j-nop" % "1.6.4",
"com.drewnoakes" % "metadata-extractor" % "2.6.2",
"com.typesafe.play" %% "play-json" % "2.2.2"
)
resolvers += "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/"
If I try to create a new project in activator with all the lines except "com.typesafe.play" %% "play-json" % "2.2.2" then it compiles successfully. But once I add play-json I get the folloing error:
[error] References to undefined settings:
[error]
[error] *:playCommonClassloader from echo:run
[error]
[error] docs:managedClasspath from echo:run
[error]
[error] *:playReloaderClassloader from echo:run
[error]
[error] echo:playVersion from echo:echoTracePlayVersion
[error]
[error] *:playRunHooks from echo:playRunHooks
[error] Did you mean echo:playRunHooks ?
[error]
And I keep getting this error even if I remove play-json line. Why is it so? What should I do to fix it?