Assembly scala project causes deduplicate errors - scala

I'm trying to assembly my scala project and cant get rid of some deduplicate errors
Here is the problematic output:
> [error] 2 errors were encountered during merge [error] stack trace is
> suppressed; run 'last
> ProjectRef(uri("https://hyehezkel#fs-bitbucket.fsd.forescout.com/scm/~hyehezkel/classification_common.git#test_branch"),
> "global") / assembly' for the full output [error]
> (ProjectRef(uri("https://hyehezkel#fs-bitbucket.fsd.forescout.com/scm/~hyehezkel/classification_common.git#test_branch"),
> "global") / assembly) deduplicate: different file contents found in
> the following: [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-buffer\4.1.42.Final\netty-buffer-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-codec\4.1.42.Final\netty-codec-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-common\4.1.42.Final\netty-common-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-handler\4.1.42.Final\netty-handler-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-resolver\4.1.42.Final\netty-resolver-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-transport-native-epoll\4.1.42.Final\netty-transport-native-epoll-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-transport-native-unix-common\4.1.42.Final\netty-transport-native-unix-common-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-transport\4.1.42.Final\netty-transport-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error] deduplicate: different file contents found in the following:
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-annotations\2.10.1\jackson-annotations-2.10.1.jar:module-info.class
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-core\2.10.1\jackson-core-2.10.1.jar:module-info.class
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-databind\2.10.1\jackson-databind-2.10.1.jar:module-info.class
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\dataformat\jackson-dataformat-csv\2.10.0\jackson-dataformat-csv-2.10.0.jar:module-info.class
I have read the following article but didnt manage to solve it:
https://index.scala-lang.org/sbt/sbt-assembly/sbt-assembly/0.14.5?target=_2.12_1.0
This is my plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")
And this is my build.st
import sbt.Keys.{dependencyOverrides, libraryDependencies, mappings}
import sbtassembly.AssemblyPlugin.assemblySettings._
name := "classification_endpoint_discovery"
version := "0.1"
organization in ThisBuild := "com.forescout"
scalaVersion in ThisBuild := "2.13.1"
updateOptions := updateOptions.value.withCachedResolution(true)
//classpathTypes += "maven-plugin"
exportJars := true
logLevel := Level.Info
logLevel in assembly := Level.Debug
lazy val commonProject = RootProject(uri("https://hyehezkel#fs-bitbucket.fsd.forescout.com/scm/~hyehezkel/classification_common.git#test_branch"))
lazy val global = project
.in(file("."))
.settings(settings)
.enablePlugins(AssemblyPlugin)
// .disablePlugins(AssemblyPlugin)
.aggregate(
commonProject,
`endpoint-discovery`
)
lazy val `endpoint-discovery` = project
.settings(
name := "endpoint-discovery",
settings,
assemblySettings,
assemblyJarName in assembly := "endpoint-discovery.jar",
assemblyJarName in assemblyPackageDependency := "endpoint-discovery-dep.jar",
libraryDependencies += dependencies.postgresql,
libraryDependencies += "com.lihaoyi" %% "ujson" % "0.7.5",
libraryDependencies += "com.lihaoyi" %% "requests" % "0.2.0",
libraryDependencies += dependencies.`deepLearning4j-core`,
libraryDependencies += dependencies.`deeplearning4j-nn`,
libraryDependencies += dependencies.`nd4j-native-platform`,
excludeDependencies += "commons-logging" % "commons-logging"
// dependencyOverrides += "org.slf4j" % "slf4j-api" % "1.7.5",
// dependencyOverrides += "org.slf4j" % "slf4j-simple" % "1.7.5",
)
.dependsOn(commonProject)
.enablePlugins(AssemblyPlugin)
lazy val dependencies =
new {
val deepLearning4jV = "1.0.0-beta4"
val postgresqlV = "9.1-901.jdbc4"
val `deepLearning4j-core` = "org.deeplearning4j" % "deeplearning4j-core" % deepLearning4jV
val `deeplearning4j-nn` = "org.deeplearning4j" % "deeplearning4j-nn" % deepLearning4jV
val `nd4j-native-platform` = "org.nd4j" % "nd4j-native-platform" % deepLearning4jV
val postgresql = "postgresql" % "postgresql" % postgresqlV
}
// SETTINGS
lazy val settings =
commonSettings
lazy val compilerOptions = Seq(
"-unchecked",
"-feature",
"-language:existentials",
"-language:higherKinds",
"-language:implicitConversions",
"-language:postfixOps",
"-deprecation",
"-encoding",
"utf8"
)
lazy val commonSettings = Seq(
scalacOptions ++= compilerOptions
)
lazy val assemblySettings = Seq(
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false, includeDependency = false),
assemblyMergeStrategy in assembly := {
case PathList("META-INF", "io.netty.versions.properties", xs # _*) => MergeStrategy.singleOrError
case "module-info.class" => MergeStrategy.singleOrError
case PathList("org", "xmlpull", xs # _*) => MergeStrategy.discard
case PathList("org", "nd4j", xs # _*) => MergeStrategy.first
case PathList("org", "bytedeco", xs # _*) => MergeStrategy.first
case PathList("org.bytedeco", xs # _*) => MergeStrategy.first
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case "XmlPullParser.class" => MergeStrategy.discard
case "Nd4jBase64.class" => MergeStrategy.discard
case "XmlPullParserException.class" => MergeStrategy.discard
// case n if n.startsWith("rootdoc.txt") => MergeStrategy.discard
// case n if n.startsWith("readme.html") => MergeStrategy.discard
// case n if n.startsWith("readme.txt") => MergeStrategy.discard
case n if n.startsWith("library.properties") => MergeStrategy.discard
case n if n.startsWith("license.html") => MergeStrategy.discard
case n if n.startsWith("about.html") => MergeStrategy.discard
// case _ => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
)
I have tried many Merge Strategies but nothing works
What am i missing here?
Any advice?

for META-INF/io.netty.versions.properties
you have:
case PathList("META-INF", "io.netty.versions.properties", xs # _*) => MergeStrategy.singleOrError
which says that it will error out, if there are more than 1 files with this name.
try MergeStrategy.first for these files instead
module-info.class
these files are only relevant for the Java 9 module system. Usually, you can just discard them:
case "module-info.class" => MergeStrategy.discard

Related

Why does Spark application fail with "ClassNotFoundException: Failed to find data source: jdbc" as uber-jar with sbt assembly?

I'm trying to assemble a Spark application using sbt 1.0.4 with sbt-assembly 0.14.6.
The Spark application works fine when launched in IntelliJ IDEA or spark-submit, but if I run the assembled uber-jar with the command line (cmd in Windows 10):
java -Xmx1024m -jar my-app.jar
I get the following exception:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: jdbc. Please find packages at http://spark.apache.org/third-party-projects.html
The Spark application looks as follows.
package spark.main
import java.util.Properties
import org.apache.spark.sql.SparkSession
object Main {
def main(args: Array[String]) {
val connectionProperties = new Properties()
connectionProperties.put("user","postgres")
connectionProperties.put("password","postgres")
connectionProperties.put("driver", "org.postgresql.Driver")
val testTable = "test_tbl"
val spark = SparkSession.builder()
.appName("Postgres Test")
.master("local[*]")
.config("spark.hadoop.fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
.config("spark.sql.warehouse.dir", System.getProperty("java.io.tmpdir") + "swd")
.getOrCreate()
val dfPg = spark.sqlContext.read.
jdbc("jdbc:postgresql://localhost/testdb",testTable,connectionProperties)
dfPg.show()
}
}
The following is build.sbt.
name := "apache-spark-scala"
version := "0.1-SNAPSHOT"
scalaVersion := "2.11.8"
mainClass in Compile := Some("spark.main.Main")
libraryDependencies ++= {
val sparkVer = "2.1.1"
val postgreVer = "42.0.0"
val cassandraConVer = "2.0.2"
val configVer = "1.3.1"
val logbackVer = "1.7.25"
val loggingVer = "3.7.2"
val commonsCodecVer = "1.10"
Seq(
"org.apache.spark" %% "spark-sql" % sparkVer,
"org.apache.spark" %% "spark-core" % sparkVer,
"com.datastax.spark" %% "spark-cassandra-connector" % cassandraConVer,
"org.postgresql" % "postgresql" % postgreVer,
"com.typesafe" % "config" % configVer,
"commons-codec" % "commons-codec" % commonsCodecVer,
"com.typesafe.scala-logging" %% "scala-logging" % loggingVer,
"org.slf4j" % "slf4j-api" % logbackVer
)
}
dependencyOverrides ++= Seq(
"io.netty" % "netty-all" % "4.0.42.Final",
"commons-net" % "commons-net" % "2.2",
"com.google.guava" % "guava" % "14.0.1"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
Does anyone has any idea, why?
[UPDATE]
Configuration taken from offical GitHub Repository did the trick:
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) =>
xs map {_.toLowerCase} match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
MergeStrategy.discard
case ps # (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
MergeStrategy.discard
case "services" :: _ => MergeStrategy.filterDistinctLines
case _ => MergeStrategy.first
}
case _ => MergeStrategy.first
}
The question is almost Why does format("kafka") fail with "Failed to find data source: kafka." with uber-jar? with the differences that the other OP used Apache Maven to create an uber-jar and here it's about sbt (sbt-assembly plugin's configuration to be precise).
The short name (aka alias) of a data source, e.g. jdbc or kafka, are only available if the corresponding META-INF/services/org.apache.spark.sql.sources.DataSourceRegister registers a DataSourceRegister.
For jdbc alias to work Spark SQL uses META-INF/services/org.apache.spark.sql.sources.DataSourceRegister with the following entry (there are others):
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
That's what ties jdbc alias up with the data source.
And you've excluded it from an uber-jar by the following assemblyMergeStrategy.
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
Note case PathList("META-INF", xs # _*) which you simply MergeStrategy.discard. That's the root cause.
Just to check that the "infrastructure" is available and you could use the jdbc data source by its fully-qualified name (not the alias), try this:
spark.read.
format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider").
load("jdbc:postgresql://localhost/testdb")
You will see other problems due to missing options like url, but...we're digressing.
A solution is to MergeStrategy.concat all META-INF/services/org.apache.spark.sql.sources.DataSourceRegister (that would create an uber-jar with all data sources, incl. the jdbc data source).
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat

Sbt ( new version 1.0.4) assembly failure

I have been trying to build fat jar for some time now. I got assembly.sbt in project folder and it looks like below
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
and my build.sbt looks like below
name := "cool"
version := "0.1"
scalaVersion := "2.11.8"
resolvers += "Hortonworks Repository" at
"http://repo.hortonworks.com/content/repositories/releases/"
resolvers += "Hortonworks Jetty Maven Repository" at
"http://repo.hortonworks.com/content/repositories/jetty-hadoop/"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-streaming_2.10" % "1.6.1.2.4.2.0-258" %
"provided",
"org.apache.spark" % "spark-streaming-kafka-assembly_2.10" %
"1.6.1.2.4.2.0-258"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
i get below error
assemblyMergeStrategy in assembly := {
^
C:\Users\sreer\Desktop\workspace\cool\build.sbt:14: error: not found:
value assembly
assemblyMergeStrategy in assembly := {
^
C:\Users\sreer\Desktop\workspace\cool\build.sbt:15: error: not found:
value PathList
case PathList("META-INF", xs # _*) => MergeStrategy.discard
^
C:\Users\sreer\Desktop\workspace\cool\build.sbt:15: error: star patterns
must correspond with varargs parameters
case PathList("META-INF", xs # _*) => MergeStrategy.discard
^
C:\Users\sreer\Desktop\workspace\cool\build.sbt:15: error: not found:
value MergeStrategy
case PathList("META-INF", xs # _*) => MergeStrategy.discard
^
C:\Users\sreer\Desktop\workspace\cool\build.sbt:16: error: not found:
value MergeStrategy
case x => MergeStrategy.first
^
[error] Type error in expression
I get this Type error and seems like it won't recognize keys like "assemblyMergeStrategy". I use sbt new version 1.0.4 and latest version of eclipse IDE for scala.
I have tried changing version of sbt and still no result, went through whole document at https://github.com/sbt/sbt-assembly, made sure there were no typos and suggestions mentioned in other threads weren't of much help ( mostly questions are about older versions of sbt). If some one could guide me that would be very helpful. Thanks.

sc.TextFile("") working in Eclipse but not in a JAR

I'm writing a code which will be in a hadoop cluster but before all, I test it locally with local files. The code is working great in Eclipse but when I'm making a huge JAR with SBT (with spark lib etc) the program is working until a textFile(path) my code is :
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.{Level, Logger}
import org.joda.time.format.DateTimeFormat
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ArrayBuffer
object TestCRA2 {
val conf = new SparkConf()
.setMaster("local")
.setAppName("Test")
.set("spark.driver.memory", "4g")
.set("spark.executor.memory", "4g")
val context = new SparkContext(conf)//.master("local")
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
def TimeParse1(path: String) : RDD[(Int,Long,Long)] = {
val data = context.textFile(path).map(_.split(";"))
return data
}
def main(args: Array[String]) {
val data = TimeParse1("file:///home/quentin/Downloads/CRA")
}
}
And here is my error :
Exception in thread "main" java.io.IOException: No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:341)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1034)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1029)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1029)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:832)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:830)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:830)
at main.scala.TestCRA2$.TimeParse1(TestCRA.scala:37)
at main.scala.TestCRA2$.main(TestCRA.scala:84)
at main.scala.TestCRA2.main(TestCRA.scala)
I can't put my files into the JAR cause they are in the cluster hadoop and it's working on Eclipse.
Here is my build.sbt :
name := "BloomFilters"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "joda-time" % "joda-time" % "2.9.3"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
If I don't do my assemblyMergeStrategy like this I've got bunch of errors of merging.
Actually I needed to change my build.sbt like this :
name := "BloomFilters"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
libraryDependencies += "joda-time" % "joda-time" % "2.9.3"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
case "services" :: xs => MergeStrategy.first
case _ => MergeStrategy.discard
}
case x => MergeStrategy.first
}
Thank you #lyomi
Your sbt assembly is probably ignoring some of the required files. Specifically, Hadoop's FileSystem class relies on a service discovery mechanism that looks for ALL META-INFO/services/org.apache.hadoop.fs.FileSystem files in the classpath.
On Eclipse it was fine, because each JAR had the corresponding file, but in the uber-jar one might have overridden others, causing the file: scheme to not get recognized.
In your SBT settings, add the following, to concatenate the service discovery files instead of discarding some of them.
val defaultMergeStrategy: String => MergeStrategy = {
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
// ... possibly other settings ...
case "services" :: xs =>
MergeStrategy.filterDistinctLines
case _ => MergeStrategy.deduplicate
}
case _ => MergeStrategy.deduplicate
}
See README.md of sbt-assembly for more info.

Unresolved dependencies path for SBT project in IntelliJ

I'm using IntelliJ to develop Spark application. I'm following this instruction on how to make intellij work nicely with SBT project.
As my whole team is using IntelliJ so we can just modify build.sbt but we got this unresolved dependencies error
Error:Error while importing SBT project:
[info] Resolving org.apache.thrift#libfb303;0.9.2 ...
[info] Resolving org.apache.spark#spark-streaming_2.10;2.1.0 ...
[info] Resolving org.apache.spark#spark-streaming_2.10;2.1.0 ...
[info] Resolving org.apache.spark#spark-parent_2.10;2.1.0 ...
[info] Resolving org.scala-lang#jline;2.10.6 ...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: sparrow-to-orc#sparrow-to-orc_2.10;0.1: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] sparrow-to-orc:sparrow-to-orc_2.10:0.1
[warn] +- mainrunner:mainrunner_2.10:0.1-SNAPSHOT
[trace] Stack trace suppressed: run 'last mainRunner/:ssExtractDependencies' for the full output.
[trace] Stack trace suppressed: run 'last mainRunner/:update' for the full output.
[error] (mainRunner/:ssExtractDependencies) sbt.ResolveException: unresolved dependency: sparrow-to-orc#sparrow-to-orc_2.10;0.1: not found
[error] (mainRunner/:update) sbt.ResolveException: unresolved dependency: sparrow-to-orc#sparrow-to-orc_2.10;0.1: not found
[error] Total time: 47 s, completed Jun 10, 2017 8:39:57 AM
And this is my build.sbt
name := "sparrow-to-orc"
version := "0.1"
scalaVersion := "2.11.8"
lazy val sparkDependencies = Seq(
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" %% "spark-sql" % "2.1.0",
"org.apache.spark" %% "spark-hive" % "2.1.0",
"org.apache.spark" %% "spark-streaming" % "2.1.0"
)
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.7.4"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.1"
libraryDependencies ++= sparkDependencies.map(_ % "provided")
lazy val mainRunner = project.in(file("mainRunner")).dependsOn(RootProject(file("."))).settings(
libraryDependencies ++= sparkDependencies.map(_ % "compile")
)
assemblyMergeStrategy in assembly := {
case PathList("org","aopalliance", xs # _*) => MergeStrategy.last
case PathList("javax", "inject", xs # _*) => MergeStrategy.last
case PathList("javax", "servlet", xs # _*) => MergeStrategy.last
case PathList("javax", "activation", xs # _*) => MergeStrategy.last
case PathList("org", "apache", xs # _*) => MergeStrategy.last
case PathList("com", "google", xs # _*) => MergeStrategy.last
case PathList("com", "esotericsoftware", xs # _*) => MergeStrategy.last
case PathList("com", "codahale", xs # _*) => MergeStrategy.last
case PathList("com", "yammer", xs # _*) => MergeStrategy.last
case "about.html" => MergeStrategy.rename
case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
case "META-INF/mailcap" => MergeStrategy.last
case "META-INF/mimetypes.default" => MergeStrategy.last
case "plugin.properties" => MergeStrategy.last
case "log4j.properties" => MergeStrategy.last
case "overview.html" => MergeStrategy.last
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
If I don't have this line then the program works fine
lazy val mainRunner = project.in(file("mainRunner")).dependsOn(RootProject(file("."))).settings(
libraryDependencies ++= sparkDependencies.map(_ % "compile")
)
But then i won't be able to run the application inside IntelliJ as spark dependencies won't be included in the classpath.
I had the same issue. The solution is to set the Scala version in the mainRunner to be the same as the one declared at the top of the build.sbt file:
lazy val mainRunner = project.in(file("mainRunner")).dependsOn(RootProject(file("."))).settings(
libraryDependencies ++= sparkDependencies.map(_ % "compile"),
scalaVersion := "2.11.8"
)
Good luck!

Dependency issue with Scalding and Hadoop with sbt-assembly

I'm trying to build a far with sbt of a simple hadoop job I'm trying to run in an attempt to run it on Amazon EMR. However when I run sbt assembly I get the following error:
[error] (*:assembly) deduplicate: different file contents found in the following:
[error] /Users/trenthauck/.ivy2/cache/org.mortbay.jetty/jsp-2.1/jars/jsp-2.1-6.1.14.jar:org/apache/jasper/compiler/Node$ChildInfo.class
[error] /Users/trenthauck/.ivy2/cache/tomcat/jasper-compiler/jars/jasper-compiler-5.5.12.jar:org/apache/jasper/compiler/Node$ChildInfo.class
[error] Total time: 10 s, completed Sep 14, 2013 4:49:24 PM
I attempted to follow the suggestion here https://groups.google.com/forum/#!topic/simple-build-tool/tzkq5TioIqM however it didn't work.
My build.sbt looks like:
import AssemblyKeys._
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("org", "apache", "jasper", xs # _*) => MergeStrategy.last
case x => old(x)
}
}
assemblySettings
name := "Scaling Play"
version := "SNAPSHOT-0.1"
scalaVersion := "2.10.1"
libraryDependencies ++= Seq(
"com.twitter" % "scalding-core_2.10" % "0.8.8",
"com.twitter" % "scalding-args_2.10" % "0.8.8",
"com.twitter" % "scalding-date_2.10" % "0.8.8",
"org.apache.hadoop" % "hadoop-core" % "1.0.0"
)
The order of the directives is important. You update the assembly settings, to overwrite it again a line later. First defining assemblySettings and then updating it will solve it.
The updated build.sbt:
import AssemblyKeys._
assemblySettings
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("org", "apache", "jasper", xs # _*) => MergeStrategy.last
case x => old(x)
}
}
…
After that you will discover that there are a lot more conflicting classes and other files. In this case you will require the following merges:
case PathList("org", "apache", xs # _*) => MergeStrategy.last
case PathList("javax", "servlet", xs # _*) => MergeStrategy.last
case PathList("com", "esotericsoftware", xs # _*) => MergeStrategy.last
case PathList("project.clj") => MergeStrategy.last
case PathList("overview.html") => MergeStrategy.last
case x => old(x)
Note that using merge strategies for class files may give problems, caused by incompatible versions of that specific class. If that is the case then your problem is larger, because then the dependencies are incompatible with each other. You have then to resort to removing the dependency and find/make a compatible version.