java.lang.VerifyError: Operand stack overflow for google-ads API and SBT - scala

I am trying to migrate from Google-AdWords to google-ads-v10 API in spark 3.1.1 in EMR.
I am facing some dependency issues due to conflicts with existing jars.
Initially, we were facing a dependency related to Protobuf jar:
Exception in thread "grpc-default-executor-0" java.lang.IllegalAccessError: tried to access field com.google.protobuf.AbstractMessage.memoizedSize from class com.google.ads.googleads.v10.services.SearchGoogleAdsRequest
at com.google.ads.googleads.v10.services.SearchGoogleAdsRequest.getSerializedSize(SearchGoogleAdsRequest.java:394)
at io.grpc.protobuf.lite.ProtoInputStream.available(ProtoInputStream.java:108)
In order to resolve this, tried to shade the Protobuf jar and have a uber-jar instead. After the shading, running my project locally in IntelliJ works fine, But when trying to run an executable jar I created I get the following error:
Exception in thread "main" io.grpc.ManagedChannelProvider$ProviderNotFoundException: No functional channel service provider found. Try adding a dependency on the grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact
I tried adding all those libraries in --spark.jars.packages but it didn't help.
java.lang.VerifyError: Operand stack overflow
Exception Details:
Location:
io/grpc/internal/TransportTracer.getStats()Lio/grpc/InternalChannelz$TransportStats; ...
...
...
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.<init>(NettyChannelBuilder.java:96)
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.forTarget(NettyChannelBuilder.java:169)
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.forAddress(NettyChannelBuilder.java:152)
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelProvider.builderForAddress(NettyChannelProvider.java:38)
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelProvider.builderForAddress(NettyChannelProvider.java:24)
at io.grpc.ManagedChannelBuilder.forAddress(ManagedChannelBuilder.java:39)
at com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:348)
Has anyone ever encountered such an issue?
Build.sbt
lazy val dependencies = new {
val sparkRedshift = "io.github.spark-redshift-community" %% "spark-redshift" % "5.0.3" % "provided" excludeAll (ExclusionRule(organization = "com.amazonaws"))
val jsonSimple = "com.googlecode.json-simple" % "json-simple" % "1.1" % "provided"
val googleAdsLib = "com.google.api-ads" % "google-ads" % "17.0.1"
val jedis = "redis.clients" % "jedis" % "3.0.1" % "provided"
val sparkAvro = "org.apache.spark" %% "spark-avro" % sparkVersion % "provided"
val queryBuilder = "com.itfsw" % "QueryBuilder" % "1.0.4" % "provided" excludeAll (ExclusionRule(organization = "com.fasterxml.jackson.core"))
val protobufForGoogleAds = "com.google.protobuf" % "protobuf-java" % "3.18.1"
val guavaForGoogleAds = "com.google.guava" % "guava" % "31.1-jre"
}
libraryDependencies ++= Seq(
dependencies.sparkRedshift, dependencies.jsonSimple, dependencies.googleAdsLib,dependencies.guavaForGoogleAds,dependencies.protobufForGoogleAds
,dependencies.jedis, dependencies.sparkAvro,
dependencies.queryBuilder
)
dependencyOverrides ++= Set(
dependencies.guavaForGoogleAds
)
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.protobuf.**" -> "repackaged.protobuf.#1").inAll
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs#_*) => MergeStrategy.discard
case PathList("module-info.class", xs#_*) => MergeStrategy.discard
case x => MergeStrategy.first
}

I had a similar issue and I changed the assembly merge strategy to this:
assemblyMergeStrategy in assembly := {
case x if x.contains("io.netty.versions.properties") => MergeStrategy.discard
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

Solved this by using the google-ads-shadowjar as an external jar rather than having a dependency on google-ads library. This solves the problem of having to deal with dependencies manually but makes your jar size bigger.

Related

Spark streaming 2.4.0 getting org.apache.spark.sql.AnalysisException: Failed to find data source: kafka

Getiing following error while trying to read data from Kafka. I am using docker-compose for running kafka and spark.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
Here is my code for reading:
object Livedata extends App with LazyLogging {
logger.info("starting livedata...")
val spark = SparkSession.builder().appName("livedata").master("local[*]").getOrCreate()
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "topic")
.option("startingOffsets", "latest")
.load()
df.printSchema()
val hadoopConfig = spark.sparkContext.hadoopConfiguration
hadoopConfig.set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
hadoopConfig.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
}
After reading few answers I have added all packages for sbt build
Here is build.sbt file:
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "com.live.data",
version := "0.1.0",
scalaVersion := "2.12.2",
assemblyJarName in assembly := "livedata.jar"
)),
name := "livedata",
libraryDependencies ++= List(
"org.scalatest" %% "scalatest" % "3.0.5",
"com.typesafe.scala-logging" %% "scala-logging" % "3.9.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.0" % "provided",
"org.apache.kafka" % "kafka-clients" % "2.5.0",
"org.apache.kafka" % "kafka-streams" % "2.5.0",
"org.apache.kafka" %% "kafka-streams-scala" % "2.5.0"
)
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs#_*) => MergeStrategy.discard
case x => MergeStrategy.first
}
Not sure what is the main issue here.
Update:
Finally I got the solution from here Error when connecting spark structured streaming + kafka
Main issue was getting this org.apache.spark.sql.AnalysisException: Failed to find data source: kafka exception because spark-sql-kafka library is not available in classpath & It is unable to find org.apache.spark.sql.sources.DataSourceRegister inside META-INF/services folder.
Following codeblock need to add in build.sbt. This will include org.apache.spark.sql.sources.DataSourceRegister file in the final jar.
// META-INF discarding
assemblyMergeStrategy in assembly := {
case PathList("META-INF","services",xs # _*) => MergeStrategy.filterDistinctLines
case PathList("META-INF",xs # _*) => MergeStrategy.discard
case "application.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}```
spark-sql-kafka-0-10 is not provided, so remove that part of the dependencies. (spark-sql is provided, though, so you could add it to that one)
You also shouldn't pull Kafka Streams (since that's not used by Spark), and kafka-clients is transitively pulled by sql-kafka, so don't need that either

Reading RDF in apache spark

I'm trying to read RDF\XML file into Apache spark (scala 2.11, apache spark 1.4.1) using Apache Jena. I wrote this scala snippet:
val factory = new RdfXmlReaderFactory()
HadoopRdfIORegistry.addReaderFactory(factory)
val conf = new Configuration()
conf.set("rdf.io.input.ignore-bad-tuples", "false")
val data = sc.newAPIHadoopFile(path,
classOf[RdfXmlInputFormat],
classOf[LongWritable], //position
classOf[TripleWritable], //value
conf)
data.take(10).foreach(println)
But it throws an error:
INFO readers.AbstractLineBasedNodeTupleReader: Got split with start 0 and length 21765995 for file with total length of 21765995
15/07/23 01:52:42 ERROR readers.AbstractLineBasedNodeTupleReader: Error parsing whole file, aborting further parsing
org.apache.jena.riot.RiotException: Producer failed to ever call start(), declaring producer dead
at org.apache.jena.riot.lang.PipedRDFIterator.hasNext(PipedRDFIterator.java:272)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractWholeFileNodeTupleReader.nextKeyValue(AbstractWholeFileNodeTupleReader.java:242)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractRdfReader.nextKeyValue(AbstractRdfReader.java:85)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
...
ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Error parsing whole file at position 0, aborting further parsing
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractWholeFileNodeTupleReader.nextKeyValue(AbstractWholeFileNodeTupleReader.java:285)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractRdfReader.nextKeyValue(AbstractRdfReader.java:85)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
The file is good, because i can parse it locally. What do I miss?
EDIT
Some information to reproduce the behaviour
Imports:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.LongWritable
import org.apache.jena.hadoop.rdf.io.registry.HadoopRdfIORegistry
import org.apache.jena.hadoop.rdf.io.registry.readers.RdfXmlReaderFactory
import org.apache.jena.hadoop.rdf.types.QuadWritable
import org.apache.spark.SparkContext
scalaVersion := "2.11.7"
dependencies:
"org.apache.hadoop" % "hadoop-common" % "2.7.1",
"org.apache.hadoop" % "hadoop-mapreduce-client-common" % "2.7.1",
"org.apache.hadoop" % "hadoop-streaming" % "2.7.1",
"org.apache.spark" % "spark-core_2.11" % "1.4.1",
"com.hp.hpl.jena" % "jena" % "2.6.4",
"org.apache.jena" % "jena-elephas-io" % "0.9.0",
"org.apache.jena" % "jena-elephas-mapreduce" % "0.9.0"
I'm using sample rdf from here. It's freely available information about John Peel sessions (more info about dump).
So it appears your problem was down to you manually managing your dependencies.
In my environment I was simply passing the following to my Spark shell:
--packages org.apache.jena:jena-elephas-io:0.9.0
This does all the dependency resolution for you
If you are building a SBT project then it should be sufficient to do the following in your build.sbt:
libraryDependencies += "org.apache.jena" % "jena-elephas-io" % "0.9.0"
Thx all for discussion in comments. The problem was really tricky and not clear from the stack trace: code needs one extra dependency to work jena-core and this dependency must be packaged first.
"org.apache.jena" % "jena-core" % "2.13.0"
"com.hp.hpl.jena" % "jena" % "2.6.4"
I use this assembly strategy:
lazy val strategy = assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) { (old) => {
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) => MergeStrategy.discard
case _ => MergeStrategy.discard
}
case x => MergeStrategy.first
}
}

Conflicting files in uber-jar creation in SBT using sbt-assembly

I am trying to compile and package a fat jar using SBT and I keep running into the following error. I have tried everything from using library dependency exclude and merging.
[trace] Stack trace suppressed: run last *:assembly for the full output.
[error] (*:assembly) deduplicate: different file contents found in the following:
[error] /Users/me/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1 .7.10.jar:META-INF/maven/org.slf4j/slf4j-api/pom.properties
[error] /Users/me/.ivy2/cache/com.twitter/parquet-format/jars/parquet-format-2.2.0-rc1.jar:META-INF/maven/org.slf4j/slf4j-api/pom.properties
[error] Total time: 113 s, completed Jul 10, 2015 1:57:21 AM
The current incarnation of my build.sbt file is below:
import AssemblyKeys._
assemblySettings
name := "ldaApp"
version := "0.1"
scalaVersion := "2.10.4"
mainClass := Some("myApp")
libraryDependencies +="org.scalanlp" %% "breeze" % "0.11.2"
libraryDependencies +="org.scalanlp" %% "breeze-natives" % "0.11.2"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.3.1"
libraryDependencies +="org.ini4j" % "ini4j" % "0.5.4"
jarName in assembly := "myApp"
net.virtualvoid.sbt.graph.Plugin.graphSettings
libraryDependencies += "org.slf4j" %% "slf4j-api"" % "1.7.10" % "provided"
I realize I am doing something wrong...I just have no idea what.
Here is how you can handle these merge issues.
import sbtassembly.Plugin._
lazy val assemblySettings = sbtassembly.Plugin.assemblySettings ++ Seq(
publishArtifact in packageScala := false, // Remove scala from the uber jar
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("META-INF", "CHANGES.txt") => MergeStrategy.first
// ...
case PathList(ps # _*) if ps.last endsWith "pom.properties" => MergeStrategy.first
case x => old(x)
}
}
)
Then add these settings to your project.
lazy val projectToJar = Project(id = "MyApp", base = file(".")).settings(assemblySettings: _*)
I got your assembly build running by removing spark from the fat jar (mllib is already included in spark).
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.3.1" % "provided"
Like vitalii said in a comment, this solution was already here. I understand that spending hours on a problem without finding the fix can be frustrating but please be nice.

Suggestions needed to improve packaging all sources and javadoc of sbt projects

To avoid version-related problems with scala (2.9, 2.10, 2.11, …), we want to include all necessary jar files to use scala in a java application. To facilitate debugging & development, we want to include the sources & javadocs of all such libraries too.
I know this topic has been asked many times before; however, I haven't found a solution that could work for us (scala 2.11 & sbt 0.13.5).
I managed to prototype an approximate solution with an sbt project configured as follows:
./build.sbt:
val packAllCommand = Command.command("packAll") {
state =>
"clean" :: "update" :: "updateClassifiers" ::
"pack" :: "dependencyGraph" :: "dependencyDot" ::
state
}
commands += packAllCommand
./project/plugins.sbt:
resolvers +=
"sonatype-releases" at "https://oss.sonatype.org/content/repositories/releases/"
addSbtPlugin("org.xerial.sbt" % "sbt-pack" % "0.6.1")
addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.7.4")
./project/Build.scala
import sbt._
import Keys._
import net.virtualvoid.sbt.graph.Plugin.graphSettings
import xerial.sbt.Pack._
/**
* Goal:
*
* use sbt to package all the jars/sources/javadoc for scala & related libraries needed to use scala in a java application
* without requiring scala to be installed on the system.
*
* #author Nicolas.F.Rouquette#jpl.nasa.gov
*/
object BuildWithSourcesAndJavadocs extends Build {
object Versions {
val scala = "2.11.2"
val config = "1.2.1"
val scalaCheck = "1.11.5"
val scalaTest = "2.2.1"
val specs2 = "2.4"
val parboiled = "2.0.0"
}
lazy val scalaLibs: Project = Project(
"scalaLibs",
file( "scalaLibs" ),
settings = Defaults.coreDefaultSettings ++ Defaults.runnerSettings ++ Defaults.baseTasks ++ graphSettings ++ packSettings ++ Seq(
scalaVersion := Versions.scala,
packExpandedClasspath := true,
libraryDependencies ++= Seq(
"org.scala-lang" % "scala-library" % scalaVersion.value % "compile" withSources () withJavadoc (),
"org.scala-lang" % "scala-compiler" % scalaVersion.value % "compile" withSources () withJavadoc (),
"org.scala-lang" % "scala-reflect" % scalaVersion.value % "compile" withJavadoc () withJavadoc () ),
( mappings in pack ) := { extraPackFun.value } ) )
lazy val otherLibs: Project = Project(
"otherLibs",
file( "otherLibs" ),
settings = Defaults.coreDefaultSettings ++ Defaults.runnerSettings ++ Defaults.baseTasks ++ graphSettings ++ packSettings ++ Seq(
scalaVersion := Versions.scala,
packExpandedClasspath := true,
libraryDependencies ++= Seq(
"org.scala-lang" % "scala-library" % Versions.scala % "provided",
"org.scala-lang" % "scala-compiler" % Versions.scala % "provided",
"org.scala-lang" % "scala-reflect" % Versions.scala % "provided",
"com.typesafe" % "config" % Versions.config % "compile" withSources () withJavadoc (),
"org.scalacheck" %% "scalacheck" % Versions.scalaCheck % "compile" withSources () withJavadoc (),
"org.scalatest" %% "scalatest" % Versions.scalaTest % "compile" withSources () withJavadoc (),
"org.specs2" %% "specs2" % Versions.specs2 % "compile" withSources () withJavadoc (),
"org.parboiled" %% "parboiled" % Versions.parboiled % "compile" withSources () withJavadoc () ),
( mappings in pack ) := { extraPackFun.value } ) ).dependsOn( scalaLibs )
lazy val root: Project = Project( "root", file( "." ) ) aggregate ( scalaLibs, otherLibs )
val extraPackFun: Def.Initialize[Task[Seq[( File, String )]]] = Def.task[Seq[( File, String )]] {
def getFileIfExists( f: File, where: String ): Option[( File, String )] = if ( f.exists() ) Some( ( f, s"${where}/${f.getName()}" ) ) else None
val ivyHome: File = Classpaths.bootIvyHome( appConfiguration.value ) getOrElse sys.error( "Launcher did not provide the Ivy home directory." )
// this is a workaround; how should it be done properly in sbt?
// goal: process the list of library dependencies of the project.
// that is, we should be able to tell the classification of each library dependency module as shown in sbt:
//
// > show libraryDependencies
// [info] List(
// org.scala-lang:scala-library:2.11.2,
// org.scala-lang:scala-library:2.11.2:provided,
// org.scala-lang:scala-compiler:2.11.2:provided,
// org.scala-lang:scala-reflect:2.11.2:provided,
// com.typesafe:config:1.2.1:compile,
// org.scalacheck:scalacheck:1.11.5:compile,
// org.scalatest:scalatest:2.2.1:compile,
// org.specs2:specs2:2.4:compile,
// org.parboiled:parboiled:2.0.0:compile)
// but... libraryDependencies is a SettingKey (see ld below)
// I haven't figured out how to get the sequence of modules from it.
val ld: SettingKey[Seq[ModuleID]] = libraryDependencies
// workaround... I found this API that I managed to call...
// this overrides the classification of all jars -- i.e., it is as if all library dependencies had been classified as "compile".
// for now... it's a reasonable approaximation of the goal...
val managed: Classpath = Classpaths.managedJars( Compile, classpathTypes.value, update.value )
val result: Seq[( File, String )] = managed flatMap { af: Attributed[File] =>
af.metadata.entries.toList flatMap { e: AttributeEntry[_] =>
e.value match {
case null => Seq()
case m: ModuleID => Seq() ++
getFileIfExists( new File( ivyHome, s"cache/${m.organization}/${m.name}/srcs/${m.name}-${m.revision}-sources.jar" ), "lib.srcs" ) ++
getFileIfExists( new File( ivyHome, s"cache/${m.organization}/${m.name}/docs/${m.name}-${m.revision}-javadoc.jar" ), "lib.javadoc" )
case _ => Seq()
}
}
}
result
}
}
Thanks to the sbt-pack and sbt-dependency-graph plugins, the above produces what I need:
scalaLibs/target/dependencies-compile.dot
scalaLibs/target/pack/lib
scalaLibs/target/pack/lib.srcs
scalaLibs/target/pack/lib.javadoc
otherLibs/target/dependencies-compile.dot
otherLibs/target/pack/lib
otherLibs/target/pack/lib.srcs
otherLibs/target/pack/lib.javadoc
The dot files can be visualized with GraphViz; it helps explain why a particular library is included…
I would like to improve this approach in terms of the following:
some libraries in scalaLibs are duplicated in otherLibs,
this approach ignores library dependency classification & overrides (not used here)
Suggestions?
Nicolas.

How do I exclude/include specific packages using xsbt-proguard-plugin?

I'm using xsbt-proguard-plugin, which is an SBT plugin for working with Proguard.
I'm trying to come up with a Proguard configuration for a Hive Deserializer I've written, which has the following dependencies:
// project/Dependencies.scala
val hadoop = "org.apache.hadoop" % "hadoop-core" % V.hadoop
val hive = "org.apache.hive" % "hive-common" % V.hive
val serde = "org.apache.hive" % "hive-serde" % V.hive
val httpClient = "org.apache.httpcomponents" % "httpclient" % V.http
val logging = "commons-logging" % "commons-logging" % V.logging
val specs2 = "org.specs2" %% "specs2" % V.specs2 % "test"
Plus an unmanaged dependency:
// lib/UserAgentUtils-1.6.jar
Because most of these are either for local unit testing or are available within a Hadoop/Hive environment anyway, I want my minified jarfile to only include:
The Java classes SnowPlowEventDeserializer.class and SnowPlowEventStruct.class
org.apache.httpcomponents.httpclient
commons-logging
lib/UserAgentUtils-1.6.jar
But I'm really struggling to get the syntax right. Should I start from a whitelist of classes I want to keep, or explicitly filter out the Hadoop/Hive/Serde/Specs2 libraries? I'm aware of this SO question but it doesn't seem to apply here.
If I initially try the whitelist approach:
// Should be equivalent to sbt> package
import ProguardPlugin._
lazy val proguard = proguardSettings ++ Seq(
proguardLibraryJars := Nil,
proguardOptions := Seq(
"-keepattributes *Annotation*,EnclosingMethod",
"-dontskipnonpubliclibraryclassmembers",
"-dontoptimize",
"-dontshrink",
"-keep class com.snowplowanalytics.snowplow.hadoop.hive.SnowPlowEventDeserializer",
"-keep class com.snowplowanalytics.snowplow.hadoop.hive.SnowPlowEventStruct"
)
)
Then I get a Hadoop processing error, so clearly Proguard is still trying to bundle Hadoop:
proguard: java.lang.IllegalArgumentException: Can't find common super class of [[Lorg/apache/hadoop/fs/FileStatus;] and [[Lorg/apache/hadoop/fs/s3/Block;]
Meanwhile if I try Proguard's filtering syntax to build up the blacklist of libraries I don't want to include:
import ProguardPlugin._
lazy val proguard = proguardSettings ++ Seq(
proguardLibraryJars := Nil,
proguardOptions := Seq(
"-keepattributes *Annotation*,EnclosingMethod",
"-dontskipnonpubliclibraryclassmembers",
"-dontoptimize",
"-dontshrink",
"-injars !*hadoop*.jar"
)
)
Then this doesn't seem to work either:
proguard: java.io.IOException: Can't read [/home/dev/snowplow-log-deserializers/!*hadoop*.jar] (No such file or directory)
Any help greatly appreciated!
The whitelist is the proper approach: ProGuard should get a complete context, so it can properly shake out classes, fields, and methods that are not needed.
The error "Can't find common super class" suggests that some library is still missing from the input. ProGuard probably warned about it, but the configuration appears to contain the option -ignorewarnings or -dontwarn (which should be avoided). You should add the library with -injars or -libraryjars.
If ProGuard then includes some classes that you weren't expecting in the output, you can get an explanation with "-whyareyoukeeping class somepackage.SomeUnexpectedClass".
Starting from a working configuration, you can still try to filter out classes or entire jars from the input. Filters are added to items in a class path though, not on their own, e.g. "-injars some.jar(!somepackage/**.class)" -- cfr. the manual. This can be useful if the input contains test classes that drag in other unwanted classes.
In the end, I couldn't get past duplicate class errors using Proguard, let alone how to figure out how to filter out the relevant jars, so finally switched to a much cleaner sbt-assembly approach:
-1. Added the sbt-assembly plugin to my project as per the README
-2. Updated the appropriate project dependencies with a "provided" flag to stop them being added into my fat jar:
val hadoop = "org.apache.hadoop" % "hadoop-core" % V.hadoop % "provided"
val hive = "org.apache.hive" % "hive-common" % V.hive % "provided"
val serde = "org.apache.hive" % "hive-serde" % V.hive % "provided"
val httpClient = "org.apache.httpcomponents" % "httpclient" % V.http
val httpCore = "org.apache.httpcomponents" % "httpcore" % V.http
val logging = "commons-logging" % "commons-logging" % V.logging % "provided"
val specs2 = "org.specs2" %% "specs2" % V.specs2 % "test"
-3. Added an sbt-assembly configuration like so:
import sbtassembly.Plugin._
import AssemblyKeys._
lazy val sbtAssemblySettings = assemblySettings ++ Seq(
assembleArtifact in packageScala := false,
jarName in assembly <<= (name, version) { (name, version) => name + "-" + version + ".jar" },
mergeStrategy in assembly <<= (mergeStrategy in assembly) {
(old) => {
case "META-INF/NOTICE.txt" => MergeStrategy.discard
case "META-INF/LICENSE.txt" => MergeStrategy.discard
case x => old(x)
}
}
)
Then typing assembly produced a "fat jar" with just the packages I needed in it, including the unmanaged dependency and excluding Hadoop/Hive etc.