Reading RDF in apache spark - scala

I'm trying to read RDF\XML file into Apache spark (scala 2.11, apache spark 1.4.1) using Apache Jena. I wrote this scala snippet:
val factory = new RdfXmlReaderFactory()
HadoopRdfIORegistry.addReaderFactory(factory)
val conf = new Configuration()
conf.set("rdf.io.input.ignore-bad-tuples", "false")
val data = sc.newAPIHadoopFile(path,
classOf[RdfXmlInputFormat],
classOf[LongWritable], //position
classOf[TripleWritable], //value
conf)
data.take(10).foreach(println)
But it throws an error:
INFO readers.AbstractLineBasedNodeTupleReader: Got split with start 0 and length 21765995 for file with total length of 21765995
15/07/23 01:52:42 ERROR readers.AbstractLineBasedNodeTupleReader: Error parsing whole file, aborting further parsing
org.apache.jena.riot.RiotException: Producer failed to ever call start(), declaring producer dead
at org.apache.jena.riot.lang.PipedRDFIterator.hasNext(PipedRDFIterator.java:272)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractWholeFileNodeTupleReader.nextKeyValue(AbstractWholeFileNodeTupleReader.java:242)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractRdfReader.nextKeyValue(AbstractRdfReader.java:85)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
...
ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Error parsing whole file at position 0, aborting further parsing
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractWholeFileNodeTupleReader.nextKeyValue(AbstractWholeFileNodeTupleReader.java:285)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractRdfReader.nextKeyValue(AbstractRdfReader.java:85)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
The file is good, because i can parse it locally. What do I miss?
EDIT
Some information to reproduce the behaviour
Imports:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.LongWritable
import org.apache.jena.hadoop.rdf.io.registry.HadoopRdfIORegistry
import org.apache.jena.hadoop.rdf.io.registry.readers.RdfXmlReaderFactory
import org.apache.jena.hadoop.rdf.types.QuadWritable
import org.apache.spark.SparkContext
scalaVersion := "2.11.7"
dependencies:
"org.apache.hadoop" % "hadoop-common" % "2.7.1",
"org.apache.hadoop" % "hadoop-mapreduce-client-common" % "2.7.1",
"org.apache.hadoop" % "hadoop-streaming" % "2.7.1",
"org.apache.spark" % "spark-core_2.11" % "1.4.1",
"com.hp.hpl.jena" % "jena" % "2.6.4",
"org.apache.jena" % "jena-elephas-io" % "0.9.0",
"org.apache.jena" % "jena-elephas-mapreduce" % "0.9.0"
I'm using sample rdf from here. It's freely available information about John Peel sessions (more info about dump).

So it appears your problem was down to you manually managing your dependencies.
In my environment I was simply passing the following to my Spark shell:
--packages org.apache.jena:jena-elephas-io:0.9.0
This does all the dependency resolution for you
If you are building a SBT project then it should be sufficient to do the following in your build.sbt:
libraryDependencies += "org.apache.jena" % "jena-elephas-io" % "0.9.0"

Thx all for discussion in comments. The problem was really tricky and not clear from the stack trace: code needs one extra dependency to work jena-core and this dependency must be packaged first.
"org.apache.jena" % "jena-core" % "2.13.0"
"com.hp.hpl.jena" % "jena" % "2.6.4"
I use this assembly strategy:
lazy val strategy = assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) { (old) => {
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) => MergeStrategy.discard
case _ => MergeStrategy.discard
}
case x => MergeStrategy.first
}
}

Related

java.lang.VerifyError: Operand stack overflow for google-ads API and SBT

I am trying to migrate from Google-AdWords to google-ads-v10 API in spark 3.1.1 in EMR.
I am facing some dependency issues due to conflicts with existing jars.
Initially, we were facing a dependency related to Protobuf jar:
Exception in thread "grpc-default-executor-0" java.lang.IllegalAccessError: tried to access field com.google.protobuf.AbstractMessage.memoizedSize from class com.google.ads.googleads.v10.services.SearchGoogleAdsRequest
at com.google.ads.googleads.v10.services.SearchGoogleAdsRequest.getSerializedSize(SearchGoogleAdsRequest.java:394)
at io.grpc.protobuf.lite.ProtoInputStream.available(ProtoInputStream.java:108)
In order to resolve this, tried to shade the Protobuf jar and have a uber-jar instead. After the shading, running my project locally in IntelliJ works fine, But when trying to run an executable jar I created I get the following error:
Exception in thread "main" io.grpc.ManagedChannelProvider$ProviderNotFoundException: No functional channel service provider found. Try adding a dependency on the grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact
I tried adding all those libraries in --spark.jars.packages but it didn't help.
java.lang.VerifyError: Operand stack overflow
Exception Details:
Location:
io/grpc/internal/TransportTracer.getStats()Lio/grpc/InternalChannelz$TransportStats; ...
...
...
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.<init>(NettyChannelBuilder.java:96)
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.forTarget(NettyChannelBuilder.java:169)
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.forAddress(NettyChannelBuilder.java:152)
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelProvider.builderForAddress(NettyChannelProvider.java:38)
at io.grpc.netty.shaded.io.grpc.netty.NettyChannelProvider.builderForAddress(NettyChannelProvider.java:24)
at io.grpc.ManagedChannelBuilder.forAddress(ManagedChannelBuilder.java:39)
at com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:348)
Has anyone ever encountered such an issue?
Build.sbt
lazy val dependencies = new {
val sparkRedshift = "io.github.spark-redshift-community" %% "spark-redshift" % "5.0.3" % "provided" excludeAll (ExclusionRule(organization = "com.amazonaws"))
val jsonSimple = "com.googlecode.json-simple" % "json-simple" % "1.1" % "provided"
val googleAdsLib = "com.google.api-ads" % "google-ads" % "17.0.1"
val jedis = "redis.clients" % "jedis" % "3.0.1" % "provided"
val sparkAvro = "org.apache.spark" %% "spark-avro" % sparkVersion % "provided"
val queryBuilder = "com.itfsw" % "QueryBuilder" % "1.0.4" % "provided" excludeAll (ExclusionRule(organization = "com.fasterxml.jackson.core"))
val protobufForGoogleAds = "com.google.protobuf" % "protobuf-java" % "3.18.1"
val guavaForGoogleAds = "com.google.guava" % "guava" % "31.1-jre"
}
libraryDependencies ++= Seq(
dependencies.sparkRedshift, dependencies.jsonSimple, dependencies.googleAdsLib,dependencies.guavaForGoogleAds,dependencies.protobufForGoogleAds
,dependencies.jedis, dependencies.sparkAvro,
dependencies.queryBuilder
)
dependencyOverrides ++= Set(
dependencies.guavaForGoogleAds
)
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.protobuf.**" -> "repackaged.protobuf.#1").inAll
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs#_*) => MergeStrategy.discard
case PathList("module-info.class", xs#_*) => MergeStrategy.discard
case x => MergeStrategy.first
}
I had a similar issue and I changed the assembly merge strategy to this:
assemblyMergeStrategy in assembly := {
case x if x.contains("io.netty.versions.properties") => MergeStrategy.discard
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
Solved this by using the google-ads-shadowjar as an external jar rather than having a dependency on google-ads library. This solves the problem of having to deal with dependencies manually but makes your jar size bigger.

Spark streaming 2.4.0 getting org.apache.spark.sql.AnalysisException: Failed to find data source: kafka

Getiing following error while trying to read data from Kafka. I am using docker-compose for running kafka and spark.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
Here is my code for reading:
object Livedata extends App with LazyLogging {
logger.info("starting livedata...")
val spark = SparkSession.builder().appName("livedata").master("local[*]").getOrCreate()
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "topic")
.option("startingOffsets", "latest")
.load()
df.printSchema()
val hadoopConfig = spark.sparkContext.hadoopConfiguration
hadoopConfig.set("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
hadoopConfig.set("fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
}
After reading few answers I have added all packages for sbt build
Here is build.sbt file:
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "com.live.data",
version := "0.1.0",
scalaVersion := "2.12.2",
assemblyJarName in assembly := "livedata.jar"
)),
name := "livedata",
libraryDependencies ++= List(
"org.scalatest" %% "scalatest" % "3.0.5",
"com.typesafe.scala-logging" %% "scala-logging" % "3.9.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.0" % "provided",
"org.apache.kafka" % "kafka-clients" % "2.5.0",
"org.apache.kafka" % "kafka-streams" % "2.5.0",
"org.apache.kafka" %% "kafka-streams-scala" % "2.5.0"
)
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs#_*) => MergeStrategy.discard
case x => MergeStrategy.first
}
Not sure what is the main issue here.
Update:
Finally I got the solution from here Error when connecting spark structured streaming + kafka
Main issue was getting this org.apache.spark.sql.AnalysisException: Failed to find data source: kafka exception because spark-sql-kafka library is not available in classpath & It is unable to find org.apache.spark.sql.sources.DataSourceRegister inside META-INF/services folder.
Following codeblock need to add in build.sbt. This will include org.apache.spark.sql.sources.DataSourceRegister file in the final jar.
// META-INF discarding
assemblyMergeStrategy in assembly := {
case PathList("META-INF","services",xs # _*) => MergeStrategy.filterDistinctLines
case PathList("META-INF",xs # _*) => MergeStrategy.discard
case "application.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}```
spark-sql-kafka-0-10 is not provided, so remove that part of the dependencies. (spark-sql is provided, though, so you could add it to that one)
You also shouldn't pull Kafka Streams (since that's not used by Spark), and kafka-clients is transitively pulled by sql-kafka, so don't need that either

Spark-Kafka invalid dependency detected

I have a basic Spark - Kafka code, I try to run following code:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import java.util.regex.Pattern
import java.util.regex.Matcher
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
import Utilities._
object WordCount {
def main(args: Array[String]): Unit = {
val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(1))
setupLogging()
// Construct a regular expression (regex) to extract fields from raw Apache log lines
val pattern = apacheLogPattern()
// hostname:port for Kafka brokers, not Zookeeper
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
// List of topics you want to listen for from Kafka
val topics = List("testLogs").toSet
// Create our Kafka stream, which will contain (topic,message) pairs. We tack a
// map(_._2) at the end in order to only get the messages, which contain individual
// lines of data.
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).map(_._2)
// Extract the request field from each log line
val requests = lines.map(x => {val matcher:Matcher = pattern.matcher(x); if (matcher.matches()) matcher.group(5)})
// Extract the URL from the request
val urls = requests.map(x => {val arr = x.toString().split(" "); if (arr.size == 3) arr(1) else "[error]"})
// Reduce by URL over a 5-minute window sliding every second
val urlCounts = urls.map(x => (x, 1)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(300), Seconds(1))
// Sort and print the results
val sortedResults = urlCounts.transform(rdd => rdd.sortBy(x => x._2, false))
sortedResults.print()
// Kick it off
ssc.checkpoint("/home/")
ssc.start()
ssc.awaitTermination()
}
}
I am using IntelliJ IDE, and create scala project by using sbt. Details of build.sbt file is as follow:
name := "Sample"
version := "1.0"
organization := "com.sundogsoftware"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.4.1",
"org.apache.spark" %% "spark-streaming-kafka" % "1.4.1",
"org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"
)
However, when I try to build the code, it creates following error:
Error:scalac: missing or invalid dependency detected while loading class file 'StreamingContext.class'.
Could not access type Logging in package org.apache.spark,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
A full rebuild may help if 'StreamingContext.class' was compiled against an incompatible version of org.apache.spark.
Error:scalac: missing or invalid dependency detected while loading class file 'DStream.class'.
Could not access type Logging in package org.apache.spark,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with -Ylog-classpath to see the problematic classpath.)
A full rebuild may help if 'DStream.class' was compiled against an incompatible version of org.apache.spark.
When using different Spark libraries together the versions of all libs should always match.
Also, the version of kafka you use matters also, so should be for example: spark-streaming-kafka-0-10_2.11
...
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10_2.11" % sparkVersion,
"org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"
)
This is a useful site if you need to check the exact dependencies you should use:
https://search.maven.org/

java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError exceptions when running kinesis spark job

I'm having difficultly trying to figure out my spark (2.1.0) job scala dependencies.
My build.sbt file:
name := "test"
version := "0.0.1"
scalaVersion := "2.11.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.1.0"
libraryDependencies += "com.typesafe.play" %% "play-json" % "2.5.1"
assemblyJarName in assembly := "test.jar"
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.startsWith("META-INF") => MergeStrategy.discard
case PathList("javax", "servlet", xs # _*) => MergeStrategy.first
case PathList("org", "apache", xs # _*) => MergeStrategy.first
case PathList("org", "jboss", xs # _*) => MergeStrategy.first
case "about.html" => MergeStrategy.rename
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
}
exportJars:= true
mainClass in assembly := Some("test.Job")
```
When I run my job it throws java.lang.NoSuchMethodError exceptions.
17/03/08 05:19:15 INFO storage.BlockManager: Removing RDD 87
17/03/08 05:19:15 INFO storage.BlockManager: Removing RDD 86
17/03/08 05:19:15 INFO storage.BlockManager: Removing RDD 85
17/03/08 05:19:15 ERROR worker.Worker: Worker.run caught exception, sleeping for 1000 milli seconds!
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: com.amazonaws.services.kinesis.model.GetRecordsResult.getMillisBehindLatest()Ljava/lang/Long;
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.checkAndSubmitNextTask(ShardConsumer.java:156)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.consumeShard(ShardConsumer.java:125)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.run(Worker.java:335)
at org.apache.spark.streaming.kinesis.KinesisReceiver$$anon$1.run(KinesisReceiver.scala:174)
Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: com.amazonaws.services.kinesis.model.GetRecordsResult.getMillisBehindLatest()Ljava/lang/Long;
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardConsumer.checkAndSubmitNextTask(ShardConsumer.java:136)
... 3 more
Caused by: java.lang.NoSuchMethodError: com.amazonaws.services.kinesis.model.GetRecordsResult.getMillisBehindLatest()Ljava/lang/Long;
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResultAndRecordMillisBehindLatest(ProcessTask.java:291)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.getRecordsResult(ProcessTask.java:249)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
are you running this on aws emr?
If so the following links may help:
Spark streaming 1.6.1 is not working with Kinesis asl 1.6.1 and asl 2.0.0-preview
http://www.waitingforcode.com/apache-spark/shading-solution-dependency-hell-spark/read
essentially aws EMR supports protobuf 2.5, however the spark-core, spark-streaming, and spark-streaming-kinesis-asl versions all depend on protobuf 2.6.1, the way we got around this issue when we encountered it was through a shaded jar, and the two links above give good examples of how to set the up.

How do I exclude/include specific packages using xsbt-proguard-plugin?

I'm using xsbt-proguard-plugin, which is an SBT plugin for working with Proguard.
I'm trying to come up with a Proguard configuration for a Hive Deserializer I've written, which has the following dependencies:
// project/Dependencies.scala
val hadoop = "org.apache.hadoop" % "hadoop-core" % V.hadoop
val hive = "org.apache.hive" % "hive-common" % V.hive
val serde = "org.apache.hive" % "hive-serde" % V.hive
val httpClient = "org.apache.httpcomponents" % "httpclient" % V.http
val logging = "commons-logging" % "commons-logging" % V.logging
val specs2 = "org.specs2" %% "specs2" % V.specs2 % "test"
Plus an unmanaged dependency:
// lib/UserAgentUtils-1.6.jar
Because most of these are either for local unit testing or are available within a Hadoop/Hive environment anyway, I want my minified jarfile to only include:
The Java classes SnowPlowEventDeserializer.class and SnowPlowEventStruct.class
org.apache.httpcomponents.httpclient
commons-logging
lib/UserAgentUtils-1.6.jar
But I'm really struggling to get the syntax right. Should I start from a whitelist of classes I want to keep, or explicitly filter out the Hadoop/Hive/Serde/Specs2 libraries? I'm aware of this SO question but it doesn't seem to apply here.
If I initially try the whitelist approach:
// Should be equivalent to sbt> package
import ProguardPlugin._
lazy val proguard = proguardSettings ++ Seq(
proguardLibraryJars := Nil,
proguardOptions := Seq(
"-keepattributes *Annotation*,EnclosingMethod",
"-dontskipnonpubliclibraryclassmembers",
"-dontoptimize",
"-dontshrink",
"-keep class com.snowplowanalytics.snowplow.hadoop.hive.SnowPlowEventDeserializer",
"-keep class com.snowplowanalytics.snowplow.hadoop.hive.SnowPlowEventStruct"
)
)
Then I get a Hadoop processing error, so clearly Proguard is still trying to bundle Hadoop:
proguard: java.lang.IllegalArgumentException: Can't find common super class of [[Lorg/apache/hadoop/fs/FileStatus;] and [[Lorg/apache/hadoop/fs/s3/Block;]
Meanwhile if I try Proguard's filtering syntax to build up the blacklist of libraries I don't want to include:
import ProguardPlugin._
lazy val proguard = proguardSettings ++ Seq(
proguardLibraryJars := Nil,
proguardOptions := Seq(
"-keepattributes *Annotation*,EnclosingMethod",
"-dontskipnonpubliclibraryclassmembers",
"-dontoptimize",
"-dontshrink",
"-injars !*hadoop*.jar"
)
)
Then this doesn't seem to work either:
proguard: java.io.IOException: Can't read [/home/dev/snowplow-log-deserializers/!*hadoop*.jar] (No such file or directory)
Any help greatly appreciated!
The whitelist is the proper approach: ProGuard should get a complete context, so it can properly shake out classes, fields, and methods that are not needed.
The error "Can't find common super class" suggests that some library is still missing from the input. ProGuard probably warned about it, but the configuration appears to contain the option -ignorewarnings or -dontwarn (which should be avoided). You should add the library with -injars or -libraryjars.
If ProGuard then includes some classes that you weren't expecting in the output, you can get an explanation with "-whyareyoukeeping class somepackage.SomeUnexpectedClass".
Starting from a working configuration, you can still try to filter out classes or entire jars from the input. Filters are added to items in a class path though, not on their own, e.g. "-injars some.jar(!somepackage/**.class)" -- cfr. the manual. This can be useful if the input contains test classes that drag in other unwanted classes.
In the end, I couldn't get past duplicate class errors using Proguard, let alone how to figure out how to filter out the relevant jars, so finally switched to a much cleaner sbt-assembly approach:
-1. Added the sbt-assembly plugin to my project as per the README
-2. Updated the appropriate project dependencies with a "provided" flag to stop them being added into my fat jar:
val hadoop = "org.apache.hadoop" % "hadoop-core" % V.hadoop % "provided"
val hive = "org.apache.hive" % "hive-common" % V.hive % "provided"
val serde = "org.apache.hive" % "hive-serde" % V.hive % "provided"
val httpClient = "org.apache.httpcomponents" % "httpclient" % V.http
val httpCore = "org.apache.httpcomponents" % "httpcore" % V.http
val logging = "commons-logging" % "commons-logging" % V.logging % "provided"
val specs2 = "org.specs2" %% "specs2" % V.specs2 % "test"
-3. Added an sbt-assembly configuration like so:
import sbtassembly.Plugin._
import AssemblyKeys._
lazy val sbtAssemblySettings = assemblySettings ++ Seq(
assembleArtifact in packageScala := false,
jarName in assembly <<= (name, version) { (name, version) => name + "-" + version + ".jar" },
mergeStrategy in assembly <<= (mergeStrategy in assembly) {
(old) => {
case "META-INF/NOTICE.txt" => MergeStrategy.discard
case "META-INF/LICENSE.txt" => MergeStrategy.discard
case x => old(x)
}
}
)
Then typing assembly produced a "fat jar" with just the packages I needed in it, including the unmanaged dependency and excluding Hadoop/Hive etc.