I'm writing a code which will be in a hadoop cluster but before all, I test it locally with local files. The code is working great in Eclipse but when I'm making a huge JAR with SBT (with spark lib etc) the program is working until a textFile(path) my code is :
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.{Level, Logger}
import org.joda.time.format.DateTimeFormat
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ArrayBuffer
object TestCRA2 {
val conf = new SparkConf()
.setMaster("local")
.setAppName("Test")
.set("spark.driver.memory", "4g")
.set("spark.executor.memory", "4g")
val context = new SparkContext(conf)//.master("local")
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
def TimeParse1(path: String) : RDD[(Int,Long,Long)] = {
val data = context.textFile(path).map(_.split(";"))
return data
}
def main(args: Array[String]) {
val data = TimeParse1("file:///home/quentin/Downloads/CRA")
}
}
And here is my error :
Exception in thread "main" java.io.IOException: No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:341)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1034)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1029)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1029)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:832)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:830)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:830)
at main.scala.TestCRA2$.TimeParse1(TestCRA.scala:37)
at main.scala.TestCRA2$.main(TestCRA.scala:84)
at main.scala.TestCRA2.main(TestCRA.scala)
I can't put my files into the JAR cause they are in the cluster hadoop and it's working on Eclipse.
Here is my build.sbt :
name := "BloomFilters"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "joda-time" % "joda-time" % "2.9.3"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
If I don't do my assemblyMergeStrategy like this I've got bunch of errors of merging.
Actually I needed to change my build.sbt like this :
name := "BloomFilters"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
libraryDependencies += "joda-time" % "joda-time" % "2.9.3"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
case "services" :: xs => MergeStrategy.first
case _ => MergeStrategy.discard
}
case x => MergeStrategy.first
}
Thank you #lyomi
Your sbt assembly is probably ignoring some of the required files. Specifically, Hadoop's FileSystem class relies on a service discovery mechanism that looks for ALL META-INFO/services/org.apache.hadoop.fs.FileSystem files in the classpath.
On Eclipse it was fine, because each JAR had the corresponding file, but in the uber-jar one might have overridden others, causing the file: scheme to not get recognized.
In your SBT settings, add the following, to concatenate the service discovery files instead of discarding some of them.
val defaultMergeStrategy: String => MergeStrategy = {
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
// ... possibly other settings ...
case "services" :: xs =>
MergeStrategy.filterDistinctLines
case _ => MergeStrategy.deduplicate
}
case _ => MergeStrategy.deduplicate
}
See README.md of sbt-assembly for more info.
Related
I'm trying to assembly my scala project and cant get rid of some deduplicate errors
Here is the problematic output:
> [error] 2 errors were encountered during merge [error] stack trace is
> suppressed; run 'last
> ProjectRef(uri("https://hyehezkel#fs-bitbucket.fsd.forescout.com/scm/~hyehezkel/classification_common.git#test_branch"),
> "global") / assembly' for the full output [error]
> (ProjectRef(uri("https://hyehezkel#fs-bitbucket.fsd.forescout.com/scm/~hyehezkel/classification_common.git#test_branch"),
> "global") / assembly) deduplicate: different file contents found in
> the following: [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-buffer\4.1.42.Final\netty-buffer-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-codec\4.1.42.Final\netty-codec-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-common\4.1.42.Final\netty-common-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-handler\4.1.42.Final\netty-handler-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-resolver\4.1.42.Final\netty-resolver-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-transport-native-epoll\4.1.42.Final\netty-transport-native-epoll-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-transport-native-unix-common\4.1.42.Final\netty-transport-native-unix-common-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\io\netty\netty-transport\4.1.42.Final\netty-transport-4.1.42.Final.jar:META-INF/io.netty.versions.properties
> [error] deduplicate: different file contents found in the following:
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-annotations\2.10.1\jackson-annotations-2.10.1.jar:module-info.class
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-core\2.10.1\jackson-core-2.10.1.jar:module-info.class
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\core\jackson-databind\2.10.1\jackson-databind-2.10.1.jar:module-info.class
> [error]
> C:\Users\hyehezkel\AppData\Local\Coursier\cache\v1\https\repo1.maven.org\maven2\com\fasterxml\jackson\dataformat\jackson-dataformat-csv\2.10.0\jackson-dataformat-csv-2.10.0.jar:module-info.class
I have read the following article but didnt manage to solve it:
https://index.scala-lang.org/sbt/sbt-assembly/sbt-assembly/0.14.5?target=_2.12_1.0
This is my plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")
And this is my build.st
import sbt.Keys.{dependencyOverrides, libraryDependencies, mappings}
import sbtassembly.AssemblyPlugin.assemblySettings._
name := "classification_endpoint_discovery"
version := "0.1"
organization in ThisBuild := "com.forescout"
scalaVersion in ThisBuild := "2.13.1"
updateOptions := updateOptions.value.withCachedResolution(true)
//classpathTypes += "maven-plugin"
exportJars := true
logLevel := Level.Info
logLevel in assembly := Level.Debug
lazy val commonProject = RootProject(uri("https://hyehezkel#fs-bitbucket.fsd.forescout.com/scm/~hyehezkel/classification_common.git#test_branch"))
lazy val global = project
.in(file("."))
.settings(settings)
.enablePlugins(AssemblyPlugin)
// .disablePlugins(AssemblyPlugin)
.aggregate(
commonProject,
`endpoint-discovery`
)
lazy val `endpoint-discovery` = project
.settings(
name := "endpoint-discovery",
settings,
assemblySettings,
assemblyJarName in assembly := "endpoint-discovery.jar",
assemblyJarName in assemblyPackageDependency := "endpoint-discovery-dep.jar",
libraryDependencies += dependencies.postgresql,
libraryDependencies += "com.lihaoyi" %% "ujson" % "0.7.5",
libraryDependencies += "com.lihaoyi" %% "requests" % "0.2.0",
libraryDependencies += dependencies.`deepLearning4j-core`,
libraryDependencies += dependencies.`deeplearning4j-nn`,
libraryDependencies += dependencies.`nd4j-native-platform`,
excludeDependencies += "commons-logging" % "commons-logging"
// dependencyOverrides += "org.slf4j" % "slf4j-api" % "1.7.5",
// dependencyOverrides += "org.slf4j" % "slf4j-simple" % "1.7.5",
)
.dependsOn(commonProject)
.enablePlugins(AssemblyPlugin)
lazy val dependencies =
new {
val deepLearning4jV = "1.0.0-beta4"
val postgresqlV = "9.1-901.jdbc4"
val `deepLearning4j-core` = "org.deeplearning4j" % "deeplearning4j-core" % deepLearning4jV
val `deeplearning4j-nn` = "org.deeplearning4j" % "deeplearning4j-nn" % deepLearning4jV
val `nd4j-native-platform` = "org.nd4j" % "nd4j-native-platform" % deepLearning4jV
val postgresql = "postgresql" % "postgresql" % postgresqlV
}
// SETTINGS
lazy val settings =
commonSettings
lazy val compilerOptions = Seq(
"-unchecked",
"-feature",
"-language:existentials",
"-language:higherKinds",
"-language:implicitConversions",
"-language:postfixOps",
"-deprecation",
"-encoding",
"utf8"
)
lazy val commonSettings = Seq(
scalacOptions ++= compilerOptions
)
lazy val assemblySettings = Seq(
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false, includeDependency = false),
assemblyMergeStrategy in assembly := {
case PathList("META-INF", "io.netty.versions.properties", xs # _*) => MergeStrategy.singleOrError
case "module-info.class" => MergeStrategy.singleOrError
case PathList("org", "xmlpull", xs # _*) => MergeStrategy.discard
case PathList("org", "nd4j", xs # _*) => MergeStrategy.first
case PathList("org", "bytedeco", xs # _*) => MergeStrategy.first
case PathList("org.bytedeco", xs # _*) => MergeStrategy.first
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case "XmlPullParser.class" => MergeStrategy.discard
case "Nd4jBase64.class" => MergeStrategy.discard
case "XmlPullParserException.class" => MergeStrategy.discard
// case n if n.startsWith("rootdoc.txt") => MergeStrategy.discard
// case n if n.startsWith("readme.html") => MergeStrategy.discard
// case n if n.startsWith("readme.txt") => MergeStrategy.discard
case n if n.startsWith("library.properties") => MergeStrategy.discard
case n if n.startsWith("license.html") => MergeStrategy.discard
case n if n.startsWith("about.html") => MergeStrategy.discard
// case _ => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
)
I have tried many Merge Strategies but nothing works
What am i missing here?
Any advice?
for META-INF/io.netty.versions.properties
you have:
case PathList("META-INF", "io.netty.versions.properties", xs # _*) => MergeStrategy.singleOrError
which says that it will error out, if there are more than 1 files with this name.
try MergeStrategy.first for these files instead
module-info.class
these files are only relevant for the Java 9 module system. Usually, you can just discard them:
case "module-info.class" => MergeStrategy.discard
I'm trying to assemble a Spark application using sbt 1.0.4 with sbt-assembly 0.14.6.
The Spark application works fine when launched in IntelliJ IDEA or spark-submit, but if I run the assembled uber-jar with the command line (cmd in Windows 10):
java -Xmx1024m -jar my-app.jar
I get the following exception:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: jdbc. Please find packages at http://spark.apache.org/third-party-projects.html
The Spark application looks as follows.
package spark.main
import java.util.Properties
import org.apache.spark.sql.SparkSession
object Main {
def main(args: Array[String]) {
val connectionProperties = new Properties()
connectionProperties.put("user","postgres")
connectionProperties.put("password","postgres")
connectionProperties.put("driver", "org.postgresql.Driver")
val testTable = "test_tbl"
val spark = SparkSession.builder()
.appName("Postgres Test")
.master("local[*]")
.config("spark.hadoop.fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
.config("spark.sql.warehouse.dir", System.getProperty("java.io.tmpdir") + "swd")
.getOrCreate()
val dfPg = spark.sqlContext.read.
jdbc("jdbc:postgresql://localhost/testdb",testTable,connectionProperties)
dfPg.show()
}
}
The following is build.sbt.
name := "apache-spark-scala"
version := "0.1-SNAPSHOT"
scalaVersion := "2.11.8"
mainClass in Compile := Some("spark.main.Main")
libraryDependencies ++= {
val sparkVer = "2.1.1"
val postgreVer = "42.0.0"
val cassandraConVer = "2.0.2"
val configVer = "1.3.1"
val logbackVer = "1.7.25"
val loggingVer = "3.7.2"
val commonsCodecVer = "1.10"
Seq(
"org.apache.spark" %% "spark-sql" % sparkVer,
"org.apache.spark" %% "spark-core" % sparkVer,
"com.datastax.spark" %% "spark-cassandra-connector" % cassandraConVer,
"org.postgresql" % "postgresql" % postgreVer,
"com.typesafe" % "config" % configVer,
"commons-codec" % "commons-codec" % commonsCodecVer,
"com.typesafe.scala-logging" %% "scala-logging" % loggingVer,
"org.slf4j" % "slf4j-api" % logbackVer
)
}
dependencyOverrides ++= Seq(
"io.netty" % "netty-all" % "4.0.42.Final",
"commons-net" % "commons-net" % "2.2",
"com.google.guava" % "guava" % "14.0.1"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
Does anyone has any idea, why?
[UPDATE]
Configuration taken from offical GitHub Repository did the trick:
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) =>
xs map {_.toLowerCase} match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
MergeStrategy.discard
case ps # (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
MergeStrategy.discard
case "services" :: _ => MergeStrategy.filterDistinctLines
case _ => MergeStrategy.first
}
case _ => MergeStrategy.first
}
The question is almost Why does format("kafka") fail with "Failed to find data source: kafka." with uber-jar? with the differences that the other OP used Apache Maven to create an uber-jar and here it's about sbt (sbt-assembly plugin's configuration to be precise).
The short name (aka alias) of a data source, e.g. jdbc or kafka, are only available if the corresponding META-INF/services/org.apache.spark.sql.sources.DataSourceRegister registers a DataSourceRegister.
For jdbc alias to work Spark SQL uses META-INF/services/org.apache.spark.sql.sources.DataSourceRegister with the following entry (there are others):
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
That's what ties jdbc alias up with the data source.
And you've excluded it from an uber-jar by the following assemblyMergeStrategy.
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
Note case PathList("META-INF", xs # _*) which you simply MergeStrategy.discard. That's the root cause.
Just to check that the "infrastructure" is available and you could use the jdbc data source by its fully-qualified name (not the alias), try this:
spark.read.
format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider").
load("jdbc:postgresql://localhost/testdb")
You will see other problems due to missing options like url, but...we're digressing.
A solution is to MergeStrategy.concat all META-INF/services/org.apache.spark.sql.sources.DataSourceRegister (that would create an uber-jar with all data sources, incl. the jdbc data source).
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat
I have been trying to build fat jar for some time now. I got assembly.sbt in project folder and it looks like below
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")
and my build.sbt looks like below
name := "cool"
version := "0.1"
scalaVersion := "2.11.8"
resolvers += "Hortonworks Repository" at
"http://repo.hortonworks.com/content/repositories/releases/"
resolvers += "Hortonworks Jetty Maven Repository" at
"http://repo.hortonworks.com/content/repositories/jetty-hadoop/"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-streaming_2.10" % "1.6.1.2.4.2.0-258" %
"provided",
"org.apache.spark" % "spark-streaming-kafka-assembly_2.10" %
"1.6.1.2.4.2.0-258"
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
i get below error
assemblyMergeStrategy in assembly := {
^
C:\Users\sreer\Desktop\workspace\cool\build.sbt:14: error: not found:
value assembly
assemblyMergeStrategy in assembly := {
^
C:\Users\sreer\Desktop\workspace\cool\build.sbt:15: error: not found:
value PathList
case PathList("META-INF", xs # _*) => MergeStrategy.discard
^
C:\Users\sreer\Desktop\workspace\cool\build.sbt:15: error: star patterns
must correspond with varargs parameters
case PathList("META-INF", xs # _*) => MergeStrategy.discard
^
C:\Users\sreer\Desktop\workspace\cool\build.sbt:15: error: not found:
value MergeStrategy
case PathList("META-INF", xs # _*) => MergeStrategy.discard
^
C:\Users\sreer\Desktop\workspace\cool\build.sbt:16: error: not found:
value MergeStrategy
case x => MergeStrategy.first
^
[error] Type error in expression
I get this Type error and seems like it won't recognize keys like "assemblyMergeStrategy". I use sbt new version 1.0.4 and latest version of eclipse IDE for scala.
I have tried changing version of sbt and still no result, went through whole document at https://github.com/sbt/sbt-assembly, made sure there were no typos and suggestions mentioned in other threads weren't of much help ( mostly questions are about older versions of sbt). If some one could guide me that would be very helpful. Thanks.
I'm fairly new to Scala and am trying to build a Spark job. I've built ajob that contains the DataStax connector and assembled it into a fat jar. When I try to execute it it fails with a java.lang.NoSuchMethodError. I've cracked open the JAR and can see that the DataStax library is included. Am I missing something obvious? Is there a good tutorial to look at regarding this process?
Thanks
console
$ spark-submit --class org.bobbrez.CasCountJob ./target/scala-2.11/bobbrez-spark-assembly-0.0.1.jar ks tn
...
Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.ObjectRef.zero()Lscala/runtime/ObjectRef;
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
...
build.sbt
name := "soofa-spark"
version := "0.0.1"
scalaVersion := "2.11.7"
// additional libraries
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M3"
libraryDependencies += "com.typesafe" % "config" % "1.3.0"
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.startsWith("META-INF") => MergeStrategy.discard
case PathList("javax", "servlet", xs # _*) => MergeStrategy.first
case PathList("org", "apache", xs # _*) => MergeStrategy.first
case PathList("org", "jboss", xs # _*) => MergeStrategy.first
case "about.html" => MergeStrategy.rename
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
}
CasCountJob.scala
package org.bobbrez
// Spark
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
object CasCountJob {
private val AppName = "CasCountJob"
def main(args: Array[String]) {
println("Hello world from " + AppName)
val keyspace = args(0)
val tablename = args(1)
println("Keyspace: " + keyspace)
println("Table: " + tablename)
// Configure and create a Scala Spark Context.
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "HOSTNAME")
.set("spark.cassandra.auth.username", "USERNAME")
.set("spark.cassandra.auth.password", "PASSWORD")
.setAppName(AppName)
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable(keyspace, tablename)
println("Table Count: " + rdd.count)
System.exit(0)
}
}
Cassandra connector for Spark 1.6 is still in development and not released yet.
For Integrating Cassandra with Spark you need at-least following dependencies: -
Spark-Cassandra connector - Download appropriate version from here
Cassandra Core driver - Download appropriate version from here
Spark-Cassandra Java library - Download appropriate version from here
Other Dependent Jars - jodatime , jodatime-convert, jsr166
The mapping of appropriate version of Cassandra Libraries and Spark are mentioned here
Apparently the Cassandra connector for Spark 1.5 is also is in development and you may see some compatibility issues. The most stable release of Cassandra connector is for Spark 1.4 which requires following Jar Files: -
Spark-Cassandra connector
Cassandra Core driver
Spark-Cassandra Java library
Other Dependent Jars - jodatime , jodatime-convert, jsr166
Needless to mention that all these jar files should be configured and available to executors.
I have written the following Scala code and my platform is Cloudera CDH 5.2.1 on CentOS 6.5
Tutorial.scala
import org.apache.spark
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
import TutorialHelper._
object Tutorial {
def main(args: Array[String]) {
val checkpointDir = TutorialHelper.getCheckPointDirectory()
val consumerKey = "..."
val consumerSecret = "..."
val accessToken = "..."
val accessTokenSecret = "..."
try {
TutorialHelper.configureTwitterCredentials(consumerKey, consumerSecret, accessToken, accessTokenSecret)
val ssc = new StreamingContext(new SparkContext(), Seconds(1))
val tweets = TwitterUtils.createStream(ssc, None)
val tweetText = tweets.map(tweet => tweet.getText())
tweetText.print()
ssc.checkpoint(checkpointDir)
ssc.start()
ssc.awaitTermination()
} finally {
//ssc.stop()
}
}
}
My build.sbt file looks like
import AssemblyKeys._ // put this at the top of the file
name := "Tutorial"
scalaVersion := "2.10.3"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided",
"org.apache.spark" %% "spark-streaming-twitter" % "1.0.0"
)
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
resourceDirectory in Compile := baseDirectory.value / "resources"
assemblySettings
mergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "log4j.properties" => MergeStrategy.discard
case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
I also created a file called projects/plugin.sbt which has the following content
addSbtPlugin("net.virtual-void" % "sbt-cross-building" % "0.8.1")
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.9.1")
and project/build.scala
import sbt._
object Plugins extends Build {
lazy val root = Project("root", file(".")) dependsOn(
uri("git://github.com/sbt/sbt-assembly.git#0.9.1")
)
}
after this I can build my "uber" assembly by using
sbt assembly
now I run my code using
sudo -u hdfs spark-submit --class Tutorial --master local /tmp/Tutorial-assembly-0.1-SNAPSHOT.jar
I get the error
Configuring Twitter OAuth
Property twitter4j.oauth.accessToken set as [...]
Property twitter4j.oauth.consumerSecret set as [...]
Property twitter4j.oauth.accessTokenSecret set as [...]
Property twitter4j.oauth.consumerKey set as [...]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/12/21 16:04:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-------------------------------------------
Time: 1419199472000 ms
-------------------------------------------
-------------------------------------------
Time: 1419199473000 ms
-------------------------------------------
14/12/21 16:04:33 ERROR ReceiverSupervisorImpl: Error stopping receiver 0org.apache.spark.Logging$class.log(Logging.scala:52)
org.apache.spark.streaming.twitter.TwitterReceiver.log(TwitterInputDStream.scala:60)
org.apache.spark.Logging$class.logInfo(Logging.scala:59)
org.apache.spark.streaming.twitter.TwitterReceiver.logInfo(TwitterInputDStream.scala:60)
org.apache.spark.streaming.twitter.TwitterReceiver.onStop(TwitterInputDStream.scala:101)
org.apache.spark.streaming.receiver.ReceiverSupervisor.stopReceiver(ReceiverSupervisor.scala:136)
org.apache.spark.streaming.receiver.ReceiverSupervisor.stop(ReceiverSupervisor.scala:112)
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:127)
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
You need to use sbt assembly plugin to prepare "assembled" jar file with all dependencies. It should contain all twitter util classes.
Links:
1. https://github.com/sbt/sbt-assembly
2. http://prabstechblog.blogspot.com/2014/04/creating-single-jar-for-spark-project.html
3. http://eugenezhulenev.com/blog/2014/10/18/run-tests-in-standalone-spark-cluster/
Or you can take a look at my Spark-Twitter project, it has configured sbt-assembly plugin: http://eugenezhulenev.com/blog/2014/11/20/twitter-analytics-with-spark/
CDH 5.2 packages Spark 1.1.0, but you build.sbt is using 1.0.0. Update the versions below and rebuild should fix your problem.
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided",
"org.apache.spark" %% "spark-streaming-twitter" % "1.0.0"
)