NoSuchMethodError from spark-cassandra-connector with assembled jar - scala

I'm fairly new to Scala and am trying to build a Spark job. I've built ajob that contains the DataStax connector and assembled it into a fat jar. When I try to execute it it fails with a java.lang.NoSuchMethodError. I've cracked open the JAR and can see that the DataStax library is included. Am I missing something obvious? Is there a good tutorial to look at regarding this process?
$ spark-submit --class org.bobbrez.CasCountJob ./target/scala-2.11/bobbrez-spark-assembly-0.0.1.jar ks tn
Exception in thread "main" java.lang.NoSuchMethodError:;
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
name := "soofa-spark"
version := "0.0.1"
scalaVersion := "2.11.7"
// additional libraries
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M3"
libraryDependencies += "com.typesafe" % "config" % "1.3.0"
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
case m if m.toLowerCase.endsWith("") => MergeStrategy.discard
case m if m.startsWith("META-INF") => MergeStrategy.discard
case PathList("javax", "servlet", xs # _*) => MergeStrategy.first
case PathList("org", "apache", xs # _*) => MergeStrategy.first
case PathList("org", "jboss", xs # _*) => MergeStrategy.first
case "about.html" => MergeStrategy.rename
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
package org.bobbrez
// Spark
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
object CasCountJob {
private val AppName = "CasCountJob"
def main(args: Array[String]) {
println("Hello world from " + AppName)
val keyspace = args(0)
val tablename = args(1)
println("Keyspace: " + keyspace)
println("Table: " + tablename)
// Configure and create a Scala Spark Context.
val conf = new SparkConf(true)
.set("", "HOSTNAME")
.set("spark.cassandra.auth.username", "USERNAME")
.set("spark.cassandra.auth.password", "PASSWORD")
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable(keyspace, tablename)
println("Table Count: " + rdd.count)

Cassandra connector for Spark 1.6 is still in development and not released yet.
For Integrating Cassandra with Spark you need at-least following dependencies: -
Spark-Cassandra connector - Download appropriate version from here
Cassandra Core driver - Download appropriate version from here
Spark-Cassandra Java library - Download appropriate version from here
Other Dependent Jars - jodatime , jodatime-convert, jsr166
The mapping of appropriate version of Cassandra Libraries and Spark are mentioned here
Apparently the Cassandra connector for Spark 1.5 is also is in development and you may see some compatibility issues. The most stable release of Cassandra connector is for Spark 1.4 which requires following Jar Files: -
Spark-Cassandra connector
Cassandra Core driver
Spark-Cassandra Java library
Other Dependent Jars - jodatime , jodatime-convert, jsr166
Needless to mention that all these jar files should be configured and available to executors.


issue running spark-submit : java.lang.NoSuchMethodError: com.couchbase.spark.streaming.Mutation.key() [duplicate]

I have the following scala code and am using sbt to compile and run this. sbt run works as expected.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{StreamingContext, Seconds}
import com.couchbase.spark.streaming._
object StreamingExample {
def main(args: Array[String]): Unit = {
// Create the Spark Config and instruct to use the travel-sample bucket
// with no password.
val conf = new SparkConf()
.set("", "")
// Initialize StreamingContext with a Batch interval of 5 seconds
val ssc = new StreamingContext(conf, Seconds(5))
// Consume the DCP Stream from the beginning and never stop.
// This counts the messages per interval and prints their count.
.couchbaseStream(from = FromBeginning, to = ToInfinity)
.foreachRDD(rdd => {
rdd.foreach(message => {
if(message.isInstanceOf[Mutation]) {
val document = message.asInstanceOf[Mutation]
println("mutated: " + document);
} else if( message.isInstanceOf[Deletion]) {
val document = message.asInstanceOf[Deletion]
println("deleted: " + document);
// Start the Stream and await termination
but this fails when run as a spark job like below :
spark-submit --class "StreamingExample" --master "local[*]" target/scala-2.11/spark-samples_2.11-1.0.jar
The error is java.lang.NoSuchMethodError: com.couchbase.spark.streaming.Mutation.key()
Following is my build.sbt
lazy val root = (project in file(".")).
name := "spark-samples",
version := "1.0",
scalaVersion := "2.11.12",
mainClass in Compile := Some("StreamingExample")
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0",
"org.apache.spark" %% "spark-streaming" % "2.4.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"com.couchbase.client" %% "spark-connector" % "2.2.0"
// META-INF discarding
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
The spark version running on my machine is 2.4.0 using scala 2.11.12.
I do not see com.couchbase.client_spark-connector_2.11-2.2.0 in my spark jars ( /usr/local/Cellar/apache-spark/2.4.0/libexec/jars ), but the older version com.couchbase.client_spark-connector_2.10-1.2.0.jar exists.
Why is spark-submit not working?
how does sbt manage to run this? where does it download the
Please ensure that both the Scala version and the spark connector library version used by SBT and your spark installation are the same.
I had run into a similar problem when I was trying to run a sample Flink job on my system. It was being caused by version mismatch.

Why does Spark application fail with "ClassNotFoundException: Failed to find data source: jdbc" as uber-jar with sbt assembly?

I'm trying to assemble a Spark application using sbt 1.0.4 with sbt-assembly 0.14.6.
The Spark application works fine when launched in IntelliJ IDEA or spark-submit, but if I run the assembled uber-jar with the command line (cmd in Windows 10):
java -Xmx1024m -jar my-app.jar
I get the following exception:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: jdbc. Please find packages at
The Spark application looks as follows.
package spark.main
import java.util.Properties
import org.apache.spark.sql.SparkSession
object Main {
def main(args: Array[String]) {
val connectionProperties = new Properties()
connectionProperties.put("driver", "org.postgresql.Driver")
val testTable = "test_tbl"
val spark = SparkSession.builder()
.appName("Postgres Test")
.config("spark.hadoop.fs.file.impl", classOf[org.apache.hadoop.fs.LocalFileSystem].getName)
.config("spark.sql.warehouse.dir", System.getProperty("") + "swd")
val dfPg =
The following is build.sbt.
name := "apache-spark-scala"
version := "0.1-SNAPSHOT"
scalaVersion := "2.11.8"
mainClass in Compile := Some("spark.main.Main")
libraryDependencies ++= {
val sparkVer = "2.1.1"
val postgreVer = "42.0.0"
val cassandraConVer = "2.0.2"
val configVer = "1.3.1"
val logbackVer = "1.7.25"
val loggingVer = "3.7.2"
val commonsCodecVer = "1.10"
"org.apache.spark" %% "spark-sql" % sparkVer,
"org.apache.spark" %% "spark-core" % sparkVer,
"com.datastax.spark" %% "spark-cassandra-connector" % cassandraConVer,
"org.postgresql" % "postgresql" % postgreVer,
"com.typesafe" % "config" % configVer,
"commons-codec" % "commons-codec" % commonsCodecVer,
"com.typesafe.scala-logging" %% "scala-logging" % loggingVer,
"org.slf4j" % "slf4j-api" % logbackVer
dependencyOverrides ++= Seq(
"io.netty" % "netty-all" % "4.0.42.Final",
"commons-net" % "commons-net" % "2.2",
"" % "guava" % "14.0.1"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
Does anyone has any idea, why?
Configuration taken from offical GitHub Repository did the trick:
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) =>
xs map {_.toLowerCase} match {
case ("" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
case ps # (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
case "services" :: _ => MergeStrategy.filterDistinctLines
case _ => MergeStrategy.first
case _ => MergeStrategy.first
The question is almost Why does format("kafka") fail with "Failed to find data source: kafka." with uber-jar? with the differences that the other OP used Apache Maven to create an uber-jar and here it's about sbt (sbt-assembly plugin's configuration to be precise).
The short name (aka alias) of a data source, e.g. jdbc or kafka, are only available if the corresponding META-INF/services/org.apache.spark.sql.sources.DataSourceRegister registers a DataSourceRegister.
For jdbc alias to work Spark SQL uses META-INF/services/org.apache.spark.sql.sources.DataSourceRegister with the following entry (there are others):
That's what ties jdbc alias up with the data source.
And you've excluded it from an uber-jar by the following assemblyMergeStrategy.
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
Note case PathList("META-INF", xs # _*) which you simply MergeStrategy.discard. That's the root cause.
Just to check that the "infrastructure" is available and you could use the jdbc data source by its fully-qualified name (not the alias), try this:
You will see other problems due to missing options like url, but...we're digressing.
A solution is to MergeStrategy.concat all META-INF/services/org.apache.spark.sql.sources.DataSourceRegister (that would create an uber-jar with all data sources, incl. the jdbc data source).
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat

sc.TextFile("") working in Eclipse but not in a JAR

I'm writing a code which will be in a hadoop cluster but before all, I test it locally with local files. The code is working great in Eclipse but when I'm making a huge JAR with SBT (with spark lib etc) the program is working until a textFile(path) my code is :
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.{Level, Logger}
import org.joda.time.format.DateTimeFormat
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ArrayBuffer
object TestCRA2 {
val conf = new SparkConf()
.set("spark.driver.memory", "4g")
.set("spark.executor.memory", "4g")
val context = new SparkContext(conf)//.master("local")
val rootLogger = Logger.getRootLogger()
def TimeParse1(path: String) : RDD[(Int,Long,Long)] = {
val data = context.textFile(path).map(_.split(";"))
return data
def main(args: Array[String]) {
val data = TimeParse1("file:///home/quentin/Downloads/CRA")
And here is my error :
Exception in thread "main" No FileSystem for scheme: file
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
at org.apache.hadoop.fs.FileSystem.createFileSystem(
at org.apache.hadoop.fs.FileSystem.access$200(
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
at org.apache.hadoop.fs.FileSystem$Cache.get(
at org.apache.hadoop.fs.FileSystem.get(
at org.apache.hadoop.fs.FileSystem.getLocal(
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1034)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1029)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1029)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:832)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:830)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.textFile(SparkContext.scala:830)
at main.scala.TestCRA2$.TimeParse1(TestCRA.scala:37)
at main.scala.TestCRA2$.main(TestCRA.scala:84)
at main.scala.TestCRA2.main(TestCRA.scala)
I can't put my files into the JAR cause they are in the cluster hadoop and it's working on Eclipse.
Here is my build.sbt :
name := "BloomFilters"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "joda-time" % "joda-time" % "2.9.3"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
If I don't do my assemblyMergeStrategy like this I've got bunch of errors of merging.
Actually I needed to change my build.sbt like this :
name := "BloomFilters"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
libraryDependencies += "joda-time" % "joda-time" % "2.9.3"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
case "services" :: xs => MergeStrategy.first
case _ => MergeStrategy.discard
case x => MergeStrategy.first
Thank you #lyomi
Your sbt assembly is probably ignoring some of the required files. Specifically, Hadoop's FileSystem class relies on a service discovery mechanism that looks for ALL META-INFO/services/org.apache.hadoop.fs.FileSystem files in the classpath.
On Eclipse it was fine, because each JAR had the corresponding file, but in the uber-jar one might have overridden others, causing the file: scheme to not get recognized.
In your SBT settings, add the following, to concatenate the service discovery files instead of discarding some of them.
val defaultMergeStrategy: String => MergeStrategy = {
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
// ... possibly other settings ...
case "services" :: xs =>
case _ => MergeStrategy.deduplicate
case _ => MergeStrategy.deduplicate
See of sbt-assembly for more info.

spark-hbase-connector : ClusterId read in ZooKeeper is null

I'am trying to run a simple program that copies the content of an rdd into a Hbase table. I'am using spark-hbase-connector by nerdammer I'am running the code using spark-submit on a local cluster on my machine. Spark version is 2.1.
this is the code i'am trying tu run :
import org.apache.spark.{SparkConf, SparkContext}
import it.nerdammer.spark.hbase._
object HbaseConnect {
def main(args: Array[String]) {
val sparkConf = new SparkConf()
sparkConf.set("", "hostname")
sparkConf.set("zookeeper.znode.parent", "/hbase-unsecure")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(1 to 100)
.map(i => (i.toString, i+1, "Hello"))
rdd.toHBaseTable("mytable").toColumns("column1", "column2")
Here is my build.sbt:
name := "HbaseConnect"
version := "0.1"
scalaVersion := "2.11.8"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first}
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.0" % "provided",
"it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3")
the execution gets stuck showing the following info:
17/11/22 10:20:34 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null
17/11/22 10:20:34 INFO TableOutputFormat: Created table instance for mytable
I am unable to indentify the problem with zookeeper. The HBase clients will discover the running HBase cluster using the following two properties:
1.hbase.zookeeper.quorum: is used to connect to the zookeeper cluster
2.zookeeper.znode.parent. tells which znode keeps the data (and address for HMaster) for the cluster
I overridden these two properties in the code. with
sparkConf.set("", "hostname")
sparkConf.set("zookeeper.znode.parent", "/hbase-unsecure")
Another question is that there is no spark-hbase-connector_2.11. can the provided version spark-hbase-connector_2.10 support scala 2.11 ?
Problem is solved. I had to override the Hmaster port to 16000 (wich is my Hmaster port number. I'am using ambari). Default value that sparkConf uses is 60000.
sparkConf.set("hbase.master", "hostname"+":16000").

Not able to execute my SparkStreaming Program

I have written the following Scala code and my platform is Cloudera CDH 5.2.1 on CentOS 6.5
import org.apache.spark
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
import TutorialHelper._
object Tutorial {
def main(args: Array[String]) {
val checkpointDir = TutorialHelper.getCheckPointDirectory()
val consumerKey = "..."
val consumerSecret = "..."
val accessToken = "..."
val accessTokenSecret = "..."
try {
TutorialHelper.configureTwitterCredentials(consumerKey, consumerSecret, accessToken, accessTokenSecret)
val ssc = new StreamingContext(new SparkContext(), Seconds(1))
val tweets = TwitterUtils.createStream(ssc, None)
val tweetText = => tweet.getText())
} finally {
My build.sbt file looks like
import AssemblyKeys._ // put this at the top of the file
name := "Tutorial"
scalaVersion := "2.10.3"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided",
"org.apache.spark" %% "spark-streaming-twitter" % "1.0.0"
resolvers += "Akka Repository" at ""
resourceDirectory in Compile := baseDirectory.value / "resources"
mergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "" => MergeStrategy.discard
case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
I also created a file called projects/plugin.sbt which has the following content
addSbtPlugin("net.virtual-void" % "sbt-cross-building" % "0.8.1")
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.9.1")
and project/build.scala
import sbt._
object Plugins extends Build {
lazy val root = Project("root", file(".")) dependsOn(
after this I can build my "uber" assembly by using
sbt assembly
now I run my code using
sudo -u hdfs spark-submit --class Tutorial --master local /tmp/Tutorial-assembly-0.1-SNAPSHOT.jar
I get the error
Configuring Twitter OAuth
Property twitter4j.oauth.accessToken set as [...]
Property twitter4j.oauth.consumerSecret set as [...]
Property twitter4j.oauth.accessTokenSecret set as [...]
Property twitter4j.oauth.consumerKey set as [...]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/spark-assembly-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/12/21 16:04:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Time: 1419199472000 ms
Time: 1419199473000 ms
14/12/21 16:04:33 ERROR ReceiverSupervisorImpl: Error stopping receiver 0org.apache.spark.Logging$class.log(Logging.scala:52)
You need to use sbt assembly plugin to prepare "assembled" jar file with all dependencies. It should contain all twitter util classes.
Or you can take a look at my Spark-Twitter project, it has configured sbt-assembly plugin:
CDH 5.2 packages Spark 1.1.0, but you build.sbt is using 1.0.0. Update the versions below and rebuild should fix your problem.
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided",
"org.apache.spark" %% "spark-streaming-twitter" % "1.0.0"