issue running spark-submit : java.lang.NoSuchMethodError: com.couchbase.spark.streaming.Mutation.key() [duplicate] - scala

This question already has answers here:
Resolving dependency problems in Apache Spark
(7 answers)
Closed 4 years ago.
I have the following scala code and am using sbt to compile and run this. sbt run works as expected.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{StreamingContext, Seconds}
import com.couchbase.spark.streaming._
object StreamingExample {
def main(args: Array[String]): Unit = {
// Create the Spark Config and instruct to use the travel-sample bucket
// with no password.
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("StreamingExample")
.set("com.couchbase.bucket.travel-sample", "")
// Initialize StreamingContext with a Batch interval of 5 seconds
val ssc = new StreamingContext(conf, Seconds(5))
// Consume the DCP Stream from the beginning and never stop.
// This counts the messages per interval and prints their count.
ssc
.couchbaseStream(from = FromBeginning, to = ToInfinity)
.foreachRDD(rdd => {
rdd.foreach(message => {
//println(message.getClass());
message.getClass();
if(message.isInstanceOf[Mutation]) {
val document = message.asInstanceOf[Mutation].key.map(_.toChar).mkString
println("mutated: " + document);
} else if( message.isInstanceOf[Deletion]) {
val document = message.asInstanceOf[Deletion].key.map(_.toChar).mkString
println("deleted: " + document);
}
})
})
// Start the Stream and await termination
ssc.start()
ssc.awaitTermination()
}
}
but this fails when run as a spark job like below :
spark-submit --class "StreamingExample" --master "local[*]" target/scala-2.11/spark-samples_2.11-1.0.jar
The error is java.lang.NoSuchMethodError: com.couchbase.spark.streaming.Mutation.key()
Following is my build.sbt
lazy val root = (project in file(".")).
settings(
name := "spark-samples",
version := "1.0",
scalaVersion := "2.11.12",
mainClass in Compile := Some("StreamingExample")
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.0",
"org.apache.spark" %% "spark-streaming" % "2.4.0",
"org.apache.spark" %% "spark-sql" % "2.4.0",
"com.couchbase.client" %% "spark-connector" % "2.2.0"
)
// META-INF discarding
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
The spark version running on my machine is 2.4.0 using scala 2.11.12.
Observations:
I do not see com.couchbase.client_spark-connector_2.11-2.2.0 in my spark jars ( /usr/local/Cellar/apache-spark/2.4.0/libexec/jars ), but the older version com.couchbase.client_spark-connector_2.10-1.2.0.jar exists.
Why is spark-submit not working?
how does sbt manage to run this? where does it download the
dependencies?

Please ensure that both the Scala version and the spark connector library version used by SBT and your spark installation are the same.
I had run into a similar problem when I was trying to run a sample Flink job on my system. It was being caused by version mismatch.

Related

spark streaming save base64 rdd to json on s3

The scala application below cannot save an rdd in json format onto S3
I have :-
a kinesis stream that has complex objects placed on the stream. This object has had JSON.stringify() applied to it before being placed on the stream as part of the Kinesis PutRecord method.
A scala spark stream job reads these items off the stream,
I cannot seem to save the rdd record that comes off the stream into json file onto an S3 bucket.
In the code i've attempted to convert the RDD[Bytes] to RDD[String] then load with spark.read.json but no luck. I've tried various other combinations and can't seem to output the onto S3 in it's raw format.
import org.apache.spark._
import org.apache.spark.sql._
import java.util.Base64
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kinesis._
import org.apache.spark.streaming.kinesis.KinesisInputDStream
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
object ScalaStream {
def main(args: Array[String]): Unit = {
val appName = "ScalaStreamExample"
val batchInterval = Milliseconds(2000)
val outPath = "s3://xxx-xx--xxx/xxxx/"
val spark = SparkSession
.builder()
.appName(appName)
.getOrCreate()
val sparkContext = spark.sparkContext
val streamingContext = new StreamingContext(sparkContext, batchInterval)
// Populate the appropriate variables from the given args
val checkpointAppName = "xxx-xx-xx--xx"
val streamName = "cc-cc-c--c--cc"
val endpointUrl = "https://kinesis.xxx-xx-xx.amazonaws.com"
val regionName = "cc-xxxx-xxx"
val initialPosition = new Latest()
val checkpointInterval = batchInterval
val storageLevel = StorageLevel.MEMORY_AND_DISK_2
val kinesisStream = KinesisInputDStream.builder
.streamingContext(streamingContext)
.endpointUrl(endpointUrl)
.regionName(regionName)
.streamName(streamName)
.initialPosition(initialPosition)
.checkpointAppName(checkpointAppName)
.checkpointInterval(checkpointInterval)
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
kinesisStream.foreachRDD { rdd =>
if (!rdd.isEmpty()){
//**************** . <---------------
// This is where i'm trying to save the raw json object to s3 as json file
// tried various combinations here but no luck.
val dataFrame = rdd.map(record=>new String(record)) // convert bytes to string
dataFrame.write.mode(SaveMode.Append).json(outPath + "/" + rdd.id.toString())
//**************** <----------------
}
}
// Start the streaming context and await termination
streamingContext.start()
streamingContext.awaitTermination()
}
}
Any ideas what i'm missing?
So it was complete red herring why it failed to work. Turns out it was a scala version conflict with what is available on EMR.
Many similar questions asked on SO that suggested this may be the issue but whilst the spark documentation lists 2.12.4 is compatible with spark 2.4.4, the EMR instance does not appear to support scala version 2.12.4. So i've updated my build.sbt and deploy script from
build.sbt:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.8"
ibraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
to:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.4.4"
deploy.sh
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.4\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://ccc.xxxx/simple-project_2.11-1.0.jar\
],ActionOnFailure=CONTINUE

spark-hbase-connector : ClusterId read in ZooKeeper is null

I'am trying to run a simple program that copies the content of an rdd into a Hbase table. I'am using spark-hbase-connector by nerdammer https://github.com/nerdammer/spark-hbase-connector. I'am running the code using spark-submit on a local cluster on my machine. Spark version is 2.1.
this is the code i'am trying tu run :
import org.apache.spark.{SparkConf, SparkContext}
import it.nerdammer.spark.hbase._
object HbaseConnect {
def main(args: Array[String]) {
val sparkConf = new SparkConf()
sparkConf.set("spark.hbase.host", "hostname")
sparkConf.set("zookeeper.znode.parent", "/hbase-unsecure")
val sc = new SparkContext(sparkConf)
val rdd = sc.parallelize(1 to 100)
.map(i => (i.toString, i+1, "Hello"))
rdd.toHBaseTable("mytable").toColumns("column1", "column2")
.inColumnFamily("mycf")
.save()
sc.stop
}}
Here is my build.sbt:
name := "HbaseConnect"
version := "0.1"
scalaVersion := "2.11.8"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first}
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.0" % "provided",
"it.nerdammer.bigdata" % "spark-hbase-connector_2.10" % "1.0.3")
the execution gets stuck showing the following info:
17/11/22 10:20:34 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null
17/11/22 10:20:34 INFO TableOutputFormat: Created table instance for mytable
I am unable to indentify the problem with zookeeper. The HBase clients will discover the running HBase cluster using the following two properties:
1.hbase.zookeeper.quorum: is used to connect to the zookeeper cluster
2.zookeeper.znode.parent. tells which znode keeps the data (and address for HMaster) for the cluster
I overridden these two properties in the code. with
sparkConf.set("spark.hbase.host", "hostname")
sparkConf.set("zookeeper.znode.parent", "/hbase-unsecure")
Another question is that there is no spark-hbase-connector_2.11. can the provided version spark-hbase-connector_2.10 support scala 2.11 ?
Problem is solved. I had to override the Hmaster port to 16000 (wich is my Hmaster port number. I'am using ambari). Default value that sparkConf uses is 60000.
sparkConf.set("hbase.master", "hostname"+":16000").

java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterUtils$

I was building this small demo code for Spark streaming using twitter. I have added the required dependencies as shown by http://bahir.apache.org/docs/spark/2.0.0/spark-streaming-twitter/ and I am using sbt to build jars. The project build successfully and only problem seems to be is- it is not able to find the TwitterUtils class.
The scala code is given below
build.sbt
name := "twitterexample"
version := "1.0"
scalaVersion := "2.11.8"
val sparkVersion = "1.6.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.bahir" %% "spark-streaming-twitter" % "2.1.0",
"org.twitter4j" % "twitter4j-core" % "4.0.4",
"org.twitter4j" % "twitter4j-stream" % "4.0.4"
)
The main scala file is
TwitterCount.scala
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import twitter4j.Status
object TwitterCount {
def main(args: Array[String]): Unit = {
val consumerKey = "abc"
val consumerSecret ="abc"
val accessToken = "abc"
val accessTokenSecret = "abc"
val lang ="english"
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret",consumerSecret)
System.setProperty("twitter4j.oauth.accessToken",accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret",accessTokenSecret)
val conf = new SparkConf().setAppName("TwitterHashTags")
val ssc = new StreamingContext(conf, Seconds(5))
val tweets = TwitterUtils.createStream(ssc,None)
val tweetsFilteredByLang = tweets.filter{tweet => tweet.getLang() == lang}
val statuses = tweetsFilteredByLang.map{ tweet => tweet.getText()}
val words = statuses.map{status => status.split("""\s+""")}
val hashTags = words.filter{ word => word.startsWith("#StarWarsDay")}
val hashcounts = hashTags.count()
hashcounts.print()
ssc.start
ssc.awaitTermination()
}
Then I am building the project using
sbt package
and I submitting the generated jars using
spark-submit --class "TwitterCount" --master local[*] target/scala-2.11/twitterexample_2.11-1.0.jar
Please help me with this.
Thanks
--class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
You are missing package name in your code. Your spark submit command should be like this.
--class com.spark.examples.TwitterCount
I found the solution at last.
java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags
I have to build the jars using
sbt assembly
but I'm still wondering what's the difference in jars that I make using
sbt package
anyone knows? plz share

updateStateByKey, noClassDefFoundError

I have problem with using updateStateByKey() function. I have following, simple code (written base on book: "Learning Spark - Lighting Fast Data Analysis"):
object hello {
def updateStateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
Some(runningCount.getOrElse(0) + newValues.size)
}
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[5]").setAppName("AndrzejApp")
val ssc = new StreamingContext(conf, Seconds(4))
ssc.checkpoint("/")
val lines7 = ssc.socketTextStream("localhost", 9997)
val keyValueLine7 = lines7.map(line => (line.split(" ")(0), line.split(" ")(1).toInt))
val statefullStream = keyValueLine7.updateStateByKey(updateStateFunction _)
ssc.start()
ssc.awaitTermination()
}
}
My build.sbt is:
name := "stream-correlator-spark"
version := "1.0"
scalaVersion := "2.11.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.3.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "1.3.1" % "provided"
)
When I build it with sbt assembly command everything goes fine. When I run this on spark cluster in local mode I got error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/dstream/DStream$
at hello$.main(helo.scala:25)
...
line 25 is:
val statefullStream = keyValueLine7.updateStateByKey(updateStateFunction _)
I feel this might be some compatibility version problem but I don't know what might be the reason and how to resolve this.
I would be really grateful for help!
When you are writing "provided" in the SBT this means exactly that your library is provided by the environment and need no to be included in the package.
Try to remove "provided" mark from "spark-streaming" library.
You can add "provided" back when you need to submit your app to a spark cluster to run. The benefit of having "provided" is that the result fat jar will not include classes from the provided dependencies, which will yield a much smaller fat jar, comparing to not having "provided". In my case, the result jar will be around 90M without "provided" and then shrink to 30+M with "provided".

Apache spark-streaming application output not forwarded to the master

I'm trying to run the FlumeEvent example which is the following
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
import org.apache.spark.streaming.flume.FlumeUtils
object FlumeEventCount {
def main(args: Array[String]) {
val batchInterval = Milliseconds(2000)
// Create the context and set the batch size
val sparkConf = new SparkConf().setAppName("FlumeEventCount")
.set("spark.cleaner.ttl","3")
val ssc = new StreamingContext(sparkConf, batchInterval)
// Create a flume stream
var stream = FlumeUtils.createStream(ssc, "192.168.1.5",3564, StorageLevel.MEMORY_ONLY_SER_2)
// Print out the count of events received from this server in each batch
stream.count().map(cnt => "Received flume events." + cnt ).print()
stream.count.print()
stream.print()
ssc.start()
ssc.awaitTermination()
}
}
My sbt file is the following
import AssemblyKeys._
assemblySettings
name := "flume-test"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-flume" % "1.0.0" exclude("org.apache.spark","spark-core") exclude("org.apache.spark", "spark-streaming_2.10")
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
I run the programm with the following command
/tmp/spark-1.0.0-bin-hadoop2/bin/spark-submit --class FlumeEventCount --master local --deploy-mode client /tmp/fooproj/target/scala-2.10/cert-log-manager-assembly-1.0.jar
On the other side, the flume application is sending everything correctly and I can see in the logs that it's received.
I haven't made any changes to spark's configuration nor setup any environment variables, I just downloaded and unpacked the program.
Can someone tell me what am I doing wrong?
//edit: When I execute spark's FlumeEventCount example, it works
//edit2: If I remove the awaiTermination and add an ssc.stop it prints everything one single time, I guess this happens because something is getting flushed
....I should have learned to rtfm more carefully by now,
quoting from this page: https://spark.apache.org/docs/latest/streaming-programming-guide.html
// Spark Streaming needs at least two working thread
val ssc = new StreamingContext("local[2]", "NetworkWordCount", Seconds(1))
I've been launching spark with only one thread
also the following works fine
stream.map(event=>"Event: header:"+ event.event.get(0).toString+" body:"+ new String(event.event.getBody.array) ).print