Apache Spark Throws java.lang.IllegalStateException: unread block data - scala

What we are doing is:
Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
Building a fat jar with a Spark app with sbt then trying to run it on the cluster
I've also included code snippets, and sbt deps at the bottom.
When I've Googled this, there seems to be two somewhat vague responses:
a) Mismatching spark versions on nodes/user code
b) Need to add more jars to the SparkConf
Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar).
But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a "mismatching version exception: you have user code using version X and node Y has version Z".
I would be very grateful for advice on this. I've submitted a bug report, because there has to be something wrong with the Spark documentation because I've seen two independent sysadms get the exact same problem with different versions of CDH on different clusters. https://issues.apache.org/jira/browse/SPARK-1867
The exception:
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59]
My code snippet:
val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
println("count = " + new SparkContext(conf).textFile(someHdfsPath).count())
My SBT dependencies:
// relevant
"org.apache.spark" % "spark-core_2.10" % "0.9.1",
"org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0",
// standard, probably unrelated
"com.github.seratch" %% "awscala" % "[0.2,)",
"org.scalacheck" %% "scalacheck" % "1.10.1" % "test",
"org.specs2" %% "specs2" % "1.14" % "test",
"org.scala-lang" % "scala-reflect" % "2.10.3",
"org.scalaz" %% "scalaz-core" % "7.0.5",
"net.minidev" % "json-smart" % "1.2"

Changing
"org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0",
to
"org.apache.hadoop" % "hadoop-common" % "2.3.0-cdh5.0.0"
In my application code seemed to fix this. Not entirely sure why. We have hadoop-yarn on the cluster, so maybe the "mr1" broke things.

I recently ran into this issue with CDH 5.2 + Spark 1.1.0.
Turns out the problem was in my spark-submit command I was using
--master yarn
instead of the new
--master yarn-cluster

Related

Using upgrade.from config in Kafka Streams is causing a "BindException: Address already in use" error in tests using embedded-kafka-schema-registry

I've got a Scala application that uses Kafka Streams - and Embedded Kafka Schema Registry in its integration tests.
I'm currently trying to upgrade Kafka Streams from 2.5.1 to 3.3.1 - and everything is working locally as expected, with all unit and integration tests passing.
However, according to the upgrade guide on the Kafka Streams documentation, when upgrading Kafka Streams, "if upgrading from 3.2 or below, you will need to do two rolling bounces, where during the first rolling bounce phase you set the config upgrade.from="older version" (possible values are "0.10.0" - "3.2") and during the second you remove it".
I've therefore added this upgrade.from config to my code as follows:
val propsMap = Map(
...
UPGRADE_FROM_CONFIG -> "2.5.1"
)
val props = new Properties()
properties.putAll(asJava(propsMap))
val streams = new KafkaStreams(topology, props);
However, doing this causes my integration tests to start failing with the following error:
[info] java.net.BindException: Address already in use
[info] at sun.nio.ch.Net.bind0(Native Method)
[info] at sun.nio.ch.Net.bind(Net.java:461)
[info] at sun.nio.ch.Net.bind(Net.java:453)
[info] at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:222)
[info] at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:85)
[info] at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:78)
[info] at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:676)
[info] at org.apache.zookeeper.server.ServerCnxnFactory.configure(ServerCnxnFactory.java:109)
[info] at org.apache.zookeeper.server.ServerCnxnFactory.configure(ServerCnxnFactory.java:105)
[info] at io.github.embeddedkafka.ops.ZooKeeperOps.startZooKeeper(zooKeeperOps.scala:26)
Does anyone know why that might be happening and how to resolve? And also additionally, if this use of the upgrade.from config is correct?
For additional context, my previous versions of the relevant libraries were:
"org.apache.kafka" %% "kafka-streams-scala" % "2.5.1"
"org.apache.kafka" % "kafka-clients" % "5.5.1-ccs"
"io.confluent" % "kafka-avro-serializer" % "5.5.1"
"io.confluent" % "kafka-schema-registry-client" % "5.5.1"
"org.apache.kafka" %% "kafka" % "2.5.1"
"io.github.embeddedkafka" %% "embedded-kafka-schema-registry" % "5.5.1"
And my updated versions are:
"org.apache.kafka" %% "kafka-streams-scala" % "3.3.1"
"org.apache.kafka" % "kafka-clients" % "7.3.0-ccs"
"io.confluent" % "kafka-avro-serializer" % "7.3.0"
"io.confluent" % "kafka-schema-registry-client" % "7.3.0"
"org.apache.kafka" %% "kafka" % "3.3.1"
"io.github.embeddedkafka" %% "embedded-kafka-schema-registry" % "7.3.0"
My integration tests use Embedded Kafka Schema Registry as follows in their test setup, with specific ports specified for Kafka, Zookeeper and Schema Registry:
class MySpec extends AnyWordSpec
with EmbeddedKafkaConfig
with EmbeddedKafka {
override protected def beforeAll(): Unit = {
super.beforeAll()
EmbeddedKafka.start()
...
}
override protected def afterAll(): Unit = {
...
EmbeddedKafka.stop()
super.afterAll()
}
}
I'm not quite sure what to try to resolve this issue.
In searching online, did find this open GitHub issue on Scalatest Embedded Kafka, which was the precursor to Embedded Kafka Schema Registry and seems to be a similar issue. However, it doesn't appear to have been resolved.
Your config upgrade_from is not valid.
Cf https://kafka.apache.org/documentation/#streamsconfigs_upgrade.from
It should be 2.5, not 2.5.1.

can't acess hadoop cluster master via spark [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
We are using cloudera's distribution for hadoop. We have a working cluster with 10 nodes. I'm trying to connect to the cluster from a remote host with InteliJ. I'm using Scala and spark.
I imported the next libraries via sbt
libraryDependencies += "org.scalatestplus.play" %% "scalatestplus-play" % "3.1.2" % Test
libraryDependencies += "com.h2database" % "h2" % "1.4.196"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.2.0"
and I'm trying to create a SparkSession with the next code :
val spark = SparkSession
.builder()
.appName("API")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.config("hive.metastore.uris","thrift://VMClouderaMasterDev01:9083")
.master("spark://10.150.1.22:9083")
.enableHiveSupport()
.getOrCreate()
but I'm getting the following error:
[error] o.a.s.n.c.TransportResponseHandler - Still have 1 requests
outstanding when connection from /10.150.1.22:9083 is closed
[warn] o.a.s.d.c.StandaloneAppClient$ClientEndpoint - Failed to connect to
master 10.150.1.22:9083
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
......
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connection from /10.150.1.22:9083 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInact
ive(TransportResponseHandler.java:146)
To be honest, I tried to connect with different ports: 8022,9023 but it didn't work. I saw that the default port is 7077, but I don't have any process that is listening on port 7077 on the master.
Any idea how can I continue? How can I check on what port the master is listening to those type of connections?
If you're using a Hadoop cluster, you shouldn't have a standalone Spark master, you should be using YARN
master("yarn")
In which case, you must export a HADOOP_CONF_DIR environment variable that contains a copy of the yarn-site.xml from the cluster

How to run Spark application assembled with Spark 2.1 on cluster with Spark 1.6?

I've been told that I could build a Spark application with one version of Spark and, as long as I use sbt assembly to build that, than I can run it with spark-submit on any spark cluster.
So, I've build my simple application with Spark 2.1.1. You can see my build.sbt file below. Than I'm starting this on my cluster with:
cd spark-1.6.0-bin-hadoop2.6/bin/
spark-submit --class App --master local[*] /home/oracle/spark_test/db-synchronizer.jar
So as you see I'm executing it with spark 1.6.0.
and I'm getting error:
17/06/08 06:59:20 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-4] shutting down ActorSystem [sparkDriver]
java.lang.NoSuchMethodError: org.apache.spark.SparkConf.getTimeAsMs(Ljava/lang/String;Ljava/lang/String;)J
at org.apache.spark.streaming.kafka010.KafkaRDD.<init>(KafkaRDD.scala:70)
at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:219)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:243)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:241)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:241)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:177)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:86)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
17/06/08 06:59:20 WARN AkkaUtils: Error sending message [message = Heartbeat(<driver>,[Lscala.Tuple2;#ac5b61d,BlockManagerId(<driver>, localhost, 26012))] in 1 attempts
akka.pattern.AskTimeoutException: Recipient[Actor[akka://sparkDriver/user/HeartbeatReceiver#-1309342978]] had already been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:194)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
17/06/08 06:59:23 WARN AkkaUtils: Error sending message [message = Heartbeat(<driver>,[Lscala.Tuple2;#ac5b61d,BlockManagerId(<driver>, localhost, 26012))] in 2 attempts
akka.pattern.AskTimeoutException: Recipient[Actor[akka://sparkDriver/user/HeartbeatReceiver#-1309342978]] had already been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:194)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
17/06/08 06:59:26 WARN AkkaUtils: Error sending message [message = Heartbeat(<driver>,[Lscala.Tuple2;#ac5b61d,BlockManagerId(<driver>, localhost, 26012))] in 3 attempts
akka.pattern.AskTimeoutException: Recipient[Actor[akka://sparkDriver/user/HeartbeatReceiver#-1309342978]] had already been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:194)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
17/06/08 06:59:29 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(<driver>,[Lscala.Tuple2;#ac5b61d,BlockManagerId(<driver>, localhost, 26012))]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
Caused by: akka.pattern.AskTimeoutException: Recipient[Actor[akka://sparkDriver/user/HeartbeatReceiver#-1309342978]] had already been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:194)
... 1 more
17/06/08 06:59:39 WARN AkkaUtils: Error sending message [message = Heartbeat(<driver>,[Lscala.Tuple2;#5e4d0345,BlockManagerId(<driver>, localhost, 26012))] in 1 attempts
akka.pattern.AskTimeoutException: Recipient[Actor[akka://sparkDriver/user/HeartbeatReceiver#-1309342978]] had already been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:194)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
17/06/08 06:59:42 WARN AkkaUtils: Error sending message [message = Heartbeat(<driver>,[Lscala.Tuple2;#5e4d0345,BlockManagerId(<driver>, localhost, 26012))] in 2 attempts
akka.pattern.AskTimeoutException: Recipient[Actor[akka://sparkDriver/user/HeartbeatReceiver#-1309342978]] had already been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:194)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
17/06/08 06:59:45 WARN AkkaUtils: Error sending message [message = Heartbeat(<driver>,[Lscala.Tuple2;#5e4d0345,BlockManagerId(<driver>, localhost, 26012))] in 3 attempts
akka.pattern.AskTimeoutException: Recipient[Actor[akka://sparkDriver/user/HeartbeatReceiver#-1309342978]] had already been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:194)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
17/06/08 06:59:48 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(<driver>,[Lscala.Tuple2;#5e4d0345,BlockManagerId(<driver>, localhost, 26012))]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:427)
Caused by: akka.pattern.AskTimeoutException: Recipient[Actor[akka://sparkDriver/user/HeartbeatReceiver#-1309342978]] had already been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:194)
... 1 more
Base on some reading I see that typically error: java.lang.NoSuchMethodError is connected to different versions of Spark. And that might be true because I'm useing different ones. But shouldn't sbt assembly cover that? please see below by build.sbt and assembly.sbt files
build.sbt
name := "spark-db-synchronizator"
//Versions
version := "1.0.0"
scalaVersion := "2.10.6"
val sparkVersion = "2.1.1"
val sl4jVersion = "1.7.10"
val log4jVersion = "1.2.17"
val scalaTestVersion = "2.2.6"
val scalaLoggingVersion = "3.5.0"
val sparkTestingBaseVersion = "1.6.1_0.3.3"
val jodaTimeVersion = "2.9.6"
val jodaConvertVersion = "1.8.1"
val jsonAssertVersion = "1.2.3"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-hive" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.slf4j" % "slf4j-api" % sl4jVersion,
"org.slf4j" % "slf4j-log4j12" % sl4jVersion exclude("log4j", "log4j"),
"log4j" % "log4j" % log4jVersion % "provided",
"org.joda" % "joda-convert" % jodaConvertVersion,
"joda-time" % "joda-time" % jodaTimeVersion,
"org.scalatest" %% "scalatest" % scalaTestVersion % "test",
"com.holdenkarau" %% "spark-testing-base" % sparkTestingBaseVersion % "test",
"org.skyscreamer" % "jsonassert" % jsonAssertVersion % "test"
)
assemblyJarName in assembly := "db-synchronizer.jar"
run in Compile := Defaults.runTask(fullClasspath in Compile, mainClass in(Compile, run), runner in(Compile, run))
runMain in Compile := Defaults.runMainTask(fullClasspath in Compile, runner in(Compile, run))
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
// Spark does not support parallel tests and requires JVM fork
parallelExecution in Test := false
fork in Test := true
javaOptions in Test ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled")
assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
You're correct and it is possible to run a Spark application with Spark 2.1.1 libraries bundled on some Spark 1.6 environments like Hadoop YARN (in CDH or HDP).
The trick is fairly often used in large corporations where the infrastructure team forces development teams to use some older Spark versions only because CDH (YARN) or HDP (YARN) do not support them.
You should use spark-submit from the newer Spark installation (I'd suggest using the latest and greatest 2.1.1 as of this writing) and bundle all Spark jars as part of your Spark application.
Just sbt assembly your Spark application with Spark 2.1.1 (as you specified in build.sbt) and spark-submit the uberjar using the very same version of Spark 2.1.1 to older Spark environments.
As a matter of fact, Hadoop YARN does not make Spark any better than any other application library or framework. It's quite reluctant to pay special attention to Spark.
That however requires a cluster environment (and just checked it won't work with Spark Standalone 1.6 when your Spark application uses Spark 2.1.1).
In your case, when you started your Spark application using local[*] master URL, it was not supposed to work.
cd spark-1.6.0-bin-hadoop2.6/bin/
spark-submit --class App --master local[*] /home/oracle/spark_test/db-synchronizer.jar
There are two reasons for this:
local[*] is fairly constrained by CLASSPATH and trying to convince Spark 1.6.0 to run Spark 2.1.1 on the same JVM might take you fairly long time (if possible at all)
You use older version to run more current 2.1.1. The opposite could work.
Use Hadoop YARN as...well...it does not pay attention to Spark and has been tested few times in my projects already.
I was wandering how can I know which version of i.e.spark-core is taken in runtime
Use web UI and you should see the version in your top-left corner.
You should also consult web UI's Environment tab where you find the configuration of the runtime environment. That's the most authoritative source about the hosting environment of your Spark application.
Near the bottom you should see the Classpath Entries which should give you the CLASSPATH with jars, files and classes.
Use it to find any CLASSPATH-related issues.

Spark 2.0.0 streaming job packed with sbt-assembly lacks Scala runtime methods

When using -> in Spark Streaming 2.0.0 jobs, or using spark-streaming-kafka-0-8_2.11 v2.0.0, and submitting it with spark-submit, I get the following error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 72.0 failed 1 times, most recent failure: Lost task 0.0 in stage 72.0 (TID 37, localhost): java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
I put a brief illustration of this phenomenon to a GitHub repo: spark-2-streaming-nosuchmethod-arrowassoc
Putting only provided dependencies to build.sbt
"org.apache.spark" %% "spark-core" % "2.0.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided"
using -> anywhere in the driver code, packing it with sbt-assembly and submitting the job results in an error. This isn't a big problem by itself, using ArrayAssoc can be avoided, but spark-streaming-kafka-0-8_2.11 v2.0.0 has it somewhere inside, and generates the same error.
Doing it like so:
wordCounts.map{
case (w, c) => Map(w -> c)
}.print()
Then
sbt assembly
Then
spark-2.0.0-bin-hadoop2.7/bin/spark-submit \
--class org.apache.spark.examples.streaming.NetworkWordCount \
--master local[2] \
--deploy-mode client \
./target/scala-2.11/spark-2-streaming-nosuchmethod-arrowassoc-assembly-1.0.jar \
localhost 5555
Spark jobs should be packed without Scala runtime, i.e. , if you're doing it with sbt-assembly, add this: assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
I just had my SPARK_HOME enviroment variable pointing to Spark 1.6.2, doesn't matter where you run spark-submit from, SPARK_HOME should be set properly.

Flink, Kafka and Zookeeper with an URI

I am trying to connect to Kafka from my local machine:
kafkaParams.setProperty("bootstrap.servers", Defaults.BROKER_URL)
kafkaParams.setProperty("metadata.broker.list", Defaults.BROKER_URL)
kafkaParams.setProperty("group.id", "group_id")
kafkaParams.setProperty("auto.offset.reset", "earliest")
Perfectly fine, but my BROKER_URI is defined as follows my-server.com:1234/my/subdirectory.
I figured out that this phenomena is called a chroot path.
It throws the following error: Caused by: org.apache.kafka.common.config.ConfigException: Invalid url in bootstrap.servers: my-server.com:1234/my/subdirectory
How do I solve this?
These are my dependencies:
val flinkVersion = "1.0.3"
"org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-connector-kafka-0.9" % flinkVersion,
Just try host:port format without the path context and slashes. If you have more than one servers it would be a list host1:port1,host2:port2
Reference: http://kafka.apache.org/documentation.html
bootstrap.servers should be a comma-separated list like the following: address1:port1,address2:port2,...,addressn:portn. If you only have one Kafka broker you should input something like localhost:9092 (unless you configured Kafka to run on another port).
You may refer on the this post from dataArtisans for more details on how to make Flink and Kafka work together.
Stupid. Zookeeper != Kafka. As you can see in the code, I used the same URL twice, but it turned out that they should be different.
I am trying to connect to Kafka from my local machine:
kafkaParams.setProperty("bootstrap.servers", Defaults.KAFKA_URL)
kafkaParams.setProperty("metadata.broker.list", Defaults.ZOOKEEPER_URL)
kafkaParams.setProperty("group.id", "group_id")
kafkaParams.setProperty("auto.offset.reset", "earliest")