Spark Streaming Kafka: ClassNotFoundException for ByteArrayDeserializer when run with spark-submit - scala

I'm new to Scala / Spark Streaming, and to StackOverflow so please excuse my formatting. I have made a Scala app that reads log files from a Kafka Stream. It runs fine within the IDE, but I'll be damned if I can get it to run using spark-submit. It always fails with:
ClassNotFoundException: org.apache.kafka.common.serialization.ByteArrayDeserializer
The line referenced in the Exception is the load command in this snippet:
val records = spark
.readStream
.format("kafka") // <-- use KafkaSource
.option("subscribe", kafkaTopic)
.option("kafka.bootstrap.servers", kafkaBroker) // 192.168.4.86:9092
.load()
.selectExpr("CAST(value AS STRING) AS temp")
.withColumn("record", deSerUDF($"temp"))
IDE: IntelliJ
Spark: 2.2.1
Scala: 2.11.8
Kafka: kafka_2.11-0.10.0.0
Relevant parts of pom.xml:
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.8</scala.version>
<scala.compat.version>2.11</scala.compat.version>
<spark.version>2.2.1</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.github.scala-incubator.io</groupId>
<artifactId>scala-io-file_2.11</artifactId>
<version>0.4.3-1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.0.0</version>
<!-- version>2.0.0</version -->
</dependency>
Note: I don't think it is related, but I have to use zip -d BroLogSpark.jar "META-INF/*.SF" and zip -d BroLogSpark.jar "META-INF/*.DSA" to get past meaning about the manifest signatures.
My jar file does not include any of org.apache.kafka. I have seen several posts that strongly suggest I have a mismatch in versions, and I have tried countless permutations of changes to pom.xml and spark-submit. After each change, I confirm that it still runs within the IDE, then proceed to try using spark-submit on the same system, same user. Below is my most recent attempt, where my BroLogSpark.jar is in the current directory and "192.168.4.86:9092 profile" are input arguments.
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.1,org.apache.kafka:kafka-clients:0.10.0.0 BroLogSpark.jar 192.168.4.86:9092 BroFile

Add below dependency too
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.0.0</version>
</dependency>

Related

Embedded Kafka & Spark 2.3 version mismatch issue

When I use this dependency:
<dependency>
<groupId>net.manub</groupId>
<artifactId>scalatest-embedded-kafka_2.11</artifactId>
<version>2.0.0</version>
<scope>test</scope>
</dependency>
With
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.3.0</version>
</dependency>
I run into this error:
Cause: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.reader.SupportsScanUnsafeRow
Trying to figure out which version of 'scalatest-embedded-kafka' will work with Spark 2.3.
Any ideas?

What dependencies I need to run spark.read.csv("...") in local distribution?

This is whole project: https://github.com/sryza/aas/blob/master/ch02-intro/src/main/scala/com/cloudera/datascience/intro/RunIntro.scala
and the dependencies don't exist. I have replaced paths "hdfs:////" to local paths and how can I run them?
For example how to run:
val spark = SparkSession.builder
.appName("Intro")
.getOrCreate
spark.read.csv("C:/test.csv")
The error is crazy: Exception in thread "main" java.lang.NoSuchMethodError: 'boolean scala.util.Properties$.coloredOutputEnabled()'
Not able to google the solution.
My pom.xml file is:
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.1</version>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.12</artifactId>
<version>3.0.8</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.4</version>
</dependency>
</dependencies>

missing or invalid dependency detected while loading class file 'SQLTestUtilsBase.class'

I am trying to write unit tests for spark scala code and I found this post: How to write unit tests in Spark 2.0+?
However, when I add those dependencies I get this error when compiling:
Error:scalac: missing or invalid dependency detected while loading class file 'SQLTestUtilsBase.class'.
Could not access type PlanTestBase in package org.apache.spark.sql.catalyst.plans,
because it (or its dependencies) are missing. Check your build definition for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
A full rebuild may help if 'SQLTestUtilsBase.class' was compiled against an incompatible version of org.apache.spark.sql.catalyst.plans.
I tried re-running with -Ylog-classpath but it did not help. This is what I think the relevant pom lines for the maven build:
<!-- #### SPARK #### -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.11.8</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.11.8</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.11.8</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.11.8</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.11.8</version>
</dependency>
<!-- #### SPARK #### -->
<!-- #### Testing #### -->
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.11</artifactId>
<version>3.0.8</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalamock</groupId>
<artifactId>scalamock_2.11</artifactId>
<version>4.1.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalactic</groupId>
<artifactId>scalactic_2.11</artifactId>
<version>3.0.8</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.11.8</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.11.8</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<!-- #### Testing #### -->
What could be causing this conflict?
It is resolved by adding a spark-catalyst dependency.
SQLTestUtilsBase trait extends PlanTestBase, and it's defined in the spark-catalyst package.
And also, I think you should change the spark version 2.11.8 to others in maven repo.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.11</artifactId>
<version>2.4.0</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
see:
https://mvnrepository.com/artifact/org.apache.spark/spark-core
org/apache/spark/sql/test/SQLTestUtils.scala#L222-L226
org/apache/spark/sql/catalyst/plans/PlanTest.scala

Getting an error while trying to run a simple spark streaming kafka example

I am trying to run a simple kafka spark streaming example. Here is the error I am getting.
16/10/02 20:45:43 INFO SparkEnv: Registering OutputCommitCoordinator
Exception in thread "main" java.lang.NoSuchMethodError:
scala.Predef$.$scope()Lscala/xml/TopScope$; at
org.apache.spark.ui.jobs.StagePage.(StagePage.scala:44) at
org.apache.spark.ui.jobs.StagesTab.(StagesTab.scala:34) at
org.apache.spark.ui.SparkUI.(SparkUI.scala:62) at
org.apache.spark.ui.SparkUI$.create(SparkUI.scala:215) at
org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:157) at
org.apache.spark.SparkContext.(SparkContext.scala:443) at
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:836)
at
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:84)
at
org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:138)
at com.application.SparkConsumer.App.main(App.java:27)
I am setting this example using the following pom. I have tried to find this missing scala.Predef class, and added the missing dependency for spark-streaming-kafka-0-8-assembly, and I can see the class when I explore this jar.
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.8.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.0</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-8-assembly_2.11</artifactId>
    <version>2.0.0</version>
</dependency>
I have tried a simple spark word count example and it works fine. When I use this spark-streaming-kafka, I am having trouble. I have tried to lookup for this error, but no luck.
Here is the code snippet.
SparkConf sparkConf = new SparkConf().setAppName("someapp").setMaster("local[2]");
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<String,Integer>();
topicMap.put("fast-messages", 1);
Map<String, String> kafkaParams = new HashMap<String,String>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc,"zoo1","my-consumer-group", topicMap);
There seems to be problem when I used 2.11 of 0.8.2.0 kafka. After switching to 2.10 it worked fine.

NoSuchMethodError with SparkContext

I get a NoSuchMethodError when my Spark application executes val sc = new SparkContext("spark://spark01:7077", "Request Executor"). I am compiling my Spark application with version 1.3.1 and Scala version 2.10.4. The Spark cluster is compiled with 1.3.1 as well as the same Scala version.
From looking at the Spark source, getTimeAsSeconds does not exist in Utils.scala until Spark 1.4. Why is it attempting to call a method that does not exist in the version I'm using?
Here are the dependencies from my pom.xml:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.4</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>2.10.4</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>2.10.4</version>
</dependency>
<dependency>
<groupId>com.twitter</groupId>
<artifactId>util-eval_2.10</artifactId>
<version>6.26.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0-rc1</version>
</dependency>
<dependency>
<groupId>org.jvnet.jaxb2_commons</groupId>
<artifactId>jaxb2-basics-runtime</artifactId>
<version>0.7.0</version>
</dependency>
<!-- Jackson JSON Library -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-jaxb-annotations</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-xml</artifactId>
</dependency>
<dependency>
<groupId>org.codehaus.woodstox</groupId>
<artifactId>woodstox-core-asl</artifactId>
<version>4.1.4</version>
</dependency>
<dependency>
<groupId>uk.org.simonsite</groupId>
<artifactId>log4j-rolling-appender</artifactId>
<version>20131024-2017</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-server</artifactId>
<version>9.3.1.v20150714</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlet</artifactId>
<version>9.3.1.v20150714</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-webapp</artifactId>
<version>9.3.1.v20150714</version>
</dependency>
<dependency>
<groupId>org.antlr</groupId>
<artifactId>antlr4-runtime</artifactId>
<version>4.5</version>
</dependency>
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-solrj</artifactId>
<version>5.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-core</artifactId>
<version>5.2.1</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-jdk14</artifactId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-util</artifactId>
</exclusion>
</exclusions>
</dependency>
Is something in my dependencies causing me to compile with Spark 1.4?
Stacktrace:
java.lang.NoSuchMethodError: org.apache.spark.network.util.JavaUtils.timeStringAsSec(Ljava/lang/String;)J
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1027)
at org.apache.spark.SparkConf.getTimeAsSeconds(SparkConf.scala:194)
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:68)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1991)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1982)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
at org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:245)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:52)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:247)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:188)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:424)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:155)
at com.scala.analytics.RequestExecutor$.executeRequest(RequestExecutor.scala:23)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:816)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1113)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1047)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:119)
at org.eclipse.jetty.server.Server.handle(Server.java:517)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:302)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:242)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:238)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:57)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:213)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:147)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
at java.lang.Thread.run(Thread.java:745)
It turns out it was a dumb mistake. I am running this application on a different machine, so I copy over my target directory every time I compile. However, I never cleaned my target directory on the remote machine, so I had an old jar with Spark 1.4.0 sitting there, which I used at one point. Every time my application ran, it would look for a Spark jar and use the 1.4.0 jar instead of the 1.3.1 jar that's also in the directory. The solution was simply to delete the old (1.4.0) jar.