Spark submit error: java.lang.ClassNotFoundException: DirectKafkaWordCount - scala

I want to running a spark streaming example DirectKafkaWordCount.
This is my directory structure:
root#sandbox:/usr/local/spark/test# find
.
./src
./src/main
./src/main/scala
./src/main/scala/DirectKafkaWordCount.scala
./simple.sbt
sbt package is done, everything is ok.
........
[info] Done updating.
[info] Compiling 1 Scala source to /usr/local/spark-1.6.0-bin-hadoop2.6/test/target/scala-2.10.5/classes...
[info] Packaging /usr/local/spark-1.6.0-bin-hadoop2.6/test/target/scala-2.10.5/direct-kafka-word-count_2.10.5-1.0.jar ...
[info] Done packaging.
[success] Total time: 60 s, completed May 12, 2016 1:34:04 AM
but errors found when I run the spark-submit:
root#sandbox:/usr/local/spark# bin/spark-submit --class DirectKafkaWordCount --master local[4] test/target/scala-2.10.5/direct-kafka-word-count_2.10.5-1.0.jar
java.lang.ClassNotFoundException: DirectKafkaWordCount
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I am new in Spark, hope someone can help me.

->Actually I think you should include the groupId and artifactId of your class when you run the
spark-submit command like:
spark-submit --class com.balabala.spark.DirectKafkaWordCount --master local[4] test/target/scala-2.10.5/direct-kafka-word-count_2.10.5-1.0.jar
Then spark should be able to find your class.

Related

Can not read avro in DataProc Spark with spark-avro

I have a cluster on Google DataProc (with image 1.4) and I want to read avro files with Spark from google cloud storage. I follow this guide: Spark read avro.
The command I ran is:
gcloud dataproc jobs submit pyspark test.py \
--cluster $CLUSTER_NAME \
--region $REGION \
--properties spark.jars.packages='org.apache.spark:spark-avro_2.12:2.4.1'
test.py is very simple, just
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession.builder.appName('test').getOrCreate()
df = spark.read.format("avro").load("gs://mybucket/abc.avro")
df.show()
I got the following error:
Py4JJavaError: An error occurred while calling o196.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
at org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 24 more
Even if I ssh to master node and start the shell there with spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.1, running val usersDF = spark.read.format("avro").load("gs://mybucket/abc.avro") has the same error.
Why this happens? Thank you.
Dataproc 1.4 uses Spark 2.4.0, not Spark 2.4.1, which normally wouldn't be expected to be a problem, but whereas Spark 2.4.0 uses Scala 2.11, Spark 2.4.1 uses Scala 2.12.
You can also see the avro artifact on a Dataproc cluster under /usr/lib/spark/external:
$ ls -l /usr/lib/spark/external
total 13656
-rw-r--r-- 1 root root 187385 Mar 6 23:25 spark-avro_2.11-2.4.0.jar
...
So you just need to use:
spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0

Issue when writing to Hbase from spark scala scheduled in oozie

Hi From our spark scala app, we are connecting to hbase and writing. When we run the jar through spark-submit it works like a charm.
<action name="spark-action">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>Hbase-Test</name>
<class>org.sample.ConnectorTest</class>
<jar>hdfs://nameservice1/app/MyhbaseConnector.jar</jar>
<spark-opts>--jars ${sparkLib} --files ${files} --driver-class-path ${driverClassPath}
</spark-opts>
<arg>testValue</arg>
</spark>
<ok to="mail"/>
<error to="Kill"/>
</action>
But when the same is scheduled in oozie workflow in a spark-action we are getting the below exception.
We are also passing some spark opts to the action.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:803)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:312)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:312)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 13 more
Download:
https://raw.githubusercontent.com/swordsmanliu/SparkStreamingHbase/master/lib/spark-core_2.11-1.5.2.logging.jar
And run:
spark-submit --jars ./spark-core_2.11-1.5.2.logging.jar ...
That is because org.apache.spark.Logging had been canceled at spark 1.6+

how to setup spark environment?

I am new to both scala and spark and I am using Intellij for running spark application.
It is an helloworld to spark using scala.
I got the code from GitHub
and I am getting these errors even after doing setup for spark using maven.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/11/02 22:31:22 INFO SparkContext: Running Spark version 1.6.0
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:38)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:36)
at org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:99)
at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:192)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2136)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2136)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2136)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:322)
at HelloWorld$.main(HelloWorld.scala:14)
at HelloWorld.main(HelloWorld.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.configuration.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 16 more
I know it is a very simple error, but I tried almost every web-link and still not able to get the solution.

Spark ClassNotFoundException when run on yarn-cluster

my code:
import org.apache.spark.{SparkConf, SparkContext}
object Run extends App {
val conf = new SparkConf().setMaster("yarn-cluster").setAppName("t666")
sc.addJar("hdfs://10.1.11.99:8020/user/spark/share/scalaj-http_2.10-2.3.0.jar")
val sc = new SparkContext(conf)
val b = scalaj.http.Base64.encodeString("刘")
val a = Array[String](b)
sc.parallelize(a).saveAsTextFile("hdfs://10.1.11.99:8020/testdata/t2/")
}
and my submit commend is:
spark-submit --master yarn-cluster --class start.Run run.jar
the log on yarn show:
16/11/04 13:50:01 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
16/11/04 13:50:01 INFO spark.SparkContext: Added JAR hdfs://10.1.11.99:8020/user/spark/share/scalaj-http_2.10-2.3.0.jar at hdfs://10.1.11.99:8020/user/spark/share/scalaj-http_2.10-2.3.0.jar with timestamp 1478238601256
16/11/04 13:50:01 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://YarnAM#192.168.3.49:53976)
16/11/04 13:50:01 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.NoClassDefFoundError: scalaj/http/Base64
java.lang.NoClassDefFoundError: scalaj/http/Base64
at start.Run$delayedInit$body.apply(Run.scala:31)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at start.Run$.main(Run.scala:9)
at start.Run.main(Run.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
Caused by: java.lang.ClassNotFoundException: scalaj.http.Base64
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 15 more
16/11/04 13:50:01 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.NoClassDefFoundError: scalaj/http/Base64)
16/11/04 13:50:01 INFO client.RMProxy: Connecting to ResourceManager at slave3/192.168.3.48:8030
16/11/04 13:50:01 INFO yarn.YarnRMClient: Registering the ApplicationMaster
16/11/04 13:50:01 INFO yarn.ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
16/11/04 13:50:01 INFO spark.SparkContext: Invoking stop() from shutdown hook
the 2nd line show:
INFO spark.SparkContext: Added JAR hdfs://10.1.11.99:8020/user/spark/share/scalaj-http_2.10-2.3.0.jar at hdfs://10.1.11.99:8020/user/spark/share/scalaj-http_2.10-2.3.0.jar with timestamp 1478238601256
it seems already add the jar file into my classpath,but this exception i can't explain.
anyone's answer will be help me a lot!
I believe SparkContext.addJar only adds the JAR to the classpath of the workers, and not the driver. Try adding the JAR using the --jars option in the spark-submit command:
spark-submit --master yarn \
--deploy-mode cluster \
--jars hdfs://10.1.11.99:8020/user/spark/share/scalaj-http_2.10-2.3.0.jar \
--class start.Run run.jar

Scalding tutorial: java.lang.ClassNotFoundException

Please help to run Scalding tutorial.
I have Hadoop 2.2 running on a single node and trying to run Scalding tutorial:
https://github.com/Cascading/scalding-tutorial/
After successfuly buiding 'fat jar' with these commands:
$ git clone git://github.com/Cascading/scalding-tutorial.git
$ cd scalding-tutorial
$ sbt assembly
I try to run tutorial examples as suggested with this command:
$ yarn jar target/scalding-tutorial-0.8.11.jar <TutorialPart> --local <addtional arguments>
Both --local and --hdfs fail with java.lang.ClassNotFoundException:
$ yarn jar target/scala-2.9.3/scalding-assembly-0.10.0.jar 1 --local
Exception in thread "main" java.lang.ClassNotFoundException: 1
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:205)
$ yarn jar target/scala-2.9.3/scalding-assembly-0.10.0.jar 1 --hdfs
Exception in thread "main" java.lang.ClassNotFoundException: 1
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:205)
Update
Changing command argument to 'Tutorial1', 'Tutorial0' does not help either:
$ yarn jar target/scala-2.9.3/scalding-assembly-0.10.0.jar Tutorial1 --local
Exception in thread "main" java.lang.ClassNotFoundException: Tutorial1
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:205)
$ yarn jar target/scala-2.9.3/scalding-assembly-0.10.0.jar Tutorial0 --local
Exception in thread "main" java.lang.ClassNotFoundException: Tutorial0
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:205)
You are passing a wrong name for the main class, that is why it can't find it. It should be Tutorial1 instead of just 1. You can see the error in the stack trace:
Exception in thread "main" java.lang.ClassNotFoundException: 1
There is no class called 1. Try:
$ yarn jar target/scala-2.9.3/scalding-assembly-0.10.0.jar Tutorial1 --local
EDIT: it works just fine to me with this command:
$ yarn jar target/scalding-tutorial-0.8.11.jar Tutorial1 --local