I am trying to connect hbase from spark and I want to run scala jar file in spark-submit. Im not sure how to write classes in scala, can any one help
package com.jeevan.sparkhbase
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.HTable;
class InsertData {
def main(arg: Array[String]) {
val conf = HBaseConfiguration.create()
val tableName = "emp"
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val myTable = new HTable(conf, tableName);
var p = new Put(new String("row999").getBytes());
p.add("cf".getBytes(), "column_name".getBytes(), new String("value999").getBytes());
myTable.put(p);
myTable.flushCommits();
}
}
I used maven to build jar and want to execute this jar file with spark-submit. following is the spark-submit command i used to run the jar
spark-submit --class com.jeevan.sparkhbase.InsertData --master local[*] SHIntegration-0.0.1-SNAPSHOT-jar-with-dependencies.jar
I am getting this error
java.lang.ClassNotFoundException: com.jeevan.sparkhbase.InsertData
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:732)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
can someone how to write this above code with class and object. appreciate your help
A couple things could be wrong here including how you packaged your jar.
First, InsertData should be an object not a class.
object InsertData {
def main(arg: Array[String]) {
// stuff
}
}
Second, you aren't actually connecting to Spark anywhere. You'll need to add something like this in your app:
val spark = SparkSession.builder().appName(jobName).master("local[1]").getOrCreate()
Check out my spark-hello-world for a complete example project.
Related
I've compiled a spark-scala script to a JAR and I want to run it with spark-submit. But I'm having this error:
2020-01-07 13:03:02,190 WARN util.Utils: Your hostname, nifi resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
2020-01-07 13:03:02,192 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
2020-01-07 13:03:03,109 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-01-07 13:03:03,826 WARN deploy.SparkSubmit$$anon$2: Failed to load hello.
java.lang.ClassNotFoundException: hello
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:806)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2020-01-07 13:03:03,857 INFO util.ShutdownHookManager: Shutdown hook called
2020-01-07 13:03:03,858 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a8cc1ba6-3643-4646-82a3-4b44f4487105
This is my code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
object hello {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local")
.setAppName("quest9")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("quest9").master("local").getOrCreate()
import spark.implicits._
val zip_codes = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/zip.csv")
val census = spark.read.format("csv").option("header", "true").load("/home/hdfs/Documents/quest_9/doc/census.csv")
census.createOrReplaceTempView("census")
zip_codes.createOrReplaceTempView("zip")
val query = census.as("census").join((zip_codes.where($"City" === "Inglewood").where($"County" === "Los Angeles").as("zip")),Seq("Zip_Code"),"inner").select($"census.Total_Males".as("male"),$"census.Total_Females".as("female")).distinct()
query.show()
val queryR = query.repartition(5)
queryR.write.parquet("/home/hdfs/Documents/population/census/IDE/census.parquet")
sc.stop()
}
}
I think my problem is that im using scala object instead of a class, but I'm not sure.
I run the spark-submit like this
spark-submit \
--class hello \
/home/hdfs/IdeaProjects/untitled/out/artifacts/quest_jar/quest.jar
Anyone solved this error before?
I think you need to specify a package name for both spark-submit and your object.
For instance :
spark-submit \
--class com.my.package.hello \
/home/hdfs/IdeaProjects/untitled/out/artifacts/quest_jar/quest.jar
and
package com.my.package
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
object hello {
...
}
I have wrote small program in scala using spark 1.6.0 in intellij IDE, jar got build but its throwing error while running.
Please share your inputs to resolve this issue.
[cloudera#quickstart ~]$ spark-submit --master local --class com.sample.sample.sample /home/cloudera/Desktop/IdeaProjects.jar
Exception in thread "main" java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
at sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:286)
at sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:239)
at java.util.jar.JarVerifier.processEntry(JarVerifier.java:317)
at java.util.jar.JarVerifier.update(JarVerifier.java:228)
at java.util.jar.JarFile.initializeVerifier(JarFile.java:348)
at java.util.jar.JarFile.getInputStream(JarFile.java:415)
at sun.misc.JarIndex.getJarIndex(JarIndex.java:137)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:674)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:666)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:665)
at sun.misc.URLClassPath$JarLoader.<init>(URLClassPath.java:638)
at sun.misc.URLClassPath$3.run(URLClassPath.java:366)
at sun.misc.URLClassPath$3.run(URLClassPath.java:356)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath.getLoader(URLClassPath.java:355)
at sun.misc.URLClassPath.getLoader(URLClassPath.java:332)
at sun.misc.URLClassPath.getResource(URLClassPath.java:198)
at java.net.URLClassLoader$1.run(URLClassLoader.java:358)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.util.Utils$.classForName(Utils.scala:177)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:688)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I have tried options provided in existed thread like delete MANIFEST.MF file..etc , But no luck
package com.sample.sample
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object sample {
def main(args: Array[String]): Unit = {
println("Hello World")
val conf = new SparkConf().setAppName("JDBC App").setMaster("local")
val sc = new SparkContext(conf)
// val sc = new SparkConf()
val sqlContext = new SQLContext(sc)
val url ="jdbc:mysql://localhost:3306/retail_db?user=root&password=cloudera"
val df = sqlContext.read.format("jdbc").option("url", url).option("dbtable", "products").load()
df.printSchema()
}
}
Following is the command I am running:
spark-submit --master local --class com.sample.sample.sample /home/cloudera/Desktop/IdeaProjects.jar
expecting this code should work
I'm trying to find trending hashtags on twitter using Spark streaming.
os -> mac os
spark-version -> 2.2.1
scala-version -> 2.11.8
zeppelin-version -> 0.7.3
Above tools and versions I'm using. I have added three different jars to my zeppelin notebook:
twitter4j-core-4.0.4.jar
twitter4j-stream-4.0.4.jar
spark-streaming-twitter_2.10-1.0.0.jar
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming.StreamingContext._
def setUpTwitter() = {
import scala.io.Source
for (line <- Source.fromFile("/Users/abhijeet/Documents/sparkscala/SparkScala/twitter.txt").getLines) {
val fields = line.split(" ")
if (fields.length == 2){
System.setProperty("twitter4j.oauth." + fields(0), fields(1))
}
}
}
setUpTwitter()
val ssc = new StreamingContext(sc, Seconds(1))
def setupLogging() = {
import org.apache.log4j.{Level, Logger}
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
}
setupLogging()
Upto this point in code zeppelin worked fine
val tweets = TwitterUtils.createStream(ssc, None)
In above line it started throwing error
java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.streaming.twitter.TwitterUtils$.createStream(TwitterUtils.scala:44)
... 56 elided
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 69 more
After that I found org.apache.spark.Logging is available in Spark version 1.5.2 or lower version. It is not in the 2.0.0.
My concern is I'm using Zeppelin. I have no pom file here or anything to change the version.
What is the way to resolve this error?
Any help would be greatly appreciated!
Thank you
I found the solution by adding the jar
spark-core_2.11-1.5.2.logging.jar
Had issues with this and found another solution that I thought people might want to try. I had better luck using the org.apache.bahir version 2.40 if you are in a later version of spark (I'm on 3.0).
<!-- https://mvnrepository.com/artifact/org.apache.bahir/spark-streaming-twitter -->
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_2.12</artifactId>
<version>2.4.0</version>
</dependency>
I am new to Spark streaming and trying to run a example from this tutorial and I am following MAKING AND RUNNING OUR OWN NETWORKWORDCOUNT.
I have completed 8th step and made a jar from sbt.
Now I am trying to run deploy my jar using the command in 9th step like this:
bin/spark-submit --class "NetworkWordCount" --master spark://abc:7077 target/scala-2.11/networkcount_2.11-1.0.jar localhost 9999
but when I run this command I get following exception:
java.lang.ClassNotFoundException: NetworkWordCount
at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at
java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:348) at
org.apache.spark.util.Utils$.classForName(Utils.scala:229) at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:700)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
jar that I have created contains "NetworkWordCount" class having the following code from the spark examples
package src.main.scala
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}
//StreamingExamples.setStreamingLogLevels()
// Create the context with a 1 second batch size
val sparkConf = new SparkConf().setAppName("MyNetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
I am unable to identify what am I doing wrong.
The spark-submit parameter --class takes a fully qualified class name.
In the case of the code above, it should be src.main.scala.NetworkCount
bin/spark-submit --class src.main.scala.NetworkCount --master spark://abc:7077 target/scala-2.11/networkcount_2.11-1.0.jar localhost 9999
Note: the package name used looks like an IDE setup issue. src/main/scala is the typical root for a scala code base, and not a package name.
make sure you have the "target/scala-2.11/networkcount_2.11-1.0.jar" file in your current dir when executing spark-submit
I am trying to run my first program in Spark with scala. Trying to read a csv file and display.
Code:
import org.apache.spark.sql.SparkSession
import org.apache.spark._
import java.io._
import org.apache.spark.SparkContext._
import org.apache.log4j._
object df extends App{
val spark=SparkSession.builder().getOrCreate()
val drf=spark.read.csv("C:/Users/admin/Desktop/scala-datasets/Scala-and-
Spark-Bootcamp-master/Spark DataFrames/CitiGroup2006_2008")
drf.head(5)
}
Getting the following error:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/04/29 23:10:53 INFO SparkContext: Running Spark version 2.1.0
17/04/29 23:10:56 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/04/29 23:10:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your
configuration at org.apache.spark.SparkContext.<init>
(SparkContext.scala:379)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
at df$.delayedEndpoint$df$1(df.scala:11)
at df$delayedInit$body.apply(df.scala:9)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at df$.main(df.scala:9)
at df.main(df.scala)
Any suggestions would be helpful
You missed .master() function call. for example if you want to run in local mode following is the solution :
object df extends App{
val spark=SparkSession.builder().master("local").getOrCreate()
val drf=spark.read.csv("C:/Users/admin/Desktop/scala-datasets/Scala-and-
Spark-Bootcamp-master/Spark DataFrames/CitiGroup2006_2008")
drf.head(5)
}
And the error log clearly says that
17/04/29 23:10:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your
configuration at org.apache.spark.SparkContext.<init>
(SparkContext.scala:379)
Hope it helps
As previous comment said you should setup master for your spark context, in your case it should be local[1] or local[*]. Also you should set a appName.
You can avoid master and appName specification via code using spark-submit with keys.
import org.apache.spark.sql.SparkSession
object df extends App{
override def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("example").master("local[*]").getOrCreate()
val drf = spark.read.csv("C:/Users/admin/Desktop/scala-datasets/Scala-and-Spark-Bootcamp-master/Spark DataFrames/CitiGroup2006_2008")
drf.head(5)
}
}