Java Class not Found Exception while doing Spark-submit Scala using sbt - scala

Here is my code that i wrote in scala
package normalisation
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs.{FileSystem,Path}
object Seasonality {
val amplitude_list_c1: Array[Nothing] = Array()
val amplitude_list_c2: Array[Nothing] = Array()
def main(args: Array[String]){
val conf = new SparkConf().setAppName("Normalization")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val line = "MP"
val ps = "Test"
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
val files = FileSystem.get(sc.hadoopConfiguration ).listStatus(new Path(location))
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
}
The error I received when compiled using sbt package is hereimage
Here is my build.sbt file
name := "OV"
scalaVersion := "2.11.8"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"

in Spark Versions > 2 you should generally use SparkSession.
See https://spark.apache.org/docs/2.3.1/api/scala/#org.apache.spark.sql.SparkSession
also then you should be able to do
val spark:SparkSession = ???
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
spark.read.json(location)
to read all your json files in the directory.
Also I think you'd also get another compile error at
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
for ps_data being out of scope.
If you need to use SparkContext for some reason it should indeed be in spark-core. Have you tried restarting your IDE, cleaned caches, etc?
EDIT: I just notices that build.sbt is probably not in the directory where you call sbt package from so sbt won't pick it up

Related

spark streaming save base64 rdd to json on s3

The scala application below cannot save an rdd in json format onto S3
I have :-
a kinesis stream that has complex objects placed on the stream. This object has had JSON.stringify() applied to it before being placed on the stream as part of the Kinesis PutRecord method.
A scala spark stream job reads these items off the stream,
I cannot seem to save the rdd record that comes off the stream into json file onto an S3 bucket.
In the code i've attempted to convert the RDD[Bytes] to RDD[String] then load with spark.read.json but no luck. I've tried various other combinations and can't seem to output the onto S3 in it's raw format.
import org.apache.spark._
import org.apache.spark.sql._
import java.util.Base64
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kinesis._
import org.apache.spark.streaming.kinesis.KinesisInputDStream
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
object ScalaStream {
def main(args: Array[String]): Unit = {
val appName = "ScalaStreamExample"
val batchInterval = Milliseconds(2000)
val outPath = "s3://xxx-xx--xxx/xxxx/"
val spark = SparkSession
.builder()
.appName(appName)
.getOrCreate()
val sparkContext = spark.sparkContext
val streamingContext = new StreamingContext(sparkContext, batchInterval)
// Populate the appropriate variables from the given args
val checkpointAppName = "xxx-xx-xx--xx"
val streamName = "cc-cc-c--c--cc"
val endpointUrl = "https://kinesis.xxx-xx-xx.amazonaws.com"
val regionName = "cc-xxxx-xxx"
val initialPosition = new Latest()
val checkpointInterval = batchInterval
val storageLevel = StorageLevel.MEMORY_AND_DISK_2
val kinesisStream = KinesisInputDStream.builder
.streamingContext(streamingContext)
.endpointUrl(endpointUrl)
.regionName(regionName)
.streamName(streamName)
.initialPosition(initialPosition)
.checkpointAppName(checkpointAppName)
.checkpointInterval(checkpointInterval)
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
kinesisStream.foreachRDD { rdd =>
if (!rdd.isEmpty()){
//**************** . <---------------
// This is where i'm trying to save the raw json object to s3 as json file
// tried various combinations here but no luck.
val dataFrame = rdd.map(record=>new String(record)) // convert bytes to string
dataFrame.write.mode(SaveMode.Append).json(outPath + "/" + rdd.id.toString())
//**************** <----------------
}
}
// Start the streaming context and await termination
streamingContext.start()
streamingContext.awaitTermination()
}
}
Any ideas what i'm missing?
So it was complete red herring why it failed to work. Turns out it was a scala version conflict with what is available on EMR.
Many similar questions asked on SO that suggested this may be the issue but whilst the spark documentation lists 2.12.4 is compatible with spark 2.4.4, the EMR instance does not appear to support scala version 2.12.4. So i've updated my build.sbt and deploy script from
build.sbt:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.8"
ibraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
to:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.4.4"
deploy.sh
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.4\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://ccc.xxxx/simple-project_2.11-1.0.jar\
],ActionOnFailure=CONTINUE

File not found exception while loading a properties file on a Scala SBT project

I am trying to learn a Scala-Spark JDBC program on IntelliJ IDEA. In order to do that, I have created a Scala SBT Project and the project structure looks like:
Before writing the JDBC connection parameters in the class, I first tried loading a properties file which contain all my connection properties and trying to display if they are loading properly as below:
connection.properties content:
devUserName=username
devPassword=password
gpDriverClass=org.postgresql.Driver
gpDevUrl=jdbc:url
Code:
package com.yearpartition.obj
import java.io.FileInputStream
import java.util.Properties
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, LogManager, Logger}
import org.apache.spark.SparkConf
object PartitionRetrieval {
var conf = new SparkConf().setAppName("Spark-JDBC")
val properties = new Properties()
properties.load(new FileInputStream("connection.properties"))
val connectionUrl = properties.getProperty("gpDevUrl")
val devUserName=properties.getProperty("devUserName")
val devPassword=properties.getProperty("devPassword")
val gpDriverClass=properties.getProperty("gpDriverClass")
println("connectionUrl: " + connectionUrl)
Class.forName(gpDriverClass).newInstance()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().config(conf).master("local[2]").getOrCreate()
println("connectionUrl: " + connectionUrl)
}
}
Content of build.sbt:
name := "YearPartition"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++= {
val sparkCoreVer = "2.2.0"
val sparkSqlVer = "2.2.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkCoreVer % "provided" withSources(),
"org.apache.spark" %% "spark-sql" % sparkSqlVer % "provided" withSources(),
"org.json4s" %% "json4s-jackson" % "3.2.11" % "provided",
"org.apache.httpcomponents" % "httpclient" % "4.5.3"
)
}
Since I am not writing or saving data into any file and trying to display the values of properties file, I executed the code using following:
SPARK_MAJOR_VERSION=2 spark-submit --class com.yearpartition.obj.PartitionRetrieval yearpartition_2.11-0.1.jar
But I am getting file not found exception as below:
Caused by: java.io.FileNotFoundException: connection.properties (No such file or directory)
I tried to fix it in vain. Could anyone let me know what is the mistake I am doing here and how can I correct it ?
You must write to full path of your connection.properties file (file:///full_path/connection.properties) and in this option when you submit a job in cluster if you want to read file the local disk you must save connection.properties file on the all server in the cluster to same path. But in other option, you can read the files from HDFS. Here is a little example for reading files on HDFS:
#throws[IOException]
def readFileFromHdfs(file: String): org.apache.hadoop.fs.FSDataInputStream = {
val conf = new org.apache.hadoop.conf.Configuration
conf.set("fs.default.name", "HDFS_HOST")
val fileSystem = org.apache.hadoop.fs.FileSystem.get(conf)
val path = new org.apache.hadoop.fs.Path(file)
if (!fileSystem.exists(path)) {
println("File (" + path + ") does not exists.")
null
} else {
val in = fileSystem.open(path)
in
}
}

sc.textFile cannot resolve symbol at textFile

I am using IntelliJ for scala
below is my code:
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object XXXX extends App{
val sc = new SparkConf()
val DataRDD = sc.textFile("/Users/itru/Desktop/XXXX.bz2").cache()
}
my build.sbt file contains below:
name := "XXXXX"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.5.2" %"provided",
"org.apache.spark" %% "spark-sql" % "1.5.2" % "provided")
Packages are downloaded, but still i see error "cannot resolve symbol at textFile" Am i missing any library dependencies
You haven't initialised your SparkContext.
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val dataRDD = sc.textFile("/Users/itru/Desktop/XXXX.bz2").cache()

SBT package scala script

I am trying to use spark submit with a scala script, but first I need to create my package.
Here is my sbt file:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.0.0"
When I try sbt package, I am getting these errors:
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:3: object functions is not a member of package org.apache.spark.sql
import org.apache.spark.sql.functions._
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:4: object types is not a member of package org.apache.spark.sql
import org.apache.spark.sql.types._
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:25: not found: value sc
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:30: not found: value sqlContext
val df = sqlContext.read.format("xml").option("attributePrefix","").option("rowTag", "project").load(uri.toString())
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:36: not found: value udf
val sqlfunc = udf(coder)
^
5 errors found
(compile:compileIncremental) Compilation failed
Is anyone faced these errors?
Thanks for helping.
Regards
Majid
You are trying to use class org.apache.spark.sql.functions and package org.apache.spark.sql.types. According to functions class documentation it's available starting from version 1.3.0. And types package is available since version 1.3.1.
Solution: update SBT file to:
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.3.1"
Other errors: "not found: value sc", "not found: value sqlContext", "not found: value udf" are caused by some missing varibales in your XML_Script_SBT.scala file. Can't solve without looking into source code.
Thanks Sergey, your correction corrects 3 errors. Below is my script:
object SimpleApp {
def main(args: Array[String]) {
val today = Calendar.getInstance.getTime
val curTimeFormat = new SimpleDateFormat("yyyyMMdd-HHmmss")
val time = curTimeFormat.format(today)
val destination = "/3.Data/3.Code_Check_Processed/2.XML/" + time + ".extensive.csv"
val source = "/3.Data/2.Code_Check_Raw/2.XML/Extensive/"
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hconf)
val iter = hdfs.listLocatedStatus(new Path(source))
val uri = iter.next.getPath.toUri
val df = sqlContext.read.format("xml").option("attributePrefix","").option("rowTag", "project").load(uri.toString())
val df2 = df.selectExpr("explode(check) as e").select("e.#VALUE","e.additionalInformation1","e.columnNumber","e.context","e.errorType","e.filePath","e.lineNumber","e.message","e.severity")
val coder: (Long => String) = (arg: Long) => {if (arg > -1) time else "nada"}
val sqlfunc = udf(coder)
val df3 = df2.withColumn("TimeStamp", sqlfunc(col("columnNumber")))
df3.write.format("com.databricks.spark.csv").option("header", "false").save(destination)
hdfs.delete(new Path(uri.toString()), true)
sys.exit(0)
}
}

Anaylze twitter datas with Spark

Anyone else help me about how can i analyze twitter data based on 'keys' whatever i write.I found this code but this is give me an error.
import java.io.File
import com.google.gson.Gson
import org.apache.spark.streaming.twitter.TwitterUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
/**
* Collect at least the specified number of tweets into json text files.
*/
object Collect {
private var numTweetsCollected = 0L
private var partNum = 0
private var gson = new Gson()
def main(args: Array[String]) {
// Process program arguments and set properties
if (args.length < 3) {
System.err.println("Usage: " + this.getClass.getSimpleName +
"<outputDirectory> <numTweetsToCollect> <intervalInSeconds> <partitionsEachInterval>")
System.exit(1)
}
val Array(outputDirectory, Utils.IntParam(numTweetsToCollect), Utils.IntParam(intervalSecs), Utils.IntParam(partitionsEachInterval)) =
Utils.parseCommandLineWithTwitterCredentials(args)
val outputDir = new File(outputDirectory.toString)
if (outputDir.exists()) {
System.err.println("ERROR - %s already exists: delete or specify another directory".format(
outputDirectory))
System.exit(1)
}
outputDir.mkdirs()
println("Initializing Streaming Spark Context...")
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(intervalSecs))
val tweetStream = TwitterUtils.createStream(ssc, Utils.getAuth)
.map(gson.toJson(_))
tweetStream.foreachRDD((rdd, time) => {
val count = rdd.count()
if (count > 0) {
val outputRDD = rdd.repartition(partitionsEachInterval)
outputRDD.saveAsTextFile(outputDirectory + "/tweets_" + time.milliseconds.toString)
numTweetsCollected += count
if (numTweetsCollected > numTweetsToCollect) {
System.exit(0)
}
}
})
ssc.start()
ssc.awaitTermination()
}
}
Error is
object gson is not a member of package com.google
If you know any link about it or fix this problem can you share with me,because i want to analyze twitter datas with spark.
Thanks.:)
Like Peter pointed out, you are missing the gson dependency. So you'll need to add the following dependency to your build.sbt :
libraryDependencies += "com.google.code.gson" % "gson" % "2.4"
You can also do the following to define all the dependencies in one sequence :
libraryDependencies ++= Seq(
"com.google.code.gson" % "gson" % "2.4",
"org.apache.spark" %% "spark-core" % "1.2.0",
"org.apache.spark" %% "spark-streaming" % "1.2.0",
"org.apache.spark" %% "spark-streaming-twitter" % "1.2.0"
)
Bonus: In case of other missing dependencies, you can try to search your dependency on the http://mvnrepository.com/ and if you need to find the associated jar/dependency for a given class, you can also use the findjar website