Why does Scala compiler fail with "object SparkConf in package spark cannot be accessed in package org.apache.spark"? - scala

I cannot access the SparkConf in the package. But I have already import the import org.apache.spark.SparkConf. My code is:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
object SparkStreaming {
def main(arg: Array[String]) = {
val conf = new SparkConf.setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext( conf, Seconds(1) )
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs_new = words.map( w => (w, 1) )
val wordsCount = pairs_new.reduceByKey(_ + _)
wordsCount.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to the terminate
}
}
The sbt dependencies are:
name := "Spark Streaming"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.2" % "provided",
"org.apache.spark" %% "spark-mllib" % "1.5.2",
"org.apache.spark" %% "spark-streaming" % "1.5.2"
)
But the error shows that SparkConf cannot be accessed.
[error] /home/cliu/Documents/github/Spark-Streaming/src/main/scala/Spark-Streaming.scala:31: object SparkConf in package spark cannot be accessed in package org.apache.spark
[error] val conf = new SparkConf.setMaster("local[2]").setAppName("NetworkWordCount")
[error] ^

It compiles if you add parenthesis after SparkConf:
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
The point is that SparkConf is a class and not a function, so you could use class name also for scope purposes. So when you add parenthesis after the class name, you are making sure you are calling the class constructor and not the scoping functionality. Here is an example from Scala shell illustrating the difference:
scala> class C1 { var age = 0; def setAge(a:Int) = {age = a}}
defined class C1
scala> new C1
res18: C1 = $iwC$$iwC$C1#2d33c200
scala> new C1()
res19: C1 = $iwC$$iwC$C1#30822879
scala> new C1.setAge(30) // this doesn't work
<console>:23: error: not found: value C1
new C1.setAge(30)
^
scala> new C1().setAge(30) // this works
scala>

In this case you cannot omit parentheses so it should be:
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

Related

SharedSparkSession is not working in Spark MemoryStream scala testing

I have tried to write Spark MemoryStream Unit test case and SharedSparkSession is not importing in my Test case program.
**import org.apache.spark.sql.test.SharedSparkSession
class MemoryStreamTest extends AnyFunSuite with SharedSparkSession {
....
}**
My build.sbt file configuration below
**scalaVersion := "2.12.0"
val sparkVersion = "3.0.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-streaming" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion
libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.5" % "test"
libraryDependencies += "com.novocode" % "junit-interface" % "0.11" % "test"**
Do I need to add any other dependencies artifact or any sclatest version changes required.
The below program getting import issue for SharedSparkSession file.
**import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.test.SharedSparkSession
class MemoryStreamTest extends AnyFunSuite with SharedSparkSession {
test("spark structured streaming can read from memory socket") {
// We can import sql implicits
implicit val sqlCtx = sparkSession.sqlContext
import sqlImplicits._
val events = MemoryStream[String]
val queryName: String = "calleventaggs"
// Add events to MemoryStream as if they came from Kafka
val batch = Seq(
"this is a value to read",
"and this is another value"
)
val currentOffset = events.addData(batch)
val streamingQuery = StreamingDataFrames.writeData(events.toDF(), "memory", queryName)
streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])
val result: DataFrame = sparkSession.table(queryName)
result.show
streamingQuery.awaitTermination(1000L)
assertResult(batch.size)(result.count)
val values = result.take(2)
assertResult(batch(0))(values(0).getString(0))
assertResult(batch(1))(values(1).getString(0))
}
}**
The SharedSparkSession is an internal test utility for the Apache-Spark project and not accessible through the packages you have provided in your sbt file.
The ScalaDocs do not mention the SharedSparkSession.
You will see that the trait SharedSparkSession extends SQLTestUtils which is another testing utility.
For your unit tests it is usually sufficient to just create a local SparkSession.
See the below working code..
import module.JsValueToString
import org.apache.log4j.{Level, Logger}
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.functions.{col, concat, current_timestamp, date_format, from_json, from_unixtime, from_utc_timestamp, lit, regexp_replace, sha2, struct, to_json, to_utc_timestamp, udf}
import org.apache.spark.sql.{DataFrame, SQLContext, SparkSession}
import org.apache.spark.sql.execution.streaming.{LongOffset, MemoryStream}
import org.scalatest.BeforeAndAfterAll
import scala.io.Source
import org.apache.log4j.Logger
import org.apache.log4j.Level
class KafkaProducerFlattenerTestCase extends AnyFunSuite with BeforeAndAfterAll {
Logger.getLogger("org").setLevel(Level.ERROR)
#transient var spark : SparkSession =_
override def beforeAll(): Unit = {
spark = SparkSession
.builder()
.appName("KafkaProducerFlattenerTestCase")
.master("local[*]")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
}
override def afterAll(): Unit = {
spark.stop()
}
test("MemoryStream testcase for Flattener JSON") {
implicit val sparkSesion: SparkSession = spark
implicit val ctx = spark.sqlContext
import sparkSesion.implicits._
val input = MemoryStream[String]
val valueDf = input.toDS().selectExpr("CAST(value As STRING)")
val df2 = (valueDf.select(to_json(struct(col("*"))).alias("content")))
df2.printSchema()
print(" Before Write Stream")
val formatName = ("memory")
val query = df2.writeStream
.queryName("testCustomSinkBasic")
.format(formatName)
.start()
val jasonContent = readJson()a
input.addData(jasonContent)
assert(query.isActive === true)
query.processAllAvailable()
assert(query.exception === None)
print("query....... "+query.runId)
val eventName = spark.sql("select * from testCustomSinkBasic")
val actualValString = JsValueToString(eventName)
println("actualValString..... "+actualValString)
assert(actualValString === expectValue())
}
def readJson(): String ={
val fileContents = Source.fromFile("src/resources/Json.txt").getLines().mkString
fileContents
}
def expectValue(): String = {
val expectVal = """{"publishTime":"123","name[0].lastname":"def","name[0].fname":"abc","name[1].lastname":"jkl","name[1].fname":"ghi","lpid":"1234"}"""
expectVal
}
}
Expected class to cover
import com.usb.transformation.JsFlattener
import org.apache.spark.sql.{DataFrame, Dataset}
import org.apache.spark.sql.functions.{col, struct, to_json}
import play.api.libs.json.{JsObject, JsValue, Json}
object JsValueToString extends Serializable{
var df3 : String = null
def apply(eventName : DataFrame): (String) ={
eventName.foreach(x => {
val content = x.getAs[String]("content").replace("\\", "")
val subStr = content.substring(10, content.length()-2)
println("content ...."+content)
println("subString "+subStr)
val str2Json: JsValue = Json.parse(subStr)
df3 = JsFlattener(str2Json).as[JsObject].toString
println("df3 Value......"+df3)
})
df3
}
}

Java Class not Found Exception while doing Spark-submit Scala using sbt

Here is my code that i wrote in scala
package normalisation
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs.{FileSystem,Path}
object Seasonality {
val amplitude_list_c1: Array[Nothing] = Array()
val amplitude_list_c2: Array[Nothing] = Array()
def main(args: Array[String]){
val conf = new SparkConf().setAppName("Normalization")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val line = "MP"
val ps = "Test"
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
val files = FileSystem.get(sc.hadoopConfiguration ).listStatus(new Path(location))
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
}
The error I received when compiled using sbt package is hereimage
Here is my build.sbt file
name := "OV"
scalaVersion := "2.11.8"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
in Spark Versions > 2 you should generally use SparkSession.
See https://spark.apache.org/docs/2.3.1/api/scala/#org.apache.spark.sql.SparkSession
also then you should be able to do
val spark:SparkSession = ???
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
spark.read.json(location)
to read all your json files in the directory.
Also I think you'd also get another compile error at
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
for ps_data being out of scope.
If you need to use SparkContext for some reason it should indeed be in spark-core. Have you tried restarting your IDE, cleaned caches, etc?
EDIT: I just notices that build.sbt is probably not in the directory where you call sbt package from so sbt won't pick it up

java.lang.ClassNotFoundException: org.apache.spark.sql.DataFrame error when running Scala MongoDB connector

I am trying to run a Scala example with SBT to read data from MongoDB. I am getting this error whenever I try to access the data read from Mongo into the RDD.
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.getDeclaredMethod(Class.java:2128)
at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1431)
at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:494)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
I have imported the Dataframe explicitly, even though it is not used in my code. Can anyone help with this issue?
My code:
package stream
import org.apache.spark._
import org.apache.spark.SparkContext._
import com.mongodb.spark._
import com.mongodb.spark.config._
import com.mongodb.spark.rdd.MongoRDD
import org.bson.Document
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.DataFrame
object SpaceWalk {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("SpaceWalk")
.setMaster("local[*]")
.set("spark.mongodb.input.uri", "mongodb://127.0.0.1/nasa.eva")
.set("spark.mongodb.output.uri", "mongodb://127.0.0.1/nasa.astronautTotals")
val sc = new SparkContext(sparkConf)
val rdd = sc.loadFromMongoDB()
def breakoutCrew ( document: Document ): List[(String,Int)] = {
println("INPUT"+document.get( "Duration").asInstanceOf[String])
var minutes = 0;
val timeString = document.get( "Duration").asInstanceOf[String]
if( timeString != null && !timeString.isEmpty ) {
val time = document.get( "Duration").asInstanceOf[String].split( ":" )
minutes = time(0).toInt * 60 + time(1).toInt
}
import scala.util.matching.Regex
val pattern = new Regex("(\\w+\\s\\w+)")
val names = pattern findAllIn document.get( "Crew" ).asInstanceOf[String]
var tuples : List[(String,Int)] = List()
for ( name <- names ) { tuples = tuples :+ (( name, minutes ) ) }
return tuples
}
val logs = rdd.flatMap( breakoutCrew ).reduceByKey( (m1: Int, m2: Int) => ( m1 + m2 ) )
//logs.foreach(println)
def mapToDocument( tuple: (String, Int ) ): Document = {
val doc = new Document();
doc.put( "name", tuple._1 )
doc.put( "minutes", tuple._2 )
return doc
}
val writeConf = WriteConfig(sc)
val writeConfig = WriteConfig(Map("collection" -> "astronautTotals", "writeConcern.w" -> "majority", "db" -> "nasa"), Some(writeConf))
logs.map( mapToDocument ).saveToMongoDB( writeConfig )
import org.apache.spark.sql.SQLContext
import com.mongodb.spark.sql._
import org.apache.spark.sql.DataFrame
// load the first dataframe "EVAs"
val sqlContext = new SQLContext(sc);
import sqlContext.implicits._
val evadf = sqlContext.read.mongo()
evadf.printSchema()
evadf.registerTempTable("evas")
// load the 2nd dataframe "astronautTotals"
val astronautDF = sqlContext.read.option("collection", "astronautTotals").mongo[astronautTotal]()
astronautDF.printSchema()
astronautDF.registerTempTable("astronautTotals")
sqlContext.sql("SELECT astronautTotals.name, astronautTotals.minutes FROM astronautTotals" ).show()
sqlContext.sql("SELECT astronautTotals.name, astronautTotals.minutes, evas.Vehicle, evas.Duration FROM " +
"astronautTotals JOIN evas ON astronautTotals.name LIKE evas.Crew" ).show()
}
}
case class astronautTotal ( name: String, minutes: Integer )
This is my sbt file -
name := "Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.0.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0"
//libraryDependencies += "org.apache.spark" %% "spark-streaming-twitter" % "1.2.1"
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter" % "2.0.0"
libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" % "0.1"
addCommandAlias("c1", "run-main stream.SaveTweets")
addCommandAlias("c2", "run-main stream.SpaceWalk")
outputStrategy := Some(StdoutOutput)
//outputStrategy := Some(LoggedOutput(log: Logger))
fork in run := true
This error message is because you are using an incompatible library that only supports Spark 1.x. You should use mongo-spark-connector 2.0.0+ instead. See: https://docs.mongodb.com/spark-connector/v2.0/

sc.textFile cannot resolve symbol at textFile

I am using IntelliJ for scala
below is my code:
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object XXXX extends App{
val sc = new SparkConf()
val DataRDD = sc.textFile("/Users/itru/Desktop/XXXX.bz2").cache()
}
my build.sbt file contains below:
name := "XXXXX"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.5.2" %"provided",
"org.apache.spark" %% "spark-sql" % "1.5.2" % "provided")
Packages are downloaded, but still i see error "cannot resolve symbol at textFile" Am i missing any library dependencies
You haven't initialised your SparkContext.
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val dataRDD = sc.textFile("/Users/itru/Desktop/XXXX.bz2").cache()

SBT package scala script

I am trying to use spark submit with a scala script, but first I need to create my package.
Here is my sbt file:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.0.0"
When I try sbt package, I am getting these errors:
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:3: object functions is not a member of package org.apache.spark.sql
import org.apache.spark.sql.functions._
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:4: object types is not a member of package org.apache.spark.sql
import org.apache.spark.sql.types._
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:25: not found: value sc
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:30: not found: value sqlContext
val df = sqlContext.read.format("xml").option("attributePrefix","").option("rowTag", "project").load(uri.toString())
^
/home/i329537/Scripts/PandI/SBT/src/main/scala/XML_Script_SBT.scala:36: not found: value udf
val sqlfunc = udf(coder)
^
5 errors found
(compile:compileIncremental) Compilation failed
Is anyone faced these errors?
Thanks for helping.
Regards
Majid
You are trying to use class org.apache.spark.sql.functions and package org.apache.spark.sql.types. According to functions class documentation it's available starting from version 1.3.0. And types package is available since version 1.3.1.
Solution: update SBT file to:
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.3.1"
Other errors: "not found: value sc", "not found: value sqlContext", "not found: value udf" are caused by some missing varibales in your XML_Script_SBT.scala file. Can't solve without looking into source code.
Thanks Sergey, your correction corrects 3 errors. Below is my script:
object SimpleApp {
def main(args: Array[String]) {
val today = Calendar.getInstance.getTime
val curTimeFormat = new SimpleDateFormat("yyyyMMdd-HHmmss")
val time = curTimeFormat.format(today)
val destination = "/3.Data/3.Code_Check_Processed/2.XML/" + time + ".extensive.csv"
val source = "/3.Data/2.Code_Check_Raw/2.XML/Extensive/"
val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf)
val hdfs = FileSystem.get(hconf)
val iter = hdfs.listLocatedStatus(new Path(source))
val uri = iter.next.getPath.toUri
val df = sqlContext.read.format("xml").option("attributePrefix","").option("rowTag", "project").load(uri.toString())
val df2 = df.selectExpr("explode(check) as e").select("e.#VALUE","e.additionalInformation1","e.columnNumber","e.context","e.errorType","e.filePath","e.lineNumber","e.message","e.severity")
val coder: (Long => String) = (arg: Long) => {if (arg > -1) time else "nada"}
val sqlfunc = udf(coder)
val df3 = df2.withColumn("TimeStamp", sqlfunc(col("columnNumber")))
df3.write.format("com.databricks.spark.csv").option("header", "false").save(destination)
hdfs.delete(new Path(uri.toString()), true)
sys.exit(0)
}
}