sc.textFile cannot resolve symbol at textFile - scala

I am using IntelliJ for scala
below is my code:
import org.apache.spark.SparkConf
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object XXXX extends App{
val sc = new SparkConf()
val DataRDD = sc.textFile("/Users/itru/Desktop/XXXX.bz2").cache()
}
my build.sbt file contains below:
name := "XXXXX"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.5.2" %"provided",
"org.apache.spark" %% "spark-sql" % "1.5.2" % "provided")
Packages are downloaded, but still i see error "cannot resolve symbol at textFile" Am i missing any library dependencies

You haven't initialised your SparkContext.
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val dataRDD = sc.textFile("/Users/itru/Desktop/XXXX.bz2").cache()

Related

spark streaming save base64 rdd to json on s3

The scala application below cannot save an rdd in json format onto S3
I have :-
a kinesis stream that has complex objects placed on the stream. This object has had JSON.stringify() applied to it before being placed on the stream as part of the Kinesis PutRecord method.
A scala spark stream job reads these items off the stream,
I cannot seem to save the rdd record that comes off the stream into json file onto an S3 bucket.
In the code i've attempted to convert the RDD[Bytes] to RDD[String] then load with spark.read.json but no luck. I've tried various other combinations and can't seem to output the onto S3 in it's raw format.
import org.apache.spark._
import org.apache.spark.sql._
import java.util.Base64
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kinesis._
import org.apache.spark.streaming.kinesis.KinesisInputDStream
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
object ScalaStream {
def main(args: Array[String]): Unit = {
val appName = "ScalaStreamExample"
val batchInterval = Milliseconds(2000)
val outPath = "s3://xxx-xx--xxx/xxxx/"
val spark = SparkSession
.builder()
.appName(appName)
.getOrCreate()
val sparkContext = spark.sparkContext
val streamingContext = new StreamingContext(sparkContext, batchInterval)
// Populate the appropriate variables from the given args
val checkpointAppName = "xxx-xx-xx--xx"
val streamName = "cc-cc-c--c--cc"
val endpointUrl = "https://kinesis.xxx-xx-xx.amazonaws.com"
val regionName = "cc-xxxx-xxx"
val initialPosition = new Latest()
val checkpointInterval = batchInterval
val storageLevel = StorageLevel.MEMORY_AND_DISK_2
val kinesisStream = KinesisInputDStream.builder
.streamingContext(streamingContext)
.endpointUrl(endpointUrl)
.regionName(regionName)
.streamName(streamName)
.initialPosition(initialPosition)
.checkpointAppName(checkpointAppName)
.checkpointInterval(checkpointInterval)
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
kinesisStream.foreachRDD { rdd =>
if (!rdd.isEmpty()){
//**************** . <---------------
// This is where i'm trying to save the raw json object to s3 as json file
// tried various combinations here but no luck.
val dataFrame = rdd.map(record=>new String(record)) // convert bytes to string
dataFrame.write.mode(SaveMode.Append).json(outPath + "/" + rdd.id.toString())
//**************** <----------------
}
}
// Start the streaming context and await termination
streamingContext.start()
streamingContext.awaitTermination()
}
}
Any ideas what i'm missing?
So it was complete red herring why it failed to work. Turns out it was a scala version conflict with what is available on EMR.
Many similar questions asked on SO that suggested this may be the issue but whilst the spark documentation lists 2.12.4 is compatible with spark 2.4.4, the EMR instance does not appear to support scala version 2.12.4. So i've updated my build.sbt and deploy script from
build.sbt:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.8"
ibraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming-kinesis-asl_2.12" % "2.4.4"
to:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.4" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.4.4"
deploy.sh
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=ScalaStream,Args=[\
--class,"ScalaStream",\
--deploy-mode,cluster,\
--master,yarn,\
--packages,\'org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.4\',\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--conf,yarn.log-aggregation-enable=true,\
--conf,spark.dynamicAllocation.enabled=true,\
--conf,spark.cores.max=4,\
--conf,spark.network.timeout=300,\
s3://ccc.xxxx/simple-project_2.11-1.0.jar\
],ActionOnFailure=CONTINUE

Java Class not Found Exception while doing Spark-submit Scala using sbt

Here is my code that i wrote in scala
package normalisation
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.hadoop.fs.{FileSystem,Path}
object Seasonality {
val amplitude_list_c1: Array[Nothing] = Array()
val amplitude_list_c2: Array[Nothing] = Array()
def main(args: Array[String]){
val conf = new SparkConf().setAppName("Normalization")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val line = "MP"
val ps = "Test"
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
val files = FileSystem.get(sc.hadoopConfiguration ).listStatus(new Path(location))
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
}
The error I received when compiled using sbt package is hereimage
Here is my build.sbt file
name := "OV"
scalaVersion := "2.11.8"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
in Spark Versions > 2 you should generally use SparkSession.
See https://spark.apache.org/docs/2.3.1/api/scala/#org.apache.spark.sql.SparkSession
also then you should be able to do
val spark:SparkSession = ???
val location = "hdfs://ipaddress/user/hdfs/{0}/ps/{1}/FS/2018-10-17".format(line,ps)
spark.read.json(location)
to read all your json files in the directory.
Also I think you'd also get another compile error at
for (each <- files) {
var ps_data = sqlContext.read.json(each)
}
println(ps_data.show())
for ps_data being out of scope.
If you need to use SparkContext for some reason it should indeed be in spark-core. Have you tried restarting your IDE, cleaned caches, etc?
EDIT: I just notices that build.sbt is probably not in the directory where you call sbt package from so sbt won't pick it up

Setup Scala and Apache Spark with SBT in Intellij

I am trying to run Spark Scala project in IntelliJ Idea on Windows 10 machine.
My build.sbt:
name := "SbtIntellSpark1"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"
project/build.properties:
sbt.version = 1.0.3
Main.scala:
package example
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger}
object Main {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val session = SparkSession
.builder()
.appName("StackOverflowSurvey")
.master("local[1]")
.getOrCreate()
val df = session.read
val responses = df
.option("header", true)
.option("inferSchema", true)
.csv("2016-stack-overflow-survey-responses.csv")
responses.printSchema()
}
}
The code runs perfectly (the schema is properly printed) when I run the Main object as shown in the following image:
My Run Configuration is as follows:
The problem is when I run "Run the program", it shows a huge stack of error which is too large to show here. Please see this gist.
How can I solve this issue?

Spark different behavior between spark-submit and spark-shell

Im using Spark 1.3.1 (on ubuntu 14.04) stand alone, sbt 0.13.10, and trying to execute the following script:
package co.some.sheker
import java.sql.Date
import org.apache.spark.{SparkContext, SparkConf}
import SparkContext._
import org.apache.spark.sql.{Row, SQLContext}
import com.datastax.spark.connector._
import java.sql._
import org.apache.spark.sql._
import org.apache.spark.sql.cassandra.CassandraSQLContext
import java.io.PushbackReader
import java.lang.{ StringBuilder => JavaStringBuilder }
import java.io.StringReader
import com.datastax.spark.connector.cql.CassandraConnector
import org.joda.time.{DateTimeConstants}
case class TableKey(key1: String, key2: String)
object myclass{
def main(args: scala.Array[String]) {
val conf = ...
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val csc = new CassandraSQLContext(sc)
val data_x = csc.sql("select distinct key1, key2 from keyspace.table where key1 = 'sheker'").map(row => (row(0).toString, row(1).toString))
println("Done cross mapping")
val snapshotsFiltered = data_x.map(x => TableKey(x._1,x._2)).joinWithCassandraTable("keyspace", "table")
println("Done join")
val jsons = snapshotsFiltered.map(_._2.getString("json"))
...
sc.stop()
println("Done.")
}
}
By using:
/home/user/spark-1.3.1/bin/spark-submit --master spark://1.1.1.1:7077 --driver-class-path /home/user/spark-cassandra-connector-java-assembly-1.3.1-FAT.jar --properties-file prop.conf --class "myclass" "myjar.jar"
The prop.conf file is:
spark.cassandra.connection.host myhost
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.eventLog.enabled true
spark.eventLog.dir /var/tmp/eventLog
spark.executor.extraClassPath /home/ubuntu/spark-cassandra-connector-java-assembly-1.3.1-FAT.jar
And I get this exception:
Done cross mapping
Exception in thread "main" java.lang.NoSuchMethodError: com.datastax.spark.connector.mapper.ColumnMapper$.defaultColumnMapper(Lscala/reflect/ClassTag;Lscala/reflect/api/TypeTags$TypeTag;)Lcom/datastax/spark/connector/mapper/ColumnMapper;
at co.crowdx.aggregation.CassandraToElasticTransformater$.main(CassandraToElasticTransformater.scala:79)
at co.crowdx.aggregation.CassandraToElasticTransformater.main(CassandraToElasticTransformater.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Done Sending Signal aggregation job to Spark
And The strange part is when I trying to run the commands from the script- in the shell its working fine. Im using:
/home/user/spark-1.3.1/bin/spark-shell --master spark://1.1.1.1:7077 --driver-class-path /home/ubuntu/spark-cassandra-connector-java-assembly-1.3.1-FAT.jar --properties-file prop.conf
The Build.scala file is:
import sbt._
import Keys._
import sbtassembly.Plugin._
import AssemblyKeys._
object AggregationsBuild extends Build {
lazy val buildSettings = Defaults.defaultSettings ++ Seq(
version := "1.0.0",
organization := "co.sheker",
scalaVersion := "2.10.4"
)
lazy val app = Project(
"geo-aggregations",
file("."),
settings = buildSettings ++ assemblySettings ++ Seq(
parallelExecution in Test := false,
libraryDependencies ++= Seq(
"com.datastax.spark" %% "spark-cassandra-connector" % "1.2.1",
// spark will already be on classpath when using spark-submit.
// marked as provided, so that it isn't included in assembly.
"org.apache.spark" %% "spark-core" % "1.2.1" % "provided",
"org.apache.spark" %% "spark-catalyst" % "1.2.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.2.1" % "provided",
"org.scalatest" %% "scalatest" % "2.1.5" % "test",
"org.postgresql" % "postgresql" % "9.4-1201-jdbc41",
"com.github.nscala-time" %% "nscala-time" % "2.4.0",
"org.elasticsearch" % "elasticsearch-hadoop" % "2.2.0" % "provided"
),
resolvers += "conjars.org" at "http://conjars.org/repo",
resolvers += "clojars" at "https://clojars.org/repo"
)
)
}
What is wrong? Why it fails on the submit but not in the shell?
You said that you are using spark 1.3 but your build contains spark 1.2.1 dependencies.
Like I said in the comment, I believe that your spark driver's version is different from the one in your application which leads to the error that you are getting.

SquaredDistance between Vectors in Spark

Iam trying to use the squared distance function in spark but nothing seems to work. I tried Vector.sqdist but getting this error "sqdist is not member of scala.collections......." (but the documentation shows it is a member of [org.apache.spark.mllib.linalg.Vector] which i imported (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.Vector)).
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.linalg.Vectors
object SimpleApp {
def main(args: Array[String]) {
val v1: org.apache.spark.mllib.linalg.Vector = Vectors.dense(5)
val v2: org.apache.spark.mllib.linalg.Vector = Vectors.dense(5)
Vectors.sqdist(v1, v2)
}
}
My sbt built
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.1.0",
"org.apache.spark" % "spark-mllib_2.10" % "1.1.0"
)
Spark Version: 1.5.0
Do you an alternative of how to use this function?
Thanks
scala.collection.immutable.Vector is not the same as org.apache.spark.mllib.linalg.Vector. Moreover sqdist is a method of Vectors object not Vector. Putting this all together:
import org.apache.spark.mllib.linalg.Vectors
val v1: org.apache.spark.mllib.linalg.Vector = Vectors.dense(5)
val v2: org.apache.spark.mllib.linalg.Vector = Vectors.dense(5)
Vectors.sqdist(v1, v2)
// Double = 0.0
Ignoring above you compile using Spark 1.1.0 (not 1.5.0) and sqdist has been introduced in 1.3.0.