Accessing Spark.SQL - scala

I am new to Spark. Following the below example in a book, I found that the command below was giving the error. What would be the best way to run a Spark-SQL command, whilst coding in general in Spark?
scala> // Use SQL to create another DataFrame containing the account
summary records
scala> val acSummary = spark.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")
<console>:37: error: not found: value spark
I tried importing import org.apache.spark.SparkContext or using the sc object, but no luck.

Assuming you're in the spark-shell, then first get a sql context thus:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Then you can do:
val acSummary = sqlContext.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")

So the value spark that is available in spark-shell is actually an instance of SparkSession (https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.sql.SparkSession)
val spark = SparkSession.builder().getOrCreate()
will give you one.

What version are you using? It appears you're in the shell and this should work, but only in Spark 2+ - otherwise you have to use sqlContext.sql

Related

saveasTextFile("path") in scala

I used Scala in Spark, and I tried to save my file in HDFS file, but I got an error.
I have tried rdd.saveAsTextFile("path"); sc.saveAsTextFile("pathe");
saveAsTextFile("path")
scala> inputJPG.map(x=>x.split("")).map(array=>array(0)).sc.saveAsTextFile("/loudacre/iplist")
<console>:28: error: value sc is not a member of Array[String]
inputJPG.map(x=>x.split("")).map(array=>array(0)).sc.saveAsTextFile("/loudacre/iplist")
I'm a new student and still learning Scala, so not familiar with RDD and Scala function.
Turn to my problem.I found that is because I have
val xx=data.collect()before which is not RDD, so I can't do data.saveAsTextFile in Spark.
So I moved the .collect() function => val xx=data
Then data.savaAsTextFile("path") works.

Spark 2.0 Scala - RDD.toDF()

I am working with Spark 2.0 Scala. I am able to convert an RDD to a DataFrame using the toDF() method.
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
But for the life of me I cannot find where this is in the API docs. It is not under RDD. But it is under DataSet (link 1). However I have an RDD not a DataSet.
Also I can't see it under implicits (link 2).
So please help me understand why toDF() can be called for my RDD. Where is this method being inherited from?
It's coming from here:
Spark 2 API
Explanation: if you import sqlContext.implicits._, you have a implicit method to convert RDD to DataSetHolder (rddToDataSetHolder), then you call toDF on the DataSetHolder
Yes, you should import sqlContext implicits like that:
val sqlContext = //create sqlContext
import sqlContext.implicits._
val df = RDD.toDF()
Before you call to "toDF" in your RDDs
Yes I finally found piece of mind, this issue. It was troubling me like hell, this post is a life saver. I was trying to generically load data from log files to a case class object making it mutable List, this idea was to finally convert the list into DF. However as it was mutable and Spark 2.1.1 has changed the toDF implementation, what ever why the list want not getting converted. I finally thought of even covering save the data to file and the load it back using .read. However 5 min back this post had saved my day.
I did the exact same way as described.
after loading the data to mutable list I immediately used
import spark.sqlContext.implicits._
val df = <mutable list object>.toDF
df.show()
I have done just this with Spark 2.
it worked.
val orders = sc.textFile("/user/gd/orders")
val ordersDF = orders.toDF()

Why does the Scala compiler give "value registerKryoClasses is not a member of org.apache.spark.SparkConf" for Spark 1.4?

I tried to register a class for Kryo as follows
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Seq(classOf[MyClass]))
val sc = new SparkContext(conf)
However, I get the following error
value registerKryoClasses is not a member of org.apache.spark.SparkConf
I also tried, conf.registerKryoClasses(classOf[MyClass]), but still it complains about the same error.
What mistake am I doing? I am using Spark 1.4.
The method SparkConf.registerKryoClasses is defined in Spark 1.4 (since 1.2). However, it expects an Array[Class[_]] as an argument. This might be the problem. Try calling conf.registerKryoClasses(Array(classOf[MyClass])) instead.

Aggregate function in spark-sql not found

I am new to Spark and I am trying to make use of some aggregate features, like sum or avg. My query in spark-shell works perfectly:
val somestats = pf.groupBy("name").agg(sum("days")).show()
When I try to run it from scala project it is not working thou, throwing an error messageé
not found: value sum
I have tried to add
import sqlContext.implicits._
import org.apache.spark.SparkContext._
just before the command, but it does not help. My spark version is 1.4.1 Am I missing anything?
You need this import:
import org.apache.spark.sql.functions._
You can use sum method directly on GroupedData (groupBy return this type)
val somestats = pf.groupBy("name").sum("days").show()

Scala code works on Spark Terminal but not in eclipse

The code below works fine when I run it on Spark Terminal but in eclipse it throws an error. What might be the reason. Please let me know if you need more information.
val IABLabels= IAB.zip(labels)
val temp1 = IABLabels.groupBy(x=>x._2).mapValues( _.map( _._1 ))
Error in Eclipse:
value mapValues is not a member of org.apache.spark.rdd.RDD[(Int, Iterable[(String, Int)])]
The code runs perfectly fine on Spark shell.
You should use this import to access extra functions on RDD's of (key,value) pairs through an implicit conversion:
import org.apache.spark.SparkContext._
You can check the API docs for further details.