Aggregate function in spark-sql not found - scala

I am new to Spark and I am trying to make use of some aggregate features, like sum or avg. My query in spark-shell works perfectly:
val somestats = pf.groupBy("name").agg(sum("days")).show()
When I try to run it from scala project it is not working thou, throwing an error messageé
not found: value sum
I have tried to add
import sqlContext.implicits._
import org.apache.spark.SparkContext._
just before the command, but it does not help. My spark version is 1.4.1 Am I missing anything?

You need this import:
import org.apache.spark.sql.functions._

You can use sum method directly on GroupedData (groupBy return this type)
val somestats = pf.groupBy("name").sum("days").show()

Related

Problem with importing UDF (User Defined Function) in Spark for Scala

I have a problem when I try to import udf using:
import org.apache.spark.sql.functions.udf
The characters "udf" have a strike through them in the Scala IDE window.Everything compiles fine! I am using Spark 3.1.2 and Scala 2.12
A picture of the problem!
Does anyone know how to fix this?

Getting a not found error when using split with dataframes

I am working with spark and scala. I have the below code that works fine on spark-shell, but when I try to move it over to Intelij it throws on error, stating unable to find split.
What am I missing, what do I need to import for split to work
var outputDF = inputDF.withColumn(srcColumn,
split(inputDF.col(srcColumn),splitBy).getItem(selectIndex))
You're probably missing this import:
import org.apache.spark.sql.functions._
(which is provided automatically in a Spark shell)

Dataframe methods within SBT project

I have the following code that works on the spark-shell
df1.withColumn("tags_splitted", split($"tags", ",")).withColumn("tag_exploded", explode($"tags_splitted")).select("id", "tag_exploded").show()
But fails in sbt with the following errors:
not found: value split
not found: value explode
My scala code has the following
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Books").getOrCreate()
import spark.implicits._
Can someone give me a pointer to what is wrong in the sbt enviroment?
Thanks
The split and explode function are available in the package org.apache.spark.sql inside functions.
So you need to import both
org.apache.spark.sql.functions.split
org.apache.spark.sql.functions.explode
Or
org.apache.spark.sql.functions._
Hope this helps!

Accessing Spark.SQL

I am new to Spark. Following the below example in a book, I found that the command below was giving the error. What would be the best way to run a Spark-SQL command, whilst coding in general in Spark?
scala> // Use SQL to create another DataFrame containing the account
summary records
scala> val acSummary = spark.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")
<console>:37: error: not found: value spark
I tried importing import org.apache.spark.SparkContext or using the sc object, but no luck.
Assuming you're in the spark-shell, then first get a sql context thus:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Then you can do:
val acSummary = sqlContext.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")
So the value spark that is available in spark-shell is actually an instance of SparkSession (https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.sql.SparkSession)
val spark = SparkSession.builder().getOrCreate()
will give you one.
What version are you using? It appears you're in the shell and this should work, but only in Spark 2+ - otherwise you have to use sqlContext.sql

Scala code works on Spark Terminal but not in eclipse

The code below works fine when I run it on Spark Terminal but in eclipse it throws an error. What might be the reason. Please let me know if you need more information.
val IABLabels= IAB.zip(labels)
val temp1 = IABLabels.groupBy(x=>x._2).mapValues( _.map( _._1 ))
Error in Eclipse:
value mapValues is not a member of org.apache.spark.rdd.RDD[(Int, Iterable[(String, Int)])]
The code runs perfectly fine on Spark shell.
You should use this import to access extra functions on RDD's of (key,value) pairs through an implicit conversion:
import org.apache.spark.SparkContext._
You can check the API docs for further details.