Dataframe methods within SBT project - scala

I have the following code that works on the spark-shell
df1.withColumn("tags_splitted", split($"tags", ",")).withColumn("tag_exploded", explode($"tags_splitted")).select("id", "tag_exploded").show()
But fails in sbt with the following errors:
not found: value split
not found: value explode
My scala code has the following
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Books").getOrCreate()
import spark.implicits._
Can someone give me a pointer to what is wrong in the sbt enviroment?
Thanks

The split and explode function are available in the package org.apache.spark.sql inside functions.
So you need to import both
org.apache.spark.sql.functions.split
org.apache.spark.sql.functions.explode
Or
org.apache.spark.sql.functions._
Hope this helps!

Related

Problem with importing UDF (User Defined Function) in Spark for Scala

I have a problem when I try to import udf using:
import org.apache.spark.sql.functions.udf
The characters "udf" have a strike through them in the Scala IDE window.Everything compiles fine! I am using Spark 3.1.2 and Scala 2.12
A picture of the problem!
Does anyone know how to fix this?

How to fix "error: encountered unrecoverable cycle resolving import"?

How to resolve the following compile error?
SOApp.scala:7: error: encountered unrecoverable cycle resolving import.
Note: this is often due in part to a class depending on a definition nested within its companion.
If applicable, you may wish to try moving some members into another object.
import spark.implicits._
Code:
object SOApp extends App with Logging {
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Stackoverflow App")
.master("local[*]")
.getOrCreate()
}
tl;dr Move import spark.implicits._ after val spark = SparkSession...getOrCreate().
That name spark causes a lot of confusion since it could refer to org.apache.spark package as well as to spark value.
Unlike Java, Scala allows for import statements in many more places.
What you could consider a Spark SQL idiom is to create a spark value that gives access to implicits. In Scala, you can only bring implicits into scope from stable objects (like values) so the following is correct:
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
And as you comment says, it's to bring implicit conversions of RDDs to DataFrames (among the things).
This is not to import org.apache.spark package, but for the implicit conversions.

Apache Spark: Can't resolve constructor StreamingContext

I've been trying to establish a StreamingContext in my program but I can't for the life of me figure out what's going on. I added the spark-streaming jar file to the dependencies and imported it in the code but I can't help feeling like I'm missing some small detail somewhere. How should I proceed?
picture of code
You forgot to import StreamingContext in your case.
Use
import org.apache.spark.streaming.StreamingContext
Not
import org.apache.spark.streaming.StreamingContext._
It will import inner objects not the class.

Getting a not found error when using split with dataframes

I am working with spark and scala. I have the below code that works fine on spark-shell, but when I try to move it over to Intelij it throws on error, stating unable to find split.
What am I missing, what do I need to import for split to work
var outputDF = inputDF.withColumn(srcColumn,
split(inputDF.col(srcColumn),splitBy).getItem(selectIndex))
You're probably missing this import:
import org.apache.spark.sql.functions._
(which is provided automatically in a Spark shell)

Aggregate function in spark-sql not found

I am new to Spark and I am trying to make use of some aggregate features, like sum or avg. My query in spark-shell works perfectly:
val somestats = pf.groupBy("name").agg(sum("days")).show()
When I try to run it from scala project it is not working thou, throwing an error messageé
not found: value sum
I have tried to add
import sqlContext.implicits._
import org.apache.spark.SparkContext._
just before the command, but it does not help. My spark version is 1.4.1 Am I missing anything?
You need this import:
import org.apache.spark.sql.functions._
You can use sum method directly on GroupedData (groupBy return this type)
val somestats = pf.groupBy("name").sum("days").show()