Getting a not found error when using split with dataframes - scala

I am working with spark and scala. I have the below code that works fine on spark-shell, but when I try to move it over to Intelij it throws on error, stating unable to find split.
What am I missing, what do I need to import for split to work
var outputDF = inputDF.withColumn(srcColumn,
split(inputDF.col(srcColumn),splitBy).getItem(selectIndex))

You're probably missing this import:
import org.apache.spark.sql.functions._
(which is provided automatically in a Spark shell)

Related

Problem with importing UDF (User Defined Function) in Spark for Scala

I have a problem when I try to import udf using:
import org.apache.spark.sql.functions.udf
The characters "udf" have a strike through them in the Scala IDE window.Everything compiles fine! I am using Spark 3.1.2 and Scala 2.12
A picture of the problem!
Does anyone know how to fix this?

how to use sparks implicit conversion (e.g. $) in IntelliJ debugger evaluate expression

When debugging Spark/Scala code with IntelliJ, using e.g. df.select($"mycol") does not work in the evaluate expression window, while df.select(col("mycol")) works fine (but needs code change):
It says :
Error during generated code invocation:
com.intellij.debugger.engine.evaluation.EvaluateException: Error evaluating method : 'invoke': Method threw 'java.lang.NoSuchFieldError' exception.: Error evaluating method : 'invoke': Method threw 'java.lang.NoSuchFieldError' exception.
Strangely, it seems to work sometimes, especially if the $ is already part of an existing expression in the code which I mark to evaluate. If I write arbitrary expressions (code-fragments), it fails consistently
EDIT: even repeating import spark.implicts._ in the code-fragment window does not help
Try this workaround:
import spark.implicits._
$""
df.select($"diff").show()
It seems that import spark.implicits.StringToColumn at top of the code fragment works
I think the reason is that IntelliJ does not realize that the import of the implicits is used in the first place (it is rendered gray), therefore its not available in the evaluate expression context
I've had a similiar issue with NoSuchFieldError because of missing spark instance within the "Evaluate Expression" dialog. See example below.
class MyClass(implicit spark: SparkSession) {
import spark.implicits._
def myFunc1() = {
// breakpoint here
}
My workaround was to modify spark instance declaration in the constructor. I've changed "implicit spark" to "implicit val spark".
..and it works :)

Dataframe methods within SBT project

I have the following code that works on the spark-shell
df1.withColumn("tags_splitted", split($"tags", ",")).withColumn("tag_exploded", explode($"tags_splitted")).select("id", "tag_exploded").show()
But fails in sbt with the following errors:
not found: value split
not found: value explode
My scala code has the following
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Books").getOrCreate()
import spark.implicits._
Can someone give me a pointer to what is wrong in the sbt enviroment?
Thanks
The split and explode function are available in the package org.apache.spark.sql inside functions.
So you need to import both
org.apache.spark.sql.functions.split
org.apache.spark.sql.functions.explode
Or
org.apache.spark.sql.functions._
Hope this helps!

Aggregate function in spark-sql not found

I am new to Spark and I am trying to make use of some aggregate features, like sum or avg. My query in spark-shell works perfectly:
val somestats = pf.groupBy("name").agg(sum("days")).show()
When I try to run it from scala project it is not working thou, throwing an error messageé
not found: value sum
I have tried to add
import sqlContext.implicits._
import org.apache.spark.SparkContext._
just before the command, but it does not help. My spark version is 1.4.1 Am I missing anything?
You need this import:
import org.apache.spark.sql.functions._
You can use sum method directly on GroupedData (groupBy return this type)
val somestats = pf.groupBy("name").sum("days").show()

How to find statistics using stat() in Spark

I'm trying to display the stats() of a Double RDD
val stat = x.map(s => s._2).stats()
where x is of type RDD[Double]
In spark-shell, this run fine.
But while running it in eclipse and compiling using sbt, it throws the following error.
value stats is not a member of org.apache.spark.rdd.RDD[Double]
Try to import additional functions for RDD's with doubles.
Add 'import org.apache.spark.SparkContext._` to your source.
This is an usual "error" when something is missing for the RDD class.