Spark job completes without executing udf - scala

I've been having an issue with a long, complicated spark job which contains a udf.
The issue I've been having is that the udf doesn't seem to get called properly, although there is no error message.
I know it isn't called properly because the output gets written, only anything the udf was supposed to calculate is NULL, and no print statements appear when debugging locally.
The only lead is that this code previously worked using different input data, meaning the error must have something to do with the input.
The change in inputs mostly means different column names are used, which is addressed in the code.
Print statements are executed given the first, 'working' input.
Both inputs are created using the same series of steps from the same database, and by inspection there doesn't appear to be a problem with either one.
I've never experienced this sort of behaviour before, and any leads on what might cause it would be appreciated.
The code is monolithic and inflexible - I'm working on refactoring, but it's not an easy piece to break apart. This is a short version of what is happening:
package mypackage
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.util._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.types._
import scala.collection.{Map => SMap}
object MyObject {
def main(args: Array[String]){
val spark: SparkSession = SparkSession.builder()
.appName("my app")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
val bigInput = spark.read.parquet("inputname.parquet")
val reference_table = spark.read.parquet("reference_table.parquet")
val exchange_rate = spark.read.parquet("reference_table.parquet")
val bigInput2 = bigInput
.filter($"column1" === "condition1")
.join(joinargs)
.drop(dropargs)
val bigInput3 = bigInput
.filter($"column2" === "condition2")
.join(joinargs)
.drop(dropargs)
<continue for many lines...>
def mapper1(
arg1: String,
arg2: Double,
arg3: Integer
): List[Double]{
exchange_rate.map(
List(idx1, idx2, idx3),
r.toSeq.toList
.drop(idx4)
.take(arg2)
)
}
def mapper2(){}
...
def mapper5(){}
def my_udf(
arg0: Integer,
arg1: String,
arg2: Double,
arg3: Integer,
...
arg20: String
): Double = {
println("I'm actually doing something!")
val result1 = mapper1(arg1, arg2, arg3)
val result2 = mapper2(arg4, arg5, arg6, arg7)
...
val result5 = mapper5(arg18, arg19, arg20)
result1.take(arg0)
.zipAll(result1, 0.0, 0.0)
.map(x=>_1*x._2)
....
.zipAll(result5, 0.0, 0.0)
.foldLeft(0.0)(_+_)
}
spark.udf.register("myUDF", my_udf_)
val bigResult1 = bigInputFinal.withColumn("Newcolumnname",
callUDF(
"myUDF",
$"col1",
...
$"col20"
)
)
<postprocessing>
bigResultFinal
.filter(<configs>)
.select(<column names>)
.write
.format("parquet")
}
}
To recap
This code runs to completion on each of two input files.
The udf only appears to execute on the first file.
There are no error messages or anything using the second file, although all non-udf logic appears to complete successfully.
Any help greatly appreciated!

Here the UDF is not being called because spark is
Lazy it does not call the UDF unless you use any action on the dataframe. You can achieve this by forcing dataframe actions.

Related

Spark Scala getting null pointer exception

I'm trying to get mass elevation data from tiff image, I have a csv file. csv file contents latitude, longitude and other attributes also. Looping through csv file and getting latitude and longitude and calling elevation method, Code given below. Reference RasterFrames extracting location information problem
package main.scala.sample
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.datasource.raster._
import org.locationtech.rasterframes.encoders.CatalystSerializer._
import geotrellis.raster._
import geotrellis.vector.Extent
import org.locationtech.jts.geom.Point
import org.apache.spark.sql.functions.col
object SparkSQLExample {
def main(args: Array[String]) {
implicit val spark = SparkSession.builder()
.master("local[*]").appName("RasterFrames")
.withKryoSerialization.getOrCreate().withRasterFrames
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val example = "https://raw.githubusercontent.com/locationtech/rasterframes/develop/core/src/test/resources/LC08_B7_Memphis_COG.tiff"
val rf = spark.read.raster.from(example).load()
val rf_value_at_point = udf((extentEnc: Row, tile: Tile, point: Point) => {
val extent = extentEnc.to[Extent]
Raster(tile, extent).getDoubleValueAtPoint(point)
})
val spark_file:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples")
.getOrCreate()
spark_file.sparkContext.setLogLevel("ERROR")
println("spark read csv files from a directory into RDD")
val rddFromFile = spark_file.sparkContext.textFile("point.csv")
println(rddFromFile.getClass)
def customF(str: String): String = {
val lat = str.split('|')(2).toDouble;
val long = str.split('|')(3).toDouble;
val point = st_makePoint(long, lat)
val test = rf.where(st_intersects(rf_geometry(col("proj_raster")), point))
.select(rf_value_at_point(rf_extent(col("proj_raster")), rf_tile(col("proj_raster")), point) as "value")
return test.toString()
}
val rdd2=rddFromFile.map(f=> customF(f))
rdd2.foreach(t=>println(t))
spark.stop()
}
}
when I'm running getting null pointer exception, any help appreciated
java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:64)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3416)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1490)
at org.apache.spark.sql.Dataset.where(Dataset.scala:1518)
at main.scala.sample.SparkSQLExample$.main$scala$sample$SparkSQLExample$$customF$1(SparkSQLExample.scala:49)
The function which is being mapped over the RDD (customF) is not null safe. Try calling customF(null) and see what happens. If it throws an exception, then you will have to make sure that rddFromFile doesn't contain any null/missing values.
It is a little hard to tell if that is exactly where issue is. I think the stack trace of the exception is less helpful than usual because the function is being run in a spark tasks on the workers.
If that is the issue, you could rewrite customF to handle the case where str is null or change the parameter type to Option[String] (and tweak the logic accordingly).
By the way, the same thing allies for UDFs. They need to either
Accept Option types as input
Handle the case where each arg is null or
Only be applied to data with no missing values.

Rewrite scala code to be more functional

I am trying to teach myself Scala whilst at the same time trying to write code that is idiomatic of a functional language, i.e. write better, more elegant, functional code.
I have the following code that works OK:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import java.time.LocalDate
object DataFrameExtensions_ {
implicit class DataFrameExtensions(df: DataFrame){
def featuresGroup1(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
def featuresGroup2(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
}
}
import DataFrameExtensions_._
val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).enableHiveSupport().getOrCreate()
import spark.implicits._
val df = Seq((8, "bat"),(64, "mouse"),(-27, "horse")).toDF("number", "word")
val groupBy = Seq("a","b")
val asAt = LocalDate.now()
val dataFrames = Seq(df.featuresGroup1(groupBy, asAt),df.featuresGroup2(groupBy, asAt))
The last line bothers me though. The two functions (featuresGroup1, featuresGroup2) both have the same signature:
scala> :type df.featuresGroup1(_,_)
(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame
scala> :type df.featuresGroup2(_,_)
(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame
and take the same vals as parameters so I assume I can write that line in a more functional way (perhaps using .map somehow) that means I can write the parameter list just once and pass it to both functions. I can't figure out the syntax though. I thought maybe I could construct a list of those functions but that doesn't work:
scala> Seq(featuresGroup1, featuresGroup2)
<console>:23: error: not found: value featuresGroup1
Seq(featuresGroup1, featuresGroup2)
^
<console>:23: error: not found: value featuresGroup2
Seq(featuresGroup1, featuresGroup2)
^
Can anyone help?
I thought maybe I could construct a list of those functions but that doesn't work:
Why are you writing just featuresGroup1/2 here when you already had the correct syntax df.featuresGroup1(_,_) just above?
Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))
df.featuresGroup1 _ should work as well.
df.featuresGroup1 by itself would work if you had an expected type, e.g.
val dataframes: Seq[(Seq[String], LocalDate) => DataFrame] =
Seq(df.featuresGroup1, df.featuresGroup2)
but in this specific case providing the expected type is more verbose than using lambdas.
I thought maybe I could construct a list of those functions but that doesn't work
You need to explicitly perform eta expansion to turn methods into functions (they are not the same in Scala), by using an underscore operator:
val funcs = Seq(featuresGroup1 _, featuresGroup2 _)
or by using placeholders:
val funcs = Seq(featuresGroup1(_, _), featuresGroup2(_, _))
And you are absolutely right about using map operator:
val dataFrames = funcs.map(f => f(groupBy, asAdt))
I strongly recommend against using implicits of types String or Seq, as if used in multiple places, these lead to subtle bugs that are not immediately obvious from the code and the code will be prone to breaking when it's moved somewhere.
If you want to use implicits, wrap them into a custom types:
case class DfGrouping(groupBy: Seq[String]) extends AnyVal
implicit val grouping: DfGrouping = DfGrouping(Seq("a", "b"))
Why no just create a function in DataFrameExtensions to do so?
def getDataframeGroups(groupBy: Seq[String], asAt: String) = Seq(featuresGroup1(groupBy,asAt), featuresGroup2(groupBy,asAt))
I think you could create a list of functions as below:
val funcs:List[DataFrame=>(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame] = List(_.featuresGroup1, _.featuresGroup1)
funcs.map(x => x(df)(groupBy, asAt))
It seems you have a list of functions which convert a DataFrame to another DataFrame. If that is the pattern, you could go a little bit further with Endo in Scalaz
I like this answer best, courtesy of Alexey Romanov.
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import java.time.LocalDate
object DataFrameExtensions_ {
implicit class DataFrameExtensions(df: DataFrame){
def featuresGroup1(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
def featuresGroup2(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
}
}
import DataFrameExtensions_._
val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).enableHiveSupport().getOrCreate()
import spark.implicits._
val df = Seq((8, "bat"),(64, "mouse"),(-27, "horse")).toDF("number", "word")
val groupBy = Seq("a","b")
val asAt = LocalDate.now()
Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))

How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run.
I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code simply uses the example code from the Spark repo.
Any ideas why this is not working? Thanks!
NB. I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+?
I have used this to write the test but I'm still getting the errors.
I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included within the SBT build file. The errors on running the tests are:
Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found.
import spark.implicits._
IntelliJ won't recognise import spark.implicits._ or the .toDF() method.
I have imported:
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}
you need to assign sqlContext to a val for implicits to work . Since your sparkSession is a var, implicits won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Moreover you can write functions for your tests so that your test class looks as following
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)
There are many libraries for unit testing of spark, one of the mostly used is
spark-testing-base: By Holden Karau
This library have all with sc as the SparkContext below is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here, every thing is prepared with sc as a SparkContext
Another approach is to create a TestWrapper and use for the multiple testcases as below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach.
Hope this helps!

Removing newlines in a DataFrame field with udf function gives TypeTag Error

val trim: String => String = _.trim.replace("[\\r\\n]", "")
def main(args: Array[String]) {
val spark = ... ...
import spark.implicits._
val trimUDF = udf[String,String](trim)
val df = spark.read.json(df_path) ...
val fixed_dblogs_df = df.withColumn("qp_new", trimUDF('qp)) ...
}
When I run this code I get a compile time error:
No TypeTag available for String
This error is where I define the udf function. I have no idea why this is happening. I have used udf functions before but this one is making this error. I used Spark 2.1.1 and that's it.
The purpose of the code is to remove all the new lines in one of my fields of columns that is StringType and I just want it to not have any newlines in it
Is there some reason you're using a UDF instead of the replace_regexp builtin?
val fixed_dblogs_df = df.withColumn("qp_new", replace_regexp('qp, "[\\r\\n]", "") ...)
UDF's break Spark's plan optimization.

How can I call a UDF in a UDF?

Hopefully, my title is the correct description of what I am trying to accomplish. I have weather data that is aggregated by week, with each row being one weak and this data is sorted by time. I then have a mathematical expression that I evaluate using this weather data in a Spark UDF. The expressions are evaluated using dynamically generated code that is then injected back into the jvm, I wanted to eventually replace this with a Scala macro, but for now this uses Janino and SimpleCompiler to cook the code and reload the class back in.
Sometimes in these model strings there are variables and functions. The variables are easy to put in since they can be string replaced in the generated code, and the functions for the most part are easy too, because if their names map to an existing static function than it will just execute that when the model is evaluated. For instance an exponent maps to Math.pow in scala.Math.
So my issue is specifically is implementing a lag and lead function for this analysis. Spark has these 2 functions built in, but they are in the above Dataframe layer while this function would be called inside of a UDF, so I am having trouble trying to be able to reference this data back from the top.
So I have this code
import org.apache.spark.sql.expressions.{Window, WindowSpec}
import org.apache.spark.sql.functions.{lag => slag, udf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.{SparkConf, SparkContext}
object Functions {
val conf: SparkConf = new SparkConf().setAppName("Blah").setMaster("local[*]")
val ctx: SparkContext = new SparkContext(conf)
val hctx: HiveContext = new HiveContext(ctx)
import hctx.implicits._
def lag(x: Double, window: Int): Double = {
x
}
def lag(c: Column, window: Int = 1)(implicit windowSpec: WindowSpec): Column = {
slag(c, window).over(windowSpec).as(c.toString() + "_lag")
}
def main(args: Array[String]): Unit = {
val funcUdf = udf((f: Column) => lag(f))
val data: DataFrame = ctx.parallelize(Seq(0, 1, 2, 3, 4, 5)).toDF("value")
implicit val spec: WindowSpec = Window.orderBy($"value")
data.select(funcUdf($"value")).show()
}
}
Is there a way to accomplish this? This code doesn't work because of a forward reference. Is there some way or do I have to compute lag windows ahead of time and pass them all around?