With the grace of StackOverflow experts I have managed to tinker one of the provided examples and create a Scala UDAF which provides me with the necessary functionality I am looking for. The structure of the UDAF/Function etc. is as below :-
case class InputRow(ddate: String, ccount: String, iitem: String)
case class Buffer(var max_ddate: String, var ddue_dt: Map[String,String])
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import java.time.LocalDate
import java.time.format.DateTimeFormatter
object RecursiveAggregatorZ extends Aggregator[InputRow, Buffer, Buffer] {
override def zero: Buffer = Buffer(null, null)
override def reduce(buffer: Buffer, currentRow: InputRow): Buffer = {
<LOGIC HERE>
buffer
}
override def merge(b1: Buffer, b2: Buffer): Buffer = {
throw new NotImplementedError("should be used only over ordered window")
}
override def finish(reduction: Buffer): Buffer = reduction
override def bufferEncoder: Encoder[Buffer] = ExpressionEncoder[Buffer]
override def outputEncoder: Encoder[Buffer] = ExpressionEncoder[Buffer]
}
I can run the actual code via Zeppelin and then execute the below to register it for use within Spark :-
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, udaf}
val recursiveAggregatorZ = udaf(RecursiveAggregatorZ)
spark.udf.register("recursiveAggregatorZ",recursiveAggregatorZ)
However, I am looking for a very to incorporate this into PySpark so that it could be used not only within Spark SQL alone. I rummaged through most of the Google provided hits wherein we have to first package the scala code into a jar and whatnot but most of them use classes as examples and not like the snip i have provided.
Would really appreciate it if anyone could guide me on:-
(1) How exactly to build a jar out of this scala function
(2) How exactly to push it into PySpark and have it registered within PySpark.
Just for sake of clarity, within Zeppelin I am able to run queries such as :-
select recursiveAggregatorZ(column1, column2, column3) over (partition by partition1, partition2, order by rn) as output from phase1
and get the output I am looking for.
Related
When saving an RDD to S3 in AVRO, I get the following warning in the console:
Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
I haven't been able to find a simple implicit such as saveAsAvroFile and therefore I've dug around and came to this:
import org.apache.avro.Schema
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroJob, AvroKeyOutputFormat}
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.RDD
object AvroUtil {
def write[T](
path: String,
schema: Schema,
avroRdd: RDD[T],
job: Job = Job.getInstance()): Unit = {
val intermediateRdd = avroRdd.mapPartitions(
f = (iter: Iterator[T]) => iter.map(new AvroKey(_) -> NullWritable.get()),
preservesPartitioning = true
)
job.getConfiguration.set("avro.output.codec", "snappy")
job.getConfiguration.set("mapreduce.output.fileoutputformat.compress", "true")
AvroJob.setOutputKeySchema(job, schema)
intermediateRdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration
)
}
}
I'm rather baffled as I don't see what is incorrect because the AVRO files seem to be outputted correctly.
You can override behaviour of existing FileOutputCommitter by implementing own OutputFileCommitter to make it more efficient and safe.
Follow this link where author has explained similar with example.
I've been having an issue with a long, complicated spark job which contains a udf.
The issue I've been having is that the udf doesn't seem to get called properly, although there is no error message.
I know it isn't called properly because the output gets written, only anything the udf was supposed to calculate is NULL, and no print statements appear when debugging locally.
The only lead is that this code previously worked using different input data, meaning the error must have something to do with the input.
The change in inputs mostly means different column names are used, which is addressed in the code.
Print statements are executed given the first, 'working' input.
Both inputs are created using the same series of steps from the same database, and by inspection there doesn't appear to be a problem with either one.
I've never experienced this sort of behaviour before, and any leads on what might cause it would be appreciated.
The code is monolithic and inflexible - I'm working on refactoring, but it's not an easy piece to break apart. This is a short version of what is happening:
package mypackage
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.util._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.types._
import scala.collection.{Map => SMap}
object MyObject {
def main(args: Array[String]){
val spark: SparkSession = SparkSession.builder()
.appName("my app")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
val bigInput = spark.read.parquet("inputname.parquet")
val reference_table = spark.read.parquet("reference_table.parquet")
val exchange_rate = spark.read.parquet("reference_table.parquet")
val bigInput2 = bigInput
.filter($"column1" === "condition1")
.join(joinargs)
.drop(dropargs)
val bigInput3 = bigInput
.filter($"column2" === "condition2")
.join(joinargs)
.drop(dropargs)
<continue for many lines...>
def mapper1(
arg1: String,
arg2: Double,
arg3: Integer
): List[Double]{
exchange_rate.map(
List(idx1, idx2, idx3),
r.toSeq.toList
.drop(idx4)
.take(arg2)
)
}
def mapper2(){}
...
def mapper5(){}
def my_udf(
arg0: Integer,
arg1: String,
arg2: Double,
arg3: Integer,
...
arg20: String
): Double = {
println("I'm actually doing something!")
val result1 = mapper1(arg1, arg2, arg3)
val result2 = mapper2(arg4, arg5, arg6, arg7)
...
val result5 = mapper5(arg18, arg19, arg20)
result1.take(arg0)
.zipAll(result1, 0.0, 0.0)
.map(x=>_1*x._2)
....
.zipAll(result5, 0.0, 0.0)
.foldLeft(0.0)(_+_)
}
spark.udf.register("myUDF", my_udf_)
val bigResult1 = bigInputFinal.withColumn("Newcolumnname",
callUDF(
"myUDF",
$"col1",
...
$"col20"
)
)
<postprocessing>
bigResultFinal
.filter(<configs>)
.select(<column names>)
.write
.format("parquet")
}
}
To recap
This code runs to completion on each of two input files.
The udf only appears to execute on the first file.
There are no error messages or anything using the second file, although all non-udf logic appears to complete successfully.
Any help greatly appreciated!
Here the UDF is not being called because spark is
Lazy it does not call the UDF unless you use any action on the dataframe. You can achieve this by forcing dataframe actions.
In my testing, I have a test trait to provide spark context:
trait SparkTestTrait {
lazy val spark: SparkSession = SparkSession.builder().getOrCreate()
}
The problem is that I need to add an import in every test function:
test("test1) {
import spark.implicits._
}
I managed to reduce this to on per file by adding to the SparkTestTrait the following:
object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = spark.sqlContext
}
and then in the constructor of the implementing file:
import testImplicits._
However, I would prefer to have these implicits imported to all classes implementing SparkTestTrait (I can't have SparkTestTrait extend SQLImplicits because the implementing classes already extend an abstract class).
Is there a way to do this?
My Spark-Code is cluttered with code like this
object Transformations {
def selectI(df:DataFrame) : DataFrame = {
// needed to use $ to generate ColumnName
import df.sparkSession.implicits._
df.select($"i")
}
}
or alternatively
object Transformations {
def selectI(df:DataFrame)(implicit spark:SparkSession) : DataFrame = {
// needed to use $ to generate ColumnName
import sparkSession.implicits._
df.select($"i")
}
}
I don't really understand why we need an instance of SparkSession just to import these implicit conversions. I would rather like to do something like :
object Transformations {
import org.apache.spark.sql.SQLImplicits._ // does not work
def selectI(df:DataFrame) : DataFrame = {
df.select($"i")
}
}
Is there an elegant solution for this problem? My use of the implicits is not limited to $ but also Encoders, .toDF() etc.
I don't really understand why we need an instance of SparkSession just to import these implicit conversions. I would rather like to do something like
Because every Dataset exists in a scope of specific SparkSession and a single Spark application can have multiple active SparkSession.
Theoretically some of the SparkSession.implicits._ could exist separately from the session instance like:
import org.apache.spark.sql.implicits._ // For let's say `$` or `Encoders`
import org.apache.spark.sql.SparkSession.builder.getOrCreate.implicits._ // For toDF
but it would have a significant impact on the user code.
Hopefully, my title is the correct description of what I am trying to accomplish. I have weather data that is aggregated by week, with each row being one weak and this data is sorted by time. I then have a mathematical expression that I evaluate using this weather data in a Spark UDF. The expressions are evaluated using dynamically generated code that is then injected back into the jvm, I wanted to eventually replace this with a Scala macro, but for now this uses Janino and SimpleCompiler to cook the code and reload the class back in.
Sometimes in these model strings there are variables and functions. The variables are easy to put in since they can be string replaced in the generated code, and the functions for the most part are easy too, because if their names map to an existing static function than it will just execute that when the model is evaluated. For instance an exponent maps to Math.pow in scala.Math.
So my issue is specifically is implementing a lag and lead function for this analysis. Spark has these 2 functions built in, but they are in the above Dataframe layer while this function would be called inside of a UDF, so I am having trouble trying to be able to reference this data back from the top.
So I have this code
import org.apache.spark.sql.expressions.{Window, WindowSpec}
import org.apache.spark.sql.functions.{lag => slag, udf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.{SparkConf, SparkContext}
object Functions {
val conf: SparkConf = new SparkConf().setAppName("Blah").setMaster("local[*]")
val ctx: SparkContext = new SparkContext(conf)
val hctx: HiveContext = new HiveContext(ctx)
import hctx.implicits._
def lag(x: Double, window: Int): Double = {
x
}
def lag(c: Column, window: Int = 1)(implicit windowSpec: WindowSpec): Column = {
slag(c, window).over(windowSpec).as(c.toString() + "_lag")
}
def main(args: Array[String]): Unit = {
val funcUdf = udf((f: Column) => lag(f))
val data: DataFrame = ctx.parallelize(Seq(0, 1, 2, 3, 4, 5)).toDF("value")
implicit val spec: WindowSpec = Window.orderBy($"value")
data.select(funcUdf($"value")).show()
}
}
Is there a way to accomplish this? This code doesn't work because of a forward reference. Is there some way or do I have to compute lag windows ahead of time and pass them all around?