My Spark-Code is cluttered with code like this
object Transformations {
def selectI(df:DataFrame) : DataFrame = {
// needed to use $ to generate ColumnName
import df.sparkSession.implicits._
df.select($"i")
}
}
or alternatively
object Transformations {
def selectI(df:DataFrame)(implicit spark:SparkSession) : DataFrame = {
// needed to use $ to generate ColumnName
import sparkSession.implicits._
df.select($"i")
}
}
I don't really understand why we need an instance of SparkSession just to import these implicit conversions. I would rather like to do something like :
object Transformations {
import org.apache.spark.sql.SQLImplicits._ // does not work
def selectI(df:DataFrame) : DataFrame = {
df.select($"i")
}
}
Is there an elegant solution for this problem? My use of the implicits is not limited to $ but also Encoders, .toDF() etc.
I don't really understand why we need an instance of SparkSession just to import these implicit conversions. I would rather like to do something like
Because every Dataset exists in a scope of specific SparkSession and a single Spark application can have multiple active SparkSession.
Theoretically some of the SparkSession.implicits._ could exist separately from the session instance like:
import org.apache.spark.sql.implicits._ // For let's say `$` or `Encoders`
import org.apache.spark.sql.SparkSession.builder.getOrCreate.implicits._ // For toDF
but it would have a significant impact on the user code.
Related
With the grace of StackOverflow experts I have managed to tinker one of the provided examples and create a Scala UDAF which provides me with the necessary functionality I am looking for. The structure of the UDAF/Function etc. is as below :-
case class InputRow(ddate: String, ccount: String, iitem: String)
case class Buffer(var max_ddate: String, var ddue_dt: Map[String,String])
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import java.time.LocalDate
import java.time.format.DateTimeFormatter
object RecursiveAggregatorZ extends Aggregator[InputRow, Buffer, Buffer] {
override def zero: Buffer = Buffer(null, null)
override def reduce(buffer: Buffer, currentRow: InputRow): Buffer = {
<LOGIC HERE>
buffer
}
override def merge(b1: Buffer, b2: Buffer): Buffer = {
throw new NotImplementedError("should be used only over ordered window")
}
override def finish(reduction: Buffer): Buffer = reduction
override def bufferEncoder: Encoder[Buffer] = ExpressionEncoder[Buffer]
override def outputEncoder: Encoder[Buffer] = ExpressionEncoder[Buffer]
}
I can run the actual code via Zeppelin and then execute the below to register it for use within Spark :-
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, udaf}
val recursiveAggregatorZ = udaf(RecursiveAggregatorZ)
spark.udf.register("recursiveAggregatorZ",recursiveAggregatorZ)
However, I am looking for a very to incorporate this into PySpark so that it could be used not only within Spark SQL alone. I rummaged through most of the Google provided hits wherein we have to first package the scala code into a jar and whatnot but most of them use classes as examples and not like the snip i have provided.
Would really appreciate it if anyone could guide me on:-
(1) How exactly to build a jar out of this scala function
(2) How exactly to push it into PySpark and have it registered within PySpark.
Just for sake of clarity, within Zeppelin I am able to run queries such as :-
select recursiveAggregatorZ(column1, column2, column3) over (partition by partition1, partition2, order by rn) as output from phase1
and get the output I am looking for.
When saving an RDD to S3 in AVRO, I get the following warning in the console:
Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
I haven't been able to find a simple implicit such as saveAsAvroFile and therefore I've dug around and came to this:
import org.apache.avro.Schema
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroJob, AvroKeyOutputFormat}
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.RDD
object AvroUtil {
def write[T](
path: String,
schema: Schema,
avroRdd: RDD[T],
job: Job = Job.getInstance()): Unit = {
val intermediateRdd = avroRdd.mapPartitions(
f = (iter: Iterator[T]) => iter.map(new AvroKey(_) -> NullWritable.get()),
preservesPartitioning = true
)
job.getConfiguration.set("avro.output.codec", "snappy")
job.getConfiguration.set("mapreduce.output.fileoutputformat.compress", "true")
AvroJob.setOutputKeySchema(job, schema)
intermediateRdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration
)
}
}
I'm rather baffled as I don't see what is incorrect because the AVRO files seem to be outputted correctly.
You can override behaviour of existing FileOutputCommitter by implementing own OutputFileCommitter to make it more efficient and safe.
Follow this link where author has explained similar with example.
I am trying to get column data in a collection without RDD map api (doing the pure dataframe way)
object CommonObject{
def doSomething(...){
.......
val releaseDate = tableDF.where(tableDF("item") <=> "releaseDate").select("value").map(r => r.getString(0)).collect.toList.head
}
}
this is all good except Spark 2.3 suggests
No implicits found for parameter evidence$6: Encoder[String]
between map and collect
map(r => r.getString(0))(...).collect
I understand to add
import spark.implicits._
before the process however it requires a spark session instance
it's pretty annoying especially when there is no spark session instance in a method. As a Spark newbie how to nicely resolve the implicit encoding parameter in the context?
You can always add a call to SparkSession.builder.getOrCreate() inside your method. Spark will find the already existing SparkSession and won't create a new one, so there is no performance impact. Then you can import explicits which will work for all case classes. This is easiest way to add encoding. Alternatively an explicit encoder can be added using Encoders class.
val spark = SparkSession.builder
.appName("name")
.master("local[2]")
.getOrCreate()
import spark.implicits._
The other way is to get SparkSession from the dataframe dataframe.sparkSession
def dummy (df : DataFrame) = {
val spark = df.sparkSession
import spark.implicits._
}
In my testing, I have a test trait to provide spark context:
trait SparkTestTrait {
lazy val spark: SparkSession = SparkSession.builder().getOrCreate()
}
The problem is that I need to add an import in every test function:
test("test1) {
import spark.implicits._
}
I managed to reduce this to on per file by adding to the SparkTestTrait the following:
object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = spark.sqlContext
}
and then in the constructor of the implementing file:
import testImplicits._
However, I would prefer to have these implicits imported to all classes implementing SparkTestTrait (I can't have SparkTestTrait extend SQLImplicits because the implementing classes already extend an abstract class).
Is there a way to do this?
Hopefully, my title is the correct description of what I am trying to accomplish. I have weather data that is aggregated by week, with each row being one weak and this data is sorted by time. I then have a mathematical expression that I evaluate using this weather data in a Spark UDF. The expressions are evaluated using dynamically generated code that is then injected back into the jvm, I wanted to eventually replace this with a Scala macro, but for now this uses Janino and SimpleCompiler to cook the code and reload the class back in.
Sometimes in these model strings there are variables and functions. The variables are easy to put in since they can be string replaced in the generated code, and the functions for the most part are easy too, because if their names map to an existing static function than it will just execute that when the model is evaluated. For instance an exponent maps to Math.pow in scala.Math.
So my issue is specifically is implementing a lag and lead function for this analysis. Spark has these 2 functions built in, but they are in the above Dataframe layer while this function would be called inside of a UDF, so I am having trouble trying to be able to reference this data back from the top.
So I have this code
import org.apache.spark.sql.expressions.{Window, WindowSpec}
import org.apache.spark.sql.functions.{lag => slag, udf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.{SparkConf, SparkContext}
object Functions {
val conf: SparkConf = new SparkConf().setAppName("Blah").setMaster("local[*]")
val ctx: SparkContext = new SparkContext(conf)
val hctx: HiveContext = new HiveContext(ctx)
import hctx.implicits._
def lag(x: Double, window: Int): Double = {
x
}
def lag(c: Column, window: Int = 1)(implicit windowSpec: WindowSpec): Column = {
slag(c, window).over(windowSpec).as(c.toString() + "_lag")
}
def main(args: Array[String]): Unit = {
val funcUdf = udf((f: Column) => lag(f))
val data: DataFrame = ctx.parallelize(Seq(0, 1, 2, 3, 4, 5)).toDF("value")
implicit val spec: WindowSpec = Window.orderBy($"value")
data.select(funcUdf($"value")).show()
}
}
Is there a way to accomplish this? This code doesn't work because of a forward reference. Is there some way or do I have to compute lag windows ahead of time and pass them all around?