What FileOutputCommitter should be used in when writing AVRO files in Spark? - scala

When saving an RDD to S3 in AVRO, I get the following warning in the console:
Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
I haven't been able to find a simple implicit such as saveAsAvroFile and therefore I've dug around and came to this:
import org.apache.avro.Schema
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroJob, AvroKeyOutputFormat}
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.RDD
object AvroUtil {
def write[T](
path: String,
schema: Schema,
avroRdd: RDD[T],
job: Job = Job.getInstance()): Unit = {
val intermediateRdd = avroRdd.mapPartitions(
f = (iter: Iterator[T]) => iter.map(new AvroKey(_) -> NullWritable.get()),
preservesPartitioning = true
)
job.getConfiguration.set("avro.output.codec", "snappy")
job.getConfiguration.set("mapreduce.output.fileoutputformat.compress", "true")
AvroJob.setOutputKeySchema(job, schema)
intermediateRdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration
)
}
}
I'm rather baffled as I don't see what is incorrect because the AVRO files seem to be outputted correctly.

You can override behaviour of existing FileOutputCommitter by implementing own OutputFileCommitter to make it more efficient and safe.
Follow this link where author has explained similar with example.

Related

Import custom Scala UDAF into PySpark

With the grace of StackOverflow experts I have managed to tinker one of the provided examples and create a Scala UDAF which provides me with the necessary functionality I am looking for. The structure of the UDAF/Function etc. is as below :-
case class InputRow(ddate: String, ccount: String, iitem: String)
case class Buffer(var max_ddate: String, var ddue_dt: Map[String,String])
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import java.time.LocalDate
import java.time.format.DateTimeFormatter
object RecursiveAggregatorZ extends Aggregator[InputRow, Buffer, Buffer] {
override def zero: Buffer = Buffer(null, null)
override def reduce(buffer: Buffer, currentRow: InputRow): Buffer = {
<LOGIC HERE>
buffer
}
override def merge(b1: Buffer, b2: Buffer): Buffer = {
throw new NotImplementedError("should be used only over ordered window")
}
override def finish(reduction: Buffer): Buffer = reduction
override def bufferEncoder: Encoder[Buffer] = ExpressionEncoder[Buffer]
override def outputEncoder: Encoder[Buffer] = ExpressionEncoder[Buffer]
}
I can run the actual code via Zeppelin and then execute the below to register it for use within Spark :-
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, udaf}
val recursiveAggregatorZ = udaf(RecursiveAggregatorZ)
spark.udf.register("recursiveAggregatorZ",recursiveAggregatorZ)
However, I am looking for a very to incorporate this into PySpark so that it could be used not only within Spark SQL alone. I rummaged through most of the Google provided hits wherein we have to first package the scala code into a jar and whatnot but most of them use classes as examples and not like the snip i have provided.
Would really appreciate it if anyone could guide me on:-
(1) How exactly to build a jar out of this scala function
(2) How exactly to push it into PySpark and have it registered within PySpark.
Just for sake of clarity, within Zeppelin I am able to run queries such as :-
select recursiveAggregatorZ(column1, column2, column3) over (partition by partition1, partition2, order by rn) as output from phase1
and get the output I am looking for.

Spark Catalyst flatMapGroupsWithState: Group State with sorted collection

I am trying to have a sorted collection in the state of my groups and I get an error from catalyst which I think regards default instance creation for the collection.
Below is simplified pipeline that demonstrates the error:
package com.example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout, OutputMode, Trigger}
import scala.collection.immutable.TreeMap
case class Event
(
key: String
)
case class KeyState
(
prop: TreeMap[Long, String]
)
object CatalystIssue {
def updateState(k: String, vs: Iterator[Event],
state: GroupState[KeyState]) : Iterator[Event] = vs
def main(args: Array[String]) {
val spark = SparkSession.builder()
.master("local[*]")
.appName("CatalystIssue")
.getOrCreate()
import spark.implicits._
val df = spark.readStream.format("rate")
.load()
.select(lit("a").as("key"))
.as[Event]
.groupByKey(_.key)
.flatMapGroupsWithState(OutputMode.Append(),
GroupStateTimeout.NoTimeout())(updateState)
val query = df.writeStream.format("console")
.trigger(Trigger.ProcessingTime("30 seconds")).start()
query.awaitTermination()
}
}
Which produces the error:
ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 53, Column 106: No applicable constructor/method found for zero actual parameters; candidates are: "public scala.collection.mutable.Builder scala.collection.generic.SortedMapFactory.newBuilder(scala.math.Ordering)"
This might be because Sorted Maps are not supported as a dataframe attribute type although that's not my intention here and I would have thought the KeyState would have been opaque to spark since you don't actually access it like a dataframe attribute.
While not very attractive one option might be to serialize the sorted set into a byte array which is an attribute of the KeyState. i.e.
case class KeyState
(
prop: Array[Byte]
)
If Java Serialization were used would that preserve the internal tree structure of the TreeMap, so that at least that would not have to be be rebuilt? Are there any alternative serialization technologies that would preserve the structure?
It seems useful to be able to keep some sorted collections in the group state, especially as the computation is supposed to be primarily in memory. Is there something about the way spark works that makes this fundamentally unworkable?

Spark job completes without executing udf

I've been having an issue with a long, complicated spark job which contains a udf.
The issue I've been having is that the udf doesn't seem to get called properly, although there is no error message.
I know it isn't called properly because the output gets written, only anything the udf was supposed to calculate is NULL, and no print statements appear when debugging locally.
The only lead is that this code previously worked using different input data, meaning the error must have something to do with the input.
The change in inputs mostly means different column names are used, which is addressed in the code.
Print statements are executed given the first, 'working' input.
Both inputs are created using the same series of steps from the same database, and by inspection there doesn't appear to be a problem with either one.
I've never experienced this sort of behaviour before, and any leads on what might cause it would be appreciated.
The code is monolithic and inflexible - I'm working on refactoring, but it's not an easy piece to break apart. This is a short version of what is happening:
package mypackage
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.util._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.types._
import scala.collection.{Map => SMap}
object MyObject {
def main(args: Array[String]){
val spark: SparkSession = SparkSession.builder()
.appName("my app")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
val bigInput = spark.read.parquet("inputname.parquet")
val reference_table = spark.read.parquet("reference_table.parquet")
val exchange_rate = spark.read.parquet("reference_table.parquet")
val bigInput2 = bigInput
.filter($"column1" === "condition1")
.join(joinargs)
.drop(dropargs)
val bigInput3 = bigInput
.filter($"column2" === "condition2")
.join(joinargs)
.drop(dropargs)
<continue for many lines...>
def mapper1(
arg1: String,
arg2: Double,
arg3: Integer
): List[Double]{
exchange_rate.map(
List(idx1, idx2, idx3),
r.toSeq.toList
.drop(idx4)
.take(arg2)
)
}
def mapper2(){}
...
def mapper5(){}
def my_udf(
arg0: Integer,
arg1: String,
arg2: Double,
arg3: Integer,
...
arg20: String
): Double = {
println("I'm actually doing something!")
val result1 = mapper1(arg1, arg2, arg3)
val result2 = mapper2(arg4, arg5, arg6, arg7)
...
val result5 = mapper5(arg18, arg19, arg20)
result1.take(arg0)
.zipAll(result1, 0.0, 0.0)
.map(x=>_1*x._2)
....
.zipAll(result5, 0.0, 0.0)
.foldLeft(0.0)(_+_)
}
spark.udf.register("myUDF", my_udf_)
val bigResult1 = bigInputFinal.withColumn("Newcolumnname",
callUDF(
"myUDF",
$"col1",
...
$"col20"
)
)
<postprocessing>
bigResultFinal
.filter(<configs>)
.select(<column names>)
.write
.format("parquet")
}
}
To recap
This code runs to completion on each of two input files.
The udf only appears to execute on the first file.
There are no error messages or anything using the second file, although all non-udf logic appears to complete successfully.
Any help greatly appreciated!
Here the UDF is not being called because spark is
Lazy it does not call the UDF unless you use any action on the dataframe. You can achieve this by forcing dataframe actions.

import implicit conversions without instance of SparkSession

My Spark-Code is cluttered with code like this
object Transformations {
def selectI(df:DataFrame) : DataFrame = {
// needed to use $ to generate ColumnName
import df.sparkSession.implicits._
df.select($"i")
}
}
or alternatively
object Transformations {
def selectI(df:DataFrame)(implicit spark:SparkSession) : DataFrame = {
// needed to use $ to generate ColumnName
import sparkSession.implicits._
df.select($"i")
}
}
I don't really understand why we need an instance of SparkSession just to import these implicit conversions. I would rather like to do something like :
object Transformations {
import org.apache.spark.sql.SQLImplicits._ // does not work
def selectI(df:DataFrame) : DataFrame = {
df.select($"i")
}
}
Is there an elegant solution for this problem? My use of the implicits is not limited to $ but also Encoders, .toDF() etc.
I don't really understand why we need an instance of SparkSession just to import these implicit conversions. I would rather like to do something like
Because every Dataset exists in a scope of specific SparkSession and a single Spark application can have multiple active SparkSession.
Theoretically some of the SparkSession.implicits._ could exist separately from the session instance like:
import org.apache.spark.sql.implicits._ // For let's say `$` or `Encoders`
import org.apache.spark.sql.SparkSession.builder.getOrCreate.implicits._ // For toDF
but it would have a significant impact on the user code.

Read specific column from Parquet without using Spark

I am trying to read Parquet files without using Apache Spark and I am able to do it but I am finding it hard to read specific columns. I am not able to find any good resource of Google as almost all the post is about reading the parquet file using. Below is my code:
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.avro.generic.GenericRecord
import org.apache.parquet.hadoop.ParquetReader
import org.apache.parquet.avro.AvroParquetReader
object parquetToJson{
def main (args : Array[String]):Unit= {
//case class Customer(key: Int, name: String, sellAmount: Double, profit: Double, state:String)
val parquetFilePath = new Path("data/parquet/Customer/")
val reader = AvroParquetReader.builder[GenericRecord](parquetFilePath).build()//.asInstanceOf[ParquetReader[GenericRecord]]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
val list = iter.toList
list.foreach(record => println(record))
}
}
The commented out case class represents the schema of my file and write now the above code reads all the columns from the file. I want to read specific columns.
If you just want to read specific columns, then you need to set a read schema on the configuration that the ParquetReader builder accepts. (This is also known as a projection).
In your case you should be able to call .withConf(conf) on the AvroParquetReader builder class, and in the conf you pass in, invoke conf.set(ReadSupport.PARQUET_READ_SCHEMA, schema) where schema is a avro schema in String form.