Avoid specifying schema twice (Spark/scala) - scala

I need to iterate over data frame in specific order and apply some complex logic to calculate new column.
Also my strong preference is to do it in generic way so I do not have to list all columns of a row and do df.as[my_record] or case Row(...) => as shown here. Instead, I want to access row columns by their names and just add result column(s) to source row.
Below approach works just fine but I'd like to avoid specifying schema twice: first time so that I can access columns by name while iterating and second time to process output.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
val q = """
select 2 part, 1 id
union all select 2 part, 4 id
union all select 2 part, 3 id
union all select 2 part, 2 id
"""
val df = spark.sql(q)
def f_row(iter: Iterator[Row]) : Iterator[Row] = {
if (iter.hasNext) {
def complex_logic(p: Int): Integer = if (p == 3) null else p * 10;
val head = iter.next
val schema = StructType(head.schema.fields :+ StructField("result", IntegerType))
val r =
new GenericRowWithSchema((head.toSeq :+ complex_logic(head.getAs("id"))).toArray, schema)
iter.scanLeft(r)((r1, r2) =>
new GenericRowWithSchema((r2.toSeq :+ complex_logic(r2.getAs("id"))).toArray, schema)
)
} else iter
}
val schema = StructType(df.schema.fields :+ StructField("result", IntegerType))
val encoder = RowEncoder(schema)
df.repartition($"part").sortWithinPartitions($"id").mapPartitions(f_row)(encoder).show
What information is lost after applying mapPartitions so output cannot be processed without explicit encoder? How to avoid specifying it?

What information is lost after applying mapPartitions so output cannot be processed without
The information is hardly lost - it wasn't there from the begining - subclasses of Row or InternalRow are basically untyped, variable shape containers, which don't provide any useful type information, that could be used to derive an Encoder.
schema in GenericRowWithSchema is inconsequential as it describes content in terms of metadata not types.
How to avoid specifying it?
Sorry, you're out of luck. If you want to use dynamically typed constructs (a bag of Any) in a statically typed language you have to pay the price, which here is providing an Encoder.

OK - I have checked some of my spark code and using .mapPartitions with the Dataset API does not require me to explicitly build/pass an encoder.
You need something like:
case class Before(part: Int, id: Int)
case class After(part: Int, id: Int, newCol: String)
import spark.implicits._
// Note column names/types must match case class constructor parameters.
val beforeDS = <however you obtain your input DF>.as[Before]
def f_row(it: Iterator[Before]): Iterator[After] = ???
beforeDS.reparition($"part").sortWithinPartitions($"id").mapPartitions(f_row).show

I found below explanation sufficient, maybe it will be useful for others.
mapPartitions requires Encoder because otherwise it cannot construct Dataset from iterator or Rows. Even though each row has a schema, that shema cannot be derived (used) by constructor of Dataset[U].
def mapPartitions[U : Encoder](func: Iterator[T] => Iterator[U]): Dataset[U] = {
new Dataset[U](
sparkSession,
MapPartitions[T, U](func, logicalPlan),
implicitly[Encoder[U]])
}
On the other hand, without calling mapPartitions Spark can use the schema derived from initial query because structure (metadata) of the original columns is not changed.
I described alternatives in this answer: https://stackoverflow.com/a/53177628/7869491.

Related

Spark scala data frame udf returning rows

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.)
How should I write this udf?
I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'.
What should I do?
The udf that I wrote:
val convert = udf[Seq[Row], Seq[Row]](blablabla...)
And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported
since spark 2.0 you can create UDFs which return Row / Seq[Row], but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :
val schema = ArrayType(DoubleType)
val myUDF = udf((s: Seq[Row]) => {
s // just pass data without modification
}, schema)
But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.
EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)
This is an old question, I just wanted to update it according to the new version of Spark.
Since Spark 3.0.0, the method that #Raphael Roth has mentioned is deprecated. Hence, you might get an AnalysisException. The reason is that the input closure using this method doesn't have type checking and the behavior might be different from what we expect in SQL when it comes to null values.
If you really know what you're doing, you need to set spark.sql.legacy.allowUntypedScalaUDF configuration to true.
Another solution is to use case class instead of schema. For example,
case class Foo(field1: String, field2: String)
val convertFunction: Seq[Row] => Seq[Foo] = input => {
input.map {
x => // do something with x and convert to Foo
}
}
val myUdf = udf(convertFunction)

How do I create a Row with a defined schema in Spark-Scala?

I'd like to create a Row with a schema from a case class to test one of my map functions. The most straightforward way I can think of doing this is:
import org.apache.spark.sql.Row
case class MyCaseClass(foo: String, bar: Option[String])
def buildRowWithSchema(record: MyCaseClass): Row = {
sparkSession.createDataFrame(Seq(record)).collect.head
}
However, this seemed like a lot of overhead to just get a single Row, so I looked into how I could directly create a Row with a schema. This led me to:
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.{Encoders, Row}
def buildRowWithSchemaV2(record: MyCaseClass): Row = {
val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => {
field.setAccessible(true)
field.get(record)
})
new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema)
}
Unfortunately, the Row that the second version returns is different from the first Row. Option fields in the first version are reduced to their primitive values, while they are still Options in the second version. Also, the second version is quite unwieldy.
Is there a better way to do this?
The second version is returning Option itself for the bar case class field, thus you are not getting primitive value as the first version. you can use the following code for primitive values
def buildRowWithSchemaV2(record: MyCaseClass): Row = {
val recordValues: Array[Any] = record.getClass.getDeclaredFields.map((field) => {
field.setAccessible(true)
val returnValue = field.get(record)
if(returnValue.isInstanceOf[Option[String]]){
returnValue.asInstanceOf[Option[String]].get
}
else
returnValue
})
new GenericRowWithSchema(recordValues, Encoders.product[MyCaseClass].schema)
}
But meanwhile I would suggest you to use DataFrame or DataSet.
DataFrame and DataSet are themselves collections of Row with schema.
So when you have a case class defined, you just need to encode your input data into case class
For example:
lets say you have input data as
val data = Seq(("test1", "value1"),("test2", "value2"),("test3", "value3"),("test4", null))
If you have a text file you can read it with sparkContext.textFile and split according to your need.
Now when you have converted your data to RDD, converting it to dataframe or dataset is two lines code
import sqlContext.implicits._
val dataFrame = data.map(d => MyCaseClass(d._1, Option(d._2))).toDF
.toDS would generate dataset
Thus you have collection of Rows with schema
for validation you can do the followings
println(dataFrame.schema) //for checking if there is schema
println(dataFrame.take(1).getClass.getName) //for checking if it is a collection of Rows
Hope you have the right answer.

How to create udf containing Array (case class) for complex column in a dataframe

I have a dataframe which have a complex column datatype of Arraytype>. For transforming this dataframe I have created udf which can consume this column using Array [case class] as parameter. The main bottle neck here is when I create case class according to stucttype, the structfield name contains special characters for example "##field". So I provide same name to case class like this way case class (##field) and attach this to udf parameter. After interpreted in spark udf definition change name of case class field to this "$hash$hashfield". When performing transform using this dataframe it is failing because of this miss match. Please help ...
Due JVM limitations Scala stores identifiers in encoded form and currently Spark can't map ##field to $hash$hashfield.
One possible solution is to extract fields manually from raw row (but you need to know order of the fields in df, you can use df.schema for that):
val myUdf = udf { (struct: Row) =>
// Pattern match struct:
struct match {
case Row(a: String) => Foo(a)
}
// .. or extract values from Row
val `##a` = struct.getAs[String](0)
}

UDAF in Spark with multiple input columns

I am trying to develop a user defined aggregate function that computes a linear regression on a row of numbers. I have successfully done a UDAF that calculates confidence intervals of means (with a lot trial and error and SO!).
Here's what actually runs for me already:
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{StructType, StructField, DoubleType, LongType, DataType, ArrayType}
case class RegressionData(intercept: Double, slope: Double)
class Regression {
import org.apache.commons.math3.stat.regression.SimpleRegression
def roundAt(p: Int)(n: Double): Double = { val s = math pow (10, p); (math round n * s) / s }
def getRegression(data: List[Long]): RegressionData = {
val regression: SimpleRegression = new SimpleRegression()
data.view.zipWithIndex.foreach { d =>
regression.addData(d._2.toDouble, d._1.toDouble)
}
RegressionData(roundAt(3)(regression.getIntercept()), roundAt(3)(regression.getSlope()))
}
}
class UDAFRegression extends UserDefinedAggregateFunction {
import java.util.ArrayList
def deterministic = true
def inputSchema: StructType =
new StructType().add("units", LongType)
def bufferSchema: StructType =
new StructType().add("buff", ArrayType(LongType))
def dataType: DataType =
new StructType()
.add("intercept", DoubleType)
.add("slope", DoubleType)
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, new ArrayList[Long]())
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
val longList: ArrayList[Long] = new ArrayList[Long](buffer.getList(0))
longList.add(input.getLong(0));
buffer.update(0, longList);
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val longList: ArrayList[Long] = new ArrayList[Long](buffer1.getList(0))
longList.addAll(buffer2.getList(0))
buffer1.update(0, longList)
}
def evaluate(buffer: Row) = {
import scala.collection.JavaConverters._
val list = buffer.getList(0).asScala.toList
val regression = new Regression
regression.getRegression(list)
}
}
However the datasets do not come in order, which is obviously very important here. Hence instead of regression($"longValue") I need to a second param regression($"longValue", $"created_day"). created_day is a sql.types.DateType.
I am pretty confused by DataTypes, StructTypes and what-not and due to the lack of examples on the web, I got stuck w/ my trial and order attempts here.
What would my bufferSchema look like?
Are those StructTypes overhead in my case? Wouldn't a (mutable) Map just do? Is MapType actually immutable and isn't this rather pointless to be a buffer type?
What would my inputSchema look like?
Does this have to match the type I retrieve in update() via in my case input.getLong(0)?
Is there a standard way how to reset the buffer in initialize()
I have seen buffer.update(0, 0.0) (when it contains Doubles, obviously), buffer(0) = new WhatEver() and I think even buffer = Nil. Does any of these make a difference?
How to update data?
The example above seems over complicated. I was expecting to be able to do sth. like buffer += input.getLong(0) -> input.getDate(1).
Can I expect to access the input this way
How to merge data?
Can I just leave the function block empty like
def merge(…) = {}?
The challenge to sort that buffer in evaluate() is sth. I should be able to figure out, although I am still interested in the most elegant ways of how you guys do this (in a fraction of the time).
Bonus question: What role does dataType play?
I return a case class, not the StructType as defined in dataType which does not seem to be an issue. Or is it working since it happens to match my case class?
Maybe this will clear things up.
The UDAF APIs work on DataFrame Columns. Everything you are doing has to get serialized just like all the other Columns in the DataFrame. As you note, the only support MapType is immutable, because this is the only thing you can put in a Column. With immutable collections, you just create a new collection that contains the old collection plus a value:
var map = Map[Long,Long]()
map = map + (0L -> 1234L)
map = map + (1L -> 4567L)
Yes, just like working with any DataFrame, your types have to match. Do buffer.getInt(0) when there's really a LongType there is going to be a problem.
There's no standard way to reset the buffer because other than whatever makes sense for your data type / use case. Maybe zero is actually last month's balanace; maybe zero is a running average from another dataset; maybe zero is an null or an empty string or maybe zero is really zero.
merge is an optimization that only happens in certain circumstances, if I remember correctly -- a way to sub-total that the SQL optimization may use if the circumstances warrant it. I just use the same function I use for update.
A case class will automatically get converted to the appropriate schema, so for the bonus question the answer is yes, it's because the schemas match. Change the dataType to not match, you will get an error.

Listing columns on a Slick table

I have a Slick 3.0 table definition similar to the following:
case class Simple(a: String, b: Int, c: Option[String])
trait Tables { this: JdbcDriver =>
import api._
class Simples(tag: Tag) extends Table[Simple](tag, "simples") {
def a = column[String]("a")
def b = column[Int]("b")
def c = column[Option[String]]("c")
def * = (a, b, c) <> (Simple.tupled, Simple.unapply)
}
lazy val simples = TableQuery[Simples]
}
object DB extends Tables with MyJdbcDriver
I would like to be able to do 2 things:
Get a list of the column names as Seq[String]
For an instance of Simple, generate a Seq[String] that would correspond to how the data would be inserted into the database using a raw query (e.g. Simple("hello", 1, None) becomes Seq("'hello'", "1", "NULL"))
What would be the best way to do this using the Slick table definition?
First of all it is not possible to trick Slick and change the order on the left side of the <> operator in the * method without changing the order of values in Simple, the row type of Simples, i.e. what Ben assumed is not possible. The ProvenShape return type of the * projection method ensures that there is a Shape available for translating between the Column-based type in * and the client-side type and if you write def * = (c, b, a) <> Simple.tupled, Simple.unapply) having Simple defined as case class Simple(a: String, b: Int, c: Option[String]), Slick will complain with an error "No matching Shape found. Slick does not know how to map the given types...". Ergo, you can iterate over all the elements of an instance of Simple with its productIterator.
Secondly, you already have the definition of the Simples table in your code and querying metatables to get the same information you already have is not sensible. You can get all you column names with a one-liner simples.baseTableRow.create_*.map(_.name). Note that the * projection of the table also defines the columns generated when you create the table schema. So the columns not mentioned in the projection are not created and the statement above is guaranteed to return exactly what you need and not to drop anything.
To recap briefly:
To get a list of the column names of the Simples table as Seq[String] use
simples.baseTableRow.create_*.map(_.name).toSeq
To generate a Seq[String] that would correspond to how the data
would be inserted into the database using a raw query for aSimple,
an instance of Simple use aSimple.productIterator.toSeq
To get the column names, try this:
db.run(for {
metaTables <- slick.jdbc.meta.MTable.getTables("simples")
columns <- metaTables.head.getColumns
} yield columns.map {_.name}) foreach println
This will print
Vector(a, b, c)
And for the case class values, you can use productIterator:
Simple("hello", 1, None).productIterator.toVector
is
Vector(hello, 1, None)
You still have to do the value mapping, and guarantee that the order of the columns in the table and the values in the case class are the same.