Spark Frameless withColumnRenamed nested field - scala

Let's say I have the following code
case class MyTypeInt(a: String, b: MyType2)
case class MyType2(v: Int)
case class MyTypeLong(a: String, b: MyType3)
case class MyType3(v: Long)
val typedDataset = TypedDataset.create(Seq(MyTypeInt("v", MyType2(1))))
typedDataset.withColumnRenamed(???, typedDataset.colMany('b, 'v).cast[Long]).as[MyTypeLong]
How can I implement this transformation when the field that I am trying to transform is nested? the signature of withColumnRenamed asks for a Symbol in the first parameter so I don't know how to do this...

withColumnRenamed does not allow you to transform a column. To do that, you should use withColumn. One approach would then be to cast the column and recreate the struct.
scala> val new_ds = ds.withColumn("b", struct($"b.v" cast "long" as "v")).as[MyTypeLong]
scala> new_ds.printSchema
root
|-- a: string (nullable = true)
|-- b: struct (nullable = false)
| |-- v: long (nullable = true)
Another approach would be to use map and build the object yourself:
ds.map{ case MyTypeInt(a, MyType2(b)) => MyTypeLong(a, MyType3(b)) }

Related

Apache Spark Null Value when casting incompatible DecimalType vs ClassCastException

Casting DecimalType(10,5) e.g. 99999.99999 to DecimalType(5,4) in Apache Spark silently returns null
Is it possible to change this behavior and allow Spark to throw an exception(for example some CastException) in this case and fail the job instead of silently return null ?
As per the Git hub documentation, https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L499
/** * Change the precision / scale in a given decimal to those set
in decimalType (if any), * returning null if it overflows or
modifying value in-place and returning it if successful. * *
NOTE: this modifies value in-place, so don't call it on external
data. */
There is also another thread, suggesting there may not be a direct method to fail the code if not able to cast. Spark: cast decimal without changing nullable property of column.
So, probably you can try checking for the nullvalue in the casted column and create a logic to fail if any?
As I mentioned in my comment above you can try to achieve what you want using a UserDefinedFunction. I'm currently facing the same problem but managed to solve mine using a UDF. The problem I was facing is that I wanted try to cast a column to DoubleType but I don't know the type upfront and I wan't my application to fail when the parsing fails, so not a silent 'null' like you are talking about.
In the code below you can see I've written an udf which takes in a struct as parameter.I'll try to parse the only value in this struct to a double. If this fails I will throw an exception which causes my job to fail.
import spark.implicits._
val cast_to_double = udf((number: Row) => {
try {
number.get(0) match {
case s: String => s.toDouble
case d: Double => d
case l: Long => l.toDouble
case i: Int => i.toDouble
case _ => throw new NumberFormatException
}
} catch {
case _: NumberFormatException => throw new IllegalArgumentException("Can't parse this so called number of yours.")
}
})
try {
val intDF = List(1).toDF("something")
val secondIntDF = intDF.withColumn("something_else", cast_to_double(struct(col("something"))))
secondIntDF.printSchema()
secondIntDF.show()
val stringIntDF = List("1").toDF("something")
val secondStringIntDF = stringIntDF.withColumn("something_else", cast_to_double(struct(col("something"))))
secondStringIntDF.printSchema()
secondStringIntDF.show()
val stringDF = List("string").toDF("something")
val secondStringDF = stringDF.withColumn("something_else", cast_to_double(struct(col("something"))))
secondStringDF.printSchema()
secondStringDF.show()
} catch {
case se: SparkException => println(se.getCause.getMessage)
}
OUTPUT:
root
|-- something: integer (nullable = false)
|-- something_else: double (nullable = false)
+---------+--------------+
|something|something_else|
+---------+--------------+
| 1| 1.0|
+---------+--------------+
root
|-- something: string (nullable = true)
|-- something_else: double (nullable = false)
+---------+--------------+
|something|something_else|
+---------+--------------+
| 1| 1.0|
+---------+--------------+
root
|-- something: string (nullable = true)
|-- something_else: double (nullable = false)
Can't parse this so called number of yours.

zip function with 3 parameter

I want to transpose multiple columns in Spark SQL table
I found this solution for only two columns, I want to know how to work with zip function with three column varA, varB and varC.
import org.apache.spark.sql.functions.{udf, explode}
val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
$"userId", $"someString",
$"vars._1".alias("varA"), $"vars._2".alias("varB")).show
this is my dataframe schema :
`root
|-- owningcustomerid: string (nullable = true)
|-- event_stoptime: string (nullable = true)
|-- balancename: string (nullable = false)
|-- chargedvalue: string (nullable = false)
|-- newbalance: string (nullable = false)
`
i tried this code :
val zip = udf((xs: Seq[String], ys: Seq[String], zs: Seq[String]) => (xs, ys, zs).zipped.toSeq)
df.printSchema
val df4=df.withColumn("vars", explode(zip($"balancename", $"chargedvalue",$"newbalance"))).select(
$"owningcustomerid", $"event_stoptime",
$"vars._1".alias("balancename"), $"vars._2".alias("chargedvalue"),$"vars._2".alias("newbalance"))
i got this error :
cannot resolve 'UDF(balancename, chargedvalue, newbalance)' due to data type mismatch: argument 1 requires array<string> type, however, '`balancename`' is of string type. argument 2 requires array<string> type, however, '`chargedvalue`' is of string type. argument 3 requires array<string> type, however, '`newbalance`' is of string type.;;
'Project [owningcustomerid#1085, event_stoptime#1086, balancename#1159, chargedvalue#1160, newbalance#1161, explode(UDF(balancename#1159, chargedvalue#1160, newbalance#1161)) AS vars#1167]
In Scala in general you can use Tuple3.zipped
val zip = udf((xs: Seq[Long], ys: Seq[Long], zs: Seq[Long]) =>
(xs, ys, zs).zipped.toSeq)
zip($"varA", $"varB", $"varC")
Specifically in Spark SQL (>= 2.4) you can use arrays_zip function:
import org.apache.spark.sql.functions.arrays_zip
arrays_zip($"varA", $"varB", $"varC")
However you have to note that your data doesn't contain array<string> but plain strings - hence Spark arrays_zip or explode are not allowed and you should parse your data first.
val zip = udf((a: Seq[String], b: Seq[String], c: Seq[String], d: Seq[String]) => {a.indices.map(i=> (a(i), b(i), c(i), d(i)))})

How to create a Row from a given case class?

Imagine that you have the following case classes:
case class B(key: String, value: Int)
case class A(name: String, data: B)
Given an instance of A, how do I create a Spark Row? e.g.
val a = A("a", B("b", 0))
val row = ???
NOTE: Given row I need to be able to get data with:
val name: String = row.getAs[String]("name")
val b: Row = row.getAs[Row]("data")
The following seems to match what you're looking for.
scala> spark.version
res0: String = 2.3.0
scala> val a = A("a", B("b", 0))
a: A = A(a,B(b,0))
import org.apache.spark.sql.Encoders
val schema = Encoders.product[A].schema
scala> schema.printTreeString
root
|-- name: string (nullable = true)
|-- data: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: integer (nullable = false)
val values = a.productIterator.toSeq.toArray
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
val row: Row = new GenericRowWithSchema(values, schema)
scala> val name: String = row.getAs[String]("name")
name: String = a
// the following won't work since B =!= Row
scala> val b: Row = row.getAs[Row]("data")
java.lang.ClassCastException: B cannot be cast to org.apache.spark.sql.Row
... 55 elided
Very short but probably not the fastest as it first creates a dataframe and then collects it again :
import session.implicits._
val row = Seq(a).toDF().first()
#Jacek Laskowski answer is great!
To complete:
Here some syntactic sugar:
val row = Row(a.productIterator.toSeq: _*)
And a recursive method if you happen to have nested case classes
def productToRow(product: Product): Row = {
val sequence = product.productIterator.toSeq.map {
case product : Product => productToRow(product)
case e => e
}
Row(sequence : _*)
}
I don't think there exist a public API that can do it directly. Internally Spark uses Encoder.toRow method to convert objects org.apache.spark.sql.catalyst.expressions.UnsafeRow, but this method is private. You could try to:
Obtain Encoder for the class:
val enc: Encoder[A] = ExpressionEncoder()
Use reflection to access toRow method and set it to accessible.
Call it to convert object to UnsafeRow.
Obtain RowEncoder for the expected schema (enc.schema).
Convert UnsafeRow to Row.
I haven't tried this, so I cannot guarantee it will work or not.

Filter an array column based on a provided list

I have the following types in a dataframe:
root
|-- id: string (nullable = true)
|-- items: array (nullable = true)
| |-- element: string (containsNull = true)
input:
val rawData = Seq(("id1",Array("item1","item2","item3","item4")),("id2",Array("item1","item2","item3")))
val data = spark.createDataFrame(rawData)
and a list of items:
val filter_list = List("item1", "item2")
I would like to filter out items that are non in the filter_list, similar to how array_contains would function, but its not working on a provided list of strings, only a single value.
so the output would look like this:
val rawData = Seq(("id1",Array("item1","item2")),("id2",Array("item1","item2")))
val data = spark.createDataFrame(rawData)
I tried solving this with the following UDF, but I probably mix types between Scala and Spark:
def filterItems(flist: List[String]) = udf {
(recs: List[String]) => recs.filter(item => flist.contains(item))
}
I'm using Spark 2.2
thanks!
You code is almost right. All you have to do is replace List with Seq
def filterItems(flist: List[String]) = udf {
(recs: Seq[String]) => recs.filter(item => flist.contains(item))
}
It would also make sense to change signature from List[String] => UserDefinedFunction to SeqString] => UserDefinedFunction, but it is not required.
Reference SQL Programming Guide - Data Types.

Spark - recursive function as udf generates an Exception

I am working with DataFrames which elements have got a schema similar to:
root
|-- NPAData: struct (nullable = true)
| |-- NPADetails: struct (nullable = true)
| | |-- location: string (nullable = true)
| | |-- manager: string (nullable = true)
| |-- service: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- serviceName: string (nullable = true)
| | | |-- serviceCode: string (nullable = true)
|-- NPAHeader: struct (nullable = true)
| | |-- npaNumber: string (nullable = true)
| | |-- date: string (nullable = true)
In my DataFrame I want to group all elements which has the same NPAHeader.code, so to do that I am using the following line:
val groupedNpa = orderedNpa.groupBy($"NPAHeader.code" ).agg(collect_list(struct($"NPAData",$"NPAHeader")).as("npa"))
After this I have a dataframe with the following schema:
StructType(StructField(npaNumber,StringType,true), StructField(npa,ArrayType(StructType(StructField(NPAData...)))))
An example of each Row would be something similar to:
[1234,WrappedArray([npaNew,npaOlder,...npaOldest])]
Now what I want is to generate another DataFrame with picks up just one of the element in the WrappedArray, so I want an output similar to:
[1234,npaNew]
Note: The chosen element from the WrappedArray is the one that matches a complext logic after iterating over the whole WrappedArray. But to simplify the question, I will pick up always the last element of the WrappedArray (after iterating all over it).
To do so, I want to define a recurside udf
import org.apache.spark.sql.functions.udf
def returnRow(elementList : Row)(index:Int): Row = {
val dif = elementList.size - index
val row :Row = dif match{
case 0 => elementList.getAs[Row](index)
case _ => returnRow(elementList)(index + 1)
}
row
}
val returnRow_udf = udf(returnRow _)
groupedNpa.map{row => (row.getAs[String]("npaNumber"),returnRow_udf(groupedNpa("npa")(0)))}
But I am getting the following error in the map:
Exception in thread "main" java.lang.UnsupportedOperationException:
Schema for type Int => Unit is not supported
What am I doing wrong?
As an aside, I am not sure if I am passing correctly the npa column with groupedNpa("npa"). I am accesing the WrappedArray as a Row, because I don't know how to iterate over Array[Row] (the get(index) method is not present in Array[Row])
TL;DR Just use one of the methods described in How to select the first row of each group?
If you want to use complex logic, and return Row you can skip SQL API and use groupByKey:
val f: (String, Iterator[org.apache.spark.sql.Row]) => Row
val encoder: Encoder
df.groupByKey(_.getAs[String]("NPAHeader.code")).mapGroups(f)(encoder)
or better:
val g: (Row, Row) => Row
df.groupByKey(_.getAs[String]("NPAHeader.code")).reduceGroups(g)
where encoder is a valid RowEncoder (Encoder error while trying to map dataframe row to updated row).
Your code is faulty in multiple ways:
groupBy doesn't guarantee the order of values. So:
orderBy(...).groupBy(....).agg(collect_list(...))
can have non-deterministic output. If you really decide to go this route you should skip orderBy and sort collected array explicitly.
You cannot pass curried function to udf. You'd have to uncurry it first, but it would require different order of arguments (see example below).
If you could, this might be the correct way to call it (Note that you omit the second argument):
returnRow_udf(groupedNpa("npa")(0))
To make it worse, you call it inside map, where udfs are not applicable at all.
udf cannot return Row. It has to return external Scala type.
External representation for array<struct> is Seq[Row]. You cannot just substitute it with Row.
SQL arrays can be accessed by index with apply:
df.select($"array"(size($"array") - 1))
but it is not a correct method due to non-determinism. You could apply sort_array, but as pointed out at the beginning, there are more efficient solutions.
Surprisingly recursion is not so relevant. You could design function like this:
def size(i: Int=0)(xs: Seq[Any]): Int = xs match {
case Seq() => i
case null => i
case Seq(h, t # _*) => size(i + 1)(t)
}
val size_ = udf(size() _)
and it would work just fine:
Seq((1, Seq("a", "b", "c"))).toDF("id", "array")
.select(size_($"array"))
although recursion is an overkill, if you can just iterate over Seq.