Spark SQL UDF returning scala immutable Map with df.WithColumn() - scala

I have case class
case class MyCaseClass(City : String, Extras : Map[String, String])
and user defined function which returns scala.collection.immutable.Map
def extrasUdf = spark.udf.register(
"extras_udf",
(age : Int, name : String) => Map("age" -> age.toString, "name" -> name)
)
but this breaks with Exception:
import spark.implicits._
spark.read.options(...).load(...)
.select('City, 'Age, 'Name)
.withColumn("Extras", extrasUdf('Age, 'Name))
.drop('Age)
.drop('Name)
.as[MyCaseClass]
I should use spark sql's MapType(DataTypes.StringType, DataTypes.IntegerType)
but I can't find any working example...
And this works if I use scala.collection.Map but I need immutable Map

There are many problems with your code:
You are using def extrastUdf =, which creates a function for registering a UDF as opposed to actually creating/registering a UDF. Use val extrasUdf = instead.
You are mixing value types in your map (String and Int), which makes the map be Map[String, Any] as Any is the common superclass of String and Int. Spark does not support Any. You can do at least two things: (a) switch to using a string map (with age.toString, in which case you don't need a UDF as you can simply use map()) or (b) switch to using named structs using named_struct() (again, without the need for a UDF). As a rule, only write a UDF if you cannot do what you need to do with the existing functions. I prefer to look at the Hive documentation because the Spark docs are rather sparse.
Also, keep in mind that type specification in Spark schema (e.g., MapType) is completely different from Scala types (e.g., Map[_, _]) and separate from how types are represented internally and mapped between Scala & Spark data structures. In other words, this has nothing to do with mutable vs. immutable collections.
Hope this helps!

Related

Dataset.groupByKey + untyped aggregation functions

Suppose I have types like these:
case class SomeType(id: String, x: Int, y: Int, payload: String)
case class Key(x: Int, y: Int)
Then suppose I did groupByKey on a Dataset[SomeType] like this:
val input: Dataset[SomeType] = ...
val grouped: KeyValueGroupedDataset[Key, SomeType] =
input.groupByKey(s => Key(s.x, s.y))
Then suppose I have a function which determines which field I want to use in an aggregation:
val chooseDistinguisher: SomeType => String = _.id
And now I would like to run an aggregation function over the grouped dataset, for example, functions.countDistinct, using the field obtained by the function:
grouped.agg(
countDistinct(<something which depends on chooseDistinguisher>).as[Long]
)
The problem is, I cannot create a UDF from chooseDistinguisher, because countDistinct accepts a Column, and to turn a UDF into a Column you need to specify the input column names, which I cannot do - I do not know which name to use for the "values" of a KeyValueGroupedDataset.
I think it should be possible, because KeyValueGroupedDataset itself does something similar:
def count(): Dataset[(K, Long)] = agg(functions.count("*").as(ExpressionEncoder[Long]()))
However, this method cheats a bit because it uses "*" as the column name, but I need to specify a particular column (i.e. the column of the "value" in a key-value grouped dataset). Also, when you use typed functions from the typed object, you also do not need to specify the column name, and it works somehow.
So, is it possible to do this, and if it is, how to do it?
As I know it's not possible with agg transformation, which expects TypedColumn type which is constructed based on Column type using as method, so you need to start from not type-safe expression. If somebody knows solution I would be interested to see it...
If you need to use type-safe aggregation you can use one of below approaches:
mapGroups - where you can implement Scala function responsible for aggregating Iterator
implement your custom Aggregator as suggested above
First approach needs less code, so below I'm showing quick example:
def countDistinct[T](values: Iterator[T])(chooseDistinguisher: T => String): Long =
values.map(chooseDistinguisher).toSeq.distinct.size
ds
.groupByKey(s => Key(s.x, s.y))
.mapGroups((k,vs) => (k, countDistinct(vs)(_.name)))
In my opinion Spark Dataset type-safe API is still much less mature than not type safe DataFrame API. Some time ago I was thinking that it could be good idea to implement simple to use type-safe aggregation API for Spark Dataset.
Currently, this use case is better handled with DataFrame, which you can later convert back into a Dataset[A].
// Code assumes SQLContext implicits are present
import org.apache.spark.sql.{functions => f}
val colName = "id"
ds.toDF
.withColumn("key", f.concat('x, f.lit(":"), 'y))
.groupBy('key)
.agg(countDistinct(f.col(colName)).as("cntd"))

Get Type of RDD in Scala/Spark

I am not sure if type is the right word to use here, but let say I have an RDD of the following type
RDD[(Long, Array[(Long, Double)])]
Now if I have the RDD, how can i find the type of it (as mentioned above) at runtime ?
I basically want to compare two RDDs, at runtime to see if they store the same kind of data (the values it self might be different), is there another way to do it? Moreover, I want to get a cached RDD as an instance of RDD type using the following code
sc.getPersistentRDDs(0).asInstanceOf[RDD[(Long, Array[(Long, Double)])]]
where RDD[(Long, Array[(Long, Double)])] has been found out dynamically at run time based on another RDD of same type.
So is there a way to get this value on runtime from an RDD ?
You can use Scala's TypeTags
import scala.reflect.runtime.universe._
def checkEqualParameters [T1, T2](x : T1, y : T2)(implicit type1 : TypeTag[T1], type2 : TypeTag[T2]) = {
type1.tpe.typeArgs == type2.tpe.typeArgs
}
And then compare
checkEqualParameters (rdd1, rdd2)

Schema for type Any is not supported

I'm trying to create a spark UDF to extract a Map of (key, value) pairs from a User defined case class.
The scala function seems to work fine, but when I try to convert that to a UDF in spark2.0, I'm running into the " Schema for type Any is not supported" error.
case class myType(c1: String, c2: Int)
def getCaseClassParams(cc: Product): Map[String, Any] = {
cc
.getClass
.getDeclaredFields // all field names
.map(_.getName)
.zip(cc.productIterator.to) // zipped with all values
.toMap
}
But when I try to instantiate a function value as a UDF it results in the following error -
val ccUDF = udf{(cc: Product, i: String) => getCaseClassParams(cc).get(i)}
java.lang.UnsupportedOperationException: Schema for type Any is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:716)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:668)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:654)
at org.apache.spark.sql.functions$.udf(functions.scala:2841)
The error message says it all. You have an Any in the map. Spark SQL and Dataset api does not support Any in the schema. It has to be one of the supported type (which is a list of basic types such as String, Integer etc. a sequence of supported types or a map of supported types).

Apache Spark - registering a UDF - returning dataframe

I have a UDF which returns a dataframe. Something like the one below
scala> predict_churn(Vectors.dense(2.0,1.0,0.0,3.0,4.0,4.0,0.0,4.0,5.0,2.0))
res3: org.apache.spark.sql.DataFrame = [noprob: string, yesprob: string, pred: string]
scala> predict_churn(Vectors.dense(2.0,1.0,0.0,3.0,4.0,4.0,0.0,4.0,5.0,2.0)).show
+------------------+------------------+----+
| noprob| yesprob|pred|
+------------------+------------------+----+
|0.3619977592578127|0.6380022407421874| 1.0|
+------------------+------------------+----+
however when i try to register this as a UDF using the command
hiveContext.udf.register("predict_churn", outerpredict _)
i get an error like
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.DataFrame is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:715)
Is returning dataframe not supported. I am using Spark 1.6.1 and Scala 2.10. If this is not supported how can i return multiple columns to an external program please.
Thanks
Bala
Is returning dataframe not supported
Correct - you can't return a DataFrame from a UDF. UDFs should return types that are convertable into the supported column types:
Primitives (Int, String, Boolean, ...)
Tuples of other supported types
Lists, Arrays, Maps of other supported types
Case Classes of other supported types
In your case, you can use a case class:
case class Record(noprob: Double, yesprob: Double, pred: Double)
And have your UDF (predict_churn) return Record.
Then, when applied to a single record (as UDFs are), this case class will be converted into columns named as its members (and with the correct types), resulting in a DataFrame similar to the one currently returned by your function.

Map[String,Any] to compact json string using json4s

I am currently extracting some metrics from different data sources and storing them in a map of type Map[String,Any] where the key corresponds to the metric name and the value corresponds to the metric value. I need this to be more or less generic, which means that values types can be primitive types or lists of primitive types.
I would like to serialize this map to a JSON-formatted string and for that I am using json4s library. The thing is that it does not seem possible and I don't see a possible solution for that. I would expect something like the following to work out of the box :)
val myMap: Map[String,Any] = ... // extract metrics
val json = myMap.reduceLeft(_ ~ _) // create JSON of metrics
Navigating through source code I've seen json4s provides implicit conversions in order to transform primitive types to JValue's and also to convert Traversable[A]/Map[String,A]/Option[A] to JValue's (under the restriction of being available an implicit conversion from A to JValue, which I understand it actually means A is a primitive type). The ~ operator offers a nice way of constructing JObject's out of JField's, which is just a type alias for (String, JValue).
In this case, map values type is Any, so implicit conversions don't take place and hence the compiler throws the following error:
value ~ is not a member of (String, Any)
[error] val json = r.reduceLeft(_ ~ _)
Is there a solution for what I want to accomplish?
Since you are actually only looking for the JSON string representation of myMap, you can use the Serialization object directly. Here is a small example (if using the native version of json4s change the import to org.json4s.native.Serialization):
EDIT: added formats implicit
import org.json4s.jackson.Serialization
implicit val formats = org.json4s.DefaultFormats
val m: Map[String, Any] = Map(
"name "-> "joe",
"children" -> List(
Map("name" -> "Mary", "age" -> 5),
Map("name" -> "Mazy", "age" -> 3)
)
)
// prints {"name ":"joe","children":[{"name":"Mary","age":5},{"name":"Mazy","age":3}]}
println(Serialization.write(m))
json4s has method for it.
pretty(render(yourMap))