create scala dataframe out of custom object - scala

I am a newbie in scala. I will try to be as clear as possible.I have the following code:
case class Session (bf: Array[File])
case class File(s: s, a: Option[a], b: Option[b], c: Option[c])
case class s(s1:Int, s2:String)
case class a(a1:Int, a2:String)
case class b(b1:Int, b2:String)
case class c(c1:Int, c2:String)
val x = Session(...) // some values here, many session objects grouped in a dataset collection i.e. Dataset[Sessions]
I want to know how to create dataframes from a Dataset[Sessions]. I do not
know how to manipulate such a complex structure.
how to create a dataframe from Dataset[sessions] only containing the custom
object "a".
Thank you

A Spark DataSet works much like a regular Scala collection. It has a toDF() operation to create a DataFrame out of it. Now you just need to extract the right data out of it using different transformations.
flatMap it into a DataSet of File
filter every File for a non-empty a
map every remaining File to a
call toDF() to create a DataFrame
In code this would be:
val ds: DataSet[Session] = ...
ds.flatMap(_.bf)
.filter(_.a.isDefined)
.map(_.a.get)
.toDF()
In Scala you can also combine the filter and map to a collect, which would lead to the following code:
ds.flatMap(_.bf).collect({ case File(_, Some(a), _, _) => a }).toDF()

Related

Pass case class to Spark UDF

I have a scala-2.11 function which creates a case class from Map based on the provided class type.
def createCaseClass[T: TypeTag, A](someMap: Map[String, A]): T = {
val rMirror = runtimeMirror(getClass.getClassLoader)
val myClass = typeOf[T].typeSymbol.asClass
val cMirror = rMirror.reflectClass(myClass)
// The primary constructor is the first one
val ctor = typeOf[T].decl(termNames.CONSTRUCTOR).asTerm.alternatives.head.asMethod
val argList = ctor.paramLists.flatten.map(param => someMap(param.name.toString))
cMirror.reflectConstructor(ctor)(argList: _*).asInstanceOf[T]
}
I'm trying to use this in the context of a spark data frame as a UDF. However, I'm not sure what's the best way to pass the case class. The approach below doesn't seem to work.
def myUDF[T: TypeTag] = udf { (inMap: Map[String, Long]) =>
createCaseClass[T](inMap)
}
I'm looking for something like this-
case class MyType(c1: String, c2: Long)
val myUDF = udf{(MyType, inMap) => createCaseClass[MyType](inMap)}
Thoughts and suggestions to resolve this is appreciated.
However, I'm not sure what's the best way to pass the case class
It is not possible to use case classes as arguments for user defined functions. SQL StructTypes are mapped to dynamically typed (for lack of a better word) Row objects.
If you want to operate on statically typed objects please use statically typed Dataset.
From try and error I learn that whatever data structure that is stored in a Dataframe or Dataset is using org.apache.spark.sql.types
You can see with:
df.schema.toString
Basic types like Int,Double, are stored like:
StructField(fieldname,IntegerType,true),StructField(fieldname,DoubleType,true)
Complex types like case class are transformed to a combination of nested types:
StructType(StructField(..),StructField(..),StructType(..))
Sample code:
case class range(min:Double,max:Double)
org.apache.spark.sql.Encoders.product[range].schema
//Output:
org.apache.spark.sql.types.StructType = StructType(StructField(min,DoubleType,false), StructField(max,DoubleType,false))
The UDF parameter type in this cases is Row, or Seq[Row] when you store an array of case classes
A basic debug technic is print to string:
val myUdf = udf( (r:Row) => r.schema.toString )
then, to see was happen:
df.take(1).foreach(println) //

Dataset.groupByKey + untyped aggregation functions

Suppose I have types like these:
case class SomeType(id: String, x: Int, y: Int, payload: String)
case class Key(x: Int, y: Int)
Then suppose I did groupByKey on a Dataset[SomeType] like this:
val input: Dataset[SomeType] = ...
val grouped: KeyValueGroupedDataset[Key, SomeType] =
input.groupByKey(s => Key(s.x, s.y))
Then suppose I have a function which determines which field I want to use in an aggregation:
val chooseDistinguisher: SomeType => String = _.id
And now I would like to run an aggregation function over the grouped dataset, for example, functions.countDistinct, using the field obtained by the function:
grouped.agg(
countDistinct(<something which depends on chooseDistinguisher>).as[Long]
)
The problem is, I cannot create a UDF from chooseDistinguisher, because countDistinct accepts a Column, and to turn a UDF into a Column you need to specify the input column names, which I cannot do - I do not know which name to use for the "values" of a KeyValueGroupedDataset.
I think it should be possible, because KeyValueGroupedDataset itself does something similar:
def count(): Dataset[(K, Long)] = agg(functions.count("*").as(ExpressionEncoder[Long]()))
However, this method cheats a bit because it uses "*" as the column name, but I need to specify a particular column (i.e. the column of the "value" in a key-value grouped dataset). Also, when you use typed functions from the typed object, you also do not need to specify the column name, and it works somehow.
So, is it possible to do this, and if it is, how to do it?
As I know it's not possible with agg transformation, which expects TypedColumn type which is constructed based on Column type using as method, so you need to start from not type-safe expression. If somebody knows solution I would be interested to see it...
If you need to use type-safe aggregation you can use one of below approaches:
mapGroups - where you can implement Scala function responsible for aggregating Iterator
implement your custom Aggregator as suggested above
First approach needs less code, so below I'm showing quick example:
def countDistinct[T](values: Iterator[T])(chooseDistinguisher: T => String): Long =
values.map(chooseDistinguisher).toSeq.distinct.size
ds
.groupByKey(s => Key(s.x, s.y))
.mapGroups((k,vs) => (k, countDistinct(vs)(_.name)))
In my opinion Spark Dataset type-safe API is still much less mature than not type safe DataFrame API. Some time ago I was thinking that it could be good idea to implement simple to use type-safe aggregation API for Spark Dataset.
Currently, this use case is better handled with DataFrame, which you can later convert back into a Dataset[A].
// Code assumes SQLContext implicits are present
import org.apache.spark.sql.{functions => f}
val colName = "id"
ds.toDF
.withColumn("key", f.concat('x, f.lit(":"), 'y))
.groupBy('key)
.agg(countDistinct(f.col(colName)).as("cntd"))

In spark, is there a way to convert the RDD objects into case objects

I am new to the Spark programing and I came across a scenario where I am novice to case class and I need to use case class in my RDDs:
For example, I have an RDD of tuples like :
Array[(String,String,String)]
having values like:
Array((20254552,ATM,-5100), (20174649,ATM,5120)........)
Is there any method to convert the above RDD into:
20254552,trans(ATM,-5100)
where trans is a case class?
Yes. Definitely you can do that. Following code should help you do that
val array = Array((20254552,"ATM",-5100), (20174649,"ATM",5120))
val rdd = sparkContext.parallelize(array)
val transedRdd = rdd.map(x => (x._1, trans(x._2, x._3)))
You should create case class outside your current class
case class trans(atm : String, num: Int)
I hope it helps
It's not the really the answer of your question but I recommend that you use Dataframes and Datasets as much as possible. Using them will benefit you a lot such as improve coding effieciency, well tested framewords with optimizations to use less memory and benefit from spark-engine fully.
Please refer to A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets for more information about differences and uses case of RDD, Dataframes and Datasets
Using Datasets the solution for your problem is very simple :
import spark.implicits._
val ds = Seq((20254552,"ATM",-5100), (20174649,"ATM",5120)).toDS()
val transsedds = ds.map(x => (x._1, trans(x._2, x._3)))
As #Ramesh says you should create case class outside your current class
case class trans(atm : String, num: Int)
Hope it helps.

How to create udf containing Array (case class) for complex column in a dataframe

I have a dataframe which have a complex column datatype of Arraytype>. For transforming this dataframe I have created udf which can consume this column using Array [case class] as parameter. The main bottle neck here is when I create case class according to stucttype, the structfield name contains special characters for example "##field". So I provide same name to case class like this way case class (##field) and attach this to udf parameter. After interpreted in spark udf definition change name of case class field to this "$hash$hashfield". When performing transform using this dataframe it is failing because of this miss match. Please help ...
Due JVM limitations Scala stores identifiers in encoded form and currently Spark can't map ##field to $hash$hashfield.
One possible solution is to extract fields manually from raw row (but you need to know order of the fields in df, you can use df.schema for that):
val myUdf = udf { (struct: Row) =>
// Pattern match struct:
struct match {
case Row(a: String) => Foo(a)
}
// .. or extract values from Row
val `##a` = struct.getAs[String](0)
}

Convert Dataframe back to RDD of case class in Spark

I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}
You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd
I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}