scala aggregate based on case class attribute - scala

aggregate by case class element and output in another output case class.
I tried getting the values as Map or List using recs.groupBy(_.grade).mapValues(_.map(_.student)) . But i need to get it case class.Please advice.
object myApp extends App {
val input: List[student] = List(student(1, 100), student(1, 101), student(2, 102))
val output: List[studentsByGrade] = List(studentsByGrade(1, List(100, 101)), studentsByGrade(2, List(102)))
}
case class student(grade: Long,
student: Long)
case class studentsByGrade
(grade: Long,
studentList: List[Long]
)

This produces the specified result.
input.groupMap(_.grade)(_.student).map(studentsByGrade.tupled)
Scala 2.12.x
input.groupBy(_.grade)
.map{case (grd,ss) => studentsByGrade(grd, ss.map(_.student))}
.toList //optional

Related

How to handle missing nested fields in spark?

Given the two case classes:
case class Response(
responseField: String
...
items: List[Item])
case class Item(
itemField: String
...)
I am creating a Response dataset:
val dataset = spark.read.format("parquet")
.load(inputPath)
.as[Response]
.map(x => x)
The issue arises when itemField is not present in any of the rows and spark will raise this error org.apache.spark.sql.AnalysisException: No such struct field itemField. If itemField was not nested I could handle it by doing dataset.withColumn("itemField", lit("")). Is it possible to do the same within the List field?
I assume the following:
Data was written with the following schema:
case class Item(itemField: String)
case class Response(responseField: String, items: List[Item])
Seq(Response("a", List()), Response("b", List())).toDF.write.parquet("/tmp/structTest")
Now schema changed to:
case class Item(itemField: String, newField: Int)
case class Response(responseField: String, items: List[Item])
spark.read.parquet("/tmp/structTest").as[Response].map(x => x) // Fails
For Spark 2.4 please see:
Spark - How to add an element to an array of structs
For Spark 2.3 this should work:
val addNewField: (Array[String], Array[Int]) => Array[Item] = (itemFields, newFields) => itemFields.zip(newFields).map { case (i, n) => Item(i, n) }
val addNewFieldUdf = udf(addNewField)
spark.read.parquet("/tmp/structTest")
.withColumn("items", addNewFieldUdf(
col("items.itemField") as "itemField",
array(lit(1)) as "newField"
)).as[Response].map(x => x) // Works

How to Map Array from Case class in Spark Scala

Sample Data :( 251~jhon~WrappedArray([STD,Health,Duval]) )
case class xyz(id : String, code : String, County : String)
case class rewards(memId : String, name: String, part: Array[xyz])
val df = spark.read.textFile("file:///data/").rdd.map(r => r.split('~'))
val df2 = df.map(x => { rewards(x(0),x(1), Array[rewards.apply()] ) } )
tried many way to map an array from case class. tried apply function
I am not sure that's what you are looking for but you can try using pattern matching to transform arrays into case classes.
val data: RDD[rewards] = sc
.parallelize(Seq("251~jhon~WrappedArray([STD,Health,Duval])"))
.map(_.split("~"))
.map{ case Array(id, code, part) => (id, code, part
.replaceFirst("\\s*WrappedArray\\(\\s*\\[\\s*", "")
.replaceFirst("\\s*\\]\\s*\\)\\s*", "")
)}
.map{ case (id, name, part) => rewards(id, name, part.split("\\s*,\\s*") match {
case Array(id, code, county) => Array(xyz(id, code, county))
})}

Nested Scala case classes to/from CSV

There are many nice libraries for writing/reading Scala case classes to/from CSV files. I'm looking for something that goes beyond that, which can handle nested cases classes. For example, here a Match has two Players:
case class Player(name: String, ranking: Int)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane",7), Player("Fred",23)),
Match("Rome", Player("Marco",19), Player("Giulia",3)),
Match("Paris", Player("Isabelle",2), Player("Julien",5))
)
I'd like to effortlessly (no boilerplate!) write/read matches to/from this CSV:
place,winner.name,winner.ranking,loser.name,loser.ranking
London,Jane,7,Fred,23
Rome,Marco,19,Giulia,3
Paris,Isabelle,2,Julien,5
Note the automated header line using the dot "." to form the column name for a nested field, e.g. winner.ranking. I'd be delighted if someone could demonstrate a simple way to do this (say, using reflection or Shapeless).
[Motivation. During data analysis it's convenient to have a flat CSV to play around with, for sorting, filtering, etc., even when case classes are nested. And it would be nice if you could load nested case classes back from such files.]
Since a case-class is a Product, getting the values of the various fields is relatively easy. Getting the names of the fields/columns does require using Java reflection.
The following function takes a list of case-class instances and returns a list of rows, each is a list of strings. It is using a recursion to get the values and headers of child case-class instances.
def toCsv(p: List[Product]): List[List[String]] = {
def header(c: Class[_], prefix: String = ""): List[String] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
if (classOf[Product].isAssignableFrom(field.getType)) header(field.getType, name + ".")
else List(name)
}
}
def flatten(p: Product): List[String] =
p.productIterator.flatMap {
case p: Product => flatten(p)
case v: Any => List(v.toString)
}.toList
header(classOf[Match]) :: p.map(flatten)
}
However, constructing case-classes from CSV is far more involved, requiring to use reflection for getting the types of the various fields, for creating the values from the CSV strings and for constructing the case-class instances.
For simplicity (not saying the code is simple, just so it won't be further complicated), I assume that the order of columns in the CSV is the same as if the file was produced by the toCsv(...) function above.
The following function starts by creating a list of "instructions how to process a single CSV row" (the instructions are also used to verify that the column headers in the CSV matches the the case-class properties). The instructions are then used to recursively produce one CSV row at a time.
def fromCsv[T <: Product](csv: List[List[String]])(implicit tag: ClassTag[T]): List[T] = {
trait Instruction {
val name: String
val header = true
}
case class BeginCaseClassField(name: String, clazz: Class[_]) extends Instruction {
override val header = false
}
case class EndCaseClassField(name: String) extends Instruction {
override val header = false
}
case class IntField(name: String) extends Instruction
case class StringField(name: String) extends Instruction
case class DoubleField(name: String) extends Instruction
def scan(c: Class[_], prefix: String = ""): List[Instruction] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
val fType = field.getType
if (fType == classOf[Int]) List(IntField(name))
else if (fType == classOf[Double]) List(DoubleField(name))
else if (fType == classOf[String]) List(StringField(name))
else if (classOf[Product].isAssignableFrom(fType)) BeginCaseClassField(name, fType) :: scan(fType, name + ".")
else throw new IllegalArgumentException(s"Unsupported field type: $fType")
} :+ EndCaseClassField(prefix)
}
def produce(instructions: List[Instruction], row: List[String], argAccumulator: List[Any]): (List[Instruction], List[String], List[Any]) = instructions match {
case IntField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toInt)
case StringField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString)
case DoubleField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toDouble)
case BeginCaseClassField(_, clazz) :: tail =>
val (instructionRemaining, rowRemaining, constructorArgs) = produce(tail, row, List.empty)
val newCaseClass = clazz.getConstructors.head.newInstance(constructorArgs.map(_.asInstanceOf[AnyRef]): _*)
produce(instructionRemaining, rowRemaining, argAccumulator :+ newCaseClass)
case EndCaseClassField(_) :: tail => (tail, row, argAccumulator)
case Nil if row.isEmpty => (Nil, Nil, argAccumulator)
case Nil => throw new IllegalArgumentException("Not all values from CSV row were used")
}
val instructions = BeginCaseClassField(".", tag.runtimeClass) :: scan(tag.runtimeClass)
assert(csv.head == instructions.filter(_.header).map(_.name), "CSV header doesn't match target case-class fields")
csv.drop(1).map(row => produce(instructions, row, List.empty)._3.head.asInstanceOf[T])
}
I've tested this using:
case class Player(name: String, ranking: Int, price: Double)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane", 7, 12.5), Player("Fred", 23, 11.1)),
Match("Rome", Player("Marco", 19, 13.54), Player("Giulia", 3, 41.8)),
Match("Paris", Player("Isabelle", 2, 31.7), Player("Julien", 5, 16.8))
)
val csv = toCsv(matches)
val matchesFromCsv = fromCsv[Match](csv)
assert(matches == matchesFromCsv)
Obviously this should be optimized and hardened if you ever want to use this for production...

Filtering RDD with CustomObject, Type Mismatch

I have this custom Scala object (basically a Java POJO):
object CustomObject {
implicit object Mapper extends JavaBeanColumnMapper[CustomObject]
}
class CustomObject extends Serializable {
#BeanProperty
var amount: Option[java.lang.Double] = _
...
}
In my main class, I've loaded an RDD that contains these CustomObjects.
I am trying to filter them and create a new RDD that contains only the objects that have amount > 5000.
val customObjectRDD = sc.objectFile[CustomObject]( "objectFiles" )
val filteredRdd = customObjectRDD.filter( x => x.amount > 5000 )
println( filteredRdd.count() )
However, my editor says
Type Mismatch: Expected: (CustomObject) => Boolean, Actual:
(CustomObject) => Any
What do I have to do to get this to work?
The > operator is not defined on Option[Double], your filter predicate will need to handle the Option:
scala> case class A(amount: Option[Double])
defined class A
scala> val myRDD = sc.parallelize(Seq(A(Some(10000d)), A(None), A(Some(5001d)), A(Some(5000d))))
myRDD: org.apache.spark.rdd.RDD[A] = ParallelCollectionRDD[12] at parallelize at <console>:29
scala> myRDD.filter(_.amount.exists(_ > 5000)).foreach{println}
A(Some(10000.0))
A(Some(5001.0))
This assumes that any object with amount = None should fail the filter predicate. See the docs for a definition of Option.exists.

Getting the parameters of a case class through Reflection

As a follow up of
Matt R's question, as Scala 2.10 has been out for quite an amount of time, what would be the best way to extract the fields and values of a case class. Taking a similar example:
case class Colour(red: Int, green: Int, blue: String) {
val other: Int = 42
}
val RBG = Colour(1,3,"isBlue")
I want to get a list (or array or any iterator for that matter) that would have the fields declared in the constructor as tuple values like these:
[(red, 1),(green, 3),(blue, "isBlue")]
I know the fact that there are a lot of examples on the net regarding the same issue but as I said, I wanted to know what should be the most ideal way to extract the required information
If you use Scala 2.10 reflection, this answer is half of the things you need. It will give you the method symbols of the case class, so you know the order and names of arguments:
import scala.reflect.runtime.{universe => ru}
import ru._
def getCaseMethods[T: TypeTag] = typeOf[T].members.collect {
case m: MethodSymbol if m.isCaseAccessor => m
}.toList
case class Person(name: String, age: Int)
getCaseMethods[Person] // -> List(value age, value name)
You can call .name.toString on these methods to get the corresponding method names.
The next step is to invoke these methods on a given instance. You need a runtime mirror for that
val rm = runtimeMirror(getClass.getClassLoader)
Then you can "mirror" an actual instance:
val p = Person("foo", 33)
val pr = rm.reflect(p)
Then you can reflect on pr each method using reflectMethod and execute it via apply. Without going through each step separately, here is a solution altogether (see the val value = line for the mechanism of extracting a parameter's value):
def caseMap[T: TypeTag: reflect.ClassTag](instance: T): List[(String, Any)] = {
val im = rm.reflect(instance)
typeOf[T].members.collect {
case m: MethodSymbol if m.isCaseAccessor =>
val name = m.name.toString
val value = im.reflectMethod(m).apply()
(name, value)
} (collection.breakOut)
}
caseMap(p) // -> List(age -> 33, name -> foo)
Every case object is a product, therefore you can use an iterator to get all its parameters' names and another iterator to get all its parameters' values:
case class Colour(red: Int, green: Int, blue: String) {
val other: Int = 42
}
val rgb = Colour(1, 3, "isBlue")
val names = rgb.productElementNames.toList // List(red, green, blue)
val values = rgb.productIterator.toList // List(1, 3, isBlue)
names.zip(values).foreach(print) // (red,1)(green,3)(blue,isBlue)
By product I mean both Cartesian product and an instance of Product. This requires Scala 2.13.0; although Product was available before, the iterator to get elements' names was only added in version 2.13.0.
Notice that no reflection is needed.