How to handle missing nested fields in spark? - scala

Given the two case classes:
case class Response(
responseField: String
...
items: List[Item])
case class Item(
itemField: String
...)
I am creating a Response dataset:
val dataset = spark.read.format("parquet")
.load(inputPath)
.as[Response]
.map(x => x)
The issue arises when itemField is not present in any of the rows and spark will raise this error org.apache.spark.sql.AnalysisException: No such struct field itemField. If itemField was not nested I could handle it by doing dataset.withColumn("itemField", lit("")). Is it possible to do the same within the List field?

I assume the following:
Data was written with the following schema:
case class Item(itemField: String)
case class Response(responseField: String, items: List[Item])
Seq(Response("a", List()), Response("b", List())).toDF.write.parquet("/tmp/structTest")
Now schema changed to:
case class Item(itemField: String, newField: Int)
case class Response(responseField: String, items: List[Item])
spark.read.parquet("/tmp/structTest").as[Response].map(x => x) // Fails
For Spark 2.4 please see:
Spark - How to add an element to an array of structs
For Spark 2.3 this should work:
val addNewField: (Array[String], Array[Int]) => Array[Item] = (itemFields, newFields) => itemFields.zip(newFields).map { case (i, n) => Item(i, n) }
val addNewFieldUdf = udf(addNewField)
spark.read.parquet("/tmp/structTest")
.withColumn("items", addNewFieldUdf(
col("items.itemField") as "itemField",
array(lit(1)) as "newField"
)).as[Response].map(x => x) // Works

Related

scala aggregate based on case class attribute

aggregate by case class element and output in another output case class.
I tried getting the values as Map or List using recs.groupBy(_.grade).mapValues(_.map(_.student)) . But i need to get it case class.Please advice.
object myApp extends App {
val input: List[student] = List(student(1, 100), student(1, 101), student(2, 102))
val output: List[studentsByGrade] = List(studentsByGrade(1, List(100, 101)), studentsByGrade(2, List(102)))
}
case class student(grade: Long,
student: Long)
case class studentsByGrade
(grade: Long,
studentList: List[Long]
)
This produces the specified result.
input.groupMap(_.grade)(_.student).map(studentsByGrade.tupled)
Scala 2.12.x
input.groupBy(_.grade)
.map{case (grd,ss) => studentsByGrade(grd, ss.map(_.student))}
.toList //optional

Flink scala Case class

I want to know how to replace x._1._2, x._1._3 by the name of the field using case class
def keyuid(l:Array[String]) : (String,Long,String) ={
//val l=s.split(",")
val ip=l(3).split(":")(1)
val values=Array("",0,0,0)
val uid=l(1).split(":")(1)
val timestamp=l(2).split(":")(1).toLong*1000
val impression=l(4).split(":")(1)
return (uid,timestamp,ip)
}
val cli_ip = click.map(_.split(","))
.map(x => (keyuid(x), 1.0)).assignAscendingTimestamps(x=>x._1._2)
.keyBy(x => x._1._3)
.timeWindow(Time.seconds(10))
.sum(1)
Use Scala pattern matching when writing lambda functions using curly braces and case keyword.
val cli_ip = click.map(_.split(","))
.map(x => (keyuid(x), 1.0)).assignAscendingTimestamps {
case ((_, timestamp, _), _) => timestamp
}
.keyBy { elem => elem match {
case ((_, _, ip), _) => ip
}
}
.timeWindow(Time.seconds(10))
.sum(1)
More information on Tuples and their pattern matching syntax here: https://docs.scala-lang.org/tour/tuples.html
Pattern Matching is indeed a good idea and makes the code more readable.
To answer your question, to make keyuid function returns a case class, you first need to define it, for instance :
case class Click(string uid, long timestamp, string ip)
Then instead of return (uid,timestamp,ip) you need to do return Click(uid,timestamp,ip)
Case class are not related to flink, but to scala : https://docs.scala-lang.org/tour/case-classes.html

Spark - Merging Multiple Map Columns with Common Keys

I have a series of columns of the schemas Map[(Int, Int), Row] and Map[(Int, Int), String], int-tuple keyed maps with some data (string or struct). I am attempting to take these columns, extract the data, and insert it into a single column based on the keys via a UDF, as shown below.
case class PersonConcept(field1: Int, field2: Int, field3: String,
field4: String, field5: String)
private def mergeMaps: UserDefinedFunction = {
val f = (keyedData: Map[Row, Row], keyedNames: Map[Row, String],
keyedOccupations: Map[Row, String]) => {
val intKeyedData = keyedData.map {
case (row, data) =>
(row.getInt(0), row.getInt(1)) -> data
}
val intKeyedNames = keyedNames.map {
case (row, data) =>
(row.getInt(0), row.getInt(1)) -> data
}
val intKeyedOccupations = keyedOccupations.map {
case (row, data) =>
(row.getInt(0), row.getInt(1)) -> data
}
intKeyedData
.map {
case (k, v) => {
val value =
PersonConcept(v.getInt(0),
v.getInt(1),
v.getString(2),
intKeyedOccupations(k), // <--- Grabbing occupation
intKeyedNames(k)) // <--- Grabbing name
(k, value)
}
}
}
udf(f)
}
I know for a fact that each map contains the exact same set of keys. However, I am encountering a NoSuchElementException.
Caused by: java.util.NoSuchElementException: key not found: (24,75)
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
The UDF works fine when the dataset consists of only one row or one partition, leading me to believe that this is somehow related to scross partition variable scope or closure (http://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures-). What I don't understand what could be causing this, and whether or not there is a feasible way to do this leveraging this data model.

How to Map Array from Case class in Spark Scala

Sample Data :( 251~jhon~WrappedArray([STD,Health,Duval]) )
case class xyz(id : String, code : String, County : String)
case class rewards(memId : String, name: String, part: Array[xyz])
val df = spark.read.textFile("file:///data/").rdd.map(r => r.split('~'))
val df2 = df.map(x => { rewards(x(0),x(1), Array[rewards.apply()] ) } )
tried many way to map an array from case class. tried apply function
I am not sure that's what you are looking for but you can try using pattern matching to transform arrays into case classes.
val data: RDD[rewards] = sc
.parallelize(Seq("251~jhon~WrappedArray([STD,Health,Duval])"))
.map(_.split("~"))
.map{ case Array(id, code, part) => (id, code, part
.replaceFirst("\\s*WrappedArray\\(\\s*\\[\\s*", "")
.replaceFirst("\\s*\\]\\s*\\)\\s*", "")
)}
.map{ case (id, name, part) => rewards(id, name, part.split("\\s*,\\s*") match {
case Array(id, code, county) => Array(xyz(id, code, county))
})}

Nested Scala case classes to/from CSV

There are many nice libraries for writing/reading Scala case classes to/from CSV files. I'm looking for something that goes beyond that, which can handle nested cases classes. For example, here a Match has two Players:
case class Player(name: String, ranking: Int)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane",7), Player("Fred",23)),
Match("Rome", Player("Marco",19), Player("Giulia",3)),
Match("Paris", Player("Isabelle",2), Player("Julien",5))
)
I'd like to effortlessly (no boilerplate!) write/read matches to/from this CSV:
place,winner.name,winner.ranking,loser.name,loser.ranking
London,Jane,7,Fred,23
Rome,Marco,19,Giulia,3
Paris,Isabelle,2,Julien,5
Note the automated header line using the dot "." to form the column name for a nested field, e.g. winner.ranking. I'd be delighted if someone could demonstrate a simple way to do this (say, using reflection or Shapeless).
[Motivation. During data analysis it's convenient to have a flat CSV to play around with, for sorting, filtering, etc., even when case classes are nested. And it would be nice if you could load nested case classes back from such files.]
Since a case-class is a Product, getting the values of the various fields is relatively easy. Getting the names of the fields/columns does require using Java reflection.
The following function takes a list of case-class instances and returns a list of rows, each is a list of strings. It is using a recursion to get the values and headers of child case-class instances.
def toCsv(p: List[Product]): List[List[String]] = {
def header(c: Class[_], prefix: String = ""): List[String] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
if (classOf[Product].isAssignableFrom(field.getType)) header(field.getType, name + ".")
else List(name)
}
}
def flatten(p: Product): List[String] =
p.productIterator.flatMap {
case p: Product => flatten(p)
case v: Any => List(v.toString)
}.toList
header(classOf[Match]) :: p.map(flatten)
}
However, constructing case-classes from CSV is far more involved, requiring to use reflection for getting the types of the various fields, for creating the values from the CSV strings and for constructing the case-class instances.
For simplicity (not saying the code is simple, just so it won't be further complicated), I assume that the order of columns in the CSV is the same as if the file was produced by the toCsv(...) function above.
The following function starts by creating a list of "instructions how to process a single CSV row" (the instructions are also used to verify that the column headers in the CSV matches the the case-class properties). The instructions are then used to recursively produce one CSV row at a time.
def fromCsv[T <: Product](csv: List[List[String]])(implicit tag: ClassTag[T]): List[T] = {
trait Instruction {
val name: String
val header = true
}
case class BeginCaseClassField(name: String, clazz: Class[_]) extends Instruction {
override val header = false
}
case class EndCaseClassField(name: String) extends Instruction {
override val header = false
}
case class IntField(name: String) extends Instruction
case class StringField(name: String) extends Instruction
case class DoubleField(name: String) extends Instruction
def scan(c: Class[_], prefix: String = ""): List[Instruction] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
val fType = field.getType
if (fType == classOf[Int]) List(IntField(name))
else if (fType == classOf[Double]) List(DoubleField(name))
else if (fType == classOf[String]) List(StringField(name))
else if (classOf[Product].isAssignableFrom(fType)) BeginCaseClassField(name, fType) :: scan(fType, name + ".")
else throw new IllegalArgumentException(s"Unsupported field type: $fType")
} :+ EndCaseClassField(prefix)
}
def produce(instructions: List[Instruction], row: List[String], argAccumulator: List[Any]): (List[Instruction], List[String], List[Any]) = instructions match {
case IntField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toInt)
case StringField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString)
case DoubleField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toDouble)
case BeginCaseClassField(_, clazz) :: tail =>
val (instructionRemaining, rowRemaining, constructorArgs) = produce(tail, row, List.empty)
val newCaseClass = clazz.getConstructors.head.newInstance(constructorArgs.map(_.asInstanceOf[AnyRef]): _*)
produce(instructionRemaining, rowRemaining, argAccumulator :+ newCaseClass)
case EndCaseClassField(_) :: tail => (tail, row, argAccumulator)
case Nil if row.isEmpty => (Nil, Nil, argAccumulator)
case Nil => throw new IllegalArgumentException("Not all values from CSV row were used")
}
val instructions = BeginCaseClassField(".", tag.runtimeClass) :: scan(tag.runtimeClass)
assert(csv.head == instructions.filter(_.header).map(_.name), "CSV header doesn't match target case-class fields")
csv.drop(1).map(row => produce(instructions, row, List.empty)._3.head.asInstanceOf[T])
}
I've tested this using:
case class Player(name: String, ranking: Int, price: Double)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane", 7, 12.5), Player("Fred", 23, 11.1)),
Match("Rome", Player("Marco", 19, 13.54), Player("Giulia", 3, 41.8)),
Match("Paris", Player("Isabelle", 2, 31.7), Player("Julien", 5, 16.8))
)
val csv = toCsv(matches)
val matchesFromCsv = fromCsv[Match](csv)
assert(matches == matchesFromCsv)
Obviously this should be optimized and hardened if you ever want to use this for production...