How to Map Array from Case class in Spark Scala - scala

Sample Data :( 251~jhon~WrappedArray([STD,Health,Duval]) )
case class xyz(id : String, code : String, County : String)
case class rewards(memId : String, name: String, part: Array[xyz])
val df = spark.read.textFile("file:///data/").rdd.map(r => r.split('~'))
val df2 = df.map(x => { rewards(x(0),x(1), Array[rewards.apply()] ) } )
tried many way to map an array from case class. tried apply function

I am not sure that's what you are looking for but you can try using pattern matching to transform arrays into case classes.
val data: RDD[rewards] = sc
.parallelize(Seq("251~jhon~WrappedArray([STD,Health,Duval])"))
.map(_.split("~"))
.map{ case Array(id, code, part) => (id, code, part
.replaceFirst("\\s*WrappedArray\\(\\s*\\[\\s*", "")
.replaceFirst("\\s*\\]\\s*\\)\\s*", "")
)}
.map{ case (id, name, part) => rewards(id, name, part.split("\\s*,\\s*") match {
case Array(id, code, county) => Array(xyz(id, code, county))
})}

Related

How to handle missing nested fields in spark?

Given the two case classes:
case class Response(
responseField: String
...
items: List[Item])
case class Item(
itemField: String
...)
I am creating a Response dataset:
val dataset = spark.read.format("parquet")
.load(inputPath)
.as[Response]
.map(x => x)
The issue arises when itemField is not present in any of the rows and spark will raise this error org.apache.spark.sql.AnalysisException: No such struct field itemField. If itemField was not nested I could handle it by doing dataset.withColumn("itemField", lit("")). Is it possible to do the same within the List field?
I assume the following:
Data was written with the following schema:
case class Item(itemField: String)
case class Response(responseField: String, items: List[Item])
Seq(Response("a", List()), Response("b", List())).toDF.write.parquet("/tmp/structTest")
Now schema changed to:
case class Item(itemField: String, newField: Int)
case class Response(responseField: String, items: List[Item])
spark.read.parquet("/tmp/structTest").as[Response].map(x => x) // Fails
For Spark 2.4 please see:
Spark - How to add an element to an array of structs
For Spark 2.3 this should work:
val addNewField: (Array[String], Array[Int]) => Array[Item] = (itemFields, newFields) => itemFields.zip(newFields).map { case (i, n) => Item(i, n) }
val addNewFieldUdf = udf(addNewField)
spark.read.parquet("/tmp/structTest")
.withColumn("items", addNewFieldUdf(
col("items.itemField") as "itemField",
array(lit(1)) as "newField"
)).as[Response].map(x => x) // Works

combine two lists of different case class types

Combine two case class lists in to merged case class list
case class emp(emp_id:Integer,emp_name:String)
case class manager(manger_id:Integer,manager_name :String,emp_id:Integer)
case class combined(emp_id:Integer,emp_name :String,
manager_id:Integer,manager_name :String)
val list1:List[emp]= List((1,"emp1"),(2,"emp2")
val list2:List[manager]= List((101,"mgr1",1)(103,"mgr3",1))
expected output
val list3 = List(
(1,"emp1",101,"mgr1"),
(1,"emp1",103,"mgr3"),
(2,"emp2",null,null))
Depends on. If your data is already sorted by `emp_id and you have the same amount of managers as employees you can go with:
list1.zip(list2).map { case (e, m) =>
combined(e.emp_id, e.emp_name, m.manager_id, m.manager_name)
}
However, I suppose is not the case in a real-life scenario, where you need to match. Since the managers have an emp_id you can first run a groupBy on managers and then iterate over the employees to enrich them with manager input.
val grouped: Map[Int, manager] = list2.groupBy(_.emp_id)
list1.map { e =>
val manager_id = grouped.get(e.emp_id).flatMap (l =>
Try{l(0)}.toOption.map(_.manager_id)).getOrElse("null")
val manager_name = grouped.get(e.emp_id).flatMap (l =>
Try{l(0)}.toOption.map(_.manager_name)).getOrElse("null")
combined(e.emp_id, e.emp_name, manager_id, manager_name)
}
Did not checked the syntax, but you should get the point here.
P.S
Please use CamelCase and capital letters for classes in Scala.
Here's how I'd be tempted to tackle it.
// types
case class Emp(emp_id:Int, emp_name:String)
case class Manager(manager_id:Int, manager_name:String, emp_id:Int)
case class Combined(emp_id :Int
,emp_name :String
,manager_id :Option[Int]
,manager_name :String)
// input data
val emps :List[Emp] = List(Emp(1,"emp1"),Emp(2,"emp2"))
val mgrs :List[Manager] = List(Manager(101,"mgr1",1),Manager(103,"mgr3",1))
// lookup Emp name by ID
val empName = emps.groupMapReduce(_.emp_id)(_.emp_name)(_+_)
mgrs.map(mgr => Combined(mgr.emp_id
,empName(mgr.emp_id)
,Some(mgr.manager_id)
,mgr.manager_name)
) ++ empName.keySet
.diff(mgrs.map(_.emp_id).toSet)
.map(id => Combined(id, empName(id), None, ""))
//res0: List[Combined] = List(Combined(1, "emp1", Some(101), "mgr1")
// ,Combined(1, "emp1", Some(103), "mgr3")
// ,Combined(2, "emp2", None, ""))
I used Option[Int] and empty string "" to replace null, which Scala style tries to avoid.

How to flatten a case class with a list value to another case class properly with scala

I have case classes of Contact and Person:
case class Contact(id: String, name: String)
case class Person(id: String, name: String, age: Int, contacts: List[Contact])
lets say I have list of Person:
val pesonList = List(
Person(1, "john", 30, List(Contact(5,"mark"),Contact(6,"tamy"),Contact(7,"mary"))),
Person(2, "jeff", 40, List(Contact(8,"lary"),Contact(9,"gary"),Contact(10,"sam")))
)
I need to flatten this pesonList and transform it to list of:
case class FlattenPerson(personId: String, contactId: Option[String], personName: String)
so the results would be:
val flattenPersonList = List(
FlattenPerson(1,"john"),
FlattenPerson(1,5,"mark"),
FlattenPerson(1,6,"tamy"),
FlattenPerson(1, 7"mary"),
FlattenPerson(2,"jeff"),
FlattenPerson(2,8,"lary"),
FlattenPerson(2,9,"gary"),
FlattenPerson(2,10,"sam")
)
I found one way that looks like its working but dosent seem like the right way...it might break and scala probably have a more efficient way.
this is what I could come up with:
val people = pesonList.map(person => {
FlattenPerson(person.id, None, person.name)
})
val contacts = pesonList.flatMap(person => {
person.contacts.map(contact => {
FlattenPerson(person.id, Some(contact.id), contact.name)
})
})
val res = people ++ contacts
this would also have bad performance, I need to do it for each api call my app gets and it can be allot of calls plus i need to filter res.
would love to get some help here
I think flatMap() can do what you're after.
personList.flatMap{pson =>
FlattenPerson(pson.id, None, pson.name) ::
pson.contacts.map(cntc => FlattenPerson(pson.id, Some(cntc.id), cntc.name))
}
//res0: List[FlattenPerson] = List(FlattenPerson(1,None,john)
// , FlattenPerson(1,Some(5),mark)
// , FlattenPerson(1,Some(6),tamy)
// , FlattenPerson(1,Some(7),mary)
// , FlattenPerson(2,None,jeff)
// , FlattenPerson(2,Some(8),lary)
// , FlattenPerson(2,Some(9),gary)
// , FlattenPerson(2,Some(10),sam))
For reference here is a recursive versions of this algorithm that includes filtering in a single pass. This appears to perform somewhat faster than calling .filter(f) on the result. The non-filtered recursive version has no real performance advantage.
def flattenPeople(people: List[Person], f: FlattenPerson => Boolean): List[FlattenPerson] = {
#annotation.tailrec
def loop(person: Person, contacts: List[Contact], people: List[Person], res: List[FlattenPerson]): List[FlattenPerson] =
contacts match {
case Contact(id, name) :: tail =>
val newPerson = FlattenPerson(person.id, Some(id), name)
if (f(newPerson)) {
loop(person, tail, people, newPerson +: res)
} else {
loop(person, tail, people, res)
}
case _ =>
val newPerson = FlattenPerson(person.id, None, person.name)
val newRes = if (f(newPerson)) newPerson +: res else res
people match {
case p :: tail =>
loop(p, p.contacts, tail, newRes)
case Nil =>
newRes.reverse
}
}
people match {
case p :: tail => loop(p, p.contacts, tail, Nil)
case _ => Nil
}
}

Scala alternative to series of if statements that append to a list?

I have a Seq[String] in Scala, and if the Seq contains certain Strings, I append a relevant message to another list.
Is there a more 'scalaesque' way to do this, rather than a series of if statements appending to a list like I have below?
val result = new ListBuffer[Err]()
val malformedParamNames = // A Seq[String]
if (malformedParamNames.contains("$top")) result += IntegerMustBePositive("$top")
if (malformedParamNames.contains("$skip")) result += IntegerMustBePositive("$skip")
if (malformedParamNames.contains("modifiedDate")) result += FormatInvalid("modifiedDate", "yyyy-MM-dd")
...
result.toList
If you want to use some scala iterables sugar I would use
sealed trait Err
case class IntegerMustBePositive(msg: String) extends Err
case class FormatInvalid(msg: String, format: String) extends Err
val malformedParamNames = Seq[String]("$top", "aa", "$skip", "ccc", "ddd", "modifiedDate")
val result = malformedParamNames.map { v =>
v match {
case "$top" => Some(IntegerMustBePositive("$top"))
case "$skip" => Some(IntegerMustBePositive("$skip"))
case "modifiedDate" => Some(FormatInvalid("modifiedDate", "yyyy-MM-dd"))
case _ => None
}
}.flatten
result.toList
Be warn if you ask for scala-esque way of doing things there are many possibilities.
The map function combined with flatten can be simplified by using flatmap
sealed trait Err
case class IntegerMustBePositive(msg: String) extends Err
case class FormatInvalid(msg: String, format: String) extends Err
val malformedParamNames = Seq[String]("$top", "aa", "$skip", "ccc", "ddd", "modifiedDate")
val result = malformedParamNames.flatMap {
case "$top" => Some(IntegerMustBePositive("$top"))
case "$skip" => Some(IntegerMustBePositive("$skip"))
case "modifiedDate" => Some(FormatInvalid("modifiedDate", "yyyy-MM-dd"))
case _ => None
}
result
Most 'scalesque' version I can think of while keeping it readable would be:
val map = scala.collection.immutable.ListMap(
"$top" -> IntegerMustBePositive("$top"),
"$skip" -> IntegerMustBePositive("$skip"),
"modifiedDate" -> FormatInvalid("modifiedDate", "yyyy-MM-dd"))
val result = for {
(k,v) <- map
if malformedParamNames contains k
} yield v
//or
val result2 = map.filterKeys(malformedParamNames.contains).values.toList
Benoit's is probably the most scala-esque way of doing it, but depending on who's going to be reading the code later, you might want a different approach.
// Some type definitions omitted
val malformations = Seq[(String, Err)](
("$top", IntegerMustBePositive("$top")),
("$skip", IntegerMustBePositive("$skip")),
("modifiedDate", FormatInvalid("modifiedDate", "yyyy-MM-dd")
)
If you need a list and the order is siginificant:
val result = (malformations.foldLeft(List.empty[Err]) { (acc, pair) =>
if (malformedParamNames.contains(pair._1)) {
pair._2 ++: acc // prepend to list for faster performance
} else acc
}).reverse // and reverse since we were prepending
If the order isn't significant (although if the order's not significant, you might consider wanting a Set instead of a List):
val result = (malformations.foldLeft(Set.empty[Err]) { (acc, pair) =>
if (malformedParamNames.contains(pair._1)) {
acc ++ pair._2
} else acc
}).toList // omit the .toList if you're OK with just a Set
If the predicates in the repeated ifs are more complex/less uniform, then the type for malformations might need to change, as they would if the responses changed, but the basic pattern is very flexible.
In this solution we define a list of mappings that take your IF condition and THEN statement in pairs and we iterate over the inputted list and apply the changes where they match.
// IF THEN
case class Operation(matcher :String, action :String)
def processInput(input :List[String]) :List[String] = {
val operations = List(
Operation("$top", "integer must be positive"),
Operation("$skip", "skip value"),
Operation("$modify", "modify the date")
)
input.flatMap { in =>
operations.find(_.matcher == in).map { _.action }
}
}
println(processInput(List("$skip","$modify", "$skip")));
A breakdown
operations.find(_.matcher == in) // find an operation in our
// list matching the input we are
// checking. Returns Some or None
.map { _.action } // if some, replace input with action
// if none, do nothing
input.flatMap { in => // inputs are processed, converted
// to some(action) or none and the
// flatten removes the some/none
// returning just the strings.

pattern matching on a series of values in scala

I'm a Scala beginner and this piece of code makes me struggle.
Is there a way to do pattern matching to make sure everything i pass to Data is of the correct type? As you can see i have quite strange datatypes...
class Data (
val recipient: String,
val templateText: String,
val templateHtml: String,
val blockMaps: Map[String,List[Map[String,String]]],
templateMap: Map[String,String]
)
...
val dataParsed = JSON.parseFull(message)
dataParsed match {
case dataParsed: Map[String, Any] => {
def e(s: String) = dataParsed get s
val templateText = e("template-text")
val templateHtml = e("template-html")
val recipient = e("email")
val templateMap = e("data")
val blockMaps = e("blkdata")
val dependencies = new Data(recipient, templateText, templateHtml, blockMaps, templateMap)
Core.inject ! dependencies
}
...
I guess your problem is you want to be able to patten match the map that you get from parseFull(), but Map doesn't have an unapply.
So you could pattern match every single value, providing a default if it is not of the correct type:
val templateText: Option[String] = e("template-text") match {
case s: String => Some(s)
case _ => None
}
Or temporarily put all the data into some structure that can be pattern matched:
val data = (e("template-text"), e("template-html"), e("email"), e("data"),
e("blkdata"))
val dependencies: Option[Data] = data match {
case (templateText: String,
templateHtml: String,
blockMaps: Map[String,List[Map[String,String]]],
templateMap: Map[String,String]) =>
Some(new Data(recipient, templateText, templateHtml, blockMaps, templateMap))
case _ => None
}