Spark - Merging Multiple Map Columns with Common Keys - scala

I have a series of columns of the schemas Map[(Int, Int), Row] and Map[(Int, Int), String], int-tuple keyed maps with some data (string or struct). I am attempting to take these columns, extract the data, and insert it into a single column based on the keys via a UDF, as shown below.
case class PersonConcept(field1: Int, field2: Int, field3: String,
field4: String, field5: String)
private def mergeMaps: UserDefinedFunction = {
val f = (keyedData: Map[Row, Row], keyedNames: Map[Row, String],
keyedOccupations: Map[Row, String]) => {
val intKeyedData = keyedData.map {
case (row, data) =>
(row.getInt(0), row.getInt(1)) -> data
}
val intKeyedNames = keyedNames.map {
case (row, data) =>
(row.getInt(0), row.getInt(1)) -> data
}
val intKeyedOccupations = keyedOccupations.map {
case (row, data) =>
(row.getInt(0), row.getInt(1)) -> data
}
intKeyedData
.map {
case (k, v) => {
val value =
PersonConcept(v.getInt(0),
v.getInt(1),
v.getString(2),
intKeyedOccupations(k), // <--- Grabbing occupation
intKeyedNames(k)) // <--- Grabbing name
(k, value)
}
}
}
udf(f)
}
I know for a fact that each map contains the exact same set of keys. However, I am encountering a NoSuchElementException.
Caused by: java.util.NoSuchElementException: key not found: (24,75)
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
The UDF works fine when the dataset consists of only one row or one partition, leading me to believe that this is somehow related to scross partition variable scope or closure (http://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures-). What I don't understand what could be causing this, and whether or not there is a feasible way to do this leveraging this data model.

Related

Scala map with function call results in references to the function instead of results

I have a list of keys for which I want to fetch data. The data is fetched via a function call for each key. I want to end up with a Map of key -> data. Here's what I've tried:
case class MyDataClass(val1: Int, val2: Boolean)
def getData(key: String): MyDataClass = {
// Dummy implementation
MyDataClass(1, true)
}
def getDataMapForKeys(keys: Seq[String]): Map[String, MyDataClass] = {
val dataMap: Map[String, MyDataClass] = keys.map((_, getData(_))).toMap
dataMap
}
This results in a type mismatch error:
type mismatch;
found : scala.collection.immutable.Map[String,String => MyDataClass]
required: Map[String,MyDataClass]
val dataMap: Map[String, MyDataClass] = keys.map((_, getData(_))).toMap
Why is it setting the values in the resulting Map to instances of the getData() function, rather than its result? How do I make it actually CALL the getData() function for each key and put the results as the values in the Map?
The code you wrote is the same as the following statements:
keys.map((_, getData(_)))
keys.map(x => (x, getData(_)))
keys.map(x => (x, y => getData(y)))
This should clarify why you obtain the error.
As suggested in the comments, stay away from _ unless in simple cases with only one occurrences.
The gist of the issue is (_, getData(_))) is creating a Tuple instead of a map entry for each key that is being mapped over. Using -> creates a Map which is what you want.
...
val dataMap: Map[String, MyDataClass] = keys.map(key => (key -> getData(key))).toMap
...

Read Hocon config as a Map[String, String] with key in dot notation and value

I have following HOCON config:
a {
b.c.d = "val1"
d.f.g = "val2"
}
HOCON represents paths "b.c.d" and "d.f.g" as objects. So, I would like to have a reader, which reads these configs as Map[String, String], ex:
Map("b.c.d" -> "val1", "d.f.g" -> "val2")
I've created a reader and trying to do it recursively:
import scala.collection.mutable.{Map => MutableMap}
private implicit val mapReader: ConfigReader[Map[String, String]] = ConfigReader.fromCursor(cur => {
def concat(prefix: String, key: String): String = if (prefix.nonEmpty) s"$prefix.$key" else key
def toMap(): Map[String, String] = {
val acc = MutableMap[String, String]()
def go(
cur: ConfigCursor,
prefix: String = EMPTY,
acc: MutableMap[String, String]
): Result[Map[String, Object]] = {
cur.fluent.mapObject { obj =>
obj.value.valueType() match {
case ConfigValueType.OBJECT => go(obj, concat(prefix, obj.pathElems.head), acc)
case ConfigValueType.STRING =>
acc += (concat(prefix, obj.pathElems.head) -> obj.asString.right.getOrElse(EMPTY))
}
obj.asRight
}
}
go(cur, acc = acc)
acc.toMap
}
toMap().asRight
})
It gives me the correct result but is there a way to avoid MutableMap here?
P.S. Also, I would like to keep implementation by "pureconfig" reader.
The solution given by Ivan Stanislavciuc isn't ideal. If the parsed config object contains values other than strings or objects, you don't get an error message (as you would expect) but instead some very strange output. For instance, if you parse a typesafe config document like this
"a":[1]
The resulting value will look like this:
Map(a -> [
# String: 1
1
])
And even if the input only contains objects and strings, it doesn't work correctly, because it erroneously adds double quotes around all the string values.
So I gave this a shot myself and came up with a recursive solution that reports an error for things like lists or null and doesn't add quotes that shouldn't be there.
implicit val reader: ConfigReader[Map[String, String]] = {
implicit val r: ConfigReader[String => Map[String, String]] =
ConfigReader[String]
.map(v => (prefix: String) => Map(prefix -> v))
.orElse { reader.map { v =>
(prefix: String) => v.map { case (k, v2) => s"$prefix.$k" -> v2 }
}}
ConfigReader[Map[String, String => Map[String, String]]].map {
_.flatMap { case (prefix, v) => v(prefix) }
}
}
Note that my solution doesn't mention ConfigValue or ConfigReader.Result at all. It only takes existing ConfigReader objects and combines them with combinators like map and orElse. This is, generally speaking, the best way to write ConfigReaders: don't start from scratch with methods like ConfigReader.fromFunction, use existing readers and combine them.
It seems a bit surprising at first that the above code works at all, because I'm using reader within its own definition. But it works because the orElse method takes its argument by name and not by value.
You can do the same without using recursion. Use method entrySet as following
import scala.jdk.CollectionConverters._
val hocon =
"""
|a {
| b.c.d = "val1"
| d.f.g = val2
|}""".stripMargin
val config = ConfigFactory.load(ConfigFactory.parseString(hocon))
val innerConfig = config.getConfig("a")
val map = innerConfig
.entrySet()
.asScala
.map { entry =>
entry.getKey -> entry.getValue.render()
}
.toMap
println(map)
Produces
Map(b.c.d -> "val1", d.f.g -> "val2")
With given knowledge, it's possible to define a pureconfig.ConfigReader that reads Map[String, String] as following
implicit val reader: ConfigReader[Map[String, String]] = ConfigReader.fromFunction {
case co: ConfigObject =>
Right(
co.toConfig
.entrySet()
.asScala
.map { entry =>
entry.getKey -> entry.getValue.render()
}
.toMap
)
case value =>
//Handle error case
Left(
ConfigReaderFailures(
ThrowableFailure(
new RuntimeException("cannot be mapped to map of string -> string"),
Option(value.origin())
)
)
)
}
I did not want to write custom readers to get a mapping of key value pairs. I instead changed my internal data type from a map to list of pairs (I am using kotlin), and then I can easily change that to a map at some later internal stage if I need to. My HOCON was then able to look like this.
additionalProperties = [
{first = "sasl.mechanism", second = "PLAIN"},
{first = "security.protocol", second = "SASL_SSL"},
]
additionalProducerProperties = [
{first = "acks", second = "all"},
]
Not the best for humans... but I prefer it to having to build custom parsing components.

Map doesn't add entry in recursive function

I'm working with scala and want to make a class with a function which add recursively something to a map.
class Index(val map: Map[String, String]) {
def add(e: (String, String)): Index = {
Index(map + (e._1 -> e._2))
}
def addAll(list: List[(String, String)], index: Index = Index()): Index = {
list match {
case ::(head, next) => addAll(next, add(head))
case Nil => index
}
}
}
object Index {
def apply(map: Map[String, String] = Map()) = {
new Index(map)
}
}
val index = Index()
val list = List(
("e1", "f1"),
("e2", "f2"),
("e3", "f3"),
)
val newIndex = index.addAll(list)
println(newIndex.map.size.toString())
I excepted this code to print 3, since the function is supposed to add 3 entries to the map, but the actual output is 1. What I'm doing wrong and how to solve it?
Online fiddle: https://scalafiddle.io/sf/eqSxPX9/0
There is a simple error where you are calling add(head) where it should be index.add(head).
However it is better to use a nested method when writing recursive routines like this, for example:
def addAll(list: List[(String, String)]): Index = {
#annotation.tailrec
def loop(rem: List[(String, String)], index: Index): Index = {
rem match {
case head :: tail => loop(tail, index.add(head))
case Nil => index
}
}
loop(list, Index())
}
This allows the function to be tail recursive and optimised by the compiler, and also avoids a spurious argument to the addAll method.
I see many problems with your code but to answer your question:
Each time you call addAll you create an Index with an empty Map.
In the line case ::(head, next) => addAll(next, add(head)) you are not using the index that you get from the parameter list. Shouldn't that be somehow updated?
Beware that the default map implementation is immutable and to updating a map means that you need to create a new one with the new value added.

Replacing values from list of custom objects with map values

I have a quite odd problem to solve, I have a String, a custom Type and a Map of Maps.
The string needs to have a few values replaced based on mapping between a value in custom type (which is a key in the map of maps).
This is the current structure:
case class Students(favSubject: String)
val mapping: Map[String, Map[String, String]] = Map("John" -> Map("English" -> "Soccer"))
val studentInfo: List[Students] = List(Students("English"))
val data: String = "John is the favourite hobby"
I tried the following:
mapping.foldLeft(data){ case (outputString, (studentName, favSubject)) => outputString.replace(studentName, favSubject.getOrElse(studentInfo.map(x => x.favSubject).toString, "")) }
What I need to get is:
"Soccer is the favourite hobby"
What I get is:
" is the favourite hobby"
So looks like I am getting the map of maps traversal right but the getOrElse part is having issues.
What I would do, would be to first change the structure of mappings so it makes more sense for the problem.
val mapping: Map[String, Map[String, String]] = Map("John" -> Map("English" -> "Soccer"))
val mapping2 =
mapping.iterator.flatMap {
case (student, map) => map.iterator.map {
case (info, value) => (info, student, value)
}
}.toList
.groupBy(_._1)
.view
.mapValues { group =>
group.iterator.map {
case (_, student, value) => student -> value
}.toList
}.toMap
// mapping2: Map[String, List[(String, String)]] = Map("English" -> List(("John", "Soccer")))
Then I would just traverse the students informativo, making all the necessary replacements.
final case class StudentInfo(favSubject: String)
val studentsInformation: List[StudentInfo] = List(StudentInfo("English"))
val data: String = "John is the favourite hobby"
val result =
studentsInformation.foldLeft(data) { (acc, info) =>
mapping2
.getOrElse(key = info.favSubject, default = List.empty)
.foldLeft(acc) { (acc2, tuple) =>
val (key, replace) = tuple
acc2.replace(key, replace)
}
}
// result: String = "Soccer is the favourite hobby"
When you map() a List, you get a List back. It's toString has a format "List(el1,el2,...)". Surely you cannot use it as a key for your sub-map, you would want just el1.
Here is a version of the working code. It might not be a solution you are looking for(!), just a solution to your question:
case class Students(favSubject: String)
val mapping: Map[String, Map[String, String]] = Map("John" -> Map("English" -> "Soccer"))
val studentInfo: List[Students] = List(Students("English"))
val data: String = "John is the favourite hobby"
val res = mapping.foldLeft(data) {
case (outputString, (studentName, favSubjectDict)) =>
outputString.replace(
studentName,
favSubjectDict.getOrElse(studentInfo.map(x => x.favSubject).head, "?")
)
}
println(s"$res") //prints "Soccer is the favourite hobby"
val notMatchingSubject = studentInfo.map(x => x.favSubject).toString
println(s"Problem in previous code: '$notMatchingSubject' !== 'English'")
Try it here: https://scastie.scala-lang.org/flQNRrUQSXWPxSTXOPPFgA
The issue
It is a bit unclear why StudentInfo is a List in this form... If I guessed, it was designed to be a list of StudentInfo containing both, name and favSubject and you would need to search it by name to find favSubject. But it is just a guess.
I went with the simplest working solution, to get a .head (first element) of the sequence from the map. Which will always be "English" even if you add more Studends to the list.

Nested Scala case classes to/from CSV

There are many nice libraries for writing/reading Scala case classes to/from CSV files. I'm looking for something that goes beyond that, which can handle nested cases classes. For example, here a Match has two Players:
case class Player(name: String, ranking: Int)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane",7), Player("Fred",23)),
Match("Rome", Player("Marco",19), Player("Giulia",3)),
Match("Paris", Player("Isabelle",2), Player("Julien",5))
)
I'd like to effortlessly (no boilerplate!) write/read matches to/from this CSV:
place,winner.name,winner.ranking,loser.name,loser.ranking
London,Jane,7,Fred,23
Rome,Marco,19,Giulia,3
Paris,Isabelle,2,Julien,5
Note the automated header line using the dot "." to form the column name for a nested field, e.g. winner.ranking. I'd be delighted if someone could demonstrate a simple way to do this (say, using reflection or Shapeless).
[Motivation. During data analysis it's convenient to have a flat CSV to play around with, for sorting, filtering, etc., even when case classes are nested. And it would be nice if you could load nested case classes back from such files.]
Since a case-class is a Product, getting the values of the various fields is relatively easy. Getting the names of the fields/columns does require using Java reflection.
The following function takes a list of case-class instances and returns a list of rows, each is a list of strings. It is using a recursion to get the values and headers of child case-class instances.
def toCsv(p: List[Product]): List[List[String]] = {
def header(c: Class[_], prefix: String = ""): List[String] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
if (classOf[Product].isAssignableFrom(field.getType)) header(field.getType, name + ".")
else List(name)
}
}
def flatten(p: Product): List[String] =
p.productIterator.flatMap {
case p: Product => flatten(p)
case v: Any => List(v.toString)
}.toList
header(classOf[Match]) :: p.map(flatten)
}
However, constructing case-classes from CSV is far more involved, requiring to use reflection for getting the types of the various fields, for creating the values from the CSV strings and for constructing the case-class instances.
For simplicity (not saying the code is simple, just so it won't be further complicated), I assume that the order of columns in the CSV is the same as if the file was produced by the toCsv(...) function above.
The following function starts by creating a list of "instructions how to process a single CSV row" (the instructions are also used to verify that the column headers in the CSV matches the the case-class properties). The instructions are then used to recursively produce one CSV row at a time.
def fromCsv[T <: Product](csv: List[List[String]])(implicit tag: ClassTag[T]): List[T] = {
trait Instruction {
val name: String
val header = true
}
case class BeginCaseClassField(name: String, clazz: Class[_]) extends Instruction {
override val header = false
}
case class EndCaseClassField(name: String) extends Instruction {
override val header = false
}
case class IntField(name: String) extends Instruction
case class StringField(name: String) extends Instruction
case class DoubleField(name: String) extends Instruction
def scan(c: Class[_], prefix: String = ""): List[Instruction] = {
c.getDeclaredFields.toList.flatMap { field =>
val name = prefix + field.getName
val fType = field.getType
if (fType == classOf[Int]) List(IntField(name))
else if (fType == classOf[Double]) List(DoubleField(name))
else if (fType == classOf[String]) List(StringField(name))
else if (classOf[Product].isAssignableFrom(fType)) BeginCaseClassField(name, fType) :: scan(fType, name + ".")
else throw new IllegalArgumentException(s"Unsupported field type: $fType")
} :+ EndCaseClassField(prefix)
}
def produce(instructions: List[Instruction], row: List[String], argAccumulator: List[Any]): (List[Instruction], List[String], List[Any]) = instructions match {
case IntField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toInt)
case StringField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString)
case DoubleField(_) :: tail => produce(tail, row.drop(1), argAccumulator :+ row.head.toString.toDouble)
case BeginCaseClassField(_, clazz) :: tail =>
val (instructionRemaining, rowRemaining, constructorArgs) = produce(tail, row, List.empty)
val newCaseClass = clazz.getConstructors.head.newInstance(constructorArgs.map(_.asInstanceOf[AnyRef]): _*)
produce(instructionRemaining, rowRemaining, argAccumulator :+ newCaseClass)
case EndCaseClassField(_) :: tail => (tail, row, argAccumulator)
case Nil if row.isEmpty => (Nil, Nil, argAccumulator)
case Nil => throw new IllegalArgumentException("Not all values from CSV row were used")
}
val instructions = BeginCaseClassField(".", tag.runtimeClass) :: scan(tag.runtimeClass)
assert(csv.head == instructions.filter(_.header).map(_.name), "CSV header doesn't match target case-class fields")
csv.drop(1).map(row => produce(instructions, row, List.empty)._3.head.asInstanceOf[T])
}
I've tested this using:
case class Player(name: String, ranking: Int, price: Double)
case class Match(place: String, winner: Player, loser: Player)
val matches = List(
Match("London", Player("Jane", 7, 12.5), Player("Fred", 23, 11.1)),
Match("Rome", Player("Marco", 19, 13.54), Player("Giulia", 3, 41.8)),
Match("Paris", Player("Isabelle", 2, 31.7), Player("Julien", 5, 16.8))
)
val csv = toCsv(matches)
val matchesFromCsv = fromCsv[Match](csv)
assert(matches == matchesFromCsv)
Obviously this should be optimized and hardened if you ever want to use this for production...