Intersect and Merge Arbitrary Maps by Id in Scala - scala

After reading two JSON files, I have the following two maps:
val m1 = Map("events" -> List(Map("id" -> "Beatles", "when" -> "Today"), Map("id"->"Elvis", "when"->"Tomorrow")))
val m2 = Map("events" -> List(Map("id" -> "Beatles", "desc"-> "The greatest band"), Map("id"->"BeachBoys","desc"-> "The second best band")))
I want to merge them in a generic way (without referencing the specific structure of these two particular maps) such that the result would be:
val m3 = Map("events" -> List(Map("id" -> "Beatles", "when" -> "Today", "desc"->"The greatest band")))
That is, first intersect by id and then join (both on the same depth level). It would be fine if it only works for a max depth of one as in this example (but of course, a fully recursive solution that could handle arbitrarily nested lists of maps / maps would be even better). This needs to be done in a complete generic way (otherwise it would be trivial), as the keys (like "events", "id", "when", ...) in both source JSON file will change.
I tried the (standard) Monoid/Semigroup addition in Scalaz/ Cats, however, this of course only concatenates the list elements and does not intersect/join.
val m3 = m1.combine(m2) // Cats
// Map(events -> List(Map(id -> Beatles, when -> Today), Map(id -> Elvis, when -> Tomorrow), Map(id -> Beatles, desc -> The greatest band), Map(id -> BeachBoys, desc -> The second best band)))
EDIT: The only assumption of the map structure is that there might be an "id" field. If it is present, then intersect and finally join.
Some background: I have two kind of JSON files. One with static information (e.g. a description of a band) and one with dynamic information (e.g. the date of the next concert). After reading the files, I get the two maps as presented above. I want to avoid to exploit the specific structure of the JSON files (e.g. by creating a domain model via case classes) as there are different scenarios with completely different source file structures which are likely subject to change and hence I don't want to create a dependency to this file structures in source code. Therefore, I need a generic way to merge these two maps.

So you have these two maps.
val m1 = Map("events" -> List(Map("id" -> "Beatles", "when" -> "Today"), Map("id"->"Elvis", "when"->"Tomorrow")))
val m2 = Map("events" -> List(Map("id" -> "Beatles", "desc"-> "The greatest band"), Map("id"->"BeachBoys","desc"-> "The second best band")))
And, it looks like you are trying to group events and form event groups with id.
Your domain model can represented with following case classes.
case class EventDetails(title: String, desc: String)
case class Event(subjectId: String, eventDetails: EventDetails)
case class EventGroup(subjectId: String, eventDetailsList: List[EventDetails])
Lets convert out Maps into more meaning full domain objects,
def eventMapToEvent(eventMap: Map[String, String]): Option[Event] = {
val subjectIdOpt = eventMap.get("id")
val (titleOpt, descOpt) = (eventMap - "id").toList.headOption match {
case Some((title, desc)) => (Some(title), Some(desc))
case _ => (None, None)
}
(subjectIdOpt, titleOpt, descOpt) match {
case (Some(subjectId), Some(title), Some(desc)) => Some(Event(subjectId, EventDetails(title, desc)))
case _ => None
}
}
val m1Events = m1.getOrElse("events", List()).flatMap(eventMapToEvent)
val m2Events = m2.getOrElse("events", List()).flatMap(eventMapToEvent)
val events = m1Events ++ m2Events
Now, the world will make more sense compared to dealing with maps. And we can proceed with the groupings.
val eventGroups = events.groupBy(event => event.subjectId).map({
case (subjectId, eventList) => EventGroup(subjectId, eventList.map(event => event.eventDetails)).toList
})
// eventGroups: scala.collection.immutable.Iterable[EventGroup] = List(EventGroup(BeachBoys,List(EventDetails(desc,The second best band))), EventGroup(Elvis,List(EventDetails(when,Tomorrow))), EventGroup(Beatles,List(EventDetails(when,Today), EventDetails(desc,The greatest band))))

Related

Join 3 maps (2 master map, 1 resultant map), to map the composite key in the resultant map to have values from the master maps

I have one map containing some master data(id->description):
val map1: Map[String, String] = Map("001" -> "ABCD", "002" -> "MNOP", "003" -> "WXYZ")
I have another map containing some other master data(id->description):
val map2: Map[String, String] = Map("100" -> "Ref1", "200" -> "Ref2", "300" -> "Ref3")
I have a resultant map as follows which is derived from some data set which has yieled the following map where the id from map1 and map2's have been used in combination to determine the key, to be precise a map derived from grouping on ids from both the above maps and then accumulating the amounts:
val map3:Map[(String, String),Double] = Map(("001","200")->3452.30,("003","300")->78484.33,("002","777") -> 893.45)
I need an output in a Map as follows:
("ABCD","Ref2")->3452.30,("WXYZ","Ref3")->78484.33,("MNOP","777") -> 893.45)
I have been trying this:
val map5 = map3.map(obj => {
(map1 getOrElse(obj._1._1, "noMatchMap1"))
(map2 getOrElse(obj._1._2, "noMatchMap2"))
} match {
case "noMatchMap1" => obj
case "noMatchMap2" => obj
case value => value -> obj._2
})
This should be it :
map3.map{
case((key1, key2), d) => ((map1.getOrElse(key1, key1), map2.getOrElse(key2, key2)),d)
}
Btw, I invite you to consult https://stackoverflow.com/help/how-to-ask for how to ask good questions, and in particular, please include what you tried. I'm happy to help you, but this isn't a site where you can just dump your homework/work and get it done :-D

Does this specific exercise lend itself well to a 'functional style' design pattern?

Say we have an array of one dimensional javascript objects contained in a file Array.json for which the key schema isn't known, that is the keys aren't known until the file is read.
Then we wish to output a CSV file with a header or first entry which is a comma delimited set of keys from all of the objects.
Each next line of the file should contain the comma separated values which correspond to each key from the file.
Array.json
[
abc:123,
xy:"yz",
s12:13,
],
...
[
abc:1
s:133,
]
A valid output:
abc,xy,s12,s
123,yz,13,
1,,,133
I'm teaching myself 'functional style' programming but I'm thinking that this problem doesn't lend itself well to a functional solution.
I believe that this problem requires some state to be kept for the output header and that subsequently each line depends on that header.
I'm looking to solve the problem in a single pass. My goals are efficiency for a large data set, minimal traversals, and if possible, parallelizability. If this isn't possible then can you give a proof or reasoning to explain why?
EDIT: Is there a way to solve the problem like this functionally?:
Say you pass through the array once, in some particular order. Then
from the start the header set looks like abc,xy,s12 for the first
object. With CSV entry 123,yz,13 . Then on the next object we add an
additional key to the header set so abc,xy,s12,s would be the header
and the CSV entry would be 1,,,133 . In the end we wouldn't need to
pass through the data set a second time. We could just append extra
commas to the result set. This is one way we could approach a single
pass....
Are there functional tools ( functions ) designed to solve problems like this, and what should I be considering? [ By functional tools I mean Monads,FlatMap, Filters, etc. ] . Alternatively, should I be considering things like Futures ?
Currently I've been trying to approach this using Java8, but am open to solutions from Scala, etc. Ideally I would be able to determine if Java8s' functional approach can solve the problem since that's the language I'm currently working in.
Since the csv output will change with every new line of input, you must hold that in memory before writing it out. If you consider creating an output text format from an internal representation of a csv file another "pass" over the data (the internal representation of the csv is practically a Map[String,List[String]] which you must traverse to convert it to text) then it's not possible to do this in a single pass.
If, however, this is acceptable, then you can use a Stream to read a single item from your json file, merge that into the csv file, and do this until the stream is empty.
Assuming, that the internal representation of the csv file is
trait CsvFile {
def merge(line: Map[String, String]): CsvFile
}
And you can represent a single item as
trait Item {
def asMap: Map[String, String]
}
You can implement it using foldLeft:
def toCsv(items: Stream[Item]): CsvFile =
items.foldLeft(CsvFile(Map()))((csv, item) => csv.merge(item.asMap))
or use recursion to get the same result
#tailrec def toCsv(items: Stream[Item], prevCsv: CsvFile): CsvFile =
items match {
case Stream.Empty => prevCsv
case item #:: rest =>
val newCsv = prevCsv.merge(item.asMap)
toCsv(rest, newCsv)
}
Note: Of course you don't have to create types for CsvFile or Item, you can use Map[String,List[String]] and Map[String,String] respectively
UPDATE:
As more detail was requested for the CsvFile trait/class, here's an example implementation:
case class CsvFile(lines: Map[String, List[String]], rowCount: Int = 0) {
def merge(line: Map[String, String]): CsvFile = {
val orig = lines.withDefaultValue(List.fill(rowCount)(""))
val current = line.withDefaultValue("")
val newLines = (lines.keySet ++ line.keySet) map {
k => (k, orig(k) :+ current(k))
}
CsvFile(newLines.toMap, rowCount + 1)
}
}
This could be one approach:
val arr = Array(Map("abc" -> 123, "xy" -> "yz", "s12" -> 13), Map("abc" -> 1, "s" -> 133))
val keys = arr.flatMap(_.keys).distinct // get the distinct keys for header
arr.map(x => keys.map(y => x.getOrElse(y,""))) // get an array of rows
Its completely OK to have state in functional programming. But having mutable state or mutating state is not allowed in functional programming.
Functional programming advocates creating new changed state instead of mutating the state in place.
So, its Ok to read and access state created in the program until and unless you are mutating or side effecting.
Coming to the point.
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.map { inner => inner.map { case (k, v) => k}}.flatten
list.map { inner => inner.map { case (k, v) => v}}.flatten
REPL
scala> val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list: List[List[(String, String)]] = List(List((abc,123), (xy,yz)), List((abc,1)))
scala> list.map { inner => inner.map { case (k, v) => k}}.flatten
res1: List[String] = List(abc, xy, abc)
scala> list.map { inner => inner.map { case (k, v) => v}}.flatten
res2: List[String] = List(123, yz, 1)
or use flatMap instead of map and flatten
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.flatMap { inner => inner.map { case (k, v) => k}}
list.flatMap { inner => inner.map { case (k, v) => v}}
In functional programming, mutable state is not allowed. But immutable states/values are fine.
Assuming that you have read your json file in to a value input:List[Map[String,String]], the codes below will solve your problem:
val input = List(Map("abc"->"123", "xy"->"yz" , "s12"->"13"), Map("abc"->"1", "s"->"33"))
val keys = input.map(_.keys).flatten.toSet
val keyvalues = input.map(kvs => keys.map(k => (k->kvs.getOrElse(k,""))).toMap)
val values = keyvalues.map(_.values)
val result = keys.mkString(",") + "\n" + values.map(_.mkString(",")).mkString("\n")

Extended groupBy

I have a collection of objects, each having a list inside, let us say:
case class Article(text: String, labels: List[String])
I need to construct a map, such that every key corresponds to one of the labels and its associated value is a list of articles having this label.
I hope that following example makes my goal more clear. I want to transform following list
List(
Article("article1", List("label1", "label2")),
Article("article2", List("label2", "label3"))
)
into a Map[String,List[Article]]:
Map(
"label1" -> List(Article("article1", ...)),
"label2" -> List(Article("article1", ...), Article("article2", ...)),
"label3" -> List(Article("article2", ...))
)
Is there an elegant way to perform this transformation? (I mean using collection methods without need for using mutable collections directly)
How about this?
val ls = List(
Article("article1", List("label1", "label2")),
Article("article2", List("label2", "label3"))
)
val m = ls
.flatMap{case a # Article(t,ls) => ls.map(_ -> a)}
.groupBy(_._1)
.mapValues(_.map(_._2))
m.foreach(println)
// (label2,List(Article(article1,List(label1, label2)), Article(article2,List(label2, label3))))
// (label1,List(Article(article1,List(label1, label2))))
// (label3,List(Article(article2,List(label2, label3))))

Distributed Map in Scala Spark

Does Spark support distributed Map collection types ?
So if I have an HashMap[String,String] which are key,value pairs , can this be converted to a distributed Map collection type ? To access the element I could use "filter" but I doubt this performs as well as Map ?
Since I found some new info I thought I'd turn my comments into an answer. #maasg already covered the standard lookup function I would like to point out you should be careful because if the RDD's partitioner is None, lookup just uses a filter anyway. In reference to the (K,V) store on top of spark it looks like this is in progress, but a usable pull request has been made here. Here is an example usage.
import org.apache.spark.rdd.IndexedRDD
// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()
// Perform a point update.
val indexed2 = indexed.put(1234L, 10873).cache()
// Perform a point lookup. Note that the original IndexedRDD remains
// unmodified.
indexed2.get(1234L) // => Some(10873)
indexed.get(1234L) // => Some(0)
// Efficiently join derived IndexedRDD with original.
val indexed3 = indexed.innerJoin(indexed2) { (id, a, b) => b }.filter(_._2 != 0)
indexed3.collect // => Array((1234L, 10873))
// Perform insertions and deletions.
val indexed4 = indexed2.put(-100L, 111).delete(Array(998L, 999L)).cache()
indexed2.get(-100L) // => None
indexed4.get(-100L) // => Some(111)
indexed2.get(999L) // => Some(0)
indexed4.get(999L) // => None
It seems like the pull request was well received and will probably be included in future versions of spark, so it is probably safe to use that pull request in your own code. Here is the JIRA ticket in case you were curious
The quick answer: Partially.
You can transform a Map[A,B] into an RDD[(A,B)] by first forcing the map into a sequence of (k,v) pairs but by doing so you loose the constrain that keys of a map must be a set. ie. you loose the semantics of the Map structure.
From a practical perspective, you can still resolve an element into its corresponding value using kvRdd.lookup(element) but the result will be a sequence, given that you have no warranties that there's a single lookup value as explained before.
A spark-shell example to make things clear:
val englishNumbers = Map(1 -> "one", 2 ->"two" , 3 -> "three")
val englishNumbersRdd = sc.parallelize(englishNumbers.toSeq)
englishNumbersRdd.lookup(1)
res: Seq[String] = WrappedArray(one)
val spanishNumbers = Map(1 -> "uno", 2 -> "dos", 3 -> "tres")
val spanishNumbersRdd = sc.parallelize(spanishNumbers.toList)
val bilingueNumbersRdd = englishNumbersRdd union spanishNumbersRdd
bilingueNumbersRdd.lookup(1)
res: Seq[String] = WrappedArray(one, uno)

Scala: batch process Map of index-based form fields

Part of a web app I'm working on handles forms that need to be bound to a collection of model (case class) instances. See this question
So, if I were to add several users at one time, form fields would be named email[0], email[1], password[0], password[1], etc.
Posting the form results in a Map[String, Seq[String]]
Now, what I would like to do is to process the Map in batches, by index, so that for each iteration I can bind a User instance, creating a List[User] as the final result of the bindings.
The hacked approach I'm thinking of is to regex match against "[\d]" in the Map keys and then find the highest index via filter or count; with that, then (0..n).toList map{ ?? } through the number of form field rows, calling the binding/validation method (which also takes a Map[String, Seq[String]]) accordingly.
What is a concise way to achieve this?
Assuming that:
All map keys are in form "field[index]"
There is only one value in Seq for each key.
If there is entry for "email[x]" than there is entry for "password[x]" and vice versa.
I would done something like this:
val request = Map(
"email[0]" -> Seq("alice#example.com"),
"email[1]" -> Seq("bob#example.com"),
"password[0]" -> Seq("%vT*n7#4"),
"password[1]" -> Seq("Bfts7B&^")
)
case class User(email: String, password: String)
val Field = """(.+)\[(\d+)\]""".r
val userList = request.groupBy { case (Field(_, idx), _) => idx.toInt }
.mapValues { userMap =>
def extractField(name: String) =
userMap.collect{case (Field(`name`, _), values) => values.head}.head
User(extractField("email"), extractField("password"))}
.toList.sortBy(_._1).map(_._2)
// Exiting paste mode, now interpreting.
request: scala.collection.immutable.Map[String,Seq[String]] = Map(email[0] -> List(alice#example.com),
email[1] -> List(bob#example.com), password[0] -> List(%vT*n7#4), password[1] -> List(Bfts7B&^))
defined class User
Field: scala.util.matching.Regex = (.+)\[(\d+)\]
userList: List[User] = List(User(alice#example.com,%vT*n7#4), User(bob#example.com,Bfts7B&^))