Spark/Scala transform [map of array] to [map of map] - scala

I am looking to change the way data is stored in one of my dataframe's column.
The column content-value has currently this type :
|-- content-value: map (nullable = true)
| |-- key: integer
| |-- value: array (valueContainsNull = true)
| | |-- element: string (containsNull = true)
And the data is currently stored like that:
{4 -> [5191, 57, -46, POS2], 5 -> [5413, 56, 48, POS2], 2 -> [5421, -59, 47, POS2], 1 -> [5237, -59, -47, POS2], 3 -> [5153, -10, 42, POS1]}
I would like to change that to a map of map that would look like :
{4 -> {value -> 5191, x -> 57, y -> -46, pos -> POS2}, 5 -> {value -> 5413, x -> 56, y -> 48, pos -> POS2}, 2 -> {value -> 5421, x -> -59, y -> 47, pos -> POS2}, 1 -> {value -> 5237, x -> -59, y -> -47, pos -> POS2}, 3 -> {value -> 5153, x -> -10, y -> 42, pos -> POS1}}
I've tried creating a new column with the keys ["value", "x", "y", "pos"] and using map_from_array without success.
Would love some help !

With dataset:
import spark.implicits._
case class Value(value: String, x: String, y: String, pos: String)
val ds = spark.createDataset[Map[Int, Array[String]]](Seq(Map(4 -> Array("5191", "57", "-46", "POS2"))))
val dsFinal =
ds.map(el => el.flatMap {
case (key, value) => Map(key -> Value(value(0), value(1), value(2), value(3)))})
It gives:
+----------------------------+
|value |
+----------------------------+
|{4 -> {5191, 57, -46, POS2}}|
+----------------------------+

Related

How can I merge an array of maps of maps into a single map?

Current
+-------+---------------------------------------------------------------------------+
|ID |map |
+-------+---------------------------------------------------------------------------+
|105 |[{bia, {4 -> 1}}, {compton, {5 -> 1}}, {alcatraz, {3 -> 6}}] |
|106 |[{compton, {4 -> 5}}] |
|107 |[{compton, {5 -> 99}}] |
|108 |[{bia, {1 -> 5}}, {compton, {1 -> 1}}] |
|101 |[{alcatraz, {1 -> 2}}] |
|102 |[{alcatraz, {1 -> 2}}] |
|103 |[{alcatraz, {1 -> 2}}, {alcatraz, {2 -> 2}}, {alcatraz, {3 -> 2}}] |
|104 |[{alcatraz, {1 -> 4}}, {alcatraz, {2 -> 2}}, {alcatraz, {3 -> 2}}] |
+-------+---------------------------------------------------------------------------+
Desired
+-------+---------------------------------------------------------------------------+
|ID |map |
+-------+---------------------------------------------------------------------------+
|105 |{bia, {4 -> 1}}, {compton, {5 -> 1}}, {alcatraz, {3 -> 6}} |
|106 |{compton, {4 -> 5}} |
|107 |{compton, {5 -> 99}} |
|108 |{bia, {1 -> 5}}, {compton, {1 -> 1}} |
|101 |{alcatraz, {1 -> 2}} |
|102 |{alcatraz, {1 -> 2}} |
|103 |{alcatraz, {1 -> 2, 2 -> 2, 3 -> 2} |
|104 |{alcatraz, {1 -> 4, 2 -> 2, 3 -> 2} |
+-------+---------------------------------------------------------------------------+
I want the final map to be that the first map level is the key for a location (e.g. alcatraz, bia, compton) and the second level is a number representing a group and the ultimate value is a count.
I got the current table by doing something like:
.groupBy(col("ID")).agg(collect_list(map($"LOCATION", map($"GROUP", $"COUNT"))) as "map")
JSON representation of desired format for more clarity
{
"alcatraz": {
"1": 100,
"2": 300
},
"bia": {
"2": 767
},
"compton": {
"1": 888,
"2": 999,
"3": 1000
},
}
I've seen some other stackoverflow posts for merging simple maps but since it's a map of a map those solutions didn't work.
I've been playing with udf but haven't had luck. Is there an easy way to acomplish my goal in scala?
What I ended up doing is using this defined udf:
val joinMap = udf { values: Seq[Map[String, Map[String, Int]]] => {
var newMap: scala.collection.mutable.Map[String, Map[String, Int]] = scala.collection.mutable.Map[String, Map[String, Int]]()
for (value <- values) {
for ((k, v) <- value) {
for ((sub_k, sub_v) <- v) {
if (newMap.contains(k)) {
newMap(k) += (sub_k -> sub_v)
} else {
newMap(k) = v
}
}
}
}
newMap
} }
You can use this by doing
.withColumn("map", joinMap(col("map")))
If you have cats in scope, you can do it this way:
def mergeMaps(values: Seq[Map[String, Map[String, Int]]]): Map[String, Map[String, Int]] = {
import cats.implicits._
values.foldLeft(Map.empty)(_ |+| _)
}
(Note: cats offers even more compact syntax for that, but that would make it even less obvious what's happening under the hood.)
But beware: in case two "inner maps" have the same keys, the above would add the integer values, so Seq(Map("1" -> Map("1" -> 1)), Map("1" -> Map("1" -> 1))) would be merged to Map("1" -> Map("1" -> 2)). If that's a problem and you need the exact same behavior as in your solution, you could modify the solution like this:
def mergeMaps(values: Seq[Map[String, Map[String, Int]]]): Map[String, Map[String, Int]] = {
import cats.implicits._
implicit val takeRightIntSemiGroup = new cats.Semigroup[Int] {
def combine(x: Int, y: Int): Int = y
}
values.foldLeft(Map.empty)(_ |+| _)
}

Combine value part of Tuple2 which is a map, into single map grouping by the key of Tuple2

I am doing this in Scala and Spark.
I have and Dataset of Tuple2 as Dataset[(String, Map[String, String])].
Below is and example of the values in the Dataset.
(A, {1->100, 2->200, 3->100})
(B, {1->400, 4->300, 5->900})
(C, {6->100, 4->200, 5->100})
(B, {1->500, 9->300, 11->900})
(C, {7->100, 8->200, 5->800})
If you notice, the key or first element of the Tuple can be repeated. Also, the corresponding map of the same Tuples' key can have duplicate keys in the map (second part of Tuple2).
I want to create a final Dataset[(String, Map[String, String])]. And the output should be as below (from the above example). Also, the last keys' value of the map is retained (check B and C) and previous same key against B and C is removed.
(A, {1->100, 2->200, 3->100})
(B, {4->300, 1->500, 9->300, 11->900, 5->900})
(C, {6->100, 4->200, 7->100, 8->200, 5->800})
Please let me know if any clarification is required.
By using rdd,
val rdd = sc.parallelize(
Seq(("A", Map(1->100, 2->200, 3->100)),
("B", Map(1->400, 4->300, 5->900)),
("C", Map(6->100, 4->200, 5->100)),
("B", Map(1->500, 9->300, 11->900)),
("C", Map(7->100, 8->200, 5->800)))
)
rdd.reduceByKey((a, b) => a ++ b).collect()
// Array((A,Map(1 -> 100, 2 -> 200, 3 -> 100)), (B,Map(5 -> 900, 1 -> 500, 9 -> 300, 11 -> 900, 4 -> 300)), (C,Map(5 -> 800, 6 -> 100, 7 -> 100, 8 -> 200, 4 -> 200)))
and using dataframe,
val df = spark.createDataFrame(
Seq(("A", Map(1->100, 2->200, 3->100)),
("B", Map(1->400, 4->300, 5->900)),
("C", Map(6->100, 4->200, 5->100)),
("B", Map(1->500, 9->300, 11->900)),
("C", Map(7->100, 8->200, 5->800)))
).toDF("key", "map")
spark.conf.set("spark.sql.mapKeyDedupPolicy","LAST_WIN")
df.withColumn("map", map_entries($"map"))
.groupBy("key").agg(collect_list($"map").alias("map"))
.withColumn("map", flatten($"map"))
.withColumn("map", map_from_entries($"map")).show(false)
+---+---------------------------------------------------+
|key|map |
+---+---------------------------------------------------+
|B |[1 -> 500, 4 -> 300, 5 -> 900, 9 -> 300, 11 -> 900]|
|C |[6 -> 100, 4 -> 200, 5 -> 800, 7 -> 100, 8 -> 200] |
|A |[1 -> 100, 2 -> 200, 3 -> 100] |
+---+---------------------------------------------------+
Using dataframes:
val df = Seq(("A", Map(1 -> 100, 2 -> 200, 3 -> 100)),
("B", Map(1 -> 400, 4 -> 300, 5 -> 900)),
("C", Map(6 -> 100, 4 -> 200, 5 -> 100)),
("B", Map(1 -> 500, 9 -> 300, 11 -> 900)),
("C", Map(7 -> 100, 8 -> 200, 5 -> 800))).toDF("a", "b")
val df2 = df.select('a, explode('b))
.groupBy("a", "key") //remove the duplicate keys
.agg(last('value).as("value")) //and take the last value for duplicate keys
.groupBy("a")
.agg(map_from_arrays(collect_list('key), collect_list('value)).as("b"))
df2.show()
prints
+---+---------------------------------------------------+
|a |b |
+---+---------------------------------------------------+
|B |[5 -> 900, 9 -> 300, 1 -> 500, 4 -> 300, 11 -> 900]|
|C |[6 -> 100, 8 -> 200, 7 -> 100, 4 -> 200, 5 -> 800] |
|A |[3 -> 100, 1 -> 100, 2 -> 200] |
+---+---------------------------------------------------+
As there are two aggregations involved, the rdd based answer is likely to be faster

Scala - filter out null values in a Array[Map[String,Int]]

I have an Array of [Map[String,Int] like this:
val orArray = Array(Map("x" -> 24, "y" -> 25, "z" -> 26), null, Map("x" -> 11, "y" -> 22, "z" -> 33), null, Map("x" -> 111, "y" -> 222, "z" -> 333))
I want to remove the null elements in this array, to get something like:
Array[Map[String,Int]] = (Map("x" -> 24, "y" -> 25, "z" -> 26), Map("x" -> 11, "y" -> 22, "z" -> 33), Map("x" -> 111, "y" -> 222, "z" -> 333))
I was trying this so far
orArray.filterNot(p => p.isEmpty)
But it generates a NullPointerException. How could I filter out those two null values?
You can simply check the null values as
orArray.filter(map => map != null)
Output:
Map(x -> 24, y -> 25, z -> 26), Map(x -> 11, y -> 22, z -> 33), Map(x -> 111, y -> 222, z -> 333)
Hope this helps!

Rolling up data with nested maps in Scala

New to programming in a more "functional" style. Normally I would write a series of nested foreach loops and += to totals.
I have a data structure that looks like:
Map(
"team1" ->
Map(
"2015" -> Map("wins" -> 30, "losses" -> 5),
"2016" -> Map("wins" -> 3, "losses" -> 7)
),
"team2" ->
Map(
"2015" -> Map("wins" -> 22, "losses" -> 1),
"2016" -> Map("wins" -> 17, "losses" -> 4)
)
)
What I want is a data structure that simply throws away the year information and adds wins/losses together by team.
Map(
"team1" -> Map("wins" -> 33, "losses" -> 12),
"team2" -> Map("wins" -> 39, "losses" -> 5)
)
I've been looking at groupBy but that seems only be useful if I don't have this nested structure.
Any ideas? Or is the more traditional imperative/foreach approach favorable here.
myMap.map(i => i._1 -> i._2.values.flatMap(_.toList).groupBy(_._1).map(i => i._1 -> i._2.map(_._2).sum))
get all values
flatMap to list
groupBy by key
get all the grouped values and sum
Define a customized method to add two Maps by keys as:
def addMap(x: Map[String, Int], y: Map[String, Int]) =
x ++ y.map{ case (k, v) => (k, v + x.getOrElse(k, 0))}
m.mapValues(_.values.reduce(addMap(_, _)))
// res16: scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Int]] =
// Map(team1 -> Map(wins -> 33, losses -> 12), team2 -> Map(wins -> 39, losses -> 5))
Using cats you could do :
import cats.implicits._
// or
// import cats.instances.map._
// import cats.instances.int._
// import cats.syntax.foldable._
teams.mapValues(_.combineAll)
// Map(
// team1 -> Map(wins -> 33, losses -> 12),
// team2 -> Map(wins -> 39, losses -> 5)
// )
combineAll combines the wins/losses maps of every year using a Monoid[Map[String, Int]] instance (also provided by the Cats library, see Monoid documentation), which sums the Ints for every key.
.mapValues { _.toSeq
.flatMap(_._2.toSeq)
.groupBy(_._1)
.mapValues(_.foldLeft(0)(_ + _._2)) }
scala> val sourceMap = Map(
| "team1" ->
| Map(
| "2015" -> Map("wins" -> 30, "losses" -> 5),
| "2016" -> Map("wins" -> 3, "losses" -> 7)
| ),
| "team2" ->
| Map(
| "2015" -> Map("wins" -> 22, "losses" -> 1),
| "2016" -> Map("wins" -> 17, "losses" -> 4)
| )
| )
sourceMap: scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Int]]] = Map(team1 -> Map(2015 -> Map(wins -> 30, losses -> 5), 2016 -> Map(wins -> 3, losses -> 7)), team2 -> Map(2015 -> Map(wins -> 22, losses -> 1), 2016 -> Map(wins -> 17, losses -> 4)))
scala> sourceMap.map { case (team, innerMap) =>
| val outcomeGroups = innerMap.values.flatten.groupBy(_._1)
| team -> outcomeGroups.map { case (outcome, xs) =>
| val scores = xs.map(_._2).sum
| outcome -> scores
| }
| }
res0: scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Int]] = Map(team1 -> Map(losses -> 12, wins -> 33), team2 -> Map(losses -> 5, wins -> 39))

How to test whether all elements in Lists in Map[String,List[Int]] hold negatives?

How to check whether all list elements are negative -- if any single values are positive return false else true?
scala> val checkNegative = Map(
| "A" -> List(-1205678557, -1206583677, -1208669605, -1205679913),
| "B" -> List(-396902501, -397202715, -396902501, -396902501, -396902501),
| "C" -> List(-397502289, -397502289, -397502289, -397502289, -397502289),
| "D" -> List(-33902725, -33902725, -412803077, -33902725),
| "E" -> List(-458008664, -30433317),
| "F" -> List(300244, 300244, 300244, -396901292, 300244)
| )
checkNegative: scala.collection.immutable.Map[String,List[Int]] = Map(E -> List(-458008664, -30433317), F -> List(300244, 300244, 300244, -396901292, 300244), A -> List(-1205678557, -1206583677, -1208669605, -1205679913), B -> List(-396902501, -397202715, -396902501, -396902501, -396902501), C -> List(-397502289, -397502289, -397502289, -397502289, -397502289), D -> List(-33902725, -33902725, -412803077, -33902725))
// How to get the value of `output`?
val output = Map(A -> true, B -> true, C -> true, D -> true, E -> true, F -> false)
val output = checkNegative.mapValues(_.forall(_ < 0))
val output = for((key, value) <- checkNegative) yield (key, !value.exists(_ > 0))