Spark dataframe to nested map - scala

How can I convert a rather small data frame in spark (max 300 MB) to a nested map in order to improve spark's DAG. I believe this operation will be quicker than a join later on (Spark dynamic DAG is a lot slower and different from hard coded DAG) as the transformed values were created during the train step of a custom estimator. Now I just want to apply them really quick during predict step of the pipeline.
val inputSmall = Seq(
("A", 0.3, "B", 0.25),
("A", 0.3, "g", 0.4),
("d", 0.0, "f", 0.1),
("d", 0.0, "d", 0.7),
("A", 0.3, "d", 0.7),
("d", 0.0, "g", 0.4),
("c", 0.2, "B", 0.25)).toDF("column1", "transformedCol1", "column2", "transformedCol2")
This gives the wrong type of map
val inputToMap = inputSmall.collect.map(r => Map(inputSmall.columns.zip(r.toSeq):_*))
I would rather want something like:
Map[String, Map[String, Double]]("column1" -> Map("A" -> 0.3, "d" -> 0.0, ...), "column2" -> Map("B" -> 0.25), "g" -> 0.4, ...)

Edit: removed collect operation from final map
If you are using Spark 2+, here's a suggestion:
val inputToMap = inputSmall.select(
map($"column1", $"transformedCol1").as("column1"),
map($"column2", $"transformedCol2").as("column2")
)
val cols = inputToMap.columns
val localData = inputToMap.collect
cols.map { colName =>
colName -> localData.flatMap(_.getAs[Map[String, Double]](colName)).toMap
}.toMap

I'm not sure I follow the motivation, but I think this is the transformation that would get you the result you're after:
// collect from DF (by your assumption - it is small enough)
val data: Array[Row] = inputSmall.collect()
// Create the "column pairs" -
// can be replaced with hard-coded value: List(("column1", "transformedCol1"), ("column2", "transformedCol2"))
val columnPairs: List[(String, String)] = inputSmall.columns
.grouped(2)
.collect { case Array(k, v) => (k, v) }
.toList
// for each pair, get data and group it by left-column's value, choosing first match
val result: Map[String, Map[String, Double]] = columnPairs
.map { case (k, v) => k -> data.map(r => (r.getAs[String](k), r.getAs[Double](v))) }
.toMap
.mapValues(l => l.groupBy(_._1).map { case (c, l2) => l2.head })
result.foreach(println)
// prints:
// (column1,Map(A -> 0.3, d -> 0.0, c -> 0.2))
// (column2,Map(d -> 0.7, g -> 0.4, f -> 0.1, B -> 0.25))

Related

Scala flatten a map with list as key and string as value

I have a peculiar case where I want to declare simple configuration like so
val config = List((("a", "b", "c"), ("first")),
(("d", "e"), ("second")),
(("f"), ("third")))
which at run time, I would like to have a map, which maps like
"a" -> "first"
"b" -> "first"
"c" -> "first"
"d" -> "second"
"e" -> "second"
"f" -> "third"
Using toMap, I was able to convert the config to a Map
scala> config.toMap
res42: scala.collection.immutable.Map[java.io.Serializable,String] = Map((a,b,c) -> first, (d,e) -> second, f -> third)
But I am not able to figure out how to flatten the list of keys into keys so I get the final desirable form. How do I solve this?
If you structure your config using List the code is very simple:
val config = List(
(List("a", "b", "c"), ("first")),
(List("d", "e"), ("second")),
(List("f"), ("third")))
config.flatMap{ case (k, v) => k.map(_ -> v) }.toMap
You can try the solution below:
val config = List(
(("a", "b", "c"), ("first")),
(("d", "e"), ("second")),
(("f"), ("third")))
val result = config.map {
case (k,v) =>
(
k.toString().replace(")", "")
.replace("(", "")
.split(","), v)
}
val res = result.map {
case (key,value) => key.map{ data =>
(data,value)
}.toList
}.flatten.toMap
In case you change the config structure to something like below, solution is much more simpler:
val config1 = List (
(List("a", "b", "c"), "first"),
(List("d", "e"), "second"),
(List("f"), "third")
)
config1.flatMap{
case (k,v) => k.map{data => (data,v)}
}.toMap
I think the above answers are good practical answers. If you're in a situation where you have no control over the input and you're stuck with Tuples instead of Lists, I'd do it this way:
val result: Map[String, String] = config.flatMap {
case (s: String, v) => List(s -> v)
case (ks: Product, v) => ks.productIterator.collect { case s: String => s -> v }
case _ => Nil //Prevent throwing
}.toMap
This will throw away anything that's not a String in the keys.
by using in built spark sql functions
val config = List((Array("a", "b", "c"), ("first")),
(Array("d", "e"), ("second")),
(Array("f"), ("third"))).toDF(List("col1","col2") : _*)
config.withColumn("exploded",functions.explode_outer($"col1")).drop("col1").show()

How to find percentage from value of a (key,value) pair?

I want to find a percentage from the value of a key, value pair which is stored in the map.
For eg: Map('a'->10,'b'->20).I need to find percentage occurance of 'a' and 'b'
Adding to Thilo's answer, you can try this below code. The final result will again be a Map[String, Double].
val map = Map("a" -> 10.0, "b" -> 20.0)
val total = map.values.sum
val mapWithPerc = map.mapValues(x => (x * 100) / total)
println(mapWithPerc)
//prints Map(a -> 33.333333333333336, b -> 66.66666666666667)
def mapToPercentage(key: String)(implicit map: Map[String, Double]) = {
val valuesSum = map.values.sum
(map(key) * 100) / valuesSum
}
implicit val m: Map[String, Double] = Map("a" -> 10, "b" -> 20, "c" -> 30)
println(mapToPercentage("a")) // 16.666666666666668
println(mapToPercentage("b")) // 33.333333333333336
println(mapToPercentage("c")) // 50
See demo here
Note: there is absolutely no need to curry the function parameters or make the map implicit. I just think it looks nicer in this example. Something like def mapToPercentage(key: String, map: Map[String, Double]) = {...} and mapToPercentage("a", m) is also perfectly valid. That being said, if you want to get even fancier:
implicit class MapToPercentage (map: Map[String, Double]) {
def getPercentage(key: String) = {
val valuesSum = map.values.sum
(map(key) * 100) / valuesSum
}
}
val m: Map[String, Double] = Map("a" -> 10, "b" -> 20, "c" -> 30)
println(m.getPercentage("a")) // 16.666666666666668
println(m.getPercentage("b")) // 33.333333333333336
println(m.getPercentage("c")) // 50
See demo here
Point being, the logic behind getting the percentage can be written a few ways:
(map(key) * 100) / valuesSum // get the value corresponding to a given key,
// multiply by 100, divide by total sum or all values
// - will complain if key doesn't exist
(map.getOrElse(key, 0D) * 100) / valuesSum // more safe than above, will not fail
// if key doesn't exist
map.get(key).map(_ * 100 / valuesSum) // will return None if key doesn't exist
// and Some(result) if key does exist
val map = Map('a' -> 10, 'b' -> 20)
val total = map.values.sum
map.get('a').map(_ * 100 / total) // gives Some(33)

Merge scala map and update common key based on condition

I wrote following code to merge map and update common keys. Is there any better way to write this
case class Test(index: Int, min: Int, max: Int, aggMin: Int, aggMax: Int)
def mergeMaps(oldMap: Map[Int, Test], newMap: Map[Int, Test]): Map[Int, Test] = {
val intersect: Map[Int, Test] = oldMap.keySet.intersect(newMap.keySet)
.map(indexKey => indexKey -> (Test(newMap(indexKey).index, newMap(indexKey).min, newMap(indexKey).max,
oldMap(indexKey).aggMin.min(newMap(indexKey).aggMin), oldMap(indexKey).aggMax.max(newMap(indexKey).aggMax)))).toMap
val merge = (oldMap ++ newMap ++ intersect)
merge
}
Here is my test case
it("test my case"){
val oldMap = Map(10 -> Test(10, 1, 2, 1, 2), 25 -> Test(25, 3, 4, 3, 4), 46 -> Test(46, 3, 4, 3, 4), 26 -> Test(26, 1, 2, 1, 2))
val newMap = Map(32 -> Test(32, 5, 6, 5, 6), 26 -> Test(26, 5, 6, 5, 6))
val result = mergeMaps(oldMap, newMap)
//Total elements count should be map 1 elements + map 2 elements
assert(result.size == 5)
//Common key element aggMin and aggMax should be updated, keep min aggMin and max aggMax from 2 common key elements and keep min and max of second map key
assert(result.get(26).get.aggMin == 1)//min aggMin -> min(1,5)
assert(result.get(26).get.aggMax == 6)//max aggMax -> max(2,6)
assert(result.get(26).get.min == 5)// 5 from second map
assert(result.get(26).get.max == 6)//6 from second map
}
Here's a slightly different take on a solution.
def mergeMaps(oldMap :Map[Int,Test], newMap :Map[Int,Test]) :Map[Int,Test] =
(oldMap.values ++ newMap.values)
.groupBy(_.index)
.map{ case (k,v) =>
k -> v.reduceLeft((a,b) =>
Test(k, b.min, b.max, a.aggMin min b.aggMin, a.aggMax max b.aggMax))
}
I could have followed the groupBy() with mapValues() instead of map() but that doesn't result in a pure Map.
Another version to do the same task.
def mergeMaps(oldMap: Map[Int, Test], newMap: Map[Int, Test]): Map[Int, Test] = {
(newMap ++ oldMap).map(key => {
val _newMapData = newMap.get(key._1)
if (_newMapData.isDefined) {
val _newMapDataValue = _newMapData.get
val oldMapValue = key._2
val result = Test(_newMapDataValue.index, _newMapDataValue.min, _newMapDataValue.max,
oldMapValue.aggMin.min(_newMapDataValue.aggMin), oldMapValue.aggMax.max(_newMapDataValue.aggMax))
(key._1 -> result)
} else (key._1 -> key._2)
})
}

How can I create a mapping function for 1000 elements in scala

I've manually created a mapping function(map) to map some elements of an RDD.
val rdd = sc.parallelize(List(
LabeledPoint(5.0, Vectors.dense(1,2)),
LabeledPoint(12.0, Vectors.dense(1,3)),
LabeledPoint(20.0, Vectors.dense(-1,4))))
val map = Map(5 -> 0.0, 12.0 -> 1.0, 20.0 -> 2.0)
val trainingData = rdd.map{
case LabeledPoint(category, features) => LabeledPoint(map(category), features)
}
How can I create a mappping function for 1000 elements. These 1000 elements are in a RDD called distinct.
I have tried this
var i=0
val map=distinct.map
{ x=>Map(x -> i)
i=i+1
}
But this cannot be used as mapping function for
val trainingData = rdd.map{
case LabeledPoint(category,
features) => LabeledPoint(map(category), features)
}
zipWithIndex and then collectAsMap:
val distinct = sc.parallelize(Seq(5, 12, 20))
distinct.zipWithIndex.collectAsMap
// res2: scala.collection.Map[Int,Long] = Map(20 -> 2, 5 -> 0, 12 -> 1)

Scala. How to create general method that accept tuple with different arities?

In my application I have many places, where I need get a list of tuples, groupBy it by first element of tuple and remove it from the rest. For example, I have tuples
(1, "Joe", "Account"), (1, "Tom", "Employer"), (2, "John", "Account"), and result should be Map(1 -> List(("Joe", "Account"), ("Joe", "Account")), 2 -> List(("John", "Account")))
It is easy implemented as
data.groupBy(_._1).map { case (k, v) => k -> v.map(f => (f._2, f._3)) }
But I am looking general solution, because I can have tuples with different arities, 2, 3, 4 or even 7.
I think Shapeless or Scalaz can help me, but my experience is low in those libraries, please point to some example
This is easily implemented using shapeless (for simplicity, I won't be generalizing it for all collection types). There's specific type class for tuples that can deconstruct them into head and tail called IsComposite
import shapeless.ops.tuple.IsComposite
def groupTail[P, H, T](tuples: List[P])(
implicit ic: IsComposite.Aux[P, H, T]): Map[H, List[T]] = {
tuples
.groupBy(ic.head)
.map { case (k, vs) => (k, vs.map(ic.tail)) }
}
This works for your case:
val data =
List((1, "Joe", "Account"), (1, "Tom", "Employer"), (2, "John", "Account"))
assert {
groupTail(data) == Map(
1 -> List(("Joe", "Account"), ("Tom", "Employer")),
2 -> List(("John", "Account"))
)
}
As well as for Tuple4 of different types:
val data2 = List((1, 1, "a", 'a), (1, 2, "b", 'b), (2, 1, "a", 'b))
assert {
groupTail(data2) == Map(
1 -> List((1, "a", 'a), (2, "b", 'b)),
2 -> List((1, "a", 'b))
)
}
Runnable code is available at Scastie