In my application I have many places, where I need get a list of tuples, groupBy it by first element of tuple and remove it from the rest. For example, I have tuples
(1, "Joe", "Account"), (1, "Tom", "Employer"), (2, "John", "Account"), and result should be Map(1 -> List(("Joe", "Account"), ("Joe", "Account")), 2 -> List(("John", "Account")))
It is easy implemented as
data.groupBy(_._1).map { case (k, v) => k -> v.map(f => (f._2, f._3)) }
But I am looking general solution, because I can have tuples with different arities, 2, 3, 4 or even 7.
I think Shapeless or Scalaz can help me, but my experience is low in those libraries, please point to some example
This is easily implemented using shapeless (for simplicity, I won't be generalizing it for all collection types). There's specific type class for tuples that can deconstruct them into head and tail called IsComposite
import shapeless.ops.tuple.IsComposite
def groupTail[P, H, T](tuples: List[P])(
implicit ic: IsComposite.Aux[P, H, T]): Map[H, List[T]] = {
tuples
.groupBy(ic.head)
.map { case (k, vs) => (k, vs.map(ic.tail)) }
}
This works for your case:
val data =
List((1, "Joe", "Account"), (1, "Tom", "Employer"), (2, "John", "Account"))
assert {
groupTail(data) == Map(
1 -> List(("Joe", "Account"), ("Tom", "Employer")),
2 -> List(("John", "Account"))
)
}
As well as for Tuple4 of different types:
val data2 = List((1, 1, "a", 'a), (1, 2, "b", 'b), (2, 1, "a", 'b))
assert {
groupTail(data2) == Map(
1 -> List((1, "a", 'a), (2, "b", 'b)),
2 -> List((1, "a", 'b))
)
}
Runnable code is available at Scastie
Related
I have a peculiar case where I want to declare simple configuration like so
val config = List((("a", "b", "c"), ("first")),
(("d", "e"), ("second")),
(("f"), ("third")))
which at run time, I would like to have a map, which maps like
"a" -> "first"
"b" -> "first"
"c" -> "first"
"d" -> "second"
"e" -> "second"
"f" -> "third"
Using toMap, I was able to convert the config to a Map
scala> config.toMap
res42: scala.collection.immutable.Map[java.io.Serializable,String] = Map((a,b,c) -> first, (d,e) -> second, f -> third)
But I am not able to figure out how to flatten the list of keys into keys so I get the final desirable form. How do I solve this?
If you structure your config using List the code is very simple:
val config = List(
(List("a", "b", "c"), ("first")),
(List("d", "e"), ("second")),
(List("f"), ("third")))
config.flatMap{ case (k, v) => k.map(_ -> v) }.toMap
You can try the solution below:
val config = List(
(("a", "b", "c"), ("first")),
(("d", "e"), ("second")),
(("f"), ("third")))
val result = config.map {
case (k,v) =>
(
k.toString().replace(")", "")
.replace("(", "")
.split(","), v)
}
val res = result.map {
case (key,value) => key.map{ data =>
(data,value)
}.toList
}.flatten.toMap
In case you change the config structure to something like below, solution is much more simpler:
val config1 = List (
(List("a", "b", "c"), "first"),
(List("d", "e"), "second"),
(List("f"), "third")
)
config1.flatMap{
case (k,v) => k.map{data => (data,v)}
}.toMap
I think the above answers are good practical answers. If you're in a situation where you have no control over the input and you're stuck with Tuples instead of Lists, I'd do it this way:
val result: Map[String, String] = config.flatMap {
case (s: String, v) => List(s -> v)
case (ks: Product, v) => ks.productIterator.collect { case s: String => s -> v }
case _ => Nil //Prevent throwing
}.toMap
This will throw away anything that's not a String in the keys.
by using in built spark sql functions
val config = List((Array("a", "b", "c"), ("first")),
(Array("d", "e"), ("second")),
(Array("f"), ("third"))).toDF(List("col1","col2") : _*)
config.withColumn("exploded",functions.explode_outer($"col1")).drop("col1").show()
I wrote following code to merge map and update common keys. Is there any better way to write this
case class Test(index: Int, min: Int, max: Int, aggMin: Int, aggMax: Int)
def mergeMaps(oldMap: Map[Int, Test], newMap: Map[Int, Test]): Map[Int, Test] = {
val intersect: Map[Int, Test] = oldMap.keySet.intersect(newMap.keySet)
.map(indexKey => indexKey -> (Test(newMap(indexKey).index, newMap(indexKey).min, newMap(indexKey).max,
oldMap(indexKey).aggMin.min(newMap(indexKey).aggMin), oldMap(indexKey).aggMax.max(newMap(indexKey).aggMax)))).toMap
val merge = (oldMap ++ newMap ++ intersect)
merge
}
Here is my test case
it("test my case"){
val oldMap = Map(10 -> Test(10, 1, 2, 1, 2), 25 -> Test(25, 3, 4, 3, 4), 46 -> Test(46, 3, 4, 3, 4), 26 -> Test(26, 1, 2, 1, 2))
val newMap = Map(32 -> Test(32, 5, 6, 5, 6), 26 -> Test(26, 5, 6, 5, 6))
val result = mergeMaps(oldMap, newMap)
//Total elements count should be map 1 elements + map 2 elements
assert(result.size == 5)
//Common key element aggMin and aggMax should be updated, keep min aggMin and max aggMax from 2 common key elements and keep min and max of second map key
assert(result.get(26).get.aggMin == 1)//min aggMin -> min(1,5)
assert(result.get(26).get.aggMax == 6)//max aggMax -> max(2,6)
assert(result.get(26).get.min == 5)// 5 from second map
assert(result.get(26).get.max == 6)//6 from second map
}
Here's a slightly different take on a solution.
def mergeMaps(oldMap :Map[Int,Test], newMap :Map[Int,Test]) :Map[Int,Test] =
(oldMap.values ++ newMap.values)
.groupBy(_.index)
.map{ case (k,v) =>
k -> v.reduceLeft((a,b) =>
Test(k, b.min, b.max, a.aggMin min b.aggMin, a.aggMax max b.aggMax))
}
I could have followed the groupBy() with mapValues() instead of map() but that doesn't result in a pure Map.
Another version to do the same task.
def mergeMaps(oldMap: Map[Int, Test], newMap: Map[Int, Test]): Map[Int, Test] = {
(newMap ++ oldMap).map(key => {
val _newMapData = newMap.get(key._1)
if (_newMapData.isDefined) {
val _newMapDataValue = _newMapData.get
val oldMapValue = key._2
val result = Test(_newMapDataValue.index, _newMapDataValue.min, _newMapDataValue.max,
oldMapValue.aggMin.min(_newMapDataValue.aggMin), oldMapValue.aggMax.max(_newMapDataValue.aggMax))
(key._1 -> result)
} else (key._1 -> key._2)
})
}
How can I convert a rather small data frame in spark (max 300 MB) to a nested map in order to improve spark's DAG. I believe this operation will be quicker than a join later on (Spark dynamic DAG is a lot slower and different from hard coded DAG) as the transformed values were created during the train step of a custom estimator. Now I just want to apply them really quick during predict step of the pipeline.
val inputSmall = Seq(
("A", 0.3, "B", 0.25),
("A", 0.3, "g", 0.4),
("d", 0.0, "f", 0.1),
("d", 0.0, "d", 0.7),
("A", 0.3, "d", 0.7),
("d", 0.0, "g", 0.4),
("c", 0.2, "B", 0.25)).toDF("column1", "transformedCol1", "column2", "transformedCol2")
This gives the wrong type of map
val inputToMap = inputSmall.collect.map(r => Map(inputSmall.columns.zip(r.toSeq):_*))
I would rather want something like:
Map[String, Map[String, Double]]("column1" -> Map("A" -> 0.3, "d" -> 0.0, ...), "column2" -> Map("B" -> 0.25), "g" -> 0.4, ...)
Edit: removed collect operation from final map
If you are using Spark 2+, here's a suggestion:
val inputToMap = inputSmall.select(
map($"column1", $"transformedCol1").as("column1"),
map($"column2", $"transformedCol2").as("column2")
)
val cols = inputToMap.columns
val localData = inputToMap.collect
cols.map { colName =>
colName -> localData.flatMap(_.getAs[Map[String, Double]](colName)).toMap
}.toMap
I'm not sure I follow the motivation, but I think this is the transformation that would get you the result you're after:
// collect from DF (by your assumption - it is small enough)
val data: Array[Row] = inputSmall.collect()
// Create the "column pairs" -
// can be replaced with hard-coded value: List(("column1", "transformedCol1"), ("column2", "transformedCol2"))
val columnPairs: List[(String, String)] = inputSmall.columns
.grouped(2)
.collect { case Array(k, v) => (k, v) }
.toList
// for each pair, get data and group it by left-column's value, choosing first match
val result: Map[String, Map[String, Double]] = columnPairs
.map { case (k, v) => k -> data.map(r => (r.getAs[String](k), r.getAs[Double](v))) }
.toMap
.mapValues(l => l.groupBy(_._1).map { case (c, l2) => l2.head })
result.foreach(println)
// prints:
// (column1,Map(A -> 0.3, d -> 0.0, c -> 0.2))
// (column2,Map(d -> 0.7, g -> 0.4, f -> 0.1, B -> 0.25))
Let us consider I hava a collection of eployees as List of Tuples, where t._1 represents department Id, t._2 is salary and t._3 is Name of employee
val eployees = List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert"))
Expected Result: -((2,5000,Kumar), (4,999,Robert), (1,9999,Ashok))
I am trying with but getting error,
val maxSal1 = emps.map(t => (t._1, (t._2, t._3))).groupBy(a => a._1).map(k => {
k._2.foldLeft(0, "dummy")((aa, bb) => {
if (aa._1 > bb._1) aa else bb
})
})
Don't overcomplicate things, avoid doing unnecessary operations, and carrying redundant information around. Just be explicit, and spell out the transformations you need at each step. Simplicity is your friend.
employees.groupBy(_._1).values.map(_.maxBy(_._2))
scala> List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert")).groupBy {
| case (dept, salary, employee) => dept
| }
res6: scala.collection.immutable.Map[Int,List[(Int, Int, String)]] = Map(2 -> List((2,5000,Pam)), 4 -> List((4,500,NK), (4,999,Robert)), 1 -> List((1,8000,Sally), (1,9999,Tom)))
scala> res6.map {
| case (dept, employees) => employees.maxBy(_._2)
| }
res5: scala.collection.immutable.Iterable[(Int, Int, String)] = List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))
But note that maxBy is a partial function:
scala> List[Int]().maxBy(x => x)
java.lang.UnsupportedOperationException: empty.maxBy
As a side note, I'd use case class Employee with 3 fields rather than a tuple. I believe it's more readable.
I tried with this option and seems to give result,
val maxsal1 = emps1.map(t => (t._1, t._2, t._3)).groupBy(_._1).values.map(t => t.foldLeft((0, 1, "dummy"))((aa, bb) => {
if (aa._2 > bb._2) aa else bb
}))
Output: List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))
If I have a list like this in scala:
val list = List(
Map("val1" -> 1, "val2" -> 2),
Map("val1" -> 3, "val2" -> 4),
Map("val1" -> 5, "val2" -> 6),
Map("val1" -> 7, "val2" -> 8)
)
And I like to create another list where elements match certain condition like this:
val newList = list map { el /*match (el("val1") < 5) here*/ =>
el /*if condition is met, add element to new list*/
}
Then result would be something like this:
List(
Map("val1" -> 1, "val2" -> 2),
Map("val1" -> 3, "val2" -> 4)
)
Is something like this possible and if so then how? I'd like to make this work from functional programming perspective.
Use list.filter:
val filteredList = list.filter(_("val1") < 5)