Spark Scala 26035582 Re-visited - scala

Whilst I understand the ouctome here, I cannot see how the highlighted aspects work.
Please, enlighten me
def isHeavy(inp: String) = inp.split(",").map(weights(_)).sum > 12
val input = List("a,b,c,d", "b,c,e", "a,c,d", "e,g")
val splitSize = 10000 // specify some number of elements that fit in memory.
val numSplits = (input.size / splitSize) + 1 // has to be > 0.
val groups = sc.parallelize(input, numSplits) // specify the # of splits.
val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)).toMap
def isHeavy(inp: String) = inp.split(",").map(weights(_)).sum > 12
val result = groups.filter(isHeavy)

weights is a map keyed by strings
scala> weights
res13: scala.collection.immutable.Map[String,Int] = Map(e -> 9, f -> 4, a -> 3, b -> 2, g -> 6, c -> 5, d -> 1)
inp.split(",") will split the string, and the map function iterates over those keys, converting each into the value of the weights map for the respective key.
The underscore is a scala shortcut and can be written as such
inp.split(",").map(x => weights(x))
In other words, val input = List("a,b,c,d") becomes a list of numbers (3,2,5,1), which then get summed, and filtered out for those more than 12
For example,
scala> input.foreach(x => println(x.split(",").mkString))
abcd
bce
acd
eg
scala> input.foreach(x => println(x.split(",").map(weights(_)).mkString(",")))
3,2,5,1
2,5,9
3,5,1
9,6
scala> input.foreach(x => println(x.split(",").map(weights(_)).sum))
11
16
9
15
scala> input.foreach(x => {
| val sum = x.split(",").map(weights(_)).sum
| if (sum > 12) println(sum)
| })
16
15

Related

Find common elements in a map of sequences - scala

I have something like this:
val myMap: Map[Int, Seq[Int]] = Map(1 -> (1, 2, 3), 2 -> (2, 3, 4), 3 -> (3, 4, 5), 4 -> (4, 5, 6))
I am trying to find a way to relate all the keys and their common elements in the sequence they are mapped to.
For example:
1 and 2 share (2, 3)
1 and 3 share (3)
2 and 3 share (3, 4)
2 and 4 share (4)
3 and 4 share (4, 5)
I suspect I need to use intersect but I am not sure how to go about the problem. I am brand new to scala and functional programming and need a little help getting started on this. I know there are probably easier ways to do this with spark, however, I am trying to stick just to scala.
Any help is greatly appreciated!
Here's one way using flatMap and collect to generate the shared values from every combination of the key pairs via intersect:
val myMap: Map[Int, List[Int]] = Map(
1 -> List(1, 2, 3), 2 -> List(2, 3, 4), 3 -> List(3, 4, 5), 4 -> List(4, 5, 6)
)
val keys = myMap.keys.toList
keys.flatMap{ i => keys.collect{
case j if j > i => (i, j, myMap(i) intersect myMap(j))
}
}
// res1: List[(Int, Int, List[Int])] = List(
// (1,2,List(2, 3)),
// (1,3,List(3)),
// (1,4,List()),
// (2,3,List(3, 4)),
// (2,4,List(4)),
// (3,4,List(4, 5))
// )
The above is essentially the same as the following for comprehension:
for {
i <- keys
j <- keys
if j > i
} yield (i, j, myMap(i) intersect myMap(j))
How do you want the results returned? Do you just want to print them to STDOUT?
myMap.keys.toList.combinations(2).foreach{ case List(a,b) =>
println(s"$a,$b --> ${myMap(a) intersect myMap(b)}")
}
Pretty similar to #jwvh solution, but with less lookups in the map, in case it is big:
val myMap: Map[Int, Seq[Int]] = Map(1 -> Seq(1, 2, 3), 2 -> Seq(2, 3, 4), 3 -> Seq(3, 4, 5), 4 -> Seq(4, 5, 6))
myMap.toList.combinations(2).foreach {
case List((i1, s1), (i2, s2)) =>
val ints = s1.intersect(s2)
if (ints.nonEmpty) {
println(s"$i1 and $i2 share (${ints.mkString(", ")})")
}
case _ => ???
}
Code run at Scastie.

Compute the maximum length assigned to each element using scala

For example, this is the content in a file:
20,1,helloworld,alaaa
2,3,world,neww
1,223,ala,12341234
Desired output"
0-> 2
1-> 3
2-> 10
3-> 8
I want to find max-length assigned to each element.
It's possible to extend this to any number of columns. First read the file as a dataframe:
val df = spark.read.csv("path")
Then create an SQL expression for each column and evaluate it with expr:
val cols = df.columns.map(c => s"max(length(cast($c as String)))").map(expr(_))
Select the new columns as an array and covert to Map:
df.select(array(cols:_*)).as[Seq[Int]].collect()
.head
.zipWithIndex.map(_.swap)
.toMap
This should give you the desired Map.
Map(0 -> 2, 1 -> 3, 2 -> 10, 3 -> 8)
Update:
OP's example suggests that they will be of equal lengths.
Using Spark-SQL and max(length()) on the DF columns is the idea that is being suggested in this answer.
You can do:
val xx = Seq(
("20","1","helloworld","alaaa"),
("2","3","world","neww"),
("1","223","ala","12341234")
).toDF("a", "b", "c", "d")
xx.registerTempTable("yy")
spark.sql("select max(length(a)), max(length(b)), max(length(c)), max(length(d)) from yy")
I would recommend using RDD's aggregate method:
val rdd = sc.textFile("/path/to/textfile").
map(_.split(","))
// res1: Array[Array[String]] = Array(
// Array(20, 1, helloworld, alaaa), Array(2, 3, world, neww), Array(1, 223, ala, 12341234)
// )
val seqOp = (m: Array[Int], r: Array[String]) =>
(r zip m).map( t => Seq(t._1.length, t._2).max )
val combOp = (m1: Array[Int], m2: Array[Int]) =>
(m1 zip m2).map( t => Seq(t._1, t._2).max )
val size = rdd.collect.head.size
rdd.
aggregate( Array.fill[Int](size)(0) )( seqOp, combOp ).
zipWithIndex.map(_.swap).
toMap
// res2: scala.collection.immutable.Map[Int,Int] = Map(0 -> 2, 1 -> 3, 2 -> 10, 3 -> 8)
Note that aggregate takes:
an array of 0's (of size equal to rdd's row size) as the initial value,
a function seqOp for calculating maximum string lengths within a partition, and,
another function combOp to combine results across partitions for the final maximum values.

How do I do a Map comprehension with Scala?

With Python, I can do something like
listOfLists = [('a', -1), ('b', 0), ('c', 1)]
my_dict = {foo: bar for foo, bar in listOfLists}
my_dict == {'a': -1, 'b': 0, 'c': 1} => True
I know this as a dictionary comprehension. When I look for this operation with Scala, I find this incomprehensible document (pun intended).
Is there an idiomatic way to do this with Scala?
Bonus question: Can I filter with this operation as well like my_dict = {foo: bar for foo, bar in listOfLists if bar > 0}?
First, let's parse your Python code to figure out what it's doing.
my_dict = {
foo: bar <-- Key, value names
for foo, bar <-- Destructuring a list
in listOfLists <-- This is where they came from
}
So you can see that even in this very short example there's actually considerable redundancy and plenty of potential for failure if listOfLists isn't actually what it says it is.
If listOfLists actually is a list of pairs (key, value), then in Scala it's trivial:
listOfPairs.toMap
If, on the other hand, it really is lists, and you want to pull off the first one to make the key and save the rest as a value, it would be something like
listOfLists.map(x => x.head -> x.tail).toMap
You can select some of them by using collect instead. For instance, maybe you only want the lists of length 2 (you could if x.head > 0 to get your example), in which case you
listOfLists.collect{
case x if x.length == 2 => x.head -> x.last
}.toMap
or if it is literally a List, you could also
listOfLists.collect{
case key :: value :: Nil => key -> value
}.toMap
I'll compare list comprehension in Scala2.x and Python 3.x
1. Sequence
In python:
xs = [x*x for x in range(5)]
#xs = [0, 1, 4, 9, 16]
ys = list(map(lambda x: x*x, range(5)))
#ys = [0, 1, 4, 9, 16]
In Scala:
scala> val xs = for(x <- 0 until 5) yield x*x
xs: scala.collection.immutable.IndexedSeq[Int] = Vector(0, 1, 4, 9, 16)
scala> val ys = (0 until 5) map (x => x*x)
ys: scala.collection.immutable.IndexedSeq[Int] = Vector(0, 1, 4, 9, 16)
Or you really want a list:
scala> import collection.breakOut
scala> val xs: List[Int] = (for(x <- 0 until 5) yield x*x)(breakOut)
xs: List[Int] = List(0, 1, 4, 9, 16)
scala> val ys: List[Int] = (0 until 5).map(x => x*x)(breakOut)
ys: List[Int] = List(0, 1, 4, 9, 16)
scala> val zs = (for(x <- 0 until 5) yield x*x).toList
zs: List[Int] = List(0, 1, 4, 9, 16)
2. Set
In Python
s1 = { x//2 for x in range(10) }
#s1 = {0, 1, 2, 3, 4}
s2 = set(map(lambda x: x//2, range(10)))
#s2 = {0, 1, 2, 3, 4}
In Scala
scala> val s1 = (for(x <- 0 until 10) yield x/2).toSet
s1: scala.collection.immutable.Set[Int] = Set(0, 1, 2, 3, 4)
scala> val s2: Set[Int] = (for(x <- 0 until 10) yield x/2)(breakOut)
s2: Set[Int] = Set(0, 1, 2, 3, 4)
scala> val s3: Set[Int] = (0 until 10).map(_/2)(breakOut)
s3: Set[Int] = Set(0, 1, 2, 3, 4)
scala> val s4 = (0 until 10).map(_/2).toSet
s4: scala.collection.immutable.Set[Int] = Set(0, 1, 2, 3, 4)
3. Dict
In Python:
pairs = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
#d1 = {1: 'aa', 2: 'bb', 3: 'cc', 4: 'dd'}
d2 = dict([(k*2, v) for k, v in pairs])
#d2 = {2: 'a', 4: 'b', 6: 'c', 8: 'd'}
In Scala
scala> val pairs = Seq(1->"a", 2->"b", 3->"c", 4->"d")
pairs: Seq[(Int, String)] = List((1,a), (2,b), (3,c), (4,d))
scala> val d1 = (for((k, v) <- pairs) yield (k, v*2)).toMap
d1: scala.collection.immutable.Map[Int,String] = Map(1 -> aa, 2 -> bb, 3 -> cc, 4 -> dd)
scala> val d2 = Map(pairs map { case(k, v) => (k*2, v) } :_*)
d2: scala.collection.immutable.Map[Int,String] = Map(2 -> a, 4 -> b, 6 -> c, 8 -> d)
scala> val d3 = pairs map { case(k, v) => (k*2, v) } toMap
d3: scala.collection.immutable.Map[Int,String] = Map(2 -> a, 4 -> b, 6 -> c, 8 -> d)
scala> val d4: Map[Int, String] = (for((k, v) <- pairs) yield (k, v*2))(breakOut)
d4: Map[Int,String] = Map(1 -> aa, 2 -> bb, 3 -> cc, 4 -> dd)
Here are a few examples:
val listOfLists = Vector(Vector(1,2), Vector(3,4), Vector(5,6))
val m1 = listOfLists.map { case Seq(a,b) => (a,b) }.toMap
val m2 = listOfLists.collect { case Seq(a,b) if b>0 => (a,b) }.toMap
val m3 = (for (Seq(a,b) <- listOfLists) yield (a,b)).toMap
val m4 = (for (Seq(a,b) <- listOfLists if b>0) yield (a,b)).toMap
val m5 = Map(listOfLists.map { case Seq(a,b) => (a,b) }: _*)
val m6 = Map(listOfLists.collect { case Seq(a,b) => (a,b) }: _*)
val m7 = Map((for (Seq(a,b) <- listOfLists) yield (a,b)): _*)
val m8 = Map((for (Seq(a,b) <- listOfLists if b>0) yield (a,b)): _*)
You can create a Map using .toMap or Map(xs: _*). The collect method lets you filter as you map. And a for-comprehension uses syntax most similar to your example.

Scala Collection Specific Implementation

Say I have some data in a Seq in Scala 2.10.2, e.g:
scala> val data = Seq( 1, 2, 3, 4, 5 )
data: Seq[Int] = List(1, 2, 3, 4, 5)
Now, I perform some operations and convert it to a Map
scala> val pairs = data.map( i => i -> i * 2 )
pairs: Seq[(Int, Int)] = List((1,2), (2,4), (3,6), (4,8), (5,10))
scala> val pairMap = pairs.toMap
pairMap: scala.collection.immutable.Map[Int,Int] = Map(5 -> 10, 1 -> 2, 2 -> 4, 3 -> 6, 4 -> 8)
Now say, for performance reasons, I'd like pairMap to use the HashMap implementation of Map. What's the best way to achieve this?
Ways I've considered:
Casting:
pairMap.asInstanceOf[scala.collection.immutable.HashMap[Int,Int]]
This seems a bit horrible.
Manually converting:
var hm = scala.collection.immutable.HashMap[Int,Int]()
pairMap.foreach( p => hm += p )
But this isn't very functional.
Using the builder
scala.collection.immutable.HashMap[Int,Int]( pairMap.toSeq:_* )
This works, but it's not the most readable piece of code.
Is there a better way that I'm missing? If not, which of these is the best approach?
Interesting bit: it already is an immutable HashMap.
scala> val data = Seq( 1, 2, 3, 4, 5 )
data: Seq[Int] = List(1, 2, 3, 4, 5)
scala> val pairs = data.map( i => i -> i * 2 )
pairs: Seq[(Int, Int)] = List((1,2), (2,4), (3,6), (4,8), (5,10))
scala> val pairMap = pairs.toMap
pairMap: scala.collection.immutable.Map[Int,Int] =
Map(5 -> 10, 1 -> 2, 2 -> 4, 3 -> 6, 4 -> 8)
scala> pairMap.getClass
res0: Class[_ <: scala.collection.immutable.Map[Int,Int]] =
class scala.collection.immutable.HashMap$HashTrieMap
Note: casting to be a hashmap doesn't at all change the underlying thing. If you wanted to guarantee building a hashmap (or some specific type), then I'd recommend this:
scala> import scala.collection.immutable
import scala.collection.immutable
scala> val pairMap = immutable.HashMap(pairs: _*)
pairMap: scala.collection.immutable.HashMap[Int,Int] =
Map(5 -> 10, 1 -> 2, 2 -> 4, 3 -> 6, 4 -> 8)
If you're looking for performance improvements you should look into using a mutable.HashMap or java.util.HashMap. Most of scala's collections are out-performed by the native java.util collections.
You can combine
An explicit result type
map: Map key and value into a tuple
breakOut: "Break out" the sequence of tuples from map, and create the target type directly
like this
val s = Seq.range(1, 6)
val m: scala.collection.immutable.HashMap[Int, Int] =
s.map(n => (n, n * n))(scala.collection.breakOut)
which creates the HashMap on-the-fly without an intermediate map.
By using breakOut and the explicit result type an appropriate builder for map is chosen, and your target type is created directly

Play Scala - groupBy remove repetitive values

I apply groupBy function to my List collection, however I want to remove the repetitive values in the value part of the Map. Here is the initial List collection:
PO_ID PRODUCT_ID RETURN_QTY
1 1 10
1 1 20
1 2 30
1 2 10
When I apply groupBy to that List, it will produce something like this:
(1, 1) -> (1, 1, 10),(1, 1, 20)
(1, 2) -> (1, 2, 30),(1, 2, 10)
What I really want is something like this:
(1, 1) -> (10),(20)
(1, 2) -> (30),(10)
So, is there anyway to remove the repetitive part in the Map's values [(1,1),(1,2)] ?
Thanks..
For
val a = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
consider
a.groupBy( v => (v._1,v._2) ).mapValues( _.map (_._3) )
which delivers
Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Note that mapValues operates over a List[List] of triplets obtained from groupBy, whereas in map we extract the third element of each triplet.
Is it easier to pull the tuple apart first?
scala> val ts = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
ts: Seq[(Int, Int, Int)] = List((1,1,10), (1,1,20), (1,2,30), (1,2,10))
scala> ts map { case (a,b,c) => (a,b) -> c }
res0: Seq[((Int, Int), Int)] = List(((1,1),10), ((1,1),20), ((1,2),30), ((1,2),10))
scala> ((Map.empty[(Int, Int), List[Int]] withDefaultValue List.empty[Int]) /: res0) { case (m, (k,v)) => m + ((k, m(k) :+ v)) }
res1: scala.collection.immutable.Map[(Int, Int),List[Int]] = Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Guess not.