Scala: map+filter instead of foldLeft - scala

Is it possible to substitute the foldLeft function through some combination of map and filter in Scala? For example on this task.
Input is a list of triples (student name, course, grade):
val grades = List(("Hans", "db", 2.3), ("Maria", "prog1", 1.0), ("Maria", "prog2", 1.3), ("Maria", "prog3", 1.7), ("Hans", "prog2", 1.7), ("Josef", "prog1", 1.3), ("Maria", "mathe1", 1.3), ("Josef", "db", 3.3), ("Hans", "prog1", 2.0))
Then to each student, a list of their courses and grades should be mapped. With foldLeft it goes like that:
grades.foldLeft(Map[String, List[(String, Double)]]())((acc, e) => acc + (e._1 -> (acc.getOrElse(e._1, List()) ::: List((e._2, e._3))))).toList
Output:
List[(String, List[(String, Double)])] = List((Hans,List((db,2.3), (prog2,1.7), (prog1,2.0))), (Maria,List((prog1,1.0), (prog2,1.3), (prog3,1.7), (mathe1,1.3))), (Josef,List((prog1,1.3), (db,3.3))))
How to achieve the same output using only map and filter functions?
So far i have this, but the output is slightly different.
grades.map(x => (x._1, List())).distinct.flatMap(x => grades.map(z => if(!x._2.contains(z._2, z._3)) (x._1, x._2 ::: List((z._2, z._3)))))

Using groupBy is a nice way to tackle this problem:
grades
.groupBy(_._1)
.mapValues(_.map(t => (t._2, t._3)))
(Thanks to m4gic from improvement to original version suggested in comment)

grades
.map({ case (name, _, _) => name })
.distinct
.map(name => {
val scores = grades
.filter({ case (nameInternal, _, _) => name == nameInternal })
.map({ case (_, subject, score) => (subject, score) })
(name, scores)
})
// res0: List[(String, List[(String, Double)])] = List((Hans,List((db,2.3), (prog2,1.7), (prog1,2.0))), (Maria,List((prog1,1.0), (prog2,1.3), (prog3,1.7), (mathe1,1.3))), (Josef,List((prog1,1.3), (db,3.3))))

Related

How to map with join by RDD Scala

I have a list of (id-(name-value) pairs. Such like this
val input = sc.parallelize(Array(Array(1, "a 10"),
Array(1, "b 11"),
Array(3, "a 12"),
Array(3, "b 13"),
Array(3, "c 14"),
Array(4, "b 15")))
The map phase has key is the id and value is (name-value) string.
val rdd = input.map(x => (x(0), x(1)))
My expected result is: for each id, compare values around based by name with a f() function.
For example, with id == "3", we got result after reduce phase:
(key: ab, value: f(12,13))
(key: ac, value: f(12,14))
(key: bc, value: f(13,14))
RDD can be joined with himself for get all pairs, and only required row can be left by filtering:
// split string value on two parts
val rdd = input.map(x => (x(0), x(1).toString.split(" ")))
.map({ case (key, parts) => (key, (parts(0), parts(1))) })
// join , filter, and transform to expected
val both = rdd
.join(rdd)
.filter({ case (_, (v1, v2)) => v1._1 < v2._1 })
.map({ case (key, (v1, v2)) => (s"[$key] key: " + v1._1 + v2._1, s"value: f(${v1._2},${v2._2})") })
Output:
([1] key: ab,value: f(10,11))
([3] key: ab,value: f(12,13))
([3] key: ac,value: f(12,14))
([3] key: bc,value: f(13,14))
PS: advanced filtering can be used here.

Optimizing cartesian product using keys in spark

To avoid computing all possible combinations, I'm trying to group values according to a certain key, and then compute the cartesian product of the values for each key, i.e.:
Input [(k1, v1), (k1, v2), (k2, v3)]
Desired output: [(v1, v1), (v1, v2), (v2, v2), (v2, v1), (v3, v3)] Here is code I have tried executing:
val input = sc.textFile('data.csv')
val rdd = input.map(s=>s.split(","))
.map(s => (s(1).toString, s(2).toString))
val group_result:RDD[String, Iterable[String]] = rdd.groupByKey()
group_result.flatMap { t =>
{
val stream1= t._2.toStream
val stream2= t._2.toStream
stream1.flatMap { src =>
stream2.par.map { trg =>
src + "," + trg
}
}
}
}
This works fine for very small files, but when the list(Iterable) is of length ~1000 the computation completely freezes.
As #zero323 said, the best way to solve this is by using PairRDDFunctions' method join, however in order to achieve this you need to have a PairedRDD, which can be obtained by using RDD's method keyBy.
You could do something like:
val rdd = sc.parallelize(Array(("k1", "v1"), ("k1", "v2"), ("k2", "v3"))).keyBy(_._1)
val result = rdd.join(rdd).map{
case (key: String, (x: Tuple2[String, String], y: Tuple2[String, String])) => (x._2, y._2)
}
result.take(20)
// res9: Array[(String, String)] = Array((v1,v1), (v1,v2), (v2,v1), (v2,v2), (v3, v3))
Here I share the notebook with the code.

How to Find Out Per key Maximum in scala collection

Let us consider I hava a collection of eployees as List of Tuples, where t._1 represents department Id, t._2 is salary and t._3 is Name of employee
val eployees = List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert"))
Expected Result: -((2,5000,Kumar), (4,999,Robert), (1,9999,Ashok))
I am trying with but getting error,
val maxSal1 = emps.map(t => (t._1, (t._2, t._3))).groupBy(a => a._1).map(k => {
k._2.foldLeft(0, "dummy")((aa, bb) => {
if (aa._1 > bb._1) aa else bb
})
})
Don't overcomplicate things, avoid doing unnecessary operations, and carrying redundant information around. Just be explicit, and spell out the transformations you need at each step. Simplicity is your friend.
employees.groupBy(_._1).values.map(_.maxBy(_._2))
scala> List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert")).groupBy {
| case (dept, salary, employee) => dept
| }
res6: scala.collection.immutable.Map[Int,List[(Int, Int, String)]] = Map(2 -> List((2,5000,Pam)), 4 -> List((4,500,NK), (4,999,Robert)), 1 -> List((1,8000,Sally), (1,9999,Tom)))
scala> res6.map {
| case (dept, employees) => employees.maxBy(_._2)
| }
res5: scala.collection.immutable.Iterable[(Int, Int, String)] = List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))
But note that maxBy is a partial function:
scala> List[Int]().maxBy(x => x)
java.lang.UnsupportedOperationException: empty.maxBy
As a side note, I'd use case class Employee with 3 fields rather than a tuple. I believe it's more readable.
I tried with this option and seems to give result,
val maxsal1 = emps1.map(t => (t._1, t._2, t._3)).groupBy(_._1).values.map(t => t.foldLeft((0, 1, "dummy"))((aa, bb) => {
if (aa._2 > bb._2) aa else bb
}))
Output: List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))

take top N after groupBy and treat them as RDD

I'd like to get top N items after groupByKey of RDD and convert the type of topNPerGroup(in the below) to RDD[(String, Int)] where List[Int] values are flatten
The data is
val data = sc.parallelize(Seq("foo"->3, "foo"->1, "foo"->2,
"bar"->6, "bar"->5, "bar"->4))
The top N items per group are computed as:
val topNPerGroup: RDD[(String, List[Int]) = data.groupByKey.map {
case (key, numbers) =>
key -> numbers.toList.sortBy(-_).take(2)
}
The result is
(bar,List(6, 5))
(foo,List(3, 2))
which was printed by
topNPerGroup.collect.foreach(println)
If I achieve, topNPerGroup.collect.foreach(println) will generate (expected result!)
(bar, 6)
(bar, 5)
(foo, 3)
(foo, 2)
I've been struggling with this same issue recently but my need was a little different in that I needed the top K values per key with a data set like (key: Int, (domain: String, count: Long)). While your dataset is simpler there is still a scaling/performance issue by using groupByKey as noted in the documentation.
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable) pairs. Note: If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using
reduceByKey or combineByKey will yield much better performance.
In my case I ran into problems very quickly because my Iterable in (K, Iterable<V>) was very large, > 1 million, so the sorting and taking of the top N became very expensive and creates potential memory issues.
After some digging, see references below, here is a full example using combineByKey to accomplish the same task in a way that will perform and scale.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object TopNForKey {
var SampleDataset = List(
(1, ("apple.com", 3L)),
(1, ("google.com", 4L)),
(1, ("stackoverflow.com", 10L)),
(1, ("reddit.com", 15L)),
(2, ("slashdot.org", 11L)),
(2, ("samsung.com", 1L)),
(2, ("apple.com", 9L)),
(3, ("microsoft.com", 5L)),
(3, ("yahoo.com", 3L)),
(3, ("google.com", 4L)))
//sort and trim a traversable (String, Long) tuple by _2 value of the tuple
def topNs(xs: TraversableOnce[(String, Long)], n: Int) = {
var ss = List[(String, Long)]()
var min = Long.MaxValue
var len = 0
xs foreach { e =>
if (len < n || e._2 > min) {
ss = (e :: ss).sortBy((f) => f._2)
min = ss.head._2
len += 1
}
if (len > n) {
ss = ss.tail
min = ss.head._2
len -= 1
}
}
ss
}
def main(args: Array[String]): Unit = {
val topN = 2
val sc = new SparkContext("local", "TopN For Key")
val rdd = sc.parallelize(SampleDataset).map((t) => (t._1, t._2))
//use combineByKey to allow spark to partition the sorting and "trimming" across the cluster
val topNForKey = rdd.combineByKey(
//seed a list for each key to hold your top N's with your first record
(v) => List[(String, Long)](v),
//add the incoming value to the accumulating top N list for the key
(acc: List[(String, Long)], v) => topNs(acc ++ List((v._1, v._2)), topN).toList,
//merge top N lists returned from each partition into a new combined top N list
(acc: List[(String, Long)], acc2: List[(String, Long)]) => topNs(acc ++ acc2, topN).toList)
//print results sorting for pretty
topNForKey.sortByKey(true).foreach((t) => {
println(s"key: ${t._1}")
t._2.foreach((v) => {
println(s"----- $v")
})
})
}
}
And what I get in the returning rdd...
(1, List(("google.com", 4L),
("stackoverflow.com", 10L))
(2, List(("apple.com", 9L),
("slashdot.org", 15L))
(3, List(("google.com", 4L),
("microsoft.com", 5L))
References
https://www.mail-archive.com/user#spark.apache.org/msg16827.html
https://stackoverflow.com/a/8275562/807318
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
Spark 1.4.0 solves the question.
Take a look at https://github.com/apache/spark/commit/5e6ad24ff645a9b0f63d9c0f17193550963aa0a7
This uses BoundedPriorityQueue with aggregateByKey
def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = {
self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))(
seqOp = (queue, item) => {
queue += item
},
combOp = (queue1, queue2) => {
queue1 ++= queue2
}
).mapValues(_.toArray.sorted(ord.reverse)) // This is an min-heap, so we reverse the order.
}
Your question is a little confusing, but I think this does what you're looking for:
val flattenedTopNPerGroup =
topNPerGroup.flatMap({case (key, numbers) => numbers.map(key -> _)})
and in the repl it prints out what you want:
flattenedTopNPerGroup.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
Just use topByKey:
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
import org.apache.spark.rdd.RDD
val topTwo: RDD[(String, Int)] = data.topByKey(2).flatMapValues(x => x)
topTwo.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
It is also possible provide alternative Ordering (not required here). For example if you wanted n smallest values:
data.topByKey(2)(scala.math.Ordering.by[Int, Int](- _))

combining two lists with index wise

first List
remoteDeviceAndPort===>List(
(1,891w.yourdomain.com,wlan-ap0),
(13,ap,GigabitEthernet0),
(11,Router-3900,GigabitEthernet0/0)
)
second List
interfacesList===>List(
(1,UP,,0,0,0,0,UP,4294,other,VoIP-Null0,0,0),
(13,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet6,0,0),
(11,UP,,0,0,0,0,UP,100,vlan,Vlan11,4558687845,1249542878),
(2,UP,,0,0,972,1327,UP,0,Tunnel,Virtual-Access1,0,0),
(4,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0),
(6,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0)
)
The above are my two lists now i have to combine these two lists like below.
Expected OutPut =>
combineList = List(
(1,UP,,0,0,0,0,UP,4294,other,VoIP-Null0,0,0,891w.yourdomain.com,wlan-ap0),
(13,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet6,0,0,ap,GigabitEthernet0),
(11,UP,,0,0,0,0,UP,100,vlan,Vlan11,4558687845,1249542878,Router-3900,GigabitEthernet0/0),
(2,UP,,0,0,972,1327,UP,0,Tunnel,Virtual-Access1,0,0,empty,empty),
(4,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0,empty,empty),
(6,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0,empty,empty)
)
The similar question here
case class NetworkDeviceInterfaces(index: Int, params: String*)
val remoteDeviceAndPort = List(
(1,"891w.yourdomain.com","wlan-ap0"),
(13,"ap","GigabitEthernet0"),
(11,"Router-3900","GigabitEthernet0/0")
)
val rdapMap = remoteDeviceAndPort map {case (k, v1, v2) => k -> (v1, v2) } toMap
val interfacesList = List(NetworkDeviceInterfaces(1,"UP","","0","0","0","0","UP","4294","other","VoIP-Null0","0","0"))
val result = interfacesList map {
interface => {
val (first, second) = rdapMap.getOrElse(interface.index, ("empty", "empty"))
NetworkDeviceInterfaces(interface.index, (interface.params ++ Seq(first, second)):_*)
}
}