Scala: map+filter instead of foldLeft

Scala: map+filter instead of foldLeft - scala

Is it possible to substitute the foldLeft function through some combination of map and filter in Scala? For example on this task.
Input is a list of triples (student name, course, grade):
val grades = List(("Hans", "db", 2.3), ("Maria", "prog1", 1.0), ("Maria", "prog2", 1.3), ("Maria", "prog3", 1.7), ("Hans", "prog2", 1.7), ("Josef", "prog1", 1.3), ("Maria", "mathe1", 1.3), ("Josef", "db", 3.3), ("Hans", "prog1", 2.0))
Then to each student, a list of their courses and grades should be mapped. With foldLeft it goes like that:
grades.foldLeft(Map[String, List[(String, Double)]]())((acc, e) => acc + (e._1 -> (acc.getOrElse(e._1, List()) ::: List((e._2, e._3))))).toList
Output:
List[(String, List[(String, Double)])] = List((Hans,List((db,2.3), (prog2,1.7), (prog1,2.0))), (Maria,List((prog1,1.0), (prog2,1.3), (prog3,1.7), (mathe1,1.3))), (Josef,List((prog1,1.3), (db,3.3))))
How to achieve the same output using only map and filter functions?
So far i have this, but the output is slightly different.
grades.map(x => (x._1, List())).distinct.flatMap(x => grades.map(z => if(!x._2.contains(z._2, z._3)) (x._1, x._2 ::: List((z._2, z._3)))))

Using groupBy is a nice way to tackle this problem:
grades
.groupBy(_._1)
.mapValues(_.map(t => (t._2, t._3)))
(Thanks to m4gic from improvement to original version suggested in comment)

grades
.map({ case (name, _, _) => name })
.distinct
.map(name => {
val scores = grades
.filter({ case (nameInternal, _, _) => name == nameInternal })
.map({ case (_, subject, score) => (subject, score) })
(name, scores)
})
// res0: List[(String, List[(String, Double)])] = List((Hans,List((db,2.3), (prog2,1.7), (prog1,2.0))), (Maria,List((prog1,1.0), (prog2,1.3), (prog3,1.7), (mathe1,1.3))), (Josef,List((prog1,1.3), (db,3.3))))

Related

How to map with join by RDD Scala

I have a list of (id-(name-value) pairs. Such like this
val input = sc.parallelize(Array(Array(1, "a 10"),
Array(1, "b 11"),
Array(3, "a 12"),
Array(3, "b 13"),
Array(3, "c 14"),
Array(4, "b 15")))
The map phase has key is the id and value is (name-value) string.
val rdd = input.map(x => (x(0), x(1)))
My expected result is: for each id, compare values around based by name with a f() function.
For example, with id == "3", we got result after reduce phase:
(key: ab, value: f(12,13))
(key: ac, value: f(12,14))
(key: bc, value: f(13,14))

RDD can be joined with himself for get all pairs, and only required row can be left by filtering:
// split string value on two parts
val rdd = input.map(x => (x(0), x(1).toString.split(" ")))
.map({ case (key, parts) => (key, (parts(0), parts(1))) })
// join , filter, and transform to expected
val both = rdd
.join(rdd)
.filter({ case (_, (v1, v2)) => v1._1 < v2._1 })
.map({ case (key, (v1, v2)) => (s"[$key] key: " + v1._1 + v2._1, s"value: f(${v1._2},${v2._2})") })
Output:
([1] key: ab,value: f(10,11))
([3] key: ab,value: f(12,13))
([3] key: ac,value: f(12,14))
([3] key: bc,value: f(13,14))
PS: advanced filtering can be used here.

Optimizing cartesian product using keys in spark

To avoid computing all possible combinations, I'm trying to group values according to a certain key, and then compute the cartesian product of the values for each key, i.e.:
Input [(k1, v1), (k1, v2), (k2, v3)]
Desired output: [(v1, v1), (v1, v2), (v2, v2), (v2, v1), (v3, v3)] Here is code I have tried executing:
val input = sc.textFile('data.csv')
val rdd = input.map(s=>s.split(","))
.map(s => (s(1).toString, s(2).toString))
val group_result:RDD[String, Iterable[String]] = rdd.groupByKey()
group_result.flatMap { t =>
{
val stream1= t._2.toStream
val stream2= t._2.toStream
stream1.flatMap { src =>
stream2.par.map { trg =>
src + "," + trg
}
}
}
}
This works fine for very small files, but when the list(Iterable) is of length ~1000 the computation completely freezes.

As #zero323 said, the best way to solve this is by using PairRDDFunctions' method join, however in order to achieve this you need to have a PairedRDD, which can be obtained by using RDD's method keyBy.
You could do something like:
val rdd = sc.parallelize(Array(("k1", "v1"), ("k1", "v2"), ("k2", "v3"))).keyBy(_._1)
val result = rdd.join(rdd).map{
case (key: String, (x: Tuple2[String, String], y: Tuple2[String, String])) => (x._2, y._2)
}
result.take(20)
// res9: Array[(String, String)] = Array((v1,v1), (v1,v2), (v2,v1), (v2,v2), (v3, v3))
Here I share the notebook with the code.

How to Find Out Per key Maximum in scala collection

Let us consider I hava a collection of eployees as List of Tuples, where t._1 represents department Id, t._2 is salary and t._3 is Name of employee
val eployees = List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert"))
Expected Result: -((2,5000,Kumar), (4,999,Robert), (1,9999,Ashok))
I am trying with but getting error,
val maxSal1 = emps.map(t => (t._1, (t._2, t._3))).groupBy(a => a._1).map(k => {
k._2.foldLeft(0, "dummy")((aa, bb) => {
if (aa._1 > bb._1) aa else bb
})
})

Don't overcomplicate things, avoid doing unnecessary operations, and carrying redundant information around. Just be explicit, and spell out the transformations you need at each step. Simplicity is your friend.
employees.groupBy(_._1).values.map(_.maxBy(_._2))

scala> List((1, 8000, "Sally"),(1, 9999, "Tom"), (2, 5000, "Pam"), (4, 500, "NK"), (4, 999, "Robert")).groupBy {
| case (dept, salary, employee) => dept
| }
res6: scala.collection.immutable.Map[Int,List[(Int, Int, String)]] = Map(2 -> List((2,5000,Pam)), 4 -> List((4,500,NK), (4,999,Robert)), 1 -> List((1,8000,Sally), (1,9999,Tom)))
scala> res6.map {
| case (dept, employees) => employees.maxBy(_._2)
| }
res5: scala.collection.immutable.Iterable[(Int, Int, String)] = List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))
But note that maxBy is a partial function:
scala> List[Int]().maxBy(x => x)
java.lang.UnsupportedOperationException: empty.maxBy
As a side note, I'd use case class Employee with 3 fields rather than a tuple. I believe it's more readable.

I tried with this option and seems to give result,
val maxsal1 = emps1.map(t => (t._1, t._2, t._3)).groupBy(_._1).values.map(t => t.foldLeft((0, 1, "dummy"))((aa, bb) => {
if (aa._2 > bb._2) aa else bb
}))
Output: List((2,5000,Pam), (4,999,Robert), (1,9999,Tom))

take top N after groupBy and treat them as RDD

I'd like to get top N items after groupByKey of RDD and convert the type of topNPerGroup(in the below) to RDD[(String, Int)] where List[Int] values are flatten
The data is
val data = sc.parallelize(Seq("foo"->3, "foo"->1, "foo"->2,
"bar"->6, "bar"->5, "bar"->4))
The top N items per group are computed as:
val topNPerGroup: RDD[(String, List[Int]) = data.groupByKey.map {
case (key, numbers) =>
key -> numbers.toList.sortBy(-_).take(2)
}
The result is
(bar,List(6, 5))
(foo,List(3, 2))
which was printed by
topNPerGroup.collect.foreach(println)
If I achieve, topNPerGroup.collect.foreach(println) will generate (expected result!)
(bar, 6)
(bar, 5)
(foo, 3)
(foo, 2)

I've been struggling with this same issue recently but my need was a little different in that I needed the top K values per key with a data set like (key: Int, (domain: String, count: Long)). While your dataset is simpler there is still a scaling/performance issue by using groupByKey as noted in the documentation.
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable) pairs. Note: If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using
reduceByKey or combineByKey will yield much better performance.
In my case I ran into problems very quickly because my Iterable in (K, Iterable<V>) was very large, > 1 million, so the sorting and taking of the top N became very expensive and creates potential memory issues.
After some digging, see references below, here is a full example using combineByKey to accomplish the same task in a way that will perform and scale.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object TopNForKey {
var SampleDataset = List(
(1, ("apple.com", 3L)),
(1, ("google.com", 4L)),
(1, ("stackoverflow.com", 10L)),
(1, ("reddit.com", 15L)),
(2, ("slashdot.org", 11L)),
(2, ("samsung.com", 1L)),
(2, ("apple.com", 9L)),
(3, ("microsoft.com", 5L)),
(3, ("yahoo.com", 3L)),
(3, ("google.com", 4L)))
//sort and trim a traversable (String, Long) tuple by _2 value of the tuple
def topNs(xs: TraversableOnce[(String, Long)], n: Int) = {
var ss = List[(String, Long)]()
var min = Long.MaxValue
var len = 0
xs foreach { e =>
if (len < n || e._2 > min) {
ss = (e :: ss).sortBy((f) => f._2)
min = ss.head._2
len += 1
}
if (len > n) {
ss = ss.tail
min = ss.head._2
len -= 1
}
}
ss
}
def main(args: Array[String]): Unit = {
val topN = 2
val sc = new SparkContext("local", "TopN For Key")
val rdd = sc.parallelize(SampleDataset).map((t) => (t._1, t._2))
//use combineByKey to allow spark to partition the sorting and "trimming" across the cluster
val topNForKey = rdd.combineByKey(
//seed a list for each key to hold your top N's with your first record
(v) => List[(String, Long)](v),
//add the incoming value to the accumulating top N list for the key
(acc: List[(String, Long)], v) => topNs(acc ++ List((v._1, v._2)), topN).toList,
//merge top N lists returned from each partition into a new combined top N list
(acc: List[(String, Long)], acc2: List[(String, Long)]) => topNs(acc ++ acc2, topN).toList)
//print results sorting for pretty
topNForKey.sortByKey(true).foreach((t) => {
println(s"key: ${t._1}")
t._2.foreach((v) => {
println(s"----- $v")
})
})
}
}
And what I get in the returning rdd...
(1, List(("google.com", 4L),
("stackoverflow.com", 10L))
(2, List(("apple.com", 9L),
("slashdot.org", 15L))
(3, List(("google.com", 4L),
("microsoft.com", 5L))
References
https://www.mail-archive.com/user#spark.apache.org/msg16827.html
https://stackoverflow.com/a/8275562/807318
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Spark 1.4.0 solves the question.
Take a look at https://github.com/apache/spark/commit/5e6ad24ff645a9b0f63d9c0f17193550963aa0a7
This uses BoundedPriorityQueue with aggregateByKey
def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = {
self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))(
seqOp = (queue, item) => {
queue += item
},
combOp = (queue1, queue2) => {
queue1 ++= queue2
}
).mapValues(_.toArray.sorted(ord.reverse)) // This is an min-heap, so we reverse the order.
}

Your question is a little confusing, but I think this does what you're looking for:
val flattenedTopNPerGroup =
topNPerGroup.flatMap({case (key, numbers) => numbers.map(key -> _)})
and in the repl it prints out what you want:
flattenedTopNPerGroup.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)

Just use topByKey:
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
import org.apache.spark.rdd.RDD
val topTwo: RDD[(String, Int)] = data.topByKey(2).flatMapValues(x => x)
topTwo.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
It is also possible provide alternative Ordering (not required here). For example if you wanted n smallest values:
data.topByKey(2)(scala.math.Ordering.by[Int, Int](- _))

combining two lists with index wise

first List
remoteDeviceAndPort===>List(
(1,891w.yourdomain.com,wlan-ap0),
(13,ap,GigabitEthernet0),
(11,Router-3900,GigabitEthernet0/0)
)
second List
interfacesList===>List(
(1,UP,,0,0,0,0,UP,4294,other,VoIP-Null0,0,0),
(13,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet6,0,0),
(11,UP,,0,0,0,0,UP,100,vlan,Vlan11,4558687845,1249542878),
(2,UP,,0,0,972,1327,UP,0,Tunnel,Virtual-Access1,0,0),
(4,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0),
(6,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0)
)
The above are my two lists now i have to combine these two lists like below.
Expected OutPut =>
combineList = List(
(1,UP,,0,0,0,0,UP,4294,other,VoIP-Null0,0,0,891w.yourdomain.com,wlan-ap0),
(13,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet6,0,0,ap,GigabitEthernet0),
(11,UP,,0,0,0,0,UP,100,vlan,Vlan11,4558687845,1249542878,Router-3900,GigabitEthernet0/0),
(2,UP,,0,0,972,1327,UP,0,Tunnel,Virtual-Access1,0,0,empty,empty),
(4,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0,empty,empty),
(6,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0,empty,empty)
)

The similar question here
case class NetworkDeviceInterfaces(index: Int, params: String*)
val remoteDeviceAndPort = List(
(1,"891w.yourdomain.com","wlan-ap0"),
(13,"ap","GigabitEthernet0"),
(11,"Router-3900","GigabitEthernet0/0")
)
val rdapMap = remoteDeviceAndPort map {case (k, v1, v2) => k -> (v1, v2) } toMap
val interfacesList = List(NetworkDeviceInterfaces(1,"UP","","0","0","0","0","UP","4294","other","VoIP-Null0","0","0"))
val result = interfacesList map {
interface => {
val (first, second) = rdapMap.getOrElse(interface.index, ("empty", "empty"))
NetworkDeviceInterfaces(interface.index, (interface.params ++ Seq(first, second)):_*)
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala: map+filter instead of foldLeft - scala

Using groupBy is a nice way to tackle this problem: grades .groupBy(_._1) .mapValues(_.map(t => (t._2, t._3))) (Thanks to m4gic from improvement to original version suggested in comment)

Related

How to map with join by RDD Scala

Optimizing cartesian product using keys in spark

How to Find Out Per key Maximum in scala collection

take top N after groupBy and treat them as RDD

combining two lists with index wise

Categories

Resources