take top N after groupBy and treat them as RDD - scala

I'd like to get top N items after groupByKey of RDD and convert the type of topNPerGroup(in the below) to RDD[(String, Int)] where List[Int] values are flatten
The data is
val data = sc.parallelize(Seq("foo"->3, "foo"->1, "foo"->2,
"bar"->6, "bar"->5, "bar"->4))
The top N items per group are computed as:
val topNPerGroup: RDD[(String, List[Int]) = data.groupByKey.map {
case (key, numbers) =>
key -> numbers.toList.sortBy(-_).take(2)
}
The result is
(bar,List(6, 5))
(foo,List(3, 2))
which was printed by
topNPerGroup.collect.foreach(println)
If I achieve, topNPerGroup.collect.foreach(println) will generate (expected result!)
(bar, 6)
(bar, 5)
(foo, 3)
(foo, 2)

I've been struggling with this same issue recently but my need was a little different in that I needed the top K values per key with a data set like (key: Int, (domain: String, count: Long)). While your dataset is simpler there is still a scaling/performance issue by using groupByKey as noted in the documentation.
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable) pairs. Note: If you are grouping in order to perform an
aggregation (such as a sum or average) over each key, using
reduceByKey or combineByKey will yield much better performance.
In my case I ran into problems very quickly because my Iterable in (K, Iterable<V>) was very large, > 1 million, so the sorting and taking of the top N became very expensive and creates potential memory issues.
After some digging, see references below, here is a full example using combineByKey to accomplish the same task in a way that will perform and scale.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object TopNForKey {
var SampleDataset = List(
(1, ("apple.com", 3L)),
(1, ("google.com", 4L)),
(1, ("stackoverflow.com", 10L)),
(1, ("reddit.com", 15L)),
(2, ("slashdot.org", 11L)),
(2, ("samsung.com", 1L)),
(2, ("apple.com", 9L)),
(3, ("microsoft.com", 5L)),
(3, ("yahoo.com", 3L)),
(3, ("google.com", 4L)))
//sort and trim a traversable (String, Long) tuple by _2 value of the tuple
def topNs(xs: TraversableOnce[(String, Long)], n: Int) = {
var ss = List[(String, Long)]()
var min = Long.MaxValue
var len = 0
xs foreach { e =>
if (len < n || e._2 > min) {
ss = (e :: ss).sortBy((f) => f._2)
min = ss.head._2
len += 1
}
if (len > n) {
ss = ss.tail
min = ss.head._2
len -= 1
}
}
ss
}
def main(args: Array[String]): Unit = {
val topN = 2
val sc = new SparkContext("local", "TopN For Key")
val rdd = sc.parallelize(SampleDataset).map((t) => (t._1, t._2))
//use combineByKey to allow spark to partition the sorting and "trimming" across the cluster
val topNForKey = rdd.combineByKey(
//seed a list for each key to hold your top N's with your first record
(v) => List[(String, Long)](v),
//add the incoming value to the accumulating top N list for the key
(acc: List[(String, Long)], v) => topNs(acc ++ List((v._1, v._2)), topN).toList,
//merge top N lists returned from each partition into a new combined top N list
(acc: List[(String, Long)], acc2: List[(String, Long)]) => topNs(acc ++ acc2, topN).toList)
//print results sorting for pretty
topNForKey.sortByKey(true).foreach((t) => {
println(s"key: ${t._1}")
t._2.foreach((v) => {
println(s"----- $v")
})
})
}
}
And what I get in the returning rdd...
(1, List(("google.com", 4L),
("stackoverflow.com", 10L))
(2, List(("apple.com", 9L),
("slashdot.org", 15L))
(3, List(("google.com", 4L),
("microsoft.com", 5L))
References
https://www.mail-archive.com/user#spark.apache.org/msg16827.html
https://stackoverflow.com/a/8275562/807318
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Spark 1.4.0 solves the question.
Take a look at https://github.com/apache/spark/commit/5e6ad24ff645a9b0f63d9c0f17193550963aa0a7
This uses BoundedPriorityQueue with aggregateByKey
def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = {
self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))(
seqOp = (queue, item) => {
queue += item
},
combOp = (queue1, queue2) => {
queue1 ++= queue2
}
).mapValues(_.toArray.sorted(ord.reverse)) // This is an min-heap, so we reverse the order.
}

Your question is a little confusing, but I think this does what you're looking for:
val flattenedTopNPerGroup =
topNPerGroup.flatMap({case (key, numbers) => numbers.map(key -> _)})
and in the repl it prints out what you want:
flattenedTopNPerGroup.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)

Just use topByKey:
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
import org.apache.spark.rdd.RDD
val topTwo: RDD[(String, Int)] = data.topByKey(2).flatMapValues(x => x)
topTwo.collect.foreach(println)
(foo,3)
(foo,2)
(bar,6)
(bar,5)
It is also possible provide alternative Ordering (not required here). For example if you wanted n smallest values:
data.topByKey(2)(scala.math.Ordering.by[Int, Int](- _))

Related

Full outer join in Scala

Given a list of lists, where each list has an object that represents the key, I need to write a full outer join that combines all the lists. Each record in the resulting list is the combination of all the fields of all the lists. In case that one key is present in list 1 and not present in list 2, then the fields in list 2 should be null or empty.
One solution I thought of is to embed an in-memory database, create the tables, run a select and get the result. However, I'd like to know if there are any libraries that handle this in a more simpler way. Any ideas?
For example, let's say I have two lists, where the key is the first field in the list:
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
val allLists = List (list1, list2)
The full outer joined list would be:
val allListsJoined = List ((1,2,"A"), (3,4,None), (5,6,None), (7,None,"B"))
NOTE: the solution needs to work for N lists
def fullOuterJoin[K, V1, V2](xs: List[(K, V1)], ys: List[(K, V2)]): List[(K, Option[V1], Option[V2])] = {
val map1 = xs.toMap
val map2 = ys.toMap
val allKeys = map1.keySet ++ map2.keySet
allKeys.toList.map(k => (k, map1.get(k), map2.get(k)))
}
Example usage:
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
println(fullOuterJoin(list1, list2))
Which prints:
List((1,Some(2),Some(A)), (3,Some(4),None), (5,Some(6),None), (7,None,Some(B)))
Edit per suggestion in comments:
If you're interested in joining an arbitrary number of lists and don't care about type info, here's a version that does that:
def fullOuterJoin[K](xs: List[List[(K, Any)]]): List[(K, List[Option[Any]])] = {
val maps = xs.map(_.toMap)
val allKeys = maps.map(_.keySet).reduce(_ ++ _)
allKeys.toList.map(k => (k, maps.map(m => m.get(k))))
}
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
val list3 = List((1, 3.5), (7, 4.0))
val lists = List(list1, list2, list3)
println(fullOuterJoin(lists))
which outputs:
List((1,List(Some(2), Some(A), Some(3.5))), (3,List(Some(4), None, None)), (5,List(Some(6), None, None)), (7,List(None, Some(B), Some(4.0))))
If you want both an arbitrary number of lists and well-typed results, that's probably beyond the scope of a stackoverflow answer but could probably be accomplished with shapeless.
Here is a way to do it using collect separately on both list
val list1Ite = list1.collect{
case ele if list2.filter(e=> e._1 == ele._1).size>0 => { //if list2 _1 contains ele._1
val left = list2.find(e=> e._1 == ele._1) //find the available element
(ele._1, ele._2, left.get._2) //perform join
}
case others => (others._1, others._2, None) //others add None as _3
}
//list1Ite: List[(Int, Int, java.io.Serializable)] = List((1,2,A), (3,4,None), (5,6,None))
Do similar operation but exclude the elements which are already available in list1Ite
val list2Ite = list2.collect{
case ele if list1.filter(e=> e._1 == ele._1).size==0 => (ele._1, None , ele._2)
}
//list2Ite: List[(Int, None.type, String)] = List((7,None,B))
Combine both list1Ite and list2Ite to result
val result = list1Ite.++(list2Ite)
result: List[(Int, Any, java.io.Serializable)] = List((1,2,A), (3,4,None), (5,6,None), (7,None,B))

How to count the number of occurences of an element with scala/spark?

I had a file that contained a list of elements like this
00|905000|20160125204123|79644809999||HGMTC|1||22|7905000|56321647569|||34110|I||||||250995210056537|354805064211510||56191|||38704||A|||11|V|81079681404134|5||||SE|||G|144|||||||||||||||Y|b00534589.huawei_anadyr.20151231184912||1|||||79681404134|0|||+##+1{79098509982}2{2}3{2}5{79644809999}6{0000002A7A5AC635}7{79681404134}|20160125|
Through a series of steps, I managed to convert it to a list of elements like this
(902996760100000,CompactBuffer(6, 5, 2, 2, 8, 6, 5, 3))
Where 905000 and 902996760100000 are keys and 6, 5, 2, 2, 8, 6, 5, 3 are values. Values can be numbers from 1 to 8. Are there any ways to count number of occurences of these values using spark, so the result looks like this?
(902996760100000, 0_1, 2_2, 1_3, 0_4, 2_5, 2_6, 0_7, 1_8)
I could do it with if else blocks and staff, but that won't be pretty, so I wondered if there are any instrumets I could use in scala/spark.
This is my code.
class ScalaJob(sc: SparkContext) {
def run(cdrPath: String) : RDD[(String, Iterable[String])] = {
//pass the file
val fileCdr = sc.textFile(cdrPath);
//find values in every raw cdr
val valuesCdr = fileCdr.map{
dataRaw =>
val p = dataRaw.split("[|]",-1)
(p(1), ScalaJob.processType(ScalaJob.processTime(p(2)) + "_" + p(32)))
}
val x = valuesCdr.groupByKey()
return x
}
Any advice on optimizing it would be appreciated. I'm really new to scala/spark.
First, Scala is a type-safe language and so is Spark's RDD API - so it's highly recommended to use the type system instead of going around it by "encoding" everything into Strings.
So I'll suggest a solution that creates an RDD[(String, Seq[(Int, Int)])] (with second item in tuple being a sequence of (ID, count) tuples) and not a RDD[(String, Iterable[String])] which seems less useful.
Here's a simple function that counts the occurrences of 1 to 8 in a given Iterable[Int]:
def countValues(l: Iterable[Int]): Seq[(Int, Int)] = {
(1 to 8).map(i => (i, l.count(_ == i)))
}
You can use mapValues with this function (place the function in the object for serializability, like you did with the rest) on an RDD[(String, Iterable[Int])] to get the result:
valuesCdr.groupByKey().mapValues(ScalaJob.countValues)
The entire solution can then be simplified a bit:
class ScalaJob(sc: SparkContext) {
import ScalaJob._
def run(cdrPath: String): RDD[(String, Seq[(Int, Int)])] = {
val valuesCdr = sc.textFile(cdrPath)
.map(_.split("\\|"))
.map(p => (p(1), processType(processTime(p(2)), p(32))))
valuesCdr.groupByKey().mapValues(countValues)
}
}
object ScalaJob {
val dayParts = Map((6 to 11) -> 1, (12 to 18) -> 2, (19 to 23) -> 3, (0 to 5) -> 4)
def processTime(s: String): Int = {
val hour = DateTime.parse(s, DateTimeFormat.forPattern("yyyyMMddHHmmss")).getHourOfDay
dayParts.filterKeys(_.contains(hour)).values.head
}
def processType(dayPart: Int, s: String): Int = s match {
case "S" => 2 * dayPart - 1
case "V" => 2 * dayPart
}
def countValues(l: Iterable[Int]): Seq[(Int, Int)] = {
(1 to 8).map(i => (i, l.count(_ == i)))
}
}

Scala spark reduce by key and find common value

I have a file of csv data stored in as a sequenceFile on HDFS, in the format of name, zip, country, fav_food1, fav_food2, fav_food3, fav_colour. There could be many entries with the same name and I needed to find out what their favourite food was (ie count all the food entries in all the records with that name and return the most popular one. I am new to Scala and Spark and have gone thorough multiple tutorials and scoured the forums but am stuck as to how to proceed. So far I have got the sequence files which had Text into String format and then filtered out the entries
Here is the sample data entries one to a line in the file
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
So the output should be the tuple (Bob, Soda) since soda appears the most amount of times in Bob's entries.
import org.apache.hadoop.io._
var lines = sc.sequenceFile("path",classOf[LongWritable],classOf[Text]).values.map(x => x.toString())
// converted to string since I could not get filter to run on Text and removing the longwritable
var filtered = lines.filter(_.split(",")(0) == "Bob");
// removed entries with all other users
var f_tuples = filtered.map(line => lines.split(",");
// split all the values
var f_simple = filtered.map(line => (line(0), (line(3), line(4), line(5))
// removed unnecessary fields
This Issue I have now is that I think I have this [<name,[f,f,f]>] structure and don't really know how to proceed to flatten it out and get the most popular food. I need to combine all the entries so I have a entry with a and then get the most common element in the value. Any help would be appreciated. Thanks
I tried this to get it to flatten out, but it seems the more I try, the more convoluted the data structure becomes.
var f_trial = fpairs.groupBy(_._1).mapValues(_.map(_._2))
// the resulting structure was of type org.apache.spark.rdd.RDD[(String, Interable[(String, String, String)]
here is what a println of a record looks like after f_trial
("Bob", List((Pizza, Soda,), (Chocolate, Cheese, Soda), (Chocolate, Pizza, Soda)))
Parenthesis Breakdown
("Bob",
List(
(Pizza, Soda, <missing value>),
(Chocolate, Cheese, Soda),
(Chocolate, Pizza, Soda)
) // ends List paren
) // ends first paren
I found time. Setup:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc = new SparkContext(conf)
val data = """
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
""".trim
val records = sc.parallelize(data.split('\n'))
Extract the food choices, and for each make a tuple of ((name, food), 1)
val r2 = records.flatMap { r =>
val Array(name, id, country, food1, food2, food3, color) = r.split(',');
List(((name, food1), 1), ((name, food2), 1), ((name, food3), 1))
}
Total up each name/food combination:
val r3 = r2.reduceByKey((x, y) => x + y)
Remap so that the name (only) is the key
val r4 = r3.map { case ((name, food), total) => (name, (food, total)) }
Pick the food with the largest count at each step
val res = r4.reduceByKey((x, y) => if (y._2 > x._2) y else x)
And we're done
println(res.collect().mkString)
//(Mary,(Chips,1))(Bob,(Soda,3))
EDIT: To collect all the food items that have the same top count for a person, we just change the last two lines:
Start with a List of items with total:
val r5 = r3.map { case ((name, food), total) => (name, (List(food), total)) }
In the equal case, concatenate the list of food items with that score
val res2 = r5.reduceByKey((x, y) => if (y._2 > x._2) y
else if (y._2 < x._2) x
else (y._1:::x._1, y._2))
//(Mary,(List(Chocolate, Pasta, Chips),1))
//(Bob,(List(Soda),3))
If you want the top-3, say, then use aggregateByKey to assemble a list of the favorite foods per person instead of the second reduceByKey
Solutions provided by Paul and mattinbits shuffle your data twice - once to perform reduce-by-name-and-food and once to reduce-by-name. It is possible to solve this problem with only one shuffle.
/**Generate key-food_count pairs from a splitted line**/
def bitsToKeyMapPair(xs: Array[String]): (String, Map[String, Long]) = {
val key = xs(0)
val map = xs
.drop(3) // Drop name..country
.take(3) // Take food
.filter(_.trim.size !=0) // Ignore empty
.map((_, 1L)) // Generate k-v pairs
.toMap // Convert to Map
.withDefaultValue(0L) // Set default
(key, map)
}
/**Combine two count maps**/
def combine(m1: Map[String, Long], m2: Map[String, Long]): Map[String, Long] = {
(m1.keys ++ m2.keys).map(k => (k, m1(k) + m2(k))).toMap.withDefaultValue(0L)
}
val n: Int = ??? // Number of favorite per user
val records = lines.map(line => bitsToKeyMapPair(line.split(",")))
records.reduceByKey(combine).mapValues(_.toSeq.sortBy(-_._2).take(n))
If you're not a purist you can replace scala.collection.immutable.Map with scala.collection.mutable.Map to further improve performance.
Here's a complete example:
import org.apache.spark.{SparkContext, SparkConf}
object Main extends App {
val data = List(
"Bob,123,USA,Pizza,Soda,,Blue",
"Bob,456,UK,Chocolate,Cheese,Soda,Green",
"Bob,12,USA,Chocolate,Pizza,Soda,Yellow",
"Mary,68,USA,Chips,Pasta,Chocolate,Blue")
val sparkConf = new SparkConf().setMaster("local").setAppName("example")
val sc = new SparkContext(sparkConf)
val lineRDD = sc.parallelize(data)
val pairedRDD = lineRDD.map { line =>
val fields = line.split(",")
(fields(0), List(fields(3), fields(4), fields(5)).filter(_ != ""))
}.filter(_._1 == "Bob")
/*pairedRDD.collect().foreach(println)
(Bob,List(Pizza, Soda))
(Bob,List(Chocolate, Cheese, Soda))
(Bob,List(Chocolate, Pizza, Soda))
*/
val flatPairsRDD = pairedRDD.flatMap {
case (name, foodList) => foodList.map(food => ((name, food), 1))
}
/*flatPairsRDD.collect().foreach(println)
((Bob,Pizza),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Cheese),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Pizza),1)
((Bob,Soda),1)
*/
val nameFoodSumRDD = flatPairsRDD.reduceByKey((a, b) => a + b)
/*nameFoodSumRDD.collect().foreach(println)
((Bob,Cheese),1)
((Bob,Soda),3)
((Bob,Pizza),2)
((Bob,Chocolate),2)
*/
val resultsRDD = nameFoodSumRDD.map{
case ((name, food), count) => (name, (food,count))
}.groupByKey.map{
case (name, foodCountList) => (name, foodCountList.toList.sortBy(_._2).reverse.head)
}
resultsRDD.collect().foreach(println)
/*
(Bob,(Soda,3))
*/
sc.stop()
}

Split RDD into RDD's with no repeating values

I have a RDD of Pairs as below :
(105,918)
(105,757)
(502,516)
(105,137)
(516,816)
(350,502)
I would like to split it into two RDD's such that the first has only the pairs with non-repeating values (for both key and value) and the second will have the rest of the omitted pairs.
So from the above we could get two RDD's
1) (105,918)
(502,516)
2) (105,757) - Omitted as 105 is already included in 1st RDD
(105,137) - Omitted as 105 is already included in 1st RDD
(516,816) - Omitted as 516 is already included in 1st RDD
(350,502) - Omitted as 502 is already included in 1st RDD
Currently I am using a mutable Set variable to track the elements already selected after coalescing the original RDD to a single partition :
val evalCombinations = collection.mutable.Set.empty[String]
val currentValidCombinations = allCombinations
.filter(p => {
if(!evalCombinations.contains(p._1) && !evalCombinations.contains(p._2)) {
evalCombinations += p._1;evalCombinations += p._2; true
} else
false
})
This approach is limited by memory of the executor on which the operations run. Is there a better scalable solution for this ?
Here is a version, which will require more memory for driver.
import org.apache.spark.rdd._
import org.apache.spark._
def getUniq(rdd: RDD[(Int, Int)], sc: SparkContext): RDD[(Int, Int)] = {
val keys = rdd.keys.distinct
val values = rdd.values.distinct
// these are the keys which appear in value part also.
val both = keys.intersection(values)
val bBoth = sc.broadcast(both.collect.toSet)
// remove those key-value pairs which have value which is also a key.
val uKeys = rdd.filter(x => !bBoth.value.contains(x._2))
.reduceByKey{ case (v1, v2) => v1 } // keep uniq keys
uKeys.map{ case (k, v) => (v, k) } // swap key, value
.reduceByKey{ case (v1, v2) => v1 } // keep uniq value
.map{ case (k, v) => (v, k) } // correct placement
}
def getPartitionedRDDs(rdd: RDD[(Int, Int)], sc: SparkContext) = {
val r = getUniq(rdd, sc)
val remaining = rdd subtract r
val set = r.flatMap { case (k, v) => Array(k, v) }.collect.toSet
val a = remaining.filter{ case (x, y) => !set.contains(x) &&
!set.contains(y) }
val b = getUniq(a, sc)
val part1 = r union b
val part2 = rdd subtract part1
(part1, part2)
}
val rdd = sc.parallelize(Array((105,918),(105,757),(502,516),
(105,137),(516,816),(350,502)))
val (first, second) = getPartitionedRDDs(rdd, sc)
// first.collect: ((516,816), (105,918), (350,502))
// second.collect: ((105,137), (502,516), (105,757))
val rdd1 = sc.parallelize(Array((839,841),(842,843),(840,843),
(839,840),(1,2),(1,3),(4,3)))
val (f, s) = getPartitionedRDDs(rdd1, sc)
//f.collect: ((839,841), (1,2), (840,843), (4,3))

Spark column wise word count

We are trying to generate column wise statistics of our dataset in spark. In addition to using the summary function from statistics library. We are using the following procedure:
We determine the columns with string values
Generate key value pair for the whole dataset, using the column number as key and value of column as value
generate a new map of format
(K,V) ->((K,V),1)
Then we use reduceByKey to find the sum of all unique value in all the columns. We cache this output to reduce further computation time.
In the next step we cycle through the columns using a for loop to find the statistics for all the columns.
We are trying to reduce the for loop by again utilizing the map reduce way but we are unable to find some way to achieve it. Doing so will allow us to generate column statistics for all columns in one execution. The for loop method is running sequentially making it very slow.
Code:
//drops the header
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
def retAtrTuple(x: String) = {
val newX = x.split(",")
for (h <- 0 until newX.length)
yield (h,newX(h))
}
val line = sc.textFile("hdfs://.../myfile.csv")
val withoutHeader: RDD[String] = dropHeader(line)
val kvPairs = withoutHeader.flatMap(retAtrTuple) //generates a key-value pair where key is the column number and value is column's value
var bool_numeric_col = kvPairs.map{case (x,y) => (x,isNumeric(y))}.reduceByKey(_&&_).sortByKey() //this contains column indexes as key and boolean as value (true for numeric and false for string type)
var str_cols = bool_numeric_col.filter{case (x,y) => y == false}.map{case (x,y) => x}
var num_cols = bool_numeric_col.filter{case (x,y) => y == true}.map{case (x,y) => x}
var str_col = str_cols.toArray //array consisting the string col
var num_col = num_cols.toArray //array consisting numeric col
val colCount = kvPairs.map((_,1)).reduceByKey(_+_)
val e1 = colCount.map{case ((x,y),z) => (x,(y,z))}
var numPairs = e1.filter{case (x,(y,z)) => str_col.contains(x) }
//running for loops which needs to be parallelized/optimized as it sequentially operates on each column. Idea is to find the top10, bottom10 and number of distinct elements column wise
for(i <- str_col){
var total = numPairs.filter{case (x,(y,z)) => x==i}.sortBy(_._2._2)
var leastOnes = total.take(10)
println("leastOnes for Col" + i)
leastOnes.foreach(println)
var maxOnes = total.sortBy(-_._2._2).take(10)
println("maxOnes for Col" + i)
maxOnes.foreach(println)
println("distinct for Col" + i + " is " + total.count)
}
Let me simplify your question a bit. (A lot actually.) We have an RDD[(Int, String)] and we want to find the top 10 most common Strings for each Int (which are all in the 0–100 range).
Instead of sorting, as in your example, it is more efficient to use the Spark built-in RDD.top(n) method. Its run-time is linear in the size of the data, and requires moving much less data around than a sort.
Consider the implementation of top in RDD.scala. You want to do the same, but with one priority queue (heap) per Int key. The code becomes fairly complex:
import org.apache.spark.util.BoundedPriorityQueue // Pretend it's not private.
def top(n: Int, rdd: RDD[(Int, String)]): Map[Int, Iterable[String]] = {
// A heap that only keeps the top N values, so it has bounded size.
type Heap = BoundedPriorityQueue[(Long, String)]
// Get the word counts.
val counts: RDD[[(Int, String), Long)] =
rdd.map(_ -> 1L).reduceByKey(_ + _)
// In each partition create a column -> heap map.
val perPartition: RDD[Map[Int, Heap]] =
counts.mapPartitions { items =>
val heaps =
collection.mutable.Map[Int, Heap].withDefault(i => new Heap(n))
for (((k, v), count) <- items) {
heaps(k) += count -> v
}
Iterator.single(heaps)
}
// Merge the per-partition heap maps into one.
val merged: Map[Int, Heap] =
perPartition.reduce { (heaps1, heaps2) =>
val heaps =
collection.mutable.Map[Int, Heap].withDefault(i => new Heap(n))
for ((k, heap) <- heaps1.toSeq ++ heaps2.toSeq) {
for (cv <- heap) {
heaps(k) += cv
}
}
heaps
}
// Discard counts, return just the top strings.
merged.mapValues(_.map { case(count, value) => value })
}
This is efficient, but made painful because we need to work with multiple columns at the same time. It would be way easier to have one RDD per column and just call rdd.top(10) on each.
Unfortunately the naive way to split up the RDD into N smaller RDDs does N passes:
def split(together: RDD[(Int, String)], columns: Int): Seq[RDD[String]] = {
together.cache // We will make N passes over this RDD.
(0 until columns).map {
i => together.filter { case (key, value) => key == i }.values
}
}
A more efficient solution could be to write out the data into separate files by key, then load it back into separate RDDs. This is discussed in Write to multiple outputs by key Spark - one Spark job.
Thanks for #Daniel Darabos's answer. But there are some mistakes.
mixed use of Map and collection.mutable.Map
withDefault((i: Int) => new Heap(n)) do not create a new Heap when you set heaps(k) += count -> v
mix uasage of parentheses
Here is the modified code:
//import org.apache.spark.util.BoundedPriorityQueue // Pretend it's not private. copy to your own folder and import it
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object BoundedPriorityQueueTest {
// https://stackoverflow.com/questions/28166190/spark-column-wise-word-count
def top(n: Int, rdd: RDD[(Int, String)]): Map[Int, Iterable[String]] = {
// A heap that only keeps the top N values, so it has bounded size.
type Heap = BoundedPriorityQueue[(Long, String)]
// Get the word counts.
val counts: RDD[((Int, String), Long)] =
rdd.map(_ -> 1L).reduceByKey(_ + _)
// In each partition create a column -> heap map.
val perPartition: RDD[collection.mutable.Map[Int, Heap]] =
counts.mapPartitions { items =>
val heaps =
collection.mutable.Map[Int, Heap]() // .withDefault((i: Int) => new Heap(n))
for (((k, v), count) <- items) {
println("\n---")
println("before add " + ((k, v), count) + ", the map is: ")
println(heaps)
if (!heaps.contains(k)) {
println("not contains key " + k)
heaps(k) = new Heap(n)
println(heaps)
}
heaps(k) += count -> v
println("after add " + ((k, v), count) + ", the map is: ")
println(heaps)
}
println(heaps)
Iterator.single(heaps)
}
// Merge the per-partition heap maps into one.
val merged: collection.mutable.Map[Int, Heap] =
perPartition.reduce { (heaps1, heaps2) =>
val heaps =
collection.mutable.Map[Int, Heap]() //.withDefault((i: Int) => new Heap(n))
println(heaps)
for ((k, heap) <- heaps1.toSeq ++ heaps2.toSeq) {
for (cv <- heap) {
heaps(k) += cv
}
}
heaps
}
// Discard counts, return just the top strings.
merged.mapValues(_.map { case (count, value) => value }).toMap
}
def main(args: Array[String]): Unit = {
Logger.getRootLogger().setLevel(Level.FATAL) //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
val conf = new SparkConf().setAppName("word count").setMaster("local[1]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN") //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
val words = sc.parallelize(List((1, "s11"), (1, "s11"), (1, "s12"), (1, "s13"), (2, "s21"), (2, "s22"), (2, "s22"), (2, "s23")))
println("# words:" + words.count())
val result = top(1, words)
println("\n--result:")
println(result)
sc.stop()
print("DONE")
}
}