How to transform RDD[(Key, Value)] into Map[Key, RDD[Value]] - scala

I searched a solution for a long time but didn't get any correct algorithm.
Using Spark RDDs in scala, how could I transform a RDD[(Key, Value)] into a Map[key, RDD[Value]], knowing that I can't use collect or other methods which may load data into memory ?
In fact, my final goal is to loop on Map[Key, RDD[Value]] by key and call saveAsNewAPIHadoopFile for each RDD[Value]
For example, if I get :
RDD[("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)]
I'd like :
Map[("A" -> RDD[1, 2, 3]), ("B" -> RDD[4, 5]), ("C" -> RDD[6])]
I wonder if it would cost not too much to do it using filter on each key A, B, C of RDD[(Key, Value)], but I don't know if calling filter as much times there are different keys would be efficient ? (off course not, but maybe using cache ?)
Thank you

You should use the code like this (Python):
rdd = sc.parallelize( [("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)] ).cache()
keys = rdd.keys().distinct().collect()
for key in keys:
out = rdd.filter(lambda x: x[0] == key).map(lambda (x,y): y)
out.saveAsNewAPIHadoopFile (...)
One RDD cannot be a part of another RDD and you have no option to just collect keys and transform their related values to a separate RDD. In my example you would iterate over the cached RDD which is ok and would work fast

It sounds like what you really want is to save your KV RDD to a separate file for each key. Rather than creating a Map[Key, RDD[Value]] consider using a MultipleTextOutputFormat similar to the example here. The code is pretty much all there in the example.
The benefit of this approach is that you're guaranteed to only take one pass over the RDD after the shuffle and you get the same result you wanted. If you did this by filtering and creating several IDs as suggested in the other answer (unless your source supported pushdown filters) you would end up taking one pass over the dataset for each individual key which would be way slower.

This is my simple test code.
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val groupby_RDD = test_RDD.groupByKey()
val result_RDD = groupby_RDD.map{v =>
var result_list:List[Int] = Nil
for (i <- v._2) {
result_list ::= i
}
(v._1, result_list)
}
The result is below
result_RDD.take(3)
>> res86: Array[(String, List[Int])] = Array((A,List(1, 3, 2)), (B,List(5, 4)), (C,List(6)))
Or you can do it like this
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val nil_list:List[Int] = Nil
val result2 = test_RDD.aggregateByKey(nil_list)(
(acc, value) => value :: acc,
(acc1, acc2) => acc1 ::: acc2 )
The result is this
result2.take(3)
>> res209: Array[(String, List[Int])] = Array((A,List(3, 2, 1)), (B,List(5, 4)), (C,List(6)))

Related

spark rdd filter after groupbykey

//create RDD
val rdd = sc.makeRDD(List(("a", (1, "m")), ("b", (1, "m")),
("a", (1, "n")), ("b", (2, "n")), ("c", (1, "m")),
("c", (5, "m")), ("d", (1, "m")), ("d", (1, "n"))))
val groupRDD = rdd.groupByKey()
after groupByKey i want to filter the second element is not equal 1 and get
("b", (1, "m")),("b", (2, "n")), ("c", (1, "m")), ("c", (5, "m"))
groupByKey() is must necessary, could help me, thanks a lot.
add:
but if the second element type is string,filter the second element All of them equal x ,like
("a",("x","m")), ("a",("x","n")), ("b",("x","m")), ("b",("y","n")), ("c",("x","m")), ("c",("z","m")), ("d",("x","m")), ("d",("x","n"))
and also get the same result ("b",("x","m")), ("b",("y","n")), ("c",("x","m")), ("c",("z","m"))
You could do:
val groupRDD = rdd
.groupByKey()
.filter(value => value._2.map(tuple => tuple._1).sum != value._2.size)
.flatMapValues(list => list) // to get the result as you like, because right now, they are, e.g. (b, Seq((1, m), (1, n)))
What this does, is that we are first grouping keys through groupByKey, then we are filtering through filter by summing the keys from your grouped entries, and checking whether the sum is as much as the grouped entries size. For example:
(a, Seq((1, m), (1, n)) -> grouped by key
(a, Seq((1, m), (1, n), 2 (the sum of 1 + 1), 2 (size of sequence))
2 = 2, filter this row out
The final result:
(c,(1,m))
(b,(1,m))
(c,(5,m))
(b,(2,n))
Good luck!
EDIT
Under the assumption that key from tuple can be any string; assuming rdd is your data that contains:
(a,(x,m))
(c,(x,m))
(c,(z,m))
(d,(x,m))
(b,(x,m))
(a,(x,n))
(d,(x,n))
(b,(y,n))
Then we can construct uniqueCount as:
val uniqueCount = rdd
// we swap places, we want to check for combination of (a, 1), (b, a), (b, b), (c, a), etc.
.map(entry => ((entry._1, entry._2._1), entry._2._2))
// we count keys, meaning that (a, 1) gives us 2, (b, a) gives us 1, (b, b) gives us 1, etc.
.countByKey()
// we filter out > 2, because they are duplicates
.filter(a => a._2 == 1)
// we get the very keys, so we can filter below
.map(a => a._1._1)
.toList
Then this:
val filteredRDD = rdd.filter(a => uniqueCount.contains(a._1))
Gives this output:
(b,(y,n))
(c,(x,m))
(c,(z,m))
(b,(x,m))

How to sort an array with a custom order in Scala

I have a collection of data like the following:
val data = Seq(
("M", 1),
("F", 2),
("F", 3),
("F/M", 4),
("M", 5),
("M", 6),
("F/M", 7),
("F", 8)
)
I would like to sort this array according to the first value of the tuple. But I don't want to sort them in alphabetic order, I want them to be sorted like that: all Fs first, then all Ms and finally all F/Ms (I don't are about the inner sorting for values with the same key).
I thought about extending the Ordering class, but it feel quite overkilling for such a simple problem. Any idea?
EDIT: See #Eastsun's comment below for an even simpler solution.
I finally came up with a simple solution based on a map:
val sortingOrder = Map("F" -> 0, "M" -> 1, "F/M" -> 2)
data.sortWith((p1, p2) => sortingOrder(p1._1) < sortingOrder(p2._1))
This will of course fail if there is a unknown key in data, but it will be fine in my case.
In order to avoid an error when a new key is met, we can do the following:
val sortingOrder = Map("F" -> 0, "M" -> 1, "F/M" -> 2)
val nKeys = sortingOrder.size
data.sortWith((p1, p2) => sortingOrder.getOrElse(p1._1, nKeys) < sortingOrder.getOrElse(p2._1, nKeys))
This will push tuples with unknown keys at the end of the list.

Reverse a word-frequency map in Scala

I have a word-frequency array like this:
[("hello", 1), ("world", 5), ("globle", 1)]
I have to reverse it such that I get frequency-to-wordCount map like this:
[(1, 2), (5, 1)]
Notice that since two words ("hello" and "globe") have the frequency 1, the value of the reversed mapping is 2. However, since there is only one word with a frequency 5, so, the value of that entry is 1. How can I do this in scala?
Update:
I happened to figure this out as well:
arr.groupBy(_._2).map(x => (x._1,x._2.toList.length))
You can first group by the count, and then just get the size of each group
val frequencies = List(("hello", 1), ("world", 5), ("globle", 1))
val reversed = frequencies.groupBy(_._2).mapValues(_.size).toList
res0: List[(Int, Int)] = List((5,1), (1,2))

Add random elements to keyed RDD from the same RDD

Imagine we have a keyed RDD RDD[(Int, List[String])] with thousands of keys and thousands to millions of values:
val rdd = sc.parallelize(Seq(
(1, List("a")),
(2, List("a", "b")),
(3, List("b", "c", "d")),
(4, List("f"))))
For each key I need to add random values from other keys. Number of elements to add varies and depends on the number of elements in the key. So that the output could look like:
val rdd2: RDD[(Int, List[String])] = sc.parallelize(Seq(
(1, List("a", "c")),
(2, List("a", "b", "b", "c")),
(3, List("b", "c", "d", "a", "a", "f")),
(4, List("f", "d"))))
I came up with the following solution which is obviously not very efficient (note: flatten and aggregation is optional, I'm good with flatten data):
// flatten the input RDD
val rddFlat: RDD[(Int, String)] = rdd.flatMap(x => x._2.map(s => (x._1, s)))
// calculate number of elements for each key
val count = rddFlat.countByKey().toSeq
// foreach key take samples from the input RDD, change the original key and union all RDDs
val rddRandom: RDD[(Int, String)] = count.map { x =>
(x._1, rddFlat.sample(withReplacement = true, x._2.toDouble / count.map(_._2).sum, scala.util.Random.nextLong()))
}.map(x => x._2.map(t => (x._1, t._2))).reduce(_.union(_))
// union the input RDD with the random RDD and aggregate
val rddWithRandomData: RDD[(Int, List[String])] = rddFlat
.union(rddRandom)
.aggregateByKey(List[String]())(_ :+ _, _ ++ _)
What's the most efficient and elegant way to achieve that?
I use Spark 1.4.1.
By looking at the current approach, and in order to ensure the scalability of the solution, probably the area of focus should be to come up with a sampling mechanism that can be done in a distributed fashion, removing the need for collecting the keys back to the driver.
In a nutshell, we need a distributed method to a weighted sample of all the values.
What I propose is to create a matrix keys x values where each cell is the probability of the value being chosen for that key. Then, we can randomly score that matrix and pick those values that fall within the probability.
Let's write a spark-based algo for that:
// sample data to guide us.
//Note that I'm using distinguishable data across keys to see how the sample data distributes over the keys
val data = sc.parallelize(Seq(
(1, List("A", "B")),
(2, List("x", "y", "z")),
(3, List("1", "2", "3", "4")),
(4, List("foo", "bar")),
(5, List("+")),
(6, List())))
val flattenedData = data.flatMap{case (k,vlist) => vlist.map(v=> (k,v))}
val values = data.flatMap{case (k,list) => list}
val keysBySize = data.map{case (k, list) => (k,list.size)}
val totalElements = keysBySize.map{case (k,size) => size}.sum
val keysByProb = keysBySize.mapValues{size => size.toDouble/totalElements}
val probMatrix = keysByProb.cartesian(values)
val scoredSamples = probMatrix.map{case ((key, prob),value) =>
((key,value),(prob, Random.nextDouble))}
ScoredSamples looks like this:
((1,A),(0.16666666666666666,0.911900315814998))
((1,B),(0.16666666666666666,0.13615047422122906))
((1,x),(0.16666666666666666,0.6292430257377151))
((1,y),(0.16666666666666666,0.23839887096373114))
((1,z),(0.16666666666666666,0.9174808344986465))
...
val samples = scoredSamples.collect{case (entry, (prob,score)) if (score<prob) => entry}
samples looks like this:
(1,foo)
(1,bar)
(2,1)
(2,3)
(3,y)
...
Now, we union our sampled data with the original and have our final result.
val result = (flattenedData union samples).groupByKey.mapValues(_.toList)
result.collect()
(1,List(A, B, B))
(2,List(x, y, z, B))
(3,List(1, 2, 3, 4, z, 1))
(4,List(foo, bar, B, 2))
(5,List(+, z))
Given that all the algorithm is written as a sequence of transformations on the original data (see DAG below), with minimal shuffling (only the last groupByKey, which is done over a minimal result set), it should be scalable. The only limitation would be the list of values per key in the groupByKey stage, which is only to comply with the representation used the question.

How to filter a RDD according to a function based another RDD in Spark?

I am a beginner of Apache Spark. I want to filter out all groups whose sum of weight is larger than a constant value in a RDD. The "weight" map is also a RDD. Here is a small-size demo, the groups to be filtered is stored in "groups", the constant value is 12:
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
val wm = weights.toArray.toMap
def isheavy(inp: String): Boolean = {
val allw = inp.split(",").map(wm(_)).sum
allw > 12
}
val result = groups.filter(isheavy)
When the input data is very large, > 10GB for example, I always encounter a "java heap out of memory" error. I doubted if it's caused by "weights.toArray.toMap", because it convert an distributed RDD to an Java object in JVM. So I tried to filter with RDD directly:
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
def isheavy(inp: String): Boolean = {
val items = inp.split(",")
val wm = items.map(x => weights.filter(_._1 == x).first._2)
wm.sum > 12
}
val result = groups.filter(isheavy)
When I ran result.collect after loading this script into spark shell, I got a "java.lang.NullPointerException" error. Someone told me when a RDD is manipulated in another RDD, there will be a nullpointer exception, and suggest me to put the weight into Redis.
So how can I get the "result" without convert "weight" to Map, or put it into Redis? If there is a solution to filter a RDD based on another map-like RDD without the help of external datastore service?
Thanks!
Suppose your group is unique. Otherwise, first make it unique by distinct, etc.
If group or weights is small, it should be easy. If both group and weights are huge, you can try this, which may be more scalable, but also looks complicated.
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
//map groups to be (a, (a,b,c,d)), (b, (a,b,c,d), (c, (a,b,c,d)....
val g1=groups.flatMap(s=>s.split(",").map(x=>(x, Seq(s))))
//j will be (a, ((a,b,c,d),3)...
val j = g1.join(weights)
//k will be ((a,b,c,d), 3), ((a,b,c,d),2) ...
val k = j.map(x=>(x._2._1, x._2._2))
//l will be ((a,b,c,d), (3,2,5,1))...
val l = k.groupByKey()
//filter by sum the 2nd
val m = l.filter(x=>{var sum = 0; x._2.foreach(a=> {sum=sum+a});sum > 12})
//we only need the original list
val result=m.map(x=>x._1)
//don't do this in real product, otherwise, all results go to driver.instead using saveAsTextFile, etc
scala> result.foreach(println)
List(e,g)
List(b,c,e)
The "java out of memory" error is coming because spark uses its spark.default.parallelism property while determining number of splits, which by default is number of cores available.
// From CoarseGrainedSchedulerBackend.scala
override def defaultParallelism(): Int = {
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
}
When the input becomes large, and you have limited memory, you should increase number of splits.
You can do something as follows:
val input = List("a,b,c,d", "b,c,e", "a,c,d", "e,g")
val splitSize = 10000 // specify some number of elements that fit in memory.
val numSplits = (input.size / splitSize) + 1 // has to be > 0.
val groups = sc.parallelize(input, numSplits) // specify the # of splits.
val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)).toMap
def isHeavy(inp: String) = inp.split(",").map(weights(_)).sum > 12
val result = groups.filter(isHeavy)
You may also consider increasing executor memory size using spark.executor.memory.