transform rdd into pairRDD - scala

This is a newbie question.
Is it possible to transform an RDD like (key,1,2,3,4,5,5,666,789,...) with a dynamic dimension into a pairRDD like (key, (1,2,3,4,5,5,666,789,...))?
I feel like it should be super-easy but I cannot get how to.
The point of doing it is that I would like to sum all the values, but not the key.
Any help is appreciated.
I am using Spark 1.2.0
EDIT enlightened by the answer I explain my use case deeplier. I have N (unknown at compile time) different pairRDD (key, value), that have to be joined and whose values must be summed up. Is there a better way than the one I was thinking?

First of all if you just wanna sum all integers but first the simplest way would be:
val rdd = sc.parallelize(List(1, 2, 3))
rdd.cache()
val first = rdd.sum()
val result = rdd.count - first
On the other hand if you want to have access to the index of elements you can use rdd zipWithIndex method like this:
val indexed = rdd.zipWithIndex()
indexed.cache()
val result = (indexed.first()._2, indexed.filter(_._1 != 1))
But in your case this feels like overkill.
One more thing i would add, this looks like questionable desine to put key as first element of your rdd. Why not just instead use pairs (key, rdd) in your driver program. Its quite hard to reason about order of elements in rdd and i cant not think about natural situation in witch key is computed as first element of rdd (ofc i dont know your usecase so i can only guess).
EDIT
If you have one rdd of key value pairs and you want to sum them by key then do just:
val result = rdd.reduceByKey(_ + _)
If you have many rdds of key value pairs before counting you can just sum them up
val list = List(pairRDD0, pairRDD1, pairRDD2)
//another pairRDD arives in runtime
val newList = anotherPairRDD0::list
val pairRDD = newList.reduce(_ union _)
val resultSoFar = pairRDD.reduceByKey(_ + _)
//another pairRDD arives in runtime
val result = resultSoFar.union(anotherPairRDD1).reduceByKey(_ + _)
EDIT
I edited example. As you can see you can add additional rdd when every it comes up in runtime. This is because reduceByKey returns rdd of the same type so you can iterate this operation (Ofc you will have to consider performence).

Related

Order Spark RDD based on ordering in another RDD

I have an RDD with strings like this (ordered in a specific way):
["A","B","C","D"]
And another RDD with lists like this:
["C","B","F","K"],
["B","A","Z","M"],
["X","T","D","C"]
I would like to order the elements in each list in the second RDD based on the order in which they appear in the first RDD. The order of the elements that do not appear in the first list is not of concern.
From the above example, I would like to get an RDD like this:
["B","C","F","K"],
["A","B","Z","M"],
["C","D","X","T"]
I know I am supposed to use a broadcast variable to broadcast the first RDD as I process each list in the second RDD. But I am very new to Spark/Scala (and functional programming in general) so I am not sure how to do this.
I am assuming that the first RDD is small since you talk about broadcasting it. In that case you are right, broadcasting the ordering is a good way to solve your problem.
// generating data
val ordering_rdd = sc.parallelize(Seq("A","B","C","D"))
val other_rdd = sc.parallelize(Seq(
Seq("C","B","F","K"),
Seq("B","A","Z","M"),
Seq("X","T","D","C")
))
// let's start by collecting the ordering onto the driver
val ordering = ordering_rdd.collect()
// Let's broadcast the list:
val ordering_br = sc.broadcast(ordering)
// Finally, let's use the ordering to sort your records:
val result = other_rdd
.map( _.sortBy(x => {
val index = ordering_br.value.indexOf(x)
if(index == -1) Int.MaxValue else index
}))
Note that indexOf returns -1 if the element is not found in the list. If we leave it as is, all non-found elements would end up at the beginning. I understand that you want them at the end so I relpace -1 by some big number.
Printing the result:
scala> result.collect().foreach(println)
List(B, C, F, K)
List(A, B, Z, M)
List(C, D, X, T)

Combine two different RDDs with different key in Scala

I have two text file already create as rdd by sparkcontext.
one of them(rdd1) saves related words:
apple,apples
car,cars
computer,computers
Another one(rdd2) saves number of items:
(apple,12)
(apples, 50)
(car,5)
(cars,40)
(computer,77)
(computers,11)
I want to combine those two rdds
disire output:
(apple, 62)
(car,45)
(computer,88)
How to code this?
The meat of the work is to pick a key for the related words. Here I just select the first word but really you could do something more intelligent than just picking a random word.
Explanation:
Create the data
Pick a key for related words
Flatmap the tuples to enable us to join on the key we picked.
Join the RDDs
Map the RDD back into a tuple
Reduce by Key
val s = Seq(("apple","apples"),("car","cars")) // create data
val rdd = sc.parallelize(s)
val t = Seq(("apple",12),("apples", 50),("car",5),("cars",40))// create data
val rdd2 = sc.parallelize(t)
val keyed = rdd.flatMap( {case(a,b) => Seq((a, a),(b,a)) } ) // could be replace with any function that selects the key to use for all of the related words
.join(rdd2) // complete the join
.map({case (_, (a ,b)) => (a,b) }) // recreate a tuple and throw away the related word
.reduceByKey(_ + _)
.foreach(println) // to show it works
Even though this solves your problem there are more elegant solutions that you could use with Dataframes you may wish to look into. You could use reduce directly on RDD and skip the step of mapping back to a tuple. I think that would be a better solution but wanted to keep it simple so that it was more illustrative of what I did.

Getting values of keys from a rdd of maps in scala

I have an RDD which has maps as its elements. I cannot use RDD.get, of course. So, as of now, I do the following to get values for keys from this map:
val x = RDD.collect().flatten.toMap
and then
x.get(key)
to get the value for the key. Now, have a really big rdd which outputs the error java.lang.OutOfMemoryError: GC overhead limit exceeded as I am applying .collect() on the rdd. How can I do it without applying .collect() on the rdd?
If it is truly Maps then you can do the following:
rdd.flatMap(identity).lookup(key)
Although this will still output to the driver, but only the values from that key. So, if that can fit in memory then you are good with that. But if you want to work with it as an rdd still then:
rdd.flatMap(identity)
.flatMap{case (key, value) => if(key == myKey) Some(value) else None}
And should you want key AND value then you can turn the flatMap into a filter and just filter on key == myKey
Since you can't fit everything onto your driver, you first need to filter the RDD for the map you need to look into, and then do the get...
val rdd = sc.parallelize(List(Map("a"->1,"b"->2),Map("c"->3,"d"->4)))
val key = "d"
val filteredRDD = rdd.filter(_.keySet contains key)
if (!filteredRDD.isEmpty) filteredRDD.first.get(key) else None

Merge multiple RDD generated in loop

I am calling a function in scala which gives an RDD[(Long,Long,Double)] as its output.
def helperfunction(): RDD[(Long, Long, Double)]
I call this function in loop in another part of the code and I want to merge all the generated RDDs. The loop calling the function looks something like this
for (i <- 1 to n){
val tOp = helperfunction()
// merge the generated tOp
}
What I want to do is something similar to what StringBuilder would do for you in Java when you wanted to merge the strings. I have looked at techniques of merging RDDs, which mostly point to using union function like this
RDD1.union(RDD2)
But this requires both RDDs to be generated before taking their union. I though of initializing a var RDD1 to accumulate the results outside the for loop but I am not sure how can I initialize a blank RDD of type [(Long,Long,Double)]. Also I am starting out with spark, so I am not even sure if this is the most elegant method to solve this problem.
Instead of using vars, you can use functional programming paradigms to achieve what you want :
val rdd = (1 to n).map(x => helperFunction()).reduce(_ union _)
Also, if you still need to create an empty RDD, you can do it using :
val empty = sc.emptyRDD[(long, long, String)]
You're correct that this might not be the optimal way to do this, but we would need more info on what you're trying to accomplish with generating a new RDD with each call to your helper function.
You could define 1 RDD prior to the loop and assign it a var then run it through your loop. Here's an example:
val rdd = sc.parallelize(1 to 100)
val rdd_tuple = rdd.map(x => (x.toLong, (x*10).toLong, x.toDouble))
var new_rdd = rdd_tuple
println("Initial RDD count: " + new_rdd.count())
for (i <- 2 to 4) {
new_rdd = new_rdd.union(rdd_tuple)
}
println("New count after loop: " + new_rdd.count())

Apache Spark's RDD splitting according to the particular size

I am trying to read strings from a text file, but I want to limit each line according to a particular size. For example;
Here is my representing the file.
aaaaa\nbbb\nccccc
When trying to read this file by sc.textFile, RDD would appear this one.
scala> val rdd = sc.textFile("textFile")
scala> rdd.collect
res1: Array[String] = Array(aaaaa, bbb, ccccc)
But I want to limit the size of this RDD. For example, if the limit is 3, then I should get like this one.
Array[String] = Array(aaa, aab, bbc, ccc, c)
What is the best performance way to do that?
Not a particularly efficient solution (not terrible either) but you can do something like this:
val pairs = rdd
.flatMap(x => x) // Flatten
.zipWithIndex // Add indices
.keyBy(_._2 / 3) // Key by index / n
// We'll use a range partitioner to minimize the shuffle
val partitioner = new RangePartitioner(pairs.partitions.size, pairs)
pairs
.groupByKey(partitioner) // group
// Sort, drop index, concat
.mapValues(_.toSeq.sortBy(_._2).map(_._1).mkString(""))
.sortByKey()
.values
It is possible to avoid the shuffle by passing data required to fill the partitions explicitly but it takes some effort to code. See my answer to Partition RDD into tuples of length n.
If you can accept some misaligned records on partitions boundaries then simple mapPartitions with grouped should do the trick at much lower cost:
rdd.mapPartitions(_.flatMap(x => x).grouped(3).map(_.mkString("")))
It is also possible to use sliding RDD:
rdd.flatMap(x => x).sliding(3, 3).map(_.mkString(""))
You will need to read all the data anyhow. Not much you can do apart from mapping each line and trim it.
rdd.map(line => line.take(3)).collect()