Getting values of keys from a rdd of maps in scala - scala

I have an RDD which has maps as its elements. I cannot use RDD.get, of course. So, as of now, I do the following to get values for keys from this map:
val x = RDD.collect().flatten.toMap
and then
x.get(key)
to get the value for the key. Now, have a really big rdd which outputs the error java.lang.OutOfMemoryError: GC overhead limit exceeded as I am applying .collect() on the rdd. How can I do it without applying .collect() on the rdd?

If it is truly Maps then you can do the following:
rdd.flatMap(identity).lookup(key)
Although this will still output to the driver, but only the values from that key. So, if that can fit in memory then you are good with that. But if you want to work with it as an rdd still then:
rdd.flatMap(identity)
.flatMap{case (key, value) => if(key == myKey) Some(value) else None}
And should you want key AND value then you can turn the flatMap into a filter and just filter on key == myKey

Since you can't fit everything onto your driver, you first need to filter the RDD for the map you need to look into, and then do the get...
val rdd = sc.parallelize(List(Map("a"->1,"b"->2),Map("c"->3,"d"->4)))
val key = "d"
val filteredRDD = rdd.filter(_.keySet contains key)
if (!filteredRDD.isEmpty) filteredRDD.first.get(key) else None

Related

Combine two different RDDs with different key in Scala

I have two text file already create as rdd by sparkcontext.
one of them(rdd1) saves related words:
apple,apples
car,cars
computer,computers
Another one(rdd2) saves number of items:
(apple,12)
(apples, 50)
(car,5)
(cars,40)
(computer,77)
(computers,11)
I want to combine those two rdds
disire output:
(apple, 62)
(car,45)
(computer,88)
How to code this?
The meat of the work is to pick a key for the related words. Here I just select the first word but really you could do something more intelligent than just picking a random word.
Explanation:
Create the data
Pick a key for related words
Flatmap the tuples to enable us to join on the key we picked.
Join the RDDs
Map the RDD back into a tuple
Reduce by Key
val s = Seq(("apple","apples"),("car","cars")) // create data
val rdd = sc.parallelize(s)
val t = Seq(("apple",12),("apples", 50),("car",5),("cars",40))// create data
val rdd2 = sc.parallelize(t)
val keyed = rdd.flatMap( {case(a,b) => Seq((a, a),(b,a)) } ) // could be replace with any function that selects the key to use for all of the related words
.join(rdd2) // complete the join
.map({case (_, (a ,b)) => (a,b) }) // recreate a tuple and throw away the related word
.reduceByKey(_ + _)
.foreach(println) // to show it works
Even though this solves your problem there are more elegant solutions that you could use with Dataframes you may wish to look into. You could use reduce directly on RDD and skip the step of mapping back to a tuple. I think that would be a better solution but wanted to keep it simple so that it was more illustrative of what I did.

Apache Spark's RDD splitting according to the particular size

I am trying to read strings from a text file, but I want to limit each line according to a particular size. For example;
Here is my representing the file.
aaaaa\nbbb\nccccc
When trying to read this file by sc.textFile, RDD would appear this one.
scala> val rdd = sc.textFile("textFile")
scala> rdd.collect
res1: Array[String] = Array(aaaaa, bbb, ccccc)
But I want to limit the size of this RDD. For example, if the limit is 3, then I should get like this one.
Array[String] = Array(aaa, aab, bbc, ccc, c)
What is the best performance way to do that?
Not a particularly efficient solution (not terrible either) but you can do something like this:
val pairs = rdd
.flatMap(x => x) // Flatten
.zipWithIndex // Add indices
.keyBy(_._2 / 3) // Key by index / n
// We'll use a range partitioner to minimize the shuffle
val partitioner = new RangePartitioner(pairs.partitions.size, pairs)
pairs
.groupByKey(partitioner) // group
// Sort, drop index, concat
.mapValues(_.toSeq.sortBy(_._2).map(_._1).mkString(""))
.sortByKey()
.values
It is possible to avoid the shuffle by passing data required to fill the partitions explicitly but it takes some effort to code. See my answer to Partition RDD into tuples of length n.
If you can accept some misaligned records on partitions boundaries then simple mapPartitions with grouped should do the trick at much lower cost:
rdd.mapPartitions(_.flatMap(x => x).grouped(3).map(_.mkString("")))
It is also possible to use sliding RDD:
rdd.flatMap(x => x).sliding(3, 3).map(_.mkString(""))
You will need to read all the data anyhow. Not much you can do apart from mapping each line and trim it.
rdd.map(line => line.take(3)).collect()

Comparing Subsets of an RDD

I’m looking for a way to compare subsets of an RDD intelligently.
Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about doing this efficiently?
The way I’m currently thinking of doing it is by creating a List of filtered RDDs and then using RDD.cartesian()
def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) => name == b}
Val keyPairs:(Int, Int) // all key pairs
Val rddPairs = keyPairs.map{
case (a, b) =>
filterSubset(a,r).cartesian(filterSubset(b,r))
}
rddPairs.map{whatever I want to compare…}
I would then iterate the list and perform a map on each of the RDDs of pairs to gather the relational data that I need.
What I can’t tell about this idea is whether it would be extremely inefficient to set up possibly of hundreds of map jobs and then iterate through them. In this case, would the lazy valuation in spark optimize the data shuffling between all of the maps? If not, can someone please recommend a possibly more efficient way to approach this issue?
Thank you for your help
One way you can approach this problem is to replicate and partition your data to reflect key pairs you want to compare. Lets start with creating two maps from the actual keys to the temporary keys we'll use for replication and joins:
def genMap(keys: Seq[Int]) = keys
.zipWithIndex.groupBy(_._1)
.map{case (k, vs) => (k -> vs.map(_._2))}
val left = genMap(keyPairs.map(_._1))
val right = genMap(keyPairs.map(_._2))
Next we can transform data by replicating with new keys:
def mapAndReplicate[T: ClassTag](rdd: RDD[(Int, T)], map: Map[Int, Seq[Int]]) = {
rdd.flatMap{case (k, v) => map.getOrElse(k, Seq()).map(x => (x, (k, v)))}
}
val leftRDD = mapAndReplicate(rddPairs, left)
val rightRDD = mapAndReplicate(rddPairs, right)
Finally we can cogroup:
val cogrouped = leftRDD.cogroup(rightRDD)
And compare / filter pairs:
cogrouped.values.flatMap{case (xs, ys) => for {
(kx, vx) <- xs
(ky, vy) <- ys
if cosineSimilarity(vx, vy) <= threshold
} yield ((kx, vx), (ky, vy)) }
Obviously in the current form this approach is limited. It assumes that values for arbitrary pair of keys can fit into memory and require a significant amount of network traffic. Still it should give you some idea how to proceed.
Another possible approach is to store data in the external system (for example database) and fetch required key-value pairs on demand.
Since you're trying to find similarity between elements I would also consider completely different approach. Instead of naively comparing key-by-key I would try to partition data using custom partitioner which reflects expected similarity between documents. It is far from trivial in general but should give much better results.
Using Dataframe you can easily do the cartesian operation using join:
dataframe1.join(dataframe2, dataframe1("key")===dataframe2("key"))
It will probably do exactly what you want, but efficiently.
If you don't know how to create an Dataframe, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes

How to get a subset of a RDD?

I am new to Spark. If I have a RDD consists of key-value pairs, what is the efficient way to return a subset of this RDD containing the keys that appear more than a certain times in the original RDD?
For example, if my original data RDD is like this:
val dataRDD=sc.parallelize(List((1,34),(5,3),(1,64),(3,67),(5,0)),3)
I want to get a new RDD, in which the keys appear more than once in dataRDD. The newRDD should contains these tuples: (1,34),(5,3),(1,64),(5,0). How can I get this new RDD? Thank you very much.
Count keys and filter infrequent:
val counts = dataRDD.keys.map((_, 1)).reduceByKey(_ + _)
val infrequent = counts.filter(_._2 == 1)
If number of infrequent values is to large to be handled in memory you can use PairRDDFunctions.subtractByKey:
dataRDD.subtractByKey(infrequent)
otherwise a broadcast variable:
val infrequentKeysBd = sc.broadcast(infrequent.keys.collect.toSet)
dataRDD.filter{ case(k, _) => !infrequentKeysBd.value.contains(k)}
If number of frequent keys is very low you can filter frequent keys and use a broadcast variable as above:
val frequent = counts.filter(_._2 > 1)
val frequentKeysBd = ??? // As before
dataRDD.filter{case(k, _) => frequentKeysBd.value.contains(k)}

transform rdd into pairRDD

This is a newbie question.
Is it possible to transform an RDD like (key,1,2,3,4,5,5,666,789,...) with a dynamic dimension into a pairRDD like (key, (1,2,3,4,5,5,666,789,...))?
I feel like it should be super-easy but I cannot get how to.
The point of doing it is that I would like to sum all the values, but not the key.
Any help is appreciated.
I am using Spark 1.2.0
EDIT enlightened by the answer I explain my use case deeplier. I have N (unknown at compile time) different pairRDD (key, value), that have to be joined and whose values must be summed up. Is there a better way than the one I was thinking?
First of all if you just wanna sum all integers but first the simplest way would be:
val rdd = sc.parallelize(List(1, 2, 3))
rdd.cache()
val first = rdd.sum()
val result = rdd.count - first
On the other hand if you want to have access to the index of elements you can use rdd zipWithIndex method like this:
val indexed = rdd.zipWithIndex()
indexed.cache()
val result = (indexed.first()._2, indexed.filter(_._1 != 1))
But in your case this feels like overkill.
One more thing i would add, this looks like questionable desine to put key as first element of your rdd. Why not just instead use pairs (key, rdd) in your driver program. Its quite hard to reason about order of elements in rdd and i cant not think about natural situation in witch key is computed as first element of rdd (ofc i dont know your usecase so i can only guess).
EDIT
If you have one rdd of key value pairs and you want to sum them by key then do just:
val result = rdd.reduceByKey(_ + _)
If you have many rdds of key value pairs before counting you can just sum them up
val list = List(pairRDD0, pairRDD1, pairRDD2)
//another pairRDD arives in runtime
val newList = anotherPairRDD0::list
val pairRDD = newList.reduce(_ union _)
val resultSoFar = pairRDD.reduceByKey(_ + _)
//another pairRDD arives in runtime
val result = resultSoFar.union(anotherPairRDD1).reduceByKey(_ + _)
EDIT
I edited example. As you can see you can add additional rdd when every it comes up in runtime. This is because reduceByKey returns rdd of the same type so you can iterate this operation (Ofc you will have to consider performence).