How to divide a RDD into multiple RDDs according to every father RDD's element - scala

I want to find a way to divide a fatherRDD into multiple RDDs accordingly to every fatherRDD's element.
For example, the elements of fatherRDD have lots of lists. I want to split this fatherRDD into lots of small RDD based on every element. In other words, if there are n elements in the fatherRDD, I want to get n RDDs.
Two days ago, I wrote a function like this:
def splitRDD(rdd1:RDD[List[(String, String)]]):List[RDD[(String, String)]] ={
var list = List[RDD[(String, String)]] ()
//println(rdd1.take(1).apply(0).apply(0)._1)
rdd1.foreach(x =>{
list = sc.makeRDD(x)::list
})
list
}
I think the wrong is I can not use sc.makeRDD(x) here. So how to divide a RDD into multiple RDDs according to every father RDD's element?

As per your description it should look like this:
def splitRDD(rdd1:RDD[List[(String, String)]]):List[RDD[(String, String)]] = rdd1.collect().toList.map(x => makeRdd(x))
def makeRdd(ls:List[(String,String)]): RDD[(String, String)] = sc.parallelize(ls)
Try this out for your data. is that what you want ?

Related

Passing parameters to foldLeft scala

I am trying to use Scala's foldLeft function to compare a Stream with a List of Lists.
This is a snippet of the code I have
def freqParse(pairs: (Pair,Int,String), record: String): (Pair,Int,String) ={
val m: Pair = ("","")
val t: FreqPairs = Map((m,0))
(pairs._1,pairs._2,pairs._3)
}
val freqItems = items.map(v => (v._1)).toList
val cross = freqItems.flatMap(x => freqItems.map(y => (x, y))) // cross product to get pair of frequent items
val freq = lines.foldLeft(pairs,0,delim)(freqParse)
cross is basically a lists where each element is a pair of strings List(List(a,b), List(a,c)...).
lines is an input stream with a record per line (25902 lines in total).
I want to count how many times each pair (or each element in cross) occurs in the entirety of the stream. Essentially comparing all elements in cross to all elements in lines.
I decided foldLeft because that way I can take each line in the stream and split it at a delimiter and then check if both the elements in cross appear or not.
I am able to split each record in lines but I don't know how to pass the cross variable to the function to begin the comparison.

Sorting an RDD in Spark

I have a dataset listing general items bought by customers. Each record in the csv, lists items purchased by a customer, from left to right. For example (shortened sample):
Bicycle, Helmet, Gloves
Shoes, Jumper, Gloves
Television, Hat, Jumper, Playstation 5
I am looking to put this in an RDD in scala, and perform counts on them.
case class SalesItemSummary(SalesItemDesc: String, SalesItemCount: String)
val rdd_1 = sc.textFile("Data/SalesItems.csv")
val rdd_2 = rdd_1.flatMap(line => line.split(",")).countByValue();
Above is a short code sample. The first line is the case class (not used yet).
Line two grabs the data from the csv and puts it in an rdd_1. Easy enough.
Line three does flatmap, splits the data on the comma, and then does a count on each. So, for example, "Gloves" and "Jumper" above would have the number 2 beside it. The others 1. In what looks like a collection of tuples.
So far so good.
Next, I want to sort rdd_2 to list the top 3 most purchased items.
Can I do this with RDD? Or do I need to transfer RDD into a dataframe to achieve sort?
If so, how do I do it?
How do I apply the case class in line 1 for example to rdd_2, which seems to be a list of tuples? Should I take this approach?
Thanks in advance
The count in the case class should be an integer... and if you want to keep the results as an RDD, I'd suggest using reduceByKey rather than countByValue which returns a Map[String, Long] rather than an RDD.
Also I'd suggest splitting by , rather than , to avoid leading spaces in the item names.
case class SalesItemSummary(SalesItemDesc: String, SalesItemCount: Int)
val rdd_1 = sc.textFile("Data/SalesItems.csv")
val rdd_2 = rdd_1.flatMap(_.split(", "))
.map((_, 1))
.reduceByKey(_ + _)
.map(line => SalesItemSummary(line._1, line._2))
rdd_2.collect()
// Array[SalesItemSummary] = Array(SalesItemSummary(Gloves,2), SalesItemSummary(Shoes,1), SalesItemSummary(Television,1), SalesItemSummary(Bicycle,1), SalesItemSummary(Helmet,1), SalesItemSummary(Hat,1), SalesItemSummary(Jumper,2), SalesItemSummary(Playstation 5,1))
To sort the RDD, you can use sortBy:
val top3 = rdd_2.sortBy(_.SalesItemCount, false).take(3)
top3
// Array[SalesItemSummary] = Array(SalesItemSummary(Gloves,2), SalesItemSummary(Jumper,2), SalesItemSummary(Shoes,1))

Compare two different RDDs using scala

I have two RDDs- one from hdfs file system and the other created from a string as shown below-
val txt=sc.textFile("/tmp/textFile.txt")
val str="This\nfile is\nallowed"
val strRDD=sc.parallelize(List(str))
Now, I want two compare the data in these two RDDs:
OR
The result should be an empty RDD but that is not the case. Can someone please explain how I should compare the data of these two RDDs?
Values of the two rdds that you've created looks to be same but are not same. It is evident if you do the count of elements in both rdds as
txt.collect().count(!_.isEmpty)
//res0: Int = 3
strRDD.collect().count(!_.isEmpty)
//res1: Int = 1
The result should be an empty RDD but that is not the case.
Thats the reason the results of txt.subtract(strRDD) and strRDD.subtract(txt) are not same
val txt=sc.textFile("/tmp/textFile.txt") gives each line as separate element in txt RDD
val str="This\nfile is\nallowed"
val strRDD=sc.parallelize(List(str)) gives one \n separated element in strRDD RDD
I hope the explanation is clear

Can only zip RDDs with same number of elements in each partition despite repartition

I load a dataset
val data = sc.textFile("/home/kybe/Documents/datasets/img.csv",defp)
I want to put an index on this data thus
val nb = data.count.toInt
val tozip = sc.parallelize(1 to nb).repartition(data.getNumPartitions)
val res = tozip.zip(data)
Unfortunately i have the following error
Can only zip RDDs with same number of elements in each partition
How can i modify the number of element by partition if it is possible ?
Why it doesn't work?
The documentation for zip() states:
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).
So we need to make sure we meet 2 conditions:
both RDDs have the same number of partitions
respective partitions in those RDDs have exactly the same size
You are making sure that you will have the same number of partitions with repartition() but Spark doesn't guarantee that you will have the same distribution in each partition for each RDD.
Why is that?
Because there are different types of RDDs and most of them have different partitioning strategies! For example:
ParallelCollectionRDD is created when you parallelise a collection with sc.parallelize(collection) it will see how many partitions there should be, will check the size of the collection and calculate the step size. I.e. you have 15 elements in the list and want 4 partitions, first 3 will have 4 consecutive elements last one will have the remaining 3.
HadoopRDD if I remember correctly, one partition per file block. Even though you are using a local file internally Spark first creates a this kind of RDD when you read a local file and then maps that RDD since that RDD is a pair RDD of <Long, Text> and you just want String :-)
etc.etc.
In your example Spark internally does create different types of RDDs (CoalescedRDD and ShuffledRDD) while doing the repartitioning but I think you got the global idea that different RDDs have different partitioning strategies :-)
Notice that the last part of the zip() doc mentions the map() operation. This operation does not repartition as it's a narrow transformation data so it would guarantee both conditions.
Solution
In this simple example as it was mentioned you can do simply data.zipWithIndex. If you need something more complicated then creating the new RDD for zip() should be created with map() as mentioned above.
I solved this by creating an implicit helper like so
implicit class RichContext[T](rdd: RDD[T]) {
def zipShuffle[A](other: RDD[A])(implicit kt: ClassTag[T], vt: ClassTag[A]): RDD[(T, A)] = {
val otherKeyd: RDD[(Long, A)] = other.zipWithIndex().map { case (n, i) => i -> n }
val thisKeyed: RDD[(Long, T)] = rdd.zipWithIndex().map { case (n, i) => i -> n }
val joined = new PairRDDFunctions(thisKeyed).join(otherKeyd).map(_._2)
joined
}
}
Which can then be used like
val rdd1 = sc.parallelize(Seq(1,2,3))
val rdd2 = sc.parallelize(Seq(2,4,6))
val zipped = rdd1.zipShuffle(rdd2) // Seq((1,2),(2,4),(3,6))
NB: Keep in mind that the join will cause a shuffle.
The following provides a Python answer to this problem by defining a custom_zip method:
Can only zip with RDD which has the same number of partitions error

transform rdd into pairRDD

This is a newbie question.
Is it possible to transform an RDD like (key,1,2,3,4,5,5,666,789,...) with a dynamic dimension into a pairRDD like (key, (1,2,3,4,5,5,666,789,...))?
I feel like it should be super-easy but I cannot get how to.
The point of doing it is that I would like to sum all the values, but not the key.
Any help is appreciated.
I am using Spark 1.2.0
EDIT enlightened by the answer I explain my use case deeplier. I have N (unknown at compile time) different pairRDD (key, value), that have to be joined and whose values must be summed up. Is there a better way than the one I was thinking?
First of all if you just wanna sum all integers but first the simplest way would be:
val rdd = sc.parallelize(List(1, 2, 3))
rdd.cache()
val first = rdd.sum()
val result = rdd.count - first
On the other hand if you want to have access to the index of elements you can use rdd zipWithIndex method like this:
val indexed = rdd.zipWithIndex()
indexed.cache()
val result = (indexed.first()._2, indexed.filter(_._1 != 1))
But in your case this feels like overkill.
One more thing i would add, this looks like questionable desine to put key as first element of your rdd. Why not just instead use pairs (key, rdd) in your driver program. Its quite hard to reason about order of elements in rdd and i cant not think about natural situation in witch key is computed as first element of rdd (ofc i dont know your usecase so i can only guess).
EDIT
If you have one rdd of key value pairs and you want to sum them by key then do just:
val result = rdd.reduceByKey(_ + _)
If you have many rdds of key value pairs before counting you can just sum them up
val list = List(pairRDD0, pairRDD1, pairRDD2)
//another pairRDD arives in runtime
val newList = anotherPairRDD0::list
val pairRDD = newList.reduce(_ union _)
val resultSoFar = pairRDD.reduceByKey(_ + _)
//another pairRDD arives in runtime
val result = resultSoFar.union(anotherPairRDD1).reduceByKey(_ + _)
EDIT
I edited example. As you can see you can add additional rdd when every it comes up in runtime. This is because reduceByKey returns rdd of the same type so you can iterate this operation (Ofc you will have to consider performence).