Compare two different RDDs using scala - scala

I have two RDDs- one from hdfs file system and the other created from a string as shown below-
val txt=sc.textFile("/tmp/textFile.txt")
val str="This\nfile is\nallowed"
val strRDD=sc.parallelize(List(str))
Now, I want two compare the data in these two RDDs:
OR
The result should be an empty RDD but that is not the case. Can someone please explain how I should compare the data of these two RDDs?

Values of the two rdds that you've created looks to be same but are not same. It is evident if you do the count of elements in both rdds as
txt.collect().count(!_.isEmpty)
//res0: Int = 3
strRDD.collect().count(!_.isEmpty)
//res1: Int = 1
The result should be an empty RDD but that is not the case.
Thats the reason the results of txt.subtract(strRDD) and strRDD.subtract(txt) are not same
val txt=sc.textFile("/tmp/textFile.txt") gives each line as separate element in txt RDD
val str="This\nfile is\nallowed"
val strRDD=sc.parallelize(List(str)) gives one \n separated element in strRDD RDD
I hope the explanation is clear

Related

Converting literal to RDD for subsequent Cartesian Product

Cannot find in the documentation how the result of below:
val DIM_Key_Max = rddA.map(x => (x._1)).max
can be subsequently converted to a single entry RDD for JOINing with another RDD, or rather cartesian product.
Nowhere I can see that. Who can help?
max returns a single object. To turn it into a single entry RDD, use parallelize:
sc.parallelize(List(DIM_Key_Max))
This returns an RDD with a single entry that can be used e.g. as an argument to cartesian.
You are getting something wrong here. max will not retrun an RDD which can be joined with another RDD.
val rdd=sc.parallelize(Array((1,2),(3,4),(5,6))).map(x=>x._1).max
rdd
rdd: Int = 5
rdd.getClass
res2: Class[Int] = int

How to divide a RDD into multiple RDDs according to every father RDD's element

I want to find a way to divide a fatherRDD into multiple RDDs accordingly to every fatherRDD's element.
For example, the elements of fatherRDD have lots of lists. I want to split this fatherRDD into lots of small RDD based on every element. In other words, if there are n elements in the fatherRDD, I want to get n RDDs.
Two days ago, I wrote a function like this:
def splitRDD(rdd1:RDD[List[(String, String)]]):List[RDD[(String, String)]] ={
var list = List[RDD[(String, String)]] ()
//println(rdd1.take(1).apply(0).apply(0)._1)
rdd1.foreach(x =>{
list = sc.makeRDD(x)::list
})
list
}
I think the wrong is I can not use sc.makeRDD(x) here. So how to divide a RDD into multiple RDDs according to every father RDD's element?
As per your description it should look like this:
def splitRDD(rdd1:RDD[List[(String, String)]]):List[RDD[(String, String)]] = rdd1.collect().toList.map(x => makeRdd(x))
def makeRdd(ls:List[(String,String)]): RDD[(String, String)] = sc.parallelize(ls)
Try this out for your data. is that what you want ?

how to use select() and map() in spark - scala?

Im writing a code for data migration from mysql to cassandra using spark. I m trying to generalize it so that given a conf file it can migrate any table. Here im stuck at 2 places:
val dataframe2 = dataframe.select("a","b","c","d","e","f")
After Loading the table from mysql i wish to select only a few columns, i have the names of these columns as a list. How can it be used here?
val RDDtuple = dataframe2.map(r => (r.getAs(0), r.getAs(1), r.getAs(2), r.getAs(3), r.getAs(4), r.getAs(5)))
Here again every table may have a different number of columns, so how can this be achieved?
To use variable number of columns in select(), your list of columns can be converted like this:
val columns = List("a", "b", "c", "d")
val dfSelectedCols = dataFrame.select(columns.head, columns.tail :_*)
Explanation: the first param in DataFrame's select(String, String...) is mandatory, so use columns.head. The remaining part of the list need to be converted to varargs using columns.tail :_*.
It's not very clear from your example, but I suppose that x is a RDD[Row] and that you are trying to convert into a RDD of Tuples, right ? Please give more details and also use meaningful variable names. x, y or z are bad choices, especially if there is no explicit typing.

Can only zip RDDs with same number of elements in each partition despite repartition

I load a dataset
val data = sc.textFile("/home/kybe/Documents/datasets/img.csv",defp)
I want to put an index on this data thus
val nb = data.count.toInt
val tozip = sc.parallelize(1 to nb).repartition(data.getNumPartitions)
val res = tozip.zip(data)
Unfortunately i have the following error
Can only zip RDDs with same number of elements in each partition
How can i modify the number of element by partition if it is possible ?
Why it doesn't work?
The documentation for zip() states:
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other).
So we need to make sure we meet 2 conditions:
both RDDs have the same number of partitions
respective partitions in those RDDs have exactly the same size
You are making sure that you will have the same number of partitions with repartition() but Spark doesn't guarantee that you will have the same distribution in each partition for each RDD.
Why is that?
Because there are different types of RDDs and most of them have different partitioning strategies! For example:
ParallelCollectionRDD is created when you parallelise a collection with sc.parallelize(collection) it will see how many partitions there should be, will check the size of the collection and calculate the step size. I.e. you have 15 elements in the list and want 4 partitions, first 3 will have 4 consecutive elements last one will have the remaining 3.
HadoopRDD if I remember correctly, one partition per file block. Even though you are using a local file internally Spark first creates a this kind of RDD when you read a local file and then maps that RDD since that RDD is a pair RDD of <Long, Text> and you just want String :-)
etc.etc.
In your example Spark internally does create different types of RDDs (CoalescedRDD and ShuffledRDD) while doing the repartitioning but I think you got the global idea that different RDDs have different partitioning strategies :-)
Notice that the last part of the zip() doc mentions the map() operation. This operation does not repartition as it's a narrow transformation data so it would guarantee both conditions.
Solution
In this simple example as it was mentioned you can do simply data.zipWithIndex. If you need something more complicated then creating the new RDD for zip() should be created with map() as mentioned above.
I solved this by creating an implicit helper like so
implicit class RichContext[T](rdd: RDD[T]) {
def zipShuffle[A](other: RDD[A])(implicit kt: ClassTag[T], vt: ClassTag[A]): RDD[(T, A)] = {
val otherKeyd: RDD[(Long, A)] = other.zipWithIndex().map { case (n, i) => i -> n }
val thisKeyed: RDD[(Long, T)] = rdd.zipWithIndex().map { case (n, i) => i -> n }
val joined = new PairRDDFunctions(thisKeyed).join(otherKeyd).map(_._2)
joined
}
}
Which can then be used like
val rdd1 = sc.parallelize(Seq(1,2,3))
val rdd2 = sc.parallelize(Seq(2,4,6))
val zipped = rdd1.zipShuffle(rdd2) // Seq((1,2),(2,4),(3,6))
NB: Keep in mind that the join will cause a shuffle.
The following provides a Python answer to this problem by defining a custom_zip method:
Can only zip with RDD which has the same number of partitions error

Why printing inside foreach doesn't reflect an order of elements

May be I am missing something but I expected the data to be sorted based on the key
scala> val x=sc.parallelize(Array( "cat", "ant", "1"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[160] at parallelize at <console>:22
scala> val xxx=x.map(v=> (v,v.length))
xxx: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[161] at map at <console>:26
scala> xxx.sortByKey().foreach(println)
(1,1)
(cat,3)
(ant,3)
scala> xxx.sortByKey().foreach(println)
(cat,3)
(1,1)
(ant,3)
It works if I tell spark to use only 1 partitions as below but how to make this work in a cluster or more than 1 workers?
scala> xxx.sortByKey(numPartitions=1).foreach(println)
(1,1)
(ant,3)
(cat,3)
UPDATE:
I think I got the answer. It is being sorted correctly as it works when I use the collect
scala> xxx.sortByKey().collect
res170: Array[(String, Int)] = Array((1,1), (ant,3), (cat,3))
Keeping the question open to validate my understanding.
That makes sense. foreach runs in parallel across the partitions which creates non-deterministic ordering. The order may be mixed. collect gives you an array of the partitions concatenated in their sorted order.
Have a look at spark documentation why collect() method fixed the issue for you.
e.g.
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as an array of objects.
Calling collect() on the resulting RDD will return or output an ordered list of records
collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Remember doing a collect() action operation on a very large distributed RDD can cause your driver program to run out of memory and crash. So, do not use collect() except for when you are prototyping your Spark program on a small dataset.
Have a look at this article for more details
EDIT:
sortByKey(): Sort the RDD by key, so that each partition contains a sorted range of the elements. Since all partitions may not reside in same Executor node, you will not get ordered set unless you call collect()