Scala--How to get the same part of the two RDDS? - scala

There are two RDD:
val rdd1 = sc.parallelize(List(("aaa", 1), ("bbb", 4), ("ccc", 3)))
val rdd2 = sc.parallelize(List(("aaa", 2), ("bbb", 5), ("ddd", 2)))
If I want to join those by the first field and get the result like:
List(("aaa", 1,2), ("bbb",4 ,5))
What should I code?Thx!!!!

You can join the RDDs and map the result to the wanted data structure:
val resultRDD = rdd1.join(rdd2).map{
case (k: String, (v1: Int, v2: Int)) => (k, v1, v2)
}
// resultRDD: org.apache.spark.rdd.RDD[(String, Int, Int)] = MapPartitionsRDD[53] at map at <console>:32
resultRDD.collect
// res1: Array[(String, Int, Int)] = Array((aaa,1,2), (bbb,4,5))

As both the RDDs of the type RDD[(String, Int)] you can simply use join to join these two RDDs and you will get RDD[(String, (Int, Int))] . Now as you want List[(String, (Int, Int))] you need to collect joined RDD (it is not recommended if joined RDD is huge) and convert it to List. Try following code,
val rdd1: RDD[(String, Int)] = sc.parallelize(List(("aaa", 1), ("bbb", 4), ("ccc", 3)))
val rdd2: RDD[(String, Int)] = sc.parallelize(List(("aaa", 2), ("bbb", 5), ("ddd", 2)))
//simply join two RDDs
val joinedRdd: RDD[(String, (Int, Int))] = rdd1.join(rdd2)
//only if you want List then collect it (It is not recommended for huge RDDs)
val lst: List[(String, (Int, Int))] = joinedRdd.collect().toList
println(lst)
//output
//List((bbb,(4,5)), (aaa,(1,2)))

Related

How can i sort Rdd[(Int, (val1, val2))] by val2, when only SortByKey is available as option?

I have a Rdd[(Int, (val1, val2))] which i want to sort by val2, but the only available option to use is SortByKey.
Is SortBy available only in older scala versions?
Is there another option than collecting it to driver?
In code i do only:
val nonslack = slacks.filter(x=> Vlts.contains(x._1))
where Vlts is Array[Int] and slacks is rdd read from file.
There is a sortBy in RDD:
val rdd = spark.sparkContext.parallelize(Seq(("one", ("one" -> 1)), ("two", ("two" -> 2)), ("three", ("three" -> 3))))
rdd.sortBy(_._2._2).collect().foreach(println(_))

Append a row to a pair RDD in spark

I have a pair RDD of existing values such as :
(1,2)
(3,4)
(5,6)
I want to append a row (7,8) to the same RDD
How can I append to the same RDD in Spark?
You can use union operation.
scala> val rdd1 = sc.parallelize(List((1,2), (3,4), (5,6)))
q: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(List((7, 8)))
q: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> val unionOfTwo = rdd1.union(rdd2)
res0: org.apache.spark.rdd.RDD[(Int, Int)] = UnionRDD[2] at union at <console>:28

Filtering RDD1 on the basis of RDD2

i have 2 RDDS in below format
RDD1 178,1
156,1
23,2
RDD2
34
178
156
now i want to filter rdd1 on the basis of value in rdd2 ie if 178 is present in rdd1 and also in rdd2 then it should return me those tuples from rdd1.
i have tried
val out = reversedl1.filter({ case(x,y) => x.contains(lines)})
where lines is my 2nd rdd and reversedl1 is first, but its not working
i also tried
val abce = reversedl1.subtractByKey(lines)
val defg = reversedl1.subtractByKey(abce)
This is also not working.
Any suggestions?
You can convert rdd2 to key value pairs and then join with rdd1 on the keys:
val rdd1 = sc.parallelize(Seq((178, 1), (156, 1), (23, 2)))
val rdd2 = sc.parallelize(Seq(34, 178, 156))
(rdd1.join(rdd2.distinct().map(k => (k, null)))
// here create a dummy value to form a pair wise RDD so you can join
.map{ case (k, (v1, v2)) => (k, v1) } // drop the dummy value
).collect
// res11: Array[(Int, Int)] = Array((156,1), (178,1))

How to perform Set transformations on RDD's with different number of columns?

I have two RDDs. One RDD is of type RDD[(String, String, String)] and the second RDD is of type RDD[(String, String, String, String, String)]. Whenever I try to perform operations like union, intersection, etc, I get the error :-
error: type mismatch;
found: org.apache.spark.rdd.RDD[(String, String, String, String,String, String)]
required: org.apache.spark.rdd.RDD[(String, String, String)]
uid.union(uid1).first()
How can I perform the set operations in this case? If set operations are not possible at all, what can I do to get the same result as set operations without having the type mismatch problem?
EDIT:
Here's a sample of the first lines from both the RDDs :
(" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502")
(fb_100007609418328,-795000,r316079113_serv60i)
Several operations require two RDDs to have the same type.
Let's take union for example: union basically concatenates two RDDs. As you can imagine it would be unsound to concatenate the following:
RDD1
(1, 2)
(3, 4)
RDD2
(5, 6, "string1")
(7, 8, "string2")
As you see, RDD2 has one extra column. One thing that you can do, is work on RDD1 to that its schema matches that of RDD2, for example by adding a default value:
RDD1
(1, 2)
(3, 4)
RDD1 (AMENDED)
(1, 2, "default")
(3, 4, "default")
RDD2
(5, 6, "string1")
(7, 8, "string2")
UNION
(1, 2, "default")
(3, 4, "default")
(5, 6, "string1")
(7, 8, "string2")
You can achieve this with the following code:
val sc: SparkContext = ??? // your SparkContext
val rdd1: RDD[(Int, Int)] =
sc.parallelize(Seq((1, 2), (3, 4)))
val rdd2: RDD[(Int, Int, String)] =
sc.parallelize(Seq((5, 6, "string1"), (7, 8, "string2")))
val amended: RDD[(Int, Int, String)] =
rdd1.map(pair => (pair._1, pair._2, "default"))
val union: RDD[(Int, Int, String)] =
amended.union(rdd2)
If you know print the contents
union.foreach(println)
you will get what we ended up having in the above example.
Of course, the exact semantics of how you want the two RDDs to match depend on your problem.

How to select values from one field rdd only if it is present in second field of rdd

I have an rdd with 3 fields as mentioned below.
1,2,6
2,4,6
1,4,9
3,4,7
2,3,8
Now, from the above rdd, I want to get following rdd.
2,4,6
3,4,7
2,3,8
The resultant rdd does not have rows starting with 1, because 1 is nowhere in the second field in input rdd.
Ok, if I understood correctly what you want to do, there are two ways:
Split your RDD into two, where first RDD contains unique values of "second field" and second RDD is has "first value" as a key. Then join rdds together. The drawback of this approach is that distinct and join are slow operations.
val r: RDD[(String, String, Int)] = sc.parallelize(Seq(
("1", "2", 6),
("2", "4", 6),
("1", "4", 9),
("3", "4", 7),
("2", "3", 8)
))
val uniqueValues: RDD[(String, Unit)] = r.map(x => x._2 -> ()).distinct
val r1: RDD[(String, (String, String, Int))] = r.map(x => x._1 -> x)
val result: RDD[(String, String, Int)] = r1.join(uniqueValues).map {case (_, (x, _)) => x}
result.collect.foreach(println)
If your RDD is relatively small and Set of second values can fit completely in memory in all the nodes, then you can create that in-memory set as a first step, broadcast it to all nodes and then just filter your RDD:
val r: RDD[(String, String, Int)] = sc.parallelize(Seq(
("1", "2", 6),
("2", "4", 6),
("1", "4", 9),
("3", "4", 7),
("2", "3", 8)
))
val uniqueValues = sc.broadcast(r.map(x => x._2).distinct.collect.toSet)
val result: RDD[(String, String, Int)] = r.filter(x => uniqueValues.value.contains(x._1))
result.collect.foreach(println)
Both examples output:
(2,4,6)
(2,3,8)
(3,4,7)