How to merge mutilple RDD in PySpark - pyspark

I want to merge multiple RDD into one using a key. Instead of doing join multiple times, is there an effcient way to do so?
For example:
Rdd_1 = [(0, a), (1, b), (2, c), (3, d)]
Rdd_2 = [(0, aa), (1, bb), (2, cc), (3, dd)]
Rdd_3 = [(0, aaa), (1, bbb), (2, ccc), (3, ddd)]
I expected output should look like
Rdd = [(0, a, aa, aaa), (1, b, bb, bbb), (2, c, cc, ccc), (3, d, dd, ddd)]
Thanks!

Well for completeness here is the join method:
Rdd_1.join(Rdd_2).join(Rdd_3).map(lambda (x,y): (x,)+y[0]+(y[1],))
In terms of efficiency if you explicitly partition each rdd on the key (using partitionBy) then all the tuples to be joined will sit in the same partition and this will make it more efficient.

Related

How to flatten the results of an RDD.groupBy() from (key, [values]) into (key, values)?

From an RDD of key-value pairs like
[(1, 3), (2, 4), (2, 6)]
I would like to obtain an RDD of tuples like
[(1, 3), (2, 4, 6)]
where the first element of each tuple is the key in the original RDD, and the next element(s) are all values associated with that key in the original RDD
I have tried this
rdd.groupByKey().mapValues(lambda x:[item for item in x]).collect()
which gives
[(1, [3]), (2, [4, 6])]
but it is not quite what I want. I don't manage to "explode" the list of items in each tuple of the result.
rdd.groupByKey().map(lambda x: (x[0],*tuple(x[1]))).collect()
Best I came up with is
rdd.groupByKey().mapValues(lambda x:[a for a in x]).map(lambda x: tuple([x[0]]+x[1])).collect()
Could it be made more compact or efficient?

How to join two spark RDD

I have 2 spark RDD, the 1st one contains a mapping between some indices and ids which are strings and the 2nd one contains tuples of related indices
val ids = spark.sparkContext.parallelize(Array[(Int, String)](
(1, "a"), (2, "b"), (3, "c"), (4, "d"), (5, "e"))).toDF("index", "idx")
val relationships = spark.sparkContext.parallelize(Array[(Int, Int)](
(1, 3), (2, 3), (4, 5))).toDF("index1", "index2")
I want to join somehow these RDD (or merge or sql or any best spark practice) to have at the end related ids instead:
The result of my combined RDD should return:
("a", "c"), ("b", "c"), ("d", "e")
Any idea how I can achieve this operation in an optimal way without loading any of the RDD into a memory map (because in my scenarios, these RDD can potentially load millions of records)
You can approach this by creating a two views from DataFrame as following
relationships.createOrReplaceTempView("relationships");
ids.createOrReplaceTempView("ids");
Next run the following SQL query to generate the required result which performs inner join between relationships and ids view to generate the required result
import sqlContext.sql;
val result = spark.sql("""select t.index1, id.idx from
(select id.idx as index1, rel.index2
from relationships rel
inner join
ids id on rel.index1=id.index) t
inner join
ids id
on id.index=t.index2
""");
result.show()
Another approach using DataFrame without creating views
relationships.as("rel").
join(ids.as("ids"), $"ids.index" === $"rel.index1").as("temp").
join(ids.as("ids"), $"temp.index2"===$"ids.index").
select($"temp.idx".as("index1"), $"ids.idx".as("index2")).show

sortByKey() by composite key in PySpark

In an RDD with composite key, is it possible to sort in ascending order with the first element and in descending order with the second order when both of them are string type? I have provided some dummy data below.
z = [(('a','b'), 3), (('a','c'), -2), (('d','b'), 4), (('e','b'), 6), (('a','g'), 8)]
rdd = sc.parallelize(z)
rdd.sortByKey(False).collect()
Maybe there's more efficient way, but here is one:
str_to_ints = lambda s, i: [ord(c) * i for c in s]
rdd.sortByKey(keyfunc=lambda x: (str_to_ints(x[0], 1), str_to_ints(x[1], -1))).collect()
# [(('a', 'g'), 8), (('a', 'c'), -2), (('a', 'b'), 3), (('d', 'b'), 4), (('e', 'b'), 6)]
Basically convert the strings in the key to list of integers with first element multiplied by 1 and second element multiplied by -1.

Join two lists with unequal length in Scala

I have 2 lists:
val list_1 = List((1, 11), (2, 12), (3, 13), (4, 14))
val list_2 = List((1, 111), (2, 122), (3, 133), (4, 144), (1, 123), (2, 234))
I want to replace key in the second list as value of first list, resulting in a new list that looks like:
List ((11, 111), (12, 122), (13, 133), (14, 144), (11, 123), (12, 234))
This is my attempt:
object UniqueTest {
def main(args: Array[String]){
val l_1 = List((1, 11), (2, 12), (3, 13), (4, 14))
val l_2 = List((1, 111), (2,122), (3, 133), (4, 144), (1, 123), (2, 234))
val l_3 = l_2.map(x => (f(x._1, l_1), x._2))
print(l_3)
}
def f(i: Int, list: List[(Int, Int)]): Int = {
for(pair <- list){
if(i == pair._1){
return pair._2
}
}
return 0
}
}
This results in:
((11, 111), (12, 122), (13, 133), (14, 144), (11, 123), (12, 234))
Is the program above a good way to do this? Are there built-in functions in Scala to handle this need, or another way to do this manipulation?
The only real over-complication you make is this line:
val l_3 = l_2.map(x => (f(x._1, l_1), x._2))
Your f function uses an imperative style to loop over a list to find a key. Any time you find yourself doing this, it's a good indication what you want is a map. By doing the for loop each time you're exploding the computational complexity: a map will allow you to fetch the corresponding value for a given key in O(1). With a map you first convert your list, which is a key-value pair, to a datastructure explicit about supporting the key-value pair relationship.
Thus, the first thing you should do is build your map. Scala provides a really easy way to do this with toMap:
val map_1 = list_1.toMap
Then it is just a matter of 'mapping':
val result = list_2.map { case (key, value) => map_1.getOrElse(key, 0), value) }
This takes each case in your list_2, matches the first value (key) to a key in your map_1, retrieves that value (or the default 0) and puts as the first value in a key-value tuple.
You can do:
val map = l_1.toMap // transform l_1 to a Map[Int, Int]
// for each (a, b) in l_2, retrieve the new value v of a and return (v, b)
val res = l_2.map { case (a, b) => (map.getOrElse(a, 0), b) }
The most idiomatic way is zipping them together and then transforming according to your needs:
(list_1 zip list_2) map { case ((k1, v1), (k2, v2)) => (v1, v2) }

Merge two arrays based on first element

I have 2 arrays Array[(Int, Int)], and Array[(Int, List[String])],
for examples:
(1, 2) and (1, (123, 456, 789))
(2, 8) and (2, (678, 1000))
(3, 4) and (3, (587, 923, 168, 392))
I would like to merge these two arrays into one Array [(Int, List[String], Int)] like this:
(1, (123, 456, 789), 2)
(2, (678, 1000), 8)
(3, (587, 923, 168, 392), 4)
and would like scala still realize the second element is a List[String],
I tried many ways they can combine these 2 maps or arrays, but cannot realize the second element is a List[String], after merging, it treated the second element as Any or Some and cannot traverse it.
I found the solution:
array1.zip(array2).map {
case ((p1,count), (p2,categoryList)) => (p1,categoryList,count)
}