Input RDD
--------------------
A,123|124|125|126
B,123|124|125|126
From this rdd I need to generate another in the below format
Output RDD
--------------------
A,123
A,124
A,125
A,126
B,123
B,124
B,125
B,126
x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])])
def f(x): return x
x.flatMapValues(f).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]
Related
I have a problem where I have generated a dataframe from a graph algorithm that I have written. The thing is that I want to keep the value of the underlying component to be the same essentially after every run of the graph code.
This is a sample dataframe generated:
df = spark.createDataFrame(
[
(1, 'A1'),
(1, 'A2'),
(1, 'A3'),
(2, 'B1'),
(2, 'B2'),
(3, 'B3'),
(4, 'C1'),
(4, 'C2'),
(4, 'C3'),
(4, 'C4'),
(5, 'D1'),
],
['old_comp_id', 'db_id']
)
After another run the values change completely, so the new run has values like these,
df2 = spark.createDataFrame(
[
(2, 'A1'),
(2, 'A2'),
(2, 'A3'),
(3, 'B1'),
(3, 'B2'),
(3, 'B3'),
(1, 'C1'),
(1, 'C2'),
(1, 'C3'),
(1, 'C4'),
(4, 'D1'),
],
['new_comp_id', 'db_id']
)
So the thing I need to do is to compare the values between the above two dataframes and change the values of the component id based on the database id associated.
if the database_id are the same then update the component id to be from the 1st dataframe
if they are different then assign a completely new comp_id (new_comp_id = max(old_comp_id)+1)
This is what I have come up with so far:
old_ids = df.groupBy("old_comp_id").agg(F.collect_set(F.col("db_id")).alias("old_db_id"))
new_ids = df2.groupBy("new_comp_id").agg(F.collect_set(F.col("db_id")).alias("new_db_id"))
joined = new_ids.join(old_ids,old_ids.old_comp_id == new_ids.new_comp_id,"outer")
joined.withColumn("update_comp", F.when( F.col("new_db_id") == F.col("old_db_id"), F.col('old_comp_id')).otherwise(F.max(F.col("old_comp_id")+1))).show()
In order to use aggregated functions in non-aggregated columns, you should use Windowing Functions.
First, you outer-join the DFs with the db_id:
from pyspark.sql.functions import when, col, max
joinedDF = df.join(df2, df["db_id"] == df2["new_db_id"], "outer")
Then, start to building the Window (which where you group by db_id, and order by old_comp_id, in order to have in first rows the old_comp_id with highest value.
from pyspark.sql.window import Window
from pyspark.sql.functions import desc
windowSpec = Window\
.partitionBy("db_id")\
.orderBy(desc("old_comp_id"))\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
Then, you build the max column using the windowSpec
from pyspark.sql.functions import max
maxCompId = max(col("old_comp_id")).over(windowSpec)
Then, you apply it on the select
joinedDF.select(col("db_id"), when(col("new_db_id").isNotNull(), col("old_comp_id")).otherwise(maxCompId+1).alias("updated_comp")).show()
For more information, please refer to the documentation (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window)
Hope this helps
In pyspark, how to transform an input RDD where Every Key has a list of Values to an output RDD where Every Value has a list of Keys it belong to?
Input
[(1, ['a','b','c','e']), (2, ['b','d']), (3, ['a','d']), (4, ['b','c'])]
Output
[('a', [1, 3]), ('b', [1, 2, 4]), ('c', [1, 4]), ('d', [2,3]), ('e', [1])]
Flatten and swap the key value on the rdd first, and then groupByKey:
rdd.flatMap(lambda r: [(k, r[0]) for k in r[1]]).groupByKey().mapValues(list).collect()
# [('a', [1, 3]), ('e', [1]), ('b', [1, 2, 4]), ('c', [1, 4]), ('d', [2, 3])]
I want to merge multiple RDD into one using a key. Instead of doing join multiple times, is there an effcient way to do so?
For example:
Rdd_1 = [(0, a), (1, b), (2, c), (3, d)]
Rdd_2 = [(0, aa), (1, bb), (2, cc), (3, dd)]
Rdd_3 = [(0, aaa), (1, bbb), (2, ccc), (3, ddd)]
I expected output should look like
Rdd = [(0, a, aa, aaa), (1, b, bb, bbb), (2, c, cc, ccc), (3, d, dd, ddd)]
Thanks!
Well for completeness here is the join method:
Rdd_1.join(Rdd_2).join(Rdd_3).map(lambda (x,y): (x,)+y[0]+(y[1],))
In terms of efficiency if you explicitly partition each rdd on the key (using partitionBy) then all the tuples to be joined will sit in the same partition and this will make it more efficient.
I have a dataframe with 2 columns, of the form
col1 col2
k1 'a'
k2 'b'
k1 'a'
k1 'c'
k2 'c'
k1 'b'
k1 'b'
k2 'c'
k1 'b'
I want the output to be
k1 ['b', 'a', 'c']
k2 ['c', 'b']
So the unique set of entries, sorted by the number of times each entry occurs (in descending order). In the above example, 'b' is associated with k1 thrice, 'a' twice, and 'c' once.
How do I go about doing this?
groupBy($"col1").count()
only looks at the number of times the entries in col1 occur, but that's not what I'm looking for.
You can do the following:
for each key and column value, calculate the count
for each key, calculate a list with all related column values and their counts
use udf to sort the list and drop the counts
Like that (in Scala):
import scala.collection.mutable
import org.apache.spark.sql.{Row}
val sort_by_count_udf = udf {
arr: mutable.WrappedArray[Row] =>
arr.map {
case Row(count: Long, col2: String) => (count, col2)
}.sortBy(-_._1).map { case (count, col2) => col2 }
}
val df = List(("k1", "a"),
("k1", "a"), ("k1", "c"), ("k1", "b"),
("k2", "b"), ("k2", "c"), ("k2", "c"),
("k1", "b"), ("k1", "b"))
.toDF("col1", "col2")
val grouped = df
.groupBy("col1", "col2")
.count()
.groupBy("col1")
.agg(collect_list(struct("count", "col2")).as("list"))
grouped.withColumn("list_ordered", sort_by_count_udf(col("list"))).show
Here's one (not so pretty solution) using only in-built functions :
df.groupBy($"col1" , $"col2")
.agg(count($"col2").alias("cnt") )
.groupBy($"col1")
.agg(sort_array(collect_list(struct(-$"cnt", $"col2"))).as("list"))
.withColumn("x" , $"list".getItem("col2") )
.show(false)
Since sort_array sorts the elements in ascending order according to their natural ordering -$"cnt" helps us in getting the elements sorted in descending order based on their count. getItem is used to get the value of col2 from the struct.
Output:
+----+------------------------+---------+
|col1|list |x |
+----+------------------------+---------+
|k2 |[[-2,c], [-1,b]] |[c, b] |
|k1 |[[-3,b], [-2,a], [-1,c]]|[b, a, c]|
+----+------------------------+---------+
In an RDD with composite key, is it possible to sort in ascending order with the first element and in descending order with the second order when both of them are string type? I have provided some dummy data below.
z = [(('a','b'), 3), (('a','c'), -2), (('d','b'), 4), (('e','b'), 6), (('a','g'), 8)]
rdd = sc.parallelize(z)
rdd.sortByKey(False).collect()
Maybe there's more efficient way, but here is one:
str_to_ints = lambda s, i: [ord(c) * i for c in s]
rdd.sortByKey(keyfunc=lambda x: (str_to_ints(x[0], 1), str_to_ints(x[1], -1))).collect()
# [(('a', 'g'), 8), (('a', 'c'), -2), (('a', 'b'), 3), (('d', 'b'), 4), (('e', 'b'), 6)]
Basically convert the strings in the key to list of integers with first element multiplied by 1 and second element multiplied by -1.