pyspark generate rdd row wise using another field as a source

pyspark generate rdd row wise using another field as a source - pyspark

Input RDD
--------------------
A,123|124|125|126
B,123|124|125|126
From this rdd I need to generate another in the below format
Output RDD
--------------------
A,123
A,124
A,125
A,126
B,123
B,124
B,125
B,126

x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])])
def f(x): return x
x.flatMapValues(f).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]

Related

Compare 2 pyspark dataframe columns and change values of another column based on it

I have a problem where I have generated a dataframe from a graph algorithm that I have written. The thing is that I want to keep the value of the underlying component to be the same essentially after every run of the graph code.
This is a sample dataframe generated:
df = spark.createDataFrame(
[
(1, 'A1'),
(1, 'A2'),
(1, 'A3'),
(2, 'B1'),
(2, 'B2'),
(3, 'B3'),
(4, 'C1'),
(4, 'C2'),
(4, 'C3'),
(4, 'C4'),
(5, 'D1'),
],
['old_comp_id', 'db_id']
)
After another run the values change completely, so the new run has values like these,
df2 = spark.createDataFrame(
[
(2, 'A1'),
(2, 'A2'),
(2, 'A3'),
(3, 'B1'),
(3, 'B2'),
(3, 'B3'),
(1, 'C1'),
(1, 'C2'),
(1, 'C3'),
(1, 'C4'),
(4, 'D1'),
],
['new_comp_id', 'db_id']
)
So the thing I need to do is to compare the values between the above two dataframes and change the values of the component id based on the database id associated.
if the database_id are the same then update the component id to be from the 1st dataframe
if they are different then assign a completely new comp_id (new_comp_id = max(old_comp_id)+1)
This is what I have come up with so far:
old_ids = df.groupBy("old_comp_id").agg(F.collect_set(F.col("db_id")).alias("old_db_id"))
new_ids = df2.groupBy("new_comp_id").agg(F.collect_set(F.col("db_id")).alias("new_db_id"))
joined = new_ids.join(old_ids,old_ids.old_comp_id == new_ids.new_comp_id,"outer")
joined.withColumn("update_comp", F.when( F.col("new_db_id") == F.col("old_db_id"), F.col('old_comp_id')).otherwise(F.max(F.col("old_comp_id")+1))).show()

In order to use aggregated functions in non-aggregated columns, you should use Windowing Functions.
First, you outer-join the DFs with the db_id:
from pyspark.sql.functions import when, col, max
joinedDF = df.join(df2, df["db_id"] == df2["new_db_id"], "outer")
Then, start to building the Window (which where you group by db_id, and order by old_comp_id, in order to have in first rows the old_comp_id with highest value.
from pyspark.sql.window import Window
from pyspark.sql.functions import desc
windowSpec = Window\
.partitionBy("db_id")\
.orderBy(desc("old_comp_id"))\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
Then, you build the max column using the windowSpec
from pyspark.sql.functions import max
maxCompId = max(col("old_comp_id")).over(windowSpec)
Then, you apply it on the select
joinedDF.select(col("db_id"), when(col("new_db_id").isNotNull(), col("old_comp_id")).otherwise(maxCompId+1).alias("updated_comp")).show()
For more information, please refer to the documentation (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window)
Hope this helps

Pyspark | Transform RDD from key with list of values > values with list of keys

In pyspark, how to transform an input RDD where Every Key has a list of Values to an output RDD where Every Value has a list of Keys it belong to?
Input
[(1, ['a','b','c','e']), (2, ['b','d']), (3, ['a','d']), (4, ['b','c'])]
Output
[('a', [1, 3]), ('b', [1, 2, 4]), ('c', [1, 4]), ('d', [2,3]), ('e', [1])]

Flatten and swap the key value on the rdd first, and then groupByKey:
rdd.flatMap(lambda r: [(k, r[0]) for k in r[1]]).groupByKey().mapValues(list).collect()
# [('a', [1, 3]), ('e', [1]), ('b', [1, 2, 4]), ('c', [1, 4]), ('d', [2, 3])]

How to merge mutilple RDD in PySpark

I want to merge multiple RDD into one using a key. Instead of doing join multiple times, is there an effcient way to do so?
For example:
Rdd_1 = [(0, a), (1, b), (2, c), (3, d)]
Rdd_2 = [(0, aa), (1, bb), (2, cc), (3, dd)]
Rdd_3 = [(0, aaa), (1, bbb), (2, ccc), (3, ddd)]
I expected output should look like
Rdd = [(0, a, aa, aaa), (1, b, bb, bbb), (2, c, cc, ccc), (3, d, dd, ddd)]
Thanks!

Well for completeness here is the join method:
Rdd_1.join(Rdd_2).join(Rdd_3).map(lambda (x,y): (x,)+y[0]+(y[1],))
In terms of efficiency if you explicitly partition each rdd on the key (using partitionBy) then all the tuples to be joined will sit in the same partition and this will make it more efficient.

How to sort by count and retain unique items in value

I have a dataframe with 2 columns, of the form
col1 col2
k1 'a'
k2 'b'
k1 'a'
k1 'c'
k2 'c'
k1 'b'
k1 'b'
k2 'c'
k1 'b'
I want the output to be
k1 ['b', 'a', 'c']
k2 ['c', 'b']
So the unique set of entries, sorted by the number of times each entry occurs (in descending order). In the above example, 'b' is associated with k1 thrice, 'a' twice, and 'c' once.
How do I go about doing this?
groupBy($"col1").count()
only looks at the number of times the entries in col1 occur, but that's not what I'm looking for.

You can do the following:
for each key and column value, calculate the count
for each key, calculate a list with all related column values and their counts
use udf to sort the list and drop the counts
Like that (in Scala):
import scala.collection.mutable
import org.apache.spark.sql.{Row}
val sort_by_count_udf = udf {
arr: mutable.WrappedArray[Row] =>
arr.map {
case Row(count: Long, col2: String) => (count, col2)
}.sortBy(-_._1).map { case (count, col2) => col2 }
}
val df = List(("k1", "a"),
("k1", "a"), ("k1", "c"), ("k1", "b"),
("k2", "b"), ("k2", "c"), ("k2", "c"),
("k1", "b"), ("k1", "b"))
.toDF("col1", "col2")
val grouped = df
.groupBy("col1", "col2")
.count()
.groupBy("col1")
.agg(collect_list(struct("count", "col2")).as("list"))
grouped.withColumn("list_ordered", sort_by_count_udf(col("list"))).show

Here's one (not so pretty solution) using only in-built functions :
df.groupBy($"col1" , $"col2")
.agg(count($"col2").alias("cnt") )
.groupBy($"col1")
.agg(sort_array(collect_list(struct(-$"cnt", $"col2"))).as("list"))
.withColumn("x" , $"list".getItem("col2") )
.show(false)
Since sort_array sorts the elements in ascending order according to their natural ordering -$"cnt" helps us in getting the elements sorted in descending order based on their count. getItem is used to get the value of col2 from the struct.
Output:
+----+------------------------+---------+
|col1|list |x |
+----+------------------------+---------+
|k2 |[[-2,c], [-1,b]] |[c, b] |
|k1 |[[-3,b], [-2,a], [-1,c]]|[b, a, c]|
+----+------------------------+---------+

sortByKey() by composite key in PySpark

In an RDD with composite key, is it possible to sort in ascending order with the first element and in descending order with the second order when both of them are string type? I have provided some dummy data below.
z = [(('a','b'), 3), (('a','c'), -2), (('d','b'), 4), (('e','b'), 6), (('a','g'), 8)]
rdd = sc.parallelize(z)
rdd.sortByKey(False).collect()

Maybe there's more efficient way, but here is one:
str_to_ints = lambda s, i: [ord(c) * i for c in s]
rdd.sortByKey(keyfunc=lambda x: (str_to_ints(x[0], 1), str_to_ints(x[1], -1))).collect()
# [(('a', 'g'), 8), (('a', 'c'), -2), (('a', 'b'), 3), (('d', 'b'), 4), (('e', 'b'), 6)]
Basically convert the strings in the key to list of integers with first element multiplied by 1 and second element multiplied by -1.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark generate rdd row wise using another field as a source - pyspark

Input RDD -------------------- A,123|124|125|126 B,123|124|125|126 From this rdd I need to generate another in the below format Output RDD -------------------- A,123 A,124 A,125 A,126 B,123 B,124 B,125 B,126

x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])]) def f(x): return x x.flatMapValues(f).collect() [('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]

Related

Compare 2 pyspark dataframe columns and change values of another column based on it

Pyspark | Transform RDD from key with list of values > values with list of keys

How to merge mutilple RDD in PySpark

How to sort by count and retain unique items in value

sortByKey() by composite key in PySpark

Categories

Resources