Pyspark - How to duplicate/triplicate rows? - pyspark

I need to "clone" or "duplicate"/"triplicate" every row from my dataframe.
I didn't find nothing about it, I just know that I need to use explode.
Example:
ID - Name
1 John
2 Maria
3 Charles
Output:
ID - Name
1 John
1 John
2 Maria
2 Maria
3 Charles
3 Charles
Thanks

You could use array_repeat with explode.(Spark2.4+)
For duplicate:
from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.array_repeat("Name",2)))
For triplicate:
df.withColumn("Name", F.explode(F.array_repeat("Name",3)))
For <spark2.4:
#duplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*2])))
#triplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*3])))
UPDATE:
In order to use another column Support to replicate a certain number of times for each row you could use this.(Spark2.4+)
df.show()
#+---+-------+-------+
#| ID| Name|Support|
#+---+-------+-------+
#| 1| John| 2|
#| 2| Maria| 4|
#| 3|Charles| 6|
#+---+-------+-------+
from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.expr("""array_repeat(Name,int(Support))"""))).show()
#+---+-------+-------+
#| ID| Name|Support|
#+---+-------+-------+
#| 1| John| 2|
#| 1| John| 2|
#| 2| Maria| 4|
#| 2| Maria| 4|
#| 2| Maria| 4|
#| 2| Maria| 4|
#| 3|Charles| 6|
#| 3|Charles| 6|
#| 3|Charles| 6|
#| 3|Charles| 6|
#| 3|Charles| 6|
#| 3|Charles| 6|
#+---+-------+-------+
For spark1.5+, using repeat, concat, substring, split & explode.
from pyspark.sql import functions as F
df.withColumn("Name", F.expr("""repeat(concat(Name,','),Support)"""))\
.withColumn("Name", F.explode(F.expr("""split(substring(Name,1,length(Name)-1),',')"""))).show()

Related

after joining two dataframes pick all columns from one dataframe on basis of primary key

I've two dataframes, I need to update records in df1 based on new updates available in df2 in pyspark.
DF1:
df1=spark.createDataFrame([(1,2),(2,3),(3,4)],["id","val1"])
+---+----+
| id|val1|
+---+----+
| 1| 2|
| 2| 3|
| 3| 4|
+---+----+
DF2:
df2=spark.createDataFrame([(1,4),(2,5)],["id","val1"])
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
+---+----+
then, I'm trying to join the two dataframes.
join_con=(df1["id"] == df2["id"])
jdf=df1.join(df2,join_con,"left")
+---+----+----+----+
| id|val1| id|val1|
+---+----+----+----+
| 1| 2| 1| 4|
| 3| 4|null|null|
| 2| 3| 2| 5|
+---+----+----+----+
Now, I want to pick all columns from df2 if df2["id"] is not null, otherwise pick all columns of df1.
something like:
jdf.filter(df2.id is null).select(df1["*"])
union
jdf.filter(df2.id is not null).select(df2["*"])
so resultant DF can be:
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
| 3| 4|
+---+----+
Can someone please help with this?
Your selection expression can be a coalesce between the column in df2 followed by df1.
from pyspark.sql import functions as F
df1=spark.createDataFrame([(1,2),(2,3),(3,4), (4, 1),],["id","val1"])
df2=spark.createDataFrame([(1,4),(2,5), (4, None),],["id","val1"])
selection_expr = [F.when(df2["id"].isNotNull(), df2[c]).otherwise(df1[c]).alias(c) for c in df2.columns]
jdf.select(selection_expr).show()
"""
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
| 3| 4|
| 4|null|
+---+----+
"""
Try with coalesce function as this function gets first non null values.
expr=zip(df2.columns,df1.columns)
e1=[coalesce(df2[f[0]],df1[f[1]]).alias(f[0]) for f in expr]
jdf.select(*e1).show()
#+---+----+
#| id|val1|
#+---+----+
#| 1| 4|
#| 2| 5|
#| 3| 4|
#+---+----+

pyspark: Auto filling in implicit missing values

I have a dataframe
user day amount
a 2 10
a 1 14
a 4 5
b 1 4
You see that, the maximum value of day is 4, and the minimum value is 1. I want to fill 0 for amount column in all missing days of all users, so the above data frame will become.
user day amount
a 2 10
a 1 14
a 4 5
a 3 0
b 1 4
b 2 0
b 3 0
b 4 0
How could I do that in PySpark? Many thanks.
Here is one approach. You can get the min and max values first , then group on user column and pivot, then fill in missing columns and fill all nulls as 0, then stack them back:
min_max = df.agg(F.min("day"),F.max("day")).collect()[0]
df1 = df.groupBy("user").pivot("day").agg(F.first("amount").alias("amount")).na.fill(0)
missing_cols = [F.lit(0).alias(str(i)) for i in range(min_max[0],min_max[1]+1)
if str(i) not in df1.columns ]
df1 = df1.select("*",*missing_cols)
#+----+---+---+---+---+
#|user| 1| 2| 4| 3|
#+----+---+---+---+---+
#| b| 4| 0| 0| 0|
#| a| 14| 10| 5| 0|
#+----+---+---+---+---+
#the next step is inspired from https://stackoverflow.com/a/37865645/9840637
arr = F.explode(F.array([F.struct(F.lit(c).alias("day"), F.col(c).alias("amount"))
for c in df1.columns[1:]])).alias("kvs")
(df1.select(["user"] + [arr])
.select(["user"]+ ["kvs.day", "kvs.amount"]).orderBy("user")).show()
+----+---+------+
|user|day|amount|
+----+---+------+
| a| 1| 14|
| a| 2| 10|
| a| 4| 5|
| a| 3| 0|
| b| 1| 4|
| b| 2| 0|
| b| 4| 0|
| b| 3| 0|
+----+---+------+
Note, since column day was pivotted , the dtype might have changed so you may have to cast them back to the original dtype
Another way to do this is to use sequence, array functions and explode. (spark2.4+)
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy(F.lit(0))
df.withColumn("boundaries", F.sequence(F.min("day").over(w),F.max("day").over(w),F.lit(1)))\
.groupBy("user").agg(F.collect_list("day").alias('day'),F.collect_list("amount").alias('amount')\
,F.first("boundaries").alias("boundaries")).withColumn("boundaries", F.array_except("boundaries","day"))\
.withColumn("day",F.flatten(F.array("day","boundaries"))).drop("boundaries")\
.withColumn("zip", F.explode(F.arrays_zip("day","amount")))\
.select("user","zip.day", F.when(F.col("zip.amount").isNull(),\
F.lit(0)).otherwise(F.col("zip.amount")).alias("amount")).show()
#+----+---+------+
#|user|day|amount|
#+----+---+------+
#| a| 2| 10|
#| a| 1| 14|
#| a| 4| 5|
#| a| 3| 0|
#| b| 1| 4|
#| b| 2| 0|
#| b| 3| 0|
#| b| 4| 0|
#+----+---+------+

Pyspark multiple records of same key into single record

I have a data frame as follows -
[Row(account_number=1, address_city='NewYork'), Row(account_number=1, address_address1='hotel road'), Row(account_number=1, address_postal='1345'), Row(account_number=2, address_city='NewJersey'),Row(account_number=2, address_postal='3421')]
I'm trying to transform this into -
[Row(account_number=1, address_city='NewYork' ,address_address1='hotel road', address_postal='1345'), Row(account_number=2, address_city='NewJersey', address_postal='3421')]
Please suggest the best possible ways to do this.
Use .pivot, groupBy on id to get records into single record.
df=spark.createDataFrame([("1","address_city","NewYork"),("1","address_address1","hotel road"),("1","address_postal","1345"),("2","address_city","NewJersey"),("2","address_postal","3421")],["id","value","name"])
#+---+----------------+----------+
#| id| value| name|
#+---+----------------+----------+
#| 1| address_city| NewYork|
#| 1|address_address1|hotel road|
#| 1| address_postal| 1345|
#| 2| address_city| NewJersey|
#| 2| address_postal| 3421|
#+---+----------------+----------+
df.groupBy("id").pivot("value").agg(first("name")).show()
#+---+----------------+------------+--------------+
#| id|address_address1|address_city|address_postal|
#+---+----------------+------------+--------------+
#| 1| hotel road| NewYork| 1345|
#| 2| null| NewJersey| 3421|
#+---+----------------+------------+--------------+
df.groupBy("id").pivot("value").agg(first("name")).collect()
#[Row(id=u'1', address_address1=u'hotel road', address_city=u'NewYork', address_postal=u'1345'), Row(id=u'2', address_address1=None, address_city=u'NewJersey', address_postal=u'3421')]

Pyspark - add missing values per key?

I have a Pyspark dataframe with some non-unique key key and some columns number and value.
For most keys, the number column goes from 1 to 12, but for some of them, there are gaps in numbers (for ex. we have numbers [1, 2, 5, 9]). I would like to add missing rows, so that for every key we have all the numbers in range 1-12 populated with the last seen value.
So that for table
key number value
a 1 6
a 2 10
a 5 20
a 9 25
I would like to get
key number value
a 1 6
a 2 10
a 3 10
a 4 10
a 5 20
a 6 20
a 7 20
a 8 20
a 9 25
a 10 25
a 11 25
a 12 25
I thought about creating a table of a and an array of 1-12, exploding the array and joining with my original table, then separately populating the value column with previous value using a window function bounded by current row. However, it seems a bit inelegant and I wonder if there is a better way to achieve what I want?
I thought about creating a table of a and an array of 1-12, exploding the array and joining with my original table, then separately populating the value column with previous value using a window function bounded by current row. However, it seems a bit inelegant and I wonder if there is a better way to achieve what I want?
I do not think your proposed approach is inelegant - but you can achieve the same using range instead of explode.
First create a dataframe with all the numbers in your range. You will also want to cross join this with the distinct key column from your DataFrame.
all_numbers = spark.range(1, 13).withColumnRenamed("id", "number")
all_numbers = all_numbers.crossJoin(df.select("key").distinct()).cache()
all_numbers.show()
#+------+---+
#|number|key|
#+------+---+
#| 1| a|
#| 2| a|
#| 3| a|
#| 4| a|
#| 5| a|
#| 6| a|
#| 7| a|
#| 8| a|
#| 9| a|
#| 10| a|
#| 11| a|
#| 12| a|
#+------+---+
Now you can outer join this to your original DataFrame and forward fill using the last known good value. If the number of keys is small enough, you may be able to broadcast
from pyspark.sql.functions import broadcast, last
from pyspark.sql import Window
df.join(broadcast(all_numbers), on=["number", "key"], how="outer")\
.withColumn(
"value",
last(
"value",
ignorenulls=True
).over(
Window.partitionBy("key").orderBy("number")\
.rowsBetween(Window.unboundedPreceding, 0)
)
)\
.show()
#+------+---+-----+
#|number|key|value|
#+------+---+-----+
#| 1| a| 6|
#| 2| a| 10|
#| 3| a| 10|
#| 4| a| 10|
#| 5| a| 20|
#| 6| a| 20|
#| 7| a| 20|
#| 8| a| 20|
#| 9| a| 25|
#| 10| a| 25|
#| 11| a| 25|
#| 12| a| 25|
#+------+---+-----+
You could do this without join. I have done multiple tests on this with different gaps and it will always work as long as number 1 is always provided as input(as you need sequence to start from there), and it will always range till 12. I used a couple windows to get a column which I could use in the sequence, then made a custom sequence using expressions, and then exploded it to get desired result. If for some reason, you will have inputs that do not have number 1 in there, let me know I will update my solution.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import when
w=Window().partitionBy("key").orderBy("number")
w2=Window().partitionBy("key").orderBy("number").rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df.withColumn("number2", F.lag("number").over(w)).withColumn("diff", F.when((F.col("number2").isNotNull()) & ((F.col("number")-F.col("number2")) > 1), (F.col("number")-F.col("number2"))).otherwise(F.lit(0)))\
.withColumn("diff2", F.lead("diff").over(w)).withColumn("diff2", F.when(F.col("diff2").isNull(), F.lit(0)).otherwise(F.col("diff2"))).withColumn("diff2", F.when(F.col("diff2")!=0, F.col("diff2")-1).otherwise(F.col("diff2"))).withColumn("max", F.max("number").over(w2))\
.withColumn("diff2", F.when((F.col("number")==F.col("max")) & (F.col("number")<F.lit(12)), F.lit(12)-F.col("number")).otherwise(F.col("diff2")))\
.withColumn("number2", F.when(F.col("diff2")!=0,F.expr("""sequence(number,number+diff2,1)""")).otherwise(F.expr("""sequence(number,number+diff2,0)""")))\
.drop("diff","diff2","max")\
.withColumn("number2", F.explode("number2")).drop("number")\
.select("key", F.col("number2").alias("number"), "value")\
.show()
+---+------+-----+
|key|number|value|
+---+------+-----+
| a| 1| 6|
| a| 2| 10|
| a| 3| 10|
| a| 4| 10|
| a| 5| 20|
| a| 6| 20|
| a| 7| 20|
| a| 8| 20|
| a| 9| 25|
| a| 10| 25|
| a| 11| 25|
| a| 12| 25|
+---+------+-----+

Get total row count over a window

In PySpark, would it be possible to obtain the total number of rows in a particular window?
Right now I am using:
w = Window.partitionBy("column_to_partition_by")
F.count(col("column_1")).over(w)
However, this only gives me the incremental row count. What I need is the total number of rows in that particular window partition. Can anyone tell me the command for this?
I think you need to add rowsBetween with your window clause.
Example:
df.show()
#+---+---+
#| i| j|
#+---+---+
#| 1| a|
#| 1| b|
#| 1| c|
#| 2| c|
#+---+---+
w = Window.partitionBy("i").rowsBetween(-sys.maxsize,sys.maxsize)
df.withColumn("count",count(col("j")).over(w)).show()
#+---+---+-----+
#| i| j|count|
#+---+---+-----+
#| 1| a| 3|
#| 1| b| 3|
#| 1| c| 3|
#| 2| c| 1|
#+---+---+-----+
Usually when we have .orderBy clause to window then we need to have rowsBetween needs to be added, as orderby clause defaults to unboundedPreceeding and currentRow.
w = Window.partitionBy("i").orderBy("j")
df.withColumn("count",count(col("j")).over(w)).show()
#incremental count
#+---+---+-----+
#| i| j|count|
#+---+---+-----+
#| 1| a| 1|
#| 1| b| 2|
#| 1| c| 3|
#| 2| c| 1|
#+---+---+-----+
w = Window.partitionBy("i").orderBy("j").rowsBetween(-sys.maxsize,sys.maxsize)
df.withColumn("count",count(col("j")).over(w)).show()
#total number of rows count
#+---+---+-----+
#| i| j|count|
#+---+---+-----+
#| 1| a| 3|
#| 1| b| 3|
#| 1| c| 3|
#| 2| c| 1|
#+---+---+-----+