How to transpose column to row with PySpark - pyspark

I'm trying to transpose some columns of my table to row. I found the previous post: Transpose column to row with Spark
I actually want the opposite way. Initially, I have:
+-----+--------+-----------+
| A | col_id | col_value |
+-----+--------+-----------+
| 1 | col_1| 0.0|
| 1 | col_2| 0.6|
| ...| ...| ...|
| 2 | col_1| 0.6|
| 2 | col_2| 0.7|
| ...| ...| ...|
| 3 | col_1| 0.5|
| 3 | col_2| 0.9|
| ...| ...| ...|
And what I want is:
+-----+-----+-----+-------+
| A |col_1|col_2|col_...|
+-----+-------------------+
| 1 | 0.0| 0.6| ... |
| 2 | 0.6| 0.7| ... |
| 3 | 0.5| 0.9| ... |
| ...| ...| ...| ... |
How can I do it? Thanks!

Hi you can use 'when' to emulate SQL CASE like statement, with that statement you redistribute data over columns , if you 'colid' is 'col2' and you are calculating col1 you simply put 0.
After that with simple sum you reduce number of rows.
from pyspark.sql import functions as F
df2=df.select(df.A, F.when(df.colid=='col_1', df.colval).otherwise(0).alias('col1'),F.when(df.colid=='col_2', df.colval)\
.otherwise(0).alias('col2'))
df2.groupBy(df.A).agg(F.sum("col1").alias('col1'),\
F.sum("col2").alias('col2')).show()

Related

PySpark Column Creation by queuing filtered past rows

In PySpark, I want to make a new column in an existing table that stores the last K texts for a particular user that had label 1.
Example-
Index | user_name | text | label |
0 | u1 | t0 | 0 |
1 | u1 | t1 | 1 |
2 | u2 | t2 | 0 |
3 | u1 | t3 | 1 |
4 | u2 | t4 | 0 |
5 | u2 | t5 | 1 |
6 | u2 | t6 | 1 |
7 | u1 | t7 | 0 |
8 | u1 | t8 | 1 |
9 | u1 | t9 | 0 |
The table after the new column (text_list) should be as follows, storing last K = 2 messages for each user.
Index | user_name | text | label | text_list |
0 | u1 | t0 | 0 | [] |
1 | u1 | t1 | 1 | [] |
2 | u2 | t2 | 0 | [] |
3 | u1 | t3 | 1 | [t1] |
4 | u2 | t4 | 0 | [] |
5 | u2 | t5 | 1 | [] |
6 | u2 | t6 | 1 | [t5] |
7 | u1 | t7 | 0 | [t3, t1] |
8 | u1 | t8 | 1 | [t3, t1] |
9 | u1 | t9 | 0 | [t8, t3] |
A naïve way to do this would be to loop through each row and maintain a queue for each user. But the table could have millions of rows. Can we do this without looping in a more scalable, efficient way?
If you are using spark version >= 2.4, there is a way you can try. Let's say df is your dataframe.
df.show()
# +-----+---------+----+-----+
# |Index|user_name|text|label|
# +-----+---------+----+-----+
# | 0| u1| t0| 0|
# | 1| u1| t1| 1|
# | 2| u2| t2| 0|
# | 3| u1| t3| 1|
# | 4| u2| t4| 0|
# | 5| u2| t5| 1|
# | 6| u2| t6| 1|
# | 7| u1| t7| 0|
# | 8| u1| t8| 1|
# | 9| u1| t9| 0|
# +-----+---------+----+-----+
Two steps :
get list of struct of column text and label over a window using collect_list
filter array where label = 1 and get the text value, descending-sort the array using sort_array and get the first two elements using slice
It would be something like this
from pyspark.sql.functions import col, collect_list, struct, expr, sort_array, slice
from pyspark.sql.window import Window
# window : first row to row before current row
w = Window.partitionBy('user_name').orderBy('index').rowsBetween(Window.unboundedPreceding, -1)
df = (df
.withColumn('text_list', collect_list(struct(col('text'), col('label'))).over(w))
.withColumn('text_list', slice(sort_array(expr("FILTER(text_list, value -> value.label = 1).text"), asc=False), 1, 2))
)
df.sort('Index').show()
# +-----+---------+----+-----+---------+
# |Index|user_name|text|label|text_list|
# +-----+---------+----+-----+---------+
# | 0| u1| t0| 0| []|
# | 1| u1| t1| 1| []|
# | 2| u2| t2| 0| []|
# | 3| u1| t3| 1| [t1]|
# | 4| u2| t4| 0| []|
# | 5| u2| t5| 1| []|
# | 6| u2| t6| 1| [t5]|
# | 7| u1| t7| 0| [t3, t1]|
# | 8| u1| t8| 1| [t3, t1]|
# | 9| u1| t9| 0| [t8, t3]|
# +-----+---------+----+-----+---------+
Thanks to the solution posted here. I modified it slightly (since it assumed text field can be sorted) and was finally able to come to a working solution. Here it is:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, when, collect_list, slice, reverse
K = 2
windowPast = Window.partitionBy("user_name").orderBy("Index").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
df.withColumn("text_list", collect_list\
(when(col("label")==1,col("text"))\
.otherwise(F.lit(None)))\
.over(windowPast))\
.withColumn("text_list", slice(reverse(col("text_list")), 1, K))\
.sort(F.col("Index"))\
.show()

Pyspark - advanced aggregation of monthly data

I have a table of the following format.
|---------------------|------------------|------------------|
| Customer | Month | Sales |
|---------------------|------------------|------------------|
| A | 3 | 40 |
|---------------------|------------------|------------------|
| A | 2 | 50 |
|---------------------|------------------|------------------|
| B | 1 | 20 |
|---------------------|------------------|------------------|
I need it in the format as below
|---------------------|------------------|------------------|------------------|
| Customer | Month 1 | Month 2 | Month 3 |
|---------------------|------------------|------------------|------------------|
| A | 0 | 50 | 40 |
|---------------------|------------------|------------------|------------------|
| B | 20 | 0 | 0 |
|---------------------|------------------|------------------|------------------|
Can you please help me out to solve this problem in PySpark?
This should help , i am assumming you are using SUM to aggregate vales from the originical DF
>>> df.show()
+--------+-----+-----+
|Customer|Month|Sales|
+--------+-----+-----+
| A| 3| 40|
| A| 2| 50|
| B| 1| 20|
+--------+-----+-----+
>>> import pyspark.sql.functions as F
>>> df2=(df.withColumn('COLUMN_LABELS',F.concat(F.lit('Month '),F.col('Month')))
.groupby('Customer')
.pivot('COLUMN_LABELS')
.agg(F.sum('Sales'))
.fillna(0))
>>> df2.show()
+--------+-------+-------+-------+
|Customer|Month 1|Month 2|Month 3|
+--------+-------+-------+-------+
| A| 0| 50| 40|
| B| 20| 0| 0|
+--------+-------+-------+-------+

How can I add one column to other columns in PySpark?

I have the following PySpark DataFrame where each column represents a time series and I'd like to study their distance to the mean.
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| 1 | 2 | ... | 2 |
| -1 | 5 | ... | 4 |
+----+----+-----+---------+
This is what I'm hoping to get:
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| -1 | 0 | ... | 2 |
| -5 | 1 | ... | 4 |
+----+----+-----+---------+
Up until now, I've tried naively running a UDF on individual columns but it takes respectively 30s-50s-80s... (keeps increasing) per column so I'm probably doing something wrong.
cols = ["T1", "T2", ...]
for c in cols:
df = df.withColumn(c, df[c] - df["Average"])
Is there a better way to do this transformation of adding one column to many other?
By using rdd, it can be done in this way.
+---+---+-------+
|T1 |T2 |Average|
+---+---+-------+
|1 |2 |2 |
|-1 |5 |4 |
+---+---+-------+
df.rdd.map(lambda r: (*[r[i] - r[-1] for i in range(0, len(r) - 1)], r[-1])) \
.toDF(df.columns).show()
+---+---+-------+
| T1| T2|Average|
+---+---+-------+
| -1| 0| 2|
| -5| 1| 4|
+---+---+-------+

Forward Fill New Row to Account for Missing Dates

I currently have a dataset grouped into hourly increments by a variable "aggregator". There are gaps in this hourly data and what i would ideally like to do is forward fill the rows with the prior row which maps to the variable in column x.
I've seen some solutions to similar problems using PANDAS but ideally i would like to understand how best to approach this with a pyspark UDF.
I'd initially thought about something like the following with PANDAS but also struggled to implement this to just fill ignoring the aggregator as a first pass:
df = df.set_index(keys=[df.timestamp]).resample('1H', fill_method='ffill')
But ideally i'd like to avoid using PANDAS.
In the example below i have two missing rows of hourly data (labeled as MISSING).
| timestamp | aggregator |
|----------------------|------------|
| 2018-12-27T09:00:00Z | A |
| 2018-12-27T10:00:00Z | A |
| MISSING | MISSING |
| 2018-12-27T12:00:00Z | A |
| 2018-12-27T13:00:00Z | A |
| 2018-12-27T09:00:00Z | B |
| 2018-12-27T10:00:00Z | B |
| 2018-12-27T11:00:00Z | B |
| MISSING | MISSING |
| 2018-12-27T13:00:00Z | B |
| 2018-12-27T14:00:00Z | B |
The expected output here would be the following:
| timestamp | aggregator |
|----------------------|------------|
| 2018-12-27T09:00:00Z | A |
| 2018-12-27T10:00:00Z | A |
| 2018-12-27T11:00:00Z | A |
| 2018-12-27T12:00:00Z | A |
| 2018-12-27T13:00:00Z | A |
| 2018-12-27T09:00:00Z | B |
| 2018-12-27T10:00:00Z | B |
| 2018-12-27T11:00:00Z | B |
| 2018-12-27T12:00:00Z | B |
| 2018-12-27T13:00:00Z | B |
| 2018-12-27T14:00:00Z | B |
Appreciate the help.
Thanks.
Here is the solution, to fill the missing hours. using windows, lag and udf. With little modification it can extend to days as well.
from pyspark.sql.window import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
from dateutil.relativedelta import relativedelta
def missing_hours(t1, t2):
return [t1 + relativedelta(hours=-x) for x in range(1, t1.hour-t2.hour)]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
df = spark.read.csv('dates.csv',header=True,inferSchema=True)
window = Window.partitionBy("aggregator").orderBy("timestamp")
df_mising = df.withColumn("prev_timestamp",lag(col("timestamp"),1, None).over(window))\
.filter(col("prev_timestamp").isNotNull())\
.withColumn("timestamp", explode(missing_hours_udf(col("timestamp"), col("prev_timestamp"))))\
.drop("prev_timestamp")
df.union(df_mising).orderBy("aggregator","timestamp").show()
which results
+-------------------+----------+
| timestamp|aggregator|
+-------------------+----------+
|2018-12-27 09:00:00| A|
|2018-12-27 10:00:00| A|
|2018-12-27 11:00:00| A|
|2018-12-27 12:00:00| A|
|2018-12-27 13:00:00| A|
|2018-12-27 09:00:00| B|
|2018-12-27 10:00:00| B|
|2018-12-27 11:00:00| B|
|2018-12-27 12:00:00| B|
|2018-12-27 13:00:00| B|
|2018-12-27 14:00:00| B|
+-------------------+----------+

how to output multiple (key,value) in spark map function

The format of input data likes below:
+--------------------+-------------+--------------------+
| StudentID| Right | Wrong |
+--------------------+-------------+--------------------+
| studentNo01 | a,b,c | x,y,z |
+--------------------+-------------+--------------------+
| studentNo02 | c,d | v,w |
+--------------------+-------------+--------------------+
And the format of output likes below():
+--------------------+---------+
| key | value|
+--------------------+---------+
| studentNo01,a | 1 |
+--------------------+---------+
| studentNo01,b | 1 |
+--------------------+---------+
| studentNo01,c | 1 |
+--------------------+---------+
| studentNo01,x | 0 |
+--------------------+---------+
| studentNo01,y | 0 |
+--------------------+---------+
| studentNo01,z | 0 |
+--------------------+---------+
| studentNo02,c | 1 |
+--------------------+---------+
| studentNo02,d | 1 |
+--------------------+---------+
| studentNo02,v | 0 |
+--------------------+---------+
| studentNo02,w | 0 |
+--------------------+---------+
The Right means 1 , The Wrong means 0.
I want to process these data using Spark map function or udf, But I don't know how to deal with it . Can you help me, please? Thank you.
Use split and explode twice and do the union
val df = List(
("studentNo01","a,b,c","x,y,z"),
("studentNo02","c,d","v,w")
).toDF("StudenID","Right","Wrong")
+-----------+-----+-----+
| StudenID|Right|Wrong|
+-----------+-----+-----+
|studentNo01|a,b,c|x,y,z|
|studentNo02| c,d| v,w|
+-----------+-----+-----+
val pair = (
df.select('StudenID,explode(split('Right,",")))
.select(concat_ws(",",'StudenID,'col).as("key"))
.withColumn("value",lit(1))
).unionAll(
df.select('StudenID,explode(split('Wrong,",")))
.select(concat_ws(",",'StudenID,'col).as("key"))
.withColumn("value",lit(0))
)
+-------------+-----+
| key|value|
+-------------+-----+
|studentNo01,a| 1|
|studentNo01,b| 1|
|studentNo01,c| 1|
|studentNo02,c| 1|
|studentNo02,d| 1|
|studentNo01,x| 0|
|studentNo01,y| 0|
|studentNo01,z| 0|
|studentNo02,v| 0|
|studentNo02,w| 0|
+-------------+-----+
You can convert to RDD as follows
val rdd = pair.map(r => (r.getString(0),r.getInt(1)))