finding the max in a dataframe with multiple columns in pyspark - pyspark

Please i need help
i'm new to pyspark and i got this probleme
i have a dataframe with 4 columns like this
A
B
C
D
O1
2
E1
2
O1
3
E1
1
O1
2
E1
0
O1
5
E2
2
O1
2
E2
3
O1
2
E2
2
O1
5
E2
1
O2
8
E1
2
O2
8
E1
0
O2
0
E1
1
O2
2
E1
4
O2
9
E1
2
O2
2
E2
1
O2
9
E2
4
O2
2
E2
2
and i want to have this ( the max of D for each (A,C) couple) :
A
B
C
D
O1
2
E1
2
O1
2
E2
3
O2
2
E1
4
O2
9
E2
4
i tried
table.groupby("A","C").agg(round(max("D")))
it did work by the column B is missing

Why not use partition by instead of group by, that way you can keep all your columns. You will retain all your records.
Edit added- If you want the distinct values of A,C - just get the columns you want and get unique values.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
table1 = table.withColumn("max_D",F.round(F.max('D').over (Window.partitionBy('A','C'))))
table1.select('A','B','C','max_D').distinct().show()

You'd want to apply max with pair of D and B so you won't lose B when aggerate.
from pyspark.sql import functions as F
(df
.groupBy('a', 'c')
.agg(F.max(F.array('d', 'b')).alias('max_d'))
.select(
F.col('a'),
F.col('c'),
F.col('max_d')[1].alias('b'),
F.col('max_d')[0].alias('d'),
)
.show()
)
+---+---+---+---+
| a| c| b| d|
+---+---+---+---+
| O1| E1| 2| 2|
| O1| E2| 2| 3|
| O2| E1| 2| 4|
| O2| E2| 9| 4|
+---+---+---+---+

Related

is there any easier way to combine 100+ PySpark dataframe with different columns together (not merge, but append)

suppose I have a lot of dataframe, with similar structure, but different columns. I want to combine all of them together, how to do it in a easier way?
for example, df1, df2, df3 are as follows:
df1
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
df2
id base1 base2 col1
5 4 100 15
6 1 99 18
7 2 89 9
df3
id base1 base2 col1 col2
9 2 77 12 3
10 1 89 16 5
11 2 88 10 7
to be:
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
5 4 100 15 NaN NaN NaN
6 1 99 18 NaN NaN NaN
7 2 89 9 NaN NaN NaN
9 2 77 12 3 NaN NaN
10 1 89 16 5 NaN NaN
11 2 88 10 7 NaN NaN
currently I use this code:
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
df_comb1=customUnion(df1,df2)
df_comb2=customUnion(df_comb1,df3)
however, if I keep creating new dataframe like df4,df5,etc. (100+)
my code becomes messy.
is there a way to code it in a easier way?
Thanks in advance
You can manage this with a list of data frames and a function, without necessarily needing to statically name each data frame...
dataframes = [df1,df2,df3] # load data frames
Compute the set of all possible columns:
all_cols = {i for lst in [df.columns for df in dataframes] for i in lst}
#{'base1', 'base2', 'col1', 'col2', 'col3', 'col4', 'id'}
A function to add missing columns to a DF:
def add_missing_cols(df, cols):
v = df
for col in [c for c in cols if (not c in df.columns)]:
v = v.withColumn(col, f.lit(None))
return v
completed_dfs = [add_missing_cols(df, all_cols) for df in dataframes]
res = completed_dfs[0]
for df in completed_dfs[1:]:
res = res.unionAll(df)
res.show()
+---+-----+-----+----+----+----+----+
| id|base1|base2|col1|col2|col3|col4|
+---+-----+-----+----+----+----+----+
| 1| 1| 100| 30| 1| 2| 3|
| 2| 2| 200| 40| 2| 3| 4|
| 3| 3| 300| 20| 4| 4| 5|
| 5| 4| 100| 15|null|null|null|
| 6| 1| 99| 18|null|null|null|
| 7| 2| 89| 9|null|null|null|
| 9| 2| 77| 12| 3|null|null|
| 10| 1| 89| 16| 5|null|null|
| 11| 2| 88| 10| 7|null|null|
+---+-----+-----+----+----+----+----+

Retrieve data for each partition based on a date range from other column in pyspark

There's a DataFrame in PySpark with data as below:
Original data:
Shop Customer date retrive_days
A C1 15/06/2019 2
A C1 16/06/2019 0
A C1 17/06/2019 0
A C1 18/06/2019 0
B C2 20/07/2019 5
B C2 21/07/2019 0
B C2 23/07/2019 0
B C2 30/07/2019 0
B C2 01/08/2019 6
B C2 02/08/2019 0
B C2 03/08/2019 0
B C2 09/08/2019 0
B C2 10/08/2019 1
B C2 11/08/2019 0
B C2 13/08/2019 0
Each customer has a date he/she visited the shop and each customer also has retrive_days and that many days data has to be fetched to the output.
I am trying to get an output which should look like this in PySpark, filtered based on the retrive_days value for each customer
Expected Output:
Shop Customer date retrive_days
A C1 15/06/2019 2
A C1 16/06/2019 0
B C2 20/07/2019 5
B C2 21/07/2019 0
B C2 23/07/2019 0
B C2 01/08/2019 6
B C2 02/08/2019 0
B C2 03/08/2019 0
B C2 10/08/2019 1
B C2 11/08/2019 0
Try this with window functions.
In your example output, the last row should be omitted because for your other 2,5,6 logic the date should not equal the max date( retrive_days + date ). If thats not the case do filter('date1<=max_date') .
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w1=Window().partitionBy("Shop","Customer").orderBy("date1")
w2=Window().partitionBy("Shop","Customer","partitions")
df.withColumn("date1", F.to_date("date", "dd/MM/yyyy"))\
.withColumn("partitions", F.sum(F.expr("""IF(retrive_days!=0, 1, 0)""")).over(w1))\
.withColumn("max_date", F.max(F.expr("""IF(retrive_days!=0,date_add(date1,retrive_days),null)""")).over(w2))\
.filter('date1<max_date').drop("date1","max_date","partitions").show()
#+----+--------+----------+------------+
#|Shop|Customer| date|retrive_days|
#+----+--------+----------+------------+
#| A| C1|15/06/2019| 2|
#| A| C1|16/06/2019| 0|
#| B| C2|20/07/2019| 5|
#| B| C2|21/07/2019| 0|
#| B| C2|23/07/2019| 0|
#| B| C2|01/08/2019| 6|
#| B| C2|02/08/2019| 0|
#| B| C2|03/08/2019| 0|
#| B| C2|10/08/2019| 1|
#+----+--------+----------+------------+

How to collect a map after group by in Pyspark dataframe?

I have a pyspark dataframe like this:
| id | time | cat |
-------------------------
1 t1 a
1 t2 b
2 t3 b
2 t4 c
2 t5 b
3 t6 a
3 t7 a
3 t8 a
Now, I want to group them by "id" and aggregate them into a Map like this:
| id | cat |
---------------------------
| 1 | a -> 1, b -> 1 |
| 2 | b -> 2, c -> 1 |
| 3 | a -> 3 |
I guess we can use pyspark sql function's collect_list to collect them as list, and then I could apply some UDF function to turn the list into dict. But is there any other (shorter or more efficient) way to do this?
You can use this function from pyspark.sql.functions.map_from_entries
if we consider your dataframe is df you should do this:
import pyspark.sql.functions as F
df1 = df.groupby("id", "cat").count()
df2 = df1.groupby("id")\
.agg(F.map_from_entries(F.collect_list(F.struct("cat","count"))).alias("cat"))
similar to yasi's answer
import pyspark.sql.functions as F
df1 = df.groupby("id", "cat").count()
df2 = df1.groupby("id")\
.agg(F.map_from_arrays(F.collect_list("cat"),F.collect_list("count"))).alias("cat"))
Here is how I did it.
Code
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
df = spark.createDataFrame([(1,'t1','a'),(1,'t2','b'),(2,'t3','b'),(2,'t4','c'),(2,'t5','b'),\
(3,'t6','a'),(3,'t7','a'),(3,'t8','a')],\
('id','time','cat'))
(df.groupBy(['id', 'cat'])
.agg(F.count(F.col('cat')).cast(StringType()).alias('counted'))
.select(['id', F.concat_ws('->', F.col('cat'), F.col('counted')).alias('arrowed')])
.groupBy('id')
.agg(F.collect_list('arrowed'))
.show()
)
Output
+-------+---------------------+
| id|collect_list(arrowed)|
+-------+---------------------+
| 1 | [a -> 1, b -> 1] |
| 3 | [a -> 3] |
| 2 | [b -> 2, c -> 1] |
+-------+---------------------+
Edit
(df.groupBy(['id', 'cat'])
.count()
.select(['id',F.create_map('cat', 'count').alias('map')])
.groupBy('id')
.agg(F.collect_list('map').alias('cat'))
.show()
)
#+---+--------------------+
#| id| cat|
#+---+--------------------+
#| 1|[[a -> 1], [b -> 1]]|
#| 3| [[a -> 3]]|
#| 2|[[b -> 2], [c -> 1]]|
#+---+--------------------+

Advanced join two dataframe spark scala

I have to join two Dataframes.
Sample:
Dataframe1 looks like this
df1_col1 df1_col2
a ex1
b ex4
c ex2
d ex6
e ex3
Dataframe2
df2_col1 df2_col2
1 a,b,c
2 d,c,e
3 a,e,c
In result Dataframe I would like to get result like this
res_col1 res_col2 res_col3
a ex1 1
a ex1 3
b ex4 1
c ex2 1
c ex2 2
c ex2 3
d ex6 2
e ex3 2
e ex3 3
What will be the best way to achieve this join?
I have updated the code below
val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3")))
val df2 = sc.parallelize(Seq(List(("1","a,b,c"),("2","d,c,e")))).toDF
df2.withColumn("df2_col2_explode", explode(split($"_2", ","))).select($"_1".as("df2_col1"),$"df2_col2_explode").join(df1.select($"_1".as("df1_col1"),$"_2".as("df1_col2")), $"df1_col1"===$"df2_col2_explode","inner").show
You just need to split the values and generate multiple rows by exploding it and then join with the other dataframe.
You can refer this link, How to split pipe-separated column into multiple rows?
I used spark sql for this join, here is a part of code;
df1.createOrReplaceTempView("temp_v_df1")
df2.createOrReplaceTempView("temp_v_df2")
val df_result = spark.sql("""select
| b.df1_col1 as res_col1,
| b.df1_col2 as res_col2,
| a.df2_col1 as res_col3
| from (select df2_col1, exp_col
| from temp_v_df2
| lateral view explode(split(df2_col2,",")) dummy as exp_col) a
| join temp_v_df1 b on a.exp_col = b.df1_col1""".stripMargin)
I used spark scala data frame to achieve your desire output.
val df1 = sc.parallelize(Seq(("a","ex1"),("b","ex4"),("c","ex2"),("d","ex6"),("e","ex3"))).toDF("df1_col1","df1_col2")
val df2 = sc.parallelize(Seq((1,("a,b,c")),(2,("d,c,e")),(3,("a,e,c")))).toDF("df2_col1","df2_col2")
df2.withColumn("_tmp", explode(split($"df2_col2", "\\,"))).as("temp").join (df1,$"temp._tmp"===df1("df1_col1"),"inner").drop("_tmp","df2_col2").show
Desire Output
+--------+--------+--------+
|df2_col1|df1_col1|df1_col2|
+--------+--------+--------+
| 2| e| ex3|
| 3| e| ex3|
| 2| d| ex6|
| 1| c| ex2|
| 2| c| ex2|
| 3| c| ex2|
| 1| b| ex4|
| 1| a| ex1|
| 3| a| ex1|
+--------+--------+--------+
Rename the Column according to your requirement.
Here the screenshot of running code
Happy Hadoooooooooooooooppppppppppppppppppp

spark delete previous row based on the new row with some conditions match

I have dataframe like below
type f1 f2 value
1 a xy 11
2 b ab 13
3 c na 16
3 c dir 18
3 c ls 23
I have to delete a previous row some some of conditions matches with next row,
for example from the above table, when column fields of type == type(row-1) && f1 == f1(row-1) && abs(value - value (row-1)) < 2 , when this condition matches I want to delete the previous row.
so I my table should like below
type f1 f2 value
1 a xy 11
2 b ab 13
3 c dir 18
3 c ls 30
I am thinking that we can make use of lag or lead features but not getting exact logic
Yes, its can be done using .lead()
import org.apache.spark.sql.expressions._
//define window specification
val windowSpec = Window.partitionBy($"type",$"f1").orderBy($"type")
val inputDF = sc.parallelize(List((1,"a","xy",11),(2,"b","ab",13),(3,"c","na",16),(3,"c","dir",18),(3,"c","ls",23))).toDF("type","f1","f2","value")
inputDF.withColumn("leadValue",lead($"value",1).over(windowSpec))
.withColumn("result", when(abs($"leadValue" - $"value") <= 2, 1).otherwise(0)) //check for condition
.filter($"result" === 0) //filter the rows
.drop("leadValue","result") //remove additional columns
.orderBy($"type")
.show
Output:
+----+---+---+-----+
|type| f1| f2|value|
+----+---+---+-----+
| 1| a| xy| 11|
| 2| b| ab| 13|
| 3| c|dir| 18|
| 3| c| ls| 23|
+----+---+---+-----+
Here as we already are partitioning by type & f1 we need not check for their equality condition