How to do a check/try-catch to a pyspark dataframe?

How to do a check/try-catch to a pyspark dataframe? - pyspark

I have a dataframe that creates a new column based on a reduction calculation of existing columns.
I need to make a check that if the reduction value used is higher than a particular threshold number, then it should be made equal to the threshold number/ahould not exceed it.
I've tried wrapping a when statement within and after the .withColumn statement
df = df.withColumn('total_new_load',
col('existing_load') * (5 - col('tot_reduced_load')))
Basically I need to add an if-statement of some sort in a pyspark syntax relating to my dataframe code, such as:
if tot_reduced_load > 50
then
tot_reduced_load = 50

Try this
from pyspark.sql import functions as F
df.withColumn("tot_reduced_load ", F.when(F.col("tot_reduced_load")>50,50)).otherwise(F.col("tot_reduced_load"))

Try this -
Sample Data:
df = spark.createDataFrame([(1,30),(2,40),(3,60)],['row_id','tot_reduced_load'])
df.show()
#+------+----------------+
#|row_id|tot_reduced_load|
#+------+----------------+
#| 1| 30|
#| 2| 40|
#| 3| 60|
#+------+----------------+
Option1: withColumn
from pyspark.sql import functions as psf
tot_reduced_load_new = psf.when(psf.col("tot_reduced_load") > 50 , 50).otherwise(psf.col("tot_reduced_load"))
df.withColumn("tot_reduced_load_new",tot_reduced_load_new ).show()
#+------+----------------+--------------------+
#|row_id|tot_reduced_load|tot_reduced_load_new|
#+------+----------------+--------------------+
#| 1| 30| 30|
#| 2| 40| 40|
#| 3| 60| 50|
#+------+----------------+--------------------+
Option2: selectExpr
df.selectExpr("*","CASE WHEN tot_reduced_load > 50 THEN 50 ELSE tot_reduced_load END AS tot_reduced_load_new").show()
#+------+----------------+--------------------+
#|row_id|tot_reduced_load|tot_reduced_load_new|
#+------+----------------+--------------------+
#| 1| 30| 30|
#| 2| 40| 40|
#| 3| 60| 50|
#+------+----------------+--------------------+

Related

pyspark: groupby and aggregate avg and first on multiple columns

I have a following sample pyspark dataframe and after groupby I want to calculate mean, and first of multiple columns, In real case I have 100s of columns, so I cant do it individually
sp = spark.createDataFrame([['a',2,4,'cc','anc'], ['a',4,7,'cd','abc'], ['b',6,0,'as','asd'], ['b', 2, 4, 'ad','acb'],
['c', 4, 4, 'sd','acc']], ['id', 'col1', 'col2','col3', 'col4'])
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| a| 2| 4| cc| anc|
| a| 4| 7| cd| abc|
| b| 6| 0| as| asd|
| b| 2| 4| ad| acb|
| c| 4| 4| sd| acc|
+---+----+----+----+----+
This is what I am trying
mean_cols = ['col1', 'col2']
first_cols = ['col3', 'col4']
sc.groupby('id').agg(*[ f.mean for col in mean_cols], *[f.first for col in first_cols])
but it's not working. How can I do it like this with pyspark

The best way for multiple functions on multiple columns is to use the .agg(*expr) format.
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
import numpy as np
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,4,5,1),(5,6,7,8),(7,8,9,2)],schema=['col1','col2','col3','col4'])
fn_l = [F.min,F.max,F.mean,F.first]
col_l=['col1','col2','col3']
expr = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_l for coln in col_l]
tst_r = tst.groupby('col4').agg(*expr)
The result will be
tst_r.show()
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
|col4|min_col1|min_col2|min_col3|max_col1|max_col2|max_col3|mean_col1|mean_col2|mean_col3|first_col1|first_col2|first_col3|
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
| 5| 5| 6| 7| 7| 8| 9| 6.0| 7.0| 8.0| 5| 6| 7|
| 4| 1| 2| 3| 3| 4| 5| 2.0| 3.0| 4.0| 1| 2| 3|
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
For selectively applying functions on columns, you can have multiple expression arrays and concatenate them in aggregation.
fn_l = [F.min,F.max]
fn_2=[F.mean,F.first]
col_l=['col1','col2']
col_2=['col1','col3','col4']
expr1 = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_l for coln in col_l]
expr2 = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_2 for coln in col_2]
tst_r = tst.groupby('col4').agg(*(expr1+expr2))

A simpler way to do:
import pyspark.sql.functions as F
tst_r = ( tst.groupby('col4')
.agg(*[F.mean(col).alias(f"{col}_mean") for col in means_col],
*[F.first(col).alias(f"{col}_first") for col in firsts_col]) )

Pyspark - How to duplicate/triplicate rows?

I need to "clone" or "duplicate"/"triplicate" every row from my dataframe.
I didn't find nothing about it, I just know that I need to use explode.
Example:
ID - Name
1 John
2 Maria
3 Charles
Output:
ID - Name
1 John
1 John
2 Maria
2 Maria
3 Charles
3 Charles
Thanks

You could use array_repeat with explode.(Spark2.4+)
For duplicate:
from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.array_repeat("Name",2)))
For triplicate:
df.withColumn("Name", F.explode(F.array_repeat("Name",3)))
For <spark2.4:
#duplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*2])))
#triplicate
df.withColumn("Name", F.explode(F.array(*[['Name']*3])))
UPDATE:
In order to use another column Support to replicate a certain number of times for each row you could use this.(Spark2.4+)
df.show()
#+---+-------+-------+
#| ID| Name|Support|
#+---+-------+-------+
#| 1| John| 2|
#| 2| Maria| 4|
#| 3|Charles| 6|
#+---+-------+-------+
from pyspark.sql import functions as F
df.withColumn("Name", F.explode(F.expr("""array_repeat(Name,int(Support))"""))).show()
#+---+-------+-------+
#| ID| Name|Support|
#+---+-------+-------+
#| 1| John| 2|
#| 1| John| 2|
#| 2| Maria| 4|
#| 2| Maria| 4|
#| 2| Maria| 4|
#| 2| Maria| 4|
#| 3|Charles| 6|
#| 3|Charles| 6|
#| 3|Charles| 6|
#| 3|Charles| 6|
#| 3|Charles| 6|
#| 3|Charles| 6|
#+---+-------+-------+
For spark1.5+, using repeat, concat, substring, split & explode.
from pyspark.sql import functions as F
df.withColumn("Name", F.expr("""repeat(concat(Name,','),Support)"""))\
.withColumn("Name", F.explode(F.expr("""split(substring(Name,1,length(Name)-1),',')"""))).show()

Pyspark - add missing values per key?

I have a Pyspark dataframe with some non-unique key key and some columns number and value.
For most keys, the number column goes from 1 to 12, but for some of them, there are gaps in numbers (for ex. we have numbers [1, 2, 5, 9]). I would like to add missing rows, so that for every key we have all the numbers in range 1-12 populated with the last seen value.
So that for table
key number value
a 1 6
a 2 10
a 5 20
a 9 25
I would like to get
key number value
a 1 6
a 2 10
a 3 10
a 4 10
a 5 20
a 6 20
a 7 20
a 8 20
a 9 25
a 10 25
a 11 25
a 12 25
I thought about creating a table of a and an array of 1-12, exploding the array and joining with my original table, then separately populating the value column with previous value using a window function bounded by current row. However, it seems a bit inelegant and I wonder if there is a better way to achieve what I want?

I thought about creating a table of a and an array of 1-12, exploding the array and joining with my original table, then separately populating the value column with previous value using a window function bounded by current row. However, it seems a bit inelegant and I wonder if there is a better way to achieve what I want?
I do not think your proposed approach is inelegant - but you can achieve the same using range instead of explode.
First create a dataframe with all the numbers in your range. You will also want to cross join this with the distinct key column from your DataFrame.
all_numbers = spark.range(1, 13).withColumnRenamed("id", "number")
all_numbers = all_numbers.crossJoin(df.select("key").distinct()).cache()
all_numbers.show()
#+------+---+
#|number|key|
#+------+---+
#| 1| a|
#| 2| a|
#| 3| a|
#| 4| a|
#| 5| a|
#| 6| a|
#| 7| a|
#| 8| a|
#| 9| a|
#| 10| a|
#| 11| a|
#| 12| a|
#+------+---+
Now you can outer join this to your original DataFrame and forward fill using the last known good value. If the number of keys is small enough, you may be able to broadcast
from pyspark.sql.functions import broadcast, last
from pyspark.sql import Window
df.join(broadcast(all_numbers), on=["number", "key"], how="outer")\
.withColumn(
"value",
last(
"value",
ignorenulls=True
).over(
Window.partitionBy("key").orderBy("number")\
.rowsBetween(Window.unboundedPreceding, 0)
)
)\
.show()
#+------+---+-----+
#|number|key|value|
#+------+---+-----+
#| 1| a| 6|
#| 2| a| 10|
#| 3| a| 10|
#| 4| a| 10|
#| 5| a| 20|
#| 6| a| 20|
#| 7| a| 20|
#| 8| a| 20|
#| 9| a| 25|
#| 10| a| 25|
#| 11| a| 25|
#| 12| a| 25|
#+------+---+-----+

You could do this without join. I have done multiple tests on this with different gaps and it will always work as long as number 1 is always provided as input(as you need sequence to start from there), and it will always range till 12. I used a couple windows to get a column which I could use in the sequence, then made a custom sequence using expressions, and then exploded it to get desired result. If for some reason, you will have inputs that do not have number 1 in there, let me know I will update my solution.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import when
w=Window().partitionBy("key").orderBy("number")
w2=Window().partitionBy("key").orderBy("number").rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df.withColumn("number2", F.lag("number").over(w)).withColumn("diff", F.when((F.col("number2").isNotNull()) & ((F.col("number")-F.col("number2")) > 1), (F.col("number")-F.col("number2"))).otherwise(F.lit(0)))\
.withColumn("diff2", F.lead("diff").over(w)).withColumn("diff2", F.when(F.col("diff2").isNull(), F.lit(0)).otherwise(F.col("diff2"))).withColumn("diff2", F.when(F.col("diff2")!=0, F.col("diff2")-1).otherwise(F.col("diff2"))).withColumn("max", F.max("number").over(w2))\
.withColumn("diff2", F.when((F.col("number")==F.col("max")) & (F.col("number")<F.lit(12)), F.lit(12)-F.col("number")).otherwise(F.col("diff2")))\
.withColumn("number2", F.when(F.col("diff2")!=0,F.expr("""sequence(number,number+diff2,1)""")).otherwise(F.expr("""sequence(number,number+diff2,0)""")))\
.drop("diff","diff2","max")\
.withColumn("number2", F.explode("number2")).drop("number")\
.select("key", F.col("number2").alias("number"), "value")\
.show()
+---+------+-----+
|key|number|value|
+---+------+-----+
| a| 1| 6|
| a| 2| 10|
| a| 3| 10|
| a| 4| 10|
| a| 5| 20|
| a| 6| 20|
| a| 7| 20|
| a| 8| 20|
| a| 9| 25|
| a| 10| 25|
| a| 11| 25|
| a| 12| 25|
+---+------+-----+

pyspark window function with filter

I have the following DataFrame with columns: ["id", "timestamp", "x", "y"]:
+---+----------+---+---+
| id| timestamp| x| y|
+---+----------+---+---+
| 0|1443489380|100| 1|
| 0|1443489390|200| 0|
| 0|1443489400|300| 0|
| 1|1443489410|400| 1|
| 1|1443489550|100| 1|
| 2|1443489560|600| 0|
| 2|1443489570|200| 0|
| 2|1443489580|700| 1|
+---+----------+---+---+
I have defined the following Window:
from pyspark.sql import Window
w = Window.partitionBy("id").orderBy("timestamp")
I would like to extract only the first and last row of data in the window w. How can I accomplish this?

If you want the first and last values on the same row, one way is to use pyspark.sql.functions.first():
from pyspark.sql import Window
from pyspark.sql.functions import first
w1 = Window.partitionBy("id").orderBy("timestamp")
w2 = Window.partitionBy("id").orderBy(f.col("timestamp").desc()) # sort desc
df.select(
"id",
*([first(c).over(w1).alias("first_" + c) for c in df.columns if c != "id"] +
[first(c).over(w2).alias("last_" + c) for c in df.columns if c != "id"])
)\
.distinct()\
.show()
#+---+---------------+-------+-------+--------------+------+------+
#| id|first_timestamp|first_x|first_y|last_timestamp|last_x|last_y|
#+---+---------------+-------+-------+--------------+------+------+
#| 0| 1443489380| 100| 1| 1443489400| 300| 0|
#| 1| 1443489410| 400| 1| 1443489550| 100| 1|
#| 2| 1443489560| 600| 0| 1443489580| 700| 1|
#+---+---------------+-------+-------+--------------+------+------+

How to add columns in pyspark dataframe dynamically

I am trying to add few columns based on input variable vIssueCols
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
from pyspark.sql.window import Window
vIssueCols=['jobid','locid']
vQuery1 = 'vSrcData2= vSrcData'
vWindow1 = Window.partitionBy("vKey").orderBy("vOrderBy")
for x in vIssueCols:
Query1=vQuery1+'.withColumn("'+x+'_prev",F.lag(vSrcData.'+x+').over(vWindow1))'
exec(vQuery1)
now above query will generate vQuery1 as below, and it is working, but
vSrcData2= vSrcData.withColumn("jobid_prev",F.lag(vSrcData.jobid).over(vWindow1)).withColumn("locid_prev",F.lag(vSrcData.locid).over(vWindow1))
Cant I write a query something like
vSrcData2= vSrcData.withColumn(x+"_prev",F.lag(vSrcData.x).over(vWindow1))for x in vIssueCols
and generate the columns with the loop statement. Some blog has suggested to add a udf and call that, But instead using udf I will use above executing string method.

You can build your query using reduce.
from pyspark.sql.functions import lag
from pyspark.sql.window import Window
from functools import reduce
#sample data
df = sc.parallelize([[1, 200, '1234', 'asdf'],
[1, 50, '2345', 'qwerty'],
[1, 100, '4567', 'xyz'],
[2, 300, '123', 'prem'],
[2, 10, '000', 'ankur']]).\
toDF(["vKey","vOrderBy","jobid","locid"])
df.show()
vWindow1 = Window.partitionBy("vKey").orderBy("vOrderBy")
#your existing processing
df1= df.\
withColumn("jobid_prev",lag(df.jobid).over(vWindow1)).\
withColumn("locid_prev",lag(df.locid).over(vWindow1))
df1.show()
#to-be processing
vIssueCols=['jobid','locid']
df2 = (reduce(
lambda r_df, col_name: r_df.withColumn(col_name+"_prev", lag(r_df[col_name]).over(vWindow1)),
vIssueCols,
df
))
df2.show()
Sample data:
+----+--------+-----+------+
|vKey|vOrderBy|jobid| locid|
+----+--------+-----+------+
| 1| 200| 1234| asdf|
| 1| 50| 2345|qwerty|
| 1| 100| 4567| xyz|
| 2| 300| 123| prem|
| 2| 10| 000| ankur|
+----+--------+-----+------+
Output:
+----+--------+-----+------+----------+----------+
|vKey|vOrderBy|jobid| locid|jobid_prev|locid_prev|
+----+--------+-----+------+----------+----------+
| 1| 50| 2345|qwerty| null| null|
| 1| 100| 4567| xyz| 2345| qwerty|
| 1| 200| 1234| asdf| 4567| xyz|
| 2| 10| 000| ankur| null| null|
| 2| 300| 123| prem| 000| ankur|
+----+--------+-----+------+----------+----------+
Hope this helps!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to do a check/try-catch to a pyspark dataframe? - pyspark

Try this from pyspark.sql import functions as F df.withColumn("tot_reduced_load ", F.when(F.col("tot_reduced_load")>50,50)).otherwise(F.col("tot_reduced_load"))

Related

pyspark: groupby and aggregate avg and first on multiple columns

Pyspark - How to duplicate/triplicate rows?

Pyspark - add missing values per key?

pyspark window function with filter

How to add columns in pyspark dataframe dynamically

Categories

Resources