'DataFrame' object is not callable in pyspark - pyspark

I want employee names who are having higher salary than their department average salary in pyspark.
filt = df3.select('SALARY','Dept_name','First_name','Last_name')
filt.filter(filt('SALARY').geq(filt.groupBy('Dept_name').agg(F.mean('SALARY')))).show()

Creating Sample DataFrame:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
data= [[200,'Marketing','Jane','Smith'],
[140,'Marketing','Jerry','Soreky'],
[120,'Marketing','Justin','Sauren'],
[170,'Sales','Joe','Statham'],
[190,'Sales','Jeremy','Sage'],
[220,'Sales','Jay','Sawyer']]
columns= ['SALARY','Dept_name','First_name','Last_name']
df= spark.createDataFrame(data,columns)
df.show()
+------+---------+----------+---------+
|SALARY|Dept_name|First_name|Last_name|
+------+---------+----------+---------+
| 200|Marketing| Jane| Smith|
| 140|Marketing| Jerry| Soreky|
| 120|Marketing| Justin| Sauren|
| 170| Sales| Joe| Statham|
| 190| Sales| Jeremy| Sage|
| 220| Sales| Jay| Sawyer|
+------+---------+----------+---------+
Creating your query to retrieve people that are having higher salary than their department average:
w=Window().partitionBy("Dept_name")
df.withColumn("Average_Salary", F.avg("SALARY").over(w))\
.filter(F.col("SALARY")>F.col("Average_Salary"))\
.select("SALARY","Dept_name","First_name","Last_name")\
.show()
+------+---------+----------+---------+
|SALARY|Dept_name|First_name|Last_name|
+------+---------+----------+---------+
| 220| Sales| Jay| Sawyer|
| 200|Marketing| Jane| Smith|
+------+---------+----------+---------+

Related

pyspark: duplicate row with column value from another row

My input df:
+-------------------+------------+
| windowStart| nodeId|
+-------------------+------------+
|2022-03-11 14:00:00|1 |
|2022-03-11 15:00:00|2 |
|2022-03-11 16:00:00|3 |
I would like to duplicate each row and use windowStart value of subsequent row, so the output should look like this:
+-------------------+------------+
| windowStart| nodeId|
+-------------------+------------+
|2022-03-11 14:00:00|1 |
|2022-03-11 15:00:00|1 |
|2022-03-11 15:00:00|2 |
|2022-03-11 16:00:00|2 |
|2022-03-11 16:00:00|3 |
How to achieve that ? Thanks !
df = spark.createDataFrame(
[
('2022-03-11 14:00:00','1'),
('2022-03-11 15:00:00','2'),
('2022-03-11 16:00:00','3')
], ['windowStart','nodeId'])
from pyspark.sql import Window as W
from pyspark.sql import functions as F
w = W.orderBy('windowStart')
df_lag = df\
.withColumn('lag', F.lead(F.col("windowStart"), 1).over(w))\
.select(F.col('lag').alias('windowStart'), 'nodeId')\
.filter(F.col('windowStart').isNotNull())
df.union(df_lag)\
.orderBy('windowStart', 'nodeId')\
.show()
+-------------------+------+
| windowStart|nodeId|
+-------------------+------+
|2022-03-11 14:00:00| 1|
|2022-03-11 15:00:00| 1|
|2022-03-11 15:00:00| 2|
|2022-03-11 16:00:00| 2|
|2022-03-11 16:00:00| 3|
+-------------------+------+

Unsure how to apply row-wise normalization on pyspark dataframe

Disclaimer: I'm a beginner when it comes to Pyspark.
For each cell in a row, I'd like to apply the following function
new_col_i = col_i / max(col_1,col_2,col_3,...,col_n)
At the very end, I'd like the range of values to go from 0.0 to 1.0.
Here are the details of my dataframe:
Dimensions: (6.5M, 2905)
Dtypes: Double
Initial DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 7.5| 0.1| 2.0|
| 2| 0.3| 3.5| 10.5|
+-----+-------+-------+-------+
Updated DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 1.0| 0.013| 0.26|
| 2| 0.028| 0.33| 1.0|
+-----+-------+-------+-------+
Any help would be appreciated.
You can find the maximum value from an array of columns and loop your dataframe to replace the normalized column value.
cols = df.columns[1:]
import builtins as p
df2 = df.withColumn('max', array_max(array(*[col(c) for c in cols]))) \
for c in cols:
df2 = df2.withColumn(c, col(c) / col('max'))
df2.show()
+---+-------------------+--------------------+-------------------+----+
| id| col_1| col_2| col_n| max|
+---+-------------------+--------------------+-------------------+----+
| 1| 1.0|0.013333333333333334|0.26666666666666666| 7.5|
| 2|0.02857142857142857| 0.3333333333333333| 1.0|10.5|
+---+-------------------+--------------------+-------------------+----+

How to do regexp_replace in one line in pyspark dataframe?

I have a pyspark dataframe column
df.groupBy('Gender').count().show()
(5) Spark Jobs
+------+------+
|Gender| count|
+------+------+
| F| 44015|
| null| 42175|
| M|104423|
| | 1|
+------+------+
I am doing regexp_replace
#df = df.fillna({'Gender':'missing'})
df = df.withColumn('Gender', regexp_replace('Gender', 'F','Female'))
df = df.withColumn('Gender', regexp_replace('Gender', 'M','Male'))
df = df.withColumn('Gender', regexp_replace('Gender', ' ','missing'))
Instead of calling df for each line, can this be done in one line?
If you do not want to use regexp_replace 3 times, you can use when/otherwise clause.
from pyspark.sql import functions as F
from pyspark.sql.functions import when
df.withColumn("Gender", F.when(F.col("Gender")=='F',F.lit("Female"))\
.when(F.col("Gender")=='M',F.lit("Male"))\
.otherwise(F.lit("missing"))).show()
+-------+------+
| Gender| count|
+-------+------+
| Female| 44015|
|missing| 42175|
| Male|104423|
|missing| 1|
+-------+------+
Or you could do your three regexp_replace in one line like this:
from pyspark.sql.functions import regexp_replace
df.withColumn('Gender', regexp_replace(regexp_replace(regexp_replace('Gender', 'F','Female'),'M','Male'),' ','missing')).show()
+-------+------+
| Gender| count|
+-------+------+
| Female| 44015|
| null| 42175|
| Male|104423|
|missing| 1|
+-------+------+
I think when/otherwise should outperform 3 regexp_replace functions because you will need to use fillna with them too.

Forward-fill missing data in PySpark not working

I have a simple dataset as shown under.
| id| name| country| languages|
|1 | Bob| USA| Spanish|
|2 | Angelina| France| null|
|3 | Carl| Brazil| null|
|4 | John| Australia| English|
|5 | Anne| Nepal| null|
I am trying to impute the null values in languages with the last non-null value using pyspark.sql.window to create a window over certain rows but nothing is happening. The column which is supposed to be have null values filled, temp_filled_spark, remains unchanged i.e a copy of original languages column.
from pyspark.sql import Window
from pyspark.sql.functions import last
window = Window.partitionBy('name').orderBy('country').rowsBetween(-sys.maxsize, 0)
filled_column = last(df['languages'], ignorenulls=True).over(window)
df = df.withColumn('temp_filled_spark', filled_column)
df.orderBy('name', 'country').show(100)
I expect the output column to be:
|temp_filled_spark|
| Spanish|
| Spanish|
| Spanish|
| English|
| English|
Could anybody help pointing out the mistake?
we can create window considering entire dataframe as one partition as,
from pyspark.sql import functions as F
>>> df1.show()
+---+--------+---------+---------+
| id| name| country|languages|
+---+--------+---------+---------+
| 1| Bob| USA| Spanish|
| 2|Angelina| France| null|
| 3| Carl| Brazil| null|
| 4| John|Australia| English|
| 5| Anne| Nepal| null|
+---+--------+---------+---------+
>>> w = Window.partitionBy(F.lit(1)).orderBy(F.lit(1)).rowsBetween(-sys.maxsize, 0)
>>> df1.select("*",F.last('languages',True).over(w).alias('newcol')).show()
+---+--------+---------+---------+-------+
| id| name| country|languages| newcol|
+---+--------+---------+---------+-------+
| 1| Bob| USA| Spanish|Spanish|
| 2|Angelina| France| null|Spanish|
| 3| Carl| Brazil| null|Spanish|
| 4| John|Australia| English|English|
| 5| Anne| Nepal| null|English|
+---+--------+---------+---------+-------+
Hope this helps.!

How to iterate over pairs in a column in Scala

I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook
One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+
You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful