How to refer to columns containing f-strings in a Pyspark function? - pyspark

I am writing a function for a Spark DF that performs operations on columns and gives them a suffix, such that I can run the function twice on two different suffixes and join them later.
I am having a time of figuring out the best way to refer to them however in this particular bit of code and was wondering what I am missing?
def calc_date(sdf, suffix):
final_sdf = (
sdf.withColumn(
f"lowest_days{suffix}",
f"sdf.list_of_days_{suffix}"[0],
)
.withColumn(
f"earliest_date_{suffix}",
f"sdf.list_of_dates_{suffix}"[0],
)
.withColumn(
f"actual_date_{suffix}",
spark_fns.expr(
f"date_sub(earliest_date_{suffix}, lowest_days{suffix})"
),
)
)
Here I am trying to pull the first value from two lists (list_of_days and list_of_dates) and perform a date calculation to create a new variable (actual_date).
I would like to do this in a function so that I don't have to do the same set of operations twice (or more) depending on the number of suffixes I have?
But the f-strings give an error col should be Column.
Any help on this would be greatly appreciated!

You need to wrap the second argument with a col().
from pyspark.sql.functions import *
def calc_date(sdf, suffix):
final_sdf = (
sdf.withColumn(
f"lowest_days{suffix}",
col(f"list_of_days_{suffix}")[0],
)
.withColumn(
f"earliest_date_{suffix}",
col(f"list_of_dates_{suffix}")[0],
)
)

Related

Pyspark dynamic column name

I have a dataframe which contains months and will change quite frequently. I am saving this dataframe values as list e.g. months = ['202111', '202112', '202201']. Using a for loop to to iterate through all list elements and trying to provide dynamic column values with following code:
for i in months:
df = (
adjustment_1_prepared_df.select("product", "mnth", "col1", "col2")
.groupBy("product")
.agg(
f.min(f.when(condition, f.col("col1")).otherwise(9999999)).alias(
concat("col3_"), f.lit(i.col)
)
)
)
So basically in alias I am trying to give column name as a combination of constant (minInv_) and a variable (e.g. 202111) but I am getting error. How can I give a column name as combination of fixed string and a variable.
Thanks in advance!
.alias("col3_"+str(i.col))

How do you use aggregated values within PySpark SQL when() clause?

I am trying to learn PySpark, and have tried to learn how to use SQL when() clauses to better categorize my data. (See here: https://sparkbyexamples.com/spark/spark-case-when-otherwise-example/) What I can't seem to get addressed is how to insert actual scalar values into the when() conditions for comparison's sake explicitly. It seems the aggregate functions return more tabular values than actual float() types.
I keep getting this error message unsupported operand type(s) for -: 'method' and 'method' When I tried running functions to aggregate another column in the original data frame I noticed the result didn't seem to be a flat scaler as much as a table (agg(select(f.stddev("Col")) gives a result like: "DataFrame[stddev_samp(TAXI_OUT): double]") Here is a sample of what I am trying to accomplish if you want to replicate, and I was wondering how you might get aggregate values like the standard deviation and mean within the when() clause so you can use that to categorize your new column:
samp = spark.createDataFrame(
[("A","A1",4,1.25),("B","B3",3,2.14),("C","C2",7,4.24),("A","A3",4,1.25),("B","B1",3,2.14),("C","C1",7,4.24)],
["Category","Sub-cat","quantity","cost"])
psMean = samp.agg({'quantity':'mean'})
psStDev = samp.agg({'quantity':'stddev'})
psCatVect = samp.withColumn('quant_category',.when(samp['quantity']<=(psMean-psStDev),'small').otherwise('not small')) ```
psMean and psStdev in your example are dataframes, you need to use collect() method to extract the scalar values
psMean = samp.agg({'quantity':'mean'}).collect()[0][0]
psStDev = samp.agg({'quantity':'stddev'}).collect()[0][0]
You could also create one variable with all stats as pandas DataFrame and reference to it later in pyspark code:
from pyspark.sql import functions as F
stats = (
samp.select(
F.mean("quantity").alias("mean"),
F.stddev("quantity").alias("std")
).toPandas()
)
(
samp.withColumn('quant_category',
F.when(
samp['quantity'] <= stats["mean"].item() - stats["std"].item(),
'small')
.otherwise('not small')
)
.toPandas()
)

How do I select columns from a Spark dataframe when I also need to use withColumnRenamed?

I have a dataframe of
df = df.select("employee_id", "employee_name", "employee_address")
I need to rename the first two fields, but also still select the third field. So I thought this would work, but this appears to only select employee_address.
df = (df.withColumnRenamed("employee_id", "empId")
.withColumnRenamed("employee_name", "empName")
.select("employee_address")
)
How do I properly rename the first two fields while also selecting the third field?
I tried a mix of withColumn usages, but that doesn't work. Do I have to use a select on all three fields?
You can use the alias command:
import pyspark.sql.functions as func
df = df.select(
func.col("employee_id").alias("empId"),
func.col("employee_name").alias("empName"),
func.col("employee_address")
)

Pyspark actions on dataframe [duplicate]

I have the following code:
val df_in = sqlcontext.read.json(jsonFile) // the file resides in hdfs
//some operations in here to create df as df_in with two more columns "terms1" and "terms2"
val intersectUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { seq1 intersect seq2 } ) //intersects two sequences
val symmDiffUDF = udf( (seq1:Seq[String], seq2:Seq[String] ) => { (seq1 diff seq2) ++ (seq2 diff seq1) } ) //compute the difference of two sequences
val df1 = (df.withColumn("termsInt", intersectUDF(df("terms1"), df1("terms2") ) )
.withColumn("termsDiff", symmDiffUDF(df("terms1"), df1("terms2") ) )
.where( size(col("termsInt")) >0 && size(col("termsDiff")) > 0 && size(col("termsDiff")) <= 2 )
.cache()
) // add the intersection and difference columns and filter the resulting DF
df1.show()
df1.count()
The app is working properly and fast until the show() but in the count() step, it creates 40000 tasks.
My understanding is that df1.show() should be triggering the full df1 creation and df1.count() should be very fast. What am I missing here? why is count() that slow?
Thank you very much in advance,
Roxana
show is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy to df1's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>) and trying out count again.
show() by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show has a lot of variations. If you run show(false) which means show all results, all partitions will be executed and may take more time. So, show() equals show(20) which is a partial action.

Implementing SQL logic via Dataframes using Spark and Scala

I have three columns (c1, c2, c3) in a Hive table t1. I have MySQL code that checks whether specific columns are null. I have the dataframe from the same table. I would like to implement the same logic via dataframe, df which has three columns, c1, c2, c3.
Here is the SQL-
if(
t1.c1=0 Or IsNull(t1.c1),
if(
IsNull(t1.c2/t1.c3),
1,
t1.c2/t1.c3
),
t1.c1
) AS myalias
I had drafted the following logic in scala using "when" as an alternative to "if" of SQL. I am facing problem writing "Or" logic(bolded below). How can I write the above SQL logic via Spark dataframe using Scala?
val df_withalias = df.withColumn("myalias",when(
Or((df("c1") == 0), isnull(df("c1"))),
when(
(isNull((df("c2") == 0)/df("c3")),
)
)
)
How can I write the above logic?
First, you can use Column's || operator to construct logical OR conditions. Also - note that when takes only 2 arguments (condition and value), and if you want to supply an alternative value (to be used if condition isn't met) - you need to use .otherwise:
val df_withalias = df.withColumn("myalias",
when(df("c1") === 0 || isnull(df("c1")),
when(isnull(df("c2")/df("c3")), 1).otherwise(df("c2")/df("c3"))
).otherwise(df("c1"))
)