CDC with pyspark - pyspark

I am trying to write pyspark code to fit into 2 scenarios.
Scenario 1:
Input data:
col1|col2|date
100|Austin|2021-01-10
100|Newyork|2021-02-15
100|Austin|2021-03-02
Expected output with CDC:
col1|col2|start_date|end_date
100|Austin|2021-01-10|2021-02-15
100|Newyork|2021-02-15|2021-03-02
100|Austin|2021-03-02|2099-12-31
In sequence there is a change in col2 values and want to maintain CDC
Scenario 2:
Input:
col1|col2|date
100|Austin|2021-01-10
100|Austin|2021-03-02 -> I want to eliminate this version because there is no change in col1 and col2 values between records.
Expected Output:
col1|col2|start_date|end_date
100|Austin|2021-01-10|2099-12-31
I am looking for the same code to work in both scenarios.
I am trying something like this but not working for both scenarios
inputdf = inputdf.groupBy('col1','col2','date').agg(
F.min("date").alias("r_date"))
inputdf = inputdf.drop("date").withColumnRenamed("r_date", "start_date")
my_allcolumnwindowasc = Window.partitionBy('col1','col2').orderBy("start_date")
inputdf = inputdf.withColumn('dropDuplicates',F.lead(inputdf.start_date).over(my_allcolumnwindowasc)).where(F.col("dropDuplicates").isNotNull()).drop('dropDuplicates')
There are more than 20 columns in some of the scenarios.
Thanks for help!

check this out.
Steps:
Use window function too give the row number
convert the dataframe to view
use self join (condition checks are the key)
use Lead window function wrapped by coalesce in the case of null value to give the "2099-12-31" value
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
spark = SparkSession \
.builder \
.appName("SO") \
.getOrCreate()
df = spark.createDataFrame(
[(100, "Austin", "2021-01-10"),
(100, "Newyork", "2021-02-15"),
(100, "Austin", "2021-03-02"),
],
['col1', 'col2', 'date']
)
# df = spark.createDataFrame(
# [(100, "Austin", "2021-01-10"),
# (100, "Austin", "2021-03-02"),
# ],
# ['col1', 'col2', 'date']
# )
df1 = df.withColumn("start_date", F.to_date("date"))
w = Window.partitionBy("col1",).orderBy("start_date")
df_1 = df1.withColumn("rn", F.row_number().over(w))
df_1.createTempView("temp_1")
df_dupe = spark.sql('select temp_1.col1,temp_1.col2,temp_1.start_date, case when temp_1.col1=temp_2.col1 and temp_1.col2=temp_2.col2 then "delete" else "no-delete" end as dupe from temp_1 left join temp_1 as temp_2 '
'on temp_1.col1=temp_2.col1 and temp_1.col2=temp_2.col2 and temp_1.rn-1 = temp_2.rn order by temp_1.start_date ')
df_dupe.filter(F.col("dupe")=="no-delete").drop("dupe")\
.withColumn("end_date", F.coalesce(F.lead("start_date").over(w),F.lit("2099-12-31"))).show()
# Result:
# Scenario1:
#+----+-------+----------+----------+
# |col1| col2|start_date| end_date|
# +----+-------+----------+----------+
# | 100| Austin|2021-01-10|2021-02-15|
# | 100|Newyork|2021-02-15|2021-03-02|
# | 100| Austin|2021-03-02|2099-12-31|
# +----+-------+----------+----------+
#
# Scenario 2:
# +----+------+----------+----------+
# |col1| col2|start_date| end_date|
# +----+------+----------+----------+
# | 100|Austin|2021-01-10|2099-12-31|
# +----+------+----------+----------+

Related

Why do we need to specify the column names twice in PySpark pandas UDF?

I am experimenting with pandas UDFs in PySpark 3.2.1. I have constructed a simple series to series example where the return function is a struct. It works ok, but it is not very elegant because the column names in the struct seem to be needed to be defined twice. Is there a better way to write the UDF to avoid this?
# create spark session
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName('learn')
.config('spark.sql.execution.arrow.pyspark.enabled', True)
.config('spark.sql.execution.arrow.pyspark.fallback.enabled', False)
.getOrCreate()
)
# create dataframe
import pandas as pd
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T
g = np.tile(['group a','group b'], 10)
x = np.linspace(0, 10., 20)
np.random.seed(3) # set seed for reproducibility
y_lin = 2*x + 10*np.random.rand(len(x))
y_qua = 3*x**2 + 10*np.random.rand(len(x))
df = pd.DataFrame({'group': g, 'x': x, 'y_lin': y_lin, 'y_qua': y_qua})
schema = StructType([
StructField('group', T.StringType(), nullable=False),
StructField('x', T.DoubleType(), nullable=False),
StructField('y_lin', T.DoubleType(), nullable=False),
StructField('y_qua', T.DoubleType(), nullable=False),
])
df = spark.createDataFrame(df, schema=schema)
def show_frame(df, n=5):
df.select([F.format_number(F.col(col), 3).alias(col)
if df.select(col).dtypes[0][1]=='double'
else col
for col in df.columns]).show(truncate=False, n=n)
show_frame(df)
# +-------+-----+------+------+
# |group |x |y_lin |y_qua |
# +-------+-----+------+------+
# |group a|0.000|5.508 |2.835 |
# |group b|0.526|8.134 |7.762 |
# |group a|1.053|5.014 |7.729 |
# |group b|1.579|8.266 |9.048 |
# |group a|2.105|13.140|18.743|
# +-------+-----+------+------+
# only showing top 5 rows
# create and use the pandas UDF
schema = T.StructType([
StructField('y_lin', T.DoubleType()), # <- first place where columns are named
StructField('y_qua', T.DoubleType()),
])
#F.pandas_udf(schema)
def create_struct(col1: pd.Series, col2: pd.Series) -> pd.DataFrame:
return pd.DataFrame({'y_lin': col1, 'y_qua': col2}) # <- second place where columns are named
res = df.select(F.col('y_lin'), F.col('y_qua'), create_struct(F.col('y_lin'), F.col('y_qua')).alias('created struct'))
show_frame(res)
# +------+------+--------------------------------------+
# |y_lin |y_qua |created struct |
# +------+------+--------------------------------------+
# |5.508 |2.835 |{5.507979025745755, 2.835250817713187}|
# |8.134 |7.762 |{8.134109805128416, 7.762404113877886}|
# |5.014 |7.729 |{5.014310547024181, 7.728636899699084}|
# |8.266 |9.048 |{8.266170788818735, 9.047901761480935}|
# |13.140|18.743|{13.139995859266019, 18.7428890722852}|
# +------+------+--------------------------------------+
# only showing top 5 rows
Many thanks for your help.

PySpark Working with Delta tables - For Loop Optimization with Union

I'm currently working in databricks and have a delta table with 20+ columns. I basically need to take a value from 1 column in each row, send it to an api which returns two values/columns, and then create the other 26 to merge the values back to the original delta table. So input is 28 columns and output is 28 columns. Currently my code looks like:
from pyspark.sql.types import *
from pyspark.sql import functions as F
import requests, uuid, json
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql.functions import col,lit
from functools import reduce
spark.conf.set("spark.sql.adaptive.enabled","true")
spark.conf.set("spark.databricks.adaptive.autoOptimizeShuffle.enabled", "true")
spark.sql('set spark.sql.execution.arrow.pyspark.enabled = true')
spark.conf.set("spark.databricks.optimizer.dynamicPartitionPruning","true")
spark.conf.set("spark.sql.parquet.compression.codec","gzip")
spark.conf.set("spark.sql.inMemorycolumnarStorage.compressed","true")
spark.conf.set("spark.databricks.optimizer.dynamicFilePruning","true");
output=spark.sql("select * from delta.`table`").cache()
SeriesAppend=[]
for i in output.collect():
#small mapping fix
if i['col1']=='val1':
var0='a'
elif i['col1']=='val2':
var0='b'
elif i['col1']=='val3':
var0='c'
elif i['col1']=='val4':
var0='d'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list=list(req_var-var0)
#subscription info
headers = {header}
body = [{
'text': i['col2']
}]
if len(i['col2'])<500:
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
dumps=json.dumps(response[0])
loads = json.loads(dumps)
json_rdd = sc.parallelize(loads)
json_df = spark.read.json(json_rdd)
json_df = json_df.withColumn('col1',lit(i['col1']))
json_df = json_df.withColumn('col2',lit(i['col2']))
json_df = json_df.withColumn('col3',lit(i['col3']))
...
SeriesAppend.append(json_df)
else:
pass
Series_output=reduce(DataFrame.unionAll, SeriesAppend)
SAMPLE DF with only 3 columns:
df = spark.createDataFrame(
[
("a", "cat","owner1"), # create your data here, be consistent in the types.
("b", "dog","owner2"),
("c", "fish","owner3"),
("d", "fox","owner4"),
("e", "rat","owner5"),
],
["col1", "col2", "col3"]) # add your column names here
I really just need to write the response + other column values to a delta table, so dataframes are not necessarily required, but haven't found a faster way than the above. Right now, I can run 5 inputs, which returns 15 in 25.3 seconds without the unionAll. With the inclusion of the union, it turns into 3 minutes.
The final output would look like:
df = spark.createDataFrame(
[
("a", "cat","owner1","MI", 48003), # create your data here, be consistent in the types.
("b", "dog","owner2", "MI", 48003),
("c", "fish","owner3","MI", 48003),
("d", "fox","owner4","MI", 48003),
("e", "rat","owner5","MI", 48003),
],
["col1", "col2", "col3", "col4", "col5"]) # add your column names here
How can I make this faster in spark?
As mentioned in my comments, you should use UDF to distribute more workload to workers instead of collect and let a single machine (driver) to run it all. It's simply wrong approach and not scalable.
# This is your main function, pure Python and you can unittest it in any way you want.
# The most important about this function is:
# - everything must be encapsulated inside the function, no global variable works here
def req(col1, col2):
if col1 == 'val1':
var0 = 'a'
elif col1 == 'val2':
var0 = 'b'
elif col1 == 'val3':
var0 = 'c'
elif col1 == 'val4':
var0 = 'd'
var0=set([var0])
req_var = set(['a','b','c','d'])
var_list = list(req_var - var0)
#subscription info
headers = {header} # !!! `header` must available **inside** this function, global won't work
body = [{
'text': col2
}]
if len(col2) < 500:
# !!! same as `header`, `constructed_url` must available **inside** this function, global won't work
request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()
return (response.col4, response.col5)
else:
return None
# Now you wrap the function above into a Spark UDF.
# I'm using only 2 columns here as input, but you can use as many columns as you wish.
# Same as output, I'm using only a tuple with 2 elements, you can make it as many items as you wish.
df.withColumn('temp', F.udf(req, T.ArrayType(T.StringType()))('col1', 'col2')).show()
# Output
# +----+----+------+------------------+
# |col1|col2| col3| temp|
# +----+----+------+------------------+
# | a| cat|owner1|[foo_cat, bar_cat]|
# | b| dog|owner2|[foo_dog, bar_dog]|
# | c|fish|owner3| null|
# | d| fox|owner4| null|
# | e| rat|owner5| null|
# +----+----+------+------------------+
# Now all you have to do is extract the tuple and assign to separate columns
# (and delete temp column to cleanup)
(df
.withColumn('col4', F.col('temp')[0])
.withColumn('col5', F.col('temp')[1])
.drop('temp')
.show()
)
# Output
# +----+----+------+-------+-------+
# |col1|col2| col3| col4| col5|
# +----+----+------+-------+-------+
# | a| cat|owner1|foo_cat|bar_cat|
# | b| dog|owner2|foo_dog|bar_dog|
# | c|fish|owner3| null| null|
# | d| fox|owner4| null| null|
# | e| rat|owner5| null| null|
# +----+----+------+-------+-------+

Validate data from the same column in different rows with pyspark

How can I change the value of a column depending on some validation between some cells? What I need is to compare the kilometraje values of each customer's (id) record to compare whether the record that follows the kilometraje is higher.
fecha id estado id_cliente error_code kilometraje error_km
1/1/2019 1 A 1 10
2/1/2019 2 A ERROR 20
3/1/2019 1 D 1 ERROR 30
4/1/2019 2 O ERROR
The error in the error_km column is because for customer (id) 2 the kilometraje value is less than the same customer record for 2/1/2019 (If time passes the car is used so the kilometraje increases, so that there is no error the mileage has to be higher or the same)
I know that withColumn I can overwrite or create a column that doesn't exist and that using when I can set conditions. For example: This would be the code I use to validate the estado and id_cliente column and ERROR overwrite the error_code column where applicable, but I don't understand how to validate between different rows for the same client.
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
file_path = 'archive.txt'
error = 'ERROR'
df = spark.read.parquet(file_path)
df = df.persist(StorageLevel.MEMORY_AND_DISK)
df = df.select('estado', 'id_cliente')
df = df.withColumn("error_code", lit(''))
df = df.withColumn('error_code',
F.when((F.col('status') == 'O') &
(F.col('client_id') != '') |
(F.col('status') == 'D') &
(F.col('client_id') != '') |
(F.col('status') == 'A') &
(F.col('client_id') == ''),
F.concat(F.col("error_code"), F.lit(":[{}]".format(error)))
)
.otherwise(F.col('error_code')))
You achieve that with the lag window function. The lag function returns you the row before the current row. With that you can easily compare the kilometraje values. Have a look at the code below:
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 30 ),
('7/1/2019', 3 , 30 ),
('4/1/2019', 2 , 5)]
columns = ['fecha', 'id', 'kilometraje']
df=spark.createDataFrame(l, columns)
df = df.withColumn('fecha',F.to_date(df.fecha, 'dd/MM/yyyy'))
w = Window.partitionBy('id').orderBy('fecha')
df = df.withColumn('error_km', F.when(F.lag('kilometraje').over(w) > df.kilometraje, F.lit('ERROR') ).otherwise(F.lit('')))
df.show()
Output:
+----------+---+-----------+--------+
| fecha| id|kilometraje|error_km|
+----------+---+-----------+--------+
|2019-01-01| 1| 10| |
|2019-01-03| 1| 30| |
|2019-01-04| 1| 10| ERROR|
|2019-01-05| 1| 30| |
|2019-01-07| 3| 30| |
|2019-01-02| 2| 20| |
|2019-01-04| 2| 5| ERROR|
+----------+---+-----------+--------+
The fourth row doesn't get labeled with 'ERROR' as the previous value had a smaller kilometraje value (10 < 30). When you want to label all the id's with 'ERROR' which contain at least one corrupted row, perform a left join.
df.drop('error_km').join(df.filter(df.error_km == 'ERROR').groupby('id').agg(F.first(df.error_km).alias('error_km')), 'id', 'left').show()
I use .rangeBetween(Window.unboundedPreceding,0).
This function searches from the current value for the added value for the back
import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
error = 'This is error'
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 22 ),
('7/1/2019', 1 , 23 ),
('22/1/2019', 2 , 5),
('11/1/2019', 2 , 24),
('13/2/2019', 1 , 16),
('14/2/2019', 2 , 18),
('5/2/2019', 1 , 19),
('6/2/2019', 2 , 23),
('7/2/2019', 1 , 14),
('8/3/2019', 1 , 50),
('8/3/2019', 2 , 50)]
columns = ['date', 'vin', 'mileage']
df=spark.createDataFrame(l, columns)
df = df.withColumn('date',F.to_date(df.date, 'dd/MM/yyyy'))
df = df.withColumn("max", lit(0))
df = df.withColumn("error_code", lit(''))
w = Window.partitionBy('vin').orderBy('date').rangeBetween(Window.unboundedPreceding,0)
df = df.withColumn('max',F.max('mileage').over(w))
df = df.withColumn('error_code', F.when(F.col('mileage') < F.col('max'), F.lit('ERROR')).otherwise(F.lit('')))
df.show()
Finally, all that remains is to remove the column that has the maximum
df = df.drop('max')
df.show()

Spark Compare two dataframes and find the match count

I have two spark sql dataframs both are not having any unique column. First dataframe contains n-grams, second one contains long text string (blog post). I want to find the matches on df2 and add count in df1.
DF1
------------
words
------------
Stack
Stack Overflow
users
spark scala
DF2
--------
POSTS
--------
Hello, Stack overflow users , Do you know spark scala
Spark scala is very fast
Users in stack are good in spark, users
Expected output
------------ ---------------
words match_count
------------ ---------------
Stack 2
Stack Overflow 1
users 3
spark scala 1
Brute force approach as follows in Scala not working over lines and treating all as lowercase, could all be added but that is for another day. Relies on not trying to examine strings but to define ngrams as that is what it is is, ngrams against ngrams and genning these and then JOINing and counting, whereby inner join only relevant. Some extra data added to prove the matching.
import org.apache.spark.ml.feature._
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField,StructType,IntegerType,ArrayType,LongType,StringType}
import spark.implicits._
// Sample data, duplicates and items to check it works.
val dfPostsInit = Seq(
( "Hello!!, Stack overflow users, Do you know spark scala users."),
( "Spark scala is very fast,"),
( "Users in stack are good in spark"),
( "Users in stack are good in spark"),
( "xy z"),
( "x yz"),
( "ABC"),
( "abc"),
( "XYZ,!!YYY##$ Hello Bob..."))
.toDF("posting")
val dfWordsInit = Seq(("Stack"), ("Stack Overflow"),("users"), ("spark scala"), ("xyz"), ("xy"), ("not found"), ("abc")).toDF("words")
val dfWords = dfWordsInit.withColumn("words_perm" ,regexp_replace(dfWordsInit("words"), " ", "^")).withColumn("lower_words_perm" ,lower(regexp_replace(dfWordsInit("words"), " ", "^")))
val dfPostsTemp = dfPostsInit.map(r => (r.getString(0), r.getString(0).split("\\W+").toArray ))
// Tidy Up
val columnsRenamed = Seq("posting", "posting_array")
val dfPosts = dfPostsTemp.toDF(columnsRenamed: _*)
// Generate Ngrams up to some limit N - needs to be set. This so that we can count properly via a JOIN direct comparison. Can parametrize this in calls below.
// Not easy to find string matching over Array and no other answer presented.
def buildNgrams(inputCol: String = "posting_array", n: Int = 3) = {
val ngrams = (1 to n).map(i =>
new NGram().setN(i)
.setInputCol(inputCol).setOutputCol(s"${i}_grams")
)
new Pipeline().setStages((ngrams).toArray)
}
val suffix:String = "_grams"
var i_grams_Cols:List[String] = Nil
for(i <- 1 to 3) {
val iGCS = i.toString.concat(suffix)
i_grams_Cols = i_grams_Cols ::: List(iGCS)
}
// Generate data for checking against later from via rows only and thus not via columns, positional dependency counts, hence permutations.
val dfPostsNGrams = buildNgrams().fit(dfPosts).transform(dfPosts)
val dummySchema = StructType(
StructField("phrase", StringType, true) :: Nil)
var dfPostsNGrams2 = spark.createDataFrame(sc.emptyRDD[Row], dummySchema)
for (i <- i_grams_Cols) {
val nameCol = col({i})
dfPostsNGrams2 = dfPostsNGrams2.union (dfPostsNGrams.select(explode({nameCol}).as("phrase")).toDF )
}
val dfPostsNGrams3 = dfPostsNGrams2.withColumn("lower_phrase_concatenated",lower(regexp_replace(dfPostsNGrams2("phrase"), " ", "^")))
val result = dfPostsNGrams3.join(dfWords, col("lower_phrase_concatenated") ===
col("lower_words_perm"), "inner")
.groupBy("words_perm", "words")
.agg(count("*").as("match_count"))
result.select("words", "match_count").show(false)
returns:
+--------------+-----------+
|words |match_count|
+--------------+-----------+
|spark scala |2 |
|users |4 |
|abc |2 |
|Stack Overflow|1 |
|xy |1 |
|Stack |3 |
|xyz |1 |
+--------------+-----------+
Seems that join-groupBy-count will do:
df1
.join(df2, expr("lower(posts) rlike lower(words)"))
.groupBy("words")
.agg(count("*").as("match_count"))
You can use pandas features in pyspark. Here is my solution below
>>> from pyspark.sql import Row
>>> import pandas as pd
>>>
>>> rdd1 = sc.parallelize(['Stack','Stack Overflow','users','spark scala'])
>>> data1 = rdd1.map(lambda x: Row(x))
>>> df1=spark.createDataFrame(data1,['words'])
>>> df1.show()
+--------------+
| words|
+--------------+
| Stack|
|Stack Overflow|
| users|
| spark scala|
+--------------+
>>> rdd2 = sc.parallelize([
... 'Hello, Stack overflow users , Do you know spark scala',
... 'Spark scala is very fast',
... 'Users in stack are good in spark'
... ])
>>> data2 = rdd2.map(lambda x: Row(x))
>>> df2=spark.createDataFrame(data2,['posts'])
>>> df2.show()
+--------------------+
| posts|
+--------------------+
|Hello, Stack over...|
|Spark scala is ve...|
|Users in stack ar...|
+--------------------+
>>> dfPd1 = df1.toPandas()
>>> dfPd2 = df2.toPandas().apply(lambda x: x.str.lower())
>>>
>>> words = dict((x,0) for x in dfPd1['words'])
>>>
>>> for i in words:
... x = dfPd2['posts'].str.contains(i.lower()).sum()
... if i in words:
... words[i] = x
...
>>>
>>> words
{'Stack': 2, 'Stack Overflow': 1, 'users': 2, 'spark scala': 2}
>>>
>>> data = pd.DataFrame.from_dict(words, orient='index').reset_index()
>>> data.columns = ['words','match_count']
>>>
>>> df = spark.createDataFrame(data)
>>> df.show()
+--------------+-----------+
| words|match_count|
+--------------+-----------+
| Stack| 2|
|Stack Overflow| 1|
| users| 2|
| spark scala| 2|
+--------------+-----------+

Spark Dataframe GroupBy and compute Complex aggregate function

Using Spark dataframe , I need to compute the percentage by using the below
complex formula :
Group by "KEY " and calculate "re_pct" as ( sum(sa) / sum( sa / (pct/100) ) ) * 100
For Instance , Input Dataframe is
val values1 = List(List("01", "20000", "45.30"), List("01", "30000", "45.30"))
.map(row => (row(0), row(1), row(2)))
val DS1 = values1.toDF("KEY", "SA", "PCT")
DS1.show()
+---+-----+-----+
|KEY| SA| PCT|
+---+-----+-----+
| 01|20000|45.30|
| 01|30000|45.30|
+---+-----+-----+
Expected Result :
+---+-----+--------------+
|KEY| re_pcnt |
+---+-----+--------------+
| 01| 45.30000038505 |
+---+-----+--------------+
I have tried to calculate as below
val result = DS1.groupBy("KEY").agg(((sum("SA").divide(
sum(
("SA").divide(
("PCT").divide(100)
)
)
)) * 100).as("re_pcnt"))
But facing Error:(36, 16) value divide is not a member of String ("SA").divide({
Any suggestion on implementing the above logic ?
You can try importing spark.implicits._ and then use $ to refer to a column.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val result = DS1.groupBy("KEY")
.agg(((sum($"SA").divide(sum(($"SA").divide(($"PCT").divide(100))))) * 100)
.as("re_pcnt"))
Which will give you the requested output.
If you do not want to import you can always use the col() command instead of $.
It is possible to use a string as input to the agg() function with the use of expr(). However, the input string need to be changed a bit. The following gives exactly the same result as before, but uses a string instead:
val opr = "sum(SA)/(sum(SA/(PCT/100))) * 100"
val df = DS1.groupBy("KEY").agg(expr(opr).as("re_pcnt"))
Note that .as("re_pcnt") need to be inside the agg() method, it can not be outside.
Your code works almost perfectly. You just have to put the '$' symbol in order to specify you're passing a column:
val result = DS1.groupBy($"KEY").agg(((sum($"SA").divide(
sum(
($"SA").divide(
($"PCT").divide(100)
)
)
)) * 100).as("re_pcnt"))
Here's the output:
result.show()
+---+-------+
|KEY|re_pcnt|
+---+-------+
| 01| 45.3|
+---+-------+