PySpark - WithColumn - pyspark

I am new to Pyspark and databricks, I am aware of some basic knowledge of it and now I am having a hard time understanding one expression.
#Calculating workdays using workdaycal()
dfx = (df.withColumn('DAYS_OUTSTANDING_WORKDAYS_1',(workdaycal(holidays_list)(col("OPEN_DATE_TIME_GMT"),col("CURRENT_DATE_TIME_GMT"))).cast(DoubleType()))
.withColumn('TAT_RESOLVED_WORKDAYS_1',(workdaycal(holidays_list)(col("OPEN_DATE_TIME_GMT"),col("LAST_RESOLVED_DATE_TIME_GMT_1"))).cast(DoubleType()))
.withColumn('TAT_CLOSED_WORKDAYS_1',(workdaycal(holidays_list)(col("OPEN_DATE_TIME_GMT"),col("CLOSED_DATE_TIME_GMT_1"))).cast(DoubleType()))
)
In the above code, I am unable to figure out what does (workdaycal(holidays_list)(col("OPEN_DATE_TIME_GMT"),col("CURRENT_DATE_TIME_GMT"))) do.
workdaycal is a user defined function which is like this
def get_bizday(sdate,edate,holidays_list):
h1,m1,s1=sdate.hour,sdate.minute,sdate.second
// further code in the function and return a float value at the end.
return float(val)
def workdaycal(holidays):
return udf(lambda l,e: get_bizday(l, e, holidays))
holiday_list is list of dates which I am passing to workdaycal.
Could anyone help me in figuring out what does this expression do?

I have simplified your use case for reproducing the results. In the simplified form, the UDF returns string representation of the input arguments.
At present, your problem statement is as follows:
workdaycal returns a "udf".
workdaycal(holidays_list) (col(...),col(...)) first calls workdaycal() which returns "udf" and this "udf" is called and the two "col(...)" are passed along with global variable `holidays_list``.
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType
df = spark.createDataFrame([["2022-01-01 09:00:00", "2022-01-01 08:00:00"],["2022-01-02 10:00:00", "2022-01-02 07:30:00"]], ["OPEN_DATE_TIME_GMT", "CURRENT_DATE_TIME_GMT"])
def get_bizday(sdate, edate, holidays_list):
return str(f"sdate={sdate}, edate={edate}, holidays_list={holidays_list}")
def workdaycal(holidays):
return udf(lambda l,e: get_bizday(l, e, holidays))
holidays_list = ["Saturday", "Sunday"]
df = df.withColumn("DAYS_OUTSTANDING_WORKDAYS_1", (workdaycal(holidays_list)(col("OPEN_DATE_TIME_GMT"),col("CURRENT_DATE_TIME_GMT"))).cast(StringType()))
You can simplify this as a direct call to function get_bizday; but there is a problem: holidays_list is not a Spark type (it is python list).
df = spark.createDataFrame([["2022-01-01 09:00:00", "2022-01-01 08:00:00"],["2022-01-02 10:00:00", "2022-01-02 07:30:00"]], ["OPEN_DATE_TIME_GMT", "CURRENT_DATE_TIME_GMT"])
#udf(returnType=StringType())
def get_bizday(sdate, edate, holidays_list):
return str(f"sdate={sdate}, edate={edate}, holidays_list={holidays_list}")
holidays_list = ["Saturday", "Sunday"]
df = df.withColumn("DAYS_OUTSTANDING_WORKDAYS_1", get_bizday(col("OPEN_DATE_TIME_GMT"), col("CURRENT_DATE_TIME_GMT"), holidays_list))
This results in error complaining that holidays_list should be one of the specified Spark types:
TypeError: Invalid argument, not a string or column: ['Saturday', 'Sunday'] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
You can solve this by removing holidays_list from parameter list, and broadcasting it and accessing it in "udf" with holidays_list_broadcast.value. A broadcasted read-only shared global variable is cached and available on all nodes in the cluster.
df = spark.createDataFrame([["2022-01-01 09:00:00", "2022-01-01 08:00:00"],["2022-01-02 10:00:00", "2022-01-02 07:30:00"]], ["OPEN_DATE_TIME_GMT", "CURRENT_DATE_TIME_GMT"])
holidays_list_broadcast = spark.sparkContext.broadcast(holidays_list)
#udf(returnType=StringType())
def get_bizday(sdate, edate):
return str(f"sdate={sdate}, edate={edate}, holidays_list={holidays_list_broadcast.value}")
df = df.withColumn("DAYS_OUTSTANDING_WORKDAYS_1", get_bizday(col("OPEN_DATE_TIME_GMT"),col("CURRENT_DATE_TIME_GMT")))
Now you are calling get_bizday() in a simpler and readable way.

Related

Spark UDF as function parameter, UDF is not in function scope

I have a few UDFs that I'd like to pass along as a function argument along with data frames.
One way to do this might be to create the UDF within the function, but that would create and destroy several instances of the UDF without reusing it which might not be the best way to approach this problem.
Here's a sample piece of code -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
val df = inputDF1
.withColumn("new_col", lkpUDF(col("c1")))
val df2 = inputDF2.
.withColumn("new_col", lkpUDF(col("c1")))
Instead of doing the above, I'd ideally want to do something like this -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
def appendCols(df: DataFrame, lkpUDF: ?): DataFrame = {
df
.withColumn("new_col", lkpUDF(col("c1")))
}
val df = appendCols(inputDF, lkpUDF)
The above UDF is pretty simple, but in my case it can return a primitive type or a user defined case class type. Any thoughts/ pointers would be much appreciated. Thanks.
Your function with the appropriate signature needs to be this:
import org.apache.spark.sql.UserDefinedFunction
def appendCols(df: DataFrame, func: UserDefinedFunction): DataFrame = {
df.withColumn("new_col", func(col("col1")))
}
The scala REPL is quite helpful in returning the type of the values initialized.
scala> val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
lkpUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,List(IntegerType))
Also, if the signature of the function that you pass into the udf wrapper consists of an Any return type (which will be the case if the function can return either a primitive or a user defined case class), the UDF will fail to compile with an exception like so:
java.lang.UnsupportedOperationException: Schema for type Any is not supported

Scala iterator on pattern match

I need help to iterate this piece of code written in Spark-Scala with DataFrame. I'm new on Scala, so I apologize if my question may seem trivial.
The function is very simple: Given a dataframe, the function casts the column if there is a pattern matching, otherwise select all field.
/* Load sources */
val df = sqlContext.sql("select id_vehicle, id_size, id_country, id_time from " + working_database + carPark);
val df2 = df.select(
df.columns.map {
case id_vehicle # "id_vehicle" => df(id_vehicle).cast("Int").as(id_vehicle)
case other => df(other)
}: _*
)
This function, with pattern matching, works perfectly!
Now I have a question: Is there any way to "iterate" this? In practice I need a function that given a dataframe, an Array[String] of column (column_1, column_2, ...) and another Array[String] of type (int, double, float, ...), return to me the same dataframe with the right cast at right position.
I need help :)
//Your supplied code fits nicely into this function
def castOnce(df: DataFrame, colName: String, typeName: String): DataFrame = {
val colsCasted = df.columns.map{
case colName => df(colName).cast(typeName).as(colName)
case other => df(other)
}
df.select(colsCasted:_ *)
}
def castMany(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val colsWithTypes: Array[(String, String)] = colNames.zip(typeNames)
colsWithTypes.foldLeft(df)((cAndType, newDf) => castOnce(newDf, cAndType._1, cAndType._2))
}
When you have a function that you just need to apply many times to the same thing a fold is often what you want.
The above code zips the two arrays together to combine them into one.
It then iterates through this list applying your function each time to the dataframe and then applying the next pair to the resultant dataframe etc.
Based on your edit I filled in the function above. I don't have a compiler so I'm not 100% sure its correct. Having written it out I am also left questioning my original approach. Below is a better way I believe but I am leaving the previous one for reference.
def(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val nameToType: Map[String, String] = colNames.zip(typeNames).toMap
val newCols= df.columns.map{dfCol =>
nameToType.get(dfCol).map{newType =>
df(dfCol).cast(newType).as(dfCol)
}.getOrElse(df(dfCol))
}
df.select(newCols:_ *)
}
The above code creates a map of column name to the desired type.
Then foreach column in the dataframe it looks the type up in the Map.
If the type exists we cast the column to that new type. If the column does not exist in the Map then we default to the column from the DataFrame directly.
We then select these columns from the DataFrame

Spark: UDF not reading already defined value

I have a function written that I am trying to apply to a dataframe via a UDF. It applies a category based on the value in a particular column. The function makes use of a value defined earlier in my code. The code looks like this:
object myFuncs extends App {
val sc = new SparkContext()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val categories = List("10","20")
def makeCategory(value:Double): String = {
if (value < categories(0)) "< 10"
else if (value >= categories(0) && value < categories(1)) "10 to 20"
else ">= 10"
}
val myFunc = udf(makeCategory _)
val df = sqlContext.parquetFile("hdfs:/to/my/file.parquet").withColumn("category", myFunc(col("myColumn")))
}
This produces a NullPointerException when it tries to read the categories variable inside the function. This works fine if I explicitly define the categories variable inside the function. Ultimately, I want to pass that in as an arg so I can't define it inside the function.
Any explanation why it won't read values defined outside the function in the UDF? Any suggestion on how to make this work without explicitly defining the values in the function? I tried using the 'lit' function and passing it as an argument but it didn't like having a list as 'lit'.
The simple solution is to pass the categories in the query, then it will work fine. You have to make changes into your function as
def makeCategory(value:Double, categoriesString : String): String = {
val categories = categoriesString.split(",")
if (value < categories(0)) "< 10"
else if (value >= categories(0) && value < categories(1)) "10 to 20"
else ">= 10"
}
So now you can register this function as UDT, but you have to use it like following
val df = sqlContext.parquetFile("hdfs:/to/my/file.parquet").withColumn("category", myFunc(col("myColumn"),"10,20"))
Hopefully it will help in your case.

Create new column with function in Spark Dataframe

I'm trying to figure out the new dataframe API in Spark. Seems like a good step forward but having trouble doing something that should be pretty simple. I have a dataframe with 2 columns, "ID" and "Amount". As a generic example, say I want to return a new column called "code" that returns a code based on the value of "Amt". I can write a function something like this:
def coder(myAmt:Integer):String {
if (myAmt > 100) "Little"
else "Big"
}
When I try to use it like this:
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", coder(myDF("Amt")))
I get type mismatch errors
found : org.apache.spark.sql.Column
required: Integer
I've tried changing the input type on my function to org.apache.spark.sql.Column but I then I start getting errors with the function compiling because it wants a boolean in the if statement.
Am I doing this wrong? Is there a better/another way to do this than using withColumn?
Thanks for your help.
Let's say you have "Amt" column in your Schema:
import org.apache.spark.sql.functions._
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
val coder: (Int => String) = (arg: Int) => {if (arg < 100) "little" else "big"}
val sqlfunc = udf(coder)
myDF.withColumn("Code", sqlfunc(col("Amt")))
I think withColumn is the right way to add a column
We should avoid defining udf functions as much as possible due to its overhead of serialization and deserialization of columns.
You can achieve the solution with simple when spark function as below
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", when(myDF("Amt") < 100, "Little").otherwise("Big"))
Another way of doing this:
You can create any function but according to the above error, you should define function as a variable
Example:
val coder = udf((myAmt:Integer) => {
if (myAmt > 100) "Little"
else "Big"
})
Now this statement works perfectly:
myDF.withColumn("Code", coder(myDF("Amt")))

Formatting the join rdd - Apache Spark

I have two key value pair RDD, I join the two rdd's and I saveastext file, here is the code:
val enKeyValuePair1 = rows_filter6.map(line => (line(8) -> (line(0),line(4),line(10),line(5),line(6),line(14),line(1),line(9),line(12),line(13),line(3),line(15),line(7),line(16),line(2),line(14))))
val enKeyValuePair = DATA.map(line => (line(0) -> (line(2),line(3))))
val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val output = final_res.saveAsTextFile("C:/out")
my output is as follows:
(534309,((17999,5161,45005,00000,XYZ,,29.95,0.00),None))
How can i get rid of all the parenthesis?
I want my output as follows:
534309,17999,5161,45005,00000,XYZ,,29.95,0.00,None
When outputing to a text file Spark will just use the toString representation of the element in the RDD. If you want control over the format, then, tou can do one last transform of the data to a String before the call to saveAsTextFile.
Luckily the tuples that arise form using the Spark API can be pulled apart using destructuring. In your example I'd do:
val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val formatted = final_res.map { tuple =>
val (f1,((f2,f3,f4,f5,f6,f7,f8,f9),f10)) = tuple
Seq(f1, f2, f3, f4, f5, f6, f7, f8, f9, f10).mkString(",")
}
formatted.saveAsTextFile("C:/out")
The first val line will take the tuple that is passed into the map function and assign the components to the values on the left. The second line creates a temporary Seq with the fields in the order you want displayed and then invokes mkString(",") to join the fields using a comma.
In cases with fewer fields or you're just hacking away at a problem on the REPL, a slight alternate to the above can also be used by using pattern matching on the partial function passed to map.
simpleJoinedRdd.map { case (key,(left,right)) => s"$key,$left,$right"}}
While that does allow you do make it a single line expression it can throw Exceptions if the data in the RDD don't match the pattern provided, as opposed to the earlier example where the compiler will complain if the tuple parameter cannot be destructured into the expected form.
You can do something like this:
import scala.collection.JavaConversions._
val output = sc.parallelize(List((534309,((17999,5161,45005,1,"XYZ","",29.95,0.00),None))))
val result = output.map(p => p._1 +=: p._2._1.productIterator.toBuffer += p._2._2)
.map(p => com.google.common.base.Joiner.on(", ").join(p.iterator))
I used guava to format string but there is porbably scala way of doing this.
do a flatmap before saving. Or, you can write a simple format function and use it in map.
Adding a bit code, just to show how it can be done. function formatOnDemand can be anything
test = sc.parallelize([(534309,((17999,5161,45005,00000,"XYZ","",29.95,0.00),None))])
print test.collect()
print test.map(formatOnDemand).collect()
def formatOnDemand(t):
out=[]
out.append(t[0])
for tok in t[1][0]:
out.append(tok)
out.append(t[1][1])
return out
>>>
[(534309, ((17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0), None))]
[[534309, 17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0, None]]