I am analysing the following piece of code:
from pyspark.sql.functions import udf,col, desc
def error(value, pred):
return abs(value - pred)
udf_MAE = udf(lambda value, pred: MAE(value= value, pred = pred), FloatType())
I know an udf is an user defined function, but I don't understand what that means? Because udfwasn't define anywhere previously on the code?
User Defined Functions (UDFs) are useful when you need to define logic specific to your use case and when you need to encapsulate that solution for reuse. They should only be used when there is no clear way to accomplish a task using built-in functions..Azure DataBricks
Create your function (after you have made sure there is no built in function to perform similar task)
def greatingFunc(name):
return 'hello {name}!'
Then you need to register your function as a UDF by designating the following:
A name for access in Python (myGreatingUDF)
The function itself (greatingFunc)
The return type for the function (StringType)
myGreatingUDF = spark.udf.register("myGreatingUDF",greatingFunc,StringType())
Now you can call you UDF anytime you need it,
guest = 'John'
print(myGreatingUDF(guest))
Related
I am trying to create a function/udf for currency conversions, which maybe i can reuse in a spark notebook, it requires a SQL statement, is it possible to add it to the udf like so? If not what can I do?
Something like:
Def curexch(from_cur, exch_rt, exdate, to_cur)
If from_cur=='USD'
Return 1
Else
Return (select excrt from exchange_rate_tbl a where exdate>=a.exdate and to_cur= a.to_cur)
Do i have to add # to the reference of the join?
As Suggested by Skin and as per this Microsoft Document you can create UDF in Azure functions and here are the sample codes for it.
Register a function as UDF
def squared(s):
return s * s
spark.udf.register("squaredWithPython", squared)
You can eve set your return type as UDF and a default return type if StringType
from pyspark.sql.types import LongType
def squared_typed(s):
return s * s
spark.udf.register("squaredWithPython", squared_typed, LongType())
I am writing unit tests of scala function where I am passing mocked spark data frame to the function and then using assertSmallDataFrameEquality(actualDF, expectedDF) function to check whether my function is transforming correctly or not.
Recently I came across a function that is taking no argument and returning Column type. Now since it is not expecting any argument. How should I write test case for this function. My function is given below.
def arriveDateMinusRuleDays: Column = {
expr(s"date_sub(${Columns.ARRIVE_DATE},${Columns.RULE_DAYS})")
}
Test blueprint is written below
test("arrive date minus rule days") {
import spark.implicits._
val today = Date.valueOf(LocalDate.now)
val inputDF = Seq(
(Y, today, 0, 80852),
(S, today, 1, 18851))
.toDF(FLAG, ARRIVE_DT, RULE_DAYS,ITEM_NBR)
val actualOutput = DataAggJob.arriveDateMinusRuleDays() // How to pass my column values to this function
// val exepectedoutput
assertmethod(actualoutput, expectedoutput)
// print(actualOutput)
}
You don't need to test each individual function. The purpose of the unit test is to assert the contract between implementation and downstream consumer, not implementation details.
If your job returns the expected output given the input, then it is working correctly, regardless of what that particular function is doing. It should really just be made private to avoid confusion,
I'm trying to define a DSL that will dictate how to parse CSVs. I would like to define simple functions to transform values as the values are being extracted from CSV. The DSL will defined in a text file.
For example, if the CSV looks like this:
id,name,amt
1,John Smith,$10.00
2,Bob Uncle,$20.00
I would like to define the following function (and please note that I would like to be able to execute arbitrary code) on the amt column
(x: String) => x.replace("$", "")
Is there a way to evaluate the function above and execute it for each of the amt values?
First, please consider that there's probably a better way to do this. For one thing, it seems concerning that your external DSL contains Scala code. Does this really need to be a DSL?
That said, it's possible to evaluate arbitrary Scala using ToolBox:
import scala.reflect.runtime.universe._
import scala.tools.reflect.ToolBox
val code = """(x: String) => x.replace("$", "")"""
val toolbox = runtimeMirror(getClass.getClassLoader).mkToolBox()
val func = toolbox.eval(toolbox.parse(code)).asInstanceOf[String => String]
println(func("$10.50")) // prints "10.50"
I am new in Scala and I have a doubt.
Imagine this situation: a function that verifies if what is passed has any digit, without cycles, var or val, declarations or recursion. This function is named as hasDigit and must be declared without using type parameters.
def hasDigit.... ???
Can someone explain me this please?
What I have so far:
def hasDigit(element: => Any) = element.toString.toList.exists(x => x.isDigit)
I can't understand if the function above is a function without type parameters or not.
I could do this:
def hasDigit_[T](element: T) = element.toString.toList.exists(x => x.isDigit)
But in this exercise it is supposed to create a generic function (what i've already done) and create a function without type parameters (that's what I want to achieve with my first function and I dont know if it is correct)
I am bit new to scala curying and the call by name functions. I am facing difficulty in understanding the Syntax. What is the fllow of the function why there is need of returning the f(result) and what function is applied on it further.
def withScan[R](table: Table, scan: Scan)(f: (Seq[Result]) => R): R = {
var resultScanner: ResultScanner = null
try {
resultScanner = table.getScanner(scan)
val it: util.Iterator[Result] = resultScanner.iterator()
val results: mutable.ArrayBuffer[Result] = ArrayBuffer()
while (it.hasNext) {
results += it.next()
}
f(results)
} finally {
if (resultScanner != null)
resultScanner.close()
}
}
Let's look at just the function signature
def withScan[R](table: Table, scan: Scan)(f: (Seq[Result]) => R): R
Firstly, ignore the fancy currying syntax for now as you can always rewrite a curried function into a normal function by putting all the parameters in one parameter list i.e.
def withScan[R](table: Table, scan: Scan, f: Seq[Result] => R): R
Secondly, notice the last parameter is a function on its own and we don't know what it does yet. withScan will take a function somebody gives it and use that function on something. We might be interested in why someone needs such a function. Since we need to deal with a lot of resources that need to be opened and closed properly such as File, DatabaseConnection, Socket,... we will then repeat ourselves with the code that closes the resources or even worse, forget to close the resources. Hence we want to factor the boring common code out to give you a convenient function: if you use withScan to access the table, we will somehow give you the Result so that you can work on that and also we will make sure to close the resources properly for you so that you can just focus on the interesting operation. This is call the "loan pattern"
Now let's go back to the currying syntax. Although currying has other interesting use cases, I believe the reason it is written in this style is in Scala, you can use curly braces block to pass the parameter to the function i.e. one can use the function above like this
withScan(myTable, myScan) { results =>
//do whatever you want with the results
}
This looks just like a built in control flow like if-else or for loop!
As I understand that properly this is function which take some Table (probably db table) and try to scna this tabel using argument scan. After you collect data using relevant scanner this method just map collected sequence to object of type R.
For such mapping it is used f function.
You can use this function:
val list: List[Result] = withScan(table, scanner)(results => results.toList)
Or
val list: List[Result] = withScan(table, scanner)(results => ObjectWhichKeepAllData(results))
IMHO, it is not very well written code, and also I feel that the better would be to do mapping thing outside of this function. Let client do the mapping (which BTW should be for every single result) and leave scanning only for that function.
This is an example of a higher-order function: a function which takes another function as a parameter.
The function appears to do the following:
- opens the passed in table with the passed in scanner
- parses the table with an iterator, populating entries in a local ArrayBuffer
- calls a function, passed in by the caller, on the sequence of entries that have been parsed.
The function parameter allows this function to be used to carry out any operation on the scanned information, depending on the function passed in.
The function prototype could equally have been declared:
def withScan[R](table: Table, scan: Scan, f: (Seq[Result]) => R): R = {
The function has been declared with two argument lists; this is an example of currying. This is a benefit when calling the function, as it allows the method to be called with a clearer syntax.
Consider a function that might be passed into this function:
def getHighestValueEntry(results: Seq[Result]): R = {
Without currying, the function would be called like this:
withScan[R](table, scan, results => getHighestValueEntry(results))
With currying the function can be called in a manner that makes the function parameter stand out more clearly. This is helped by the ability in Scala to use curly braces instead of parentheses to surround the arguments to a function, if you are only passing in one argument:
withScan(table, scan) { results =>
getHighestValueEntry(results) }