SparkSQL custom defined function in when clause - scala

I have a DataFrame like this:
id val1 val2
------------
1 v11 v12
2 v21 v22
3 v31 v32
4 v41 v42
5 v51 v52
6 v61 v62
Each row represents a person which may belong to one or more groups.I have a function that takes the values for each row, and determines whether that person meets the criteria for a particular group:
def isInGroup: Boolean = f(group: Int)(id: String, v1: String, v2: String)
and I'm trying to output a DataFrame like this:
Group1 Group2 Group3 Group4
---------------------------
3 0 6 1
Here's my code so far, which doesn't work. Unfortunately, the when clause only takes a parameter of type Column, and my function doesn't work. User Defined functions don't work either. I'd really like to stick with the select/struct/as was of doing it if possible.
val summaryDF = dataDF
.select(struct(
sum(when(isInGroup(1)($"id", $"val1", $"val2"), value = 1)).as("Group1")),
sum(when(isInGroup(2)($"id", $"val1", $"val2"), value = 1)).as("Group2")),
sum(when(isInGroup(3)($"id", $"val1", $"val2"), value = 1)).as("Group3")),
sum(when(isInGroup(4)($"id", $"val1", $"val2"), value = 1)).as("Group4"))
))

As I shown in my previous answer, you'll need an udf:
import org.apache.spark.sql.functions.udf
def isInGroupUDF(group: Int) = udf(isInGroup(group) _)
sum(when(
isInGroupUDF(1)($"id", $"val1", $"val2"), 1
)).as("Group1")
If you want to avoid listing columns you can try for example with default arguments:
def isInGroupUDF(group: Int, id: Column = $"id",
v1: Column = $"val1", v2: Column = $"val2") = {
val f = udf(isInGroup(group) _)
f(id, v1, v2)
}
sum(when(
isInGroupUDF(1), 1
)).as("Group1")

Related

how should count enum in dataframe by scala

I'm fresh about scala, there's a spark dataframe like above:
userid|productid|enumXX|
1 3 1
2 3 1
3 4 2
1 3 3
for enumXX values are 1,2,3; it's enum type, definition is below:
object enumXX extends Enumeration {
type EnumXX = Value
val apple = Value(1)
val balana = Value(2)
val orign = Value(3)
}
I would like to group by userid, productid and collect how many apple, balana and orign(count), how should I do by scala?

Scala: How to do union of data frames in the loop

I would like to do union of data frame in the recursive method.
I am doing some calculations in the recursive method and filtering the data and storing in one variable. In 2nd iteration i will do some calculation and again i will store the data in same variable.when i am calling the method second time my first result is getting vanished.Ideally i have to store the result in one temp variable and i need to do union of all the result till the recursive method gets completed its execution.
Iteration1 output in df:
Col1
14
35
Iteration2 output in df:
Col1
18
20
Now i need the final output as,
Col1
14
35
18
20
Code:
def myRecursiveMethod(first: List[List[String]],
Inputcolumnsdummy: List[List[String]],
secondInputcolumns: List[List[String]] = {
val ongoingResult = doSomeCalculation(first,Inputcolumnsdummy, secondInputcolumns)
}
I want my code should be something like below,
def myRecursiveMethod(first: List[List[String]],
Inputcolumnsdummy: List[List[String]],
secondInputcolumns: List[List[String]]) = {
val ongoingResult = doSomeCalculation(first, Inputcolumnsdummy, secondInputcolumns)
Val temp = temp.union(ongoingResult)
}
You should try: use union like this: df1.union(df2) or df1.union(computation(df2,...)).
Exemple below:
def doCompute(df: DataFrame): DataFrame = {
val tmp: DataFrame = ... // TODO: call to your computation method
tmp.show()
df.union(tmp)
}
val df1: DataFrame = ...
val df2: DataFrame = ...
val df3: DataFrame = ...
var union_df: DataFrame = df1.union(doCompute(df2)).union(doCompute(df3))
One thing I did not understand in your question is how is your function myRecursiveMethod recursive? A recursive function calls itself, by definition. Not sure your question is really clear.

Adding columns in Spark dataframe based on rules

I have a dataframe df, which contains below data:
**customers** **product** **Val_id**
1 A 1
2 B X
3 C
4 D Z
i have been provided 2 rules, which are as below:
**rule_id** **rule_name** **product value** **priority**
123 ABC A,B 1
456 DEF A,B,D 2
Requirement is to apply these rules on dataframe df in priority order, customers who have passed rule 1, should not be considered for rule 2 and in final dataframe add two more columns rule_id and rule_name, i have written below code to achieve it:
val rule_name = when(col("product").isin("A","B"), "ABC").otherwise(when(col("product").isin("A","B","D"), "DEF").otherwise(""))
val rule_id = when(col("product").isin("A","B"), "123").otherwise(when(col("product").isin("A","B","D"), "456").otherwise(""))
val df1 = df_customers.withColumn("rule_name" , rule_name).withColumn("rule_id" , rule_id)
df1.show()
Final output looks like below:
**customers** **product** **Val_id** **rule_name** **rule_id**
1 A 1 ABC 123
2 B X ABC 123
3 C
4 D Z DEF 456
Is there any better way to achieve it, adding both columns by just going though entire dataset once instead of going through entire dataset twice?
Question : Is there any better way to achieve it, adding both columns
by just going though entire dataset once instead of going through
entire dataset twice?
Answer : you can have a Map return type in scala...
Limitation : This udf if you are using with With Column for example
column name is ruleIDandRuleName then you can use a single fuction
with Map data type or any acceptable data type of spark sql column.
Other wise you cant use the below mentioned approach
shown in the below example snippet
def ruleNameAndruleId = udf((product : String) => {
if(Seq("A", "B").contains(product)) Map("ruleName"->"ABC","ruleId"->"123")
else if(Seq("A", "B", "D").contains(product)) (Map("ruleName"->"DEF","ruleId"->"456")
else (Map("ruleName"->"","ruleId"->"") })
caller will be
df.withColumn("ruleIDandRuleName",ruleNameAndruleId(product here) ) // ruleNameAndruleId will return a map containing rulename and rule id
An alternative to your solution would be to use udf functions. Its almost similar to when function as both required serialization and deserialization. Its upto you to test which is faster and efficient.
def rule_name = udf((product : String) => {
if(Seq("A", "B").contains(product)) "ABC"
else if(Seq("A", "B", "D").contains(product)) "DEF"
else ""
})
def rule_id = udf((product : String) => {
if(Seq("A", "B").contains(product)) "123"
else if(Seq("A", "B", "D").contains(product)) "456"
else ""
})
val df1 = df_customers.withColumn("rule_name" , rule_name(col("product"))).withColumn("rule_id" , rule_id(col("product")))
df1.show()

Spark Scala. Using an external variables "dataframe" in a map

I have two dataframes,
val df1 = sqlContext.csvFile("/data/testData.csv")
val df2 = sqlContext.csvFile("/data/someValues.csv")
df1=
startTime name cause1 cause2
15679 CCY 5 7
15683 2 5
15685 1 9
15690 9 6
df2=
cause description causeType
3 Xxxxx cause1
1 xxxxx cause1
3 xxxxx cause2
4 xxxxx
2 Xxxxx
and I want to apply a complex function getTimeCust to both cause1 and cause2 to determine a final cause, then match the description of this final cause code in df2. I must have a new df (or rdd) with the following columns:
startTime name cause descriptionCause
My solution was
val rdd2 = df1.map(row => {
val (cause, descriptionCause) = getTimeCust(row.getInt(2), row.getInt(3), df2)
Row (row(0),row(1),cause,descriptionCause)
})
If a run the code below I have a NullPointerException because the df2 is not visible.
The function getTimeCust(Int, Int, DataFrame) works well outside the map.
Use df1.join(df2, <join condition>) to join your dataframes together then select the fields you need from the joined dataframe.
You can't use spark's distributed structures (rdd, dataframe, etc) in code that runs on an executor (like inside a map).
Try something like this:
def f1(cause1: Int, cause2: Int): Int = some logic to calculate cause
import org.apache.spark.sql.functions.udf
val dfCause = df1.withColumn("df1_cause", udf(f1)($"cause1", $"cause2"))
val dfJoined = dfCause.join(df2, on= df1Cause("df1_cause")===df2("cause"))
dfJoined.select("cause", "description").show()
Thank you #Assaf. Thanks to your answer and the spark udf with data frame. I have resolved the this problem. The solution is:
val getTimeCust= udf((cause1: Any, cause2: Any) => {
var lastCause = 0
var categoryCause=""
var descCause=""
lastCause= .............
categoryCause= ........
(lastCause, categoryCause)
})
and after call the udf as:
val dfWithCause = df1.withColumn("df1_cause", getTimeCust( $"cause1", $"cause2"))
ANd finally the join
val dfFinale=dfWithCause.join(df2, dfWithCause.col("df1_cause._1") === df2.col("cause") and dfWithCause.col("df1_cause._2") === df2.col("causeType"),'outer' )

Lookup in Spark dataframes

I am using Spark 1.6 and I would like to know how to implement in lookup in the dataframes.
I have two dataframes employee & department.
Employee Dataframe
-------------------
Emp Id | Emp Name
------------------
1 | john
2 | David
Department Dataframe
--------------------
Dept Id | Dept Name | Emp Id
-----------------------------
1 | Admin | 1
2 | HR | 2
I would like to lookup emp id from the employee table to the department table and get the dept name. So, the resultset would be
Emp Id | Dept Name
-------------------
1 | Admin
2 | HR
How do I implement this look up UDF feature in SPARK. I don't want to use JOIN on both the dataframes.
As already mentioned in the comments, joining the dataframes is the way to go.
You can use a lookup, but I think there is no "distributed" solution, i.e. you have to collect the lookup-table into driver memory. Also note that this approach assumes that EmpID is unique:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
import scala.collection.Map
val emp = Seq((1,"John"),(2,"David"))
val deps = Seq((1,"Admin",1),(2,"HR",2))
val empRdd = sc.parallelize(emp)
val depsDF = sc.parallelize(deps).toDF("DepID","Name","EmpID")
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,String]) = udf((empID:Int) => lookupMap.get(empID))
val combinedDF = depsDF
.withColumn("empNames",lookup(lookupMap)($"EmpID"))
My initial thought was to pass the empRdd to the UDF and use the lookup method defined on PairRDD, but this does of course not work because you cannot have spark actions (i.e. lookup) within transformations (ie. the UDF).
EDIT:
If your empDf has multiple columns (e.g. Name,Age), you can use this
val empRdd = empDf.rdd.map{row =>
(row.getInt(0),(row.getString(1),row.getInt(2)))}
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,(String,Int)]) =
udf((empID:Int) => lookupMap.lift(empID))
depsDF
.withColumn("lookup",lookup(lookupMap)($"EmpID"))
.withColumn("empName",$"lookup._1")
.withColumn("empAge",$"lookup._2")
.drop($"lookup")
.show()
As you are saying you already have Dataframes then its pretty easy follow these steps:
1)create a sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2) Create Temporary tables for all 3 Eg:
EmployeeDataframe.createOrReplaceTempView("EmpTable")
3) Query using MySQL Queries
val MatchingDetails = sqlContext.sql("SELECT DISTINCT E.EmpID, DeptName FROM EmpTable E inner join DeptTable G on " +
"E.EmpID=g.EmpID")
Starting with some "lookup" data, there are two approaches:
Method #1 -- using a lookup DataFrame
// use a DataFrame (via a join)
val lookupDF = sc.parallelize(Seq(
("banana", "yellow"),
("apple", "red"),
("grape", "purple"),
("blueberry","blue")
)).toDF("SomeKeys","SomeValues")
Method #2 -- using a map in a UDF
// turn the above DataFrame into a map which a UDF uses
val Keys = lookupDF.select("SomeKeys").collect().map(_(0).toString).toList
val Values = lookupDF.select("SomeValues").collect().map(_(0).toString).toList
val KeyValueMap = Keys.zip(Values).toMap
def ThingToColor(key: String): String = {
if (key == null) return ""
val firstword = key.split(" ")(0) // fragile!
val result: String = KeyValueMap.getOrElse(firstword,"not found!")
return (result)
}
val ThingToColorUDF = udf( ThingToColor(_: String): String )
Take a sample data frame of things that will be looked up:
val thingsDF = sc.parallelize(Seq(
("blueberry muffin"),
("grape nuts"),
("apple pie"),
("rutabaga pudding")
)).toDF("SomeThings")
Method #1 is to join on the lookup DataFrame
Here, the rlike is doing the matching. And null appears where that does not work. Both columns of the lookup DataFrame get added.
val result_1_DF = thingsDF.join(lookupDF, expr("SomeThings rlike SomeKeys"),
"left_outer")
Method #2 is to add a column using the UDF
Here, only 1 column is added. And the UDF can return a non-Null value. However, if the lookup data is very large it may fail to "serialize" as required to send to the workers in the cluster.
val result_2_DF = thingsDF.withColumn("AddValues",ThingToColorUDF($"SomeThings"))
Which gives you:
In my case I had some lookup data that was over 1 million values, so Method #1 was my only choice.