how to use a custom function in query on Spark Dataframe using Scala - scala

I load data from database to Spark Dataframe,named DF,then I must to extract some records from the Dataframe which their ID has special condition. So, I define this function:
def hash_id(id:String): Int = {
val two_char = id.takeRight(2).toInt
val hash_result = two_char % 4
return hash_result
}
Then, I use the function in this query:
DF.filter(hash_id("ID")===3)
But I receive this error:
value === is not a member of Int
DF has ID column.
Would you please guide me how to use a custom function in where/filter clause?
Any help would be really appreciated.

=== can only be used between Column objects. That's why you have an error value === is not a member of Int, as return type of your function hash_id is an Int, not a Column
To be able to use your function, you should convert it to an user-defined function and apply this function to a column object as follow:
import org.apache.spark.sql.functions.{col, udf}
def hash_id(id:String): Int = {
val two_char = id.takeRight(2).toInt
val hash_result = two_char % 4
return hash_result
}
val hash_id_udf = udf((id: String) => hasd_id(id))
DF.filter(hash_id_udf(col("ID")) === 3)

Related

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

Passing struct type to methods or UDFS in spark sql dataframes

I have two dataframes and I have joined them and after joining in the joined dataframe , i have got two columns which are of type struct. Basically they are of Array[[String,Int]]. I need to derive a third column based on the elements of this struct type.
My code looks like below.
val bdf = Seq(
("a",1,1,10)
,("a",1,2,10)
,("a",1,3,10)
,("a",1,4,10)
,("b",1,1,20)
,("b",1,2,10)
,("a",2,3,10)
,("a",2,4,20)
,("a",2,5,20)
,("c",2,1,10)
,("c",2,2,20)
,("c",2,3,20)
).toDF("contract_number","linenumber","monthdel","open_quant")
val gbdf = bdf.withColumn("bmergedcol",struct(bdf("monthdel"),bdf("open_quant"))).groupBy("contract_number","linenumber").agg(collect_list("bmergedcol"))
val pl = Seq(
("a",1,"FLAT",10)
,("a",1,"FLAT",30)
,("a",1,"NFE",10)
,("b",1,"FLAT",10)
,("b",1,"NFE",10)
,("c",2,"NFE",10)
,("a",3,"NFE",20)
,("c",2,"FLAT",20)).toDF("connum","linnum","type","qnt")
import org.apache.spark.sql.functions._
val gpl = pl.withColumn("mergedcol",struct(pl("type"),pl("qnt"))).groupBy("connum","linnum").agg(collect_list("mergedcol"))
val jdf = gbdf.join(gpl,expr("((contract_number = connum) AND (linenumber = linnum ))"),"left_outer")
My output of jdf is like
I need to understand how can i pass the two struct type fields to some method and derive a third one from it?
Both array of structs should enter your UDF as Seq[Row], which you can then map into tuples by specifing the types of the structs (i think its string,int in your case). In this example I use pattern-matching on Row, but there are also other ways to do it (e.g. using Row#.getAs):
val myUDF = udf((arr1:Seq[Row],arr2:Seq[Row]) => {
// convert to tuples
val arr1Tup: Seq[(String, Int)] = arr1.map{case Row(s:String,i:Int) => (s,i)}
val arr2Tup: Seq[(String, Int)] = arr2.map{case Row(s:String,i:Int) => (s,i)}
// now do derive new quantities
})
Using the 2 Sequences of Tuples you can derive your new column
User-Defined Functions (aka UDF) is a feature of Spark SQL to define new Column-based functions transforming Datasets. An UDF could be used to pass the two struct type fields to derive a result.
val customUdf = udf((col1: Seq[Row], col2: Int) => {
// This is an example.
col1(1).getAs[String]("type") + "--" + col2
})
val cdf = jdf.withColumn("custom", customUdf(jdf.col("collect_list(mergedcol)"), jdf.col("linnum")))
cdf.show(10)
In above udf col1 is Seq[Row] as it an array of struct type, If only struct type has to be accessed than simply Row should be used.

How do I send multiple columns to a udf from a When Clause in Spark dataframe?

I want to join two dataframes on basis on full_outer_join and trying to add a new column in the joined result set which tells me the matching records, unmatched records from left dataframe alone and unmatched records from right dataframe alone.
Here is my spark code:
val creditLoc ="/data/accounts/credits/year=2016/month=06/day=02"
val debitLoc = "/data/accounts/debits/year=2016/month=06/day=02"
val creditDF = sqlContext.read.avro(creditLoc)
val debitDF = sqlContext.read.avro(debitLoc)
val credit = creditDF.withColumnRenamed("account_id","credit_account_id").as("credit")
val debit = debitDF.withColumnRenamed("account_id","debit_account_id").as("debit")
val fullOuterDF = credit.join(debit,credit("credit_account_id") === debit("debit_account_id"),"full_outer")
val CREDIT_DEBIT_CONSOLIDATE_SCHEMA=List(
("credit.credit_account_id","string"),
("credit.channel_name", "string"),
("credit.service_key", "string"),
("credit.trans_id", "string"),
("credit.trans_dt", "string"),
("credit.trans_amount", "string"),
("debit.debit_account_id","string"),
("debit.icf_number","string"),
("debit.debt_amount","string")
)
val columnNamesList = CREDIT_DEBIT_CONSOLIDATE_SCHEMA.map(elem => col(elem._1)).seq
val df = fullOuterDF.select(columnNamesList:_*)
val caseDF = df.withColumn("matching_type",
when(df("credit_account_id") === df("debit_account_id"),"MATCHING_CREDIT_DEBIT").otherwise(
when(df("debit_account_id").isNull,"UNMATCHED_CREDIT").otherwise(
when(df("credit_account_id").isNull,"UNMATCHED_DEBIT").otherwise("INVALID_MATCHING_TYPE")
)
)
)
As of now I applied the logic for "matching_type" inside a when clause itself, but now I want to write the logic of "matching_type" inside an UDF.
If the write like the above the code works.
The below UDFs accepts a single column as parameter, how do I create a udf that accepts multiple columns and return a boolean based on conditions inside that udf?
val isUnMatchedCREDIT = udf[Boolean, String](credit_account_id => {
credit_account_id == null
})
val isUnMatchedDEBIT = udf[Boolean, String](debit_account_id => {
debit_account_id == null
})
val caseDF = df.withColumn("matching_type",
when(df("credit_account_id") === df("debit_account_id"),"MATCHING_CREDIT_DEBIT").otherwise(
when(isUnMatchedCREDIT(df("credit_account_id")),"UNMATCHED_CREDIT").otherwise(
when(isUnMatchedDEBIT(df("debit_account_id")),"UNMATCHED_DEBIT").otherwise("INVALID_MATCHING_TYPE")
)
)
)
)
Basically I want to create another UDF as isMatchedCREDITDEBIT() that accepts two columns credit_account_id and debit_account_id and that UDF should return true if both the values are equal else false. In simple words, I want to created an UDF for the below logic:
when(df("credit_account_id") === df("debit_account_id"),"MATCHING_CREDIT_DEBIT")
I have tried this but it is throwing compile type error:
val isMatchedCREDITDEBIT()= udf[Boolean, String,String](credit_account_id => {
credit_account_id == debit_account_id
})
Could someone help me on this?
You can create an udf that takes two columns and perform your logic like this:
val isMatchedCREDITDEBIT = udf((credit_account_id: String, debit_account_id: String) => {
credit_account_id == debit_account_id
})
which can be called in the when clause
when(isMatchedCREDITDEBIT(df("credit_account_id"), df("debit_account_id")), "MATCHING_CREDIT_DEBIT")
However, it would be easier to create a single udf for all the logic you are performing on the two columns. The udf below takes both columns as input and returns the string you want, instead of a boolean.
val isMatchedCREDITDEBIT = udf((credit_account_id: String, debit_account_id: String) => {
if(credit_account_id == null){
"UNMATCHED_CREDIT"
} else if (debit_account_id == null){
"UNMATCHED_DEBIT"
} else if (credit_account_id == debit_account_id){
"MATCHING_CREDIT_DEBIT"
} else {
"INVALID_MATCHING_TYPE"
}
})
val caseDF = df.withColumn("matching_type",
isMatchedCREDITDEBIT(df("credit_account_id"), df("debit_account_id")))

Apache Spark. UDF Column based on another column without passing it's name as argument.

There is DataSet with column firm, I'm adding another column to this DataSet - firm_id here's example:
private val firms: mutable.Map[String, Integer] = ...
private val firmIdFromCode: (String => Integer) = (code: String) => firms(code)
val firm_id_by_code: UserDefinedFunction = udf(firmIdFromCode)
...
val ds = dataset.withColumn("firm_id", firm_id_by_code($"firm"))
Is there a way to eliminate passing $"firm" as argument (this column is always present in DS).
I am searching for something for this:
val ds = dataset.withColumn("firm_id", firm_id_by_code)
You could supply the column it will be using when you define the udf.
val someUdf = udf{ /*udf code*/}.apply($"colName")
// Usage in dataset
val ds = dataset.withColumn("newColName",someUdf)

display column name into list[column]scala

I want to insert list of column from datframe into a list [column] so I can perform a select request. it means want to get list of column and insert it automatically into a list [column] Any help Thanks
object PCA extends App{
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val strPath="C:/Users/mhattabi/Desktop/testBis2.txt"
val intial_Data=spark.read.option("header",true).csv(strPath)
//array string contains names of column
val arrayList=intial_Data.columns
var colsList = List[Column]()
//wanna insert name of column into the listColum
arrayList.foreach(p=>colsList.)
//i want to have something like
//val colsList = List(col("col1"),col("col2"))
//intial_Data.select(colsList:_*).show
}
You could use col function as follow:
var colsList = List[Column]()
arrayList.columns.foreach { c => colsList:+=col(c)}
Remember to import sql functions to use col:
import org.apache.spark.sql.functions._
I would rather use immutable list than the variable list by transformation like below.
val arrayList = initial_Data.columns
val colsList = arrayList.map(col)