Apache Spark. UDF Column based on another column without passing it's name as argument. - scala

There is DataSet with column firm, I'm adding another column to this DataSet - firm_id here's example:
private val firms: mutable.Map[String, Integer] = ...
private val firmIdFromCode: (String => Integer) = (code: String) => firms(code)
val firm_id_by_code: UserDefinedFunction = udf(firmIdFromCode)
...
val ds = dataset.withColumn("firm_id", firm_id_by_code($"firm"))
Is there a way to eliminate passing $"firm" as argument (this column is always present in DS).
I am searching for something for this:
val ds = dataset.withColumn("firm_id", firm_id_by_code)

You could supply the column it will be using when you define the udf.
val someUdf = udf{ /*udf code*/}.apply($"colName")
// Usage in dataset
val ds = dataset.withColumn("newColName",someUdf)

Related

Error while saving map value in Spark dataframe(scala) - Expected column, actual Map[int,string]

I have key values pair in Map[int,string]. I need to save this value in Hive table using spark dataframe. But i am getting error - Expected column, actual Map[int,string]
Code:
val dbValuePairs = Array(2019,10)
val dbkey = dbValuePairs.map(x => x).zipWithIndex.map(t => (t._2, t._1)).toMap
val dqMetrics = spark.sql("select * from dqMetricsStagingTbl")
.withColumn("Dataset_Name", lit(Dataset_Name))
.withColumn("Key", dbkey)
dqMetrics.createOrReplaceTempView("tempTable")
spark.sql("create table if not exists hivetable AS select * from tempTable")
dqMetrics.write.mode("append").insertInto(hivetable)
Please help! Error in withColumn("Key", dbkey) line
Look at Spark function withColumn signature:
def withColumn(colName: String, col: Column): DataFrame
it takes two arguments: colName as String and col as Column.
your dbkey type is Map[Int, Int] that is not Column:
val dbkey: Map[Int, Int] = dbValuePairs.map(x => x).zipWithIndex.map(t => (t._2, t._1)).toMap
if you want store Map in your table column you can use map function which takes a sequence of Column:
// from object org.apache.spark.sql.functions
def map(cols: Column*): Column
so you can convert your dbkey to Seq[Column] and pass it to withColumn function:
val dbValuePairs = Array(2019,10)
val dbkey: Map[Int, Int] = dbValuePairs.map(x => x).zipWithIndex.map(t => (t._2, t._1)).toMap
val dbkeyColumnSeq: Seq[Column] = dbkey.flatMap(t => Seq(lit(t._2), lit(t._1))).toSeq
val dqMetrics = spark.sql("select * from dqMetricsStagingTbl")
.withColumn("Dataset_Name", lit(""))
.withColumn("Key", map(dbkeyColumnSeq:_*))

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

spark scala - UDF usage for creating new column

I need to create a new column called hash_id from uid column of my dataframe, Below is my code:
//1.Define a hashing function
def calculate_hashid (uid: String) : BigInteger ={
val md = java.security.MessageDigest.getInstance("SHA-1")
val ha = new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
return ha
}
//2.Convert function to UDF
val calculate_hashidUDF = udf(calculate_hashid)
//3.Apply udf on spark dataframe
val userAgg_Data_hashid = userAgg_Data.withColumn("hash_id", calculate_hashidUDF($"uid"))
I am getting error at udf(calculate_hashid) saying
missing arguments for the method calculate_hashid(string)
I have gone through many examples online but could not resolve it, what am I missing here.
You can register your udf as
val calculate_hashidUDF = udf[String, BigInteger](calculate_hashidUDF)
You can also rewrite your udf as
def calculate_hashidUDF = udf(((uid: String) => {
val md = java.security.MessageDigest.getInstance("SHA-1")
new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
}): String => BigInteger)
Or even without return type
def calculate_hashidUDF = udf((uid: String) => {
val md = java.security.MessageDigest.getInstance("SHA-1")
new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
})

display column name into list[column]scala

I want to insert list of column from datframe into a list [column] so I can perform a select request. it means want to get list of column and insert it automatically into a list [column] Any help Thanks
object PCA extends App{
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val strPath="C:/Users/mhattabi/Desktop/testBis2.txt"
val intial_Data=spark.read.option("header",true).csv(strPath)
//array string contains names of column
val arrayList=intial_Data.columns
var colsList = List[Column]()
//wanna insert name of column into the listColum
arrayList.foreach(p=>colsList.)
//i want to have something like
//val colsList = List(col("col1"),col("col2"))
//intial_Data.select(colsList:_*).show
}
You could use col function as follow:
var colsList = List[Column]()
arrayList.columns.foreach { c => colsList:+=col(c)}
Remember to import sql functions to use col:
import org.apache.spark.sql.functions._
I would rather use immutable list than the variable list by transformation like below.
val arrayList = initial_Data.columns
val colsList = arrayList.map(col)

how to join two datasets by key in scala spark

I have two datasets and each dataset have two elements.
Below are examples.
Data1: (name, animal)
('abc,def', 'monkey(1)')
('df,gh', 'zebra')
...
Data2: (name, fruit)
('a,efg', 'apple')
('abc,def', 'banana(1)')
...
Results expected: (name, animal, fruit)
('abc,def', 'monkey(1)', 'banana(1)')
...
I want to join these two datasets by using first column 'name.' I have tried to do this for a couple of hours, but I couldn't figure out. Can anyone help me?
val sparkConf = new SparkConf().setAppName("abc").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val text1 = sc.textFile(args(0))
val text2 = sc.textFile(args(1))
val joined = text1.join(text2)
Above code is not working!
join is defined on RDDs of pairs, that is, RDDs of type RDD[(K,V)].
The first step needed is to transform the input data into the right type.
We first need to transform the original data of type String into pairs of (Key, Value):
val parse:String => (String, String) = s => {
val regex = "^\\('([^']+)',[\\W]*'([^']+)'\\)$".r
s match {
case regex(k,v) => (k,v)
case _ => ("","")
}
}
(Note that we can't use a simple split(",") expression because the key contains commas)
Then we use that function to parse the text input data:
val s1 = Seq("('abc,def', 'monkey(1)')","('df,gh', 'zebra')")
val s2 = Seq("('a,efg', 'apple')","('abc,def', 'banana(1)')")
val rdd1 = sparkContext.parallelize(s1)
val rdd2 = sparkContext.parallelize(s2)
val kvRdd1 = rdd1.map(parse)
val kvRdd2 = rdd2.map(parse)
Finally, we use the join method to join the two RDDs
val joined = kvRdd1.join(kvRdd2)
// Let's check out results
joined.collect
// res31: Array[(String, (String, String))] = Array((abc,def,(monkey(1),banana(1))))
You have to create pairRDDs first for your data sets then you have to apply join transformation. Your data sets are not looking accurate.
Please consider the below example.
**Dataset1**
a 1
b 2
c 3
**Dataset2**
a 8
b 4
Your code should be like below in Scala
val pairRDD1 = sc.textFile("/path_to_yourfile/first.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val pairRDD2 = sc.textFile("/path_to_yourfile/second.txt").map(line => (line.split(" ")(0),line.split(" ")(1)))
val joinRDD = pairRDD1.join(pairRDD2)
joinRDD.collect
Here is the result from scala shell
res10: Array[(String, (String, String))] = Array((a,(1,8)), (b,(2,4)))