Create an String dynamically based on parameters - scala

I'm parametrizing a query in Scala.
I have an array of strings with column names named colNames.
I want to create an string where for each name of the string the output is A.colName = B.colName and then join all the items in the array putting an " AND " string between each item.
Example of input
val colNames = Array("colName1","colName2")
val table1 = "A"
val table2 = "B"
Example of the desired output
"A.colName1 = B.colName1 AND A.colName2 = B.colName2"
In a non FP language I would do that with a for loop, but I don't know how to do it in Scala in a functional way.

You can use the map and mkString methods on Array:
scala> colNames map { colName => s"${table1}.${colName} = ${table2}.${colName}" } mkString " AND "
val res0: String = A.colName1 = B.colName1 AND A.colName2 = B.colName2

Related

How to rename a dataframe column and datatype from another dataframe values in spark?

Hi I've two dataframes like this:
import spark.implicits._
import org.apache.spark.sql._
val transformationDF = Seq(
("A_IN", "ain","String"),
("ADDR_HASH","addressHash","String")
).toDF("db3Column", "hudiColumn","hudiDatatype")
val addressDF=Seq(
("123","uyt"),
("124","qwe")
).toDF("A_IN", "ADDR_HASH")
Now I wanted to rename the column and change the datatype on values mentioned in the transformationdf.The hudicolumn name and hudidatatype from transformationDF will become the column name and datatype of addressDF.
I tried code like this to change but doesn't work:
var db3ColumnName:String =_
var hudiColumnName:String =_
var hudiDatatypeName:String = _
for (row <- transformationDF.rdd.collect)
{
db3ColumnName = row.mkString(",").split(",")(0)
hudiColumnName= row.mkString(",").split(",")(1)
hudiDatatypeName = row.mkString(",").split(",")(2)
addressDF.withColumnRenamed(db3ColumnName,hudiColumnName).withColumn(hudiColumnName,col(hudiColumnName).cast(hudiDatatypeName))
}
Now when I print the addressDF thechanges do not reflect.
Can anyone help me with this .
This is a textbook case that calls for using foldLeft:
val finalDF = transformationDF.collect.foldLeft(addressDF){ case (df, row) =>
{
val db3ColumnName = row.getString(0)
val hudiColumnName = row.getString(1)
val hudiDatatypeName = row.getString(2)
df.withColumnRenamed(db3ColumnName, hudiColumnName)
.withColumn(hudiColumnName, col(hudiColumnName).cast(hudiDatatypeName))
}
}
Datasets in Spark are immutable and each operation that "modifies" a dataset actually returns a new object leaving the one that the operation was called on unchanged. The above foldLeft effectively starts with addressDF and chains all the transformations onto intermediate objects that get passed as the first argument in the second argument list. The return value of the current iteration becomes the input of the next iteration. The return value of the last iteration is the return value of foldLeft itself.
When you use withColumnRenamed or withColumn, it returns a new Dataset, so you should do like this:
var db3ColumnName: String = null
var hudiColumnName: String = null
var hudiDatatypeName: String = null
for (row <- transformationDF.rdd.collect) {
db3ColumnName = row.mkString(",").split(",")(0)
hudiColumnName = row.mkString(",").split(",")(1)
hudiDatatypeName = row.mkString(",").split(",")(2)
addressDF = addressDF.withColumnRenamed(db3ColumnName, hudiColumnName).withColumn(hudiColumnName, col(hudiColumnName).cast(hudiDatatypeName))
}
addressDF.printSchema()
Print the addressDF will return:
root
|-- ain: string (nullable = true)
|-- addressHash: string (nullable = true)

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

How to pass column names in selectExpr through one or more string parameters in spark using scala?

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as ==> mismatched input ',' expecting
Below is the piece of code I am trying to parameterize.
var filteredMicroBatchDF=microBatchOutputDF
.selectExpr("col1","col2","struct(offset,KAFKA_TS) as otherCols" )
.groupBy("col1","col2").agg(max("otherCols").as("latest"))
.selectExpr("col1","col2","latest.*")
Reference to the script I am trying to emulate: -
https://docs.databricks.com/_static/notebooks/merge-in-cdc.html
I have tried like below by passing column names in a variable and then reading in the selectExpr from these variables: -
val keyCols = "col1","col2"
val structCols = "struct(offset,KAFKA_TS) as otherCols"
var filteredMicroBatchDF=microBatchOutputDF
.selectExpr(keyCols,structCols )
.groupBy(keyCols).agg(max("otherCols").as("latest"))
.selectExpr(keyCols,"latest.*")
When I run the script it gives me error as
org.apache.spark.sql.streaming.StreamingQueryException:
mismatched input ',' expecting <<EOF>>
EDIT
Here is what I have tried after comments by Luis Miguel which works fine: -
import org.apache.spark.sql.{DataFrame, functions => sqlfun}
def foo(microBatchOutputDF: DataFrame)
(keyCols: Seq[String], structCols: Seq[String]): DataFrame =
microBatchOutputDF
.selectExpr((keyCols ++ structCols) : _*)
.groupBy(keyCols.head, keyCols.tail : _*).agg(sqlfun.max("otherCols").as("latest"))
.selectExpr((keyCols :+ "latest.*") : _*)
var keyColumns = Seq("COL1","COL2")
var structColumns = "offset,Kafka_TS"
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
Note: Below results in an error
foo(microBatchOutputDF)(keyCols = Seq(keyColumns), structColumns = Seq("struct("+structColumns+") as otherCols"))
The thing about above working code is that, here keyColumns were hardcoded. So, I tried reading (firstly) from parameter file and (Secondly) from widget which resulted in error and it is here I am looking for advice and suggestions: -
First Method
def loadProperties(url: String):Properties = {
val properties: Properties = new Properties()
if (url != null) {
val source = Source.fromURL(url)
properties.load(source.bufferedReader())
}
return properties
}
var tableProp: Properties = new Properties()
tableProp = loadProperties("dbfs:/Configs/Databricks/Properties/table/Table.properties")
var keyColumns = Seq(tableProp.getProperty("keyCols"))
var structColumns = tableProp.getProperty("structCols")
keyCols and StructCols are defined in parameter file as: -
keyCols = Col1, Col2 (I also tried assigning these as "Col1","Col2")
StructCols = offset,Kafka_TS
Then finally,
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
The code is throwing the error pointing at first comma (as if its taking the columns field as single argument):
mismatched input ',' expecting <EOF>
== SQL ==
"COL1","COL2""
-----^^^
If I pass just one column in the keyCols property, code is working fine.
E.g. keyCols = Col1
Second Method
Here I tried reading key columns from the widget and its the same error again.
dbutils.widgets.text("prmKeyCols", "","")
val prmKeyCols = dbutils.widgets.get("prmKeyCols")
var keyColumns = Seq(prmKeyCols)
The widget is passed in as below
"Col1","Col2"
Then finally,
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
This is also giving same error.
Something like this should work:
import org.apache.spark.sql.{DataFrame, functions => sqlfun}
def foo(microBatchOutputDF: DataFrame)
(keyCols: Seq[String], structCols: Seq[String]): DataFrame =
microBatchOutputDF
.selectExpr((keyCols ++ structCols) : _*)
.groupBy(keyCols.head, keyCols.tail : _*).agg(sqlfun.max("otherCols").as("latest"))
.selectExpr((keyCols :+ "latest.*") : _*)
Which you can use like:
foo(microBatchOutputDF)(keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols"))

How to use if-else condition in Scala's Filter?

I have an ArrayBuffer with data in the following format: period_name:character varying(15) year:bigint. The data in it represents column name of a table and its datatype. My requirement is to extract the column name period and the datatype, just character varying excluding substring from "(" till ")" and then send all the elements to a ListBuffer. I came up with the following logic:
for(i <- receivedGpData) {
gpTypes = i.split("\\:")
if(gpTypes(1).contains("(")) {
gpColType = gpTypes(1).substring(0, gpTypes(1).indexOf("("))
prepList += gpTypes(0) + " " + gpColType
} else {
prepList += gpTypes(0) + " " + gpTypes(1)
}
}
The above code is working but I am trying to implement the same using Scala's Map and Filter functions. What I don't understand is how to use the if-else condition in the Scala Filter after the condition:
var reList = receivedGpData.map(element => element.split(":"))
.filter{ x => x(1).contains("(")
}
Could anyone let me know how can I implement the same code in for-loop using Scala's map & filter functions ?
val receivedGpData = Array("bla:bla(1)","bla2:cat")
val res = receivedGpData
.map(_.split(":"))
.map(s=>(s(0),s(1).takeWhile(_!='(')))
.map(s => s"${s._1} ${s._2}").toList
println(res)
Using regex:
val p = "(\\w+):([.[^(]]*)(\\(.*\\))?".r
val res = data.map{case p(x,y,_)=>x+" "+y}
In Scala REPL:
scala> val data = Array("period_name:character varying(15)","year:bigint")
data: Array[String] = Array(period_name:character varying(15), year:bigint)
scala> val p = "(\\w+):([.[^(]]*)(\\(.*\\))?".r
p: scala.util.matching.Regex = (\w+):([.[^(]]*)(\(.*\))?
scala> val res = data.map{case p(x,y,_)=>x+" "+y}
res: Array[String] = Array(period_name character varying, year bigint)

Map value empty outside a for each in scala

I have just started programming in scala. I'm also using Apache spark for reading a file - moviesFile. In the following code, I'm updating a mutable map inside a foreach function. The map is updated within the foreach function. But the values are not present once the foreach exits.
How to make the values remain permanent in the map variable movieMap.
val movieMap = scala.collection.mutable.Map[String,String]()
val movie = moviesFile.map(_.split("::")).foreach {
x => x.mkString(" ")
val movieid = x(0)
val title = x(1)
val genre = x(2)
val value = title+","+genre
movieMap(movieid.toString()) = value.toString()
println(movieMap.keySet)
}
println(movieMap.keySet)
println(movieMap.get("29"))
I believe that you are using Spark in a very wrong way. If you want to utilize Spark, you will have to use Spark's distributed data structures.
I will suggest to stay with Spark's distributed and parallelized data structure ( RDD's ). RDD's which contain ( key, value ) pairs are implicitly provided with some Map-like functionality.
Import org.apache.spark.SparkContext._
// Assume sc is the SparkContext instance
val moviesFileRdd = sc.textFile("movies.txt")
// moviesRdd is RDD[ ( String, String ) ] which acts as a Map-like thing of ( key, value ) pairs
val moviesRdd = moviesFileRdd.map( line =>
val splitLine = line.split( "::" )
val movieId = splitLine(0)
val title = splitLine(1)
val genre = splitLine(2)
val value = title + ", " + genre
( movieId.toString, value.toString )
)
// You see... RDD[ ( String, String ) ] offers some map-like things.
// get a list of all values with key 29
val listOfValuesWithKey29 = moviesRdd.lookup( "29" )
// I don't know why ? but if you really need a map here then
val moviesMap = moviesRdd.collectAsMap
// moviesMap will be a immutable Map, in case you need a mutable Map,
val moviesMutableMap = mutable.Map( moviesMap.toList: _* )