How to pass column names in selectExpr through one or more string parameters in spark using scala? - scala

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as ==> mismatched input ',' expecting
Below is the piece of code I am trying to parameterize.
var filteredMicroBatchDF=microBatchOutputDF
.selectExpr("col1","col2","struct(offset,KAFKA_TS) as otherCols" )
.groupBy("col1","col2").agg(max("otherCols").as("latest"))
.selectExpr("col1","col2","latest.*")
Reference to the script I am trying to emulate: -
https://docs.databricks.com/_static/notebooks/merge-in-cdc.html
I have tried like below by passing column names in a variable and then reading in the selectExpr from these variables: -
val keyCols = "col1","col2"
val structCols = "struct(offset,KAFKA_TS) as otherCols"
var filteredMicroBatchDF=microBatchOutputDF
.selectExpr(keyCols,structCols )
.groupBy(keyCols).agg(max("otherCols").as("latest"))
.selectExpr(keyCols,"latest.*")
When I run the script it gives me error as
org.apache.spark.sql.streaming.StreamingQueryException:
mismatched input ',' expecting <<EOF>>
EDIT
Here is what I have tried after comments by Luis Miguel which works fine: -
import org.apache.spark.sql.{DataFrame, functions => sqlfun}
def foo(microBatchOutputDF: DataFrame)
(keyCols: Seq[String], structCols: Seq[String]): DataFrame =
microBatchOutputDF
.selectExpr((keyCols ++ structCols) : _*)
.groupBy(keyCols.head, keyCols.tail : _*).agg(sqlfun.max("otherCols").as("latest"))
.selectExpr((keyCols :+ "latest.*") : _*)
var keyColumns = Seq("COL1","COL2")
var structColumns = "offset,Kafka_TS"
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
Note: Below results in an error
foo(microBatchOutputDF)(keyCols = Seq(keyColumns), structColumns = Seq("struct("+structColumns+") as otherCols"))
The thing about above working code is that, here keyColumns were hardcoded. So, I tried reading (firstly) from parameter file and (Secondly) from widget which resulted in error and it is here I am looking for advice and suggestions: -
First Method
def loadProperties(url: String):Properties = {
val properties: Properties = new Properties()
if (url != null) {
val source = Source.fromURL(url)
properties.load(source.bufferedReader())
}
return properties
}
var tableProp: Properties = new Properties()
tableProp = loadProperties("dbfs:/Configs/Databricks/Properties/table/Table.properties")
var keyColumns = Seq(tableProp.getProperty("keyCols"))
var structColumns = tableProp.getProperty("structCols")
keyCols and StructCols are defined in parameter file as: -
keyCols = Col1, Col2 (I also tried assigning these as "Col1","Col2")
StructCols = offset,Kafka_TS
Then finally,
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
The code is throwing the error pointing at first comma (as if its taking the columns field as single argument):
mismatched input ',' expecting <EOF>
== SQL ==
"COL1","COL2""
-----^^^
If I pass just one column in the keyCols property, code is working fine.
E.g. keyCols = Col1
Second Method
Here I tried reading key columns from the widget and its the same error again.
dbutils.widgets.text("prmKeyCols", "","")
val prmKeyCols = dbutils.widgets.get("prmKeyCols")
var keyColumns = Seq(prmKeyCols)
The widget is passed in as below
"Col1","Col2"
Then finally,
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
This is also giving same error.

Something like this should work:
import org.apache.spark.sql.{DataFrame, functions => sqlfun}
def foo(microBatchOutputDF: DataFrame)
(keyCols: Seq[String], structCols: Seq[String]): DataFrame =
microBatchOutputDF
.selectExpr((keyCols ++ structCols) : _*)
.groupBy(keyCols.head, keyCols.tail : _*).agg(sqlfun.max("otherCols").as("latest"))
.selectExpr((keyCols :+ "latest.*") : _*)
Which you can use like:
foo(microBatchOutputDF)(keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols"))

Related

How to rename a dataframe column and datatype from another dataframe values in spark?

Hi I've two dataframes like this:
import spark.implicits._
import org.apache.spark.sql._
val transformationDF = Seq(
("A_IN", "ain","String"),
("ADDR_HASH","addressHash","String")
).toDF("db3Column", "hudiColumn","hudiDatatype")
val addressDF=Seq(
("123","uyt"),
("124","qwe")
).toDF("A_IN", "ADDR_HASH")
Now I wanted to rename the column and change the datatype on values mentioned in the transformationdf.The hudicolumn name and hudidatatype from transformationDF will become the column name and datatype of addressDF.
I tried code like this to change but doesn't work:
var db3ColumnName:String =_
var hudiColumnName:String =_
var hudiDatatypeName:String = _
for (row <- transformationDF.rdd.collect)
{
db3ColumnName = row.mkString(",").split(",")(0)
hudiColumnName= row.mkString(",").split(",")(1)
hudiDatatypeName = row.mkString(",").split(",")(2)
addressDF.withColumnRenamed(db3ColumnName,hudiColumnName).withColumn(hudiColumnName,col(hudiColumnName).cast(hudiDatatypeName))
}
Now when I print the addressDF thechanges do not reflect.
Can anyone help me with this .
This is a textbook case that calls for using foldLeft:
val finalDF = transformationDF.collect.foldLeft(addressDF){ case (df, row) =>
{
val db3ColumnName = row.getString(0)
val hudiColumnName = row.getString(1)
val hudiDatatypeName = row.getString(2)
df.withColumnRenamed(db3ColumnName, hudiColumnName)
.withColumn(hudiColumnName, col(hudiColumnName).cast(hudiDatatypeName))
}
}
Datasets in Spark are immutable and each operation that "modifies" a dataset actually returns a new object leaving the one that the operation was called on unchanged. The above foldLeft effectively starts with addressDF and chains all the transformations onto intermediate objects that get passed as the first argument in the second argument list. The return value of the current iteration becomes the input of the next iteration. The return value of the last iteration is the return value of foldLeft itself.
When you use withColumnRenamed or withColumn, it returns a new Dataset, so you should do like this:
var db3ColumnName: String = null
var hudiColumnName: String = null
var hudiDatatypeName: String = null
for (row <- transformationDF.rdd.collect) {
db3ColumnName = row.mkString(",").split(",")(0)
hudiColumnName = row.mkString(",").split(",")(1)
hudiDatatypeName = row.mkString(",").split(",")(2)
addressDF = addressDF.withColumnRenamed(db3ColumnName, hudiColumnName).withColumn(hudiColumnName, col(hudiColumnName).cast(hudiDatatypeName))
}
addressDF.printSchema()
Print the addressDF will return:
root
|-- ain: string (nullable = true)
|-- addressHash: string (nullable = true)

Dynamic conversion of Array of double columns into multiple columns in nested spark dataframe

My current DataFrame looks like as below:
{"id":"1","inputs":{"values":{"0.2":[1,1],"0.4":[1,1],"0.6":[1,1]}},"id1":[1,2]}
I want to transform this dataframe into the below dataFrame:
{"id":"1", "v20":[1,1],"v40":[1,1],"v60":[1,1],"id1":[1,2]}
This means that, each 'values' array's items (0.2, 0.4 and 0.6) will be multiplied by 100, prepended with the letter 'v', and extracted into separate columns.
How does the code would look like in order to achieve this. I have tried withColumn but couldn't achieve this.
Try the below code and please find the inline comments for the code explanation
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
object DynamicCol {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.read.json("src/main/resources/dyamicCol.json") /// Load the JSON file
val dfTemp = df.select(col("inputs.values").as("values")) // Temp Dataframe for fetching the nest values
val index = dfTemp
.schema.fieldIndex("values")
val propSchema = dfTemp.schema(index).dataType.asInstanceOf[StructType]
val dfFinal = propSchema.fields.foldLeft(df)( (df,field) => { // Join Dataframe with the list of nested columns
val colNameInt = (field.name.toDouble * 100).toInt
val colName = s"v$colNameInt"
df.withColumn(colName,col("inputs.values.`" + field.name + "`")) // Add the nested column mappings
} ).drop("inputs") // Drop the extra column
dfFinal.write.mode(SaveMode.Overwrite).json("src/main/resources/dyamicColOut.json") // Output the JSON file
}
}
I would make the logic for the change of column name splitter into 2 parts, the one that is a numeric value, and the one that doesn't change.
def stringDecimalToVNumber(colName:String): String =
"v" + (colName.toFloat * 100).toInt.toString
and form a single function that transforms according to the case
val floatRegex = """(\d+\.?\d*)""".r
def transformColumnName(colName:String): String = colName match {
case floatRegex(v) => stringDecimalToVNumber(v) //it's a float, transform it
case x => x // keep it
now we have the function to transform the end of the columns, let's pick the schema dynamicly.
val flattenDF = df.select("id","inputs.values.*")
val finalDF = flattenDF
.schema.names
.foldLeft(flattenDF)((dfacum,x) => {
val newName = transformColumnName(x)
if (newName == x)
dfacum // the name didn't need to be changed
else
dfacum.withColumnRenamed(x, transformColumnName(x))
})
This will dynamically transform all the columns inside inputs.values to the new name, and put them in next to id.

Looping through Map Spark Scala

Within this code we have two files: athletes.csv that contains names, and twitter.test that contains the tweet message. We want to find name for every single line in the twitter.test that match the name in athletes.csv We applied map function to store the name from athletes.csv and want to iterate all of the name to all of the line in the test file.
object twitterAthlete {
def loadAthleteNames() : Map[String, String] = {
// Handle character encoding issues:
implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)
// Create a Map of Ints to Strings, and populate it from u.item.
var athleteInfo:Map[String, String] = Map()
//var movieNames:Map[Int, String] = Map()
val lines = Source.fromFile("../athletes.csv").getLines()
for (line <- lines) {
var fields = line.split(',')
if (fields.length > 1) {
athleteInfo += (fields(1) -> fields(7))
}
}
return athleteInfo
}
def parseLine(line:String): (String)= {
var athleteInfo = loadAthleteNames()
var hello = new String
for((k,v) <- athleteInfo){
if(line.toString().contains(k)){
hello = k
}
}
return (hello)
}
def main(args: Array[String]){
Logger.getLogger("org").setLevel(Level.ERROR)
val sc = new SparkContext("local[*]", "twitterAthlete")
val lines = sc.textFile("../twitter.test")
var athleteInfo = loadAthleteNames()
val splitting = lines.map(x => x.split(";")).map(x => if(x.length == 4 && x(2).length <= 140)x(2))
var hello = new String()
val container = splitting.map(x => for((key,value) <- athleteInfo)if(x.toString().contains(key)){key}).cache
container.collect().foreach(println)
// val mapping = container.map(x => (x,1)).reduceByKey(_+_)
//mapping.collect().foreach(println)
}
}
the first file look like:
id,name,nationality,sex,height........
001,Michael,USA,male,1.96 ...
002,Json,GBR,male,1.76 ....
003,Martin,female,1.73 . ...
the second file look likes:
time, id , tweet .....
12:00, 03043, some message that contain some athletes names , .....
02:00, 03023, some message that contain some athletes names , .....
some thinks like this ...
but i got empty result after running this code, any suggestions is much appreciated
result i got is empty :
()....
()...
()...
but the result that i expected something like:
(name,1)
(other name,1)
You need to use yield to return value to your map
val container = splitting.map(x => for((key,value) <- athleteInfo ; if(x.toString().contains(key)) ) yield (key, 1)).cache
I think you should just start with the simplest option first...
I would use DataFrames so you can use the built-in CSV parsing and leverage Catalyst, Tungsten, etc.
Then you can use the built-in Tokenizer to split the tweets into words, explode, and do a simple join. Depending how big/small the data with athlete names is you'll end up with a more optimized broadcast join and avoid a shuffle.
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.Tokenizer
val tweets = spark.read.format("csv").load(...)
val athletes = spark.read.format("csv").load(...)
val tokenizer = new Tokenizer()
tokenizer.setInputCol("tweet")
tokenizer.setOutputCol("words")
val tokenized = tokenizer.transform(tweets)
val exploded = tokenized.withColumn("word", explode('words))
val withAthlete = exploded.join(athletes, 'word === 'name)
withAthlete.select(exploded("id"), 'name).show()

Apache Spark. UDF Column based on another column without passing it's name as argument.

There is DataSet with column firm, I'm adding another column to this DataSet - firm_id here's example:
private val firms: mutable.Map[String, Integer] = ...
private val firmIdFromCode: (String => Integer) = (code: String) => firms(code)
val firm_id_by_code: UserDefinedFunction = udf(firmIdFromCode)
...
val ds = dataset.withColumn("firm_id", firm_id_by_code($"firm"))
Is there a way to eliminate passing $"firm" as argument (this column is always present in DS).
I am searching for something for this:
val ds = dataset.withColumn("firm_id", firm_id_by_code)
You could supply the column it will be using when you define the udf.
val someUdf = udf{ /*udf code*/}.apply($"colName")
// Usage in dataset
val ds = dataset.withColumn("newColName",someUdf)

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.