Joining multiple table using loop concept using Scala? - scala

I try to execute following piece of code but end up getting error again and again.
My code is as given below. I am trying to pass key value pair while joining tables but fail to pass value.
val mainDF = Seq(("1","acv","34","a"),("2","fbg","56","b"),("3","rty","78","c"))
.toDF("id","name","age","DBName")
val deltaDF = Seq(("1","gbj","67","a"),("2","gbj","67","a"),("2","jku","88","b"),("4","jku","88","b"),("5","uuu","12","c"))
.toDF("id","name","age","DBName")
val nameMap = Map("TT" -> "id")
for ((k,i) <- nameMap) {
val updatedDF1 = mainDF.as("main")
.join(deltaDF.as("delta"), $"main.$i" === $"delta.$i" && $"main.DBName" === $"delta.DBName", "outer")
.select(mainDF.columns.map(c => coalesce($"delta.$c", $"main.$c") as c): _*)
println(s"key: $k, value: $i")
}
updatedDF1.show()
And the error:
Error : <console>:30: error: not found: value updatedDF1
updatedDF1.show()
If anyone can help me or suggest different way to do the same.

As comments suggest, the updatedDF1 declaration becomes out of scope outside of the for block and therefore not in the same lexical scope. Move the show action statement into the for block:
val nameMap = Map("TT" -> "id")
for ((k,i) <- nameMap) {
val updatedDF1 = mainDF.as("main")
.join(deltaDF.as("delta"), $"main.$i" === $"delta.$i" && $"main.DBName" === $"delta.DBName", "outer")
.select(mainDF.columns.map(c => coalesce($"delta.$c", $"main.$c") as c): _*)
println(s"key: $k, value: $i")
updatedDF1.show()
}
To understand why:
"A block is delimited by braces { ... }."
"The definitions inside a block are only visible from within the block."
Points taken from https://www.scala-exercises.org/scala_tutorial/lexical_scopes

Related

Creating a list of StructFields from data frame

I need to ultimately build a schema from a CSV. I can read the CSV into data frame, and I've got a case class defined.
case class metadata_class (colname:String,datatype:String,length:Option[Int],precision:Option[int])
val foo = spark.read.format("csv").option("delimiter",",").option("header","true").schema(Encoders.product[metadata_class.schema).load("/path/to/file").as[metadata_file].toDF()
Now I'm trying to iterate through that data frame and build a list of StructFields. My current effort:
val sList: List[StructField] = List(
for (m <- foo.as[metadata_class].collect) {
StructField[m.colname,getType(m.datatype))
})
That gives me a type mismatch:
found : Unit
required: org.apache.spark.sql.types.StructField
for (m <- foo.as[metadata_class].collect) {
^
What am I doing wrong here? Or am I not even close?
There is not usual to use for-loop in scala. For loop has Unit return type, and in your code, result value of sList will be List[Unit]:
val sList: List[Unit] = List(
for (m <- foo.as[metadata_class].collect) {
StructField(m.colname, getType(m.datatype))
}
)
but you declared sList as List[StructField] this is the cause of compile error.
I suppose you should use map function instead of for loop for iterate on metadata_class objects and create StructFields from them:
val structFields: List[StructField] = foo.as[metadata_class]
.collect
.map(m => StructField(m.colname, getType(m.datatype)))
.toList
you will earn List[StructField] such way.
In scala language every statement is expression with return type, for-loop also and it return type is Unit.
read more about statements/expressions:
statement vs expression in scala
statements and expressions in scala

Hashing multiple columns of spark dataframe

I need to hash specific columns of spark dataframe. Some columns have specific datatype which are basically the extensions of standard spark's DataType class. The problem is because for some reason in when case some conditions don't work as expected.
As a hash table I have a map. Let's call it tableConfig:
val tableConfig = Map("a" -> "KEEP", "b" -> "HASH", "c" -> "KEEP", "d" -> "HASH", "e" -> "KEEP")
The salt variable is used to concatenate with column:
val salt = "abc"
The function for hashing looks like this:
def hashColumns(tableConfig: Map[String, String], salt: String, df: DataFrame): DataFrame = {
val removedColumns = tableConfig.filter(_._2 == "REMOVE").keys.toList
val hashedColumns = tableConfig.filter(_._2 == "HASH").keys.toList
val cleanedDF = df.drop(removedColumns: _ *)
val colTypes = cleanedDF.dtypes.toMap
def typeFromString(s: String): DataType = s match {
case "StringType" => StringType
case "BooleanType" => BooleanType
case "IntegerType" => IntegerType
case "DateType" => DateType
case "ShortType" => ShortType
case "DecimalType(15,7)" => DecimalType(15,7)
case "DecimalType(18,2)" => DecimalType(18,2)
case "DecimalType(11,7)" => DecimalType(11,7)
case "DecimalType(17,2)" => DecimalType(17,2)
case "DecimalType(38,2)" => DecimalType(38,2)
case _ => throw new TypeNotPresentException(
"Please check types in the dataframe. The following column type is missing: ".concat(s), null
)
}
val getType = colTypes.map{case (k, _) => (k, typeFromString(colTypes(k)))}
val hashedDF = cleanedDF.columns.foldLeft(cleanedDF) {
(memoDF, colName) =>
memoDF.withColumn(
colName,
when(col(colName).isin(hashedColumns: _*) && col(colName).isNull, null).
when(col(colName).isin(hashedColumns: _*) && col(colName).isNotNull,
sha2(concat(col(colName), lit(salt)), 256)).otherwise(col(colName)
)
)
}
hashedDF
}
I am getting error regarding to specific column. Namely the error is the following:
org.apache.spark.sql.AnalysisException: cannot resolve '(c IN ('a', 'b', 'd', 'e'))' due to data type mismatch: Arguments must be same type but were: boolean != string;;
Column names are changed.
My search didn't give any clear explanation why isin or isNull functions don't work as expected. Also I follow specific style of implementation and want to avoid the following approaches:
1) No UDFs. They are painful for me.
2) No for loops over spark dataframe columns. The data could contain more than billion samples and it's going to be a headache in terms of performance.
As mentioned in the comments the first FIX should be to remove the condition col(colName).isin(hashedColumns: _*) && col(colName).isNull since this check will be always false.
As for the error, it is because of the mismatch between value type of col(colName) and hashedColumns. The value of hashedColumns is always a string therefore col(colName) should be a string as well but in your case it seems to be a Boolean.
The last issue that I see here it has to do with the logic of the foldLeft. If I understood correctly what you want to achieve is to go through the columns and apply sha2 for those that exist in hashedColumns. To achieve that you must change your code to:
// 1st change: Convert each element of hashedColumns from String to Spark col
val hashArray = hashedColumns.map(lit(_))
val hashedDF = cleanedDF.columns.foldLeft(cleanedDF) {
(memoDF, colName) =>
memoDF.withColumn(
colName,
// 2nd.change: check if colName is in "a", "b", "c", "d" etc, if so apply sha2 otherwise leave the value as it is
when(col(colName).isNotNull && array_contains(array(hashArray:_*), lit(colName)) ,
sha2(concat(col(colName), lit(salt)), 256)
)
)
}
UPDATE:
Iterating through all the columns via foldLeft wouldn't be efficient and adds extra overhead, even more when you have large number of columns (see discussion with #baitmbarek below) I added one more approach instead of foldLeft using single select. In the next code when is applied only for the hashedColumns. We separate the columns into nonHashedCols and transformedCols then we concatenate the list and pass it to the select:
val transformedCols = hashedColumns.map{ c =>
when(col(c).isNotNull , sha2(concat(col(c), lit(salt)), 256)).as(c)
}
val nonHashedCols = (cleanedDF.columns.toSet -- hashedColumns.toSet).map(col(_)).toList
cleanedDF.select((nonHashedCols ++ transformedCols):_*)
(Posted a solution on behalf of the question author, to move it from the question to the answer section).
#AlexandrosBiratsis gave very good solution in terms of performance and elegant implementation. So the hashColumns function will look like this:
def hashColumns(tableConfig: Map[String, String], salt: String, df: DataFrame): DataFrame = {
val removedCols = tableConfig.filter(_._2 == "REMOVE").keys.toList
val hashedCols = tableConfig.filter(_._2 == "HASH").keys.toList
val cleanedDF = df.drop(removedCols: _ *)
val transformedCols = hashedCols.map{ c =>
when(col(c).isNotNull , sha2(concat(col(c), lit(salt)), 256)).as(c)
}
val nonHashedCols = (cleanedDF.columns.toSet -- hashedCols.toSet).map(col(_)).toList
val hashedDF = cleanedDF.select((nonHashedCols ++ transformedCols):_*)
hashedDF
}

How to define a function in scala for flatMap

New to Scala, I want to try to rewrite some code in flatMap by calling a function instead of writing the whole process inside "()".
The original code is like:
val longForm = summary.flatMap(row => {
/*This is the code I want to replace with a function*/
val metric = row.getString(0)
(1 until row.size).map{i=>
(metric,schema(i).name,row.getString(i).toDouble)
})
}/*End of function*/)
The function I wrote is:
def tfunc(line:Row):List[Any] ={
val metric = line.getString(0)
var res = List[Any]
for (i<- 1 to line.size){
/*Save each iteration result as a List[tuple], then append to the res List.*/
val tup = (metric,schema(i).name,line.getString(i).toDouble)
val tempList = List(tup)
res = res :: tempList
}
res
}
The function did not passed compilation with the following error:
error: missing argument list for method apply in object List
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing apply _ or apply(_) instead of apply.
var res = List[Any]
What is wrong with this function?
And for flatMap, is it the write way to return the result as a List?
You haven't explained why you want to replace that code block. Is there a particular goal you're after? There are many, many, different ways that block could be rewritten. How can we know which would be better at meeting you requirements?
Here's one approach.
def tfunc(line :Row) :List[(String,String,Double)] ={
val metric = line.getString(0)
List.tabulate(line.tail.length){ idx =>
(metric, schema(idx+1).name, line.getString(idx+1).toDouble)
}
}

assign value in scala foreach loop

I am learning scala but got stuck in a simple problem. I wanted to assign a value to a variable using foreach loop.
for example:
List A
foreach x in A { variable b = x; => then some operation => print result}
can you please let me know how I can achieve this in scala?
this is proper way of running a foreach operation on a List.
val list: List[T] = /* list definition */
list foreach { x => var a = x; /* some operation */ }
1) You can use .map on list if you want to process it and want a list of something else back (just like in maths f:A=>B)
input set
scala> val initialOrders = List("order1", "order2", "order3")
initialOrders: List[String] = List(order1, order2, order3)
function
scala> def shipOrder(order: Any) = order + " is shipped"
shipOrder: (order: Any)String
process input set and store output
scala> val shippedOrders = initialOrders.map(order => { val myorder = "my" + order; println(s"shipping is ${myorder}"); shipOrder(myorder) })
shipping is myorder1
shipping is myorder2
shipping is myorder3
shippedOrders: List[String] = List(myorder1 is shipped, myorder2 is shipped, myorder3 is shipped)
2) Or you can simply iterate with foreach on list when you don't care about output from function.
scala> initialOrders.foreach(order => { val whateverVariable = order+ "-whatever"; shipOrder(order) })
Note
What is the difference between a var and val definition in Scala?

Type Mismatch in scala case match

Trying to create multiple dataframes in a single foreach, using spark, as below
I get values delivery and click out of row.getAs("type"), when I try to print them.
val check = eachrec.foreach(recrd => recrd.map(row => {
row.getAs("type") match {
case "delivery" => val delivery_data = delivery(row.get(0).toString,row.get(1).toString)
case "click" => val click_data = delivery(row.get(0).toString,row.get(1).toString)
case _ => "not sure if this impacts"
}})
)
but getting below error:
Error:(41, 14) type mismatch; found : String("delivery") required: Nothing
case "delivery" => val delivery_data = delivery(row.get(0).toString,row.get(1).toString)
^
My plan is to create dataframe using todf() once I create these individual delivery objects referenced by delivery_data and click_data by:
delivery_data.toDF() and click_data.toDF().
Please provide any clue regarding the error above (in match case).
How can I create two df's using todf() in val check?
val declarations make your first 2 cases return type to be unit, but in the third case you return a String
for instance, here the z type was inferred by the compiler, Unit:
def x = {
val z: Unit = 3 match {
case 2 => val a = 2
case _ => val b = 3
}
}
I think you need to cast this match clause to String.
row.getAs("type").toString