Function to hash specific columns in Spark Scala using SHA2 - scala

I have a streaming Dataframe reading messages from Kafka. Before I initiate writeStream, I want to hash some columns, and mask few others. Columns to be hashed or masked would be different for different tables and hence I am making them parameterized.
I am using below code to mask the selected columns which is working fine.
var maskColumnsconfig = "COL1, COL2" //Reading columns to mask from Config file or widget
var maskColumns = maskColumnsconfig.split(",")
def maskData(base: DataFrame, maskColumns: Seq[String]) = {
val maskExpr = base.columns.map { col => if(maskColumns.contains(col)) s"null as ${col}" else col }
base.selectExpr(maskExpr: _*) //masking columns as null
}
val maskedDF = maskData(myDataFrame,Seq(maskColumns:_*))
Reference: - How to mask columns using Spark 2?
For hashing, I am looking to create a function that does something similar to below: -
myDataFrame.withColumn("COL1_hashed",sha2($"COL1",256)).drop($"COL1").withColumnRenamed("COL1_Hashed", "COL1").withColumn("COL2_hashed",sha2($"COL2",256)).drop($"COL2").withColumnRenamed("COL2_Hashed", "COL2")
Edit: Instead, I can just do: -
myDataFrame.withColumn("COL1",sha2($"COL1",256).withColumn("COL2,sha($"COL2",256)
i.e.
1. add hashed column, 2. drop original col, 3. rename hashed col to name of original col
4. repeat for other columns to be hashed
EDIT: 1.replace column with hashed values, 2. repeat for other columns to be hashed
Any suggestions/ideas on how it can be achieved using a function that takes in multiple columns and perform above operations on all of them. I tried creating a function like below but it gives an error:
def hashData(base: DataFrame, hashColumns: Seq[String]) = {
val hashExpr = base.columns.map { col => if(hashColumns.contains(col)) base.withColumn({col},sha2({col},256)) else col }
base.selectExpr(hashExpr: _*)
}
command-3855877266331823:2: error: type mismatch;
found : String
required: org.apache.spark.sql.Column
val hashExpr = base.columns.map { col => if(hashColumns.contains(col)) base.withColumn({col},sha2({col},256)) else col }
EDIT 2: Tried to imitate the function similar to masking but it too gives error.
def hashData(base: DataFrame, hashColumns: Seq[String]) = {
val hashExpr = base.columns.map { col => if(hashColumns.contains(col)) base.withColumn(col,sha2(base(col),256)) else col }
base.selectExpr(hashExpr: _*)
}
Error: -
found : Array[java.io.Serializable]
required: Array[_ <: String]
Note: java.io.Serializable >: String, but class Array is invariant in type T.
You may wish to investigate a wildcard type such as `_ >: String`. (SLS 3.2.10)
base.selectExpr(hashExpr: _*)
Ideally, I'd want one function doing both hashing and masking. I'd appreciate any ideas/leads in achieving this.
Scala: 2.11
Spark: 2.4.4

Related

Hashing multiple columns of spark dataframe

I need to hash specific columns of spark dataframe. Some columns have specific datatype which are basically the extensions of standard spark's DataType class. The problem is because for some reason in when case some conditions don't work as expected.
As a hash table I have a map. Let's call it tableConfig:
val tableConfig = Map("a" -> "KEEP", "b" -> "HASH", "c" -> "KEEP", "d" -> "HASH", "e" -> "KEEP")
The salt variable is used to concatenate with column:
val salt = "abc"
The function for hashing looks like this:
def hashColumns(tableConfig: Map[String, String], salt: String, df: DataFrame): DataFrame = {
val removedColumns = tableConfig.filter(_._2 == "REMOVE").keys.toList
val hashedColumns = tableConfig.filter(_._2 == "HASH").keys.toList
val cleanedDF = df.drop(removedColumns: _ *)
val colTypes = cleanedDF.dtypes.toMap
def typeFromString(s: String): DataType = s match {
case "StringType" => StringType
case "BooleanType" => BooleanType
case "IntegerType" => IntegerType
case "DateType" => DateType
case "ShortType" => ShortType
case "DecimalType(15,7)" => DecimalType(15,7)
case "DecimalType(18,2)" => DecimalType(18,2)
case "DecimalType(11,7)" => DecimalType(11,7)
case "DecimalType(17,2)" => DecimalType(17,2)
case "DecimalType(38,2)" => DecimalType(38,2)
case _ => throw new TypeNotPresentException(
"Please check types in the dataframe. The following column type is missing: ".concat(s), null
)
}
val getType = colTypes.map{case (k, _) => (k, typeFromString(colTypes(k)))}
val hashedDF = cleanedDF.columns.foldLeft(cleanedDF) {
(memoDF, colName) =>
memoDF.withColumn(
colName,
when(col(colName).isin(hashedColumns: _*) && col(colName).isNull, null).
when(col(colName).isin(hashedColumns: _*) && col(colName).isNotNull,
sha2(concat(col(colName), lit(salt)), 256)).otherwise(col(colName)
)
)
}
hashedDF
}
I am getting error regarding to specific column. Namely the error is the following:
org.apache.spark.sql.AnalysisException: cannot resolve '(c IN ('a', 'b', 'd', 'e'))' due to data type mismatch: Arguments must be same type but were: boolean != string;;
Column names are changed.
My search didn't give any clear explanation why isin or isNull functions don't work as expected. Also I follow specific style of implementation and want to avoid the following approaches:
1) No UDFs. They are painful for me.
2) No for loops over spark dataframe columns. The data could contain more than billion samples and it's going to be a headache in terms of performance.
As mentioned in the comments the first FIX should be to remove the condition col(colName).isin(hashedColumns: _*) && col(colName).isNull since this check will be always false.
As for the error, it is because of the mismatch between value type of col(colName) and hashedColumns. The value of hashedColumns is always a string therefore col(colName) should be a string as well but in your case it seems to be a Boolean.
The last issue that I see here it has to do with the logic of the foldLeft. If I understood correctly what you want to achieve is to go through the columns and apply sha2 for those that exist in hashedColumns. To achieve that you must change your code to:
// 1st change: Convert each element of hashedColumns from String to Spark col
val hashArray = hashedColumns.map(lit(_))
val hashedDF = cleanedDF.columns.foldLeft(cleanedDF) {
(memoDF, colName) =>
memoDF.withColumn(
colName,
// 2nd.change: check if colName is in "a", "b", "c", "d" etc, if so apply sha2 otherwise leave the value as it is
when(col(colName).isNotNull && array_contains(array(hashArray:_*), lit(colName)) ,
sha2(concat(col(colName), lit(salt)), 256)
)
)
}
UPDATE:
Iterating through all the columns via foldLeft wouldn't be efficient and adds extra overhead, even more when you have large number of columns (see discussion with #baitmbarek below) I added one more approach instead of foldLeft using single select. In the next code when is applied only for the hashedColumns. We separate the columns into nonHashedCols and transformedCols then we concatenate the list and pass it to the select:
val transformedCols = hashedColumns.map{ c =>
when(col(c).isNotNull , sha2(concat(col(c), lit(salt)), 256)).as(c)
}
val nonHashedCols = (cleanedDF.columns.toSet -- hashedColumns.toSet).map(col(_)).toList
cleanedDF.select((nonHashedCols ++ transformedCols):_*)
(Posted a solution on behalf of the question author, to move it from the question to the answer section).
#AlexandrosBiratsis gave very good solution in terms of performance and elegant implementation. So the hashColumns function will look like this:
def hashColumns(tableConfig: Map[String, String], salt: String, df: DataFrame): DataFrame = {
val removedCols = tableConfig.filter(_._2 == "REMOVE").keys.toList
val hashedCols = tableConfig.filter(_._2 == "HASH").keys.toList
val cleanedDF = df.drop(removedCols: _ *)
val transformedCols = hashedCols.map{ c =>
when(col(c).isNotNull , sha2(concat(col(c), lit(salt)), 256)).as(c)
}
val nonHashedCols = (cleanedDF.columns.toSet -- hashedCols.toSet).map(col(_)).toList
val hashedDF = cleanedDF.select((nonHashedCols ++ transformedCols):_*)
hashedDF
}

How to get datatype of column in spark dataframe dynamically

I have a dataframe - converted dtypes to map.
val dfTypesMap:Map[String,String]] = df.dtypes.toMap
Output:
(PRODUCT_ID,StringType)
(PRODUCT_ID_BSTP_MAP,MapType(StringType,IntegerType,false))
(PRODUCT_ID_CAT_MAP,MapType(StringType,StringType,true))
(PRODUCT_ID_FETR_MAP_END_FR,ArrayType(StringType,true))
When I use type [String] hardcoding in row.getAS[String], there is no compilation error.
df.foreach(row => {
val prdValue = row.getAs[String]("PRODUCT_ID")
})
I want to iterate above map dfTypesMap and get corresponding value type. Is there any way to convert dt column types to general types like below?
StringType --> String
MapType(StringType,IntegerType,false) ---> Map[String,Int]
MapType(StringType,StringType,true) ---> Map[String,String]
ArrayType(StringType,true) ---> List[String]
As mentioned, Datasets make it easier to work with types.
Dataset is basically a collection of strongly-typed JVM objects.
You can map your data to case classes like so
case class Foo(PRODUCT_ID: String, PRODUCT_NAME: String)
val ds: Dataset[Foo] = df.as[Foo]
Then you can safely operate on your typed objects. In your case you could do
ds.foreach(foo => {
val prdValue = foo.PRODUCT_ID
})
For more on Datasets, check out
https://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets

How to efficiently delete subset in spark RDD

When conducting research, I find it somewhat difficult to delete all the subsets in Spark RDD.
The data structure is RDD[(key,set)]. For example, it could be:
RDD[ ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ]
Since the set of mike (Set(1,3)) is a subset of peter's (Set(1,2,3)), I want to delete "mike", which will end up with
RDD[ ("peter",Set(1,2,3)), ("jack",Set(5)) ]
It is easy to implement in python locally with two "for" loop operation. But when I want to extend to cloud with scala and spark, it is not that easy to find a good solution.
Thanks
I doubt we can escape to comparing each element to each other (the equivalent of a double loop in a non-distributed algorithm). The subset operation between sets is not reflexive, meaning that we need to compare is "alice" subsetof "bob" and is "bob" subsetof "alice".
To do this using the Spark API, we can resort to multiplying the data with itself using a cartesian product and verifying each entry of the resulting matrix:
val data = Seq(("peter",Set(1,2,3)), ("mike",Set(1,3)), ("anne", Set(7)),("jack",Set(5,4,1)), ("lizza", Set(5,1)), ("bart", Set(5,4)), ("maggie", Set(5)))
// expected result from this dataset = peter, olga, anne, jack
val userSet = sparkContext.parallelize(data)
val prod = userSet.cartesian(userSet)
val subsetMembers = prod.collect{case ((name1, set1), (name2,set2)) if (name1 != name2) && (set2.subsetOf(set1)) && (set1 -- set2).nonEmpty => (name2, set2) }
val superset = userSet.subtract(subsetMembers)
// lets see the results:
superset.collect()
// Array[(String, scala.collection.immutable.Set[Int])] = Array((olga,Set(1, 2, 3)), (peter,Set(1, 2, 3)), (anne,Set(7)), (jack,Set(5, 4, 1)))
This can be achieved by using RDD.fold function.
In this case the output required is a "List" (ItemList) of superset items. For this the input should also be converted to "List" (RDD of ItemList)
import org.apache.spark.rdd.RDD
// type alias for convinience
type Item = Tuple2[String, Set[Int]]
type ItemList = List[Item]
// Source RDD
val lst:RDD[Item] = sc.parallelize( List( ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ) )
// Convert each element as a List. This is needed for using fold function on RDD
// since the data-type of the parameters are the same as output parameter
// data-type for fold function
val listOflst:RDD[ItemList] = lst.map(x => List(x))
// for each element in second ItemList
// - Check if it is not subset of any element in first ItemList and add first
// - Remove the subset of newly added elements
def combiner(first:ItemList, second:ItemList) : ItemList = {
def helper(lst: ItemList, i:Item) : ItemList = {
val isSubset: Boolean = lst.exists( x=> i._2.subsetOf(x._2))
if( isSubset) lst else i :: lst.filterNot( x => x._2.subsetOf(i._2))
}
second.foldLeft(first)(helper)
}
listOflst.fold(List())(combiner)
You can use filter after a map.
You can build like a map that will return a value for what you want to delete. First build a function:
def filter_mike(line):
if line[1] != Set(1,3):
return line
else:
return None
Then you can filter now like this:
your_rdd.map(filter_mike).filter(lambda x: x != None)
This will work

Scala iterator on pattern match

I need help to iterate this piece of code written in Spark-Scala with DataFrame. I'm new on Scala, so I apologize if my question may seem trivial.
The function is very simple: Given a dataframe, the function casts the column if there is a pattern matching, otherwise select all field.
/* Load sources */
val df = sqlContext.sql("select id_vehicle, id_size, id_country, id_time from " + working_database + carPark);
val df2 = df.select(
df.columns.map {
case id_vehicle # "id_vehicle" => df(id_vehicle).cast("Int").as(id_vehicle)
case other => df(other)
}: _*
)
This function, with pattern matching, works perfectly!
Now I have a question: Is there any way to "iterate" this? In practice I need a function that given a dataframe, an Array[String] of column (column_1, column_2, ...) and another Array[String] of type (int, double, float, ...), return to me the same dataframe with the right cast at right position.
I need help :)
//Your supplied code fits nicely into this function
def castOnce(df: DataFrame, colName: String, typeName: String): DataFrame = {
val colsCasted = df.columns.map{
case colName => df(colName).cast(typeName).as(colName)
case other => df(other)
}
df.select(colsCasted:_ *)
}
def castMany(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val colsWithTypes: Array[(String, String)] = colNames.zip(typeNames)
colsWithTypes.foldLeft(df)((cAndType, newDf) => castOnce(newDf, cAndType._1, cAndType._2))
}
When you have a function that you just need to apply many times to the same thing a fold is often what you want.
The above code zips the two arrays together to combine them into one.
It then iterates through this list applying your function each time to the dataframe and then applying the next pair to the resultant dataframe etc.
Based on your edit I filled in the function above. I don't have a compiler so I'm not 100% sure its correct. Having written it out I am also left questioning my original approach. Below is a better way I believe but I am leaving the previous one for reference.
def(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val nameToType: Map[String, String] = colNames.zip(typeNames).toMap
val newCols= df.columns.map{dfCol =>
nameToType.get(dfCol).map{newType =>
df(dfCol).cast(newType).as(dfCol)
}.getOrElse(df(dfCol))
}
df.select(newCols:_ *)
}
The above code creates a map of column name to the desired type.
Then foreach column in the dataframe it looks the type up in the Map.
If the type exists we cast the column to that new type. If the column does not exist in the Map then we default to the column from the DataFrame directly.
We then select these columns from the DataFrame

Why does Scala type inference fail in one case but not the other?

Background: I'm using net.liftweb.record with MongoDB to access a database. At some point, I was in need of drawing a table of a collection of documents from the database (and render them as an ASCII table). I ran into very obscure type inference issues which are very easy to solve but nevertheless made me want to understand why they were happening.
Reproduction: For simplicity, I've reduced the code to (what I think is) an absolute minimum, so that it only depends on net.liftweb.record and none of the Mongo specific types. I've kept the real-life body of the function under question to make the example more realistic.
makeTable takes some apples, and some functions that map apples to columns. Columns can either be mapped to a real field on the apples, or a dynamically computed value (with a name). To be able to mix the two (real fields and dynamic values) in a single Seq, I defined a structural type Col.
To see how the code (below) behaves, try the following variants of the cols parameter to makeTable:
// OK:
cols = Seq(_.isDone)
cols = Seq(job => dynCol1)
cols = Seq(job => dynCol1, job => dynCol2)
// ERROR: found: Seq[Job => Object], required: Seq[Job => Test.Col]
cols = Seq(_.isDone, job => dynCol1)
cols = Seq(_.isDone, job => dynCol2)
cols = Seq(_.isDone, job => dynCol1, job => dynCol2)
...so whenever _.isDone (i.e. the column that maps to a physical field) is mixed with any other "flavor" of column, the error occurs (CASE 1). Alone it behaves well; other flavors of column also behave well when alone or mixed (CASE 2).
Intuitive workaround: marking cols as Seq[Job => Col] ALWAYS fixes the error.
Counter-intuitive workaround: explicitly marking any of the return values of the functions in the Seq as Col, or any of the functions as Job => Col, solves the issue.
The code:
import net.liftweb.record.{ Record, MetaRecord }
import net.liftweb.record.field.IntField
import scala.language.reflectiveCalls
class Job extends Record[Job] {
def meta = Job
object isDone extends IntField(this)
}
object Job extends Job with MetaRecord[Job]
object Test extends App {
type Col = { def name: String; def get: Any }
def makeTable[T](xs: Seq[T])(cols: Seq[T => Col]) = {
assert(xs.size >= 1)
val rows = xs map { x => cols { map { _(x).get } }
val header = cols map { _(xs.head).name }
(header +: rows)
}
val dynCol1 = new { def name = "dyncol1"; def get = "dyn1" }
val dynCol2 = new { def name = "dyncol2"; def get = "dyn2" }
val jobs = Seq(Job.createRecord, Job.createRecord)
makeTable(jobs)(Seq(
_.isDone,
job => dynCol1,
job => dynCol2
))
}
P.S. I'm not adding a lift or lift-record tag because I think this is not related to Lift and is simply a Scala question triggered by what happens to be a Lift-specific situation. Feel free to correct me if I'm wrong.