Hashing multiple columns of spark dataframe - scala

I need to hash specific columns of spark dataframe. Some columns have specific datatype which are basically the extensions of standard spark's DataType class. The problem is because for some reason in when case some conditions don't work as expected.
As a hash table I have a map. Let's call it tableConfig:
val tableConfig = Map("a" -> "KEEP", "b" -> "HASH", "c" -> "KEEP", "d" -> "HASH", "e" -> "KEEP")
The salt variable is used to concatenate with column:
val salt = "abc"
The function for hashing looks like this:
def hashColumns(tableConfig: Map[String, String], salt: String, df: DataFrame): DataFrame = {
val removedColumns = tableConfig.filter(_._2 == "REMOVE").keys.toList
val hashedColumns = tableConfig.filter(_._2 == "HASH").keys.toList
val cleanedDF = df.drop(removedColumns: _ *)
val colTypes = cleanedDF.dtypes.toMap
def typeFromString(s: String): DataType = s match {
case "StringType" => StringType
case "BooleanType" => BooleanType
case "IntegerType" => IntegerType
case "DateType" => DateType
case "ShortType" => ShortType
case "DecimalType(15,7)" => DecimalType(15,7)
case "DecimalType(18,2)" => DecimalType(18,2)
case "DecimalType(11,7)" => DecimalType(11,7)
case "DecimalType(17,2)" => DecimalType(17,2)
case "DecimalType(38,2)" => DecimalType(38,2)
case _ => throw new TypeNotPresentException(
"Please check types in the dataframe. The following column type is missing: ".concat(s), null
)
}
val getType = colTypes.map{case (k, _) => (k, typeFromString(colTypes(k)))}
val hashedDF = cleanedDF.columns.foldLeft(cleanedDF) {
(memoDF, colName) =>
memoDF.withColumn(
colName,
when(col(colName).isin(hashedColumns: _*) && col(colName).isNull, null).
when(col(colName).isin(hashedColumns: _*) && col(colName).isNotNull,
sha2(concat(col(colName), lit(salt)), 256)).otherwise(col(colName)
)
)
}
hashedDF
}
I am getting error regarding to specific column. Namely the error is the following:
org.apache.spark.sql.AnalysisException: cannot resolve '(c IN ('a', 'b', 'd', 'e'))' due to data type mismatch: Arguments must be same type but were: boolean != string;;
Column names are changed.
My search didn't give any clear explanation why isin or isNull functions don't work as expected. Also I follow specific style of implementation and want to avoid the following approaches:
1) No UDFs. They are painful for me.
2) No for loops over spark dataframe columns. The data could contain more than billion samples and it's going to be a headache in terms of performance.

As mentioned in the comments the first FIX should be to remove the condition col(colName).isin(hashedColumns: _*) && col(colName).isNull since this check will be always false.
As for the error, it is because of the mismatch between value type of col(colName) and hashedColumns. The value of hashedColumns is always a string therefore col(colName) should be a string as well but in your case it seems to be a Boolean.
The last issue that I see here it has to do with the logic of the foldLeft. If I understood correctly what you want to achieve is to go through the columns and apply sha2 for those that exist in hashedColumns. To achieve that you must change your code to:
// 1st change: Convert each element of hashedColumns from String to Spark col
val hashArray = hashedColumns.map(lit(_))
val hashedDF = cleanedDF.columns.foldLeft(cleanedDF) {
(memoDF, colName) =>
memoDF.withColumn(
colName,
// 2nd.change: check if colName is in "a", "b", "c", "d" etc, if so apply sha2 otherwise leave the value as it is
when(col(colName).isNotNull && array_contains(array(hashArray:_*), lit(colName)) ,
sha2(concat(col(colName), lit(salt)), 256)
)
)
}
UPDATE:
Iterating through all the columns via foldLeft wouldn't be efficient and adds extra overhead, even more when you have large number of columns (see discussion with #baitmbarek below) I added one more approach instead of foldLeft using single select. In the next code when is applied only for the hashedColumns. We separate the columns into nonHashedCols and transformedCols then we concatenate the list and pass it to the select:
val transformedCols = hashedColumns.map{ c =>
when(col(c).isNotNull , sha2(concat(col(c), lit(salt)), 256)).as(c)
}
val nonHashedCols = (cleanedDF.columns.toSet -- hashedColumns.toSet).map(col(_)).toList
cleanedDF.select((nonHashedCols ++ transformedCols):_*)

(Posted a solution on behalf of the question author, to move it from the question to the answer section).
#AlexandrosBiratsis gave very good solution in terms of performance and elegant implementation. So the hashColumns function will look like this:
def hashColumns(tableConfig: Map[String, String], salt: String, df: DataFrame): DataFrame = {
val removedCols = tableConfig.filter(_._2 == "REMOVE").keys.toList
val hashedCols = tableConfig.filter(_._2 == "HASH").keys.toList
val cleanedDF = df.drop(removedCols: _ *)
val transformedCols = hashedCols.map{ c =>
when(col(c).isNotNull , sha2(concat(col(c), lit(salt)), 256)).as(c)
}
val nonHashedCols = (cleanedDF.columns.toSet -- hashedCols.toSet).map(col(_)).toList
val hashedDF = cleanedDF.select((nonHashedCols ++ transformedCols):_*)
hashedDF
}

Related

Scala iterator on pattern match

I need help to iterate this piece of code written in Spark-Scala with DataFrame. I'm new on Scala, so I apologize if my question may seem trivial.
The function is very simple: Given a dataframe, the function casts the column if there is a pattern matching, otherwise select all field.
/* Load sources */
val df = sqlContext.sql("select id_vehicle, id_size, id_country, id_time from " + working_database + carPark);
val df2 = df.select(
df.columns.map {
case id_vehicle # "id_vehicle" => df(id_vehicle).cast("Int").as(id_vehicle)
case other => df(other)
}: _*
)
This function, with pattern matching, works perfectly!
Now I have a question: Is there any way to "iterate" this? In practice I need a function that given a dataframe, an Array[String] of column (column_1, column_2, ...) and another Array[String] of type (int, double, float, ...), return to me the same dataframe with the right cast at right position.
I need help :)
//Your supplied code fits nicely into this function
def castOnce(df: DataFrame, colName: String, typeName: String): DataFrame = {
val colsCasted = df.columns.map{
case colName => df(colName).cast(typeName).as(colName)
case other => df(other)
}
df.select(colsCasted:_ *)
}
def castMany(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val colsWithTypes: Array[(String, String)] = colNames.zip(typeNames)
colsWithTypes.foldLeft(df)((cAndType, newDf) => castOnce(newDf, cAndType._1, cAndType._2))
}
When you have a function that you just need to apply many times to the same thing a fold is often what you want.
The above code zips the two arrays together to combine them into one.
It then iterates through this list applying your function each time to the dataframe and then applying the next pair to the resultant dataframe etc.
Based on your edit I filled in the function above. I don't have a compiler so I'm not 100% sure its correct. Having written it out I am also left questioning my original approach. Below is a better way I believe but I am leaving the previous one for reference.
def(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val nameToType: Map[String, String] = colNames.zip(typeNames).toMap
val newCols= df.columns.map{dfCol =>
nameToType.get(dfCol).map{newType =>
df(dfCol).cast(newType).as(dfCol)
}.getOrElse(df(dfCol))
}
df.select(newCols:_ *)
}
The above code creates a map of column name to the desired type.
Then foreach column in the dataframe it looks the type up in the Map.
If the type exists we cast the column to that new type. If the column does not exist in the Map then we default to the column from the DataFrame directly.
We then select these columns from the DataFrame

Type Mismatch in scala case match

Trying to create multiple dataframes in a single foreach, using spark, as below
I get values delivery and click out of row.getAs("type"), when I try to print them.
val check = eachrec.foreach(recrd => recrd.map(row => {
row.getAs("type") match {
case "delivery" => val delivery_data = delivery(row.get(0).toString,row.get(1).toString)
case "click" => val click_data = delivery(row.get(0).toString,row.get(1).toString)
case _ => "not sure if this impacts"
}})
)
but getting below error:
Error:(41, 14) type mismatch; found : String("delivery") required: Nothing
case "delivery" => val delivery_data = delivery(row.get(0).toString,row.get(1).toString)
^
My plan is to create dataframe using todf() once I create these individual delivery objects referenced by delivery_data and click_data by:
delivery_data.toDF() and click_data.toDF().
Please provide any clue regarding the error above (in match case).
How can I create two df's using todf() in val check?
val declarations make your first 2 cases return type to be unit, but in the third case you return a String
for instance, here the z type was inferred by the compiler, Unit:
def x = {
val z: Unit = 3 match {
case 2 => val a = 2
case _ => val b = 3
}
}
I think you need to cast this match clause to String.
row.getAs("type").toString

How to skip entries that don't match a pattern when constructing an RDD

I'm doing some pattern matching on an RDD and would like to select only thoese rows/records matching a pattern. Here is what I currently have,
val idPattern = """Id="([^"]*)""".r.unanchored
val typePattern = """PostTypeId="([^"]*)""".r.unanchored
val datePattern = """CreationDate="([^"]*)""".r.unanchored
val tagPattern = """Tags="([^"]*)""".r.unanchored
val projectedPostsAnswers = postsAnswers.map {line => {
val id = line match {case idPattern(x) => x}
val typeId = line match {case typePattern(x) => x}
val date = line match {case datePattern(x) => x}
val tags = line match {case tagPattern(x) => x}
Post(Some(id),Some(typeId),Some(date),Some(tags))
}
}
case class Post(Id: Option[String], Type: Option[String], CreationDate: Option[String], Tags: Option[String])
I'm only interested in row/record which match to all patterns (that is, records that have all those four fields). How can I skip those rows/records not satisfy my requirement?
You can use RDD.collect(scala.PartialFunction f) to do the filtering and mapping in one step.
For example, if you know the order of these fields in your input, you can merge the regular expressions into one and use a single case:
val pattern = """Id="([^"]*).*PostTypeId="([^"]*).*CreationDate="([^"]*).*Tags="([^"]*)""".r.unanchored
val projectedPostsAnswers = postsAnswers.collect {
case pattern(id, typeId, date, tags) => Post(Some(id), Some(typeId), Some(date), Some(tags))
}
The returned RDD[Post] will only contain records for which this case matched. Notice that this collect(PartialFunction) has nothing to do with collect() - it does not collect entire data set into driver memory.
Use the filter API.
val projectedPostAnswers = postsAnswers.filter(line => f(line)).map{....
Create a function 'f' that does your data cleansing for you. Make sure the function returns true or false, as thats what filter uses to decide whether or not to pass the record.

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD

How to create a List of Wildcard elements Scala

I'm trying to write a function that returns a list (for querying purposes) that has some wildcard elements:
def createPattern(query: List[(String,String)]) = {
val l = List[(_,_,_,_,_,_,_)]
var iter = query
while(iter != null) {
val x = iter.head._1 match {
case "userId" => 0
case "userName" => 1
case "email" => 2
case "userPassword" => 3
case "creationDate" => 4
case "lastLoginDate" => 5
case "removed" => 6
}
l(x) = iter.head._2
iter = iter.tail
}
l
}
So, the user enters some query terms as a list. The function parses through these terms and inserts them into val l. The fields that the user doesn't specify are entered as wildcards.
Val l is causing me troubles. Am I going the right route or are there better ways to do this?
Thanks!
Gosh, where to start. I'd begin by getting an IDE (IntelliJ / Eclipse) which will tell you when you're writing nonsense and why.
Read up on how List works. It's an immutable linked list so your attempts to update by index are very misguided.
Don't use tuples - use case classes.
You shouldn't ever need to use null and I guess here you mean Nil.
Don't use var and while - use for-expression, or the relevant higher-order functions foreach, map etc.
Your code doesn't make much sense as it is, but it seems you're trying to return a 7-element list with the second element of each tuple in the input list mapped via a lookup to position in the output list.
To improve it... don't do that. What you're doing (as programmers have done since arrays were invented) is to use the index as a crude proxy for a Map from Int to whatever. What you want is an actual Map. I don't know what you want to do with it, but wouldn't it be nicer if it were from these key strings themselves, rather than by a number? If so, you can simplify your whole method to
def createPattern(query: List[(String,String)]) = query.toMap
at which point you should realise you probably don't need the method at all, since you can just use toMap at the call site.
If you insist on using an Int index, you could write
def createPattern(query: List[(String,String)]) = {
def intVal(x: String) = x match {
case "userId" => 0
case "userName" => 1
case "email" => 2
case "userPassword" => 3
case "creationDate" => 4
case "lastLoginDate" => 5
case "removed" => 6
}
val tuples = for ((key, value) <- query) yield (intVal(key), value)
tuples.toMap
}
Not sure what you want to do with the resulting list, but you can't create a List of wildcards like that.
What do you want to do with the resulting list, and what type should it be?
Here's how you might build something if you wanted the result to be a List[String], and if you wanted wildcards to be "*":
def createPattern(query:List[(String,String)]) = {
val wildcard = "*"
def orElseWildcard(key:String) = query.find(_._1 == key).getOrElse("",wildcard)._2
orElseWildcard("userID") ::
orElseWildcard("userName") ::
orElseWildcard("email") ::
orElseWildcard("userPassword") ::
orElseWildcard("creationDate") ::
orElseWildcard("lastLoginDate") ::
orElseWildcard("removed") ::
Nil
}
You're not using List, Tuple, iterator, or wild-cards correctly.
I'd take a different approach - maybe something like this:
case class Pattern ( valueMap:Map[String,String] ) {
def this( valueList:List[(String,String)] ) = this( valueList.toMap )
val Seq(
userId,userName,email,userPassword,creationDate,
lastLoginDate,removed
):Seq[Option[String]] = Seq( "userId", "userName",
"email", "userPassword", "creationDate", "lastLoginDate",
"removed" ).map( valueMap.get(_) )
}
Then you can do something like this:
scala> val pattern = new Pattern( List( "userId" -> "Fred" ) )
pattern: Pattern = Pattern(Map(userId -> Fred))
scala> pattern.email
res2: Option[String] = None
scala> pattern.userId
res3: Option[String] = Some(Fred)
, or just use the map directly.