How to mask columns using Spark 2? - scala

I have some tables in which I need to mask some of its columns. Columns to be masked vary from table to table and I am reading those columns from application.conf file.
For example, for employee table as shown below
+----+------+-----+---------+
| id | name | age | address |
+----+------+-----+---------+
| 1 | abcd | 21 | India |
+----+------+-----+---------+
| 2 | qazx | 42 | Germany |
+----+------+-----+---------+
if we want to mask name and age columns then I get these columns in an sequence.
val mask = Seq("name", "age")
Expected values after masking are:
+----+----------------+----------------+---------+
| id | name | age | address |
+----+----------------+----------------+---------+
| 1 | *** Masked *** | *** Masked *** | India |
+----+----------------+----------------+---------+
| 2 | *** Masked *** | *** Masked *** | Germany |
+----+----------------+----------------+---------+
If I have employee table an data frame, then what is the way to mask these columns?
If I have payment table as shown below and want to mask name and salary columns then I get mask columns in Sequence as
+----+------+--------+----------+
| id | name | salary | tax_code |
+----+------+--------+----------+
| 1 | abcd | 12345 | KT10 |
+----+------+--------+----------+
| 2 | qazx | 98765 | AD12d |
+----+------+--------+----------+
val mask = Seq("name", "salary")
I tried something like this mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) ) but it did not returned anything.
Thanks to #philantrovert, I found out the solution. Here is the solution I used:
def maskData(base: DataFrame, maskColumns: Seq[String]) = {
val maskExpr = base.columns.map { col => if(maskColumns.contains(col)) s"'*** Masked ***' as ${col}" else col }
base.selectExpr(maskExpr: _*)
}

The simplest and fastest way would be to use withColumn and simply overwrite the values in the columns with "*** Masked ***". Using your small example dataframe
val df = spark.sparkContext.parallelize( Seq (
(1, "abcd", 12345, "KT10" ),
(2, "qazx", 98765, "AD12d")
)).toDF("id", "name", "salary", "tax_code")
If you have a small number of columns to be masked, with known names, then you can simply do:
val mask = Seq("name", "salary")
df.withColumn("name", lit("*** Masked ***"))
.withColumn("salary", lit("*** Masked ***"))
Otherwise, you need to create a loop:
var df2 = df
for (col <- mask){
df2 = df2.withColumn(col, lit("*** Masked ***"))
}
Both these approaches will give you a result like this:
+---+--------------+--------------+--------+
| id| name| salary|tax_code|
+---+--------------+--------------+--------+
| 1|*** Masked ***|*** Masked ***| KT10|
| 2|*** Masked ***|*** Masked ***| AD12d|
+---+--------------+--------------+--------+

Please check the code below. The key is the udf function.
val df = ss.sparkContext.parallelize( Seq (
("c1", "JAN-2017", 49 ),
("c1", "MAR-2017", 83),
)).toDF("city", "month", "sales")
df.show()
val mask = udf( (s : String) => {
"*** Masked ***"
})
df.withColumn("city", mask($"city")).show`

Your statement
mask.foreach(c => base.withColumn(c, regexp_replace(col(c), "^.*?$", "*** Masked ***" ) ) )
will return a List[org.apache.spark.sql.DataFrame] which doesn't sound too good.
You can use selectExpr and generate your regexp_replace expression using :
base.show
+---+----+-----+-------+
| id|name| age|address|
+---+----+-----+-------+
| 1|abcd|12345| KT10 |
| 2|qazx|98765| AD12d|
+---+----+-----+-------+
val mask = Seq("name", "age")
val expr = df.columns.map { col =>
if (mask.contains(col) ) s"""regexp_replace(${col}, "^.*", "** Masked **" ) as ${col}"""
else col
}
This will generate an expression with regex_replace for the columns that are present in the Sequence mask
Array[String] = Array(id, regexp_replace(name, "^.*", "** Masked **" ) as name, regexp_replace(age, "^.*", "** Masked **" ) as age, address)
Now you can use selectExpr on the generated Sequence
base.selectExpr(expr: _*).show
+---+------------+------------+-------+
| id| name| age|address|
+---+------------+------------+-------+
| 1|** Masked **|** Masked **| KT10 |
| 2|** Masked **|** Masked **| AD12d|
+---+------------+------------+-------+

Related

How to find similar rows by matching column values spark?

So i have a data set like
{"customer":"customer-1","attributes":{"att-a":"att-a-7","att-b":"att-b-3","att-c":"att-c-10","att-d":"att-d-10","att-e":"att-e-15","att-f":"att-f-11","att-g":"att-g-2","att-h":"att-h-7","att-i":"att-i-5","att-j":"att-j-14"}}
{"customer":"customer-2","attributes":{"att-a":"att-a-9","att-b":"att-b-7","att-c":"att-c-12","att-d":"att-d-4","att-e":"att-e-10","att-f":"att-f-4","att-g":"att-g-13","att-h":"att-h-4","att-i":"att-i-1","att-j":"att-j-13"}}
{"customer":"customer-3","attributes":{"att-a":"att-a-10","att-b":"att-b-6","att-c":"att-c-1","att-d":"att-d-1","att-e":"att-e-13","att-f":"att-f-12","att-g":"att-g-9","att-h":"att-h-6","att-i":"att-i-7","att-j":"att-j-4"}}
{"customer":"customer-4","attributes":{"att-a":"att-a-9","att-b":"att-b-14","att-c":"att-c-7","att-d":"att-d-4","att-e":"att-e-8","att-f":"att-f-7","att-g":"att-g-14","att-h":"att-h-9","att-i":"att-i-13","att-j":"att-j-3"}}
I have flattened the data in the DF like this
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a| att-b| att-c| att-d| att-e| att-f| att-g| att-h| att-i| att-j| customer|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+
| att-a-7| att-b-3|att-c-10|att-d-10|att-e-15|att-f-11| att-g-2| att-h-7| att-i-5|att-j-14| customer-1|
| att-a-9| att-b-7|att-c-12| att-d-4|att-e-10| att-f-4|att-g-13| att-h-4| att-i-1|att-j-13| customer-2|
I want to complete the comapreColumns function.
which compares the columns of the two dataframes(userDF and flattenedDF) and returns a new DF as sample output.
how to do that? Like, compare each row's and column in flattenedDF with userDF and count++ if they match? e.g att-a with att-a att-b with att-b.
def getCustomer(customerID: String)(dataFrame: DataFrame): DataFrame = {
dataFrame.filter($"customer" === customerID).toDF()
}
def compareColumns(customerID: String)(dataFrame: DataFrame): DataFrame = {
val userDF = dataFrame.transform(getCustomer(customerID))
userDF.printSchema()
userDF
}
Sample Output:
+--------------------+-----------+
| customer | similarity_score |
+--------------------+-----------+
|customer-1 | -1 | its the same as the reference customer so to ignore '-1'
|customer-12 | 2 |
|customer-3 | 2 |
|customer-44 | 5 |
|customer-5 | 1 |
|customer-6 | 10 |
Thanks

How do I explode a nested Struct in Spark using Scala

I am creating a dataframe using
val snDump = table_raw
.applyMapping(mappings = Seq(
("event_id", "string", "eventid", "string"),
("lot-number", "string", "lotnumber", "string"),
("serial-number", "string", "serialnumber", "string"),
("event-time", "bigint", "eventtime", "bigint"),
("companyid", "string", "companyid", "string")),
caseSensitive = false, transformationContext = "sn")
.toDF()
.groupBy(col("eventid"), col("lotnumber"), col("companyid"))
.agg(collect_list(struct("serialnumber", "eventtime")).alias("snetlist"))
.createOrReplaceTempView("sn")
I have data like this in the df
eventid | lotnumber | companyid | snetlist
123 | 4q22 | tu56ff | [[12345,67438]]
456 | 4q22 | tu56ff | [[12346,67434]]
258 | 4q22 | tu56ff | [[12347,67455], [12333,67455]]
999 | 4q22 | tu56ff | [[12348,67459]]
I want to explode it put data in 2 columns in my table for that what I am doing is
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("serialN"), explode(col("snetlist")).alias("eventT"), col("companyid"))
Also tried
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), col($"snetlist.serialnumber").alias("serialN"), col($"snetlist.eventtime").alias("eventT"), col("companyid"))
but it turns out that explode can be only used once and I get error in the select so how do I use explode/or something else to achieve what I am trying to.
eventid | lotnumber | companyid | serialN | eventT |
123 | 4q22 | tu56ff | 12345 | 67438 |
456 | 4q22 | tu56ff | 12346 | 67434 |
258 | 4q22 | tu56ff | 12347 | 67455 |
258 | 4q22 | tu56ff | 12333 | 67455 |
999 | 4q22 | tu56ff | 12348 | 67459 |
I have looked at a lot of stackoverflow threads but none of it helped me. It is possible that such question is already answered but my understanding of scala is very less which might have made me not understand the answer. If this is a duplicate then someone could direct me to the correct answer. Any help is appreciated.
First, explode the array in a temporary struct-column, then unpack it:
val serialNumberEvents = snDump
.withColumn("tmp",explode((col("snetlist"))))
.select(
col("eventid"),
col("lotnumber"),
col("companyid"),
// unpack struct
col("tmp.serialnumber").as("serialN"),
col("tmp.eventtime").as("serialT")
)
The trick is to pack the columns you want to explode in an array (or struct), use explode on the array and then unpack them.
val col_names = Seq("eventid", "lotnumber", "companyid", "snetlist")
val data = Seq(
(123, "4q22", "tu56ff", Seq(Seq(12345,67438))),
(456, "4q22", "tu56ff", Seq(Seq(12346,67434))),
(258, "4q22", "tu56ff", Seq(Seq(12347,67455), Seq(12333,67455))),
(999, "4q22", "tu56ff", Seq(Seq(12348,67459)))
)
val snDump = spark.createDataFrame(data).toDF(col_names: _*)
val serialNumberEvents = snDump.select(col("eventid"), col("lotnumber"), explode(col("snetlist")).alias("snetlist"), col("companyid"))
val exploded = serialNumberEvents.select($"eventid", $"lotnumber", $"snetlist".getItem(0).alias("serialN"), $"snetlist".getItem(1).alias("eventT"), $"companyid")
exploded.show()
Note that my snetlist has the schema Array(Array) rather then Array(Struct). You can simply get this by also creating an array instead of a struct out of your columns
Another approach, if needing to explode twice, is as follows - for another example, but to demonstrate the point:
val flattened2 = df.select($"director", explode($"films.actors").as("actors_flat"))
val flattened3 = flattened2.select($"director", explode($"actors_flat").as("actors_flattened"))
See Is there an efficient way to join two large Datasets with (deeper) nested array field? for a slightly different context, but same approach applies.
This answer in response to your assertion you can only explode once.

Spark Scala Dataframe - replace/join column values with values from another dataframe (but is transposed)

I have a table with ~300 columns filled with characters (stored as String):
valuesDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| U | C | ...
| U | E | ...
| I | B | ...
| C | U | ...
| ... | ... | ...
I have a Data Summary, which maps the characters onto their actual meaning. It is in this form:
summaryDF:
| Field | Value | ValueDesc |
|------------------|-------|---------------|
| FavouriteBeer | U | Unknown |
| FavouriteBeer | C | Carlsberg |
| FavouriteBeer | I | InnisAndGunn |
| FavouriteBeer | D | DoomBar |
| FavouriteCheese | C | Cheddar |
| FavouriteCheese | E | Emmental |
| FavouriteCheese | B | Brie |
| FavouriteCheese | U | Unknown |
| ... | ... | ... |
I want to programmatically replace the character values of each column in valuesDF with the Value Descriptions from summaryDF. This is the result I'm looking for:
finalDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| Unknown | Cheddar | ...
| Unknown | Emmental | ...
| InnisAndGunn | Brie | ...
| Carlsberg | Unknown | ...
| ... | ... | ...
As there are ~300 columns, I'm not keen to type out withColumn methods for each one.
Unfortunately I'm a bit of a novice when it comes to programming for Spark, although I've picked up enough to get by over the last 2 months.
What I'm pretty sure I need to do is something along the lines of:
valuesDF.columns.foreach { col => ...... } to iterate over each column
Filter summaryDF on Field using col String value
Left join summaryDF onto valuesDF based on current column
withColumn to replace the original character code column from valuesDF with new description column
Assign new DF as a var
Continue loop
However, trying this gave me Cartesian product error (I made sure to define the join as "left").
I tried and failed to pivot summaryDF (as there are no aggregations to do??) then join both dataframes together.
This is the sort of thing I've tried, and always getting a NullPointerException. I know this is really not the right way to do this, and can see why I'm getting Null Pointer... but I'm really stuck and reverting back to old, silly & bad Python habits in desperation.
var valuesDF = sourceDF
// I converted summaryDF to a broadcasted RDD
// because its small and a "constant" lookup table
summaryBroadcast
.value
.foreach{ x =>
// searchValue = Value (e.g. `U`),
// replaceValue = ValueDescription (e.g. `Unknown`),
val field = x(0).toString
val searchValue = x(1).toString
val replaceValue = x(2).toString
// error catching as summary data does not exactly mapping onto field names
// the joys of business people working in Excel...
try {
// I'm using regexp_replace because I'm lazy
valuesDF = valuesDF
.withColumn( attribute, regexp_replace(col(attribute), searchValue, replaceValue ))
}
catch {case _: Exception =>
null
}
}
Any ideas? Advice? Thanks.
First, we'll need a function that executes a join of valuesDf with summaryDf by Value and the respective pair of Favourite* and Field:
private def joinByColumn(colName: String, sourceDf: DataFrame): DataFrame = {
sourceDf.as("src") // alias it to help selecting appropriate columns in the result
// the join
.join(summaryDf, $"Value" === col(colName) && $"Field" === colName, "left")
// we do not need the original `Favourite*` column, so drop it
.drop(colName)
// select all previous columns, plus the one that contains the match
.select("src.*", "ValueDesc")
// rename the resulting column to have the name of the source one
.withColumnRenamed("ValueDesc", colName)
}
Now, to produce the target result we can iterate on the names of the columns to match:
val result = Seq("FavouriteBeer",
"FavouriteCheese").foldLeft(valuesDF) {
case(df, colName) => joinByColumn(colName, df)
}
result.show()
+-------------+---------------+
|FavouriteBeer|FavouriteCheese|
+-------------+---------------+
| Unknown| Cheddar|
| Unknown| Emmental|
| InnisAndGunn| Brie|
| Carlsberg| Unknown|
+-------------+---------------+
In case a value from valuesDf does not match with anything in summaryDf, the resulting cell in this solution will contain null. If you want just to replace it with Unknown value, instead of .select and .withColumnRenamed lines above use:
.withColumn(colName, when($"ValueDesc".isNotNull, $"ValueDesc").otherwise(lit("Unknown")))
.select("src.*", colName)

Trim Leading 0's from DataFrame in Scala

I have a Dataframe :
| subcategory | subcategory_label | category |
| 00EEE | 00EEE FFF | Drink |
| 0000EEE | 00EEE FFF | Fruit |
| 0EEE | 000EEE FFF | Meat |
from which I need to remove leading 0's from the columns in Dataframe and need a result like this
| subcategory | subcategory_label | category |
| EEE | EEE FFF | Drink |
| EEE | EEE FFF | Fruit |
| EEE | EEE FFF | Meat |
So far, I am able to remove the leading 0's from one column using
df.withColumn("subcategory ", regexp_replace(df("subcategory "), "^0*", "")).show
How to remove the leading 0's from dataframe in one go?
With this as the provided dataframe :
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|0000FFFF |0000EE 000FF |ABC |
+-----------+-----------------+--------+
You can create a regexp_replace for all the columns. Something like :
val regex_all = df.columns.map( c => regexp_replace(col(c), "^0*", "" ).as(c) )
And then, use select since it takes a varargs of type Column :
df.select(regex_all :_* ).show(false)
+-----------+-----------------+--------+
|subcategory|subcategory_label|category|
+-----------+-----------------+--------+
|FFFF |EE 000FF |ABC |
+-----------+-----------------+--------+
EDIT:
Defining a function to do return a regexp_replaced Sequence is straight forward :
/**
* #param origCols total cols in the DF, pass `df.columns`
* #param replacedCols `Seq` of columns for which expression is to be generated
* #return `Seq[org.apache.spark.sql.Column]` Spark SQL expression
*/
def createRegexReplaceZeroes(origCols : Seq[String], replacedCols: Seq[String] ) = {
origCols.map{ c =>
if(replacedCols.contains(c)) regexp_replace(col(c), "^0*", "" ).as(c)
else col(c)
}
}
This function will return an Array[org.apache.spark.sql.Column]
Now, store the columns you want to replace in an Array :
val removeZeroes = Array( "subcategory", "subcategory_label" )
And, then call the function with removeZeroes as argument. This will return the regexp_replace statements for the columns available in removeZeroes
df.select( createRegexReplaceZeroes(df.columns, removeZeroes) :_* )
You can use UDF for doing the same.
I feel it looks more elegant.
scala> val removeLeadingZerosUDF = udf({ x: String => x.replaceAll("^0*", "") })
removeLeadingZerosUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df = Seq( "000012340023", "000123400023", "001234000230", "012340002300", "123400002300" ).toDF("cols")
df: org.apache.spark.sql.DataFrame = [cols: string]
scala> df.show()
+------------+
| cols|
+------------+
|000012340023|
|000123400023|
|001234000230|
|012340002300|
|123400002300|
+------------+
scala> df.withColumn("newCols", removeLeadingZerosUDF($"cols")).show()
+------------+------------+
| cols| newCols|
+------------+------------+
|000012340023| 12340023|
|000123400023| 123400023|
|001234000230| 1234000230|
|012340002300| 12340002300|
|123400002300|123400002300|
+------------+------------+

How to rank the data set having multiple columns in Scala?

I have data set like this which i am fetching from csv file but how to
store in Scala to do the processing.
+-----------+-----------+----------+
| recent | Freq | Monitor |
+-----------+-----------+----------+
| 1 | 1234| 199090|
| 4 | 2553| 198613|
| 6 | 3232 | 199090|
| 1 | 8823 | 498831|
| 7 | 2902 | 890000|
| 8 | 7991 | 081097|
| 9 | 7391 | 432370|
| 12 | 6138 | 864981|
| 7 | 6812 | 749821|
+-----------+-----------+----------+
Actually I need to sort the data and rank it.
I am new to Scala programming.
Thanks
Answering your question here is the solution, this code reads a csv and order by the third column
object CSVDemo extends App {
println("recent, freq, monitor")
val bufferedSource = io.Source.fromFile("./data.csv")
val list: Array[Array[String]] = (bufferedSource.getLines map { line => line.split(",").map(_.trim) }).toArray
val newList = list.sortBy(_(2))
newList map { line => println(line.mkString(" ")) }
bufferedSource.close
}
you read the file and you parse it to an Array[Array[String]], then you order by the third column, and you print
Here I am using the list and try to normalize each column at a time and then concatenating them. Is there any other way to iterate column wise and normalize them. Sorry my coding is very basic.
val col1 = newList.map(line => line.head)
val mi = newList.map(line => line.head).min
val ma = newList.map(line => line.head).max
println("mininumn value of first column is " +mi)
println("maximum value of first column is : " +ma)
// calculate scale for the first column
val scale = col1.map(x => math.round((x.toInt - mi.toInt) / (ma.toInt - mi.toInt)))
println("Here is the normalized range of first column of the data")
scale.foreach(println)