Adding new column using other existing columns Spark/Scala - scala

I want to add a new column using other existing columns. This must be released on conditions. This is an example of my Dataframe :
val data = Seq(("WHT20177", "CTHT WO/MTR# : WHT20212/BTI0426; WHT20177/BTH0393"),
("WHT55637", "CTHT WO/MTR# : WHT50747/BTI2699; WHT55637/SQL1239"))
val dataFrame = data.toDF("prev_wo", "ref_wo")
+--------+-------------------------------------------------+
|prev_wo |ref_wo |
+--------+-------------------------------------------------+
|WHT20177|CTHT WO/MTR# : WHT20212/BTI0426; WHT20177/BTH0393|
|WHT55637|CTHT WO/MTR# : WHT50747/BTI2699; WHT55637/SQL1239|
+--------+-------------------------------------------------+
The column "ref_wo" must contain "prev_wo", in that case only, I must put the following element in a new column which I shall name "col1".
For the first line, the value to extract is "BTH0393", for the second line, the value to extract is "SQL1239"
I am trying this in Spark Scala using two different methods. The first one reacts only to the first line, and the second one reacts only to the second line.
First method :
def addNewColumn(df: DataFrame): DataFrame = {
val prev_wo = dataFrame.select("prev_wo").collectAsList().get(0).mkString(",")
val regex_extract = ("(?<=" + prev_wo + "\\/)(.{7})").r
df
.withColumn("col1",
when($"ref_wo".contains(col("prev_wo")),
regexp_extract(col("ref_wo"), regex_extract.toString(), 1))
.otherwise(null)
)
}
val new_dataFrame = dataFrame
.transform(addNewColumn)
OUTPUT :
+--------+-------------------------------------------------+-------+
|prev_wo |ref_wo |col1 |
+--------+-------------------------------------------------+-------+
|WHT20177|CTHT WO/MTR# : WHT20212/BTI0426; WHT20177/BTH0393|BTH0393|
|WHT55637|CTHT WO/MTR# : WHT50747/BTI2699; WHT55637/SQL1239| |
+--------+-------------------------------------------------+-------+
Second method :
def addColumn(df: DataFrame): DataFrame = {
var out = df
df.collect().foreach(row => {
val prev_wo = row.getValuesMap(Seq("prev_wo")).get("prev_wo").getOrElse("")
val regex_extract = ("(?<=" + prev_wo + "\\/)(.{7})").r
out = out
.withColumn("col1",
when($"ref_wo".contains(col("prev_wo")),
regexp_extract(col("ref_wo"), regex_extract.toString(), 1))
.otherwise(null)
)
})
out
}
val new_dataFrame = dataFrame
.transform(addColumn)
OUTPUT
+--------+-------------------------------------------------+-------+
|prev_wo |ref_wo |col1 |
+--------+-------------------------------------------------+-------+
|WHT20177|CTHT WO/MTR# : WHT20212/BTI0426; WHT20177/BTH0393| |
|WHT55637|CTHT WO/MTR# : WHT50747/BTI2699; WHT55637/SQL1239|SQL1239|
+--------+-------------------------------------------------+-------+

You can use regexp_extract with a pattern dynamically generated from prev_wo:
dataFrame.withColumn("col1", expr("regexp_extract(ref_wo, concat(prev_wo, '/(.{7})'), 1)")).show(false)
+--------+-------------------------------------------------+-------+
|prev_wo |ref_wo |col1 |
+--------+-------------------------------------------------+-------+
|WHT20177|CTHT WO/MTR# : WHT20212/BTI0426; WHT20177/BTH0393|BTH0393|
|WHT55637|CTHT WO/MTR# : WHT50747/BTI2699; WHT55637/SQL1239|SQL1239|
+--------+-------------------------------------------------+-------+

Related

Assigning elements in an array into the same DataFrame using scala ad spark

I input an array and then I want to get their unicodes and store into a dataframe. Here is my code
def getUnicodeOfEmoji (emojiArray : Array[String]) : DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
import existingSparkSession.implicits._
var result: DataFrame = null
var df : DataFrame = null
for (i <- 0 until emojiArray.length) {
df = Seq(emojiArray(i)).toDF("emoji")
df.show()
result = df.selectExpr(
"emoji",
"'U+' || trim('0' , string(hex(encode(emoji, 'utf-32')))) as result"
)
}
result.show(false)
return result
}
}
input = val emojis="😃😜😍"
actual output
|emoji|result |
+-----+-------+
|😍 |U+1F60D|
+-----+-------+
But I need to have all 3 emojis with their specific unicodes within the dataframe.
You don't need a for loop to construct the dataframe. You can convert the array to a Seq and use the toDF method of a Seq to construct the resulting dataframe.
def getUnicodeOfEmoji (emojiArray : Array[String]) : DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
import existingSparkSession.implicits._
val df = emojiArray.toSeq.toDF("emoji")
val result = df.selectExpr(
"emoji",
"'U+' || trim('0' , string(hex(encode(emoji, 'utf-32')))) as result"
)
result.show(false)
return result
}
val emojis = "😃😜😍"
val input = raw"\p{block=Emoticons}".r.findAllIn(emojis).toArray
val converted = getUnicodeOfEmoji(input)
+-----+-------+
|emoji|result |
+-----+-------+
|😃 |U+1F603|
|😜 |U+1F61C|
|😍 |U+1F60D|
+-----+-------+
A slight improvement is to convert your string of emojis to a Seq[String] directly before feeding into the function, e.g.
def getUnicodeOfEmoji (emojiArray : Seq[String]) : DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
import existingSparkSession.implicits._
val df = emojiArray.toDF("emoji")
val result = df.selectExpr(
"emoji",
"'U+' || trim('0' , string(hex(encode(emoji, 'utf-32')))) as result"
)
result.show(false)
return result
}
val emojis = "😃😜😍"
val input = raw"\p{block=Emoticons}".r.findAllIn(emojis).toSeq
val converted = getUnicodeOfEmoji(input)

Functionnal way of writing huge when rlike statement

I'm using regex to identify file type based on extension in DataFrame.
import org.apache.spark.sql.{Column, DataFrame}
val ignoreCase :String = "(?i)"
val ignoreExtension :String = "(?:\\.[_\\d]+)*(?:|\\.bck|\\.old|\\.orig|\\.bz2|\\.gz|\\.7z|\\.z|\\.zip)*(?:\\.[_\\d]+)*$"
val pictureFileName :String = "image"
val pictureFileType :String = ignoreCase + "^.+(?:\\.gif|\\.ico|\\.jpeg|\\.jpg|\\.png|\\.svg|\\.tga|\\.tif|\\.tiff|\\.xmp)" + ignoreExtension
val videoFileName :String = "video"
val videoFileType :String = ignoreCase + "^.+(?:\\.mod|\\.mp4|\\.mkv|\\.avi|\\.mpg|\\.mpeg|\\.flv)" + ignoreExtension
val otherFileName :String = "other"
def pathToExtension(cl: Column): Column = {
when(cl.rlike( pictureFileType ), pictureFileName ).
when(cl.rlike( videoFileType ), videoFileName ).
otherwise(otherFileName)
}
val df = List("file.jpg", "file.avi", "file.jpg", "file3.tIf", "file5.AVI.zip", "file4.mp4","afile" ).toDF("filename")
val df2 = df.withColumn("filetype", pathToExtension( col( "filename" ) ) )
df2.show
This is only a sample and I have 30 regex and type identified, thus the function pathToExtension() is really long because I have to put a new when statement for each type.
I can't find a proper way to write this code the functional way with a list or map containing the regexp and the name like this :
val typelist = List((pictureFileName,pictureFileType),(videoFileName,videoFileType))
foreach [need help for this part]
All the code I've tried so far won't work properly.
You can use foldLeft to traverse your list of when conditions and chain them as shown below:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
val default = "other"
def chainedWhen(c: Column, rList: List[(String, String)]): Column = rList.tail.
foldLeft(when(c rlike rList.head._2, rList.head._1))( (acc, t) =>
acc.when(c rlike t._2, t._1)
).otherwise(default)
Testing the method:
val df = Seq(
(1, "a.txt"), (2, "b.gif"), (3, "c.zip"), (4, "d.oth")
).toDF("id", "file_name")
val rList = List(("text", ".*\\.txt"), ("gif", ".*\\.gif"), ("zip", ".*\\.zip"))
df.withColumn("file_type", chainedWhen($"file_name", rList)).show
// +---+---------+---------+
// | id|file_name|file_type|
// +---+---------+---------+
// | 1| a.txt| text|
// | 2| b.gif| gif|
// | 3| c.zip| zip|
// | 4| d.oth| other|
// +---+---------+---------+

Inline map function instead of loops

I have a table in dataframe with three columns. city_name, driver_name, vehicles out of which vehicle is a list.
I also have some other details such as driver hours, driver contact etc for each driver in mysql. Tables in database are in this format: city_name.driver_name.
scala> val tables = """
[
{"vehicles" : ["subaru","mazda"], "city_name" : "seattle", "driver_name" : "x"},
{"city_name" : "seattle", "driver_name" : "y"},
{"city_name" : "newyork", "driver_name" : "x"},
{"city_name" : "dallas", "driver_name" : "y"}
]
""" | | | | | | |
tables: String =
"
[
{"vehicles" : ["subaru","mazda"], "city_name" : "seattle", "driver_name" : "x"},
{"city_name" : "seattle", "driver_name" : "y"},
{"city_name" : "newyork", "driver_name" : "x"},
{"city_name" : "dallas", "driver_name" : "y"}
]
"
scala> val metadataRDD = sc.parallelize(tables.split('\n').map(_.trim.filter(_ >= ' ')).mkString :: Nil)
metadataRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at parallelize at <console>:30
scala> val metadataDF = spark.read.json(metadataRDD)
metadataDF: org.apache.spark.sql.DataFrame = [city_name: string, driver_name: string ... 1 more field]
scala> metadataDF.show
+---------+-----------+---------------+
|city_name|driver_name| vehicles|
+---------+-----------+---------------+
| seattle| x|[subaru, mazda]|
| seattle| y| null|
| newyork| x| null|
| dallas| y| null|
+---------+-----------+---------------+
For each of these driver I need to apply a function and write to a parquet. What I am trying to do is use a inline function as below but I can't get it to work:
metadataDF.map((e) => {
val path = "s3://test/"
val df = sparkJdbcReader.option("dbtable",
e.city_name + "." + e.driver_name).load()
val dir = path + e.driver_name + e.city_name
if (e.vehicles)
do something
else:
df.write.mode("overwrite").format("parquet").save(dir)
})
Basically the questions is around how to use that inline function.
A call to map() function always transforms the given input collection of type A to another collection of type B, using the supplied function. In your map function call you are saving the Dataframe to your Storage layer[presumably HDFS]. The save() method defined on the DataFrameWriter Class has a return type of Unit [think of it as void in Java]. Hence, your function will not work as it is transforming your DataFrame to essentially two types: the data-type returned from the if block and Unit returned from the else block.
You can refactor your code and break it up in two blocks or so:
import org.apache.spark.sql.functions.{concat,concat_ws,lit,col}
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
val metadataRDD: RDD[String] = sc.parallelize(tables.split('\n').map(_.trim.filter(_ >= ' ')).mkString :: Nil)
val metadataDF: DataFrame = spark.read.json(metadataRDD)
val df_new_col: DataFrame = metadataDF
.withColumn("city_driver",concat_ws(".",col("city_name"),col("driver_name")))
.withColumn("dir",concat(lit("s3://test/"),col("city_name"),col("driver_name")))
you now have two columns where you have table names and their paths next to them. You can collect them and use them to read your dataframes to be stored in Parquet format.

spark sql - how to write dynamic query in spark sql

I have one hive table. I want to create dynamic spark SQL queries.at the time of spark submit, i am specifying rulename. based on the rule name query should generate. At the time of spark submit, I have to specify rule name. For example:
sparks-submit <RuleName> IncorrectAge
It should fire my scala object code:
select tablename, filter, condition from all_rules where rulename="IncorrectAge"
My table: Rules(Input table)
|---------------------------------------------------------------------------|
| rowkey| rule_name|rule_run_status| tablename |condition|filter |level|
|--------------------------------------------------------------------------|
| 1 |IncorrectAge| In_Progress | VDP_Vendor_List| age>18 gender=Male|NA|
|---------------------------------------------------------------------------
|2 | Customer_age| In_Progress | Customer_List | age<25 gender=Female|NA|
|----------------------------------------------------------------------------
I fetch the rulename:
select tablename, filter, condition from all_rules where rulename="IncorrectAge";
After executing this query, I got the result like this:
|----------------------------------------------|
|tablename | filter | condition |
|----------------------------------------------|
|VDP_Vendor_List | gender=Male | age>18 |
|----------------------------------------------|
Now I want make spark sql query dynamically
select count(*) from VDP_Vendor_List // first column --tablename
select count(*) from VDP_Vendor_List where gender=Male --tablename and filter
select * from EMP where gender=Male AND age >18 --tablename, filter, condition
My Code -Spark 2.2 version code :
import org.apache.spark.sql.{ Row, SparkSession }
import org.apache.log4j._
object allrules {
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local[*]")
.appName("Spark Hive")
.enableHiveSupport().getOrCreate();
import spark.implicits._
val sampleDF = spark.read.json("C:/software/sampletableCopy.json") // for testing purpose i converted hive table to json data
sampleDF.registerTempTable("sampletable")
val allrulesDF = spark.sql("SELECT * FROM sampletable")
allrulesDF.show()
val TotalCount: Long = allrulesDF.count()
println("==============> Total count ======>" + allrulesDF.count())
val df1 = allrulesDF.select(allrulesDF.col("tablename"),allrulesDF.col("condition"),allrulesDF.col("filter"),allrulesDF.col("rule_name"))
df1.show()
val df2= df1.where(df1.col("rule_name").equalTo("IncorrectAge")).show()
println(df2)
// var table_name = ""
// var condition =""
// var filter = "";
// df1.foreach(row=>{
// table_name = row.get(1).toString();
// condition = row.get(2).toString();
// filter = row.get(3).toString();
// })
}
}
You can pass arguments from spark-submit to your application:
bin/spark-submit --class allrules something.jar tablename filter condition
then, in your main function you will have your params:
def main(args: Array[String]) : Unit = {
// args(0), args(1) ... there are your params
}
You can pass your argument to your driver class like this :
object DriverClass
{
val log = Logger.getLogger(getClass.getName)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("yarn").config("spark.sql.warehouse.dir", "path").enableHiveSupport().getOrCreate()
if (args == null || args.isEmpty || args.length != 2) {
log.error("Invalid number of arguments passed.")
log.error("Arguments Usage: <Rule Name> <Rule Type>)
log.error("Stopping the flow")
System.exit(1)
}
import spark.implicits._
val ruleName: String = String.valueOf(args(0).trim())
val ruleType: String = String.valueOf(args(1).trim())
val processSQL: String="Select tablename, filter, condition from all_rules where $ruleName=$ruleType"
val metadataDF=spark.sql(processSQL)
val (tblnm,fltr,cndtn) =metadataDF.rdd.map(f=>(f.get(0).toString(),f.get(1).toString(),f.get(2).toString())).collect()(0)
val finalSql_1="select count(*) from $tblnm" // first column
val finalSql_2="select count(*) from $tblnm" where $fltr"
val finalSql_3="select * from EMP where $fltr AND $cndtn"
spark.sql(finalSql_1).show()
spark.sql(finalSql_2).show()
spark.sql(finalSql_3).show()
}
}

Task serialisation error when using UDF

I use IntelliJ IDEA to execute the code shown below. The content of df is the following:
+------+------+
|nodeId| p_i|
+------+------+
| 26|0.6914|
| 29|0.6914|
| 474| 0.0|
| 65|0.4898|
| 191|0.4445|
| 418|0.4445|
I get Task serialization error at line result.show() when I run this code:
class MyUtils extends Serializable {
def calculate(spark: SparkSession,
df: DataFrame): DataFrame = {
def myFunc(a: Double): String = {
var result: String = "-"
if (a > 1) {
result = "A"
}
return result
}
val myFuncUdf = udf(myFunc _)
val result = df.withColumn("role", myFuncUdf(df("a")))
result.show()
result
}
}
Why do I get this error?
Update:
This is how I run the code:
object Processor extends App {
// ...
val mu = new MyUtils()
var result = mu.calculate(spark, df)
}
I had to add extends Serializable to the specification of a class MyUtils.