Functionnal way of writing huge when rlike statement - scala

I'm using regex to identify file type based on extension in DataFrame.
import org.apache.spark.sql.{Column, DataFrame}
val ignoreCase :String = "(?i)"
val ignoreExtension :String = "(?:\\.[_\\d]+)*(?:|\\.bck|\\.old|\\.orig|\\.bz2|\\.gz|\\.7z|\\.z|\\.zip)*(?:\\.[_\\d]+)*$"
val pictureFileName :String = "image"
val pictureFileType :String = ignoreCase + "^.+(?:\\.gif|\\.ico|\\.jpeg|\\.jpg|\\.png|\\.svg|\\.tga|\\.tif|\\.tiff|\\.xmp)" + ignoreExtension
val videoFileName :String = "video"
val videoFileType :String = ignoreCase + "^.+(?:\\.mod|\\.mp4|\\.mkv|\\.avi|\\.mpg|\\.mpeg|\\.flv)" + ignoreExtension
val otherFileName :String = "other"
def pathToExtension(cl: Column): Column = {
when(cl.rlike( pictureFileType ), pictureFileName ).
when(cl.rlike( videoFileType ), videoFileName ).
otherwise(otherFileName)
}
val df = List("file.jpg", "file.avi", "file.jpg", "file3.tIf", "file5.AVI.zip", "file4.mp4","afile" ).toDF("filename")
val df2 = df.withColumn("filetype", pathToExtension( col( "filename" ) ) )
df2.show
This is only a sample and I have 30 regex and type identified, thus the function pathToExtension() is really long because I have to put a new when statement for each type.
I can't find a proper way to write this code the functional way with a list or map containing the regexp and the name like this :
val typelist = List((pictureFileName,pictureFileType),(videoFileName,videoFileType))
foreach [need help for this part]
All the code I've tried so far won't work properly.

You can use foldLeft to traverse your list of when conditions and chain them as shown below:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
val default = "other"
def chainedWhen(c: Column, rList: List[(String, String)]): Column = rList.tail.
foldLeft(when(c rlike rList.head._2, rList.head._1))( (acc, t) =>
acc.when(c rlike t._2, t._1)
).otherwise(default)
Testing the method:
val df = Seq(
(1, "a.txt"), (2, "b.gif"), (3, "c.zip"), (4, "d.oth")
).toDF("id", "file_name")
val rList = List(("text", ".*\\.txt"), ("gif", ".*\\.gif"), ("zip", ".*\\.zip"))
df.withColumn("file_type", chainedWhen($"file_name", rList)).show
// +---+---------+---------+
// | id|file_name|file_type|
// +---+---------+---------+
// | 1| a.txt| text|
// | 2| b.gif| gif|
// | 3| c.zip| zip|
// | 4| d.oth| other|
// +---+---------+---------+

Related

how to call an udf with multiple arguments(currying) in spark sql?

How do i call the below UDF with multiple arguments(currying) in a spark dataframe as below.
read read and get a list[String]
val data = sc.textFile("file.csv").flatMap(line => line.split("\n")).collect.toList
register udf
val getValue = udf(Udfnc.getVal(_: Int, _: String, _: String)(_: List[String]))
call udf in the below df
df.withColumn("value",
getValue(df("id"),
df("string1"),
df("string2"))).show()
Here is am missing the List[String] argument, and I am really not sure as how should i pass on this argument .
I can make following assumption about your requirement based on your question
a] UDF should accept parameter other than dataframe column
b] UDF should take multiple columns as parameter
Let's say you want to concat values from all column along with specified parameter. Here is how you can do it
import org.apache.spark.sql.functions._
def uDF(strList: List[String]) = udf[String, Int, String, String]((value1: Int, value2: String, value3: String) => value1.toString + "_" + value2 + "_" + value3 + "_" + strList.mkString("_"))
val df = spark.sparkContext.parallelize(Seq((1,"r1c1","r1c2"),(2,"r2c1","r2c2"))).toDF("id","str1","str2")
scala> df.show
+---+----+----+
| id|str1|str2|
+---+----+----+
| 1|r1c1|r1c2|
| 2|r2c1|r2c2|
+---+----+----+
val dummyList = List("dummy1","dummy2")
val result = df.withColumn("new_col", uDF(dummyList)(df("id"),df("str1"),df("str2")))
scala> result.show(2, false)
+---+----+----+-------------------------+
|id |str1|str2|new_col |
+---+----+----+-------------------------+
|1 |r1c1|r1c2|1_r1c1_r1c2_dummy1_dummy2|
|2 |r2c1|r2c2|2_r2c1_r2c2_dummy1_dummy2|
+---+----+----+-------------------------+
Defining a UDF with multiple parameters:
val enrichUDF: UserDefinedFunction = udf((jsonData: String, id: Long) => {
val lastOccurence = jsonData.lastIndexOf('}')
val sid = ",\"site_refresh_stats_id\":" + id+ " }]"
val enrichedJson = jsonData.patch(lastOccurence, sid, sid.length)
enrichedJson
})
Calling the udf to an existing dataframe:
val enrichedDF = EXISTING_DF
.withColumn("enriched_column",
enrichUDF(col("jsonData")
, col("id")))
An import statement is also required as:
import org.apache.spark.sql.expressions.UserDefinedFunction

How to Loop through multiple Col values in a dataframe to get count

I have a list of tables let say x,y,z and each table is having some cols for example test,test1,test2,test3 for table x. just like we have cols like rem,rem1,rem2 for table y. Similarly is the case for table z. Now the requirement is that we have to loop through each col in a table and have to get row count based on below scenario's.
If test is not NULL and all other are NULL(test1,test2,test3) then it will be one count.
Now we have to loop through each table and then find cols like test* then match the above condition then marked that row as one 1 count if it satisfy above condition.
I'm pretty new to scala but i thought of the below approach.
for each $tablename{
{
Val df = sql("select * from $tablename ")
val coldf = df.select(df.columns.filter(_.startsWith("test")).map(df(_)) : _*)
val df_filtered = coldf.map(eachrow ->df.filter(s$"eachrow".isNull))
}
}
}
It is not working for me and not getting any idea where to put the count variable.if someone can help on this i will really appreciate.
Im using spark 2 with scala.
Code update
below is the code for generating the table list and table-col mapping list.
val table_names = sql("SELECT t1.Table_Name ,t1.col_name FROM table_list t1 LEFT JOIN db_list t2 ON t2.tableName == t1.Table_Name WHERE t2.tableName IS NOT NULL ").toDF("tabname", "colname")
//List of all tables in the db as list of df
val dfList = table_names.select("tabname").map(r => r.getString(0)).collect.toList
val dfTableList = dfList.map(spark.table)
//Mapping each table with keycol
val tabColPairList = table_names.rdd.map( r => (r(0).toString, r(1).toString)).collect
val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap
After this i'm using the below methods..
def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {
if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
(acc, i) => if (row.isNullAt(i)) acc else acc + 1
)
if (nonNulls == 0) 1 else 0
}
}
val dfCountList = dfTableList.map{ df =>
df.cols
val keyCol = dfColMap(df)
//println(keyCol)
val colPattern = s"$keyCol\\d+".r
val checkCols = df.columns.map( c => c match {
case colPattern() => Some(c)
case _ => None
} ).flatten
val rddWithCount = df.rdd.map{ case r: Row =>
Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
}
spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))
its giving me below error:
createCount: (row: org.apache.spark.sql.Row, keyCol: String, checkCols: Seq[String])Int
java.util.NoSuchElementException: key not found: [id: string, projid: string ... 40 more fields]
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:121)
at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:120)
at scala.collection.immutable.List.map(List.scala:273)
... 78 elided
Given your requirement, I would suggest taking advantage of RDD's functionality and use a Row-based method that creates your count for each Row per DataFrame:
val dfX = Seq(
("a", "ma", "a1", "a2", "a3"),
("b", "mb", null, null, null),
("null", "mc", "c1", null, "c3")
).toDF("xx", "mm", "xx1", "xx2", "xx3")
val dfY = Seq(
("nd", "d", "d1", null),
("ne", "e", "e1", "e2"),
("nf", "f", null, null)
).toDF("nn", "yy", "yy1", "yy2")
val dfZ = Seq(
("g", null, "g1", "g2", "qg"),
("h", "ph", null, null, null),
("i", "pi", null, null, "qi")
).toDF("zz", "pp", "zz1", "zz2", "qq")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.IntegerType
val dfList = List(dfX, dfY, dfZ)
val dfColMap = Map(dfX -> "xx", dfY -> "yy", dfZ -> "zz")
def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {
if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
(acc, i) => if (row.isNullAt(i)) acc else acc + 1
)
if (nonNulls == 0) 1 else 0
}
}
val dfCountList = dfList.map{ df =>
val keyCol = dfColMap(df)
val colPattern = s"$keyCol\\d+".r
val checkCols = df.columns.map( c => c match {
case colPattern() => Some(c)
case _ => None
} ).flatten
val rddWithCount = df.rdd.map{ case r: Row =>
Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
}
spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))
}
dfCountList(0).show
// +----+---+----+----+----+-----+
// | xx| mm| xx1| xx2| xx3|count|
// +----+---+----+----+----+-----+
// | a| ma| a1| a2| a3| 0|
// | b| mb|null|null|null| 1|
// |null| mc| c1|null| c3| 0|
// +----+---+----+----+----+-----+
dfCountList(1).show
// +---+---+----+----+-----+
// | nn| yy| yy1| yy2|count|
// +---+---+----+----+-----+
// | nd| d| d1|null| 0|
// | ne| e| e1| e2| 0|
// | nf| f|null|null| 1|
// +---+---+----+----+-----+
dfCountList(2).show
// +---+----+----+----+----+-----+
// | zz| pp| zz1| zz2| qq|count|
// +---+----+----+----+----+-----+
// | g|null| g1| g2| qg| 0|
// | h| ph|null|null|null| 1|
// | i| pi|null|null| qi| 1|
// +---+----+----+----+----+-----+
[UPDATE]
Note that the above solution works for any number of DataFrames as long as you have them in dfList and their corresponding key columns in dfColMap.
If you have a list of Hive tables instead, simply convert them into DataFrames using spark.table(), as below:
val tableList = List("tableX", "tableY", "tableZ")
val dfList = tableList.map(spark.table)
// dfList: List[org.apache.spark.sql.DataFrame] = List(...)
Now you still have to tell Spark what the key column for each table is. Let's say you have the list in the same order as the table list's. You can zip the list to create dfColMap and you'll have everything needed to apply the above solution:
val keyColList = List("xx", "yy", "zz")
val dfColMap = dfList.zip(keyColList).toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)
[UPDATE #2]
If you have the Hive table names and their corresponding key column names stored in a DataFrame, you can generate dfColMap as follows:
val dfTabColPair = Seq(
("tableX", "xx"),
("tableY", "yy"),
("tableZ", "zz")
).toDF("tabname", "colname")
val tabColPairList = dfTabColPair.rdd.map( r => (r(0).toString, r(1).toString)).
collect
// tabColPairList: Array[(String, String)] = Array((tableX,xx), (tableY,yy), (tableZ,zz))
val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)

NullPointerException when using UDF in Spark

I have a DataFrame in Spark such as this one:
var df = List(
(1,"{NUM.0002}*{NUM.0003}"),
(2,"{NUM.0004}+{NUM.0003}"),
(3,"END(6)"),
(4,"END(4)")
).toDF("CODE", "VALUE")
+----+---------------------+
|CODE| VALUE|
+----+---------------------+
| 1|{NUM.0002}*{NUM.0003}|
| 2|{NUM.0004}+{NUM.0003}|
| 3| END(6)|
| 4| END(4)|
+----+---------------------+
My task is to iterate through the VALUE column and do the following: check if there is a substring such as {NUM.XXXX}, get the XXXX number, get the row where $"CODE" === XXXX, and replace the {NUM.XXX} substring with the VALUE string in that row.
I would like the dataframe to look like this in the end:
+----+--------------------+
|CODE| VALUE|
+----+--------------------+
| 1|END(4)+END(6)*END(6)|
| 2| END(4)+END(6)|
| 3| END(6)|
| 4| END(4)|
+----+--------------------+
This is the best I've come up with:
val process = udf((ln: String) => {
var newln = ln
while(newln contains "{NUM."){
var num = newln.slice(newln.indexOf("{")+5, newln.indexOf("}")).toInt
var new_value = df.where($"CODE" === num).head.getAs[String](1)
newln = newln.replace(newln.slice(newln.indexOf("{"),newln.indexOf("}")+1), new_value)
}
newln
})
var df2 = df.withColumn("VALUE", when('VALUE contains "{NUM.",process('VALUE)).otherwise('VALUE))
Unfortunately, I get a NullPointerException when I try to filter/select/save df2, and no error when I just show df2. I believe the error appears when I access the DataFrame df within the UDF, but I need to access it every iteration, so I can't pass it as an input. Also, I've tried saving a copy of df inside the UDF but I don't know how to do that. What can I do here?
Any suggestions to improve the algorithm are very welcome! Thanks!
I wrote something which works but not very optimized I think. I actually do recursive joins on the initial DataFrame to replace the NUMs by END. Here is the code :
case class Data(code: Long, value: String)
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().master("local").getOrCreate()
val data = Seq(
Data(1,"{NUM.0002}*{NUM.0003}"),
Data(2,"{NUM.0004}+{NUM.0003}"),
Data(3,"END(6)"),
Data(4,"END(4)"),
Data(5,"{NUM.0002}")
)
val initialDF = sparkSession.createDataFrame(data)
val endDF = initialDF.filter(!(col("value") contains "{NUM"))
val numDF = initialDF.filter(col("value") contains "{NUM")
val resultDF = endDF.union(replaceNumByEnd(initialDF, numDF))
resultDF.show(false)
}
val parseNumUdf = udf((value: String) => {
if (value.contains("{NUM")) {
val regex = """.*?\{NUM\.(\d+)\}.*""".r
value match {
case regex(code) => code.toLong
}
} else {
-1L
}
})
val replaceUdf = udf((value: String, replacement: String) => {
val regex = """\{NUM\.(\d+)\}""".r
regex.replaceFirstIn(value, replacement)
})
def replaceNumByEnd(initialDF: DataFrame, currentDF: DataFrame): DataFrame = {
if (currentDF.count() == 0) {
currentDF
} else {
val numDFWithCode = currentDF
.withColumn("num_code", parseNumUdf(col("value")))
.withColumnRenamed("code", "code_original")
.withColumnRenamed("value", "value_original")
val joinedDF = numDFWithCode.join(initialDF, numDFWithCode("num_code") === initialDF("code"))
val replacedDF = joinedDF.withColumn("value_replaced", replaceUdf(col("value_original"), col("value")))
val nextDF = replacedDF.select(col("code_original").as("code"), col("value_replaced").as("value"))
val endDF = nextDF.filter(!(col("value") contains "{NUM"))
val numDF = nextDF.filter(col("value") contains "{NUM")
endDF.union(replaceNumByEnd(initialDF, numDF))
}
}
If you need more explanation, don't hesitate.

Spliting columns in a Spark dataframe in to new rows [Scala]

I have output from a spark data frame like below:
Amt |id |num |Start_date |Identifier
43.45|19840|A345|[2014-12-26, 2013-12-12]|[232323,45466]|
43.45|19840|A345|[2010-03-16, 2013-16-12]|[34343,45454]|
My requirement is to generate output in below format from the above output
Amt |id |num |Start_date |Identifier
43.45|19840|A345|2014-12-26|232323
43.45|19840|A345|2013-12-12|45466
43.45|19840|A345|2010-03-16|34343
43.45|19840|A345|2013-16-12|45454
Can somebody help me to achieve this.
Is this the thing you're looking for?
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sparkSession = ...
import sparkSession.implicits._
val input = sc.parallelize(Seq(
(43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq(232323,45466)),
(43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq(34343,45454))
)).toDF("amt", "id", "num", "start_date", "identifier")
val zipArrays = udf { (dates: Seq[String], identifiers: Seq[Int]) =>
dates.zip(identifiers)
}
val output = input.select($"amt", $"id", $"num", explode(zipArrays($"start_date", $"identifier")))
.select($"amt", $"id", $"num", $"col._1".as("start_date"), $"col._2".as("identifier"))
output.show()
Which returns:
+-----+-----+----+----------+----------+
| amt| id| num|start_date|identifier|
+-----+-----+----+----------+----------+
|43.45|19840|A345|2014-12-26| 232323|
|43.45|19840|A345|2013-12-12| 45466|
|43.45|19840|A345|2010-03-16| 34343|
|43.45|19840|A345|2013-16-12| 45454|
+-----+-----+----+----------+----------+
EDIT:
Since you would like to have multiple columns that should be zipped, you should try something like this:
val input = sc.parallelize(Seq(
(43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq("232323","45466"), Seq("123", "234")),
(43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq("34343","45454"), Seq("345", "456"))
)).toDF("amt", "id", "num", "start_date", "identifier", "another_column")
val zipArrays = udf { seqs: Seq[Seq[String]] =>
for(i <- seqs.head.indices) yield seqs.fold(Seq.empty)((accu, seq) => accu :+ seq(i))
}
val columnsToSelect = Seq($"amt", $"id", $"num")
val columnsToZip = Seq($"start_date", $"identifier", $"another_column")
val outputColumns = columnsToSelect ++ columnsToZip.zipWithIndex.map { case (column, index) =>
$"col".getItem(index).as(column.toString())
}
val output = input.select($"amt", $"id", $"num", explode(zipArrays(array(columnsToZip: _*)))).select(outputColumns: _*)
output.show()
/*
+-----+-----+----+----------+----------+--------------+
| amt| id| num|start_date|identifier|another_column|
+-----+-----+----+----------+----------+--------------+
|43.45|19840|A345|2014-12-26| 232323| 123|
|43.45|19840|A345|2013-12-12| 45466| 234|
|43.45|19840|A345|2010-03-16| 34343| 345|
|43.45|19840|A345|2013-16-12| 45454| 456|
+-----+-----+----+----------+----------+--------------+
*/
If I understand correctly, you want the first elements of col 3 and 4.
Does this make sense?
val newDataFrame = for {
row <- oldDataFrame
} yield {
val zro = row(0) // 43.45
val one = row(1) // 19840
val two = row(2) // A345
val dates = row(3) // [2014-12-26, 2013-12-12]
val numbers = row(4) // [232323,45466]
Row(zro, one, two, dates(0), numbers(0))
}
You could use SparkSQL.
First you create a view with the information we need to process:
df.createOrReplaceTempView("tableTest")
Then you can select the data with the expansions:
sparkSession.sqlContext.sql(
"SELECT Amt, id, num, expanded_start_date, expanded_id " +
"FROM tableTest " +
"LATERAL VIEW explode(Start_date) Start_date AS expanded_start_date " +
"LATERAL VIEW explode(Identifier) AS expanded_id")
.show()

How to calculate a certain character in the Spark

I'd like to calculate the character 'a' in spark-shell.
I have a somewhat troublesome method, split by 'a' and "length - 1" is what i want.
Here is the code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val test_data = sqlContext.read.json("music.json")
test_data.registerTempTable("test_data")
val temp1 = sqlContext.sql("select user.id_str as userid, text from test_data")
val temp2 = temp1.map(t => (t.getAs[String]("userid"),t.getAs[String]("text").split('a').length-1))
However, someone told me this is not remotely safe. I don't know why and can you give me a better way to do this?
It is not safe because if value is NULL:
val df = Seq((1, None), (2, Some("abcda"))).toDF("id", "text")
getAs[String] will return null:
scala> df.first.getAs[String]("text") == null
res1: Boolean = true
and split will give NPE:
scala> df.first.getAs[String]("text").split("a")
java.lang.NullPointerException
...
which is most likely the situation you got in your previous question.
One simple solution:
import org.apache.spark.sql.functions._
val aCnt = coalesce(length(regexp_replace($"text", "[^a]", "")), lit(0))
df.withColumn("a_cnt", aCnt).show
// +---+-----+-----+
// | id| text|a_cnt|
// +---+-----+-----+
// | 1| null| 0|
// | 2|abcda| 2|
// +---+-----+-----+
If you want to make your code relatively safe you should either check for existence of null:
def countAs1(s: String) = s match {
case null => 0
case chars => s.count(_ == 'a')
}
countAs1(df.where($"id" === 1).first.getAs[String]("text"))
// Int = 0
countAs1(df.where($"id" === 2).first.getAs[String]("text"))
// Int = 2
or catch possible exceptions:
import scala.util.Try
def countAs2(s: String) = Try(s.count(_ == 'a')).toOption.getOrElse(0)
countAs2(df.where($"id" === 1).first.getAs[String]("text"))
// Int = 0
countAs2(df.where($"id" === 2).first.getAs[String]("text"))
// Int = 2