How to calculate a certain character in the Spark - scala

I'd like to calculate the character 'a' in spark-shell.
I have a somewhat troublesome method, split by 'a' and "length - 1" is what i want.
Here is the code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val test_data = sqlContext.read.json("music.json")
test_data.registerTempTable("test_data")
val temp1 = sqlContext.sql("select user.id_str as userid, text from test_data")
val temp2 = temp1.map(t => (t.getAs[String]("userid"),t.getAs[String]("text").split('a').length-1))
However, someone told me this is not remotely safe. I don't know why and can you give me a better way to do this?

It is not safe because if value is NULL:
val df = Seq((1, None), (2, Some("abcda"))).toDF("id", "text")
getAs[String] will return null:
scala> df.first.getAs[String]("text") == null
res1: Boolean = true
and split will give NPE:
scala> df.first.getAs[String]("text").split("a")
java.lang.NullPointerException
...
which is most likely the situation you got in your previous question.
One simple solution:
import org.apache.spark.sql.functions._
val aCnt = coalesce(length(regexp_replace($"text", "[^a]", "")), lit(0))
df.withColumn("a_cnt", aCnt).show
// +---+-----+-----+
// | id| text|a_cnt|
// +---+-----+-----+
// | 1| null| 0|
// | 2|abcda| 2|
// +---+-----+-----+
If you want to make your code relatively safe you should either check for existence of null:
def countAs1(s: String) = s match {
case null => 0
case chars => s.count(_ == 'a')
}
countAs1(df.where($"id" === 1).first.getAs[String]("text"))
// Int = 0
countAs1(df.where($"id" === 2).first.getAs[String]("text"))
// Int = 2
or catch possible exceptions:
import scala.util.Try
def countAs2(s: String) = Try(s.count(_ == 'a')).toOption.getOrElse(0)
countAs2(df.where($"id" === 1).first.getAs[String]("text"))
// Int = 0
countAs2(df.where($"id" === 2).first.getAs[String]("text"))
// Int = 2

Related

Functionnal way of writing huge when rlike statement

I'm using regex to identify file type based on extension in DataFrame.
import org.apache.spark.sql.{Column, DataFrame}
val ignoreCase :String = "(?i)"
val ignoreExtension :String = "(?:\\.[_\\d]+)*(?:|\\.bck|\\.old|\\.orig|\\.bz2|\\.gz|\\.7z|\\.z|\\.zip)*(?:\\.[_\\d]+)*$"
val pictureFileName :String = "image"
val pictureFileType :String = ignoreCase + "^.+(?:\\.gif|\\.ico|\\.jpeg|\\.jpg|\\.png|\\.svg|\\.tga|\\.tif|\\.tiff|\\.xmp)" + ignoreExtension
val videoFileName :String = "video"
val videoFileType :String = ignoreCase + "^.+(?:\\.mod|\\.mp4|\\.mkv|\\.avi|\\.mpg|\\.mpeg|\\.flv)" + ignoreExtension
val otherFileName :String = "other"
def pathToExtension(cl: Column): Column = {
when(cl.rlike( pictureFileType ), pictureFileName ).
when(cl.rlike( videoFileType ), videoFileName ).
otherwise(otherFileName)
}
val df = List("file.jpg", "file.avi", "file.jpg", "file3.tIf", "file5.AVI.zip", "file4.mp4","afile" ).toDF("filename")
val df2 = df.withColumn("filetype", pathToExtension( col( "filename" ) ) )
df2.show
This is only a sample and I have 30 regex and type identified, thus the function pathToExtension() is really long because I have to put a new when statement for each type.
I can't find a proper way to write this code the functional way with a list or map containing the regexp and the name like this :
val typelist = List((pictureFileName,pictureFileType),(videoFileName,videoFileType))
foreach [need help for this part]
All the code I've tried so far won't work properly.
You can use foldLeft to traverse your list of when conditions and chain them as shown below:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
val default = "other"
def chainedWhen(c: Column, rList: List[(String, String)]): Column = rList.tail.
foldLeft(when(c rlike rList.head._2, rList.head._1))( (acc, t) =>
acc.when(c rlike t._2, t._1)
).otherwise(default)
Testing the method:
val df = Seq(
(1, "a.txt"), (2, "b.gif"), (3, "c.zip"), (4, "d.oth")
).toDF("id", "file_name")
val rList = List(("text", ".*\\.txt"), ("gif", ".*\\.gif"), ("zip", ".*\\.zip"))
df.withColumn("file_type", chainedWhen($"file_name", rList)).show
// +---+---------+---------+
// | id|file_name|file_type|
// +---+---------+---------+
// | 1| a.txt| text|
// | 2| b.gif| gif|
// | 3| c.zip| zip|
// | 4| d.oth| other|
// +---+---------+---------+

generating join condition dynamically in spark/scala

I want to be able to pass the join condition for two data frames as an input string. The idea is to make the join generic enough so that the user could pass on the condition they like.
Here's how I am doing it right now. Although it works, I think its not clean.
val testInput =Array("a=b", "c=d")
val condition: Column = testInput.map(x => testMethod(x)).reduce((a,b) => a.and(b))
firstDataFrame.join(secondDataFrame, condition, "fullouter")
Here's the testMethod
def testMethod(inputString: String): Column = {
val splitted = inputString.split("=")
col(splitted.apply(0)) === col(splitted.apply(1))
}
Need help in figuring out a better way of taking input to generate the join condition dynamically
Not sure custom method like such would provide too much benefit, but if you must go down that path I would recommend making it cover also join on:
columns of the same name (which is rather common)
inequality condition
Sample code below:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
def joinDFs(dfL: DataFrame, dfR: DataFrame, conditions: List[String], joinType: String) = {
val joinConditions = conditions.map( cond => {
val arr = cond.split("\\s+")
if (arr.size != 3) throw new Exception("Invalid join conditions!") else
arr(1) match {
case "<" => dfL(arr(0)) < dfR(arr(2))
case "<=" => dfL(arr(0)) <= dfR(arr(2))
case "=" => dfL(arr(0)) === dfR(arr(2))
case ">=" => dfL(arr(0)) >= dfR(arr(2))
case ">" => dfL(arr(0)) > dfR(arr(2))
case "!=" => dfL(arr(0)) =!= dfR(arr(2))
case _ => throw new Exception("Invalid join conditions!")
}
} ).
reduce(_ and _)
dfL.join(dfR, joinConditions, joinType)
}
val dfLeft = Seq(
(1, "2018-04-01", "p"),
(1, "2018-04-01", "q"),
(2, "2018-05-01", "r")
).toDF("id", "date", "value")
val dfRight = Seq(
(1, "2018-04-15", "x"),
(2, "2018-04-15", "y")
).toDF("id", "date", "value")
val conditions = List("id = id", "date <= date")
joinDFs(dfLeft, dfRight, conditions, "left_outer").
show
// +---+----------+-----+----+----------+-----+
// | id| date|value| id| date|value|
// +---+----------+-----+----+----------+-----+
// | 1|2018-04-01| p| 1|2018-04-15| x|
// | 1|2018-04-01| q| 1|2018-04-15| x|
// | 2|2018-05-01| r|null| null| null|
// +---+----------+-----+----+----------+-----+

How to Loop through multiple Col values in a dataframe to get count

I have a list of tables let say x,y,z and each table is having some cols for example test,test1,test2,test3 for table x. just like we have cols like rem,rem1,rem2 for table y. Similarly is the case for table z. Now the requirement is that we have to loop through each col in a table and have to get row count based on below scenario's.
If test is not NULL and all other are NULL(test1,test2,test3) then it will be one count.
Now we have to loop through each table and then find cols like test* then match the above condition then marked that row as one 1 count if it satisfy above condition.
I'm pretty new to scala but i thought of the below approach.
for each $tablename{
{
Val df = sql("select * from $tablename ")
val coldf = df.select(df.columns.filter(_.startsWith("test")).map(df(_)) : _*)
val df_filtered = coldf.map(eachrow ->df.filter(s$"eachrow".isNull))
}
}
}
It is not working for me and not getting any idea where to put the count variable.if someone can help on this i will really appreciate.
Im using spark 2 with scala.
Code update
below is the code for generating the table list and table-col mapping list.
val table_names = sql("SELECT t1.Table_Name ,t1.col_name FROM table_list t1 LEFT JOIN db_list t2 ON t2.tableName == t1.Table_Name WHERE t2.tableName IS NOT NULL ").toDF("tabname", "colname")
//List of all tables in the db as list of df
val dfList = table_names.select("tabname").map(r => r.getString(0)).collect.toList
val dfTableList = dfList.map(spark.table)
//Mapping each table with keycol
val tabColPairList = table_names.rdd.map( r => (r(0).toString, r(1).toString)).collect
val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap
After this i'm using the below methods..
def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {
if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
(acc, i) => if (row.isNullAt(i)) acc else acc + 1
)
if (nonNulls == 0) 1 else 0
}
}
val dfCountList = dfTableList.map{ df =>
df.cols
val keyCol = dfColMap(df)
//println(keyCol)
val colPattern = s"$keyCol\\d+".r
val checkCols = df.columns.map( c => c match {
case colPattern() => Some(c)
case _ => None
} ).flatten
val rddWithCount = df.rdd.map{ case r: Row =>
Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
}
spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))
its giving me below error:
createCount: (row: org.apache.spark.sql.Row, keyCol: String, checkCols: Seq[String])Int
java.util.NoSuchElementException: key not found: [id: string, projid: string ... 40 more fields]
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:121)
at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:120)
at scala.collection.immutable.List.map(List.scala:273)
... 78 elided
Given your requirement, I would suggest taking advantage of RDD's functionality and use a Row-based method that creates your count for each Row per DataFrame:
val dfX = Seq(
("a", "ma", "a1", "a2", "a3"),
("b", "mb", null, null, null),
("null", "mc", "c1", null, "c3")
).toDF("xx", "mm", "xx1", "xx2", "xx3")
val dfY = Seq(
("nd", "d", "d1", null),
("ne", "e", "e1", "e2"),
("nf", "f", null, null)
).toDF("nn", "yy", "yy1", "yy2")
val dfZ = Seq(
("g", null, "g1", "g2", "qg"),
("h", "ph", null, null, null),
("i", "pi", null, null, "qi")
).toDF("zz", "pp", "zz1", "zz2", "qq")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.IntegerType
val dfList = List(dfX, dfY, dfZ)
val dfColMap = Map(dfX -> "xx", dfY -> "yy", dfZ -> "zz")
def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {
if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
(acc, i) => if (row.isNullAt(i)) acc else acc + 1
)
if (nonNulls == 0) 1 else 0
}
}
val dfCountList = dfList.map{ df =>
val keyCol = dfColMap(df)
val colPattern = s"$keyCol\\d+".r
val checkCols = df.columns.map( c => c match {
case colPattern() => Some(c)
case _ => None
} ).flatten
val rddWithCount = df.rdd.map{ case r: Row =>
Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
}
spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))
}
dfCountList(0).show
// +----+---+----+----+----+-----+
// | xx| mm| xx1| xx2| xx3|count|
// +----+---+----+----+----+-----+
// | a| ma| a1| a2| a3| 0|
// | b| mb|null|null|null| 1|
// |null| mc| c1|null| c3| 0|
// +----+---+----+----+----+-----+
dfCountList(1).show
// +---+---+----+----+-----+
// | nn| yy| yy1| yy2|count|
// +---+---+----+----+-----+
// | nd| d| d1|null| 0|
// | ne| e| e1| e2| 0|
// | nf| f|null|null| 1|
// +---+---+----+----+-----+
dfCountList(2).show
// +---+----+----+----+----+-----+
// | zz| pp| zz1| zz2| qq|count|
// +---+----+----+----+----+-----+
// | g|null| g1| g2| qg| 0|
// | h| ph|null|null|null| 1|
// | i| pi|null|null| qi| 1|
// +---+----+----+----+----+-----+
[UPDATE]
Note that the above solution works for any number of DataFrames as long as you have them in dfList and their corresponding key columns in dfColMap.
If you have a list of Hive tables instead, simply convert them into DataFrames using spark.table(), as below:
val tableList = List("tableX", "tableY", "tableZ")
val dfList = tableList.map(spark.table)
// dfList: List[org.apache.spark.sql.DataFrame] = List(...)
Now you still have to tell Spark what the key column for each table is. Let's say you have the list in the same order as the table list's. You can zip the list to create dfColMap and you'll have everything needed to apply the above solution:
val keyColList = List("xx", "yy", "zz")
val dfColMap = dfList.zip(keyColList).toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)
[UPDATE #2]
If you have the Hive table names and their corresponding key column names stored in a DataFrame, you can generate dfColMap as follows:
val dfTabColPair = Seq(
("tableX", "xx"),
("tableY", "yy"),
("tableZ", "zz")
).toDF("tabname", "colname")
val tabColPairList = dfTabColPair.rdd.map( r => (r(0).toString, r(1).toString)).
collect
// tabColPairList: Array[(String, String)] = Array((tableX,xx), (tableY,yy), (tableZ,zz))
val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)

Spark Rdd handle different fields of each row by field name

I am new to Spark and Scala, now I'm somehow stuck with a problem: how to handle different field of each row by field name, then into a new rdd.
This is my pseudo codeļ¼š
val newRdd = df.rdd.map(x=>{
def Random1 => random(1,10000) //pseudo
def Random2 => random(10000,20000) //pseduo
x.schema.map(y=> {
if (y.name == "XXX1")
x.getAs[y.dataType](y.name)) = Random1
else if (y.name == "XXX2")
x.getAs[y.dataType](y.name)) = Random2
else
x.getAs[y.dataType](y.name)) //pseduo,keeper the same
})
})
There are 2 less errors in above:
the second map,"x.getAs" is a error syntax
how to into a new rdd
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Thanks Ramesh Maharjan, it works now.
def randomString(len: Int): String = {
val rand = new scala.util.Random(System.nanoTime)
val sb = new StringBuilder(len)
val ab = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
for (i <- 0 until len) {
sb.append(ab(rand.nextInt(ab.length)))
}
sb.toString
}
def testUdf = udf((value: String) =>randomString(2))
val df = sqlContext.createDataFrame(Seq((1,"Android"), (2, "iPhone")))
df.withColumn("_2", testUdf(df("_2")))
+---+---+
| _1| _2|
+---+---+
| 1| F3|
| 2| Ag|
+---+---+
If you are intending to filter certain felds "XXX1" "XXX2" then simple select function should do the trick
df.select("XXX1", "XXX2")
and convert that to rdd
If you are intending something else then your x.getAs should look as below
val random1 = x.getAs(y.name)
It seems that you are trying to change values in some columns "XXX1" and "XXX2"
For that a simple udf function and withColumn should do the trick
Simple udf function is as below
def testUdf = udf((value: String) => {
//do your logics here and what you return from here would be reflected in the value you passed from the column
})
And you can call the udf function as
df.withColumn("XXX1", testUdf(df("XXX1")))
Similarly you can do for "XXX2"

NullPointerException when using UDF in Spark

I have a DataFrame in Spark such as this one:
var df = List(
(1,"{NUM.0002}*{NUM.0003}"),
(2,"{NUM.0004}+{NUM.0003}"),
(3,"END(6)"),
(4,"END(4)")
).toDF("CODE", "VALUE")
+----+---------------------+
|CODE| VALUE|
+----+---------------------+
| 1|{NUM.0002}*{NUM.0003}|
| 2|{NUM.0004}+{NUM.0003}|
| 3| END(6)|
| 4| END(4)|
+----+---------------------+
My task is to iterate through the VALUE column and do the following: check if there is a substring such as {NUM.XXXX}, get the XXXX number, get the row where $"CODE" === XXXX, and replace the {NUM.XXX} substring with the VALUE string in that row.
I would like the dataframe to look like this in the end:
+----+--------------------+
|CODE| VALUE|
+----+--------------------+
| 1|END(4)+END(6)*END(6)|
| 2| END(4)+END(6)|
| 3| END(6)|
| 4| END(4)|
+----+--------------------+
This is the best I've come up with:
val process = udf((ln: String) => {
var newln = ln
while(newln contains "{NUM."){
var num = newln.slice(newln.indexOf("{")+5, newln.indexOf("}")).toInt
var new_value = df.where($"CODE" === num).head.getAs[String](1)
newln = newln.replace(newln.slice(newln.indexOf("{"),newln.indexOf("}")+1), new_value)
}
newln
})
var df2 = df.withColumn("VALUE", when('VALUE contains "{NUM.",process('VALUE)).otherwise('VALUE))
Unfortunately, I get a NullPointerException when I try to filter/select/save df2, and no error when I just show df2. I believe the error appears when I access the DataFrame df within the UDF, but I need to access it every iteration, so I can't pass it as an input. Also, I've tried saving a copy of df inside the UDF but I don't know how to do that. What can I do here?
Any suggestions to improve the algorithm are very welcome! Thanks!
I wrote something which works but not very optimized I think. I actually do recursive joins on the initial DataFrame to replace the NUMs by END. Here is the code :
case class Data(code: Long, value: String)
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().master("local").getOrCreate()
val data = Seq(
Data(1,"{NUM.0002}*{NUM.0003}"),
Data(2,"{NUM.0004}+{NUM.0003}"),
Data(3,"END(6)"),
Data(4,"END(4)"),
Data(5,"{NUM.0002}")
)
val initialDF = sparkSession.createDataFrame(data)
val endDF = initialDF.filter(!(col("value") contains "{NUM"))
val numDF = initialDF.filter(col("value") contains "{NUM")
val resultDF = endDF.union(replaceNumByEnd(initialDF, numDF))
resultDF.show(false)
}
val parseNumUdf = udf((value: String) => {
if (value.contains("{NUM")) {
val regex = """.*?\{NUM\.(\d+)\}.*""".r
value match {
case regex(code) => code.toLong
}
} else {
-1L
}
})
val replaceUdf = udf((value: String, replacement: String) => {
val regex = """\{NUM\.(\d+)\}""".r
regex.replaceFirstIn(value, replacement)
})
def replaceNumByEnd(initialDF: DataFrame, currentDF: DataFrame): DataFrame = {
if (currentDF.count() == 0) {
currentDF
} else {
val numDFWithCode = currentDF
.withColumn("num_code", parseNumUdf(col("value")))
.withColumnRenamed("code", "code_original")
.withColumnRenamed("value", "value_original")
val joinedDF = numDFWithCode.join(initialDF, numDFWithCode("num_code") === initialDF("code"))
val replacedDF = joinedDF.withColumn("value_replaced", replaceUdf(col("value_original"), col("value")))
val nextDF = replacedDF.select(col("code_original").as("code"), col("value_replaced").as("value"))
val endDF = nextDF.filter(!(col("value") contains "{NUM"))
val numDF = nextDF.filter(col("value") contains "{NUM")
endDF.union(replaceNumByEnd(initialDF, numDF))
}
}
If you need more explanation, don't hesitate.