UDF to extract String in scala - scala

I'm trying to extract the last set number from this data type:
urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)
In this example I'm trying to extract 10342800535 as a string.
This is my code in scala,
def extractNestedUrn(urn: String): String = {
val arr = urn.split(":").map(_.trim)
val nested = arr(3)
val clean = nested.substring(1, nested.length -1)
val subarr = clean.split(":").map(_.trim)
val res = subarr(3)
val out = res.split(",").map(_.trim)
val fin = out(1)
fin.toString
}
This is run as an UDF and it throws the following error,
org.apache.spark.SparkException: Failed to execute user defined function
What am I doing wrong?

You can simply use regexp_extract function. Check this
val df = Seq(("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)")).toDF("x")
df.show(false)
+-------------------------------------------------------------------+
|x |
+-------------------------------------------------------------------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|
+-------------------------------------------------------------------+
df.withColumn("NestedUrn", regexp_extract(col("x"), """.*,(\d+)""", 1)).show(false)
+-------------------------------------------------------------------+-----------+
|x |NestedUrn |
+-------------------------------------------------------------------+-----------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|10342800535|
+-------------------------------------------------------------------+-----------+

One reason that org.apache.spark.SparkException: Failed to execute user defined function exception are raised is when an exception is raised inside your user defined function.
Analysis
If I try to run your user defined function with the example input you provided, using the code below:
import org.apache.spark.sql.functions.{col, udf}
import sparkSession.implicits._
val dataframe = Seq("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)").toDF("urn")
def extractNestedUrn(urn: String): String = {
val arr = urn.split(":").map(_.trim)
val nested = arr(3)
val clean = nested.substring(1, nested.length -1)
val subarr = clean.split(":").map(_.trim)
val res = subarr(3)
val out = res.split(",").map(_.trim)
val fin = out(1)
fin.toString
}
val extract_urn = udf(extractNestedUrn _)
dataframe.select(extract_urn(col("urn"))).show(false)
I get this complete stack trace:
Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(UdfExtractionError$$$Lambda$1165/1699756582: (string) => string)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
...
at UdfExtractionError$.main(UdfExtractionError.scala:37)
at UdfExtractionError.main(UdfExtractionError.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
at UdfExtractionError$.extractNestedUrn$1(UdfExtractionError.scala:29)
at UdfExtractionError$.$anonfun$main$4(UdfExtractionError.scala:35)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127)
... 86 more
The important part of this stack trace is actually:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
This is the exception raised when executing your user defined function code.if we analyse your function code, you split two times the input by :. The result of the first split is actually this array:
["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
and not this array:
["urn", "fb", "candidateHiringState", "(urn:fb:contract:187236028,10342800535)"]
So, if we execute the remaining statements of your function, you get:
val arr = ["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
val nested = "(urn"
val clean = "urn"
val subarr = ["urn"]
As at the next line you call the fourth element of the array subarr that contains only one element, an ArrayOutOfBound exception is raised and then Spark returns a SparkException
Solution
Although the best solution to your problem is obviously the previous answer with regexp_extract, you can correct your user defined function as below:
def extractNestedUrn(urn: String): String = {
val arr = urn.split(':') // split using character instead of string regexp
val nested = arr.last // get last element of array, here "187236028,10342800535)"
val subarr = nested.split(',')
val res = subarr.last // get last element, here "10342800535)"
val out = res.init // take all the string except the last character, to remove ')'
out // no need to use .toString as out is already a String
}
However, as said before, the best solution is to use spark inner function regexp_extract as explained in first answer. Your code will be easier to understand and more performant

Related

How to pass column names in selectExpr through one or more string parameters in spark using scala?

I am using script for CDC Merge in spark streaming. I wish to pass column values in selectExpr through a parameter as column names for each table would change. When I pass the columns and struct field through a string variable, I am getting error as ==> mismatched input ',' expecting
Below is the piece of code I am trying to parameterize.
var filteredMicroBatchDF=microBatchOutputDF
.selectExpr("col1","col2","struct(offset,KAFKA_TS) as otherCols" )
.groupBy("col1","col2").agg(max("otherCols").as("latest"))
.selectExpr("col1","col2","latest.*")
Reference to the script I am trying to emulate: -
https://docs.databricks.com/_static/notebooks/merge-in-cdc.html
I have tried like below by passing column names in a variable and then reading in the selectExpr from these variables: -
val keyCols = "col1","col2"
val structCols = "struct(offset,KAFKA_TS) as otherCols"
var filteredMicroBatchDF=microBatchOutputDF
.selectExpr(keyCols,structCols )
.groupBy(keyCols).agg(max("otherCols").as("latest"))
.selectExpr(keyCols,"latest.*")
When I run the script it gives me error as
org.apache.spark.sql.streaming.StreamingQueryException:
mismatched input ',' expecting <<EOF>>
EDIT
Here is what I have tried after comments by Luis Miguel which works fine: -
import org.apache.spark.sql.{DataFrame, functions => sqlfun}
def foo(microBatchOutputDF: DataFrame)
(keyCols: Seq[String], structCols: Seq[String]): DataFrame =
microBatchOutputDF
.selectExpr((keyCols ++ structCols) : _*)
.groupBy(keyCols.head, keyCols.tail : _*).agg(sqlfun.max("otherCols").as("latest"))
.selectExpr((keyCols :+ "latest.*") : _*)
var keyColumns = Seq("COL1","COL2")
var structColumns = "offset,Kafka_TS"
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
Note: Below results in an error
foo(microBatchOutputDF)(keyCols = Seq(keyColumns), structColumns = Seq("struct("+structColumns+") as otherCols"))
The thing about above working code is that, here keyColumns were hardcoded. So, I tried reading (firstly) from parameter file and (Secondly) from widget which resulted in error and it is here I am looking for advice and suggestions: -
First Method
def loadProperties(url: String):Properties = {
val properties: Properties = new Properties()
if (url != null) {
val source = Source.fromURL(url)
properties.load(source.bufferedReader())
}
return properties
}
var tableProp: Properties = new Properties()
tableProp = loadProperties("dbfs:/Configs/Databricks/Properties/table/Table.properties")
var keyColumns = Seq(tableProp.getProperty("keyCols"))
var structColumns = tableProp.getProperty("structCols")
keyCols and StructCols are defined in parameter file as: -
keyCols = Col1, Col2 (I also tried assigning these as "Col1","Col2")
StructCols = offset,Kafka_TS
Then finally,
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
The code is throwing the error pointing at first comma (as if its taking the columns field as single argument):
mismatched input ',' expecting <EOF>
== SQL ==
"COL1","COL2""
-----^^^
If I pass just one column in the keyCols property, code is working fine.
E.g. keyCols = Col1
Second Method
Here I tried reading key columns from the widget and its the same error again.
dbutils.widgets.text("prmKeyCols", "","")
val prmKeyCols = dbutils.widgets.get("prmKeyCols")
var keyColumns = Seq(prmKeyCols)
The widget is passed in as below
"Col1","Col2"
Then finally,
foo(microBatchOutputDF)(keyCols = Seq(keyColumns:_*), structColumns = Seq("struct("+structColumns+") as otherCols"))
This is also giving same error.
Something like this should work:
import org.apache.spark.sql.{DataFrame, functions => sqlfun}
def foo(microBatchOutputDF: DataFrame)
(keyCols: Seq[String], structCols: Seq[String]): DataFrame =
microBatchOutputDF
.selectExpr((keyCols ++ structCols) : _*)
.groupBy(keyCols.head, keyCols.tail : _*).agg(sqlfun.max("otherCols").as("latest"))
.selectExpr((keyCols :+ "latest.*") : _*)
Which you can use like:
foo(microBatchOutputDF)(keyCols = Seq("col1", "col2"), structCols = Seq("struct(offset,KAFKA_TS) as otherCols"))

Getting java.lang.ArrayIndexOutOfBoundsException: 1 in spark when applying aggregate functions

I am trying to do some transformations on a data set. After reading the data set when performing df.show() operations, I am getting the rows listed in spark shell. But when I try to do df.count or any aggregate functions, I am getting
java.lang.ArrayIndexOutOfBoundsException: 1.
val itpostsrow = sc.textFile("/home/jayk/Downloads/spark-data")
import scala.util.control.Exception.catching
import java.sql.Timestamp
implicit class StringImprovements(val s:String) {
def toIntSafe = catching(classOf[NumberFormatException])
opt s.toInt
def toLongsafe = catching(classOf[NumberFormatException])
opt s.toLong
def toTimeStampsafe = catching(classOf[IllegalArgumentException]) opt Timestamp.valueOf(s)
}
case class Post(commentcount:Option[Int],lastactivitydate:Option[java.sql.Timestamp],ownerUserId:Option[Long],body:String,score:Option[Int],creattiondate:Option[java.sql.Timestamp],viewcount:Option[Int],title:String,tags:String,answerCount:Option[Int],acceptedanswerid:Option[Long],posttypeid:Option[Long],id:Long)
def stringToPost(row:String):Post = {
val r = row.split("~")
Post(r(0).toIntSafe,
r(1).toTimeStampsafe,
r(2).toLongsafe,
r(3),
r(4).toIntSafe,
r(5).toTimeStampsafe,
r(6).toIntSafe,
r(7),
r(8),
r(9).toIntSafe,
r(10).toLongsafe,
r(11).toLongsafe,
r(12).toLong)
}
val itpostsDFcase1 = itpostsrow.map{x=>stringToPost(x)}
val itpostsDF = itpostsDFcase1.toDF()
Your function stringToPost() might cause a Java error ArrayIndexOutOfBoundsException if the text file contains some empty row or if the number of fields after the split is not 13.
Due to Spark's lazy evaluation one notices such errors only when performing an action like count.

NullPointerException applying a function to spark RDD that works on non-RDD

I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

How to create collection of RDDs out of RDD?

I have an RDD[String], wordRDD. I also have a function that creates an RDD[String] from a string/word. I would like to create a new RDD for each string in wordRDD. Here are my attempts:
1) Failed because Spark does not support nested RDDs:
var newRDD = wordRDD.map( word => {
// execute myFunction()
(new MyClass(word)).myFunction()
})
2) Failed (possibly due to scope issue?):
var newRDD = sc.parallelize(new Array[String](0))
val wordArray = wordRDD.collect
for (w <- wordArray){
newRDD = sc.union(newRDD,(new MyClass(w)).myFunction())
}
My ideal result would look like:
// input RDD (wordRDD)
wordRDD: org.apache.spark.rdd.RDD[String] = ('apple','banana','orange'...)
// myFunction behavior
new MyClass('apple').myFunction(): RDD[String] = ('pple','aple'...'appl')
// after executing myFunction() on each word in wordRDD:
newRDD: RDD[String] = ('pple','aple',...,'anana','bnana','baana',...)
I found a relevant question here: Spark when union a lot of RDD throws stack overflow error, but it didn't address my issue.
Use flatMap to get RDD[String] as you desire.
var allWords = wordRDD.flatMap { word =>
(new MyClass(word)).myFunction().collect()
}
You cannot create a RDD from within another RDD.
However, it is possible to rewrite your function myFunction: String => RDD[String], which generates all words from the input where one letter is removed, into another function modifiedFunction: String => Seq[String] such that it can be used from within an RDD. That way, it will also be executed in parallel on your cluster. Having the modifiedFunction you can obtain the final RDD with all words by simply calling wordRDD.flatMap(modifiedFunction).
The crucial point is to use flatMap (to map and flatten the transformations):
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val input = sc.parallelize(Seq("apple", "ananas", "banana"))
// RDD("pple", "aple", ..., "nanas", ..., "anana", "bnana", ...)
val result = input.flatMap(modifiedFunction)
}
def modifiedFunction(word: String): Seq[String] = {
word.indices map {
index => word.substring(0, index) + word.substring(index+1)
}
}