spark scala - UDF usage for creating new column

spark scala - UDF usage for creating new column - scala

I need to create a new column called hash_id from uid column of my dataframe, Below is my code:
//1.Define a hashing function
def calculate_hashid (uid: String) : BigInteger ={
val md = java.security.MessageDigest.getInstance("SHA-1")
val ha = new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
return ha
}
//2.Convert function to UDF
val calculate_hashidUDF = udf(calculate_hashid)
//3.Apply udf on spark dataframe
val userAgg_Data_hashid = userAgg_Data.withColumn("hash_id", calculate_hashidUDF($"uid"))
I am getting error at udf(calculate_hashid) saying
missing arguments for the method calculate_hashid(string)
I have gone through many examples online but could not resolve it, what am I missing here.

You can register your udf as
val calculate_hashidUDF = udf[String, BigInteger](calculate_hashidUDF)
You can also rewrite your udf as
def calculate_hashidUDF = udf(((uid: String) => {
val md = java.security.MessageDigest.getInstance("SHA-1")
new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
}): String => BigInteger)
Or even without return type
def calculate_hashidUDF = udf((uid: String) => {
val md = java.security.MessageDigest.getInstance("SHA-1")
new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
})

Related

How can I dynamically invoke the same scala function in cascading manner with output of previous call goes as input to the next call

I am new to Spark-Scala and trying following thing but I am stuck up and not getting on how to achieve this requirement. I shall be really thankful if someone can really help in this regards.
We have to invoke different rules on different columns of given table. The list of column names and rules is being passed as argument to the program
The resultant of first rule should go as input to the next rule input.
question : How can I execute exec() function in cascading manner with dynamically filling the arguments for as many rules as specified in arguments.
I have developed a code as follows.
object Rules {
def main(args: Array[String]) = {
if (args.length != 3) {
println("Need exactly 3 arguments in format : <sourceTableName> <destTableName> <[<colName>=<Rule> <colName>=<Rule>,...")
println("E.g : INPUT_TABLE OUTPUT_TABLE [NAME=RULE1,ID=RULE2,TRAIT=RULE3]");
System.exit(-1)
}
val conf = new SparkConf().setAppName("My-Rules").setMaster("local");
val sc = new SparkContext(conf);
val srcTableName = args(0).trim();
val destTableName = args(1).trim();
val ruleArguments = StringUtils.substringBetween(args(2).trim(), "[", "]");
val businessRuleMappings = ruleArguments.split(",").map(_.split("=")).map(arr => arr(0) -> arr(1)).toMap;
val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc) ;
val hiveContext : HiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
val dfSourceTbl = hiveContext.table("TEST.INPUT_TABLE");
def exec(dfSource: DataFrame,columnName :String ,funName: String): DataFrame = {
funName match {
case "RULE1" => TransformDF(columnName,dfSource,RULE1);
case "RULE2" => TransformDF(columnName,dfSource,RULE2);
case "RULE3" => TransformDF(columnName,dfSource,RULE3);
case _ =>dfSource;
}
}
def TransformDF(x:String, df:DataFrame, f:(String,DataFrame)=>DataFrame) : DataFrame = {
f(x,df);
}
def RULE1(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE2(column : String, sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
def RULE3(column : String,sourceDF: DataFrame): DataFrame = {
//put businees logic
return sourceDF;
}
// How can I call this exec() function with output casacing and arguments for variable number of rules.
val finalResultDF = exec(exec(exec(dfSourceTbl,"NAME","RULE1"),"ID","RULE2"),"TRAIT","RULE3);
finalResultDF.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto("DB.destTableName")
}
}

I would write all the rules as functions transforming one dataframe to another:
val rules: Seq[(DataFrame) => DataFrame] = Seq(
RULE1("NAME",_:DataFrame),
RULE2("ID",_:DataFrame),
RULE3("TRAIT",_:DataFrame)
)
Not you can apply them using folding
val finalResultDF = rules.foldLeft(dfSourceTbl)(_ transform _)

Add scoped variable per row iteration in Apache Spark

I'm reading multiple html files into a dataframe in Spark.
I'm converting elements of the html to columns in the dataframe using a custom udf
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.toDF("filepath", "filecontent")
.withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
.withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))
...
def parseDocValue(cssSelectorQuery: String) =
udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())
Which works perfectly, however each withColumn call will result in the parsing of the html string, which is redundant.
Is there a way (without using lookup tables or such) that I can generate 1 parsed Document (Jsoup.parse(html)) based on the "filecontent" column per row and make that available for all withColumn calls in the dataframe?
Or shouldn't I even try using DataFrames and just use RDD's ?

So the final answer was in fact quite simple:
Just map over the rows and create the object ones there
def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
val domObject = document.select(cssSelectorQuery)
val domValue = attr match {
case Some(a) => domObject.attr(a)
case None => domObject.text()
}
domValue match {
case x if x == null || x.isEmpty => None
case y => Some(y)
}
}
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath, minPartitions = 265)
.map {
case (filepath, filecontent) => {
implicit val document = Jsoup.parse(filecontent)
val customDataJson = docJson(filecontent, customJsonRegex)
DataEntry(
biz_name = docValue(".biz-page-title"),
biz_website = docValue(".biz-website a"),
url = docValue("meta[property=og:url]", attr = Some("content")),
...
filename = Some(fileName(filepath)),
fileTimestamp = Some(fileTimestamp(filepath))
)
}
}
.toDS()

I'd probably rewrite it as follows, to do the parsing and selecting in one go and put them in a temporary column:
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.withColumn("temp", parseDocValue(Array(".biz-page-title", ".biz-website a"))('filecontent))
.withColumn("biz_name", col("temp")(0))
.withColumn("biz_website", col("temp")(1))
.drop("temp")
def parseDocValue(cssSelectorQueries: Array[String]) =
udf((html: String) => {
val j = Jsoup.parse(html)
cssSelectorQueries.map(query => j.select(query).text())})

Spark DF: Schema for type Unit is not supported

I am new to Scala and Spark and trying to build on some samples I found. Essentially I am trying to call a function from within a data frame to get State from zip code using Google API..
I have the code working separately but not together ;(
Here is the piece of code not working...
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Unit is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:716)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:654)
at org.apache.spark.sql.functions$.udf(functions.scala:2837)
at MovieRatings$.getstate(MovieRatings.scala:51)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:48)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:47)...
Line 51 starts with def getstate = udf {(zipcode:String)...
...
code:
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, zipcode as state FROM Users")
// zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("transformed") else c)
val newDF = zipcodesDF.select(mappedCols:_*).show()
}
def getstate = udf {(zipcode:String) => {
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val shortnames = for {
JObject(address_components) <- address
JField("short_name", short_name) <- address_components
} yield short_name
val state = shortnames(3)
//return state.toString()
val stater = state.toString()
}
}

Thanks for the responses.. I think I figured it out. Here is the code that works. One thing to note is Google API has restriction so some valid zip codes don't have state info.. not an issue for me though.
private def loaduserdata(spark: SparkSession): Unit = {
import spark.implicits._
// Create an RDD of User objects from a text file, convert it to a Dataframe
val userDF = spark.sparkContext
.textFile("examples/src/main/resources/users.csv")
.map(_.split("::"))
.map(attributes => users(attributes(0).trim.toInt, attributes(1), attributes(2).trim.toInt, attributes(3), attributes(4)))
.toDF()
// Register the DataFrame as a temporary view
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, substr(zipcode,1,5) as state FROM Users ORDER BY zipcode desc") // zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("state") else c)
val geoDF = zipcodesDF.select(mappedCols:_*)//.show()
geoDF.createOrReplaceTempView("Geo")
}
val getstate = udf {(zipcode: String) =>
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val statenm = for {
JObject(statename) <- address
JField("types", JArray(types)) <- statename
JField("short_name", JString(short_name)) <- statename
if types.toString().equals("List(JString(administrative_area_level_1), JString(political))")
// if types.head.equals("JString(administrative_area_level_1)")
} yield short_name
val str = if (statenm.isEmpty.toString().equals("true")) "N/A" else statenm.head
}

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong

Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()

For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD

What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

How to create collection of RDDs out of RDD?

I have an RDD[String], wordRDD. I also have a function that creates an RDD[String] from a string/word. I would like to create a new RDD for each string in wordRDD. Here are my attempts:
1) Failed because Spark does not support nested RDDs:
var newRDD = wordRDD.map( word => {
// execute myFunction()
(new MyClass(word)).myFunction()
})
2) Failed (possibly due to scope issue?):
var newRDD = sc.parallelize(new Array[String](0))
val wordArray = wordRDD.collect
for (w <- wordArray){
newRDD = sc.union(newRDD,(new MyClass(w)).myFunction())
}
My ideal result would look like:
// input RDD (wordRDD)
wordRDD: org.apache.spark.rdd.RDD[String] = ('apple','banana','orange'...)
// myFunction behavior
new MyClass('apple').myFunction(): RDD[String] = ('pple','aple'...'appl')
// after executing myFunction() on each word in wordRDD:
newRDD: RDD[String] = ('pple','aple',...,'anana','bnana','baana',...)
I found a relevant question here: Spark when union a lot of RDD throws stack overflow error, but it didn't address my issue.

Use flatMap to get RDD[String] as you desire.
var allWords = wordRDD.flatMap { word =>
(new MyClass(word)).myFunction().collect()
}

You cannot create a RDD from within another RDD.
However, it is possible to rewrite your function myFunction: String => RDD[String], which generates all words from the input where one letter is removed, into another function modifiedFunction: String => Seq[String] such that it can be used from within an RDD. That way, it will also be executed in parallel on your cluster. Having the modifiedFunction you can obtain the final RDD with all words by simply calling wordRDD.flatMap(modifiedFunction).
The crucial point is to use flatMap (to map and flatten the transformations):
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val input = sc.parallelize(Seq("apple", "ananas", "banana"))
// RDD("pple", "aple", ..., "nanas", ..., "anana", "bnana", ...)
val result = input.flatMap(modifiedFunction)
}
def modifiedFunction(word: String): Seq[String] = {
word.indices map {
index => word.substring(0, index) + word.substring(index+1)
}
}