pass accumulators to spark udf - scala

This is a simplified version of what I am trying to do. I want to do some counting inside my udf. So thinking one way of doing it is to pass Long accumulators to the udf and incrementing the acuumulators inside the if else loops in deserializeProtobuf function. But not able to get the syntax working. Can anyone help me with that ? Is there any better way ?
def deserializeProtobuf(raw_data: Byte[Array]) = {
val input_stream = new ByteArrayInputStream(raw_data)
parsed_data = CustomClass.parseFrom(input_stream)
if (condition 1 related to parsed_data) {
< increment variable1 >
}
else if (condition 2 related to parsed_data) {
< increment variable2 >
}
else {
< increment variable3 >
}
}
val decode = udf(deserializeProtobuf _)
val deserialized_data = ds.withColumn("data", decode(col("protobufData")))

I have done something like this before , If you are doing heavy-lifting in your CUSTOMCLASS one thing I can suggest is to Broadcast it , also you can instantiate Metrics on BroadCast variable
Now coming to counting part I tried accumulator part but it was quite difficult to manage them inside UDF as getting correct count over a window so I tried to use spark-metrics and send the count at regular interval
use this https://github.com/groupon/spark-metrics
and make sure initialise the metrics on Broadcast variable creation time from that point the copied variable will report on same metrics

You shouldn't have to pass the accumulator to the UDF:
import org.apache.spark.util.{AccumulatorV2, LongAccumulator}
import org.apache.spark.sql.functions.{udf,col}
var acc1: LongAccumulator = null
def my_udf = udf ( (arg1: str) => {
...
acc1.add(1)
}
val spark = SparkSession...
acc1 = spark.sparkContext.longAccumulator("acc1")
... withColumn("col_name", my_udf(col("...")))
// some action here to cause the withColumn to execute
System.err.println(s"${acc1.value}")

Related

Should I use an object class or broadcast variable

I have an coordinates RDD[(Int,Int)] and I want to create a new RDD[(Int,(Int,Int))] what is the best practice?
object GlobalVariables{
private var pointId : Int = 0
def newPointId(): Long ={
pointId += 1
pointId
}
}
points = coordinates.map(x=> (GlobalVariables.newPointID,x._1, x._2))
Is this code executed on workers or should I use a combination of broadcast Variables and accumulators?
If the code is executed on workers how can I be sure that I will not have any concurrency error?
You can try another solution without the need to have mutable counter, The transformation zipWithIndex provides a stable indexing, numbering each element in its original order.
example :
val myRdd = RDD(1,2,3)
val zippedWithIndex = myRdd.zipWithIndex // ((1,0),(2,1),(3,2))
After this first transormation you can flip the index and the value
val result = zippedWithIndex.map{case (index,value) => (value,index)} // ((0,1),(1,2),(2,3))

Spark Scala Dataset Validations using UDF and its Performance

I'm new to Spark Scala. I have implemented an solution for Dataset validation for multiple columns using UDF rather than going through individual columns in for loop. But i dint know how this is working faster and i have to explain it was the better solution.
The columns for data validation will be received at run time, so we cannot hard-coded the column names in code. And also the comments column needs to be updated with the column name when column value got failed in validation.
Old Code,
def doValidate(data: Dataset[Row], columnArray: Array[String], validValueArrays: Array[String]): Dataset[Row] = {
var ValidDF: Dataset[Row] = data
var i:Int = 0
for (s <- columnArray) {
var list = validValueArrays(i).split(",")
ValidDF = ValidDF.withColumn("comments",when(ValidDF.col(s).isin(list: _*),concat(lit(col("comments")),lit(" Error: Invalid Records in: ") ,lit(s))).otherwise(col("comments")))
i = i + 1
}
return ValidDF;
}
New Code,
def validateColumnValues(data: Dataset[Row], columnArray: Array[String], validValueArrays: Array[String]): Dataset[Row] = {
var ValidDF: Dataset[Row] = data
var checkValues = udf((row: Row, comment: String) => {
var newComment = comment
for (s: Int <- 0 to row.length-1) {
var value = row.get(s)
var list = validValueArrays(s).split(",")
if(!list.contains(value))
{
newComment = newComment + " Error:Invalid Records in: " + columnArray(s) +";"
}
}
newComment
});
ValidDF = ValidDF.withColumn("comments",checkValues(struct(columnArray.head, columnArray.tail: _*),col("comments")))
return ValidDF;
}
columnArray --> Will have list of columns
validValueArrays --> Will have Valid Values Corresponding to column array position. The multiple valid values will be , separated.
I want to know which one better or any other better approach to do it. When i tested new code looks better. And also what is the difference between this two logic's as i read UDF is a black-box for Spark. And in this case the UDF will affect performance in any case?
I need to correct some closed bracket before running it. One '}' to be removed when you return the validDF. I still get a runtime analysis error.
It is better to avoid UDF as a UDF implies deserialization to process the data in classic Scala and then reserialize it. However, if your requirement cannot be archived using in build SQL function, then you have to go for UDF but you must make sure you review the SparkUI for performance and plan of execution.

Scala Spark not returning value outside loop [duplicate]

I am new to Scala and Spark and would like some help in understanding why the below code isn't producing my desired outcome.
I am comparing two tables
My desired output schema is:
case class DiscrepancyData(fieldKey:String, fieldName:String, val1:String, val2:String, valExpected:String)
When I run the below code step by step manually, I actually end up with my desired outcome. Which is a List[DiscrepancyData] completely populated with my desired output. However, I must be missing something in the code below because it returns an empty list (before this code gets called there are other codes that is involved in reading tables from HIVE, mapping, grouping, filtering, etc etc etc):
val compareCols = Set(year, nominal, adjusted_for_inflation, average_private_nonsupervisory_wage)
val key = "year"
def compare(table:RDD[(String, Iterable[Row])]): List[DiscrepancyData] = {
var discs: ListBuffer[DiscrepancyData] = ListBuffer()
def compareFields(fieldOne:String, fieldTwo:String, colName:String, row1:Row, row2:Row): DiscrepancyData = {
if (fieldOne != fieldTwo){
DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
row1.getAs(colName).toString, //table1Value
row2.getAs(colName).toString, //table2Value
row2.getAs(colName).toString) //expectedValue
}
else null
}
def comparison() {
for(row <- table){
var elem1 = row._2.head //gets the first element in the iterable
var elem2 = row._2.tail.head //gets the second element in the iterable
for(col <- compareCols){
var value1 = elem1.getAs(col).toString
var value2 = elem2.getAs(col).toString
var disc = compareFields(value1, value2, col, elem1, elem2)
if (disc != null) discs += disc
}
}
}
comparison()
discs.toList
}
I'm calling the above function as such:
var outcome = compare(groupedFiltered)
Here is the data in groupedFiltered:
(1991,CompactBuffer([1991,7.14,5.72,39%], [1991,4.14,5.72,39%]))
(1997,CompactBuffer([1997,4.88,5.86,39%], [1997,3.88,5.86,39%]))
(1999,CompactBuffer([1999,5.15,5.96,39%], [1999,5.15,5.97,38%]))
(1947,CompactBuffer([1947,0.9,2.94,35%], [1947,0.4,2.94,35%]))
(1980,CompactBuffer([1980,3.1,6.88,45%], [1980,3.1,6.88,48%]))
(1981,CompactBuffer([1981,3.15,6.8,45%], [1981,3.35,6.8,45%]))
The table schema for groupedFiltered:
(year String,
nominal Double,
adjusted_for_inflation Double,
average_provate_nonsupervisory_wage String)
Spark is a distributed computing engine. Next to "what the code is doing" of classic single-node computing, with Spark we also need to consider "where the code is running"
Let's inspect a simplified version of the expression above:
val records: RDD[List[String]] = ??? //whatever data
var list:mutable.List[String] = List()
for {record <- records
entry <- records }
{ list += entry }
The scala for-comprehension makes this expression look like a natural local computation, but in reality the RDD operations are serialized and "shipped" to executors, where the inner operation will be executed locally. We can rewrite the above like this:
records.foreach{ record => //RDD.foreach => serializes closure and executes remotely
record.foreach{entry => //record.foreach => local operation on the record collection
list += entry // this mutable list object is updated in each executor but never sent back to the driver. All updates are lost
}
}
Mutable objects are in general a no-go in distributed computing. Imagine that one executor adds a record and another one removes it, what's the correct result? Or that each executor comes to a different value, which is the right one?
To implement the operation above, we need to transform the data into our desired result.
I'd start by applying another best practice: Do not use null as return value. I also moved the row ops into the function. Lets rewrite the comparison operation with this in mind:
def compareFields(colName:String, row1:Row, row2:Row): Option[DiscrepancyData] = {
val key = "year"
val v1 = row1.getAs(colName).toString
val v2 = row2.getAs(colName).toString
if (v1 != v2){
Some(DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
v1, //table1Value
v2, //table2Value
v2) //expectedValue
)
} else None
}
Now, we can rewrite the computation of discrepancies as a transformation of the initial table data:
val discrepancies = table.flatMap{case (str, row) =>
compareCols.flatMap{col => compareFields(col, row.next, row.next) }
}
We can also use the for-comprehension notation, now that we understand where things are running:
val discrepancies = for {
(str,row) <- table
col <- compareCols
dis <- compareFields(col, row.next, row.next)
} yield dis
Note that discrepancies is of type RDD[Discrepancy]. If we want to get the actual values to the driver we need to:
val materializedDiscrepancies = discrepancies.collect()
Iterating through an RDD and updating a mutable structure defined outside the loop is a Spark anti-pattern.
Imagine this RDD being spread over 200 machines. How can these machines be updating the same Buffer? They cannot. Each JVM will be seeing its own discs: ListBuffer[DiscrepancyData]. At the end, your result will not be what you expect.
To conclude, this is a perfectly valid (not idiomatic though) Scala code but not a valid Spark code. If you replace RDD with an Array it will work as expected.
Try to have a more functional implementation along these lines:
val finalRDD: RDD[DiscrepancyData] = table.map(???).filter(???)

Spark: UDF not reading already defined value

I have a function written that I am trying to apply to a dataframe via a UDF. It applies a category based on the value in a particular column. The function makes use of a value defined earlier in my code. The code looks like this:
object myFuncs extends App {
val sc = new SparkContext()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val categories = List("10","20")
def makeCategory(value:Double): String = {
if (value < categories(0)) "< 10"
else if (value >= categories(0) && value < categories(1)) "10 to 20"
else ">= 10"
}
val myFunc = udf(makeCategory _)
val df = sqlContext.parquetFile("hdfs:/to/my/file.parquet").withColumn("category", myFunc(col("myColumn")))
}
This produces a NullPointerException when it tries to read the categories variable inside the function. This works fine if I explicitly define the categories variable inside the function. Ultimately, I want to pass that in as an arg so I can't define it inside the function.
Any explanation why it won't read values defined outside the function in the UDF? Any suggestion on how to make this work without explicitly defining the values in the function? I tried using the 'lit' function and passing it as an argument but it didn't like having a list as 'lit'.
The simple solution is to pass the categories in the query, then it will work fine. You have to make changes into your function as
def makeCategory(value:Double, categoriesString : String): String = {
val categories = categoriesString.split(",")
if (value < categories(0)) "< 10"
else if (value >= categories(0) && value < categories(1)) "10 to 20"
else ">= 10"
}
So now you can register this function as UDT, but you have to use it like following
val df = sqlContext.parquetFile("hdfs:/to/my/file.parquet").withColumn("category", myFunc(col("myColumn"),"10,20"))
Hopefully it will help in your case.

Create new column with function in Spark Dataframe

I'm trying to figure out the new dataframe API in Spark. Seems like a good step forward but having trouble doing something that should be pretty simple. I have a dataframe with 2 columns, "ID" and "Amount". As a generic example, say I want to return a new column called "code" that returns a code based on the value of "Amt". I can write a function something like this:
def coder(myAmt:Integer):String {
if (myAmt > 100) "Little"
else "Big"
}
When I try to use it like this:
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", coder(myDF("Amt")))
I get type mismatch errors
found : org.apache.spark.sql.Column
required: Integer
I've tried changing the input type on my function to org.apache.spark.sql.Column but I then I start getting errors with the function compiling because it wants a boolean in the if statement.
Am I doing this wrong? Is there a better/another way to do this than using withColumn?
Thanks for your help.
Let's say you have "Amt" column in your Schema:
import org.apache.spark.sql.functions._
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
val coder: (Int => String) = (arg: Int) => {if (arg < 100) "little" else "big"}
val sqlfunc = udf(coder)
myDF.withColumn("Code", sqlfunc(col("Amt")))
I think withColumn is the right way to add a column
We should avoid defining udf functions as much as possible due to its overhead of serialization and deserialization of columns.
You can achieve the solution with simple when spark function as below
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", when(myDF("Amt") < 100, "Little").otherwise("Big"))
Another way of doing this:
You can create any function but according to the above error, you should define function as a variable
Example:
val coder = udf((myAmt:Integer) => {
if (myAmt > 100) "Little"
else "Big"
})
Now this statement works perfectly:
myDF.withColumn("Code", coder(myDF("Amt")))