What is Spark execution order with function calls in scala? - scala

I have a spark program as follows:
object A {
var id_set: Set[String] = _
def init(argv: Array[String]) = {
val args = new AArgs(argv)
id_set = args.ids.split(",").toSet
}
def main(argv: Array[String]) {
init(argv)
val conf = new SparkConf().setAppName("some.name")
val rdd1 = getRDD(paras)
val rdd2 = getRDD(paras)
//......
}
def getRDD(paras) = {
//function details
getRDDDtails(paras)
}
def getRDDDtails(paras) = {
//val id_given = id_set
id_set.foreach(println) //worked normal, not empty
someRDD.filter{ x =>
val someSet = x.getOrElse(...)
//id_set.foreach(println) ------wrong, id_set just empty set
(some_set & id_set).size > 0
}
}
class AArgs(args: Array[String]) extends Serializable {
//parse args
}
I have a global variable id_set. At first, it is just an empty set. In main function, I call init which sets id_set to a non-empty set from args. After that, I call getRDD function which calls getRDDDtails. In getRDDDtails, I filter a rdd based on contents in id_set. However, the result semms to be empty. I tried to print is_set in executor, and it is just an empty line. So, the problem seems to be is_set is not well initilized(in init function). However, when I try to print is_set in driver(in head lines of function getRDDDtails), it worked normal, not empty.
So, I have tried to add val id_given = id_set in function getRDDDtails, and use id_given later. This seems to fix the problem. But I'm totally confused why should this happen? What is the execution order of Spark programs? Why does my solution work?

Related

How to invoke Spark functions (with arguments) from applications.properties(config file)?

So, I have a typesafe config file named application.properties which contains certain values like:
dev.execution.mode = local
dev.input.base.dir = /Users/debaprc/Documents/QualityCheck/Data
dev.schema.lis = asin StringType,subs_activity_date DateType,marketplace_id DecimalType
I have used these values as Strings in my Spark code like:
def main(args: Array[String]): Unit = {
val props = ConfigFactory.load()
val envProps = props.getConfig("dev")
val spark = SparkSession.builder.appName("DataQualityCheckSession")
.config("spark.master", envProps.getString("execution.mode"))
.getOrCreate()
Now I have certain functions defined in my spark code (func1, func2, etc...). I want to specify which functions are to be called, along with the respective arguments, in my application.properties file. Something like this:
dev.functions.lis = func1,func2,func2,func3
dev.func1.arg1.lis = arg1,arg2
dev.func2.arg1.lis = arg3,arg4,arg5
dev.func2.arg2.lis = arg6,arg7,arg8
dev.func3.arg1.lis = arg9,arg10,arg11,arg12
Now, once I specify these, what do I do in Spark, to call the functions with the provided arguments? Or do I need to specify the functions and arguments in a different way?
I agree with #cchantep the approach seems wrong. But if you still want to do something like that, I would decouple the function names in the properties file from the actual functions/methods in your code.
I have tried this and worked fine:
def function1(args: String): Unit = {
println(s"func1 args: $args")
}
def function2(args: String): Unit = {
println(s"func2 args: $args")
}
val functionMapper: Map[String, String => Unit] = Map(
"func1" -> function1,
"func2" -> function2
)
val args = "arg1,arg2"
functionMapper("func1")(args)
functionMapper("func2")(args)
Output:
func1 args: arg1,arg2
func2 args: arg1,arg2
Edited: Simpler approach with output example.

Problems with spark datasets

When I execute a function in a mapPartition of dataset (executeStrategy()) it returns a result which I could check by debug but when I use dataset.show () it shows me an empty table and I do not know why this happens
This is for a data mining job at my school. I'm using windows 10, scala 2.11.12 and spark-2.2.0, which work without problems.
case class MyState(code: util.ArrayList[Object], evaluation: util.ArrayList[java.lang.Double])
private def executeStrategy(iter: Iterator[Row]): Iterator[(String,MyState)] = {
val listBest = new util.ArrayList[State]
Predicate.fuzzyValues = iter.toList
for (i <- 0 until conf.runNumber) {
Strategy.executeStrategy(conf.iterByRun, 1, conf.algorithm("algorithm").asInstanceOf[GeneratorType])
listBest.addAll(Strategy.getStrategy.listBest)
}
val result = postMining(listBest)
result.map(x => (x.getCode.toString, MyState(x.getCode,x.getEvaluation))).iterator
}
def run(sparkSession: SparkSession, n: Int): Unit = {
import sparkSession.implicits._
var data0 = conf.dataBase.repartition(n).persist(StorageLevel.MEMORY_AND_DISK_SER)
var listBest = new util.ArrayList[State]
implicit def enc1 = Encoders.bean(classOf[(String,MyState)])
val data1 = data0.mapPartitions(executeStrategy)
data1.show(3)
}
I expect that the dataset has the results of the processing of each partition, which I can see when I debug, but I get an empty dataset.
I have tried rdd with the same function executeStrategy() and this one returns an rdd with the results. What is the problem with the dataset?

Can spark-submit with named argument?

I know i can pass argument to main function by
spark-submit com.xxx.test 1 2
and get argument by:
def main(args: Array[String]): Unit = {
// 读取参数
var city = args(0)
var num = args(1)
but i want to know is there a path to pass named argument like:
spark-submit com.xxx.test --citys=1 --num=2
and how to get this named argument in main.scala?
you can write your own custom class which parses the input arguments based on the key something like below:
object CommandLineUtil {
def getOpts(args: Array[String], usage: String): collection.mutable.Map[String, String] = {
if (args.length == 0) {
log.warn(usage)
System.exit(1)
}
val (opts, vals) = args.partition {
_.startsWith("-")
}
val optsMap = collection.mutable.Map[String, String]()
opts.map { x =>
val pair = x.split("=")
if (pair.length == 2) {
optsMap += (pair(0).split("-{1,2}")(1) -> pair(1))
} else {
log.warn(usage)
System.exit(1)
}
}
optsMap
}
}
Then you can use the methods with in your spark application
val usage = "Usage: [--citys] [--num]"
val optsMap = CommandLineUtil.getOpts(args, usage)
val citysValue = optsMap("citys")
val numValue = optsMap("num")
You can improvise CommandLineUtil as per your requirements
No.
As you can read in the Documentation, you just pass the arguments of your application, and then you handle them.
So, if you want to have "named arguments", then you should implement that in your code (I mean it will be custom).

How to create collection of RDDs out of RDD?

I have an RDD[String], wordRDD. I also have a function that creates an RDD[String] from a string/word. I would like to create a new RDD for each string in wordRDD. Here are my attempts:
1) Failed because Spark does not support nested RDDs:
var newRDD = wordRDD.map( word => {
// execute myFunction()
(new MyClass(word)).myFunction()
})
2) Failed (possibly due to scope issue?):
var newRDD = sc.parallelize(new Array[String](0))
val wordArray = wordRDD.collect
for (w <- wordArray){
newRDD = sc.union(newRDD,(new MyClass(w)).myFunction())
}
My ideal result would look like:
// input RDD (wordRDD)
wordRDD: org.apache.spark.rdd.RDD[String] = ('apple','banana','orange'...)
// myFunction behavior
new MyClass('apple').myFunction(): RDD[String] = ('pple','aple'...'appl')
// after executing myFunction() on each word in wordRDD:
newRDD: RDD[String] = ('pple','aple',...,'anana','bnana','baana',...)
I found a relevant question here: Spark when union a lot of RDD throws stack overflow error, but it didn't address my issue.
Use flatMap to get RDD[String] as you desire.
var allWords = wordRDD.flatMap { word =>
(new MyClass(word)).myFunction().collect()
}
You cannot create a RDD from within another RDD.
However, it is possible to rewrite your function myFunction: String => RDD[String], which generates all words from the input where one letter is removed, into another function modifiedFunction: String => Seq[String] such that it can be used from within an RDD. That way, it will also be executed in parallel on your cluster. Having the modifiedFunction you can obtain the final RDD with all words by simply calling wordRDD.flatMap(modifiedFunction).
The crucial point is to use flatMap (to map and flatten the transformations):
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val input = sc.parallelize(Seq("apple", "ananas", "banana"))
// RDD("pple", "aple", ..., "nanas", ..., "anana", "bnana", ...)
val result = input.flatMap(modifiedFunction)
}
def modifiedFunction(word: String): Seq[String] = {
word.indices map {
index => word.substring(0, index) + word.substring(index+1)
}
}

Scala: Adding elements to Set inside 'foreach' doesn't persist

I create a mutable set and iterate over a list using a 'foreach' to populate the set. When I print the set inside the foreach, it prints the contents of the set correctly. However, the set is empty after the end of 'foreach'. I am not able to figure out what I am missing.
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
object SparkTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Test")
val sc = new SparkContext(conf)
val graph = GraphLoader.edgeListFile(sc, "followers.txt")
val edgeList = graph.edges
var mapperResults = iterateMapper(edgeList)
sc.stop()
}
def iterateMapper(edges: EdgeRDD[Int, Int]) : scala.collection.mutable.Set[(VertexId, VertexId)] = {
var mapperResults = scala.collection.mutable.Set[(VertexId, VertexId)]()
val mappedValues = edges.mapValues(edge => (edge.srcId, edge.dstId)) ++ edges.mapValues(edge => (edge.dstId, edge.srcId))
mappedValues.foreach {
edge => {
var src = edge.attr._1
var dst = edge.attr._2
mapperResults += ((src, dst))
}
}
println(mapperResults)
return mapperResults
}
}
This is the code I'm working with. It is a modified example from Spark.
The
println(mapperResults)
prints out an empty set.
Actually it works, but in the worker.
the foreach is a function that exists for the side effects, but it work on the worker, so you wont see the updated Set.
Other issue is that it design to be Immutable! so do not use mutable collection there. Also there is no need for that. The following code should do what do you meant to do:
var mapperResults = mappedValues.map(_.attr).distinct.collect
It shorter, cleaner and do the map work on the workers.