How to run two SparkSql queries in parallel in Apache Spark - scala

First, let me write the part of the code I want to execute in .scala file on spark.
This is my source file. It has structured data with four fields
val inputFile = sc.textFile("hdfs://Hadoop1:9000/user/hduser/test.csv")
I have declared a case class to store the data from file into table with four columns
case class Table1(srcIp: String, destIp: String, srcPrt: Int, destPrt: Int)
val inputValue = inputFile.map(_.split(",")).map(p => Table1(p(0),p(1),p(2).trim.toInt,p(3).trim.toInt)).toDF()
inputValue.registerTempTable("inputValue")
Now, let's say, I want to run following two queries. How can I run these queries in parallel as they are mutually independent. I feel, if I could run them in parallel, it can reduce the execution time. Right now, they are executed serially.
val primaryDestValues = sqlContext.sql("SELECT distinct destIp FROM inputValue")
primaryDestValues.registerTempTable("primaryDestValues")
val primarySrcValues = sqlContext.sql("SELECT distinct srcIp FROM inputValue")
primarySrcValues.registerTempTable("primarySrcValues")
primaryDestValues.join(primarySrcValues, $"destIp" === $"srcIp").select($"destIp",$"srcIp").show(

May be you can look in direction of Futures/Promises. There is a method in SparkContext submitJob which return you future with results. So may this you can fire two jobs and then collect results from futures.
I have not tried this method yet. Just an assumption.

No idea why you want to use sqlContext in the first place, and don't keep things simple.
val inputValue = inputFile.map(_.split(",")).map(p => (p(0),p(1),p(2).trim.toInt,p(3).trim.toInt))
Assuming p(0) = destIp, p(1)=srcIp
val joinedValue = inputValue.map{case(destIp, srcIp, x, y) => (destIp, (x, y))}
.join(inputFile.map{case(destIp, srcIp, x, y) => (srcIp, (x, y))})
.map{case(ip, (x1, y1), (x2, y2)) => (ip, destX, destY, srcX, srcY)}
Now it will be parallezied, and you can even control number of partitions you want using colasce

You can skip the two DISTINCT and do one at the end:
inputValue.select($"srcIp").join(
inputValue.select($"destIp"),
$"srcIp" === $"destIp"
).distinct().show

That's a nice question. This can be executed in parallel using the par in the array. For this you have customize your code accordingly.
Declare an array with two items in it (your can name this as per your wish). Write your code inside each case statement which you need to execute in parallel.
Array("destIp","srcIp").par.foreach { i =>
{
i match {
case "destIp" => {
val primaryDestValues = sqlContext.sql("SELECT distinct destIp FROM inputValue")
primaryDestValues.registerTempTable("primaryDestValues")
}
case "srcIp" => {
val primarySrcValues = sqlContext.sql("SELECT distinct srcIp FROM inputValue")
primarySrcValues.registerTempTable("primarySrcValues")
}}}
}
Once both of the case statement's execution is completed, your below code will be executed.
primaryDestValues.join(primarySrcValues, $"destIp" === $"srcIp").select($"destIp",$"srcIp").show()
Note : If you remove par from the code, it will run sequentially
The other option is to create another sparksession inside the code and execute sql using that sparksession variable. But this is little risky and has be used very carefully

Related

Serialising temp collections created in Spark executors during task execution

I'm trying to find an effective way of writing collections created inside tasks to the output files of the job. For example, if we iterate over a RDD using foreach, we can create data structures that are local to the executor ex.,ListBuffer arr in the following code snippet. My problem is that how do I serialise arr and write it to file?
(1) Should I use FileWriter api or Spark saveAsTextFile will work?
(2) What will be the advantages of using one over the other
(3) Is there a better way of achieving the same.
PS: The reason I am using foreach instead of map is because I might not be able to transform all my RDD rows and I want to avoid getting Null values in the output.
val dataSorted: RDD[(Int, Int)] = <Some Operation>
val arr: ListBuffer = ListBuffer[(String, String)]()
dataSorted.foreach {
case (e, r) => {
if(e.id > 1000) {
arr += (("a", "b"))
}
}
}
Thanks,
Devj
You should not use driver's variables, but Accumulators - therw are articles about them with code examples here and here, also this question maybe helpful - there is simplified code example of custom AccumulatorParam
Write your own accumulator, that is able to add (String, String) or use built-in CollectionAccumulator. This is implementation of AccumulatorV2, new version of accumulator from Spark 2
Other way is to use Spark built-in filter and map functions - thanks #ImDarrenG for suggesting flatMap, but I think filter and map will be easier:
val result : Array[(String, String)] = someRDD
.filter(x => x._1 > 1000) // filter only good rows
.map (x => ("a", "b"))
.collect() // convert to arrat
The Spark API saves you some file handling code but essentially achieves the same thing.
The exception is if you are not using, say, HDFS and do not want your output file to be partitioned (spread across the executors file systems). In this case you will need to collect the data to the driver and use FileWriter to write to a single file, or files, and how you achieve that will depend on how much data you have. If you have more data than driver has memory you will need to handle it differently.
As mentioned in another answer, you're creating an array in your driver, while adding items from your executors, which will not work in a cluster environment. Something like this might be a better way to map your data and handle nulls:
val outputRDD = dataSorted.flatMap {
case (e, r) => {
if(e.id > 1000) {
Some(("a", "b"))
} else {
None
}
}
}
// save outputRDD to file/s here using the approapriate method...

Scala Spark not returning value outside loop [duplicate]

I am new to Scala and Spark and would like some help in understanding why the below code isn't producing my desired outcome.
I am comparing two tables
My desired output schema is:
case class DiscrepancyData(fieldKey:String, fieldName:String, val1:String, val2:String, valExpected:String)
When I run the below code step by step manually, I actually end up with my desired outcome. Which is a List[DiscrepancyData] completely populated with my desired output. However, I must be missing something in the code below because it returns an empty list (before this code gets called there are other codes that is involved in reading tables from HIVE, mapping, grouping, filtering, etc etc etc):
val compareCols = Set(year, nominal, adjusted_for_inflation, average_private_nonsupervisory_wage)
val key = "year"
def compare(table:RDD[(String, Iterable[Row])]): List[DiscrepancyData] = {
var discs: ListBuffer[DiscrepancyData] = ListBuffer()
def compareFields(fieldOne:String, fieldTwo:String, colName:String, row1:Row, row2:Row): DiscrepancyData = {
if (fieldOne != fieldTwo){
DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
row1.getAs(colName).toString, //table1Value
row2.getAs(colName).toString, //table2Value
row2.getAs(colName).toString) //expectedValue
}
else null
}
def comparison() {
for(row <- table){
var elem1 = row._2.head //gets the first element in the iterable
var elem2 = row._2.tail.head //gets the second element in the iterable
for(col <- compareCols){
var value1 = elem1.getAs(col).toString
var value2 = elem2.getAs(col).toString
var disc = compareFields(value1, value2, col, elem1, elem2)
if (disc != null) discs += disc
}
}
}
comparison()
discs.toList
}
I'm calling the above function as such:
var outcome = compare(groupedFiltered)
Here is the data in groupedFiltered:
(1991,CompactBuffer([1991,7.14,5.72,39%], [1991,4.14,5.72,39%]))
(1997,CompactBuffer([1997,4.88,5.86,39%], [1997,3.88,5.86,39%]))
(1999,CompactBuffer([1999,5.15,5.96,39%], [1999,5.15,5.97,38%]))
(1947,CompactBuffer([1947,0.9,2.94,35%], [1947,0.4,2.94,35%]))
(1980,CompactBuffer([1980,3.1,6.88,45%], [1980,3.1,6.88,48%]))
(1981,CompactBuffer([1981,3.15,6.8,45%], [1981,3.35,6.8,45%]))
The table schema for groupedFiltered:
(year String,
nominal Double,
adjusted_for_inflation Double,
average_provate_nonsupervisory_wage String)
Spark is a distributed computing engine. Next to "what the code is doing" of classic single-node computing, with Spark we also need to consider "where the code is running"
Let's inspect a simplified version of the expression above:
val records: RDD[List[String]] = ??? //whatever data
var list:mutable.List[String] = List()
for {record <- records
entry <- records }
{ list += entry }
The scala for-comprehension makes this expression look like a natural local computation, but in reality the RDD operations are serialized and "shipped" to executors, where the inner operation will be executed locally. We can rewrite the above like this:
records.foreach{ record => //RDD.foreach => serializes closure and executes remotely
record.foreach{entry => //record.foreach => local operation on the record collection
list += entry // this mutable list object is updated in each executor but never sent back to the driver. All updates are lost
}
}
Mutable objects are in general a no-go in distributed computing. Imagine that one executor adds a record and another one removes it, what's the correct result? Or that each executor comes to a different value, which is the right one?
To implement the operation above, we need to transform the data into our desired result.
I'd start by applying another best practice: Do not use null as return value. I also moved the row ops into the function. Lets rewrite the comparison operation with this in mind:
def compareFields(colName:String, row1:Row, row2:Row): Option[DiscrepancyData] = {
val key = "year"
val v1 = row1.getAs(colName).toString
val v2 = row2.getAs(colName).toString
if (v1 != v2){
Some(DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
v1, //table1Value
v2, //table2Value
v2) //expectedValue
)
} else None
}
Now, we can rewrite the computation of discrepancies as a transformation of the initial table data:
val discrepancies = table.flatMap{case (str, row) =>
compareCols.flatMap{col => compareFields(col, row.next, row.next) }
}
We can also use the for-comprehension notation, now that we understand where things are running:
val discrepancies = for {
(str,row) <- table
col <- compareCols
dis <- compareFields(col, row.next, row.next)
} yield dis
Note that discrepancies is of type RDD[Discrepancy]. If we want to get the actual values to the driver we need to:
val materializedDiscrepancies = discrepancies.collect()
Iterating through an RDD and updating a mutable structure defined outside the loop is a Spark anti-pattern.
Imagine this RDD being spread over 200 machines. How can these machines be updating the same Buffer? They cannot. Each JVM will be seeing its own discs: ListBuffer[DiscrepancyData]. At the end, your result will not be what you expect.
To conclude, this is a perfectly valid (not idiomatic though) Scala code but not a valid Spark code. If you replace RDD with an Array it will work as expected.
Try to have a more functional implementation along these lines:
val finalRDD: RDD[DiscrepancyData] = table.map(???).filter(???)

SPARK - Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe

I am new to SPARK and figuring out a better way to achieve the following scenario.
There is a database table containing 3 fields - Category, Amount, Quantity.
First I try to pull all the distinct Categories from the database.
val categories:RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString)
Now for each category I want to execute the Pipeline which essentially creates dataframes from each category and apply some Machine Learning.
categories.foreach(executePipeline)
def execute(category: String): Unit = {
val dfCategory = sqlCtxt.read.jdbc(JDBC_URL,"SELECT * FROM TABLE_NAME WHERE CATEGORY="+category)
dfCategory.show()
}
Is it possible to do something like this ? Or is there any better alternative ?
// You could get all your data with a single query and convert it to an rdd
val data = sqlCtxt.read.jdbc(JDBC_URL,"SELECT * FROM TABLE_NAME).rdd
// then group the data by category
val groupedData = data.groupBy(row => row.getAs[String]("category"))
// then you get an RDD[(String, Iterable[org.apache.spark.sql.Row])]
// and you can iterate over it and execute your pipeline
groupedData.map { case (categoryName, items) =>
//executePipeline(categoryName, items)
}
Your code would fail on a TaskNotSerializable exception since you're trying to use the SQLContext (which isn't serializable) inside the execute method, which should be serialized and sent to workers to be executed on each record in the categories RDD.
Assuming you know the number of categories is limited, which means the list of categories isn't too large to fit in your driver memory, you should collect the categories to driver, and iterate over that local collection using foreach:
val categoriesRdd: RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString)
val categories: Seq[String] = categoriesRdd.collect()
categories.foreach(executePipeline)
Another improvement would be reusing the dataframe that you loaded instead of performing another query, using a filter for each category:
def executePipeline(singleCategoryDf: DataFrame) { /* ... */ }
categories.foreach(cat => {
val filtered = df.filter(col(CATEGORY) === cat)
executePipeline(filtered)
})
NOTE: to make sure the re-use of df doesn't reload it for every execution, make sure you cache() it before collecting the categories.

scala + spark: serialize a task for recommendations

I've been working with Scala + Spark and the Movie Recommendation with MLib tutorial.
After obtaining my predictions I need the top 3 items per user.
val predictions =
model.predict(usersProducts).map { case Rating(user, product, rate) =>
(user, product, rate)
}
I've tried this:
def myPrint(x:(Int, Int, Double)) = println(x._1 + ":" + x._2 + " - " +x._3)
predictions.collect().sortBy(- _._3).groupBy(_._1).foreach( t2 => t2._2.take(3).foreach(myPrint) )
( _.1 is user, _.2 is item, _.3 is rate)
I had to add the "collect()" method to make it work, but I can't serialize this task.
By the way, I added the myPrint method because I don't know how to obtain a collection or map from the last line.
Any idea to make it serializable?
Any idea to get a collection/map from last line?
If I can't do it better, in myPrint I will write in a database and make commit after 1000 insert.
Thanks.
You could make sure that all the computations are done in RDDs by slightly modifying your approach:
predictions.sortBy(- _.rating).groupBy(_.user)
.flatMap(_._2.take(3)).foreach(println)
A task that calls a method has to serialize the object containing the method. Try using a function value instead:
val myPrint: ((Int, Int, Double)) => Unit = x => ...
You don't want the collect() at the start, that defeats the whole point of using Spark.
I don't understand what you're saying about "get a collection/map". .take(3) already returns a collection.
After read the answer of lmm and do some research I resolved my problem this way:
First, I begun to work with a Rating object instead of Tuples:
val predictions = model.predict(usersProducts)
Then I defined the function value as follows, now I do here the "take":
def myPrint: ((Int, Iterable[Rating])) => Unit = x => x._2.take(3).foreach(println)
So, now I mix everything this way:
predictions.sortBy(- _.rating).groupBy(_.user).foreach(myPrint)

How do I iterate RDD's in apache spark (scala)

I use the following command to fill an RDD with a bunch of arrays containing 2 strings ["filename", "content"].
Now I want to iterate over every of those occurrences to do something with every filename and content.
val someRDD = sc.wholeTextFiles("hdfs://localhost:8020/user/cloudera/*")
I can't seem to find any documentation on how to do this however.
So what I want is this:
foreach occurrence-in-the-rdd{
//do stuff with the array found on loccation n of the RDD
}
You call various methods on the RDD that accept functions as parameters.
// set up an example -- an RDD of arrays
val sparkConf = new SparkConf().setMaster("local").setAppName("Example")
val sc = new SparkContext(sparkConf)
val testData = Array(Array(1,2,3), Array(4,5,6,7,8))
val testRDD = sc.parallelize(testData, 2)
// Print the RDD of arrays.
testRDD.collect().foreach(a => println(a.size))
// Use map() to create an RDD with the array sizes.
val countRDD = testRDD.map(a => a.size)
// Print the elements of this new RDD.
countRDD.collect().foreach(a => println(a))
// Use filter() to create an RDD with just the longer arrays.
val bigRDD = testRDD.filter(a => a.size > 3)
// Print each remaining array.
bigRDD.collect().foreach(a => {
a.foreach(e => print(e + " "))
println()
})
}
Notice that the functions you write accept a single RDD element as input, and return data of some uniform type, so you create an RDD of the latter type. For example, countRDD is an RDD[Int], while bigRDD is still an RDD[Array[Int]].
It will probably be tempting at some point to write a foreach that modifies some other data, but you should resist for reasons described in this question and answer.
Edit: Don't try to print large RDDs
Several readers have asked about using collect() and println() to see their results, as in the example above. Of course, this only works if you're running in an interactive mode like the Spark REPL (read-eval-print-loop.) It's best to call collect() on the RDD to get a sequential array for orderly printing. But collect() may bring back too much data and in any case too much may be printed. Here are some alternative ways to get insight into your RDDs if they're large:
RDD.take(): This gives you fine control on the number of elements you get but not where they came from -- defined as the "first" ones which is a concept dealt with by various other questions and answers here.
// take() returns an Array so no need to collect()
myHugeRDD.take(20).foreach(a => println(a))
RDD.sample(): This lets you (roughly) control the fraction of results you get, whether sampling uses replacement, and even optionally the random number seed.
// sample() does return an RDD so you may still want to collect()
myHugeRDD.sample(true, 0.01).collect().foreach(a => println(a))
RDD.takeSample(): This is a hybrid: using random sampling that you can control, but both letting you specify the exact number of results and returning an Array.
// takeSample() returns an Array so no need to collect()
myHugeRDD.takeSample(true, 20).foreach(a => println(a))
RDD.count(): Sometimes the best insight comes from how many elements you ended up with -- I often do this first.
println(myHugeRDD.count())
The fundamental operations in Spark are map and filter.
val txtRDD = someRDD filter { case(id, content) => id.endsWith(".txt") }
the txtRDD will now only contain files that have the extension ".txt"
And if you want to word count those files you can say
//split the documents into words in one long list
val words = txtRDD flatMap { case (id,text) => text.split("\\s+") }
// give each word a count of 1
val wordT = words map (x => (x,1))
//sum up the counts for each word
val wordCount = wordsT reduceByKey((a, b) => a + b)
You want to use mapPartitions when you have some expensive initialization you need to perform -- for example, if you want to do Named Entity Recognition with a library like the Stanford coreNLP tools.
Master map, filter, flatMap, and reduce, and you are well on your way to mastering Spark.
I would try making use of a partition mapping function. The code below shows how an entire RDD dataset can be processed in a loop so that each input goes through the very same function. I am afraid I have no knowledge about Scala, so everything I have to offer is java code. However, it should not be very difficult to translate it into scala.
JavaRDD<String> res = file.mapPartitions(new FlatMapFunction <Iterator<String> ,String>(){
#Override
public Iterable<String> call(Iterator <String> t) throws Exception {
ArrayList<String[]> tmpRes = new ArrayList <>();
String[] fillData = new String[2];
fillData[0] = "filename";
fillData[1] = "content";
while(t.hasNext()){
tmpRes.add(fillData);
}
return Arrays.asList(tmpRes);
}
}).cache();
what the wholeTextFiles return is a Pair RDD:
def wholeTextFiles(path: String, minPartitions: Int): RDD[(String, String)]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Here is an example of reading the files at a local path then printing every filename and content.
val conf = new SparkConf().setAppName("scala-test").setMaster("local")
val sc = new SparkContext(conf)
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.collect
.foreach(t => println(t._1 + ":" + t._2));
the result:
file:/Users/leon/Documents/test/1.txt:{"name":"tom","age":12}
file:/Users/leon/Documents/test/2.txt:{"name":"john","age":22}
file:/Users/leon/Documents/test/3.txt:{"name":"leon","age":18}
or converting the Pair RDD to a RDD first
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.map(t => t._2)
.collect
.foreach { x => println(x)}
the result:
{"name":"tom","age":12}
{"name":"john","age":22}
{"name":"leon","age":18}
And I think wholeTextFiles is more compliant for small files.
for (element <- YourRDD)
{
// do what you want with element in each iteration, and if you want the index of element, simply use a counter variable in this loop beginning from 0
println (element._1) // this will print all filenames
}