efficiently using union in spark

efficiently using union in spark - scala

I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage.
Is there more efficient way to do this?

Why don't you convert the two RDDs to dataframes and use union function.
Converting to dataframe is easy you just need to import sqlContext.implicits._ and apply .toDF() function with header names.
for example:
val sparkSession = SparkSession.builder().appName("testings").master("local").config("", "").getOrCreate()
val sqlContext = sparkSession.sqlContext
var firstTableColumns = Seq("col1", "col2")
var secondTableColumns = Seq("col3", "col4")
import sqlContext.implicits._
var firstDF = Seq((1, 2), (2, 3), (3, 4), (2, 3), (3, 4)).toDF(firstTableColumns:_*)
var secondDF = Seq((4, 5), (5, 6), (6, 7), (4, 5)) .toDF(secondTableColumns: _*)
firstDF = firstDF.union(secondDF)
It should be very easy for you to work with dataframes than with RDDs. Changing dataframe to RDD is quite easy too, just call .rdd function
val rddData = firstDF.rdd

Related

Databrick Azure broadcast variables not serializable

So I am trying to create a extremely simple spark notebook using Azure Databricks and would like to make use of a simple RDD map call.
This is just for messing around, so the example is a bit contrived, but I can not get a value to work in the RDD map call unless it is a static constant value
I have tried using a broadcast variable
Here is a simple example using an int which I broadcast and then try and use in the RDD map
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val multiplier = 3
val multiplierBroadcast = sparkContext.broadcast(multiplier)
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => multiplierBroadcast.value)
val df = mappedRdd.toDF
df.show()
Here is another example where I use simple serializable singleton object with an int field which I broadcast and then try and use in the RDD map
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val multiplier = 3
object Foo extends Serializable { val theMultiplier: Int = multiplier}
val fooBroadcast = sparkContext.broadcast(Foo)
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => fooBroadcast.value.theMultiplier)
val df = mappedRdd.toDF
df.show()
And finally a List[int] with a single element which I broadcast and then try and use in the RDD map
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val multiplier = 3
val listBroadcast = sparkContext.broadcast(List(multiplier))
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => listBroadcast.value.head)
val df = mappedRdd.toDF
df.show()
However ALL the examples above fail with this error. Which as you can see is pointing towards an issue with the RDD map value not being serializable. I can not see the issue, and int value should be serializable using all the above examples I think
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2375)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:379)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:378)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:371)
at org.apache.spark.rdd.RDD.map(RDD.scala:378)
If I however make the value in the RDD map a regular int value like this
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => 6)
val df = mappedRdd.toDF
df.show()
Everything is working fine and I see my simple DataFrame shown as expected
Any ideas anyone?

From your code, I would assume that you are on Spark 2+. Perhaps, there is no need to drop down to the RDD level and, instead, work with DataFrames.
The code below shows how to join two DataFrames and explicitly broadcast the first one.
import sparkSession.implicits._
import org.apache.spark.sql.functions._
val data = Seq(1, 2, 3, 4, 5)
val dataDF = data.toDF("id")
val largeDataDF = Seq((0, "Apple"), (1, "Pear"), (2, "Banana")).toDF("id", "value")
val df = largeDataDF.join(broadcast(dataDF), Seq("id"))
df.show()
Typically, small DataFrames are perfect candidates for broadcasting as an optimization whereby they are sent to all executors. spark.sql.autoBroadcastJoinThreshold is a configuration which limits the size of DataFrames eligible for broadcast. Additional details can be found on the Spark official documentation
Note also that with DataFrames, you have access to a handy explain method. With it, you can see the physical plan and it can be useful for debugging.
Running explain() on our example would confirm that Spark is doing a BroadcastHashJoin optimization.
df.explain()
== Physical Plan ==
*Project [id#11, value#12]
+- *BroadcastHashJoin [id#11], [id#3], Inner, BuildRight
:- LocalTableScan [id#11, value#12]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- LocalTableScan [id#3]
If you need additional help with DataFrames, I provide an extensive list of examples at http://allaboutscala.com/big-data/spark/

So the answer was that you should not capture the Spark content in a val and then use that for the broadcast. So this is working code
import sqlContext.implicits._
val multiplier = 3
val multiplierBroadcast = spark.sparkContext.broadcast(multiplier)
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => multiplierBroadcast.value)
val df = mappedRdd.toDF
df.show()
Thanks to #nadim Bahadoor for this answer

The questions about Spark dataframe operations [duplicate]

This question already has an answer here:
get TopN of all groups after group by using Spark DataFrame
(1 answer)
Closed 5 years ago.
if I create a dataframe like this:
val df1 = sc.parallelize(List((1, 1), (1, 1), (1, 1), (1, 2), (1, 2), (1, 3), (2, 1), (2, 2), (2, 2), (2, 3)).toDF("key1","key2")
Then I group by "key1" and "key2", and count "key2".
val df2 = df1.groupBy("key1","key2").agg(count("key2") as "k").sort(col("k").desc)
My question is how to filter this dataframe and leave the top 2 num of the "k" from each "key1"?
if I don't use window functions ,what should I solve this problem?

This can be done using window-function, using row_number() (or also rank()/dense_rank(), depending on your requirements):
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.expressions.Window
df2
.withColumn("rnb", row_number().over(Window.partitionBy($"key1").orderBy($"k".desc)))
.where($"rnb" <= 2).drop($"rnb")
.show()
EDIT:
Here a solution using RDD (which do not require a HiveContext):
df2
.rdd
.groupBy(_.getAs[Int]("key1"))
.flatMap{case (_,rows) => {
rows.toSeq
.sortBy(_.getAs[Long]("k")).reverse
.take(2)
.map{case Row(key1:Int,key2:Int,k:Long) => (key1,key2,k)}
}
}
.toDF("key1","key2","k")
.show()

Partial/Full-match value in one RDD to values in another RDD

I have two RDDs where the first RDD has records of the form
RDD1 = (1, 2017-2-13,"ABX-3354 gsfette"
2, 2017-3-18,"TYET-3423 asdsad"
3, 2017-2-09,"TYET-3423 rewriu"
4, 2017-2-13,"ABX-3354 42324"
5, 2017-4-01,"TYET-3423 aerr")
and the second RDD has records of the form
RDD2 = ('mfr1',"ABX-3354")
('mfr2',"TYET-3423")
I need to find all the records in RDD1 which have a full match/partial match for each value in RDD2 matching the 3rd Column of RDD1 to 2nd column of RDD2 and get the count
For this example, the end result would be:
ABX-3354 2
TYET-3423 3
What is the best way to do this?

I am posting couple of solutions with Spark SQL and more focused towards accurate pattern matching of search string in given text.
1: Using CrossJoin
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-335442324"), //changed from "ABX-3354 42324"
(5, "2017-4-01", "aerrTYET-3423") //changed from "TYET-3423 aerr"
).toDF("id", "dt", "txt")
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).toDF("col1", "key")
//match function for filter
def matcher(row: Row): Boolean = row.getAs[String]("txt")
.contains(row.getAs[String]("key"))
val join = df1.crossJoin(df2)
import org.apache.spark.sql.functions.count
val result = join.filter(matcher _)
.groupBy("key")
.agg(count("txt").as("count"))
2: Using Broadcast variable
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "aerrTYET-3423"),
(6, "2017-4-01", "aerrYET-3423")
).toDF("id", "dt", "pattern")
//small dataset to broadcast
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).map(_._2) // considering only 2 values in pair
//Lookup to use in UDF
val lookup = spark.sparkContext.broadcast(df2)
//Udf
import org.apache.spark.sql.functions._
val matcher = udf((txt: String) => {
val matches: Seq[String] = lookup.value.filter(txt.contains(_))
if (matches.size > 0) matches.head else null
})
val result = df1.withColumn("match", matcher($"pattern"))
.filter($"match".isNotNull) // not interested in non matching records
.groupBy("match")
.agg(count("pattern").as("count"))
Both solutions result same output
result.show()
+---------+-----+
| key|count|
+---------+-----+
|TYET-3423| 3|
| ABX-3354| 2|
+---------+-----+

Here is how you can get the result
val RDD1 = spark.sparkContext.parallelize(Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "TYET-3423 aerr")
))
val RDD2 = spark.sparkContext.parallelize(Seq(
("mfr1","ABX-3354"),
("mfr2","TYET-3423")
))
RDD1.map(r =>{
(r._3.split(" ")(0), (r._1, r._2, r._3))
})
.join(RDD2.map(r => (r._2, r._1)))
.groupBy(_._1)
.map(r => (r._1, r._2.toSeq.size))
.foreach(println)
Output:
(TYET-3423,3)
(ABX-3354,2)
Hope this helps!

Consistent indexing and categorizing of categorical fields

Suppose I have the following Scala code:
import org.apache.spark.ml.feature.StringIndexer
val df = spark.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, "a"),
(4, "a"),
(5, "c")
)).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
Now, suppose I create an org.apache.spark.mllib.tree.model.DecisionTreeModel that uses this indexer and save the model to a file.
How can I ensure that if I do predictions on new data in the future that the indexer will be consistent with the original indexer used on the original data to construct the model?

Persist and re-load the indexer too

Spark : anti-join two DStreams

I can do JOINs on two Spark DStreams like :
val joinStream = stream1.join(stream2)
Now, what if I need to filter out all the records that weren't JOINed. Essentially, something like stream1.anti-join(stream2). Is this possible somehow?
Thanks and appreciate any help!

Assuming you had these:
val rdd1 = sc.parallelize(Array(
(1, "one"),
(2, "twow"),
(3, "three"),
(4, "four"),
(5, "five")
))
val rdd2 = sc.parallelize(Array(
(1, "otherOne"),
(4, "otherFour"),
(5,"otherFive"),
(6,"six"),
(7,"seven")
))
val antiJoined = rdd1.fullOuterJoin(rdd2).filter(r => r._2._1.isEmpty || r._2._2.isEmpty)
antiJoined.collect foreach println
(6,(None,Some(six)))
(2,(Some(twow),None))
(3,(Some(three),None))
(7,(None,Some(seven)))