working with two RDDs apache spark - scala

I am using calliope i.e. spark plugin to connect with cassandra. I have created 2 RDDs which looks like
class A
val persistLevel = org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
val cas1 = CasBuilder.cql3.withColumnFamily("cassandra_keyspace", "cassandra_coulmn_family 1")
val sc1 = new SparkContext("local", "name it any thing ")
var rdd1 = sc.cql3Cassandra[SCALACLASS_1](cas1)
var rddResult1 = rdd1.persist(persistLevel)
class B
val cas2 = CasBuilder.cql3.withColumnFamily("cassandra_keyspace", "cassandra_coulmn_family 2")
var rdd2 = sc1.cql3Cassandra[SCALACLASS_2](cas2)
var rddResult2 = rdd2.persist(persistLevel)
somehow following code base which creates a new RDD using the other 2 is not working. Is it possible that we cannot iterate with 2 RDDs together?
Here is the code snippet which is not working -
case class Report(id: Long, anotherId: Long)
var reportRDD = rddResult2.flatMap(f => {
val buf = List[Report]()
**rddResult1.collect().toList**.foldLeft(buf)((k, v) => {
val buf1 = new ListBuffer[Report]
buf ++ v.INSTANCE_VAR_FROM_SCALACLASS_1.foldLeft(buf1)((ik, iv) => {
buf1 += Report(f.INSTANCE_VAR_FROM_SCALACLASS_1, iv.INSTANCE_VAR_FROM_SCALACLASS_2)
})
})
})
while if I replace the bold thing and initialize a val for it like -
val collection = rddResult1.collect().toList
var reportRDD = rddResult2.flatMap(f => {
val buf = List[Report]()
**collection**.foldLeft(buf)((k, v) => {
val buf1 = new ListBuffer[Report]
buf ++ v.INSTANCE_VAR_FROM_SCALACLASS_1.foldLeft(buf1)((ik, iv) => {
buf1 += Report(f.INSTANCE_VAR_FROM_SCALACLASS_1, iv.INSTANCE_VAR_FROM_SCALACLASS_2)
})
})
})
it works, is there any explaination?

You are mixing a transformation with an action. The closure of the rdd2.flatMap is executed on the workers, while rdd1.collect is an 'action' in Spark lingo and delivers data back to the driver. So, informally, you could say that the data is not there when you try to flatMap over it. (I don't know enough of the internals -yet- to pin-point the exact root-cause)
If you want to operate on both RDDs distributedly, you should join them using one of the join functions (join, leftOuterJoin, rightOuterJoin, cogroup).
E.g.
val mappedRdd1 = rdd1.map(x=> (x.id,x))
val mappedRdd2 = rdd2.map(x=> (x.customerId, x))
val joined = mappedRdd1.join(mappedRdd2)
joined.flatMap(...reporting logic..).collect

You can operate on RDDs in the application. But you cannot operate on RDDs in the executors (the worker nodes). The executors cannot give commands to drive the cluster. The code inside flatMap runs on the executors.
In the first case, you try to operate on an RDD in the executor. I reckon you would get a NotSerializableException as you cannot even send the RDD object to the executors. In the second case, you pull the RDD contents to the application, and then send this simple List to the executors. (Lambda captures are automatically serialized.)

Related

Unable to flatten array of DataFrames

I have an array of DataFrames that I obtain by using randomSplit() in this manner:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
I'll be iterating over folds using a for loop, where I will be dropping the ith entry inside folds and store it separately. Then I will be using all the others as another DataFrame as in my code below:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
for (j <- ts.indices) {
trainSet = trainSet.union(ts(j))
}
}
While this does serve my purpose, I was also trying another approach where I would still separate folds into ts and testSet, and then use the flatten function for the remaining inside ts to create another DataFrame using something like this:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = ts.flatten
}
But at the initialization of the trainSet line, I get an error that: No Implicits Found for parameter asTrav: Dataset[Row] => Traversable[U_]. I have also done import spark.implicits._ after initializing the SparkSession.
My end goal with the creation of trainSet after flatten is to retrieve a DataFrame created after joining (union) the other Dataset[Row]s inside ts. I'm not sure where I'm going wrong.
I'm using Spark 2.4.5 with Scala 2.11.12
EDIT 1: Added how I read the Dataframe
I'm not sure what's your intention here but instead of using mutable variables and flattening you can do recursive iteration like this:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
val testSet = spark.createDataFrame(Seq.empty)
val trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
go(folds, Array.empty)
def go(items: Array[Dataset[Row]], result: Array[Dataset[Row]]): Array[Dataset[Row]] = items match {
case arr # Array(_, _*) =>
val res = arr.map { t =>
trainSet.union(t)
}
go(arr.tail, result ++ res)
case Array() => result
}
As I have seen the use case of testSet, there is no usage of it in the method body
I have replaced that for loop with a simple reduce:
val trainSet = ts.reduce((a,b) => a.union(b))

Spark accumulator empty when used in UDF

I was working on optimizing my Spark process, and was trying to use a UDF with an accumulator. I have gotten the accumulator to work on its own, and was looking to see if I would get any speed up using a UDF. But instead, when I wrap the accumulator in the UDF, it remains empty. Am I going something wrong in particular? Is there something going on with Lazy Execution where even with my .count it is still not executing?
Input:
0,[0.11,0.22]
1,[0.22,0.33]
Output:
(0,0,0.11),(0,1,0.22),(1,0,0.22),(1,1,0.33)
Code
val accum = new MapAccumulator2d()
val session = SparkSession.builder().getOrCreate()
session.sparkContext.register(accum)
//Does not work - Empty Accumlator
val rowAccum = udf((itemId: Int, item: mutable.WrappedArray[Float]) => {
val map = item
.zipWithIndex
.map(ff => {
((itemId, ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
itemId
})
dataFrame.select(rowAccum(col("itemId"), col("jaccardList"))).count
//Works
dataFrame.foreach(f => {
val map = f.getAs[mutable.WrappedArray[Float]](1)
.zipWithIndex
.map(ff => {
((f.getInt(0), ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
})
val list = accum.value.toList.map(f => (f._1._1, f._1._2, f._2))
Looks like the only issue here is using count to "trigger" the lazily-evaluated UDF: Spark is "smart" enough to realize that the select operation can't change the result of count and therefore doesn't really execute the UDF. Choosing a different operation (e.g. collect) shows that the UDF works and updates the accumulator.
Here's a (more concise) example:
val accum = sc.longAccumulator
val rowAccum = udf((itemId: Int) => { accum.add(itemId); itemId })
val dataFrame = Seq(1,2,3,4,5).toDF("itemId")
dataFrame.select(rowAccum(col("itemId"))).count() // won't trigger UDF
println(s"RESULT: ${accum.value}") // prints 0
dataFrame.select(rowAccum(col("itemId"))).collect() // triggers UDF
println(s"RESULT: ${accum.value}") // prints 15

How can I construct a String with the contents of a given DataFrame in Scala

Consider I have a dataframe. How can I retrieve the contents of that dataframe and represent it as a string.
Consider I try to do that with the below example code.
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
df.foreach(x => {
println("x = ", x)
sb.append(x)
})
println("sb = ", sb)
The output of the code shows the example dataframe has contents:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(4.875333799256043,2.8363794106756046E-6))
However, the final stringbuilder contains an empty string.
Any thoughts how to retrieve a String for a given dataframe in Scala?
Many thanks
UPD: as mentioned by #user8371915, solution below will work only in single JVM in development (local) mode. In fact we cant modify broadcast variables like globals. You can use accumulators, but it will be quite inefficient. Also you can read an answer about read/write global vars here. Hope it will help you.
I think you should read topic about shared variables in Spark. Link here
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
Let's have a look at broadcast variables. I edited your code:
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
val broadcastVar = sc.broadcast(sb)
df.foreach(x => {
println("x = ", x)
broadcastVar.value.append(x)
})
println("sb = ", broadcastVar.value)
Here I used broadcastVar as a container for a StringBuilder variable sb.
Here is output:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(4.875333799256043,2.8363794106756046E-6))
(x = ,(14.316322626848278,0.0))
(sb = ,(7.876169953355888,7.489564524121306E-13)(1.866393526974307,0.064020056478447)(4.875333799256043,2.8363794106756046E-6)(2.864048126935307,0.004808399479386827)(14.316322626848278,0.0)(4.032486069215076,8.914865448939047E-5))
Hope this helps.
Does the output of df.show(false) help? If yes, then this SO answer helps: Is there any way to get the output of Spark's Dataset.show() method as a string?
Thanks everybody for the feedback and for understanding this slightly better.
The combination of responses result in the below. The requirements have changed slightly in that I represent my df as a list of jsons. The code below does this, without the use of the broadcast.
class HandleDf(df: DataFrame, limit: Int) extends java.io.Serializable {
val jsons = df.limit(limit).collect.map(rowToJson(_))
def rowToJson(r: org.apache.spark.sql.Row) : JSONObject = {
try { JSONObject(r.getValuesMap(r.schema.fieldNames)) }
catch { case t: Throwable =>
JSONObject.apply(Map("Row with error" -> t.toString))
}
}
}
The class I use here...
val jsons = new HandleDf(df, 100).jsons

How to use COGROUP for large datasets

I have two rdd's namely val tab_a: RDD[(String, String)] and val tab_b: RDD[(String, String)] I'm using cogroup for those datasets like:
val tab_c = tab_a.cogroup(tab_b).collect.toArray
val updated = tab_c.map { x =>
{
//somecode
}
}
I'm using tab_c cogrouped values for map function and it works fine for small datasets but in case of huge datasets it throws Out Of Memory exception.
I have tried converting the final value to RDD but no luck same error
val newcos = spark.sparkContext.parallelize(tab_c)
1.How to use Cogroup for large datasets ?
2.Can we persist the cogrouped value ?
Code
val source_primary_key = source.map(rec => (rec.split(",")(0), rec))
source_primary_key.persist(StorageLevel.DISK_ONLY)
val destination_primary_key = destination.map(rec => (rec.split(",")(0), rec))
destination_primary_key.persist(StorageLevel.DISK_ONLY)
val cos = source_primary_key.cogroup(destination_primary_key).repartition(10).collect()
var srcmis: Array[String] = new Array[String](0)
var destmis: Array[String] = new Array[String](0)
var extrainsrc: Array[String] = new Array[String](0)
var extraindest: Array[String] = new Array[String](0)
var srcs: String = Seq("")(0)
var destt: String = Seq("")(0)
val updated = cos.map { x =>
{
val key = x._1
val value = x._2
srcs = value._1.mkString(",")
destt = value._2.mkString(",")
if (srcs.equalsIgnoreCase(destt) == false && destt != "") {
srcmis :+= srcs
destmis :+= destt
}
if (srcs == "") {
extraindest :+= destt.mkString("")
}
if (destt == "") {
extrainsrc :+= srcs.mkString("")
}
}
}
Code Updated:
val tab_c = tab_a.cogroup(tab_b).filter(x => x._2._1 =!= x => x._2._2)
// tab_c = {1,Compactbuffer(1,john,US),Compactbuffer(1,john,UK)}
{2,Compactbuffer(2,john,US),Compactbuffer(2,johnson,UK)}..
ERROR:
ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(4,3,ResultTask,FetchFailed(null,0,-1,27,org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
ERROR YarnScheduler: Lost executor 8 on datanode1: Container killed by YARN for exceeding memory limits. 1.0 GB of 1020 MB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Thank you
When you use collect() you are basically telling spark to move all the resulting data back to the master node, which can easily produce a bottleneck. You are no longer using Spark at that point, just a plain array in a single machine.
To trigger computation just use something that requires the data at every node, that's why executors live on top of a distributed file system. For instance saveAsTextFile().
Here are some basic examples.
Remember, the entire objective here (that is, if you have big data) is to move the code to your data and compute there, not to bring all the data to the computation.
TL;DR Don't collect.
To run this code safely, without additional assumptions (on average requirements for worker nodes might be significantly smaller), every node (driver and each executor) would require memory significantly exceeding total memory requirements for all data.
If you were to run it outside Spark you would need only one node. Therefore Spark provides no benefits here.
However if you skip collect.toArray and make some assumptions about data distribution you might run it just fine.

scala.MatchError: null on spark RDDs

I am relatively new to both spark and scala.
I was trying to implement collaborative filtering using scala on spark.
Below is the code
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile("/user/amohammed/CB/input-cb.txt")
val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)
val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val keywords = distinctKeywords collect
distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()
It throws a scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line
Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code:
val users = distinctUsers collect
users.map(x => {(x, keywords.map(y => model.predict(x, y)))})
Where am I getting it wrong when dealing with RDDs?
Spark Version : 1.0.0
Scala Version : 2.10.4
Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says:
val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
Note that the userFeatures field of model is itself another RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model, and thus the lookup method on it is failing. I also tried placing both model and keywords into broadcast variables, but that didn't work either.
Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them.
I'd start with this:
val ratings = data.map(_.split(',') match {
case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})
// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class
val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()
val model = ALS.train(ratings, 1, 20, 0.01)
Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument.
val userKeywords = distinctUsers.cartesian(distinctKeywords)
val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
(user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }
Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. If you specifically want arrays as in your original example, you can do:
val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))
Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well.