Processing Apache Spark GraphX multiple subgraphs - scala

I have a parent Graph that I want to filter into multiple subgraphs, so I can apply a function to each subgraph and extract some data. My code looks like this:
val myTerms = <RDD of terms I want to use to filter the graph>
val myVertices = ...
val myEdges = ...
val myGraph = Graph(myVertices, myEdges)
val myResults : RDD[(<Tuple>)] = myTerms.map { x => mySubgraphFunction(myGraph, x) }
Where mySubgraphFunction is a function that creates a subgraph, performs a calculation, and returns a tuple of result data.
When I run this, I get a Java null pointer exception at the point that mySubgraphFunction calls GraphX.subgraph. If I call collect on the RDD of terms, I can get this to work (also added persist on the RDDs for performance):
val myTerms = <RDD of terms I want to use to filter the graph>
val myVertices = <read RDD>.persist(StorageLevel.MEMORY_ONLY_SER)
val myEdges = <read RDD>.persist(StorageLevel.MEMORY_ONLY_SER)
val myGraph = Graph(myVertices, myEdges)
val myResults : Array[(<Tuple>)] = myTerms.collect().map { x =>
mySubgraphFunction(myGraph, x) }
Is there a way to get this to work where I don't have to call collect() (i.e. make this a distributed operation)? I'm creating ~1k subgraphs and the performance is slow.

Related

Unable to flatten array of DataFrames

I have an array of DataFrames that I obtain by using randomSplit() in this manner:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
I'll be iterating over folds using a for loop, where I will be dropping the ith entry inside folds and store it separately. Then I will be using all the others as another DataFrame as in my code below:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
for (j <- ts.indices) {
trainSet = trainSet.union(ts(j))
}
}
While this does serve my purpose, I was also trying another approach where I would still separate folds into ts and testSet, and then use the flatten function for the remaining inside ts to create another DataFrame using something like this:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = ts.flatten
}
But at the initialization of the trainSet line, I get an error that: No Implicits Found for parameter asTrav: Dataset[Row] => Traversable[U_]. I have also done import spark.implicits._ after initializing the SparkSession.
My end goal with the creation of trainSet after flatten is to retrieve a DataFrame created after joining (union) the other Dataset[Row]s inside ts. I'm not sure where I'm going wrong.
I'm using Spark 2.4.5 with Scala 2.11.12
EDIT 1: Added how I read the Dataframe
I'm not sure what's your intention here but instead of using mutable variables and flattening you can do recursive iteration like this:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
val testSet = spark.createDataFrame(Seq.empty)
val trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
go(folds, Array.empty)
def go(items: Array[Dataset[Row]], result: Array[Dataset[Row]]): Array[Dataset[Row]] = items match {
case arr # Array(_, _*) =>
val res = arr.map { t =>
trainSet.union(t)
}
go(arr.tail, result ++ res)
case Array() => result
}
As I have seen the use case of testSet, there is no usage of it in the method body
I have replaced that for loop with a simple reduce:
val trainSet = ts.reduce((a,b) => a.union(b))

Apache Flink - Prediction Handling

I am currently working with Apache Flink's SVM-Class to predict some text data.
The class provides a predict-function which is taking a DataSet[Vector] as an input and gives me a DataSet[Prediction] as result. So far so good.
My problem is, that i dont have the context which prediction belongs to which text and i cant insert the text within the predict()-function to have it afterwards.
Code:
val tweets: DataSet[(SparseVector, String)] =
source.flatMap(new SelectEnglishTweetWithCreatedAtFlatMapper)
.map(tweet => (featureVectorService.transform(tweet._2))
model.predict(tweets).print
result example:
(SparseVector((462,8.73165920153676), (10844,8.508515650222549), (15656,2.931052542245018)),-1.0)
Is there a way to keep other data next to the prediction to have everything together ? because without context the prediction is not helping me.
Or maybe there is a way to just predict one vector instead of a DataSet, that i could call the function inside the map function above.
The SVM predictor expects as input a sub type of Vector. Hence there are two options to solve this problem:
Create a sub type of Vector which contains the tweet text as a tag. It will then be looped through the predictor. This approach has the advantage that no additional operation is needed. However, one needs define new classes an utilities to represent different vector types with tags:
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
new DenseVectorWithTag(Array(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult: DataSet[(DenseVectorWithTag, Double)] = svm.predict(vectorizedInput)
class DenseVectorWithTag(override val data: Array[Double], tag: String)
extends DenseVector(data) {
override def toString: String = "(" + super.toString + ", " + tag + ")"
}
Join the prediction DataSet with the input DataSet on the vectorized representation of the tweets. This approach has the advantage that we don't need to introduce new classes. The price we pay for this is an additional join operation which might be expensive:
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
(DenseVector(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult = svm.predict(vectorizedInput.map(a => a._1))
val inputWithPrediction: DataSet[(String, Double)] = vectorizedInput
.join(predictionResult)
.where(0)
.equalTo(0)
.apply((t, p) => (t._2, p._2))

How can I construct a String with the contents of a given DataFrame in Scala

Consider I have a dataframe. How can I retrieve the contents of that dataframe and represent it as a string.
Consider I try to do that with the below example code.
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
df.foreach(x => {
println("x = ", x)
sb.append(x)
})
println("sb = ", sb)
The output of the code shows the example dataframe has contents:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(4.875333799256043,2.8363794106756046E-6))
However, the final stringbuilder contains an empty string.
Any thoughts how to retrieve a String for a given dataframe in Scala?
Many thanks
UPD: as mentioned by #user8371915, solution below will work only in single JVM in development (local) mode. In fact we cant modify broadcast variables like globals. You can use accumulators, but it will be quite inefficient. Also you can read an answer about read/write global vars here. Hope it will help you.
I think you should read topic about shared variables in Spark. Link here
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
Let's have a look at broadcast variables. I edited your code:
val tvalues: Array[Double] = Array(1.866393526974307, 2.864048126935307, 4.032486069215076, 7.876169953355888, 4.875333799256043, 14.316322626848278)
val pvalues: Array[Double] = Array(0.064020056478447, 0.004808399479386827, 8.914865448939047E-5, 7.489564524121306E-13, 2.8363794106756046E-6, 0.0)
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]");
val sc = new SparkContext(conf)
val df = sc.parallelize(tvalues zip pvalues)
val sb = StringBuilder.newBuilder
val broadcastVar = sc.broadcast(sb)
df.foreach(x => {
println("x = ", x)
broadcastVar.value.append(x)
})
println("sb = ", broadcastVar.value)
Here I used broadcastVar as a container for a StringBuilder variable sb.
Here is output:
(x = ,(1.866393526974307,0.064020056478447))
(x = ,(2.864048126935307,0.004808399479386827))
(x = ,(4.032486069215076,8.914865448939047E-5))
(x = ,(7.876169953355888,7.489564524121306E-13))
(x = ,(4.875333799256043,2.8363794106756046E-6))
(x = ,(14.316322626848278,0.0))
(sb = ,(7.876169953355888,7.489564524121306E-13)(1.866393526974307,0.064020056478447)(4.875333799256043,2.8363794106756046E-6)(2.864048126935307,0.004808399479386827)(14.316322626848278,0.0)(4.032486069215076,8.914865448939047E-5))
Hope this helps.
Does the output of df.show(false) help? If yes, then this SO answer helps: Is there any way to get the output of Spark's Dataset.show() method as a string?
Thanks everybody for the feedback and for understanding this slightly better.
The combination of responses result in the below. The requirements have changed slightly in that I represent my df as a list of jsons. The code below does this, without the use of the broadcast.
class HandleDf(df: DataFrame, limit: Int) extends java.io.Serializable {
val jsons = df.limit(limit).collect.map(rowToJson(_))
def rowToJson(r: org.apache.spark.sql.Row) : JSONObject = {
try { JSONObject(r.getValuesMap(r.schema.fieldNames)) }
catch { case t: Throwable =>
JSONObject.apply(Map("Row with error" -> t.toString))
}
}
}
The class I use here...
val jsons = new HandleDf(df, 100).jsons

How to efficiently extract a value from HiveContext Query

I am running a query through my HiveContext
Query:
val hiveQuery = s"SELECT post_domain, post_country, post_geo_city, post_geo_region
FROM $database.$table
WHERE year=$year and month=$month and day=$day and hour=$hour and event_event_id='$uniqueIdentifier'"
val hiveQueryObj:DataFrame = hiveContext.sql(hiveQuery)
Originally, I was extracting each value from the column with:
hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
However, I was told to avoid this because it makes too many connections to Hive. I am pretty new to this area so I'm not sure how to extract the column values efficiently. How can I perform the same logic in a more efficient way?
I plan to implement this in my code
val arr = Array("post_domain", "post_country", "post_geo_city", "post_geo_region")
arr.foreach(column => {
// expected Map
val ex = expected.get(column).get
val actual = hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
assert(actual.equals(ex))
}

scala.MatchError: null on spark RDDs

I am relatively new to both spark and scala.
I was trying to implement collaborative filtering using scala on spark.
Below is the code
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile("/user/amohammed/CB/input-cb.txt")
val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)
val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val keywords = distinctKeywords collect
distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()
It throws a scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line
Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code:
val users = distinctUsers collect
users.map(x => {(x, keywords.map(y => model.predict(x, y)))})
Where am I getting it wrong when dealing with RDDs?
Spark Version : 1.0.0
Scala Version : 2.10.4
Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says:
val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
Note that the userFeatures field of model is itself another RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model, and thus the lookup method on it is failing. I also tried placing both model and keywords into broadcast variables, but that didn't work either.
Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them.
I'd start with this:
val ratings = data.map(_.split(',') match {
case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})
// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class
val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()
val model = ALS.train(ratings, 1, 20, 0.01)
Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument.
val userKeywords = distinctUsers.cartesian(distinctKeywords)
val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
(user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }
Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. If you specifically want arrays as in your original example, you can do:
val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))
Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well.