Errors in PageRank of GraphFrames - pyspark

I am new to pyspark and am trying to understand how PageRank works. I am using Spark 1.6 in Jupyter on Cloudera. Screenshots of my vertices and edges (as well as the schema) are in these links: verticesRDD and edgesRDD
I have the code so far as follow:
#import relevant libraries for Graph Frames
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import desc
from graphframes import *
#Read the csv files
verticesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/station.csv")
edgesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/trip.csv")
#Renaming the id columns to enable GraphFrame
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Trip ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Start Station", "src")
edgesRDD = edgesRDD.withColumnRenamed("End Station", "dst")
#Register as temporary tables for running the analysis
verticesRDD.registerTempTable("verticesRDD")
edgesRDD.registerTempTable("edgesRDD")
#Note: whether i register the RDDs as temp tables or not, i get the same results... so im not sure if this step is really needed
#Make the GraphFrame
g = GraphFrame(verticesRDD, edgesRDD)
Now when i run the pageRank function:
g.pageRank(resetProbability=0.15, maxIter=10)
Py4JJavaError: An error occurred while calling o98.run.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 79.0 failed 1 times, most recent failure: Lost task 0.0 in stage 79.0 (TID 2637, localhost): scala.MatchError: [null,null,[913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
results = g.pageRank(resetProbability=0.15, maxIter=10, sourceId="id")
Py4JJavaError: An error occurred while calling o166.run.: org.graphframes.NoSuchVertexException: GraphFrame algorithm given vertex ID which does not exist in Graph. Vertex ID id not contained in GraphFrame(v:[id: int, name: string, lat: double, long: double, dockcount: int, landmark: string, installation: string], e:[src: string, dst: string, id: int, Duration: int, Start Date: string, Start Terminal: int, End Date: string, End Terminal: int, Bike #: int, Subscriber Type: string, Zip Code: string])
ranks = g.pageRank.resetProbability(0.15).maxIter(10).run()
AttributeError: 'function' object has no attribute 'resetProbability'
ranks = g.pageRank(resetProbability=0.15, maxIter=10).run()
Py4JJavaError: An error occurred while calling o188.run.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 90.0 failed 1 times, most recent failure: Lost task 0.0 in stage 90.0 (TID 2641, localhost): scala.MatchError: [null,null,[913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
I am reading PageRank but dont understand where i'm going wrong.. any help will be appreciated

The problem was how I was defining my vertices. I was renaming "station_id" to "id", when in fact, it had to be "name. So this line
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
has to be
verticesRDD = verticesRDD.withColumnRenamed("name", "id")
pageRank working properly with this change!

Related

Error comes when using User Defined Case Class in Scala

The result that I am trying to find out is that, I need to find the distinct repos accessed by a particular user, using the inverted index based method in Scala. I have made the required Case Class to extract the data from the input file, and hence made the required RDD and joined on the basis of the required user, but whenver after finding all the repos accessed. When I try to use the distinct and then count function, then I get some error, which I will paste below.
The code is as follows:
import org.apache.spark.rdd.RDD
import java.text.SimpleDateFormat
import java.util.Date
case class LogLine(debug_level: String, timestamp: Date, download_id: String,retrieval_stage: String, rest: String);
val regex = """([^\s]+), ([^\s]+)\+00:00, ([^\s]+) -- ([^\s]+): (.*$)""".r
val rdd = sc.
textFile("file.txt").
flatMap ( x => x match {
case regex(debug_level,dateTime,downloadId,retrievalStage,rest) =>
val df = new SimpleDateFormat("yyyy-MM-dd:HH:mm:ss")
new Some(LogLine(debug_level, df.parse(dateTime.replace("T", ":")), downloadId, retrievalStage, rest))
case _ => None;
})
rdd.cache()
val inverted2 = rdd.groupBy(element => element.download_id).cache
val list2 = List[String]("ghtorrent-22")
val user = sc.parallelize(list2).keyBy(x => x)
val poss = inverted2.join(user).flatMap{x => x._2._1}.map(line => line.rest)
After this, whenever, I try to run, the next part to evaluate the result, I get the error.
The code and error are as follows:
Code:
poss.distinct.count
Error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 8.0 failed 1 times, most recent failure: Lost task 5.0 in stage 8.0 (TID 59) (LAPTOP-Q81MN7NI executor driver): java.lang.OutOfMemoryError: Java heap space
at java.base/java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:4055)
at java.base/java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3861)
at java.base/java.io.ObjectInputStream.readString(ObjectInputStream.java:2043)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1661)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:514)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:538)
at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1087)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.readNextHashCode(ExternalAppendOnlyMap.scala:335)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1(ExternalAppendOnlyMap.scala:315)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1$adapted(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$Lambda$3514/0x0000000101065840.apply(Unknown Source)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.<init>(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:287)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:43)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:118)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
at org.apache.spark.rdd.RDD.count(RDD.scala:1253)
... 37 elided
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.base/java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:4055)
at java.base/java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3861)
at java.base/java.io.ObjectInputStream.readString(ObjectInputStream.java:2043)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1661)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readArray(ObjectInputStream.java:2102)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2358)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:493)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:451)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:158)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:514)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:538)
at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1087)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.readNextHashCode(ExternalAppendOnlyMap.scala:335)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1(ExternalAppendOnlyMap.scala:315)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.$anonfun$new$1$adapted(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$Lambda$3514/0x0000000101065840.apply(Unknown Source)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.<init>(ExternalAppendOnlyMap.scala:313)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:287)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:43)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:118)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)

spark sql create table will produce exception "$anonfun$createTransformFunc$2: (string) => array)" when this a array in the tempview

the code is as follows:
val tokenizer = new RegexTokenizer().setPattern("[\\W_]+").setMinTokenLength(4).setInputCol("sendcontent").setOutputCol("tokens")
var tokenized_df = tokenizer.transform(sourDF)
import org.apache.spark.sql.functions.{concat_ws}
val mkString = udf((arrayCol: Seq[String]) => arrayCol.mkString(","))
tokenized_df=tokenized_df.withColumn("words",mkString($"tokens")).drop("tokens")
tokenized_df.createOrReplaceTempView("tempview")
sql(s"drop table if exists $result_table")
sql(s"create table $result_table as select msgid,sendcontent,cast (words as string) as words from tempview")
the exception is as follows:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, svr14614hw2288.hadoop.sh.ctripcorp.com, executor 2): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$2: (string) => array)

NullPointerException when using Word2VecModel with UserDefinedFunction

I am trying to pass a word2vec model object to my spark udf. Basically I have a test set with movie Ids and I want to pass the ids along with the model object to get an array of recommended movies for each row.
def udfGetSynonyms(model: org.apache.spark.ml.feature.Word2VecModel) =
udf((col : String) => {
model.findSynonymsArray("20", 1)
})
however this gives me a null pointer exception. When I run model.findSynonymsArray("20", 1) outside the udf I get the expected answer. For some reason it doesn't understand something about the function within the udf but can run it outside the udf.
Note: I added "20" here just to get a fixed answer to see if that would work. It does the same when I replace "20" with col.
Thanks for the help!
StackTrace:
SparkException: Job aborted due to stage failure: Task 0 in stage 23127.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23127.0 (TID 4646648, 10.56.243.178, executor 149): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$udfGetSynonyms1$1: (string) => array<struct<_1:string,_2:double>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:111)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:350)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Word2VecModel.findSynonymsArray(Word2Vec.scala:273)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:7)
at linebb57ebe901e04c40a4fba9fb7416f724554.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$udfGetSynonyms1$1.apply(command-232354:4)
... 12 more
The SQL and udf API is a bit limited and I am not sure if there is a way to use custom types as columns or as inputs to udfs. A bit of googling didn't turn up anything too useful.
Instead, you can use the DataSet or RDD API and just use a regular Scala function instead of a udf, something like:
val model: Word2VecModel = ...
val inputs: DataSet[String] = ...
inputs.map(movieId => model.findSynonymsArray(movieId, 10))
Alternatively, I guess you could serialize the model to and from a string, but that seems much uglier.
I think this issue happens because wordVectors is a transient variable
class Word2VecModel private[ml] (
#Since("1.4.0") override val uid: String,
#transient private val wordVectors: feature.Word2VecModel)
extends Model[Word2VecModel] with Word2VecBase with MLWritable {
I have solved this by broadcasting w2vModel.getVectors and re-creating the Word2VecModel model inside each partition

scala.MatchError on a tuple

After processing some input data, I got a RDD[(String, String, Long)], say input, in hand.
input: org.apache.spark.rdd.RDD[(String, String, Long)] = MapPartitionsRDD[9] at flatMap at <console>:54
The string fields here represent vertices of graph and long field is the weight of the edge.
To create a graph out of this, first I am inserting vertices into a map with a unique id if vertex is not known already. If it was already encountered, I use the vertex id that was assigned previously. Essentially, each vertex is assigned a unique id of type Long and then I want to create Edges.
Here is what I am doing:
var vertexMap = collection.mutable.Map[String, Long]()
var vid : Long = 0 // global vertex id counter
var srcVid : Long = 0 // source vertex id
var dstVid : Long = 0 // destination vertex id
val graphEdges = input.map {
case Row(src: String, dst: String, weight: Long) => (
if (vertexMap.contains(src)) {
srcVid = vertexMap(src)
if (vertexMap.contains(dst)) {
dstVid = vertexMap(dst)
} else {
vid += 1 // pick a new vertex id
vertexMap += (dst -> vid)
dstVid = vid
}
Edge(srcVid, dstVid, weight)
} else {
vid += 1
vertexMap(src) = vid
srcVid = vid
if (vertexMap.contains(dst)) {
dstVid = vertexMap(dst)
} else {
vid += 1
vertexMap(dst) = vid
dstVid = vid
}
Edge(srcVid, dstVid, weight)
}
}
val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);
println("num vertices = " + graph.numVertices);
What I see is
graphEdges is of type RDD[org.apache.spark.graphx.Edge[Long]] and graph is of type Graph[Int,Long]
graphEdges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Long]] = MapPartitionsRDD[10] at map at <console>:64
graph: org.apache.spark.graphx.Graph[Int,Long] = org.apache.spark.graphx.impl.GraphImpl#1b48170a
but I get the following error, while printing the graph's edge and vertex count.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 8.0 failed 1 times, most recent failure: Lost task 1.0 in stage 8.0 (TID 9, localhost, executor driver): ***scala.MatchError: (vertexA, vertexN, 2000
)*** (of class scala.Tuple3)
at $anonfun$1.apply(<console>:64)
at $anonfun$1.apply(<console>:64)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I don't understand where is the mismatch here.
Thanks #Joe K for the helpful tip. I started using zipIndex and code looks compact now, however graph instantiation still fails. Here is the updated code:
val vertices = input.map(r => r._1).union(input.map(r => r._2)).distinct.zipWithIndex
val graphEdges = input.map {
case (src, dst, weight) =>
Edge(vertices.lookup(src)(0), vertices.lookup(dst)(0), weight)
}
val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);
So, from the original 3-tuple, I am forming a union of 1st and 2nd tuples (which are vertices), then assigning unique Ids to each after uniquifying them. I am then using their ids, while creating edges. However, it fails with following exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 23, localhost, executor driver): org.apache.spark.SparkException: This RDD lacks
a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed
inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:89)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:937)
at $anonfun$1.apply(<console>:55)
at $anonfun$1.apply(<console>:53)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
Any thoughts ?
This specific error is coming from trying to match a tuple as a Row, which it is not.
Change:
case Row(src: String, dst: String, weight: Long) => {
to just:
case (src, dst, weight) => {
Also, your larger plan for generating vertex ids will not work. All of the logic inside the map will happen in parallel in different executors, which will have different copies of the mutable map.
You should flatMap your edges to get a list of all vertexes, then call .distinct.zipWithIndex to assign each vertex a single unique long value. You would then need to re-join with the original edges.

Scala - Spark : return vertex properties from particular node

I have a Graph and I want to compute the max degree. In particular the vertex with max degree I want to know all properties.
This is the snippets of code:
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
if (a._2 > b._2) a else b
}
val maxDegrees : (VertexId, Int) = graphX.degrees.reduce(max)
max: (a: (org.apache.spark.graphx.VertexId, Int), b: (org.apache.spark.graphx.VertexId, Int))(org.apache.spark.graphx.VertexId, Int)
maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (2063726182,56387)
val startVertexRDD = graphX.vertices.filter{case (hash_id, (id, state)) => hash_id == maxDegrees._1}
startVertexRDD.collect()
But it returned this exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 145.0 failed 1 times, most recent failure: Lost task 0.0 in stage 145.0 (TID 5380, localhost, executor driver): scala.MatchError: (1009147972,null) (of class scala.Tuple2)
How can fix it?
I think this is the problem. Here:
val startVertexRDD = graphX.vertices.filter{case (hash_id, (id, state)) => hash_id == maxDegrees._1}
So it tries to compare some tuple like this
(2063726182,56387)
expecting something like this:
(hash_id, (id, state))
Raising a scala.MatchError because is comparing a Tuple2 of (VertextId, Int) with a Tuple2 of (VertexId, Tuple2(id, state))
Be carefull with this as well:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 145.0 failed 1 times, most recent failure: Lost task 0.0 in stage 145.0 (TID 5380, localhost, executor driver): scala.MatchError: (1009147972,null) (of class scala.Tuple2)
Concretely here:
scala.MatchError: (1009147972,null)
There is no degree calculated for vertice 1009147972 so when it compares could raise some problems as well.
Hope this helps.