Suppose we have got the input in Apache GraphX as :
Vertex RDD:
val vertexArray = Array(
(1L, "Alice"),
(2L, "Bob"),
(3L, "Charlie"),
(4L, "David"),
(5L, "Ed"),
(6L, "Fran")
)
Edge RDD:
val edgeArray = Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 1),
Edge(3L, 4L, 1),
Edge(5L, 6L, 1)
)
I need all the components connected to a node in Apache Spark GraphX
1,[1,2,3,4]
5,[5,6]
You can use ConnectedComponents which returns
a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex.
and reshape results
graph.connectedComponents.vertices.map(_.swap).groupByKey
Related
I have the following DataFrame in Spark 2.2:
df =
v_in v_out
123 456
123 789
456 789
This df defines edges of a graph. Each row is a pair of vertices. I want to extract the Array of edges in order to create an RDD of edges as follows:
val edgeArray = Array(
Edge(2L, 1L, 0.0),
Edge(2L, 4L, 0.2),
Edge(3L, 2L, 0.9),
Edge(3L, 6L, 0.1),
Edge(4L, 1L, 0.0),
Edge(5L, 2L, 0.8),
Edge(5L, 3L, 0.7),
Edge(5L, 6L, 0.5)
)
val spark = SparkSession.builder()
.appName("ES")
.master("local[*]")
.getOrCreate()
implicit val sparkContext = spark.sparkContext
val edgeRDD: RDD[Edge[Double]] = sparkContext.parallelize(edgeArray)
How can I obtain edgeArray of the same structure using df? In each Edge, the third value can be any random Double value from 0 to 1.
UPDATE:
I did it in this way, but not sure if this is the most optimal solution:
val edgeArray = df.rdd.collect().map(row => Edge(row.get(0).toString.toLong, row.get(1).toString.toLong, 0.0))
val edgeRDD: RDD[Edge[Double]] = sparkContext.parallelize(edgeArray)
I don't like to use Array, because I might have millions of edges. Can I pass DataFrame more directly to RDD?
Given
val df = Seq((123, 456), (123, 789), (456, 789)).toDF("v_in", "v_out")
Import
import org.apache.spark.sql.functions.rand
import org.apache.spark.graphx.Edge
and convert:
val edgeRDD = df.toDF("srcId", "dstId")
.withColumn("attr", rand)
.as[Edge[Double]].rdd
With graphframes:
spark.jars.packages graphframes:graphframes:X.X.X-sparkY.Y-s_Z.ZZ
where X.X.X is package version, Y.Y is Spark version and Z.ZZ is Scala version, you can create Graph like this:
GraphFrame.fromEdges(df.toDF("src", "dst")).toGraphX
but it'll use Row attributes.
I am relatively new to spark graphx. Basically my graph has:
2 vertex types: person and car
edge describes which person owns which car
I want to given all the person vertices in the graph, traverse the edges to collect a list of cars for each person
e.g.
person1 -> [car1, car2]
person2 -> [car3]
You can achieve this with a bit of SQL.
Let's assume that you have the following graph:
import org.apache.spark.graphx
import org.apache.spark.rdd.RDD
// Create an RDD for the vertices
val v: RDD[(VertexId, (String))] =
sc.parallelize(Array((1L, ("car1")), (2L, ("car2")),
(3L, ("car3")), (4L, ("person1")),(5L, ("person2"))))
// Create an RDD for edges
val e: RDD[Edge[Int]] =
sc.parallelize(Array(Edge(4L, 1L,1), Edge(4L, 2L, 1),
Edge(5L, 1L,1)))
val graph = Graph(v,e)
Now extract the edges and vertices into Dataframes:
val vDf = graph.vertices.toDF("vId","vName")
val eDf =graph.edges.toDF("person","car","attr")
Transform the data into the desired output
eDf.drop("attr").join(vDf,'person === 'vId).drop("vId","person").withColumnRenamed("vName","person")
.join(vDf,'car === 'vId).drop("car","vId")
.groupBy("person")
.agg(collect_set('vName)).toDF("person","car")
.show()
+-------+------------+
| person| car|
+-------+------------+
|person2| [car1]|
|person1|[car2, car1]|
+-------+------------+
Im trying to run the KMeans case from here.
This is my code:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[10]")//.set("spark.sql.warehouse.dir", "file:///")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
// Crates a DataFrame
val dataset: DataFrame = sqlContext.createDataFrame(Seq(
(1, Vectors.dense(0.0, 0.0, 0.0)),
(2, Vectors.dense(0.1, 0.1, 0.1)),
(3, Vectors.dense(0.2, 0.2, 0.2)),
(4, Vectors.dense(9.0, 9.0, 9.0)),
(5, Vectors.dense(9.1, 9.1, 9.1)),
(6, Vectors.dense(9.2, 9.2, 9.2))
)).toDF("id", "features")
// Trains a k-means model
val kmeans = new KMeans()
.setK(2)
.setFeaturesCol("features")
.setPredictionCol("prediction")
val model = kmeans.fit(dataset)
// Shows the result
println("Final Centers: ")
model.clusterCenters.foreach(println)}
The Error follow:
Information:2016/9/19 0019 下午 3:36 - Compilation completed with 1 error and 0 warnings in 2s 454ms
D:\IdeaProjects\de\src\main\scala\com.te\KMeansExample.scala
Error:Error:line (18)No TypeTag available for (Int, org.apache.spark.mllib.linalg.Vector)
val dataset: DataFrame = sqlContext.createDataFrame(Seq(
some details:
1. When I run this with spark1.6.2 and scala 2.10.6.it is compile fail and show the Error above.But when change the scala version to 2.11.0 .it's run OK.
2. I run this code in Hue which submit this job to my Cluster by Livy, and my Cluster build with Spark1.6.2 and scala2.10.6
Can anybody help me ? Thanks
I am not very sure about the cause of this problem but I think that it is because scala reflection in older versions of scala was not able to work out the TypeTag of yet not inferred function parameters.
In this case,
val dataset: DataFrame = sqlContext.createDataFrame(Seq(
(1, Vectors.dense(0.0, 0.0, 0.0)),
(2, Vectors.dense(0.1, 0.1, 0.1)),
(3, Vectors.dense(0.2, 0.2, 0.2)),
(4, Vectors.dense(9.0, 9.0, 9.0)),
(5, Vectors.dense(9.1, 9.1, 9.1)),
(6, Vectors.dense(9.2, 9.2, 9.2))
)).toDF("id", "features")
The parameter Seq((1, Vectors.dense(0.0, 0.0, 0.0)),.....) is being seen by Scala the first time and hence its type is still not inferred by the system. And hence scala reflection can not work out the associated TypeTag.
So... my guess is that if you just move that out.. allow scala to infer the type... it will work.
val vectorSeq = Seq(
(1, Vectors.dense(0.0, 0.0, 0.0)),
(2, Vectors.dense(0.1, 0.1, 0.1)),
(3, Vectors.dense(0.2, 0.2, 0.2)),
(4, Vectors.dense(9.0, 9.0, 9.0)),
(5, Vectors.dense(9.1, 9.1, 9.1)),
(6, Vectors.dense(9.2, 9.2, 9.2))
)
val dataset: DataFrame = sqlContext.createDataFrame(vectorSeq).toDF("id", "features")
I can do JOINs on two Spark DStreams like :
val joinStream = stream1.join(stream2)
Now, what if I need to filter out all the records that weren't JOINed. Essentially, something like stream1.anti-join(stream2). Is this possible somehow?
Thanks and appreciate any help!
Assuming you had these:
val rdd1 = sc.parallelize(Array(
(1, "one"),
(2, "twow"),
(3, "three"),
(4, "four"),
(5, "five")
))
val rdd2 = sc.parallelize(Array(
(1, "otherOne"),
(4, "otherFour"),
(5,"otherFive"),
(6,"six"),
(7,"seven")
))
val antiJoined = rdd1.fullOuterJoin(rdd2).filter(r => r._2._1.isEmpty || r._2._2.isEmpty)
antiJoined.collect foreach println
(6,(None,Some(six)))
(2,(Some(twow),None))
(3,(Some(three),None))
(7,(None,Some(seven)))
For example, if I have two graphs with vertices and edges like this:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertexRdd1: RDD[(VertexId, (String, Int))] = sc.parallelize(Array(
(1L, ("a", 28)),
(2L, ("b", 27)),
(3L, ("c", 65))
))
val edgeRdd1: RDD[Edge[Int]] = sc.parallelize(Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 8)
))
val vertexRdd2: RDD[(VertexId, (String, Int))] = sc.parallelize(Array(
(1L, ("a", 28)),
(2L, ("b", 27)),
(3L, ("c", 28)),
(4L, ("d", 27)),
(5L, ("e", 65))
))
val edgeRdd2: RDD[Edge[Int]] = sc.parallelize(Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 4),
Edge(3L, 5L, 1),
Edge(2L, 4L, 1)
))
How can I get the number of common edges between these two graphs, without considering the edge attribute? So, in the above example the number of common edges is 2 and the common edges are: Edge(1L, 2L, 1) common with Edge(1L, 2L, 1) and Edge(2L, 3L, 8) common with Edge(2L, 3L, 4).
I am programming in scala.
Assuming you have graph1 (Graph(vertexRdd1, edgeRdd1)) and graph2 (Graph(vertexRdd2, edgeRdd2))) you can map edges to (srcId, dstId) and then use intersection method:
val srcDst1 = graph1.edges.map(e => (e.srcId, e.dstId))
val srcDst2 = graph2.edges.map(e => (e.srcId, e.dstId))
srcDst1.intersection(srcDst2).count()