How to display/visualize a graph created by GraphFrame? - pyspark

I have created a graph using GraphFrame
g = GraphFrame (vertices, edges)
Apart from analyzing the graph using the queries and the properties offered by the GraphFrame, I would like to visualize the graph to use in a presentation.
Do you know of any tool/library / API / code that allows this visualization in a simple way?

Not a simple way, but you can use python-igraph library, https://igraph.org/. I used it from R but python should be similar. See simple example below. The main problem with all that tool, you should carefully select small subgraph to draw.
Install it:
#>pip install python-igraph
The simplest visualisation:
g = GraphFrame (vertices, edges)
from igraph import *
ig = Graph.TupleList(g.edges.collect(), directed=True)
plot(ig)

Another way is to use the plot functionality from the graph module networkx
import networkx as nx
from graphframes import GraphFrame
def PlotGraph(edge_list):
Gplot=nx.Graph()
for row in edge_list.select('src','dst').take(1000):
Gplot.add_edge(row['src'],row['dst'])
plt.subplot(121)
nx.draw(Gplot)
spark = SparkSession \
.builder \
.appName("PlotAPp") \
.getOrCreate()
sqlContext = SQLContext(spark)
vertices = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("e1", "Esther2", 32),
("f", "Fanny", 36),
("g", "Gabby", 60),
("h", "Mark", 61),
("i", "Gunter", 62),
("j", "Marit", 63)], ["id", "name", "age"])
edges = sqlContext.createDataFrame([
("a", "b", "friend"),
("b", "a", "follow"),
("c", "a", "follow"),
("c", "f", "follow"),
("g", "h", "follow"),
("h", "i", "friend"),
("h", "j", "friend"),
("j", "h", "friend"),
("e", "e1", "friend")
], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
PlotGraph(g.edges)
see also PYSPARK: how to visualize a GraphFrame?

Related

Getting the first item for a tuple for each row in a list in Scala

I am looking to do this in Scala, but nothing works. In pyspark it works obviously.
from operator import itemgetter
rdd = sc.parallelize([(0, [(0,'a'), (1,'b'), (2,'c')]), (1, [(3,'x'), (5,'y'), (6,'z')])])
mapped = rdd.mapValues(lambda v: map(itemgetter(0), v))
Output
mapped.collect()
[(0, [0, 1, 2]), (1, [3, 5, 6])]
val rdd = sparkContext.parallelize(List(
(0, Array((0, "a"), (1, "b"), (2, "c"))),
(1, Array((3, "x"), (5, "y"), (6, "z")))
))
rdd
.mapValues(v => v.map(_._1))
.foreach(v=>println(v._1+"; "+v._2.toSeq.mkString(",") ))
Output:
0; 0,1,2
1; 3,5,6

Function "shared neighbors distribution" in igraph

I wonder if someone could give me some clue about writing a function like "shared neighbors distribution", given clusters. I found this feature in Cytoscape NetworkAnalyzer very informative for my research purpose. Since I have many clusters to analyze, it would be handy to write a script for this job. Suggestions using igraph, networkx, etc are welcomed. Thank you very much!
For example:
edgelist <- read.table(text = "
E B
E A
B D
B F
B C
A C
A F
A D")
library(igraph)
graph <- graph.data.frame(edgelist,directed=F)
plot(graph)
We would see a graph like this:
enter image description here
Either two of the Nodes (C, D, E, F) shared Nodes A and B. That is 6 times.
Nodes A and B shared Nodes (C, D, E, F).
In total, the summary should like this:
enter image description here
Instead of writing a loop (to get the neighbors for each vertex, and compare them), I wonder if there are better solutions.
Networkx has no built-in functions for this problem so you should do it manually. Moreover, if it had, these functions used node loops inside them. So node loop is actually an optimal or sub-optimal solution. For your code you can use Python defaultdicts to make it simplier:
import networkx as nx
from collections import defaultdict
G = nx.Graph()
G.add_edges_from([
("E", "B"),
("E", "A"),
("B", "D"),
("B", "F"),
("B", "C"),
("A", "C"),
("A", "F"),
("A", "D")
])
snd = defaultdict(int)
for n1 in G.nodes:
for n2 in G.nodes:
len_nbrs = len(set(G.neighbors(n1)) & set(G.neighbors(n2)))
if len_nbrs:
snd[len_nbrs] += 1
snd
So snd will looks like it:
defaultdict(int, {2: 16, 4: 4})
Thank you so much #vurmux for the framework and idea. I just adjusted pair combinations, to avoid self-pairs, duplicates etc. Then we get the correct answer. Great. Cheers!
import networkx as nx
from collections import defaultdict
from itertools import combinations
G = nx.Graph()
G.add_edges_from([
("E", "B"),
("E", "A"),
("B", "D"),
("B", "F"),
("B", "C"),
("A", "C"),
("A", "F"),
("A", "D")
])
snd = defaultdict(int)
l =G.nodes()
comb =combinations(l,2) # combinations('ABCD', 2) --> AB AC AD BC BD CD
for i in list(comb):
len_nbrs = len(set(G.neighbors(i[0])) & set(G.neighbors(i[1])))
if len_nbrs:
snd[len_nbrs] += 1
snd
Now we have:
defaultdict(int, {2: 6, 4: 1})

Spark Scala - but class RDD is invariant in type T

Another question for the uninitiated:
Two RDDs that appear the same but are not. As follows:
val rdd0 = sc.parallelize( List("a", "b", "c", "d", "e"))
val rdd1 = rdd0.map(x => (x, 110 - x.toCharArray()(0).toByte ))
val rdd2 = sc.parallelize( List(("c", 2), ("d, 2)", ("e", 2), ("f", 2))))
//Seemingly the same type but not, how practically to get them to be UNIONed?
val rddunion = rdd1.union(rdd2).collect()
Get this:
<console>:182: error: type mismatch;
found : org.apache.spark.rdd.RDD[Product with Serializable]
required: org.apache.spark.rdd.RDD[(String, Int)]
Note: Product with Serializable >: (String, Int), but class RDD is invariant in type T.
You may wish to define T as -T instead. (SLS 4.5)
val rddunion = rdd1.union(rdd2).collect()
^
How to get this to work for the novice. I can sort of see now why people are a little hesitant with Scala. Read some of the doc, but not entirely clear. How to allow this UNION of RDDs to work?
Very grateful.
you are writing " in wrong place ("d, 2)"
so instead of
val rdd2 = sc.parallelize( List(("c", 2), ("d, 2)", ("e", 2), ("f", 2))))
correct one is
val rdd2 = sc.parallelize( List(("c", 2), ("d", 2), ("e", 2), ("f", 2)))

Pyspark - after groupByKey and count distinct value according to the key?

I would like to find how many distinct values according to the key, for example, suppose I have
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1), ("b", 2), ("a", 2)])
And I have done using groupByKey
sorted(x.groupByKey().map(lambda x : (x[0], list(x[1]))).collect())
x.groupByKey().mapValues(len).collect()
the output will by like,
[('a', [1, 1, 2]), ('b', [1, 2])]
[('a', 3), ('b', 2)]
However, I want to find distinct values in the list, the output should be like,
[('a', [1, 2]), ('b', [1, 2])]
[('a', 2), ('b', 2)]
I am very new to spark and try to apply the distinct() function somewhere, but all failed :-(
Thanks a lot in advance!
you can use set instead of list -
sorted(x.groupByKey().map(lambda x : (x[0], set(x[1]))).collect())
You can try number of approaches for same. I solved it using below approach:-
from operator import add
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1), ("b", 2), ("a", 2)])
x = x.map(lambda n:((n[0],n[1]), 1))
x.groupByKey().map(lambda n:(n[0][0],1)).reduceByKey(add).collect()
OutPut:-
[('b', 2), ('a', 2)]
Hope This will help you.

Sequence transformations in scala

In scala, is there a simple way of transforming this sequence
Seq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
into this Seq(("a", 4), ("b", 7), ("c", 4))?
Thanks
I'm not sure if you meant to have Strings in the second ordinate of the tuple. Assuming Seq[(String, Int)], you can use groupBy to group the elements by the first ordinate:
Seq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
.groupBy(_._1)
.mapValues(_.map(_._2).sum)
.toSeq
Otherwise, you'll next an extra .toInt
Here is another way by unzipping and using the second item in the tuple.
val sq = sqSeq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
sq.groupBy(_._1)
.transform {(k,lt) => lt.unzip._2.sum}
.toSeq
The above code details:
scala> sq.groupBy(_._1)
res01: scala.collection.immutable.Map[String,Seq[(String, Int)]] = Map(b -> List((b,2), (b,5)), a -> List((a,1), (a,3)), c -> List((c,4)))
scala> sq.groupBy(_._1).transform {(k,lt) => lt.unzip._2.sum}
res02: scala.collection.immutable.Map[String,Int] = Map(b -> 7, a -> 4, c -> 4)
scala> sq.groupBy(_._1).transform {(k,lt) => lt.unzip._2.sum}.toSeq
res03: Seq[(String, Int)] = ArrayBuffer((b,7), (a,4), (c,4))