two identical graphs considered not equal - networkx

using networkx,
The MultiGraph and MultiDiGraph classes allow you to add the same
edge twice
So I guess it implies the basic class Graph() ignores multiple edges.
I did a test, and found it does ignore multiply edges, however adding same edge twice makes the graph object different. Can somebody explain why? Thanks
import networkx as nx
G1 = nx.Graph()
G1.add_edge(1, 2)
G1.edges() # [(1, 2)]
G1.degree(1) # 1
G2 = nx.Graph()
G2.add_edges_from([(1, 2), (1, 2)])
G2.edges() # [(1, 2)]
G2.degree(1) # 1
G1==G2 # False

Your graphs are isomorphic (have the same structure) but are different Python objects. You can test isomorphism with nx.is_ismorphic
import networkx as nx
G1 = nx.Graph()
G1.add_edge(1, 2)
G1.edges() # [(1, 2)]
G1.degree(1) # 1
G2 = nx.Graph()
G2.add_edges_from([(1, 2), (1, 2)])
G2.edges() # [(1, 2)]
G2.degree(1) # 1
print G1==G2 # False
print repr(G1),repr(G2)
print nx.is_isomorphic(G1,G2)
#OUTPUT
# False
#<networkx.classes.graph.Graph object at 0x7fda3174c050> <networkx.classes.graph.Graph object at 0x7fda3463e890>
#True

Related

Transform maptype in Pyspark

I have a pyspark dataframe with 500k rows, each row has a maptype with 10k (key, value) items. The keys are the same for each row, e.g., k0, k1, ..., k9999.
What I want is to run some interpolation on the 10k values for each row and get a percentile (e.g., 50%). it seems there are two ways to do this:
first explode the maptype to columns, then do the interpolation
Run the interpolation on the maptype, then explode to columns to get the statistics
I have used pandas for some time but quite new to Pyspark. I'd very much appreciate if you could shed some lights on
Whether I should explode the maptype first
how do I do the interpolation (either on the maptype or the columns). This seem to be an easy task with numpy but I am not sure how to do the comprehension of the maptype/columns with pyspark
The following is a simple example
What I have
from pyspark.sql.functions import map_values
df = spark.sql("SELECT map('a', 1, 'b', 3, 'c', 2) as data")
df.show(20, False)
+------------------------+
|data |
+------------------------+
|[a -> 1, b -> 3, c -> 2]|
+------------------------+
What I want is to call the interp1d function to get result/median (see below) for the maptype values [1, 3, 2].
import numpy as np
from scipy.interpolate import interp1d
x = (np.linspace(0, 5, 11), np.linspace(0, 5, 11)**2)
f = interp1d(x[0], x[1], kind = 'linear', fill_value ='extrapolate', assume_sorted = False )
result = f([1,3,2])
median = np.percentile(result, 50)
print(f'result: {result}\nmedian: {median}')
result: [1. 9. 4.]
median: 4.0

Not getting correct label Graphframe LPA

I am using Graphframe LPA to find the communities but somehow it's not giving me expected result
graph_data = spark.createDataFrame([
("a", "d", "friend"),
("b", "d", "friend"),
("c", "d", "friend")
], ["src", "dst", "relationship"])
here my requirement is to get single community id for all vertices a,b,c and d but i am getting two different community id one for a,b,c and one for d
code:
df1 = graph_data.selectExpr('src AS id')
df2 = graph_data.selectExpr('dst AS id')
vertices = df1.union(df2)
vertices = vertices.distinct()
edges = graph_data
g = GraphFrame(vertices, edges)
communities = g.labelPropagation(maxIter=5)
Given that d is a root it has a separate label. To accomplish a single label, recommend using connected components instead, see docs.
communities = g.connectedComponents()
Note: requires you set a checkpoint directory prior.
sc.setCheckpointDir("some_path")

NetworkX - Connected Columns

I'm trying to create a data visualization, and maybe NetworkX isn't the best tool but I would like to have parallel columns of nodes (2 separate groups) which connect to each other. I can't figure out how to place the two groups of nodes in this layout. The different options I have tried always default to a more 'web-like' layout. I'm trying to create a visualization where customers/companies (the keys in a dictionary) would have edges drawn to product nodes (values in the same dictionary).
For example:
d = {"A":[ 1, 2, 3], "B": [2,3], "C": [1,3]
From dictionary 'd', we would have a column of nodes ["A", "B", "C"] and second column [1, 2, 3] and between the two edges would be drawn.
A 1
B 2
C 3
UPDATE:
So the 'pos' argument suggested helped but I thought it was having difficulties using this on multiple objects. Here is the method I came up with:
nodes = ["A", "B", "C", "D"]
nodes2 = ["X", "Y", "Z"]
edges = [("A","Y"),("C","X"), ("C","Z")]
#This function will take a list of values we want to turn into nodes
# Then it assigns a y-value for a specific value of X creating columns
def create_pos(column, node_list):
pos = {}
y_val = 0
for key in node_list:
pos[key] = (column, y_val)
y_val = y_val+1
return pos
G.add_nodes_from(nodes)
G.add_nodes_from(nodes2)
G.add_edges_from(edges)
pos1 = create_pos(0, nodes)
pos2 = create_pos(1, nodes2)
pos = {**pos1, **pos2}
nx.draw(G, pos)
Here is the code I wrote with the help of #wolfevokcats to create columns of nodes which are connected.
G = nx.Graph()
nodes = ["A", "B", "C", "D"]
nodes2 = ["X", "Y", "Z"]
edges = [("A","Y"),("C","X"), ("C","Z")]
#This function will take a list of values we want to turn into nodes
# Then it assigns a y-value for a specific value of X creating columns
def create_pos(column, node_list):
pos = {}
y_val = 0
for key in node_list:
pos[key] = (column, y_val)
y_val = y_val+1
return pos
G.add_nodes_from(nodes)
G.add_nodes_from(nodes2)
G.add_edges_from(edges)
pos1 = create_pos(0, nodes)
pos2 = create_pos(1, nodes2)
pos = {**pos1, **pos2}
nx.draw(G, pos, with_labels = True)

How can I group RDD by key then count per unique string?

I have an RDD like:
[(1, "Western"),
(1, "Western")
(1, "Drama")
(2, "Western")
(2, "Romance")
(2, "Romance")]
I wish to count per userID the occurances of each movie genres resulting in
1, { "Western", 2), ("Drama", 1) } ...
After that it's my intention to pick the one with the largest number and thus gaining the most popular genre per user.
I have tried userGenre.sortByKey().countByValue()
but to no avail I have no clue on how I can perform this task. I'm using pyspark jupyter notebook.
EDIT:
I have tried the following and it seems to have worked, could someone confirm?
userGenreRDD.map(lambda x: (x, 1)).aggregateByKey(\
0, # initial value for an accumulator \
lambda r, v: r + v, # function that adds a value to an accumulator \
lambda r1, r2: r1 + r2 # function that merges/combines two accumulators \
)
Here is one way of doing
rdd = sc.parallelize([('u1', "Western"),('u2', "Western"),('u1', "Drama"),('u1', "Western"),('u2', "Romance"),('u2', "Romance")])
The occurrence of each movie genre could be
>>> rdd = sc.parallelize(rdd.countByValue().items())
>>> rdd.map(lambda ((x,y),z): (x,(y,z))).groupByKey().map(lambda (x,y): (x, [y for y in y])).collect()
[('u1', [('Western', 2), ('Drama', 1)]), ('u2', [('Western', 1), ('Romance', 2)])]
Most popular genre
>>> rdd.map(lambda (x,y): ((x,y),1)).reduceByKey(lambda x,y: x+y).map(lambda ((x,y),z):(x,(y,z))).groupByKey().mapValues(lambda (x,y): (y)).collect()
[('u1', ('Western', 2)), ('u2', ('Romance', 2))]
Now one could ask what should be most popular genre if more than one genre have the same popularity count?

Computing F-measure for clustering

Can anyone help me to calculate F-measure collectively ? I know how to calculate recall and precision, but don't know for a given algorithm how to calculate one F-measure value.
As an exemple, suppose my algorithm creates m clusters, but I know there are n clusters for the same data (as created by another benchmark algorithm).
I found one pdf but it is not useful since the collective value I got is greater than 1. Reference of pdf is F Measure explained. Specifically I have read some research paper, in which the author compares two algorithms on the basis of F-measure, they got collectively values between 0 and 1.
if you read the pdf mentioned above carefully, the formula is F(C,K) = ∑ | ci | / N * max {F(ci,kj)}
where ci is reference cluster & kj is cluster created by other algorithm, here i is running from 1 to n & j is running from 1 to m.Let say |c1|=218 here as per pdf N=m*n let say m=12 and n=10, and we got max F(c1,kj) for j=2. Definitely F(c1,k2) is between 0 and 1. but the resultant value calculated by above formula we will get value above 1.
The term f-measure itself is underspecified. It's the harmonic mean, usually of precision and recall. Actually you should even say F1-score if you mean the unweighted version, because you can put different weight on the two input values. But without saying which two values are averaged (not in the sense of the arithmetic mean!) this doesn't say much.
https://en.wikipedia.org/wiki/F1_score
Note that the values must be in the 0-1 value range. Otherwise, you have an error earlier on.
In cluster analysis, the common approach is to apply the F1-Measure to the precision and recall of pairs, often referred to as "pair counting f-measure". But you could compute the same mean on other values, too.
Pair-counting has the nice property that it doesn't directly compare clusters, so the result is well defined when one result has m cluster, the other has n clusters. However, pair counting needs strict partitions. When elements are not clustered or assigned to more than one cluster, the pair-counting measures can easily go out of the range 0-1.
E. Achtert, S. Goldhofer, H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Clusterings Metrics and Visual Support
Int. Conf. Data Engineering (ICDE 2012)
http://www.computer.org/portal/web/csdl/doi/10.1109/ICDE.2012.128
Discusses some of these metrics (including Rand index and such) and gives a simple explanation of the "pair counting F-measure".
The paper Characterization and evaluation of similarity measures for pairs of clusterings by Darius Pfitzner, Richard Leibbrandt and David Powers contains a lot of useful information regarding this subject, including the following example:
Given the set,
D = {1, 2, 3, 4, 5, 6}
and the partitions,
P = {1, 2, 3}, {4, 5}, {6}, and
Q = {1, 2, 4}, {3, 5, 6}
where P is set created by our algorithm and Q is set created by standard algorithm we known
PairsP = {(1, 2), (1, 3), (2, 3), (4, 5)},
PairsQ = {(1, 2), (1, 4), (2, 4), (3, 5), (3, 6), (5, 6)}, and
PairsD = {(1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 3), (2, 4),
(2, 5), (2, 6), (3, 4), (3, 5), (3, 6), (4, 5), (4, 6), (5, 6)}
so,
a = | PairsP intersection PairsQ | = |(1, 2)| = 1
b = | PairsP- PairsQ | = |(1, 3)(2, 3)(4, 5)| = 3
c = | PairsQ- PairsP | = |(1, 4)(2, 4)(3, 5)(3, 6)(5, 6)| = 5
F-measure= 2a/(2a+b+c)
Note: There is an error in the publication on page 364 where a, b, c, and d are computed and the result of b and c are actually switched incorrectly. This switch would throw off the results of some other measures. Obviously, the F-measure is unaffected.
The N in your formula, F(C,K) = ∑ | ci | / N * max {F(ci,kj)}, is the sum of the |ci| over all i i.e. it is the total number of elements. You are perhaps mistaking it to be the number of clusters and therefore are getting an answer greater than one. If you make the change, your answer will be between 1 and 0.