I made a simple network:
import networkx as nx
G=nx.Graph()
G.add_node("a")
G.add_nodes_from(["b","c"])
G.add_edge(1,2)
edge = ("d", "e")
G.add_edge(*edge)
edge = ("a", "b")
G.add_edge(*edge)
print("Nodes of graph: ")
print(G.nodes())
print("Edges of graph: ")
print(G.edges())
# adding a list of edges:
G.add_edges_from([("a","c"),("c","d"), ("a",1), (1,"d"), ("a",2)])
nx.draw(G)
plt.show() # display
I would like to make a list of all possible path starting from and ending in same nodes for all nodes. For example, starting from the node "a" and ending the node "a" can be:
a-2-1-a, a-c-b-1-a, ....
starting from the node "2" and ending in the node "2" can be:
2-a-1-2, 2-1-d-c-a-2, ....
How can I do it?
A path starting and ending at the same node is called a cycle. Since cycles can be repeated, whenever you have a cycle the number of possible paths is infinite. You can find the list of cycles which form a basis for cycles of G, by calling nx.cycle_basis(G).
If you wish to compute explicitly all paths without repetitions of nodes, it can be done as follows:
import networkx as nx
G=nx.Graph()
G.add_edges_from([("d", "e"), ("a", "b"), ("a","c"), ("c","d"), ("a",1), (1,"d"), ("a",2), (1,2)])
node_to_cycles = {}
for source in G.nodes():
paths = []
for target in G.neighbors(source):
paths += [l + [source] for l in list(nx.all_simple_paths(G, source=source, target=target)) if len(l) > 2]
node_to_cycles[source] = paths
print(node_to_cycles['a'])
This implementation is based on nx.all_simple_paths. Since a simple path has only one instance of each node, we search paths to each neighbor of a node, and then concatenate the node itself.
Related
I've created a subgraph_view by applying a filter to edges. When I call nodes() on the subgraph it still shows me all nodes, even if none of the edges use them. I need to get a list of only nodes that are still part of the subgraph.
G = nx.path_graph(6)
G[2][3]["cross_me"] = False
G[3][4]["cross_me"] = False
def filter_edge(n1, n2):
return G[n1][n2].get("cross_me", True)
view = nx.subgraph_view(G, filter_edge=filter_edge)
# node 3 is no longer used by any edges in the subgraph
view.edges()
This produces
EdgeView([(0, 1), (1, 2), (4, 5)])
as expected. However, when I run view.nodes() I get
NodeView((0, 1, 2, 3, 4, 5))
What I expect to see is
NodeView((0, 1, 2, 4, 5))
This seems odd. Is there some way to extract only the nodes used by the subgraph?
The confusion stems from the definition of 'graph.' A disconnected node is still a part of a graph. In fact, you could have a graph with no edges at all. So the behavior of subgraph_view() is counterintuitive but correct.
If, however, you still want to achieve what you're describing, there are lots of potential ways, depending on your tolerance for modifying the original graph. I'll mention two that attempt to stay as close to your current method as possible and avoid deleting edges or nodes from G.
Method 1
The easiest way using your view object is to take it as input to edge_subgraph() (which only takes edges as input) like this:
final_view = view.edge_subgraph(view.edges())
final_view.nodes()
gives
NodeView((0, 1, 2, 4, 5))
Method 2
To me, Method 1 seems clunky and confusing by defining an intermediate view. If instead we go back up a little bit and start with G, we could define a filter_node function that checks the edge attributes of each node and filters that node if
all edges are flagged for removal, or
the node has no edges in the first place.
You could also do this by manually flagging the node itself, as you've done with the edges.
G = nx.path_graph(6)
G[2][3]["cross_me"] = False
G[3][4]["cross_me"] = False
def filter_edge(n1, n2):
return G[n1][n2].get("cross_me", True)
def filter_node(n):
return sum([i[2].get("cross_me", True) for i in G.edges(n, data=True)])
view = nx.subgraph_view(G, filter_node=filter_node, filter_edge=filter_edge)
view.nodes()
also gives the expected
NodeView((0, 1, 2, 4, 5))
pair = list(g.edges()) print(pair)
Why is the result of the second node is not an 'int'?
result
I used firstnode = pair[a][0], secondnode = int(pair[a][1] for converting the number of the second node from a float to an int.
But I am still confused why it is float?
So I am not entirely sure how your code looks like, but since you have for all of the edges in your graph a float as the second number and you want to type it as an integer, I would suggest you to do this:
My code example:
import networkx as nx
# Create dummy graph
g = nx.Graph()
g.add_edge(1,2.0)
g.add_edge(5,4.3)
g.add_edge(8,3.9)
# Get list of all edges
pair = list(g.edges())
print(pair)
# Iterate through all edges
for a in range(len(g.edges())):
# Get first node of edge
firstnode = pair[a][0]
# Get second node of edge and type cast it to int
secondnode = int(pair[a][1])
# Print nodes / or execute something else here
print(firstnode,secondnode)
print()
And this is the output:
[(1, 2.0), (5, 4.3), (8, 3.9)]
1 2
5 4
8 3
I hope that helps!
I had the same question - to descibe my case - if I print my edges of nodes:
for edge in G.edges():
print(edge)
Gives:
('1', '11')
('1', '6')
('2', '2')
...etc
Meaning the nodes IDs were STRINGS not INT. SO for example, you would need:
print("Node {} has degree {}".format(node_id, G.degree[node_id]))
print("Node {} has degree {}".format('1', G.degree['i']))
i am working on a community detection algorithm that uses the concept of propagating label to nodes. i have problem in selecting the true type for the Label_counter variable.
we have an algorithm with name LPA(label propagation algorithm) which propagates labels to nodes through iterations. think labels as node property. the initial label for each node is the node id, and in iterations nodes update their new label based on the most frequent label among its neighbors. the algorithm i am working on is something like LPA. at first every node has initial label equal to 0 and then nodes get new labels. as nodes update and get new labels, based on some conditions the Label_counter should be incremented by one to use this value as label for other nodes . for example label=1 or label = 2 and so on. for example we have zachary karate club dataset that it has 34 nodes and the dataset has 2 communities.
the initial state is like this:
(1,0)
(2,0)
.
.
.
(34,0)
first number is node Id and second one is label.
as nodes get new label, the Label_counter increments and other nodes in next iterations get new label and again Label_counter increments.
(1,1)
(2,1)
(3,1)
.
.
.
(33,3)
(34,3)
nodes with same label, belong to same community.
the problem that i have is:
because nodes in RDD and variables are distributed across the machines(each machine has a copy of variables) and executors dont have connection with each other, if an executor updates the Label_counter, other executors wont be informed of new value of Label_counter and maybe nodes will get wrong labels, IS it true to use Accumulator as label counter in this case, because Accumulators are shared variables across machines, or there is other ways for handling this problem???
In spark it is always complicated to compute index like values because they depend on things that are not in all the partitions. I can propose the following idea.
Compute the number of time the condition is met per partition
Compute the cumulated increment per partition so that we know the initial increment of each partition.
Increment the values of the partition based on that initial increment
Here is what the code could look like this. Let me start by setting up a few things.
// Let's define some condition
def condition(node : Long) = node % 10 == 1
// step 0, generate the data
val rdd = spark.range(34)
.select('id+1).repartition(10).rdd
.map(r => (r.getAs[Long](0), 0))
.sortBy(_._1).cache()
rdd.collect
Array[(Long, Int)] = Array((1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0),
(9,0), (10,0), (11,0), (12,0), (13,0), (14,0), (15,0), (16,0), (17,0), (18,0),
(19,0), (20,0), (21,0), (22,0), (23,0), (24,0), (25,0), (26,0), (27,0), (28,0),
(29,0), (30,0), (31,0), (32,0), (33,0), (34,0))
Then the core of the solution:
// step 1 and 2
val partIncrInit = rdd
// to each partition, we associate the number of times we need to increment
.mapPartitionsWithIndex{ case (i,p) =>
Iterator(i -> p.map(_._1).count(condition))
}
.collect.sorted // sort by partition index
.map(_._2) // we don't need the index anymore
.scanLeft(0)(_+_) // cumulated sum
// step 3, we increment each partition based on this initial increment.
val result = rdd
.mapPartitionsWithIndex{ case (i, p) =>
var incr = 0
p.map{ case (node, value) =>
if(condition(node))
incr+=1
(node, partIncrInit(i) + value + incr)
}
}
result.collect
Array[(Long, Int)] = Array((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1),
(9,1), (10,1), (11,2), (12,2), (13,2), (14,2), (15,2), (16,2), (17,2), (18,2),
(19,2), (20,2), (21,3), (22,3), (23,3), (24,3), (25,3), (26,3), (27,3), (28,3),
(29,3), (30,3), (31,4), (32,4), (33,4), (34,4))
So I am trying to create a Google Maps like software that would evaluate the trip time from point a to point b based on the taxi data.
val taxiFile = sc.textFile("/resources/LabData/nyctaxisub.csv")
val header = taxiFile.first()
val taxiNoHead = taxiFile.filter(x=> (x !=header))
//3, 4, 9, 10 are the dropoff and pickup longitude and latitude
val taxiData=taxiNoHead.
filter(_.split(",")(3)!="" ).
filter(_.split(",")(4)!="")
val taxiFence=taxiData.
filter(_.split(",")(3).toDouble>40.70).
filter(_.split(",")(3).toDouble<40.77).
filter(_.split(",")(4).toDouble>(-74.02)).
filter(_.split(",")(4).toDouble<(-73.7)).
filter(_.split(",")(9).toDouble>40.70).
filter(_.split(",")(9).toDouble<40.77).
filter(_.split(",")(10).toDouble>(-74.02)).
filter(_.split(",")(10).toDouble<(-73.7))
val taxi=taxiFence.
map{
line=>Vectors.dense(
line.split(',').slice(3,5).map(_ .toDouble)
)
}.cache()
val iterationCount=10
val clusterCount=40
val model=KMeans.train(taxi,clusterCount,iterationCount)
val clusterCenters=model.clusterCenters.map(_.toArray)
clusterCenters.foreach(lines=>println(lines(0),lines(1)))
(Here is a link to map with all of my cluster centers http://www.darrinward.com/lat-long/?id=1922446)
Now my problem is that I am not sure how to calculate the average time from one cluster to another. This could be done easily if there was a way to access the data points assigned to each cluster and then take an average of the trip_time_in_secs for trips that go from the cluster to the other cluster nearest to the dropoff point. In addition if there were clusters between points a and b, for instance point c. It would also calculate the avg. time from point a to c and from c to b and then sum these together. Similarly to Dijkstra's algorithm.
Is this possible to do? And if yes, how?
I want networkx to find the absolute longest path in my directed,
acyclic graph.
I know about Bellman-Ford, so I negated my graph lengths. The problem:
networkx's bellman_ford() requires a source node. I want to find the
absolute longest path (or the shortest path after negation), not the
longest path from a given node.
Of course, I could run bellman_ford() on each node in the graph and
sort, but is there a more efficient method?
From what I've read (eg,
http://en.wikipedia.org/wiki/Longest_path_problem) I realize there
actually may not be a more efficient method, but was wondering if
anyone had any ideas (and/or had proved P=NP (grin)).
EDIT: all the edge lengths in my graph are +1 (or -1 after negation), so a method that simply visits the most nodes would also work. In general, it won't be possible to visit ALL nodes of course.
EDIT: OK, I just realized I could add an additional node that simply connects to every other node in the graph, and then run bellman_ford from that node. Any other suggestions?
There is a linear-time algorithm mentioned at http://en.wikipedia.org/wiki/Longest_path_problem
Here is a (very lightly tested) implementation
EDIT, this is clearly wrong, see below. +1 for future testing more than lightly before posting
import networkx as nx
def longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
pairs = [[dist[v][0]+1,v] for v in G.pred[node]] # incoming pairs
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node, max_dist = max(dist.items())
path = [node]
while node in dist:
node, length = dist[node]
path.append(node)
return list(reversed(path))
if __name__=='__main__':
G = nx.DiGraph()
G.add_path([1,2,3,4])
print longest_path(G)
EDIT: Corrected version (use at your own risk and please report bugs)
def longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0]+1,v) for v in G.pred[node]]
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node,(length,_) = max(dist.items(), key=lambda x:x[1])
path = []
while length > 0:
path.append(node)
length,node = dist[node]
return list(reversed(path))
if __name__=='__main__':
G = nx.DiGraph()
G.add_path([1,2,3,4])
G.add_path([1,20,30,31,32,4])
# G.add_path([20,2,200,31])
print longest_path(G)
Aric's revised answer is a good one and I found it had been adopted by the networkx library link
However, I found a little flaw in this method.
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
because pairs is a list of tuples of (int,nodetype). When comparing tuples, python compares the first element and if they are the same, will process to compare the second element, which is nodetype. However, in my case the nodetype is a custom class whos comparing method is not defined. Python therefore throw out an error like 'TypeError: unorderable types: xxx() > xxx()'
For a possible improving, I say the line
dist[node] = max(pairs)
can be replaced by
dist[node] = max(pairs,key=lambda x:x[0])
Sorry about the formatting since it's my first time posting. I wish I could just post below Aric's answer as a comment but the website forbids me to do so stating I don't have enough reputation (fine...)