networkx: efficiently find absolute longest path in digraph - networkx

I want networkx to find the absolute longest path in my directed,
acyclic graph.
I know about Bellman-Ford, so I negated my graph lengths. The problem:
networkx's bellman_ford() requires a source node. I want to find the
absolute longest path (or the shortest path after negation), not the
longest path from a given node.
Of course, I could run bellman_ford() on each node in the graph and
sort, but is there a more efficient method?
From what I've read (eg,
http://en.wikipedia.org/wiki/Longest_path_problem) I realize there
actually may not be a more efficient method, but was wondering if
anyone had any ideas (and/or had proved P=NP (grin)).
EDIT: all the edge lengths in my graph are +1 (or -1 after negation), so a method that simply visits the most nodes would also work. In general, it won't be possible to visit ALL nodes of course.
EDIT: OK, I just realized I could add an additional node that simply connects to every other node in the graph, and then run bellman_ford from that node. Any other suggestions?

There is a linear-time algorithm mentioned at http://en.wikipedia.org/wiki/Longest_path_problem
Here is a (very lightly tested) implementation
EDIT, this is clearly wrong, see below. +1 for future testing more than lightly before posting
import networkx as nx
def longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
pairs = [[dist[v][0]+1,v] for v in G.pred[node]] # incoming pairs
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node, max_dist = max(dist.items())
path = [node]
while node in dist:
node, length = dist[node]
path.append(node)
return list(reversed(path))
if __name__=='__main__':
G = nx.DiGraph()
G.add_path([1,2,3,4])
print longest_path(G)
EDIT: Corrected version (use at your own risk and please report bugs)
def longest_path(G):
dist = {} # stores [node, distance] pair
for node in nx.topological_sort(G):
# pairs of dist,node for all incoming edges
pairs = [(dist[v][0]+1,v) for v in G.pred[node]]
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
node,(length,_) = max(dist.items(), key=lambda x:x[1])
path = []
while length > 0:
path.append(node)
length,node = dist[node]
return list(reversed(path))
if __name__=='__main__':
G = nx.DiGraph()
G.add_path([1,2,3,4])
G.add_path([1,20,30,31,32,4])
# G.add_path([20,2,200,31])
print longest_path(G)

Aric's revised answer is a good one and I found it had been adopted by the networkx library link
However, I found a little flaw in this method.
if pairs:
dist[node] = max(pairs)
else:
dist[node] = (0, node)
because pairs is a list of tuples of (int,nodetype). When comparing tuples, python compares the first element and if they are the same, will process to compare the second element, which is nodetype. However, in my case the nodetype is a custom class whos comparing method is not defined. Python therefore throw out an error like 'TypeError: unorderable types: xxx() > xxx()'
For a possible improving, I say the line
dist[node] = max(pairs)
can be replaced by
dist[node] = max(pairs,key=lambda x:x[0])
Sorry about the formatting since it's my first time posting. I wish I could just post below Aric's answer as a comment but the website forbids me to do so stating I don't have enough reputation (fine...)

Related

Issues with ordiplot3d NMDS in 3dvegan package

I am looking for some help here with this 3d NMDS code. I have 3 issues.
The layout of the plot moves significantly each time I execute the code.
The sites and species are sometimes far off of the plot.
The species text is often overlapping. How can I fix this?
I am unsure how to change the plotting environment to ggplot, so that might be out of the question.
library(vegan)
library(vegan3d)
library(tidyverse)
data("dune")
SiteID <- 1:20
NMDS = metaMDS(dune,distance="bray", try=500, wascores = TRUE, k=3)
NMDS1 = NMDS$points[,1]
NMDS2 = NMDS$points[,2]
NMDS3 = NMDS$points[,3]
NMDS = data.frame(NMDS1 = NMDS1, NMDS2 = NMDS2, NMDS3 = NMDS3, SiteID=SiteID)
NMDS_input <- metaMDS(dune,distance="bray",try=500,k=3,wascores = T)
pl4 <- with(NMDS, ordiplot3d(NMDS_input, pch=16, angle=50, main="Fish ion level 3", cex.lab=1.7,cex.symbols=1.5, tick.marks=FALSE))
sp <- scores(NMDS_input, choices=1:3, display="species", scaling="symmetric")
si <- scores(NMDS_input, choices=1:3, display="sites", scaling="symmetric")
text(pl4$xyz.convert(sp), rownames(sp), cex=0.7, xpd=TRUE)
sii <- as.data.frame(cbind(NMDS$SiteID,si))
with(NMDS, orditorp(pl4, labels = sii$V1, air=1, cex = 1))
labels must be character variables in orditorp. We always assumed so, but this was not checked in vegan::orditorp. Latest vegan version in github will take care of this and will also work with numeric labels.
ordiplot3d returns projected coordinates (in 2D) and if you want to plot those, you can just use the pl4 object that you saved and you do not need to use pl4$xyz.convert. This object will also be accepted in orditorp.
If you want to plot points that were not used in the original mock-3D plot, you must use pl4$xyz.convert for their 2D projection. This function will return the projected coordinates in a form that is directly accepted by standard R functions text, points (and some others), but they will not be accepted by orditorp (and I won't change this). You must make these into two-column matrix-like object; data.frame() will work.
Your example code contains a lot of un-needed code. The following is an edit with only necessary lines and fixes that make this example work with current vegan release.
library(vegan)
library(vegan3d)
data(dune)
SiteID <- as.character(1:20) # must be character
NMDS_input <- metaMDS(dune,distance="bray",try=500,k=3,wascores = T)
pl4 <- ordiplot3d(NMDS_input, pch=16, angle=50, main="Fish ion level 3", cex.lab=1.7,cex.symbols=1.5, tick.marks=FALSE) # no with(NMDS,...)
sp <- scores(NMDS_input, choices=1:3, display="species") # no arg scaling in scores.metaMDS
text(pl4$xyz.convert(sp), rownames(sp), cex=0.7, xpd=TRUE)
orditorp(pl4, labels = SiteID, air=1, cex = 1) # character labels w/points in the same location

How To Use kmedoids from pyclustering with set number of clusters

I am trying to use k-medoids to cluster some trajectory data I am working with (multiple points along the trajectory of an aircraft). I want to cluster these into a set number of clusters (as I know how many types of paths there should be).
I have found that k-medoids is implemented inside the pyclustering package, and am trying to use that. I am technically able to get it to cluster, but I do not know how to control the number of clusters. I originally thought it was directly tied to the number of elements inside what I called initial_medoids, but experimentation shows that it is more complicated than this. My relevant code snippet is below.
Note that D holds a list of lists. Each list corresponds to a single trajectory.
def hausdorff( u, v):
d = max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
return d
traj_count = len(traj_lst)
D = np.zeros((traj_count, traj_count))
for i in range(traj_count):
for j in range(i + 1, traj_count):
distance = hausdorff(traj_lst[i], traj_lst[j])
D[i, j] = distance
D[j, i] = distance
from pyclustering.cluster.kmedoids import kmedoids
initial_medoids = [104, 345, 123, 1]
kmedoids_instance = kmedoids(traj_lst, initial_medoids)
kmedoids_instance.process()
cluster_lst = kmedoids_instance.get_clusters()[0]
num_clusters = len(np.unique(cluster_lst))
print('There were %i clusters found' %num_clusters)
I have a total of 1900 trajectories, and the above-code finds 1424 clusters. I had expected that I could control the number of clusters through the length of initial_medoids, as I did not see any option to input the number of clusters into the program, but this seems unrelated. Could anyone guide me as to the mistake I am making? How do I choose the number of clusters?
In case of requirement to obtain clusters you need to call get_clusters():
cluster_lst = kmedoids_instance.get_clusters()
Not get_clusters()[0] (in this case it is a list of object indexes in the first cluster):
cluster_lst = kmedoids_instance.get_clusters()[0]
And that is correct, you can control amount of clusters by initial_medoids.
It is true you can control the number of cluster, which correspond to the length of initial_medoids.
The documentation is not clear about this. The get__clusters function "Returns list of medoids of allocated clusters represented by indexes from the input data". so, this function does not return the cluster labels. It returns the index of rows in your original (input) data.
Please check the shape of cluster_lst in your example, using .get_clusters() and not .get_clusters()[0] as annoviko suggested. In your case, this shape should be (4,). So, you have a list of four elements (clusters), each containing the index or rows in your original data.
To get, for example, data from the first cluster, use:
kmedoids_instance = kmedoids(traj_lst, initial_medoids)
kmedoids_instance.process()
cluster_lst = kmedoids_instance.get_clusters()
traj_lst_first_cluster = traj_lst[cluster_lst[0]]

Trying to balance my dataset through sample_weight in scikit-learn

I'm using RandomForest for classification, and I got an unbalanced dataset, as: 5830-no, 1006-yes. I try to balance my dataset with class_weight and sample_weight, but I can`t.
My code is:
X_train,X_test,y_train,y_test = train_test_split(arrX,y,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw)
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
But I don't get any improvement on my ratios TPR, FPR, ROC when using class_weight and sample_weight.
Why? Am I doing anything wrong?
Nevertheless, if I use the function called balanced_subsample, my ratios obtain a great improvement:
def balanced_subsample(x,y,subsample_size):
class_xs = []
min_elems = None
for yi in np.unique(y):
elems = x[(y == yi)]
class_xs.append((yi, elems))
if min_elems == None or elems.shape[0] < min_elems:
min_elems = elems.shape[0]
use_elems = min_elems
if subsample_size < 1:
use_elems = int(min_elems*subsample_size)
xs = []
ys = []
for ci,this_xs in class_xs:
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
x_ = this_xs[:use_elems]
y_ = np.empty(use_elems)
y_.fill(ci)
xs.append(x_)
ys.append(y_)
xs = np.concatenate(xs)
ys = np.concatenate(ys)
return xs,ys
My new code is:
X_train_subsampled,y_train_subsampled=balanced_subsample(arrX,y,0.5)
X_train,X_test,y_train,y_test = train_test_split(X_train_subsampled,y_train_subsampled,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw)
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
This is not a full answer yet, but hopefully it'll help get there.
First some general remarks:
To debug this kind of issue it is often useful to have a deterministic behavior. You can pass the random_state attribute to RandomForestClassifier and various scikit-learn objects that have inherent randomness to get the same result on every run. You'll also need:
import numpy as np
np.random.seed()
import random
random.seed()
for your balanced_subsample function to behave the same way on every run.
Don't grid search on n_estimators: more trees is always better in a random forest.
Note that sample_weight and class_weight have a similar objective: actual sample weights will be sample_weight * weights inferred from class_weight.
Could you try:
Using subsample=1 in your balanced_subsample function. Unless there's a particular reason not to do so we're better off comparing the results on similar number of samples.
Using your subsampling strategy with class_weight and sample_weight both set to None.
EDIT: Reading your comment again I realize your results are not so surprising!
You get a better (higher) TPR but a worse (higher) FPR.
It just means your classifier tries hard to get the samples from class 1 right, and thus makes more false positives (while also getting more of those right of course!).
You will see this trend continue if you keep increasing the class/sample weights in the same direction.
There is a imbalanced-learn API that helps with oversampling/undersampling data that might be useful in this situation. You can pass your training set into one of the methods and it will output the oversampled data for you. See simple example below
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=1)
x_oversampled, y_oversampled = ros.fit_sample(orig_x_data, orig_y_data)
Here it the link to the API: http://contrib.scikit-learn.org/imbalanced-learn/api.html
Hope this helps!

Find value in vector "p" that corresponds to maximum value in vector "r = f(p)"

As simple as in title. I have nx1 sized vector p. I'm interested in the maximum value of r = p/foo - floor(p/foo), with foo being a scalar, so I just call:
max_value = max(p/foo-floor(p/foo))
How can I get which value of p gave out max_value?
I thought about calling:
[max_value, max_index] = max(p/foo-floor(p/foo))
but soon I realised that max_index is pretty useless. I'm sorry asking this, real beginner here.
Having dropped the issue to pieces, I realized there's no unique corrispondence between values p and values in my related vector p/foo-floor(p/foo), so there's a logical issue rather than a language one.
However, given my input data, I know that the solution is unique. How can I fix this?
I ended up doing:
result = p(p/foo-floor(p/foo) == max(p/foo-floor(p/foo)))
Looks terrible, so if you know any other way...
Once you have the index, use it:
result = p(max_index)
You can create a new vector with your lets say "transformed" values:
p2 = (p/foo-floor(p/foo))
and then just use find to find the max values on p2:
max_index = find(p2 == max(p2))
that will return the index or indices of p2 with the max value of that operation, and finally just lookup the original value in p
p(max_index)
in 1 line, this is:
p(find((p/foo-floor(p/foo) == max((p/foo-floor(p/foo))))))
which is basically the same thing you did in the end :)

Ullman’s Subgraph Isomorphism Algorithm

Could somebody give me a working Ullman's graph isomorphism problem implementation in MATLAB, or link to it. Or if you have at least in C so I would try to implement it in MATLAB.
Thanks
i'm lookign for it too. I've been loking in the web but with no luck so far, but i've found this:
Algorithm, where the algorithm is explained.
On another hand, i found this:
def search(graph,subgraph,assignments,possible_assignments):
update_possible_assignments(graph,subgraph,possible_assignments)
i=len(assignments)
# Make sure that every edge between assigned vertices in the subgraph is also an
# edge in the graph.
for edge in subgraph.edges:
if edge.first<i and edge.second<i:
if not graph.has_edge(assignments[edge.first],assignments[edge.second]):
return False
# If all the vertices in the subgraph are assigned, then we are done.
if i==subgraph.n_vertices:
return True
for j in possible_assignments[i]:
if j not in assignments:
assignments.append(j)
# Create a new set of possible assignments, where graph node j is the only
# possibility for the assignment of subgraph node i.
new_possible_assignments = deep_copy(possible_assignments)
new_possible_assignments[i] = [j]
if search(graph,subgraph,assignments,new_possible_assignments):
return True
assignments.pop()
possible_assignments[i].remove(j)
update_possible_assignments(graph,subgraph,possible_assignments)
def find_isomporhism(graph,subgraph):
assignments=[]
possible_assignments = [[True]*graph.n_vertices for i in range(subgraph.n_vertices)]
if search(graph,subgraph,asignments,possible_assignments):
return assignments
return None
here: implementation. I do not have the skills to transform this into Matlab, if you have them , i would really appreciate if you could share your code when you're done.