How to generate a triangle free graph in Networkx (with randomseed)? - networkx

After checking the documentation on triangles of networkx, I've wondered if there is a more efficient way of generating a triangle free graph than to randomly spawn graphs until a triangle free one happens to emerge, (in particular if one would like to use a constant random seed).
Below is code that spawns graphs until they are triangle free, yet with varying random seeds. For a graph of 10 nodes it already takes roughly 20 seconds.
def create_triangle_free_graph(show_graphs):
seed = 42
nr_of_nodes = 10
probability_of_creating_an_edge = 0.85
nr_of_triangles = 1 # Initialise at 1 to initiate while loop.
while nr_of_triangles > 0:
graph = nx.fast_gnp_random_graph(
nr_of_nodes, probability_of_creating_an_edge
)
triangles = nx.triangles(G).values()
nr_of_triangles = sum(triangles) / 3
print(f"nr_of_triangles={nr_of_triangles}")
return graph
Hence, I would like to ask:
Are there faster ways to generate triangle free graphs (using random seeds) in networkx?

A triangle exists in a graph iff two vertices connected by an edge share one or more neighbours. A triangle-free graph can be expanded by adding edges between nodes that share no neighbours. The empty graph is triangle-free, so there is a straightforward algorithm to create triangle-free graphs.
#!/usr/bin/env python
"""
Create a triangle free graph.
"""
import random
import networkx as nx
from itertools import combinations
def triangle_free_graph(total_nodes):
"""Construct a triangle free graph."""
nodes = range(total_nodes)
g = nx.Graph()
g.add_nodes_from(nodes)
edge_candidates = list(combinations(nodes, 2))
random.shuffle(edge_candidates)
for (u, v) in edge_candidates:
if not set(n for n in g.neighbors(u)) & set(n for n in g.neighbors(v)):
g.add_edge(u, v)
return g
g = triangle_free_graph(10)
print(nx.triangles(g))
The number of edges in the resulting graph is highly dependent on the ordering of edge_candidates. To get a graph with the desired edge density, repeat the process until a graph with equal or higher density is found (and then remove superfluous edges), or until your patience runs out.
cutoff = 0.85
max_iterations = 1e+4
iteration = 0
while nx.density(g) < cutoff:
g = triangle_free_graph(10)
iteration += 1
if iteration == max_iterations:
import warnings
warnings.warn("Maximum number of iterations reached!")
break
# TODO: remove edges until the desired density is achieved

Related

Hierarchical Agglomerative clustering for Spark

I am working on a project using Spark and Scala and I am looking for a hierarchical clustering algorithm, which is similar to scipy.cluster.hierarchy.fcluster or sklearn.cluster.AgglomerativeClustering, which will be useable for large amounts of data.
MLlib for Spark implements Bisecting k-means, which needs as input the number of clusters. Unfortunately in my case, I don't know the number of clusters and I would prefer to use some distance threshold as an input parameter, as it is possible to use in those two python implementations above.
If anyone would know the answer, I would be very grateful.
So I had the same problem and after looking high and low found no answers so I will post what I did here in the hopes that it helps anyone else and that maybe someone will build on it.
The basic idea of what I did was to use bisecting K-means recursively to continue to split clusters in half until all points in the cluster were a specified distance away from the centroid. I was using gps data so I have a little bit of extra machinery to deal with that.
The first step is to create a model that will cut the data in half. I used bisecting K means but I think this would work with any of the pyspark clustering methods so long as you can get the distance to the centroid.
import pyspark.sql.functions as f
from pyspark import SparkContext, SQLContext
from pyspark.ml.clustering import BisectingKMeans
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
bkm = BisectingKMeans().setK(2).setSeed(1)
assembler = VectorAssembler(inputCols=['lat','long'], outputCol="features")
adf = assembler.transform(locAggDf)#locAggDf contains my location info
model = bkm.fit(adf)
# predictions will have the original data plus the "features" col which assigns a cluster number
predictions = model.transform(adf)
predictions.persist()
The next step is our recursive function. The idea here is that we specify some distance from the centroid and if any point in a cluster is farther than that distance we cut the cluster in half. When a cluster is tight enough that it meets the condition I add it to a result array that I use to build the final clustering
def bisectToDist(model, predictions, bkm, precision, result = []):
centers = model.clusterCenters()
# row[0] is predictedClusterNum, row[1] is unit, row[2] point lat, row[3] point long
# centers[row[0]] is the lat long of center, centers[row[0]][0] = lat, centers[row[0]][1] = long
distUdf = f.udf(
lambda row: getDistWrapper((centers[row[0]][0], centers[row[0]][1], row[1]), (row[2], row[3], row[1])),
FloatType())##getDistWrapper(is how I calculate the distance of lat and long but you can define any distance metric)
predictions = predictions.withColumn('dist', distUdf(
f.struct(predictions.prediction, predictions.encodedPrecisionUnit, predictions.lat, predictions.long)))
#create a df of all rows that were in clusters that had a point outside of the threshold
toBig = predictions.join(
predictions.groupby('prediction').agg({"dist": "max"}).filter(f.col('max(dist)') > self.precision).select(
'prediction'), ['prediction'], 'leftsemi')
#this could probably be improved
#get all cluster numbers that were to big
listids = toBig.select("prediction").distinct().rdd.flatMap(lambda x: x).collect()
#if all data points are within the speficed distance of the centroid we can return the clustering
if len(listids) == 0:
return predictions
# assuming binary class now k must be = 2
# if one of the two clusters was small enough we will not have another recusion call for that cluster
# we must save it and return it at this depth the clustiering that was 2 big will be cut in half in the loop below
if len(listids) == 1:
ok = predictions.join(
predictions.groupby('prediction').agg({"dist": "max"}).filter(
f.col('max(dist)') <= precision).select(
'prediction'), ['prediction'], 'leftsemi')
for clusterId in listids:
# get all of the pieces that were to big
part = toBig.filter(toBig.prediction == clusterId)
# we now deed to refit the subset of the data
assembler = VectorAssembler(inputCols=['lat', 'long'], outputCol="features")
adf = assembler.transform(part.drop('prediction').drop('features').drop('dist'))
model = bkm.fit(adf)
#predictions now holds the new subclustering and we are ready for recursion
predictions = model.transform(adf)
result.append(bisectToDist(model, predictions, bkm, result=result))
#return anything that was given and already good
if len(listids) == 1:
return ok
Finally we can call the function and build the resulting dataframe
result = []
self.bisectToDist(model, predictions, bkm, result=result)
#drop any nones can happen in recursive not top level call
result =[r for r in result if r]
r = result[0]
r = r.withColumn('subIdx',f.lit(0))
result = result[1:]
idx = 1
for r1 in result:
r1 = r1.withColumn('subIdx',f.lit(idx))
r = r.unionByName(r1)
idx = idx + 1
# each of the subclusters will have a 0 or 1 classification in order to make it 0 - n I added the following
r = r.withColumn('delta', r.subIdx * 100 + r.prediction)
r = r.withColumn('delta', r.delta - f.lag(r.delta, 1).over(Window.orderBy("delta"))).fillna(0)
r = r.withColumn('ddelta', f.when(r.delta != 0,1).otherwise(0))
r = r.withColumn('spacialLocNum',f.sum('ddelta').over(Window.orderBy(['subIdx','prediction'])))
#spacialLocNum should be the final clustering
Admittadly this is quite convoluted and slow but it does get the job done, hope this helps!

N-dimensional GP Regression

I'm trying to use GPflow for a multidimensional regression. But I'm confused by the shapes of the mean and variance.
For example: A 2-dimensional input space X of shape (20,20) is supposed to be predicted. My training samples are of shape (8,2) which means 8 training samples overall for the two dimensions. The y-values are of shape (8,1) which of course means one value of the ground truth per combination of the 2 input dimensions.
If I now use model.predict_y(X) I would expect to receive a mean of shape (20,20) but obtain a shape of (20,1). Same goes for the variance. I think that this problem comes from the shape of the y-values but I have have no idea how to fix it.
bound = 3
num = 20
X = np.random.uniform(-bound, bound, (num,num))
print(X_sample.shape) # (8,2)
print(Y_sample.shape) # (8,1)
k = gpflow.kernels.RBF(input_dim=2)
m = gpflow.models.GPR(X_sample, Y_sample, kern=k)
m.likelihood.variance = sigma_n
m.compile()
gpflow.train.ScipyOptimizer().minimize(m)
mean, var = m.predict_y(X)
print(mean.shape) # (20, 1)
print(var.shape) # (20, 1)
It sounds like you may be confused between the shape of a grid of input positions and the shape of the numpy arrays: if you want to predict on a 20 x 20 grid in two dimensions, you have 400 points in total, each with 2 values. So X (the one that you pass to m.predict_y()) should have shape (400, 2). (Note that the second dimension needs to have the same shape as X_sample!)
To construct this array of shape (400,2) you can use np.meshgrid (e.g., see What is the purpose of meshgrid in Python / NumPy?).
m.predict_y(X) only predicts the marginal variance at each test point, so the returned mean and var both have shape (400,1) (same length as X). You can of course reshape them to the 20 x 20 values on your grid.
(It is also possible to compute the full covariance, for the latent f this is implemented as m.predict_f_full_cov, which for X of shape (400,2) would return a 400x400 matrix. This is relevant if you want consistent samples from the GP, but I suspect that goes well beyond this question.)
I was indeed making the mistake to not flatten the arrays which in return produced the mistake. Thank you for the fast response STJ!
Here is an example of the working code:
# Generate data
bound = 3.
x1 = np.linspace(-bound, bound, num)
x2 = np.linspace(-bound, bound, num)
x1_mesh,x2_mesh = np.meshgrid(x1, x2)
X = np.dstack([x1_mesh, x2_mesh]).reshape(-1, 2)
z = f(x1_mesh, x2_mesh) # evaluation of the function on the grid
# Draw samples from feature vectors and function by a given index
size = 2
np.random.seed(1991)
index = np.random.choice(range(len(x1)), size=(size,X.ndim), replace=False)
samples = utils.sampleFeature([x1,x2], index)
X1_sample = samples[0]
X2_sample = samples[1]
X_sample = np.column_stack((X1_sample, X2_sample))
Y_sample = utils.samplefromFunc(f=z, ind=index)
# Change noise parameter
sigma_n = 0.0
# Construct models with initial guess
k = gpflow.kernels.RBF(2,active_dims=[0,1], lengthscales=1.0,ARD=True)
m = gpflow.models.GPR(X_sample, Y_sample, kern=k)
m.likelihood.variance = sigma_n
m.compile()
#print(X.shape)
mean, var = m.predict_y(X)
mean_square = mean.reshape(x1_mesh.shape) # Shape: (num,num)
var_square = var.reshape(x1_mesh.shape) # Shape: (num,num)
# Plot mean
fig = plt.figure(figsize=(16, 12))
ax = plt.axes(projection='3d')
ax.plot_surface(x1_mesh, x2_mesh, mean_square, cmap=cm.viridis, linewidth=0.5, antialiased=True, alpha=0.8)
cbar = ax.contourf(x1_mesh, x2_mesh, mean_square, zdir='z', offset=offset, cmap=cm.viridis, antialiased=True)
ax.scatter3D(X1_sample, X2_sample, offset, marker='o',edgecolors='k', color='r', s=150)
fig.colorbar(cbar)
for t in ax.zaxis.get_major_ticks(): t.label.set_fontsize(fontsize_ticks)
ax.set_title("$\mu(x_1,x_2)$", fontsize=fontsize_title)
ax.set_xlabel("\n$x_1$", fontsize=fontsize_label)
ax.set_ylabel("\n$x_2$", fontsize=fontsize_label)
ax.set_zlabel('\n\n$\mu(x_1,x_2)$', fontsize=fontsize_label)
plt.xticks(fontsize=fontsize_ticks)
plt.yticks(fontsize=fontsize_ticks)
plt.xlim(left=-bound, right=bound)
plt.ylim(bottom=-bound, top=bound)
ax.set_zlim3d(offset,np.max(z))
which leads to (red dots are the sample points drawn from the function). Note: Code not refactored what so ever :)

Distinguish classes with different colors Networkx

I have a huge dataset of 80,000 rows , I want to draw a meaningful graph in networkx using 2 dataframes (nodes and edges)
In "nodes", I have : actor1 , category_id(int :numerical value from 0 - 7 describe the type , and fatalities (float representing the number of injured or killed people))
In "edges" : "actor1", "actor2", "interaction: float 64"
my aim is to draw a graph with different colors according to category_id and different sizes based on number of fatalities
I started thid code which run perfectly until I tried to retrieve interaction and fatalities to calculate wights of nodes as follows
nodes = ACLED_to_graph[['actor1','category_id','fatalities']]
edges = ACLED_to_graph[['actor1','actor2','interaction']]
# Initiate the graph
G4 = nx.Graph()
for index, row in nodes.iterrows():
G4.add_node(row['actor1'], category_id=row['category_id'], nodesize=row['fatalities'])
for index, row in edges.iterrows():
G4.add_weighted_edges_from([(row['actor1'], row['actor2'], row['interaction'])])
#greater_than_ = [x for x in G.nodes(data=True) if x[2]['nodesize']>15]
# Sort nodes by degree
sorted(G4.degree, key=lambda x: x[1], reverse=True)
# remove anonymous nodes whose degree are <2 and <200
cond1 = [node for node,degree in G4.degree() if degree>=200]
cond2 = [node for node,degree in G4.degree() if degree<4]
remove = cond1+cond2
G4.remove_nodes_from(remove)
G4.remove_edges_from(remove)
# Customize the layout
pos=nx.spring_layout(G4, k=0.25, iterations=50)
# Define color map for classes
color_map = {0:'#f1f0c0',1:'#f09494', 2:'#eebcbc', 3:'#72bbd0', 4:'#91f0a1', 5:'#629fff', 6:'#bcc2f2',
7:'#eebcbc' }
plt.figure(figsize=(25,25))
options = {
'edge_color': '#FFDEA2',
'width': 1,
'with_labels': True,
'font_weight': 'regular',
}
colors = [color_map[G4.node[node]['category_id']] for node in G4.node]
#sizes = [G.node[node]['interaction'] for node in G]
"""
Using the spring layout :
- k controls the distance between the nodes and varies between 0 and 1
- iterations is the number of times simulated annealing is run
default k=0.1 and iterations=50
"""
#node_color=colors,
#node_size=sizes,
nx.draw(G4,node_color=colors, **options,cmap=plt.get_cmap('jet'))
ax = plt.gca()
ax.collections[0].set_edgecolor("#555555")
I am also removing some nodes with degrees greater than 200 and less than 3 to simplify the graph and make it more appealing.
I am getting the following error :
colors = [color_map[G4.node[node]['category_id']] for node in G4.node]
KeyError: 'category_id'
without the input data it is a bit hard to tell for sure, but it looks as if you are not constructing the graph nodes with a property 'category_id'. In the for index, row in nodes.iterrows(): you assign the data in the nodes dictionary, key 'category_id' to the property "group".
you can confirm this to be the case by checking what keys are set for an example node in your graph, e.g. print(G4.node['actor1 '].keys()).
To fix this, either
a) change the assignment
for index, row in nodes.iterrows():
G4.add_node(row['actor1'], category_id=row['category_id'], nodesize=row['interaction'])
or b) change the lookup
colors = [color_map[G4.node[node]['group']] for node in G4.node]
Solving mathematical operation using nodes attributes can be summarized as follows :
1 -After the subsetting the dataframe, we initialize the graph
nodes = ACLED_to_graph[['actor1','category_id','interaction']]
edges = ACLED_to_graph[['actor1','actor2','fatalities']]
# Initiate the graph
G8 = nx.Graph()
2- Add edges attributes first (I emphasize the use of from_pandas_edgelist)
for index, row in edges.iterrows():
G8 = nx.from_pandas_edgelist(edges, 'actor1', 'actor2', ['fatalities'])
3- Next we add nodes attributes using add_note, other techniques such as set_nodes_attributes didn't work in pandas
for index, row in nodes.iterrows():
G8.add_node(row['actor1'], category_id=row['category_id'], interaction=row['interaction'])
4- Sort nodes by degree to select the most connected nodes (I am choosing nodes with degrees more than 3 and less than 200)
sorted(G8.degree, key=lambda x: x[1], reverse=True)
# remove anonymous nodes whose degree are <2 and <200
cond1 = [node for node,degree in G8.degree() if degree>=200]
cond2 = [node for node,degree in G8.degree() if degree<3]
remove = cond1+cond2
G8.remove_nodes_from(remove)
G8.remove_edges_from(remove)
5- set the color based on degree (calling node.degree)
node_color = [G8.degree(v) for v in G8]
6- set the edge size based on fatalities
edge_width = [0.15*G8[u][v]['fatalities'] for u,v in G8.edges()]
7- set the node size based on interaction
node_size = [list(nx.get_node_attributes(G8, 'interaction').values()) for v in G8]
I used get.node_attribute instead pandas to access the features which allowed me to list the dictionary and convert it to a matrix of values , ready to compute.
8- Select the most important edges based on atalities
large_edges = [x for x in G8.edges(data=True) if x[2]['fatalities']>=3.0]
9- Finally, draw the network and edges seperately
nx.draw_networkx(G8, pos, node_size=node_size,node_color=node_color, alpha=0.7, with_labels=False, width=edge_width, edge_color='.4', cmap=plt.cm.Blues)
nx.draw_networkx_edges(G8, pos, edgelist=large_edges, edge_color='r', alpha=0.4, width=6)

Plot portfolio composition map in Julia (or Matlab)

I am optimizing portfolio of N stocks over M levels of expected return. So after doing this I get the time series of weights (i.e. a N x M matrix where where each row is a combination of stock weights for a particular level of expected return). Weights add up to 1.
Now I want to plot something called portfolio composition map (right plot on the picture), which is a plot of these stock weights over all levels of expected return, each with a distinct color and length (at every level of return) is proportional to it's weight.
My questions is how to do this in Julia (or MATLAB)?
I came across this and the accepted solution seemed so complex. Here's how I would do it:
using Plots
#userplot PortfolioComposition
#recipe function f(pc::PortfolioComposition)
weights, returns = pc.args
weights = cumsum(weights,dims=2)
seriestype := :shape
for c=1:size(weights,2)
sx = vcat(weights[:,c], c==1 ? zeros(length(returns)) : reverse(weights[:,c-1]))
sy = vcat(returns, reverse(returns))
#series Shape(sx, sy)
end
end
# fake data
tickers = ["IBM", "Google", "Apple", "Intel"]
N = 10
D = length(tickers)
weights = rand(N,D)
weights ./= sum(weights, dims=2)
returns = sort!((1:N) + D*randn(N))
# plot it
portfoliocomposition(weights, returns, labels = tickers)
matplotlib has a pretty powerful polygon plotting capability, e.g. this link on plotting filled polygons:
ploting filled polygons in python
You can use this from Julia via the excellent PyPlot.jl package.
Note that the syntax for certain things changes; see the PyPlot.jl README and e.g. this set of examples.
You "just" need to calculate the coordinates from your matrix and build up a set of polygons to plot the portfolio composition graph. It would be nice to see the code if you get this working!
So I was able to draw it, and here's my code:
using PyPlot
using PyCall
#pyimport matplotlib.patches as patch
N = 10
D = 4
weights = Array(Float64, N,D)
for i in 1:N
w = rand(D)
w = w/sum(w)
weights[i,:] = w
end
weights = [zeros(Float64, N) weights]
weights = cumsum(weights,2)
returns = sort!([linspace(1,N, N);] + D*randn(N))
##########
# Plot #
##########
polygons = Array(PyObject, 4)
colors = ["red","blue","green","cyan"]
labels = ["IBM", "Google", "Apple", "Intel"]
fig, ax = subplots()
fig[:set_size_inches](5, 7)
title("Problem 2.5 part 2")
xlabel("Weights")
ylabel("Return (%)")
ax[:set_autoscale_on](false)
ax[:axis]([0,1,minimum(returns),maximum(returns)])
for i in 1:(size(weights,2)-1)
xy=[weights[:,i] returns;
reverse(weights[:,(i+1)]) reverse(returns)]
polygons[i] = matplotlib[:patches][:Polygon](xy, true, color=colors[i], label = labels[i])
ax[:add_artist](polygons[i])
end
legend(polygons, labels, bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0)
show()
# savefig("CompositionMap.png",bbox_inches="tight")
Can't say that this is the best way, to do this, but at least it is working.

Solving a system of equations using Python/Scipy for a set of measurements

I have an physical instrument of measurement (force platform with load cells) which gives me three values, A, B and C. It happens, though, that these values - that should be orthogonal - actually are somewhat coupled, due to physical characteristics of the measuring device, which causes cross-talk between applied and returned values of force and torque.
Then, it is recommended that a calibration matrix be used to transform the measured values into a better estimate of the actual values, like this:
The problem is that it is necessary to perform a SET of measurements, so that different measured(Fz, Mx, My) and actual(Fz, Mx, My) are least-squared to get some C matrix that works best for the system as a whole.
I can solve Ax = B problems with scipy.linalg.lststq, or even scipy.linalg.solve (giving an exact solution) for ONE measurement, but how should I proceed to consider a set of different measurements, each one with its own equation giving a potentially different 3x3 matrix?
Any help is much appreciated, thanks for reading.
I posted a similar question containing just the mathematical part of this at math.stackexchange.com, and this answer solved the problem:
math.stackexchange.com/a/232124/27435
In case anyone have a similar problem in the future, here is the almost literal Scipy implementation of that answer (first lines are initialization boilerplate code):
import numpy
import scipy.linalg
### Origin of the coordinate system: upper left corner!
"""
1----------2
| |
| |
4----------3
"""
platform_width = 600
platform_height = 400
# positions of each load cell (one per corner)
loadcell_positions = numpy.array([[0, 0],
[platform_width, 0],
[platform_width, platform_height],
[0, platform_height]])
platform_origin = numpy.array([platform_width, platform_height]) * 0.5
# applying a known force at known positions and taking the measurements
measurements_per_axis = 5
total_load = 50
results = []
for x in numpy.linspace(0, platform_width, measurements_per_axis):
for y in numpy.linspace(0, platform_height, measurements_per_axis):
position = numpy.array([x,y])
for loadpos in loadcell_positions:
moments = platform_origin-loadpos * total_load
load = numpy.array([total_load])
result = numpy.hstack([load, moments])
results.append(result)
results = numpy.array(results)
noise = numpy.random.rand(*results.shape) - 0.5
measurements = results + noise
# now expand ("stuff") the 3x3 matrix to get a linearly independent 3x3 matrix
expands = []
for n in xrange(measurements.shape[0]):
k = results[n,:]
m = measurements[n,:]
expand = numpy.zeros((3,9))
expand[0,0:3] = m
expand[1,3:6] = m
expand[2,6:9] = m
expands.append(expand)
expands = numpy.vstack(expands)
# perform the actual regression
C = scipy.linalg.lstsq(expands, measurements.reshape((-1,1)))
C = numpy.array(C[0]).reshape((3,3))
# the result with pure noise (not actual coupling) should be
# very close to a 3x3 identity matrix (and is!)
print C
Hope this helps someone!