PyTorch Geometric directed graph shows wrong number of edges when converted to NetworkX undirected format - networkx

I loaded the PyTorch Geometric dataset OGB_MAG, converted it into a homogeneous dataset, and then checked its number of nodes as follows:
import torch
import numpy as np
from torch_geometric.datasets import OGB_MAG
import os.path as osp
import networkx as nx
from torch_geometric.utils import to_networkx, to_undirected
path = osp.join('..', 'data', 'OGB_MAG')
dataset = OGB_MAG(path)
data = dataset[0]
homogeneous_data = data.to_homogeneous()
print(homogeneous_data)
--> Data(node_type=[1939743], edge_index=[2, 21111007], edge_type=[21111007])
print(homogeneous_data.has_isolated_nodes())
--> True
Then I converted it to NetworkX format and checked its number of edges:
nx_data = to_networkx(homogeneous_data)
print(nx_data.number_of_edges())
--> 21111007
nx_data = nx_data.to_undirected(reciprocal=False)
print(nx_data.number_of_nodes(), nx_data.number_of_edges())
--> (1939743, 21091072)
print(len(list(nx.isolates(nx_data))))
--> 0
The number of edges is different (21111007 with PyG vs 21091072 with NetworkX after converting to undirected). The number of nodes is the same though. We also see that PyG says there are isolated nodes, but NetworkX says there are none.
Any insight as to why I'm seeing this discrepancy?

Related

Networkx edge induced subgraph isomorphism

Consider the following snippet:
import networkx as nx
from networkx.algorithms import isomorphism
import matplotlib.pyplot as plt
subg = nx.Graph()
subg.add_nodes_from([0]+[i+1 for i in range(6)])
subg.add_edges_from([(0, i) for i in range(1,7)])
bigg = nx.Graph()
bigg.add_nodes_from([0]+[i+1 for i in range(6)])
bigg.add_edges_from([(0, i) for i in range(1,7)]+[(i,i+1) for i in range(1,6)]+[(1,6)])
nx.draw(subg, with_labels=True)
plt.show()
nx.draw(bigg, with_labels=True)
plt.show()
matcher = isomorphism.GraphMatcher(bigg, subg)
print([x for x in matcher.subgraph_isomorphisms_iter()])
This returns no subgraph isomorphisms due to the fact that there exists no edge (1,6) in subg for example. How does one get networkx to not care about this? I would like to get the mapping of nodes from subg to bigg which respect all the relations defined in subg without respect to whether or not additional relations exist between the nodes in bigg.

why k-means is better in clustering than topic modelling algorithms like LDA?

I want to know about the advantages of K-means in clustering essays to discover their topics. There are a lot of algorithms to do it such as K-medoid, x-means, LDA, LSA, etc. Please give me a full description of the motives to select k-means algorithms
I don't think you can draw parallels between all these things. I would highly recommend doing some well-defined Googling on your side, and come back here with a more refined question, or questions. In the meantime, I'll share with you what little I know about these topics. First, let's look at PCA & LDA...
import numpy as np
import pandas as pd
# Importing the Dataset
#url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
#names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
#dataset = pd.read_csv(url, names=names)
dataset = pd.read_csv('C:\\your_path_here\\iris.csv')
# PRINCIPAL COMPONENT ANALYSIS
X = dataset.drop('species', 1)
y = dataset['species']
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# As mentioned earlier, PCA performs best with a normalized feature set. We will perform standard scalar normalization to normalize our feature set. To do this, execute the following code:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Performance Evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[11 0 0]
[ 0 12 1]
[ 0 1 5]]
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))
Accuracy 0.9333333333333333
# Results with 2 & 3 pirncipal Components
#from sklearn.decomposition import PCA
#pca = PCA(n_components=5)
#X_train = pca.fit_transform(X_train)
#X_test = pca.transform(X_test)
# https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/
# LINEAR DISCRIMINANT ANALYSIS
# Data Preprocessing
# Once dataset is loaded into a pandas data frame object, the first step is to divide dataset into features and corresponding labels and then divide the resultant dataset into training and test sets. The following code divides data into labels and feature set:
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Feature Scaling
# As was the case with PCA, we need to perform feature scaling for LDA too. Execute the following script to do so:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=1)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[11 0 0]
[ 0 13 0]
[ 0 0 6]]
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))
Result:
Accuracy 1.0
# https://stackabuse.com/implementing-lda-in-python-with-scikit-learn/
Does that make sense? Hopefully it does. Now, let's move on to KMeans and PCA...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
dataset = pd.read_csv('C:\\your_path_here\\iris.csv')
# PRINCIPAL COMPONENT ANALYSIS
X = dataset.drop('species', 1)
y = dataset['species']
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
ax = sns.scatterplot(x="sepal_length", y="sepal_width", hue="sepal_length", data=dataset)
ax = sns.scatterplot(x="petal_length", y="petal_width", hue="petal_length", data=dataset)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
# ordinarily, when you don't have the actual labels, you might use
# silhouette analysis to determine a good number of clusters k to use.
# i.e. you would just run that same code for different values of k and print the value for
# the silhouette score.
# let's see what that value is for the case we just did, k=3.
from sklearn import metrics
score = metrics.silhouette_score(X_scaled, y_cluster_kmeans)
score
# Result:
# 0.45994823920518646
# note that this is the mean over all the samples - there might be some clusters
# that are well separated and others that are closer together.
# so let's look at the distribution of silhouette scores...
scores = metrics.silhouette_samples(X_scaled, y_cluster_kmeans)
sns.distplot(scores);
# so you can see that the blue species have higher silhouette scores
# (the legend doesn't show the colors though... so the pandas plot is more useful).
# note that if we used the best mean silhouette score to try to find the best
# number of clusters k, we'd end up with 2 clusters, because the mean silhouette
# score in that case would be largest, since the clusters would be better separated.
# but, that's using k-means - gmm might give better results...
# so that was clustering on the orginal 4d data.
# if you have a lot of features it can be helpful to do some feature reduction
# to avoid the curse of dimensionality (i.e. needing exponentially more data
# to do accurate predictions as the number of features grows).
# you can do this with Principal Component Analysis (PCA), which remaps the data
# to a new (smaller) coordinate system which tries to account for the
# most information possible.
# you can *also* use PCA to visualize the data by reducing the
# features to 2 dimensions and making a scatterplot.
# it kind of mashes the data down into 2d, so can lose
# information - but in this case it's just going from 4d to 2d,
# so not losing too much info.
# so let's just use it to visualize the data...
# mash the data down into 2 dimensions
from sklearn.decomposition import PCA
ndimensions = 2
pca = PCA(n_components=ndimensions, random_state=seed)
pca.fit(X_scaled)
X_pca_array = pca.transform(X_scaled)
X_pca = pd.DataFrame(X_pca_array, columns=['PC1','PC2']) # PC=principal component
X_pca.sample(5)
# Result:
PC1 PC2
90 0.279078 -1.120029
26 -2.051151 0.242164
83 1.061095 -0.633843
135 2.798770 0.856803
54 1.075475 -0.208421
# so that gives us new 2d coordinates for each data point.
# at this point, if you don't have labelled data,
# you can add the k-means cluster ids to this table and make a
# colored scatterplot.
# we do actually have labels for the data points, but let's imagine
# we don't, and use the predicted labels to see what the predictions look like.
df_plot = X_pca.copy()
df_plot['ClusterKmeans'] = y_cluster_kmeans
df_plot['SpeciesId'] = y_id_array # also add actual labels so we can use it in later plots
df_plot.sample(5)
# Result:
PC1 PC2 ClusterKmeans SpeciesId
132 1.862703 -0.178549 0 2
85 0.429139 0.845582 0 1
139 1.852045 0.676128 0 2
33 -2.446177 2.150728 1 0
147 1.521170 0.269069 0 2
# so now we can make a 2d scatterplot of the clusters
# first define a plot fn
def plotData(df, groupby):
"make a scatterplot of the first two principal components of the data, colored by the groupby field"
# make a figure with just one subplot.
# you can specify multiple subplots in a figure,
# in which case ax would be an array of axes,
# but in this case it'll just be a single axis object.
fig, ax = plt.subplots(figsize = (7,7))
# color map
cmap = mpl.cm.get_cmap('prism')
# we can use pandas to plot each cluster on the same graph.
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
for i, cluster in df.groupby(groupby):
cluster.plot(ax = ax, # need to pass this so all scatterplots are on same graph
kind = 'scatter',
x = 'PC1', y = 'PC2',
color = cmap(i/(nclusters-1)), # cmap maps a number to a color
label = "%s %i" % (groupby, i),
s=30) # dot size
ax.grid()
ax.axhline(0, color='black')
ax.axvline(0, color='black')
ax.set_title("Principal Components Analysis (PCA) of Iris Dataset");
# plot the clusters each datapoint was assigned to
plotData(df_plot, 'ClusterKmeans')
# so those are the *predicted* labels - what about the *actual* labels?
plotData(df_plot, 'SpeciesId')
# so the k-means clustering *did not* find the correct clusterings!
# q. so what do these dimensions mean?
# they're the principal components, which pick out the directions
# of maximal variation in the original data.
# PC1 finds the most variation, PC2 the second-most.
# the rest of the data is basically thrown away when the data is reduced down to 2d.

How to apply clustering on sentences embeddings?

I would like to create a summary with the major points of the original document. To do this, I made sentences embeddings with a Universal Sentence Encoder(https://tfhub.dev/google/universal-sentence-encoder/2). After, I would like apply clustering on my vectors.
I've tried with the library sklearn:
import numpy as np
from sklearn.cluster import KMeans
n_clusters = np.ceil(len(encoded)**0.5)
kmeans = KMeans(n_clusters=n_clusters)
kmeans = kmeans.fit(encoded)
But I get an error message:
'numpy.float64' object cannot be interpreted as an integer'
The problem is caused in this line:
n_clusters = np.ceil(len(encoded)**0.5)
kmeans expects to receive an integer as the number of clusters so simply add:
n_clusters = int(np.ceil(len(encoded)**0.5))

How to create heatmap in Python using OSMnx and a list of values corresponding to the graph's nodes?

I've built a simulation model of pedestrians walking on a network using OSMnx and one of the simulation's outputs is a list "Visits" that is corresponding to the nodes in NodesList = list(Graph.nodes).
How can I create an heatmap using those lists and OSMnx?
For example:
NodesList[:5]
Output: [1214630921, 5513510924, 5513510925, 5513510926, 5243527186]
Visits[:5]
Output: [1139, 1143, 1175, 1200, 1226]
P.S.
the type of heatmap is not important (Nodes size, nodes color, etc.)
Since you specified the type of heatmap is not important, I have come up with the following solution.
import osmnx as ox
address_name='Melbourne'
#Import graph
G=ox.graph_from_address(address_name, distance=300)
#Make geodataframes from graph data
nodes, edges = ox.graph_to_gdfs(G, nodes=True, edges=True)
import numpy as np
#Create a new column in the nodes geodataframe with number of visits
#I have filled it up with random integers
nodes['visits'] = np.random.randint(0,1000, size=len(nodes))
#Now make the same graph, but this time from the geodataframes
#This will help retain the 'visits' columns
G = ox.save_load.gdfs_to_graph(nodes, edges)
#Then plot a graph where node size and node color are related to the number of visits
nc = ox.plot.get_node_colors_by_attr(G,'visits',num_bins = 5)
ox.plot_graph(G,fig_height=8,fig_width=8,node_size=nodes['visits'], node_color=nc)

keep the scaling while drawing a weighed networkx

when I draw a weighed networkx, it does not really represented the real weight in terms of distance. I was curious if there is any parameters that I am missing or some other problem.
so, I started by making a simulated dataset as following
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import euclidean
import networkx as nx
from scipy.spatial.distance import pdist, squareform, cdist
# data generation
data = vstack((rand(5,2) + array([12,12]),rand(5,2)))
a = pdist(data, 'euclidean')
def givexy(index1D, VectorLength):
return [index1D%VectorLength, index1D/VectorLength]
import matplotlib.pyplot as plt
plt.plot(data[:,0], data[:,1], 'o')
plt.show()
then, I calculate the euclidean distance among all pairs and use the distance as weight
G = nx.empty_graph(1)
for cnt, item in enumerate(a):
print cnt
G.add_edge(givexy(cnt, 10)[0], givexy(cnt, 10)[1], weight=item, length=0)
pos = nx.spring_layout(G)
nx.draw_networkx(G, pos)
edge_labels=dict([((u,v,),"%.2f" % d['weight'])
for u,v,d in G.edges(data=True)])
nx.draw_networkx_edge_labels(G,pos,edge_labels=edge_labels)
#~ nx.draw(G,pos,edge_labels=edge_labels)
plt.show()
exit()
you might a get a different plot - because of unknown reason it is random. my main problem is the distance of nodes. for example the distance between node 4 to 8 is 0.82 but it looks longer than the distance of node 7 and 0.
any hint ?
thank you,
The spring layout doesn't explicitly use the weights as distances. Higher weight edges produce shorter edges in general.
Though if you want to specify the positions explicitly you can do that:
from numpy import vstack,array
from numpy.random import rand
from scipy.spatial.distance import euclidean, pdist
import networkx as nx
import matplotlib.pyplot as plt
# data generation
data = vstack((rand(5,2) + array([12,12]),rand(5,2)))
a = pdist(data, 'euclidean')
def givexy(index1D, VectorLength):
return [index1D%VectorLength, index1D/VectorLength]
plt.plot(data[:,0], data[:,1], 'o')
G = nx.Graph()
for cnt, item in enumerate(a):
print cnt
G.add_edge(givexy(cnt, 10)[0], givexy(cnt, 10)[1], weight=item, length=0)
pos={}
for node,row in enumerate(data):
pos[node]=row
nx.draw_networkx(G, pos)
plt.savefig('drawing.png')