Spectral clustering using scikit learn on graph generated through networkx - cluster-analysis

I have a 3000x50 feature vector matrix. I obtained a similarity matrix for this using sklearn.metrics.pairwise_distances as 'Similarity_Matrix'. Now I used networkx to create a graph using the similarity matrix generated in the previous step as G=nx.from_numpy_matrix(Similarity_Matrix). I want to perform spectral clustering on this graph G now but several google searches have failed to provide a decent example of scikit learn spectral clustering on this graph :( The official documentation shows how spectral clustering can be done on some image data which is highly unclear at least to a newbie like myself.
Can anyone give me a code sample for this or for graph cuts or graph partitioning using networkx, scikit learn etc.
Thanks a million!

adj_matrix = nx.from_numpy_matrix will help you create an adjacency matrix which will be your affinity matrix. You need to feed this to scikit-learn like this: SpectralClustering(affinity = 'precomputed', assign_labels="discretize",random_state=0,n_clusters=2).fit_predict(adj_matrix)
If you don't have any similarity matrix, you can change the value of 'affinity' param to 'rbf' or 'nearest_neighbors'. An example below explains the entire Spectral Clustering pipeline:
import sklearn
import networkx as nx
import matplotlib.pyplot as plt
'''Graph creation and initialization'''
G=nx.Graph()
G.add_edge(1,2) # default edge weight=1
G.add_edge(3,4,weight=0.2) #weight represents edge weight or affinity
G.add_edge(2,3,weight=0.9)
G.add_edge("Hello", "World", weight= 0.6)
'''Matrix creation'''
adj_matrix = nx.to_numpy_matrix(G) #Converts graph to an adj matrix with adj_matrix[i][j] represents weight between node i,j.
node_list = list(G.nodes()) #returns a list of nodes with index mapping with the a
'''Spectral Clustering'''
clusters = SpectralClustering(affinity = 'precomputed', assign_labels="discretize",random_state=0,n_clusters=2).fit_predict(adj_matrix)
plt.scatter(nodes_list,clusters,c=clusters, s=50, cmap='viridis')
plt.show()

Related

Networkx - Get probability p(k) from network

I have plotted the histogram of network (dataframe), with count of 'k' node connections, like so:
import seaborn as sns
parameter ='k'
sns.histplot(network[parameter])
But now I need to create a modular random graph using above group distribution with:
from networkx.generators.community import random_partition_graph
random_partition_graph(sizes, p_in, p_out, seed=None, directed=False)
And, instead of counts, I need this value p(k), which must be passed as p_in.
p_in (float)
probability of edges with in groups
How do I get p(k) from my network?
This is how I would handle what you described. First, you can normalize your histogram such that the integral of the histogram is equal to 1. This can be done by setting the weights argument of your histogram appropriately. This histogram can then be considered the probability distribution of your degrees. Now that you have this probability distribution, i.e. a list of probability (deg_prob in the code) you can randomly sample from it using np.random.choice(np.arange(np.amin(degrees),np.amax(degrees)+1), p=deg_prob, size=N_sampling). From this random sampling, you can then create a random expected_degree_graph by just passing your samples in the w argument.
You can then compare the degree distribution of your original graph with the one from your random graph.
See below for the code and more details:
import networkx as nx
from networkx.generators.random_graphs import binomial_graph
from networkx.generators.degree_seq import expected_degree_graph
import matplotlib.pyplot as plt
import numpy as np
fig=plt.figure()
N_nodes=1000
G=binomial_graph(n=N_nodes, p=0.01, seed=0) #Creating a random graph as data
degrees = np.array([G.degree(n) for n in G.nodes()])#Computing degrees of nodes
bins_val=np.arange(np.amin(degrees),np.amax(degrees)+2) #Bins
deg_prob,_,_=plt.hist(degrees,bins=bins_val,align='left',weights=np.ones_like(degrees)/N_nodes,
color='tab:orange',alpha=0.3,label='Original distribution')#Histogram
#Sampling from distribution
N_sampling=500
random_sampling=np.random.choice(np.arange(np.amin(degrees),np.amax(degrees)+1), p=deg_prob, size=N_sampling)
#Creating random graph from samples
G_random_sampling=expected_degree_graph(random_sampling,seed=0,selfloops=False)
degrees_random_sampling = np.array([G_random_sampling.degree(n) for n in G_random_sampling.nodes()])
deg_prob_random_sampling,_,_=plt.hist(degrees_random_sampling,bins=bins_val,align='left',
weights=np.ones_like(degrees_random_sampling)/N_sampling,color='tab:blue',label='Sample distribution',alpha=0.3)
#Plotting both histograms
plt.xticks(bins_val)
plt.xlabel('degree')
plt.ylabel('Prob')
plt.legend()
plt.show()
The output then gives:

Need to find clusters and their centroids in a h5 crowd density map file

I'm trying to use clustering techniques which should allow me to find centroids (or medoids) for each group of people inside a density map (of a real photo). I could I reach that? I've already used Kmeans strategy, and maybe the calculated centroids could be also correct. But how could I better view them over the image?
h5 file: density map of a crowd - points are representing people
Download the ".h5" from here: https://drive.google.com/file/d/1C5xvEQELswr4SJ5zhtYtUEVw2FbP2QWo/view?usp=sharing
I obtain the matrix of this h5 file through this code:
import sys
import numpy
import h5py
import matplotlib.pyplot as plt
from PIL import Image as im
with h5py.File('/content/img001001.h5', 'r') as hf:
h5_matrix= hf.get('density')[:]
plt.imshow(h5_matrix)
#print(h5_matrix[:, 1])
print(h5_matrix.shape)
Printed matrix look like this:
https://drive.google.com/file/d/1f376lUPaWT58iBIg5E693uQfC22g5m3U/view?usp=sharing
what I would like to find and have: density map with centroids
How could I afford that?

MFCC spectrogram vs Scipi Spectrogram

I am currently working on a Convolution Neural Network (CNN) and started to look at different spectrogram plots:
With regards to the Librosa Plot (MFCC), the spectrogram is way different that the other spectrogram plots. I took a look at the comment posted here talking about the "undetailed" MFCC spectrogram. How to accomplish the task (Python Code wise) posted by the solution given there?
Also, would this poor resolution MFCC plot miss any nuisances as the images go through the CNN?
Any help in carrying out the Python Code mentioned here will be sincerely appreciated!
Here is my Python code for the comparison of the Spectrograms and here is the location of the wav file being analyzed.
Python Code
# Load various imports
import os
import librosa
import librosa.display
import matplotlib.pyplot as plt
import scipy.io.wavfile
#24bit accessible version
import wavfile
plt.figure(figsize=(17, 30))
filename = 'AWCK AR AK 47 Attached.wav'
librosa_audio, librosa_sample_rate = librosa.load(filename, sr=None)
plt.subplot(4,1,1)
xmin = 0
plt.title('Original Audio - 24BIT')
fig_1 = plt.plot(librosa_audio)
sr = librosa_sample_rate
plt.subplot(4,1,2)
mfccs = librosa.feature.mfcc(y=librosa_audio, sr=librosa_sample_rate, n_mfcc=40)
librosa.display.specshow(mfccs, sr=librosa_sample_rate, x_axis='time', y_axis='hz')
plt.title('Librosa Plot')
print(mfccs.shape)
plt.subplot(4,1,3)
X = librosa.stft(librosa_audio)
Xdb = librosa.amplitude_to_db(abs(X))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
# plt.colorbar()
# maximum frequency
Fs = 96000.
samplerate, data = scipy.io.wavfile.read(filename)
plt.subplot(4,1,4)
plt.specgram(data, Fs=samplerate)
plt.title('Scipy Plot (Fs=96000)')
plt.show()
MFCCs are not spectrograms (time-frequency), but "cepstrograms" (time-cepstrum). Comparing MFCC with spectrogram visually is not easy, and I am not sure it is very useful either. If you wish to do so, then invert the MFCC to get back a (mel) spectrogram, by doing an inverse DCT. You can probably use mfcc_to_mel for that.
This will allow to estimate how much data has been lost in the MFCC forward transformation. But it may not say much about how much relevant information for your task has been lost, or how much reduction there has been in irrelevant noise.
This needs to be evaluated for your task and dataset. The best way is to try different settings, and evaluate performance using the evaluation metrics that you care about.
Note that MFCCs may not be such a great representation for the typical 2D CNNs that are applied to spectrograms. That is because the locality has been reduced: In the MFCC domain, frequencies that are close to eachother are no longer next to eachother in vertical axis. And because 2D CNNs have kernels with limited locality (typ 3x3 or 5x5 early on), this can reduce performance of the model.

How to delete a random edge in networkx?

Suppose you have a graph graph = nx.read_gml("x.gml") and you'd like to drop n edges. Is there any quick way to do so?
Here is one approach using the sample function from the random library. I set k, the number of edges to be sampled to 2.
import networkx as nx
import random
G=nx.Graph()
G.add_edges_from([[1,2],[1,3],[2,3],[2,4],[3,5],[4,5]])
to_remove=random.sample(G.edges(),k=2)
G.remove_edges_from(to_remove)
print(G.edges())

Dimensionality reduction in HOG feature vector

I found out the HOG feature vector of the following image in MATLAB.
Input Image
I used the following code.
I = imread('input.jpg');
I = rgb2gray(I);
[features, visualization] = extractHOGFeatures(I,'CellSize',[16 16]);
features comes out to be a 1x1944 vector and I need to reduce the dimensionality of this vector (say to 1x100), what method should I employ for the same?
I thought of Principal Component Analysis and ran the following in MATLAB.
prinvec = pca(features);
prinvec comes out to be an empty matrix (1944x0). Am I doing it wrong? If not PCA, what other methods can I use to reduce the dimension?
You can't do PCA on this, since you have more features than your single observation. Get more observations, some 10,000 presumably, and you can do PCA.
See PCA in matlab selecting top n components for the more detailed and mathematical explanation as to why this is the case.