How to use scipy.interpolate.LinearNDInterpolator with own triangulation - scipy

I have my own triangulation algorithm that creates a triangulation based on both Delaunay's condition and the gradient such that the triangles align with the gradient.
This is an example output:
The above description is not relevant to the question but is necessary for the context.
Now I want to use my triangulation with scipy.interpolate.LinearNDInterpolator to do an interpolation.
With scipy's Delaunay I would do the following
import numpy as np
import scipy.interpolate
import scipy.spatial
points = np.random.rand(100, 2)
values = np.random.rand(100)
delaunay = scipy.spatial.Delaunay(points)
ip = scipy.interpolate.LinearNDInterpolator(delaunay, values)
This delaunay object has delaunay.points and delaunay.simplices that form the triangulation. I have the exact same information with my own triangulation, but scipy.interpolate.LinearNDInterpolator requires a scipy.spatial.Delaunay object.
I think I would need to subclass scipy.spatial.Delaunay and implement the relevant methods. However, I don't know which ones I need in order to get there.

I wanted to do this very same thing with the Delaunay triangulation offered by the triangle package. The triangle Delaunay code is about eight times faster than the SciPy one on large (~100_000) points. (I encourage other developers to try to beat that :) )
Unfortunately, the Scipy LinearNDInterpolator function relies heavily on specific attributes present in the SciPy Delaunay triangulation object. These are created by the _get_delaunay_info() CPython code, which is difficult to disassemble. Even knowing which attributes are needed (there seem to be many, including things like paraboloid_scale and paraboloid_shift), I'm not sure how I would extract this from a different triangulation library.
Instead, I tried #Patol75's approach (comment to the Question above), but using LinearTriInterpolator instead of the Cubic one. The code runs correctly, but is slower than doing the entire thing in SciPy. Interpolating 400_000 points from a cloud of 400_000 points takes about 3 times longer using the matplotlib code than scipy. The Matplotlib tri code is written in C++, so converting the code to i.e. CuPy is not straightforward. If we could mix the two approaches, we could reduce the total time from 3.65 sec / 10.2 sec to 1.1 seconds!
import numpy as np
np.random.seed(1)
N = 400_000
shape = (100, 100)
points = np.random.random((N, 2)) * shape # spread over 100, 100 to avoid float point errors
vals = np.random.random((N,))
interp_points1 = np.random.random((N,2)) * shape
interp_points2 = np.random.random((N,2)) * shape
triangle_input = dict(vertices=points)
### Matplotlib Tri
import triangle as tr
from matplotlib.tri import Triangulation, LinearTriInterpolator
triangle_output = tr.triangulate(triangle_input) # 280 ms
tri = tr.triangulate(triangle_input)['triangles'] # 280 ms
tri = Triangulation(*points.T, tri) # 5 ms
func = LinearTriInterpolator(tri, vals) # 9490 ms
func(*interp_points.T).data # 116 ms
# returns [0.54467719, 0.35885304, ...]
# total time 10.2 sec
### Scipy interpolate
tri = Delaunay(points) # 2720 ms
func = LinearNDInterpolator(tri, vals) # 1 ms
func(interp_points) # 925 ms
# returns [0.54467719, 0.35885304, ...]
# total time 3.65 sec

Related

How to generate a triangle free graph in Networkx (with randomseed)?

After checking the documentation on triangles of networkx, I've wondered if there is a more efficient way of generating a triangle free graph than to randomly spawn graphs until a triangle free one happens to emerge, (in particular if one would like to use a constant random seed).
Below is code that spawns graphs until they are triangle free, yet with varying random seeds. For a graph of 10 nodes it already takes roughly 20 seconds.
def create_triangle_free_graph(show_graphs):
seed = 42
nr_of_nodes = 10
probability_of_creating_an_edge = 0.85
nr_of_triangles = 1 # Initialise at 1 to initiate while loop.
while nr_of_triangles > 0:
graph = nx.fast_gnp_random_graph(
nr_of_nodes, probability_of_creating_an_edge
)
triangles = nx.triangles(G).values()
nr_of_triangles = sum(triangles) / 3
print(f"nr_of_triangles={nr_of_triangles}")
return graph
Hence, I would like to ask:
Are there faster ways to generate triangle free graphs (using random seeds) in networkx?
A triangle exists in a graph iff two vertices connected by an edge share one or more neighbours. A triangle-free graph can be expanded by adding edges between nodes that share no neighbours. The empty graph is triangle-free, so there is a straightforward algorithm to create triangle-free graphs.
#!/usr/bin/env python
"""
Create a triangle free graph.
"""
import random
import networkx as nx
from itertools import combinations
def triangle_free_graph(total_nodes):
"""Construct a triangle free graph."""
nodes = range(total_nodes)
g = nx.Graph()
g.add_nodes_from(nodes)
edge_candidates = list(combinations(nodes, 2))
random.shuffle(edge_candidates)
for (u, v) in edge_candidates:
if not set(n for n in g.neighbors(u)) & set(n for n in g.neighbors(v)):
g.add_edge(u, v)
return g
g = triangle_free_graph(10)
print(nx.triangles(g))
The number of edges in the resulting graph is highly dependent on the ordering of edge_candidates. To get a graph with the desired edge density, repeat the process until a graph with equal or higher density is found (and then remove superfluous edges), or until your patience runs out.
cutoff = 0.85
max_iterations = 1e+4
iteration = 0
while nx.density(g) < cutoff:
g = triangle_free_graph(10)
iteration += 1
if iteration == max_iterations:
import warnings
warnings.warn("Maximum number of iterations reached!")
break
# TODO: remove edges until the desired density is achieved

Curve fitting of sine function in python using scipy is not yielding desired output

I'm trying to fit sine function on my data. No errors are shown but it doesn't seem to work.
python
def sin_fun(x,a,b):
return (a*np.sin(b*x))
p_opt,p_cov=cf(sin_fun,xdata,ydata)
print(p_opt)
plt.plot(xdata,sin_fun(xdata,*p_opt))
plt.scatter(xdata,ydata)
plt.show()
This is the output I am getting:
I have simulated your data. There are 2 problems with your code as to why it isn't doing what you want. First is that your sin_fun needs a y-offset parameter, otherwise the function will always be symmetrical about y = 0. Secondly, the fit works better if you can provide curve_fit with a reasonable guess. This is done using the p0 argument. Have a look here:
from scipy.optimize import curve_fit as cf
import numpy as np
from matplotlib import pyplot as plt
# simulate your data
xdata = np.linspace(0, 25000, 256)
ydata = 15000 * np.sin(xdata/2000) + 22000
# add some noise
ydata += np.random.rand(xdata.size) * 2000
# sin function needs a y-offset -> c
def sin_fun(x,a,b,c):
return a*np.sin(b*x)+c
# need a reasonable guess -> note that the guess is not quite right but curve_fit still works
p_opt,p_cov=cf(sin_fun,xdata,ydata, p0=(10000, 1/2500, 15000))
print(p_opt)
plt.plot(xdata,sin_fun(xdata,*p_opt))
plt.plot(xdata,ydata, 'r.', ms=1)
plt.show()
With these fixes you can get a good fit. You could also add a phase parameter to your function to help fit other sinusoids.

why k-means is better in clustering than topic modelling algorithms like LDA?

I want to know about the advantages of K-means in clustering essays to discover their topics. There are a lot of algorithms to do it such as K-medoid, x-means, LDA, LSA, etc. Please give me a full description of the motives to select k-means algorithms
I don't think you can draw parallels between all these things. I would highly recommend doing some well-defined Googling on your side, and come back here with a more refined question, or questions. In the meantime, I'll share with you what little I know about these topics. First, let's look at PCA & LDA...
import numpy as np
import pandas as pd
# Importing the Dataset
#url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
#names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
#dataset = pd.read_csv(url, names=names)
dataset = pd.read_csv('C:\\your_path_here\\iris.csv')
# PRINCIPAL COMPONENT ANALYSIS
X = dataset.drop('species', 1)
y = dataset['species']
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# As mentioned earlier, PCA performs best with a normalized feature set. We will perform standard scalar normalization to normalize our feature set. To do this, execute the following code:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Performance Evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[11 0 0]
[ 0 12 1]
[ 0 1 5]]
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))
Accuracy 0.9333333333333333
# Results with 2 & 3 pirncipal Components
#from sklearn.decomposition import PCA
#pca = PCA(n_components=5)
#X_train = pca.fit_transform(X_train)
#X_test = pca.transform(X_test)
# https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/
# LINEAR DISCRIMINANT ANALYSIS
# Data Preprocessing
# Once dataset is loaded into a pandas data frame object, the first step is to divide dataset into features and corresponding labels and then divide the resultant dataset into training and test sets. The following code divides data into labels and feature set:
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Feature Scaling
# As was the case with PCA, we need to perform feature scaling for LDA too. Execute the following script to do so:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=1)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[11 0 0]
[ 0 13 0]
[ 0 0 6]]
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))
Result:
Accuracy 1.0
# https://stackabuse.com/implementing-lda-in-python-with-scikit-learn/
Does that make sense? Hopefully it does. Now, let's move on to KMeans and PCA...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
dataset = pd.read_csv('C:\\your_path_here\\iris.csv')
# PRINCIPAL COMPONENT ANALYSIS
X = dataset.drop('species', 1)
y = dataset['species']
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
ax = sns.scatterplot(x="sepal_length", y="sepal_width", hue="sepal_length", data=dataset)
ax = sns.scatterplot(x="petal_length", y="petal_width", hue="petal_length", data=dataset)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
# ordinarily, when you don't have the actual labels, you might use
# silhouette analysis to determine a good number of clusters k to use.
# i.e. you would just run that same code for different values of k and print the value for
# the silhouette score.
# let's see what that value is for the case we just did, k=3.
from sklearn import metrics
score = metrics.silhouette_score(X_scaled, y_cluster_kmeans)
score
# Result:
# 0.45994823920518646
# note that this is the mean over all the samples - there might be some clusters
# that are well separated and others that are closer together.
# so let's look at the distribution of silhouette scores...
scores = metrics.silhouette_samples(X_scaled, y_cluster_kmeans)
sns.distplot(scores);
# so you can see that the blue species have higher silhouette scores
# (the legend doesn't show the colors though... so the pandas plot is more useful).
# note that if we used the best mean silhouette score to try to find the best
# number of clusters k, we'd end up with 2 clusters, because the mean silhouette
# score in that case would be largest, since the clusters would be better separated.
# but, that's using k-means - gmm might give better results...
# so that was clustering on the orginal 4d data.
# if you have a lot of features it can be helpful to do some feature reduction
# to avoid the curse of dimensionality (i.e. needing exponentially more data
# to do accurate predictions as the number of features grows).
# you can do this with Principal Component Analysis (PCA), which remaps the data
# to a new (smaller) coordinate system which tries to account for the
# most information possible.
# you can *also* use PCA to visualize the data by reducing the
# features to 2 dimensions and making a scatterplot.
# it kind of mashes the data down into 2d, so can lose
# information - but in this case it's just going from 4d to 2d,
# so not losing too much info.
# so let's just use it to visualize the data...
# mash the data down into 2 dimensions
from sklearn.decomposition import PCA
ndimensions = 2
pca = PCA(n_components=ndimensions, random_state=seed)
pca.fit(X_scaled)
X_pca_array = pca.transform(X_scaled)
X_pca = pd.DataFrame(X_pca_array, columns=['PC1','PC2']) # PC=principal component
X_pca.sample(5)
# Result:
PC1 PC2
90 0.279078 -1.120029
26 -2.051151 0.242164
83 1.061095 -0.633843
135 2.798770 0.856803
54 1.075475 -0.208421
# so that gives us new 2d coordinates for each data point.
# at this point, if you don't have labelled data,
# you can add the k-means cluster ids to this table and make a
# colored scatterplot.
# we do actually have labels for the data points, but let's imagine
# we don't, and use the predicted labels to see what the predictions look like.
df_plot = X_pca.copy()
df_plot['ClusterKmeans'] = y_cluster_kmeans
df_plot['SpeciesId'] = y_id_array # also add actual labels so we can use it in later plots
df_plot.sample(5)
# Result:
PC1 PC2 ClusterKmeans SpeciesId
132 1.862703 -0.178549 0 2
85 0.429139 0.845582 0 1
139 1.852045 0.676128 0 2
33 -2.446177 2.150728 1 0
147 1.521170 0.269069 0 2
# so now we can make a 2d scatterplot of the clusters
# first define a plot fn
def plotData(df, groupby):
"make a scatterplot of the first two principal components of the data, colored by the groupby field"
# make a figure with just one subplot.
# you can specify multiple subplots in a figure,
# in which case ax would be an array of axes,
# but in this case it'll just be a single axis object.
fig, ax = plt.subplots(figsize = (7,7))
# color map
cmap = mpl.cm.get_cmap('prism')
# we can use pandas to plot each cluster on the same graph.
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
for i, cluster in df.groupby(groupby):
cluster.plot(ax = ax, # need to pass this so all scatterplots are on same graph
kind = 'scatter',
x = 'PC1', y = 'PC2',
color = cmap(i/(nclusters-1)), # cmap maps a number to a color
label = "%s %i" % (groupby, i),
s=30) # dot size
ax.grid()
ax.axhline(0, color='black')
ax.axvline(0, color='black')
ax.set_title("Principal Components Analysis (PCA) of Iris Dataset");
# plot the clusters each datapoint was assigned to
plotData(df_plot, 'ClusterKmeans')
# so those are the *predicted* labels - what about the *actual* labels?
plotData(df_plot, 'SpeciesId')
# so the k-means clustering *did not* find the correct clusterings!
# q. so what do these dimensions mean?
# they're the principal components, which pick out the directions
# of maximal variation in the original data.
# PC1 finds the most variation, PC2 the second-most.
# the rest of the data is basically thrown away when the data is reduced down to 2d.

Why does the HMC sampler return negative values for hyperparameters that need to be positive? [older GPflow versions before 1.0]

I'd like to build a GP with marginalized hyperparameters.
I have seen that this is possible with the HMC sampler provided in gpflow from this notebook
However, when I tried to run the following code as a first step of this (NOTE this is on gpflow 0.5, an older version), the returned samples are negative, even though the lengthscale and variance need to be positive (negative values would be meaningless).
import numpy as np
from matplotlib import pyplot as plt
import gpflow
from gpflow import hmc
X = np.linspace(-3, 3, 20)
Y = np.random.exponential(np.sin(X) ** 2)
Y = (Y - np.mean(Y)) / np.std(Y)
k = gpflow.kernels.Matern32(1, lengthscales=.2, ARD=False)
m = gpflow.gpr.GPR(X[:, None], Y[:, None], k)
m.kern.lengthscales.prior = gpflow.priors.Gamma(1., 1.)
m.kern.variance.prior = gpflow.priors.Gamma(1., 1.)
# dont want likelihood be a hyperparam now so fixed
m.likelihood.variance = 1e-6
m.likelihood.variance.fixed = True
m.optimize(maxiter=1000)
samples = m.sample(500)
print(samples)
Output:
[[-0.43764571 -0.22753325]
[-0.50418501 -0.11070128]
[-0.5932655 0.00821438]
[-0.70217714 0.05077999]
[-0.77745654 0.09362291]
[-0.79404456 0.13649446]
[-0.83989415 0.27118385]
[-0.90355789 0.29589641]
...
I don't know too much in detail about HMC sampling but I would expect that the sampled posterior hyperparameters are positive, I've checked the code and it seems maybe related to the Log1pe transform, though I failed to figure it out myself.
Any hint on this?
It would be helpful if you specified which GPflow version you are using - especially given that from the output you posted it looks like you are using a really old version of GPflow (pre-1.0), and this is actually something that got improved since. What is happening here (in old GPflow) is that the sample() method returns a single array S x P, where S is the number of samples, and P is the number of free parameters [e.g. for a M x M matrix parameter with lower-triangular transform (such as the Cholesky of the covariance of the approximate posterior, q_sqrt), only M * (M - 1)/2 parameters are actually stored and optimised!]. These are the values in the unconstrained space, i.e. they can take any value whatsoever. Transforms (see gpflow.transforms module) provide the mapping between this value (between plus/minus infinity) and the constrained value (e.g. gpflow.transforms.positive for lengthscales and variances). In old GPflow, the model provides a get_samples_df() method that takes the S x P array returned by sample() and returns a pandas DataFrame with columns for all the trainable parameters which would be what you want. Or, ideally, you would just use a recent version of GPflow, in which the HMC sampler directly returns the DataFrame!

Why is the output different for code ported from MATLAB to Python?

EDIT: After some more testing and a response form the scipy mailing list, the issue appears to be with fspecial(). To get the same output I need to generate the same kind of kernel in Python as the Matlab fspecial command is producing. For now I will try to export the kernel from matlab and work from there. Added as a edit since question has been "closed"
I am trying to port the following MATLAB code to Python. It seems to work but the output is different form MATLAB. I think the problem is with apply a "mean" filter to the log(amplituide). Any help appreciated.
The MATLAB code is from: http://www.klab.caltech.edu/~xhou/projects/spectralResidual/spectralresidual.html
%% Read image from file
inImg = im2double(rgb2gray(imread('1.jpg')));
inImg = imresize(inImg, 64/size(inImg, 2));
%% Spectral Residual
myFFT = fft2(inImg);
myLogAmplitude = log(abs(myFFT));
myPhase = angle(myFFT);
mySpectralResidual = myLogAmplitude - imfilter(myLogAmplitude, fspecial('average', 3), 'replicate');
saliencyMap = abs(ifft2(exp(mySpectralResidual + i*myPhase))).^2;
%% After Effect
saliencyMap = mat2gray(imfilter(saliencyMap, fspecial('gaussian', [10, 10], 2.5)));
imshow(saliencyMap);
Here is my attempt in python:
from skimage import img_as_float
from skimage.io import imread
from skimage.color import rgb2gray
from scipy import fftpack, ndimage, misc
from scipy.ndimage import uniform_filter
from matplotlib.pyplot as plt
# Read image from file
image = img_as_float(rgb2gray(imread('1.jpg')))
image = misc.imresize(image, 64.0 / image.shape[0])
# Spectral Residual
fft = fftpack.fft2(image)
logAmplitude = np.log(np.abs(fft))
phase = np.angle(fft)
avgLogAmp = uniform_filter(logAmplitude, size=3, mode="nearest") #Is this same a applying "mean" filter
spectralResidual = logAmplitude - avgLogAmp
saliencyMap = np.abs(fftpack.ifft2(np.exp(spectralResidual + 1j * phase))) ** 2
# After Effect
saliencyMap = ndimage.gaussian_filter(sm, sigma=2.5)
plt.imshow(sm)
plt.show()
For completness here is a input image and the output from MATLAB and python.
I doubt anyone will be able to give you a firm answer on this. It could be any number of things... Could be that one FFT is 0-centered while the other isn't, could be a float vs double somewhere, could be mishandling of absolute value, could be a filter setting, ...
If I were you, I'd write out some intermediate values for both computations and find a way to compare them. Start in the middle, if they compare well then move down, if they don't compare well then move up. Maybe write an intermediate value from the python script out to a file, import into matlab, take the element-wise difference, and graph. If they're not the same dimensions, that's clue #1.