Using scipy.stats.gaussian_kde with 2 dimensional data - scipy

I'm trying to use the scipy.stats.gaussian_kde class to smooth out some discrete data collected with latitude and longitude information, so it shows up as somewhat similar to a contour map in the end, where the high densities are the peak and low densities are the valley.
I'm having a hard time putting a two-dimensional dataset into the gaussian_kde class. I've played around to figure out how it works with 1 dimensional data, so I thought 2 dimensional would be something along the lines of:
from scipy import stats
from numpy import array
data = array([[1.1, 1.1],
[1.2, 1.2],
[1.3, 1.3]])
kde = stats.gaussian_kde(data)
kde.evaluate([1,2,3],[1,2,3])
which is saying that I have 3 points at [1.1, 1.1], [1.2, 1.2], [1.3, 1.3]. and I want to have the kernel density estimation using from 1 to 3 using width of 1 on x and y axis.
When creating the gaussian_kde, it keeps giving me this error:
raise LinAlgError("singular matrix")
numpy.linalg.linalg.LinAlgError: singular matrix
Looking into the source code of gaussian_kde, I realize that the way I'm thinking about what dataset means is completely different from how the dimensionality is calculate, but I could not find any sample code showing how multi-dimension data works with the module. Could someone help me with some sample ways to use gaussian_kde with multi-dimensional data?

This example seems to be what you're looking for:
import numpy as np
import scipy.stats as stats
from matplotlib.pyplot import imshow
# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
axis=1)
kde = stats.kde.gaussian_kde(rvs.T)
# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)
z = kde(grid_coords.T)
z = z.reshape(128,128)
imshow(z,aspect=x_flat.ptp()/y_flat.ptp())
Axes need fixing, obviously.
You can also do a scatter plot of the data with
scatter(rvs[:,0],rvs[:,1])

I think you are mixing up kernel density estimation with interpolation or maybe kernel regression. KDE estimates the distribution of points if you have a larger sample of points.
I'm not sure which interpolation you want, but either the splines or rbf in scipy.interpolate will be more appropriate.
If you want one-dimensional kernel regression, then you can find a version in scikits.statsmodels with several different kernels.
update: here is an example (if this is what you want)
>>> data = 2 + 2*np.random.randn(2, 100)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([ 0.02573917, 0.02470436, 0.03084282])
gaussian_kde has variables in rows and observations in columns, so reversed orientation from the usual in stats. In your example, all three points are on a line, so it has perfect correlation. That is, I guess, the reason for the singular matrix.
Adjusting the array orientation and adding a small noise, the example works, but still looks very concentrated, for example you don't have any sample point near (3,3):
>>> data = np.array([[1.1, 1.1],
[1.2, 1.2],
[1.3, 1.3]]).T
>>> data = data + 0.01*np.random.randn(2,3)
>>> kde = stats.gaussian_kde(data)
>>> kde.evaluate(np.array([[1,2,3],[1,2,3]]))
array([ 7.70204299e+000, 1.96813149e-044, 1.45796523e-251])

I found it difficult to understand the SciPy manual's description of how gaussian_kde works with 2D data. Here is an explanation which is intended to complement #endolith 's example. I divided the code into several steps with comments to explain the less intuitive bits.
First, the imports:
import numpy as np
import scipy.stats as st
from matplotlib.pyplot import imshow, show
Create some dummy data: these are 1-D arrays of the "X" and "Y" point coordinates.
np.random.seed(142) # for reproducibility
x = st.norm.rvs(loc=2, scale=1, size=2000)
y = st.norm.rvs(loc=0, scale=3, size=2000)
For 2-D density estimation the gaussian_kde object has to be initialised with an array with two rows containing the "X" and "Y" datasets. In NumPy terminology, we "stack them vertically":
xy = np.vstack((x, y))
so the "X" data is in the first row xy[0,:] and the "Y" data are in the second row xy[1,:] and xy.shape is (2, 2000). Now create the gaussian_kde object:
dens = st.gaussian_kde(xy)
We will evaluate the estimated 2-D density PDF on a 2-D grid. There is more than one way of creating such a grid in NumPy. I show here an approach which is different from (but functionally equivalent to) #endolith 's method:
gx, gy = np.mgrid[x.min():x.max():128j, y.min():y.max():128j]
gxy = np.dstack((gx, gy)) # shape is (128, 128, 2)
gxy is a 3-D array, the [i,j]-th element of gxy contains a 2-element list of the corresponding "X" and "Y" values: gxy[i, j] 's value is [ gx[i], gy[j] ].
We have to invoke dens() (or dens.pdf() which is the same thing) on each of the 2-D grid points. NumPy has a very elegant function for this purpose:
z = np.apply_along_axis(dens, 2, gxy)
In words, the callable dens (could have been dens.pdf as well) is invoked along axis=2 (the third axis) in the 3-D array gxy and the values should be returned as a 2-D array. The only glitch is that the shape of z will be (128,128,1) and not (128,128) what I expected. Note that the documentation says that:
The shape of out [the return value, L.D.] is identical to the shape of arr, except along the
axis dimension. This axis is removed, and replaced with new dimensions
equal to the shape of the return value of func1d. So if func1d returns
a scalar out will have one fewer dimensions than arr.
Most likely dens() returned a 1-long tuple and not a scalar which I was hoping for. I didn't investigate the issue any further, because this is easy to fix:
z = z.reshape(128, 128)
after which we can generate the image:
imshow(z, aspect=gx.ptp() / gy.ptp())
show() # needed if you try this in PyCharm
Here is the image. (Note that I have implemented #endolith 's version as well and got an image indistinguishable from this one.)

The example posted in the top answer didn't work for me. I had to tweak it little bit and it works now:
import numpy as np
import scipy.stats as stats
from matplotlib import pyplot as plt
# Create some dummy data
rvs = np.append(stats.norm.rvs(loc=2,scale=1,size=(2000,1)),
stats.norm.rvs(loc=0,scale=3,size=(2000,1)),
axis=1)
kde = stats.kde.gaussian_kde(rvs.T)
# Regular grid to evaluate kde upon
x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():128j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():128j]
x,y = np.meshgrid(x_flat,y_flat)
grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)
z = kde(grid_coords.T)
z = z.reshape(128,128)
plt.imshow(z,aspect=x_flat.ptp()/y_flat.ptp())
plt.show()

Related

knnsearch from Matlab to Julia

I am trying to run a nearest neighbour search in Julia using NearestNeighbors.jl package. The corresponding Matlab code is
X = rand(10);
Y = rand(100);
Z = zeros(size(Y));
Z = knnsearch(X, Y);
This generates Z, a vector of length 100, where the i-th element is the index of X whose element is nearest to the i-th element in Y, for all i=1:100.
Could really use some help converting the last line of the Matlab code above to Julia!
Use:
X = rand(1, 10)
Y = rand(1, 100)
nn(KDTree(X), Y)[1]
The storing the intermediate KDTree object would be useful if you wanted to reuse it in the future (as it will improve the efficiency of queries).
Now what is the crucial point of my example. The NearestNeighbors.jl accepst the following input data:
It can either be:
a matrix of size nd × np with the points to insert in the tree where nd is the dimensionality of the points and np is the number of points
a vector of vectors with fixed dimensionality, nd, which must be part of the type.
I have used the first approach. The point is that observations must be in columns (not in rows as in your original code). Remember that in Julia vectors are columnar, so rand(10) is considered to be 1 observation that has 10 dimensions by NearestNeighbors.jl, while rand(1, 10) is considered to be 10 observations with 1 dimension each.
However, for your original data since you want a nearest neighbor only and it is single-dimensional and is small it is enough to write (here I assume X and Y are original data you have stored in vectors):
[argmin(abs(v - y) for v in X) for y in Y]
without using any extra packages.
The NearestNeighbors.jl is very efficient for working with high-dimensional data that has very many elements.

Using numerical methods to plot solution to first-order nonlinear differential equation in Matlab

I have a question about plotting x(t), the solution to the following differential equation knowing that dx/dt equals the expression below. The value of x is 0 at t = 0.
syms x
dxdt = -(1.0*(6.84e+45*x^2 + 5.24e+32*x - 2.49e+42))/(2.47e+39*x + 7.12e+37)
I want to plot the solution of this first-order nonlinear differential equation. The analytical solution involves complex numbers so that's not relevant because this equation models a real-life process, but Matlab can solve the equation using numerical methods and plot it. Can someone please suggest how to do this?
in matlab try this
tspan = [0 10];
x0 = 0;
[t,x] = ode45(#(t,x) -(1.0*(6.84e+45*x^2 + 5.24e+32*x - 2.49e+42))/(2.47e+39*x + 7.12e+37), tspan, x0);
plot(t,x,'b')
i try it and i got this
hope that help you.
I have written an example for how to use Python with SymPy and matplotlib. SymPy can be used to calculate both definite and indefinite integrals. By calculating the indefinite integral and adding a constant to set it to evaluate to 0 at t = 0. Now you have the integral, so just a matter of plotting. I would define an array from a starting point to an endpoint with 1000 points between (could likely be less). You can then calculate the value of the integral with the constant at each time point, which can then be plotted with matplotlib. There are plenty of other questions on how to customize plots with matplotlib.
This displays a basic plot of the indefinite integral of the function dxdt with assumption of x(t) = 0. Variation of the tuple when running Plotting() will set what range of x values to plot. This is set to plot 1000 data points between the minimum and maximum values set when calling the function.
For more information on customizing the plot, I recommend matplotlib documentation. Documentation on the integral can be found in SymPy documentation.
import pandas as pd
from sympy import *
from sympy.abc import x
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
def Plotting(xValues, dxdt):
# Calculate integral
xt = integrate(dxdt,x)
# Convert to function
f = lambdify(x, xt)
C = -f(0)
# Define x values, last number in linspace corresponding to number of points to plot
xValues = np.linspace(xValues[0],xValues[1],500)
yValues = [f(x)+C for x in xValues]
# Initialize figure
fig = plt.figure(figsize = (4,3))
ax = fig.add_axes([0, 0, 1, 1])
# Plot Data
ax.plot(xValues, yValues)
plt.show()
plt.close("all")
# Define Function
dxdt = -(1.0*(6.84e45*x**2 + 5.24e32*x - 2.49e42))/(2.47e39*x + 7.12e37)
# Run Plotting function, with left and right most points defined as tuple, and function as second argument
Plotting((-0.025, 0.05),dxdt)

Scaling vertex coordinates of marching cubes algorithm output in skimage

I am trying to use skimage.measure.marching_cubes_lewiner to resolve some isosurface f(x,y,z)=0. In my case f is strongly nonlinear, and is best mapped when coordinates are given with logarithmic spacing. Because the marching cubes wants a regular grid, to build the voxels, I am working on a meshgrid of coordinates X,Y,Z which correspond to the log10 of my original coordinates, so that my isosurface is equivalently given by f(10**X,10**Y,10**Z)=0. Everything would be fine, if it were not for the fact that, say I am working with X,Y,Z in [-1.5,2]^3 (equivalent to x,y,z in [0.03,100.]^3), the vertex coordinates of the solution given by skimage.measure.marching_cubes_lewiner, are not in this cube.
Following the answer to another related question on SO, I thought it could be due to the fact that, probably the algorithm works thinking of a unitary volume, so that I need to set the right spacing input argument in my call of skimage.measure.marching_cubes_lewiner. In this fashion, say I am mapping my function f on a grid of N points per coordinate, so that I am increasing exponents by numpy.diff([-1.5,2])/N per coordinate, I accordingly call:
import numpy as np
from skimage import measure as msr
def f(x,y,z):
val = ... # some lengthy code to define my implicit function
return val
# Define ranges of my coordinates
xRange = [0.03,100.]
yRange = [0.03,100.]
zRange = [0.03,100.]
XRange = np.log10(xRange)
YRange = np.log10(yRange)
ZRange = np.log10(zRange)
# Create regular grid
N = 50 # number of points per coordinate
X,Y,Z = np.mesh[XRange[0]:XRange[1]:N*1j,
YRange[0]:YRange[1]:N*1j,
ZRange[0]:ZRange[1]:N*1j]
F = f(10**X,10**Y,10**Z)
sol,_,_,_ = skimage.measure.marching_cubes_lewiner(F,0.0,spacing(np.diff(XRange)/N,np.diff(YRange)/N,np.diff(ZRange)/N))
yet, unexpectedly, the coordinates of the solution points generally seem in [0,Vx]*[0,Vy]*[0,Vz] with Vx>XRange[-1], Vy>YRange[-1] and Vz>ZRange[-1]. I have no clue of why this happens and how I could properly rescale the coordinates of my isosurface solution, to the real units of my problem.

separate 3D matrix like numpy

I am converting numpy code to matlab. tensor is a 3D matrix of 6 x 2D matrices of the tensor components. This code appears to then split them back into those 6 separate 2D matrices.
gxx, gxy, gxz, gyy, gyz, gzz = tensor
Can I do this as eloquently in matlab?
re OmG: gxx, etc are the six tensor components of a gravity grid. xx for 2nd derivative of x in the x direction, xy is the 2nd derivative of x in the y direction, etc. Those components will be put through a simple equation to calculate the invariants which will then calculate the depth of the gravity anomaly.
As #Div-iL says, you could simply assign each variable to a slice of the 3D array:
tensor = rand(5,3,6); % Random data to play with
gxx = tensor(:,:,1);
gxy = tensor(:,:,2);
% etc
However if you really wanted to do it automatically you could generate a cell-array of 2D arrays (using mat2cell) and then assign them to variables using a comma-separated list assignment:
[nx,ny,nz] = size(tensor);
ca = mat2cell(tensor, nx, ny, ones(1,nz));
[gxx, gxy, gxz, gyy, gyz, gzz] = ca{:};
However, that all feels a bit hairy to me. If you're looking for a natively-supported one-liner (like your example) then I think you're out of luck.

N-dimensional MatLab Meshgrid

I know I can do this by meshgrid up to 3-dimensional space.
If I do
[X,Y] = meshgrid(1:3,10:14,4:8)
as in http://www.mathworks.com/help/matlab/ref/meshgrid.html, then I will get the grid points on the 3-D space.
But meshgrid can't do this for n-dimensional space.
How should I get grid points (do similar thing like meshgrid) on n-dimensional space (e.g. n=64) ?
To create a grid of n-dimensional data, you will want to use ndgrid
[yy,xx,zz,vv] = ndgrid(yrange, xrange, zrange, vrange);
This can be expanded to any arbitrary number of dimensions.
As Daniel notes, notice that the first two outputs are reversed in their naming since y (rows) are the first dimension in MATLAB.
If you want to go to really high dimensions (such as 64), when the inputs/outputs get unmanageable, you can setup cell arrays for the inputs and outputs and rely on cell array expansion to do the work:
ranges = cell(64, 1);
ranges{1} = xrange;
ranges{2} = yrange;
...
ranges{64} = vals;
outputs = cell(size(ranges);
[outputs{:}] = ndgrid(ranges{:});
As a side note, this can really blow up quickly as your number of dimensions grows. There may be a more elegant solution to what you're ultimately trying to do.
For example if I create example inputs (at 64 dimensions) and for each dimension choose a random number between 1 and 5 for the length, I get a "maximum variable size" error
ranges = arrayfun(#(x)1:randi([1 5]), 1:64, 'uniform', 0);
[xx,yy] = ndgrid(ranges{:});