How is RMSD calculated in the Scipy implementation of the Kabsch algorithm? - scipy

Scipy calculates the rmsd like this, and I'll paraphrase it here for convenience (for readability I drop the weights and the max(*, 0))
rmsd = np.sqrt(np.sum(b ** 2 + a ** 2) - 2 * np.sum(s))
To me this does not look like RMSD.
Now from the docs one would infer that the rmsd return value is defined as the square root of double this expression:
The latter is indeed what I would consider to be the RMSD. In fact I went ahead and coded it up (note that this function expects me to apply the estimated transformation to one of the sets of points first whereas the snippet above does not)
def _calc_rmsd(a: np.ndarray, b_transformed: np.ndarray) -> float:
distances = np.linalg.norm(a - b_transformed, axis=-1)
rmsd = np.sqrt((distances ** 2).sum() / len(distances))
return rmsd
I also plotted out what these would look like for randomly generated point pairs with normally distributed noise (blue is scipy, orange is mine)
Or extending the plot out to 200 point pairs:
So to sum it up:
The definition of rmsd in the docs is in agreement with what I believe to be the widely accepted notion of rmsd
The scipy code implementation of rmsd disagrees with the latter. I don't even understand what it's supposed to mathematically represent.
From monte-carlo simulations, clearly the two implementations have different outcomes.
So what's going on?

Apparently the SciPy code is not returning the root-mean-squared distance. It sums the squared differences, but it does not divide by the number of vectors before taking the square root. The difference between the SciPy calculation and yours is a factor of sqrt(len(a)). You can verify this with an example such as the following.
In [157]: from scipy.spatial.transform import Rotation
In [158]: def _calc_rmsd(a: np.ndarray, b_transformed: np.ndarray) -> float:
...: distances = np.linalg.norm(a - b_transformed, axis=-1)
...: rmsd = np.sqrt((distances ** 2).sum() / len(distances))
...: return rmsd
...:
Some test data:
In [159]: a = np.array([[0, 1, 1], [1, 1, 1.5], [2.0, -1.0, 4.0], [-1, 0, 5]])
In [160]: b = np.array([[0, 1, 1.5], [2, 2, 2], [1, -1, 5], [-3, 0.1, 1]])
Compute the rotation:
In [161]: R, rmsd = Rotation.align_vectors(a, b)
In [162]: rmsd
Out[162]: 3.8753534834716685
Here's your calculation of the RMSD:
In [163]: _calc_rmsd(a, R.apply(b))
Out[163]: 1.9376767417358356
And here is your calculation, multiplied by sqrt(len(a)), so it matches the result returned by Rotation.align_vectors:
In [164]: _calc_rmsd(a, R.apply(b)) * np.sqrt(len(a))
Out[164]: 3.875353483471671
This looks like a documentation issue. If you have a moment, you could create a new issue for this over in https://github.com/scipy/scipy/issues

Related

Is there a way to use scipy.interpolate to do 1d spline interpolation with boundary conditions specified away from data?

There are lots of ways of assembling splines directly from data while specifying derivatives at the boundary but I do not see an easy way to specify derivative constraints away from the data. For example
np.random.seed(0)
M = 30
x = np.random.rand(M)
x.sort()
y = np.random.rand(M) + (10 * x) ** 2
xx = np.linspace(x.min(), x.max(), 100) # for plotting
import scipy.interpolate as si
f = si.make_interp_spline(x, y) # ok but how to add derivative conditions away from the x-values?
x_derivs = [-1, 2]
order_derivs = [2, 1]
value_derivs = [0, 1]
More generally, is there a built in method to do this or take derivative constraints separate from value constraints?
I think it's just a matter of creating the operator matrix to solve the linear problem.
Short answer : it's not implemented at the moment.
You can of course modify the scipy source, the relevant part is
https://github.com/scipy/scipy/blob/v1.7.0/scipy/interpolate/_bsplines.py#L1105

Implementation details of positional encoding in transformer model?

How exactly does this positional encoding being calculated?
Let's assume a machine translation scenario and these are input sentences,
english_text = [this is good, this is bad]
german_text = [das ist gut, das ist schlecht]
Now our input vocabulary size is 4 and embedding dimension is 4.
#words #embeddings
this - [0.5, 0.2, 0.3, 0.1]
is - [0.1, 0.2, 0.5, 0.1]
good - [0.9, 0.7, 0.9, 0.1]
bad - [0.7, 0.3, 0.4, 0.1]
As per transformer paper we add the each word position encoding with each word embedding and then pass it to encoder like seen in the image below,
As far as the paper is concerned they given this formula for calculating position encoding of each word,
So, this is how I think I can implement it,
d_model = 4 # Embedding dimension
positional_embeddings = np.zeros((max_sentence_length, d_model))
max_sentence_length = 3 # as per my examples above
for position in range(maximum_sentence_length):
for i in range(0, d_model, 2):
positional_embeddings[position, i] = (
sin(position / (10000 ** ( (2*i) / d_model) ) )
)
positional_embeddings[position, i + 1] = (
cos(position / (10000 ** ( (2 * (i + 1) ) / d_model) ) )
)
Then, the new embedding vector will be
[[0.5, 0.2, 0.3, 0.1],
[0.1, 0.2, 0.5, 0.1],
[0.9, 0.7, 0.9, 0.1]] + positional_embeddings = NEW EMBEDDINGS
## shapes
3 x 4 + 3 x 4 = 3 x 4
Is this how the calculation will be carried out in the implementation? Do correct me if there's any mistake in my above pseudo implementation.
If everything is correct then I have three doubts hope someone can clear them,
1) From the above implementation we are using sin formula for even positions and cos formula for odd positions but I couldn't understand the reason behind it? I read that it's taking use of cyclic properties but couldn't understand it.
2) Is there a reason behind choosing 10000/(2i/d) or 10000/(2i+1/d) as scaling factor in formula.
3) All the sentence will not be equal to max sentence length so we might have to padded the sentence so do we also calculate positional encondings to padding tokens.
Your implementation is basically correct. The typical implementation is pre-computing the embedding matrix, make a non-trainable embedding layer, and do an embedding lookup of a range. See e.g. the implementation in HuggingFace's Transformers.
Some hints about the intuition behind the equations are in these threads:
on CrossValidated
on Reddit
But it seems to me that pretty much all decisions about the position encoding were empirical choices.
By cyclic properties, they IMHO mean that given a dimension of the embedding, the difference of the embedding values between positions with a constant offset is the same regardless of the position in the sequence. For that, using either only sine or cosine might be enough, but some positions would have a much larger norm that the others, therefore they alternate sine and cosine.
I think the scaling factors are empirically estimated to cover the usual length of sentences.
With padding, you indeed consider also the positional encoding of the padded positions, but since they are pre-computed, it does mean higher computation load because you get the embeddings for the padding symbols anyway.

Scipy.curve_fit() vs. Matlab fit() weighted nonlinear least squares

I have a Matlab reference routine that I am trying to convert to numpy/scipy. I have encountered a curve fitting problem that does I cannot solve in Python. So here is a simple example which demonstrates the problem. The data is completely synthetic and not part of the problem.
Let's say I'm trying to fit a straight-line model of noisy data -
x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [0.1075, 1.3668, 1.5482, 3.1724, 4.0638, 4.7385, 5.9133, 7.0685, 8.7157, 9.5539]
For the unweighted solution in Matlab, I would code
g = #(m, b, x)(m*x + b)
f = fittype(g)
bestfit = fit(x, y, g)
which produces a solution of bestfit.m = 1.048, bestfit.b = -0.09219
Running this data through scipy.optimize.curve_fit() produces identical results.
If instead the fit uses a decay function to reduce the impact of data points
dw = [0.7290, 0.5120, 0.3430, 0.2160, 0.1250, 0.0640, 0.0270, 0.0080, 0.0010, 0]
weightedfit = fit(x, y, g, 'Weights', dw)
This produces a slope if 0.944 and offset 0.1484.
I have not figured out how to conjure this result from scipy.optimize.curve_fit using the sigma parameter. If I pass the weights as provided to Matlab, the '0' causes a divide by zero exception. Clearly Matlab and scipy are thinking very differently about the meaning of the weights in the underlying optimization routine. Is there a simple way of converting between the two that allows me to provide a weighting function which produces identical results?
Ok, so after further investigation I can offer the answer, at least for this simple example.
import numpy as np
import scipy as sp
import scipy.optimize
def modelFun(x, m, b):
return m * x + b
def testFit():
w = np.diag([1.0, 1/0.7290, 1/0.5120, 1/0.3430, 1/0.2160, 1/0.1250, 1/0.0640, 1/0.0270, 1/0.0080, 1/0.0010])
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([0.1075, 1.3668, 1.5482, 3.1724, 4.0638, 4.7385, 5.9133, 7.0685, 8.7157, 9.5539])
popt = sp.optimize.curve_fit(modelFun, x, y, sigma=w)
print(popt[0])
print(popt[1])
Which produces the desired result.
In order to force sp.optimize.curve_fit to minimize the same chisq metric as Matlab using the curve fitting toolbox, you must do two things:
Use the reciprocal of the weight factors
Create a diagonal matrix from the new weight factors. According to the scipy reference:
sigma None or M-length sequence or MxM array, optional
Determines the uncertainty in ydata. If we define residuals as r =
ydata - f(xdata, *popt), then the interpretation of sigma depends on
its number of dimensions:
A 1-d sigma should contain values of standard deviations of errors in
ydata. In this case, the optimized function is chisq = sum((r / sigma)
** 2).
A 2-d sigma should contain the covariance matrix of errors in ydata.
In this case, the optimized function is chisq = r.T # inv(sigma) # r.
New in version 0.19.
None (default) is equivalent of 1-d sigma filled with ones.

Pytorch, what are the gradient arguments

I am reading through the documentation of PyTorch and found an example where they write
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(x.grad)
where x was an initial variable, from which y was constructed (a 3-vector). The question is, what are the 0.1, 1.0 and 0.0001 arguments of the gradients tensor ? The documentation is not very clear on that.
Explanation
For neural networks, we usually use loss to assess how well the network has learned to classify the input image (or other tasks). The loss term is usually a scalar value. In order to update the parameters of the network, we need to calculate the gradient of loss w.r.t to the parameters, which is actually leaf node in the computation graph (by the way, these parameters are mostly the weight and bias of various layers such Convolution, Linear and so on).
According to chain rule, in order to calculate gradient of loss w.r.t to a leaf node, we can compute derivative of loss w.r.t some intermediate variable, and gradient of intermediate variable w.r.t to the leaf variable, do a dot product and sum all these up.
The gradient arguments of a Variable's backward() method is used to calculate a weighted sum of each element of a Variable w.r.t the leaf Variable. These weight is just the derivate of final loss w.r.t each element of the intermediate variable.
A concrete example
Let's take a concrete and simple example to understand this.
from torch.autograd import Variable
import torch
x = Variable(torch.FloatTensor([[1, 2, 3, 4]]), requires_grad=True)
z = 2*x
loss = z.sum(dim=1)
# do backward for first element of z
z.backward(torch.FloatTensor([[1, 0, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_() #remove gradient in x.grad, or it will be accumulated
# do backward for second element of z
z.backward(torch.FloatTensor([[0, 1, 0, 0]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()
# do backward for all elements of z, with weight equal to the derivative of
# loss w.r.t z_1, z_2, z_3 and z_4
z.backward(torch.FloatTensor([[1, 1, 1, 1]]), retain_graph=True)
print(x.grad.data)
x.grad.data.zero_()
# or we can directly backprop using loss
loss.backward() # equivalent to loss.backward(torch.FloatTensor([1.0]))
print(x.grad.data)
In the above example, the outcome of first print is
2 0 0 0
[torch.FloatTensor of size 1x4]
which is exactly the derivative of z_1 w.r.t to x.
The outcome of second print is :
0 2 0 0
[torch.FloatTensor of size 1x4]
which is the derivative of z_2 w.r.t to x.
Now if use a weight of [1, 1, 1, 1] to calculate the derivative of z w.r.t to x, the outcome is 1*dz_1/dx + 1*dz_2/dx + 1*dz_3/dx + 1*dz_4/dx. So no surprisingly, the output of 3rd print is:
2 2 2 2
[torch.FloatTensor of size 1x4]
It should be noted that weight vector [1, 1, 1, 1] is exactly derivative of loss w.r.t to z_1, z_2, z_3 and z_4. The derivative of loss w.r.t to x is calculated as:
d(loss)/dx = d(loss)/dz_1 * dz_1/dx + d(loss)/dz_2 * dz_2/dx + d(loss)/dz_3 * dz_3/dx + d(loss)/dz_4 * dz_4/dx
So the output of 4th print is the same as the 3rd print:
2 2 2 2
[torch.FloatTensor of size 1x4]
Typically, your computational graph has one scalar output says loss. Then you can compute the gradient of loss w.r.t. the weights (w) by loss.backward(). Where the default argument of backward() is 1.0.
If your output has multiple values (e.g. loss=[loss1, loss2, loss3]), you can compute the gradients of loss w.r.t. the weights by loss.backward(torch.FloatTensor([1.0, 1.0, 1.0])).
Furthermore, if you want to add weights or importances to different losses, you can use loss.backward(torch.FloatTensor([-0.1, 1.0, 0.0001])).
This means to calculate -0.1*d(loss1)/dw, d(loss2)/dw, 0.0001*d(loss3)/dw simultaneously.
Here, the output of forward(), i.e. y is a a 3-vector.
The three values are the gradients at the output of the network. They are usually set to 1.0 if y is the final output, but can have other values as well, especially if y is part of a bigger network.
For eg. if x is the input, y = [y1, y2, y3] is an intermediate output which is used to compute the final output z,
Then,
dz/dx = dz/dy1 * dy1/dx + dz/dy2 * dy2/dx + dz/dy3 * dy3/dx
So here, the three values to backward are
[dz/dy1, dz/dy2, dz/dy3]
and then backward() computes dz/dx
The original code I haven't found on PyTorch website anymore.
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(x.grad)
The problem with the code above is there is no function based on how to calculate the gradients. This means we don't know how many parameters (arguments the function takes) and the dimension of parameters.
To fully understand this I created an example close to the original:
Example 1:
a = torch.tensor([1.0, 2.0, 3.0], requires_grad = True)
b = torch.tensor([3.0, 4.0, 5.0], requires_grad = True)
c = torch.tensor([6.0, 7.0, 8.0], requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients,retain_graph=True)
print(a.grad) # tensor([3.0000e-01, 3.0000e+00, 3.0000e-04])
print(b.grad) # tensor([1.2000e+00, 1.6000e+01, 2.0000e-03])
print(c.grad) # tensor([1.6667e-02, 1.4286e-01, 1.2500e-05])
I assumed our function is y=3*a + 2*b*b + torch.log(c) and the parameters are tensors with three elements inside.
You can think of the gradients = torch.FloatTensor([0.1, 1.0, 0.0001]) like this is the accumulator.
As you may hear, PyTorch autograd system calculation is equivalent to Jacobian product.
In case you have a function, like we did:
y=3*a + 2*b*b + torch.log(c)
Jacobian would be [3, 4*b, 1/c]. However, this Jacobian is not how PyTorch is doing things to calculate the gradients at a certain point.
PyTorch uses forward pass and backward mode automatic differentiation (AD) in tandem.
There is no symbolic math involved and no numerical differentiation.
Numerical differentiation would be to calculate δy/δb, for b=1 and b=1+ε where ε is small.
If you don't use gradients in y.backward():
Example 2
a = torch.tensor(0.1, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(0.1, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
y.backward()
print(a.grad) # tensor(3.)
print(b.grad) # tensor(4.)
print(c.grad) # tensor(10.)
You will simply get the result at a point, based on how you set your a, b, c tensors initially.
Be careful how you initialize your a, b, c:
Example 3:
a = torch.empty(1, requires_grad = True, pin_memory=True)
b = torch.empty(1, requires_grad = True, pin_memory=True)
c = torch.empty(1, requires_grad = True, pin_memory=True)
y=3*a + 2*b*b + torch.log(c)
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
print(a.grad) # tensor([3.3003])
print(b.grad) # tensor([0.])
print(c.grad) # tensor([inf])
If you use torch.empty() and don't use pin_memory=True you may have different results each time.
Also, note gradients are like accumulators so zero them when needed.
Example 4:
a = torch.tensor(1.0, requires_grad = True)
b = torch.tensor(1.0, requires_grad = True)
c = torch.tensor(1.0, requires_grad = True)
y=3*a + 2*b*b + torch.log(c)
y.backward(retain_graph=True)
y.backward()
print(a.grad) # tensor(6.)
print(b.grad) # tensor(8.)
print(c.grad) # tensor(2.)
Lastly few tips on terms PyTorch uses:
PyTorch creates a dynamic computational graph when calculating the gradients in forward pass. This looks much like a tree.
So you will often hear the leaves of this tree are input tensors and the root is output tensor.
Gradients are calculated by tracing the graph from the root to the leaf and multiplying every gradient in the way using the chain rule. This multiplying occurs in the backward pass.
Back some time I created PyTorch Automatic Differentiation tutorial that you may check interesting explaining all the tiny details about AD.

Simple understanding of Orthogonal Distance Regression (ODR)

I have some data points with errors in both the x and y coordinates on these data points. I therefore want to use python's ODR tool to compute the best-fit slope and the error on this slope. I have tried doing it for my actual data but do not find good results. Therefore, I have first tried to use ODR with a simple example as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy.odr import *
def linear_func(B, x):
return B[0]*x+B[1]
x_data=np.array([0.0, 1.0, 2.0, 3.0])
y_data=np.array([0.0, 1.0, 2.0, 3.0])
x_err=np.array([1.0, 1.0, 1.0, 1.0])
y_err=np.array([5.0, 5.0, 5.0, 5.0])
linear=Model(linear_func)
data=RealData(x_data, y_data, sx=x_err, sy=y_err)
odr=ODR(data, linear, beta0=[1.0, 0.0])
out=odr.run()
out.pprint()
The pprint() line gives:
Beta: [ 1. 0.]
Beta Std Error: [ 0. 0.]
Beta Covariance: [[ 5.20000039 -7.80000026]
[ -7.80000026 18.1999991 ]]
Residual Variance: 0.0
Inverse Condition #: 0.0315397386692
Reason(s) for Halting:
Sum of squares convergence
The resutling Beta values are shown to be 1.0 and 0.0, which I would epect. But why are the standard errors, Beta Std Error, also both zero if my errors on the data points are quite large? Can anyone offer some insight?
I see no discrepancy here. Your example model fits your data perfectly, so the weights you pass to the data do not matter. Moreover, your initial guess beta0=[1.0, 0.0] is a parameter vector giving an optimal solution, so the ODR machinery can not find an iterative improvement of the parameters and quits after zero iterations. The associated errors are zero because for a given data the solution found is infinitely better than any other solution possible because your sum of squares at B=[1, 0] is zero.
To see the what actually happens inside ODR.run() function, add odr.set_iprint(init=2, iter=2, final=2) before you run the regression. In particular, the following output confirms that ODR reaches the stopping condition immediately:
--- STOPPING CONDITIONS:
INFO = 1 ==> SUM OF SQUARES CONVERGENCE.
NITER = 0 (NUMBER OF ITERATIONS)
Note how the errors will not be zero, and NITER will be an integer number if either your x_data is unequal to y_data or if beta0 does not match the optimal solution. In that case, the errors returned by ODR will be nonzero, although still incredibly small.