I have a list of distances vs energies and I'm trying to fit it to a Morse potential using scipy.optimize.curve_fit. The data is:
distances: [0.7, 0.78, 0.86, 0.94, 1.02, 1.1, 1.18, 1.26]
energies: [-1428.03995379, -1428.13375727, -1428.18294153, -1428.20472839,
-1428.20977469, -1428.2047732, -1428.19393863, -1428.17996123]
and the Morse potential is:
def morsePotential(r, D, alpha, r0):
return D * (1 - np.exp(-alpha * (r - r0)))**2
When I do
param, cv = curve_fit(morsePotential, distances, energies, np.array([0.005, 10, 1.0]))
D, alpha, r0 = param
scipy says it cannot estimate the covariance and the fit produces completely ridiculous output, for example D << 0, which makes no sense for this potential, where D is the depth of the energy well. I am also having trouble fitting these data to a harmonic potential, which is even more worrying. What can I do to nudge curve_fit in the right direction? Right now it is either returning nonsense values or failing completely.

As already pointed out by Warren Weckesser the function y(r) = D * (1 - np.exp(-alpha * (r - r0)))**2 is not convenient to correctly fit it to the given data.
Of course the best way is to find a better model from physical considerations. On the other hand, the function proposed below comes only from mathematical adjustment. This is a second best solution.


Implementation details of positional encoding in transformer model?

How exactly does this positional encoding being calculated?
Let's assume a machine translation scenario and these are input sentences,
english_text = [this is good, this is bad]
german_text = [das ist gut, das ist schlecht]
Now our input vocabulary size is 4 and embedding dimension is 4.
#words #embeddings
this - [0.5, 0.2, 0.3, 0.1]
is - [0.1, 0.2, 0.5, 0.1]
good - [0.9, 0.7, 0.9, 0.1]
bad - [0.7, 0.3, 0.4, 0.1]
As per transformer paper we add the each word position encoding with each word embedding and then pass it to encoder like seen in the image below,
As far as the paper is concerned they given this formula for calculating position encoding of each word,
So, this is how I think I can implement it,
d_model = 4 # Embedding dimension
positional_embeddings = np.zeros((max_sentence_length, d_model))
max_sentence_length = 3 # as per my examples above
for position in range(maximum_sentence_length):
for i in range(0, d_model, 2):
positional_embeddings[position, i] = (
sin(position / (10000 ** ( (2*i) / d_model) ) )
positional_embeddings[position, i + 1] = (
cos(position / (10000 ** ( (2 * (i + 1) ) / d_model) ) )
Then, the new embedding vector will be
[[0.5, 0.2, 0.3, 0.1],
[0.1, 0.2, 0.5, 0.1],
[0.9, 0.7, 0.9, 0.1]] + positional_embeddings = NEW EMBEDDINGS
## shapes
3 x 4 + 3 x 4 = 3 x 4
Is this how the calculation will be carried out in the implementation? Do correct me if there's any mistake in my above pseudo implementation.
If everything is correct then I have three doubts hope someone can clear them,
1) From the above implementation we are using sin formula for even positions and cos formula for odd positions but I couldn't understand the reason behind it? I read that it's taking use of cyclic properties but couldn't understand it.
2) Is there a reason behind choosing 10000/(2i/d) or 10000/(2i+1/d) as scaling factor in formula.
3) All the sentence will not be equal to max sentence length so we might have to padded the sentence so do we also calculate positional encondings to padding tokens.
Your implementation is basically correct. The typical implementation is pre-computing the embedding matrix, make a non-trainable embedding layer, and do an embedding lookup of a range. See e.g. the implementation in HuggingFace's Transformers.
Some hints about the intuition behind the equations are in these threads:
on CrossValidated
on Reddit
But it seems to me that pretty much all decisions about the position encoding were empirical choices.
By cyclic properties, they IMHO mean that given a dimension of the embedding, the difference of the embedding values between positions with a constant offset is the same regardless of the position in the sequence. For that, using either only sine or cosine might be enough, but some positions would have a much larger norm that the others, therefore they alternate sine and cosine.
I think the scaling factors are empirically estimated to cover the usual length of sentences.
With padding, you indeed consider also the positional encoding of the padded positions, but since they are pre-computed, it does mean higher computation load because you get the embeddings for the padding symbols anyway.

Small bug in MATLAB R2017B LogLikelihood after fitnlm?

Background: I am working on a problem similar to the nonlinear logistic regression described in the link [1] (my problem is more complicated, but link [1] is enough for the next sections of this post). Comparing my results with those obtained in parallel with a R package, I got similar results for the coefficients, but (very approximately) an opposite logLikelihood.
Hypothesis: The logLikelihood given by fitnlm in Matlab is in fact the negative LogLikelihood. (Note that this impairs consequently the BIC and AIC computation by Matlab)
Reasonning: in [1], the same problem is solved through two different approaches. ML-approach/ By defining the negative LogLikelihood and making an optimization with fminsearch. GLS-approach/ By using fitnlm.
The negative LogLikelihood after the ML-approach is:380
The negative LogLikelihood after the GLS-approach is:-406
I imagine the second one should be at least multiplied by (-1)?
Questions: Did I miss something? Is the (-1) coefficient enough, or would this simple correction not be enough?
Self-contained code:
%copy-pasting code from [1]
myf = #(beta,x) beta(1)*x./(beta(2) + x);
mymodelfun = #(beta,x) 1./(1 + exp(-myf(beta,x)));
x = linspace(-1,1,200)';
beta = [10;2];
mu = mymodelfun(beta,x);
n = 50;
z = binornd(n,mu);
y = z./n;
%ML Approach
mynegloglik = #(beta) -sum(log(binopdf(z,n,mymodelfun(beta,x))));
opts = optimset('fminsearch');
opts.MaxFunEvals = Inf;
opts.MaxIter = 10000;
betaHatML = fminsearch(mynegloglik,beta0,opts)
neglogLH_MLApproach = mynegloglik(betaHatML);
%GLS Approach
wfun = #(xx) n./(xx.*(1-xx));
nlm = fitnlm(x,y,mymodelfun,beta0,'Weights',wfun)
neglogLH_GLSApproach = - nlm.LogLikelihood;
This answer (now) only details which code is used. Please see Tom Lane's answer below for a substantive answer.
Basically, fitnlm.m is a call to
When opening NonLinearModel.m, one gets in line 1209:
model.LogLikelihood = getlogLikelihood(model);
getlogLikelihood is itself described between lines 1234-1251.
For instance:
function L = getlogLikelihood(model)
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
Please also not that this notably impacts ModelCriterion.AIC and ModelCriterion.BIC, as they are computed using model.LogLikelihood ("thinking" it is the logLikelihood).
To get the corresponding formula for BIC/AIC/..., type:
edit classreg.regr.modelutils.modelcriterion
this is Tom from MathWorks. Take another look at the formula quoted:
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
Remember the normal distribution has a factor (1/sqrt(2*pi)), so taking logs of that gives us -log(2*pi)/2. So the minus sign comes from that and it is part of the log likelihood. The property value is not the negative log likelihood.
One reason for the difference in the two log likelihood values is that the "ML approach" value is computing something based on the discrete probabilities from the binomial distribution. Those are all between 0 and 1, and they add up to 1. The "GLS approach" is computing something based on the probability density of the continuous normal distribution. In this example, the standard deviation of the residuals is about 0.0462. That leads to density values that are much higher than 1 at the peak. So the two things are not really comparable. You would need to convert the normal values to probabilities on the same discrete intervals that correspond to individual outcomes from the binomial distribution.

scipy integrate.quad return an incorrect value

i use scipy integrate.quad to calc cdf of normal distribution:
def nor(delta, mu, x):
return 1 / (math.sqrt(2 * math.pi) * delta) * np.exp(-np.square(x - mu) / (2 * np.square(delta)))
delta = 0.1
mu = 0
t = np.arange(4.0, 10.0, 1)
nor_int = lambda t: integrate.quad(lambda x: nor(delta, mu, x), -np.inf, t)
nor_int_vec = np.vectorize(nor_int)
s = nor_int_vec(t)
for i in zip(s[0],s[1]):
print i
while it print as follows:
(1.0000000000000002, 1.2506543424265854e-08)
(1.9563704110140217e-11, 3.5403445591955275e-11)
(1.0000000000001916, 1.2616577562700088e-08)
(1.0842532749783998e-34, 1.9621183122960244e-34)
(4.234531567162006e-09, 7.753407284370446e-09)
(1.0000000000001334, 1.757986959115912e-10)
for some x, it return a value approximate to zero, it should be return 1.
can somebody tell me what is wrong?
Same reason as in why does quad return both zeros when integrating a simple Gaussian pdf at a very small variance? but seeing as I can't mark it as a duplicate, here goes:
You are integrating a function with tight localization (at scale delta) over a very large (in fact infinite) interval. The integration routine can simply miss the part of the interval where the function is substantially different from 0, judging it to be 0 instead. Some guidance is required. The parameter points can be used to this effect (see the linked question) but since quad over an infinite interval does not support it, the interval has to be manually split, like so:
for t in range(4, 10):
int1 = integrate.quad(lambda x: nor(delta, mu, x), -np.inf, mu - 10*delta)[0]
int2 = integrate.quad(lambda x: nor(delta, mu, x), mu - 10*delta, t)[0]
print(int1 + int2)
This prints 1 or nearly 1 every time. I picked mu-10*delta as a point to split on, figuring most of the function lies to the right of it, no matter what mu and delta are.
Use np.sqrt etc; there is usually no reason for put math functions in NumPy code. The NumPy versions are available and are vectorized.
Applying np.vectorize to quad is not doing anything besides making the code longer and slightly harder to read. Use a normal Python loop or list comprehension. See NumPy vectorization with integration

Optimization with Unknown Number of Variables

Since the original problem is more complicated, the idea is described using a simple example below.
For example, suppose we want to put several router antennas somewhere in a room so that the cellphone get most signal strength on the table (received power > Pmax) while weakest signal strength on bed (received power < Pmin). What is the best (minimum) number of antennas that should be used, and where should they be placed, in order to achieve the goal.
SIGNAL_STRENGTH is dependent on variable (x, y, z) and the number
of variables
. i.e. location and number of antennas.
Besides, assume
PREDICTION = f((x1, y1, z1), (x2, y2, z2), ... (xi, yi, zi), ... (xn,
yn, zn))
where n and (xi, yi, zi) are to be optimized. The goal is to minimize
cost function = ||SIGNAL_STRENGTH - PREDICTION||
I tried to use GA with mixed integer programming in Matlab to implement that. Two optimization functions are used, outer function is to optimize n, and inner optimization function optimizes (x, y, z) with given n. This method works slow and I haven't seen one result given by this method so far.
Does anyone have a more efficient way to solve this problem? Any suggestion is appreciated. Thanks in advance.
Terminology | Problem Definition
An antenna is sending at position a in R^3 with constant power. Its signal strength can be measured by some S: R^3 -> R where S has a single maximum S_0 at a and the set, constructed by S(x) > const, is simply connected, i.e. S(x) = S_0 * exp(-const * (x-a)^2).
Given a set of antennas A the resulting signal strength is the maximum of a single antenna
S_A(x) = max{S_a(x) : for all a in A} ,
which means we 'lock' on the strongest antenna, which is what cell phones do.
Let K = R^3 x R denote a space of points (position, intensity). Now concider two finite subsets POI_min and POI_max of K. We want to find the set A with the minimal amount of antennas (|A| -> min.), that satisfies
for all (x,w) in POI_min : S_A(x) < w and for all (x,w) in POI_max : S_A(x) > w .
As S(x) > const is simply connected there has to be an antenna in a sphere around the position of each element (x,w) in POI_max with radius r = max{||xi - x|| : for all xi in S(xi) = w}. Which means that if we would put an antenna at the position of (x,w), then the furthest we can go away from x and still have signal strength w is the radius r within which an actual antenna has to be positioned.
With a similar argumentation for POI_min it follows that there is no antenna within r = min{||xi - x|| : for all xi in S(xi) = w}.
Instead of solving a nonlinear optimization task we can intersect spheres to obtain the optimal solution. If k spheres around the POI_max positions intersect, we can place a single antenna in the intersection, reducing the amount of antennas needed by k-1.
However each antenna that is placed must satisfy all constraints given by the elements of POI_min. Assuming that antennas are omnidirectional and thus orientation of an antenna doesn't matter we can do (pseudocode):
min_sphere = {(x_i,r_i) : from POI_min},
spheres_to_cover = {(x_i,r_i) : from POI_max}
A = {}
while not is_empty(spheres_to_cover)
power_set_score = struct // holds score, k
PS <- costruct power set of sphere_to_cover
for i = 1:number_of_elements(PS)
k = PS[i]
if intersection(k) \ min_sphere is not empty
power_set_score[i].score = |k|
power_set_score[i].score = 0
end if
power_set_score[i].k = k
end for
sort(power_set_score) // sort by score, biggest first
A <- add arbitrary point in (intersection(power_set_score[1].k) \ min_sphere)
spheres_to_cover = spheres_to_cover \ power_set_score[1].k
end while
On the other hand you have just given an example problem and thus this solution may not be applicable or broad enough for your case. I did make a few assumptions. So being more specific in the question might give you an even better answer.

Calculate posterior distribution of unknown mis-classification with PRTools in MATLAB

I'm using the PRTools MATLAB library to train some classifiers, generating test data and testing the classifiers.
I have the following details:
N: Total # of test examples
k: # of
mis-classification for each
classifier and class
I want to do:
Calculate and plot Bayesian posterior distributions of the unknown probabilities of mis-classification (denoted q), that is, as probability density functions over q itself (so, P(q) will be plotted over q, from 0 to 1).
I have that (math formulae, not matlab code!):
Posterior = Likelihood * Prior / Normalization constant =
P(q|k,N) = P(k|q,N) * P(q|N) / P(k|N)
The prior is set to 1, so I only need to calculate the likelihood and normalization constant.
I know that the likelihood can be expressed as (where B(N,k) is the binomial coefficient):
P(k|q,N) = B(N,k) * q^k * (1-q)^(N-k)
... so the Normalization constant is simply an integral of the posterior above, from 0 to 1:
P(k|N) = B(N,k) * integralFromZeroToOne( q^k * (1-q)^(N-k) )
(The Binomial coefficient ( B(N,k) ) can be omitted though as it appears in both the likelihood and normalization constant)
Now, I've heard that the integral for the normalization constant should be able to be calculated as a series ... something like:
k!(N-k)! / (N+1)!
Is that correct? (I have some lecture notes with this series, but can't figure out if it is for the normalization constant integral, or for the overall distribution of mis-classification (q))
Also, hints are welcome as how to practically calculate this? (factorials are easily creating truncation errors right?) ... AND, how to practically calculate the final plot (the posterior distribution over q, from 0 to 1).
I really haven't done much with Bayesian posterior distributions ( and not for a while), but I'll try to help with what you've given. First,
k!(N-k)! / (N+1)! = 1 / (B(N,k) * (N + 1))
and you can calculate the binomial coefficients in Matlab with nchoosek() though it does say in the docs that there can be accuracy problems for large coefficients. How big are N and k?
Second, according to Mathematica,
integralFromZeroToOne( q^k * (1-q)^(N-k) ) = pi * csc((k-N)*pi) * Gamma(1+k)/(Gamma(k-N) * Gamma(2+N))
where csc() is the cosecant function and Gamma() is the gamma function. However, Gamma(x) = (x-1)! which we'll use in a moment. The problem is that we have a function Gamma(k-N) on the bottom and k-N will be negative. However, the reflection formula will help us with that so that we end up with:
= (N-k)! * k! / (N+1)!
Apparently, your notes were correct.
Let q be the probability of mis-classification. Then the probability that you would observe k mis-classifications in N runs is given by:
P(k|N,q) = B(N,k) q^k (1-q)^(N-k)
You need to then assume a suitable prior for q which is bounded between 0 and 1. A conjugate prior for the above is the beta distribution. If q ~ Beta(a,b) then the posterior is also a Beta distribution. For your info the posterior is:
f(q|-) ~ Beta(a+k,b+N-k)
Hope that helps.