Using SciPy's quad to get the principal value of an integral by integrating to just below and from just above the singular point - scipy

I am trying to compute the principal value of an integral (over s) of 1/((s - q02)*(s - q2)) on [Ecut, inf] with q02 < Ecut < q2. Doing the Principle value by hand (or Mathematica) one obtains the general result
ln((q2-Ecut)/(Ecut-q02)) / (q02 -q2)
In the specific example below this gives the result -1.58637*10^-11. One should also be able to get the same result by splitting the integral in two, integrating up to q2 - eps and then starting from q2 + eps, and then adding the two results (the divergences should cancel). By taking eps smaller and smaller one should recover the result above. When I implement this in scipi using quad however my result converges to the wrong result 6.04685e-11, as I show in the plot of eps vs integral result I include.
Why is quad doing this? even if I have eps = 0 it gives me this wrong result, when I would expect it to give me an error as the thing blows up...
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import quad
q02 = 485124412.
Ecut = 17909665929.
q2 = 90000000000.
def integrand(s):
return 1/((s - q02)*(s - q2))
xx=[1.,0.1,0.01,0.001,0.0001,0.00001,0.000001,0.0000001,0.00000001,
0.000000001,0.0000000001,0.00000000001,0.]
integral = [0*y for y in xx]
i=0
for eps in xx:
ans1,err = quad(integrand, Ecut, q2 -eps )
ans2,err= quad(integrand, q2 + eps, np.inf)
integral[i] = ans1 + ans2
i=i+1
plt.semilogx(xx,integral,marker='.')
plt.show()

One should also be able to get the same result by splitting the integral in two, integrating up to q2 - eps and then starting from q2 + eps, and then adding the two results
Only if computations were perfectly accurate. In numerical practice, what you described is basically the worst thing one could do. You get two large integrals of opposite signs that very nearly cancel each other when added; what is left has more to do with the errors of integration than with actual value of the integral.
I notice you disregarded the error values err in your script, not even printing them out. Bad idea: they are of size 1e-10, which would already tell you that the final result with "something e-11" is junk.
The Computational Science question Numerical Principal Value Integration - Hilbert like addresses this issue. One of the approaches they indicate is to add the values of the integrand at the points symmetric about the singularity, before trying to integrate it. This requires taking the integral over a symmetric interval centered at the singularity q2 (that is, from Ecut to 2*q2-Ecut), and then adding the contribution of the integral from 2*q2-Ecut to infinity. This split makes sense anyway, because quad treats infinite limits very differently (using Fourier integration), which is yet another thing that will affect the way the singularity cancels out.
So, an implementation of this approach would be
ans1, err = quad(lambda s: integrand(s) + integrand(2*q2-s), Ecut, q2)
ans2, err = quad(integrand, 2*q2-Ecut, np.inf)
No eps is needed. However, the result is still off: it's about -2.5e-11. Turns out, the second integral is the culprit. Unfortunately, the Fourier integral approach doesn't seem to be effective here (or I didn't find a way to make it work). It turns out that providing a large, but finite value as the upper limit leads to a better result, especially if the option epsabs is also used, e.g. epsabs=1e-20.
Better yet, read the documentation of quad extra carefully and notice that it directly supports integrals with Cauchy weight 1/(s-q2), choosing an appropriate numerical method for them. This still requires a finite upper limit, and a small value of epsabs, but the result is pretty accurate:
quad(lambda s: 1/(s - q02), Ecut, 1e9*q2, weight='cauchy', wvar=q2, epsabs=1e-20)
returns -1.5863735715967363e-11, compared to exact value -1.5863735704856253e-11. Notice that the factor 1/(s-q2) does not appear in the integrand above, being relegated to the weight options.

Related

Fitting data with Voigt profile in lmfit in python - huge errors

I am trying to fit some RIXS data with Voigt profiles (lmfit in Python), and I have defined the Voigt profile in the following way:
def gfunction_norm(x, pos, gwid):
gauss= (1/(gwid*(np.sqrt(2*np.pi))))*(np.exp((-1.0/2)*((((x-pos)/gwid))**2)))
return (gauss-gauss.min())/(gauss.max()-gauss.min())
def lfunction_norm(x,pos,lwid):
lorentz=(0.15915*lwid)/((x-pos)**2+0.25*lwid**2)
return (lorentz-lorentz.min())/(lorentz.max()-lorentz.min())
def voigt(x, pos, gwid, lwid, int):
step=0.005
x2=np.arange(pos-7,pos+7+step,step)
voigt3=np.convolve(gfunction_norm(x2, pos, gwid), lfunction_norm(x2, pos, lwid), mode='same')
norm=(voigt3-voigt3.min())/(voigt3.max()-voigt3.min())
y=np.interp(energy, x2, norm)
return y * int
I have used this definition instead of the popular Voigt profile definition in Python:
def voigt(x, alpha, cen, gamma):
sigma=alpha/np.sqrt(2*np.log(2))
return np.real(wofz((x-cen+1j*gamma)/sigma/np.sqrt(2)))/(sigma*2.51)
because it gives me more clarity on the intensity of the peaks etc.
Now, I have a couple of spectra with 9-10 peaks and I am trying to fit all of them with Voigt profiles (precisely in the way I defined it).
Now, I have a couple of questions:
Do you think my Voigt definition is OK? What (dis)advantages do I have by using the convolution instead of the approximative second definition?
As a result of my fit, sometimes I get crazy large standard deviations. For example, these are best-fit parameters for one of the peaks:
int8: 0.00986265 +/- 0.00113104 (11.47%) (init = 0.05)
pos8: -2.57960013 +/- 0.00790640 (0.31%) (init = -2.6)
gwid8: 0.06613237 +/- 0.02558441 (38.69%) (init = 0.1)
lwid8: 1.0909e-04 +/- 1.48706395 (1363160.91%) (init = 0.001)
(intensity, position, gaussian and lorentzian width respectively).
Does this output mean that this peak should be purely gaussian?
I have noticed that large errors usually happen when the best-fit parameter is very small. Does this have something to do with the Levenberg-Marquardt algorithm that is used by default in lmfit? I should note that I sometimes have the same problem even when I use the other definition of the Voigt profile (and not just for Lorentzian widths).
Here is a part of the code (it is a part of a bigger code and it is in a for loop, meaning I use same initial values for multiple similar spectra):
model = Model(final)
result = model.fit(spectra[:,nb_spectra], params, x=energy)
print(result.fit_report())
"final" is the sum of many voigt profiles that I previously defined.
Thank you!
This seems a duplicate or follow-up to Lmfit fit produces huge uncertainties - please use on SO question per topic.
Do you think my Voigt definition is OK? What (dis)advantages do I have by using the convolution instead of the approximative second definition?
What makes you say that the second definition is approximate? In some sense, all floating-point calculations are approximate, but the Faddeeva function from scipy.special.wofz is the analytic solution for the Voigt profile. Doing the convolution yourself is likely to be a bit slower and is also an approximation (at the machine-precision level).
Since you are using Lmfit, I would recommend using its VoigtModel which will make your life easier: it uses scipy.special.wofz and parameter names that make it easy to switch to other profiles (say, GaussianModel).
You did not give a very complete example of code (for reference, a minimal, working version of actual code is more or less expected on SO and highly recommended), but that might look something like
from lmfit.models import VoigtModel
model = VoigtModel(prefix='p1_') + VoigtModel(prefix='p2_') + ...
As a result of my fit, sometimes I get crazy large standard
deviations. For example, these are best-fit parameters for one of the
peaks:
int8: 0.00986265 +/- 0.00113104 (11.47%) (init = 0.05)
pos8: -2.57960013 +/- 0.00790640 (0.31%) (init = -2.6)
gwid8: 0.06613237 +/- 0.02558441 (38.69%) (init = 0.1)
lwid8: 1.0909e-04 +/- 1.48706395 (1363160.91%) (init = 0.001)
(intensity, position, gaussian and lorentzian width respectively).
Does this output mean that this peak should be purely gaussian?
First, that may not be a "crazy large" standard deviation - it sort of depends on the data and rest of the fit. Perhaps the value for int8 is really, really small, and heavily overlapped with other peaks -- it might be highly correlated with other variables. But, it may very well mean that the peak is more Gaussian-like.
Since you are analyzing X-ray scattering data, the use of a Voigt function is probably partially justified with the idea (assertion, assumption, expectation?) that the material response would give a Gaussian profile, while the instrumentation (including X-ray source) would give a Lorentzian broadening. That suggests that the Lorentzian width might be the same for the various peaks, or perhaps parameterized as a simple function of the incident and scattering wavelengths or q values. That is, you might be able to (and may be better off) constrain the values of the Lorentzian width (your lwid, I think, or gamma in lmfit.models.VoigtModel) to all be the same.

Matrix Multiplication Simplification not giving same results in code

I want to evaluate the expression xTAx -2yTAx+yTAy in Matlab.
If my calculations are correct, this should be equivalent to solving(x-y)TA(x-y)
My main problem right now is that I am not getting the same results from the above two expressions. I'm wondering if it's a precision error or a conceptual error.
I tried a very simple example with some random values to check my math was right. The mini example I used is as follows
A = [1,2,0;2,1,2;0,2,1];
x = [1;2;3];
y = [4;4;4];
(transpose(x)*A*x)-(2*transpose(y)*A*x)+(transpose(y)*A*(y))
transpose(x-y)*A*(x-y)
Both give 46.
I then tried a more realistic example of values for what I'm doing and it failed. For example
A = [2.66666666666667,-0.333333333333333,0,-0.333333333333333,-0.333333333333333,0,0,0,0;
-0.333333333333333,2.66666666666667,-0.333333333333333,-0.333333333333333,-0.333333333333333,-0.333333333333333,0,0,0;
0,-0.333333333333333,2.66666666666667,0,-0.333333333333333,-0.333333333333333,0,0,0;
-0.333333333333333,-0.333333333333333,0,2.66666666666667,-0.333333333333333,0,-0.333333333333333,-0.333333333333333,0;
-0.333333333333333,-0.333333333333333,-0.333333333333333,-0.333333333333333,2.66666666666667,-0.333333333333333,-0.333333333333333,-0.333333333333333 -0.333333333333333;
0,-0.333333333333333,-0.333333333333333,0,-0.333333333333333,2.66666666666667,0,-0.333333333333333,-0.333333333333333;
0,0,0,-0.333333333333333,-0.333333333333333,0,2.66666666666667,-0.333333333333333,0;
0,0,0,-0.333333333333333,-0.333333333333333,-0.333333333333333,-0.333333333333333,2.66666666666667,-0.333333333333333;
0,0,0,0,-0.333333333333333,-0.333333333333333,0,-0.333333333333333,2.66666666666667];
x =[1.21585420370805;
1.00388159497757e-16;
-0.405284734569351;
1.36809776609658e-16;
-1.04796659533634e-17;
-7.52459042423650e-17;
-0.607927101854027;
-8.49163704356314e-17;
0.303963550927013];
v =[0.0319067068305797,0.00786616506360615,0.0709811622828447,0.0719615328671117;
1.26150800194897e-17,5.77584497721720e-18,7.89740111567879e-18,7.14396333930938e-18;
-0.158358815125228,-0.876275686098803,0.0539216716399594,0.0450616819309899;
7.90937837037453e-18,3.24196519177793e-18,3.99402664932776e-18,4.17486202509670e-18;
5.35533279761622e-18,-8.91723780863019e-19,-1.56128480212603e-18,1.84423677629470e-19;
-2.18478057810330e-17,-6.63779320738873e-18,-3.21099714760257e-18,-3.93612240449303e-18;
-0.0213963226092681,-0.0168256143048771,-0.0175695110350900,-0.0128155908603601;
-4.06029341772399e-18,-5.65705978843172e-18,-1.80182480882056e-18,-1.59281757789645e-18;
0.221482525259710,-0.0576644539359728,0.0163934384910353,0.0197432976432437];
u = [1.37058079022968;
1.79302486419321;
69.4330887239959;
-52.3949662214410];
y = v*u;
Gives 0 for the first expression and 7.1387e-28 for the second. Is this a precision error thing? Which version is better/ more accurate to use and why? Thank you!
I have not checked your simplification, but floating-point math in general is inexact. Even doing the same operations in a different order can give you different results. With the deviations you are seeing I would believe they could reasonably come from inexactness in floating-point math and it is up to you to decide whether the deviations are acceptable for your purpose.
To determine which version is more accurate you would have to compute reference values to compare to, which might prove difficult.

Indefinite integration with Matlab's Symbolic Toolbox - complex solution

I'm using Matlab 2014b. I've tried:
clear all
syms x real
assumeAlso(x>=5)
This returned:
ans =
[ 5 <= x, in(x, 'real')]
Then I tried:
int(sqrt(x^2-25)/x,x)
But this still returned a complex answer:
(x^2 - 25)^(1/2) - log(((x^2 - 25)^(1/2) + 5*i)/x)*5*i
I tried the simplify command, but still a complex answer. Now, this might be fixed in the latest version of Matlab. If so, can people let me know or offer a suggestion for getting the real answer?
The hand-calculated answer is sqrt(x^2-25)-5*asec(x/5)+C.
This behavior is present in R2017b, though when converted to floating point the imaginary components are different.
Why does this occur?
This occurs because Matlab's int function returns the full general solution when you ask for the indefinite integral. This solution is valid over the entire domain of of real values, including your restricted domain of x>=5.
With a bit of math you can show that the solution is always real for x>=5 (see complex logarithm). Or you can use more symbolic math via the isAlways function to show this:
syms x real
assume(x>=5)
y = int(sqrt(x^2-25)/x, x)
isAlways(imag(y)==0)
This returns true (logical 1). Unfortunately, Matlab's simplification routines appear to not be able to reduce this expression when assumptions are included. You might also submit this case to The MathWorks as a service request in case they'd consider improving the simplification for this and similar equations.
How can this be "fixed"?
If you want to get rid of the zero-valued imaginary part of the solution you can use sym/real:
real(y)
which returns 5*atan2(5, (x^2-25)^(1/2)) + (x^2-25)^(1/2).
Also, as #SardarUsama points out, when the full solution is converted to floating point (or variable precision) there will sometimes numeric imprecision when converting from exact symbolic form. Using the symbolic real form above should avoid this.
The answer is not really complex.
Take a look at this:
clear all; %To clear the conditions of x as real and >=5 (simple clear doesn't clear that)
syms x;
y = int(sqrt(x^2-25)/x, x)
which, as we know, gives:
y =
(x^2 - 25)^(1/2) - log(((x^2 - 25)^(1/2) + 5i)/x)*5i
Now put some real values of x≥5 to check what result it gives:
n = 1004; %We'll be putting 1000 values of x in y from 5 to 1004
yk = zeros(1000,1); %Preallocation
for k=5:n
yk(k-4) = subs(y,x,k); %Putting the value of x
end
Now let's check the imaginary part of the result we have:
>> imag(yk)
ans =
1.0e-70 *
0
0
0
0
0.028298997121333
0.028298997121333
0.028298997121333
%and so on...
Notice the multiplier 1e-70.
Let's check the maximum value of imaginary part in yk.
>> max(imag(yk))
ans =
1.131959884853339e-71
This implies that the imaginary part is extremely small and it is not a considerable amount to be worried about. Ideally it may be zero and it's coming due to imprecise calculations. Hence, it is safe to call your result real.

Dimensionality reduction using PCA - MATLAB

I am trying to reduce dimensionality of a training set using PCA.
I have come across two approaches.
[V,U,eigen]=pca(train_x);
eigen_sum=0;
for lamda=1:length(eigen)
eigen_sum=eigen_sum+eigen(lamda,1);
if(eigen_sum/sum(eigen)>=0.90)
break;
end
end
train_x=train_x*V(:, 1:lamda);
Here, I simply use the eigenvalue matrix to reconstruct the training set with lower amount of features determined by principal components describing 90% of original set.
The alternate method that I found is almost exactly the same, save the last line, which changes to:
train_x=U(:,1:lamda);
In other words, we take the training set as the principal component representation of the original training set up to some feature lamda.
Both of these methods seem to yield similar results (out of sample test error), but there is difference, however minuscule it may be.
My question is, which one is the right method?
The answer depends on your data, and what you want to do.
Using your variable names. Generally speaking is easy to expect that the outputs of pca maintain
U = train_x * V
But this is only true if your data is normalized, specifically if you already removed the mean from each component. If not, then what one can expect is
U = train_x * V - mean(train_x * V)
And in that regard, weather you want to remove or maintain the mean of your data before processing it, depends on your application.
It's also worth noting that even if you remove the mean before processing, there might be some small difference, but it will be around floating point precision error
((train_x * V) - U) ./ U ~~ 1.0e-15
And this error can be safely ignored

atan(c*tan(x)) does not consider x>pi/2

The Problem:
I have a problem in Matlab with a part of a function that calculates the value of
atan(c*tan(x)).
For some real c and x where x is an angle. Usually this would return some values between -pi/2 and pi/2 but I want it to mind "how often the angle got around": For example with c=1 we get the same result for x=pi/2 and x=3*pi/2.
This is what I want to avoid, in this case I would want it to calculate for x=3*pi/2 the value 3*pi/2 (I know that for c=1 atan and tan would cancel out, but for arbitrary real c they do not).
In other words, I want to make atan(c*tan(x)) continuous on R in x.
How I tried to fix it:
I simply added a function that kept an eye on the argument of the tangens:
atan(c*tan(x))+pi.*(floor((x./pi)+(1/2)))
This works for values that are not too near to the poles of the tangent function, but once x lies !very! near it breaks down (jumps of heigh pi are observed). This is especially a problem for my calculations as x is near enough one of the poles all the time.
What still seems to be the problem:
Imho the problem is that the tangens has a "much higher resolution" near its poles than the "floor function" making its jump too early/late for the correcting function.
My question:
Is there another possibility to fix this problem that is not sensitive to x lying near the poles of the tangent ?
There are multiple possible solutions for this problem. A simple one would be to use a special case for x near pi + k*pi/2, letting atan(c*tan(x)) = +- pi/2 depending on whether x is slightly smaller, or slightly larger, than pi/2 respectively.
function Out = ContTangents(In,RidgeParam)
InOverPi = In./pi;
Rounded = round(InOverPi);
CloseRounded = round(InOverPi+1/2)-1/2;
ArctanResult = zeros(size(In));
IsCloseVal = abs(InOverPi - CloseRounded) < 0.00001;
CloseVal = sign(RidgeParam*(CloseRounded-InOverPi)) * pi/2;
NotCloseVal = atan(RidgeParam*tan(In));
ArctanResult(IsCloseVal) = CloseVal(IsCloseVal);
ArctanResult(~IsCloseVal) = NotCloseVal(~IsCloseVal);
Out = ArctanResult + pi*(Rounded + 1/2);
This solution appears to produce nice-looking curves, and as far as I can see it avoids discontinuities and singularities. At least, it seems nice when I run
figure, plot(ContTangents(-10:.001*pi:10,2))
Edit: Some parts of the function are modified. When running ContTangents(pi/2-exp(-36),2) (the problem value in the OP's example below), I found this very peculiar behavior. Is this a bug in Matlab?
K>> InOverPi<0.5
ans =
1
K>> round(InOverPi)
ans =
1
This could be worked around through redefining an own round function which doesn't fail at the point .5-eps(.25). In fact, due to the entertaining behavior of floating point at the limit, you can't even define that number as such, but you have to proceed as follows.
format hex
(pi/2-exp(-36))/pi
ans =
3fdfffffffffffff
-eps(.25)+.25+.25
ans =
3fdfffffffffffff