Division accuracy - double

I am programming in c/c++ and want to minimize the truncation error in my calculations. Let's take a look at following example:
double test
test=1./6.;
Is it already calculating in the right precision or is it doing the calculation in float and casting it in the end to double?
Would the correct way be
test= double(1)/double(6);
?

Related

Variable precicion arithmetic for symbolic integral in Matlab

I am trying to calculate some integrals that use very high power exponents. An example equation is:
(-exp(-(x+sqrt(p)).^2)+exp(-(x-sqrt(p)).^2)).^2 ...
./( exp(-(x+sqrt(p)).^2)+exp(-(x-sqrt(p)).^2)) ...
/ (2*sqrt(pi))
where p is constant (1000 being a typical value), and I need the integral for x=[-inf,inf]. If I use the integral function for numeric integration I get NaN as a result. I can avoid that if I set the limits of the integration to something like [-20,20] and a low p (<100), but ideally I need the full range.
I have also tried setting syms x and using int and vpa, but in this case vpa returns:
1.0 - 1.0*numeric::int((1125899906842624*(exp(-(x - 10*10^(1/2))^2) - exp(-(x + 10*10^(1/2))^2))^2)/(3991211251234741*(exp(-(x - 10*10^(1/2))^2) + exp(-(x + 10*10^(1/2))^2)))
without calculating a value. Again, if I set the limits of the integration to lower values I do get a result (also for low p), but I know that the result that I get is wrong – e.g., if x=[-100,100] and p=1000, the result is >1, which should be wrong as the equation should be asymptotic to 1 (or alternatively the codomain should be [0,1) ).
Am I doing something wrong with vpa or is there another way to calculate high precision values for my integrals?
First, you're doing something that makes solving symbolic problems more difficult and less accurate. The variable pi is a floating-point value, not an exact symbolic representation of the fundamental constant. In Matlab symbolic math code, you should always use sym('pi'). You should do the same for any other special numeric values, e.g., sqrt(sym('2')) and exp(sym('1')), you use or they will get converted to an approximate rational fraction by default (the source of strange large number you see in the code in your question). For further details, I recommend that you read through the documentation for the sym function.
Applying the above, here's a runnable example:
syms x;
p = 1000;
f = (-exp(-(x+sqrt(p)).^2)+exp(-(x-sqrt(p)).^2)).^2./(exp(-(x+sqrt(p)).^2)...
+exp(-(x-sqrt(p)).^2))/(2*sqrt(sym('pi')));
Now vpa(int(f,x,-100,100)) and vpa(int(f,x,-1e3,1e3)) return exactly 1.0 (to 32 digits of precision, see below).
Unfortunately, vpa(int(f,x,-Inf,Inf)), does not return an answer, but a call to the underlying MuPAD function numeric::int. As I explain in this answer, this is what can happen when int cannot obtain a result. Normally, it should try to evaluate the the integral numerically, but your function appears to be ill-defined at ±∞, resulting in divide by zero issues that the variable precision quadrature methods can't handle well. You can evaluate the integral at wider bounds by increasing the variable precision using the digits function (just remember to set digits back to the default of 32 when done). Setting digits(128) allowed me to evaluate vpa(int(f,x,-1e4,1e4)). You can also more efficiently evaluate your integral over a wider range via 2*vpa(int(f,x,0,1e4)) at lower effective digits settings.
If your goal is to see exactly how much less than one p = 1000 corresponds to, you can use something like vpa(1-2*int(f,x,0,1e4)). At digits(128), this returns
0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000086457415971094118490438229708839420392402555445545519907545198837816908450303280444030703989603548138797600750757834260181259102
Applying double to this shows that it is approximately 8.6e-89.

Matlab double function outputs infinity when taking a big number of type "sym"

This is literally the number I obtain (from symsum function), which is of type sym:
a=328791078344903739363762093060350430076929707044786898291940722052812676355129485878814911641516759087483581972443760841410582114920781832660013389681326267351368505696628653562484228680842650173635989588528021721039959787053654401351638478786763875479187208098871238084448485336138651690856082810553570419028927840285091142054111375001
I would like to make mathematical operations (in particular, take a natural log) on this number and so want to transform it to double, however the output from double(a) is simply "Inf". How to go about this problem and convert it from "sum" to a numeric type?
Your number is ~3.3x10335 but the largest number that can be represented by MATLAB's double precision floating point numbers is ~1.8x10308 (see the output of realmax). Converting your number to double precision causes overflow because the number is larger than can be represented so MATLAB just returns Inf.
For an exhaustive overview of floating point representations and arithmetic, you can check out this PDF.
Can you count the digits and insert a decimal point before converting to double?
If so, take advantage of the fact that the natural log of a number that overflows may not itself overflow.
Using "^" for power, you can represent your number as 3.28791078344903739363762093060350430076929707044786898291940722052812676355129485878814911641516759087483581972443760841410582114920781832660013389681326267351368505696628653562484228680842650173635989588528021721039959787053654401351638478786763875479187208098871238084448485336138651690856082810553570419028927840285091142054111375001 * (10 ^ 335).
The decimal log of (10^335) is 335. Its natural log is 335*log(10).
The natural log of the original number is:
log(3.287910783449037393637620930603504300769297070447868982919407220528)
+ 335*log(10)
All inputs, intermediate results, and the final result of this calculation are in the double range.

Making a calculation in objective c

I need a variable a = 6700000^2 * (a - b) (2 + sinf(a)+ s inf(b)), where a and b are floats between -7 to 7. I need all the precision that floats can give me.
Which data type should a be? Is the sinf the proper function to get the best precision out of a and b? And should a and b be in radians or degrees?
Well I Made a mistake when I posted the expression, the correct expression is c=67000000^2*(a-b)(2+sinf(a)+sinf(b)) and my problem is with c ."a" and "b" are floats and they are passed to me as floats, they really are coordinates (latitude and longitude) so thats not my concern... My concern is when using sinf on them do I lose any precision? And which type should c be so I don't lose precision cause I'm using a long double variable d to store a sum of multiple different c variables and d is returned to me as being zero and it shouldn't (sould be about 1 or 2 )so I was gessing I was losing some precision when calculating the c parcels...I was using c as being a double...can it be that I am losing some precision when calculating c?
Thank you very much for your help.
I can't tell you whether float is good enough for your application. If you need more precision, use double, and then use sin() instead of sinf().
The standard trig functions take angles in radians, as you'll discover if you read the relevant documentation.
Instead of using float, you should use a double if you want no worries in regards to memory. Remember to then change sinf() to sin() and use radians.
If you want the best precision without rolling your own types, you should use double rather than float. In that case, you can just use sin(3). According to the man page, you should pass the argument in radians.

Getting double precision in fortran 90 using intel 11.1 compiler

I have a very large code that sets up and iteratively solves a system of non-linear partial differential equation, written in fortran. I need all variables to be double precision. In the additional module that I have written for the code, I declare all variables as the double precision type, but my module still uses variables from the old source code that are declared as type real. So my question is, what happens when a single-precision variable is multiplied by a double precision variable in fortran? Is the result double precision if the variable used to store the value is declared as double precision? And what if a double precision value is multiplied by a constant without the "D0" at the end? Can I just set a compiler option in Intel 11.1 to make all real/double precision/constants of double precision?
So my question is, what happens when a single-precision variable is multiplied by a double precision variable in fortran? The single precision is promote to double precision and the operation is done in double precision.
Is the result double precision if the variable used to store the value is declared as double precision? Not necessarily. The right-hand side is an expression that doesn't "know" about the precision of the variable on the left hand side, in to which it will be stored. If you have Double = SingleA * SingleB (using names to indicate the types), the calculation will be performed in single precision, then converted to double for storage. This will NOT gain extra precision for the calculation!
And what if a double precision value is multiplied by a constant without the "D0" at the end? This is just like the first question, the constant will be promoted to double precision and the calculation done in double precision. However, the constant is still single precision and even if you wrote down many digits as for a double-precision constant, the internal storage is single precision and cannot represent that accuracy. For example, DoubleVar * 3.14159265359 will be calculated in double precision, but will be something approximating DoubleVar * 3.14159 done in double precision.
If you want to have the compiler retain many digits in a constant, you must specific the precision of a constant. The Fortran 90 way to do this is to define your own real type with whatever precision that you need, e.g., to require at least 14 decimal digits:
integer, parameter :: DoubleReal_K = selected_real_kind (14)
real (DoubleReal_K) :: A
A = 5.0_DoubleReal_K
A = A * 3.14159265359_DoubleReal_K
The Fortran standard is very specific about this; other languages are like this, too, and it's really what you'd expect. If an expression contains an operation on two floating-point variables of different precisions, then the expression is of the type of the higher-precision operand. eg,
(real variable) + (double variable) -> (double)
(double variable)*(real variable) -> (double)
(double variable)*(real constant) -> (double)
etc.
Now, if you are storing the result in a lower-precision floating point variable, it'll get down-converted again. But if you are storing it in a variable of the higher precision, it'll maintain it's precision.
If there's any cases where you're concerned that a single-precision floating point variable is causing a problem, you can force it to be converted to double precision
using the DBLE() intrinsic:
DBLE(real variable) -> double
If you write numbers in the form 0.1D0 it will treat it as double precision number, otherwise if you write 0.1, the precision will be lost in the conversion.
Here is an example:
program main
implicit none
real(8) a,b,c
a=0.2D0
b=0.2
c=0.1*a
print *,a,b,c
end program
When compiled with
ifort main.f90
I get results:
0.200000000000000 0.200000002980232 2.000000029802322E-002
When compiled with
ifort -r8 main.f90
I get results:
0.200000000000000 0.200000000000000 2.000000000000000E-002
If you use the IBM XLF compiler, the equivalence is
xlf -qautodbl=dbl4 main.f90
Jonathan Dursi's answer is correct - the other part of your question was if there was a way to make all real variables double precision.
You can accomplish this with the ifort compiler by using the -i8 (for integers) and -r8 (for reals) options. I'm not sure if there is a way to force the compiler to interpret literals as double-precision without specifying them as such (e.g. by changing 3.14159265359 to 3.14159265359D0) - we ran into this issue a while back.

iPhone/Obj C: Why does convert float to int: (int) float * 100 does not work?

In my code, I am using float to do currency calculation but the rounding has yielded undesired results so I am trying to convert it all to int. With as little change to the infrastructure as possible, in my init functions, I did this:
-(id)initWithPrice:(float)p;
{
[self setPrice:(int)(p*100)];
}
I multiply by 100 b/c in the main section, the values are given as .xx to 2 decimals. I abnormally I notice is that for float 1.18, the int rounds it to 117. Does anyone know it does that? The float leaves it as 1.18. I expect it to be 118 for the int equiv in cents.
Thanks.
Floating point is always a little imprecise. With IEEE floating point encoding, powers of two can be represented exactly (like 4,2,1,0.5,0.25,0.125,0.0625,...) , but numbers like 0.1 are always an approximation (just try representing it as a sum of powers of 2).
Your (int) cast will truncate whatever comes in, so if p*100 is resolving to 117.9999995 due to this imprecision , that will become 1.17 instead of 1.18.
Better solution is to use something like roundf on p*100. Even better would be if you can go upstream and fully convert to fixed-point math using integers in the entire program.