I am writing a program in c++ where I want to find the epsilon of my pc.
I want the result to be double precision (which is 2.2204460492503131 E-16) but instead the output is 1.0842 E-019 which is the epsilon in long double precision.
My program is this:
#include <iostream>
double e = 1.0;
double x;
int main ()
{
for (int i = 0; e + 1.0!=1.0 ; i++)
{
std::cout<<e<<'\n';
x = e;
e/=2.0;
}
std::cout << "The epsilon of this Computer is "<< x <<'\n';
return 0;
}
Output std::numeric_limits<double>::epsilon() instead. std::numeric_limits is declared in the standard header <limits>.
A more usual technique, if you really must calculate it (rather than trusting your standard library to provide a correct value) is
double epsilon = 1.0;
while ((1.0 + 0.5 * epsilon) != 1.0)
epsilon *= 0.5;
or to do the calculation.
Note that (although you haven't shown how you did it) it may actually be your long double calculation that is incorrect, since literal floating point values (like 1.0) default to being of type double, not long double - which might suggest the error is in your calculation of the long double result, not the double one.. If you want the result to be of type long double, it would be advisable to give all of that literal values (1.0, 0.5) the L suffix, to force them to be of type long double.
Also remember to use appropriate formatting when streaming the resultant value to std::cout, to ensure output also has the accuracy/precision you need. The default settings (what you get if you don't control the formatting) may differ.
Related
I am calculating the intersection point of two lines given in the polar coordinate system:
typedef ap_fixed<16,3,AP_RND> t_lines_angle;
typedef ap_fixed<16,14,AP_RND> t_lines_rho;
bool get_intersection(
hls::Polar_< t_lines_angle, t_lines_rho>* lineOne,
hls::Polar_< t_lines_angle, t_lines_rho>* lineTwo,
Point* point)
{
float angleL1 = lineOne->angle.to_float();
float angleL2 = lineTwo->angle.to_float();
t_lines_angle rhoL1 = lineOne->rho.to_float();
t_lines_angle rhoL2 = lineTwo->rho.to_float();
t_lines_angle ct1=cosf(angleL1);
t_lines_angle st1=sinf(angleL1);
t_lines_angle ct2=cosf(angleL2);
t_lines_angle st2=sinf(angleL2);
t_lines_angle d=ct1*st2-st1*ct2;
// we make sure that the lines intersect
// which means that parallel lines are not possible
point->X = (int)((st2*rhoL1-st1*rhoL2)/d);
point->Y = (int)((-ct2*rhoL1+ct1*rhoL2)/d);
return true;
}
After synthesis for our FPGA I saw that the 4 implementations of the float sine (and cos) take 4800 LUTs per implementation, which sums up to 19000 LUTs for these 4 functions. I want to reduce the LUT count by using a fixed point sine. I already found a implementation of CORDIC but I am not sure how to use it. The input of the function is an integer but i have a ap_fixed datatype. How can I map this ap_fixed to integer? and how can I map my 3.13 fixed point to the required 2.14 fixed point?
With the help of one of my colleagues I figured out a quite easy solution that does not require any hand written implementations or manipulation of the fixed point data:
use #include "hls_math.h" and the hls::sinf() and hls::cosf() functions.
It is important to say that the input of the functions should be ap_fixed<32, I> where I <= 32. The output of the functions can be assigned to different types e.g., ap_fixed<16, I>
Example:
void CalculateSomeTrig(ap_fixed<16,5>* angle, ap_fixed<16,5>* output)
{
ap_fixed<32,5> functionInput = *angle;
*output = hls::sinf(functionInput);
}
LUT consumption:
In my case the consumption of LUT was reduced to 400 LUTs for each implementation of the function.
You can use bit-slicing to get the fraction and the integer parts of the ap_fixed variable, and then manipulate them to get the new ap_fixed. Perhaps something like:
constexpr int max(int a, int b) { return a > b ? a : b; }
template <int W2, int I2, int W1, int I1>
ap_fixed<W2, I2> convert(ap_fixed<W1, I1> f)
{
// Read fraction part as integer:
ap_fixed<max(W2, W1) + 1, max(I2, I1) + 1> result = f(W1 - I1 - 1, 0);
// Shift by the original number of bits in the fraction part
result >>= W1 - I1;
// Add the integer part
result += f(W1 - 1, W1 - I1);
return result;
}
I haven't tested this code well, so take it with a grain of salt.
This question already has answers here:
How to use Bitxor for Double Numbers?
(2 answers)
Closed 9 years ago.
I have two matrices a = [120.23, 255.23669877,...] and b = [125.000083, 800.0101010,...] with double numbers in [0, 999]. I want to use bitxor for a and b. I can not use bitxor with round like this:
result = bitxor(round(a(1,j),round(b(1,j))
Because the decimal parts 0.23 and 0.000083 ,... are very important to me. I thought maybe I could do a = a*10^k and b = b*10^k and use bitxor and after that result/10^k (because I want my result's range to also be [0, 999]. But I do not know the maximum length of the number after the decimal point. Does k = 16 support the max range of double numbers in Matlab? Does bitxor support two 19-digit numbers? Is there a better solution?
This is not really an answer, but a very long comment with embedded code. I don't have a current matlab installation, and in any case don't know enough to answer the question in that context. Instead, I've written a Java program that I think may do what you are asking for. It uses two Java classes, BigInteger and BigDecimal. BigInteger is an extended integer format. BigDecimal is the combination of a BigInteger and a decimal scale.
Conversion from double to BigDecimal is exact. Conversion in the opposite direction may require rounding.
The function xor in my program converts each of its operands to BigDecimal. It finds a number of decimal digits to move the decimal point by to make both operands integers. After scaling, it converts to BigInteger, does the actual xor, and converts back to BigDecimal undoing the scaling.
The main point of this is for you to look at the results, and see whether they are what you want, and would be useful to you if you could do the same thing in Matlab. Explaining any ways in which the results are not what you want may help clarify your requirements for the Matlab experts.
Here is some test output. The top and bottom rows of each block are in decimal. The middle row is the scaled integer versions of the inputs, in hex.
Testing operands 1.100000000000000088817841970012523233890533447265625, 2
2f0a689f1b94a78f11d31b7ab806d40b1014d3f6d59 xor 558749db77f70029c77506823d22bd0000000000000 = 7a8d21446c63a7a6d6a61df88524690b1014d3f6d59
1.1 xor 2.0 = 2.8657425494106605
Testing operands 100, 200.0004999999999881765688769519329071044921875
2cd76fe086b93ce2f768a00b22a00000000000 xor 59aeee72a26b59f6380fcf078b92c4478e8a13 = 7579819224d26514cf676f0ca932c4478e8a13
100.0 xor 200.0005 = 261.9771865509636
Testing operands 120.3250000000000028421709430404007434844970703125, 120.75
d2c39898113a28d484dd867220659fbb45005915 xor d3822c338b76bab08df9fee485d1b00000000000 = 141b4ab9a4c926409247896a5b42fbb45005915
120.325 xor 120.75 = 0.7174277813579485
Testing operands 120.2300000000000039790393202565610408782958984375, 120.0000830000000036079654819332063198089599609375
d298ff20fbed5fd091d87e56002df79fc7007cb7 xor d231e5f39e1db18654cb8c43d579692616a16a1f = a91ad365f0ee56c513f215d5549eb9d1a116a8
120.23 xor 120.000083 = 0.37711627930683345
Here is the Java program:
import java.math.BigDecimal;
import java.math.BigInteger;
public class Test {
public static double xor(double a, double b) {
BigDecimal ad = new BigDecimal(a);
BigDecimal bd = new BigDecimal(b);
/*
* Shifting the decimal point right by scale will make both operands
* integers.
*/
int scale = Math.max(ad.scale(), bd.scale());
/*
* Scale both operands by, in effect, multiplying by the same power of 10.
*/
BigDecimal aScaled = ad.movePointRight(scale);
BigDecimal bScaled = bd.movePointRight(scale);
/*
* Convert the operands to integers, treating any rounding as an error.
*/
BigInteger aInt = aScaled.toBigIntegerExact();
BigInteger bInt = bScaled.toBigIntegerExact();
BigInteger resultInt = aInt.xor(bInt);
System.out.println(aInt.toString(16) + " xor " + bInt.toString(16) + " = "
+ resultInt.toString(16));
/*
* Undo the decimal point shift, in effect dividing by the same power of 10
* as was used to scale to integers.
*/
BigDecimal result = new BigDecimal(resultInt, scale);
return result.doubleValue();
}
public static void test(double a, double b) {
System.out.println("Testing operands " + new BigDecimal(a) + ", " + new BigDecimal(b));
double result = xor(a, b);
System.out.println(a + " xor " + b + " = " + result);
System.out.println();
}
public static void main(String arg[])
{
test(1.1, 2.0);
test(100, 200.0005);
test(120.325, 120.75);
test(120.23, 120.000083);
}
}
"But I do not know the max length of number after point ..."
In double precision floating-point you have 15–17 significant decimal digits. If you give bitxor double inputs these must be less than intmax('uint64'): 1.844674407370955e+19. The largest double, realmax (= 1.797693134862316e+308), is much bigger than this, so you can't represent everything in the the way you're using. For example, this means that your value of 800.0101010*10^17 won't work.
If your range is [0, 999], one option is to solve for the largest fractional exponent k and use that: log(double(intmax('uint64'))/999)/log(10) (= 16.266354234268810).
This is the code i have:
int resultInt = [ja.resultCount intValue];
float pages = resultInt / 10;
NSLog(#"%d",resultInt);
NSLog(#"%.2f",pages);
the resultInt comes back from php script with the value 3559 so the pages result should be 355.9 but i get the result as 355.00 which isn't right
Use
float pages = resultInt / 10.0f;
int/int is int
but int/float or float/int is float
Edited for more explanation
It is important to remember that the resultant value of a mathematical operation is subject to the rules of the receiving variable's data type. The result of a division operation may yield a floating point value. However, if assigned to an integer the fractional part will be lost. Equally important, and less obvious, is the effect of an operation performed on several integers and assigned to a non-integer. In this case, the result is calculated as an integer before being implicitly converted. This means that although the resultant value is assigned to a floating point variable, the fractional part is still truncated unless at least one of the values is explicitly converted first. The following examples illustrate this:
int a = 7;
int b = 3;
int integerResult;
float floatResult;
integerResult = a / b; // integerResult = 2 (truncated)
floatResult = a / b; // floatResult = 2.0 (truncated)
floatResult = (float)a / b; // floatResult = 2.33333325
This has to do with the fact that you're using integer and not float.
Tell the variables that you are using that they are floats and you are done.
int resultInt = [ja.resultCount intValue];
float pages = (float)resultInt / 10.f;
**NSLog(#"%0.2f", 1.345); // output 1.34**
NSLog(#"%0.2f", 1.3451);// output 1.35
NSLog(#"%0.2f", round(1.345 * 100)/100.);//output 1.35
Why the first line output 1.34?!!
=======================================
Updated:
NSLog(#"%.2f", round([#"644.435" doubleValue] * 100) / 100); // output 644.43,
but
NSLog(#"%.2f", round([#"6.435" doubleValue] * 100) / 100); // output 6.44?
If I want to convert a NSString to keep two digit after the point, would you please advise how to convert?
Because 1.345 cannot be represented exactly in IEEE754. More than likely, it's something like 1.34444444449 which, when printed, gives you 1.34.
If you search for things like ieee754, precision and floating point, one of the billion or so articles should be able to enlighten you :-)
Look particularly for: What every computer scientist should know about floating point.
Going into a little more detail, examine the following C program:
#include <stdio.h>
int main (void) {
float f = 1.345f;
double d = 1.345;
printf ("f = %.20f\nd = %.20lf\n", f, d);
return 0;
}
The output of this is:
f = 1.34500002861022949219
d = 1.34499999999999997335
So the float value is a little above 1.345 and the double (which is what you get when you specify 1.345 without the f suffix) is a little below.
That explains why you're seeing it truncated to 1.34 rather than rounded to 1.35.
I've seen several examples in books and around the web where they sometimes use decimal places when declaring float values even if they are whole numbers, and sometimes using an "f" suffix. Is this necessary?
For example:
[UIColor colorWithRed:0.8 green:0.914 blue:0.9 alpha:1.00];
How is this different from:
[UIColor colorWithRed:0.8f green:0.914f blue:0.9f alpha:1.00f];
Does the trailing "f" mean anything special?
Getting rid of the trailing zeros for the alpha value works too, so it becomes:
[UIColor colorWithRed:0.8 green:0.914 blue:0.9 alpha:1];
So are the decimal zeros just there to remind myself and others that the value is a float?
Just one of those things that has puzzled me so any clarification is welcome :)
Decimal literals are treated as double by default. Using 1.0f tells the compiler to use a float (which is smaller than double) instead. In most cases it doesn't really matter if a number is a double or a float, the compiler will make sure you get the right format for the job in the end. In high-performance code you may want to be explicit, but I'd suggest benchmarking it yourself.
As John said numbers with a decimal place default to double. TomTom is wrong.
I was curious to know if the compiler would just optimize the double to a const float (which I assumed would happen)... turns out it doesn't and the idea of the speed increase is actually legit... depending on how much you use it. In math-heavy application, you probably do want to use this trick.
It must be the case that it is taking the stored float variable, casting it to a double, performing the math against the double (the number without the f), then casting it back to a float to store it again. That would explain the diference in calculation even though we're storing in floats each time.
The code & raw results:
https://gist.github.com/1880400
Pulled out relevant benchmark on an iPad 1 in Debug profile (Release resulted in even more of a performance increase by using the f notation):
------------ 10000000 total loops
timeWithDoubles: 1.33593 sec
timeWithFloats: 0.80924 sec
Float speed up: 1.65x
Difference in calculation: -0.000038
Code:
int main (int argc, const char * argv[]) {
for (unsigned int magnitude = 100; magnitude < INT_MAX; magnitude *= 10) {
runTest(magnitude);
}
return 0;
}
void runTest(int numIterations) {
NSTimeInterval startTime = CFAbsoluteTimeGetCurrent();
float d = 1.2f;
for (int i = 0; i < numIterations; i++) {
d += 1.8368383;
d *= 0.976;
}
NSTimeInterval timeWithDoubles = CFAbsoluteTimeGetCurrent() - startTime;
startTime = CFAbsoluteTimeGetCurrent();
float f = 1.2f;
for (int i = 0; i < numIterations; i++) {
f += 1.8368383f;
f *= 0.976f;
}
NSTimeInterval timeWithFloats = CFAbsoluteTimeGetCurrent() - startTime;
printf("\n------------ %d total loops\n", numIterations);
printf("timeWithDoubles: %2.5f sec\n", timeWithDoubles);
printf("timeWithFloats: %2.5f sec\n", timeWithFloats);
printf("Float speed up: %2.2fx\n", timeWithDoubles / timeWithFloats);
printf("Difference in calculation: %f\n", d - f);
}
Trailing f: this is a float.
Trailing f + "." - redundant.
That simple.
8f is 8 as a float.
8.0 is 8 as a float.
8 is 8 as integer.
8.0f is 8 as a float.
Mostly the "f" can be style - to make sure it is a float, not a double.