I got an error when trying to multiply a number with a floating point in dart. Does anyone know why this happens?
void main() {
double x = 37.8;
int y = 100;
var z = x * y;
print(z);
// 3779.9999999999995
}
In other languages (C#/C++) I would have the result: 3780.0
This is completely expected because 37.8 (or rather, the 0.8 part) cannot be precisely encoded as a binary fraction in the IEEE754 standard so instead you get a close approximation that will include an error term in the LSBs of the fraction.
If you need numbers that are lossless (e.g. if you are handling monetary calculations) then check out the decimal package.
A simpler hack if your floating point number has sufficient bits allocated to the fraction to keep the erroroneous bits out of the way is to round off the number after your calculation to the number of decimal places that you care about.
You can use toStringAsFixed to control fraction digits.
void main() {
double x = 37.8;
int y = 100;
var z = x * y;
print(z.toStringAsFixed(1));
// 3780.0
}
Related
I've been writing a CAD type program for fun in JAVA. The other day I wrote some code to define a line which was tangent to 2 circles. I've been checking my numbers with a commercial CAD program and they have been fairly close. Usually to the 9th decimal point. My results really only need to be stored in an array to 7 decimal points. After successfully defining the line tangent to the 2 circles, I decided to test it and define a point which was the intersection of the line and one of the circles.
In one case I got the result I was looking for, in another case I got no intersection. After looking at a few of the calculations I realized I was getting a very very small variation of maybe 9 or ten decimal places. I'm thinking of rewriting the code using BigDecimal.
This is a small snippet of some of the code I need to rewrite. Once I started it became much more cumbersome than I was wanting to do. I'm thinking about just converting the results using BigDecimal and using the original code unless there is an easy way to convert the following code to a BigDecimal type of format.
private float[] offsetLine(float lnx1, float lny1, float lnz1, float lnx2, float lny2, float lnz2, String direction, float offset) {
double deltax = Math.abs(lnx2 - lnx1);
double deltay = Math.abs(lny2 - lny1);
double lineLength = Math.sqrt(deltax * deltax + deltay * deltay);
double stepx = (offset * deltay) / lineLength;
double stepy = (offset * deltax) / lineLength;
Ok I'll answer my own question. Here's some code I dug up. I could only round to 6 decimal places to get the rounding I wanted. Once I did my calculations in double values I called the subroutine roundDbl
double checkRadius1 = Math.sqrt(((cir1x - offsetpts[0])*(cir1x - offsetpts[0])) + ((cir1y - offsetpts[1]) * (cir1y - offsetpts[1])));
double checkRadiusRounded = roundDbl(checkRadius1, 6); //round to 6 decimal places
public static Float roundDbl(Double dblValue, int decimalPlace) {
String tempDblString = Double.toString(dblValue);
String tempDbl = new BigDecimal(tempDblString).setScale(decimalPlace, RoundingMode.HALF_UP).stripTrailingZeros().toPlainString();
return Float.valueOf(tempDbl);
}
I was converting Float => CGFloat and it gave me following result. Why It comes as "0.349999994039536" after conversion but works fine with Double => CGFloat?
let float: Float = 0.35
let cgFloat = CGFloat(float)
print(cgFloat)
// 0.349999994039536
let double: Double = 0.35
let cgFloat = CGFloat(double)
print(cgFloat)
// 0.35
Both converting “.35” to float and converting “.35” to double produce a value that differs from .35, because the floating-point formats use a binary base, so the exact mathematical value must be approximated using powers of two (negative powers of two in this case).
Because the float format uses fewer bits, its result is less precise and, in this case, less accurate. The float value is 0.3499999940395355224609375, and the double value is 0.34999999999999997779553950749686919152736663818359375.
I am not completely conversant with Swift, but I suspect the algorithm it is using to convert a CGFloat to decimal (with default options) is something like:
Produce a fixed number of decimal digits, with correct rounding from the actual value of the CGFloat to the number of digits, and then suppress any trailing zeroes. For example, if the exact mathematical value is 0.34999999999999997…, and the formatting uses 15 significant digits, the intermediate result is “0.350000000000000”, and then this is shorted to “0.35”.
The way this operates with float and double is:
When converted to double, .35 becomes 0.34999999999999997779553950749686919152736663818359375. When printed using the above methods, the result is “0.35”.
When converted to float, .35 becomes 0.3499999940395355224609375. When printed using the above method, the result is “0.349999994039536”.
Thus, both the float and double values differ from .35, but the formatting for printing does not use enough digits to show the deviation for the double value, while it does use enough digits to show the deviation for the float value.
I try to find the steps between a min and a max value with a given step-size, using swift 2.1.
So we have a min and a max value, both of type Double. The step-size is a Double too. If min is 0.0 and max 0.5 with steps of 0.1, the result is 6, obviously.
But if I start with -0.1 as the minimum value, the result is 6 too. But should be 7, agree?
Here is my Playground example:
let min:Double = -0.1
let max:Double = 0.5
let step:Double = 0.1
var steps: Int {
return Int((max - min) / step) + 1
}
print("steps: \(steps)") // returns "steps: 6", but should be 7
The result is 6.99999999 if we use a Double for the steps variable. But this loss of precision only occurs when our min value is negative.
Do you know a workaround? I just don't want to round() each time I calculate with Doubles.
When you use Int() it forces truncation of your number, which always rounds towards zero. So, 6.9999 becomes 6 rather than 7, because it's closer to zero. If you use round() first it should help:
var steps: Int {
return Int(round((max - min) / step) + 1.0)
}
That's always not a good idea to calculate integral steps based on floating point ranges, you'll always encounter issues, and you won't be able to do much.
Instead I recommend to build your logic on integral steps, and calculate double values based on integral values (not vice versa as you do). I.e. you don't calculate integral step based on range, but you set your integral number of steps and calculate your double step.
In HLSL, how would I go about packing two floats within the range of 0-1 into one float with an optimal precision. This would be incredibly useful to compress my GBuffer further.
//Packing
float a = 0.45;
float b = 0.55;
uint aScaled = a * 0xFFFF;
uint bScaled = b * 0xFFFF;
uint abPacked = (aScaled << 16) | (bScaled & 0xFFFF);
float finalFloat = asfloat(abPacked);
//Unpacking
float inputFloat = finalFloat;
uint uintInput = asuint(inputFloat);
float aUnpacked = (uintInput >> 16) / 65535.0f;
float bUnpacked = (uintInput & 0xFFFF) / 65535.0f;
Converting floating point numbers to fixed point integers is an error prone idea, due to floats covering much larger magnitudes. Say unpacking sRGB will give you pow(255,2.2) values, which are larger than 0xffff, and you will need several times than amount for robust HDR. Generally fixed point code is very fragile, obfuscated and a nightmare to debug. People invented floats for a good reason.
There are several 16-bit float formats. IEEE 16-bit float one is optimized for numbers between -1.0 to 1.0, but also support numbers up to 0x10000, just in case you need HDR, still so you will need to normalize your larger floats for it, Then there is bfloat16, which behaves like normal 32-bit float, just with less precision. IEEE 16-bit floats are widely supported by modern CPUs and GPUs, and can also be converted quickly even in software. bfloat16 is just gaining popularity, so you will have to research if it is suitable for your needs. Finally you can introduce your own 16-bit float format, using integer log function, which is provided by most CPUs as a single instruction.
we, developers very often need to calculate angle to perform rotation. Usually we can use atan2() function but sometimes we need more precision. What do you do then?
I know that theoretically atan2 is precise but in my system (iOS) it's inaccurate about 0.05 radians so it's big difference. That's not just my problem. I've seen similar opinions.
atan2 is used to get an angle a from a vector (x,y). If then you use this angle to apply a rotation you will use cos(a) and sin(a). You could simply compute cos and sin by normalizing (x,y), and keep them instead of the angle. Precision will be higher, and you will save a lot of cycles lost in trigonometric functions.
Edit. If you really want an angle from (x,y), it can be computed using variants of CORDIC to the precision you need.
you can use atan2l if long double has more precision than double in your system.
long double atan2l(long double y, long double x);
On iOS, I've found that the standard trigonometry operators are precise to roughly 13 or 14 decimal digits, so it sounds very odd that you're seeing errors on the order of 0.05 radians. If you can produce code and specific values that demonstrate this, please file a bug report on the behavior (and post the code here so that we can have a record of it).
That said, if you really need high precision for your trigonometry operators, I've modified a few of the routines that Dave DeLong created for his DDMathParser code. These routines use NSDecimal for performing the math, giving you up to ~34 digits of decimal precision while avoiding your standard floating point problems with representing base 10 decimals. You can download the code for these modified routines from here.
An NSDecimal version of atan() is calculated using the following code:
NSDecimal DDDecimalAtan(NSDecimal x) {
// from: http://en.wikipedia.org/wiki/Inverse_trigonometric_functions#Infinite_series
// The normal infinite series diverges if x > 1
NSDecimal one = DDDecimalOne();
NSDecimal absX = DDDecimalAbsoluteValue(x);
NSDecimal z = x;
if (NSDecimalCompare(&one, &absX) == NSOrderedAscending)
{
// y = x / (1 + sqrt(1+x^2))
// Atan(x) = 2*Atan(y)
// From: http://www.mathkb.com/Uwe/Forum.aspx/math/14680/faster-Taylor-s-series-of-Atan-x
NSDecimal interiorOfRoot;
NSDecimalMultiply(&interiorOfRoot, &x, &x, NSRoundBankers);
NSDecimalAdd(&interiorOfRoot, &one, &interiorOfRoot, NSRoundBankers);
NSDecimal denominator = DDDecimalSqrt(interiorOfRoot);
NSDecimalAdd(&denominator, &one, &denominator, NSRoundBankers);
NSDecimal y;
NSDecimalDivide(&y, &x, &denominator, NSRoundBankers);
NSDecimalMultiply(&interiorOfRoot, &y, &y, NSRoundBankers);
NSDecimalAdd(&interiorOfRoot, &one, &interiorOfRoot, NSRoundBankers);
denominator = DDDecimalSqrt(interiorOfRoot);
NSDecimalAdd(&denominator, &one, &denominator, NSRoundBankers);
NSDecimal y2;
NSDecimalDivide(&y2, &y, &denominator, NSRoundBankers);
// NSDecimal two = DDDecimalTwo();
NSDecimal four = DDDecimalFromInteger(4);
NSDecimal firstArctangent = DDDecimalAtan(y2);
NSDecimalMultiply(&z, &four, &firstArctangent, NSRoundBankers);
}
else
{
BOOL shouldSubtract = YES;
for (NSInteger n = 3; n < 150; n += 2) {
NSDecimal numerator;
if (NSDecimalPower(&numerator, &x, n, NSRoundBankers) == NSCalculationUnderflow)
{
numerator = DDDecimalZero();
n = 150;
}
NSDecimal denominator = DDDecimalFromInteger(n);
NSDecimal term;
if (NSDecimalDivide(&term, &numerator, &denominator, NSRoundBankers) == NSCalculationUnderflow)
{
term = DDDecimalZero();
n = 150;
}
if (shouldSubtract) {
NSDecimalSubtract(&z, &z, &term, NSRoundBankers);
} else {
NSDecimalAdd(&z, &z, &term, NSRoundBankers);
}
shouldSubtract = !shouldSubtract;
}
}
return z;
}
This uses a Taylor series approximation, with some shortcuts to speed convergence. I believe that the precision might not be the full 34 digits at results very close to Pi / 4 radians, so I might still need to fix that.
If you need extreme precision this is an option, but again what you're reporting shouldn't be happening with double values, so there's something odd here.
Use angles very often? No, you don't. Out of 10 times that I have seen a developer use angles, 7 times he should have used linear algebra instead and avoid any trigoniometric functions.
A rotation is better done with a matrix, not with an angle. See also this question:
CGAffineTranformRotate atan2 inaccuration