Why is Swift giving me inaccurate floating point arithmetic results? [duplicate] - swift

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 8 years ago.
Swift floating point arithmetic seems broken compared to C (and therefore Objective-C).
Let's take a simple example.
In C:
double output = 90/100.0; // Gives 0.9
float output = 90/100.0f; // Gives 0.9
In Swift:
var output = Double(90)/Double(100.0) // Gives 0.90000000000000002
var output = Float(90)/Float(100.0) // Gives 0.899999976
What's going on? Is this a bug or am I missing something?
EDIT:
#import <iostream>
int main() {
double inter = 90/100.0;
std::cout << inter << std::endl; // Outputs 0.9
return 0;
}

The issue is simply the different number of digits being printed out.
#include <iostream>
#include <iomanip>
int main() {
double d = 90.0 / 100.0;
float f = 90.0f / 100.0f;
std::cout << d << ' ' << f << '\n';
std::cout << std::setprecision(20) << d << ' ' << f << '\n';
}
0.9 0.9
0.9000000000000000222 0.89999997615814208984
(I wrote this example in C++, but you will get the same results in every language that uses the hardware's floating point arithmetic and allows this formatting.)
If you want to understand why finite precision floating point math does not give you exact results then:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
And:
Float

Related

Visualize PCD containing custom double point structure

I have created a custom double point-type for storing the point position in the PCD file. I required the double data type since my points are in global coordinates and have very large values (of order 10^6 to 10^7) and require good precision. Since the values are large and the default FLOAT32 precision is limited, there is considerable data approximation which is also visible during visualization.
I created this PCD by transforming the raw pointcloud with the initial global reference coordinate from GPS in the data bag that I have. I am using a 15 point precision.
I created a separate script for visualizing this custom point-type PCD. But by visually comparing, I cannot see any considerable difference between the FLOAT32 and double data-type PCD's.
Raw_float_pcd_visualization
Transformed_float_pcd_visualization
Transformed_double_pcd_visualization
You can see that the transformed_double and transformed_float PCD's are quite similar and approximated. While the raw_float PCD is quite good as compared to these two.
I am attaching the PCD files for reference:
raw_float
transformed_float
transformed_double
I think that I am skipping some things while loading the pointcloud and there are some more changes that need to be done in order to visualize the points with double point precision.
I used "pcl_viewer" from pcl_tools for visualizing FLOAT type PCD's.
Code for visualizaing custom DOUBLE point-structure PCD:
#define PCL_NO_PRECOMPILE
#include <iostream>
// #include "double_viz/pcl_double.h"
#include <pcl-1.7/pcl/common/common.h>
#include <pcl-1.7/pcl/io/pcd_io.h>
#include <pcl-1.7/pcl/visualization/pcl_visualizer.h>
#include <pcl-1.7/pcl/console/parse.h>
#include <pcl-1.7/pcl/point_cloud.h>
#include <pcl-1.7/pcl/point_types.h>
namespace pcl
{
#define PCL_ADD_UNION_POINT4D_DOUBLE \
union EIGEN_ALIGN16 { \
double data[4]; \
struct { \
double x; \
double y; \
double z; \
}; \
};
struct _PointXYZDouble
{
PCL_ADD_UNION_POINT4D_DOUBLE; // This adds the members x,y,z which can also be accessed using the point (which is float[4])
EIGEN_MAKE_ALIGNED_OPERATOR_NEW
};
struct EIGEN_ALIGN16 PointXYZDouble : public _PointXYZDouble
{
inline PointXYZDouble (const _PointXYZDouble &p)
{
x = p.x; y = p.y; z = p.z; data[3] = 1.0;
}
inline PointXYZDouble ()
{
x = y = z = 0.0;
data[3] = 1.0;
}
inline PointXYZDouble (double _x, double _y, double _z)
{
x = _x; y = _y; z = _z;
data[3] = 1.0;
}
EIGEN_MAKE_ALIGNED_OPERATOR_NEW
};
}
POINT_CLOUD_REGISTER_POINT_STRUCT (pcl::_PointXYZDouble,
(double, x, x)
(double, y, y)
(double, z, z)
)
POINT_CLOUD_REGISTER_POINT_WRAPPER(pcl::PointXYZDouble, pcl::_PointXYZDouble)
// This function displays the help
void
showHelp(char * program_name)
{
std::cout << std::endl;
std::cout << "Usage: " << program_name << " cloud_filename.[pcd]" << std::endl;
std::cout << "-h: Show this help." << std::endl;
}
// This is the main function
int
main (int argc, char** argv)
{
// Show help
if (pcl::console::find_switch (argc, argv, "-h") || pcl::console::find_switch (argc, argv, "--help"))
{
showHelp (argv[0]);
return 0;
}
// Fetch point cloud filename in arguments | Works with PCD
std::vector<int> filenames;
if (filenames.size () != 1)
{
filenames = pcl::console::parse_file_extension_argument (argc, argv, ".pcd");
if (filenames.size () != 1)
{
showHelp (argv[0]);
return -1;
}
}
// Load file | Works with PCD and PLY files
pcl::PointCloud<pcl::PointXYZDouble>::Ptr source_cloud (new pcl::PointCloud<pcl::PointXYZDouble> ());
if (pcl::io::loadPCDFile (argv[filenames[0]], *source_cloud) < 0)
{
std::cout << "Error loading point cloud " << argv[filenames[0]] << std::endl << std::endl;
showHelp (argv[0]);
return -1;
}
// Visualization
// printf( "\nPoint cloud colors : white = original point cloud\n"
// " red = transformed point cloud\n");
pcl::visualization::PCLVisualizer viewer ("Visualize double PCL");
// Define R,G,B colors for the point cloud
pcl::visualization::PointCloudColorHandlerCustom<pcl::PointXYZDouble> source_cloud_color_handler (source_cloud, 100, 100, 100);
// We add the point cloud to the viewer and pass the color handler
viewer.addPointCloud (source_cloud, source_cloud_color_handler, "original_cloud");
viewer.addCoordinateSystem (1.0, "cloud", 0);
viewer.setBackgroundColor(0.05, 0.05, 0.05, 0); // Setting background to a dark grey
viewer.setPointCloudRenderingProperties (pcl::visualization::PCL_VISUALIZER_OPACITY, 1, "original_cloud");
viewer.setPointCloudRenderingProperties (pcl::visualization::PCL_VISUALIZER_POINT_SIZE, 1, "original_cloud");
viewer.setPointCloudRenderingProperties (pcl::visualization::PCL_VISUALIZER_LINE_WIDTH, 1, "original_cloud");
//viewer.setPosition(800, 400); // Setting visualiser window position
while (!viewer.wasStopped ()) // Display the visualiser until 'q' key is pressed
{
viewer.spinOnce ();
}
return 0;
}
In the raw_float file, the size field has been defined as 4 bytes each: SIZE 4 4 4 4,
to be read as double it should be SIZE 8 8 8 8.
With your current implementation each field is being read as Float32

How to emulate *really simple* variable bit shifts with SSE?

I have two variable bit-shifting code fragments that I want to SSE-vectorize by some means:
1) a = 1 << b (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/2/4/8/16/32/64/128/256
2) a = 1 << (8 * b) (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/0x100/0x10000/etc
OK, I know that AMD's XOP VPSHLQ would do this, as would AVX2's VPSHLQ. But my challenge here is whether this can be achieved on 'normal' (i.e. up to SSE4.2) SSE.
So, is there some funky SSE-family opcode sequence that will achieve the effect of either of these code fragments? These only need yield the listed output values for the specific input values (0-7).
Update: here's my attempt at 1), based on Peter Cordes' suggestion of using the floating point exponent to do simple variable bitshifting:
#include <stdint.h>
typedef union
{
int32_t i;
float f;
} uSpec;
void do_pow2(uint64_t *in_array, uint64_t *out_array, int num_loops)
{
uSpec u;
for (int i=0; i<num_loops; i++)
{
int32_t x = *(int32_t *)&in_array[i];
u.i = (127 + x) << 23;
int32_t r = (int32_t) u.f;
out_array[i] = r;
}
}

EditorGuiLayout.MaskField issue with large enums

I'm working on an input system that would allow the user to translate input mappings between different input devices and operating systems and potentially define their own.
I'm trying to create a MaskField for an editor window where the user can select from a list of RuntimePlatforms, but selecting individual values results in multiple values being selected.
Mainly for debugging I set it up to generate an equivalent enum RuntimePlatformFlags that it uses instead of RuntimePlatform:
[System.Flags]
public enum RuntimePlatformFlags: long
{
OSXEditor=(0<<0),
OSXPlayer=(0<<1),
WindowsPlayer=(0<<2),
OSXWebPlayer=(0<<3),
OSXDashboardPlayer=(0<<4),
WindowsWebPlayer=(0<<5),
WindowsEditor=(0<<6),
IPhonePlayer=(0<<7),
PS3=(0<<8),
XBOX360=(0<<9),
Android=(0<<10),
NaCl=(0<<11),
LinuxPlayer=(0<<12),
FlashPlayer=(0<<13),
LinuxEditor=(0<<14),
WebGLPlayer=(0<<15),
WSAPlayerX86=(0<<16),
MetroPlayerX86=(0<<17),
MetroPlayerX64=(0<<18),
WSAPlayerX64=(0<<19),
MetroPlayerARM=(0<<20),
WSAPlayerARM=(0<<21),
WP8Player=(0<<22),
BB10Player=(0<<23),
BlackBerryPlayer=(0<<24),
TizenPlayer=(0<<25),
PSP2=(0<<26),
PS4=(0<<27),
PSM=(0<<28),
XboxOne=(0<<29),
SamsungTVPlayer=(0<<30),
WiiU=(0<<31),
tvOS=(0<<32),
Switch=(0<<33),
Lumin=(0<<34),
BJM=(0<<35),
}
In this linked screenshot, only the first 4 options were selected. The integer next to "Platforms: " is the mask itself.
I'm not a bitwise wizard by a large margin, but my assumption is that this occurs because EditorGUILayout.MaskField returns a 32bit int value, and there are over 32 enum options. Are there any workarounds for this or is something else causing the issue?
First thing I've noticed is that all values inside that Enum is the same because you are shifting 0 bits to left. You can observe this by logging your values with this script.
// Shifts 0 bits to the left, printing "0" 36 times.
for(int i = 0; i < 36; i++){
Debug.Log(System.Convert.ToString((0 << i), 2));
}
// Shifts 1 bits to the left, printing values up to 2^35.
for(int i = 0; i < 36; i++){
Debug.Log(System.Convert.ToString((1 << i), 2));
}
The reason inheriting from long does not work alone, is because of bit shifting. Check out this example I found about the issue:
UInt32 x = ....;
UInt32 y = ....;
UInt64 result = (x << 32) + y;
The programmer intended to form a 64-bit value from two 32-bit ones by shifting 'x' by 32 bits and adding the most significant and the least significant parts. However, as 'x' is a 32-bit value at the moment when the shift operation is performed, shifting by 32 bits will be equivalent to shifting by 0 bits, which will lead to an incorrect result.
So you should also cast the shifting bits. Like this:
public enum RuntimePlatformFlags : long {
OSXEditor = (1 << 0),
OSXPlayer = (1 << 1),
WindowsPlayer = (1 << 2),
OSXWebPlayer = (1 << 3),
// With literals.
tvOS = (1L << 32),
Switch = (1L << 33),
// Or with casts.
Lumin = ((long)1 << 34),
BJM = ((long)1 << 35),
}

Approximation of 1-exp(-mu*t) when mu*t is very small

I am working on some fairly simple linear attenuation and absorption calculations and from high school math I seem to remember that there is an approximation of:
1-exp(-mu*t)
When
mu*t << 1
Does this approximation exist? I thought it was a taylor series expansion but could not convince myself after looking through old math textbooks.
Any help or direction is greatly appreciated.
mu*t plus O((mu*t)^2)
To see why, try rewriting this as f(u) = 1-exp(-u), and taking a Taylor series expansion at the point u=0.
If you are using C++11, for example, it has this function as part of the standard library: expm1.
In your case, you would call it as -expm1(-mu*t).
Otherwise, you can derive the Maclaurin series for expm1 easily from the Maclaurin series for exp(x) by simply dropping the first 1. One implementation is given below in expm1_maclaurin.
Comparing this with the built-in expm1:
#include <cmath>
#include <iostream>
#include <limits>
using namespace std;
double expm1_maclaurin( double x )
{
const double order = 10;
double retval = 1.0;
for( int i = order ; 1 < i ; --i ) retval = 1.0 + x*retval/i;
return x*retval;
}
int main()
{
cout.precision(numeric_limits<double>::digits10);
for( int i = 0 ; i <= 32 ; ++i )
{
double x = i < 0 ? 1.0 * (1u<<-i) : i < 32 ? 1.0 / (1u<<i) : 0;
cout << "x=" << x << ' '
<< expm1(x) << ' '
<< expm1_maclaurin(x) << ' '
<< ( expm1(x) == expm1_maclaurin(x) ) << endl;
}
return 0;
}
Output:
x=1 1.71828182845905 1.71828180114638 0
x=0.5 0.648721270700128 0.648721270687366 0
x=0.25 0.284025416687742 0.284025416687735 0
x=0.125 0.133148453066826 0.133148453066826 1
x=0.0625 0.0644944589178594 0.0644944589178594 1
x=0.03125 0.0317434074991027 0.0317434074991027 1
...
For all positive x <= 1/8 the result is equal to full double precision of expm1.

What does this line of code do? Const uint32_t goodguys = 0x1 << 0

Can someone tell me what is being done here:
Const uint32_t goodguys = 0x1 << 0
I'm assuming it is c++ and it is assigning a tag to a group but I have never seen this done. I am a self taught objective c guy and this just looks very foreign to me.
Well, if there are more lines that look like this that follow the one that you posted, then they could be bitmasks.
For example, if you have the following:
const uint32_t bit_0 = 0x1 << 0;
const uint32_t bit_1 = 0x1 << 1;
const uint32_t bit_2 = 0x1 << 2;
...
then you could use use the bitwise & operator with bit_0, bit_1, bit_2, ... and another number in order to see which bits in that other number are turned on.
const uint32_t num = 5;
...
bool bit_0_on = (num & bit_0) != 0;
bool bit_1_on = (num & bit_1) != 0;
bool bit_2_on = (num & bit_2) != 0;
...
So your 0x1 is simply a way to designate that goodguys is a bitmask, because the hexadecimal 0x designator shows that the author of the code is thinking specifically about bits, instead of decimal digits. And then the << 0 is used to change exactly what the bitmask is masking (you just change the 0 to a 1, 2, etc.).
Although base 10 is a normal way to write numbers in a program, sometimes you want to express the number in octal base or hex base. To write numbers in octal, precede the value with a 0. Thus, 023, really means 19 in base 10. To write numbers in hex, precede the value with a 0x or 0X. Thus, 0x23, really means 35 in base 10.
So
goodguys = 0x1;
really means the same as
goodguys = 1;
The bitwise shift operators shift their first operand left (<<) or right (>>) by the number of positions the second operand specifies. Look at the following two statements
goodguys = 0x1;
goodguys << 2;
The first statement is the same as goodguys = 1;
The second statement says that we should shift the bits to the left by 2 positions. So we end up with
goodguys = 0x100
which is the same as goodguys = 4;
Now you can express the two statements
goodguys = 0x1;
goodguys << 2;
as a single statement
goodguys = 0x1 << 2;
which is similar to what you have. But if you are unfamiliar with hex notation and bitwise shift operators it will look intimidating.
When const is used with a variable, it uses the following syntax:
const variable-name = value;
In this case, the const modifier allows you to assign an initial value to a variable that cannot later be changed by the program. For Instance
const int POWER_UPS = 4;
will assign 4 to variable POWER_UPS. But if you later try to overwrite this value like
POWER_UPS = 8;
you will get a compilation error.
Finally the uint32_t means 32-bit unsigned int type. You will use it when you want to make sure that your variable is 32 bits long and nothing else.