Trying to understand how the casting/conversion is done by compiler,e.g., when cast from float to int - downcast

When a float is casted to int, how this casting is implemented by compiler.
Does compiler masks some part of memory of float variable i.e., which part of memory is plunked by compiler to pass the remaining to int variable.
I guess the answer to this lies in how the int and float is maintained in memory.
But isn't it machine dependent rather than compiler dependent. How compiler decides which part of memory to copy when casted to lower type (this is a static casting, right).
I am kind of confused with some wrong information, I guess.
(I read some questions on tag=downcasting, where debate on whether it is a cast or a conversion was going on, I am not very much interested on what it is called, as both are performed by compiler, but on how this is being performed).
...
Thanks

When talking about basic types and not pointers, then a conversion is done. Because floating point and integer representations are very different (usually IEEE-754 and two's complement respectively) it's more than just masking out some bits.
If you wanted to see the floating point number represented as an int without doing a conversion, you can do something like this (in C):
float f = 10.5;
int i2 = (int*)&f;
printf("%f %d\n", f, i2);

Most CPU architectures provide a native instruction (or multi-instruction sequence) to do float<->int conversions. The compiler will generally just generate this instruction. There's often faster methods. This question has some good information: What is the fastest way to convert float to int on x86.

Related

What are scenarios where you should use Float in Swift?

Learning about the difference between Floats and Doubles in Swift. I can't think of any reasons to use Float. I know there are, and I know I am just not experienced enough to understand them.
So my question is why would you use float in Swift?
why would you use float in Swift
Left to your own devices, you likely never would. But there are situations where you have to. For example, the value of a UISlider is a Float. So when you retrieve that number, you are working with a Float. It’s not up to you.
And so with all the other numerical types. Swift includes a numerical type corresponding to every numerical type that you might possibly encounter as you interface with Cocoa and the outside world.
Float is a typealias for Float32. Float32 and Float16 are incredibly useful for GPU programming with Metal. They both will feel as archaic someday on the GPU as they do on the CPU, but that day is years off.
https://developer.apple.com/metal/
Double
Represents a 64-bit floating-point number.
Has a precision of at least 15 decimal digits.
Float
Float represents a 32-bit floating-point number.
precision of Float can be as little as 6 decimal digits.
The appropriate floating-point type to use depends on the nature and range of values you need to work with in your code. In situations where either type would be appropriate, Double is preferred.

Float and Double network byte order

The Swift library includes the function bigEndian that can be used on integer types (such as Int, UInt, UInt8, UInt64, Int64, etc) to convert them from host order (which might presumably be anything, but realistically will be big or little endian) to network byte order (which is big endian). There're some good SO answers referring to this, and a particularly complete one is here.
However, I've not found a good resource that covers arranging a Float (32 bit) or Double (64 bit) type in to network byte order. Given that these types don't have a bigEndian method, I'm wondering if there is some subtlety involved? (The linked question does discuss floating point types, but I'm not sure it is definitely covering all details that might be relevant).
Specifically, I want to handle the 64 bit Double floating point type. I'd like a solution that will work on any platform where Swift is available.
Thank you.

Does converting UInt8(or similar types) to Int counter the purpose of UInt8?

I'm storing many of the integers in my program as UInt8, having a 0 - 255 range of values. Now later on I will be summing many of them to get a result that will be able to be stored into an Int. Does this conversion I have to do before I add the values from UInt8 to Int defeat the purpose of me using a smaller datatype to begin with? I feel it would be faster to just use just Int, but suffer larger a memory footprint. But why go for UInt8 when I have to face many conversions reducing of speed and increasing memory as well. Is there something I'm missing, or should smaller datatypes be really only used with other small datatypes?
You are talking a few bytes per variable when storing as UInt8 instead of Int. These data types were conceived very early on in the history of computing, when memory was measured in the low KBs. Even the Apple Watch has 512MB.
Here's what Apple says in the Swift Book:
Unless you need to work with a specific size of integer, always use Int for integer values in your code. This aids code consistency and interoperability. Even on 32-bit platforms, Int can store any value between -2,147,483,648 and 2,147,483,647, and is large enough for many integer ranges.
I use UInt8, UInt16 and UInt32 mainly in code that deals with C. And yes, converting back and forth is a pain in the neck.

double_t in C99

I just read that C99 has double_t which should be at least as wide as double. Does this imply that it gives more precision digits after the decimal place? More than the usual 15 digits for double?.
Secondly, how to use it: Is only including
#include <float.h>
enough? I read that one has to set the FLT_EVAL_METHOD to 2 for long double. How to do this? As I work with numerical methods, I would like maximum precision without using an arbitrary precision library.
Thanks a lot...
No. double_t is at least as wide as double; i.e., it might be the same as double. Footnote 190 in the C99 standard makes the intent clear:
The types float_t and double_t are
intended to be the implementation’s
most efficient types at least as wide
as float and double, respectively.
As Michael Burr noted, you can't set FLT_EVAL_METHOD.
If you want the widest floating-point type on any system available using only C99, use long double. Just be aware that on some platforms it will be the same as double (and could even be the same as float).
Also, if you "work with numerical methods", you should be aware that for many (most even) numerical methods, the approximation error of the method is vastly larger than the rounding error of double precision, so there's often no benefit to using wider types. Exceptions exist, of course. What type of numerical methods are you working on, specifically?
Edit: seriously, either (a) just use long double and call it a day or (b) take a few weeks to learn about how floating-point is actually implemented on the platforms that you're targeting, and what the actual accuracy requirements are for the algorithms that you're implementing.
Note that you don't get to set FLT_EVAL_METHOD - it is set by the compiler's headers to let you determine how the library does certain things with floating point.
If your code is very sensitive to exactly how floating point operations are performed, you can use the value of that macro to conditionally compile code to handle those differences that might be important to you.
So for example, in general you know that double_t will be at least a double in all cases. If you want your code to do something different if double_t is a long double then your code can test if FLT_EVAL_METHOD == 2 and act accordingly.
Note that if FLT_EVAL_METHOD is something other than 0, 1, or 2 you'll need to look at the compiler's documentation to know exactly what type double_t is.
double_t may be defined by typedef double double_t; — of course, if you plan to rely on implementation specifics, you need to look at your own implementation.

double precision in Ada?

I'm very new to Ada and was trying to see if it offers double precision type. I see that we have float and
Put( Integer'Image( Float'digits ) );
on my machine gives a value of 6, which is not enough for numerical computations.
Does Ada has double and long double types as in C?
Thanks a lot...
It is a wee bit more complicated than that.
The only predefined floating-point type that compilers have to support is Float. Compilers may optionally support Short_Float and Long_Float. You should be able to look in appendex F of your compiler documentation to see what it supports.
In practice, your compiler almost certianly defines Float as a 32-bit IEEE float, and Long_Float as a 64-bit. Note that C pretty much works this way too with its float and double. C doesn't actually define the size of those.
If you absolutely must have a certian precision (eg: you are sharing the data with something external that must use IEEE 64-bit), then you should probably define your own float type with exactly that precision. That would ensure your code is either portable to any platform or compiler you move it to, or that it will produce a compiler error so you can fix the issue.
You can create any size Float you like. For a long it would be:
type My_Long_Float is digits 11;
Wiki Books is a good reference for things like this.