Floating point bug when writing to file in Matlab? - matlab

Im not sure if Im missing something simple but the following code fails (a and b are meant to be the same):
a=single(2147483584)
f=fopen('test','wb');
fwrite(f,a,'int32')
fclose(f);
f=fopen('test','rb');
b=fread(f,inf,'int32');
fclose(f)
a
b
with output:
a =
2.1475e+009
b =
-2.1475e+009
and the following code succeeds:
a=single(2147483583)
f=fopen('test','wb');
fwrite(f,a,'int32')
fclose(f);
f=fopen('test','rb');
b=fread(f,inf,'int32');
fclose(f)
a
b
with output:
a =
2.1475e+009
b =
2.1475e+009
Does anyone know why?

I don't know Matlab well, but it seems fairly clear what's happening here. You're converting a to a float and then storing the result of that conversion as a 32-bit signed integer. But the nearest single-precision IEEE 754 float to the integer 2147483584 is 2147483648.0, or 2**31. A 32-bit integer can only represent values in the range [-2**31, 2**31-1], so it looks as though when you write this value as an integer, it gets wrapped modulo 2**32 to give -2**31 instead of 2**31.
In contrast, the nearest single-precision float to 2147483583 is 2147483520.0, which does fit in a 32-bit integer.

Related

Long Integer signed Fixed Point to Real convertion SystemVerilog

I need to convert a Long number as Fixed point into a Double rappresentation.
The fixed point math is used into the synthesis process and the Real data type only for validation and simulation.
If I make multiple convertion in chain with multiple datatypes to adjust the format then it is not enough or completely wrong .
In my case with a fixed point mantissa of 44 bit I have 3bit integer+sign bit. Q notation like "sfix_44_48"
As example I am doing this to convert a fixed point integer number into a Real value(getting the number 0.5f ):
logic signed [47:0] fp_number = 48'h0800_0000_0000; // it should be 0.5f
real r_val;
real rr_val;
real rrr_val;
real tmp;
initial
begin
r_val = $itor(fp_number)/(2**44); // doesn't solve the problem.
rr_val = real'{fp_number}/(2**44); // doesn't solve the problem.
$cast(tmp,fp_number>>>44); // doesn't solve the problem
rrr_val = tmp;
end
$itor(...) is limited to 32bit integer part.
As result of above I get zero or NaN, on Modelsim simulation.
No luck during all these convertions.
the SV LRM doesn't seem to have a clear way to do this convertion.
What is the SV workaround to allow simulations to analize data greater than 32bit size easily? please.
C.
You want to use
rr_val = real'(fp_number)/(2.0**44);
Do not use any of the $TtoT functions from Verilog. They have fixed datatype inputs and outputs.
2**44 gets computed as a 32-bit 2-complement value and overflows, giving you 0. You can use 2.0 or real'(2) instead.
thanks to #dave_59 I post this piece of code which show the mess with the convertion.
logic signed [47:0] fp_number = 48'h0800_0000_0000; // it should be 0.5f
logic signed [31:0] fp_number2 = 32'h0800_0000; // it should be 0.5f
real r_val;
real rr_val;
real rrr_val;
real rrrr_val;
initial
begin
$display("48bit fp convertion sfix_44_48");
r_val = real'{fp_number}/(2**44); // doesn't solve the problem (curly braces valid sintax but wrong convertion + wrong convertion on the denominator).
rr_val = real'{fp_number}/(2.0**44); // doesn't solve the problem (curly braces valid sintax but wrong convertion + denominator convertion OK).
rrr_val = real'(fp_number)/(2**44); // doesn't solve the problem (numerator OK + the power operation is not properly converted to a Real number as result).
rrrr_val = real'(fp_number)/(2.0**44); // solve the problem with long integer fixed points convertion (the braces are not curly anymore).
$display("r_val[%08f]",r_val,", rr_val[%08f]",rr_val,", rrr_val[%08f]",rrr_val,", rrrr_val[%08f]",rrrr_val); // it should be 0.5 on the fourth data
$display("32bit fp convertion sfix_28_32");
r_val = real'{fp_number2}/(2**28); // result totally different than previous 48bit operation, doesn't solve the problem (curly braces valid sintax but wrong convertion + wrong convertion on the denominator).
rr_val = real'{fp_number2}/(2.0**28); // doesn't solve the problem (curly braces valid sintax but wrong convertion + denominator convertion OK).
rrr_val = real'(fp_number2)/(2**28); // with a 32bit range it apparently solve the problem (numerator OK + the power operation is OK with this range).
rrrr_val = real'(fp_number2)/(2.0**28); // solve the problem with long integer fixed points convertion (the braces are not curly anymore).
$display("r_val[%08f]",r_val,", rr_val[%08f]",rr_val,", rrr_val[%08f]",rrr_val,", rrrr_val[%08f]",rrrr_val); // it should be 0.5 on the fourth data
end
the only valid convertion at 48bit is the fourth case.
For 32bit the third case is valid and also the fourth case.
First: the classic pow(...) operation must be done with this syntax (2.0**BIT) which will create a Real division and not a integer division using (2**BIT) when a scaling fixed point will be applied.
In this case the operation above is managed as float/double(C style) or real(SystemVerilog)
* Second: the real'() cast operation MUST be used with NO curly braces.
I didn't have a Linting tool to proper check the syntax so I would expect a Warning due to the validity of the operation with the curly braces.
Third: the subdle results are ok if the INTEGER denominator is limited at 32bit operations.
As results shown below:
SIM START.
# 48bit fp convertion sfix_44_48
# ** Error (suppressible): (vsim-8604) ./blocks/sim/test_tb.sv(141): NaN (not a number) resulted from a division operation.
# ** Error (suppressible): (vsim-8630) Infinity results from division operation.
# Time: 0 ps Iteration: 0 Process: /test_tb/#INITIAL#138 File: ./blocks/sim/test_tb.sv Line: 143
# r_val[-1.#IND00], rr_val[0.000000], rrr_val[1.#INF00], rrrr_val[0.500000]
# 32bit fp convertion sfix_28_32
SIM END.
# r_val[0.000000], rr_val[0.000000], rrr_val[0.500000], rrrr_val[0.500000]
The solution is to avoid any not protected casting, like a casting with the braces boundary:
r_val = real'{ byteU,byteH,byteL} / (2.0**44) ; // WRONG
rr_val = real'({byteU,byteH,byteL}) / (2.0**44) ; // CORRECT
If the scaling factor occurs then the operation, generally /, must be done with the same type of operands (real/real).
Unsafe way is (real/long) which leads into a nightmare.

What is the point of writing integer in hexadecimal, octal and binary?

I am well aware that one is able to assign a value to an array or constant in Swift and have those value represented in different formats.
For Integer: One can declare in the formats of decimal, binary, octal or hexadecimal.
For Float or Double: One can declare in the formats of either decimal or hexadecimal and able to make use of the exponent too.
For instance:
var decInt = 17
var binInt = 0b10001
var octInt = 0o21
var hexInt = 0x11
All of the above variables gives the same result which is 17.
But what's the catch? Why bother using those other than decimal?
There are some notations that can be way easier to understand for people even if the result in the end is the same. You can for example think in cases like colour notation (hexadecimal) or file permission notation (octal).
Code is best written in the most meaningful way.
Using the number format that best matches the domain of your program, is just one example. You don't want to obscure domain specific details and want to minimize the mental effort for the reader of your code.
Two other examples:
Do not simplify calculations. For example: To convert a scaled integer value in 1/10000 arc minutes to a floating point in degrees, do not write the conversion factor as 600000.0, but instead write 10000.0 * 60.0.
Chose a code structure that matches the nature of your data. For example: If you have a function with two return values, determine if it's a symmetrical or asymmetrical situation. For a symmetrical situation always write a full if (condition) { return A; } else { return B; }. It's a common mistake to write if (condition) { return A; } return B; (simply because 'it works').
Meaning matters!

convert number string into float with specific precision (without getting rounding errors)

I have a vector of cells (say, size of 50x1, called tokens) , each of which is a struct with properties x,f1,f2 which are strings representing numbers. for example, tokens{15} gives:
x: "-1.4343429"
f1: "15.7947111"
f2: "-5.8196158"
and I am trying to put those numbers into 3 vectors (each is also 50x1) whose type is float. So I create 3 vectors:
x = zeros(50,1,'single');
f1 = zeros(50,1,'single');
f2 = zeros(50,1,'single');
and that works fine (why wouldn't it?). But then when I try to populate those vectors: (L is a for loop index)
x(L)=tokens{L}.x;
.. also for the other 2
I get :
The following error occurred converting from string to single:
Conversion to single from string is not possible.
Which I can understand; implicit conversion doesn't work for single. It does work if x, f1 and f2 are of type 50x1 double.
The reason I am doing it with floats is because the data I get is from a C program which writes the some floats into a file to be read by matlab. If I try to convert the values into doubles in the C program I get rounding errors...
So, (after what I hope is a good question,) how might I be able to get the numbers in those strings, at the right precision? (all the strings have the same number of decimal places: 7).
The MCVE:
filedata = fopen('fname1.txt','rt');
%fname1.txt is created by a C program. I am quite sure that the problem isn't there.
scanned = textscan(filedata,'%s','Delimiter','\n');
raw = scanned{1};
stringValues = strings(50,1);
for K=1:length(raw)
stringValues(K)=raw{K};
end
clear K %purely for convenience
regex = 'x=(?<x>[\-\.0-9]*),f1=(?<f1>[\-\.0-9]*),f2=(?<f2>[\-\.0-9]*)';
tokens = regexp(stringValues,regex,'names');
x = zeros(50,1,'single');
f1 = zeros(50,1,'single');
f2 = zeros(50,1,'single');
for L=1:length(tokens)
x(L)=tokens{L}.x;
f1(L)=tokens{L}.f1;
f2(L)=tokens{L}.f2;
end
Use function str2double before assigning into yours arrays (and then cast it to single if you want). Strings (char arrays) must be explicitely converted to numbers before using them as numbers.

Scientific notation in MATLAB

Say I have an array that contains the following elements:
1.0e+14 *
1.3325 1.6485 2.0402 1.0485 1.2027 2.0615 1.7432 1.9709 1.4807 0.9012
Now, is there a way to grab 1.0e+14 * (base and exponent) individually?
If I do arr(10), then this will return 9.0120e+13 instead of 0.9012e+14.
Assuming the question is to grab any elements in the array with coefficient less than one. Is there a way to obtain 1.0e+14, so that I could just do arr(i) < 1.0e+14?
I assume you want string output.
Let a denote the input numeric array. You can do it this way, if you don't mind using evalc (a variant of eval, which is considered bad practice):
s = evalc('disp(a)');
s = regexp(s, '[\de+-\.]+', 'match');
This produces a cell array with the desired strings.
Example:
>> a = [1.2e-5 3.4e-6]
a =
1.0e-04 *
0.1200 0.0340
>> s = evalc('disp(a)');
>> s = regexp(s, '[\de+-\.]+', 'match')
s =
'1.0e-04' '0.1200' '0.0340'
Here is the original answer from Alain.
Basic math can tell you that:
floor(log10(N))
The log base 10 of a number tells you approximately how many digits before the decimal are in that number.
For instance, 99987123459823754 is 9.998E+016
log10(99987123459823754) is 16.9999441, the floor of which is 16 - which can basically tell you "the exponent in scientific notation is 16, very close to being 17".
Now you have the exponent of the scientific notation. This should allow you to get to whatever your goal is ;-).
And depending on what you want to do with your exponent and the number, you could also define your own method. An example is described in this thread.

format a number as I want

I have the number: a = 3.860575156847749e+003; and I would show it in a normal manner. So I write b = sprintf('%0.1f' a);. If I print b I will get: 3860.6. This is perfect. Matter of fact, while a is a double type, b has been converted in char.
What can I do to proper format that number and still have a number as final result?
Best regards
Well, you have to distinguish between both the numerical value (the number stored in your computer's memory) and its decimal representation (the string/char array you see on your screen). You can't really impose a format on a number: a number has a value which can be represented as a string in different ways (e.g. 1234 = 1.234e3 = 12.34e2 = 0.1234e4 = ...).
If you want to store a number with less precision, you can use round, floor, ceil to calculate a number which has less precision than the original number.
E.g. if you have a = 3.860575156847749e+003 and you want a number that only has 5 significant digits, you can do so by using round:
a = 3.860575156847749e+003;
p = 0.1; % absolute precision you want
b = p .* round(a./p)
This will yield a variable b = 3.8606e3 which can be represented in different ways, but should contain zeros (in practice: very small values are sometimes unavoidable) after the fifth digit. I think that is what you actually want, but remember that for a computer this number is equal to 3.86060000 as well (it is just another string representation of the same value), so I want to stress again that the decimal representation is not set by rounding the number but by (implicitly) calling a function that converts the double to a string, which happens either by sprintf, disp or possibly some other functions.
Result of sprintf y a text variable. have you tried to declare a variable as integer (for example) and use this as return value for sprintf instruction?
This can be useful to you: http://blogs.mathworks.com/loren/2006/12/27/displaying-numbers-in-matlab/