How is it possible, in Ada, to have a code failure from assigning a 64bit floating point number to a 16bit integer? - type-conversion

I have just discovered this report on the reasons for the failure of the Ariane 5 rocket. According to the report, the failure occurred when a 64bit floating point number was assigned to a 16bit integer.
One of the, many, reasons why Ada is a reliable language, is that it uses strong typing, and has proper handling of exceptions. I don't understand how it was possible to write code that attempted this conversion and have it compile correctly.
There's also the question of why an exception handler didn't exist for this condition, which is also peculiar, but, perhaps, more of a failure of the programmer than the language - though an Ada project that left live code with potential exceptions but no exception handler is difficult to imagine.
Any ideas?
http://www-users.math.umn.edu/~arnold/disasters/ariane5rep.html

The answer is easy: it is always possible to explicitly convert, and that seems to have been done with the Ariane 5 code:
-- Overflow is correctly handled for the vertical component
L_M_BV_32 := TBD.T_ENTIER_16S((1.0 / C_M_LSB_BH) *
G_M_INFO_DERIVE(T_ALG.E_BH));
if L_M_BV_32 > 32767 then
P_M_DERIVE(T_ALG.E_BV) := 16#7FFF#;
elseif L_M_BV_32 < -32768 then
P_M_DERIVE(T_ALG.E_BV) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BV) := UC_16S_EN_16NS(TBD.T_ENTIER_16S(L_M_BV_32));
end if;
-- But not for the horizontal one
P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS(TBD.T_ENTIER_16S
((1.0 / C_M_LSB_BH) *
G_M_INFO_DERIVE(T_ALG.E_BH));
Here the T_ENTIER_16S function no doubt converts the floating point to a 16 bit signed value. Note that "ENTIER" is French so the function is likely available from an internal library.
How to handle explicit floating point to integer conversion in Ada is of course available from your favorite Q/A site.
For Ada the conversion is very explicit; the programmers should have thought about the scenario. But maybe they did and thought it should never occur. The calculation probably needs to be performant as well; time is of essence when flying a rocket after all.
That the result of the calculation doesn't require a very high amount of precision is clear from the vertical calculations: it just maxes out the values of the 16 bit signed integer instead of upgrading the integer to 32 bit. Returning negative values of values close to zero is what lead to the failure - it would warp the calculations too much.
Be warned that maxing out a bounded integer value is a stop-gap measure that may not work in every situation. The "final solution" used for this project may not work in your project.
Note that somehow they assumed that returning an almost random value made more sense then throwing an exception in the T_ENTIER_16S function. this is possibly because of performance reasons; the function could simply copy the corresponding bits, saving two if statements to check for overflow / underflow.
Exceptions are nice, but they are of little value for these kind of calculations during runtime. The problem is that the rocket depends on the function to execute correctly. If it doesn't it will fail and crash. The only time an exception is handy is when there is recourse. Otherwise exceptions may just help with analyzing the error after the fact.
In this case an assertion guard statement could have been used within the T_ENTIER_16S. This error would then have been caught during testing, if the input of the test set would be adequate. Unfortunately, at that time, assertions were not available in Ada.
And the inflight code - when the assertions are disabled - could have returned MIN / MAX values instead of a direct bit representation. Anything is better than returning completely unexpected results. That is: if the additional testing for the values isn't an issue with regards to the running time of the function.

In short: it hasn't be pure programming problem, it's more complex problem
The defect on Ariane 5 was not caused by a single cause. Throughout the development and testing processes, there were many stages at which this defect could be identified.
• The software module was reused in a new environment where
operating conditions differed from the requirements of the software
module. These requirements have not been revised.
• The system has identified and recognized the error.
Unfortunately, the specification of the error handling mechanism was
inconsistent and caused final destruction.
• An erroneous module has never been properly tested in a new
environment — neither at the hardware level, nor at the system
integration level. Therefore, the fallacy of development and
implementation was not detected.
See diff with correct code:
--- LIRE_DERIVE.ads 000 Tue Jun 04 12:00:00 1996
+++ LIRE_DERIVE.ads Fri Jan 29 13:50:00 2010
## -3,10 +3,17 ##
if L_M_BV_32 > 32767 then
P_M_DERIVE(T_ALG.E_BV) := 16#7FFF#;
elsif L_M_BV_32 < -32768 then
P_M_DERIVE(T_ALG.E_BV) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BV) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BV_32));
end if;
-P_M_DERIVE(T_ALG.E_BH) :=
- UC_16S_EN_16NS (TDB.T_ENTIER_16S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH)));
+L_M_BH_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH));
+
+if L_M_BH_32 > 32767 then
+ P_M_DERIVE(T_ALG.E_BH) := 16#7FFF#;
+elsif L_M_BH_32 < -32768 then
+ P_M_DERIVE(T_ALG.E_BH) := 16#8000#;
+else
+ P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BH_32));
+end if;
More detailed info:
A)
The “Operand Error” error occurred due to an unexpectedly large value of BH (Horizontal Bias) calculated by an internal function based on the value of the “horizontal speed” measured by the sensors on the Platform.
The value of BH served as an indicator of the accuracy of the Platform positioning. the BH value was much larger than expected because the flight path of the Ariane 5 at an early stage was significantly different from the flight path of the Ariane 4 (where this software module was used earlier), which led to a significantly higher "horizontal speed".
The final action, which had fatal consequences, was the termination of the processor. Accordingly, the entire Navigation System has ceased to function. It was technically impossible to resume her actions.
B)
However, Ariane 5, unlike the previous model, had a fundamentally different discipline for performing pre-flight actions - so different that the work of the fatal program module after the launch time did not make sense at all. However, the module was reused without any modifications.
The investigation revealed that there were as many as seven variables involved in type conversion operations in this software module. It turned out that the developers conducted an analysis of all operations that could potentially generate an exception for vulnerability.
It was their very conscious decision to add proper protection to four variables, and leave three - including BH - unprotected. The reason for this decision was the conviction that for these three variables the occurrence of an overflow situation is impossible in principle.
This confidence was supported by calculations showing that the expected range of physical flight parameters, on the basis of which the values of these variables are determined, is such that it cannot lead to an undesirable situation. And that was true - but for the trajectory calculated for the Ariane 4 model.
And the rocket of the new generation Ariane 5 was launched on a completely different trajectory, for which no evaluations were performed. Meanwhile, it (together with a high initial acceleration) was such that the “horizontal speed” exceeded the calculated (for Ariane 4) more than five times.
Protection for all seven (including BH) variables was not provided, because the maximum workload of 80% was declared for the IRS computer. Developers had to look for ways to reduce unnecessary computational costs and they weakened protection where a theoretically undesirable situation could not arise. When it arose, such an exception handling mechanism came into effect, which turned out to be completely inadequate.
This mechanism included the following three basic actions.
• Information about the occurrence of an emergency situation must be
transmitted via the bus to the onboard computer OBC.
• In parallel, she - along with the entire context - was recorded in
the reprogrammable EEPROM memory (which, during the investigation,
it was possible to restore and read its contents).
• The IRS processor was supposed to crash.
The last action turned out to be fatal - it was he who happened in a situation that actually was normal (despite the software exception generated due to an unprotected overflow), which led to the catastrophe.
C)
Velocity was represented as a 64-bit float.
A conversion into a 16-bit signed integer caused an overflow
The current velocity of Ariane 5 was too high to be represented as a 16-bit integer
Error handling was suppressed for performance reasons
(
Protection for all seven (including BH) variables was not provided, because the maximum workload of 80% was declared for the IRS computer.
Developers had to look for ways to reduce unnecessary computational costs and they weakened protection where a theoretically undesirable situation could not arise.
)
According to a presentation 1,2 by Jean-Jacques Levy (who was part of the
team who searched for the source of the problem), the actual source code
in Ada that caused the problem was as follows.:
-- Vertical velocity bias as measured by sensor
L_M_BV_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BV) * G_M_INFO_DERIVE(T_ALG.E_BV));
-- Check, if measured vertical velocity bias ban be
-- converted to a 16 bit int. If so, then convert
if L_M_BV_32 > 32767 then
P_M_DERIVE(T_ALG.E_BV) := 16#7FFF#;
elsif L_M_BV_32 < -32768 then
P_M_DERIVE(T_ALG.E_BV) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BV) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BV_32));
end if;
-- Horizontal velocity bias as measured by sensor -- is converted to a 16 bit int without checking P_M_DERIVE
P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS (TDB.T_ENTIER_16S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH)));
The last line (shown here as two lines of text) caused the overflow,
where the conversion from 64 bits to 16 bits unsigned is not protected.
The code before is protected by testing before the assignment if the
number is too big.
The correct code would have been:
L_M_BV_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BV) * G_M_INFO_DERIVE(T_ALG.E_BV));
if L_M_BV_32 > 32767 then
P_M_DERIVE(T_ALG.E_BV) := 16#7FFF#;
elsif L_M_BV_32 < -32768 then
P_M_DERIVE(T_ALG.E_BV) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BV) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BV_32));
end if;
L_M_BH_32 := TBD.T_ENTIER_32S ((1.0/C_M_LSB_BH) * G_M_INFO_DERIVE(T_ALG.E_BH));
if L_M_BH_32 > 32767 then
P_M_DERIVE(T_ALG.E_BH) := 16#7FFF#;
elsif L_M_BH_32 < -32768 then
P_M_DERIVE(T_ALG.E_BH) := 16#8000#;
else
P_M_DERIVE(T_ALG.E_BH) := UC_16S_EN_16NS(TDB.T_ENTIER_16S(L_M_BH_32));
end if;
in other words, the same overflow check should have been present for the horizontal part of the calculation (E_BH) as was already present for the vertical part of the calculation (E_BV).
'Source:
http://moscova.inria.fr/~levy/talks/10enslongo/enslongo.pdf
( On 2019-01-24 can be found at:
http://para.inria.fr/~levy/talks/10enslongo/enslongo.pdf
)
https://habr.com/ru/company/pvs-studio/blog/306748/
https://ja.wikipedia.org/wiki/%E3%82%AF%E3%83%A9%E3%82%B9%E3%82%BF%E3%83%BC%E3%83%9F%E3%83%83%E3%82%B7%E3%83%A7%E3%83%B3

The purpose of the code was to fly the rocket. What would an exception handler do? Play a funeral dirge as the rocket breaks up? When the exception would trigger, it would have been too late: there was no data available to compute the control equations to actually fly the vehicle, from that point onwards.

Related

Designing an Asynchronous Down Counter (13 to 1 and reset) using JK-FF with PRN and CLRN

I decided to mess around with asynchronous counters and tried to create an asynchronous down counter using FF-JK which counts from 13 to 1 and wraps around. However, I ran into various problems.
My RESET signal expression: Q0.Q1.Q2.Q3 + (Q1.Q2.Q3)' where Q0 is LSB and Q3 is MSB
My circuit is as follows:
However, when I simulated the circuit, it gave me the wrong results.
I hope I described my problem detailedly, and if there was anything I missed, please correct me. Thank you very much, and have a wonderful day.
I tried reconnecting my reset signal from PRN to CLRN and vice versa, I have also tried using T-FF, SR-FF, D-FF (the implementations were different).
I found a workaround for this problem, though, I would not consider it an optimal solution. However, it does the job now, and I look forward to receiving insights on the issues I faced.
Instead of designing the RESET signal and having the circuit loop in some undefined state, I designed a MOD-13 counter which goes through all 13 states from 0000 to 1100 and maps the output to the desired values (13 to 1 respectively).
MOD-13 counters are easier to design, as they prevent undefined states from being reached so easily, and the circuit responsible for mapping the states with the desired values are simple combinatorial circuits which are also easy to design.
Of course, there exists a much better way to do this, but I am not able to implement it at the moment. With that being said, I am always open for discussions, and I look forward to seeing what kind of error I had accumulated during implementation. Have a wonderful day!

Register Ranges in HLSL?

I am currently refactoring a large chunk of old code and have finally dove into the HLSL section where my knowledge is minimal due to being out of practice. I've come across some documentation online that specifies which registers are to be used for which purposes:
t – for shader resource views (SRV)
s – for samplers
u – for unordered access views (UAV)
b – for constant buffer views (CBV)
This part is pretty self explanatory. If I want to create a constant buffer, I can just declare as:
cbuffer LightBuffer: register(b0) { };
cbuffer CameraBuffer: register(b1) { };
cbuffer MaterialBuffer: register(b2) { };
cbuffer ViewBuffer: register(b3) { };
However, originating from the world of MIPS Assembly I can't help but wonder if there are finite and restricted ranges on these. For example, temporary registers are restricted to a range of t0 - t7 in MIPS Assembly. In the case of HLSL I haven't been able to find any documentation surrounding this topic as everything seems to point to assembly languages and microprocessors (such as the 8051 if you'd like a random topic to read up on).
Is there a set range for the four register types in HLSL or do I just continue as much as needed in a sequential fashion and let the underlying assembly handle the messy details?
Note
I have answered this question partially, as I am unable to find a range for u currently; however, if someone has a better, more detailed answer than what I've given through testing, then feel free to post it and I will mark that as the correct answer. I will leave this question open until December 1st, 2018 to give others a chance to give a better answer for future readers.
Resource slot count (for d3d11, indeed d3d12 case expands that) are specified in Resource Limit msdn page.
The ones which are of interest for you here are :
D3D11_COMMONSHADER_INPUT_RESOURCE_REGISTER_COUNT (which is t) = 128
D3D11_COMMONSHADER_SAMPLER_SLOT_COUNT (which is s) = 16
D3D11_COMMONSHADER_CONSTANT_BUFFER_HW_SLOT_COUNT (which is b) = 15 but one is reserved to eventually store some constant data from shaders (if you have a static const large array for example)
The u case is different, as it depends on Feature Level (and tbh is a vendor/os version mess) :
D3D11_FEATURE_LEVEL_11_1 or greater, this is 64 slots
D3D11_FEATURE_LEVEL_11 : It will always be 8 (but some cards/driver eventually support 64, you need at least windows 8 for it (It might also be available in windows 7 with some platform update too). I do not recall a way to test if 64 is supported (many nvidia in their 700 range do for example).
D3D11_FEATURE_LEVEL_10_1 : either 0 or 1, there's a way to check is compute is supported
You need to perform a feature check:
D3D11_FEATURE_DATA_D3D10_X_HARDWARE_OPTIONS checkData;
d3dDevice->CheckFeatureSupport(D3D11_FEATURE_D3D10_X_HARDWARE_OPTIONS, &checkData);
BOOL computeSupport = checkData.ComputeShaders_Plus_RawAndStructuredBuffers_Via_Shader_4_x
Please note that for some OS/Driver version I had this flag returning TRUE while not supported (Intel was doing that on win7/8), so in that case the only valid solution was to try to either create a small Raw / Byte Address buffer or a Structured Buffer and check the HRESULT
As a side note feature feature level 10 or below are for for quite old configurations nowadays, so except for rare scenarios you can probably safely ignore it (I just leave it for information purpose).
Since it's usually a long wait time for these types of questions I tested the b register by attempting to create a cbuffer in register b51. This failed as I expected and luckily SharpDX spit out an exception that stated it has a maximum of 14. So for the sake of future readers I am testing all four register types and posting back the ranges I find successful.
b has a range of b0 - b13.
s has a range of s0 - s15.
t has a range of t0 - t127.
u has a range of .
At the current moment, I am unable to find a range for the u register as I have no examples of it in my code, and haven't actually ever used it. If someone comes along that does have an example usage then feel free to test it and update this post for future readers.
I did find a contradiction to my findings above in the documentation linked in my question; they have an example using a t register above the noted range in this answer:
Texture2D a[10000] : register(t0);
Texture2D b[10000] : register(t10000);
ConstantBuffer<myConstants> c[10000] : register(b0);
Note
I would like to point out that I am using the SharpDX version of the HLSL compiler and so I am unsure if these ranges vary from compiler to compiler; I heavily doubt that they do, but you can never be too sure until you try to exceed them. GLSL may be the same due to being similar to HLSL, but it could also be very different.

VHDL simulation in real time?

I've written some code that has an RTC component in it. It's a bit difficult to do proper emulation of the code because the clock speed is set to 50MHz so to see any 'real time' events take place would take forever. I did try to do simulation for 2 seconds in modelsim but it ended up crashing.
What would be a better way to do it if I don't have an evaluation board to burn and test using scope?
If you could provide a little more specific example of exactly what you're trying to test and what is chewing up your simulation cycles that would be helpful.
In general, if you have a lot of code that you need to test in simulation, it's helpful if you can create testbenches of the sub-modules and test them first. Often, if you simulate at the top (chip) level and try to stimulate sub-modules that are buried deep in the hierarchy of a design, it takes many clock ticks just to get data into and out of the sub-module. If you simulate the sub-module directly you have direct access to the modules I/O and can test the things you want to test in that module in fewer cycles than if you try to get to it from the top level.
If you are trying to test logic that has very deep fifos that you are trying to fill or a specific count of a large counter you're trying to hit, you can either add logic to your code to help create those conditions in fewer cycles (like a load instruction on the counter) or you can force the values of internal signals of your design from the testbench itself.
These are just a couple of general ideas. Again, if you provide more detail about what it is you're simulating there are probably people on this forum that can provide help that is more specific to your problem.
As already mentioned by Ciano, if you provided more information about your design we would be able to give more accurate answer. However, there are several tips that hardware designers should follow, specially for complex system simulation. Some of them (that I mostly use) are listed below:
Hierarchical simulation (as Ciano, already posted): instead of simulating the entire system, try to simulate smaller set of modules.
Selective configuration: most systems require some initialization processes such as reset initialization time, external chips register initialization, etc... Usually for simulation purposes a few of them are not require and you may use a global constant to jump these stages when simulating, like:
constant SIMULATION_ENABLE : STD_LOGIC := '1';
...;
-- in reset condition:
if SIMULATION_ENABLE = '1' then
currentState <= state_executeSystem; -- jump the initialization procedures
else
currentState <= state_initializeSystem;
end if;
Be careful, do not modify your code directly (hard coded). As the system increases, it becomes impossible to remember which parts of it you modified to simulate. Use constants instead, as the above example, to configure modules to simulation profile.
Scaled time/size constants: instead of using (everytime) the real values for time and sizes (such as time event, memory sizes, register file size, etc) use scaled values whenever possible. For example, if you are building a RTC that generates an interrupt to the main system every 60 seconds - scale your constants (if possible) to generate interrupts to about (6ms, 60us). Of course, the scale choice depends on your system. In my designs, I use two global configuration files. One of them I use for simulation and the other for synthesis. Most constant values are scaled down to enable lower simulation time.
Increase the abstraction: for bigger modules it might be useful to create a simplified and more abstract module, acting as a model of your module. For example, if you have a processor that has this RTC (you mentioned) as a peripheral, you may create a simplified module of this RTC. Pretending that you only need the its interrupt you may create a simplified model such as:
constant INTERRUPT_EVENTS array(1 to 2) of time := (
32 ns,
100 ms
);
process
for i in 1 to INTERRUPT_EVENTS'length loop
rtcInterrupt <= '0';
wait for INTERRUPT_EVENTS(i);
rtcInterrupt <= '1';
wait for clk = '1' and clk'event
end for
wait;
end process;

What potential error can be caused by retrieving and changing of a value in the same line?

I was asked question about what would happen if I try to retrieve a reference value and then try to change it within the same line of code. My answer was that nothing will happen because as I tried to do this before I did not encounter any compiler errors (at least in C# or Java).
What is the real answer to this?
This is example with the pseudo code:
Module main()
Call changeNumber(10)
End Module
Module changeNumber(Integer Ref number)
Set number = number * number
Display number
End Module
(PS. Sorry for not formatting/creting this post correctly. I'm having bit of an issue here.)
There would be no unusual side effects, if that's what you're asking. The language specifications dictate a specific order of execution (number * number is evaluated, then set to number), which prevents any issues from occurring.
Nothing would happen, in your particular pseudo code.
In reference to a question you asked after this question -
Actually there could be issues in some rare instance, but it would depend on how you are allocating space for the number and your coding language. Consider this. you are explicitly naming the data type as an int, well what if the accepted input is a larger number than an int can handle (or a negative number), that ends up being x integers long and then you multiply it by that same number, your allocated space (which usually has padding with integer) could also have insufficient space that is too small for this particular instance, which would cause a segmentation fault in C. depending on which language you're using, whether it's higher level than C, it may have compilation checks for this case, but not always.

How to reset Ada.Real_Time.Clock?

when reading Ada.Real_Time.Clock right after power-up it shows a value that isn't close to zero and sometimes even negative.
As far as I know Ada.Real_Time.Clock suppose to reset on power-up.
How can I reset Ada.Real_Time.Clock?
Thanks.
The Ada 2005 LRM declares that "real time is defined to be the physical time as observed in the external environment. [emphasis added--MC]
"It is not specified by the language whether the time values are synchronized with any standard time reference. For example, E can correspond to the time of system initialization or it can correspond to the epoch of some time standard." (D.8[18-19])
As it states, Ada does not require that "E", the start of the epoch serving as the "zero time" for real-time Time values, correspond to any particular starting point; it's left up to the compiler implementer.
Whatever specific numeric values you observe for the instances of Time you're seeing, whether near or far from zero, positive or negative, are dependent solely on the compiler implementer's choice of E, how it represents times values, and how it correspondingly implements the real-time capability.
Therefore you should avoid writing code that depends on specific, knowable values of Time, nor code that requires Time values to be intimately manipulable.
Real_Time.Time values should be considered abstract quantities.
Agreeing with Marc. While I have seen some platforms that use time since boot (particularly on Intel platforms, where I think they like to use the processor's iteration counter), that is entirely up to the compiler vendor.
If you need something like "time since startup" and your platform isn't giving you that, then the thing to do would be to grab Real_Time.Clock when you start up, and subtract that value from all further reads from Real_Time.Clock.
You can look at exactly what facilites are defined for the Real_Time package, including all the LRM sections Marc was quoting you, at its LRM page here.
It was long ago but if it helps someone...
I reseted the clock by writing 0 to the time base registers of the MCU.
That's a lovely explanation, but what if someone is trying to write unit tests against code which implements the real_Time clock? For instance, I know that my function foo does an internal comparison against Ada.Real_Time.Clock to check for time spans. Before executing foo with the appropriate inputs I want to reset the clock to force foo down a specific path internally and verify the resulting out parameter has changed.
return_value := foo;
assert (return_value = path1, "tested foo path1");
Reset_Clock;
return_value := foo;
assert (return_value = path2, "tested foo path2");