Chaos engineering best practice - chaos

I studied the principles of chaos, and looks for some opensource project, such as chaosblade which is open sourced by Alibaba, and mangle, by vmware.
These tools are both fault injection tools, and do nothing to analysis on the tested system.
According to the principles of chaos, we should
1.Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
2.Hypothesize that this steady state will continue in both the control group and the experimental group.
3.Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
4.Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.
Is there any good suggestions or best practice?

so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.
As always the answer is it depends.... It depends how do you want to measure your hypothesis, it depends on the hypothesis itself and it depends on the system. But normally it makes totally sense to introduce metrics to improve/increase the observability.
If your hypothesis is like Our service can process 120 requests in a second, even if one node fails. Then you could do it via metrics to measure that yes, but you could also measure it via the requests you send and receive the responses back. It is up to you.
But if your Hypothesis is I get a response for an request which was send before a node goes down. Then it makes more sense to verify this directly with the requests and response.
At our project we use for example chaostoolkit, which lets you specify the hypothesis in json or yaml and related action to prove it.
So you can say I have a steady state X and if I do Y, then the steady state X should be still valid. The toolkit is also able to verify metrics if you want to.

The Principles of Chaos are a bit above the actual testing, they reflect the philosophy of designed vs actual system and system under injection vs baseline, but are a bit too abstract to apply in everyday testing, they are a way of reasoning, not a work process methodology.
I'm think the control group vs experiment wording is one especially doubtful part - you stage a test (injection) in a controlled environment and try to catch if there is a user-facing incident, SLA breach of any kind or a degradation. I do not see where there is a control group out there if you test on a stand or dedicated environment.
We use a very linear variety of chaos methodology which is:
find failure points in the system (based on architecture, critical user scenarios and history of incidents)
design choas test scenarios (may be a single attack or more elaborate sequence)
run tests, register results and reuse green for new releases
start tasks to fix red tests, verify the solutions when they are available
One may say we are actually using the Principles of Choas in 1 and 2, but we tend to think of choas testing as quite linear and simple process.

Mangle 3.0 released with an option for analysis using resiliency score. Detailed documentation available at https://github.com/vmware/mangle/blob/master/docs/sre-developers-and-users/resiliency-score.md

Related

how to track errors in FPGA/ASIC development using post place'n' route and/or post synthesis simulation?

I am a bit confused on the usefulness of post PnR and/or post synthesis simulations for FPGA/ASIC development. If the synthesis or PnR process complete successfully in the design flow, is there any chance that the respective 'post' simulation will reveal errors in the design? Could someone give an example?
In typical design flow post Synthesis and/or post PnR simulations are not useful and the aim should be to avoid them.
Post synthesis simulation can only unearth bugs in the, well, synthesis tool which are extremely rare in established FPGA tools. Checking these should not be an integral part of any design flow.
Albeit there are some very rare cases where the PnR tools might make e.g. technology mapping error or fail to give a warning from design rule violation, at minimum 99% of the cases that reveal problems in Post PnR simulation are due to design error, most typically clock domain crossing, memory access race conditions a good, but already very rare, second.
Therefore, the emphasis should be in adhering to the design rules and having rigorous design methodology to avoid the problems rather than trying to catch them in the post PnR simulation.
To your question - if there is no negative slack and the design rule check is ok, there should not be anything more that either of the post simulations can reveal.
One practical use for post PnR simulation is when you have complex design that is failing occasionally due to timing variation of an external component or mistake in I/O constraints, but you don't have a clue about the error mechanism. Combination of integrated logic analyzer and post PnR simulation can help the trickiest of situations to find out the root cause.
Post-PnR simulations don't only verify the functionality, but also the timing. The timing information of the circuit can be dumped to the simulation in several formats, however the most popular one is Standard Delay Format (SDF), which is published as IEEE 1497.
What kind of errors can we catch then?
It is hard to catch some unwanted glitches in RTL simulations. If some outputs are generated by a combinational logic, post-PnR simulations are more important than ever.
There may be some mistakes in the synthesis and/or PnR constraints. It is always better to double check everything.
Synthesis/PnR tools may have bugs. Logic Equivalence Checking (LEC) can also catch bugs, but it performs for functionality only.
Post PnR simulations are what is called as Gate Level Simulation in industry. This is of two types timing and non timing. This kind of simulation is used to detect
Timing paths, not checked by STA or timing closure.
Bugs in power and reset operation as HFNS (High Fanout Net synthesis) and CTS (Clock tree synthesis) may have caused irregularities in the reset of some resettable flops causing them to deliver x to the next logic in the path causing an x-propagation.
Bugs in DFT logic which was not checked during RTL simulation and might have been removed during PnR.
x on logic path due to relaibility issues for clock domain cross paths skipped by STA
Mostly stable process in terms of translating the logic from mapped to PAR. But, of course, if pedantic, you could use a LEC for both syn->map and map-> PAR.
Post PAR Sims could be useful though if you have issues in the lab, maybe because you didn't fully constrain your design for timing, and need to simulate with the back-annotated SDF, as someone else mentioned above. This, of course, does not help you though, wrt to IO,h if you haven't created models with timing in your TB and/or constrained your IO properly as provided to you by the Board designer.
I think it is best-practice to run regression suite at least once against the PAR netlist with back-annotated SDF. It costs you nothing, and provides one more confidence data point.

Is there any standard for supporting Lock-step processor?

I want to ask about supporting Lock-step(lockstep, lock-step) processors in SW-level.
As I know, in AUTOSAR-ASILD, Lock-step processor is used for fault torelant system as below scenario.
The input signals for a processor is copied to another processor(its Lock-step pair).
The output signals from two different processors are compared.
If two output signals are different, trap is generated.
I think that if there is generated trap, then this generated trap should be processed somewhere in SW-level.
However, I could not find any standard for this processing.
I have read some error handling in SW topics specified in AUTOSAR, but I could not find any satisfying answers.
So, my question is summarized as below.
In AUTOSAR or other standard, where is the right place that processes Lock-step trap(SW-C or RTE or BSW)?.
In AUTOSAR or other standard, what is the right action that processes Lock-step trap(RESET or ABORT)?
Thank you.
There are multiple concepts involved here, from different sources.
The ASIL levels are defined by ISO 26262. ASIL-D is the highest level and using a lockstep CPU is one of the methods typically used to achieve ASIL-D compliance for the whole system. Autosar doesn't define how you achieve ASIL-D, or any ASIL level at all. From an Autosar perspective, lockstep would be an implementation detail of the MCU driver, and Autosar doesn't require MCUs to support lockstep. How a particular lockstep implementation works (whether the outputs are compared after each instruction or not, etc.) depends on the hardware, so you can find those answers in the corresponding hardware manual.
Correspondingly, some decisions have to be made by people working on the system, including an expert on functional safety. The decision on what to do on lockstep failure is one such decision - how you react to a lockstep trap should be defined at the system level. This is also not defined by Autosar, although the most reasonable option is to reset your microcontroller after saving some information about the error.
As for where in the Autosar stack the trap should be handled, this is also an implementation decision, although the reasonable choice is for this to happen at the MCAL level - to the extent that talking about levels even makes sense here, as the trap will run in interrupt/trap context and not the normal OS task context. Typically, a trap would come with a higher priority than any interrupt, and also typically it's not possible to disable the traps in software. A trap will be handled by some routine that is registered by the OS in the same way it registers ISRs, so you'd want to configure the trap handler in whatever tool you're using for OS configuration. The lockstep trap may (again, depending on the hardware) be considered a non-recoverable trap, meaning that the trap handler should trigger a reset eventually. Calling the standard ShutdownOS() function may be reasonable.

How to represent part of BPMN workflow that is automated by system?

I am documenting a user workflow where part of the flow is automated by a system (e.g. if the order quantity is less than 10 then approve the order immediately rather than sending it to a staff for review).
I have swim lanes that goes from people to people but not sure where I can fit this system task/decision path. What's the best practice? Possibly a dumb idea but I'm inclined to create a new swim lane and call it the "system".
Any thoughts?
The approach of detaching system task into separate lane is quite possible as BPMN 2.0 specification does not explicitly specify meaning of lanes and says something like that:
Lanes are used to organize and categorize Activities within a Pool.
The meaning of the Lanes is up to the modeler. BPMN does not specify
the usage of Lanes. Lanes are often used for such things as internal
roles (e.g., Manager, Associate), systems (e.g., an enterprise
application), an internal department (e.g., shipping, finance), etc.
So you are completely free to fill them with everything you want. However, your case is quite evident and doesn't require such separation at all. According to your description we have typical conditional activity which can be expressed via Service task or Sub-process. These are 2 different approaches and they hold different semantics.
According to BPMN specification Service Task is a task that uses some sort of service, which could be a Web service or an automated application. I.e it is usually used when modeller don't want to decompose some process and is intended to outsource it to some external tool or agent.
Another cup of tea is Sub-process, which is typically used when you
want to wrap some complex piece of workflow for reuse or if that
piece of workflow can be decomposed into sub-elements.
In your use case sub-process is a thing of choice. It is highly adjustable, transparent and maintainable. For example, inside sub-process you can use Business Rules Engine for your condition parameter (Order Quantity) and flexibly adjust its value on-the-fly.
In greater detail you can learn the difference of these approcahes from this blog.
There is a technique of expressing system tasks/decisions via dedicated participant/lane. Then all system tasks are collocated on a system lane.
System tasks (service tasks in BPMN) are usually done on behalf of an actor, so in my opinion it is useful to position them in the lane for that actor.
Usually such design also help to keep the diagram easy to read by limiting the number of transition between "users" lanes and "system" lane.

What is the best definition of an RTOS?

I have yet to find a definition of an RTOS that is specific enough to have meaning. The best one I can find is on wiki:
https://en.wikipedia.org/wiki/Real-time_operating_system
However I have some critical comments/questions:
"Real Time" seems to be undefined in all the definitions for RTOS I've found. Nothing can be fast as actual real time (infinitesimally small!). Therefore, I believe "real time" only makes sense in the context of the observer. Real time for a human using an iPhone user might be <20ms because human eye sight cannot detect changes faster than that. For an air bag deployment it might be <1ms. All definitions on the internet seem to gloss over the definition of "real time"!
If RTOS is defined by the requirement to execute something within a specific time frame ("deadline"), why does jitter come into the definition? If the iPhone response jitters between 12-14ms, is it no longer responding in real time? It meets the 20ms requirement, right? If one time the response went to 100ms, the user might notice, at which point the system is not an RTOS
How can there possibly be a "soft" RTOS?! The definition of RTOS is meeting a particular deadline time requirement. If it doesn't meet it, than its not an RTOS! The very definition of RTOS prohibits a "soft" RTOS
To me it seems there is no formal and precise definition of RTOS. It's a general term to explain the characteristic of an OS who's main priority is the appearance of "real time" (per requirement number) to a particular type of observer. It also seems like the name has taken on implementation meaning such as how things are processed, multi-tasking, message passing, semaphores, etc... all which may NOT be part of an RTOS at all if the system fails to respond within the "deadline" requirement, right?
Sorry about such a ubiquitous question, but I can't get a clear picture in my brain. All definitions I've found are simply not precise enough or cloud the definition with implementation details.
You're right that no definition defines the exact time bounds. That's not the goal of a definition. Real time isn't dependent on the observer, though, but the application. As applications differ, time bounds differ, and therefore a definition cannot give that bound as a number.
Jitter is irrelevant as long as the application's time bound is met. You're absolutely right about the example. If the deadline is 20 ms, taking 100 ms is a failure. If the OS is to blame for the delay, it's not an RTOS.
"Soft realtime" has a very specific meaning, and this is probably the only thing you really got wrong. The concept at work here is, what do you do when a task exceeds its deadline? (Note: this could be either the fault of the task itself or the RTOS.) In a hard realtime system, the task simply has no value anymore. A late outcome is as good as no outcome, and you cancel the task. No point in risking other tasks.
Soft RTOS is actually more complex. Finishing the task still has value, although diminished. So the RTOS cannot hard kill the task, but the OS still has to ensure other tasks meet their deadlines. That requires extra care, which wouldn't have been necessary if you'd just kill the task.
There is an Embedded Systems Dictionary. Here are some excerpts:
real-time adj. Having timeliness requirements, typically in the form of deadlines that can’t be missed.
real-time operating system n. An operating system designed specifically for use in real-time systems. Abbreviated RTOS.
real-time system n. Any computer system, embedded or otherwise, that has timeliness requirements. The following question can be used
to distinguish real-time systems from the rest: “Is a late answer as
bad, or even worse, than a wrong answer?” In other words, what happens
if the computation doesn’t finish in time? If nothing bad happens,
it’s not a real-time system. If someone dies or the mission fails,
it’s generally considered “hard” real-time, which is meant to imply
that the system has hard deadlines. Everything in between is “soft”
real-time.

Analysing and generating statistics on your code

I was wondering if anyone had any ideas or procedures for generating general statistics on your source code.
Off the top of my head I would love to know how many functions in my project's code are called once or very few times or any classes that are only instantiated once.
I'm sure there is a ton of other interesting things to be found out.
I could do something like the above using grep magic but has anyone come across tools or tips?
Coverity is the first thing coming to mind. It currently offers (on one of their products)
Software DNA Map™ analysis system: Generates a comprehensive representation of the entire build system including a semantically correct parsing of every line of code.
Defect Manager: Intuitive interface makes it easy to establish ownership of defects and resolve them via a customized workflow that mirrors your existing development process.
Local Analysis: Enables code to be analyzed locally on developers’ desktops to ensure quality before sharing with other developers.
Boolean Satisfiability: Translates the code into questions based on Boolean values, then applies SAT solvers for the most accurate defect detection and the lowest false positive rate available. Only Prevent offers the added precision of this proprietary method.
Race Conditions Checker: Features an industry-first race conditions checker built specifically for today’s complex multi-threaded applications.
Path Simulation: Simulates 100% of all values and data paths, enabling detection of the most critical defects.
Statistical & Interprocedural Analysis: Ensures a comprehensive analysis of your entire build system by inferring correct behavior based on previously observed behavior and performing whole-program analysis similar to the executing Bin.
False Path Pruning: Efficiently removes false positives to give Prevent an average FP rate of about 15%, with some users reporting FP rates of as low as 5%.
Incremental Analysis: Analyzes source code wholly or incrementally, allowing you to save time by checking only those components that are affected by a change.
Reporting: Measures software quality trends over time via customizable reporting so you can show defects grouped by checker, classification, component, and other defect information.
There are lots of tools that do this. But afaik none of them are language independent (which in turn would be mostly impossible e.g. some languages might not even have functions).
Generally you will find those tools under the categories of "code coverage tools" or "profilers".
For .Net you can use Visual Studio or Clrprofiler.