Difference between branching and select instructions - compiler-optimization

The 'select' instruction is used to choose one value based on a condition, without branching.
I want to know the differences between branching and select instructions (preferably for both x86 architectures and PTX). As far as I know, select is more optimal compared to branching instructions, but I don't have a clear picture.

Branching is a general-purpose mechanism used to redirect control flow. It is used to implement most forms of the if statement (when specific optimizations don't apply).
Selection is a specialized instruction available on some instruction sets which can implement some forms of the conditional expression
z = (cond) ? x : y;
or
if(cond) z = x;
provided that x and y are plain values (if they were expressions, they would both have to be computed before the select, which might incur performance penalties or incorrect side-effect evaluation). Such an instruction is necessarily more limited than branching, but has the distinct advantage that the instruction pointer doesn't change. As a result, the processor does not need to flush its pipeline on a branch misprediction (since there is no branch). Because of this, a select instruction (where available) is faster.
On some superscalar architectures, e.g. CUDA, branches are very expensive performance-wise because the parallel units must remain perfectly synchronized. On CUDA, for example, every execution unit in a block must take the same execution path; if one thread branches, then every unit steps through both branches (but will execute no-operations on the branch not taken). A select instruction, however, doesn't incur this kind of penalty.
Note that most compilers will, with suitable options, generate 'select'-style instructions like cmov if given a simple-enough if statement. Also, in some cases, it is possible to use bitwise manipulation or logical operations to combine a boolean conditional with expression values without performing a branch.

Related

For black-box analysis of the outcome of a system call, is a complete comparison of before-and-after forensic system images the right way to measure?

I'm doing x86-64 binary obfuscation research and fundamentally one of the key challenges in the offense / defense cat and mouse game of executing a known-bad program and detecting it (even when obfuscated) is system call sequence analysis.
Put simply, obfuscation is just achieving the same effects on the system through a different sequence of instructions and memory states in order to minimize observable analysis channels. But at the end of the day, you need to execute certain system calls in a certain order to achieve certain input / output behaviors for a program.
Or do you? The question I want to study is this: Could the intended outcome of some or all system calls be achieved through different system calls? Let's say system call D, when executed 3 times consecutively, with certain parameters can be heuristically attributed to malicious behavior. If system calls A, B, and C could be found to achieve the same effect (perhaps in addition to other side-effects) desired from system call D, then it would be possible to evade kernel hooks designed to trace and heuristically analyze system call sequences.
To determine how often this system call outcome overlap exists in a given OS, I don't want to use documentation and manual analysis for a few reasons:
undocumented behavior
lots of work, repeated for every OS and even different versions
So rather, I'm interested in performing black-box analysis to fuzz system calls with various arguments and observing the effects. My problem is I'm not sure how to measure the effects. Once I execute a system call, what mechanism could I use to observe exactly which changes result from it? Is there any reliable way, aside from completely iterating over entire forensic snapshots of the machine before and after?

How to handle the two signals depending on each other?

I read Deprecating the Observer Pattern with Scala.React and found reactive programming very interesting.
But there is a point I can't figure out: the author described the signals as the nodes in a DAG(Directed acyclic graph). Then what if you have two signals(or event sources, or models, w/e) depending on each other? i.e. the 'two-way binding', like a model and a view in web front-end programming.
Sometimes it's just inevitable because the user can change view, and the back-end(asynchronous request, for example) can change model, and you hope the other side to reflect the change immediately.
The loop dependencies in a reactive programming language can be handled with a variety of semantics. The one that appears to have been chosen in scala.React is that of synchronous reactive languages and specifically that of Esterel. You can have a good explanation of this semantics and its alternatives in the paper "The synchronous languages 12 years later" by Benveniste, A. ; Caspi, P. ; Edwards, S.A. ; Halbwachs, N. ; Le Guernic, P. ; de Simone, R. and available at http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1173191&tag=1 or http://virtualhost.cs.columbia.edu/~sedwards/papers/benveniste2003synchronous.pdf.
Replying #Matt Carkci here, because a comment wouldn't suffice
In the paper section 7.1 Change Propagation you have
Our change propagation implementation uses a push-based approach based on a topologically ordered dependency graph. When a propagation turn starts, the propagator puts all nodes that have been invalidated since the last turn into a priority queue which is sorted according to the topological order, briefly level, of the nodes. The propagator dequeues the node on the lowest level and validates it, potentially changing its state and putting its dependent nodes, which are on greater levels, on the queue. The propagator repeats this step until the queue is empty, always keeping track of the current level, which becomes important for level mismatches below. For correctly ordered graphs, this process monotonically proceeds to greater levels, thus ensuring data consistency, i.e., the absence of glitches.
and later at section 7.6 Level Mismatch
We therefore need to prepare for an opaque node n to access another node that is on a higher topological level. Every node that is read from during n’s evaluation, first checks whether the current propagation level which is maintained by the propagator is greater than the node’s level. If it is, it proceed as usual, otherwise it throws a level mismatch exception containing a reference to itself, which is caught only in the main propagation loop. The propagator then hoists n by first changing its level to a level above the node which threw the exception, reinserting n into the propagation queue (since it’s level has changed) for later evaluation in the same turn and then transitively hoisting all of n’s dependents.
While there's no mention about any topological constraint (cyclic vs acyclic), something is not clear. (at least to me)
First arises the question of how is the topological order defined.
And then the implementation suggests that mutually dependent nodes would loop forever in the evaluation through the exception mechanism explained above.
What do you think?
After scanning the paper, I can't find where they mention that it must be acyclic. There's nothing stopping you from creating cyclic graphs in dataflow/reactive programming. Acyclic graphs only allow you to create Pipeline Dataflow (e.g. Unix command line pipes).
Feedback and cycles are a very powerful mechanism in dataflow. Without them you are restricted to the types of programs you can create. Take a look at Flow-Based Programming - Loop-Type Networks.
Edit after second post by pagoda_5b
One statement in the paper made me take notice...
For correctly ordered graphs, this process
monotonically proceeds to greater levels, thus ensuring data
consistency, i.e., the absence of glitches.
To me that says that loops are not allowed within the Scala.React framework. A cycle between two nodes would seem to cause the system to continually try to raise the level of both nodes forever.
But that doesn't mean that you have to encode the loops within their framework. It could be possible to have have one path from the item you want to observe and then another, separate, path back to the GUI.
To me, it always seems that too much emphasis is placed on a programming system completing and giving one answer. Loops make it difficult to determine when to terminate. Libraries that use the term "reactive" tend to subscribe to this thought process. But that is just a result of the Von Neumann architecture of computers... a focus of solving an equation and returning the answer. Libraries that shy away from loops seem to be worried about program termination.
Dataflow doesn't require a program to have one right answer or ever terminate. The answer is the answer at this moment of time due to the inputs at this moment. Feedback and loops are expected if not required. A dataflow system is basically just a big loop that constantly passes data between nodes. To terminate it, you just stop it.
Dataflow doesn't have to be so complicated. It is just a very different way to think about programming. I suggest you look at J. Paul Morison's book "Flow Based Programming" for a field tested version of dataflow or my book (once it's done).
Check your MVC knowledge. The view doesn't update the model, so it won't send signals to it. The controller updates the model. For a C/F converter, you would have two controllers (one for the F control, on for the C control). Both controllers would send signals to a single model (which stores the only real temperature, Kelvin, in a lossless format). The model sends signals to two separate views (one for C view, one for F view). No cycles.
Based on the answer from #pagoda_5b, I'd say that you are likely allowed to have cycles (7.6 should handle it, at the cost of performance) but you must guarantee that there is no infinite regress. For example, you could have the controllers also receive signals from the model, as long as you guaranteed that receipt of said signal never caused a signal to be sent back to the model.
I think the above is a good description, but it uses the word "signal" in a non-FRP style. "Signals" in the above are really messages. If the description in 7.1 is correct and complete, loops in the signal graph would always cause infinite regress as processing the dependents of a node would cause the node to be processed and vice-versa, ad inf.
As #Matt Carkci said, there are FRP frameworks that allow loops, at least to a limited extent. They will either not be push-based, use non-strictness in interesting ways, enforce monotonicity, or introduce "artificial" delays so that when the signal graph is expanded on the temporal dimension (turning it into a value graph) the cycles disappear.

Undoable sets of changes

I'm looking for a way to manage consistent sets of changes across several data sources, including, but not limited to, a database, some network control tools, and probably other SOAP-based services.
If one change fails for some reason (e.g. real-world app says "no", or a database insert fails), I want the whole set to be undone. So that's like transactions, just not limited to a DB.
I came up with a module that stacks up "change" objects which in turn have their init, commit, and rollback methods. When the set is DESTROYed, it rolls uncommitted changes back. This kinda works.
Still I can't overcome the feeling of a wheel being invented. Is there a standard CPAN module, or a well described common method to perform such a task? (At least GoF's "command" pattern and RAII principle come to mind...)
There are a couple of approaches to executing a Distributed transaction (which is what you're describing):
The standard pattern is called "Two-phase commit protocol".
At the moment I'm not aware of any Perl module which implements Two-phase commit, which is kind of surprising and may likely be due to a lapse in my Googling. The only thing I found was Env::Transaction but I have no clue how stable/good/functional it is.
For certain cases, a solution involving rollback via "Compensating transactions" is possible.
This is basically a special case of general rollback where, when generating task list A designed to change the target system state from S1 to S2, you at the same time generate a "compensating" task list A-neg designed to change the target system state from S2 back to S1. This is obviously only possible for certain systems, and moreover only a small subset of those are commutative (meaning that your can execute both transaction and its compensating transaction non-contiguously, e.g. the result of A + B + A-neg + B-neg is an invariant state.
Please notice that the compensating transactions does NOT always have to be engineered to be a "transaction" - one clever approach (again, only possible on certain subject domains) involves storing your data with a special "finalized" flag; then periodically harvest and destroy data with a false "finalized" flag if the data's "originating transaction timestamp" is older than some sort of threshold.

Is there a way to ensure atomicity with an operation in C?

I want this statement (within the body of the if statement) to be atomic:
if(I2C1STATbits.P || cmd_buffer_ptr >= CMD_BUFFER_SIZE - 1)
cmd_buff_full = 1; // should be atomic
My processor (dsPIC33F) supports atomic bit set and clear. It also supports atomic writes for 16-bit registers and memory locations; these are single cycle. How can I be sure the operation will be implemented in an atomic fashion - is there a way to force the compiler to do this? In my case I'm fairly sure it will compile to be atomic, but I don't want it to change in future if I, for example, change some other code and it reorganises things, or if I update the compiler. For example, is there an atomic keyword?
I'm working with GCC v3.23 - more specifically, MPLAB C30, a modified closed source version of GCC. I am working on a microcontroller, which has only interrupts; there is no concept of threads. The only possible problem with atomicity is that an interrupt may be triggered in the middle of a write over two cycles, if that is even possible.
Depending on what other competing operations you want the assignment to be atomic, you could use sig_atomic_t. Strictly speaking, this protects it only from the presence of signals. In practice, it also provides atomicity wrt. multi-threeading.
Edit: if the object is to guarantee that the store operation is not coded into two assembler instructions, it will be necessary to use inline assembly - C will make no guarantees in that respect. If the objective is to prevent an interrupt from interfering with the store operation, an alternative is to disable interrupts before the store, and enable them afterwards.
Not in C, but perhaps there's a proprietary library call in libraries that come with your processor. For example, on Windows, there's an InterlockedIncrement() and InterlockedDecrement() (to inc/dec longs) that's guaranteed to be atomic without a lock.

Where in the Fetch-Execute cycle is a value via an address mode decoded

I'm currently building a small CPU interpreter that has support several addressing modes, including register-deferred and displacement. It utilizes the classic IF-ID-EX-MEM-WB RISC-pipeline. In what stage of the pipeline is the value for an address-moded operand decoded. For example:
addw r9, (r2), 8(r3)
In what stage is (r2) and 8(r3) would be decoded into their actual values?
It's a funny question.
One property of RISC architectures is register-register operation. That is, all operands for computation instructions such as ADD must already be in registers. This enables RISC implementations to enjoy a regular pipeline such as the IF-ID-EX-MEM-WB pipeline you mention in your question. This constraint also simplifies memory accesses and exceptions. For example, if the only instructions to read data from memory are load instructions, and if these instructions have only a simple addressing mode like register+displacement, then a given instruction can incur at most one memory protection exception.
In contrast, CISC architectures typically permit rich operand addressing modes, such as register indirect, and indexed as in your question. Implementations of these architectures often have an irregular pipeline, which may stall as one or more memory accesses are incurred before the operands are available for the computation (ADD etc.).
Nonetheless, microarchitects have successfully pipelined CISC architectures. For example, the Intel 486 had a pipeline that enabled operands and results to be read/written to memory. So when implementing ADD [eax],42, there was a pipeline stage to read [eax] from the 8 KB d-cache, a pipeline stage to perform the add, and another pipeline stage to write-back the sum to [eax].
Since CISC instruction and operand usage is dynamically quite mixed and irregular, your pipeline design would either have to be rather long to account for the worst case, e.g. multiple memory reads to access operands and a memory write to write-back a result, or it would have to stall the pipeline to insert the additional memory accesses when necessary.
So to accomodate these CISCy addressng modes, you might need a IF-ID-EA-RD1-RD2-EX-WR pipeline (EA=eff addr, RD1=read op 1, RD2=read op 2, WR=write result to RAM or reg file).
Happy hacking.
As Jan Gray pointed out, the CISC instruction you mention addw r9, (r2), 8(r3)
does not map directly onto a IF-ID-EX-MEM-WB RISC pipeline.
But rather than creating a IF-ID-EA-RD1-RD2-EX-WR pipeline (which I don't think works for ths case anyway, at least not in my notation), you might also consider breaking the CISC instruction up into RISC-like microinstructions
tmp1 := load Memory[ r2 ]
tmp2 := load Memory[ 8+r3 ]
r9 := addw tmp1 + tmp2
With this uop (micro-operation) decomposition,
the address computations (r2) and 8(r3) would be done in their respective EX pipestages,
and the actual memory access in and around the MEM pipestage.
As Jan mentions, the i486 had a different pipeline, a so-called load-op pipeline:
IF-ID-AGU-MEM-EX-WB, where AGU is the address generation unit / pipestage.
This permits a different uop decomposition
tmp1 := load Memory[ r2 ]
r9 := addw tmp1 + load Memory[ 8+r3 ]
with the address computations (r2) and 8(r3) done in the AGU pipestage.
As Jan Gray mentioned above, the instruction you are trying to execute doesn't really work for this pipeline. You need to load the data in the MEM stage, and do some math on it in the EX stage (which is before mem).
To answer another related question though, if you wanted to do:
Load R9, 8(R3)
The value for the 'value for the value-modded operand' is calculated in the EX stage.