How can I make an atomic statement in Twincat3 PLC? - plc

I am working with a fast loop (0.5 ms cycle time) and slow loop (10 ms cycle time) which communicate with each other.
How can I make the in- and outputs to be consistent?
Consider the example below, I want the assignments in the SlowLoop to be atomic, to be sure that both referenced inputs from the FAST loop correspond with values from the same cycle.
Example
FastLoop [0.5 ms]
FAST_CNT = some rising edge detection
FAST_RUNIDX += 1
SlowLoop [10 ms]
<-- Atomic Operation
pulseCount = FAST_CNT
elapsedTicks = FAST_RUNIDX
Atomic Operation -->

Anytime that anything needs to be 'Atomic', you need to handle an object (STRUCT or FUNCTION_BLOCK).
In this case as there is no associated logic, a STRUCT should do the job nicely.
TYPE st_CommUnit :
STRUCT
Count : UINT;
Index : UINT;
END_STRUCT
END_TYPE
You can then have this STRUCT be presented as either an Input or Output between tasks using the %Q* and %I* addressing.
- Fast Task -
SourceData AT %Q* : st_CommUnit
- Slow Task -
TargetData AT %I* : st_CommUnit
Using this you end up with a linkable object such that you can link:
The entire unit
Each individual component

If you use two different tasks with different cycle times maybe also running on different cores, you need a way to synchronize the two tasks when doing read/write operations.
In order to access the data in an atomic way use the synchronization FBs that Beckhoff provides like for example FB_IecCriticalSection.
More info here on the infosys Website:
https://infosys.beckhoff.com/english.php?content=../content/1033/tc3_plc_intro/45844579955484184843.html&id=

Related

NSOperationQueue worse performance than single thread on computation task

My first question!
I am doing CPU-intensive image processing on a video feed, and I wanted to use OperationQueue. However, the results are absolutely horrible. Here's an example—let's say I have a CPU intensive operation:
var data = [Int].init(repeating: 0, count: 1_000_000)
func run() {
let startTime = DispatchTime.now().uptimeNanoseconds
for i in data.indices { data[i] = data[i] &+ 1 }
NSLog("\(DispatchTime.now().uptimeNanoseconds - startTime)")
}
It takes about 40ms on my laptop to execute. I time a hundred runs:
(1...100).forEach { i in run(i) }
They average about 42ms each, for about 4200ms total. I have 4 physical cores, so I try to run it on an OperationQueue:
var q = OperationQueue()
(1...100).forEach { i in
q.addOperation {
run(i)
}
}
q.waitUntilAllOperationsAreFinished()
Interesting things happen depending on q.maxConcurrentOperationCount:
concurrency single operation total
1 45ms 4500ms
2 100-250ms 8000ms
3 100-300ms 7200ms
4 250-450ms 9000ms
5 250-650ms 9800ms
6 600-800ms 11300ms
I use the default QoS of .background and can see that the thread priority is default (0.5). Looking at the CPU utilization with Instruments, I see a lot of wasted cycles (the first part is running it on main thread, the second is running with OperationQueue):
I wrote a simple thread queue in C and used that from Swift and it scales linearly with the cores, so I'm able to get my 4x speed increase. But what am I doing wrong with Swift?
Update: I think we have concluded that this is a legitimate bug in DispatchQueue. Then the question actually is what is the correct channel to ask about issues in DispatchQueue code?
You seem to measure the wall-clock time of each run execution. This does not seem to be the right metric. Parallelizing the problem does not signify that each run will execute faster... it just means that you can do several runs at once.
Anyhow, let me verify your results.
Your function run seems to take a parameter some of the time only. Let me define a similar function for clarity:
func increment(_ offset : Int) {
for i in data.indices { data[i] = data[i] &+ offset }
}
On my test machine, in release mode, this code takes 0.68 ns per entry or about 2.3 cycles (at 3.4 GHz) per addition. Disabling bound checking helps a bit (down to 0.5 ns per entry).
Anyhow. So next let us parallelize the problem as you seem to suggest:
var q = OperationQueue()
for i in 1...queues {
q.addOperation {
increment(i)
}
}
q.waitUntilAllOperationsAreFinished()
That does not seem particular safe but is it fast?
Well, it is faster... I hit 0.3 ns per entry.
Source code : https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/extra/swift/opqueue
.background Will run the threads with the lowest priority. If you are looking for fast execution, consider .userInitiated and make sure you are measuring the performance with compiler optimizations turned on.
Also consider using DispatchQueue instead of OperationQueue. It might have less overhead and better performance.
Update based on your comments: try this. It goes from 38s on my laptop to 14 or so.
Notable changes:
I made the queue explicitly concurrent
I run the thing in release mode
Replaced the inner loop calculation with random number, the original got optimized out
QoS set to higher level: QoS now works as expected and .background runs forever
var data = [Int].init(repeating: 0, count: 1_000_000)
func run() {
let startTime = DispatchTime.now().uptimeNanoseconds
for i in data.indices { data[i] = Int(arc4random_uniform(1000)) }
print("\((DispatchTime.now().uptimeNanoseconds - startTime)/1_000_000)")
}
let startTime = DispatchTime.now().uptimeNanoseconds
var g = DispatchGroup()
var q = DispatchQueue(label: "myQueue", qos: .userInitiated, attributes: [.concurrent])
(1...100).forEach { i in
q.async(group: g) {
run()
}
}
g.wait()
print("\((DispatchTime.now().uptimeNanoseconds - startTime)/1_000_000)")
Something is still wrong though - serial queue runs 3x faster even though it does not use all cores.
For the sake of future readers, two observations on multithreaded performance:
There is a modest overhead introduced by multithreading. You need to make sure that there is enough work on each thread to offset this overhead. As the old Concurrency Programming Guide says
You should make sure that your task code does a reasonable amount of work through each iteration. As with any block or function you dispatch to a queue, there is overhead to scheduling that code for execution. If each iteration of your loop performs only a small amount of work, the overhead of scheduling the code may outweigh the performance benefits you might achieve from dispatching it to a queue. If you find this is true during your testing, you can use striding to increase the amount of work performed during each loop iteration. With striding, you group together multiple iterations of your original loop into a single block and reduce the iteration count proportionately. For example, if you perform 100 iterations initially but decide to use a stride of 4, you now perform 4 loop iterations from each block and your iteration count is 25.
And goes on to say:
Although dispatch queues have very low overhead, there are still costs to scheduling each loop iteration on a thread. Therefore, you should make sure your loop code does enough work to warrant the costs. Exactly how much work you need to do is something you have to measure using the performance tools.
A simple way to increase the amount of work in each loop iteration is to use striding. With striding, you rewrite your block code to perform more than one iteration of the original loop.
You should be wary of using either operations or GCD dispatches to achieve multithreaded algorithms. This can lead to “thread explosion”. You should use DispatchQueue.concurrentPerform (previously known as dispatch_apply). This is a mechanism for performing loops in parallel, while ensuring that the degree of concurrency will not exceed the capabilities of the device.

Pause JModelica and Pass Incremental Inputs During Simulation

Hi Modelica Community,
I would like to run two models in parallel in JModelica but I'm not sure how to pass variables between the models. One model is a python model and the other is an EnergyPlusToFMU model.
The examples in the JModelica documentation has the full simulation period inputs defined prior to the simulation of the model. I don't understand how one would configure a model that pauses for inputs, which is a key feature of FMUs and co-simulation.
Can someone provide me with an example or piece of code that shows how this could be implemented in JModelica?
Do I put the simulate command in a loop? If so, how do I handle warm up periods and initialization without losing data at prior timesteps?
Thank you for your time,
Justin
Late answer, but in case it is picked up by others...
You can indeed put the simulation into a loop, you just need to keep track of the state of your system, such that you can re-init it at every iteration. Consider the following example:
Ts = 100
x_k = x_0
for k in range(100):
# Do whatever you need to get your input here
u_k = ...
FMU.reset()
FMU.set(x_k.keys(), x_k.values())
sim_res = FMU.simulate(
start_time=k*Ts,
final_time=(k+1)*Ts,
input=u_k
)
x_k = get_state(sim_res)
Now, I have written a small function to grab the state, x_k, of the system:
# Get state names and their values at given index
def get_state(fmu, results, index):
# Identify states as variables with a _start_ value
identifier = "_start_"
keys = fmu.get_model_variables(filter=identifier + "*").keys()
# Now, loop through all states, get their value and put it in x
x = {}
for name in keys:
x[name] = results[name[len(identifier):]][index]
# Return state
return x
This relies on setting "state_initial_equations": True compile option.

Slow Performance When Using Scalaz Task

I'm looking to improve the performance of a Monte Carlo simulation I am developing.
I first did an implementation which does the simulation of each paths sequentially as follows:
def simulate() = {
for (path <- 0 to 30000) {
(0 to 100).foreach(
x => // do some computation
)
}
}
This basically is simulating 30,000 paths and each path has 100 discretised random steps.
The above function runs very quickly on my machine (about 1s) for the calculation I am doing.
I then thought about speeding it up even further by making the code run in a multithreaded fashion.
I decided to use Task for this and I coded the following:
val simulation = (1 |-> 30000 ).map(n => Task {
(1 |-> 100).map(x => // do some computation)
})
I then use this as follows:
Task.gatherUnordered(simulation).run
When I kick this off, I know my machine is doing a lot of work as I can
see that in the activity monitor and also the machine fan is going ballistic.
After about two minutes of heavy activity on the machine, the work it seems
to be doing finishes but I don't get any value returned (I am expected a collection
of Doubles from each task that was processed).
My questions are:
Why does this take longer than the sequential example? I am more
than likely doing something wrong but I can't see it.
Why don't I get any returned collection of values from the tasks that are apparently being processed?
I'm not sure why Task.gatherUnordered is so slow, but if you change Task.gatherUnordered to Nondeterminism.gatherUnordered everything will be fine:
import scalaz.Nondeterminism
Nondeterminism[Task].gatherUnordered(simulation).run
I'm going to create an issue on Github about Task.gatherUnordered. This definitely should be fixed.

What is the benefit of automatic variables?

I'm looking for benefits of "automatic" in Systemverilog.
I have been seeing the "automatic" factorial example. But I can't get though them. Does anyone know why we use "automatic"?
Traditionally, Verilog has been used for modelling hardware at RTL and at Gate level abstractions. Since both RTL and Gate level abstraction are static/fixed (non-dynamic), Verilog supported only static variables. So for example, any reg or wire in Verilog would be instantiated/mapped at the beginning of simulation and would remain mapped in the simulation memory till the end of simulation. As a result, you can take dump of any wire/reg as a waveform, and the reg/wire would have a value from the beginning till the end, since it is always mapped. In a programmers perspective, such variables are termed static. In C/C++ world, to declare such a variable, you will have to use storage class specifier static. In Verilog every variable is implicitly static.
Note that until the advent of SystemVerilog, Verilog supported only static variables. Even though Verilog also supported some constructs for modelling at behavioural abstraction, the support was limited by absence of automatic storage class.
automatic (called auto in software world) storage class variables are mapped on the stack. When a function is called, all the local (non-static) variables declared in the function are mapped to individual locations in the stack. Since such variables exist only on the stack, they cease to exist as soon as the execution of the function is complete and the stack correspondingly shrinks.
Amongst other advantages, one possibility that this storage class enables is recursive functions. In Verilog world, a function can not be re-entrant. Recursive (or re-entrant) functions do not serve any useful purpose in a world where automatic storage class is not available. To understand this, you can imagine a re-entrant function as a function which dynamically makes multiple recursive instantiations of itself. Each instance gets its automatic variables mapped on the stack. As we progress into the recursion, the stack grows and each function gets to make its computations using its own set of variables. When the function calls return the computed values are collated and a final result made available. With only static variables, each function call will store the variable values at the same common locations thus erasing any benefit of having multiple calls (instantiations).
Coming to the factorial algorithm, it is relatively easy to conceptualize factorial as a recursive algorithm. In maths we write factorial(n) = n(factial(n-1))*. So you need to calculate factorial(n-1) in order to know factorial(n). Note that recursion can not be completed without a terminating case, which in case of factorial is n=1.
function automatic int factorial;
input int n;
if (n > 1)
factorial = factorial (n - 1) * n;
else
factorial = 1;
endfunction
Without automatic storage class, since all variables in a function would be mapped to a fixed location, when we call factorial(n-1) from inside factorial(n), the recursive call would overwrite any variable inside the caller context. In the factorial function as defined in the above code snippet, if we do not specify the storage class as automatic, both n and the result factorial would be overwritten by the recursive call to factorial(n-1). As a result the variable n would consecutively be overwritten as n-1, n-2, n-3 and so on till we reach the terminating condition of n = 1. The terminating recursive call to factorial would have a value of 1 assigned to n and when the recursion unwinds, factorial(n-1) * n would evaluate to 1 in each stage.
With automatic storage class, each recursive function call would have its own place in the memory (actually on the stack) to store variable n. As a result consecutive calls to factorial will not overwrite variable n of the caller. As a result when the recursion unwinds, we shall have the right value for factorial(n) as n*(n-1)(n-2) .. *1.
Note that it is possible to define factorial using iteration as well. And that can be done without use of automatic storage class. But in many cases, recursion makes it possible for the user to code algorithms in a more intuitive fashion.
I propose 1 example as below(using fork...join_none):
Ex.1 (non-using automatic) : value output will be "3 3 3 3". because i take the latest value after exit for loop, i is stored in static memory location. This may be a bug in your code.
initial begin
for( int i =0; i<=3 ; i++)
fork
$write ("%d ", i);
join_none
end
Ex.2 (using automatic) : value output will be "0 1 2 3". Because in each loop, value of i is copied to k, and fork..join_none spawn a thread with each value of k ( each loop will locate 1 memory space for k : k0, k1, k2, k3):
initial begin
for( int i =0; i<=3 ; i++)
fork
automatic int k = i;
$write ("%d ", k);
join_none
end
Another example is the use of fork join inside for loop -
Without the use of automatic, the fork join inside the for loop will not work correctly.
for (int i=0; i<`SOME_VALUE ; i++) begin
automatic int id=i;
fork
task/function using the id above ;
...
join_none
end

Convert unsigned int to Time in System-verilog

I have in large part of my System-Verilog code used parameters to define different waiting times such as:
int unsigned HALF_SPI_CLOCK = ((SYSTEM_CLK_PERIOD/2)*DIVISION_FACTOR); //DEFINES THE TIME
Now since I define a timescale in my files I can directly use these params to introduce waiting cycles:
`timescale 1ns/1ns
initial begin
#HALF_SPI_CLOCK;
end
Now I want to have time-specified delays everywhere. Means that the simulation will still respect all the timings even if I change the timescale. I would like to keep the parameters but wherever I have a wait statement, I need to specify the time. Something like
#(HALF_SPI_CLOCK) ns;
But this is not accepted by Modelsim. Is there a way to cast a parameter or an Unsigned int to a variable of type time in System-Verilog? Is there a way to specify the time unit? I have looked around but could not find any workaround.
The reason why I want to have control over the time and make it independent from the timescale is because I intend to change thetimescale later to try to make my simulation faster.
Other recommendations or thoughts are very welcome*
It is possible to pass time as a parameter in SystemVerilog, like:
module my_module #(time MY_TIME = 100ns);
initial begin
#MY_TIME;
$display("[%t] End waiting", $time);
end
endmodule
Or use multiplication to get the right time units, like:
module my_module2 #(longint MY_TIME = 100);
initial begin
# (MY_TIME * 1us);
$display("[%t] End waiting 2", $time);
end
endmodule
See runnable example on EDA Playground: http://www.edaplayground.com/x/m2
This will simulate and do what you want. While not the most elegant, it works.
task wait_ns(int num);
repeat (num) #1ns;
endtask
...
wait_ns(HALF_SPI_CLOCK);
This could have a negative impact simulation speed depending on how the timescale, clock events, and the unit of delay relate to each other.