how to optimize (reduce) the latency in the verilog HDL code (hardware) generated by the MATLAB HDL CODER add-on from a given Simulink model? - matlab

Thanks in advance,
I am having a simple Simulink model, that takes in a 32-bit number in the IEEE-754 format and adds the same number, which gives the output again in the 32-bit wide IEEE-754 format. I used MATLAB's HDL CODER add-on and generated the Verilog HDL code for the same. When I wrote a testbench for the same, I found the latency I get from this code is 100ns. But is there a way I can reduce this to even further, say some 10ns.
Below I am attaching the Simulink model I used to generate the Verilog HDL code, along with the generated Verilog files. Also, I am attaching a screenshot of the simulation in case you don't want to waste your time running the scripts
Link to download the files

my point is how to use pipeline settings before conversion
I am assuming that "pipeline settings" is a MATLAB HDL generator parameter.
Basically what you do is "try": use a pipeline setting and synthesize the code. If you have slack you can:
Reduce the number of pipeline stages.
Increase the clock frequency.
(For negative slack you use the inverse methods)
Now here is where things get tricky:
Most of the time you can't really speed things up. A certain functionality needs a time to calculate. Some algorithms can be sped up by using more parallel resources but only up tot a limit. An adder is good example: you can have ripple carry, carry look-ahead and more advanced techniques, but you can not speed up infinitely. (Otherwise CPUs these days would be running at Terra Hz)
I suspect in the end you will find that it takes T time to do your IEEE-754 adder. That can be X clock cycles of an A MHz. clock or Y clock cycles of B MHz. But X times A is about the same as Y times B.
What you can do is pump lots of calculations into your pipe so a new one comes out every clock cycle. But the latency will still be there.


What is the best practice to enable mixed sample time in Simulink model

An outside Library (from PreScan) requests 200 Hz while my control plant model needs to run at 100 Hz. Therefore, my question is that how can I coordinate these two activities? My concern is that if I use 200Hz in Simulink, it may compromise my control plant’s fidelity.
Is it possible to set simulink time step as 1/100 while keep the outside library to run at 200Hz?
Simulink works perfectly happily with multi-rate models. The thing (it appears) that you don't understand is the difference between the overall model sample rate - i.e. the settings of your solver - and the sample rate of individual blocks within your model.
It's very typical to have some blocks in your model sampled at say 100Hz, while other parts of your model sampled at 200Hz. In this case you would choose a discrete solver and give it a sample time of 200Hz. The 200Hz blocks would get executed at every solver time step, while the 100Hz blocks would get executed every second solver time step.
You should look at the Sample Times in Systems section of the documentation.
You can use both explicit and implicit rate control in Simulink.
Sample Time
To set the fundamental sample time go to: Configuration Parameter>Solver>Fixed-step size. You can also use the Simulink API:
get_param(bdroot, 'FixedStep')
set_param(bdroot, 'FixedStep', '0.005') % 200Hz
To activate the Sample Time colors go to: Display>Sample Time>All. The Sample Time Legend will help you understanding how the implicit rate control works.
Sample time Option
You can control the tasking and sample time options via: Configuration Parameter>Solver>Tasking and sample time options.
At the beginning you can activate the automatic handling of rate transition for data transfer. Then you shall analyse what colors your model elements are and place the Rate-Transition blocks on the data signal lines between the model elements with different sample rates.
Now the rate control is implicit. If you use the function calls to explicitly call your subsystems at a required rate by involving a predefined scheduler, than the rate control is explicit.
You can open build in Simulink examples to see how it works:

Can I measure the speedup from parallelization in matlab?

If I assume that a problem is a candidate for parallization e.g. matrix multiplication or some other problem and I use an Intel i7 haswell dualcore, is there some way I can compare a parallel execution to a sequential version of the same program or will matlab optimize a program to my architecture (dualcore, quadcore..)? I would like to know the speedup from adding more processors from a good benchmark parallell program.
Unfortunately there is no such thing as a benchmark parallel program. If you measure a speedup for a benchmark algorithm that does not mean that all the algorithms will benefit from parallelization
Since your target architecture has only 2 cores you might be better off avoiding parallelization at all and let Matlab and the operative system to optimize the execution. Anyway, here are the steps I followed.
Determine if your problem is apt for parallelization by calculating the theoretical speedup. Some problems like matrix multiplication or Gauss elimination are well studied. Since I assume your problem is more complicated than that, try to decompose your algorithm into simple blocks and determine, block-wise, the advantages of parallelization.
If you find that several parts of your algorithms could profit from parallelization, study those part separately.
Obtain statistical information of the runtime of your sequential algorithm. That is, run your program X number of times under similar conditions (and similar inputs) and average the running time.
Obtain statistical information of the runtime of your parallel algorithm.
Measure with the profiler. Many people recommends to use function like tic or toc. The profiler will give you a more accurate picture of your running times, as well as detailed information per function. See the documentation for detailed information on how to use the profiler.
Don't make the mistake of not taking into account the time Matlab takes to open the pool of workers (I assume you are working with the Parallel Computing Toolbox). Depending on your number of workers, the pool takes more/less time and in some occasions it could be up to 1 minute (2011b)!
You can try "Run and time" feature on MATLAB.
Or simply put some tic and toc to the first and end of your code, respectively.
Matlab provides a number of timing functions to help you assess the performance of your code: go read the documentation here and select the function that you deem most appropriate in your case! In particular, be aware of the difference between tic toc and the cputime function.

Difference between data processed by CPU and GPU in openCL

I have some matlab code that uses several large MEX functions and I want to speed things up by using openCL ( I am replacing parts of code of the MEX functions with openCL code using openCL API ). I've translated a small part of the code into an openCL kernel and I am already facing difficulties.
Some elements of the resulting matrix after execution on GPU are different from the corresponding elements of the resulting matrix when the original MEX function is called and the error is less than 0.01. This leads to a small error in the final result but I fear the error will accumulate as I translate more code.
This is probably related with different precision of the calculations on CPU and GPU. Does anyone know how to ensure the same precision? I am running 64 bit matlab R2012b on Ubuntu 12.04. The hardware I am using is Intel Core2 Duo E4700 and NVIDIA GeForce GT 520.
The small differences between results on your CPU and GPU are easily explained as arising from differences in floating-point precision if you have modified your code from using double precision (64-bit) f-p numbers on the CPU to using single-precision (32-bit) f-p numbers on the GPU.
I would not call this difference an error, rather it is an artefact of doing arithmetic on computers with floating-point numbers. The results you were getting on your CPU-only code were already different from any theoretically 'true' result. Much of the art of numerical computing is in keeping the differences between theoretical and actual computations small enough (whatever the heck that means) for the entire duration of a computation. It would take more time and space than I have now to expand on this, but surprises arising from lack of understanding of what floating-point arithmetic is, and isn't, are a rich source of questions here on SO. Some of the answers to those questions are very illuminating. This one should get you started.
If you have taken care to use the same precision on both CPU and GPU then the differences you report may be explained by the non-commutativity of floating-point arithmetic: in floating-point arithmetic it is not guaranteed that (a+b)+c == a+(b+c). The order of operations matters; if you have any SIMD going on I'd bet that the order of operations is not identical on the two implementations. Even if you haven't, what have you done to ensure that operations are ordered the same on both GPU and CPU ?
As to what you should do about it, that's rather up to you. You could (though I personally wouldn't recommend it) write your own routines for doing double-precision f-p arithmetic on the GPU. If you choose to do this, expect to wave goodbye to much of the speed-up that the GPU promises.
A better course of action is to ensure that your single-precision software provides sufficient accuracy for your purposes. For example, in the world I work in our original measurements from the environment are generally not accurate to more than about 3 significant figures, so any results that our codes produce have no validity after about 3 s-f. So if I can keep the errors in the 5th and lower s-fs that's good enough.
Unfortunately, from your point of view, getting enough accuracy from single-precision computations isn't necessarily guaranteed by globally replacing double with float and reompiling, you may (generally would) need to implement different algorithms, ones which take more time to guarantee more accuracy and which do not drift so much as computations proceed. Again, you'll lose some of the speed advantage that GPUs promise.
A common problem is, that floating point values are kept within an 80bit CPU register, instead of getting truncated and stored each time. In these cases, the additional precision leads to deviations. So you may check, what options your compiler offers to counter such issues. It can also be interesting to view the difference of release and debug builds.

Reducing calculation time for derivative blocks in SimMechanics

I have a program in SimMechanics that uses 6 derivative blocks (du/dt). It takes about 24 hours to do 10 secs of simulation. Is there any way to reduce the calculation time of the Simulink derivative blocks?
You don't say what your integration time step is. If it's on the order of milliseconds, and you're simulating a 10 sec total transient time, that means 10,000 time steps.
The stability limit of the time step is determined by the characteristics of the dynamic system you're simulating.
It's also affected by the integration scheme you're using. Explicit integration is well-known to have stability problems for larger time steps, so if you're using an Euler method of integration you'll be forced to use a small time step.
Maybe you can switch your integration scheme to an implicit method, 5th order Runge Kutta with error correction, or Burlich-Storer. See your documentation for details.
You've given no useful information about the physics of the system of interest, the size of the model, or your simulation choices, so all this is an educated guess on my part.
Runge-Kutta methods (called ODE45 or ODE23 in Matlab dialect) are not always useful with mechanical problems, due to best performance with variable time slice setup. Move to fixed time setup and select the solver by evaluating the error order you can admit. Refer to both Matlab documentation (and some Numerical Analysis texts too, :-) ) for deeper detail.
Consider also if your problem needs some "stiff-enabled" technique of resolution. Huge constant terms could drive to instability your solver if not properly handled.

Running a Simulink xPC block at a faster rate than the continuous rate

I have a Simulink xPC target application that has blocks with discrete states at several different sample rates and some sections using continuous states. My intention on keeping the continuous states is for better numerical integration.
What creates the problem: One block is reading a device at a very fast rate (500 hz). The rest of the application can and should run at a slower rate (say, 25 or 50 Hz) because it would be overkill to run it at the highest rate, and because the processor simply cannot squeeze a full application cycle into the .002 secs of the faster rate. So I need both rates. However, the continuous states run by definition in Simulink at the faster discrete rate of the whole application! This means everywhere I have continuous states now they're forced to run at 500 Hz when 25 Hz would do!
Is there a way to force the continuous states in xPC target to a rate that is not the fastest in the application? Or alternatively, is there a way to allow certain block to run at a faster speed than the rest of the application?
You are thinking about continuous solvers in the wrong way - continuous doesn't only mean that it's run as fast as possible - it uses a fundamentally different algorithm to solve the equations than discrete. Due to this, they must be run at least as fast as the discrete solvers.
From Using Simulink:
Continuous solvers use numerical
integration to compute a model's
continuous states at the current time
step from the states at previous time
steps and the state derivatives.
Continuous solvers rely on the model's
blocks to compute the values of the
model's discrete states at each time
Mathematicians have developed a wide
variety of numerical integration
techniques for solving the ordinary
differential equations (ODEs) that
represent the continuous states of
dynamic systems. Simulink provides an
extensive set of fixed-step and
variable-step continuous solvers, each
implementing a specific ODE solution
method (see Solvers).
Discrete solvers exist primarily to
solve purely discrete models. They
compute the next simulation time step
for a model and nothing else. They do
not compute continuous states and they
rely on the model's blocks to update
the model's discrete states.
So the upshot is that no it's not good enough to have the continuous run more slowly than the fastest discrete solvers - otherwise they are, by definition, not continuous. You should reconsider why you are specifying them as continuous.
What are you trying to accomplish by slowing down the continuous solvers? Is this a simulation time/performance issue?
My take on this is that it cannot be done. One way to approach this is to replace the continuous states by discrete ones (perhaps at an intermediate rate, say 100 Hz), and cross my fingers that the loss of precision is bearable.
Maybe it's possible to isolate a block and run it separately at a faster rate somehow, but I don't know.
Truly continuous computation is impossible in a digital processor such as your computer's.
What MATLAB/Simulink means by "continuous" is "I will (dynamically) try to guess what discrete step size is small enough so that discretization error is very small in your application".
If you already know, by knowing your application, that 20ms (50Hz) would be small enough, then use discrete - 50Hz.