Suggestions for optimising FPGA design - matlab

I need to figure out optimisations for this FPGA design. I've got a few ideas and I'd like to know if they sound reasonable for my design. I'd also like to ask if anyone has any other ideas to improve my designs efficiency.
The design I have to optimise is an ensemble of neurons, I've included two images below.
My current ideas
Add pipeline registers between each neuron and each adder
'Register the inputs and outputs' by inserting registers in-between each logic block
Convert the adder tree into an adder chain
Use time division multiplexers to share the LUT's between logic blocks
Do my current ideas for improving performance make any sense? I don't know very much about FPGA's at all so I'm not sure if my optimisations will make much of an improvement or if they even make sense.
Any help would be greatly appreciated.
Links to PDFs of my neuron and emsemble (The image quality is higher):
https://francismcnamee.com/pdfs/neuron_ensemble.pdf
https://francismcnamee.com/pdfs/single_neuron.pdf
Ensemble of neurons (Each subsystem is a single neuron, the design of each neuron is shown below)
A single neuron

To start with "less area usage and/or faster speed." forget about and: you can optimise of area or speed. Both will not work.
Use time division multiplexers to share the LUT's between logic blocks"
Multiplexers are also build from LUTs so you loose area before you gain some. Then the TDM needs to have a controller, the interim results need to be stored and retrieved. All and all it is not trivial and I would only do that if you are rather good at logic design. You may gain area, but you will lose speed.
Convert the adder tree into an adder chain.
No, you don't touch the adder tree. The FPGA synthesis tool will select the optimal adder configuration for you. It will balance area against speed and come up with something much better result then you can for yourself.
In fact this applies to every part of the design:let the synthesis tool do its work. You will not be able to outperform it.
Add pipeline registers between each neuron and each adder
Register the inputs and outputs' by inserting registers in-between each logic block
Sorry again but: Nope! Working with registers is not that simple.
You need to balance the registers. Ideally the logic delay between each pipeline stage should be the same.
Lets say you multiply takes** 10nS. The adder takes 3nS. Then you should place a pipeline stages after a set of 3 adders. The delay will be ~20nS. If you placed an pipeline stage after each adder, the total delay would be ~40nS.
Now you get to the core of speeding up a design: do you use 4 pipeline stages so you can run at 200 MHz or 2 pipeline stages and run at 100MHz? In both cases the throughput is the same.
Beware that each register stage also cost you time: you need to meet the set-up time of the register. As such the fastest design is the one with no registers: the data falls through at the maximum speed. But then you may need to wait a long time before you can present the next set of data.
As you may gather: balancing registers is not easy and is rather an art. The best way would be to run the design without any registers through the synthesis tool. Then run a timing analysis on it and look at the worst-case timing path. From that try to figure out where to put the register stages. But again, that is easier said then done. To me reading those timing analysis reports is easy, but for a novice they might seem all abracadabra.
Sorry if I let you hanging here but unfortunately there is no "magic trick" in these cases. Ideally you could let an experienced design play a few hours with your code and see what (s)he can do.
**The numbers I use have been made up

Related

Measure the electricity consumed by a browser to render a webpage

Is there a way to calculate the electricity consumed to load and render a webpage (frontend)? I was thinking of a 'test' made with phantomjs for example:
load a web page
scroll to the bottom
And measure how much electricity was needed. I can perhaps extrapolate from CPU cycle. But phantomjs is headless, rendering in real browser is certainly different. Perhaps it's impossible to do real measurements.. but with an index it may be possible to compare websites.
Do you have other suggestions?
It's pretty much impossible to measure this internally in modern processors (anything more recent than 286). By internally, I mean by counting cycles. This is because different parts of the processor consume different levels of energy per cycle depending upon the instruction.
That said, you can make your measurements. Stick a power meter between the wall and the processor. Here's a procedure:
Measure the baseline energy usage, i.e. nothing running except the OS and the browser, and the browser completely static (i.e. not doing anything). You need to make sure that everything is stead state (SS) meaning start your measurements only after several minutes of idle.
Measure the usage doing the operation you want. Again, you want to avoid any start up and stopping work, so make sure you start measuring at least 15 seconds after you start the operation. Stopping isn't an issue since the browser will execute any termination code after you finish your measurement.
Sounds simple, right? Unfortunately, because of the nature of your measurements, there are some gotchas.
Do you recall your physics classes (or EE classes) that talked about signal to noise ratios? Well, a scroll down uses very little energy, so the signal (scrolling) is well in the noise (normal background processes). This means you have to take a LOT of samples to get anything useful.
Your browser startup energy usage, or anything else that uses a decent amount of processing, is much easier to measure (better signal to noise ratio).
Also, make sure you understand the underlying electronics. For example, power is VA (voltage*amperage) where both V and A are in phase. I don't think this will be an issue since I'm pretty sure they are in phase for computers. Also, any decent power meter understands the difference.
I'm guessing you intend to do this for mobile devices. Your measurements will only be roughly the same from processor to processor. This is due to architectural differences from generation to generation, and from manufacturer to manufacturer.
Good luck.

Scala streaming peak detection with reactive events

I am trying to work out the best way to structure an application that in essence is a peak detection program. In my line of work I have been given charge of developing a system that essentially is looking at pulses in a stream of data and doing calculations on the peak data.
At the moment the software is implemented in LabVIEW. I'm sure many of you on here would understand why I'd love to see the end of that environment. I would like to redesign this in Scala (and possibly use Play if I was to make it use a web frontend) but I am not sure how best to approach the initial peak-detection component.
I've seen many tutorials for peak detection in various languages and I understand from a theoretical perspective many of the algorithms. What I am not sure is how would I approach this from the most Scala/Play idiomatic way?
Obviously I don't expect someone to write the code for me but I would really appreciate any pointers as to the direction I should take that makes the most sense. Since I cannot be too specific on the use case I'll try to give an overview of what I'm trying to do below:
Interfacing with data acquisition hardware to send out control voltages and read back "streams" of data.
I should be able to work the hardware side out, but is there a specific structure that would be best for the returned stream? I don't necessarily know ahead of time how much data I'll be reading so a stream that can be buffered and chunked would probably be appropriate.
Scan through the stream to find peaks and measure their height and trigger an event.
Peaks are usually about 20 samples wide or so but that depends on sample rate so I don't want to hard-code anything like that. I assume a sliding window would be necessary so peaks don't get "cut off" on the edge of a buffer. As a peak arrives I need to record and act on it. I think reactive streams and so on may be appropriate but I'm not sure. I will be making live graphs etc with the data so however it is done I need a way to send an event immediately on a successful detection.
The streams can be quite long and are at high sample-rates (minimum of 250ksamples per second) so I'd prefer not to have to buffer the entire stream to memory. The only information that needs to be permanent is the peak voltage data. I will need a way to visualise the raw stream for calibration purposes but I imagine that should be pretty simple.
The full application is much more complex and I'll need to do some initial filtering of noise and drift but I believe I should be able to work that out once I know what kind of implementation I should build on.
I've tried to look into Play's Iteratees and such but they are a little hard to follow. If they are an appropriate fit then I'm happy to work on learning them but since I'm not sure if that is the best way to approach the problem I'd love to know where I should look.
Reactive frameworks and the like certainly look interesting and I can see how I could really easily build the rest of the application around them but I'm just not sure how best to implement a streaming peak detection function on top of them beyond something simple like triggering when a value is over a threshold (as mentioned previously a "peak" can be quite wide and the signal is noisy).
Any advice would be greatly appreciated!
This is not a solution to this question but I'm writing this as an answer because of space/formatting limitations in the comments section.
Since you are exploring options I would suggest the following:
Assuming you have a large enough buffer to keep a window of data in memory (W=tXw) you can calculate the peak for the buffer using your existing algorithm. Next you can collect the next few samples data in a delta buffer (d) (a much smaller window). The delta buffer is the size of your increment. Assuming this is time series data you can easily create the new sliding window by removing the first delta (dXt) values from the buffer W and adding d values to the buffer. This is how Spark-streaming implements reduceByWindow function on a DStream. Iteratee can also help here.
If your system is distributed then you can use stream processing systems (Storm, Spark-streaming) to get better latency and throughput at the cost of distributing the system.
If you are really resource constrained and can live approximate results that bounded I would suggest you look at implementing a combination of probabilistic data structures such as count-min-sketch, hyperloglog and bloom filter.

Signal processing or algorithmic programming for a PLC

I have an application that takes voltages and temperatures as analog inputs and does some processing using an algorithm which involves signal processing such as low-pass filtering,
exponential smoothing, and other steps which might typically be done in a high-level programming language such as C or C++.
I'm curious how I could perform these same steps using a PLC, and in particular, the Allen-Bradley Control-Logix system? It seems to me that the instruction set with ladder logic is too limited for this. Could I perform this using structured text?
Ladder logic can do the computation just fine, although it isn't the nicest programming language in the world. It has a full complement of conditionals, arithmetic, arrays, etc.
Your real problem is fitting your computation into the cyclic execution model that most ladder logic engines (and Control Logix) run: a repeated execution of the program in the control from top to bottom, with each rung or computation being executed just once per scan.
If you need to loop over a set of values repeatedly before producing a result, you will likely have difficulty resolving the ladder engines' desire to execute everything "just once" per scan, and your need to execute a loop to produce a result. I believe in fact that there are FOR loop operators that can repeat a block of ladder code just as conventional loop; you need to ensure that the amount of time spent in your loops/algorithm don't materially affect the scan rate.
What may work well is for you to let the the scan rate act as one of your loops; typically you compute a filter by accepting a new value into an array and then computing a result over that array. Since you basically can't accept values faster than one-per-scan-cycle anyway, you can compute at-most-one-filter-result per scan cycle without losing any precision. If your array is of modest size (e.g., 10 values), you can in effect simply code a polynomial over the array as an equation to produce your filter result, and then code that polynomial (klunkily but straightforwardly) as ladder logic.
Control Logix PLCs do not have to execute on a cyclic sweep. I don't have RSLogix 5000 in front of me right now, but when defining the project, you are required to create one Program that executes on a cyclic sweep. But you can create other programs that do not. You can also run them off a trigger (not useful for regular input scanning) or off a fixed timer (very useful for input scanning). Keep in mind that there is no point in setting the input scan timer faster than your instrumentation updates-modern PLCs can frequently execute a scan much faster than a meter can update the data.
One good technique I have used for this is to create a program called one-second or something similar. This program will scan all your inputs, and perform all your signal processing, and then write to buffered memory locations. The rest of your program looks at those buffered memory locations, and never monitors the inputs directly. You can set the input-buffering program to execute as fast as needed for your process, up to whatever the PLC can handle before it faults.
It would also be a good idea to write your signal processing functions them selves as Add On Instructions, and then call them, with whatever parameters you need.
So you could have an AOI with a call interface like this:
input-1_buffered := input_smooth (low_pass, input-1);
This would call your input_smooth function, using input-1 as the value and input-1_buffered as the final location. low_pass would be used within the input_smooth function to jump to the appropriate logic.
Then you can write your actual smoothing logic in structured text, without anyone needing to understand it, because it will only exist in that one AOI.

Estimating possible # of actors in Scala

How can I estimate the number of actors that a Scala program can handle?
For context, I'm contemplating what is essentially a neural net that will be creating and forgetting cells at a high rate. I'm contemplating making each cell an actor, but there will be millions of them. I'm trying to decide whether this design is worth pursuing, but can't estimate the limits of number of actors. My intent is that it should totally run on one system, so distributed limits don't apply.
For that matter, I haven't definitely settled on Scala, if there's some better choice, but the cells do have state, as in, e.g., their connections to other cells, the weights of the connections, etc. Though this COULD be done as "Each cell is final. Changes mean replacing the current cell with a new one bearing the same id#."
P.S.: I don't know Scala. I'm considering picking it up to do this project. I'm also considering lots of other alternatives, including Java, Object Pascal and Ada. But actors seem a better map to what I'm after than thread-pools (and Java can't handle enough threads to make a thread/cell design feasible.
P.S.: At all times, most of the actors will be quiescent, but there wil need to be a way of cycling through the entire collection of them. If there isn't one built into the language, then this can be managed via first/next links within each cell. (Both links are needed, to allow cells in the middle to be extracted for release.)
With a neural net simulation, the real question is how much of the computational effort will be spent communicating, and how much will be spent computing something within a cell? If most of the effort is in communication then actors are perhaps a good choice for correctness, but not a good choice at all for efficiency (even with Akka, which performs reasonably well; AsyncFP might do the trick, though). Millions of neurons sounds slow--efficiency is probably a significant concern. If the neurons have some pretty heavy-duty computations to do themselves, then the communications overhead is no big deal.
If communications is the bottleneck, and you have lots of tiny messages, then you should design a custom data structure to hold the network, and also custom thread-handling that will take advantage of all the processors you have and minimize the amount of locking that you must do. For example, if you have space, each neuron could hold an array of input values from those neurons linked to it, and it would when calculating its output just read that array directly with no locking and the input neurons would just update the values also with no locking as they went. You can then just dump all your neurons into one big pool and have a master distribute them in chunks of, I don't know, maybe ten thousand at a time, each to its own thread. Scala will work fine for this sort of thing, but expect to do a lot of low-level work yourself, or wait for a really long time for the simulation to finish.

How to find the time value of operation to optimize new algorithm design?

My question is specific to iPhone, iPod, and iPad, since I am assuming that the architecture makes a big difference. I'm hoping there is either a specification somewhere (for the various chips perhaps), or a reliable way to measure T for each specific instruction. I know I can use any number of tools to measure aggregate processor time used, memory used, etc. I want to quantify at a lower level.
So, I'm able to figure out how many times I go through the main part of the algorithm. For example, I iterate n * (n-1) times in a naive implementation, and between n (best case) and n + n * (n-1) (worst case) in another. I can also make a reasonable count of the total number of instructions (+ - = % * /, and logic statements), and I can compare those counts, but that's assuming the weight of each operation is the same. Also, I don't have any idea how to weight the actual time value of a logic statement (if, else, for, while) vs a mathematical operator... is "if" as much work as "+" each time I use it? I would love to know where to find this information.
So, for clarity, my goal is to discover how much processor time I am demanding of the CPU (or GPU or any U) so that I can design an optimal algorithm around processor time. Can someone give me an idea of where to start for iOS hardware?
Edit: This link to ClockServices.c and SIMD stuff in the developer portal might be a good start for people interested in this. A few more cups of coffee tonight and I might get through it ;)
On a modern platform, processor time isn't the only limiting factor. Often, memory access is.
Still, processor time:
Your basic approach at an estimation for the processor load is OK, though, and is sensible: Make a rough estimate of the cost based on your knowledge of typical platforms.
In this article, Table 1 shows the times for typical primitive operations in .NET. While your platform may vary, the relative time is usually very similar. Maybe you can find - or even make - one for iStuff.
(I haven't come across one so thorough for other platforms, except processor / instruction set manuals, but they deal with assembly instructions)
memory locality:
A cache miss can cost you hundreds of cycles, a disk access a thousand times as much. So controlling your memory access patterns (i.e. reducing the working set, restructuring and accessing data in a cache-friendly way) is an important part of evaluating an algorithm.
xCode has instruments to measure performance of each function/operation, you can simply use them.