Why do we need to specify the number of flash wait cycles? - stm32

Especially when working with "faster" devices like STMF4xx/F7xx we need to specify the number of flash wait cycles, based on the supply voltage and the sys-clock frequency.
When the CPU fetches instructions/or constants this is done over the FLITF. Am I right with the assumption that the FLITF holds a CPU request as long as it can provide the requested data, making it impossible for other Bus-Masters to access flash meanwhile.
If this was true, why should it be important to any interface to know flash wait cycles. Like Cache does preload instructions so or so, independent if it knows how long to wait, no?

Because the flash interface isn't magic.
It has to meet the necessary setup and hold times for addressing and reading out the flash cells, which will vary somewhat depending on voltage. Taking the STM32F411 as an example (because I have that TRM handy), doing some maths with the voltage/frequency/wait-state table implies that a read from flash on one of those takes in the order of ~30ns above 2.7V, down to ~60ns below 2.1V.
Since the flash interface doesn't have its own asynchronous nanosecond-precision timekeeping ability (because that would be needlessly complicated, power-hungry, and silly), that translates to asserting its signals for n clock cycles, after which it can assume the data signals from the cells are stable enough to read back*. How does it know what the clock frequency is, and therefore what n should be? Simple: you, as the programmer who set the clock, tell it. Some hardware things are just infinitely easier to let software deal with.
* and then going through the further shenanigans of extracting the relevant 8, 16 or 32 bits out of the 128-bit line it's read, to finally spit that out the other side onto the AHB bus to the waiting CPU, obviously.

Related

Polling vs. Interrupts with slow/fast I/O devices

I'm learning about the differences between Polling and Interrupts for I/O in my OS class and one of the things my teacher mentioned was that the speed of the I/O device can make a difference in which method would be better. He didn't follow up on it but I've been wracking my brain about it and I can't figure out why. I feel like using Interrupts is almost always better and I just don't see how the speed of the I/O device has anything to do with it.
The only advantage of polling comes when you don't care about every change that occurs.
Assume you have a real-time system that measures the temperature of a vat of molten plastic used for molding. Let's also say that your device can measure to a resolution of 1/1000 of a degree and can take new temperature every 1/10,000 of a second.
However, you only need the temperature every second and you only need to know the temperature within 1/10 of a degree.
In that kind of environment, polling the device might be preferable. Make one polling request every second. If you used interrupts, you could get 10,000 interrupts a second as the temperature moved +/- 1/1000 of a degree.
Polling used to be common with certain I/O devices, such as joysticks and pointing devices.
That said, there is VERY little need for polling and it has pretty much gone away.
Generally you would want to use interrupts, because polling can waste a lot of CPU cycles. However, if the event is frequent, synchronous (and if other factors apply e.g. short polling times...) polling can be a good alternative, especially because interrupts create more overhead than polling cycles.
You might want to take a look at this thread as well for more detail:
Polling or Interrupt based method

Measure the electricity consumed by a browser to render a webpage

Is there a way to calculate the electricity consumed to load and render a webpage (frontend)? I was thinking of a 'test' made with phantomjs for example:
load a web page
scroll to the bottom
And measure how much electricity was needed. I can perhaps extrapolate from CPU cycle. But phantomjs is headless, rendering in real browser is certainly different. Perhaps it's impossible to do real measurements.. but with an index it may be possible to compare websites.
Do you have other suggestions?
It's pretty much impossible to measure this internally in modern processors (anything more recent than 286). By internally, I mean by counting cycles. This is because different parts of the processor consume different levels of energy per cycle depending upon the instruction.
That said, you can make your measurements. Stick a power meter between the wall and the processor. Here's a procedure:
Measure the baseline energy usage, i.e. nothing running except the OS and the browser, and the browser completely static (i.e. not doing anything). You need to make sure that everything is stead state (SS) meaning start your measurements only after several minutes of idle.
Measure the usage doing the operation you want. Again, you want to avoid any start up and stopping work, so make sure you start measuring at least 15 seconds after you start the operation. Stopping isn't an issue since the browser will execute any termination code after you finish your measurement.
Sounds simple, right? Unfortunately, because of the nature of your measurements, there are some gotchas.
Do you recall your physics classes (or EE classes) that talked about signal to noise ratios? Well, a scroll down uses very little energy, so the signal (scrolling) is well in the noise (normal background processes). This means you have to take a LOT of samples to get anything useful.
Your browser startup energy usage, or anything else that uses a decent amount of processing, is much easier to measure (better signal to noise ratio).
Also, make sure you understand the underlying electronics. For example, power is VA (voltage*amperage) where both V and A are in phase. I don't think this will be an issue since I'm pretty sure they are in phase for computers. Also, any decent power meter understands the difference.
I'm guessing you intend to do this for mobile devices. Your measurements will only be roughly the same from processor to processor. This is due to architectural differences from generation to generation, and from manufacturer to manufacturer.
Good luck.

How to find the time value of operation to optimize new algorithm design?

My question is specific to iPhone, iPod, and iPad, since I am assuming that the architecture makes a big difference. I'm hoping there is either a specification somewhere (for the various chips perhaps), or a reliable way to measure T for each specific instruction. I know I can use any number of tools to measure aggregate processor time used, memory used, etc. I want to quantify at a lower level.
So, I'm able to figure out how many times I go through the main part of the algorithm. For example, I iterate n * (n-1) times in a naive implementation, and between n (best case) and n + n * (n-1) (worst case) in another. I can also make a reasonable count of the total number of instructions (+ - = % * /, and logic statements), and I can compare those counts, but that's assuming the weight of each operation is the same. Also, I don't have any idea how to weight the actual time value of a logic statement (if, else, for, while) vs a mathematical operator... is "if" as much work as "+" each time I use it? I would love to know where to find this information.
So, for clarity, my goal is to discover how much processor time I am demanding of the CPU (or GPU or any U) so that I can design an optimal algorithm around processor time. Can someone give me an idea of where to start for iOS hardware?
Edit: This link to ClockServices.c and SIMD stuff in the developer portal might be a good start for people interested in this. A few more cups of coffee tonight and I might get through it ;)
On a modern platform, processor time isn't the only limiting factor. Often, memory access is.
Still, processor time:
Your basic approach at an estimation for the processor load is OK, though, and is sensible: Make a rough estimate of the cost based on your knowledge of typical platforms.
In this article, Table 1 shows the times for typical primitive operations in .NET. While your platform may vary, the relative time is usually very similar. Maybe you can find - or even make - one for iStuff.
(I haven't come across one so thorough for other platforms, except processor / instruction set manuals, but they deal with assembly instructions)
memory locality:
A cache miss can cost you hundreds of cycles, a disk access a thousand times as much. So controlling your memory access patterns (i.e. reducing the working set, restructuring and accessing data in a cache-friendly way) is an important part of evaluating an algorithm.
xCode has instruments to measure performance of each function/operation, you can simply use them.

Why are CPU registers fast to access?

Register variables are a well-known way to get fast access (register int i). But why are registers on the top of hierarchy (registers, cache, main memory, secondary memory)? What are all the things that make accessing registers so fast?
Registers are circuits which are literally wired directly to the ALU, which contains the circuits for arithmetic. Every clock cycle, the register unit of the CPU core can feed a half-dozen or so variables into the other circuits. Actually, the units within the datapath (ALU, etc.) can feed data to each other directly, via the bypass network, which in a way forms a hierarchy level above registers — but they still use register-numbers to address each other. (The control section of a fully pipelined CPU dynamically maps datapath units to register numbers.)
The register keyword in C does nothing useful and you shouldn't use it. The compiler decides what variables should be in registers and when.
Registers are a core part of the CPU, and much of the instruction set of a CPU will be tailored for working against registers rather than memory locations. Accessing a register's value will typically require very few clock cycles (likely just 1), as soon as memory is accessed, things get more complex and cache controllers / memory buses get involved and the operation is going to take considerably more time.
Several factors lead to registers being faster than cache.
Direct vs. Indirect Addressing
First, registers are directly addressed based on bits in the instruction. Many ISAs encode the source register addresses in a constant location, allowing them to be sent to the register file before the instruction has been decoded, speculating that one or both values will be used. The most common memory addressing modes indirect through a register. Because of the frequency of base+offset addressing, many implementations optimize the pipeline for this case. (Accessing the cache at different stages adds complexity.) Caches also use tagging and typically use set associativity, which tends to increase access latency. Not having to handle the possibility of a miss also reduces the complexity of register access.
Complicating Factors
Out-of-order implementations and ISAs with stacked or rotating registers (e.g., SPARC, Itanium, XTensa) do rename registers. Specialized caches such as Todd Austin's Knapsack Cache (which directly indexes the cache with the offset) and some stack cache designs (e.g., using a small stack frame number and directly indexing a chunk of the specialized stack cache using that frame number and the offset) avoid register read and addition. Signature caches associate a register name and offset with a small chunk of storage, providing lower latency for accesses to the lower members of a structure. Index prediction (e.g., XORing offset and base, avoiding carry propagation delay) can reduce latency (at the cost of handling mispredictions). One could also provide memory addresses earlier for simpler addressing modes like register indirect, but accessing the cache in two different pipeline stages adds complexity. (Itanium only provided register indirect addressing — with option post increment.) Way prediction (and hit speculation in the case of direct mapped caches) can reduce latency (again with misprediction handling costs). Scratchpad (a.k.a. tightly coupled) memories do not have tags or associativity and so can be slightly faster (as well as have lower access energy) and once an access is determined to be to that region a miss is impossible. The contents of a Knapsack Cache can be treated as part of the context and the context not be considered ready until that cache is filled. Registers could also be loaded lazily (particularly for Itanium stacked registers), theoretically, and so have to handle the possibility of a register miss.
Fixed vs. Variable Size
Registers are usually fixed size. This avoids the need to shift the data retrieved from aligned storage to place the actual least significant bit into its proper place for the execution unit. In addition, many load instructions sign extend the loaded value, which can add latency. (Zero extension is not dependent on the data value.)
Complicating Factors
Some ISAs do support sub-registers, notable x86 and zArchitecture (descended from S/360), which can require pre-shifting. One could also provide fully aligned loads at lower latency (likely at the cost of one cycle of extra latency for other loads); subword loads are common enough and the added latency small enough that special casing is not common. Sign extension latency could be hidden behind carry propagation latency; alternatively sign prediction could be used (likely just speculative zero extension) or sign extension treated as a slow case. (Support for unaligned loads can further complicate cache access.)
Small Capacity
A typical register file for an in-order 64-bit RISC will be only about 256 bytes (32 8-byte registers). 8KiB is considered small for a modern cache. This means that multiplying the physical size and static power to increase speed has a much smaller effect on the total area and static power. Larger transistors have higher drive strength and other area-increasing design factors can improve speed.
Complicating Factors
Some ISAs have a large number of architected registers and may have very wide SIMD registers. In addition, some implementations add additional registers for renaming or to support multithreading. GPUs, which use SIMD and support multithreading, can have especially high capacity register files; GPU register files are also different from CPU register files in typically being single ported, accessing four times as many vector elements of one operand/result per cycle as can be used in execution (e.g., with 512-bit wide multiply-accumulate execution, reading 2KiB of each of three operands and writing 2KiB of the result).
Common Case Optimization
Because register access is intended to be the common case, area, power, and design effort is more profitably spent to improve performance of this function. If 5% of instructions use no source registers (direct jumps and calls, register clearing, etc.), 70% use one source register (simple loads, operations with an immediate, etc.), 25% use two source registers, and 75% use a destination register, while 50% access data memory (40% loads, 10% stores) — a rough approximation loosely based on data from SPEC CPU2000 for MIPS —, then more than three times as many of the (more timing-critical) reads are from registers than memory (1.3 per instruction vs. 0.4) and
Complicating Factors
Not all processors are design for "general purpose" workloads. E.g., processor using in-memory vectors and targeting dot product performance using registers for vector start address, vector length, and an accumulator might have little reason to optimize register latency (extreme parallelism simplifies hiding latency) and memory bandwidth would be more important than register bandwidth.
Small Address Space
A last, somewhat minor advantage of registers is that the address space is small. This reduces the latency for address decode when indexing a storage array. One can conceive of address decode as a sequence of binary decisions (this half of a chunk of storage or the other). A typical cache SRAM array has about 256 wordlines (columns, index addresses) — 8 bits to decode — and the selection of the SRAM array will typically also involve address decode. A simple in-order RISC will typically have 32 registers — 5 bits to decode.
Complicating Factors
Modern high-performance processors can easily have 8 bit register addresses (Itanium had more than 128 general purpose registers in a context and higher-end out-of-order processors can have even more registers). This is also a less important consideration relative to those above, but it should not be ignored.
Conclusion
Many of the above considerations overlap, which is to be expected for an optimized design. If a particular function is expected to be common, not only will the implementation be optimized but the interface as well. Limiting flexibility (direct addressing, fixed size) naturally aids optimization and smaller is easier to make faster.
Registers are essentially internal CPU memory. So accesses to registers are easier and quicker than any other kind of memory accesses.
Smaller memories are generally faster than larger ones; they can also require fewer bits to address. A 32-bit instruction word can hold three four-bit register addresses and have lots of room for the opcode and other things; one 32-bit memory address would completely fill up an instruction word leaving no room for anything else. Further, the time required to address a memory increases at a rate more than proportional to the log of the memory size. Accessing a word from a 4 gig memory space will take dozens if not hundreds of times longer than accessing one from a 16-word register file.
A machine that can handle most information requests from a small fast register file will be faster than one which uses a slower memory for everything.
Every microcontroller has a CPU as Bill mentioned, that has the basic components of ALU, some RAM as well as other forms of memory to assist with its operations. The RAM is what you are referring to as Main memory.
The ALU handles all of the arthimetic logical operations and to operate on any operands to perform these calculations, it loads the operands into registers, performs the operations on these, and then your program accesses the stored result in these registers directly or indirectly.
Since registers are closest to the heart of the CPU (a.k.a the brain of your processor), they are higher up in the chain and ofcourse operations performed directly on registers take the least amount of clock cycles.

Communication between processor and high speed perihperal

Considering that a processor runs at 100 MHz and the data is coming to the processor from an external device/peripheral at the rate of 1000 Mbit/s (8 Bits/Clockcycle # 125 MHz), which is the best way to handle traffic that comes at a higher speed to the processor ?
First off, you can't do it in software. There would be no way to sample the digital lines at a sufficient rate, or to doing anything useful with it.
You need to use a hardware FIFO buffer or memory cell. When a data burst comes in, it can be buffered in the high speed FIFO and then read out as needed by the processor.
Drop in high speed FIFO chips are surprisingly expensive (though most are dual ported). To cut cost, you would be best off using an SRAM chip, and a hardware adder to increment the address lines on incoming data.
This is not an uncommon situation for software. semaj said the right word. This is a system engineering issue. Other folks have the right answer too. If you want to look at or process that data with the 100MHz processor, it is not going to happen, dont bother trying. You CAN look at snapshots of it or have the hardware filter out a specific percentage of it that you are looking for. At the end of the day though it is a systems issue, what does the hardware provide, where does it put this data, what is the softwares task for this data, does it see X buffers of data come in on the goesinta, and the notify the goesouta hardware that there are X buffers ready to go? Does the hardware examine and align the buffers so that you can look at a header, and then decide where to route the hardware? Once you do your system engineering you will know if you can use that processor or not, and if you can use it what its job is and how to do it.
Your direct question. What is the best way to handle it. The best way to handle it is to have hardware (fpga, asic, etc) move it into and out of some storage device (ram of some sort probably). Not necessarily the same ram the processor runs out of (DMA is a good thing to avoid). The hardware is something the software can talk to but you cannot examine all of that data so dont try. Without knowing what kind of data this is, what form, what the software looks at how much work you are willing to force the hardware to do, etc determines the rest of the answer. If you expect a certain (guaranteed) percentage to be bad or not belong to this processor, etc have the hardware filter that out and then what is left you can process.
Networking is a good example of this, PCs have gige ports but cannot process GigE line rate data. That is why we use switches now instead of hubs, hardware slices out a percentage of the data so the pc can handle it, the protocols take care of the data that cannot be processed by resending it later. And the switches processors dont look at all of the data, the hardware slices it up so the software can examine just the header. Or sometimes the software simply manages tables that drive the hardware and the hardware does all the work of processing the data.
Do your system engineering the answers will simply fall out.
You buffer it. Typically data from a device is written to a memory buffer (circular queue) using DMA (no cpu involved). The cpu reads from the memory buffer at a constant rate. Usually devices send data in bursts. This keeps the buffer from filling up. If there is too much data, buffer overflow.
DMA (direct memory access) is possibly the solution, however, it seems unlikely that the memory bus could run faster than the processor core, so the receiving peripheral would have to accept data into a larger register than 8 bit because 125MHz could not be sustained. For example a 16bit register would allow memory writes at 62.5MHz which may be achievable. Also the receiving device would have to be able to accept an external clock that is both faster and asynchronous to the core clock. Also of course the receiving peripheral must have support for DMA.
Unless you are more specific about your hardware and the communication protocol it is difficult to give anything other than a general answer.