Processor FSB characteristics - cpu-architecture

I've a basic question regarding of Front Side Bus (FSB) characteristics
Consider for instance Pentium 4 FSB: it is a "quad-pumped" bus in which FSB'clock (BCLK) is 100Mhz but data transfert is at 400 MT/s (4 Transfers/Cycle).
AFAIK to complete a transfer to/from memory are needed five phases: Request Phase, Snoop Phase, Response Phase and Data Phase. Some of them (e.g. Request phase) requires one bus clock cycle (BCLK).
So, even if the process of sending data is quad-pumped, the net FSB transfert rate cannot be BCLK x 4 because of the (necessary) previous associated phases (namely Request, Snoop and Response)
Does it make sense ?
At a deeper look to me it seems the "pipelined" FSB architecture is the answer to the original question. Having separate lanes for address, data (64 bit data bus widht) and control, Pentium 4 FSB runs in a pipelined mode maximizing DRAM module throughput
Someone can confirm my understanding ? Thanks.

Related

Is there a trade-off for memory to memory DMA transfer when the data size is small?

I am learning about the STM32 F4 microcontroller. I'm trying to find out about limitations for using DMA.
Per my understanding and research, I know that if the data size is small (that is, the device uses DMA to generate or consume a small amount of data), the overhead is increased because DMA transfer requires the DMA controller to perform operations, thereby unnecessarily increasing system cost.
I did some reaserch and found the following:
Limitation of DMA
CPU puts all its lines at high impedance state so that the DMA controller can then transfer data directly between device and memory without CPU intervention. Clearly, it is more suitable for device with high data transfer rates like a disk.
Over a serial interface, data is transferred one bit at a time which makes it slow to use DMA.
Is that correct? What else do I need to know?
DMA -CPU puts all its lines at high impedance state
I do not know where did you take it from - but you should not use this source any more.
Frequency of the DMA transfers do not matter unless you reach the the BUS throughput. you can transfer one byte per week, month, year, decade ..... and it is absolutely OK.
In the STM32 microcontrollers it is a very important feature as we can transfer data from/to external devices even if the uC is in low power mode with the core (CPU) sleeping. DMA controller can even wake up the core when some conditions are met.
As #Vinci and #0___________ (f.k.a. #P__J__) already pointed out,
A DMA controller works autonomously and doesn't create overhead on the CPU it supplements (at least not by itself). But:
The CPU/software must perform some instructions to configure the DMA and to trigger it or have it triggered by some peripheral. For this, it needs CPU time and program memory space (usually ROM). Besides, it usually needs some additional RAM in variables to manage the software around the DMA.
Hence, you are right, using a DMA comes with some kinds of overhead.
And furthermore,
The DMA transfers make use of the memory bus(es) that connect the involved memories/registers/peripherals to the DMA controller. That is, while the DMA controller does its own work, it may cause the CPU which it tries to offload to stall in the meantime, at least for short moments when the data words are transferred (which in turn sum up for longer transfers...).
On the other hand, a DMA doesn't only help you to reduce the CPU load (regarding total CPU time to implement some feature). If used "in a smart way", it helps you to reduce software latencies to implement different functions because one part of the implementation can be "hidden" behind the DMA-driven data transfer of another part (unless, both rely on the same bus resources - see above...).
The information is right in that using a DMA requires some development work and some runtime to manage the DMA transfer itself (see also
a related question
here), which may not be worth the benefits of using DMA. That is, for small portions of data one doesn't gain as much performance (or latency) as during big transfers. On embedded systems, DMA controllers (and their channels) are limited resources so it is important to consider which part of the function benefits from such a resource most. Therefore, one would usually prefer using DMA for the data transfers to/from disks (if it is about "payload data" such as large files or video streams) over slow serial connections.
The information is wrong, however, in that DMA is not worth using on serial interfaces as those only transfer a single bit at a time. Please note that microcontrollers (as your
STM32F4)
have built-in peripheral components that convert the serial bit-by-bit stream into a byte-by-byte or word-by-word stream, which can easily be tranferred by DMA in a helpful way - especially if the size of the packets is known in advance and software doesn't have to analyse a non-formatted stream. Furthermore, not every serial connection is "slow" at all. If the project uses, e. g., an SPI flash chip, then the SPI serial connection is the one used for data transfer.

What is responsible for changing core's load and frequency in multicore processor

Having looked for a description of the multicore design i keep finding several diagrams, but all of them look somewhat like this:
I know from looking at i7z command output that different cores can run at different frequencies.
This would suggest that the decisions regarding which core will be given a new process and for changing the frequency of the core itself are done either by the operating system or by the control block of the core itself.
My question is: What controls the frequencies of each individual core? Is the job of associating a READY process with the specific core placed upon the operating system or is it done by something within the processor.
Scheduling processes/threads to cores is purely up to the OS. The hardware has no understanding of tasks waiting to run. Maintaining the OS's list of processes that are runnable vs. waiting for I/O is completely a software thing.
Migrating a thread from one core to another is done by kernel code on the original core storing the architectural state to memory, then OS code on the new core restoring that saved state and resuming user-space execution.
Traditionally, frequency and voltage scaling decisions are made by the OS. Take Linux as an example: The decision-making code is called a governor (and also this arch wiki link came up high on google). It looks at things like how often processes have used their entire time slice on the current core. If the governor decides the CPU should run at a different speed, it programs some control registers to implement the change. As I understand it, the hardware takes care of choosing the right voltage to support the requested frequency.
As I understand it, the OS running on each core makes decisions independently. On hardware that allows each core to run at different frequencies, the decision-making code doesn't need to coordinate with each other. If running a high frequency on one core requires a high voltage chip-wide, the hardware takes care of that. I think the modern implementation of DVFS (dynamic voltage and frequency scaling) is fairly high-level, with the OS just telling the hardware which of N choices it wants, and the onboard power microcontroller taking care of the details of programming oscillators / clock dividers and voltage regulators.
Intel's "Turbo" feature, which opportunistically boosts the frequency above the max sustainable frequency, does the decision making in hardware. Any time the OS requests the highest advertised frequency, the CPU uses turbo when power and cooling allow.
Intel's Skylake takes this a step further: The OS can hand full control over DVFS to the hardware, optionally with constraints. That lets it react from microsecond to microsecond, rather than on a timescale of milliseconds. This does actually allow better performance in bursty workloads, because more power budget is available for turbo when it's useful. A few benchmarks are bursty enough to observe this, like some browser / javascript ones IIRC.
There was a whole talk about Skylake's new power management at IDF2015, check out the slides and/or archived webcast. The old method is described in a lot of detail there, too, to illustrate the difference, so you should really check it out if you want more detail than my summary. (The list of other IDF talks is here, thanks to Agner Fog's blog for the link)
The core frequency is controlled by a given voltage applied to a core's "oscillator".
This voltage can be changed by the Operating System but it can also be changed by the BIOS itself if a high temperature is detected in the CPU.

What are the performance and architectural differences between PCIe and QPI?

PCIe 3.0 x16 and QPI 1.1 (20 lanes) have identical effective bandwidth (16 GB/s). So, I wanted to get a rough picture about the differences between the two.
What are the differences between the two in terms of latency and message rate (number of packets or TLPs per second)? For latency, my ballpark numbers are 20 ns for QPI and 200 ns for PCIe 3.0. Are these good estimates? If yes, why is PCIe's latency so much higher - is it due to the wire length?
Are there significant architectural differences between the two, apart from the fact that QPI provides cache snooping? To my knowledge, both use a layered protocol: transport layer through physical layer.
The two have fairly different messaging types due to their different roles. QPI is directly concerned with implementing cache coherency via the MESIF protocol and NUMA via a distributed directory. PCIe has no such notions, although they share in common memory read and write and completion message types (see here for some PCIe basics). They have similar power states and a priority scheme implemented through virtual channels. Both use credit-based flow control but there's no guarantee of any commonality in what sort of credits are maintained by QPI versus PCIe endpoints (as far as I know, the specifics of QPI credits are an Intel trade secret).
Message rate for each is usually expressed in GT/s. Typical QPI rates are 4.8, 6.4 and 8 GT/s, and 5 or 8 GT/s for PCIe.
Your latency estimates for both are probably low. QPI is on the order of a few hundred ns per hop. Note that 4+ socket systems may have more than one QPI hop between pairs of sockets. PCIe might be closer to 500ns, although again, that depends on the system topology. Latency between a processor socket's main memory and a PCIe card hanging directly off that socket (the PEG slot) is going to be lower than between that same memory and a card hanging off the south bridge.

Soft hand off in CDMA cellular networks

Hi,
In the CDMA cellular networks when MS (Mobile Station) need to change a BS(Base Station), exactly necessary for hand-off, i know that is soft hand-off (make a connection with a target BS before leaving current BS-s). But i want to know, because connection of MS remaining within a time with more than one BS, MS use the same code in CDMA to communicate with all BS-s or different code for different BS-s ?
Thanks in advance
For the benefit of everyone, i have touched upon few points before coming to the main point.
Soft Handoff is also termed as "make-before-break" handoff. This technique falls under the category of MAHO (Mobile Assisted Handover). The key theme behind this is having the MS to maintain a simultaneous communication link with two or more BS for ensuring a un-interrupted call.
In DL direction, it is achieved using different transmission codes(transmit same bit stream) on different physical channels in the same frequency by two or more BTS wherein the CDMA phone simultaneously receives the signals from these two or more BTS. In the active set, there can be more than one pilot as there could be three carriers involved in soft hand off. Also, there shall also be a rake receiver that shall do maximal combining of received signals.
In UL direction, MS shall operate on a candidate set where there could be more than 1 pilot that have sufficient signal strength for usage as reported by MS. The BTS shall tag each of the user's data with Frame reliability indicator that can provide details about the transmission quality to BSC. So, even though the signals(MS code channel) are received by both base stations, it is achieved by routing the signals to the BSC along with information of quality of received signals, which shall examine the quality based on the Frame reliability indicator and choose the best quality stream or the best candidate.

Communication between processor and high speed perihperal

Considering that a processor runs at 100 MHz and the data is coming to the processor from an external device/peripheral at the rate of 1000 Mbit/s (8 Bits/Clockcycle # 125 MHz), which is the best way to handle traffic that comes at a higher speed to the processor ?
First off, you can't do it in software. There would be no way to sample the digital lines at a sufficient rate, or to doing anything useful with it.
You need to use a hardware FIFO buffer or memory cell. When a data burst comes in, it can be buffered in the high speed FIFO and then read out as needed by the processor.
Drop in high speed FIFO chips are surprisingly expensive (though most are dual ported). To cut cost, you would be best off using an SRAM chip, and a hardware adder to increment the address lines on incoming data.
This is not an uncommon situation for software. semaj said the right word. This is a system engineering issue. Other folks have the right answer too. If you want to look at or process that data with the 100MHz processor, it is not going to happen, dont bother trying. You CAN look at snapshots of it or have the hardware filter out a specific percentage of it that you are looking for. At the end of the day though it is a systems issue, what does the hardware provide, where does it put this data, what is the softwares task for this data, does it see X buffers of data come in on the goesinta, and the notify the goesouta hardware that there are X buffers ready to go? Does the hardware examine and align the buffers so that you can look at a header, and then decide where to route the hardware? Once you do your system engineering you will know if you can use that processor or not, and if you can use it what its job is and how to do it.
Your direct question. What is the best way to handle it. The best way to handle it is to have hardware (fpga, asic, etc) move it into and out of some storage device (ram of some sort probably). Not necessarily the same ram the processor runs out of (DMA is a good thing to avoid). The hardware is something the software can talk to but you cannot examine all of that data so dont try. Without knowing what kind of data this is, what form, what the software looks at how much work you are willing to force the hardware to do, etc determines the rest of the answer. If you expect a certain (guaranteed) percentage to be bad or not belong to this processor, etc have the hardware filter that out and then what is left you can process.
Networking is a good example of this, PCs have gige ports but cannot process GigE line rate data. That is why we use switches now instead of hubs, hardware slices out a percentage of the data so the pc can handle it, the protocols take care of the data that cannot be processed by resending it later. And the switches processors dont look at all of the data, the hardware slices it up so the software can examine just the header. Or sometimes the software simply manages tables that drive the hardware and the hardware does all the work of processing the data.
Do your system engineering the answers will simply fall out.
You buffer it. Typically data from a device is written to a memory buffer (circular queue) using DMA (no cpu involved). The cpu reads from the memory buffer at a constant rate. Usually devices send data in bursts. This keeps the buffer from filling up. If there is too much data, buffer overflow.
DMA (direct memory access) is possibly the solution, however, it seems unlikely that the memory bus could run faster than the processor core, so the receiving peripheral would have to accept data into a larger register than 8 bit because 125MHz could not be sustained. For example a 16bit register would allow memory writes at 62.5MHz which may be achievable. Also the receiving device would have to be able to accept an external clock that is both faster and asynchronous to the core clock. Also of course the receiving peripheral must have support for DMA.
Unless you are more specific about your hardware and the communication protocol it is difficult to give anything other than a general answer.