Data hazards in hardware platforms

Data hazards in hardware platforms - cpu-architecture

I have got a list of 2 types of hazards:
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
I am not able to understand the intuition behind these 2 rules which can help me know the technical terms and also explains me the concept? Any explanations are welcome :)

In most 3-operand ISAs (e.g. MIPS documentation like http://www.mrc.uidaho.edu/mrc/people/jff/digital/MIPSir.html uses this convention), rd is the destination register, and Rs, Rt are source registers. e.g. add rd, rs, rt. (rs and rt might be second and third, or source and third, IDK).
If you're reading a register (in ID) which was recently written (the instruction writing it hasn't reached the write-back stage), that's a RAW read-after-write true dependency.
Out-of-order exec also introduces the possibility of write-after-write and write-after-read anti-dependency hazards. https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards. But in a scalar in-order pipeline, only true dependencies are a concern, I think. At least if all instructions have fixed 1-cycle latency, so excluding funky stuff like MIPS mult or div that write the hi:lo pair.

Related

ThreadX RAM issue on STM32

I'm currently starting to use ThreadX on a STM32 Nucleo-H723ZG (STM32H723ZG MCU).
I noticed that when loading the Nx_TCP_Echo_Server / Nx_TCP_Echo_Client projects from CubeMX, the RAM gets filled up pretty much to the top, which makes me wonder, how I'm supposed to add my own code and data here.
Since I'm pretty new to RAM partitioning, RTOS and similar, I don't have a perfect feeling for what is wrong or right and how to proceed (and if it is a problem at all).
Nevertheless I wonder, if maybe using a different way of partitioning the RAM or by dropping some non-necessary code-parts, the RAM could be freed-up.
Or a different way of thinking:
Since RAM_D1 got filled, but _D2, _D3 and DTCMRAM are pretty much empty, is there a way to use the free RAM for my own purposes (I would like to let SPI and ADC processing run via DMA, so this needs a place to go ....)
Hope my questions are not too confusing ;)
The system has the following amount of RAM, according to STM:
"SRAM: total 564 Kbytes all with ECC, including 128 Kbytes of data TCM RAM for critical real-time data + 432 Kbytes of system RAM (up to 256 Kbytes can remap on instruction TCM RAM for critical real time instructions) + 4 Kbytes of backup SRAM (available in the lowest-power modes)" (see STMs STM32H723ZG MCU product page)
Down below you'll find screenshots of the current RAM usage, for RAM_D1 especially .tcp_sec eats up most of the RAM.
--> Can .tcp_sec be optimized or kicked-out?
If tcp means here the tcp protocol, maybe this can be a way to optimize this, since I'm not sure whether I need a handshake etc., maybe UDP is sufficient (and faster for the ADC data streaming) ... what do you say?
Edit:
The linker-file shows, that there .tcp_sec (NOLOAD) is written ... is NOLOAD maybe a hint on a "placebo" RAM occupation (pre-allocation / reservation, but no actual usage?)
Linker-script extract:
/* User_heap_stack section, used to check that there is enough RAM left */
._user_heap_stack :
{
. = ALIGN(8);
PROVIDE ( end = . );
PROVIDE ( _end = . );
. = . + _Min_Heap_Size;
. = . + _Min_Stack_Size;
. = ALIGN(8);
} >RAM_D1
.tcp_sec (NOLOAD) : {
. = ABSOLUTE(0x24048000);
*(.RxDecripSection)
. = ABSOLUTE(0x24048060);
*(.TxDecripSection)
} >RAM_D1 AT> FLASH
For context:
I am developing a "system controller", where my plan is to have it running a RTOS, which manages the read-in of analog values, writing control messages via SPI to two other STMs of the same kind and communicating via Ethernet to my desktop application.
The desktop application is then in charge of post-processing the digitized analog values and sending control messages to the system controller. In the best case the system controller digitizes the analog signal on ADC3 with 5 MSPS (at probably 6 Bit resolution = 30 MBit/s) and sends that data hickup-free to my desktop application.
-> Is this plan possible on this MCU?
I tried to buy a higher (more RAM) version of the nucleo I've got, but due to shortages this one is the best one I was able to get.
For the RTOS I'd like to stick with ThreadX, since FreeRTOS support in STM32IDE seems to be phased out now, after ThreadX was employed as the RTOS by STM.
(I like the easy register configuration using CubeMX/STM32 IDE, hence my drive to use that SW universe ... if there are good reasons to use a different RTOS, tell me :) )
Thank you for your time!

I generated the same project on my side and took a look. I believe you should be able to implement what you want in this CPU. You will need to carefully use the available memory.
It seems there is a confusion about the section .tcp_sec. It contains DMA reception and transmission descriptors for the Ethernet controller/driver. These are constrained by the driver and hardware to be at a specific address. The descriptors are rather small, but the buffers are bigger. With some work these can be reduced. If you are using Ethernet you will need this, no mater if you use TCP or not. As I said, the name can be confusing.
The flash has still plenty of space available. In the debug configuration only about 11% is used. The rest is available for your application code.
You can locate you data in other memory regions. Depending on the toolchain you will use is how you will need to tell the compiler/linker where your data goes. You can look towards the top of the main.c file in that example to see how the DMA descriptors are assigned to a specific section for three different toolchains (IAR, ARM MDK, GCC).
In terms of how to most efficiently use and configure the microcontroller peripherals please get in touch with STMicro, they will know best.
This should get you started. Let us know if this helps!

CPUs with instructions with more than two branch destinations

Processors usually come with jmp-instructions to continue from a different fixed location and may depend on some condition. So the out-degree is two at most.
Are there any processors out there that have a single instruction that branches to one of three or more fixed locations?

There are a lot of reasons to assume / guess no, but I'm not familiar with enough ISAs to give a definite no. Especially if we include historical early computers from the 50s and 60s; they often have very odd stuff compared to modern systems.
Normally you just use an indirect branch (target address in a register or from memory, or looked up from a compressed table with ARM tbb) if you need anything other than taken vs. fall-through, so there's very little benefit to spending an opcode on a funky direct branch instruction with 2 non-fallthrough destinations.
Also, you'd need space in the instruction encoding for either 2 separate targets, or else some special rule like fall-through, PC + offset, PC + offset*2 (i.e. jump twice as far forwards or backwards). Using it would require laying out code with targets at specific offsets. You do sometimes make a table of fixed-size blocks of instructions and compute an offset into it (instead of looking up an address from a table of addresses), but having an instruction that forced you to do that sounds unlikely.
The condition itself could be a register being - / 0 / + as a 3-way condition, or FLAGS being less-than, equal, or greater-than. Or something else.
So it sounds very unlikely, and a complication to branch-prediction (unless you just treat it as indirect, in which case why bother).
But I wouldn't be shocked if there's some combination of conditions that make it make sense on some ISA. Maybe if there's a special-case handler address in some special register, and the normal case involves taken or fall-through?
But if we allow one of the target addresses to come from a register or other internal state, any branch that can fault would count. Consider a hypothetical ISA with a compare-and-branch on memory, like Intel with macro-fused cmp [rdi], eax / jne rel32 which decodes to a single internal uop.
Then the possible targets are:
fall-through to RIP
taken to RIP+rel32
#PF fault to the page-fault handler address (loaded from memory on x86-64).

Register Ranges in HLSL?

I am currently refactoring a large chunk of old code and have finally dove into the HLSL section where my knowledge is minimal due to being out of practice. I've come across some documentation online that specifies which registers are to be used for which purposes:
t – for shader resource views (SRV)
s – for samplers
u – for unordered access views (UAV)
b – for constant buffer views (CBV)
This part is pretty self explanatory. If I want to create a constant buffer, I can just declare as:
cbuffer LightBuffer: register(b0) { };
cbuffer CameraBuffer: register(b1) { };
cbuffer MaterialBuffer: register(b2) { };
cbuffer ViewBuffer: register(b3) { };
However, originating from the world of MIPS Assembly I can't help but wonder if there are finite and restricted ranges on these. For example, temporary registers are restricted to a range of t0 - t7 in MIPS Assembly. In the case of HLSL I haven't been able to find any documentation surrounding this topic as everything seems to point to assembly languages and microprocessors (such as the 8051 if you'd like a random topic to read up on).
Is there a set range for the four register types in HLSL or do I just continue as much as needed in a sequential fashion and let the underlying assembly handle the messy details?
Note
I have answered this question partially, as I am unable to find a range for u currently; however, if someone has a better, more detailed answer than what I've given through testing, then feel free to post it and I will mark that as the correct answer. I will leave this question open until December 1st, 2018 to give others a chance to give a better answer for future readers.

Resource slot count (for d3d11, indeed d3d12 case expands that) are specified in Resource Limit msdn page.
The ones which are of interest for you here are :
D3D11_COMMONSHADER_INPUT_RESOURCE_REGISTER_COUNT (which is t) = 128
D3D11_COMMONSHADER_SAMPLER_SLOT_COUNT (which is s) = 16
D3D11_COMMONSHADER_CONSTANT_BUFFER_HW_SLOT_COUNT (which is b) = 15 but one is reserved to eventually store some constant data from shaders (if you have a static const large array for example)
The u case is different, as it depends on Feature Level (and tbh is a vendor/os version mess) :
D3D11_FEATURE_LEVEL_11_1 or greater, this is 64 slots
D3D11_FEATURE_LEVEL_11 : It will always be 8 (but some cards/driver eventually support 64, you need at least windows 8 for it (It might also be available in windows 7 with some platform update too). I do not recall a way to test if 64 is supported (many nvidia in their 700 range do for example).
D3D11_FEATURE_LEVEL_10_1 : either 0 or 1, there's a way to check is compute is supported
You need to perform a feature check:
D3D11_FEATURE_DATA_D3D10_X_HARDWARE_OPTIONS checkData;
d3dDevice->CheckFeatureSupport(D3D11_FEATURE_D3D10_X_HARDWARE_OPTIONS, &checkData);
BOOL computeSupport = checkData.ComputeShaders_Plus_RawAndStructuredBuffers_Via_Shader_4_x
Please note that for some OS/Driver version I had this flag returning TRUE while not supported (Intel was doing that on win7/8), so in that case the only valid solution was to try to either create a small Raw / Byte Address buffer or a Structured Buffer and check the HRESULT
As a side note feature feature level 10 or below are for for quite old configurations nowadays, so except for rare scenarios you can probably safely ignore it (I just leave it for information purpose).

Since it's usually a long wait time for these types of questions I tested the b register by attempting to create a cbuffer in register b51. This failed as I expected and luckily SharpDX spit out an exception that stated it has a maximum of 14. So for the sake of future readers I am testing all four register types and posting back the ranges I find successful.
b has a range of b0 - b13.
s has a range of s0 - s15.
t has a range of t0 - t127.
u has a range of .
At the current moment, I am unable to find a range for the u register as I have no examples of it in my code, and haven't actually ever used it. If someone comes along that does have an example usage then feel free to test it and update this post for future readers.
I did find a contradiction to my findings above in the documentation linked in my question; they have an example using a t register above the noted range in this answer:
Texture2D a[10000] : register(t0);
Texture2D b[10000] : register(t10000);
ConstantBuffer<myConstants> c[10000] : register(b0);
Note
I would like to point out that I am using the SharpDX version of the HLSL compiler and so I am unsure if these ranges vary from compiler to compiler; I heavily doubt that they do, but you can never be too sure until you try to exceed them. GLSL may be the same due to being similar to HLSL, but it could also be very different.

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?

Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.

The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?

How to determine SSE prefetch instruction size?

I am working with code which contains inline assembly for SSE prefetch instructions. A preprocessor constant determines whether the instructions for 32-, 64- or 128-bye prefetches are used. The application is used on a wide variety of platforms, and so far I have had to investigate in each case which is the best option for the given CPU. I understand that this is the cache line size. Is this information obtainable automatically? It doesn't seem to be explicitly present in /proc/cpuinfo.

I think your question is related to this question or this one. I think it is clear that - unless you can rely on a OS or library-function - you will want to use the CPUID instruction, but the question then becomes exactly what information you are looking for. - And of course, AMD's and Intel's implementations don't need to agree. This page suggests using Cpuid.1.EBX[15:8] (i.e., BH) for finding out on Intel and function 80000005h on AMD. In addition, on Intel, CPUID.2... seems to contain the relevant information, but it looks like a real pain to parse out the desired information.
I think, from what I've read, both AMD and Intel CPUID instructions will support CPUID.1.EBX[15:8], which returns the size of one cache line in QUADWORDs as used by the CLFLUSH instruction (which isn't present on all processors, so I don't know whether you'll always find something there). So, after executing CPUID.1, you'd have to multiply BH by 8 to get the cache line size in bytes. This hinges on my implicit assumption (please can anyone say whether it is really valid?) that the definition of one cache line size is always the same for CLFLUSH and PREFETCHh instructions.
Also, Intel's manuals states that PREFETCHh is only a hint, but that, if it prefetches anything, it will always be a minimum of 32 bytes.
EDIT1:
Another useful resource (even if not directly answering your question) for the optimised use of PREFETCHh is Intel's optimisation manual here.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse