Register Ranges in HLSL? - range

I am currently refactoring a large chunk of old code and have finally dove into the HLSL section where my knowledge is minimal due to being out of practice. I've come across some documentation online that specifies which registers are to be used for which purposes:
t – for shader resource views (SRV)
s – for samplers
u – for unordered access views (UAV)
b – for constant buffer views (CBV)
This part is pretty self explanatory. If I want to create a constant buffer, I can just declare as:
cbuffer LightBuffer: register(b0) { };
cbuffer CameraBuffer: register(b1) { };
cbuffer MaterialBuffer: register(b2) { };
cbuffer ViewBuffer: register(b3) { };
However, originating from the world of MIPS Assembly I can't help but wonder if there are finite and restricted ranges on these. For example, temporary registers are restricted to a range of t0 - t7 in MIPS Assembly. In the case of HLSL I haven't been able to find any documentation surrounding this topic as everything seems to point to assembly languages and microprocessors (such as the 8051 if you'd like a random topic to read up on).
Is there a set range for the four register types in HLSL or do I just continue as much as needed in a sequential fashion and let the underlying assembly handle the messy details?
Note
I have answered this question partially, as I am unable to find a range for u currently; however, if someone has a better, more detailed answer than what I've given through testing, then feel free to post it and I will mark that as the correct answer. I will leave this question open until December 1st, 2018 to give others a chance to give a better answer for future readers.

Resource slot count (for d3d11, indeed d3d12 case expands that) are specified in Resource Limit msdn page.
The ones which are of interest for you here are :
D3D11_COMMONSHADER_INPUT_RESOURCE_REGISTER_COUNT (which is t) = 128
D3D11_COMMONSHADER_SAMPLER_SLOT_COUNT (which is s) = 16
D3D11_COMMONSHADER_CONSTANT_BUFFER_HW_SLOT_COUNT (which is b) = 15 but one is reserved to eventually store some constant data from shaders (if you have a static const large array for example)
The u case is different, as it depends on Feature Level (and tbh is a vendor/os version mess) :
D3D11_FEATURE_LEVEL_11_1 or greater, this is 64 slots
D3D11_FEATURE_LEVEL_11 : It will always be 8 (but some cards/driver eventually support 64, you need at least windows 8 for it (It might also be available in windows 7 with some platform update too). I do not recall a way to test if 64 is supported (many nvidia in their 700 range do for example).
D3D11_FEATURE_LEVEL_10_1 : either 0 or 1, there's a way to check is compute is supported
You need to perform a feature check:
D3D11_FEATURE_DATA_D3D10_X_HARDWARE_OPTIONS checkData;
d3dDevice->CheckFeatureSupport(D3D11_FEATURE_D3D10_X_HARDWARE_OPTIONS, &checkData);
BOOL computeSupport = checkData.ComputeShaders_Plus_RawAndStructuredBuffers_Via_Shader_4_x
Please note that for some OS/Driver version I had this flag returning TRUE while not supported (Intel was doing that on win7/8), so in that case the only valid solution was to try to either create a small Raw / Byte Address buffer or a Structured Buffer and check the HRESULT
As a side note feature feature level 10 or below are for for quite old configurations nowadays, so except for rare scenarios you can probably safely ignore it (I just leave it for information purpose).

Since it's usually a long wait time for these types of questions I tested the b register by attempting to create a cbuffer in register b51. This failed as I expected and luckily SharpDX spit out an exception that stated it has a maximum of 14. So for the sake of future readers I am testing all four register types and posting back the ranges I find successful.
b has a range of b0 - b13.
s has a range of s0 - s15.
t has a range of t0 - t127.
u has a range of .
At the current moment, I am unable to find a range for the u register as I have no examples of it in my code, and haven't actually ever used it. If someone comes along that does have an example usage then feel free to test it and update this post for future readers.
I did find a contradiction to my findings above in the documentation linked in my question; they have an example using a t register above the noted range in this answer:
Texture2D a[10000] : register(t0);
Texture2D b[10000] : register(t10000);
ConstantBuffer<myConstants> c[10000] : register(b0);
Note
I would like to point out that I am using the SharpDX version of the HLSL compiler and so I am unsure if these ranges vary from compiler to compiler; I heavily doubt that they do, but you can never be too sure until you try to exceed them. GLSL may be the same due to being similar to HLSL, but it could also be very different.

Related

Data hazards in hardware platforms

I have got a list of 2 types of hazards:
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
I am not able to understand the intuition behind these 2 rules which can help me know the technical terms and also explains me the concept? Any explanations are welcome :)
In most 3-operand ISAs (e.g. MIPS documentation like http://www.mrc.uidaho.edu/mrc/people/jff/digital/MIPSir.html uses this convention), rd is the destination register, and Rs, Rt are source registers. e.g. add rd, rs, rt. (rs and rt might be second and third, or source and third, IDK).
If you're reading a register (in ID) which was recently written (the instruction writing it hasn't reached the write-back stage), that's a RAW read-after-write true dependency.
Out-of-order exec also introduces the possibility of write-after-write and write-after-read anti-dependency hazards. https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards. But in a scalar in-order pipeline, only true dependencies are a concern, I think. At least if all instructions have fixed 1-cycle latency, so excluding funky stuff like MIPS mult or div that write the hi:lo pair.

What potential error can be caused by retrieving and changing of a value in the same line?

I was asked question about what would happen if I try to retrieve a reference value and then try to change it within the same line of code. My answer was that nothing will happen because as I tried to do this before I did not encounter any compiler errors (at least in C# or Java).
What is the real answer to this?
This is example with the pseudo code:
Module main()
Call changeNumber(10)
End Module
Module changeNumber(Integer Ref number)
Set number = number * number
Display number
End Module
(PS. Sorry for not formatting/creting this post correctly. I'm having bit of an issue here.)
There would be no unusual side effects, if that's what you're asking. The language specifications dictate a specific order of execution (number * number is evaluated, then set to number), which prevents any issues from occurring.
Nothing would happen, in your particular pseudo code.
In reference to a question you asked after this question -
Actually there could be issues in some rare instance, but it would depend on how you are allocating space for the number and your coding language. Consider this. you are explicitly naming the data type as an int, well what if the accepted input is a larger number than an int can handle (or a negative number), that ends up being x integers long and then you multiply it by that same number, your allocated space (which usually has padding with integer) could also have insufficient space that is too small for this particular instance, which would cause a segmentation fault in C. depending on which language you're using, whether it's higher level than C, it may have compilation checks for this case, but not always.

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.
The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?

How to reset Ada.Real_Time.Clock?

when reading Ada.Real_Time.Clock right after power-up it shows a value that isn't close to zero and sometimes even negative.
As far as I know Ada.Real_Time.Clock suppose to reset on power-up.
How can I reset Ada.Real_Time.Clock?
Thanks.
The Ada 2005 LRM declares that "real time is defined to be the physical time as observed in the external environment. [emphasis added--MC]
"It is not specified by the language whether the time values are synchronized with any standard time reference. For example, E can correspond to the time of system initialization or it can correspond to the epoch of some time standard." (D.8[18-19])
As it states, Ada does not require that "E", the start of the epoch serving as the "zero time" for real-time Time values, correspond to any particular starting point; it's left up to the compiler implementer.
Whatever specific numeric values you observe for the instances of Time you're seeing, whether near or far from zero, positive or negative, are dependent solely on the compiler implementer's choice of E, how it represents times values, and how it correspondingly implements the real-time capability.
Therefore you should avoid writing code that depends on specific, knowable values of Time, nor code that requires Time values to be intimately manipulable.
Real_Time.Time values should be considered abstract quantities.
Agreeing with Marc. While I have seen some platforms that use time since boot (particularly on Intel platforms, where I think they like to use the processor's iteration counter), that is entirely up to the compiler vendor.
If you need something like "time since startup" and your platform isn't giving you that, then the thing to do would be to grab Real_Time.Clock when you start up, and subtract that value from all further reads from Real_Time.Clock.
You can look at exactly what facilites are defined for the Real_Time package, including all the LRM sections Marc was quoting you, at its LRM page here.
It was long ago but if it helps someone...
I reseted the clock by writing 0 to the time base registers of the MCU.
That's a lovely explanation, but what if someone is trying to write unit tests against code which implements the real_Time clock? For instance, I know that my function foo does an internal comparison against Ada.Real_Time.Clock to check for time spans. Before executing foo with the appropriate inputs I want to reset the clock to force foo down a specific path internally and verify the resulting out parameter has changed.
return_value := foo;
assert (return_value = path1, "tested foo path1");
Reset_Clock;
return_value := foo;
assert (return_value = path2, "tested foo path2");

How to determine SSE prefetch instruction size?

I am working with code which contains inline assembly for SSE prefetch instructions. A preprocessor constant determines whether the instructions for 32-, 64- or 128-bye prefetches are used. The application is used on a wide variety of platforms, and so far I have had to investigate in each case which is the best option for the given CPU. I understand that this is the cache line size. Is this information obtainable automatically? It doesn't seem to be explicitly present in /proc/cpuinfo.
I think your question is related to this question or this one. I think it is clear that - unless you can rely on a OS or library-function - you will want to use the CPUID instruction, but the question then becomes exactly what information you are looking for. - And of course, AMD's and Intel's implementations don't need to agree. This page suggests using Cpuid.1.EBX[15:8] (i.e., BH) for finding out on Intel and function 80000005h on AMD. In addition, on Intel, CPUID.2... seems to contain the relevant information, but it looks like a real pain to parse out the desired information.
I think, from what I've read, both AMD and Intel CPUID instructions will support CPUID.1.EBX[15:8], which returns the size of one cache line in QUADWORDs as used by the CLFLUSH instruction (which isn't present on all processors, so I don't know whether you'll always find something there). So, after executing CPUID.1, you'd have to multiply BH by 8 to get the cache line size in bytes. This hinges on my implicit assumption (please can anyone say whether it is really valid?) that the definition of one cache line size is always the same for CLFLUSH and PREFETCHh instructions.
Also, Intel's manuals states that PREFETCHh is only a hint, but that, if it prefetches anything, it will always be a minimum of 32 bytes.
EDIT1:
Another useful resource (even if not directly answering your question) for the optimised use of PREFETCHh is Intel's optimisation manual here.