So after researching engines a lot I've been building a 2d framework for the iphone. As you know the world of engine architecture is vast so I've been trying to apply best practices as much as possible.
I've been using:
uint_fast8_t mId;
If I look up the definition of uint_fast8_t I find:
/* 7.18.1.3 Fastest-width integer types */
...
typedef uint8_t uint_fast8_t;
And I've been using these types throughout my code - My question is, is there a performance benefit to using these types? And what exactly is going on behind the scenes? Besides the obvious fact that this is correct data type (unsigned 8 bit integer) for the data, is it worthwhile to have this peppered throughout my code?
Is this a needless optimization that the compiler would probably take care of anyways?
Thanks.
Edit: No responses/answers, so I'm putting a bounty on this!
the "fast" integer types are defined to be the fastest integer type available with at least the amount of bits required (in this case 8).
If your platform defines uint_fast8_t as uint8_t then there will be absolutely no difference in speed.
The reason is that there may be architectures that are slower when not using their native word length. E.g. I could find one reference where for Alpha processors uint_fast_8_t was defined to be "unsigned int".
An uint_fast8_t is the fastest integer guaranteed to be at least 8 bits wide. Depending on your platform it could be 8 or 16 or 32 bits wide.
It isnt taken care of by the compiler itself, it does indeed make your program execute faster
Here are some resource I found, You might already have seen them http://embeddedgurus.com/stack-overflow/2008/06/efficient-c-tips-1-choosing-the-correct-integer-size/
http://www.mail-archive.com/avr-gcc-list#nongnu.org/msg03149.html
The header in mingw64 said the fast types are "Not actually guaranteed to be fastest for all purposes"
/* 7.18.1.3 Fastest minimum-width integer types
* Not actually guaranteed to be fastest for all purposes <---------------------
* Here we use the exact-width types for 8 and 16-bit ints.
*/
typedef signed char int_fast8_t;
typedef unsigned char uint_fast8_t;
typedef short int_fast16_t;
typedef unsigned short uint_fast16_t;
typedef int int_fast32_t;
typedef unsigned int uint_fast32_t;
__MINGW_EXTENSION typedef long long int_fast64_t;
__MINGW_EXTENSION typedef unsigned long long uint_fast64_t;
and that still applies to ARM or other architectures, because using a narrow type requires zero extension or sign extension in many situations which is less optimal than a native int.
However that'll benefit in large arrays or in case or slow operations (like division). I'm not sure how slow ARM divisions are but on x86 64-bit division is much slower than 32-bit or 8-bit division
Related
I noticed a HashSet<int> performing very slowly when working on a Flutter project. I had about 20,000 integers in a Set, and checking set.contains() took a very long time. But when I use toString() to convert all items to string, it performed 1000x faster.
I then tried to create a minimal reproducible code with 10 million random integers, but I couldn't reproduce the issue. Turns out, something special about these data caused the extreme slowness. I've attached a test code (and data) at the end of this question.
How to reproduce:
First, click "add int" button to add 14790 integers to a set. Then click "query int" (runs set.contains(123)) and "query string" (runs set.contains('123')). Observe that: 1. both operations are super slow; 2. the int version is slower than the string version. Picture:
Then click "clear items", then "add string" to add the toString() version of the data. Then click "query int" and "query string" again, notice how much faster it becomes. Picture:
Lastly, click both "add int" and "add string" to create a mixed set (with twice the entries). Observe that the querying times dropped in half for the int version, as if the faster strings helped "dilute" the problem. Picture:
I've had several friends running the same test code on various machines (intel i5, apple M1, snapdragon), timings are different but the conclusions are the same.
What's not the answer here:
Here are some things I considered, but they couldn't explain what's happening with some more tests.
Maybe int needs boxing, whereas string is already an object?
That's probably not the issue here. With 1 million randomly generated values, ints performed faster than strings.
string is immutable so their hash value could be cached?
I don't know if they are cached, but this doesn't explain the results observed with 1 million randomly generated values.
int hash resulted in a lot of collisions?
I tried to print out .hashCode for all ints and strings in the data set, and verified they are all unique.
Test code:
The full test code with data is too long for StackOverflow, I've put it here https://pastebin.com/raw/4fm2hKQB instead.
So yeah, I'm lost, if anyone could help me understand what's going on that'll be greatly appreciated!
I commented on the issue in the Dart repo. For completeness I will mention the 'answer' part of the comment here.
The implementations of HashSet and LinkedHashSet make the assumption that the key.hashCode values are 'good' hash codes that are reasonably distributed over a range of integers so that the lower N bits do not collide or nearly collide to 'bunch up' in the hash-table. Unfortunately int.hashCode does not have this property as it is effectively the identity function.
Things go wrong when the lower bits of all the keys are the same (or have only a few of the possible values) so taking the lower N bits gives the same effective hash code value. This is just the power-of-two version of the % 1000 example mentioned by #ch271828n.
#ch271828n mentions using a different hashCode. This is probably the best short-term solution. Use LinkedHashSet(hashCode: dispersedHashCode) with something like this:
int dispersedHashCode(e) { // untested!
int hash = e.hashCode;
// Odd number with 30%-50% of the bits set in an irregular pattern.
hash *= 0x1736B4D29;
hash += hash >>> 20;
// maybe do it again to let bits higher that 20 influence the low bits.
return hash;
}
Something like this would ideally be built into the core library hashed structures. This might take a long time since, realistically, a performance issue with a simple work-around will be likely be prioritized behind security bugs, incorrect behaviour bugs, performance issues with no work-around, and new features that enable customers to do things that are otherwise impossible to difficult to do.
A completely different approach would be to use an ordered Set like SplayTreeSet.
I am also considering hash collision problem.
int hash resulted in a lot of collisions?
I tried to print out .hashCode for all ints and strings in the data set, and verified they are all unique.
Well, "all unique" does not mean "there is no collision". For a hash set, the number of bins are much less than the number of hashcode. For example, suppose you have a hash set with 1000 bins, and the mapping from hashCode to bin index is a simple bin index := hashCode % 1000, and suppose your data has hashCode like 0, 1000, 2000, 3000 etc. In this artificial case, your data has all unique hashCode, but they all fall into the first bin out of the 1000 bins. Huge collision!
A simple approach to debug whether it is the problem of hashcode: Re-run the program with LinkedHashSet(hashCode: (e) => some_other_hash_approach(e), equals: ...). By using such a new hash set, you can test on other hashCode generating functions. If some hashCode generating functions do not result in the same extremely slow speed, it is highly because of the original hashCode function which causes collision.
In addition, you can even use the same hashCode method for both the int and the String case. Then you guarantee that both cases have exactly the same collision behavior. Then it is easy to see whether collision is the cause, or is unrelated.
Another debug approach: Look at the C++ source code of LinkedHashSet, and see what algorithms it uses to assign data to bins. Then check whether collision as mentioned above happens or not.
A third debug method: Compile the pure-Dart program into an executable, and use profilers like perf to run it. Then you can see which code is hottest and consume most of the time. You may need debug symbols of Dart's native C++ code, which should be fetchable.
I'm relatively new to CUDA programming, so I want to clarify the behaviour of a struct when I pass it into a kernel. I've defined the following struct to somewhat imitate the behavior of a 3D array that knows its own size:
struct protoarray {
size_t dim1;
size_t dim2;
size_t dim3;
float* data;
};
I create two variables of type protoarray, dynamically allocate space to data via malloc and cudaMalloc on the host and device side, and update dim1, dim2 and dim3 to reflect the size of array I want this struct to represent. I read in this thread that the struct should be passed via copy. So this is what I do in my kernel
__global__ void kernel(curandState_t *state, protoarray arr_device){
const size_t dim1 = arr_device.dim1;
const size_t dim2 = arr_device.dim2;
for(size_t j(0); j < dim2; j++){
for(size_t i(0); i < dim1; i++){
// Do something
}
}
}
The struct is passed by copy, so all its contents are copied into shared memory of each block. This is where I'm getting bizarre behaviour, which I'm hoping you could help me with. Suppose I had set arr_device.dim1 = 2 on the host side. While debugging inside the kernel and setting a breakpoint at one of the for loops, checking the value of arr_device.dim1 yields something like 16776576, nowhere large enough to cause overflow, but this value copies correctly into dim1 as 2, which means that the for loops execute as I intended them to. As a side question, is using size_t which is essential unsigned long long int bad practice, seeing as the GPU's are made of 32bit cores?
Generally, how safe is it to pass struct and class into kernels as arguments, is bad practice that should be avoided at all cost? I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory, and that they should be very lightweight if I want to pass them by value.
This is a partial answer, since without a proper program to look into, it is difficult/impossible to guess why you would see an invalid value in your arr_device.dim1.
The struct is passed by copy, so all its contents are copied into shared memory of each block.
Incorrect. Kernel arguments are stored in constant memory, which is device-global and not block-specific. They are not stored shared memory (which is block-specific).
When a thread runs, it typically reads arguments from constant memory into registers (and again, not shared memory).
Generally, how safe is it to pass struct and class into kernels as arguments
My personal rule of thumb on this matter is: If the struct/class...
is trivially-copyable; and
all its members of the struct/class are defined both for the host and the device side, or at least - designed with GPU use in mind;
then it should be safe to pass to a kernel.
passing struct and class into kernels as arguments [ - ] is [it] bad practice that should be avoided at all cost?
No. But remember that most C++ libraries only provide host-side code; and were not written with a mind of being used on a GPU. So I'd be wary of using non-trivial classes without a lot of scrutiny.
I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory
Yes, this can be problematic. However - if you used cuda::memory::managed::allocate(), cuda::memory::managed::make_unique() or cudaMallocManaged() - then this should "just work", i.e. the relevant memory pages will be fetched to the GPU or the CPU as necessary when accessed. See:
Unified Memory in CUDA for beginners
Beyond GPU Memory Limits with Unified Memory on Pascal
and that they should be very lightweight if I want to pass [objects to kernels] by value.
Yes, because each and every thread has to read each argument from constant memory before it can use that argument. And while constant memory allows this to happen relatively quickly, it's still a bunch of overhead that you want to minimize.
Also remember that you can't pass anything to kernels by (C++) reference; it's all "by-value" - the object itself or a pointer to it.
I've been reading through this section for a while, but I can't seem to figure it out. I'm on AMD64 ABI Draft 0.99.6, page 18, section 3.32 Parameter Passing and theres the following text:
Arguments of type __m256 are split into four eightbyte chunks. The least significant one belongs to class SSE and all the others to class SSEUP.
I'm confused because it sounds like I use three SSEUP registers and only one SSE, but that seems wasteful of the other two SSE registers associated with the SSEUP. Am I misreading it? I probably won't even use this datatype, but I've been confused on this text for quite a while. Can someone give an example of how this would work? I'm probably missing something obvious.
Page 18 just contains a list of definitions necessary for a later discussion of the algorithm used to pass the parameters of a function.
Particularly, the SSE class is always passed in a new vector register, the first available of %xmm0-%xmm7.
Note that these names refer to the lower 128-bit parts of the registers but its better to think of them in terms of variable size vector registers %v0-%v7.
The SSEUP class is passed in the next available 64-bit (eight-byte) of the last vector register used.
__m256 is then passed, in processors that support AVX, using a single %ymm register: the lower 64 bits get the SSE class - and hence a new %v0 register - while the other three 64 bits chunks get SSEUP thereby reusing the %v0 register.
Here's the relevant quote from the document:
If the class is SSE, the next available vector register is used, the registers
are taken in the order from %xmm0 to %xmm7.
If the class is SSEUP, the eightbyte is passed in the next available eightbyte
chunk of the last used vector register.
The SSEUP class was introduced earlier in the ABI and it is still present today.
You can quickly consult the Version 0.9 to see the differences: the type _m256 and _m512 were not present for example.
For compiler that doesn't support the new ABI with the _m256 type or for compilers that do support it but target processors with no AVX support, that type is usually an aggregate of two _m128 and thus by the rules described later (particularly the post-merge rules) it is passed in memory:
If the size of an object is larger than two eightbytes, or in C++, is a nonPOD structure or union type, or contains unaligned fields, it has class
MEMORY.
For compilers using the old ABI
If the size of the aggregate exceeds two eightbytes and the first eightbyte
isn’t SSE or any other eightbyte isn’t SSEUP, the whole argument
is passed in memory.
For compilers using the new ABI
The standard is admittedly confusing mostly due to the need to address backward compatibility, the SSE and SSEUP classifications are handy classifications in an architecture where the vector registers keep widening and broad range of different sizes are already present out there.
My kernel is pretty simple. It tries to see if the codes are valid and then stores only the unique codes according to the prefix scan output:
__kernel void moveValid(__global int* sortCode, __global int* mark, __global int* processorOffsets, __global int* uniqueCode,__global int* numPoints, __global int* pointIndex)
{
int ig = get_global_id(0);
int m = mark[ig];
int j= processorOffsets[ig];
atomic_inc(&numPoints[j-1]);
// select
if(m == true)
{
uniqueCode[j] = sortCode[ig];
pointIndex[j] = ig;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
Seems like the kernel is really slow. Is it due to the if statement? Can anyone give any tips as to how the kernel can be improved? Also can select be used in this scenario?
So, without looking too deeply at your code, I can give the following feedback about its speed. I am assuming that you are using a GPU as your device. If you are using a CPU as your device, some on the information may still apply.
atomic_inc(&numPoints[j-1]);
Atomic increments are PAINFULLY slow to global memory on most devices. This is because that data has to be committed into global memory (can't be cached locally).
barrier(CLK_GLOBAL_MEM_FENCE);
This barrier insures that all work_items in the work_group are present before continuing execution. Why do you need this in your code? Especially when nothing is left to do, there is no reason to not allow your threads to finish execution. This is also a big performance hit.
if(m == true)
This is not actually the worst if statement I've seen, because it only has one branch. This will still slow down your code, but not as substantially as other things. The condition will be computed for all threads in serial (for some architectures) before continuing.
Overall, in this code you are performing 4 global memory access as on atomic operation and no math operations. GPUs are the worst type of device to do this type of algorithm because memory accesses are exceptionally slow to global memory especially if the accesses are coalesced. Could you consider moving some of your arrays to local memory instead?
I'm using an open-source networking framework that makes it easy for developers to communicate over a service found using Bonjour in Objective-C.
There are a few lines that have had me on edge for a while now, even though they never seem to have caused any problems on any machines I've tested, regardless of whether I'm running the 32-bit of 64-bit version of my application:
int packetLength = [rawPacketData length];
[outgoingBuffer appendBytes:&packetLength length:sizeof(int)];
[outgoingBuffer appendData:rawPacketData];
[self writeToStream];
Note that the first piece of information sent is the length of the data packet, which is pretty standard, and then the data itself is sent. What scares me is the length of the length. Will one machine ever assume an int is 4 bytes, while the other machine believes an int to be 8 bytes?
If the two sizes could be different on different machines, what would cause this? Is it dependent on my compiler, or the end-user's machine architecture? And finally, if it is a problem, how can I take an 8-byte int and scrunch it down to 4-bytes to ensure backwards compatibility? (Since I'll never need more than 4 bytes to represent the size of the data packet.)
You can't assume that sizeof(int) will always be four bytes. If the size matters, you should either hard-code a size of 4 (and write code to serialize values into four-byte arrays with the proper endianness), or use types like int32_t defined in <stdint.h>.
(However, as a practical matter, most compiler vendors have decided that int should stay four bytes, so you probably don't need to worry about everything breaking tomorrow. Then again, it wasn't so long ago that many compiler vendors let an int be two bytes, leading to many problems when ints became four bytes, so you really ought to do things the right way so guard against future changes.)
It could be different, but this depends on the compiler more than the machine. A different compiler might redeclare int to be 8 bytes.
Size of int depends on machine architecture; size of int will be the size of the data bus almost always, unless your C compiler does something special and changes it.
That means size of int is not 4 bytes when you compile your program in a 8-bit, 16-bit or 64-bit machine/architecture.
I would define a constant for the buffer size instead of using size of int.
Hope this answers your question.