In my function (or task), I have a constant string that is only used inside that method.
What is the best way to define it (for performance):
const static string stuff = "stuff";
const string stuff = "stuff";
static string stuff = "stuff";
string stuff = "stuff";
Example on EDA Playground: http://www.edaplayground.com/s/4/1090
const will prevent future writes, IEEE std 1800-2012 § 6.20.6 "Const constants" states "... a const can be set during simulation ..." which suggests that it is up to the vender to decide if there should be any optimization for performance.
static will put the variable into shares memory. It could help or hurt performance depending on the simulation scenario. Actual performance impact is simulator specific, so you will need to run your own benchmarks. IEEE std 1800-2012 § 6.21 "Scope and lifetime" for more.
For a small project, the performance impact will be negligible. For large projects, performance needs to be split into categories: memory usage, and memory access time. static variables can have trade-off smaller memory footprint (shared memory) and longer look-up times (memory address of the static variable can be faraway for the rest of the object data). const is unlikely to add any negative performance.
Easiest way to get some basic performance data is to end the simulation with $finish(2). See IEEE std 1800-2012 Table 20-1—Diagnostics for $finish. This will report the simulation time, location, and statistics about the memory and CPU time used in simulation if the simulator is following the standard.
With the provided example using ModelSim 10.1d, all combinations reported the same memory usage. Run time was only effected by the number of calls the print method, not the const/static attribute.
If I had to guess, the performance would be (ordered best to worse):
const string in a static method
string in a static method
const static string
static string
const string
string
Related
I noticed a HashSet<int> performing very slowly when working on a Flutter project. I had about 20,000 integers in a Set, and checking set.contains() took a very long time. But when I use toString() to convert all items to string, it performed 1000x faster.
I then tried to create a minimal reproducible code with 10 million random integers, but I couldn't reproduce the issue. Turns out, something special about these data caused the extreme slowness. I've attached a test code (and data) at the end of this question.
How to reproduce:
First, click "add int" button to add 14790 integers to a set. Then click "query int" (runs set.contains(123)) and "query string" (runs set.contains('123')). Observe that: 1. both operations are super slow; 2. the int version is slower than the string version. Picture:
Then click "clear items", then "add string" to add the toString() version of the data. Then click "query int" and "query string" again, notice how much faster it becomes. Picture:
Lastly, click both "add int" and "add string" to create a mixed set (with twice the entries). Observe that the querying times dropped in half for the int version, as if the faster strings helped "dilute" the problem. Picture:
I've had several friends running the same test code on various machines (intel i5, apple M1, snapdragon), timings are different but the conclusions are the same.
What's not the answer here:
Here are some things I considered, but they couldn't explain what's happening with some more tests.
Maybe int needs boxing, whereas string is already an object?
That's probably not the issue here. With 1 million randomly generated values, ints performed faster than strings.
string is immutable so their hash value could be cached?
I don't know if they are cached, but this doesn't explain the results observed with 1 million randomly generated values.
int hash resulted in a lot of collisions?
I tried to print out .hashCode for all ints and strings in the data set, and verified they are all unique.
Test code:
The full test code with data is too long for StackOverflow, I've put it here https://pastebin.com/raw/4fm2hKQB instead.
So yeah, I'm lost, if anyone could help me understand what's going on that'll be greatly appreciated!
I commented on the issue in the Dart repo. For completeness I will mention the 'answer' part of the comment here.
The implementations of HashSet and LinkedHashSet make the assumption that the key.hashCode values are 'good' hash codes that are reasonably distributed over a range of integers so that the lower N bits do not collide or nearly collide to 'bunch up' in the hash-table. Unfortunately int.hashCode does not have this property as it is effectively the identity function.
Things go wrong when the lower bits of all the keys are the same (or have only a few of the possible values) so taking the lower N bits gives the same effective hash code value. This is just the power-of-two version of the % 1000 example mentioned by #ch271828n.
#ch271828n mentions using a different hashCode. This is probably the best short-term solution. Use LinkedHashSet(hashCode: dispersedHashCode) with something like this:
int dispersedHashCode(e) { // untested!
int hash = e.hashCode;
// Odd number with 30%-50% of the bits set in an irregular pattern.
hash *= 0x1736B4D29;
hash += hash >>> 20;
// maybe do it again to let bits higher that 20 influence the low bits.
return hash;
}
Something like this would ideally be built into the core library hashed structures. This might take a long time since, realistically, a performance issue with a simple work-around will be likely be prioritized behind security bugs, incorrect behaviour bugs, performance issues with no work-around, and new features that enable customers to do things that are otherwise impossible to difficult to do.
A completely different approach would be to use an ordered Set like SplayTreeSet.
I am also considering hash collision problem.
int hash resulted in a lot of collisions?
I tried to print out .hashCode for all ints and strings in the data set, and verified they are all unique.
Well, "all unique" does not mean "there is no collision". For a hash set, the number of bins are much less than the number of hashcode. For example, suppose you have a hash set with 1000 bins, and the mapping from hashCode to bin index is a simple bin index := hashCode % 1000, and suppose your data has hashCode like 0, 1000, 2000, 3000 etc. In this artificial case, your data has all unique hashCode, but they all fall into the first bin out of the 1000 bins. Huge collision!
A simple approach to debug whether it is the problem of hashcode: Re-run the program with LinkedHashSet(hashCode: (e) => some_other_hash_approach(e), equals: ...). By using such a new hash set, you can test on other hashCode generating functions. If some hashCode generating functions do not result in the same extremely slow speed, it is highly because of the original hashCode function which causes collision.
In addition, you can even use the same hashCode method for both the int and the String case. Then you guarantee that both cases have exactly the same collision behavior. Then it is easy to see whether collision is the cause, or is unrelated.
Another debug approach: Look at the C++ source code of LinkedHashSet, and see what algorithms it uses to assign data to bins. Then check whether collision as mentioned above happens or not.
A third debug method: Compile the pure-Dart program into an executable, and use profilers like perf to run it. Then you can see which code is hottest and consume most of the time. You may need debug symbols of Dart's native C++ code, which should be fetchable.
I'm relatively new to CUDA programming, so I want to clarify the behaviour of a struct when I pass it into a kernel. I've defined the following struct to somewhat imitate the behavior of a 3D array that knows its own size:
struct protoarray {
size_t dim1;
size_t dim2;
size_t dim3;
float* data;
};
I create two variables of type protoarray, dynamically allocate space to data via malloc and cudaMalloc on the host and device side, and update dim1, dim2 and dim3 to reflect the size of array I want this struct to represent. I read in this thread that the struct should be passed via copy. So this is what I do in my kernel
__global__ void kernel(curandState_t *state, protoarray arr_device){
const size_t dim1 = arr_device.dim1;
const size_t dim2 = arr_device.dim2;
for(size_t j(0); j < dim2; j++){
for(size_t i(0); i < dim1; i++){
// Do something
}
}
}
The struct is passed by copy, so all its contents are copied into shared memory of each block. This is where I'm getting bizarre behaviour, which I'm hoping you could help me with. Suppose I had set arr_device.dim1 = 2 on the host side. While debugging inside the kernel and setting a breakpoint at one of the for loops, checking the value of arr_device.dim1 yields something like 16776576, nowhere large enough to cause overflow, but this value copies correctly into dim1 as 2, which means that the for loops execute as I intended them to. As a side question, is using size_t which is essential unsigned long long int bad practice, seeing as the GPU's are made of 32bit cores?
Generally, how safe is it to pass struct and class into kernels as arguments, is bad practice that should be avoided at all cost? I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory, and that they should be very lightweight if I want to pass them by value.
This is a partial answer, since without a proper program to look into, it is difficult/impossible to guess why you would see an invalid value in your arr_device.dim1.
The struct is passed by copy, so all its contents are copied into shared memory of each block.
Incorrect. Kernel arguments are stored in constant memory, which is device-global and not block-specific. They are not stored shared memory (which is block-specific).
When a thread runs, it typically reads arguments from constant memory into registers (and again, not shared memory).
Generally, how safe is it to pass struct and class into kernels as arguments
My personal rule of thumb on this matter is: If the struct/class...
is trivially-copyable; and
all its members of the struct/class are defined both for the host and the device side, or at least - designed with GPU use in mind;
then it should be safe to pass to a kernel.
passing struct and class into kernels as arguments [ - ] is [it] bad practice that should be avoided at all cost?
No. But remember that most C++ libraries only provide host-side code; and were not written with a mind of being used on a GPU. So I'd be wary of using non-trivial classes without a lot of scrutiny.
I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory
Yes, this can be problematic. However - if you used cuda::memory::managed::allocate(), cuda::memory::managed::make_unique() or cudaMallocManaged() - then this should "just work", i.e. the relevant memory pages will be fetched to the GPU or the CPU as necessary when accessed. See:
Unified Memory in CUDA for beginners
Beyond GPU Memory Limits with Unified Memory on Pascal
and that they should be very lightweight if I want to pass [objects to kernels] by value.
Yes, because each and every thread has to read each argument from constant memory before it can use that argument. And while constant memory allows this to happen relatively quickly, it's still a bunch of overhead that you want to minimize.
Also remember that you can't pass anything to kernels by (C++) reference; it's all "by-value" - the object itself or a pointer to it.
I've been reading through this section for a while, but I can't seem to figure it out. I'm on AMD64 ABI Draft 0.99.6, page 18, section 3.32 Parameter Passing and theres the following text:
Arguments of type __m256 are split into four eightbyte chunks. The least significant one belongs to class SSE and all the others to class SSEUP.
I'm confused because it sounds like I use three SSEUP registers and only one SSE, but that seems wasteful of the other two SSE registers associated with the SSEUP. Am I misreading it? I probably won't even use this datatype, but I've been confused on this text for quite a while. Can someone give an example of how this would work? I'm probably missing something obvious.
Page 18 just contains a list of definitions necessary for a later discussion of the algorithm used to pass the parameters of a function.
Particularly, the SSE class is always passed in a new vector register, the first available of %xmm0-%xmm7.
Note that these names refer to the lower 128-bit parts of the registers but its better to think of them in terms of variable size vector registers %v0-%v7.
The SSEUP class is passed in the next available 64-bit (eight-byte) of the last vector register used.
__m256 is then passed, in processors that support AVX, using a single %ymm register: the lower 64 bits get the SSE class - and hence a new %v0 register - while the other three 64 bits chunks get SSEUP thereby reusing the %v0 register.
Here's the relevant quote from the document:
If the class is SSE, the next available vector register is used, the registers
are taken in the order from %xmm0 to %xmm7.
If the class is SSEUP, the eightbyte is passed in the next available eightbyte
chunk of the last used vector register.
The SSEUP class was introduced earlier in the ABI and it is still present today.
You can quickly consult the Version 0.9 to see the differences: the type _m256 and _m512 were not present for example.
For compiler that doesn't support the new ABI with the _m256 type or for compilers that do support it but target processors with no AVX support, that type is usually an aggregate of two _m128 and thus by the rules described later (particularly the post-merge rules) it is passed in memory:
If the size of an object is larger than two eightbytes, or in C++, is a nonPOD structure or union type, or contains unaligned fields, it has class
MEMORY.
For compilers using the old ABI
If the size of the aggregate exceeds two eightbytes and the first eightbyte
isn’t SSE or any other eightbyte isn’t SSEUP, the whole argument
is passed in memory.
For compilers using the new ABI
The standard is admittedly confusing mostly due to the need to address backward compatibility, the SSE and SSEUP classifications are handy classifications in an architecture where the vector registers keep widening and broad range of different sizes are already present out there.
I'm using MATLAB profile to observe memory using the command
profile -memory on
profile clear
% my code
profile report
and i got this table
1- i want to ask about the meaning of
Allocated Memory,Freed Memory, SelfMemory, and Peak Memory
2- what is the meaning of negative self memory?
After a quick google, it would seem that no-one knows, except perhaps MathWorks and they aren't telling. (I jest, but in truth I found very little information on the subject).
Logically however I would interpret the column names as follows:
Allocated memory = the total amount of memory allocated within the function and any it calls.
Freed memory = the total amount of memory released within the function and any it calls.
Peak Memory = the maximum amount of memory in use at any one time during the execution of the function.
Self Memory = the amount of memory used by the function, but not including any functions it calls.
I would hypothesize that a negative 'Self Memory' would indicate that the function frees more memory than it allocates. This could be that it has ownership of a piece of data passed to it, which it then clears. E.g.:
function A()
foo = B();
clear foo
end
function foo = B()
foo = rand(10000,10000);
end
In the example above, the data is created in the call to B and since Matlab employs a lazy copy memory management, this case works pretty much as pass-by-reference for the return value. So, B allocates the memory, and A frees it.
Indeed, running that code with the profiling method in the question produces the following output, which supports my hypothesis.
When we define a variable with following syntax does that mean it is hanging in the memory all the time:
static NSString *const kMyLabel = #"myLabel";
I have 100 of constants. Should I go with this of #define pre-processor compiler considering that #define will not keep them alive in the memory.
Hardcoded strings, in the format #"my string", are baked into the application binary. In order to make it not be permanent, you'd have to do:
static NSString *kMyLabel = nil;
...somewhere else
kMyLabel = [[NSMutableString alloc] initWithString:#"myLabel"];
But that'd be stupid, because then you'd have both #"myLabel" in memory (because it's part of the app binary) AND your allocated string. So double the memory.
In short:
If you have a constant string, there's no way to "unload" it from memory. And unless you're hard coding a few chapters from a book into your binary, it's not going to be something to worry about. Have you measured it as being a performance issue?
There would be no difference between a constant static variable and #define directive. When using #define, the preprocessor will replace the variable with #"myLabel" every time it is used. This could mean that you have one instance of the string for each use, but the compiler combines them so that any strings in the binary are unique. Using the constant static, the code will load the location of the variable when needed. This means #define may be a tiny bit faster as there is less dereferencing to get the string, but it would be unnoticeable.
It will be "in memory", but it will just be a memory mapped section of your application's executable file. If there's memory pressure, that page will be flushed without writing to disk.
Basically, it's "free" except for a tiny bit of IO on startup. Go nuts with them.