EditorGuiLayout.MaskField issue with large enums - unity3d

I'm working on an input system that would allow the user to translate input mappings between different input devices and operating systems and potentially define their own.
I'm trying to create a MaskField for an editor window where the user can select from a list of RuntimePlatforms, but selecting individual values results in multiple values being selected.
Mainly for debugging I set it up to generate an equivalent enum RuntimePlatformFlags that it uses instead of RuntimePlatform:
[System.Flags]
public enum RuntimePlatformFlags: long
{
OSXEditor=(0<<0),
OSXPlayer=(0<<1),
WindowsPlayer=(0<<2),
OSXWebPlayer=(0<<3),
OSXDashboardPlayer=(0<<4),
WindowsWebPlayer=(0<<5),
WindowsEditor=(0<<6),
IPhonePlayer=(0<<7),
PS3=(0<<8),
XBOX360=(0<<9),
Android=(0<<10),
NaCl=(0<<11),
LinuxPlayer=(0<<12),
FlashPlayer=(0<<13),
LinuxEditor=(0<<14),
WebGLPlayer=(0<<15),
WSAPlayerX86=(0<<16),
MetroPlayerX86=(0<<17),
MetroPlayerX64=(0<<18),
WSAPlayerX64=(0<<19),
MetroPlayerARM=(0<<20),
WSAPlayerARM=(0<<21),
WP8Player=(0<<22),
BB10Player=(0<<23),
BlackBerryPlayer=(0<<24),
TizenPlayer=(0<<25),
PSP2=(0<<26),
PS4=(0<<27),
PSM=(0<<28),
XboxOne=(0<<29),
SamsungTVPlayer=(0<<30),
WiiU=(0<<31),
tvOS=(0<<32),
Switch=(0<<33),
Lumin=(0<<34),
BJM=(0<<35),
}
In this linked screenshot, only the first 4 options were selected. The integer next to "Platforms: " is the mask itself.
I'm not a bitwise wizard by a large margin, but my assumption is that this occurs because EditorGUILayout.MaskField returns a 32bit int value, and there are over 32 enum options. Are there any workarounds for this or is something else causing the issue?

First thing I've noticed is that all values inside that Enum is the same because you are shifting 0 bits to left. You can observe this by logging your values with this script.
// Shifts 0 bits to the left, printing "0" 36 times.
for(int i = 0; i < 36; i++){
Debug.Log(System.Convert.ToString((0 << i), 2));
}
// Shifts 1 bits to the left, printing values up to 2^35.
for(int i = 0; i < 36; i++){
Debug.Log(System.Convert.ToString((1 << i), 2));
}
The reason inheriting from long does not work alone, is because of bit shifting. Check out this example I found about the issue:
UInt32 x = ....;
UInt32 y = ....;
UInt64 result = (x << 32) + y;
The programmer intended to form a 64-bit value from two 32-bit ones by shifting 'x' by 32 bits and adding the most significant and the least significant parts. However, as 'x' is a 32-bit value at the moment when the shift operation is performed, shifting by 32 bits will be equivalent to shifting by 0 bits, which will lead to an incorrect result.
So you should also cast the shifting bits. Like this:
public enum RuntimePlatformFlags : long {
OSXEditor = (1 << 0),
OSXPlayer = (1 << 1),
WindowsPlayer = (1 << 2),
OSXWebPlayer = (1 << 3),
// With literals.
tvOS = (1L << 32),
Switch = (1L << 33),
// Or with casts.
Lumin = ((long)1 << 34),
BJM = ((long)1 << 35),
}

Related

Unaligned access performance on Intel x86 vs AMD x86 CPUs

I have implemented a simple linear probing hash map with an array of structs memory layout. The struct holds the key, the value, and a flag indicating whether the entry is valid. By default, this struct gets padded by the compiler, as key and value are 64-bit integers, but the entry only takes up 8 bools. Hence, I have also tried packing the struct at the cost of unaligned access. I was hoping to get better performance from the packed/unaligned version due to higher memory density (we do not waste bandwidth on transferring padding bytes).
When benchmarking this hash map on an Intel Xeon Gold 5220S CPU (single-threaded, gcc 11.2, -O3 and -march=native), I see no performance difference between the padded version and the unaligned version. However, on an AMD EPYC 7742 CPU (same setup), I find a performance difference between unaligned and padded. Here is a graph depicting the results for hash map load factors 25 % and 50 %, for different successful query rates on the x axis (0,25,50,75,100): As you can see, on Intel, the grey and blue (circle and square) lines almost overlap, the benefit of struct packing is marginal. On AMD, however, the line representing unaligned/packed structs is consistently higher, i.e., we have more throughput.
In order to investigate this, I tried to built a smaller microbenchmark. In this microbenchmark, we perform a similar benchmark, but without the hash map find logic (i.e., we just pick random indices in the array and advance a little there). Please find the benchmark here:
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>
void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }
template <typename T>
void doNotOptimize(T const& val) {
asm volatile("" : : "r,m"(val) : "memory");
}
struct PaddedStruct {
uint64_t key;
uint64_t value;
bool is_valid;
PaddedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
};
struct PackedStruct {
uint64_t key;
uint64_t value;
uint8_t is_valid;
PackedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
} __attribute__((__packed__));
int main() {
const uint64_t size = 134217728;
uint16_t repetitions = 0;
uint16_t advancement = 0;
std::cin >> repetitions;
std::cout << "Got " << repetitions << std::endl;
std::cin >> advancement;
std::cout << "Got " << advancement << std::endl;
std::cout << "Initializing." << std::endl;
std::vector<PaddedStruct> padded(size);
std::vector<PackedStruct> unaligned(size);
std::vector<uint64_t> queries(size);
// Initialize the structs with random values + prefault
std::random_device rd;
std::mt19937 gen{rd()};
std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};
for (uint64_t i = 0; i < padded.size(); ++i) {
padded[i].key = dist(gen);
padded[i].value = dist(gen);
padded[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
unaligned[i].key = padded[i].key;
unaligned[i].value = padded[i].value;
unaligned[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
queries[i] = dist2(gen);
}
std::cout << "Running benchmark." << std::endl;
ClobberMemory();
auto start_padded = std::chrono::high_resolution_clock::now();
PaddedStruct* padded_ptr = nullptr;
uint64_t sum = 0;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
padded_ptr = &padded[query + i];
if (padded_ptr->is_valid) [[likely]] {
sum += padded_ptr->value;
}
}
doNotOptimize(sum);
}
}
ClobberMemory();
auto end_padded = std::chrono::high_resolution_clock::now();
uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl; // print sum to avoid that it gets optimized out
ClobberMemory();
auto start_unaligned = std::chrono::high_resolution_clock::now();
uint64_t sum2 = 0;
PackedStruct* packed_ptr = nullptr;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
packed_ptr = &unaligned[query + i];
if (packed_ptr->is_valid) [[likely]] {
sum2 += packed_ptr->value;
}
}
doNotOptimize(sum2);
}
}
ClobberMemory();
auto end_unaligned = std::chrono::high_resolution_clock::now();
uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
}
When running the benchmark, I pick repetitions = 3 and advancement = 5, i.e., after compiling and running it, you have to enter 3 (and press newline) and then enter 5 and press enter/newline. I updated the source code to (a) avoid loop unrolling by the compiler because repetition/advancement were hardcoded and (b) switch to pointers into that vector as it more closely resembles what the hash map is doing.
On the Intel CPU, I get:
Padded Runtime (ms): 13204
Unaligned Runtime (ms): 12185
On the AMD CPU, I get:
Padded Runtime (ms): 28432
Unaligned Runtime (ms): 22926
So while in this microbenchmark, Intel still benefits a little from the unaligned access, for the AMD CPU, both the absolute and relative improvement is higher. I cannot explain this. In general, from what I've learned from relevant SO threads, unaligned access for a single member is just as expensive as aligned access, as long as it stays within a single cache line (1). Also in (1), a reference to (2) is given, which claims that the cache fetch width can differ from the cache line size. However, except for Linus Torvalds mail, I could not find any other documentation of cache fetch widths in processors and especially not for my concrete two CPUs to figure out if that might somehow have to do with this.
Does anybody have an idea why the AMD CPU benefits much more from the struct packing? If it is about reduced memory bandwidth consumption, I should be able to see the effects on both CPUs. And if the bandwidth usage is similar, I do not understand what might be causing the differences here.
Thank you so much.
(1) Relevant SO thread: How can I accurately benchmark unaligned access speed on x86_64?
(2) https://www.realworldtech.com/forum/?threadid=168200&curpostid=168779
The L1 Data Cache fetch width on the Intel Xeon Gold 5220S (and all the other Skylake/CascadeLake Xeon processors) is up to 64 naturally-aligned Bytes per cycle per load.
The core can execute two loads per cycle for any combination of size and alignment that does not cross a cacheline boundary. I have not tested all the combinations on the SKX/CLX processors, but on Haswell/Broadwell, throughput was reduced to one load per cycle whenever a load crossed a cacheline boundary, and I would assume that SKX/CLX are similar. This can be viewed as necessary feature rather than a "penalty" -- a line-splitting load might need to use both ports to load a pair of adjacent lines, then combine the requested portions of the lines into a payload for the target register.
Loads that cross page boundaries have a larger performance penalty, but to measure it you have to be very careful to understand and control the locations of the page table entries for the two pages: DTLB, STLB, in the caches, or in main memory. My recollection is that the most common case is pretty fast -- partly because the "Next Page Prefetcher" is pretty good at pre-loading the PTE entry for the next page into the TLB before a sequence of loads gets to the end of the first page. The only case that is painfully slow is for stores that straddle a page boundary, and the Intel compiler works very hard to avoid this case.
I have not looked at the sample code in detail, but if I were performing this analysis, I would be careful to pin the processor frequency, measure the instruction and cycle counts, and compute the average number of instructions and cycles per update. (I usually set the core frequency to the nominal (TSC) frequency just to make the numbers easier to work with.) For the naturally-aligned cases, it should be pretty easy to look at the assembly code and estimate what the cycle counts should be. If the measurements are similar to observations for that case, then you can begin looking at the overhead of unaligned accesses in reference to a more reliable understanding of the baseline.
Hardware performance counters can be valuable for this case as well, particularly the DTLB_LOAD_MISSES events and the L1D.REPLACEMENT event. It only takes a few high-latency TLB miss or L1D miss events to skew the averages.
The number of cache-line accesses when using 24-byte data structures may be the same as when using 17-byte data structure.
Please see this blog post: https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/

How to emulate *really simple* variable bit shifts with SSE?

I have two variable bit-shifting code fragments that I want to SSE-vectorize by some means:
1) a = 1 << b (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/2/4/8/16/32/64/128/256
2) a = 1 << (8 * b) (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/0x100/0x10000/etc
OK, I know that AMD's XOP VPSHLQ would do this, as would AVX2's VPSHLQ. But my challenge here is whether this can be achieved on 'normal' (i.e. up to SSE4.2) SSE.
So, is there some funky SSE-family opcode sequence that will achieve the effect of either of these code fragments? These only need yield the listed output values for the specific input values (0-7).
Update: here's my attempt at 1), based on Peter Cordes' suggestion of using the floating point exponent to do simple variable bitshifting:
#include <stdint.h>
typedef union
{
int32_t i;
float f;
} uSpec;
void do_pow2(uint64_t *in_array, uint64_t *out_array, int num_loops)
{
uSpec u;
for (int i=0; i<num_loops; i++)
{
int32_t x = *(int32_t *)&in_array[i];
u.i = (127 + x) << 23;
int32_t r = (int32_t) u.f;
out_array[i] = r;
}
}

What does this line of code do? Const uint32_t goodguys = 0x1 << 0

Can someone tell me what is being done here:
Const uint32_t goodguys = 0x1 << 0
I'm assuming it is c++ and it is assigning a tag to a group but I have never seen this done. I am a self taught objective c guy and this just looks very foreign to me.
Well, if there are more lines that look like this that follow the one that you posted, then they could be bitmasks.
For example, if you have the following:
const uint32_t bit_0 = 0x1 << 0;
const uint32_t bit_1 = 0x1 << 1;
const uint32_t bit_2 = 0x1 << 2;
...
then you could use use the bitwise & operator with bit_0, bit_1, bit_2, ... and another number in order to see which bits in that other number are turned on.
const uint32_t num = 5;
...
bool bit_0_on = (num & bit_0) != 0;
bool bit_1_on = (num & bit_1) != 0;
bool bit_2_on = (num & bit_2) != 0;
...
So your 0x1 is simply a way to designate that goodguys is a bitmask, because the hexadecimal 0x designator shows that the author of the code is thinking specifically about bits, instead of decimal digits. And then the << 0 is used to change exactly what the bitmask is masking (you just change the 0 to a 1, 2, etc.).
Although base 10 is a normal way to write numbers in a program, sometimes you want to express the number in octal base or hex base. To write numbers in octal, precede the value with a 0. Thus, 023, really means 19 in base 10. To write numbers in hex, precede the value with a 0x or 0X. Thus, 0x23, really means 35 in base 10.
So
goodguys = 0x1;
really means the same as
goodguys = 1;
The bitwise shift operators shift their first operand left (<<) or right (>>) by the number of positions the second operand specifies. Look at the following two statements
goodguys = 0x1;
goodguys << 2;
The first statement is the same as goodguys = 1;
The second statement says that we should shift the bits to the left by 2 positions. So we end up with
goodguys = 0x100
which is the same as goodguys = 4;
Now you can express the two statements
goodguys = 0x1;
goodguys << 2;
as a single statement
goodguys = 0x1 << 2;
which is similar to what you have. But if you are unfamiliar with hex notation and bitwise shift operators it will look intimidating.
When const is used with a variable, it uses the following syntax:
const variable-name = value;
In this case, the const modifier allows you to assign an initial value to a variable that cannot later be changed by the program. For Instance
const int POWER_UPS = 4;
will assign 4 to variable POWER_UPS. But if you later try to overwrite this value like
POWER_UPS = 8;
you will get a compilation error.
Finally the uint32_t means 32-bit unsigned int type. You will use it when you want to make sure that your variable is 32 bits long and nothing else.

packed structure size in C, is this correct?

I found it in some exsiting code, it looks some problems, but the code works fine, can you help if this piece of code has any tricking things in.
why ignore two unsigned when calculate the size of the structure?
tmsg_sz = sizeof(plfm_xml_header_t) + sizeof(oid_t) + sizeof(char*)
+ sizeof(unsigned) + sizeof(snmp_varbind_t)*5 ;
tmsg = (snmp_trap_t*) malloc(tmsg_sz);
if (!tmsg) {
PRINTF("malloc failed \n");
free(trap_msg);
return -1;
}
memset (tmsg, 0, tmsg_sz);
tmsg->hdr.type = PLFM_SNMPTRAP_MSG;
copy_oid_oidt(clog_msg_gen_notif_oid, OID_LENGTH(clog_msg_gen_notif_oid), &tmsg->oid);
tmsg->trap_type = SNMP_TRAP_ENTERPRISESPECIFIC;
tmsg->trap_specific = 1;
tmsg->trapmsg = strdup("Trap Message");
tmsg->numofvar = 5;
build_snmp_varbind(&(tmsg->vars[0]), facility, STR_DATA_TYPE, sizeof(facility)+1, clog_hist_facility_oid, 14);
build_snmp_varbind(&(tmsg->vars[1]), &sev, U32_DATA_TYPE, sizeof(sev),clog_hist_severity_oid, 14);
build_snmp_varbind(&(tmsg->vars[2]), name, STR_DATA_TYPE, sizeof(name)+1, clog_hist_msgname_oid, 14);
build_snmp_varbind(&(tmsg->vars[3]), trap_msg, STR_DATA_TYPE, strlen(trap_msg)+1,clog_hist_msgtext_oid, 14);
// get system uptime
long uptime = get_uptime();
build_snmp_varbind(&(tmsg->vars[4]), (long*)&uptime, TMR_DATA_TYPE, sizeof(uptime),clog_hist_timestamp_oid, 14);
typedef struct snmp_trap_s {
plfm_xml_header_t hdr;
oid_t oid; /* trap oid */
unsigned trap_type;
unsigned trap_specific;
char *trapmsg; /* text message for this trap */
unsigned numofvar;
snmp_varbind_t vars[0];
} __attribute__((__packed__)) snmp_trap_t;
Compilers try hard to put multibyte data aligned in various ways. For example, an int variable, in an architecture where sizeof int == 4, may need to be placed in a location divisible by 4. This may be a hard requirement, or this may just make the system more efficient; it depends on the computer. So, consider
typedef struct combo {
char c;
int i;
} combo;
Depending on the architecture, sizeof combo may be 5, 6, or most often 8. Swap the two members, and the size should be 5.
typedef struct combo2 {
int i;
char c;
} combo2;
However, an array of combo2s may have a size you do not expect:
combo2 cb[2];
The size of cb could very well be 16, as 3 bytes of wasted space follow combo2[0] and combo2[1]. This lets combo2[1].i start at a location divisible by 4.
A recommendation is to order the members of a structure by size; the 8-byte members should precede the 4-byte members, then the 2-byte members, then the 1-byte members. Of course, you have to be aware of typical sizes, and you can't be working on an oddball architecture where characters are not packed into larger words. Cray? cough-cough.

Three boolean values saved in one tinyint

probably a simple question but I seem to be suffering from programmer's block. :)
I have three boolean values: A, B, and C. I would like to save the state combination as an unsigned tinyint (max 255) into a database and be able to derive the states from the saved integer.
Even though there are only a limited number of combinations, I would like to avoid hard-coding each state combination to a specific value (something like if A=true and B=true has the value 1).
I tried to assign values to the variables so (A=1, B=2, C=3) and then adding, but I can't differentiate between A and B being true from i.e. only C being true.
I am stumped but pretty sure that it is possible.
Thanks
Binary maths I think. Choose a location that's a power of 2 (1, 2, 4, 8 etch) then you can use the 'bitwise and' operator & to determine the value.
Say A = 1, B = 2 , C= 4
00000111 => A B and C => 7
00000101 => A and C => 5
00000100 => C => 4
then to determine them :
if( val & 4 ) // same as if (C)
if( val & 2 ) // same as if (B)
if( val & 1 ) // same as if (A)
if((val & 4) && (val & 2) ) // same as if (C and B)
No need for a state table.
Edit: to reflect comment
If the tinyint has a maximum value of 255 => you have 8 bits to play with and can store 8 boolean values in there
binary math as others have said
encoding:
myTinyInt = A*1 + B*2 + C*4 (assuming you convert A,B,C to 0 or 1 beforehand)
decoding
bool A = myTinyInt & 1 != 0 (& is the bitwise and operator in many languages)
bool B = myTinyInt & 2 != 0
bool C = myTinyInt & 4 != 0
I'll add that you should find a way to not use magic numbers. You can build masks into constants using the Left Logical/Bit Shift with a constant bit position that is the position of the flag of interest in the bit field. (Wow... that makes almost no sense.) An example in C++ would be:
enum Flags {
kBitMask_A = (1 << 0),
kBitMask_B = (1 << 1),
kBitMask_C = (1 << 2),
};
uint8_t byte = 0; // byte = 0b00000000
byte |= kBitMask_A; // Set A, byte = 0b00000001
byte |= kBitMask_C; // Set C, byte = 0b00000101
if (byte & kBitMask_A) { // Test A, (0b00000101 & 0b00000001) = T
byte &= ~kBitMask_A; // Clear A, byte = 0b00000100
}
In any case, I would recommend looking for Bitset support in your favorite programming language. Many languages will abstract the logical operations away behind normal arithmetic or "test/set" operations.
Need to use binary...
A = 1,
B = 2,
C = 4,
D = 8,
E = 16,
F = 32,
G = 64,
H = 128
This means A + B = 3 but C = 4. You'll never have two conflicting values. I've listed the maximum you can have for a single byte, 8 values or (bits).