Understanding a false sharing example - false-sharing

I am studying false sharing and I have an example on how to avoid it by padding arrays:
Just making sure I understand correctly, this would only work if we're using 8 byte integers, right? If we're using 4 byte integers, the whole array would be 64 bytes and it wouldn't avoid false sharing.

Related

How to hash two 32 bit integers into one 32 bit integers without collision?

I am looking for a one-way hashing to combine two 32 bits integers into one 32 bits integer. Not sure if it's feasible to do it without collision.
Edit:
I think my integers are generally small. One of them rarely takes more than 14 bits, and the other one rarely takes more than 20 bits.
Edit 2: Thanks for the help in the comments. I think for cases like if the combination of two integers took more than 32 bits, I can do something differently to not hash it. With that, how should I hash my integers?
Thanks!

Word Addressable Memory

For a 32 bit word addressable memory, the word has size of 4 bytes.
If I try to store a data structure uses less than 4 byte memory, say 2 bytes. Is the remaining 2 bytes wasted?
Should we consider the word size when we decide what data structure to use?
Got similar question here but not exactly what i am asking.
Please help.
On a modern CPU, memory itself is retrieved in usually chunks called cache lines (64 bytes on x86), but the CPU instruction set can address individual bytes.
If you had some esoteric machine with an instruction set that couldn't address individual bytes, then your compiler would hide that from you.
Whether or not memory is wasted in data structures smaller than a word would depend on the language you use and its implementation, but generally, records are aligned according to the field with the coarsest requirement. If you have an array of 16 bit integers, they will pack together tightly.
If you have 3 or 4 integers, it scarcely matters whether you store them in 2, 4, or 8 bytes.
If you have 3 or 4 billion integers, then it's probably worth considering a more space-efficient structure.
Generally speaking, the natural integer size for a given language implementation is supposed to be optimal in some way, so my advice is in general 'use int unless you know it's not appropriate' and let the compiler worry about it - until you have performance data to show otherwise.

Compressing Sets of Integers Into Smaller Integers

Along the lines of How to encode integers into other integers, I am wondering if it is possible to encode one integer or a set of integers into one smaller integer or a smaller set of integers, and if so, how it is done. For example, encoding an 8 bit integer into a 4 bit integer, a 256 integer into a 16 bit integer. It doesn't seem possible but perhaps there is something along these lines. Basically, how to get a set of integers to take up less space. Not necessarily encoding into another sequence of bytes, but maybe even into a data structure that is more compact.
Sure, you can always encode them into fewer bits. However you won't be able to decode them back to the original bits. Though you neglected to mention that step, I'm guessing that's what you're looking for.

Loading in 256bit vector register in AVX2 Haswell processor

I want to load a 256 bit YMM register with 32 values, each of length 1 byte. All the intrinsic I looked into load either double word, i.e., 4 byte integers or quad word, i.e., 8 byte values.
How to load data of size lesser than these?
Is there any mnemonic which does this but don't has a equivalent intrinsic?
I don't think there is a way to gather bytes only. But it sounds to me like you need to rethink your problem. Is this pixel data? For Example RGBA values? If so, maybe you can change your application so it read/writes out for example RRRRGGGGBBBB (SSE). Then you don't have to gather bytes. You can read in 128/256 bits at once and that will be the most efficient use of SIMD.
Note that you may gain efficiency by using short int operations. I mean extent to 16bits and use 16bit integer SSE/AVX instructions.
Here is an example of bi-linear interpolation with SSE which reads in integers of four bytes (RGBA), and extends them to 16bits. This is faster then extending them to 32bits. The SSE3 example converts RGBARGBARGBARGBA to RRRRGGGGBBBB.
http://fastcpp.blogspot.no/2011/06/bilinear-pixel-interpolation-using-sse.html
This is a rather old question but I think what you might want is the AVX intrinsic __m256i _mm256_set_epi8 which takes 32 chars as input parameters.
There is no instruction which broadcasts single byte, but you can use _mm256_set1_epi8 intrinsic to achieve this effect.
You can simply use the _mm256_load_si256 intrinsic with a cast. This intrinsic corresponds to the VMOVDQA instruction.
Here is the code to read bytes from memory, and to store them in memory.
char raw[32] __attribute__ ((aligned (32)));
__v32qi foo = _mm256_loadu_si256( (__m256i*) raw ); // read raw bytes from memory into avx register
_mm256_store_si256( (__m256i*) raw, foo ); // store contents of avx register into memory
You can also load unaligned bytes, using _mm256_loadu_si256 if you prefer.
Where do you expect the 32 pointers to come from? Unless you're wanting to do 32 parallel lookups into a 256-byte lookup table, you don't really have space in the source operand to write the addresses required for the load.
I think you have to do four 8x32-bit gather operations and then merge the results; the gather operation supports unaligned accesses so you could load from an address adjusted to get the target byte in the right place within the YMM register, then just AND with masks and OR to merge.

Variable-byte encoding clarification

I am very new to the world of byte encoding so please excuse me (and by all means, correct me) if I am using/expressing simple concepts in the wrong way.
I am trying to understand variable-byte encoding. I have read the Wikipedia article (http://en.wikipedia.org/wiki/Variable-width_encoding) as well as a book chapter from an Information Retrieval textbook. I think I understand how to encode a decimal integer. For example, if I wanted to provide variable-byte encoding for the integer 60, I would have the following result:
1 0 1 1 1 1 0 0
(please let me know if the above is incorrect). If I understand the scheme, then I'm not completely sure how the information is compressed. Is it because usually we would use 32 bits to represent an integer, so that representing 60 would result in 1 1 1 1 0 0 preceded by 26 zeros, thus wasting that space as opposed to representing it with just 8 bits instead?
Thank you in advance for the clarifications.
The way you do it is by reserving one of the bits to mean "I'm not done with the value." Usually, that's the most significant bit.
When you read a byte, you process the lower 7 bits. If the most significant bit is 1, then you know there's one more byte to read, and you repeat the process, adding the next 7 bits to the current 7 bits.
The MIDI format uses that exact encoding to represent lengths of MIDI events, in the following manner:
ExpectedValue = 0
byte=ReadFromFile
ExpectedValue = ExpectedValue + (byte AND 0x7f)
if byte > 127 then
ExpectedValue = ExpectedValue SHL 7
Goto 2
Done
For example, the value 0x80 would be represented using the bytes 0x81 0x00. You can try running the algorithm on those two bytes, and you see you'll get the right value.
UTF-8 works similarly, but it uses a slightly more complex scheme to tell you how many bytes you should be expecting. This allows for some error correction, since you can easily tell if the bytes you're getting match the length claimed. Wikipedia describes their structure quite well.
You hit the nail on the head.
There are many encoding schemes, such as gamma and delta, which are special cases of elias coding. These are bit-level codes, as opposed to the byte-level code you used, and are useful when you have a strong skew towards small numbers (which can often be achieved by encoding deltas instead of absolute values).
Bit-level encoding schemes are much more difficult to implement than byte-level schemes and the additional CPU burden may outweigh the time saved by having less data to read, though most modern CPUs have "highest-bit" and "lowest-bit" instructions that dramatically improve the performance of bit-level codecs. As CPU speeds continue to outpace RAM speeds, bit-level schemes will become more attractive, though the simplicity of byte-level codecs is a big factor too.
Yes, you are right, you save space by encoding using one byte instead of 4.
Generally, you will save memory if the values you are encoding are much smaller than the maximum value that would have fit in your original fixed-width encoding.