Loading in 256bit vector register in AVX2 Haswell processor - cpu-architecture

I want to load a 256 bit YMM register with 32 values, each of length 1 byte. All the intrinsic I looked into load either double word, i.e., 4 byte integers or quad word, i.e., 8 byte values.
How to load data of size lesser than these?
Is there any mnemonic which does this but don't has a equivalent intrinsic?

I don't think there is a way to gather bytes only. But it sounds to me like you need to rethink your problem. Is this pixel data? For Example RGBA values? If so, maybe you can change your application so it read/writes out for example RRRRGGGGBBBB (SSE). Then you don't have to gather bytes. You can read in 128/256 bits at once and that will be the most efficient use of SIMD.
Note that you may gain efficiency by using short int operations. I mean extent to 16bits and use 16bit integer SSE/AVX instructions.
Here is an example of bi-linear interpolation with SSE which reads in integers of four bytes (RGBA), and extends them to 16bits. This is faster then extending them to 32bits. The SSE3 example converts RGBARGBARGBARGBA to RRRRGGGGBBBB.
http://fastcpp.blogspot.no/2011/06/bilinear-pixel-interpolation-using-sse.html

This is a rather old question but I think what you might want is the AVX intrinsic __m256i _mm256_set_epi8 which takes 32 chars as input parameters.

There is no instruction which broadcasts single byte, but you can use _mm256_set1_epi8 intrinsic to achieve this effect.

You can simply use the _mm256_load_si256 intrinsic with a cast. This intrinsic corresponds to the VMOVDQA instruction.
Here is the code to read bytes from memory, and to store them in memory.
char raw[32] __attribute__ ((aligned (32)));
__v32qi foo = _mm256_loadu_si256( (__m256i*) raw ); // read raw bytes from memory into avx register
_mm256_store_si256( (__m256i*) raw, foo ); // store contents of avx register into memory
You can also load unaligned bytes, using _mm256_loadu_si256 if you prefer.

Where do you expect the 32 pointers to come from? Unless you're wanting to do 32 parallel lookups into a 256-byte lookup table, you don't really have space in the source operand to write the addresses required for the load.
I think you have to do four 8x32-bit gather operations and then merge the results; the gather operation supports unaligned accesses so you could load from an address adjusted to get the target byte in the right place within the YMM register, then just AND with masks and OR to merge.

Related

Word Addressable Memory

For a 32 bit word addressable memory, the word has size of 4 bytes.
If I try to store a data structure uses less than 4 byte memory, say 2 bytes. Is the remaining 2 bytes wasted?
Should we consider the word size when we decide what data structure to use?
Got similar question here but not exactly what i am asking.
Please help.
On a modern CPU, memory itself is retrieved in usually chunks called cache lines (64 bytes on x86), but the CPU instruction set can address individual bytes.
If you had some esoteric machine with an instruction set that couldn't address individual bytes, then your compiler would hide that from you.
Whether or not memory is wasted in data structures smaller than a word would depend on the language you use and its implementation, but generally, records are aligned according to the field with the coarsest requirement. If you have an array of 16 bit integers, they will pack together tightly.
If you have 3 or 4 integers, it scarcely matters whether you store them in 2, 4, or 8 bytes.
If you have 3 or 4 billion integers, then it's probably worth considering a more space-efficient structure.
Generally speaking, the natural integer size for a given language implementation is supposed to be optimal in some way, so my advice is in general 'use int unless you know it's not appropriate' and let the compiler worry about it - until you have performance data to show otherwise.

How can I learn the size of the EEPROM on a chip if documentation is unavailable?

If I have an EEPROM integrated circuit but documentation is not available for it, how can I find out how much memory is available to me?
My first thought was to write some distinct bytes to the first several sequential addresses and then loop through the memory reading each byte until I read my distinct bytes and count how many bytes exist between reading the distinct bytes the first time and the second time. But then I realised that my unsigned data type could be too small and wrap from its largest value back to zero before the last address in the EEPROM was actually reached.
Any software or hardware tricks to learn this information about an unidentified EEPROM integrated circuit would be very much appreciated.
My solution to this problem ends up being pretty close to my theory stated in my question, where I write some recognisable pattern of bytes starting at byte zero of the EEPROM. I then loop through the EEPROM memory, starting at byte zero, and keep track of how many bytes exist between the first time we read our "recognisable pattern of bytes" and the second time. To ensure that I don't read from byte zero a second time before reading every other byte in the EEPROM memory once (due to our counting variable being too small to count up to the size of the EEPROM memory), I then increase the size of my counting variable datatype to be able to count to a higher number if needed. If the number of bytes between the first read of the "recognisable pattern of bytes" and the second is the same with the two different sized counting variable data types, then I know that I have found the correct size of the EEPROM in bytes.

Can a CRC32 engine be used for computing CRC16 hashes?

I'm working with a microcontroller with native HW functions to calculate CRC32 hashes from chunks of memory, where the polynomial can be freely defined. It turns out that the system has different data-links with different bit-lengths for CRC, like 16 and 8 bit, and I intend to use the hardware engine for it.
In simple tests with online tools I've concluded that it is possible to find a 32-bit polynomial that has the same result of a 8-bit CRC, example:
hashing "a sample string" with 8-bit engine and poly 0xb7 yelds a result 0x97
hashing "a sample string" with 16-bit engine and poly 0xb700 yelds a result 0x9700
...32-bit engine and poly 0xb7000000 yelds a result 0x97000000
(with zero initial value and zero final xor, no reflections)
So, padding the poly with zeros and right-shifting the results seems to work.
But is it 'always' possible to find a set of parameters that make 32-bit engines to work as 16 or 8 bit ones? (including poly, final xor, init val and inversions)
To provide more context and prevent 'bypass answers' like 'dont't use the native engine': I have a scenario in a safety critical system where it's necessary to prevent a common design error from propagating to redundant processing nodes. One solution for that is having software-based CRC calculation in one node, and hardware-based in its pair.
Yes, what you're doing will work in general for CRCs that are not reflected. The pre and post conditioning can be done very simply with code around the hardware instructions loop.
Assuming that the hardware CRC doesn't have an option for this, to do a reflected CRC you would need to reflect each input byte, and then reflect the final result. That may defeat the purpose of using a hardware CRC. (Though if your purpose is just to have a different implementation, then maybe it wouldn't.)
You don't have to guess. You can calculate it. Because CRC is a remainder of a division by an irreducible polynomial, it's a 1-to-1 function on its domain.
So, CRC16, for example, has to produce 65536 (64k) unique results if you run it over 0 through 65536.
To see if you get the same outcome by taking parts of CRC32, run it over 0 through 65535, keep the 2 bytes that you want to keep, and then see if there is any collision.
If your data has 32 bits in it, then it should not be an issue. The issue arises if you have less than 32 bit numbers and you shuffle them around in a 32-bit space. Their 1st and last byte are not guaranteed to be uniformly distributed.

Checksumming: CRC or hash?

Performance and security considerations aside, and assuming a hash function with a perfect avalanche effect, which should I use for checksumming blocks of data: CRC32 or hash truncated to N bytes? I.e. which will have a smaller probability to miss an error? Specifically:
CRC32 vs. 4-byte hash
CRC32 vs. 8-byte hash
CRC64 vs. 8-byte hash
Data blocks are to be transferred over network and stored on disk, repeatedly. Blocks can be 1KB to 1GB in size.
As far as I understand, CRC32 can detect up to 32 bit flips with 100% reliability, but after that its reliability approaches 1-2^(-32) and for some patterns is much worse. A perfect 4-byte hash reliability is always 1-2^(-32), so go figure.
8-byte hash should have a much better overall reliability (2^(-64) chance to miss an error), so should it be preferred over CRC32? What about CRC64?
I guess the answer depends on type of errors that might be expected in such sort of operation. Are we likely to see sparse 1-bit flips or massive block corruptions? Also, given that most storage and networking hardware implements some sort of CRC, should not accidental bit flips be taken care of already?
Only you can say whether 1-2-32 is good enough or not for your application. The error detection performance between a CRC-n and n bits from a good hash function will be very close to the same, so pick whichever one is faster. That is likely to be the CRC-n.
Update:
The above "That is likely to be the CRC-n" is only somewhat likely. It is not so likely if very high performance hash functions are used. In particular, CityHash appears to be very nearly as fast as a CRC-32 calculated using the Intel crc32 hardware instruction! I tested three CityHash routines and the Intel crc32 instruction on a 434 MB file. The crc32 instruction version (which computes a CRC-32C) took 24 ms of CPU time. CityHash64 took 55 ms, CityHash128 60 ms, and CityHashCrc128 50 ms. CityHashCrc128 makes use of the same hardware instruction, though it does not compute a CRC.
In order to get the CRC-32C calculation that fast, I had to get fancy with three crc32 instructions on three separate buffers in order to make use of the three arithmetic logic units in parallel in a single core, and then writing the inner loop in assembler. CityHash is pretty damned fast. If you don't have the crc32 instruction, then you would be hard-pressed to compute a 32-bit CRC as fast as a CityHash64 or CityHash128.
Note however that the CityHash functions would need to be modified for this purpose, or an arbitrary choice would need to be made in order to define a consistent meaning for the CityHash value on large streams of data. The reason is that those functions are not set up to accept buffered data, i.e. feeding the functions a chunk at a time and expecting to get the same result as if the entire set of data were fed to the function at once. The CityHash functions would need to modified to update an intermediate state.
The alternative, and what I did for the quick and dirty testing, is to use the Seed versions of the functions where I would use the CityHash from the previous buffer as the seed for the next buffer. The problem with that is that the result is then dependent on the buffer size. If you feed CityHash different size buffers with this approach, you get different hash values.
Another Update four years later:
Even faster is the xxhash family. I would now recommend that over a CRC for a non-cryptographic hash.
Putting aside "performance" issues; you might want to consider using one of the SHA-2 functions (say SHA-256).

Binary integral data compression

I need to transmit integral data types over the network but don't want to transfer all 32 (or 64) bits all the time - data fits into just one byte 99% of time - so it looks like it's need to compress it somehow: for example first bit of a byte is 0 if other 7 bits means just some value (0-127), otherwise (if first byte is 1) it's need to shift these 7 bytes left and read second byte to do the same process.
Is there some common way to do this? I don't want to reinvent a wheel...
Thank you.
The scheme you describe (which is essentially a base-128 encoding: each byte is a 7-bit base-128 "digit" and a single bit flag to indicate whether or not it is the final digit) is a common way of doing this.
For example, see:
the section on "LEB128" in the DWARF spec (ยง7.6);
"Base 128 Varints" in Google's protocol buffers;
"Variable Width Integers" in the LLVM bitcode format (various different widths are used in various different places there).
Just about any data compression algorithm would be able to compress that kind of data stream very well. Use whatever compression libraries your language provides.