Number of independent AES 256 CBC decryption operations per second with AES-NI or GPU acceleration - aes

AES-NI seems to be optimized to encrypt/decrypt big chunks of data. However I'm trying to decrypt a password and I have many very small bits to try (iv + first cbc block, 32 bytes in total).
I'm using openssl at the moment, calling EVP_DecryptInit_ex, EVP_DecryptUpdate for every cycle (and EVP_CIPHER_CTX_init once per thread).
I can do this around 2 million times per second on a single core.
I assume this is the sort of performance I can expect using AES-NI instructions and I shouldn't worry about optimising this further. Is this correct?
Does anyone have any idea how much faster this might be on a high end GPU or not-too-expensive FPGA?

FPGA: You can convert an input block to an output block on any reasonable FPGA with a 2-cycle throughput at several hundred MHz, with a latency of 16 cycles. So, possibly 256 Mblocks/s pipelined, or maybe 32 Mblocks/s not pipelined. You could get maybe 5 of these on a reasonably cheap FPGA, or 30+ on an expensive one. YMMV.

Related

How much faster is sending 16 bit vs 32 bit over BLE?

I am working on a project where I am sending information over BLE from a phone to a Raspberry Pi Zero. I can fit all the information I need into 16 bit messages, however, down the line I may need more bits, though I probably won’t. Would I be better off sending only 16 bit packets than 32 bit? Is it that much faster to send and parse 16 bits for a RPi Zero over BLE? I am only entertaining the idea of 32 bits because if I do need more information in the future, updating the code would be much easier.
The packets contain position data of the phone and will be send every .1 of a second. I am using Bleno on the Pi to receive data.
Dude, those two extra bytes won't kill your energy budget. It's wise to keep reserved space for future use. It enables backwards compatibility and ease of future development.
There's not really any difference in the length of the on-air packet transmission due to the big overhead of BLE, and you won't experience any difference due to the nature of connection intervals. We're talking 16bits/(10^6) = 16uS in 1mbps mode and 8uS in 2mbps mode.

Checksumming: CRC or hash?

Performance and security considerations aside, and assuming a hash function with a perfect avalanche effect, which should I use for checksumming blocks of data: CRC32 or hash truncated to N bytes? I.e. which will have a smaller probability to miss an error? Specifically:
CRC32 vs. 4-byte hash
CRC32 vs. 8-byte hash
CRC64 vs. 8-byte hash
Data blocks are to be transferred over network and stored on disk, repeatedly. Blocks can be 1KB to 1GB in size.
As far as I understand, CRC32 can detect up to 32 bit flips with 100% reliability, but after that its reliability approaches 1-2^(-32) and for some patterns is much worse. A perfect 4-byte hash reliability is always 1-2^(-32), so go figure.
8-byte hash should have a much better overall reliability (2^(-64) chance to miss an error), so should it be preferred over CRC32? What about CRC64?
I guess the answer depends on type of errors that might be expected in such sort of operation. Are we likely to see sparse 1-bit flips or massive block corruptions? Also, given that most storage and networking hardware implements some sort of CRC, should not accidental bit flips be taken care of already?
Only you can say whether 1-2-32 is good enough or not for your application. The error detection performance between a CRC-n and n bits from a good hash function will be very close to the same, so pick whichever one is faster. That is likely to be the CRC-n.
Update:
The above "That is likely to be the CRC-n" is only somewhat likely. It is not so likely if very high performance hash functions are used. In particular, CityHash appears to be very nearly as fast as a CRC-32 calculated using the Intel crc32 hardware instruction! I tested three CityHash routines and the Intel crc32 instruction on a 434 MB file. The crc32 instruction version (which computes a CRC-32C) took 24 ms of CPU time. CityHash64 took 55 ms, CityHash128 60 ms, and CityHashCrc128 50 ms. CityHashCrc128 makes use of the same hardware instruction, though it does not compute a CRC.
In order to get the CRC-32C calculation that fast, I had to get fancy with three crc32 instructions on three separate buffers in order to make use of the three arithmetic logic units in parallel in a single core, and then writing the inner loop in assembler. CityHash is pretty damned fast. If you don't have the crc32 instruction, then you would be hard-pressed to compute a 32-bit CRC as fast as a CityHash64 or CityHash128.
Note however that the CityHash functions would need to be modified for this purpose, or an arbitrary choice would need to be made in order to define a consistent meaning for the CityHash value on large streams of data. The reason is that those functions are not set up to accept buffered data, i.e. feeding the functions a chunk at a time and expecting to get the same result as if the entire set of data were fed to the function at once. The CityHash functions would need to modified to update an intermediate state.
The alternative, and what I did for the quick and dirty testing, is to use the Seed versions of the functions where I would use the CityHash from the previous buffer as the seed for the next buffer. The problem with that is that the result is then dependent on the buffer size. If you feed CityHash different size buffers with this approach, you get different hash values.
Another Update four years later:
Even faster is the xxhash family. I would now recommend that over a CRC for a non-cryptographic hash.
Putting aside "performance" issues; you might want to consider using one of the SHA-2 functions (say SHA-256).

CISC instruction length

I was wondering, what is the maximum possible length of a CISC instruction on most of today's CISC architectures?
I haven't found the definitive answer yet, but it is suggested that it's 16 bytes long, in theory.
In the video # around 15:00 mins, why does the speaker suggests "in theory" and why exactly 16 bytes?
In practice as well. For the x86-64 AMD has limited the allowed instruction length to 15 bytes. After that, the instruction decoder will give up and signal an error.
Otherwise, with multiple instruction prefixes and override bytes, we don't know exactly how long the instruction could get. No limit at all, if we allow redundant repetitions of some prefixes.
Agner Fog describes the problem:
Executing three, four or five instructions simultaneously is not unusual. The limit is not the execution units, which we have plenty of, but the instruction decoder. The length of an instruction can be anywhere from one to fifteen bytes. If we want to decode several instructions simultaneously, then we have a serious problem. We have to know the length of the first instruction before we know where the second instruction begins. So we can't decode the second instruction before we have decoded the first instruction. The decoding is a serial process by nature, and it takes a lot of hardware to be able to decode multiple instructions per clock cycle. In other words, the decoding of instructions can be a serious bottleneck, and it becomes worse the more complicated the instruction codes are.
See the rest of his blog post here.
CISC is a design philosophy, not an architecture, therefore there's no such thing as "CISC instruction length", only instruction length of a specific CISC architecture (like x86 or Motorola 68k)
Talking specifically about x86 then the limit is 15 bytes. Theoretically the instruction length can be infinite because prefixes can be repeated. However that makes it difficult for the decoder so in 80286 Intel began to limit it to 10 bytes, and then 15 bytes in later ISA versions. For more information about it read
x86_64 ASM - maximum bytes for an instruction?
What is the maximum length an Intel 386 instruction without any prefixes?
Also note that RISC doesn't mean fixed-length instructions. Modern MIPS, ARM, RISC-V... all have a variable length instruction mode to increase code density

How to compute a reasonable number of bits for a checksum?

I have around 1500 bytes of data that I want to construct a checksum for so that if the data gets corrupted the chances of the checksum still matching the data is less than say 1 in 10^15, i.e. a low enough probability that I can treat it as it is never going to happen.
The question is how many bits should I compute? I have a sha-160 computation that gives me a 160 bit hash of my data, but I expect this is way larger than necessary. So I'm thinking I could truncate the resulting hash down to say the low 40 bits and use that as a sufficiently large bit pattern that if the data gets corrupted, I will most likely detect it.
So the question is two fold, how many bits is good enough and is taking the lower bits of a sha-160 hash a good approach to take?
You can use the table here to determine approximately how many bits you need for your desired error detection rate.

What hash algorithms are parallelizable? Optimizing the hashing of large files utilizing on multi-core CPUs

I'm interested in optimizing the hashing of some large files (optimizing wall clock time). The I/O has been optimized well enough already and the I/O device (local SSD) is only tapped at about 25% of capacity, while one of the CPU cores is completely maxed-out.
I have more cores available, and in the future will likely have even more cores. So far I've only been able to tap into more cores if I happen to need multiple hashes of the same file, say an MD5 AND a SHA256 at the same time. I can use the same I/O stream to feed two or more hash algorithms, and I get the faster algorithms done for free (as far as wall clock time). As I understand most hash algorithms, each new bit changes the entire result, and it is inherently challenging/impossible to do in parallel.
Are any of the mainstream hash algorithms parallelizable?
Are there any non-mainstream hashes that are parallelizable (and that have at least a sample implementation available)?
As future CPUs will trend toward more cores and a leveling off in clock speed, is there any way to improve the performance of file hashing? (other than liquid nitrogen cooled overclocking?) or is it inherently non-parallelizable?
There is actually a lot of research going on in this area. The US National Institute of Standards and Technology is currently holding a competition to design the next-generation of government-grade hash function. Most of the proposals for that are parallelizable.
One example: http://www.schneier.com/skein1.2.pdf
Wikipedia's description of current status of the contest: http://en.wikipedia.org/wiki/SHA-3
What kind of SSD do you have ? My C implementation of MD5 runs at 400 MB/s on a single Intel Core2 core (2.4 GHz, not the latest Intel). Do you really have SSD which support a bandwidth of 1.6 GB/s ? I want the same !
Tree hashing can be applied on any hash function. There are a few subtleties and the Skein specification tries to deal with them, integrating some metadata in the function itself (this does not change much things for performance), but the "tree mode" of Skein is not "the" Skein as submitted to SHA-3. Even if Skein is selected as SHA-3, the output of a tree-mode hash would not be the same as the output of "plain Skein".
Hopefully, a standard will be defined at some point, to describe generic tree hashing. Right now there is none. However, some protocols have been defined with support for a custom tree hashing with the Tiger hash function, under the name "TTH" (Tiger Tree Hash) or "THEX" (Tree Hash Exchange Format). The specification for TTH appears to be a bit elusive; I find some references to drafts which have either moved or disappeared for good.
Still, I am a bit dubious about the concept. It is kind of neat, but provides a performance boost only if you can read data faster than what a single core can process, and, given the right function and the right implementation, a single core can hash quite a lot of data per second. A tree hash spread over several cores requires having the data sent to the proper cores, and 1.6 GB/s is not the smallest bandwidth ever.
SHA-256 and SHA-512 are not very fast. Among the SHA-3 candidates, assuming an x86 processor in 64-bit mode, some of them achieve high speed (more than 300 MB/s on my 2.4 GHz Intel Core2 Q6600, with a single core -- that's what I can get out of SHA-1, too), e.g. BMW, SHABAL or Skein. Cryptographically speaking, these designs are a bit too new, but MD5 and SHA-1 are already cryptographically "broken" (quite effectively in the case of MD5, rather theoretically for SHA-1) so any of the round-2 SHA-3 candidates should be fine.
When I put my "seer" cap, I foresee that processors will keep on becoming faster than RAM, to the point that hashing cost will be dwarfed out by memory bandwidth: the CPU will have clock cycles to spare while it waits for the data from the main RAM. At some point, the whole threading model (one big RAM for many cores) will have to be amended.
You didn't say what you need your hash for.
If you're not gonna exchange it with the outside world but just for internal use, simply divide each file in chunks, compute and store all the checksums. You can then use many cores just by throwing a chunk to each one.
Two solutions that comes to mind is dividing files in fixed-size chunks (simpler, but will use less cores for smaller files where you're not supposed to need all that power) or in a fixed-number of chunks (will use all the cores for every file). Really depends on what you want to achieve and what your file size distribution looks like.
If, on the other hand, you need hashes for the outside world, as you can read from the other replies it's not possible with "standard" hashes (eg. if you want to send out SHA1 hashes for others to check with different tools) so you must look somewhere else. Like computing the hash when you store the file, for later retrieval, or compute hashes in background with the 'free' cores and store for later retrieval.
The better solution depends on what your constraints are and where you can invest space, time or cpu power.