Faster way to do _mm256_set1_ps - x86-64

Is there a faster way to do _mm256_set1_ps in assembly than the C intrinsic? It appears that the intrinsic compiles down to a sequence of vmovss, vshufps, vmovss, vshufps and vinsertf128, which even the intrinsics guide itself says is inefficient. I am wondering if there are alternative ways to do this. I realize that if there is Intel probably has implemented it, but doesn't hurt to ask....

While this has been partially addressed for some time, I found it as part of dealing with some similar issues and thought a formal answer might be of interest. I'm aware of two main cases.
The constant for _mm256_set1_ps() is in memory at a known address. As #Peter Cordes mentioned above in the comments, AVX vbroadcastss applies in this case.
The constant is already in the low bits of a register. AVX2 vbroadcastss is suitable here (AVX requires vpermilps to set the lower 128 bits followed by vperm2f128 to set the upper 128, I believe).
I've encountered inefficient code generation around this for a variety of reasons and have implemented my own variants of _mm_set1_ps() and _mm256_set1_ps() to encourage more efficient compilation. Don't feel I'm in a position to make more specific recommendations than checking the disassembly you're getting, however.

Related

ILP solvers with small memory footprint

I'm trying to solve a sequence-labelling problem by formulating it as an integer linear program (as an experiment to see how well doing it in that way works). I've already found some suggestions for solvers on SO but I would like to get some more fine-grained advice due to some constraints I'm under (yes, that pun was actually intended).
I'm running out of memory on more than half of my sequences due to their length while using COIN-OR although I see no reason I need to use so much memory for my problem at hand: This is a Boolean linear program, so I would theoretically need only one bit per feature. However, e.g. the COIN Open Solver Interface seems to be able to use only double values for e.g. defining constraints.
Are there any (free) ILP packages which are well-suited for either Boolean problems or at least for problems with a very small range of potential values?
CPLEX seems to be considered approximately the state of the art, and in my experience for hard ILPs it is often better than any free solver I found. Unfortunately, CPLEX is not free, except for academic users; IBM does offer free access to CPLEX for students and researchers at educational institutions, if you fit that description.

Numerical Integral of large numbers in Fortran 90

so I have the following Integral that i need to do numerically:
Int[Exp(0.5*(aCosx + bSinx + cCos2x + dSin2x))] x=0..2Pi
The problem is that the output at any given value of x can be extremely large, e^2000, so larger than I can deal with in double precision.
I havn't had much luck googling for the following, how do you deal with large numbers in fortran, not high precision, i dont care if i know it to beyond double precision, and at the end i'll just be taking the log, but i just need to be able to handle the large numbers untill i can take the log..
Are there integration packes that have the ability to handle arbitrarily large numbers? Mathematica clearly can.. so there must be something like this out there.
Cheers
This is probably an extended comment rather than an answer but here goes anyway ...
As you've already observed Fortran isn't equipped, out of the box, with the facility for handling such large numbers as e^2000. I think you have 3 options.
Use mathematics to reduce your problem to one which does (or a number of related ones which do) fall within the numerical range that your Fortran compiler can compute.
Use Mathematica or one of the other computer algebra systems (eg Maple, SAGE, Maxima). All (I think) of these can be integrated into a Fortran program (with varying degrees of difficulty and integration).
Use a library for high-precision (often called either arbitray-precision or multiple-precision too) arithmetic. Your favourite search engine will turn up a number of these for you, some written in Fortran (and therefore easy to integrate), some written in C/C++ or other languages (and therefore slightly harder to integrate). You might start your search at Lawrence Berkeley or the GNU bignum library.
(Yes I know that I wrote that you have 3 options, but your question suggests that you aren't ready to consider this yet) You could write your own high-/arbitrary-/multiple-precision functions. Fortran provides everything you need to construct such a library, there is a lot of work already done in the field to learn from, and it might be something of interest to you.
In practice it generally makes sense to apply as much mathematics as possible to a problem before resorting to a computer, that process can not only assist in solving the problem but guide your selection or construction of a program to solve what's left of the problem.
I agree with High Peformance Mark that the best option here numerically is to use analytics to scale or simplify the result first.
I will mention that if you do want to brute force it, gfortran (as of 4.6, with the libquadmath library) has support for quadruple precision reals, which you can use by selecting the appropriate kind. As long as your answers (and the intermediate results!) don't get too much bigger than what you're describing, that may work, but it will generally be much slower than double precision.
This requires looking deeper at the problem you are trying to solve and the behavior of the underlying mathematics. To add to the good advice already provided by Mark and Jonathan, consider expanding the exponential and trig functions into Taylor series and truncating to the desired level of precision.
Also, take a step back and ask why you are trying to accomplish by calculating this value. As an example, I recently had to debug why I was getting outlandish results from a property correlation which was calculating vapor pressure of a fluid to see if condensation was occurring. I spent a long time trying to understand what was wrong with the temperature being fed into the correlation until I realized the case causing the error was a simulation of vapor detonation. The problem was not in the numerics but in the logic of checking for condensation during a literal explosion; physically, a condensation check made no sense. The real problem was the code was asking an unnecessary question; it already had the answer.
I highly recommend Forman Acton's Numerical Methods That (Usually) Work and Real Computing Made Real. Both focus on problems like this and suggest techniques to tame ill-mannered computations.

Compilation optimization for iPhone : floating point or fixed point?

I'm building a library for iphone (speex, but i'm sure it will apply to a lot of other libs too) and the make script has an option to use fixed point instead of floating point.
As the iphone ARM processor has the VFP extension and performs very well floating point calculations, do you think it's a better choice to use the fixed point option ?
If someone already benchmarked this and wants to share , i would really thank him.
Well, it depends on the setup of your application, here is some guidelines
First try turning on optimization to 0s (Fastest Smallest)
Turn on Relax IEEE Compliance
If your application can easily process floating point numbers in contiguous memory locations independently, you should look at the ARM NEON intrinsic's and assembly instructions, they can process up to 4 floating point numbers in a single instruction.
If you are already heavily using floating point math, try to switch some of your logic to fixed point (but keep in mind that moving from an NEON register to an integer register results in a full pipeline stall)
If you are already heavily using integer math, try changing some of your logic to floating point math.
Remember to profile before optimization
And above all, better algorithms will always beat micro-optimizations such as the above.
If you are dealing with large blocks of sequential data, NEON is definitely the way to go.
Float or fixed, that's a good question. NEON is somewhat faster dealing with fixed, but I'd keep the native input format since conversions take time and eventually, extra memory.
Even if the lib offers a different output formats as an option, it almost alway means lib-internal conversions. So I guess float is the native one in this case. Stick to it.
Noone prevents you from micro-optimizing better algorithms. And usually, the better the algorithm, the more performance gain can be achieved through micro-optimizations due to the pipelining on modern machines.
I'd stay away from intrinsics though. There are so many posts on the net complaining about intrinsics doing something crazy, especially when dealing with immediate values.
It can and will get very troublesome, and you can hardly optimize anything with intrinsics either.

Can one construct a "good" hash function using CRC32C as a base?

Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what other transformation would one apply to overcome that?
Update
How about this? Only 16 bits are suitable for a hash value. Fine. If your table is 65535 or less then great. If not, run the CRC value through the Nehalem POPCNT (population count) instruction to get the number of bits set. Then, use that as an index into an array of tables. This works if your table is south of 1mm entries. I'd bet that's cheaper/faster that the best-performing hash functions. Now that GCC 4.5 has a CRC32 intrinsic it should be easy to test...if only I had the copious spare time to work on it.
David
Revisited, August 2014
Prompted by Arnaud Bouchez in a recent comment, and in view of other answers and comments, I acknowledge that the original answer needs to be altered or for the least qualified. I left the original as-is, at the end, for reference.
First, and maybe most important, a fair answer to the question depends on the intended use of the hash code: What does one mean by "good" [hash function...]? Where/how will the hash be used? (e.g. is it for hashing a relatively short input key? Is it for indexing / lookup purposes, to produce message digests or yet other uses? How long is the desired hash code itself, all 32 bits [of CRC32 or derivatives thereof], more bits, fewer... etc?
The OP questions calls for "a faster general-purpose hash function", so the focus is on SPEED (something less CPU intensive and/or something which can make use of parallel processing of various nature). We may note here that the computation time for the hash code itself is often only part of the problem in an application of hash (for example if the size of the hash code or its intrinsic characteristics result in many collisions which require extra cycles to be dealt with). Also the requirement for "general purpose" leaves many questions as to the possible uses.
With this in mind, a short and better answer is, maybe:
Yes, the hardware implementations of CRC32C on newer Intel processors can be used to build faster hash codes; beware however that depending on the specific implementation of the hash and on its application the overall results may be sub-optimal because of the frequency of collisions, of the need to use longer codes. Also, for sure, cryptographic uses of the hash should be carefully vetted because the CRC32 algorithm itself is very weak in this regard.
The original answer cited a article on Evaluating Hash functions by Bret Mulvey and as pointed in Mdlg's answer, the conclusion of this article are erroneous in regards to CRC32 as the implementation of CRC32 it was based on was buggy/flawed. Despite this major error in regards to CRC32, the article provides useful guidance as to the properties of hash algorithms in general. The URL to this article is now defunct; I found it on archive.today but I don't know if the author has it at another location and also whether he updated it.
Other answers here cite CityHash 1.0 as an example of a hash library that uses CRC32C. Apparently, this is used in the context of some longer (than 32 bits) hash codes but not for the CityHash32() function itself. Also, the use of CRC32 by City Hash functions is relatively small, compared with all the shifting and shuffling and other operations that are performed to produce the hash code. (This is not a critique of CityHash for which I have no hands-on experience. I'll go on a limb, from a cursory review of the source code that CityHash functions produce good, e.g. ell distributed codes, but are not significantly faster than various other hash functions.)
Finally, you may also find insight on this issue in a quasi duplicate question on SO .
Original answer and edit (April 2010)
A priori, this sounds like a bad idea!.
CRC32 was not designed for hashing purposes, and its distribution is likely to not be uniform, hence making it a relatively poor hash-code. Furthermore, its "scrambling" power is relatively weak, making for a very poor one-way hash, as would be used in cryptographic applications.
[BRB: I'm looking for online references to that effect...]
Google's first [keywords = CRC32 distribution] hit seems to confirm this :
Evaluating CRC32 for hash tables
Edit: The page cited above, and indeed the complete article provides a good basis of what to look for in Hash functions.
Reading [quickly] this article, confirmed the blanket statement that in general CRC32 should not be used as a hash, however, and depending on the specific purpose of the hash, it may be possible to use, at least in part, a CRC32 as a hash code.
For example the lower (or higher, depending on implementation) 16 bits of the CRC32 code have a relatively even distribution, and, provided that one isn't concerned about the cryptographic properties of the hash code (i.e. for example the fact that similar keys produce very similar codes), it may be possible to build a hash code which uses, say, a concatenation of the lower [or higher] 16 bits for two CRC32 codes produced with the two halves (or whatever division) of the original key.
One would need to run tests to see if the efficiency of the built-in CRC32 instruction, relative to an alternative hash functions, would be such that the overhead of calling the instruction twice and splicing the code together etc. wouldn't result in an overall slower function.
The article referred to in other answers draws incorrect conclusions based on buggy crc32 code. Google's ranking algorithm does not rank based on scientific accuracy yet.
Contrary to the referred article "Evaluating CRC32 for hash tables" conclusions, CRC32 and CRC32C are acceptable for hash table use. The author's sample code has a bug in the crc32 table generation. Fixing the crc32 table, gives satifactory results using the same methodology. Also the speed of the CRC32 instruction, makes it the best choice in many contexts. Code using the CRC32 instruction is 16x faster at peak than an optimal software implementation. (Note that CRC32 is not exactly the same than CRC32C which the intel instruction implements.)
CRC32 is obviously not suitable for crypto use. (32 bit is a joke to brute force).
Yes. CityHash 1.0.1 includes some new "good hash functions" that use CRC32 instructions.
Just so long as you're not after crypto hash it just might work.
For cryptographic purposes, CRC32 is a bad fundation because it is linear (over the vector space GF(2)^32) and that is hard to correct. It may work for non-cryptographic purposes.
However, recent Intel cores have the AES-NI instructions, which basically perform 1/10th of an AES block encryption in two clock cycles. They are available on the most recent i5 and i7 processors (see the Wikipedia page for some details). This looks like a good start for building a cryptographic hash function (and a hash function which is good for cryptography will also be good for about anything else).
Indeed, at least one of the SHA-3 "round 2" candidates (the ECHO hash function) is built around the AES elements so that the AES-NI opcodes provide a very substantial performance boost. (Unfortunately, in the absence of AES-NI instruction, ECHO performance somewhat sucks.)

What is the fastest int to float conversion on the iPhone?

I am converting some Int16s and Int32s to float and then back again.
I'm just using a straight cast, but doing this 44100 times per second (any guesses what its for? :) )
Is a cast efficient? Can it be done any faster?
P.S Compile for thumb is turned off.
There are only two ways to know.
1) Read the code the compiler generates for promoting ints to floats in your case.
2) Measure the performance of the code the compiler generates vs. other options.
To do the former, set the SDK to Device and the Active Architecture to arm, and choose Build > Show Assembly Code. Then read the compiler-generated code.
If you are smarter than a compiler then you can write your own assembly code and use it instead. Odds are you aren't.
If you are doing an operation many, many times, Instruments will do a good job at showing you how many processor samples it's taking. But Jim's point is valid, and you shouldn't dismiss it as unhelpful: in an operation involving math on floating-point numbers, compiler type promotion is the least of your worries. Chips are built to do that in two or three cycles, and compilers usually manage to make that happen. But the effects processing you're doing will probably take thousands of cycles. The promotion will be lost in the noise.
Is a cast efficient? In your case, I'd guess it's efficient enough.
Can it be done faster? Maybe...but would it be worth the effort? Have you benchmarked it and discovered a performance problem due to the cast operations?
If you're doing anything mathematically nontrivial with the floating point sample data,
I'd be really surprised if the casts turned out to be a significant bottleneck!