What is packed and unpacked and extended packed data - cpu-architecture

I have been going through Intel Intrinsics and every function is working on integers or floats or double that are packed or unpacked or extended packed.
It seems like this question should be answered some where on the internet but I can't find the answer at all.
What is that packing thing?

Well, I've just been searching for the answer to the same question, and also with no success. So I can only be guessing.
Intel introduced packed and scalar instructions already in their MMX technology. For example, they introduced a function
__m64 _mm_add_pi8 (__m64 a, __m64 b)
At that time there was no such a thing as "extended packed". The only data type was __m64 and all operations worked on integers.
With SSE there came 128-bit registers and operations on floating point numbers. However, SSE2 included a superset of MMX operations on integers performed in 128-bit registers. For example,
__m128i _mm_add_epi8 (__m128i a, __m128i b)
Here for the first time we see the "ep" (extended packed") part of the function name. Why it was introduced? I believe this was a solution to the problem of the name _mm_add_pi8 being already taken by the MMX instruction listed above. The interface of SSE/AVX is in the C language, where there's no polymorphism of function names.
With AVX, Intel chose a different strategy, and started to add the register length just after the opening "_mm" letters, c.f.:
__m256i _mm256_add_epi8 (__m256i a, __m256i b)
__m512i _mm512_add_epi8 (__m512i a, __m512i b)
Why they here chose "ep" and not "p" is a mystery, irrelevant for programmers. Actually, they seem to use "p" for operations on floats and doubles and "ep" for integers.
__m128d _mm_add_pd (__m128d a, __m128d b); // "d": function operates on doubles
__m256 _mm256_add_ps (__m256 a, __m256 b); // "s": function operates on floats
Perhaps this goes back to the transition from MMX to SSE, where "ep" was introduced for operations on integers (no floats were handled by MMX) and an attempt to make AVX mnemonics as close to the SSE ones as possible
Thus, basically, from the perspective of a programmer, there's no difference between "ep" ("extended packed") and "p" ("packed"), for we are already aware of the register length that we target in our code.
As for the next part of the question, "unpacking" belongs to a completely different category of notions than "scalar" and "packed". This is rather a colloquial term for a particular data rearrangement or shuffle, like rotation or shift.
The reason for using "epi" in the name of intrinsics like _mm256_unpackhi_epi16 is that it is a truly vector (not scalar) function on a vector of 16-bit integer elements. Notice that here "unpack" belongs to the part of the function name that describe its action (like mul, add, or permute), whereas "s" / "p" / "ep" (scalar, packed, extended packed) belong to the part describing the operation mode (scalar for "s", vector for "p" or "ep").
(There are no scalar-integer instructions that operate between two XMM registers, but "si" does appear in the intrinsic name for movd eax, xmm0: _mm_cvtsi128_si32. There are a few similar intrinsics.)

Related

Online Algorithm approach for alternating subsequence

Consider a sequence A = a1, a2, a3, ... an of integers. A subsequence B of A is a sequence B = b1, b2, .... ,bn which is created from A by removing some elements but by keeping the order. Given an integer sequence A, the goal is to compute an alternating subsequence B, i.e. a sequence b1, ... bn such that for all i in {2, 3, ... , m-1}, if b{i-1} < b{i} then b{i} > b{i+1} and if b{i-1} > b{i} then b{i} < b{i+1}**
Consider an online version of the problem, where the sequence A is given element-by-element and each time, one needs to directly decide whether to include the next element in the subsequence B. Is it possible to achieve a constant competitive ratio (by using a deterministic online algorithm)? Either give an online algorithm which achieves a constant competitive ratio or show that it is not possible to find such an online algorithm.
Assume sequence [9,8,9,8,9,8, .... , 9,8,9,8,2,1,2,9,8,9, ... , 8,9,8,9,8,9]
My Argumentation:
The algorithm must decide immediately if it inserts an incoming number into the subsequence. If the algorithm now gets the numbers 1 then 2 then 2 it will eventually decide that they are part of the sequence and thus by a nonlinear factor worse than the optimal solution of n-3.
-> No constant competitive ratio!
Is this a proper argumentation?
If I understood what you meant, your argument is correct, but the sequence you gave in the example is wrong. for example the algorithm may choose all the 9's and 8's.
You can alter your argument slightly to make it more accurate, for example consider the sequence
3,4,3,4,3,4,......, 1/5,2/6,1/5,2/6,....
Explanation:
You start the sequence with 3,4,3,4,... etc. until the algorithm picks two numbers. If it never does, it's obviously not competitive (it gets 0/1 out of n)
If the algorithm picked a 3, then 4, the algorithm must next take a number lower than 4. By continuing with 5,6,5,6,... the algorithm cannot take another number.
If the algorithm chose to take a 4 then a 3, by a similar resoning we can easily see how continuing with 1,2,1,2,... prevents the algorithm from taking another nubmer.
Thus, in any case, the algorithm cannot take more than 2 numbers for every n, which, as you stated, isn't a constant competitive ratio.

OpenCL select() function with double

I'm porting some complex engineering code to OpenCL and have run into a problem with the select() ternary function with doubles. I'm just using scalars for now so I could use the simple C ternary operator ()?: but I plan to move to vector types soon.
My problem is that select with doubles requires a (long) type as the comparison but the scalar relational functions (e.g., isgreater) only return (int) for doubles. The prototypes for these functions are ...
int isgreater (double a, double b);
longn isgreater (doublen a, doublen b);
double select (double a, double b, long cmp);
doublen select (doublen a, doublen b, longn cmp);
I can get the scalar code to compile/run in scalar mode only if I cast the results of isgreater() as a long since select requires the element types to by the same size.
double hi = ...;
double lo = ...;
double res = select (lo, hi, (long)isgreater(T, T_cutoff));
Otherwise, I get a compiler error since select is ambiguous. There seems to be a mismatch in the specification regarding the relational mask types for scalar and vector doubles.
Q1: Is this an oversight in the specification or a bug in the implementation? Both the Intel and AMD OpenCL compilers fail for builds on the CPU so I'm guessing is the former.
Q2: OpenCL scalar relational functions return 0/1 and vector relational functions return 0/-1 (that is, all bits set). The (int)->(long) conversion appears to be consistent with this requirement but not (int)->(ulong), right? Is the (int)->(long) conversion costly?
Q3: When (if) I switch to vector doubles, will the compiler toss out the unnecessary explicit conversion? I want to retain both scalar and vector types so I can target CUDA GPUs and SIMD devices (MIC, CPUs) w/o having to keep two massive code sets.
Thanks for any advice here.
Q1:
I'd say that not implicitly converting the result of isgreater into long is an oversight in the specification.
In the single element case select should work exactly like ternary operator. That's also the reason isgreater returns 1 in scalar case. Basically isgreater should work exactly like > does when scalar operators are used.
In the vectorized case select looks at the MSB, which is the reason isgreater returns -1 (All bits 1 so MSB is naturally 1 too).
Q2: Int long conversion shouldn't be costly at all. At most it just requires 1 additional instruction.
Q3:
It does not.
This issue annoyingly prevents one from making a code that vectorizes from 1 to n elements, it requires special handling for the scalar case.

Pearson perfect hashing

I'm trying to write a generator that produces Pearson perfect hashes. Note that I don't need a minimal perfect hash. Wikipedia says that a Pearson perfect hash can be found in O(|S|) time using a randomized algorithm (where S is the set of keys). However, I haven't been able to find such an algorithm online. Is this even possible?
Note: I don't want to use gperf/cmph/etc., I'd rather write my own implementation.
Pearson's original paper outlines an algorithm to construct a permutation table T for perfect hashing:
The table T at the heart of this new hashing function can sometimes be modified to produce a minimal, perfect hashing function over a modest list of words. In fact, one can usually choose the exact value of the function for a particular word. For example, Knuth [3] illustrates perfect hashing with an algorithm that maps a list of 31 common English words onto unique integers between −10 and 30. The table T presented in Table II maps these same 31 words onto the integers from 1 to 31 in alphabetic order.
Although the procedure for constructing the table in Table II is too involved to be detailed here, the following highlights will enable the interested reader to repeat the process:
A table T was constructed by pseudorandom permutation of the integers (0 ... 255).
One by one, the desired values were assigned to the words in the list. Each assignment was effected by exchanging two elements in the table.
For each word, the first candidate considered for exchange was T[h[n − 1] ⊕ C[n]], the last table element referenced in the computation of the hash function for that word.
A table element could not be exchanged if it was referenced during the hashing of a previously assigned word or if it was referenced earlier in the hashing of the same word.
If the necessary exchange was forbidden by Rule 4, attention was shifted to the previously referenced table element, T[h[n − 2] ⊕ C[n − 1]].
The procedure is not always successful. For example, using the ASCII character codes, if the word “a” hashes to 0 and the word “i” hashes to 15, it turns out that the word “in” must hash to 0. Initial attempts to map Knuth's 31 words onto the integers (0 ... 30) failed for exactly this reason. The shift to the range (1 ... 31) was an ad hoc tactic to circumvent this problem.
Does this tampering with T damage the statistical behavior of the hashing function? Not seriously. When the 26,662 dictionary entries are hashed into 256 bins, the resulting distribution is still not significantly different from uniform (χ² = 266.03, 255 d.f., p = 0.30). Hashing the 128 randomly selected dictionary words resulted in an average of 27.5 collisions versus 26.8 with the unmodified T. When this function is extended as described above to produce 16-bit hash indices, the same test produces a substantially greater number of collisions (4,870 versus 4,721 with the unmodified T), although the distribution still is not significantly different from uniform (χ² = 565.2, 532 d.f., p = 0.154).

Fixed point arithmetic

I'm currently using Microchip's Fixed Point Library, but I think this applies to most fixed point libraries. It supports Q15 and Q15.16 types, respectively 16-bit and 32-bit data.
One thing I noticed is that it does not include add, subtract, multiply or divide functions.
How am I supposed to do these? Is it as simple as just adding/subtracting/multiplying/dividing them together using integer math? I can see addition and subtraction working, but multiplying or dividing wouldn't take care of the fractional part...?
The Microsoft library includes functions for adding and subtracting that deal with underflow/overflow (_Q15add and _Q15sub).
Multiplication can be implemented as an assembly function (I think the code is good - this is from memory).
C calling prototype is:
extern _Q15 Q15mpy(_Q15 a, _Q15 b);
The routine (placed in a .s source file in your project) is:
.global _Q15mpy
_Q15mpy:
mul.ss w0, w1, w2 ; signed multiple parameters, result in w2:w3
SL w2, w2 ; place most significant bit of W2 in carry
RLC w3, w0 ; rotate left carry into w3; result in W0
return ; return value in W0
.end
Remember to include libq.h
This routine does a left-shift of one bit rather than a right-shift of 15 bit on the result. There are no overflow concerns because Q15 numbers always have a magnitude <= 1.
It turns out that all basic arithmetic functions are performed by using the native operators due to how the numbers are represented. e.g. divide uses the / operator and multiply the * operator, and these compile to simple 32-bit divides and multiplies.

Where can I find a cheat sheet for hungarian notation?

I'm working on a legacy COM C++ project that makes use of system hungarian notation. Because it's maintenance of legacy code, the convention is to code in the original style it was written in - our newer code isn't coded this way. So I'm not interested in changing that standard or having a a discussion of our past sins =)
Is there an online cheat-sheet available out there for systems hungarian notation?
The best I can find thus far is a pre stack-overflow discussion post, but it doesn't quite have everything I've needed in the past. Does anyone have any other links?
(making this community wiki in the hope this becomes a self populating list)
If this is for a legacy COM project, you'll probably want to follow Microsoft's Hungarian Notation specifications, which are documented on MSDN.
Note that this is Apps Hungarian, i.e. the "good" kind of Hungarian Notation. Systems Hungarian is the "bad" kind, where names are prefixed with their compiler types, e.g. i for int.
Tables from the MSDN article
Table 1. Some examples for procedure names
Name Description
InitSy Takes an sy as its argument and initializes it.
OpenFn fn is the argument. The procedure will "open" the fn. No value is returned.
FcFromBnRn Returns the fc corresponding to the bn,rn pair given. (The names cannot tell us what the types sy, fn, fc, and so on, are.)
The following is a list of standard type constructions. (X and Y stand for arbitrary tags. According to standard punctuation, the actual tags are lowercase.)
Table 2. Standard type constructions
pX Pointer to X.
dX Difference between two instances of type X. X + dX is of type X.
cX Count of instances of type X.
mpXY An array of Ys indexed by X. Read as "map from X to Y."
rgX An array of Xs. Read as "range X." The indices of the array are called:
iX index of the array rgX.
dnX (rare) An array indexed by type X. The elements of the array are called:
eX (rare) Element of the array dnX.
grpX A group of Xs stored one after another in storage. Used when the X elements are of variable size and standard array indexing would not apply. Elements of the group must be referenced by means other then direct indexing. A storage allocation zone, for example, is a grp of blocks.
bX Relative offset to a type X. This is used for field displacements in a data structure with variable size fields. The offset may be given in terms of bytes or words, depending on the base pointer from which the offset is measured.
cbX Size of instances of X in bytes.
cwX Size of instances of X in words.
The following are standard qualifiers. (The letter X stands for any type tag. Actual type tags are in lowercase.)
Table 3. Standard qualifiers
XFirst The first element in an ordered set (interval) of X values.
XLast The last element in an ordered set of X values. XLast is the upper limit of a closed interval, hence the loop continuation condition should be: X <= XLast.
XLim The strict upper limit of an ordered set of X values. Loop continuation should be: X < XLim.
XMax Strict upper limit for all X values (excepting Max, Mac, and Nil) for all other X: X < XMax. If X values start with X=0, XMax is equal to the number of different X values. The allocated length of a dnx vector, for example, will be typically XMax.
XMac The current (as opposed to constant or allocated) upper limit for all X values. If X values start with 0, XMac is the current number of X values. To iterate through a dnx array, for example:
for x=0 step 1 to xMac-1 do ... dnx[x] ...
or
for ix=0 step 1 to ixMac-1 do ... rgx[ix] ...
XNil A distinguished Nil value of type X. The value may or may not be 0 or -1.
XT Temporary X. An easy way to qualify the second quantity of a given type in a scope.
Table 4. Some common primitive types
f Flag (Boolean, logical). If qualifier is used, it should describe the true state of the flag. Exception: the constants fTrue and fFalse.
w Word with arbitrary contents.
ch Character, usually in ASCII text.
b Byte, not necessarily holding a coded character, more akin to w. Distinguished from the b constructor by the capital letter of the qualifier in immediately following.
sz Pointer to first character of a zero terminated string.
st Pointer to a string. First byte is the count of characters cch.
h pp (in heap).
Here's one for 'Systems Hungarian', which in my experience was the more commonly used (and less useful):
http://web.mst.edu/~cpp/common/hungarian.html
But how universally followed this is, I have no idea.
The other form of Hungarian Notation is "Apps Hungarian", which apparently is Simonyi's original intent (the use of the variable was encoded rather than the type). See http://en.wikipedia.org/wiki/Hungarian_notation for some details.
Because this is a legacy project, your software department manager should have a copy of the style guide for whatever version of Hungarian Notation the original programmers used. (I'm assuming that the original programmers have long since fled to more enlightened workplaces.)
You really should reconsider your use of Hungarian notation. It was originally a patch for the lack of strong typing (and compiler type-checking) in C. Modern compilers enforce type-correctness, making Hungarian notation redundant at best, and erroneous otherwise.
There doesn't seem to be any one exhaustive resource for looking up Hungarian Notation prefixes, probably because a lot of it varied from code base to code base. There, of course, were a lot of very commonly used ones.
The best list I could find was here
The rest cover the commonly used conventions such as this entry
MSDN's enty on Hungarian Notation is here
and a couple of short papers on the subject (overlapping each other perhaps) here and here
Your best bet would be to see how the variables are used and that (may) help you figure out the definition of the prefixes (though in practice the naming rarey reflected the use of the variable, sadly).
You might be able to piece together some semblance of notation from those various links.
Just to be complete(!) how about Hungarian Object Notation for Visual Basic from Microsoft Support no less.