SSE instruction to interleave top 16-bits of packed 32-bit blocks of two XMM registers? - x86-64

I can't seem to find any SSE instruction that does exactly this
r[15:0] = a[31:16]
r[31:16] = a[63:48]
r[47:32] = a[95:80]
r[63:48] = a[127:112]
r[79:64] = b[31:16]
r[95:80] = b[63:48]
r[111:96] = b[95:80]
r[127:112] = b[127:112]
Any help would be greatly appreciated.
Looked through this https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
but it's kind of hard to find exactly what you're looking for, so I may have missed something obvious.

Related

AR terms in SUR models - Matlab

I am trying to estimate a SUR model of the form
y_{1,t} = \alpha_1 +\beta_1 x_{1,t} + \beta_2 x_{2,t} + \beta_3 y_{1,t-1} +\epsilon_{1,t}
y_{2,t} = \alpha_2 +\beta_4 x_{1,t} + \beta_5 x_{2,t} + \beta_6 y_{2,t-1} +\epsilon_{2,t}
Define mY = [y_1 y_2] and mX = [x_1 x_2].
For this purpose I am doing
iT = size(mY,1); iN = size(mY,2);
mXsur = kron(mX, eye(iN));
mXsurCell = mat2cell(mXsur, iN*ones(iT,1));
iR = size(mXsur,2);
Mdl = vgxset('n', iN, 'nAR',1, 'nX',iR,'Constant',true);
[SurOutput, SurSDerror, ~,SURcov] = vgxvarx(Mdl, mY, mXsurCell);
The issue is that the bit of code nAR, 1 seems to add 1 lag of both y variables to each equation and I only wish to add one per equation. Is there a quick way to do this?
(Of course I can include the lagged terms manually in the mX matrix, but my question is whether we can do this via vgxset in a quicker way. I think not based on my reading of the help file, but still want to double check). Thanks

NEON: loading uint8_t array into 128 bit register

I need to load values from uint8 array into 128 NEON register. There is a similar question. But there were no good answers.
My solution is:
uint8_t arr[4] = {1,2,3,4};
//load 4 of 8-bit vals into 64 bit reg
uint8x8_t _vld1_u8 = vld1_u8(arr);
//convert to 16-bit and move to 128-bit reg
uint16x8_t _vmovl_u8 = vmovl_u8(_vld1_u8);
//get low 64 bit and move them to 64-bit reg
uint16x4_t _vget_low_u16 = vget_low_u16(_vmovl_u8);
//convert to 32-bit and move to 128-bit reg
uint32x4_t ld32x4 = vmovl_u16(_vget_low_u16);
This works fine, but it seems to me that this approach is not the fastest. Maybe there is a better and faster way to load 8bit data into 128 reg as 32bit ?
Edit:
Thanks to #FrankH. I've came up with the second version using some hack:
uint8x16x2_t z = vzipq_u8(vld1q_u8(arr), q_zero);
uint8x16_t rr = *(uint8x16_t*)&z;
z = vzipq_u8(rr, q_zero);
ld32x4 = *(uint8x16_t*)&z;
It boils down to this assembly (when compiler optimisations are on):
vld1.8 {d16, d17}, [r5]
vzip.8 q8, q9
vorr q9, q4, q4
vzip.8 q8, q9
So there are no redundant stores and it's pretty fast. But still it is about x1.5 slower then the first solution.
You can do a "double zip" with zeroes:
uint16x4_t zero = 0;
uint32x4_t ld32x4 =
vreinterpretq_u32_u16(
vzipq_u8(
vzip_u8(
vld1_u8(arr),
vreinterpret_u8_u16(zero)
),
zero
)
);
Since the vreinterpretq_*() are no-ops, this boils down to three instructions. Don't have a crosscompiler around at the moment, can't validate that :(
Edit:
Don't get me wrong there ... while vreinterpretq_*() isn't resulting in a Neon instruction, it's not a no-op; that's because it stops the compiler from doing the type of funky things you'd see if you'd instead use widerVal.val[0]. All it tells the compiler is, like:
"you've got a uint8x16x2_t but I want to use only half of that as a uint8x16_t, give me half the registers."
Or:
"you have a uint8x16x2_t but I want to use those regs as a uint32x4_t instead."
I.e. it tells the compilers to alias sets of neon registers - preventing stores/loads to/from the stack as you'd get if you do the explicit sub-set access through the .val[...] syntax.
In a way, the .val[...] syntax "is a hack" but the better method, the use of vreinterpretq_*(), "looks like a hack". Not using it results in more instructions and slower/inferior code.

Logical indexing of fields within a structure

I have a structure like so:
Basis.FieldsBasisType.fieldsBasisComponents
There are ~13 components to each basis, including 6 asset class IDs.
So, for example
fieldnames(Basis.SalaryIncrease) =
'Constant'
'AWeight'
'AAssetClassID'
'ATimeLag'
'BWeight'
'BAssetClassID'
'BTimeLag'
'CWeight'
'CAssetClassID'
'CTimeLag'
'DWeight'
'DAssetClassID'
'DTimeLag'
'EWeight'
'EAssetClassID'
'ETimeLag'
'FWeight'
'FAssetClassID'
'FTimeLag'
'cap'
'floor'
Now what I want to do is select all unique asset classes used in any basis. I am really struggling to make this neat though, currently I am using:
basisNames = fieldnames(Basis);
requiredSeries=[];
for i = 1:size(fieldnames(Basis),1)
requiredSeries = [requiredSeries;unique(Basis.(basisNames{i}).AAssetClassID)];
requiredSeries = [requiredSeries;unique(Basis.(basisNames{i}).BAssetClassID)];
requiredSeries = [requiredSeries;unique(Basis.(basisNames{i}).CAssetClassID)];
requiredSeries = [requiredSeries;unique(Basis.(basisNames{i}).DAssetClassID)];
requiredSeries = [requiredSeries;unique(Basis.(basisNames{i}).EAssetClassID)];
requiredSeries = [requiredSeries;unique(Basis.(basisNames{i}).FAssetClassID)];
end
requiredSeries = unique(requiredSeries)
Which is really ugly in my opinion. I want to do some kind of string compare to find 'AssetClassID' within the fields, so something like:
field = fieldnames(Basis.(basisNames{1}));
strfind(field,'AssetClassID');
And then use that cell array to logically index 'field' and just grab the data from 'AssetClassID' fields. But I am stuck on making that work.
~cellfun('isempty',strfind(field,'AssetClassID'))
gets me the logical index, how do I apply that to fields and then use it to get values.
Any ideas would be appreciated, I feel there should be a neat way of doing it and I am missing something. Hardcoding those fieldnames seems short sighted as a solution.
#
Edit: I hate myself.
Sorry folks, I came up with a working variant like moments after posting this, apologies for wasting anyones time!
basisNames = fieldnames(Basis);
for i = 1:size(fieldnames(Basis),1)
field = fieldnames(Basis.(basisNames{i}));
field = cell2mat(field(~cellfun('isempty',strfind(field,'AssetClassID'))));
for j = 1:size(field,1)
requiredSeries = [requiredSeries;unique(Basis.(basisNames{i}).(field(1,:)))];
end
requiredSeries = unique(requiredSeries)
end
I was missing a necessary cell2mat earlier which caused the inability to get it to bloody work. Anyway, I'd always like to hear improvements to that but otherwise you can shut this down.
Sorry folks, I came up with a working variant 30 mins or after posting this, popping it down as an answer as per Michelle's suggestion.
basisNames = fieldnames(Basis);
for i = 1:size(fieldnames(Basis),1)
field = fieldnames(Basis.(basisNames{i}));
field = cell2mat(field(~cellfun('isempty',strfind(field,'AssetClassID'))));
for j = 1:size(field,1)
requiredSeries = [requiredSeries;unique(Basis.(basisNames{i}).(a(1,:)))];
end
requiredSeries = unique(requiredSeries)
end
I was missing a necessary cell2mat earlier which caused the inability to get it to bloody work. Anyway, I'd always like to hear improvements to that but otherwise you ignore this entirely :)

how can i swap value of two variables without third one in objective c

hey guys i want your suggestion that how can change value of two variables without 3rd one. in objective cc.
is there any way so please inform me,
it can be done in any language. x and y are 2 variables and we want to swap them
{
//lets say x , y are 1 ,2
x = x + y; // 1+2 =3
y = x - y; // 3 -2 = 1
x = x -y; // 3-1 = 2;
}
you can use these equation in any language to achieve this
Do you mean exchange the value of two variables, as in the XOR swap algorithm? Unless you're trying to answer a pointless interview question, programming in assembly language, or competing in the IOCCC, don't bother. A good optimizing compiler will probably handle the standard tmp = a; a = b; b = tmp; better than whatever trick you might come up with.
If you are doing one of those things (or are just curious), see the Wikipedia article for more info.
As far as number is concerned you can swap numbers in any language without using the third one whether it's java, objective-C OR C/C++,
For more info
Potential Problem in "Swapping values of two variables without using a third variable"
Since this is explicitly for iPhone, you can use the ARM instruction SWP, but it's almost inconceivable why you'd want to. The complier is much, much better at this kind of optimization. If you just want to avoid the temporary variable in code, write an inline function to handle it. The compiler will optimize it away if it can be done more efficiently.
NSString * first = #"bharath";
NSString * second = #"raj";
first = [NSString stringWithFormat:#"%#%#",first,second];
NSRange needleRange = NSMakeRange(0,
first.length - second.length);
second = [first substringWithRange:needleRange];
first = [first substringFromIndex:second.length];
NSLog(#"first---> %#, Second---> %#",first,second);

Beat Detection on iPhone with wav files and openal

Using this website i have tried to make a beat detection engine. http://www.gamedev.net/reference/articles/article1952.asp
{
ALfloat energy = 0;
ALfloat aEnergy = 0;
ALint beats = 0;
bool init = false;
ALfloat Ei[42];
ALfloat V = 0;
ALfloat C = 0;
ALshort *hold;
hold = new ALshort[[myDat length]/2];
[myDat getBytes:hold length:[myDat length]];
ALuint uiNumSamples;
uiNumSamples = [myDat length]/4;
if(alDatal == NULL)
alDatal = (ALshort *) malloc(uiNumSamples*2);
if(alDatar == NULL)
alDatar = (ALshort *) malloc(uiNumSamples*2);
for (int i = 0; i < uiNumSamples; i++)
{
alDatal[i] = hold[i*2];
alDatar[i] = hold[i*2+1];
}
energy = 0;
for(int start = 0; start<(22050*10); start+=512){
for(int i = start; i<(start+512); i++){
energy+= ((alDatal[i]*alDatal[i]) + (alDatal[i]*alDatar[i]));
}
aEnergy = 0;
for(int i = 41; i>=0; i--){
if(i ==0){
Ei[0] = energy;
}
else {
Ei[i] = Ei[i-1];
}
if(start >= 21504){
aEnergy+=Ei[i];
}
}
aEnergy = aEnergy/43.f;
if (start >= 21504) {
for(int i = 0; i<42; i++){
V += (Ei[i]-aEnergy);
}
V = V/43.f;
C = (-0.0025714*V)+1.5142857;
init = true;
if(energy >(C*aEnergy)) beats++;
}
}
}
alDatal and alDatar are (short*) type;
myDat is NSdata that holds the actual audio data of a wav file formatted to
22050 khz and 16 bit stereo.
This doesn't seem to work correctly. If anyone could help me out that would be amazing. I've been stuck on this for 3 days.
The desired result is after the 10 seconds worth of data has been processed i should be able to multiply that by 6 and have an estimated beats per minute.
My current results are 389 beats every 10 seconds, 2334 BPM the song i know is right around 120 BPM.
That code really has been smacked about with the ugly stick. If you're going to ask other people to find your bugs for you, it's a good idea to make things presentable first. Strangely enough, this will often help you to find them for yourself too.
So, before I point out some of the more fundamental errors, I have to make a few schoolmarmly suggestions:
Don't sprinkle your code with magic numbers. Is it really that hard to type a few lines like const ALuint SAMPLE_RATE = 22050? Trust me, it makes life a lot easier.
Use variable names that you aren't going to mix up easily. One of your bugs is a substitution of alDatal for alDatar. That probably wouldn't have happened if they were called left and right. Similarly, what is the point of having a meaningful variable name like energy if you're just going to stick it alongside the meaningless but more or less identical aEnergy? Why not something informative like average?
Declare variables close to where you're going to use them and in the appropriate scope. Another of your bugs is that you don't reset your calculated energy sum when you move your averaging window, so the energy will just add up and up. But you don't need the energy outside that loop, and if you declared it inside the problem couldn't happen.
There are some other things I personally find a little irksome, like the random bracing and indentation, and mixing of C and C++ allocations, and odd inconsistent scraps of Hungarian prefixing, but at least some of those may be more a matter of taste so I won't go on.
Anyway, here are some reasons why your code doesn't work:
First up, look at the right hand side of this line:
energy+= ((alDatal[i]*alDatal[i]) + (alDatal[i]*alDatar[i]));
You want the square of each channel value, so it should really say:
energy+= ((alDatal[i]*alDatal[i]) + (alDatar[i]*alDatar[i]));
Spot the difference? Not easy with those names, is it?
Second, you should be computing the total energy over each window of samples, but you're only setting energy = 0 outside the outer loop. So the sum accumulates, and consequently the current window energy will always be the biggest you've ever encountered.
Third, your variance calculation is wrong. You have:
V += (Ei[i]-aEnergy);
But it should be the sum of the squares of the differences from the mean:
V += (Ei[i] - aEnergy) * (Ei[i] - aEnergy);
There may well be other errors as well. For instance, you don't allocate the data buffers if they're not NULL, but assume that they're the right length -- which you've only just calculated. You may justify that in terms of some consistent usage you've stuck to throughout your code, but from the perspective of what we can see here it looks like a pretty bad idea.