Pinning down a discrepancy in ddot between two machines - scipy

I currently have two machines which produce different outputs for an instance of np.dot on two vectors. Without digging through the many layers of abstraction leading from NumPy to BLAS, I was able to reproduce the discrepancy in scipy.linalg.blas.ddot, so I assume an explanation of the discrepancy in BLAS also explains the discrepancy in NumPy. Concretely, consider the following example:
import numpy as np
from scipy.linalg.blas import ddot
u = np.array([0.13463703107579461093, -0.07773272613450200874, -0.98784132994666418170])
v = np.array([-0.86246572448831815283, -0.03715105562531360872, -0.50475010960748223354])
a = np.dot(v, u)
b = v[0]*u[0] + v[1]*u[1] + v[2]*u[2]
c = ddot(v, u)
print(f'{a:.25f}')
print(f'{b:.25f}')
print(f'{c:.25f}')
This produces the following outputs:
Machine 1 Machine 2
a 0.3853810478481685120044631 0.3853810478481685675156143
b 0.3853810478481685120044631 0.3853810478481685120044631
c 0.3853810478481685120044631 0.3853810478481685675156143
Similarly, the following piece of Cython gives rise to the same discrepancy:
cimport scipy.linalg.cython_blas
cimport numpy as np
import numpy as np
cdef np.float64_t run_test(np.double_t[:] a, np.double_t[:] b):
cdef int ix, iy, n
ix = iy = 1
n = 3
return scipy.linalg.cython_blas.ddot(&n, &a[0], &ix, &b[0], &iy)
a = np.array([0.13463703107579461093, -0.07773272613450200874, -0.98784132994666418170])
b = np.array([-0.86246572448831815283, -0.03715105562531360872, -0.50475010960748223354])
print(f'{run_test(a, b):.25f}')
So, I'm trying to understand what could give rise to this.
The machines in question run Windows 10 (Intel(R) Core(TM) i7-5600U) and Windows Server 2016 (Intel(R) Xeon(R) Gold 6140) respectively.
In both cases have I set up fresh conda environments with nothing but numpy, scipy, cython, and their dependencies. I've run checksums on the environments to ensure that the binaries that end up being included agree and verified that the outputs of np.__config__.show() match up. Similarly I checked that the outputs of mkl.get_version_string() agree on the two machines.
This leads me to think that the problem might be in differences in hardware. I did not look into what instructions end up being executed (lacking a straightforward way to debug the Cython code on Windows/MSVC), but I checked that both machines support AVX2/FMA, which seemed like it could be one source of the discrepancy.
On the other hand, I did find that the two machines support different instruction sets, though. Concretely
Machine 1 (i7) Machine 2 (Xeon)
AVX Y Y
AVX2 Y Y
AVX512CD N Y
AVX512ER N N
AVX512F N Y
AVX512PF N N
FMA Y Y
I am, however, not aware of a good way to determine if this by itself is sufficient to explain the discrepancy, or if it's a red herring(?)
So my question becomes:
Starting from the above, what are some natural steps to try to pin down the cause of the discrepancy? Is it assembly time, or is there something more obvious?

Given the excellent comments to the question, it seems clear that the difference between supported instruction sets is ultimately the culprit, and indeed we can use ListDLLs while running the Cython script to find that MKL loads different libraries based on the two cases.
For the i7 (machine 1):
>listdlls64 python.exe | wsl grep mkl
0x00000000b9ff0000 0xe7e000 [...]\miniconda3\envs\npfloattest\Library\bin\mkl_rt.dll
0x00000000b80e0000 0x1f05000 [...]\miniconda3\envs\npfloattest\Library\bin\mkl_intel_thread.dll
0x00000000b3b40000 0x43ba000 [...]\miniconda3\envs\npfloattest\Library\bin\mkl_core.dll
0x00000000b0e50000 0x2ce5000 [...]\miniconda3\envs\npfloattest\Library\bin\mkl_avx2.dll
0x00000000b01f0000 0xc58000 [...]\miniconda3\envs\npfloattest\Library\bin\mkl_vml_avx2.dll
0x00000000f88c0000 0x7000 [...]\miniconda3\envs\npfloattest\lib\site-packages\mkl\_mklinit.cp37-win_amd64.pyd
0x00000000afce0000 0x22000 [...]\miniconda3\envs\npfloattest\lib\site-packages\mkl\_py_mkl_service.cp37-win_amd64.pyd
For the Xeon (machine 2):
0x0000000057ec0000 0xe7e000 [...]\Miniconda3\envs\npfloattest\Library\bin\mkl_rt.dll
0x0000000055fb0000 0x1f05000 [...]\Miniconda3\envs\npfloattest\Library\bin\mkl_intel_thread.dll
0x0000000051bf0000 0x43ba000 [...]\Miniconda3\envs\npfloattest\Library\bin\mkl_core.dll
0x000000004e1a0000 0x3a4a000 [...]\Miniconda3\envs\npfloattest\Library\bin\mkl_avx512.dll
0x000000005c6c0000 0xc03000 [...]\Miniconda3\envs\npfloattest\Library\bin\mkl_vml_avx512.dll
0x0000000079a70000 0x7000 [...]\Miniconda3\envs\npfloattest\lib\site-packages\mkl\_mklinit.cp37-win_amd64.pyd
0x000000005e830000 0x22000 [...]\Miniconda3\envs\npfloattest\lib\site-packages\mkl\_py_mkl_service.cp37-win_amd64.pyd
This very strongly indicates that the support of AVX512CD/AVX512F is enough to prompt MKL to use a different library, and thereby, presumably, ultimately a different set of instructions.
Now it's interesting to see how this actually unfolds: What instructions are emitted, and what that means on the concrete number example.
As a start, let us write the equivalent VC++ program to get an idea about what instructions end up being run:
typedef double (*func)(int, const double*, int, const double*, int);
int main()
{
double a[3];
double b[3];
std::cin >> a[0];
std::cin >> a[1];
std::cin >> a[2];
std::cin >> b[0];
std::cin >> b[1];
std::cin >> b[2];
func cblas_ddot;
HINSTANCE rt = LoadLibrary(TEXT("mkl_rt.dll"));
cblas_ddot = (func)GetProcAddress(rt, "cblas_ddot");
double res_rt = cblas_ddot(3, a, 1, b, 1);
std::cout.precision(25);
std::cout << res_rt;
}
Let us try to run this on each machine using Visual Studio's assembly debugger, starting with the i7 (machine 1)/the machine supporting only AVX2; here, in each case, we note all YMM registers modified by the instruction; for example, YMM4 and YMM5 are initialized with the values of a and b respectively, after the vfmadd231pd, YMM3 contains the element-wise product of the two arrays, and that after the vaddsd, the lower part of YMM5 contains the result:
vmaskmovpd ymm4,ymm5,ymmword ptr [rbx]
YMM4 = 0000000000000000-BFEF9C656BB84218-BFB3E64ABC939CC1-3FC13BC946A68994
vmaskmovpd ymm5,ymm5,ymmword ptr [r9]
YMM5 = 0000000000000000-BFE026E9B3AD5464-BFA3057691D85EDE-BFEB9951B813250D
vfmadd231pd ymm3,ymm5,ymm4
YMM3 = 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC
vaddpd ymm1,ymm3,ymm1
YMM1 = 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC
vaddpd ymm0,ymm2,ymm0
vaddpd ymm2,ymm1,ymm0
YMM2 = 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC
vhaddpd ymm3,ymm2,ymm2
YMM3 = 3FDFE946951928C9-3FDFE946951928C9-BFBCFCC53F6313B2-BFBCFCC53F6313B2
vperm2f128 ymm4,ymm3,ymm3,1
YMM4 = BFBCFCC53F6313B2-BFBCFCC53F6313B2-3FDFE946951928C9-3FDFE946951928C9
vaddsd xmm5,xmm3,xmm4
YMM5 = 0000000000000000-0000000000000000-BFBCFCC53F6313B2-3FD8AA15454063DC
vmovsd qword ptr [rsp+90h],xmm5
The same experiment on machine 2, the one supporting AVX-512, gives the following result (where we give only the lower half of the ZMM registers):
vmovupd zmm5{k1}{z},zmmword ptr [r12]
ZMM5 = 0000000000000000-BFEF9C656BB84218-BFB3E64ABC939CC1-3FC13BC946A68994
vmovupd zmm4{k1}{z},zmmword ptr [r9]
ZMM4 = 0000000000000000-BFE026E9B3AD5464-BFA3057691D85EDE-BFEB9951B813250D
vfmadd231pd zmm3,zmm4,zmm5
ZMM3 = 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC
vaddpd zmm17,zmm1,zmm0
mov eax,0F0h
kmovw k1,eax
vaddpd zmm16,zmm3,zmm2
ZMM16= 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC
vaddpd zmm19,zmm16,zmm17
ZMM19= 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC
mov eax,0Ch
kmovw k2,eax
vcompresspd zmm18{k1}{z},zmm19
vaddpd zmm21,zmm18,zmm19
ZMM21= 0000000000000000-3FDFE946951928C9-3F67A8442F158742-BFBDBA0760DBBFEC
vcompresspd zmm20{k2}{z},zmm21
ZMM20= 0000000000000000-0000000000000000-0000000000000000-3FDFE946951928C9
vaddpd zmm0,zmm20,zmm21
ZMM0 = 0000000000000000-3FDFE946951928C9-3F67A8442F158742-3FD87AC4BCE238CE
vhaddpd xmm1,xmm0,xmm0
ZMM1 = 0000000000000000-0000000000000000-3FD8AA15454063DD-3FD8AA15454063DD
vmovsd qword ptr [rsp+88h],xmm1
Comparing the two, we first note that the discrepancy is a single bit, 3FD8AA15454063DC vs. 3FD8AA15454063DD, but we now also see how it arises: In the AVX2 case, we perform the horizontal add on what corresponds to the 0th and 1st entries of the vectors, while in the AVX-512 case, we're using the 0th and 2nd entries. That is, it would seem that the discrepancy simply boils down to the discrepancy between what you get by naively computing v[0]*u[0] + v[2]*u[2] + v[1]*u[1] and v[0]*u[0] + v[1]*u[1] + v[2]*u[2]. Indeed, comparing the two we find the exact same discrepancy:
In [34]: '%.25f' % (v[0]*u[0] + v[2]*u[2] + v[1]*u[1])
Out[34]: '0.3853810478481685675156143'
In [35]: '%.25f' % (v[0]*u[0] + v[1]*u[1] + v[2]*u[2])
Out[35]: '0.3853810478481685120044631'

If you need bit-wise equal answer: Did you try MKL_ENABLE_INSTRUCTIONS variable from the link posted by #bg2b (https://software.intel.com/en-us/mkl-linux-developer-guide-instruction-set-specific-dispatching-on-intel-architectures)? if you import MKL library and then call mkl.enable_instructions it might be too late.
In double precision (DP) world: the relative difference is -1.4404224470807435333001684021155e-16 (the absolute difference is -5.55111512e-17) which is less than C++ and Python DP machine epsilon (https://en.wikipedia.org/wiki/Machine_epsilon). So results are equal from Python correctness.
Cheers,
Vladimir

Related

3-layered Neural network doesen't learn properly

So, I'm trying to implement a neural network with 3 layers in python, however I am not the brightest person so anything with more then 2 layers is kinda difficult for me. The problem with this one is that it gets stuck at .5 and does not learn I have no actual clue where it went wrong. Thank you for anyone with the patience to explain the error to me. (I hope the code makes sense)
import numpy as np
def sigmoid(x):
return 1/(1+np.exp(-x))
def reduce(x):
return x*(1-x)
l0=[np.array([1,1,0,0]),
np.array([1,0,1,0]),
np.array([1,1,1,0]),
np.array([0,1,0,1]),
np.array([0,0,1,0]),
]
output=[0,1,1,0,1]
syn0=np.random.random((4,4))
syn1=np.random.random((4,1))
for justanumber in range(1000):
for i in range(len(l0)):
l1=sigmoid(np.dot(l0[i],syn0))
l2=sigmoid(np.dot(l1,syn1))
l2_err=output[i]-l2
l2_delta=reduce(l2_err)
l1_err=syn1*l2_delta
l1_delta=reduce(l1_err)
syn1=syn1.T
syn1+=l0[i].T*l2_delta
syn1=syn1.T
syn0=syn0.T
syn0+=l0[i].T*l1_delta
syn0=syn0.T
print l2
PS. I know that it might be a piece of trash as a script but that is why I asked for assistance
Your computations are not fully correct. For example, the reduce is called on the l1_err and l2_err, where it should be called on l1 and l2.
You are performing stochastic gradient descent. In this case with such few parameters, it oscilates hugely. In this case use a full batch gradient descent.
The bias units are not present. Although you can still learn without bias, technically.
I tried to rewrite your code with minimal changes. I have commented your lines to show the changes.
#!/usr/bin/python3
import matplotlib.pyplot as plt
import numpy as np
def sigmoid(x):
return 1/(1+np.exp(-x))
def reduce(x):
return x*(1-x)
l0=np.array ([np.array([1,1,0,0]),
np.array([1,0,1,0]),
np.array([1,1,1,0]),
np.array([0,1,0,1]),
np.array([0,0,1,0]),
]);
output=np.array ([[0],[1],[1],[0],[1]]);
syn0=np.random.random((4,4))
syn1=np.random.random((4,1))
final_err = list ();
gamma = 0.05
maxiter = 100000
for justanumber in range(maxiter):
syn0_del = np.zeros_like (syn0);
syn1_del = np.zeros_like (syn1);
l2_err_sum = 0;
for i in range(len(l0)):
this_data = l0[i,np.newaxis];
l1=sigmoid(np.matmul(this_data,syn0))[:]
l2=sigmoid(np.matmul(l1,syn1))[:]
l2_err=(output[i,:]-l2[:])
#l2_delta=reduce(l2_err)
l2_delta=np.dot (reduce(l2), l2_err)
l1_err=np.dot (syn1, l2_delta)
#l1_delta=reduce(l1_err)
l1_delta=np.dot(reduce(l1), l1_err)
# Accumulate gradient for this point for layer 1
syn1_del += np.matmul(l2_delta, l1).T;
#syn1=syn1.T
#syn1+=l1.T*l2_delta
#syn1=syn1.T
# Accumulate gradient for this point for layer 0
syn0_del += np.matmul(l1_delta, this_data).T;
#syn0=syn0.T
#syn0-=l0[i,:].T*l1_delta
#syn0=syn0.T
# The error for this datpoint. Mean sum of squares
l2_err_sum += np.mean (l2_err ** 2);
l2_err_sum /= l0.shape[0]; # Mean sum of squares
syn0 += gamma * syn0_del;
syn1 += gamma * syn1_del;
print ("iter: ", justanumber, "error: ", l2_err_sum);
final_err.append (l2_err_sum);
# Predicting
l1=sigmoid(np.matmul(l0,syn0))[:]# 1 x d * d x 4 = 1 x 4;
l2=sigmoid(np.matmul(l1,syn1))[:] # 1 x 4 * 4 x 1 = 1 x 1
print ("Predicted: \n", l2)
print ("Actual: \n", output)
plt.plot (np.array (final_err));
plt.show ();
The output I get is:
Predicted:
[[0.05214011]
[0.97596354]
[0.97499515]
[0.03771324]
[0.97624119]]
Actual:
[[0]
[1]
[1]
[0]
[1]]
Therefore the network was able to predict all the toy training examples. (Note in real data you would not like to fit the data at its best as it leads to overfitting). Note that you may get a bit different result, as the weight initialisations are different. Also, try to initialise the weight between [-0.01, +0.01] as a rule of thumb, when you are not working on a specific problem and you specifically know the initialisation.
Here is the convergence plot.
Note that you do not need to actually iterate over each example, instead you can do matrix multiplication at once, which is much faster. Also, the above code does not have bias units. Make sure you have bias units when you re-implement the code.
I would recommend you go through the Raul Rojas' Neural Networks, a Systematic Introduction, Chapter 4, 6 and 7. Chapter 7 will tell you how to implement deeper networks in a simple way.

Reducing LUT utilization in a Vivado HLS design (RSA cryptosystem using montgomery multiplication)

A question/problem for anyone experienced with Xilinx Vivado HLS and FPGA design:
I need help reducing the utilization numbers of a design within the confines of HLS (i.e. can't just redo the design in an HDL). I am targeting the Zedboard (Zynq 7020).
I'm trying to implement 2048-bit RSA in HLS, using the Tenca-koc multiple-word radix 2 montgomery multiplication algorithm, shown below (More algorithm details here):
I wrote this algorithm in HLS and it works in simulation and in C/RTL cosim. My algorithm is here:
#define MWR2MM_m 2048 // Bit-length of operands
#define MWR2MM_w 8 // word size
#define MWR2MM_e 257 // number of words per operand
// Type definitions
typedef ap_uint<1> bit_t; // 1-bit scan
typedef ap_uint< MWR2MM_w > word_t; // 8-bit words
typedef ap_uint< MWR2MM_m > rsaSize_t; // m-bit operand size
/*
* Multiple-word radix 2 montgomery multiplication using carry-propagate adder
*/
void mwr2mm_cpa(rsaSize_t X, rsaSize_t Yin, rsaSize_t Min, rsaSize_t* out)
{
// extend operands to 2 extra words of 0
ap_uint<MWR2MM_m + 2*MWR2MM_w> Y = Yin;
ap_uint<MWR2MM_m + 2*MWR2MM_w> M = Min;
ap_uint<MWR2MM_m + 2*MWR2MM_w> S = 0;
ap_uint<2> C = 0; // two carry bits
bit_t qi = 0; // an intermediate result bit
// Store concatenations in a temporary variable to eliminate HLS compiler warnings about shift count
ap_uint<MWR2MM_w> temp_concat=0;
// scan X bit-by bit
for (int i=0; i<MWR2MM_m; i++)
{
qi = (X[i]*Y[0]) xor S[0];
// C gets top two bits of temp_concat, j'th word of S gets bottom 8 bits of temp_concat
temp_concat = X[i]*Y.range(MWR2MM_w-1,0) + qi*M.range(MWR2MM_w-1,0) + S.range(MWR2MM_w-1,0);
C = temp_concat.range(9,8);
S.range(MWR2MM_w-1,0) = temp_concat.range(7,0);
// scan Y and M word-by word, for each bit of X
for (int j=1; j<=MWR2MM_e; j++)
{
temp_concat = C + X[i]*Y.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + qi*M.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j);
C = temp_concat.range(9,8);
S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) = temp_concat.range(7,0);
S.range(MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)) = (S.bit(MWR2MM_w*j), S.range( MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)+1));
}
S.range(S.length()-1, S.length()-MWR2MM_w) = 0;
C=0;
}
// if final partial sum is greater than the modulus, bring it back to proper range
if (S >= M)
S -= M;
*out = S;
}
Unfortunately, the LUT utilization is huge.
This is problematic because I need to be able to fit multiple of these blocks in hardware as axi4-lite slaves.
Could someone please provide a few suggestions as to how I can reduce the LUT utilization, WITHIN THE CONFINES OF HLS?
I've already tried the following:
Experimenting with different word lengths
switching the top level inputs to arrays so they are BRAM (i.e. not using ap_uint<2048>, but instead ap_uint foo[MWR2MM_e])
Experimenting with all sorts of directives: compartmentalizing into multiple inline functions, dataflow architecture, resource limits on lshr, etc.
However, nothing really drives the LUT utilization down in a meaningful way. Is there a glaringly obvious way that I could reduce the utilization that is apparent to anyone?
In particular, I've seen papers on implementations of the mwr2mm algorithm that (only use one DSP block and one BRAM). Is this even worth attempting to implement using HLS? Or is there no way that I can actually control the resources that the algorithm is mapped to without describing it in HDL?
Thanks for the help.

How to do bitwise operation decently?

I'm doing analysis on binary data. Suppose I have two uint8 data values:
a = uint8(0xAB);
b = uint8(0xCD);
I want to take the lower two bits from a, and whole content from b, to make a 10 bit value. In C-style, it should be like:
(a[2:1] << 8) | b
I tried bitget:
bitget(a,2:-1:1)
But this just gave me separate [1, 1] logical type values, which is not a scalar, and cannot be used in the bitshift operation later.
My current solution is:
Make a|b (a or b):
temp1 = bitor(bitshift(uint16(a), 8), uint16(b));
Left shift six bits to get rid of the higher six bits from a:
temp2 = bitshift(temp1, 6);
Right shift six bits to get rid of lower zeros from the previous result:
temp3 = bitshift(temp2, -6);
Putting all these on one line:
result = bitshift(bitshift(bitor(bitshift(uint16(a), 8), uint16(b)), 6), -6);
This is doesn't seem efficient, right? I only want to get (a[2:1] << 8) | b, and it takes a long expression to get the value.
Please let me know if there's well-known solution for this problem.
Since you are using Octave, you can make use of bitpack and bitunpack:
octave> a = bitunpack (uint8 (0xAB))
a =
1 1 0 1 0 1 0 1
octave> B = bitunpack (uint8 (0xCD))
B =
1 0 1 1 0 0 1 1
Once you have them in this form, it's dead easy to do what you want:
octave> [B A(1:2)]
ans =
1 0 1 1 0 0 1 1 1 1
Then simply pad with zeros accordingly and pack it back into an integer:
octave> postpad ([B A(1:2)], 16, false)
ans =
1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0
octave> bitpack (ans, "uint16")
ans = 973
That or is equivalent to an addition when dealing with integers
result = bitshift(bi2de(bitget(a,1:2)),8) + b;
e.g
a = 01010111
b = 10010010
result = 00000011 100010010
= a[2]*2^9 + a[1]*2^8 + b
an alternative method could be
result = mod(a,2^x)*2^y + b;
where the x is the number of bits you want to extract from a and y is the number of bits of a and b, in your case:
result = mod(a,4)*256 + b;
an extra alternative solution close to the C solution:
result = bitor(bitshift(bitand(a,3), 8), b);
I think it is important to explain exactly what "(a[2:1] << 8) | b" is doing.
In assembly, referencing individual bits is a single operation. Assume all operations take the exact same time and "efficient" a[2:1] starts looking extremely inefficient.
The convenience statement actually does (a & 0x03).
If your compiler actually converts a uint8 to a uint16 based on how much it was shifted, this is not a 'free' operation, per se. Effectively, what your compiler will do is first clear the "memory" to the size of uint16 and then copy "a" into the location. This requires an extra step (clearing the "memory" (register)) that wouldn't normally be needed.
This means your statement actually is (uint16(a & 0x03) << 8) | uint16(b)
Now yes, because you're doing a power of two shift, you could just move a into AH, move b into AL, and AH by 0x03 and move it all out but that's a compiler optimization and not what your C code said to do.
The point is that directly translating that statement into matlab yields
bitor(bitshift(uint16(bitand(a,3)),8),uint16(b))
But, it should be noted that while it is not as TERSE as (a[2:1] << 8) | b, the number of "high level operations" is the same.
Note that all scripting languages are going to be very slow upon initiating each instruction, but will complete said instruction rapidly. The terse nature of Python isn't because "terse is better" but to create simple structures that the language can recognize so it can easily go into vectorized operations mode and start executing code very quickly.
The point here is that you have an "overhead" cost for calling bitand; but when operating on an array it will use SSE and that "overhead" is only paid once. The JIT (just in time) compiler, which optimizes script languages by reducing overhead calls and creating temporary machine code for currently executing sections of code MAY be able to recognize that the type checks for a chain of bitwise operations need only occur on the initial inputs, hence further reducing runtime.
Very high level languages are quite different (and frustrating) from high level languages such as C. You are giving up a large amount of control over code execution for ease of code production; whether matlab actually has implemented uint8 or if it is actually using a double and truncating it, you do not know. A bitwise operation on a native uint8 is extremely fast, but to convert from float to uint8, perform bitwise operation, and convert back is slow. (Historically, Matlab used doubles for everything and only rounded according to what 'type' you specified)
Even now, octave 4.0.3 has a compiled bitshift function that, for bitshift(ones('uint32'),-32) results in it wrapping back to 1. BRILLIANT! VHLL place you at the mercy of the language, it isn't about how terse or how verbose you write the code, it's how the blasted language decides to interpret it and execute machine level code. So instead of shifting, uint32(floor(ones / (2^32))) is actually FASTER and more accurate.

NEON: loading uint8_t array into 128 bit register

I need to load values from uint8 array into 128 NEON register. There is a similar question. But there were no good answers.
My solution is:
uint8_t arr[4] = {1,2,3,4};
//load 4 of 8-bit vals into 64 bit reg
uint8x8_t _vld1_u8 = vld1_u8(arr);
//convert to 16-bit and move to 128-bit reg
uint16x8_t _vmovl_u8 = vmovl_u8(_vld1_u8);
//get low 64 bit and move them to 64-bit reg
uint16x4_t _vget_low_u16 = vget_low_u16(_vmovl_u8);
//convert to 32-bit and move to 128-bit reg
uint32x4_t ld32x4 = vmovl_u16(_vget_low_u16);
This works fine, but it seems to me that this approach is not the fastest. Maybe there is a better and faster way to load 8bit data into 128 reg as 32bit ?
Edit:
Thanks to #FrankH. I've came up with the second version using some hack:
uint8x16x2_t z = vzipq_u8(vld1q_u8(arr), q_zero);
uint8x16_t rr = *(uint8x16_t*)&z;
z = vzipq_u8(rr, q_zero);
ld32x4 = *(uint8x16_t*)&z;
It boils down to this assembly (when compiler optimisations are on):
vld1.8 {d16, d17}, [r5]
vzip.8 q8, q9
vorr q9, q4, q4
vzip.8 q8, q9
So there are no redundant stores and it's pretty fast. But still it is about x1.5 slower then the first solution.
You can do a "double zip" with zeroes:
uint16x4_t zero = 0;
uint32x4_t ld32x4 =
vreinterpretq_u32_u16(
vzipq_u8(
vzip_u8(
vld1_u8(arr),
vreinterpret_u8_u16(zero)
),
zero
)
);
Since the vreinterpretq_*() are no-ops, this boils down to three instructions. Don't have a crosscompiler around at the moment, can't validate that :(
Edit:
Don't get me wrong there ... while vreinterpretq_*() isn't resulting in a Neon instruction, it's not a no-op; that's because it stops the compiler from doing the type of funky things you'd see if you'd instead use widerVal.val[0]. All it tells the compiler is, like:
"you've got a uint8x16x2_t but I want to use only half of that as a uint8x16_t, give me half the registers."
Or:
"you have a uint8x16x2_t but I want to use those regs as a uint32x4_t instead."
I.e. it tells the compilers to alias sets of neon registers - preventing stores/loads to/from the stack as you'd get if you do the explicit sub-set access through the .val[...] syntax.
In a way, the .val[...] syntax "is a hack" but the better method, the use of vreinterpretq_*(), "looks like a hack". Not using it results in more instructions and slower/inferior code.

How to select the last column of numbers from a table created by FoldList in Mathematica

I am new to Mathematica and I am having difficulties with one thing. I have this Table that generates 10 000 times 13 numbers (12 numbers + 1 that is a starting number). I need to create a Histogram from all 10 000 13th numbers. I hope It's quite clear, quite tricky to explain.
This is the table:
F = Table[(Xi = RandomVariate[NormalDistribution[], 12];
Mu = -0.00644131;
Sigma = 0.0562005;
t = 1/12; s = 0.6416;
FoldList[(#1*Exp[(Mu - Sigma^2/2)*t + Sigma*Sqrt[t]*#2]) &, s,
Xi]), {SeedRandom[2]; 10000}]
The result for the following histogram could be a table that will take all the 13th numbers to one table - than It would be quite easy to create an histogram. Maybe with "select"? Or maybe you know other ways to solve this.
You can access different parts of a list using Part or (depending on what parts you need) some of the more specialised commands, such as First, Rest, Most and (the one you need) Last. As noted in comments, Histogram[Last/#F] or Histogram[F[[All,-1]]] will work fine.
Although it wasn't part of your question, I would like to note some things you could do for your specific problem that will speed it up enormously. You are defining Mu, Sigma etc 10,000 times, because they are inside the Table command. You are also recalculating Mu - Sigma^2/2)*t + Sigma*Sqrt[t] 120,000 times, even though it is a constant, because you have it inside the FoldList inside the Table.
On my machine:
F = Table[(Xi = RandomVariate[NormalDistribution[], 12];
Mu = -0.00644131;
Sigma = 0.0562005;
t = 1/12; s = 0.6416;
FoldList[(#1*Exp[(Mu - Sigma^2/2)*t + Sigma*Sqrt[t]*#2]) &, s,
Xi]), {SeedRandom[2]; 10000}]; // Timing
{4.19049, Null}
This alternative is ten times faster:
F = Module[{Xi, beta}, With[{Mu = -0.00644131, Sigma = 0.0562005,
t = 1/12, s = 0.6416},
beta = (Mu - Sigma^2/2)*t + Sigma*Sqrt[t];
Table[(Xi = RandomVariate[NormalDistribution[], 12];
FoldList[(#1*Exp[beta*#2]) &, s, Xi]), {SeedRandom[2];
10000}] ]]; // Timing
{0.403365, Null}
I use With for the local constants and Module for the things that are other redefined within the Table (Xi) or are calculations based on the local constants (beta). This question on the Mathematica StackExchange will help explain when to use Module versus Block versus With. (I encourage you to explore the Mathematica StackExchange further, as this is where most of the Mathematica experts are hanging out now.)
For your specific code, the use of Part isn't really required. Instead of using FoldList, just use Fold. It only retains the final number in the folding, which is identical to the last number in the output of FoldList. So you could try:
FF = Module[{Xi, beta}, With[{Mu = -0.00644131, Sigma = 0.0562005,
t = 1/12, s = 0.6416},
beta = (Mu - Sigma^2/2)*t + Sigma*Sqrt[t];
Table[(Xi = RandomVariate[NormalDistribution[], 12];
Fold[(#1*Exp[beta*#2]) &, s, Xi]), {SeedRandom[2];
10000}] ]];
Histogram[FF]
Calculating FF in this way is even a little faster than the previous version. On my system Timing reports 0.377 seconds - but such a difference from 0.4 seconds is hardly worth worrying about.
Because you are setting the seed with SeedRandom, it is easy to verify that all three code examples produce exactly the same results.
Making my comment an answer:
Histogram[Last /# F]