systemVerilog signed doesn't work correctly - system-verilog

I have the following function:
function tx_upconv_out_transaction predict(tx_upconv_in_transaction in_trx);
tx_upconv_out_transaction predicted = tx_upconv_out_transaction::type_id::create("predicted");
//-------golden model-----
// predicted.y = (in_trx.xi * in_trx.cos - in_trx.xq * in_trx.sin)/ (2 ** 17);
$display(" xi = %d, cos = %d xq = %d sin = %d", $signed(in_trx.xi),$signed(in_trx.cos),$signed(in_trx.xq),$signed(in_trx.sin) );
predicted.y = ($signed(in_trx.xi) * $signed(in_trx.cos) - $signed(in_trx.xq) * $signed(in_trx.sin))/ (131072);
return predicted;
endfunction: predict
Where:
The field in in_trx are defined by:
bit [15:0] xi;
bit [15:0] xq;
bit [15:0] sin;
bit [15:0] cos;
For the input:
xi, qq = fffa (hex)
sin = 0
cos = 7ffe (hex)
The output (display) is:
xi = -6, cos = 32766 xq = -6 sin = 0
Where it should be:
xi = -6, cos = -2 xq = -6 sin = 0

You can declare your vectors to signed and unsigned (default). Eg.:
logic signed [3:0] signed_reg; // a 4-bit vector in range -8 to 7
From now you you will not need $signed systemcalls.
Also if you are using 16 bit 2-state variables you should consider the built in
shortint type that is a 2-state data type, 16-bit signed integer.

Related

Log2 approximation in fixed-point

I'v already implemented fixed-point log2 function using lookup table and low-order polynomial approximation but not quite happy with accuracy across the entire 32-bit fixed-point range [-1,+1). The input format is s0.31 and the output format is s15.16.
I'm posting this question here so that another user can post his answer (some comments were exchanged in another thread but they prefer to provide comprehensive answer in a separate thread). Any other answers are welcome, I would much appreciate if you could provide some speed vs accuracy details of your algorithm and its implementation.
Thanks.
By simply counting the leading zero bits in a fixed-point number x, one can determine log2(x) to the closest strictly smaller integer. On many processor architectures, there is a "count leading zeros" machine instruction or intrinsic. Where this is not available, a fairly efficient implementation of clz() can be constructed in a variety of ways, one of which is included in the code below.
To compute the fractional part of the logarithm, the two main obvious contenders are interpolation in a table and minimax polynomial approximation. In this specific case, quadratic interpolation in a fairly small table seems to be the more attractive option. x = 2i * (1+f), with 0 ≤ f < 1. We determine i as described above and use the leading bits of f to index into the table. A parabola is fit through this and two following table entries, computing the parameters of the parabola on the fly. The result is rounded, and a heuristic adjustment is applied to partially compensate for the truncating nature of fixed-point arithmetic. Finally, the integer portion is added, yielding the final result.
It should be noted that the computation involves right shifts of signed integers which may be negative. We need those right shifts to map to arithmetic right shifts at machine code level, something which is not guaranteed by the ISO-C standard. However, in practice most compilers do what is desired. In this case I used the Intel compiler on an x64 platform running Windows.
With a 66-entry table of 32-bit words, the maximum absolute error can be reduced to 8.18251e-6, so full s15.16 accuracy is achieved.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#define FRAC_BITS_OUT (16)
#define INT_BITS_OUT (15)
#define FRAC_BITS_IN (31)
#define INT_BITS_IN ( 0)
/* count leading zeros: intrinsic or machine instruction on many architectures */
int32_t clz (uint32_t x)
{
uint32_t n, y;
n = 31 + (!x);
if ((y = (x & 0xffff0000U))) { n -= 16; x = y; }
if ((y = (x & 0xff00ff00U))) { n -= 8; x = y; }
if ((y = (x & 0xf0f0f0f0U))) { n -= 4; x = y; }
if ((y = (x & 0xccccccccU))) { n -= 2; x = y; }
if (( (x & 0xaaaaaaaaU))) { n -= 1; }
return n;
}
#define LOG2_TBL_SIZE (6)
#define TBL_SIZE ((1 << LOG2_TBL_SIZE) + 2)
/* for i = [0,65]: log2(1 + i/64) * (1 << 31) */
const uint32_t log2Tab [TBL_SIZE] =
{
0x00000000, 0x02dcf2d1, 0x05aeb4dd, 0x08759c50,
0x0b31fb7d, 0x0de42120, 0x108c588d, 0x132ae9e2,
0x15c01a3a, 0x184c2bd0, 0x1acf5e2e, 0x1d49ee4c,
0x1fbc16b9, 0x22260fb6, 0x24880f56, 0x26e2499d,
0x2934f098, 0x2b803474, 0x2dc4439b, 0x30014ac6,
0x32377512, 0x3466ec15, 0x368fd7ee, 0x38b25f5a,
0x3acea7c0, 0x3ce4d544, 0x3ef50ad2, 0x40ff6a2e,
0x43041403, 0x450327eb, 0x46fcc47a, 0x48f10751,
0x4ae00d1d, 0x4cc9f1ab, 0x4eaecfeb, 0x508ec1fa,
0x5269e12f, 0x5440461c, 0x5612089a, 0x57df3fd0,
0x59a80239, 0x5b6c65aa, 0x5d2c7f59, 0x5ee863e5,
0x60a02757, 0x6253dd2c, 0x64039858, 0x65af6b4b,
0x675767f5, 0x68fb9fce, 0x6a9c23d6, 0x6c39049b,
0x6dd2523d, 0x6f681c73, 0x70fa728c, 0x72896373,
0x7414fdb5, 0x759d4f81, 0x772266ad, 0x78a450b8,
0x7a231ace, 0x7b9ed1c7, 0x7d17822f, 0x7e8d3846,
0x80000000, 0x816fe50b
};
#define RND_SHIFT (31 - FRAC_BITS_OUT)
#define RND_CONST ((1 << RND_SHIFT) / 2)
#define RND_ADJUST (0x10d) /* established heuristically */
/*
compute log2(x) in s15.16 format, where x is in s0.31 format
maximum absolute error 8.18251e-6 # 0x20352845 (0.251622232)
*/
int32_t fixed_log2 (int32_t x)
{
int32_t f1, f2, dx, a, b, approx, lz, i, idx;
uint32_t t;
/* x = 2**i * (1 + f), 0 <= f < 1. Find i */
lz = clz (x);
i = INT_BITS_IN - lz;
/* normalize f */
t = (uint32_t)x << (lz + 1);
/* index table of log2 values using LOG2_TBL_SIZE msbs of fraction */
idx = t >> (32 - LOG2_TBL_SIZE);
/* difference between argument and smallest sampling point */
dx = t - (idx << (32 - LOG2_TBL_SIZE));
/* fit parabola through closest three sampling points; find coeffs a, b */
f1 = (log2Tab[idx+1] - log2Tab[idx]);
f2 = (log2Tab[idx+2] - log2Tab[idx]);
a = f2 - (f1 << 1);
b = (f1 << 1) - a;
/* find function value for argument by computing ((a*dx+b)*dx) */
approx = (int32_t)((((int64_t)a)*dx) >> (32 - LOG2_TBL_SIZE)) + b;
approx = (int32_t)((((int64_t)approx)*dx) >> (32 - LOG2_TBL_SIZE + 1));
approx = log2Tab[idx] + approx;
/* round fractional part of result */
approx = (((uint32_t)approx) + RND_CONST + RND_ADJUST) >> RND_SHIFT;
/* combine integer and fractional parts of result */
return (i << FRAC_BITS_OUT) + approx;
}
/* convert from s15.16 fixed point to double-precision floating point */
double fixed_to_float_s15_16 (int32_t a)
{
return a / 65536.0;
}
/* convert from s0.31 fixed point to double-precision floating point */
double fixed_to_float_s0_31 (int32_t a)
{
return a / (65536.0 * 32768.0);
}
int main (void)
{
double a, res, ref, err, maxerr = 0.0;
int32_t x, start, end;
start = 0x00000001;
end = 0x7fffffff;
printf ("testing fixed_log2 with inputs in [%17.10e, %17.10e)\n",
fixed_to_float_s0_31 (start), fixed_to_float_s0_31 (end));
for (x = start; x < end; x++) {
a = fixed_to_float_s0_31 (x);
ref = log2 (a);
res = fixed_to_float_s15_16 (fixed_log2 (x));
err = fabs (res - ref);
if (err > maxerr) {
maxerr = err;
}
}
printf ("max. err = %g\n", maxerr);
return EXIT_SUCCESS;
}
For completeness, I am showing the minimax polynomial approximation below. The coefficients for such approximations can be generated by several tools such as Maple, Mathematica, Sollya or with homebrew code using the Remez algorithm, which is what I used here. The code below shows the original floating-point coefficients, the dynamic scaling used to maximize accuracy in intermediate computation, and the heuristic adjustments applied to mitigate the impact of non-rounding fixed-point arithmetic.
A typical approach for computation of log2(x) is to use x = 2i * (1+f) and use approximation of log2(1+f) for (1+f) in [√½, √2], which means that we use a polynomial p(f) on the primary approximation interval [√½-1, √2-1].
The intermediate computation scales up operands as far as feasible for improved accuracy under the restriction that we want to use a 32-bit mulhi operation as its basic building block, as this is a native instruction on many 32-bit architectures, accessible either via inline machine code or as an intrinsic. As in the table-based code, there are right shifts of signed data which may be negative, and such right shifts must map to arithmetic right shifts, something that ISO-C doesn't guarantee but most C compilers do.
I managed to get the maximum absolute error for this variant down to 1.11288e-5, so almost full s15.16 accuracy but slightly worse than for the table-based variant. I suspect I should have added one additional term to the polynomial.
/* on 32-bit architectures, there is often an instruction/intrinsic for this */
int32_t mulhi (int32_t a, int32_t b)
{
return (int32_t)(((int64_t)a * (int64_t)b) >> 32);
}
#define RND_SHIFT (25 - FRAC_BITS_OUT)
#define RND_CONST ((1 << RND_SHIFT) / 2)
#define RND_ADJUST (-2) /* established heuristically */
/*
compute log2(x) in s15.16 format, where x is in s0.31 format
maximum absolute error 1.11288e-5 # 0x5a82689f (0.707104757)
*/
int32_t fixed_log2 (int32_t x)
{
int32_t lz, i, f, p, approx;
uint32_t t;
/* x = 2**i * (1 + f), 0 <= f < 1. Find i */
lz = clz (x);
i = INT_BITS_IN - lz;
/* force (1+f) into range [sqrt(0.5), sqrt(2)] */
t = (uint32_t)x << lz;
if (t > (uint32_t)(1.414213562 * (1U << 31))) {
i++;
t = t >> 1;
}
/* compute log2(1+f) for f in [-0.2929, 0.4142] */
f = t - (1U << 31);
p = + (int32_t)(-0.206191055 * (1U << 31) - 1);
p = mulhi (p, f) + (int32_t)( 0.318199910 * (1U << 30) - 18);
p = mulhi (p, f) + (int32_t)(-0.366491705 * (1U << 29) + 22);
p = mulhi (p, f) + (int32_t)( 0.479811855 * (1U << 28) - 2);
p = mulhi (p, f) + (int32_t)(-0.721206390 * (1U << 27) + 37);
p = mulhi (p, f) + (int32_t)( 0.442701618 * (1U << 26) + 35);
p = mulhi (p, f) + (f >> (31 - 25));
/* round fractional part of the result */
approx = (p + RND_CONST + RND_ADJUST) >> RND_SHIFT;
/* combine integer and fractional parts of result */
return (i << FRAC_BITS_OUT) + approx;
}

How to implement the Softmax derivative independently from any loss function?

For a neural networks library I implemented some activation functions and loss functions and their derivatives. They can be combined arbitrarily and the derivative at the output layers just becomes the product of the loss derivative and the activation derivative.
However, I failed to implement the derivative of the Softmax activation function independently from any loss function. Due to the normalization i.e. the denominator in the equation, changing a single input activation changes all output activations and not just one.
Here is my Softmax implementation where the derivative fails the gradient checking by about 1%. How can I implement the Softmax derivative so that it can be combined with any loss function?
import numpy as np
class Softmax:
def compute(self, incoming):
exps = np.exp(incoming)
return exps / exps.sum()
def delta(self, incoming, outgoing):
exps = np.exp(incoming)
others = exps.sum() - exps
return 1 / (2 + exps / others + others / exps)
activation = Softmax()
cost = SquaredError()
outgoing = activation.compute(incoming)
delta_output_layer = activation.delta(incoming) * cost.delta(outgoing)
Mathematically, the derivative of Softmax σ(j) with respect to the logit Zi (for example, Wi*X) is
where the red delta is a Kronecker delta.
If you implement iteratively:
def softmax_grad(s):
# input s is softmax value of the original input x. Its shape is (1,n)
# i.e. s = np.array([0.3,0.7]), x = np.array([0,1])
# make the matrix whose size is n^2.
jacobian_m = np.diag(s)
for i in range(len(jacobian_m)):
for j in range(len(jacobian_m)):
if i == j:
jacobian_m[i][j] = s[i] * (1 - s[i])
else:
jacobian_m[i][j] = -s[i] * s[j]
return jacobian_m
Test:
In [95]: x
Out[95]: array([1, 2])
In [96]: softmax(x)
Out[96]: array([ 0.26894142, 0.73105858])
In [97]: softmax_grad(softmax(x))
Out[97]:
array([[ 0.19661193, -0.19661193],
[-0.19661193, 0.19661193]])
If you implement in a vectorized version:
soft_max = softmax(x)
# reshape softmax to 2d so np.dot gives matrix multiplication
def softmax_grad(softmax):
s = softmax.reshape(-1,1)
return np.diagflat(s) - np.dot(s, s.T)
softmax_grad(soft_max)
#array([[ 0.19661193, -0.19661193],
# [-0.19661193, 0.19661193]])
It should be like this: (x is the input to the softmax layer and dy is the delta coming from the loss above it)
dx = y * dy
s = dx.sum(axis=dx.ndim - 1, keepdims=True)
dx -= y * s
return dx
But the way you compute the error should be:
yact = activation.compute(x)
ycost = cost.compute(yact)
dsoftmax = activation.delta(x, cost.delta(yact, ycost, ytrue))
Explanation: Because the delta function is a part of the backpropagation algorithm, its responsibility is to multiply the vector dy (in my code, outgoing in your case) by the Jacobian of the compute(x) function evaluated at x. If you work out what does this Jacobian look like for softmax [1], and then multiply it from the left by a vector dy, after a bit of algebra you'll find out that you get something that corresponds to my Python code.
[1] https://stats.stackexchange.com/questions/79454/softmax-layer-in-a-neural-network
The other answers are great, here to share a simple implementation of forward/backward, regardless of loss functions.
In the image below, it is a brief derivation of the backward for softmax. The 2nd equation is loss function dependent, not part of our implementation.
backward verified by manual grad checking.
import numpy as np
class Softmax:
def forward(self, x):
mx = np.max(x, axis=1, keepdims=True)
x = x - mx # log-sum-exp trick
e = np.exp(x)
probs = e / np.sum(np.exp(x), axis=1, keepdims=True)
return probs
def backward(self, x, probs, bp_err):
dim = x.shape[1]
output = np.empty(x.shape)
for j in range(dim):
d_prob_over_xj = - (probs * probs[:,[j]]) # i.e. prob_k * prob_j, no matter k==j or not
d_prob_over_xj[:,j] += probs[:,j] # i.e. when k==j, +prob_j
output[:,j] = np.sum(bp_err * d_prob_over_xj, axis=1)
return output
def compute_manual_grads(x, pred_fn):
eps = 1e-3
batch_size, dim = x.shape
grads = np.empty(x.shape)
for i in range(batch_size):
for j in range(dim):
x[i,j] += eps
y1 = pred_fn(x)
x[i,j] -= 2*eps
y2 = pred_fn(x)
grads[i,j] = (y1 - y2) / (2*eps)
x[i,j] += eps
return grads
def loss_fn(probs, ys, loss_type):
batch_size = probs.shape[0]
# dummy mse
if loss_type=="mse":
loss = np.sum((np.take_along_axis(probs, ys.reshape(-1,1), axis=1) - 1)**2) / batch_size
values = 2 * (np.take_along_axis(probs, ys.reshape(-1,1), axis=1) - 1) / batch_size
# cross ent
if loss_type=="xent":
loss = - np.sum( np.take_along_axis(np.log(probs), ys.reshape(-1,1), axis=1) ) / batch_size
values = -1 / np.take_along_axis(probs, ys.reshape(-1,1), axis=1) / batch_size
err = np.zeros(probs.shape)
np.put_along_axis(err, ys.reshape(-1,1), values, axis=1)
return loss, err
if __name__ == "__main__":
batch_size = 10
dim = 5
x = np.random.rand(batch_size, dim)
ys = np.random.randint(0, dim, batch_size)
for loss_type in ["mse", "xent"]:
S = Softmax()
probs = S.forward(x)
loss, bp_err = loss_fn(probs, ys, loss_type)
grads = S.backward(x, probs, bp_err)
def pred_fn(x, ys):
pred = S.forward(x)
loss, err = loss_fn(pred, ys, loss_type)
return loss
manual_grads = compute_manual_grads(x, lambda x: pred_fn(x, ys))
# compare both grads
print(f"loss_type = {loss_type}, grad diff = {np.sum((grads - manual_grads)**2) / batch_size}")
Just in case you are processing in batches, here is an implementation in NumPy (tested vs TensorFlow). However, I will suggest avoiding the associated tensor operations, by mixing the jacobian with the cross-entropy, which leads to a very simple and efficient expression.
def softmax(z):
exps = np.exp(z - np.max(z))
return exps / np.sum(exps, axis=1, keepdims=True)
def softmax_jacob(s):
return np.einsum('ij,jk->ijk', s, np.eye(s.shape[-1])) \
- np.einsum('ij,ik->ijk', s, s)
def np_softmax_test(z):
return softmax_jacob(softmax(z))
def tf_softmax_test(z):
z = tf.constant(z, dtype=tf.float32)
with tf.GradientTape() as g:
g.watch(z)
a = tf.nn.softmax(z)
jacob = g.batch_jacobian(a, z)
return jacob.numpy()
z = np.random.randn(3, 5)
np.all(np.isclose(np_softmax_test(z), tf_softmax_test(z)))
Here is a c++ vectorized version, using intrinsics ( 22 times (!) faster than the non-SSE version):
// How many floats fit into __m256 "group".
// Used by vectors and matrices, to ensure their dimensions are appropriate for
// intrinsics.
// Otherwise, consecutive rows of matrices will not be 16-byte aligned, and
// operations on them will be incorrect.
#define F_MULTIPLE_OF_M256 8
//check to quickly see if your rows are divisible by m256.
//you can 'undefine' to save performance, after everything was verified to be correct.
#define ASSERT_THE_M256_MULTIPLES
#ifdef ASSERT_THE_M256_MULTIPLES
#define assert_is_m256_multiple(x) assert( (x%F_MULTIPLE_OF_M256) == 0)
#else
#define assert_is_m256_multiple (q)
#endif
// usually used at the end of our Reduce functions,
// where the final __m256 mSum needs to be collapsed into 1 scalar.
static inline float slow_hAdd_ps(__m256 x){
const float *sumStart = reinterpret_cast<const float*>(&x);
float sum = 0.0f;
for(size_t i=0; i<F_MULTIPLE_OF_M256; ++i){
sum += sumStart[i];
}
return sum;
}
f_vec SoftmaxGrad_fromResult(const float *softmaxResult, size_t size,
const float *gradFromAbove){//<--gradient vector, flowing into us from the above layer
assert_is_m256_multiple(size);
//allocate vector, where to store output:
f_vec grad_v(size, true);//true: skip filling with zeros, to save performance.
const __m256* end = (const __m256*)(softmaxResult + size);
for(size_t i=0; i<size; ++i){// <--for every row
//go through this i'th row:
__m256 sum = _mm256_set1_ps(0.0f);
const __m256 neg_sft_i = _mm256_set1_ps( -softmaxResult[i] );
const __m256 *s = (const __m256*)softmaxResult;
const __m256 *gAbove = (__m256*)gradFromAbove;
for (s; s<end; ){
__m256 mul = _mm256_mul_ps(*s, neg_sft_i); // sftmaxResult_j * (-sftmaxResult_i)
mul = _mm256_mul_ps( mul, *gAbove );
sum = _mm256_add_ps( sum, mul );//adding to the total sum of this row.
++s;
++gAbove;
}
grad_v[i] = slow_hAdd_ps( sum );//collapse the sum into 1 scalar (true sum of this row).
}//end for every row
//reset back to start and subtract a vector, to account for Kronecker delta:
__m256 *g = (__m256*)grad_v._contents;
__m256 *s = (__m256*)softmaxResult;
__m256 *gAbove = (__m256*)gradFromAbove;
for(s; s<end; ){
__m256 mul = _mm256_mul_ps(*s, *gAbove);
*g = _mm256_add_ps( *g, mul );
++s;
++g;
}
return grad_v;
}
If for some reason somebody wants a simple (non-SSE) version, here it is:
inline static void SoftmaxGrad_fromResult_nonSSE(const float* softmaxResult,
const float *gradFromAbove, //<--gradient vector, flowing into us from the above layer
float *gradOutput,
size_t count ){
// every pre-softmax element in a layer contributed to the softmax of every other element
// (it went into the denominator). So gradient will be distributed from every post-softmax element to every pre-elem.
for(size_t i=0; i<count; ++i){
//go through this i'th row:
float sum = 0.0f;
const float neg_sft_i = -softmaxResult[i];
for(size_t j=0; j<count; ++j){
float mul = gradFromAbove[j] * softmaxResult[j] * neg_sft_i;
sum += mul;//adding to the total sum of this row.
}
//NOTICE: equals, overwriting any old values:
gradOutput[i] = sum;
}//end for every row
for(size_t i=0; i<count; ++i){
gradOutput[i] += softmaxResult[i] * gradFromAbove[i];
}
}

Most Efficient Way of Using mexCallMATLAB in Converting Double* to mxArray*

I am writing a MEX code in which I need to use pinv function. I am trying to find a way to pass the array of type double to pinv using mexCallMATLAB in the most efficient way. Let's for the sake of example say the array is named G and its size is 100.
double *G = (double*) mxMalloc( 100 * sizeof(double) );
where
G[0] = G11; G[1] = G12;
G[2] = G21; G[3] = G22;
Which means every four consecutive elements of G is a 2×2 matrix. G stores 25 different values of this 2×2 matrix.
I should note that these 2×2 matrices are not well-conditioned and they may contain all zero in their element. How can I use pinv function to calculate the pseudoinverse in the elements of G? For example, how can I pass the array to mexCallMATLAB in order to calculate the pseudoinverse of the first 2×2 matrix in G?
I thought of the following approach:
mxArray *G_PINV_input = mxCreateDoubleMatrix(2, 2, mxREAL);
mxArray *G_PINV_output = mxCreateDoubleMatrix(2, 2, mxREAL);
double *G_PINV_input_ptr = mxGetPr(G_PINV_input);
memcpy( G_PINV_input_ptr, &G[0], 4 * sizeof(double));
mexCallMATLAB(1, G_PINV_output, 1, G_PINV_input, "pinv");
I am not sure how good this approach is. Copying the values is not economical at all because the total number of elements in G in my actual application is large. Is there anyway to skip this copying?
Here is my implementation of the MEX-function:
my_pinv.cpp
#include "mex.h"
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
// validate arguments
if (nrhs!=1 || nlhs>1)
mexErrMsgIdAndTxt("mex:error", "Wrong number of arguments");
if (!mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) || mxIsSparse(prhs[0]))
mexErrMsgIdAndTxt("mex:error", "Input isnt real dense double array");
if (mxGetNumberOfElements(prhs[0]) != 100)
mexErrMsgIdAndTxt("mex:error", "numel() != 100");
// create necessary arrays
mxArray *rhs[1], *lhs[1];
plhs[0] = mxCreateDoubleMatrix(100, 1, mxREAL);
rhs[0] = mxCreateDoubleMatrix(2, 2, mxREAL);
double *in = mxGetPr(prhs[0]);
double *out = mxGetPr(plhs[0]);
double *x = mxGetPr(rhs[0]), *y;
// for each 2x2 matrix
for (mwIndex i=0; i<100; i+=4) {
// copy 2x2 matrix into rhs
x[0] = in[i+0];
x[2] = in[i+1];
x[1] = in[i+2];
x[3] = in[i+3];
// lhs = pinv(rhs)
mexCallMATLAB(1, lhs, 1, rhs, "pinv");
// copy 2x2 matrix from lhs
y = mxGetPr(lhs[0]);
out[i+0] = y[0];
out[i+1] = y[1];
out[i+2] = y[2];
out[i+3] = y[3];
// free array
mxDestroyArray(lhs[0]);
}
// cleanup
mxDestroyArray(rhs[0]);
}
Here is a baseline implementation in MATLAB so that we can verify the results are correct:
my_pinv0.m
function y = my_pinv0(x)
y = zeros(size(x));
for i=1:4:numel(x)
y(i:i+3) = pinv(x([0 1; 2 3]+i));
end
end
Now we test the MEX-function:
% some vector
x = randn(100,1);
% MEX vs. MATLAB function
y = my_pinv0(x);
yy = my_pinv(x);
% compare
assert(isequal(y,yy))
EDIT:
Here is an another implementation:
my_pinv2.cpp
#include "mex.h"
inline void call_pinv(const double &a, const double &b, const double &c,
const double &d, double *out)
{
mxArray *rhs[1], *lhs[1];
// create input matrix [a b; c d]
rhs[0] = mxCreateDoubleMatrix(2, 2, mxREAL);
double *x = mxGetPr(rhs[0]);
x[0] = a;
x[1] = c;
x[2] = b;
x[3] = d;
// lhs = pinv(rhs)
mexCallMATLAB(1, lhs, 1, rhs, "pinv");
// get values from output matrix
const double *y = mxGetPr(lhs[0]);
out[0] = y[0];
out[1] = y[1];
out[2] = y[2];
out[3] = y[3];
// cleanup
mxDestroyArray(lhs[0]);
mxDestroyArray(rhs[0]);
}
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
// validate arguments
if (nrhs!=1 || nlhs>1)
mexErrMsgIdAndTxt("mex:error", "Wrong number of arguments");
if (!mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) || mxIsSparse(prhs[0]))
mexErrMsgIdAndTxt("mex:error", "Input isnt real dense double array");
if (mxGetNumberOfElements(prhs[0]) != 100)
mexErrMsgIdAndTxt("mex:error", "numel() != 100");
// allocate output
plhs[0] = mxCreateDoubleMatrix(100, 1, mxREAL);
double *out = mxGetPr(plhs[0]);
const double *in = mxGetPr(prhs[0]);
// for each 2x2 matrix
for (mwIndex i=0; i<100; i+=4) {
// 2x2 input matrix [a b; c d], and its determinant
const double a = in[i+0];
const double b = in[i+1];
const double c = in[i+2];
const double d = in[i+3];
const double det = (a*d - b*c);
if (det != 0) {
// inverse of 2x2 matrix [d -b; -c a]/det
out[i+0] = d/det;
out[i+1] = -c/det;
out[i+2] = -b/det;
out[i+3] = a/det;
}
else {
// singular matrix, fallback to pseudo-inverse
call_pinv(a, b, c, d, &out[i]);
}
}
}
This time we compute the determinant of the 2x2 matrix, if is non-zero, we calculate the inverse ourselves according to:
Otherwise we fallback to invoking PINV from MATLAB for the pseudo-inverse.
Here is quick benchmark:
% 100x1 vector
x = randn(100,1); % average case, with normal 2x2 matrices
% running time
funcs = {#my_pinv0, #my_pinv1, #my_pinv2};
t = cellfun(#(f) timeit(#() f(x)), funcs, 'Uniform',true);
% compare results
y = cellfun(#(f) f(x), funcs, 'Uniform',false);
assert(isequal(y{1},y{2}))
I get the following timings:
>> fprintf('%.6f\n', t);
0.002111 % MATLAB function
0.001498 % first MEX-file with mexCallMATLAB
0.000010 % second MEX-file with "unrolled" matrix inverse (+ PINV as fallback)
The error is acceptable and within machine precision:
>> norm(y{1}-y{3})
ans =
2.1198e-14
You could also test the worst case, when many of the 2x2 matrices are singular:
x = randi([0 1], [100 1]);
You don't need to allocate the output. Just make the pointer and let pinv create the mxArray automatically.
mxArray *lhs;
Then just use & like,
mexCallMATLAB(1, &lhs, 1, &rhs, "pinv");

Matlab: fast way to sum ones in binary numbers with Sparse structure?

Most answers only address the already-answered question about Hamming weights but ignore the point about find and dealing with the sparsity. Apparently the answer by Shai here addresses the point about find -- but I am not yet able to verify it. My answer here does not utilise the ingenuity of other answers such as the bitshifting but good enough example answer.
Input
>> mlf=sparse([],[],[],2^31+1,1);mlf(1)=10;mlf(10)=111;mlf(77)=1010;
>> transpose(dec2bin(find(mlf)))
ans =
001
000
000
011
001
010
101
Goal
1
0
0
2
1
1
2
Fast calculation for the amount of ones in binary numbers with the sparse structure?
You can do this in tons of ways. The simplest I think would be
% Example data
F = [268469248 285213696 536904704 553649152];
% Solution 1
sum(dec2bin(F)-'0',2)
And the fastest (as found here):
% Solution 2
w = uint32(F');
p1 = uint32(1431655765);
p2 = uint32(858993459);
p3 = uint32(252645135);
p4 = uint32(16711935);
p5 = uint32(65535);
w = bitand(bitshift(w, -1), p1) + bitand(w, p1);
w = bitand(bitshift(w, -2), p2) + bitand(w, p2);
w = bitand(bitshift(w, -4), p3) + bitand(w, p3);
w = bitand(bitshift(w, -8), p4) + bitand(w, p4);
w = bitand(bitshift(w,-16), p5) + bitand(w, p5);
According to your comments, you convert a vector of numbers to binary string representations using dec2bin. Then you can achieve what you want as follows, where I'm using vector [10 11 12] as an example:
>> sum(dec2bin([10 11 12])=='1',2)
ans =
2
3
2
Or equivalently,
>> sum(dec2bin([10 11 12])-'0',2)
For speed, you could avoid dec2bin like this (uses modulo-2 operations, inspired in dec2bin code):
>> sum(rem(floor(bsxfun(#times, [10 11 12].', pow2(1-N:0))),2),2)
ans =
2
3
2
where N is the maximum number of binary digits you expect.
If you really want fast, I think a look-up-table would be handy. You can simply map, for 0..255 how many ones they have. Do this once, and then you only need to decompose an int to its bytes look the sum up in the table and add the results - no need to go to strings...
An example:
>> LUT = sum(dec2bin(0:255)-'0',2); % construct the look up table (only once)
>> ii = uint32( find( mlf ) ); % get the numbers
>> vals = LUT( mod( ii, 256 ) + 1 ) + ... % lower bytes
LUT( mod( ii/256, 256 ) + 1 ) + ...
LUT( mod( ii/65536, 256 ) + 1 ) + ...
LUT( mod( ii/16777216, 256 ) + 1 );
Using typecast (as suggested by Amro):
>> vals = sum( reshape(LUT(double(typecast(ii,'uint8'))+1), 4, [] ), 1 )';
Run time comparison
>> ii = uint32(randi(intmax('uint32'),100000,1));
>> tic; vals1 = sum( reshape(LUT(typecast(ii,'uint8')+1), 4, [] ), 1 )'; toc, %//'
>> tic; vals2 = sum(dec2bin(ii)-'0',2); toc
>> dii = double(ii); % type issues
>> tic; vals3 = sum(rem(floor(bsxfun(#times, dii, pow2(1-32:0))),2),2); toc
Results:
Elapsed time is 0.006144 seconds. <-- this answer
Elapsed time is 0.120216 seconds. <-- using dec2bin
Elapsed time is 0.118009 seconds. <-- using rem and bsxfun
Here is an example to show #Shai's idea of using a lookup table:
% build lookup table for 8-bit integers
lut = sum(dec2bin(0:255)-'0', 2);
% get indices
idx = find(mlf);
% break indices into 8-bit integers and apply LUT
nbits = lut(double(typecast(uint32(idx),'uint8')) + 1);
% sum number of bits in each
s = sum(reshape(nbits,4,[]))
you might have to switch to uint64 instead if you have really large sparse arrays with large indices outside the 32-bit range..
EDIT:
Here is another solution for you using Java:
idx = find(mlf);
s = arrayfun(#java.lang.Integer.bitCount, idx);
EDIT#2:
Here is yet another solution implemented as C++ MEX function. It relies on std::bitset::count:
bitset_count.cpp
#include "mex.h"
#include <bitset>
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
// validate input/output arguments
if (nrhs != 1) {
mexErrMsgTxt("One input argument required.");
}
if (!mxIsUint32(prhs[0]) || mxIsComplex(prhs[0]) || mxIsSparse(prhs[0])) {
mexErrMsgTxt("Input must be a 32-bit integer dense matrix.");
}
if (nlhs > 1) {
mexErrMsgTxt("Too many output arguments.");
}
// create output array
mwSize N = mxGetNumberOfElements(prhs[0]);
plhs[0] = mxCreateDoubleMatrix(N, 1, mxREAL);
// get pointers to data
double *counts = mxGetPr(plhs[0]);
uint32_T *idx = reinterpret_cast<uint32_T*>(mxGetData(prhs[0]));
// count bits set for each 32-bit integer number
for(mwSize i=0; i<N; i++) {
std::bitset<32> bs(idx[i]);
counts[i] = bs.count();
}
}
Compile the above function as mex -largeArrayDims bitset_count.cpp, then run it as usual:
idx = find(mlf);
s = bitset_count(uint32(idx))
I decided to compare all the solutions mentioned so far:
function [t,v] = testBitsetCount()
% random data (uint32 vector)
x = randi(intmax('uint32'), [1e5,1], 'uint32');
% build lookup table (done once)
LUT = sum(dec2bin(0:255,8)-'0', 2);
% functions to compare
f = {
#() bit_twiddling(x) % bit twiddling method
#() lookup_table(x,LUT); % lookup table method
#() bitset_count(x); % MEX-function (std::bitset::count)
#() dec_to_bin(x); % dec2bin
#() java_bitcount(x); % Java Integer.bitCount
};
% compare timings and check results are valid
t = cellfun(#timeit, f, 'UniformOutput',true);
v = cellfun(#feval, f, 'UniformOutput',false);
assert(isequal(v{:}));
end
function s = lookup_table(x,LUT)
s = sum(reshape(LUT(double(typecast(x,'uint8'))+1),4,[]))';
end
function s = dec_to_bin(x)
s = sum(dec2bin(x,32)-'0', 2);
end
function s = java_bitcount(x)
s = arrayfun(#java.lang.Integer.bitCount, x);
end
function s = bit_twiddling(x)
p1 = uint32(1431655765);
p2 = uint32(858993459);
p3 = uint32(252645135);
p4 = uint32(16711935);
p5 = uint32(65535);
s = x;
s = bitand(bitshift(s, -1), p1) + bitand(s, p1);
s = bitand(bitshift(s, -2), p2) + bitand(s, p2);
s = bitand(bitshift(s, -4), p3) + bitand(s, p3);
s = bitand(bitshift(s, -8), p4) + bitand(s, p4);
s = bitand(bitshift(s,-16), p5) + bitand(s, p5);
end
The times elapsed in seconds:
t =
0.0009 % bit twiddling method
0.0087 % lookup table method
0.0134 % C++ std::bitset::count
0.1946 % MATLAB dec2bin
0.2343 % Java Integer.bitCount
This gives you the rowsums of the binary numbers from the sparse structure.
>> mlf=sparse([],[],[],2^31+1,1);mlf(1)=10;mlf(10)=111;mlf(77)=1010;
>> transpose(dec2bin(find(mlf)))
ans =
001
000
000
011
001
010
101
>> sum(ismember(transpose(dec2bin(find(mlf))),'1'),2)
ans =
1
0
0
2
1
1
2
Hope someone able to find faster rowsummation!
Mex it!
Save this code as countTransBits.cpp:
#include "mex.h"
void mexFunction( int nout, mxArray* pout[], int nin, mxArray* pin[] ) {
mxAssert( nin == 1 && mxIsSparse(pin[0]) && mxGetN( pin[0] ) == 1,
"expecting single sparse column vector input" );
mxAssert( nout == 1, "expecting single output" );
// set output, assuming 32 bits, set to 64 if needed
pout[0] = mxCreateNumericMatrix( 32, 1, mxUINT32_CLASS, mxREAL );
unsigned int* counter = (unsigned int*)mxGetData( pout[0] );
for ( int i = 0; i < 32; i++ ) {
counter[i] = 0;
}
// start working
mwIndex *pIr = mxGetIr( pin[0] );
mwIndex* pJc = mxGetJc( pin[0] );
double* pr = mxGetPr( pin[0] );
for ( mwSize i = pJc[0]; i < pJc[1]; i++ ) {
if ( pr[i] != 0 ) {// make sure entry is non-zero
unsigned int entry = pIr[i] + 1; // cast to unsigned int and add 1 for 1-based indexing in Matlab
int bit = 0;
while ( entry != 0 && bit < 32 ) {
counter[bit] += ( entry & 0x1 ); // count the lsb
bit++;
entry >>= 1; // shift right
}
}
}
}
Compile it in Matlab
>> mex -largeArrayDims -O countTransBits.cpp
Run the code
>> countTransBits( mlf )
Note that the output count in 32 bins lsb to msb.
The bitcount FEX contribution offers a solution based on the lookup table approach, but is better optimized. It runs more than twice as fast as the bit twiddling method (i.e. the fastest pure-MATLAB method reported by Amro) over a 1 million uint32 vector, using R2015a on my old laptop.

iPhone FFT with Accelerate framework vDSP

I'm having difficulty implementing an FFT using vDSP. I understand the theory but am looking for a specific code example please.
I have data from a wav file as below:
Question 1. How do I put the audio data into the FFT?
Question 2. How do I get the output data out of the FFT?
Question 3. The ultimate goal is to check for low frequency sounds. How would I do this?
-(OSStatus)open:(CFURLRef)inputURL{
OSStatus result = -1;
result = AudioFileOpenURL (inputURL, kAudioFileReadPermission, 0, &mAudioFile);
if (result == noErr) {
//get format info
UInt32 size = sizeof(mASBD);
result = AudioFileGetProperty(mAudioFile, kAudioFilePropertyDataFormat, &size, &mASBD);
UInt32 dataSize = sizeof packetCount;
result = AudioFileGetProperty(mAudioFile, kAudioFilePropertyAudioDataPacketCount, &dataSize, &packetCount);
NSLog([NSString stringWithFormat:#"File Opened, packet Count: %d", packetCount]);
UInt32 packetsRead = packetCount;
UInt32 numBytesRead = -1;
if (packetCount > 0) {
//allocate buffer
audioData = (SInt16*)malloc( 2 *packetCount);
//read the packets
result = AudioFileReadPackets (mAudioFile, false, &numBytesRead, NULL, 0, &packetsRead, audioData);
NSLog([NSString stringWithFormat:#"Read %d bytes, %d packets", numBytesRead, packetsRead]);
}
}
return result;
}
FFT code below:
log2n = N;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
printf("1D real FFT of length log2 ( %d ) = %d\n\n", n, log2n);
/* Allocate memory for the input operands and check its availability,
* use the vector version to get 16-byte alignment. */
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
originalReal = (float *) malloc(n * sizeof(float));
obtainedReal = (float *) malloc(n * sizeof(float));
if (originalReal == NULL || A.realp == NULL || A.imagp == NULL) {
printf("\nmalloc failed to allocate memory for the real FFT"
"section of the sample.\n");
exit(0);
}
/* Generate an input signal in the real domain. */
for (i = 0; i < n; i++)
originalReal[i] = (float) (i + 1);
/* Look at the real signal as an interleaved complex vector by
* casting it. Then call the transformation function vDSP_ctoz to
* get a split complex vector, which for a real signal, divides into
* an even-odd configuration. */
vDSP_ctoz((COMPLEX *) originalReal, 2, &A, 1, nOver2);
/* Set up the required memory for the FFT routines and check its
* availability. */
setupReal = vDSP_create_fftsetup(log2n, FFT_RADIX2);
if (setupReal == NULL) {
printf("\nFFT_Setup failed to allocate enough memory for"
"the real FFT.\n");
exit(0);
}
/* Carry out a Forward and Inverse FFT transform. */
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_INVERSE);
/* Verify correctness of the results, but first scale it by 2n. */
scale = (float) 1.0 / (2 * n);
vDSP_vsmul(A.realp, 1, &scale, A.realp, 1, nOver2);
vDSP_vsmul(A.imagp, 1, &scale, A.imagp, 1, nOver2);
/* The output signal is now in a split real form. Use the function
* vDSP_ztoc to get a split real vector. */
vDSP_ztoc(&A, 1, (COMPLEX *) obtainedReal, 2, nOver2);
/* Check for accuracy by looking at the inverse transform results. */
Compare(originalReal, obtainedReal, n);
Thanks
You put your audio sample data into the real part of the input, and zero the imaginary part.
If you are just interested in the magnitude of each bin in the frequency domain then you calculate sqrt(re*re + im*im) for each output bin. If you're only interested in relative magnitude then you can drop the sqrt and just calculate the squared magnitude, (re*re + im*im).
You would look at the magnitudes of the bin or bins (see (2)) that correspond to your frequency or frequencies of interest. If your sample rate is Fs, and your FFT size is N, then the corresponding frequency for output bin i is given by f = i * Fs / N. Conversely if you are interested in a specific frequency f then the bin of interest, i, is given by i = N * f / Fs.
Additional note: you will need to apply a suitable window function (e.g. Hann aka Hanning) to your FFT input data, prior to calculating the FFT itself.
You can check Apple’s documentation and take good care of data packing.
Here is my example:
// main.cpp
// FFTTest
//
// Created by Harry-Chris Stamatopoulos on 11/23/12.
//
/*
This is an example of a hilbert transformer using
Apple's VDSP fft/ifft & other VDSP calls.
Output signal has a PI/2 phase shift.
COMPLEX_SPLIT vector "B" was used to cross-check
real and imaginary parts coherence with the original vector "A"
that is obtained straight from the fft.
Tested and working.
Cheers!
*/
#include <iostream>
#include <Accelerate/Accelerate.h>
#define PI 3.14159265
#define DEBUG_PRINT 1
int main(int argc, const char * argv[])
{
float fs = 44100; //sample rate
float f0 = 440; //sine frequency
uint32_t i = 0;
uint32_t L = 1024;
/* vector allocations*/
float *input = new float [L];
float *output = new float[L];
float *mag = new float[L/2];
float *phase = new float[L/2];
for (i = 0 ; i < L; i++)
{
input[i] = cos(2*PI*f0*i/fs);
}
uint32_t log2n = log2f((float)L);
uint32_t n = 1 << log2n;
//printf("FFT LENGTH = %lu\n", n);
FFTSetup fftSetup;
COMPLEX_SPLIT A;
COMPLEX_SPLIT B;
A.realp = (float*) malloc(sizeof(float) * L/2);
A.imagp = (float*) malloc(sizeof(float) * L/2);
B.realp = (float*) malloc(sizeof(float) * L/2);
B.imagp = (float*) malloc(sizeof(float) * L/2);
fftSetup = vDSP_create_fftsetup(log2n, FFT_RADIX2);
/* Carry out a Forward and Inverse FFT transform. */
vDSP_ctoz((COMPLEX *) input, 2, &A, 1, L/2);
vDSP_fft_zrip(fftSetup, &A, 1, log2n, FFT_FORWARD);
mag[0] = sqrtf(A.realp[0]*A.realp[0]);
//get phase
vDSP_zvphas (&A, 1, phase, 1, L/2);
phase[0] = 0;
//get magnitude;
for(i = 1; i < L/2; i++){
mag[i] = sqrtf(A.realp[i]*A.realp[i] + A.imagp[i] * A.imagp[i]);
}
//after done with possible phase and mag processing re-pack the vectors in VDSP format
B.realp[0] = mag[0];
B.imagp[0] = mag[L/2 - 1];;
//unwrap, process & re-wrap phase
for(i = 1; i < L/2; i++){
phase[i] -= 2*PI*i * fs/L;
phase[i] -= PI / 2 ;
phase[i] += 2*PI*i * fs/L;
}
//construct real & imaginary part of the output packed vector (input to ifft)
for(i = 1; i < L/2; i++){
B.realp[i] = mag[i] * cosf(phase[i]);
B.imagp[i] = mag[i] * sinf(phase[i]);
}
#if DEBUG_PRINT
for (i = 0 ; i < L/2; i++)
{
printf("A REAL = %f \t A IMAG = %f \n", A.realp[i], A.imagp[i]);
printf("B REAL = %f \t B IMAG = %f \n", B.realp[i], B.imagp[i]);
}
#endif
//ifft
vDSP_fft_zrip(fftSetup, &B, 1, log2n, FFT_INVERSE);
//scale factor
float scale = (float) 1.0 / (2*L);
//scale values
vDSP_vsmul(B.realp, 1, &scale, B.realp, 1, L/2);
vDSP_vsmul(B.imagp, 1, &scale, B.imagp, 1, L/2);
//unpack B to real interleaved output
vDSP_ztoc(&B, 1, (COMPLEX *) output, 2, L/2);
// print output signal values to console
printf("Shifted signal x = \n");
for (i = 0 ; i < L/2; i++)
printf("%f\n", output[i]);
//release resources
free(input);
free(output);
free(A.realp);
free(A.imagp);
free(B.imagp);
free(B.realp);
free(mag);
free(phase);
}
One thing you need to be careful to is the DC component of the calculated FFT. I compared my results with the fftw library FFT and the imaginary part of the transform calculated with the vDSP library always had a different value at index 0 (which means 0 frequency, so DC).
Another measure I applied was to divide both real and imaginary parts by a factor of 2. I guess this is due to the algorithm used in the function. Also, both these problems occurred in the FFT process but not in the IFFT process.
I used vDSP_fft_zrip.