efficient, but still precise, multiple of exponent - scala

I need to compute efficiently an array of something like f(i,a) = exp(-0.5 * (i-1) * i * a) for all i in (0..n), with n up to 20.000 and a a positive value very close to 0.
To avoid computing exp n times, I used an incremental approach such as (writing in scala):
def fInc(n: Int, a: Double)
val expA = Math.exp(-a)
var u = 1.0
var v = 1.0
var i = 1
while(i < n){
u *= expA
v *= u // in practice I store that value in an array, for all i
i += 1
}
}
// reference by calling exp directly
def fRef(n: Int, a: Double) = Math.exp(-0.5 * (i-1) * i * a)
This is mathematically correct, but then the difference with direct exp computation is too big. Here are some results:
n a v Math.exp diff
1000 1E-6 0.6068340008761639 0.6068340008714599 4.704014955336788E-12
1000 1E-9 0.9995006247427483 0.9995006247293567 1.3391510123028638E-11
1000 1E-12 0.9999995005111699 0.9999995005001248 1.1045164782785832E-11
1000 1E-15 0.9999999995008992 0.9999999995005 3.992361996552063E-13
10000 1E-6 1.938417748402E-22 1.938417746809E-22 1.5929953847004499E-31
10000 1E-9 0.9512341819777599 0.9512341806597269 1.3180330160622589E-9
10000 1E-12 0.9999500073554776 0.9999500062497292 1.1057483817467073E-9
10000 1E-15 0.9999999500449599 0.9999999500050013 3.995859199079632E-11
As you can see, for some values,the difference goes up to 1e-9, while I can accept maybe 1e-13
So question:
Is there a way to get a better approximate with an algorithm that is still much more efficient than calling exp on all i?
Notes:
I use apache FastMath exp, which gives almost the same results as standard java exp.
The actual algorith is more complex, with other such incremental exp (not quadratic though)

Here is the best solution I found:
The error being incremented (kind of) linearly with each multiplication by the "unitary exp(a)". We can think of the error as a function similar to err(i) ~= i*i*err0 for some err0. The point is that the error of v is quadratic w.r.t i.
The best I found is:
reset the v to the correct value at some chosen frequency (each k iteration)
improve the correctness of u each k iteration, using incremental exp computation
.
val k = 100
val expA = Math.exp(-a)
val expAk = Math.exp(-k*a)
var u = 1.0
var uk = 1.0
var v = 1.0
var i = 1
while(i < n){
if(i%k==0){
uk *= expAk
u = uk
v = Math.exp(- 0.5*(i+1)*i * a)
} else{
u *= expA
v *= u
}
i += 1
}
This method require n / k + 2 call to exp, not quite satifying but the best I have for now. It can probably be improved by choosing the best frequency parameter k.

Related

Converting the euclidean distance to manhattan distance

The below calculation is given in spark mlib library to find the euclidean distance
private[mllib] def fastSquaredDistance(
v1: Vector,
norm1: Double,
v2: Vector,
norm2: Double,
precision: Double = 1e-6): Double = {
val n = v1.size
require(v2.size == n)
require(norm1 >= 0.0 && norm2 >= 0.0)
val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
val normDiff = norm1 - norm2
var sqDist = 0.0
val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
if (precisionBound1 < precision) {
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
} else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
val dotValue = dot(v1, v2)
sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
(sqDist + EPSILON)
if (precisionBound2 > precision) {
sqDist = Vectors.sqdist(v1, v2)
}
} else {
sqDist = Vectors.sqdist(v1, v2)
}
sqDist
}
I am very new to machine learning .My question is about how to find manhattan distance by modifying the above code.
Without any additional context, I'd suggest just implementing the L1 distance in the obvious naive fashion:
d_manhatten(u,v) = sum( abs(u[i] - v[i]), i) // Pseudocode
Now, I haven't looked at your code much, but it looks like much of it is (1) concerned about precision (which is less of a problem for L1, compared to L2, since there is no square) and (2) uses the L2 norms as inputs (which, to my knowledge, are not useful in computing the L1 anyway). So modifying the current method may not be so useful.
Also, I hear a lot that premature optimization is the root of all evil, so try the simplest thing first, and if it is unacceptable, then try obfuscating optimizing :)

How to perform Modulo greatest common divisor?

Suppose that gcd(e,m) = g. Find integer d such that (e x d) = g mod m
Where m and e are greater than or equal to 1.
The following problem seems to be solvable algebraically but I've tried doing it and it give me an integer number. Sometimes, the solution for d is an integer and sometimes it isn't. How can I approach this problem?
d can be computed with the extended euklidean algorithm, see e.g. here:
https://en.wikipedia.org/wiki/Extended_Euclidean_algorithm
The a,b on that page are your e,m, and your d will be the x.
Perhaps you are assuming that both e and m are integers, but the problem allows them to be non-integers? There is only one case that gives an integer solution when both e and m are integers.
Why strictly integer output is not a reasonable outcome if e != m:
When you look at a fraction like 3/7 say, and refer to its denominator as the numerator's "divisor", this is a loose sense of the word from a classical math-y perspective. When you talk about the gcd (greatest common divisor), the "d" refers to an integer that divides the numerator (an integer) evenly, resulting in another integer: 4 is a divisor of 8, because 8/4 = 2 and 2 is an integer. A computer science or discrete mathematics perspective might frame a divisor as a number d that for a given number a gives 0 when we take a % d (a mod d for discrete math). Can you see that the absolute value of a divisor can't exceed the absolute value of the numerator? If it did, you would get pieces of pie, instead of whole pies - example:
4 % a = 0 for a in Z (Z being the set of integers) while |a| <= 4 (in math-y notation, that set is: {a ∈ Z : |a| <= 4}), but
4 % a != 0 for a in Z while |a| > 4 (math-y: {a ∈ Z : |a| > 4}
), because when we divide 4 by stuff bigger than it, like 5, we get fractions (i.e. |4/a| < 1 when |a| > 4). Don't worry too much about the absolute value stuff if it throws you off - it is there to account for working with negative numbers since they are integers as well.
So, even the "greatest" of divisors for any given integer will be smaller than the integer. Otherwise it's not a divisor (see above, or Wikipedia on divisors).
Look at gcd(e, m) = g:
By the definition of % (mod for math people), for any two numbers number1 and number2, number1 % number2 never makes number1 bigger: number1 % number2 <= number1.
So substitute: (e * d) = g % m --> (e * d) <= g
By the paragraphs above and definition of gcd being a divisor of both e and m: g <= e, m.
To make (e * d) <= g such that d, g are both integers, knowing that g <= e since g is a divisor of e, we have to make the left side smaller to match g. You can only make an integer smaller with multiplcation if the other multipland is 0 or a fraction. The problem specifies that d is an integer, so we one case that works - the d = 0 case - and infinitely many that give a contradiction - contradiction that e, m, and d all be integers.
If e == m:
This is the d = 0 case:
If e == m, then gcd(e, m) = e = m - example: greatest common divisor of 3 and 3 is 3
Then (e * d) = g % m is (e * d) = m % m and m % m = 0 so (e * d) = 0 implying d = 0
How to code a function that will find d when either of e or m might be NON-integer:
A lot of divisor problems are done iteratively, like "find the gcd" or "find a prime number". That works in part because those problems deal strictly with integers, which are countable. With this problem, we need to allow e or m to be non-integer in order to have a solution for cases other than e = m. The set of rational numbers is NOT countable, however, so an iterative solution would eventually make your program crash. With this problem, you really just want a formula, and possibly some cases. You might set it up like this:
If e == m
return 0 # since (e * d) = m % m -> d = 0
Else
return g / e
Lastly:
Another thing that might be useful depending on what you do with this problem is the fact that the right-hand-side is always either g or 0, because g <= m since g is a divisor of m (see all the stuff above). In the cases where g < m, g % m = g. In the case where g == m, g % m = 0.
The #asp answer with the link to the Wikipedia page on the Euclidean Algorithm is good.
The #aidenhjj comment about trying the math-specific version of StackOverflow is good.
In case this is for a math class and you aren't used to coding: <=, >=, ==, and != are computer speak for ≤, ≥, "are equal", and "not equal" respectively.
Good luck.

How to implement the Softmax derivative independently from any loss function?

For a neural networks library I implemented some activation functions and loss functions and their derivatives. They can be combined arbitrarily and the derivative at the output layers just becomes the product of the loss derivative and the activation derivative.
However, I failed to implement the derivative of the Softmax activation function independently from any loss function. Due to the normalization i.e. the denominator in the equation, changing a single input activation changes all output activations and not just one.
Here is my Softmax implementation where the derivative fails the gradient checking by about 1%. How can I implement the Softmax derivative so that it can be combined with any loss function?
import numpy as np
class Softmax:
def compute(self, incoming):
exps = np.exp(incoming)
return exps / exps.sum()
def delta(self, incoming, outgoing):
exps = np.exp(incoming)
others = exps.sum() - exps
return 1 / (2 + exps / others + others / exps)
activation = Softmax()
cost = SquaredError()
outgoing = activation.compute(incoming)
delta_output_layer = activation.delta(incoming) * cost.delta(outgoing)
Mathematically, the derivative of Softmax σ(j) with respect to the logit Zi (for example, Wi*X) is
where the red delta is a Kronecker delta.
If you implement iteratively:
def softmax_grad(s):
# input s is softmax value of the original input x. Its shape is (1,n)
# i.e. s = np.array([0.3,0.7]), x = np.array([0,1])
# make the matrix whose size is n^2.
jacobian_m = np.diag(s)
for i in range(len(jacobian_m)):
for j in range(len(jacobian_m)):
if i == j:
jacobian_m[i][j] = s[i] * (1 - s[i])
else:
jacobian_m[i][j] = -s[i] * s[j]
return jacobian_m
Test:
In [95]: x
Out[95]: array([1, 2])
In [96]: softmax(x)
Out[96]: array([ 0.26894142, 0.73105858])
In [97]: softmax_grad(softmax(x))
Out[97]:
array([[ 0.19661193, -0.19661193],
[-0.19661193, 0.19661193]])
If you implement in a vectorized version:
soft_max = softmax(x)
# reshape softmax to 2d so np.dot gives matrix multiplication
def softmax_grad(softmax):
s = softmax.reshape(-1,1)
return np.diagflat(s) - np.dot(s, s.T)
softmax_grad(soft_max)
#array([[ 0.19661193, -0.19661193],
# [-0.19661193, 0.19661193]])
It should be like this: (x is the input to the softmax layer and dy is the delta coming from the loss above it)
dx = y * dy
s = dx.sum(axis=dx.ndim - 1, keepdims=True)
dx -= y * s
return dx
But the way you compute the error should be:
yact = activation.compute(x)
ycost = cost.compute(yact)
dsoftmax = activation.delta(x, cost.delta(yact, ycost, ytrue))
Explanation: Because the delta function is a part of the backpropagation algorithm, its responsibility is to multiply the vector dy (in my code, outgoing in your case) by the Jacobian of the compute(x) function evaluated at x. If you work out what does this Jacobian look like for softmax [1], and then multiply it from the left by a vector dy, after a bit of algebra you'll find out that you get something that corresponds to my Python code.
[1] https://stats.stackexchange.com/questions/79454/softmax-layer-in-a-neural-network
The other answers are great, here to share a simple implementation of forward/backward, regardless of loss functions.
In the image below, it is a brief derivation of the backward for softmax. The 2nd equation is loss function dependent, not part of our implementation.
backward verified by manual grad checking.
import numpy as np
class Softmax:
def forward(self, x):
mx = np.max(x, axis=1, keepdims=True)
x = x - mx # log-sum-exp trick
e = np.exp(x)
probs = e / np.sum(np.exp(x), axis=1, keepdims=True)
return probs
def backward(self, x, probs, bp_err):
dim = x.shape[1]
output = np.empty(x.shape)
for j in range(dim):
d_prob_over_xj = - (probs * probs[:,[j]]) # i.e. prob_k * prob_j, no matter k==j or not
d_prob_over_xj[:,j] += probs[:,j] # i.e. when k==j, +prob_j
output[:,j] = np.sum(bp_err * d_prob_over_xj, axis=1)
return output
def compute_manual_grads(x, pred_fn):
eps = 1e-3
batch_size, dim = x.shape
grads = np.empty(x.shape)
for i in range(batch_size):
for j in range(dim):
x[i,j] += eps
y1 = pred_fn(x)
x[i,j] -= 2*eps
y2 = pred_fn(x)
grads[i,j] = (y1 - y2) / (2*eps)
x[i,j] += eps
return grads
def loss_fn(probs, ys, loss_type):
batch_size = probs.shape[0]
# dummy mse
if loss_type=="mse":
loss = np.sum((np.take_along_axis(probs, ys.reshape(-1,1), axis=1) - 1)**2) / batch_size
values = 2 * (np.take_along_axis(probs, ys.reshape(-1,1), axis=1) - 1) / batch_size
# cross ent
if loss_type=="xent":
loss = - np.sum( np.take_along_axis(np.log(probs), ys.reshape(-1,1), axis=1) ) / batch_size
values = -1 / np.take_along_axis(probs, ys.reshape(-1,1), axis=1) / batch_size
err = np.zeros(probs.shape)
np.put_along_axis(err, ys.reshape(-1,1), values, axis=1)
return loss, err
if __name__ == "__main__":
batch_size = 10
dim = 5
x = np.random.rand(batch_size, dim)
ys = np.random.randint(0, dim, batch_size)
for loss_type in ["mse", "xent"]:
S = Softmax()
probs = S.forward(x)
loss, bp_err = loss_fn(probs, ys, loss_type)
grads = S.backward(x, probs, bp_err)
def pred_fn(x, ys):
pred = S.forward(x)
loss, err = loss_fn(pred, ys, loss_type)
return loss
manual_grads = compute_manual_grads(x, lambda x: pred_fn(x, ys))
# compare both grads
print(f"loss_type = {loss_type}, grad diff = {np.sum((grads - manual_grads)**2) / batch_size}")
Just in case you are processing in batches, here is an implementation in NumPy (tested vs TensorFlow). However, I will suggest avoiding the associated tensor operations, by mixing the jacobian with the cross-entropy, which leads to a very simple and efficient expression.
def softmax(z):
exps = np.exp(z - np.max(z))
return exps / np.sum(exps, axis=1, keepdims=True)
def softmax_jacob(s):
return np.einsum('ij,jk->ijk', s, np.eye(s.shape[-1])) \
- np.einsum('ij,ik->ijk', s, s)
def np_softmax_test(z):
return softmax_jacob(softmax(z))
def tf_softmax_test(z):
z = tf.constant(z, dtype=tf.float32)
with tf.GradientTape() as g:
g.watch(z)
a = tf.nn.softmax(z)
jacob = g.batch_jacobian(a, z)
return jacob.numpy()
z = np.random.randn(3, 5)
np.all(np.isclose(np_softmax_test(z), tf_softmax_test(z)))
Here is a c++ vectorized version, using intrinsics ( 22 times (!) faster than the non-SSE version):
// How many floats fit into __m256 "group".
// Used by vectors and matrices, to ensure their dimensions are appropriate for
// intrinsics.
// Otherwise, consecutive rows of matrices will not be 16-byte aligned, and
// operations on them will be incorrect.
#define F_MULTIPLE_OF_M256 8
//check to quickly see if your rows are divisible by m256.
//you can 'undefine' to save performance, after everything was verified to be correct.
#define ASSERT_THE_M256_MULTIPLES
#ifdef ASSERT_THE_M256_MULTIPLES
#define assert_is_m256_multiple(x) assert( (x%F_MULTIPLE_OF_M256) == 0)
#else
#define assert_is_m256_multiple (q)
#endif
// usually used at the end of our Reduce functions,
// where the final __m256 mSum needs to be collapsed into 1 scalar.
static inline float slow_hAdd_ps(__m256 x){
const float *sumStart = reinterpret_cast<const float*>(&x);
float sum = 0.0f;
for(size_t i=0; i<F_MULTIPLE_OF_M256; ++i){
sum += sumStart[i];
}
return sum;
}
f_vec SoftmaxGrad_fromResult(const float *softmaxResult, size_t size,
const float *gradFromAbove){//<--gradient vector, flowing into us from the above layer
assert_is_m256_multiple(size);
//allocate vector, where to store output:
f_vec grad_v(size, true);//true: skip filling with zeros, to save performance.
const __m256* end = (const __m256*)(softmaxResult + size);
for(size_t i=0; i<size; ++i){// <--for every row
//go through this i'th row:
__m256 sum = _mm256_set1_ps(0.0f);
const __m256 neg_sft_i = _mm256_set1_ps( -softmaxResult[i] );
const __m256 *s = (const __m256*)softmaxResult;
const __m256 *gAbove = (__m256*)gradFromAbove;
for (s; s<end; ){
__m256 mul = _mm256_mul_ps(*s, neg_sft_i); // sftmaxResult_j * (-sftmaxResult_i)
mul = _mm256_mul_ps( mul, *gAbove );
sum = _mm256_add_ps( sum, mul );//adding to the total sum of this row.
++s;
++gAbove;
}
grad_v[i] = slow_hAdd_ps( sum );//collapse the sum into 1 scalar (true sum of this row).
}//end for every row
//reset back to start and subtract a vector, to account for Kronecker delta:
__m256 *g = (__m256*)grad_v._contents;
__m256 *s = (__m256*)softmaxResult;
__m256 *gAbove = (__m256*)gradFromAbove;
for(s; s<end; ){
__m256 mul = _mm256_mul_ps(*s, *gAbove);
*g = _mm256_add_ps( *g, mul );
++s;
++g;
}
return grad_v;
}
If for some reason somebody wants a simple (non-SSE) version, here it is:
inline static void SoftmaxGrad_fromResult_nonSSE(const float* softmaxResult,
const float *gradFromAbove, //<--gradient vector, flowing into us from the above layer
float *gradOutput,
size_t count ){
// every pre-softmax element in a layer contributed to the softmax of every other element
// (it went into the denominator). So gradient will be distributed from every post-softmax element to every pre-elem.
for(size_t i=0; i<count; ++i){
//go through this i'th row:
float sum = 0.0f;
const float neg_sft_i = -softmaxResult[i];
for(size_t j=0; j<count; ++j){
float mul = gradFromAbove[j] * softmaxResult[j] * neg_sft_i;
sum += mul;//adding to the total sum of this row.
}
//NOTICE: equals, overwriting any old values:
gradOutput[i] = sum;
}//end for every row
for(size_t i=0; i<count; ++i){
gradOutput[i] += softmaxResult[i] * gradFromAbove[i];
}
}

How to solve a linear system of matrices in scala breeze?

How to solve a linear system of matrices in scala breeze? ie, I have Ax = b, where A is a matrix (usually positive definite), and x and b are vectors.
I can see that there is a cholesky decomposition available, but I couldn't seem to find a solver? (if it was matlab I could do x = b \ A. If it was scipy I could do x = A.solve(b) )
Apparently, it is quite simple in fact, and built into scala-breeze as an operator:
x = A \ b
It doesnt use Cholesky, it uses LU decomposition, which is I think about half as fast, but they are both O(n^3), so comparable.
Well, I wrote my own solver in the end. I'm not sure if this is the optimal way to do it, but it doesn't seem unreasonable? :
// Copyright Hugh Perkins 2012
// You can use this under the terms of the Apache Public License 2.0
// http://www.apache.org/licenses/LICENSE-2.0
package root
import breeze.linalg._
object Solver {
// solve Ax = b, for x, where A = choleskyMatrix * choleskyMatrix.t
// choleskyMatrix should be lower triangular
def solve( choleskyMatrix: DenseMatrix[Double], b: DenseVector[Double] ) : DenseVector[Double] = {
val C = choleskyMatrix
val size = C.rows
if( C.rows != C.cols ) {
// throw exception or something
}
if( b.length != size ) {
// throw exception or something
}
// first we solve C * y = b
// (then we will solve C.t * x = y)
val y = DenseVector.zeros[Double](size)
// now we just work our way down from the top of the lower triangular matrix
for( i <- 0 until size ) {
var sum = 0.
for( j <- 0 until i ) {
sum += C(i,j) * y(j)
}
y(i) = ( b(i) - sum ) / C(i,i)
}
// now calculate x
val x = DenseVector.zeros[Double](size)
val Ct = C.t
// work up from bottom this time
for( i <- size -1 to 0 by -1 ) {
var sum = 0.
for( j <- i + 1 until size ) {
sum += Ct(i,j) * x(j)
}
x(i) = ( y(i) - sum ) / Ct(i,i)
}
x
}
}

sum the first n prime reciprocals such that the sum exceeds k (Matlab)

I am trying to write a program in matlab, such that the sum of the reciprocals of the n first prime numbers exceeds a given value k. To clearify, I am trying to make a function
SumPrime(k)
And it is supposed to return an integer n such that
\sum_{i=1}^{n} 1/p_i > k
sum of primes and reciprocals of them and plot in matlab?
I tried looking here, but this does not quite answer my question. Neither did the command
sumInversePrimes = sum(1./primes(n));
Here is my attempt. First i define a function for finding the n`th prime number.
function Y = NthPrime(n)
if n==1
Y = 2;
return
end
if n < 1 || round(n)~=n
return
end
j = 2;
u = 0;
while u < n
T = primes(j);
u = numel(T);
j = 1 + j;
end
Y = T(numel(T));
After doing this (lengthy?) code for finding the n`th prime number, the rest is a cakewalk.
function Y = E(u)
sum = 0
n = 0
while sum < u
n = n + 1
sum = sum + 1/( NthPrime(n) )
end
Y = n;
Return the proper values. This somewhat works. Alas it is very slow, and I guess this is very bad code. I have merely started learning coding in matlab, Could someone please help me either write a better code or optimize mine ?
XOXOXOX
Nebby
Here's how to precompute the sums then find the first that exceeds a threshold:
>> p = primes(1000);
>> cs = cumsum(1./p);
>> find(cs > 1.8, 1)
ans = 25