How to implement the Softmax derivative independently from any loss function? - neural-network

For a neural networks library I implemented some activation functions and loss functions and their derivatives. They can be combined arbitrarily and the derivative at the output layers just becomes the product of the loss derivative and the activation derivative.
However, I failed to implement the derivative of the Softmax activation function independently from any loss function. Due to the normalization i.e. the denominator in the equation, changing a single input activation changes all output activations and not just one.
Here is my Softmax implementation where the derivative fails the gradient checking by about 1%. How can I implement the Softmax derivative so that it can be combined with any loss function?
import numpy as np
class Softmax:
def compute(self, incoming):
exps = np.exp(incoming)
return exps / exps.sum()
def delta(self, incoming, outgoing):
exps = np.exp(incoming)
others = exps.sum() - exps
return 1 / (2 + exps / others + others / exps)
activation = Softmax()
cost = SquaredError()
outgoing = activation.compute(incoming)
delta_output_layer = *

Mathematically, the derivative of Softmax σ(j) with respect to the logit Zi (for example, Wi*X) is
where the red delta is a Kronecker delta.
If you implement iteratively:
def softmax_grad(s):
# input s is softmax value of the original input x. Its shape is (1,n)
# i.e. s = np.array([0.3,0.7]), x = np.array([0,1])
# make the matrix whose size is n^2.
jacobian_m = np.diag(s)
for i in range(len(jacobian_m)):
for j in range(len(jacobian_m)):
if i == j:
jacobian_m[i][j] = s[i] * (1 - s[i])
jacobian_m[i][j] = -s[i] * s[j]
return jacobian_m
In [95]: x
Out[95]: array([1, 2])
In [96]: softmax(x)
Out[96]: array([ 0.26894142, 0.73105858])
In [97]: softmax_grad(softmax(x))
array([[ 0.19661193, -0.19661193],
[-0.19661193, 0.19661193]])
If you implement in a vectorized version:
soft_max = softmax(x)
# reshape softmax to 2d so gives matrix multiplication
def softmax_grad(softmax):
s = softmax.reshape(-1,1)
return np.diagflat(s) -, s.T)
#array([[ 0.19661193, -0.19661193],
# [-0.19661193, 0.19661193]])

It should be like this: (x is the input to the softmax layer and dy is the delta coming from the loss above it)
dx = y * dy
s = dx.sum(axis=dx.ndim - 1, keepdims=True)
dx -= y * s
return dx
But the way you compute the error should be:
yact = activation.compute(x)
ycost = cost.compute(yact)
dsoftmax =,, ycost, ytrue))
Explanation: Because the delta function is a part of the backpropagation algorithm, its responsibility is to multiply the vector dy (in my code, outgoing in your case) by the Jacobian of the compute(x) function evaluated at x. If you work out what does this Jacobian look like for softmax [1], and then multiply it from the left by a vector dy, after a bit of algebra you'll find out that you get something that corresponds to my Python code.

The other answers are great, here to share a simple implementation of forward/backward, regardless of loss functions.
In the image below, it is a brief derivation of the backward for softmax. The 2nd equation is loss function dependent, not part of our implementation.
backward verified by manual grad checking.
import numpy as np
class Softmax:
def forward(self, x):
mx = np.max(x, axis=1, keepdims=True)
x = x - mx # log-sum-exp trick
e = np.exp(x)
probs = e / np.sum(np.exp(x), axis=1, keepdims=True)
return probs
def backward(self, x, probs, bp_err):
dim = x.shape[1]
output = np.empty(x.shape)
for j in range(dim):
d_prob_over_xj = - (probs * probs[:,[j]]) # i.e. prob_k * prob_j, no matter k==j or not
d_prob_over_xj[:,j] += probs[:,j] # i.e. when k==j, +prob_j
output[:,j] = np.sum(bp_err * d_prob_over_xj, axis=1)
return output
def compute_manual_grads(x, pred_fn):
eps = 1e-3
batch_size, dim = x.shape
grads = np.empty(x.shape)
for i in range(batch_size):
for j in range(dim):
x[i,j] += eps
y1 = pred_fn(x)
x[i,j] -= 2*eps
y2 = pred_fn(x)
grads[i,j] = (y1 - y2) / (2*eps)
x[i,j] += eps
return grads
def loss_fn(probs, ys, loss_type):
batch_size = probs.shape[0]
# dummy mse
if loss_type=="mse":
loss = np.sum((np.take_along_axis(probs, ys.reshape(-1,1), axis=1) - 1)**2) / batch_size
values = 2 * (np.take_along_axis(probs, ys.reshape(-1,1), axis=1) - 1) / batch_size
# cross ent
if loss_type=="xent":
loss = - np.sum( np.take_along_axis(np.log(probs), ys.reshape(-1,1), axis=1) ) / batch_size
values = -1 / np.take_along_axis(probs, ys.reshape(-1,1), axis=1) / batch_size
err = np.zeros(probs.shape)
np.put_along_axis(err, ys.reshape(-1,1), values, axis=1)
return loss, err
if __name__ == "__main__":
batch_size = 10
dim = 5
x = np.random.rand(batch_size, dim)
ys = np.random.randint(0, dim, batch_size)
for loss_type in ["mse", "xent"]:
S = Softmax()
probs = S.forward(x)
loss, bp_err = loss_fn(probs, ys, loss_type)
grads = S.backward(x, probs, bp_err)
def pred_fn(x, ys):
pred = S.forward(x)
loss, err = loss_fn(pred, ys, loss_type)
return loss
manual_grads = compute_manual_grads(x, lambda x: pred_fn(x, ys))
# compare both grads
print(f"loss_type = {loss_type}, grad diff = {np.sum((grads - manual_grads)**2) / batch_size}")

Just in case you are processing in batches, here is an implementation in NumPy (tested vs TensorFlow). However, I will suggest avoiding the associated tensor operations, by mixing the jacobian with the cross-entropy, which leads to a very simple and efficient expression.
def softmax(z):
exps = np.exp(z - np.max(z))
return exps / np.sum(exps, axis=1, keepdims=True)
def softmax_jacob(s):
return np.einsum('ij,jk->ijk', s, np.eye(s.shape[-1])) \
- np.einsum('ij,ik->ijk', s, s)
def np_softmax_test(z):
return softmax_jacob(softmax(z))
def tf_softmax_test(z):
z = tf.constant(z, dtype=tf.float32)
with tf.GradientTape() as g:
a = tf.nn.softmax(z)
jacob = g.batch_jacobian(a, z)
return jacob.numpy()
z = np.random.randn(3, 5)
np.all(np.isclose(np_softmax_test(z), tf_softmax_test(z)))

Here is a c++ vectorized version, using intrinsics ( 22 times (!) faster than the non-SSE version):
// How many floats fit into __m256 "group".
// Used by vectors and matrices, to ensure their dimensions are appropriate for
// intrinsics.
// Otherwise, consecutive rows of matrices will not be 16-byte aligned, and
// operations on them will be incorrect.
#define F_MULTIPLE_OF_M256 8
//check to quickly see if your rows are divisible by m256.
//you can 'undefine' to save performance, after everything was verified to be correct.
#define assert_is_m256_multiple(x) assert( (x%F_MULTIPLE_OF_M256) == 0)
#define assert_is_m256_multiple (q)
// usually used at the end of our Reduce functions,
// where the final __m256 mSum needs to be collapsed into 1 scalar.
static inline float slow_hAdd_ps(__m256 x){
const float *sumStart = reinterpret_cast<const float*>(&x);
float sum = 0.0f;
for(size_t i=0; i<F_MULTIPLE_OF_M256; ++i){
sum += sumStart[i];
return sum;
f_vec SoftmaxGrad_fromResult(const float *softmaxResult, size_t size,
const float *gradFromAbove){//<--gradient vector, flowing into us from the above layer
//allocate vector, where to store output:
f_vec grad_v(size, true);//true: skip filling with zeros, to save performance.
const __m256* end = (const __m256*)(softmaxResult + size);
for(size_t i=0; i<size; ++i){// <--for every row
//go through this i'th row:
__m256 sum = _mm256_set1_ps(0.0f);
const __m256 neg_sft_i = _mm256_set1_ps( -softmaxResult[i] );
const __m256 *s = (const __m256*)softmaxResult;
const __m256 *gAbove = (__m256*)gradFromAbove;
for (s; s<end; ){
__m256 mul = _mm256_mul_ps(*s, neg_sft_i); // sftmaxResult_j * (-sftmaxResult_i)
mul = _mm256_mul_ps( mul, *gAbove );
sum = _mm256_add_ps( sum, mul );//adding to the total sum of this row.
grad_v[i] = slow_hAdd_ps( sum );//collapse the sum into 1 scalar (true sum of this row).
}//end for every row
//reset back to start and subtract a vector, to account for Kronecker delta:
__m256 *g = (__m256*)grad_v._contents;
__m256 *s = (__m256*)softmaxResult;
__m256 *gAbove = (__m256*)gradFromAbove;
for(s; s<end; ){
__m256 mul = _mm256_mul_ps(*s, *gAbove);
*g = _mm256_add_ps( *g, mul );
return grad_v;
If for some reason somebody wants a simple (non-SSE) version, here it is:
inline static void SoftmaxGrad_fromResult_nonSSE(const float* softmaxResult,
const float *gradFromAbove, //<--gradient vector, flowing into us from the above layer
float *gradOutput,
size_t count ){
// every pre-softmax element in a layer contributed to the softmax of every other element
// (it went into the denominator). So gradient will be distributed from every post-softmax element to every pre-elem.
for(size_t i=0; i<count; ++i){
//go through this i'th row:
float sum = 0.0f;
const float neg_sft_i = -softmaxResult[i];
for(size_t j=0; j<count; ++j){
float mul = gradFromAbove[j] * softmaxResult[j] * neg_sft_i;
sum += mul;//adding to the total sum of this row.
//NOTICE: equals, overwriting any old values:
gradOutput[i] = sum;
}//end for every row
for(size_t i=0; i<count; ++i){
gradOutput[i] += softmaxResult[i] * gradFromAbove[i];


Calculating turn on a 2d Path [duplicate]

I understand that:
atan2(vector.y, vector.x) = the angle between the vector and the X axis.
But I wanted to know how to get the angle between two vectors using atan2. So I came across this solution:
atan2(vector1.y - vector2.y, vector1.x - vector2.x)
My question is very simple:
Will the two following formulas produce the same number?
atan2(vector1.y - vector2.y, vector1.x - vector2.x)
atan2(vector2.y - vector1.y, vector2.x - vector1.x)
If not: How do I know what vector comes first in the subtractions?
atan2(vector1.y - vector2.y, vector1.x - vector2.x)
is the angle between the difference vector (connecting vector2 and vector1) and the x-axis,
which is problably not what you meant.
The (directed) angle from vector1 to vector2 can be computed as
angle = atan2(vector2.y, vector2.x) - atan2(vector1.y, vector1.x);
and you may want to normalize it to the range [0, 2 π):
if (angle < 0) { angle += 2 * M_PI; }
or to the range (-π, π]:
if (angle > M_PI) { angle -= 2 * M_PI; }
else if (angle <= -M_PI) { angle += 2 * M_PI; }
A robust way to do it is by finding the sine of the angle using the cross product, and the cosine of the angle using the dot product and combining the two with the Atan2() function.
In C# this is:
public struct Vector2
public double X, Y;
/// <summary>
/// Returns the angle between two vectos
/// </summary>
public static double GetAngle(Vector2 A, Vector2 B)
// |A·B| = |A| |B| COS(θ)
// |A×B| = |A| |B| SIN(θ)
return Math.Atan2(Cross(A,B), Dot(A,B));
public double Magnitude { get { return Math.Sqrt(Dot(this,this)); } }
public static double Dot(Vector2 A, Vector2 B)
return A.X*B.X+A.Y*B.Y;
public static double Cross(Vector2 A, Vector2 B)
return A.X*B.Y-A.Y*B.X;
class Program
static void Main(string[] args)
Vector2 A=new Vector2() { X=5.45, Y=1.12};
Vector2 B=new Vector2() { X=-3.86, Y=4.32 };
double angle=Vector2.GetAngle(A, B) * 180/Math.PI;
// angle = 120.16850967865749
See the test case above in GeoGebra.
I think a better formula was posted here:
angle = atan2(norm(cross(a,b)), dot(a,b))
So this formula works in 2 or 3 dimensions.
For 2 dimensions this formula simplifies to the one stated above.
Nobody pointed out that if you have a single vector, and want to find the angle of the vector from the X axis, you can take advantage of the fact that the argument to atan2() is actually the slope of the line, or (delta Y / delta X). So if you know the slope, you can do the following:
A = angle of the vector/line you wish to determine (from the X axis).
m = signed slope of the vector/line.
A = atan2(m, 1)
Very useful!
If you care about accuracy for small angles, you want to use this:
angle = 2*atan2(|| ||b||a - ||a||b ||, || ||b||a + ||a||b ||)
Where "||" means absolute value, AKA "length of the vector". See
However, that has the downside that in two dimensions, it loses the sign of the angle.
As a complement to the answer of #martin-r one should note that it is possible to use the sum/difference formula for arcus tangens.
angle = atan2(vec2.y, vec2.x) - atan2(vec1.y, vec1.x);
angle = -atan2(vec1.x * vec2.y - vec1.y * vec2.x, dot(vec1, vec2))
where dot = vec1.x * vec2.x + vec1.y * vec2.y
Caveat 1: make sure the angle remains within -pi ... +pi
Caveat 2: beware when the vectors are getting very similar, you might get extinction in the first argument, leading to numerical inaccuracies
You don't have to use atan2 to calculate the angle between two vectors. If you just want the quickest way, you can use dot(v1, v2)=|v1|*|v2|*cos A
to get
A = Math.acos( dot(v1, v2)/(v1.length()*v2.length()) );
xb,yb and xa,ya are the coordinates of the two vectors
The formula, angle(vector.b,vector.a), that I sent, give results
in the four quadrants and for any coordinates xa,ya and xb,yb.
For coordinates xa=ya=0 and or xb=yb=0 is undefined.
The angle can be bigger or smaller than pi, and can be positive
or negative.
Here a little program in Python that uses the angle between vectors to determine if a point is inside or outside a certain polygon
import sys
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from shapely.geometry import Point, Polygon
from pprint import pprint
# Plot variables
x_min, x_max = -6, 12
y_min, y_max = -3, 8
tick_interval = 1
FIG_SIZE = (10, 10)
DELTA_ERROR = 0.00001
IN_BOX_COLOR = 'yellow'
OUT_BOX_COLOR = 'black'
def angle_between(v1, v2):
""" Returns the angle in radians between vectors 'v1' and 'v2'
The sign of the angle is dependent on the order of v1 and v2
so acos(norm(dot(v1, v2))) does not work and atan2 has to be used, see:
arg1 = np.cross(v1, v2)
arg2 =, v2)
angle = np.arctan2(arg1, arg2)
return angle
def point_inside(point, border):
""" Returns True if point is inside border polygon and False if not
:point: x, y in shapely.geometry.Point type
:border: [x1 y1, x2 y2, ... , xn yn] in shapely.geomettry.Polygon type
assert len(border.exterior.coords) > 2,\
'number of points in the polygon must be > 2'
point = np.array(point)
side1 = np.array(border.exterior.coords[0]) - point
sum_angles = 0
for border_point in border.exterior.coords[1:]:
side2 = np.array(border_point) - point
angle = angle_between(side1, side2)
sum_angles += angle
side1 = side2
# if wn is 1 then the point is inside
wn = sum_angles / 2 / np.pi
if abs(wn - 1) < DELTA_ERROR:
return True
return False
class MainMap():
def settings(cls, fig_size):
# set the plot outline, including axes going through the origin
cls.fig, = plt.subplots(figsize=fig_size), x_max), y_max)
tick_range_x = np.arange(round(x_min + (10*(x_max - x_min) % tick_interval)/10, 1),
x_max + 0.1, step=tick_interval)
tick_range_y = np.arange(round(y_min + (10*(y_max - y_min) % tick_interval)/10, 1),
y_max + 0.1, step=tick_interval)'both', which='major', labelsize=6)['left'].set_position('zero')['right'].set_color('none')['bottom'].set_position('zero')['top'].set_color('none')
def get_ax(cls):
def plot():
class PlotPointandRectangle(MainMap):
def __init__(self, start_point, rectangle_polygon, tolerance=0):
self.current_object = None
self.currently_dragging = False
self.fig.canvas.mpl_connect('key_press_event', self.on_key)
self.plot_types = ['o', 'o-']
self.plot_type = 1
self.rectangle = rectangle_polygon
# define a point that can be moved around
self.point = patches.Circle((start_point.x, start_point.y), 0.10,
if point_inside(start_point, self.rectangle):
_color = IN_BOX_COLOR
_color = OUT_BOX_COLOR
cv_point = self.point.figure.canvas
cv_point.mpl_connect('button_release_event', self.on_release)
cv_point.mpl_connect('pick_event', self.on_pick)
cv_point.mpl_connect('motion_notify_event', self.on_motion)
def plot_rectangle(self):
x = [point[0] for point in self.rectangle.exterior.coords]
y = [point[1] for point in self.rectangle.exterior.coords]
# y = self.rectangle.y
self.rectangle_plot, =, y,
self.plot_types[self.plot_type], color='r', lw=0.4, markersize=2)
def on_release(self, event):
self.current_object = None
self.currently_dragging = False
def on_pick(self, event):
self.currently_dragging = True
self.current_object = event.artist
def on_motion(self, event):
if not self.currently_dragging:
if self.current_object == None:
point = Point(event.xdata, event.ydata) = point.x, point.y
if point_inside(point, self.rectangle):
_color = IN_BOX_COLOR
_color = OUT_BOX_COLOR
def remove_rectangle_from_plot(self):
except ValueError:
def on_key(self, event):
# with 'space' toggle between just points or points connected with
# lines
if event.key == ' ':
self.plot_type = (self.plot_type + 1) % 2
def main(start_point, rectangle):
plt_me = PlotPointandRectangle(start_point, rectangle) #pylint: disable=unused-variable
if __name__ == "__main__":
start_point = Point([float(val) for val in sys.argv[1].split()])
except IndexError:
start_point= Point(0, 0)
border_points = [(-2, -2),
(1, 1),
(3, -1),
(3, 3.5),
(4, 1),
(5, 1),
(4, 3.5),
(5, 6),
(3, 4),
(3, 5),
(-0.5, 1),
(-3, 1),
(-1, -0.5),
border_points_polygon = Polygon(border_points)
main(start_point, border_points_polygon)

adaptive elliptical structuring element in MATLAB

I'm trying to create an adaptive elliptical structuring element for an image to dilate or erode it. I write this code but unfortunately all of the structuring elements are ones(2*M+1).
I = input('Enter the input image: ');
M = input('Enter the maximum allowed semi-major axes length: ');
% determining ellipse parameteres from eigen value decomposition of LST
row = size(I,1);
col = size(I,2);
SE = cell(row,col);
padI = padarray(I,[M M],'replicate','both');
padrow = size(padI,1);
padcol = size(padI,2);
for m = M+1:padrow-M
for n = M+1:padcol-M
a = (l2(m-M,n-M)+eps/l1(m-M,n-M)+l2(m-M,n-M)+2*eps)*M;
b = (l1(m-M,n-M)+eps/l1(m-M,n-M)+l2(m-M,n-M)+2*eps)*M;
if e1(m-M,n-M,1)==0
phi = pi/2;
phi = atan(e1(m-M,n-M,2)/e1(m-M,n-M,1));
% defining structuring element for each pixel of image
x0 = m;
y0 = n;
se = zeros(2*M+1);
row_se = 0;
for i = x0-M:x0+M
row_se = row_se+1;
col_se = 0;
for j = y0-M:y0+M
col_se = col_se+1;
x = j-y0;
y = x0-i;
if ((x*cos(phi)+y*sin(phi))^2)/a^2+((x*sin(phi)-y*cos(phi))^2)/b^2 <= 1
se(row_se,col_se) = 1;
SE{m-M,n-M} = se;
a, b and phi are semi-major and semi-minor axes length and phi is angle between a and x axis.
I used 2 MATLAB functions to compute the Local Structure Tensor of the image, and then its eigenvalues and eigenvectors for each pixel. These are the matrices l1, l2, e1 and e2.
This is the bit of your code I didn't understand:
a = (l2(m-M,n-M)+eps/l1(m-M,n-M)+l2(m-M,n-M)+2*eps)*M;
b = (l1(m-M,n-M)+eps/l1(m-M,n-M)+l2(m-M,n-M)+2*eps)*M;
I simplified the expression for b to (just removing the indexing):
b = (l1+eps/l1+l2+2*eps)*M;
For l1 and l2 in the normal range we get:
b =(approx)= (l1+0/l1+l2+2*0)*M = (l1+l2)*M;
Thus, b can easily be larger than M, which I don't think is your intention. The eps in this case also doesn't protect against division by zero, which is typically the purpose of adding eps: if l1 is zero, eps/l1 is Inf.
Looking at this expression, it seems to me that you intended this instead:
b = (l1+eps)/(l1+l2+2*eps)*M;
Here, you're adding eps to each of the eigenvalues, making them guaranteed non-zero (the structure tensor is symmetric, positive semi-definite). Then you're dividing l1 by the sum of eigenvalues, and multiplying by M, which leads to a value between 0 and M for each of the axes.
So, this seems to be a case of misplaced parenthesis.
Just for the record, this is what you need in your code:
a = (l2(m-M,n-M)+eps ) / ( l1(m-M,n-M)+l2(m-M,n-M)+2*eps)*M;
b = (l1(m-M,n-M)+eps ) / ( l1(m-M,n-M)+l2(m-M,n-M)+2*eps)*M;
^ ^
added parentheses
Note that you can simplify your code by defining, outside of the loops:
[se_x,se_y] = meshgrid(-M:M,-M:M);
The inner two loops, over i and j, to construct se can then be written simply as:
se = ((se_x.*cos(phi)+se_y.*sin(phi)).^2)./a.^2 + ...
((se_x.*sin(phi)-se_y.*cos(phi)).^2)./b.^2 <= 1;
(Note the .* and .^ operators, these do element-wise multiplication and power.)
A further slight improvement comes from realizing that phi is first computed from e1(m,n,1) and e1(m,n,2), and then used in calls to cos and sin. If we assume that the eigenvector is properly normalized, then
cos(phi) == e1(m,n,1)
sin(phi) == e1(m,n,2)
But you can always make sure they are normalized:
cos_phi = e1(m-M,n-M,1);
sin_phi = e1(m-M,n-M,2);
len = hypot(cos_phi,sin_phi);
cos_phi = cos_phi / len;
sin_phi = sin_phi / len;
se = ((se_x.*cos_phi+se_y.*sin_phi).^2)./a.^2 + ...
((se_x.*sin_phi-se_y.*cos_phi).^2)./b.^2 <= 1;
Considering trigonometric operations are fairly expensive, this should speed up your code a bit.

Fast CVX solvers in Matlab

I am wondering what is the fastest convex optimizer in Matlab or is there any way to speed up current solvers? I'm using CVX, but it's taking forever to solve the optimization problem I have.
The optimization I have is to solve
minimize norm(Ax-b, 2)
subject to
x >= 0
and x d <= delta
where the size of A and b are very large.
Is there any way that I can solve this by a least square solver and then transfer it to the constraint version to make it faster?
I'm not sure what x.d <= delta means, but I'll just assume it's supposed to be x <= delta.
You can solve this problem using the projected gradient method or an accelerated projected gradient method (which is just a slight modification of the projected gradient method, which "magically" converges much faster). Here is some python code that shows how to minimize .5|| Ax - b ||^2 subject to the constraint that 0 <= x <= delta using FISTA, which is an accelerated projected gradient method. More details about the projected gradient method and FISTA can be found for example in Boyd's manuscript on proximal algorithms.
import numpy as np
import matplotlib.pyplot as plt
def fista(gradf,proxg,evalf,evalg,x0,params):
# This code does FISTA with line search
maxIter = params['maxIter']
t = params['stepSize'] # Initial step size
showTrigger = params['showTrigger']
increaseFactor = 1.25
decreaseFactor = .5
costs = np.zeros((maxIter,1))
xkm1 = np.copy(x0)
vkm1 = np.copy(x0)
for k in np.arange(1,maxIter+1,dtype = np.double):
costs[k-1] = evalf(xkm1) + evalg(xkm1)
if k % showTrigger == 0:
print "Iteration: " + str(k) + " cost: " + str(costs[k-1])
t = increaseFactor*t
acceptFlag = False
while acceptFlag == False:
if k == 1:
theta = 1
a = tkm1
b = t*(thetakm1**2)
c = -t*(thetakm1**2)
theta = (-b + np.sqrt(b**2 - 4*a*c))/(2*a)
y = (1 - theta)*xkm1 + theta*vkm1
(gradf_y,fy) = gradf(y)
x = proxg(y - t*gradf_y,t)
fx = evalf(x)
if fx <= fy + np.vdot(gradf_y,x - y) + (.5/t)*np.sum((x - y)**2):
acceptFlag = True
t = decreaseFactor*t
tkm1 = t
thetakm1 = theta
vkm1 = xkm1 + (1/theta)*(x - xkm1)
xkm1 = x
return (xkm1,costs)
if __name__ == '__main__':
delta = 5.0
numRows = 300
numCols = 50
A = np.random.randn(numRows,numCols)
ATrans = np.transpose(A)
xTrue = delta*np.random.rand(numCols,1)
b =,xTrue)
noise = .1*np.random.randn(numRows,1)
b = b + noise
def evalf(x):
AxMinusb =, x) - b
val = .5 * np.sum(AxMinusb ** 2)
return val
def gradf(x):
AxMinusb =, x) - b
grad =, AxMinusb)
val = .5 * np.sum(AxMinusb ** 2)
return (grad, val)
def evalg(x):
return 0.0
def proxg(x,t):
return np.maximum(np.minimum(x,delta),0.0)
x0 = np.zeros((numCols,1))
params = {'maxIter': 500, 'stepSize': 1.0, 'showTrigger': 5}
(x,costs) = fista(gradf,proxg,evalf,evalg,x0,params)

Matlab FFT and home brewed FFT

I'm trying to verify an FFT algorithm I should use for a project VS the same thing on Matlab.
The point is that with my own C FFT function I always get the right (the second one) part of the double sided FFT spectrum evaluated in Matlab and not the first one as "expected".
For instance if my third bin is in the form a+i*b the third bin of Matlab's FFT is a-i*b. A and b values are the same but i always get the complex conjugate of Matlab's.
I know that in terms of amplitudes and power there's no trouble (cause abs value) but I wonder if in terms of phases I'm going to read always wrong angles.
Im not so skilled in Matlab to know (and I have not found useful infos on the web) if Matlab FFT maybe returns the FFT spectre with negative frequencies first and then positive... or if I have to fix my FFT algorithm... or if it is all ok because phases are the unchanged regardless wich part of FFT we choose as single side spectrum (but i doubt about this last option).
If S is the sample array with N=512 samples, Y = fft(S) in Matlab return the FFT as (the sign of the imaginary part in the first half of the array are random, just to show the complex conjugate difference for the second part):
1 A1 + i*B1 (DC, B1 is always zero)
2 A2 + i*B2
3 A3 - i*B3
4 A4 + i*B4
5 A5 + i*B5
253 A253 - i*B253
254 A254 + i*B254
255 A255 + i*B255
256 A256 + i*B256
257 A257 - i*B257 (Nyquyst, B257 is always zero)
258 A256 - i*B256
259 A255 - i*B255
260 A254 - i*B254
261 A253 + i*B253
509 A5 - i*B5
510 A4 - i*B4
511 A3 + i*B3
512 A2 - i*B2
My FFT implementation returns only 256 values (and that's ok) in the the Y array as:
1 1 A1 + i*B1 (A1 is the DC, B1 is Nyquist, both are pure Real numbers)
2 512 A2 - i*B2
3 511 A3 + i*B3
4 510 A4 - i*B4
5 509 A5 + i*B5
253 261 A253 + i*B253
254 260 A254 - i*B254
255 259 A255 - i*B255
256 258 A256 - i*B256
Where the first column is the proper index of my Y array and the second is just the reference of the relative row in the Matlab FFT implementation.
As you can see my FFT implementation (DC apart) returns the FFT like the second half of the Matlab's FFT (in reverse order).
To summarize: even if I use fftshift as suggested, it seems that my implementation always return what in the Matlab FFT should be considered the negative part of the spectrum.
Where is the error???
This is the code I use:
Note 1: the FFT array is not declared here and it is changed inside the function. Initially it holds the N samples (real values) and at the end it contains the N/2 +1 bins of the single sided FFT spectrum.
Note 2: the N/2+1 bins are stored in N/2 elements only because the DC component is always real (and it is stored in FFT[0]) and also the Nyquyst (and it is stored in FFT[1]), this exception apart all the other even elements K holds a real number and the oven elements K+1 holds the imaginary part.
void Fft::FastFourierTransform( bool inverseFft ) {
double twr, twi, twpr, twpi, twtemp, ttheta;
int i, i1, i2, i3, i4, c1, c2;
double h1r, h1i, h2r, h2i, wrs, wis;
int nn, ii, jj, n, mmax, m, j, istep, isign;
double wtemp, wr, wpr, wpi, wi;
double theta, tempr, tempi;
// NS is the number of samples and it must be a power of two
if( NS == 1 )
if( !inverseFft ) {
ttheta = 2.0 * PI / NS;
c1 = 0.5;
c2 = -0.5;
else {
ttheta = 2.0 * PI / NS;
c1 = 0.5;
c2 = 0.5;
ttheta = -ttheta;
twpr = -2.0 * Pow( Sin( 0.5 * ttheta ), 2 );
twpi = Sin(ttheta);
twr = 1.0+twpr;
twi = twpi;
for( i = 2; i <= NS/4+1; i++ ) {
i1 = i+i-2;
i2 = i1+1;
i3 = NS+1-i2;
i4 = i3+1;
wrs = twr;
wis = twi;
h1r = c1*(FFT[i1]+FFT[i3]);
h1i = c1*(FFT[i2]-FFT[i4]);
h2r = -c2*(FFT[i2]+FFT[i4]);
h2i = c2*(FFT[i1]-FFT[i3]);
FFT[i1] = h1r+wrs*h2r-wis*h2i;
FFT[i2] = h1i+wrs*h2i+wis*h2r;
FFT[i3] = h1r-wrs*h2r+wis*h2i;
FFT[i4] = -h1i+wrs*h2i+wis*h2r;
twtemp = twr;
twr = twr*twpr-twi*twpi+twr;
twi = twi*twpr+twtemp*twpi+twi;
h1r = FFT[0];
FFT[0] = c1*(h1r+FFT[1]);
FFT[1] = c1*(h1r-FFT[1]);
if( inverseFft )
isign = -1;
isign = 1;
n = NS;
nn = NS/2;
j = 1;
for(ii = 1; ii <= nn; ii++) {
i = 2*ii-1;
if( j>i ) {
tempr = FFT[j-1];
tempi = FFT[j];
FFT[j-1] = FFT[i-1];
FFT[j] = FFT[i];
FFT[i-1] = tempr;
FFT[i] = tempi;
m = n/2;
while( m>=2 && j>m ) {
j = j-m;
m = m/2;
j = j+m;
mmax = 2;
while(n>mmax) {
istep = 2*mmax;
theta = 2.0 * PI /(isign*mmax);
wpr = -2.0 * Pow( Sin( 0.5 * theta ), 2 );
wpi = Sin(theta);
wr = 1.0;
wi = 0.0;
for(ii = 1; ii <= mmax/2; ii++) {
m = 2*ii-1;
for(jj = 0; jj <= (n-m)/istep; jj++) {
i = m+jj*istep;
j = i+mmax;
tempr = wr*FFT[j-1]-wi*FFT[j];
tempi = wr*FFT[j]+wi*FFT[j-1];
FFT[j-1] = FFT[i-1]-tempr;
FFT[j] = FFT[i]-tempi;
FFT[i-1] = FFT[i-1]+tempr;
FFT[i] = FFT[i]+tempi;
wtemp = wr;
wr = wr*wpr-wi*wpi+wr;
wi = wi*wpr+wtemp*wpi+wi;
mmax = istep;
if( inverseFft )
for(i = 1; i <= 2*nn; i++)
FFT[i-1] = FFT[i-1]/nn;
if( !inverseFft ) {
twpr = -2.0 * Pow( Sin( 0.5 * ttheta ), 2 );
twpi = Sin(ttheta);
twr = 1.0+twpr;
twi = twpi;
for(i = 2; i <= NS/4+1; i++) {
i1 = i+i-2;
i2 = i1+1;
i3 = NS+1-i2;
i4 = i3+1;
wrs = twr;
wis = twi;
h1r = c1*(FFT[i1]+FFT[i3]);
h1i = c1*(FFT[i2]-FFT[i4]);
h2r = -c2*(FFT[i2]+FFT[i4]);
h2i = c2*(FFT[i1]-FFT[i3]);
FFT[i1] = h1r+wrs*h2r-wis*h2i;
FFT[i2] = h1i+wrs*h2i+wis*h2r;
FFT[i3] = h1r-wrs*h2r+wis*h2i;
FFT[i4] = -h1i+wrs*h2i+wis*h2r;
twtemp = twr;
twr = twr*twpr-twi*twpi+twr;
twi = twi*twpr+twtemp*twpi+twi;
h1r = FFT[0];
FFT[0] = h1r+FFT[1]; // DC
FFT[1] = h1r-FFT[1]; // FS/2 (NYQUIST)
In matlab try using fftshift(fft(...)). Matlab doesn't automatically shift the spectrum after the FFT is called which is why they implemented the fftshift() function.
It is simply a matlab formatting thing. Basically, matlab arrange Fourier transform in following order
DC, (DC-1), .... (Nyquist-1), -Nyquist, -Nyquist+1, ..., DC-1
Let's say you have a 8 point sequence: [1 2 3 1 4 5 1 3]
In your signal processing class, your professor probably draws the Fourier spectrum based on a Cartesian system ( negative -> positive for x axis); So your DC should be located at 0 (the 4th position in your fft sequence, assuming position index here is 0-based) on your x axis.
In matlab, the DC is the very first element in the fft sequence, so you need to to fftshit() to swap the first half and second half of the fft sequence such that DC will be located at 4th position (position is 0-based indexed)
I am attaching a graph here so you may have a visual:
where a is the original 8-point sequence; FT(a) is the Fourier transform of a.
The matlab code is here:
a = [1 2 3 1 4 5 1 3];
A = fft(a);
N = length(a);
x = -N/2:N/2-1;
subplot(3,1,1), stem(x, a,'o'); title('a'); xlabel('time')
subplot(3,1,2), stem(x, fftshift(abs(A),2),'o'); title('FT(a) in signal processing'); xlabel('frequency')
subplot(3,1,3), stem(x, abs(A),'o'); title('FT(a) in matlab'); xlabel('frequency')

Octave backpropagation implementation issues

I wrote a code to implement steepest descent backpropagation with which I am having issues. I am using the Machine CPU dataset and have scaled the inputs and outputs into range [0 1]
The codes in matlab/octave is as follows:
steepest descent backpropagation
%SGD = Steepest Gradient Decent
function weights = nnSGDTrain (X, y, nhid_units, gamma, max_epoch, X_test, y_test)
iput_units = columns (X);
oput_units = columns (y);
n = rows (X);
W2 = rand (nhid_units + 1, oput_units);
W1 = rand (iput_units + 1, nhid_units);
train_rmse = zeros (1, max_epoch);
test_rmse = zeros (1, max_epoch);
for (epoch = 1:max_epoch)
delW2 = zeros (nhid_units + 1, oput_units)';
delW1 = zeros (iput_units + 1, nhid_units)';
for (i = 1:rows(X))
o1 = sigmoid ([X(i,:), 1] * W1); %1xn+1 * n+1xk = 1xk
o2 = sigmoid ([o1, 1] * W2); %1xk+1 * k+1xm = 1xm
D2 = o2 .* (1 - o2);
D1 = o1 .* (1 - o1);
e = (y_test(i,:) - o2)';
delta2 = diag (D2) * e; %mxm * mx1 = mx1
delta1 = diag (D1) * W2(1:(end-1),:) * delta2; %kxm * mx1 = kx1
delW2 = delW2 + (delta2 * [o1 1]); %mx1 * 1xk+1 = mxk+1 %already transposed
delW1 = delW1 + (delta1 * [X(i, :) 1]); %kx1 * 1xn+1 = k*n+1 %already transposed
delW2 = gamma .* delW2 ./ n;
delW1 = gamma .* delW1 ./ n;
W2 = W2 + delW2';
W1 = W1 + delW1';
[dummy train_rmse(epoch)] = nnPredict (X, y, nhid_units, [W1(:);W2(:)]);
[dummy test_rmse(epoch)] = nnPredict (X_test, y_test, nhid_units, [W1(:);W2(:)]);
printf ('Epoch: %d\tTrain Error: %f\tTest Error: %f\n', epoch, train_rmse(epoch), test_rmse(epoch));
fflush (stdout);
weights = [W1(:);W2(:)];
% plot (1:max_epoch, test_rmse, 1);
% hold on;
plot (1:max_epoch, train_rmse(1:end), 2);
% hold off;
%Now SFNN Only
function [o1 rmse] = nnPredict (X, y, nhid_units, weights)
iput_units = columns (X);
oput_units = columns (y);
n = rows (X);
W1 = reshape (weights(1:((iput_units + 1) * nhid_units),1), iput_units + 1, nhid_units);
W2 = reshape (weights((((iput_units + 1) * nhid_units) + 1):end,1), nhid_units + 1, oput_units);
o1 = sigmoid ([X ones(n,1)] * W1); %nxiput_units+1 * iput_units+1xnhid_units = nxnhid_units
o2 = sigmoid ([o1 ones(n,1)] * W2); %nxnhid_units+1 * nhid_units+1xoput_units = nxoput_units
rmse = RMSE (y, o2);
RMSE function
function rmse = RMSE (a1, a2)
rmse = sqrt (sum (sum ((a1 - a2).^2))/rows(a1));
I have also trained the same dataset using the R RSNNS package mlp and the RMSE for train set (first 100 examples) are around 0.03 . But in my implementation I cannot achieve lower RMSE than 0.14 . And sometimes the errors grow for some higher learning rates, and no learning rate gets me lower RMSE than 0.14. Also a paper i referred report the RMSE in for the train set is around 0.03
I wanted to know where is the problem i the code. I have followed Raul Rojas book and confirmed that things are okay.
In backprobagation code the line
e = (y_test(i,:) - o2)';
is not correct, because the o2 is the output from the train set and i am finding the difference from one example from the test set y_test. The line should have been as below:
e = (y(i,:) - o2)';
which correctly finds the difference between the predicted output by the current model and the target output of the corresponding example.
This took me 3 days to find this one, I am fortunate enough to find this freaking bug which stopped me from going into further modifications.