I am working on a CNN where the last layer is supposed to be a cross-entropy layer where I compare 2 images. I am imposed to use Caffe. I checked on layer catalogue. And there isn't a cross-entropy layer, only a Sigmoid cross-entropy. I asked my supervisor who told that using wouldn't do.
So here is my question. Is vanilla cross-entropy hidden somewhere?
No, there is no vanilla cross-entropy in caffe till now. There are only some special instances of cross-entropy in caffe, for example, SigmoidCrossEntropyLossLayer and MultinomialLogisticLossLayer.
But you can get vanilla cross-entropy by simply modefiying the
MultinomialLogisticLossLayer. Because this layer computes cross-entropy assuming that the target probability distribution is a form of one-hot vector, so by modifying its computation formula by assumming the target probability distribution a general probability distribution you can get the vanilla cross-entropy. Hope that will help you.
Orignal MultinomialLogisticLossLayer in caffe:
template <typename Dtype>
void MultinomialLogisticLossLayer<Dtype>::Forward_cpu(
const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
const Dtype* bottom_data = bottom[0]->cpu_data();
const Dtype* bottom_label = bottom[1]->cpu_data();
int num = bottom[0]->num();
int dim = bottom[0]->count() / bottom[0]->num();
Dtype loss = 0;
for (int i = 0; i < num; ++i) {
int label = static_cast<int>(bottom_label[i]);
Dtype prob = std::max(bottom_data[i * dim + label], Dtype(kLOG_THRESHOLD));
loss -= log(prob);
}
top[0]->mutable_cpu_data()[0] = loss / num;
}
Assume that bottom[1] contains target/true distribution, then Cross EntropyLossLayer should be like this now:
void CrossEntropyLossLayer<Dtype>::Forward_cpu(
const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
const Dtype* bottom_data = bottom[0]->cpu_data();
const Dtype* bottom_label = bottom[1]->cpu_data();
int num = bottom[0]->num();
int dim = bottom[0]->count() / bottom[0]->num();
Dtype loss = 0;
for (int i = 0; i < num; ++i) {
for (int j = 0; j < dim; ++j){
Dtype prob = std::max(bottom_data[i * dim + j], Dtype(kLOG_THRESHOLD));
loss -= bottom_label[i * dim + j] * log(prob);
}
}
top[0]->mutable_cpu_data()[0] = loss / num;
}
Similarly, the corresponding backward_cpu function for back propagation will be:
template <typename Dtype>
void CrossEntropyLossLayer<Dtype>::Backward_cpu(
const vector<Blob<Dtype>*>& top, const vector<bool>& propagate_down,
const vector<Blob<Dtype>*>& bottom) {
if (propagate_down[1]) {
LOG(FATAL) << this->type()
<< " Layer cannot backpropagate to label inputs.";
}
if (propagate_down[0]) {
const Dtype* bottom_data = bottom[0]->cpu_data();
const Dtype* bottom_label = bottom[1]->cpu_data();
Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
int num = bottom[0]->num();
int dim = bottom[0]->count() / bottom[0]->num();
caffe_set(bottom[0]->count(), Dtype(0), bottom_diff);
const Dtype scale = -top[0]->cpu_diff()[0] / num;
for (int i = 0; i < num; ++i) {
for (int j = 0; j < dim; ++j){
Dtype prob = std::max(bottom_data[i * dim + j], Dtype(kLOG_THRESHOLD));
bottom_diff[i * dim + j] = scale * bottom_label[i * dim + j] / prob;
}
}
}
}
For a neural networks library I implemented some activation functions and loss functions and their derivatives. They can be combined arbitrarily and the derivative at the output layers just becomes the product of the loss derivative and the activation derivative.
However, I failed to implement the derivative of the Softmax activation function independently from any loss function. Due to the normalization i.e. the denominator in the equation, changing a single input activation changes all output activations and not just one.
Here is my Softmax implementation where the derivative fails the gradient checking by about 1%. How can I implement the Softmax derivative so that it can be combined with any loss function?
import numpy as np
class Softmax:
def compute(self, incoming):
exps = np.exp(incoming)
return exps / exps.sum()
def delta(self, incoming, outgoing):
exps = np.exp(incoming)
others = exps.sum() - exps
return 1 / (2 + exps / others + others / exps)
activation = Softmax()
cost = SquaredError()
outgoing = activation.compute(incoming)
delta_output_layer = activation.delta(incoming) * cost.delta(outgoing)
Mathematically, the derivative of Softmax σ(j) with respect to the logit Zi (for example, Wi*X) is
where the red delta is a Kronecker delta.
If you implement iteratively:
def softmax_grad(s):
# input s is softmax value of the original input x. Its shape is (1,n)
# i.e. s = np.array([0.3,0.7]), x = np.array([0,1])
# make the matrix whose size is n^2.
jacobian_m = np.diag(s)
for i in range(len(jacobian_m)):
for j in range(len(jacobian_m)):
if i == j:
jacobian_m[i][j] = s[i] * (1 - s[i])
else:
jacobian_m[i][j] = -s[i] * s[j]
return jacobian_m
Test:
In [95]: x
Out[95]: array([1, 2])
In [96]: softmax(x)
Out[96]: array([ 0.26894142, 0.73105858])
In [97]: softmax_grad(softmax(x))
Out[97]:
array([[ 0.19661193, -0.19661193],
[-0.19661193, 0.19661193]])
If you implement in a vectorized version:
soft_max = softmax(x)
# reshape softmax to 2d so np.dot gives matrix multiplication
def softmax_grad(softmax):
s = softmax.reshape(-1,1)
return np.diagflat(s) - np.dot(s, s.T)
softmax_grad(soft_max)
#array([[ 0.19661193, -0.19661193],
# [-0.19661193, 0.19661193]])
It should be like this: (x is the input to the softmax layer and dy is the delta coming from the loss above it)
dx = y * dy
s = dx.sum(axis=dx.ndim - 1, keepdims=True)
dx -= y * s
return dx
But the way you compute the error should be:
yact = activation.compute(x)
ycost = cost.compute(yact)
dsoftmax = activation.delta(x, cost.delta(yact, ycost, ytrue))
Explanation: Because the delta function is a part of the backpropagation algorithm, its responsibility is to multiply the vector dy (in my code, outgoing in your case) by the Jacobian of the compute(x) function evaluated at x. If you work out what does this Jacobian look like for softmax [1], and then multiply it from the left by a vector dy, after a bit of algebra you'll find out that you get something that corresponds to my Python code.
[1] https://stats.stackexchange.com/questions/79454/softmax-layer-in-a-neural-network
The other answers are great, here to share a simple implementation of forward/backward, regardless of loss functions.
In the image below, it is a brief derivation of the backward for softmax. The 2nd equation is loss function dependent, not part of our implementation.
backward verified by manual grad checking.
import numpy as np
class Softmax:
def forward(self, x):
mx = np.max(x, axis=1, keepdims=True)
x = x - mx # log-sum-exp trick
e = np.exp(x)
probs = e / np.sum(np.exp(x), axis=1, keepdims=True)
return probs
def backward(self, x, probs, bp_err):
dim = x.shape[1]
output = np.empty(x.shape)
for j in range(dim):
d_prob_over_xj = - (probs * probs[:,[j]]) # i.e. prob_k * prob_j, no matter k==j or not
d_prob_over_xj[:,j] += probs[:,j] # i.e. when k==j, +prob_j
output[:,j] = np.sum(bp_err * d_prob_over_xj, axis=1)
return output
def compute_manual_grads(x, pred_fn):
eps = 1e-3
batch_size, dim = x.shape
grads = np.empty(x.shape)
for i in range(batch_size):
for j in range(dim):
x[i,j] += eps
y1 = pred_fn(x)
x[i,j] -= 2*eps
y2 = pred_fn(x)
grads[i,j] = (y1 - y2) / (2*eps)
x[i,j] += eps
return grads
def loss_fn(probs, ys, loss_type):
batch_size = probs.shape[0]
# dummy mse
if loss_type=="mse":
loss = np.sum((np.take_along_axis(probs, ys.reshape(-1,1), axis=1) - 1)**2) / batch_size
values = 2 * (np.take_along_axis(probs, ys.reshape(-1,1), axis=1) - 1) / batch_size
# cross ent
if loss_type=="xent":
loss = - np.sum( np.take_along_axis(np.log(probs), ys.reshape(-1,1), axis=1) ) / batch_size
values = -1 / np.take_along_axis(probs, ys.reshape(-1,1), axis=1) / batch_size
err = np.zeros(probs.shape)
np.put_along_axis(err, ys.reshape(-1,1), values, axis=1)
return loss, err
if __name__ == "__main__":
batch_size = 10
dim = 5
x = np.random.rand(batch_size, dim)
ys = np.random.randint(0, dim, batch_size)
for loss_type in ["mse", "xent"]:
S = Softmax()
probs = S.forward(x)
loss, bp_err = loss_fn(probs, ys, loss_type)
grads = S.backward(x, probs, bp_err)
def pred_fn(x, ys):
pred = S.forward(x)
loss, err = loss_fn(pred, ys, loss_type)
return loss
manual_grads = compute_manual_grads(x, lambda x: pred_fn(x, ys))
# compare both grads
print(f"loss_type = {loss_type}, grad diff = {np.sum((grads - manual_grads)**2) / batch_size}")
Just in case you are processing in batches, here is an implementation in NumPy (tested vs TensorFlow). However, I will suggest avoiding the associated tensor operations, by mixing the jacobian with the cross-entropy, which leads to a very simple and efficient expression.
def softmax(z):
exps = np.exp(z - np.max(z))
return exps / np.sum(exps, axis=1, keepdims=True)
def softmax_jacob(s):
return np.einsum('ij,jk->ijk', s, np.eye(s.shape[-1])) \
- np.einsum('ij,ik->ijk', s, s)
def np_softmax_test(z):
return softmax_jacob(softmax(z))
def tf_softmax_test(z):
z = tf.constant(z, dtype=tf.float32)
with tf.GradientTape() as g:
g.watch(z)
a = tf.nn.softmax(z)
jacob = g.batch_jacobian(a, z)
return jacob.numpy()
z = np.random.randn(3, 5)
np.all(np.isclose(np_softmax_test(z), tf_softmax_test(z)))
Here is a c++ vectorized version, using intrinsics ( 22 times (!) faster than the non-SSE version):
// How many floats fit into __m256 "group".
// Used by vectors and matrices, to ensure their dimensions are appropriate for
// intrinsics.
// Otherwise, consecutive rows of matrices will not be 16-byte aligned, and
// operations on them will be incorrect.
#define F_MULTIPLE_OF_M256 8
//check to quickly see if your rows are divisible by m256.
//you can 'undefine' to save performance, after everything was verified to be correct.
#define ASSERT_THE_M256_MULTIPLES
#ifdef ASSERT_THE_M256_MULTIPLES
#define assert_is_m256_multiple(x) assert( (x%F_MULTIPLE_OF_M256) == 0)
#else
#define assert_is_m256_multiple (q)
#endif
// usually used at the end of our Reduce functions,
// where the final __m256 mSum needs to be collapsed into 1 scalar.
static inline float slow_hAdd_ps(__m256 x){
const float *sumStart = reinterpret_cast<const float*>(&x);
float sum = 0.0f;
for(size_t i=0; i<F_MULTIPLE_OF_M256; ++i){
sum += sumStart[i];
}
return sum;
}
f_vec SoftmaxGrad_fromResult(const float *softmaxResult, size_t size,
const float *gradFromAbove){//<--gradient vector, flowing into us from the above layer
assert_is_m256_multiple(size);
//allocate vector, where to store output:
f_vec grad_v(size, true);//true: skip filling with zeros, to save performance.
const __m256* end = (const __m256*)(softmaxResult + size);
for(size_t i=0; i<size; ++i){// <--for every row
//go through this i'th row:
__m256 sum = _mm256_set1_ps(0.0f);
const __m256 neg_sft_i = _mm256_set1_ps( -softmaxResult[i] );
const __m256 *s = (const __m256*)softmaxResult;
const __m256 *gAbove = (__m256*)gradFromAbove;
for (s; s<end; ){
__m256 mul = _mm256_mul_ps(*s, neg_sft_i); // sftmaxResult_j * (-sftmaxResult_i)
mul = _mm256_mul_ps( mul, *gAbove );
sum = _mm256_add_ps( sum, mul );//adding to the total sum of this row.
++s;
++gAbove;
}
grad_v[i] = slow_hAdd_ps( sum );//collapse the sum into 1 scalar (true sum of this row).
}//end for every row
//reset back to start and subtract a vector, to account for Kronecker delta:
__m256 *g = (__m256*)grad_v._contents;
__m256 *s = (__m256*)softmaxResult;
__m256 *gAbove = (__m256*)gradFromAbove;
for(s; s<end; ){
__m256 mul = _mm256_mul_ps(*s, *gAbove);
*g = _mm256_add_ps( *g, mul );
++s;
++g;
}
return grad_v;
}
If for some reason somebody wants a simple (non-SSE) version, here it is:
inline static void SoftmaxGrad_fromResult_nonSSE(const float* softmaxResult,
const float *gradFromAbove, //<--gradient vector, flowing into us from the above layer
float *gradOutput,
size_t count ){
// every pre-softmax element in a layer contributed to the softmax of every other element
// (it went into the denominator). So gradient will be distributed from every post-softmax element to every pre-elem.
for(size_t i=0; i<count; ++i){
//go through this i'th row:
float sum = 0.0f;
const float neg_sft_i = -softmaxResult[i];
for(size_t j=0; j<count; ++j){
float mul = gradFromAbove[j] * softmaxResult[j] * neg_sft_i;
sum += mul;//adding to the total sum of this row.
}
//NOTICE: equals, overwriting any old values:
gradOutput[i] = sum;
}//end for every row
for(size_t i=0; i<count; ++i){
gradOutput[i] += softmaxResult[i] * gradFromAbove[i];
}
}
I'm interested in finding the coordinates (X,Y) for my whole, entire binary image, and not the CoM for each component seperatly.
How can I make it efficiently?
I guess using regionprops, but couldn't find the correct way to do so.
You can define all regions as a single region for regionprops
props = regionprops( double( BW ), 'Centroid' );
According to the data type of BW regionprops decides whether it should label each connected component as a different region or treat all non-zeros as a single region with several components.
Alternatively, you can compute the centroid by yourself
[y x] = find( BW );
cent = [mean(x) mean(y)];
Just iterate over all the pixels calculate the average of their X and Y coordinate
void centerOfMass (int[][] image, int imageWidth, int imageHeight)
{
int SumX = 0;
int SumY = 0;
int num = 0;
for (int i=0; i<imageWidth; i++)
{
for (int j=0; j<imageHeight; j++)
{
if (image[i][j] == WHITE)
{
SumX = SumX + i;
SumY = SumY + j;
num = num+1;
}
}
}
SumX = SumX / num;
SumY = SumY / num;
// The coordinate (SumX,SumY) is the center of the image mass
}
Extending this method to gray scale images in range of [0..255]: Instead of
if (image[i][j] == WHITE)
{
SumX = SumX + i;
SumY = SumY + j;
num = num+1;
}
Use the following calculation
SumX = SumX + i*image[i][j];
SumY = SumY + j*image[i][j];
num = num+image[i][j];
In this case a pixel of value 100 has 100 times higher weight than dark pixel with value 1, so dark pixels contribute a rather small fraction to the center of mass calculation.
Please note that in this case, if your image is large you might hit a 32 bits integer overflow so in that case use long int sumX, sumY variables instead of int.
This is part of a code from spectral subtraction algorithm,i'm trying to optimize it for android.please help me.
this is the matlab code:
function Seg=segment(signal,W,SP,Window)
% SEGMENT chops a signal to overlapping windowed segments
% A= SEGMENT(X,W,SP,WIN) returns a matrix which its columns are segmented
% and windowed frames of the input one dimentional signal, X. W is the
% number of samples per window, default value W=256. SP is the shift
% percentage, default value SP=0.4. WIN is the window that is multiplied by
% each segment and its length should be W. the default window is hamming
% window.
% 06-Sep-04
% Esfandiar Zavarehei
if nargin<3
SP=.4;
end
if nargin<2
W=256;
end
if nargin<4
Window=hamming(W);
end
Window=Window(:); %make it a column vector
L=length(signal);
SP=fix(W.*SP);
N=fix((L-W)/SP +1); %number of segments
Index=(repmat(1:W,N,1)+repmat((0:(N-1))'*SP,1,W))';
hw=repmat(Window,1,N);
Seg=signal(Index).*hw;
and this is our java code for this function:
public class MatrixAndSegments
{
public int numberOfSegments;
public double[][] res;
public MatrixAndSegments(int numberOfSegments,double[][] res)
{
this.numberOfSegments = numberOfSegments;
this.res = res;
}
}
public MatrixAndSegments segment (double[] signal_in,int samplesPerWindow, double shiftPercentage, double[] window)
{
//default shiftPercentage = 0.4
//default samplesPerWindow = 256 //W
//default window = hanning
int L = signal_in.length;
shiftPercentage = fix(samplesPerWindow * shiftPercentage); //SP
int numberOfSegments = fix ( (L - samplesPerWindow)/ shiftPercentage + 1); //N
double[][] reprowMatrix = reprowtrans(samplesPerWindow,numberOfSegments);
double[][] repcolMatrix = repcoltrans(numberOfSegments, shiftPercentage,samplesPerWindow );
//Index=(repmat(1:W,N,1)+repmat((0:(N-1))'*SP,1,W))';
double[][] index = new double[samplesPerWindow+1][numberOfSegments+1];
for (int x = 1; x < samplesPerWindow+1; x++ )
{
for (int y = 1 ; y < numberOfSegments + 1; y++) //numberOfSegments was 3
{
index[x][y] = reprowMatrix[x][y] + repcolMatrix[x][y];
}
}
//hamming window
double[] hammingWindow = this.HammingWindow(samplesPerWindow);
double[][] HW = repvector(hammingWindow, numberOfSegments);
double[][] seg = new double[samplesPerWindow][numberOfSegments];
for (int y = 1 ; y < numberOfSegments + 1; y++)
{
for (int x = 1; x < samplesPerWindow+1; x++)
{
seg[x-1][y-1] = signal_in[ (int)index[x][y]-1 ] * HW[x-1][y-1];
}
}
MatrixAndSegments Matrixseg = new MatrixAndSegments(numberOfSegments,seg);
return Matrixseg;
}
public int fix(double val) {
if (val < 0) {
return (int) Math.ceil(val);
}
return (int) Math.floor(val);
}
public double[][] repvector(double[] vec, int replications)
{
double[][] result = new double[vec.length][replications];
for (int x = 0; x < vec.length; x++) {
for (int y = 0; y < replications; y++) {
result[x][y] = vec[x];
}
}
return result;
}
public double[][] reprowtrans(int end, int replications)
{
double[][] result = new double[end +1][replications+1];
for (int x = 1; x <= end; x++) {
for (int y = 1; y <= replications; y++) {
result[x][y] = x ;
}
}
return result;
}
public double[][] repcoltrans(int end, double multiplier, int replications)
{
double[][] result = new double[replications+1][end+1];
for (int x = 1; x <= replications; x++) {
for (int y = 1; y <= end ; y++) {
result[x][y] = (y-1)*multiplier;
}
}
return result;
}
public double[] HammingWindow(int size)
{
double[] window = new double[size];
for (int i = 0; i < size; i++)
{
window[i] = 0.54-0.46 * (Math.cos(2.0 * Math.PI * i / (size-1)));
}
return window;
}
"Porting" Matlab code statement by statement to Java is a bad approach.
Data is rarely manipulated in Matlab using loops and addressing individual elements (because the Matlab interpreter/VM is rather slow), but rather through calls to block processing functions (which have been carefully written and optimized). This leads to a very idiosyncratic programming style in which repmat, reshape, find, fancy indexing et al. are used to do operations which would be much more naturally expressed through Java loops.
For example, to multiply each column of a matrix A by a vector v, you will write in matlab:
A = diag(v) * A
or
A = repmat(v', 1, size(A, 2)) .* A
This solution:
for i = 1:size(A, 2),
A(:, i) = A(:, i) .* v';
end;
is inefficient.
But it would be terribly foolish to try to do the same thing in Java and invoke a matrix product or to build a matrix with repeated copies of v. Instead, just do:
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
a[i][j] *= v[i]
}
}
I suggest you to try to understand what this matlab function is actually doing, instead of focusing on how it is doing it, and reimplement it from scratch in Java, forgetting all the matlab implementation except the specifications given in the comments. Half of the code you have written is useless, indeed. Actually, it seems to me that this function wouldn't be needed at all, and what it does could be efficiently integrated in the caller's code.
I am working on building a little game program like this one for a sky diving sim and I have got a lot of the equations down but my terminal velocity is way too high for the given alt for any of this. I have been looking at this and gone over it the only thing I can think of is that I have one of the measurements wrong or something. Any help will be appreciated with this.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace GasLaws
{
class Program
{
static void Main(string[] args)
{
double temperature;
double airPressure;
double airDensity;
double tV;
double alt;
//constants
const double gasCont = 287.05;
const double startTemp = 15; //this is C ground level
const double aidRate = 0.0065; //adiobatic rate for temperature by elevation
//Human Constannts
const double drag = 0.7; //in meters
const double area = 7.5; //ft^2
const double wieght = 175;
Console.WriteLine("Enter the alt in feet");
int tempint = Int32.Parse(Console.ReadLine());
alt = (double)tempint * 0.3048; //convert feet to meters
Console.WriteLine("Alt = " + alt + " m");
//calculate air pressur
airPressure = AirPressure(alt);
temperature = CalTemp(startTemp, alt, aidRate);
airDensity = AirDensity(airPressure, gasCont, temperature);
tV = TerminalVelocity(wieght, drag, airDensity, area);
//pasue screen for a second
Console.ReadLine();
}
//p = 101325(1-2.25577 * 10^-5 * alt) ^5
//this calculates correctly
private static double AirPressure(double al)
{
double tempAlt = 1 - 2.25577 * 0.00001 * al;
Console.WriteLine("Inside eq = " + tempAlt);
tempAlt = Math.Pow(tempAlt, 5 );
Console.WriteLine("Power of 5 = " + tempAlt);
tempAlt = 101325 * tempAlt;
Console.WriteLine("Pressure is = " + tempAlt + " Pascal");
return tempAlt;
}
//temperature calculation
// use adiobatic rate to calculate this
//this is right
private static double CalTemp(double t, double al, double rate) //start temperature and altitude
{
double nTemp = t;
for (int i = 0; i < al; i++)
{
nTemp -= rate;
}
Console.WriteLine("Temperature in for loop = " + nTemp);
return nTemp;
}
//claculate like this
//D Pressure / gas constant * temperature
//this works fine
private static double AirDensity(double pres, double gas, double temp)
{
temp = temp + 273.15; //convert temperautre to Kelvans
double dens = pres / (gas * temp);
dens = dens / 1000;
Console.WriteLine("PResure = " + pres);
Console.WriteLine("Gas cont = " + gas);
Console.WriteLine("Temperture = " + temp);
Console.WriteLine("Air Density is: " + dens);
return dens;
}
private static double TerminalVelocity(double w, double cd, double p, double a)
{
double v = (2 * w) / (cd * p * a) ;
v = Math.Sqrt(v);
Console.WriteLine("Terminal Velocity = " + v);
return v;
}
}
}
Your formula doesn't seem to include gravitational acceleration, and you are also mixing SI and US/imperial units. What unit will your calculated velocity have? It would probably be easier if you stay with SI units only. Another thing that looks a little strange is this line:
const double drag = 0.7; //in meters
The drag coefficient is a dimensionless number. It shouldn't have a physical unit (like meter).
The correct formula is:
v = sqrt((2 * m * g) / (d * A * C))
The variables, with corresponding SI units, are:
m [kg] - Mass of falling body.
g [m/s^2] - Gravitational acceleration.
d [kg/m^3] - Air density.
A [m^2] - Projected area.
C [-] - Drag coefficient.
If you use these units, the formula will yield the velocity in [m/s]. As an example. let's try the following values with units as above: m=80, g=9.8, d=1, A=0.7, C=0.7
This gives the terminal velocity v = 57 m/s which seems reasonable.