Conjugate Gradient with a noisy function - neural-network

OK, so I'm doing RMS prop or SGD to get a neural network to learn it's parameters. But, after a while, both training and validation errors appear to have stagnated (outside of random fluctuations: I'm using dropout)..
So, I decided, to try to use conjugate gradient to refine the values. I still obviously don't want it to overfit, so I was keeping the dropout... But, of course, this makes the optimization function be noisy. So, I guess my question is: Does Conjugate Gradient (or L-BFGS or etc.) require noiseless functions? Or can they work in the presence of noise?
Thanks!

Gradient-based optimization algorithms are highly sensitive to noise. This is because the derivative calculation gets affected due to the discontinuities caused by the noise across the function domain.
To optimize noisy objective functions it is better to use heuristic algorithms: Genetic Algorithms, Simulated Annealing, Ant Colony, Particle Swarm... They Are not based on gradients and therefore they do not present the same weakness.
You can read more about these algorithms in the book:
Duc Pham, D. Karaboga, Intelligent Optimisation Techniques. London, United Kingdom: Springer-Verlag London, 2000.
If you are interested in Simulated Annealing, you can also read:
Peter Rossmanith, “Simulated Annealing” in Algorithms Unplugged, Vöcking, B., Alt, H., Dietzfelbinger, M., Reischuk, R., Scheideler, C., Vollmer, H., Wagner, D., Ed. Berlin, Germany: Springer-Verlag Berlin Heidelberg, 2011, ch. 41, pp. 393-400.

Related

MATLAB program to estimate the energy spectrum

MATLAB program to estimate the energy spectrum by non-parametric and parametric methods in the case of EEG signals from epilepsy patients with focal and non-focal area recordings
I have to do this and I have no idea where to start. I have this reference link, https://www.upf.edu/web/ntsa/downloads/-/asset_publisher/xvT6E4pczrBw/content/2012-nonrandomness-nonlinear-dependence-and-nonstationarity-of-electroencephalographic-recordings-from-epilepsy-patients?inheritRedirect=false&redirect=https%3A%2F%2Fwww.upf.edu%2Fweb%2Fntsa%2Fdownloads%3Fp_p_id%3D101_INSTANCE_xvT6E4pczrBw%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1#.X751wmUzaUk
If someone can give me some advice, please.
I have done both. The Signal Processing Toolbox and System Identification Toolbox documentation is very good on this topic.
Here is a paper I published on this topic. It is about point process rather than continuous signals, but it does have sections on parametric versus non-parametric spectrum analysis.
Non-parametric is the normal way. If your algorithm uses the fast Fourier transform, it is non-parametric.
Parametric means poles-and-zeroes type models. Spectra estimated non-parametrically tend to be a lot smoother than non-parametric. A key question that arises in parametric estimation which does not arise in non-parametric is what model structure to use and what model order. In non-paramtric, those questions do not arise. There are ways to answer those questions. The Akaike information criterion (AIC) is a standard way to answer the question about model order, although there are certainly alternatives to AIC which are better in some situations.
As you read the documentation, you will see numerous references to 'estimation'. This may seem strange at first, but it is a very important concept: wen you determine the spectrum from a signal, whether you do it parametrically or non-parametrically, you are estimating the true underlying spectrum, based on a sample of the process. The data you recorded is that sample.

How to update datasets without re-clustering the whole datatsets after clustering finished?

What I used is spectral clustering. What should I do to avoid clustering the whole datasets?
There are papers on how to infer the spectral embedding for new data points, e.g.,
Bengio, Y., Paiement, J. F., Vincent, P., Delalleau, O., Roux, N. L., & Ouimet, M. (2004). Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In Advances in neural information processing systems (pp. 177-184).
You can then assign them to the nearest k-means cluster and update the means.
But implementing this will require quite some coding work on your behalf. In particular, in order to get this fast.
Clustering isn't really meant to be updatable, and not is spectral embedding, so it is worth looking into alternate algorithms, and to reconsider your objective whether you really need to have this.

Difference between i-vector and d-vector

could someone please explain the difference between i-vector and d-vector? All I know about them is that they are widely used in speaker/speech recognition systems and they are kind of templates for representing speaker information, but I don't know the main differences.
I-vector is a feature that represents the idiosyncratic characteristics of the frame-level features' distributive pattern. I-vector extraction is essentially a dimensionality reduction of the GMM supervector (although the GMM supervector is not extracted when computing the i-vector). It's extracted in a similar manner with the eigenvoice adaptation scheme or the JFA technique, but is extracted per sentence (or input speech sample).
On the other hand, d-vector is extracted using DNN. To extract a d-vector, a DNN model that takes stacked filterbank features (similar to the DNN acoustic model used in ASR) and generates the one-hot speaker label (or the speaker probability) on the output is trained. D-vector is the averaged activation from the last hidden layer of this DNN. So unlike the i-vector framework, this doesn't have any assumptions about the feature's distribution (the i-vector framework assumes that the i-vector, or the latent variable has a Gaussian distribution).
So in conclusion, these are two distinct features extracted from totally different methods or assumptions. I recommend you reading these papers:
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.
E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. G-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," in Proc. ICASSP, 2014, pp. 4080-4084.
I don't know how to properly characterize the d-vector in plain language, but I can help a little.
The identity vector, or i-vector, Is a spectral signature for a particular slice of speech, usually a sliver of a phoneme, rarely (as far as I can see) as large as the entire phoneme. Basically, it's a discrete spectrogram expressed in a form isomorphic to the Gaussian mixture of the time slice.
EDIT
Thanks to those who provided comments and a superior answer. I updated this only to replace the incorrect information from my original attempt.
A d-vector is extracted from a Deep NN, the mean of the feature vectors in the DNN's final hidden layer. This becomes the model for the speaker, used to compare against other speech samples for identification.

Why do sigmoid functions work in Neural Nets?

I have just started programming for Neural networks. I am currently working on understanding how a Backpropogation (BP) neural net works. While the algorithm for training in BP nets is quite straightforward, I was unable to find any text on why the algorithm works. More specifically, I am looking for some mathematical reasoning to justify using sigmoid functions in neural nets, and what makes them mimic almost any data distribution thrown at them.
Thanks!
The sigmoid function introduces non-linearity in the network. Without a non-linear activation function, the net can only learn functions which are linear combinations of its inputs. The result is called universal approximation theorem or Cybenko theorem, after the gentleman who proved it in 1989. Wikipedia is a good place to start, and it has a link to the original paper (the proof is somewhat involved though). The reason why you would use a sigmoid as opposed to something else is that it is continuous and differentiable, its derivative is very fast to compute (as opposed to the derivative of tanh, which has similar properties) and has a limited range (from 0 to 1, exclusive)

Neural Network learning rate and batch weight update

I have programmed a Neural Network in Java and am now working on the back-propagation algorithm.
I've read that batch updates of the weights will cause a more stable gradient search instead of a online weight update.
As a test I've created a time series function of 100 points, such that x = [0..99] and y = f(x). I've created a Neural Network with one input and one output and 2 hidden layers with 10 neurons for testing. What I am struggling with is the learning rate of the back-propagation algorithm when tackling this problem.
I have 100 input points so when I calculate the weight change dw_{ij} for each node it is actually a sum:
dw_{ij} = dw_{ij,1} + dw_{ij,2} + ... + dw_{ij,p}
where p = 100 in this case.
Now the weight updates become really huge and therefore my error E bounces around such that it is hard to find a minimum. The only way I got some proper behaviour was when I set the learning rate y to something like 0.7 / p^2.
Is there some general rule for setting the learning rate, based on the amount of samples?
http://francky.me/faqai.php#otherFAQs :
Subject: What learning rate should be used for
backprop?
In standard backprop, too low a learning rate makes the network learn very slowly. Too high a learning rate
makes the weights and objective function diverge, so there is no learning at all. If the objective function is
quadratic, as in linear models, good learning rates can be computed from the Hessian matrix (Bertsekas and
Tsitsiklis, 1996). If the objective function has many local and global optima, as in typical feedforward NNs
with hidden units, the optimal learning rate often changes dramatically during the training process, since
the Hessian also changes dramatically. Trying to train a NN using a constant learning rate is usually a
tedious process requiring much trial and error. For some examples of how the choice of learning rate and
momentum interact with numerical condition in some very simple networks, see
ftp://ftp.sas.com/pub/neural/illcond/illcond.html
With batch training, there is no need to use a constant learning rate. In fact, there is no reason to use
standard backprop at all, since vastly more efficient, reliable, and convenient batch training algorithms exist
(see Quickprop and RPROP under "What is backprop?" and the numerous training algorithms mentioned
under "What are conjugate gradients, Levenberg-Marquardt, etc.?").
Many other variants of backprop have been invented. Most suffer from the same theoretical flaw as
standard backprop: the magnitude of the change in the weights (the step size) should NOT be a function of
the magnitude of the gradient. In some regions of the weight space, the gradient is small and you need a
large step size; this happens when you initialize a network with small random weights. In other regions of
the weight space, the gradient is small and you need a small step size; this happens when you are close to a
local minimum. Likewise, a large gradient may call for either a small step or a large step. Many algorithms
try to adapt the learning rate, but any algorithm that multiplies the learning rate by the gradient to compute
the change in the weights is likely to produce erratic behavior when the gradient changes abruptly. The
great advantage of Quickprop and RPROP is that they do not have this excessive dependence on the
magnitude of the gradient. Conventional optimization algorithms use not only the gradient but also secondorder derivatives or a line search (or some combination thereof) to obtain a good step size.
With incremental training, it is much more difficult to concoct an algorithm that automatically adjusts the
learning rate during training. Various proposals have appeared in the NN literature, but most of them don't
work. Problems with some of these proposals are illustrated by Darken and Moody (1992), who
unfortunately do not offer a solution. Some promising results are provided by by LeCun, Simard, and
Pearlmutter (1993), and by Orr and Leen (1997), who adapt the momentum rather than the learning rate.
There is also a variant of stochastic approximation called "iterate averaging" or "Polyak averaging"
(Kushner and Yin 1997), which theoretically provides optimal convergence rates by keeping a running
average of the weight values. I have no personal experience with these methods; if you have any solid
evidence that these or other methods of automatically setting the learning rate and/or momentum in
incremental training actually work in a wide variety of NN applications, please inform the FAQ maintainer
(saswss#unx.sas.com).
References:
Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic
Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8.
Darken, C. and Moody, J. (1992), "Towards faster stochastic gradient
search," in Moody, J.E., Hanson, S.J., and Lippmann, R.P., eds.
Advances in Neural Information Processing Systems 4, San Mateo, CA:
Morgan Kaufmann Publishers, pp. 1009-1016. Kushner, H.J., and Yin,
G. (1997), Stochastic Approximation Algorithms and Applications, NY:
Springer-Verlag. LeCun, Y., Simard, P.Y., and Pearlmetter, B.
(1993), "Automatic learning rate maximization by online estimation of
the Hessian's eigenvectors," in Hanson, S.J., Cowan, J.D., and Giles,
C.L. (eds.), Advances in Neural Information Processing Systems 5, San
Mateo, CA: Morgan Kaufmann, pp. 156-163. Orr, G.B. and Leen, T.K.
(1997), "Using curvature information for fast stochastic search," in
Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural
Information Processing Systems 9,Cambridge, MA: The MIT Press, pp.
606-612.
Credits:
Archive-name: ai-faq/neural-nets/part1
Last-modified: 2002-05-17
URL: ftp://ftp.sas.com/pub/neural/FAQ.html
Maintainer: saswss#unx.sas.com (Warren S. Sarle)
Copyright 1997, 1998, 1999, 2000, 2001, 2002 by Warren S. Sarle, Cary, NC, USA.
A simple solution would be to take the average weight of a batch instead of summing it. This way you can just use a learning rate of 0.7 (or any other value of your liking), without having to worry about optimizing yet another parameter.
More interesting information about batch updating and learning rates can be found in this article by Wilson (2003).