Understanding relation between Neural Networks and Hidden Markov Model

Understanding relation between Neural Networks and Hidden Markov Model - neural-network

I've red a few paper about speech recognition based on neural networks, the gaussian mixture model and the hidden markov model. On my research, I came across the paper "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition" from George E. Dahl, Dong Yu, et al.. I think I understand the most of the presented idea, however I still have trouble with some details. I really would appreciate, if someone could enlighten me.
As I understand it, the procedure consists of three elements:
Input
The audio stream gets split up by frames of 10ms and processed by a MFCC, which outputs a feature vector.
DNN The neural network gets the feature vector as a input and processes the features, so that each frame(phone) is distinguishable or rather gives a represents of the phone in context.
HMM
The HMM is a is a state model, in which each state represents a tri-phone. Each state has a number of probability for changing to all the other state.
Now the output layer of the DNN produces a feature vector, that tells the current state to which state it has to change next.
What I don't get: How are the features of the output layer(DNN) mapped to the probabilities of the state. And how is the HMM created in the first place? Where do I get all the Information about the probabilietes?
I don't need to understand every detail, the basic concept is sufficient for my purpose. I just need to assure, that my basic thinking about the process is right.

On my research, I came across the paper "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition" from George E. Dahl, Dong Yu, et al.. I think I understand the most of the presented idea, however I still have trouble with some details.
It is better to read a textbook, not a research paper.
so that each frame(phone) is distinguishable or rather gives a represents of the phone in context.
This sentence does not have clear meaning which means you are not quite sure yourself. DNN takes a frame features and produces the probabilities for the states.
HMM The HMM is a is a state model, in which each state represents a tri-phone.
Not necessary a triphone. Usually there are tied triphones which means several triphones correspond to certain state.
Now the output layer of the DNN produces a feature vector
No, DNN produces state probabilities for the current frame, it does not produce feature vector.
that tells the current state to which state it has to change next.
No, next state is selected by HMM Viterbi algorithm based on current state and DNN probabilities. DNN alone does not decide the next state.
What I don't get: How are the features of the output layer(DNN) mapped to the probabilities of the state.
Output layer produces probabilities. It says that phone A at this frame is probable with probability 0.9 and phone B in this frame is probable with probability 0.1
And how is the HMM created in the first place?
Unlike end-to-end systems which does not use HMM, HMM is usually trained with HMM/GMM system and Baum-Welch algorithm before DNN is initialized. So you first train GMM/HMM with Baum-Welch, then you train the DNN to improve GMM.
Where do I get all the Information about the probabilietes?
It is hard to understand your last question.

Related

Use a trained neural network to imitate its training data

I'm in the overtures of designing a prose imitation system. It will read a bunch of prose, then mimic it. It's mostly for fun so the mimicking prose doesn't need to make too much sense, but I'd like to make it as good as I can, with a minimal amount of effort.
My first idea is to use my example prose to train a classifying feed-forward neural network, which classifies its input as either part of the training data or not part. Then I'd like to somehow invert the neural network, finding new random inputs that also get classified by the trained network as being part of the training data. The obvious and stupid way of doing this is to randomly generate word lists and only output the ones that get classified above a certain threshold, but I think there is a better way, using the network itself to limit the search to certain regions of the input space. For example, maybe you could start with a random vector and do gradient descent optimisation to find a local maximum around the random starting point. Is there a word for this kind of imitation process? What are some of the known methods?

How about Generative Adversarial Networks (GAN, Goodfellow 2014) and their more advanced siblings like Deep Convolutional Generative Adversarial Networks? There are plenty of proper research articles out there, and also more gentle introductions like this one on DCGAN and this on GAN. To quote the latter:
GANs are an interesting idea that were first introduced in 2014 by a
group of researchers at the University of Montreal lead by Ian
Goodfellow (now at OpenAI). The main idea behind a GAN is to have two
competing neural network models. One takes noise as input and
generates samples (and so is called the generator). The other model
(called the discriminator) receives samples from both the generator
and the training data, and has to be able to distinguish between the
two sources. These two networks play a continuous game, where the
generator is learning to produce more and more realistic samples, and
the discriminator is learning to get better and better at
distinguishing generated data from real data. These two networks are
trained simultaneously, and the hope is that the competition will
drive the generated samples to be indistinguishable from real data.
(DC)GAN should fit your task quite well.

Extracting Patterns using Neural Networks

I am trying to extract common patterns that always appear whenever a certain event occurs.
For example, patient A, B, and C all had a heart attack. Using the readings from there pulse, I want to find the common patterns before the heart attack stroke.
In the next stage I want to do this using multiple dimensions. For example, using the readings from the patients pulse, temperature, and blood pressure, what are the common patterns that occurred in the three dimensions taking into consideration the time and order between each dimension.
What is the best way to solve this problem using Neural Networks and which type of network is best?
(Just need some pointing in the right direction)
and thank you all for reading

Described problem looks like a time series prediction problem. That means a basic prediction problem for a continuous or discrete phenomena generated by some existing process. As a raw data for this problem we will have a sequence of samples x(t), x(t+1), x(t+2), ..., where x() means an output of considered process and t means some arbitrary timepoint.
For artificial neural networks solution we will consider a time series prediction, where we will organize our raw data to a new sequences. As you should know, we consider X as a matrix of input vectors that will be used in ANN learning. For time series prediction we will construct a new collection on following schema.
In the most basic form your input vector x will be a sequence of samples (x(t-k), x(t-k+1), ..., x(t-1), x(t)) taken at some arbitrary timepoint t, appended to it predecessor samples from timepoints t-k, t-k+1, ..., t-1. You should generate every example for every possible timepoint t like this.
But the key is to preprocess data so that we get the best prediction results.
Assuming your data (phenomena) is continuous, you should consider to apply some sampling technique. You could start with an experiment for some naive sampling period Δt, but there are stronger methods. See for example Nyquist–Shannon Sampling Theorem, where the key idea is to allow to recover continuous x(t) from discrete x(Δt) samples. This is reasonable when we consider that we probably expect our ANNs to do this.
Assuming your data is discrete... you still should need to try sampling, as this will speed up your computations and might possibly provide better generalization. But the key advice is: do experiments! as the best architecture depends on data and also will require to preprocess them correctly.
The next thing is network output layer. From your question, it appears that this will be a binary class prediction. But maybe a wider prediction vector is worth considering? How about to predict the future of considered samples, that is x(t+1), x(t+2) and experiment with different horizons (length of the future)?
Further reading:
Somebody mentioned Python here. Here is some good tutorial on timeseries prediction with Keras: Victor Schmidt, Keras recurrent tutorial, Deep Learning Tutorials
This paper is good if you need some real example: Fessant, Francoise, Samy Bengio, and Daniel Collobert. "On the prediction of solar activity using different neural network models." Annales Geophysicae. Vol. 14. No. 1. 1996.

MLP with sliding windows = TDNN

Need some confirmation on the statement.
Is two of these equivalent?
1.MLP with sliding time windows
2.Time delay neural network (TDNN)
Can anyone confirm on the given statement? Possibly with reference. Thanks

"Equivalent" is too generalizing but you can roughly say that in terms of architecture (at least regarding their original proposal - there have been more modifications like the MS-TDNN which is even more different from a MLP). The correct phrasing would be that TDNN is an extended MLP architecture [1].
Both use Backpropagation and both are FeedForward nets.
The main idea can probably be phrased like this:
Delaying the inputs of neurons located in a hidden or the output layer
is similar to multiplying the layers beyond and helps with pattern
scaling and translation and is close to integrating the input signal
over time.
What makes it different from the MLP:
However, in order to deal with delayed or scaled input signals, the
original denition of the TDNN required that all (delayed) links of a
neuron that are connected to one input are identical.
This requirement was overthrown in later studies, however, like in [1] where past and present nodes have different weights (which obviously seems reasonable for a number of applications) making it equivalent of a MLP.
That's all regarding architecture comparisons. Let's talk about training. The results will be different: The whole training will differ if you input the same sequential data into an MLP wich only gets current data one-by-one from a sliding window and if you input it with current and past data together into the TDNN. The big difference is context. With the MLP you'll have the context of past inputs in past activations. With the TDNN you'll have them in present activations, directly coupled to your present inputs. Again, MLPs have no temporal context capabilities (this is why recurrent neural networks are much more popular for sequential data) and the TDNN is an attempt to solve that. The way I see it, TDNN is basically an attempt to merge the 2 worlds of MLPs (basic Backprop) and RNNs (context/sequences).
TL;DR: If you strip down the TDNNs purpose you can say your statement holds true on an architectural level. But if you compare both architectures side by side in action you will get different observations.

Here is decription of TDNN taken from Waibel et al 1989 paper. "In our TDNN basic unit is modified by intoducing delays D1 through Dn as shown in Fig. 1. J inputs of such unit now will be multiplied by several weights, one for each delay". This is essentialy MLP with sliding window (see also Fig. 2 there).

Using a learned Artificial Neural Network to solve inputs

I've recently been delving into artificial neural networks again, both evolved and trained. I had a question regarding what methods, if any, to solve for inputs that would result in a target output set. Is there a name for this? Everything I try to look for leads me to backpropagation which isn't necessarily what I need. In my search, the closest thing I've come to expressing my question is
Is it possible to run a neural network in reverse?
Which told me that there, indeed, would be many solutions for networks that had varying numbers of nodes for the layers and they would not be trivial to solve for. I had the idea of just marching toward an ideal set of inputs using the weights that have been established during learning. Does anyone else have experience doing something like this?
In order to elaborate:
Say you have a network with 401 input nodes which represents a 20x20 grayscale image and a bias, two hidden layers consisting of 100+25 nodes, as well as 6 output nodes representing a classification (symbols, roman numerals, etc).
After training a neural network so that it can classify with an acceptable error, I would like to run the network backwards. This would mean I would input a classification in the output that I would like to see, and the network would imagine a set of inputs that would result in the expected output. So for the roman numeral example, this could mean that I would request it to run the net in reverse for the symbol 'X' and it would generate an image that would resemble what the net thought an 'X' looked like. In this way, I could get a good idea of the features it learned to separate the classifications. I feel as it would be very beneficial in understanding how ANNs function and learn in the grand scheme of things.

For a simple feed-forward fully connected NN, it is possible to project hidden unit activation into pixel space by taking inverse of activation function (for example Logit for sigmoid units), dividing it by sum of incoming weights and then multiplying that value by weight of each pixel. That will give visualization of average pattern, recognized by this hidden unit. Summing up these patterns for each hidden unit will result in average pattern, that corresponds to this particular set of hidden unit activities.Same procedure can be in principle be applied to to project output activations into hidden unit activity patterns.
This is indeed useful for analyzing what features NN learned in image recognition. For more complex methods you can take a look at this paper (besides everything it contains examples of patterns that NN can learn).
You can not exactly run NN in reverse, because it does not remember all information from source image - only patterns that it learned to detect. So network cannot "imagine a set inputs". However, it possible to sample probability distribution (taking weight as probability of activation of each pixel) and produce a set of patterns that can be recognized by particular neuron.

I know that you can, and I am working on a solution now. I have some code on my github here for imagining the inputs of a neural network that classifies the handwritten digits of the MNIST dataset, but I don't think it is entirely correct. Right now, I simply take a trained network and my desired output and multiply backwards by the learned weights at each layer until I have a value for inputs. This is skipping over the activation function and may have some other errors, but I am getting pretty reasonable images out of it. For example, this is the result of the trained network imagining a 3: number 3

Yes, you can run a probabilistic NN in reverse to get it to 'imagine' inputs that would match an output it's been trained to categorise.
I highly recommend Geoffrey Hinton's coursera course on NN's here:
https://www.coursera.org/course/neuralnets
He demonstrates in his introductory video a NN imagining various "2"s that it would recognise having been trained to identify the numerals 0 through 9. It's very impressive!
I think it's basically doing exactly what you're looking to do.
Gruff

Continuously train MATLAB ANN, i.e. online training?

I would like to ask for ideas what options there is for training a MATLAB ANN (artificial neural network) continuously, i.e. not having a pre-prepared training set? The idea is to have an "online" data stream thus, when first creating the network it's completely untrained but as samples flow in the ANN is trained and converges.
The ANN will be used to classify a set of values and the implementation would visualize how the training of the ANN gets improved as samples flows through the system. I.e. each sample is used for training and then also evaluated by the ANN and the response is visualized.
The effect that I expect is that for the very first samples the response of the ANN will be more or less random but as the training progress the accuracy improves.
Any ideas are most welcome.
Regards, Ola

In MATLAB you can use the adapt function instead of train. You can do this incrementally (change weights every time you get a new piece of information) or you can do it every N-samples, batch-style.
This document gives an in-depth run-down on the different styles of training from the perspective of a time-series problem.
I'd really think about what you're trying to do here, because adaptive learning strategies can be difficult. I found that they like to flail all over compared to their batch counterparts. This was especially true in my case where I work with very noisy signals.
Are you sure that you need adaptive learning? You can't periodically re-train your NN? Or build one that generalizes well enough?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse