Applying neural network to MFCCs for variable-length speech segments - matlab

I'm currently trying to create and train a neural network to perform simple speech classification using MFCCs.
At the moment, I'm using 26 coefficients for each sample, and a total of 5 different classes - these are five different words with varying numbers of syllables.
While each sample is 2 seconds long, I am unsure how to handle cases where the user can pronounce words either very slowly or very quickly. E.g., the word 'television' spoken within 1 second yields different coefficients than the word spoken within two seconds.
Any advice on how I can solve this problem would be much appreciated!

I'm currently trying to create and train a neural network to perform simple speech classification using MFCCs.
Simple neural networks do not have input lenght invariance and do not allow to analyze time series.
For classification of time series like a series of MFCC frames you can use a classifier with time invariance. For example you can use neural networks combined with hidden Markov models (ANN-HMM), gaussian mixture model with hidden markov models (GMM-HMM) or recurrent neural networks (RNN). Matlab implementation for RNN is here. Theano implementation is also available. You can find a detailed description of those structures in Google.
Speech recognition is not a simple thing to implement, it is better to use existing software like CMUSphinx

Related

There are deep learning methods for string similarity in machine translation?

I am interested in machine translation and more specific I would like to examine the similarity between two strings. I would like to know if there are deep learning methods for text feature extraction. I already tried the famous statistics methods like cosine similarity, Levenstein distance, word frequency and others.
Thank you
To find the similarity between 2 string ,try to train a Siamese networks
on your dataset
Siamese networks are a special type of neural network architecture. Instead of a model learning to classify its inputs, the neural networks learns to differentiate between two inputs. It learns the similarity between them.
https://medium.com/#gautam.karmakar/manhattan-lstm-model-for-text-similarity-2351f80d72f1
The below is the link of a kaggle competition ,they have used siamese networks for text simmilarity
https://medium.com/mlreview/implementing-malstm-on-kaggles-quora-question-pairs-competition-8b31b0b16a07
Hope this clears your doubts

Use a trained neural network to imitate its training data

I'm in the overtures of designing a prose imitation system. It will read a bunch of prose, then mimic it. It's mostly for fun so the mimicking prose doesn't need to make too much sense, but I'd like to make it as good as I can, with a minimal amount of effort.
My first idea is to use my example prose to train a classifying feed-forward neural network, which classifies its input as either part of the training data or not part. Then I'd like to somehow invert the neural network, finding new random inputs that also get classified by the trained network as being part of the training data. The obvious and stupid way of doing this is to randomly generate word lists and only output the ones that get classified above a certain threshold, but I think there is a better way, using the network itself to limit the search to certain regions of the input space. For example, maybe you could start with a random vector and do gradient descent optimisation to find a local maximum around the random starting point. Is there a word for this kind of imitation process? What are some of the known methods?
How about Generative Adversarial Networks (GAN, Goodfellow 2014) and their more advanced siblings like Deep Convolutional Generative Adversarial Networks? There are plenty of proper research articles out there, and also more gentle introductions like this one on DCGAN and this on GAN. To quote the latter:
GANs are an interesting idea that were first introduced in 2014 by a
group of researchers at the University of Montreal lead by Ian
Goodfellow (now at OpenAI). The main idea behind a GAN is to have two
competing neural network models. One takes noise as input and
generates samples (and so is called the generator). The other model
(called the discriminator) receives samples from both the generator
and the training data, and has to be able to distinguish between the
two sources. These two networks play a continuous game, where the
generator is learning to produce more and more realistic samples, and
the discriminator is learning to get better and better at
distinguishing generated data from real data. These two networks are
trained simultaneously, and the hope is that the competition will
drive the generated samples to be indistinguishable from real data.
(DC)GAN should fit your task quite well.

Parameter settings for neural networks based classification using Matlab

Recently, I am trying to using Matlab build-in neural networks toolbox to accomplish my classification problem. However, I have some questions about the parameter settings.
a. The number of neurons in the hidden layer:
The example on this page Matlab neural networks classification example shows a two-layer (i.e. one-hidden-layer and one-output-layer) feed forward neural networks. In this example, it uses 10 neurons in the hidden layer
net = patternnet(10);
My first question is how to define the best number of neurons for my classification problem? Should I use cross-validation method to get the best performed number of neurons using a training data set?
b. Is there a method to choose three-layer or more multi-layer neural networks?
c. There are many different training method we can use in the neural networks toolbox. A list can be found at Training methods list. The page mentioned that the fastest training function is generally 'trainlm'; however, generally speaking, which one will perform best? Or it totally depends on the data set I am using?
d. In each training method, there is a parameter called 'epochs', which is the training iteration for my understanding. For each training method, Matlab defined the maximum number of epochs to train. However, from the example, it seems like 'epochs' is another parameter we can tune. Am I right? Or we just set the maximum number of epochs or leave it as default?
Any experience with Matlab neural networks toolbox is welcome and thanks very much for your reply. A.
a. You can refer to How to choose number of hidden layers and nodes in neural network? and ftp://ftp.sas.com/pub/neural/FAQ3.html#A_hu
Surely you can do cross-validation to determine the parameter of best number of neurons. But it's not recommended as it's more suitable to use it in the stage of weights training of a certain network.
b. Refer to ftp://ftp.sas.com/pub/neural/FAQ3.html#A_hl
And for more layers of neural network, you can refer to Deep Learning, which is very hot in recent years and gets state-of-the-art performances in many of the pattern recognition tasks.
c. It depends on your data. trainlm performs better on function fitting (nonlinear regression) problems than on pattern recognition problems while training large networks and pattern recognition networks, trainscg and trainrp are good choices. Generally, Gradient Descent and Resilient Backpropagation is recommended. More detailed comparison can be found here: http://www.mathworks.cn/cn/help/nnet/ug/choose-a-multilayer-neural-network-training-function.html
d. Yes, you're right. We can tune the epochs parameter. Generally you can output the recognition results/accuracy at every epoch and you will see that it is promoting more and more slowly, and the more epochs the more computing time. You can make a compromise between the accuracy and computation time.
For part b of your question:
You can use like this code:
net = patternnet([10 15 20]);
This script create a network with 3 hidden layer that first layer has 10 neurons, second layer has 15 neurons and 3th layer has 20 neurons.

What's the difference between convolutional and recurrent neural networks? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm new to the topic of neural networks. I came across the two terms convolutional neural network and recurrent neural network.
I'm wondering if these two terms are referring to the same thing, or, if not, what would be the difference between them?
Difference between CNN and RNN are as follows:
CNN:
CNN takes a fixed size inputs and generates fixed-size outputs.
CNN is a type of feed-forward artificial neural network - are variations of multilayer perceptrons which are designed to use minimal amounts of preprocessing.
CNNs use connectivity pattern between its neurons and is inspired by the organization of the animal visual cortex, whose individual neurons are arranged in such a way that they respond to overlapping regions tiling the visual field.
CNNs are ideal for images and video processing.
RNN:
RNN can handle arbitrary input/output lengths.
RNN unlike feedforward neural networks - can use their internal memory to process arbitrary sequences of inputs.
Recurrent neural networks use time-series information. i.e. what I spoke last will impact what I will speak next.
RNNs are ideal for text and speech analysis.
Convolutional neural networks (CNN) are designed to recognize images. It has convolutions inside, which see the edges of an object recognized on the image. Recurrent neural networks (RNN) are designed to recognize sequences, for example, a speech signal or a text. The recurrent network has cycles inside that implies the presence of short memory in the net. We have applied CNN as well as RNN choosing an appropriate machine learning algorithm to classify EEG signals for BCI: http://rnd.azoft.com/classification-eeg-signals-brain-computer-interface/
These architectures are completely different, so it is rather hard to say "what is the difference", as the only thing in common is the fact, that they are both neural networks.
Convolutional networks are networks with overlapping "reception fields" performing convolution tasks.
Recurrent networks are networks with recurrent connections (going in the opposite direction of the "normal" signal flow) which form cycles in the network's topology.
Apart from others, in CNN we generally use a 2d squared sliding window along an axis and convolute (with original input 2d image) to identify patterns.
In RNN we use previously calculated memory. If you are interested you can see, LSTM (Long Short-Term Memory) which is a special kind of RNN.
Both CNN and RNN have one point in common, as they detect patterns and sequences, that is you can't shuffle your single input data bits.
Convolutional neural networks (CNNs) for computer vision, and recurrent neural networks (RNNs) for natural language processing.
Although this can be applied in other areas, RNNs have the advantage of networks that can have signals travelling in both directions by introducing loops in the network.
Feedback networks are powerful and can get extremely complicated. Computations derived from the previous input are fed back into the network, which gives them a kind of memory. Feedback networks are dynamic: their state is changing continuously until they reach an equilibrium point.
First, we need to know that recursive NN is different from recurrent NN.
By wiki's definition,
A recursive neural network (RNN) is a kind of deep neural network created by applying the same set of weights recursively over a structure
In this sense, CNN is a type of Recursive NN.
On the other hand, recurrent NN is a type of recursive NN based on time difference.
Therefore, in my opinion, CNN and recurrent NN are different but both are derived from recursive NN.
This is the difference between CNN and RNN
Convolutional Neural NEtwork:
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most commonly applied to analyzing visual imagery. ... They have applications in image and video recognition, recommender systems, image classification, medical image analysis, and natural language processing.
Recurrent Neural Networks:
A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs.
It is more helpful to describe the convolution and recurrent layers first.
Convolution layer:
Includes input, one or more filters (as well as subsampling).
The input can be one-dimensional or n-dimensional (n>1), for example, it can be a two-dimensional image. One or more filters are also defined in each layer. Inputs are convolving with each filter. The method of convolution is almost similar to the convolution of filters in image processing. In general, the purpose of this section is to extract the features of each filter from the input. The output of each convolution is called a feature map.
For example, a filter is considered for horizontal edges, and the result of its convolution with the input is the extraction of the horizontal edges of the input image. Usually, in practice and especially in the first layers, a large number of filters (for example, 60 filters in one layer) are defined. Also, after convolution, the subsampling operation is usually performed, for example, their maximum or average of each of the two neighborhood values ​​is selected.
The convolution layer allows important features and patterns to be extracted from the input. And delete input data dependencies (linear and nonlinear).
[The following figure shows an example of the use of convolutional layers and pattern extraction for classification.][1]
[1]: https://i.stack.imgur.com/HS4U0.png [Kalhor, A. (2020). Classification and Regression NNs. Lecture.]
Advantages of convolutional layers:
Able to remove correlations and reduce input dimensions
Network generalization is increasing
Network robustness increases against changes because it extracts key features
Very powerful and widely used in supervised learning
...
Recurrent layers:
In these layers, the output of the current layer or the output of the next layers can also be used as the input of the layer. It also can receive time series as input.
The output without using the recurrent layer is as follows (a simple example):
y = f(W * x)
Where x is input, W is weight and f is the activator function.
But in recurrent networks it can be as follows:
y = f(W * x)
y = f(W * y)
y = f(W * y)
... until convergence
This means that in these networks the generated output can be used as an input and thus have memory networks. Some types of recurrent networks are Discrete Hopfield Net and Recurrent Auto-Associative NET, which are simple networks or complex networks such as LSTM.
An example is shown in the image below.
Advantages of Recurrent Layers:
They have memory capability
They can use time series as input.
They can use the generated output for later use.
Very used in machine translation, voice recognition, image description
...
Networks that use convolutional layers are called convolutional networks (CNN). Similarly, networks that use recurrent layers are called recurrent networks. It is also possible to use both layers in a network according to the desired application!

How to train on and make a serialized feature vector for a Neural Network?

By serialized i mean that the values for an input come in discrete intervals of time and that size of the vector is also not known before hand.
Conventionally the neural networks employ fixed size parallel input neurons and fixed size parallel output neurons.
A serialized implementation could be used in speech recognition where i can feed the network with a time series of the waveform and on the output end get the phonemes.
It would be great if someone can point out some existing implementation.
Simple neural network as a structure doesn't have invariance across time scale deformation that's why it is impractical to apply it to recognize time series. To recognize time series usually a generic communication model is used (HMM). NN could be used together with HMM to classify individual frames of speech. In such HMM-ANN configuration audio is split on frames, frame slices are passed into ANN in order to calculate phoneme probabilities and then the whole probability sequence is analyzed for a best match using dynamic search with HMM.
HMM-ANN system usually requires initialization from more robust HMM-GMM system thus there are no standalone HMM-ANN implementation, usually they are part of a whole speech recognition toolkit. Among popular toolkits Kaldi has implementation for HMM-ANN and even for HMM-DNN (deep neural networks).
There are also neural networks which are designed to classify time series - recurrent neural networks, they can be successfully used to classify speech. The example can be created with any toolkit supporting RNN, for example Keras. If you want to start with recurrent neural networks, try long-short term memory networks (LSTM), their architecture enables more stable training. Keras setup for speech recognition is discussed in Building Speech Dataset for LSTM binary classification
There are several types of neural networks that are intended to model sequence data; I would say most of these models fit into an equivalence class known as a recurrent neural network, which is generally any neural network model whose connection graph contains a cycle. The cycle in the connection graph can typically be exploited to model some aspect of the past "state" of the network, and different strategies -- for example, Elman/Jordan nets, Echo State Networks, etc. -- have been developed to take advantage of this state information in different ways.
Historically, recurrent nets have been extremely difficult to train effectively. Thanks to lots of recent work in second-order optimization tools for neural networks, along with research from the deep neural networks community, several recent examples of recurrent networks have been developed that show promise in modeling real-world tasks. In my opinion, one of the neatest current examples of such a network is Ilya Sutskever's "Generating text with recurrent neural networks" (ICML 2011), in which a recurrent net is used as a very compact, long-range n-gram character model. (Try the RNN demo on the linked homepage, it's fun.)
As far as I know, recurrent nets have not yet been applied successfully to speech -> phoneme modeling directly, but Alex Graves specifically mentions this task in several of his recent papers. (Actually, it looks like he has a 2013 ICASSP paper on this topic.)